Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley
Dec 24, 2015
ImplementingCommunication-Avoiding Algorithms
Jim DemmelEECS amp Math Departments
UC Berkeley
Why avoid communication
bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
2
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save timebull Same story for energy
bull Avoid communication to save energy
Goals
3
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
6
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Why avoid communication
bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network
bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency
2
communication
bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]
bull Avoid communication to save timebull Same story for energy
bull Avoid communication to save energy
Goals
3
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
6
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Goals
3
bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels
bull L1 L2 DRAM network etc bull Attain lower bounds if possible
bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
6
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
6
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
6
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
6
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
7
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent ge words_moved largest_message_size
bull Parallel case assume either load or memory balanced
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Lower bound for all ldquon3-likerdquo linear algebra
bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no
matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)
8
bull Let M = ldquofastrdquo memory size (per processor)
words_moved (per processor) = (flops (per processor) M12 )
messages_sent (per processor) = (flops (per processor) M32 )
bull Parallel case assume either load or memory balanced
SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Limits to parallel scaling (12)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )
bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P
bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip
bull How big can we make P and M
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Limits to parallel scaling (22)
bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))
bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B
ndash Otherwise no communication may be needed
bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Can we attain these lower bounds
bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not
bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties
new ways to encode answers new data structures
ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress
ndash Algorithms Energy Heterogeneous Processors hellip11
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
c
(Pc)12
(Pc)12
Example P = 32 c = 2
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
25D Matrix Multiplication
bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid
k
j
iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12
(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)
(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)
(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
25D Matmul on BGP 16K nodes 64K coresc = 16 copies
Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D
12x faster
27x faster
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Perfect Strong Scaling ndash in Time and Energy (12)
bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2
bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model
ndash γT βT αT = secs per flop per word_moved per message of size m
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]
= T(P)cbull Notation for energy model
ndash γE βE αE = joules for same operations
ndash δE = joules per word of memory used per sec
ndash εE = joules per sec for leakage etc
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)
= E(P)bull Perfect scaling extends to N-body Strassen hellip
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Perfect Strong Scaling ndash in Time and Energy (22)
bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c
bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)
bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve
itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to
achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Handling Heterogeneity
bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory
bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi
12 + Fi αi Mi32 = Fi [γi + βi Mi
12 + αi Mi32] = Fi ξi
ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti
ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)
bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops
bull Works for Strassen other algorithmshellip
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries
ndash Solomonik Hammond Matthews
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
C(ijk) = Σm A(ijm)B(mk)
A3-fold symm
B2-fold symm
C2-fold symm
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Application to Tensor Contractions
bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply
bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory
bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn
bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Communication Lower Bounds for Strassen-like matmul algorithms
bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected
bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults
ndash words_moved = Ω (flopsM^(logmpq -1))
bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable
Classical O(n3) matmul
words_moved =Ω (M(nM12)3P)
Strassenrsquos O(nlg7) matmul
words_moved =Ω (M(nM12)lg7P)
Strassen-like O(nω) matmul
words_moved =Ω (M(nM12)ωP)
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
vs
Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory
Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory
CAPS If EnoughMemory and P 7 then BFS step else DFS step end if
Communication Avoiding Parallel Strassen (CAPS)
Best way to interleaveBFS and DFS is an tuning parameter
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
26
Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080
Speedups 24-184(over previous Strassen-based algorithms)
Invited to appear as Research Highlight in CACM
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Strassen-like beyond matmul
bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0
bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound
Ballard D Holtz Schwartz
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Cache and Network Oblivious Algorithms
bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware
bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory
bull CARMAndash Divide-and-conquer classical matmul divide largest of 3
dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
CARMA Performance Distributed Memory
Square m = k = n = 6144
ScaLAPACK
CARMA
Peak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
CARMA Performance Distributed Memory
Inner Product m = n = 192 k = 6291456
ScaLAPACK
CARMAPeak
(log)
(log)
Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
CARMA Performance Shared Memory
Square m = k = n
MKL (double)CARMA (double)
MKL (single)CARMA (single)
Peak (single)
Peak (double)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
CARMA Performance Shared Memory
Inner Product m = n = 64
MKL (double)
CARMA (double)
MKL (single)
CARMA (single)
(log)
(linear)
Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Why is CARMA Faster in Shared MemoryL3 Cache Misses
Shared Memory Inner Product (m = n = 64 k = 524288)
97 Fewer Misses
86 Fewer Misses
(linear)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)
35
bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)
bull Recursive Approach func factor(A) if A has 1 column update it
else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)
bull None of these approaches minimizes messagesbull Parallel case Partial
Pivoting =gt n reductionsbull Need another idea
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
TSQR An Architecture-Dependent Algorithm
W =
W0
W1
W2
W3
R00
R10
R20
R30
R01
R11
R02Parallel
W =
W0
W1
W2
W3
R01R02
R00
R03
SequentialStreaming
W =
W0
W1
W2
W3
R00
R01
R01
R11
R02
R11
R03
Dual Core
Can choose reduction tree dynamically
Multicore Multisocket Multirack Multisite Out-of-core
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo
Wnxb =
W1
W2
W3
W4
P1middotL1middotU1
P2middotL2middotU2
P3middotL3middotU3
P4middotL4middotU4
=
Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo
W1rsquoW2rsquoW3rsquoW4rsquo
P12middotL12middotU12
P34middotL34middotU34
=Choose b pivot rows call them W12rsquo
Choose b pivot rows call them W34rsquo
W12rsquoW34rsquo
= P1234middotL1234middotU1234
Choose b pivot rows
Go back to W and use these b pivot rows (move them to top do LU without pivoting)
37
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Minimizing Communication in TSLU
W = W1
W2
W3
W4
LULULULU
LU
LULUParallel
W = W1
W2
W3
W4
LULU
LU
LUSequentialStreaming
W = W1
W2
W3
W4
LULU LU
LULU
LULU
Dual Core
Can choose reduction tree dynamically to match architecture as before
38
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Making TSLU Numerically Stable
bull Details matterndash Going up the tree we could do LU either on original rows of A
(tournament pivoting) or computed rows of Undash Only tournament pivoting stable
bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A
bull Why just a ldquoThmrdquo
39
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Stability of LU using TSLU CALU
Summer School Lecture 4 40
bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Why is stability of TSLU just a ldquoThmrdquo
bull Proof is correct ndash in exact arithmeticbull Experiment
ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are
they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)
ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors
ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)
ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)
bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal
panel in symmetric-indefinite factorization41
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Fixing TSLU
bull Run TSLU quickly test for stability fix if necessary (rare)
bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor
bull Last topic in lecture how to guarantee floating point reproducibility
42
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
2D CALU with Tournament Pivoting
43
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
25D CALU with Tournament Pivoting (c=4 copies)
44
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Exascale Machine ParametersSource DOE Exascale Workshop
bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Exascale predicted speedupsfor Gaussian Elimination
2D CA-LU vs ScaLAPACK-LU
log2 (p)
log
2 (
n2p
) =
log
2 (m
emo
ry_p
er_p
roc)
Up to 29x
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
25D vs 2D LUWith and Without Pivoting
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Other CA algorithms for Ax=b least squares(13)
bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D
ldquosimplerdquobull Save frac12 flops preserve inertia
ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)
ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps
ndash PAPT = LTLT where T is banded using TSLU
48
0 0
0
0 0
0
0
hellip
hellip
ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP
ndash So far could not do partial pivoting and minimize messages just words
ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots
ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU
49
bull func factor(A) if A has 1 column update it else factor(left half of A)
update right half of A
factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M)
bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)
bull Words = O(n3M12)
bull Messages = O(n3M32)
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR
ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)
ndash Usual approach like Partial Pivoting
bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound
ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two
groups of b columns either using usual approach or something better (GuEisenstat)
bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix
ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
What about sparse matrices (13)
bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then
bull But canrsquot reorder outer loop for 25D need another idea
bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring
bull Kleenersquos Algorithm
52
for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)
D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Performance of 25D APSP using Kleene
53
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
62xspeedup
2x speedup
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
What about sparse matrices (23)
bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of
G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix
bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering
for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid
bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse
multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on
separators) 54
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
What about sparse matrices (33)
bull If matrix stays very sparse lower bound unattainable new one
bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of
data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)
bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)
Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely
bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices
along dimensions most likely to minimize cost
55
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Symmetric Eigenproblem and SVD
bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd
bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12
ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
b+1
b+1
Successive Band Reduction (BischofLangSun)
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
b+1
b+1
d+1
c
Successive Band Reduction (BischofLangSun)
b = bandwidthc = columnsd = diagonalsConstraint c+d b
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1Q1
b+1
b+1
d+1
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
12
Q1
b+1
b+1
d+1
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
12
Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
1
2
2Q1
Q1T
b+1
b+1
d+1
d+1
cd+c
d+c
d+c
d+c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
1
2
2
3
3
Q1
Q1T
Q2
Q2T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
Q1
Q1T
Q2
Q2T
Q3
Q3T
b+1
b+1
d+1
d+1
d+c
d+c
d+c
d+c
c
c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
5
5
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
5
5
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
1
1
2
2
3
3
4
4
5
5
6
6
Q5T
Q1
Q1T
Q2
Q2T
Q3
Q3T
Q5
Q4
Q4T
b+1
b+1
d+1
d+1
c
c
d+c
d+c
d+c
d+c
b = bandwidthc = columnsd = diagonalsConstraint c+d b
Successive Band Reduction (BischofLangSun)
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Conventional vs CA - SBR
Conventional Communication-Avoiding
Touch all data 4 times Touch all data once
Speedups of Sym Band Reductionvs DSBTRD
bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads
bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads
bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads
bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads
bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x
Nonsymmetric Eigenproblem
bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer
ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A
ndash QTAQ will be block upper triangular
ndash Apply recursively to A11 A22
ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum
A11 A12
ε A22
Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]
Two Levels Memory Hierarchy
Words Messages Words Messages
BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]
Cholesky[Grsquo97][APrsquo00]
[LAPACK][BDHSrsquo09]
[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]
Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]
LU[Grsquo97][Trsquo97]
[GDXrsquo11][BDLSTrsquo13]
[GDXrsquo11][BDLSTrsquo13]
[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]
QR[EGrsquo98][FWrsquo03]
[DGHLrsquo12][BDLSTrsquo13]
[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]
[EGrsquo98][FWrsquo03][BDLSTrsquo13]
[FWrsquo03][BDLSTrsquo13]
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]
Non Sym Eig [BDDrsquo11] [BDDrsquo11]
Legend[Existing][Ours][Math-Lib][Random]
Words (BW) Messages (L) Saving factor
BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12
Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12
Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12
LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12
QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12
Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n
Attaining with extra memory 25D M=(cn2P)
Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication
ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability
75
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Nonsymmetric Eigenproblem
bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer
ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A
ndash QTAQ will be block upper triangular
ndash Apply recursively to A11 A22
ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum
A11 A12
ε A22
Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]
Two Levels Memory Hierarchy
Words Messages Words Messages
BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]
Cholesky[Grsquo97][APrsquo00]
[LAPACK][BDHSrsquo09]
[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]
Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]
LU[Grsquo97][Trsquo97]
[GDXrsquo11][BDLSTrsquo13]
[GDXrsquo11][BDLSTrsquo13]
[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]
QR[EGrsquo98][FWrsquo03]
[DGHLrsquo12][BDLSTrsquo13]
[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]
[EGrsquo98][FWrsquo03][BDLSTrsquo13]
[FWrsquo03][BDLSTrsquo13]
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]
Non Sym Eig [BDDrsquo11] [BDDrsquo11]
Legend[Existing][Ours][Math-Lib][Random]
Words (BW) Messages (L) Saving factor
BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12
Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12
Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12
LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12
QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12
Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n
Attaining with extra memory 25D M=(cn2P)
Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication
ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability
75
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]
Two Levels Memory Hierarchy
Words Messages Words Messages
BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]
Cholesky[Grsquo97][APrsquo00]
[LAPACK][BDHSrsquo09]
[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]
Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]
LU[Grsquo97][Trsquo97]
[GDXrsquo11][BDLSTrsquo13]
[GDXrsquo11][BDLSTrsquo13]
[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]
QR[EGrsquo98][FWrsquo03]
[DGHLrsquo12][BDLSTrsquo13]
[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]
[EGrsquo98][FWrsquo03][BDLSTrsquo13]
[FWrsquo03][BDLSTrsquo13]
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]
Non Sym Eig [BDDrsquo11] [BDDrsquo11]
Legend[Existing][Ours][Math-Lib][Random]
Words (BW) Messages (L) Saving factor
BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12
Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12
Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12
LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12
QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12
Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n
Attaining with extra memory 25D M=(cn2P)
Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication
ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability
75
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Legend[Existing][Ours][Math-Lib][Random]
Words (BW) Messages (L) Saving factor
BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12
Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12
Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12
LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12
QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12
Rank Revealing QR [BDDrsquo11][DGGXrsquo13]
Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12
Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n
Attaining with extra memory 25D M=(cn2P)
Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication
ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability
75
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication
ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability
75
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Avoiding Communication in Iterative Linear Algebra
bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo
bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication
ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation
bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal
ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal
bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability
75
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Example The Difficulty of Tuning SpMV
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
77
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Example The Difficulty of Tuning
bull n = 21200bull nnz = 15 M
bull Source NASA structural analysis problem (raefsky)
bull 8x8 dense substructure exploit this to limit mem_refs
78
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Speedups on Itanium 2 The Need for Search
Reference
Best 4x2
Mflops
Mflops
79
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Register Profile Itanium 2
190 Mflops
1190 Mflops
80
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16
Itanium 2 - 33Itanium 1 - 8
252 Mflops
122 Mflops
820 Mflops
459 Mflops
247 Mflops
107 Mflops
12 Gflops
190 Mflops
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Another example of tuning challenges for SpMV
bull Ex11 matrix (fluid flow)
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
82
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Zoom in to top corner
bull More complicated non-zero structure in general
bull N = 16614bull NNZ = 11M
83
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
3x3 blocks look natural buthellip
bull Example 3x3 blockingndash Logical grid of 3x3 cells
bull But would lead to lots of ldquofill-inrdquo
84
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Extra Work Can Improve Efficiency
bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15
bull On Pentium III 15x speedup
ndash Actual mflop rate 152 = 225 higher
85
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Source Accelerator Cavity Design Problem (Ko via Husbands)
86
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
100x100 Submatrix Along Diagonal
Summer School Lecture 7
87
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Post-RCM Reordering
88
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Effect of Combined RCM+TSP Reordering
Before Green + RedAfter Green + Blue
Summer School Lecture 7
892x speedups on Pentium 4 Power 4 hellip
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Summary of Other Performance Optimizations
bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip
bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR
bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip
90
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Optimized Sparse Kernel Interface - OSKI
bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning
bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski
bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression
software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki
91
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
93
Example Classical Conjugate Gradient (CG)
SpMVs and dot products require communication in
each iteration
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
via CA Matrix Powers Kernel
Global reduction to compute G
94
Example CA-Conjugate Gradient
Local computations within inner loop require
no communication
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
96
Slower convergence due
to roundoff
Loss of accuracy due to roundoff
At s = 16 monomial basis is rank deficient Method breaks down
Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400
CA-CG (monomial)CG
machine precision
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
97
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit
bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square
bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank
matrices
Explicit (O(nnz)) Implicit (o(nnz))
Explicit (O(nnz)) CSR and variations Vision climate AMRhellip
Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries
Indices
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms
ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious
ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)
bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo
bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
101
bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010
ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver
demanded by customers (construction engineers) otherwise they donrsquot believe results
ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it
ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection
Reproducible Floating Point Computation
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Absolute Error for Random Vectors
Same magnitude opposite signs
Intel MKL non-reproducibility
Relative Error for Orthogonal vectors
Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value
Sign notreproducible
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
103
bull Consider summation or dot productbull Goals
1 Same answer independent of layout processors order of summands
2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy
bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)
GoalsApproaches for Reproducibility
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
104
Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya
Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger
bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David
Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang
bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou
Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National
Instruments NEC Nokia NVIDIA Samsung Oracle
bull bebopcsberkeleyedu
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)
Summary
Donrsquot Communichellip
106
Time to redesign all linear algebra n-body hellip algorithms and software
(and compilers)