Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

ImplementingCommunication-Avoiding Algorithms

Jim DemmelEECS amp Math Departments

UC Berkeley

Why avoid communication

bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save timebull Same story for energy

bull Avoid communication to save energy

bull Redesign algorithms to avoid communicationbull Between all memory hierarchy levels

bull L1 L2 DRAM network etc bull Attain lower bounds if possible

bull Current algorithms often far from lower boundsbull Large speedups and energy savings possible

Outlinebull Review extend communication lower boundsbull Direct Linear Algebra Algorithms

ndash Matmul bull classical amp Strassen-like heterogeneous tensors oblivious

ndash LU amp QR (tournament pivoting)ndash Sparse matricesndash Eigenproblems (symmetric and nonsymmetric)

bull Iterative Linear Algebrandash Autotuning Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

bull Floating-point reproducibilityndash Despite nondeterminismnonassociativity

Lower bound for all ldquon3-likerdquo linear algebra

bull Holds forndash Matmul BLAS LU QR eig SVD tensor contractions hellipndash Some whole programs (sequences of these operations no

matter how individual ops are interleaved eg Ak)ndash Dense and sparse matrices (where flops ltlt n3 )ndash Sequential and parallel algorithmsndash Some graph-theoretic algorithms (eg Floyd-Warshall)

bull Let M = ldquofastrdquo memory size (per processor)

words_moved (per processor) = (flops (per processor) M12 )

messages_sent (per processor) = (flops (per processor) M32 )

bull Parallel case assume either load or memory balanced

messages_sent ge words_moved largest_message_size

SIAM SIAGLinear Algebra Prize 2012Ballard D Holtz Schwartz

Limits to parallel scaling (12)

bull Consider dense case flops_per_proc = n3Pndash Words = (n3(PM12 ))ndash Messages = (n3(PM32 ))

bull What is M Must be at least n2P to hold datandash Words = (n2P12 )ndash Messages = (P12 )

bull But if M fixed looks like perfect strong scaling in timendash Flops Words Messages all proportional to 1P

bull Ditto for energy if we count energy costs in joules hellipndash Per flop per word moved per messagendash Per word per second for data stored in memory Mndash Per second for leakage cooling hellip

bull How big can we make P and M

bull How big can we make P and Mbull Assume we start with 1 copy of inputs A and B

ndash Otherwise no communication may be needed

bull Thm Words= (n2P23 ) independent of Mbull Reached when M = n2P23 too or P = n3M32 and Messages = (1) (log P in practice)bull Attained by 25D algorithm when c=P13 (ldquo3D algrdquo)bull Can keep increasing P until P = n3 Words = Messages = (1) (log n in practice)

Can we attain these lower bounds

bull Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these boundsndash Often not

bull If not are there other algorithms that dondash Yes for much of dense linear algebrandash New algorithms with new numerical properties

new ways to encode answers new data structures

ndash Not just loop transformations (need those too)bull Only a few sparse algorithms so farbull Lots of work in progress

ndash Algorithms Energy Heterogeneous Processors hellip11

25D Matrix Multiplication

bull Assume can fit cn2P data per processor c gt 1bull Processors form (Pc)12 x (Pc)12 x c grid

(Pc)12

Example P = 32 c = 2

iInitially P(ij0) owns A(ij) and B(ij) each of size n(cP)12 x n(cP)12

(1) P(ij0) broadcasts A(ij) and B(ij) to P(ijk)

(2) Processors at level k perform 1c-th of SUMMA ie 1c-th of Σm A(im)B(mj)

(3) Sum-reduce partial sums Σm A(im)B(mj) along k-axis so P(ij0) owns C(ij)

25D Matmul on BGP 16K nodes 64K coresc = 16 copies

Distinguished Paper Award EuroParrsquo11 (Solomonik D)SCrsquo11 paper by Solomonik Bhatele D

12x faster

27x faster

Perfect Strong Scaling ndash in Time and Energy (12)

bull Every time you add a processor you should use its memory M toobull Start with minimal number of procs PM = 3n2

bull Increase P by a factor of c total memory increases by a factor of cbull Notation for timing model

ndash γT βT αT = secs per flop per word_moved per message of size m

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ]

= T(P)cbull Notation for energy model

ndash γE βE αE = joules for same operations

ndash δE = joules per word of memory used per sec

ndash εE = joules per sec for leakage etc

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP)

= E(P)bull Perfect scaling extends to N-body Strassen hellip

bull T(cP) = n3(cP) [ γT+ βTM12 + αT(mM12) ] = T(P)c

bull E(cP) = cP n3(cP) [ γE+ βEM12 + αE(mM12) ] + δEMT(cP) + εET(cP) = E(P)

bull Can use these formulas to answer many questions such asndash How to choose p and M to minimize energy E needed for computationndash Given max allowed runtime T what is minimum energy E needed to achieve

itndash Given max allowed energy E what is the minimum runtime T attainablendash Can we minimize the average power P = ETndash Given target energy efficiency what architectural parameters are needed to

achieve itbull Can we attain 75 GflopsWattbull Can we attain an exaflop for 20 MWatts

Handling Heterogeneity

bull Suppose each of P processors could differndash γi = secflop βi = secword αi = secmessage Mi = memory

bull What is optimal assignment of work Fi to minimize timendash Ti = Fi γi + Fi βi Mi

12 + Fi αi Mi32 = Fi [γi + βi Mi

12 + αi Mi32] = Fi ξi

ndash Choose Fi so Σi Fi = n3 and minimizing T = maxi Ti

ndash Answer Fi = n3(1ξi)Σj(1ξj) and T = n3Σj(1ξj)

bull Optimal Algorithm for nxn matmulndash Recursively divide into 8 half-sized subproblemsndash Assign subproblems to processor i to add up to Fi flops

bull Works for Strassen other algorithmshellip

Application to Tensor Contractions

bull Ex C(ijk) = Σmn A(ijmn)B(mnk)ndash Communication lower bounds apply

bull Complex symmetries possiblendash Ex B(mnk) = B(kmn) = hellipndash d-fold symmetry can save up to d-fold flopsmemory

bull Heavily used in electronic structure calculationsndash Ex NWChem

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetries

ndash Solomonik Hammond Matthews

C(ijk) = Σm A(ijm)B(mk)

A3-fold symm

B2-fold symm

C2-fold symm

bull Heavily used in electronic structure calculationsndash Ex NWChem for coupled cluster (CC) approach to Schroedinger eqn

bull CTF Cyclops Tensor Frameworkndash Exploits 25D algorithms symmetriesndash Up to 3x faster running CC than NWChem on 3072 cores of Cray XE6 ndash Solomonik Hammond Matthews

Communication Lower Bounds for Strassen-like matmul algorithms

bull Proof graph expansion (different from classical matmul)ndash Strassen-like DAG must be ldquoregularrdquo and connected

bull Extends up to M = n2 p2ω bull Extends to rectangular case multiply (mxn)(nxp) in q mults

ndash words_moved = Ω (flopsM^(logmpq -1))

bull Best Paper Prize (SPAArsquo11) Ballard D Holtz Schwartz also in JACMbull Is the lower bound attainable

Classical O(n3) matmul

words_moved =Ω (M(nM12)3P)

Strassenrsquos O(nlg7) matmul

words_moved =Ω (M(nM12)lg7P)

Strassen-like O(nω) matmul

words_moved =Ω (M(nM12)ωP)

Runs all 7 multiplies in parallelEach on P7 processorsNeeds 74 as much memory

Runs all 7 multiplies sequentiallyEach on all P processorsNeeds 14 as much memory

CAPS If EnoughMemory and P 7 then BFS step else DFS step end if

Communication Avoiding Parallel Strassen (CAPS)

Best way to interleaveBFS and DFS is an tuning parameter

Performance Benchmarking Strong Scaling PlotFranklin (Cray XT4) n = 94080

Speedups 24-184(over previous Strassen-based algorithms)

Invited to appear as Research Highlight in CACM

Strassen-like beyond matmul

bull Thm (D Dumitriu Holtzrsquo07) Any Strassen-like O(nω) matmul algorithm can be used to build a numerically stable O(nω+η) algorithm for any ηgt0 for Ax=b least squares eig SVD hellipndash ηgt0 needed to deal with numerical stabilityndash Strassen already stable so η=0

bull Thm For sequential versions of these algorithms Words_moved = O(nω+ηM(ω+η)2 ndash 1 + n2 log n) ie attain expected lower bound

Ballard D Holtz Schwartz

Cache and Network Oblivious Algorithms

bull Motivation Minimizes communication at every level of a hierarchical system without tuning parameters (in theory)ndash Not always 25D Matmul on BGP was topology aware

bull CAPS Divide-and-conquer choose BFS or DFS to adapt to processors available memory

bull CARMAndash Divide-and-conquer classical matmul divide largest of 3

dimensions to create two subproblemsndash Choose BFS or DFS to adapt to processors available memory

CARMA Performance Distributed Memory

Square m = k = n = 6144

ScaLAPACK

Cray XE6 (Hopper) each node 2 x 12 core 4 x NUMA

Inner Product m = n = 192 k = 6291456

ScaLAPACK

CARMAPeak

CARMA Performance Shared Memory

Square m = k = n

MKL (double)CARMA (double)

MKL (single)CARMA (single)

Peak (single)

Peak (double)

(linear)

Intel Emerald 4 Intel Xeon X7560 x 8 cores 4 x NUMA

Inner Product m = n = 64

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

Why is CARMA Faster in Shared MemoryL3 Cache Misses

Shared Memory Inner Product (m = n = 64 k = 524288)

97 Fewer Misses

86 Fewer Misses

(linear)

One-sided Factorizations (LU QR) so farbull Classical Approach for i=1 to n update column i update trailing matrixbull words_moved = O(n3)

bull Blocked Approach (LAPACK) for i=1 to nb update block i of b columns update trailing matrixbull words moved = O(n3M13)

bull Recursive Approach func factor(A) if A has 1 column update it

else factor(left half of A) update right half of A factor(right half of A)bull words moved = O(n3M12)

bull None of these approaches minimizes messagesbull Parallel case Partial

Pivoting =gt n reductionsbull Need another idea

TSQR An Architecture-Dependent Algorithm

R02Parallel

R01R02

SequentialStreaming

Dual Core

Can choose reduction tree dynamically

Multicore Multisocket Multirack Multisite Out-of-core

Back to LU Using similar idea for TSLU as TSQR Use reduction tree to do ldquoTournament Pivotingrdquo

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

Choose b pivot rows of W1 call them W1rsquoChoose b pivot rows of W2 call them W2rsquoChoose b pivot rows of W3 call them W3rsquoChoose b pivot rows of W4 call them W4rsquo

W1rsquoW2rsquoW3rsquoW4rsquo

P12middotL12middotU12

=Choose b pivot rows call them W12rsquo

Choose b pivot rows call them W34rsquo

W12rsquoW34rsquo

= P1234middotL1234middotU1234

Choose b pivot rows

Go back to W and use these b pivot rows (move them to top do LU without pivoting)

Minimizing Communication in TSLU

W = W1

LULULULU

LULUParallel

W = W1

LUSequentialStreaming

W = W1

LULU LU

Dual Core

Can choose reduction tree dynamically to match architecture as before

Making TSLU Numerically Stable

bull Details matterndash Going up the tree we could do LU either on original rows of A

(tournament pivoting) or computed rows of Undash Only tournament pivoting stable

bull ldquoThmrdquo New scheme as stable as Partial Pivoting (GEPP) in following sense Get same Schur complements as GEPP applied to different input matrix whose entries are blocks taken from input A

bull Why just a ldquoThmrdquo

Stability of LU using TSLU CALU

Summer School Lecture 4 40

bull Empirical testingndash Both random matrices and ldquospecial onesrdquondash Both binary tree (BCALU) and flat-tree (FCALU)ndash 3 metrics ||PA-LU||||A|| normwise and componentwise backward errorsndash See [D Grigori Xiang 2010] for details

Why is stability of TSLU just a ldquoThmrdquo

bull Proof is correct ndash in exact arithmeticbull Experiment

ndash Generate 100 random 6x6 rank 3 matrices in Matlabndash [LUP] = lu(A) do LU without pivoting on PA compare L factors are

they the samebull Compute || L ndash Lnp || A few 0rsquos A few infinrsquos a few NaNsbull Rest mostly O(1)

ndash Why Floating point is nonassociative doing arithmetic in different order gives different rounding errors

ndash Same experiment with rank 6 matrices || L ndash Lnp || usually nonzero O(macheps)

ndash Same experiment with 20x20 rank 4 matrices || L ndash Lnp || often O(103)

bull Much harder to break TSLU but possiblendash Occurred when using TSLU to factorize a low-rank subdiagonal

panel in symmetric-indefinite factorization41

Fixing TSLU

bull Run TSLU quickly test for stability fix if necessary (rare)

bull Test conditioning of U if not tiny (usual case) proceed elsebull Compute || L || if not big (usual case) proceed elsebull Factor A = QR using TSQR thenbull Factor Q = PLU using TSLU thenbull A = PL(UR) with UR as upper triangular factor

bull Last topic in lecture how to guarantee floating point reproducibility

2D CALU with Tournament Pivoting

25D CALU with Tournament Pivoting (c=4 copies)

Exascale Machine ParametersSource DOE Exascale Workshop

bull 2^20 1000000 nodesbull 1024 coresnode (a billion cores)bull 100 GBsec interconnect bandwidthbull 400 GBsec DRAM bandwidthbull 1 microsec interconnect latencybull 50 nanosec memory latencybull 32 Petabytes of memorybull 12 GB total L1 on a node

Exascale predicted speedupsfor Gaussian Elimination

2D CA-LU vs ScaLAPACK-LU

log2 (p)

Up to 29x

25D vs 2D LUWith and Without Pivoting

Other CA algorithms for Ax=b least squares(13)

bull A symmetric and indefinitendash Seek factorization that retains symmetry PAPT = LDLT D

ldquosimplerdquobull Save frac12 flops preserve inertia

ndash Usual approach Bunch-Kaufmanbull D block diagonal with 1x1 and 2x2 blocksbull Pivot search down column along row (lots of communication)

ndash Alternative Aasenbull D = tridiagonal = Tbull Two steps

ndash PAPT = LTLT where T is banded using TSLU

hellip

ndash Solvefactor narrow band problem with Tbull Up to 28x faster than MKL Best Paper at IPDPSrsquo13

Other CA algorithms for Ax=b least squares (23)bull Minimizing bandwidth and latency for sequential GEPP

ndash So far could not do partial pivoting and minimize messages just words

ndash Challengebull Column layout good for choosing pivots bad for matmulbull Blocked layout good for matmul bad for choosing pivots

ndash Solution use both layouts switching between thembull ldquoShape Morphing LUrdquo or SMLU

bull func factor(A) if A has 1 column update it else factor(left half of A)

update right half of A

factor(right half of A)

bull Words = O(n3M12)

bull Messages = O(n3M)

bull func factor(A) if A has 1 column update it else factor(left half of A) reshape to recursive block format update right half of A reshape to columnwise format factor(right half of A)

bull Messages = O(n3M32)

Other CA algorithms for Ax=b least squares (33)bull Need for pivoting arises beyond LU in QR

ndash Choose permutation P so that leading columns of AP = QR span column space of A ndash Rank Revealing QR (RRQR)

ndash Usual approach like Partial Pivoting

bull Put longest column first update rest of matrix repeatbull Hard to do using BLAS3 at all let alone hit lower bound

ndash Use Tournament Pivotingbull Each round of tournament selects best b columns from two

groups of b columns either using usual approach or something better (GuEisenstat)

bull Thm This approach ``reveals the rankrsquorsquo of A in the sense that the leading rxr submatrix of R has singular values ldquonearrdquo the largest r singular values of A ditto for trailing submatrix

ndash Idea extends to other pivoting schemesbull Cholesky with diagonal pivotingbull LU with complete pivotingbull LDLT with complete pivoting 50

What about sparse matrices (13)

bull If matrix quickly becomes dense use dense algorithmbull Ex All Pairs Shortest Path using Floyd-Warshallbull Similar to matmul Let D = A then

bull But canrsquot reorder outer loop for 25D need another idea

bull Abbreviate D(ij) = min(D(ij)mink(A(ik)+B(kj)) by D = ABndash Dependencies ok 25D works just different semiring

bull Kleenersquos Algorithm

for k = 1n for i = 1n for j=1n D(ij) = min(D(ij) D(ik) + D(kj)

D = DC-APSP(An) D = A Partition D = [[D11D12][D21D22]] into n2 x n2 blocks D11 = DC-APSP(D11n2) D12 = D11 D12 D21 = D21 D11 D22 = D21 D12 D22 = DC-APSP(D22n2) D21 = D22 D21 D12 = D12 D22 D11 = D12 D21

Performance of 25D APSP using Kleene

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

62xspeedup

2x speedup

bull If parts of matrix becomes dense optimize thosebull Ex Cholesky on matrix A with good separatorsbull Thm (LiptonRoseTarjanrsquo79) If all balanced separators of

G(A) have at least w vertices then G(chol(A)) has clique of size wndash Need to do dense Cholesky on w x w submatrix

bull Thm Words_moved = Ω(w3M12) etc bull Thm (Georgersquo73) Nested dissection gives optimal ordering

for 2D grid 3D grid similar matricesndash w = n for 2D n x n grid w = n2 for 3D n x n x n grid

bull Sequential multifrontal Cholesky attains boundsbull PSPACES (Gupta Karypis Kumar) is a parallel sparse

multifrontal Cholesky packagendash Attains 2D and 25D lower bounds (using optimal dense Cholesky on

separators) 54

bull If matrix stays very sparse lower bound unattainable new one

bull Ex AB both diagonal no communication in parallel casebull Ex AB both are Erdos-Renyi Prob(A(ij)ne0) = dn d ltlt n12iidbull Assumption Algorithm is sparsity-independent assignment of

data and work to processors is sparsity-pattern-independent (but zero entries need not be communicated or operated on)

bull Thm A parallel algorithm that is sparsity-independent and load balanced for Erdos-Renyi matmul satisfies (in expectation)

Words_moved = Ω(min( dnP12 d2nP ) )ndash Proof exploits fact that reuse of entries of C = AB unlikely

bull Contrast general lower bound Words_moved = Ω(d2n(PM12)))bull Attained by divide-and-conquer algorithm that splits matrices

along dimensions most likely to minimize cost

Symmetric Eigenproblem and SVD

bull Usual approach for A=AT (SVD similar)ndash A QTAQ = T where Q orthogonal T tridiagonalndash T UTTU = Λ where U orthogonal Λ diagonalndash QUrsquos columns are eigenvectors Λ eigenvaluesndash Dense Tridiagonal Diagonalndash Only half BLAS3 half BLAS2 in LAPACKrsquos sytrd

bull Communication-Avoiding Approachndash A QAQT = B where B=BT banded of bandwidth M12

ndash Continue as above starting with Bndash Dense Banded Tridiagonal Diagonalndash Dense Banded use TSQR to zero out M12 colsrows at a timendash Banded Tridiagonal need new(ish) idea

Successive Band Reduction (BischofLangSun)

b = bandwidthc = columnsd = diagonalsConstraint c+d b

Conventional vs CA - SBR

Conventional Communication-Avoiding

Touch all data 4 times Touch all data once

Speedups of Sym Band Reductionvs DSBTRD

bull Up to 17x on Intel Gainestown vs MKL 100ndash n=12000 b=500 8 threads

bull Up to 12x on Intel Westmere vs MKL 103ndash n=12000 b=200 10 threads

bull Up to 25x on AMD Budapest vs ACML 44ndash n=9000 b=500 4 threads

bull Up to 30x on AMD Magny-Cours vs ACML 44ndash n=12000 b=500 6 threads

bull Neither MKL nor ACML benefits from multithreading in DSBTRD ndash Best sequential speedup vs MKL 19xndash Best sequential speedup vs ACML 85x

Nonsymmetric Eigenproblem

bull No apparent way to modify standard algorithmbull Instead Spectral Divide-and-Conquer

ndash Find orthogonal matrix Q whose leading columns span an invariant subspace of A

ndash QTAQ will be block upper triangular

ndash Apply recursively to A11 A22

ndash Depends on randomization1 Randomized Rank Revealing QR decomposition2 Randomized location to try splitting spectrum

A11 A12

ε A22

Attaining the Lower bounds SequentialLegend[Existing][Ours][Math-Lib][Random]

Two Levels Memory Hierarchy

Words Messages Words Messages

BLAS-3 [FLPRrsquo99][BDLSTrsquo13][MKL etc] [FLPRrsquo99][BDLSTrsquo13][MKL etc]

Cholesky[Grsquo97][APrsquo00]

[LAPACK][BDHSrsquo09]

[Grsquo97][APrsquo00][BDHSrsquo09] [Grsquo97][APrsquo00][BDHSrsquo09]

Sym Indefinite [BBDDDPSTYrsquo13] [BBDDDPSTYrsquo13]

LU[Grsquo97][Trsquo97]

[GDXrsquo11][BDLSTrsquo13]

[Grsquo97][Trsquo97] [BDLSTrsquo13] [BDLSTrsquo13]

QR[EGrsquo98][FWrsquo03]

[DGHLrsquo12][BDLSTrsquo13]

[FWrsquo03][DGHLrsquo12][BDLSTrsquo13]

[EGrsquo98][FWrsquo03][BDLSTrsquo13]

[FWrsquo03][BDLSTrsquo13]

Rank Revealing QR [BDDrsquo11][DGGXrsquo13]

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13] [BDDrsquo11]

Non Sym Eig [BDDrsquo11] [BDDrsquo11]

Legend[Existing][Ours][Math-Lib][Random]

Words (BW) Messages (L) Saving factor

BLAS-3 [AGZrsquo94][MTrsquo99][ScaLAPACK][Crsquo69][vGWrsquo97][SDrsquo11] L nP12

Cholesky [ScaLAPACK][Trsquo99][SDrsquo11] L nP12

Sym Indefinite [BBDDDPSTYrsquo13][ScaLAPACK] [BBDDDPSTYrsquo13] L nP12

LU [ScaLAPACK][GDXrsquo11][Trsquo99][SDrsquo11] [GDXrsquo11][Trsquo99][SDrsquo11] L nP12

QR [ScaLAPACK][DGHLrsquo12] [Trsquo99] [DGHLrsquo12][Trsquo99] L nP12

Sym Eig amp SVD [BDDrsquo11][BDKrsquo13][ScaLAPACK] [BDDrsquo11][BDKrsquo13] L nP12

Non-Sym Eig [BDDrsquo11] [BDDrsquo11] BW P12 L n

Attaining with extra memory 25D M=(cn2P)

Attaining the Lower bounds Parallel 2DM=(n2P)(Ignoring poly-log(P) factors words = ( n2 P12) messages = (P12)

Avoiding Communication in Iterative Linear Algebra

bull k-steps of iterative solver for sparse Ax=b or Ax=λxndash Does k SpMVs with A and starting vectorndash Many such ldquoKrylov Subspace Methodsrdquo

bull Conjugate Gradients (CG) GMRES Lanczos Arnoldi hellip bull Goal minimize communication

ndash Assume matrix ldquowell-partitionedrdquondash Serial implementation

bull Conventional O(k) moves of data from slow to fast memorybull New O(1) moves of data ndash optimal

ndash Parallel implementation on p processorsbull Conventional O(k log p) messages (k SpMV calls dot prods)bull New O(log p) messages - optimal

bull Lots of speed up possible (modeled and measured)ndash Price some redundant computationndash Challenges Poor partitioning Preconditioning Num Stability

Example The Difficulty of Tuning SpMV

bull n = 21200bull nnz = 15 M

bull Source NASA structural analysis problem (raefsky)

Example The Difficulty of Tuning

bull 8x8 dense substructure exploit this to limit mem_refs

Speedups on Itanium 2 The Need for Search

Reference

Best 4x2

Mflops

Register Profile Itanium 2

190 Mflops

1190 Mflops

Register Profiles IBM and Intel IA-64Power3 - 17 Power4 - 16

Itanium 2 - 33Itanium 1 - 8

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Another example of tuning challenges for SpMV

bull Ex11 matrix (fluid flow)

bull More complicated non-zero structure in general

bull N = 16614bull NNZ = 11M

Zoom in to top corner

3x3 blocks look natural buthellip

bull Example 3x3 blockingndash Logical grid of 3x3 cells

bull But would lead to lots of ldquofill-inrdquo

Extra Work Can Improve Efficiency

bull Example 3x3 blockingndash Logical grid of 3x3 cellsndash Fill-in explicit zerosndash Unroll 3x3 block multipliesndash ldquoFill ratiordquo = 15

bull On Pentium III 15x speedup

ndash Actual mflop rate 152 = 225 higher

Source Accelerator Cavity Design Problem (Ko via Husbands)

100x100 Submatrix Along Diagonal

Summer School Lecture 7

Post-RCM Reordering

Effect of Combined RCM+TSP Reordering

Before Green + RedAfter Green + Blue

892x speedups on Pentium 4 Power 4 hellip

Summary of Other Performance Optimizations

bull Optimizations for SpMVndash Register blocking (RB) up to 4x over CSRndash Reordering to create dense structure 2x over CSRndash Variable block splitting 21x over CSR 18x over RBndash Diagonals 2x over CSRndash Symmetry 28x over CSR 26x over RBndash Cache blocking 28x over CSRndash Multiple vectors (SpMM) 7x over CSRndash And combinationshellip

bull Sparse triangular solvendash Hybrid sparsedense data structure 18x over CSR

bull Higher-level kernelsndash AmiddotATmiddotx ATmiddotAmiddotx 4x over CSR 18x over RBndash More general kernels later hellip

Optimized Sparse Kernel Interface - OSKI

bull Provides sparse kernels automatically tuned for userrsquos matrix amp machinendash BLAS-style functionality SpMV Ax amp ATy TrSVndash Does both off-line and run-time tuningndash Hides complexity of run-time tuning

bull For ldquoadvancedrdquo users amp solver library writersndash Available as stand-alone libraryndash Available as PETSc extensionndash bebopcsberkeleyeduoski

bull pOSKIndash Extension to multicore architecturesndash OSKI + thread blocking cache blocking matrix compression

software prefetching NUMA SIMD hellipndash bebopcsberkeleyeduposki

Example Classical Conjugate Gradient (CG)

SpMVs and dot products require communication in

each iteration

via CA Matrix Powers Kernel

Global reduction to compute G

Example CA-Conjugate Gradient

Local computations within inner loop require

no communication

bull Iterative Linear Algebrandash Autotuing Sparse-Matrix-Vector-Multiply (SpMV)ndash Reorganizing Krylov methods ndash Conjugate Gradientsndash Stability challenges and approachesndash What is a ldquosparse matrixrdquo

Slower convergence due

to roundoff

Loss of accuracy due to roundoff

At s = 16 monomial basis is rank deficient Method breaks down

Model problem bull 2D Poisson 5 point stencilbull 30x30 gridbull Cond(A)~400

CA-CG (monomial)CG

machine precision

What is a ldquosparse matrixrdquobull Requires o(n2) dataindices to storebull Nonzero entries and indices could be explicit or implicit

bull Matrix could be sum of ldquosparserdquo matrices ndash Ex A = sparse + low rank = S + UDVT D small amp square

bull Semiseparable matrices arise as preconditionersndash Need to write Ak = (S + UDVT)k as sum of Sk and low rank

matrices

Explicit (O(nnz)) Implicit (o(nnz))

Explicit (O(nnz)) CSR and variations Vision climate AMRhellip

Implicit (o(nnz)) Graph Laplacian StencilsNonzero entries

Indices

bull Get bit-wise identical answer when you type aout againbull NA-Digest submission on 8 Sep 2010

ndash From Kai Diethelm at GNS-MBHndash Sought reproducible parallel sparse linear equation solver

demanded by customers (construction engineers) otherwise they donrsquot believe results

ndash Willing to sacrifice 40 - 50 of performance for itbull Email to ~110 Berkeley CSE faculty asking about it

ndash Most ldquoWhat How will I debug without reproducibilityrdquondash Few ldquoI know better and do careful error analysisrdquondash S Govindjee needs it for fracture simulationsndash S Russell needs it for nuclear blast detection

Reproducible Floating Point Computation

Absolute Error for Random Vectors

Same magnitude opposite signs

Intel MKL non-reproducibility

Relative Error for Orthogonal vectors

Vector size 1e6 Data aligned to 16-byte boundaries For each input vectorbull Dot products are computed using 1 2 3 or 4 threadsbull Absolute error = maximum ndash minimumbull Relative error = Absolute error maximum absolute value

Sign notreproducible

bull Consider summation or dot productbull Goals

1 Same answer independent of layout processors order of summands

2 Good performance (scales well)3 Portable (assume IEEE 754 only)4 User can choose accuracy

bull Approachesndash Guarantee fixed reduction tree (not 2 or 3)ndash Use (very) high precision to get exact answer (not 2)ndash Prerounding technique (Nguyen D)

GoalsApproaches for Reproducibility

Performance results on 1024 proc Cray XC3012x to 32x slowdown vs fastest code for n=1M

Collaborators and Supportersbull James Demmel Kathy Yelick Michael Anderson Grey Ballard Erin Carson Aditya

Devarakonda Michael Driscoll David Eliahu Andrew Gearhart Evangelos Georganas Nicholas Knight Penporn Koanantakool Ben Lipshitz Diep Nguyen Oded Schwartz Edgar Solomonik Omer Spillinger

bull Austin Benson Maryam Dehnavi Mark Hoemmen Shoaib Kamil Marghoob Mohiyuddinbull Abhinav Bhatele Aydin Buluc Michael Christ Ioana Dumitriu Armando Fox David

Gleich Ming Gu Jeff Hammond Mike Heroux Olga Holtz Kurt Keutzer Julien Langou Devin Matthews Tom Scanlon Michelle Strout Sam Williams Hua Xiang

bull Jack Dongarra Dulceneia Becker Ichitaro Yamazakibull Sivan Toledo Alex Druinsky Inon Peled bull Laura Grigori Sebastien Cayrols Simplice Donfack Mathias Jacquelin Amal Khabou

Sophie Moufawad Mikolaj Szydlarskibull Members of ParLab ASPIRE BEBOP CACHE EASI FASTMath MAGMA PLASMAbull Thanks to DOE NSF UC Discovery INRIA Intel Microsoft Mathworks National

Instruments NEC Nokia NVIDIA Samsung Oracle

bull bebopcsberkeleyedu

Summary

Donrsquot Communichellip

Time to redesign all linear algebra n-body hellip algorithms and software

(and compilers)

Implementing Communication-Avoiding Algorithms

Outline

Outline (2)

Lower bound for all ldquon3-likerdquo linear algebra (2)

Outline (3)

25D Matrix Multiplication (2)

25D Matmul on BGP 16K nodes 64K cores (2)

Application to Tensor Contractions (2)

CARMA Performance Distributed Memory (2)

CARMA Performance Shared Memory (2)

Why is CARMA Faster in Shared Memory

Outline (4)

One-sided Factorizations (LU QR) so far

Back to LU Using similar idea for TSLU as TSQR Use reduction

Fixing TSLU

Exascale Machine Parameters Source DOE Exascale Workshop

Exascale predicted speedups for Gaussian Elimination 2D CA

25D vs 2D LU With and Without Pivoting

Other CA algorithms for Ax=b least squares (23)

Outline (5)

Outline (6)

Speedups of Sym Band Reduction vs DSBTRD

Attaining the Lower bounds Sequential

Attaining the Lower bounds Parallel 2DM=(n2P) (Ignoring po

Outline (7)

Outline (8)

Register Profiles IBM and Intel IA-64

Outline (9)

Outline (10)

Outline (11)

What is a ldquosparse matrixrdquo

Outline (12)

Performance results on 1024 proc Cray XC30 12x to 32x slowdow

Collaborators and Supporters

Summary

bull Communication = moving datandash Between level of memory hierarchyndash Between processors over a network

bull Running time of an algorithm is sum of 3 termsndash flops time_per_flopndash words moved bandwidthndash messages latency

communication

bull Time_per_flop ltlt 1 bandwidth ltlt latencybull Gaps growing exponentially with time [FOSC]

bull Avoid communication to save timebull Same story for energy

bull Avoid communication to save energy

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(Pc)12

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

12x faster

27x faster

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

12 + αi Mi32] = Fi ξi

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A3-fold symm

B2-fold symm

C2-fold symm

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n = 6144

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

ScaLAPACK

CARMAPeak

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Square m = k = n

Peak (single)

Peak (double)

(linear)

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

MKL (double)

CARMA (double)

MKL (single)

CARMA (single)

(linear)

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

97 Fewer Misses

86 Fewer Misses

(linear)

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

R02Parallel

R01R02

SequentialStreaming

Dual Core

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Wnxb =

P1middotL1middotU1

P2middotL2middotU2

P3middotL3middotU3

P4middotL4middotU4

W12rsquoW34rsquo

Choose b pivot rows

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

W = W1

LULULULU

LULUParallel

W = W1

LULU LU

Dual Core

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Fixing TSLU

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

log2 (p)

Up to 29x

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

hellip

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

62xspeedup

2x speedup

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

separators) 54

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

A11 A12

ε A22

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Reference

Best 4x2

Mflops

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

190 Mflops

1190 Mflops

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

252 Mflops

122 Mflops

820 Mflops

459 Mflops

247 Mflops

107 Mflops

12 Gflops

190 Mflops

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Post-RCM Reordering

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

each iteration

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

no communication

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

to roundoff

CA-CG (monomial)CG

machine precision

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

matrices

Indices

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

(and compilers)

Outline

Outline (2)

Outline (3)

Outline (4)

Fixing TSLU

Outline (5)

Outline (6)

Outline (7)

Outline (8)

Outline (9)

Outline (10)

Outline (11)

Outline (12)

Summary

Implementing Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Documents

processor i

max i t i answer

cp e e

j slide

cp t t

f i flops works

e mtcp e tcp

operations e