Avoiding Communication in Sparse Matrix-Vector Multiply (SpMV) • Sequential and shared-memory performance is dominated by off-chip communication • Distributed-memory performance is dominated by network communication The problem: SpMV has low arithmetic intensity
32
Embed
Avoiding Communication in Sparse Matrix-Vector Multiply ( SpMV )
Avoiding Communication in Sparse Matrix-Vector Multiply ( SpMV ). Sequential and shared-memory performance is dominated by off-chip communication Distributed-memory performance is dominated by network communication The problem: SpMV has low arithmetic intensity. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Avoiding Communication in Sparse Matrix-Vector Multiply (SpMV)
• Sequential and shared-memory performance is dominated by off-chip communication
• Distributed-memory performance is dominated by network communication
The problem:SpMV has low arithmetic intensity
dimension: n = 5 number of nonzeros: nnz = 3n-2 (tridiagonal A)
overcounts flops by up to n (diagonal A)
SpMV
floating point operations 2⋅nnz
floating point words moved nnz + 2⋅n
Assumption: A is invertible⇒ nonzero in every row ⇒ nnz ≥ n
SpMV Arithmetic Intensity (1)
• Arithmetic intensity := Total flops / Total DRAM bytes• Upper bound: compulsory traffic
– further diminished by conflict or capacity misses
A r i t h m e t i c I n t e n s i t y
O( n )O( lg(n) )
O( 1 )
SpMV, BLAS1,2
Stencils (PDEs)
Lattice Methods
FFTsDense Linear Algebra
(BLAS3)Particle Methods
more flops per byte
SpMV
flops 2⋅nnzwords moved nnz + 2⋅narith. intensity 2
SpMV Arithmetic Intensity (2)
actual flop:byte ratio
atta
inab
le g
flop/
s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
strea
m bandwidth
peak double precision
floating-point rate
In practice, A requires at least nnz words:• indexing data, zero padding• depends on nonzero structure, eg,
banded or dense blocks• depends on data structure, eg,
• CSR/C, COO, SKY, DIA, JDS, ELL, DCS/C, …
• blocked generalizations• depends on optimizations, eg,
index compression or variable block splitting
• 2 flops per word of data• 8 bytes per double• flop:byte ratio ≤ ¼• Can’t beat 1/16 of peak!
How to do more flops per byte?
Reuse data (x, y, A) across multiple SpMVs
SpMV Arithmetic Intensity (3)
(1) used in:• Block Krylov methods• Krylov methods for multiple
systems (AX = B)
(1) k independent SpMVs
(2) used in: • s-step Krylov methods,• Communication-avoiding Krylov
methods…to compute k Krylov basis vectors
Def. Krylov space (given A, x, s):
What if we can amortize cost of reading A over k SpMVs ?• (k-fold reuse of A)
Combining multiple SpMVs
(2) k dependent SpMVs
(3) k dependent SpMVs, in-place variant
(3) used in: • multigrid smoothers, power method• Related to Streaming Matrix Powers
optimization for CA-Krylov methods
SpMM optimization:• Compute row-by-row• Stream A only once
=
1 SpMV k independent SpMVs k independent SpMVs (using SpMM)
flops 2⋅nnz 2k⋅nnz 2k⋅nnz
words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + 2kn
arith. intensity,nnz = ω(n)
2 2 2k
(1) k independent SpMVs (SpMM)
Akx (Akx) optimization: • Must satisfy data
dependencies while keeping working set in cache
Naïve algorithm (no reuse):
1 SpMV k dependent SpMVs k dependent SpMVs (using Akx)
flops 2⋅nnz 2k⋅nnz 2k⋅nnz
words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + (k+1)n
arith. intensity,nnz = ω(n)
2 2 2k
(2) k dependent SpMVs (Akx)
1 SpMV k dependent SpMVs k dependent SpMVs (using Akx)
flops 2⋅nnz 2k⋅nnz 2k⋅nnz
words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + (k+1)n
arith. intensity,nnz = ω(n)
2 2 2k
(2) k dependent SpMVs (Akx)
Akx algorithm (reuse nonzeros of A):
1 SpMV k dependent SpMVs, in-place
Akx, last-vector-only
flops 2⋅nnz 2k⋅nnz 2k⋅nnz
words moved nnz + 2n k⋅nnz + 2kn 1⋅nnz + 2n
arith. intensity,nnz = anything
2 2 2k
Last-vector-only Akx optimization:• Reuses matrix and vector k times, instead of once.• Overwrites intermediates without memory traffic• Attains O(k) reuse, even when nnz < n
• eg, A is a stencil (implicit values and structure)
(3) k dependent SpMVs, in-place (Akx, last-vector-only)
Problem flops words moved optimization words
moved
relative bandwidth savings(n, nnz ⟶ ∞)
nnz = ω(n) nnz = c⋅n nnz = o(n)
SpMV 2⋅nnznnz + 2n
- - - - -
k independent
SpMVs2k⋅nnz
k⋅nnz +
2knSpMM
nnz +
2knk ≤ min(c, k) 1
k dependent SpMVs 2k⋅nnz
k⋅nnz +
2knAkx
nnz +
(k+1)nk ≤ min(c, k) 2
k dependent SpMVs, in-place
2k⋅nnzk⋅nnz
+ 2kn
Akx, last-vector-only
nnz + 2n
k k k
Combining multiple SpMVs(summary of sequential results)
Avoiding Serial Communication
• Reduce compulsory misses by reusing data:– more efficient use of memory– decreased bandwidth cost (Akx, asymptotic)
• Must also consider latency cost – How many cachelines?– depends on contiguous accesses
• When k = 16 ⇒ compute-bound? – Fully utilize memory system– Avoid additional memory traffic like capacity
and conflict misses– Fully utilize in-core parallelism – (Note: still assumes no indexing data)
• In practice, complex performance tradeoffs.– Autotune to find best k
actual flop:byte ratio
atta
inab
le g
flop/
s
Opteron 2356(Barcelona)
0.5
1.0
1/8
2.0
4.0
8.0
16.0
32.0
64.0
128.0
256.0
1/41/2 1 2 4 8 16
peak DP
Stream
Bandwidth
?
On being memory bound• Assume that off-chip communication (cache to memory) is bottleneck,
– eg, that we express sufficient ILP to hide hits in L3• When your multicore performance is bound by memory operations, is it because of
latency or bandwidth?– Latency-bound: expressed concurrency times the memory access rate does not fully utilize the
memory bandwidth• Traversing a linked list, pointer-chasing benchmarks
– Bandwidth-bound: expressed concurrency times the memory access rate exceeds the memory bandwidth• SpMV, stream benchmarks
– Either way, manifests as pipeline stalls on loads/stores (suboptimal throughput)
• Caches can improve memory bottlenecks – exploit them whenever possible– Avoid memory traffic when you have temporal or spatial locality – Increase memory traffic when cache line entries are unused (no locality)
• Prefetchers can allow you to express more concurrency– Hide memory traffic when your access pattern has sequential locality (clustered or regularly
strided access patterns)
Distributed-memory parallel SpMV• Harder to make general statements about performance:
– Many ways to partition x, y, and A to P processors– Communication, computation, and load-balance are partition-dependent– What fits in cache? (What is “cache”?!)
• A parallel SpMV involves 1 or 2 rounds of messages – (Sparse) collective communication, costly synchronization– Latency-bound (hard to saturate network bandwidth)– Scatter entries of x and/or gather entries of y across network– k SpMVs cost O(k) rounds of messages
• Can we do k SpMVs in one round of messages?– k independent vectors? SpMM generalizes
• Distribute all source vectors in one round of messages• Avoid further synchronization
– k dependent vectors? Akx generalizes• Distribute source vector plus additional ghost zone entries in one round of messages• Avoid further synchronization
– Last-vector-only Akx ≈ standard Akx in parallel• No savings discarding intermediates
Example: tridiagonal matrix, k = 3, n = 40, p = 4
Naïve algorithm:k messages per neighbor
Akx optimization:1 message per neighbor
Distributed-memory parallel Akx
Polynomial Basis for Akx
• Today we considered the special case of the monomials:
• Stability problems - tends to lose linear independence– Converges to principal eigenvector
• Given A, x, k > 0, compute
where pj(A) is a degree-j polynomial in A.– Choose p for stability.
– Hypergraph partitioning– Dynamic load balancing– Overlapped communication and computation
• Algorithmic variants:– Compositions of distributed-memory parallel,
shared memory parallel, sequential algorithms– Streaming or explicitly buffered workspace– Explicit or implicit cache blocks– Avoiding redundant computation/storage/traffic– Last-vector-only optimization– Remove low-rank components (blocking covers)– Different polynomial bases pj(A)
• Other:– Preprocessing optimizations– Extended precision arithmetic– Scalable data structures (sparse representations)– Dynamic value and/or pattern updates
Krylov subspace methods (1)Want to solve Ax = b (still assume A is invertible)
How accurately can you hope to compute x? • Depends on condition number of A and the accuracy of your inputs A and b
• condition number with respect to matrix inversion• cond(A) – how much A distorts the unit sphere (in some norm)• 1/cond(A) – how close A is to a singular matrix• expect to lose log10(cond(A)) decimal digits relative to (relative) input accuracy
• Idea: Make successive approximations, terminate when accuracy is sufficient• How good is an approximation x0 to x?• Error: e0 = x0 - x
• If you know e0, then compute x = x0 - e0 (and you’re done.)• Finding e0 is as hard as finding x; assume you never have e0
• Residual: r0 = b – Ax0
• r0 = 0 ⇔ e0 = 0, but they do not necessarily vanish simultaneously cond(A) small (⇒ r0 small ⇒ e0 small)
Krylov subspace methods (2)1. Given approximation xold, refine by adding a correction xnew = xold + v
• Pick v as the ‘best possible choice’ from search space V• Krylov subspace methods: V :=
2. Expand V by one dimension3. xold = xnew. Repeat.
• Once dim(V) = dim(A) = n, xnew should be exact
Why Krylov subspaces?• Cheap to compute (via SpMV)• Search spaces V coincide with the residual spaces -
• makes it cheaper to avoid repeating search directions• K(A,z) = K(c1A - c2I, c3z) invariant under scaling, translation⇒
• Without loss, assume |λ(A)| ≤ 1• As s increases, Ks gets closer to the dominant eigenvectors of A• Intuitively, corrections v should target ‘largest-magnitude’ residual components
Convergence of Krylov methods• Convergence = process by which residual goes to zero
– If A isn’t too poorly conditioned, error should be small. • Convergence only governed by the angles θm between
spaces Km and AKm
– How fast does sin(θm) go to zero?– Not eigenvalues! You can construct a unitary system that
results in the same sequence of residuals r0, r1, …– If A is normal, λ(A) provides bounds on convergence.
• Preconditioning– Transforming A with hopes of ‘improving’ λ(A) or cond(A)
Conjugate Gradient (CG) MethodGiven starting approximation x0 to Ax = b, let p0 := r0 := b - Ax0.
Update residual according to new candidate solution
Expand search space
Communication-bound:• 1 SpMV operation per iteration• 2 dot products per iteration
1. Reformulate to use Akx2. Do something about the
dot products
Applying Akx to CG (1)1. Ignore x, α, and β, for now2. Unroll the CG loop s times (in your head)3. Observe that:
ie, two Akx calls4. This means we can represent rm+j and pm+j
symbolically as linear combinations:
5. And perform SpMV operations symbolically: (same holds for Rj-1)
vectors of length n vectors of
length 2j+1 CG loop:For m = 0,1,…, Do
SpMV performed symbolically by shifting coordinates:
Applying Akx to CG (2)6. Now substitute coefficient vectors for vector iterates (eg, for r)
CG loop:For m = 0,1,…, Do
7. Let’s also compute the 2j+1-by-2j+1 Gram matrices:
Now we can perform all dot products symbolically:
Blocking CG dot products
CG loop:For m = 0,1,…, Do
For m = 0, s, 2s, …, until convergence, Do{
For j = 0 to s - 1, Do {
} End For
} End For
Given approximation x0 to Ax = b, let
Represent SpMV operation as a change of basis (here, a shift):
Expand Krylov basis, using SpMM and Akx optimizations:
Represent the 2s+1 inner products of length n with a 2s+1-by-2s+1 Gram matrix
Represent vector iterates of length n with vectors of length 2s+1 and 2s+2:
Recover vector iterates:
Take s steps of CG without communication
Communication (sequential only)
Communication (sequential and parallel)
CG loop:For m = 0,1,…, Do
CA-CG
Kernel Computation costs Communication costs
s dependent SpMVs
• 2s⋅nnz flops (1 source vector)
Sequential:• Read s vectors of length n• Write s vectors of length n• Read A s times• bandwidth cost ≈ s⋅nnz + 2snParallel:• Distribute 1 source vector s times
Akx • 4s⋅nnz flops (2 source vectors)
Sequential:• Read 2 vectors of length n, • Write 2s-1 vectors of length n,• Read A once (both Akx and SpMM optimizations)• bandwidth cost ≈ nnz + (2s+1)nParallel:• Distribute 2 source vectors once • Communication volume and number of messages