A benchmark for sparse matrix-vector multiplication ● Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs ● http://mhoemmen.arete.cc/Report/ ● Research made possible by: NSF, Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry [email protected]
41
Embed
A benchmark for sparse matrix- vector multiplication ● Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs {hormozd|mhoemmen}@eecs ●
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A benchmark for sparse matrix-vector multiplication
● Hormozd Gahvari and Mark Hoemmen {hormozd|mhoemmen}@eecs
● http://mhoemmen.arete.cc/Report/● Research made possible by: NSF,
Argonne National Lab, a gift from Intel, National Energy Research Scientific Computing Center, and Tyler Berry [email protected]
● Account for miss rates, latencies and bandwidths
● Sparsity: bounds as heuristic to predict best block dimensions for a machine
● Upper and lower bounds not tight, so difficult to use for performance prediction
● Sparsity's goal: optimization, not performance prediction
Our SMVM benchmark
● Do SMVM with BSR matrix: randomly scattered blocks– BSR format: Typically less structured
matrices anyway● “Best” block size, 1x1
– Characterize different matrix types– Take advantage of potential optimizations
(unlike current benchmarks), but in a general way
Dense matrix in sparse format
● Test this with optimal block size:– To show that fill doesn't affect
performance much– Fill: affects locality of accesses to source
vector
Data set sizing
● Size vectors to fit in largest cache, matrix out of cache– Tests “streaming in” of matrix values– Natural scaling to machine parameters!
● “Inspiration” SPECfp92 (small enough so manufacturers could size cache to fit all data) vs. SPECfp95 (data sizes increased)
– Fill now machine-dependent:● Tests show fill (locality of source vector
accesses) has little effect
Results: “Best” block size
● Highest Mflops/s value for the block sizes tested, for:– Sparse matrix (fill chosen as above)– Dense matrix in sparse format (4096 x
4096)● Compare with Mflops/s for STREAM
Triad (a[i] = b[i] + s * c[i])
Rank processors acc. to benchmarks:
● For optimized (best block size) SMVM:– Peak mem bandwidth good predictor for
Itanium 2, P4, PM relationship– STREAM mispredicts these
● STREAM:– Better predicts unoptimized (1 x 1) SMVM– Peak bandwidth no longer helpful
Our benchmark: Useful performance indicator
● Comparison with results for “real-life” matrices: – Works well for FEM matrices– Not always as well for non-FEM matrices– More wasted space in block data
structure: directly proportional to slowdown
Comparison of Benchmark with Real Matrices
● Following two graphs show MFLOP rate of matrices generated by our benchmark vs. matrices from BeBOP group and a dense matrix in sparse format
● Plots compare by block size; matrix “number” is given in parentheses. Matrices 2-9 are FEM matrices.
● A comprehensive list of the BeBOP test suite matrices can be found in Vuduc, et. al., “Performance Optimizations and Bounds for Sparse Matrix-Vector Multiply,” 2002.
MFLOP Rate of Benchmark vs. Real Matrices: Sun Ultra 3
0
10
20
30
40
50
60
70
1 x 1(17)
1 x 1(21)
1 x 1(25)
1 x 1(28)
1 x 1(36)
1 x 1(40)
1 x 1(44)
2 x 1(10)
2 x 1(13)
2 x 1(15)
2 x 1(27)
2 x 2(11)
2 x 2(12)
3 x 3(6)
3 x 3(7)
3 x 3(9)
6 x 2(4)
6 x 6(3)
6 x 6(8)
Block Size
MFL
OP
S/s Real
Benchmark
Dense
MFLOP Rate of Benchmark vs. Real Matrices by Block Size: Intel Itanium 2
0
200
400
600
800
1000
1200
1400
1 x 1 (44) 2 x 1 (41) 2 x 1 (42) 6 x 1 (3) 6 x 1 (8) 6 x 1 (9) 6 x 1 (21)
Block Size
MFL
OPS
/s Real
Benchmark
Dense
Comparison Conclusions
● Our benchmark does a good job modeling real data
● Dense matrix in sparse format looks good on Ultra 3, but is noticeably inferior to our benchmark for large block sizes on Itanium 2
Evaluating SIMD instructions
● SMVM benchmark: – Tool to evaluate arch. features
● e.g.: Desktop SIMD floating-point● SSE-2 ISA:
– Pentium 4, M; AMD Opteron– Parallel ops on 2 floating-point doubles