Optimizing the Performance of Sparse Matrix-Vector Multiplication

Post on 24-Jan-2016

67 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Optimizing the Performance of Sparse Matrix-Vector Multiplication. Eun-Jin Im U.C.Berkeley. Overview. Motivation Optimization techniques Register Blocking Cache Blocking Multiple Vectors Sparsity system Related Work Contribution Conclusion. Motivation : Usage. - PowerPoint PPT Presentation

Transcript

6/13/00 U.C.Berkeley 1

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Eun-Jin ImU.C.Berkeley

6/13/00 U.C.Berkeley 2

Overview Motivation Optimization techniques

Register Blocking Cache Blocking Multiple Vectors

Sparsity system Related Work Contribution Conclusion

6/13/00 U.C.Berkeley 3

Motivation : Usage Sparse Matrix-Vector Multiplication

Usage of this operation: Iterative Solvers Explicit Methods Eigenvalue and Singular Value Problems

Applications in structure modeling, fluid dynamics, document retrieval(Latent Semantic Indexing) and many other simulation areas

xAy

6/13/00 U.C.Berkeley 4

Motivation : Performance (1) Matrix-vector multiplication (BLAS2) is slower

than matrix-matrix multiplication (BLAS3) For example, on 167 MHz UltraSPARC I,

Vendor optimized matrix-vector multiplication: 57Mflops

Vendor optimized matrix-matrix multiplication: 185Mflops

The reason: lower ratio of the number of floating point operations to the number of memory operation

6/13/00 U.C.Berkeley 5

Motivation : Performance (2) Sparse matrix operation is slower than dense

matrix operation. For example, on 167 MHz UltraSPARC I,

Dense matrix-vector multiplication : naïve implementation : 38Mflops vendor optimized implementation : 57Mflops Sparse matrix-vector multiplication (Naïve

implementation) 5.7 - 25Mflops

The reason : indirect data structure, thus inefficient memory accesses

6/13/00 U.C.Berkeley 6

Motivation : Optimized libraries Old approach : Hand-Optimized Libraries

Vendor-supplied BLAS, LAPACK New approach : Automatic generation of

libraries PHiPAC (dense linear algebra) ATLAS (dense linear algebra) FFTW (fast fourier transform)

Our approach : Automatic generation of libraries for sparse matricesAdditional dimension : nonzero structure of sparse

matrices

6/13/00 U.C.Berkeley 7

Sparse Matrix Formats There are large number of sparse matrix

formats. Point-entry

Coordinate format (COO), Compressed Sparse Row (CSR),Compressed Sparse Column (CSC), Sparse Diagonal

(DIA), … Block-entry

Block Coordinate (BCO), Block Sparse Row (BSR),Block Sparse Column (BSC), Block Diagonal (BDI),Variable Block Compressed Sparse Row (VBR), …

6/13/00 U.C.Berkeley 8

Compressed Sparse Row Format We internally use CSR format, because it is

relatively efficient format

6/13/00 U.C.Berkeley 9

Optimization Techniques Register Blocking Cache Blocking Multiple vector

6/13/00 U.C.Berkeley 10

Register Blocking Blocked Compressed Sparse Row Format

Advantages of the format Better temporal locality in registers The multiplication loop can be unrolled for better

performance

0 2 4

A00 A01 A10 A11 A04 0 0 A15 A22 0 A32 A33 A25 0 A34 A35

0 4 2 4

35343332

2522

151110

040100

00

0000

000

000

AAAA

AA

AAA

AAA

6/13/00 U.C.Berkeley 11

Register Blocking : Fill Overhead We use uniform block size, adding fill overhead.

fill overhead = 12/7 = 1.71 This increases both space and the number of

floating point operations.

6/13/00 U.C.Berkeley 12

Register Blocking Dense Matrix profile on an UltraSPARC I (input

to the performance model)

6/13/00 U.C.Berkeley 13

Register Blocking : Selecting the block size The hard part of the problem is picking the

block size so that : It minimizes the fill overhead It maximizes the raw performance

Two approaches : Exhaustive search Using a model

6/13/00 U.C.Berkeley 14

Register Blocking: Performance model Two components to the performance model

Multiplication performance of dense matrix represented in sparse format

Estimated fill overhead

Predicted performance for block size r x c dense r x c blocked performance = fill overhead

6/13/00 U.C.Berkeley 15

Benchmark matrices Matrix 1: Dense matrix (1000 x 1000) Matrices 2-17 : Finite Element Method

matrices Matrices 18-39 : matrices from Structural

Engineering, Device Simulation Matrices 40-44 : Linear Programming matrices Matrix 45 : document retrieval matrix used for Latent Semantic Indexing Matrix 46 : random matrix (10000 x 10000,

0.15%)

6/13/00 U.C.Berkeley 16

Register Blocking : Performance

The optimization is effective most on FEM matrices and dense matrix (lower-numbered).

6/13/00 U.C.Berkeley 17

Register Blocking : Performance

Speedup is generally best on MIPS R10000, which is competitive with the dense BLAS performance. (DGEMV/DGEMM = 0.38)

6/13/00 U.C.Berkeley 18

Register Blocking : Validation of Performance Model

Comparison to the performance of exhaustive search (yellow bars, block sizes in lower row) on a subset of the benchmark matrices

The exhaustive search does not produce much better result.

6/13/00 U.C.Berkeley 19

Register Blocking : Overhead Pre-computation overhead :

Estimating fill overhead (red bars) Reorganizing the matrix (yellow bars)

The ratio means the number of repetitions for which the optimization is beneficial.

6/13/00 U.C.Berkeley 20

Cache Blocking Temporal locality of access to source vector

Source vector x

DestinationVector

y

In memory

6/13/00 U.C.Berkeley 21

Cache Blocking : Performance

MIPS speedup is generally better. larger cache, larger miss penalty (26/589 ns for MIPS, 36/268 ns for Ultra.)

Except document retrieval and random matrix.

6/13/00 U.C.Berkeley 22

Cache Blocking : Performance on document retrieval matrix

Document retrieval matrix : 10K x 256K, 37M nonzeros, SVD is applied for LSI(Latent Semantic Indexing)

The nonzero elements are spread across the matrix, with no dense cluster.

Peak at 16K x 16K cache block with speedup 3.1

6/13/00 U.C.Berkeley 23

Cache Blocking : When and how to use cache blocking From the experiment, the matrices for which

cache blocking is most effective are large and “random”.

We developed a measurement of “randomness” of matrix.

We perform search in coarse grain, to decide cache block size.

6/13/00 U.C.Berkeley 24

Combination of Register and Cache blocking : UltraSPARC The combination is rarely beneficial, often slower than

either of the two optimization.

6/13/00 U.C.Berkeley 25

Combination of Register and Cache blocking : MIPS

6/13/00 U.C.Berkeley 26

Multiple Vector Multiplication Better chance of optimization : BLAS2 vs.

BLAS3

Repetition of single-vector case Multiple-vector case

6/13/00 U.C.Berkeley 27

Multiple Vector Multiplication : Performances Register blocking performance Cache blocking performance

6/13/00 U.C.Berkeley 28

Multiple Vector Multiplication :Register Blocking Performance

The speedup is larger than single vector register blocking. Even the performance of the matrices that did not

speedup improved. (middle group in UltraSPARC)

6/13/00 U.C.Berkeley 29

Multiple Vector Multiplication : Cache Blocking Performance

Noticeable speedup for the matrices that did not speedup (UltraSPARC) Block sizes are much smaller than that of single vector cache blocking.

UltraSPARC MIPS

6/13/00 U.C.Berkeley 30

Sparsity System : Purpose Guide a choice of optimization Automatic selection of optimization

parameters such as block size, number of vectors

http://comix.cs.berkeley.edu/~ejim/sparsity

6/13/00 U.C.Berkeley 31

Sparsity System : Organization

SparsityMachineProfiler

MachinePerformance

Profile

SparsityOptimizer

Examplematrix

MaximumNumber of

vectors

Optimized code,drivers

6/13/00 U.C.Berkeley 32

Summary : Speedup of Sparsity on UltraSPARC On UltraSPARC, up to 3x for single vector, 4.7x for multiple

vector

Single Vector Multiple Vector

6/13/00 U.C.Berkeley 33

Summary : Speedup of Sparsity on MIPSOn MIPS, up to 3x single vector, 6x for multiple vector

Single Vector Multiple Vector

6/13/00 U.C.Berkeley 34

Summary : Overhead of Sparsity Optimization

The number of iteration =

Overhead time Time saved

The BLAS Technical Forum include a parameter in the matrix creation routine to indicate how many times the operation is performed.

6/13/00 U.C.Berkeley 35

Related Work (1) Dense Matrix Optimization

Loop transformation by compilers : M. Wolf, etc. Hand-optimized libraries : BLAS, LAPACK

Automatic Generation of Libraries PHiPAC, ATLAS and FFTW

Sparse Matrix Standardization and Libraries BLAS Technical Forum NIST Sparse BLAS, MV++, SparseLib++, TNT

Hand Optimization of Sparse Matrix-Vector Multi.

S. Toledo, Oliker et. al.

6/13/00 U.C.Berkeley 36

Related Work (2) Sparse Matrix Packages

SPARSKIT, PSPARSELIB, Aztec, BlockSolve95, Spark98

Compiling Sparse Matrix Code Sparse compiler (Bik), Bernoulli compiler (Kotlyar)

On-demand Code Generation NIST SparseBLAS, Sparse compiler

6/13/00 U.C.Berkeley 37

Contribution Thorough investigation of memory hierarchy

optimization for sparse matrix-vector multiplication

Performance study on benchmark matrices Development of performance model to choose

optimization parameter Sparsity system for automatic tuning and code

generation of sparse matrix-vector multiplication

6/13/00 U.C.Berkeley 38

Conclusion Memory hierarchy optimization for sparse

matrix-vector multiplication Register Blocking : matrices with dense local structure

benefit Cache Blocking : large matrices with random structure

benefit Multiple vector multiplication improves the

performance further because of reuse of matrix elements

The choice of optimization depends on both matrix structure and machine architecture.

The automated system helps this complicated and time-consuming process.

top related