Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems

AACEC 2010 – Heraklion, Crete, Greece

Jakob Siegel1, Oreste Villa2, Sriram Krishnamoorthy2, Antonino Tumeo2 and Xiaoming Li1

1 University of Delaware2 Pacific Northwest National Laboratory

1

September 24th, 2010

Overview

IntroductionCluster levelNode levelResultsConclusionFuture Work

2

Overview

IntroductionCluster levelNode levelResultsConclusionFuture Work

3

Sparse Matrix-Matrix Multiply- Challenges

The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges:

Large size of input matricesE.g. 106×106 with 30×106 nonzero elements

Compressed representationPartitioningDensity of the output matricesLoad balancing

large differences in density and computation times

4

Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.

Sparse Matrix-Matrix Multiply

Cross Cluster implementation:PartitioningData DistributionLoad BalancingCommunication/ScalingResult handling

In-Node implementation:Multiple efficient SpGEMM algorithms

CPU/GPU implementation Double bufferingExploiting heterogeneity

5

Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.

Overview

IntroductionCluster level

Node levelResultsConclusionFuture Work

6

Sparse Matrix-Matrix Multiply- Cluster level

BlockingBlock size depends on sparsity of input matrices and # processing elements. NumOfBlocksX × NumOfBlocksY >> NumOfProcessingElements

Data LayoutWhat format and order to allow for easy and fast access

Communication and storage implemented using Global Arrays (GA)

Offers a set of primitives for non-blocking operations, contiguous and non-contiguous data transfers.

7

Sparse Matrix-Matrix Multiply- Data representation and Tiling

8

A

B

C

C=A×B

• Blocked Matrix representation:• Each block is stored in

CSR* form

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 2 7 0 0 0 0 0 5

data (1 -1 5 4 6 -2 2 7 5) col (0 1 1 2 3 0 2 3 4) row (0 2 3 5 8 9)

*CSR: Compressed Sparse Row


9

A

B

C

C=A×B

data column row data col…Tile 0 Tile 2

…

• Matrix A:• The single CSR tiles are stored serialized into

the GA space. • Tile sizes and offsets are stored in a 2D array• Tiles with 0 nonzero elements are not

represented in the GA dataset.


10

B• Matrix B:

• tiles are serialized in a transposed way.

• depending on the algorithm used to calculate the single tiles the data in the tiles can be stored transposed or not transposed.

• For the Gustavson algorithm the representation of the data in the tiles themselves is not transposed.

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 2 7 0 0 0 0 0 5

1 0 0 -2 0-1 5 0 0 0 0 0 4 2 0 0 0 6 7 0 0 0 0 0 5

not transposed

or

transposed

Sparse Matrix-Matrix Multiply- Tasking and Data Movement

11

0 1 2 3 45 6 7 8 ..

1

C

• Each Block in C represents a Task.

• Nodes grab tasks and additional needed data when they have computational power available

• Results are stored locally

• meta data of the result blocks in each node is distributed to determine the offsets of the tiles in the GA space.

• Tiles are put into the GA space in right order

0 1 N-1…

34

0 25

Sparse Matrix-Matrix Multiply- Tasking and Data Movement

12

A

BC=A×B

• Each node fetches the data needed by the task to handle:

E.g. here for task/tile 5 the node has to load the data of Stripes sa = 1 and sb = 0

N-1

25

012…

Sa-1

0 1 2 …Sb-1

Overview

IntroductionCluster level

Node levelResultsConclusionFuture Work

14

2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2

Sparse Matrix-Matrix Multiply - Gustavson

15

The algorithm is based on the equation:

i-th row of C is a linear combination of the v rows of B for which aiv is nonzero. Where A has the dimensions p×q and B q×r

0 -5 0 0 0-4 -5 0 14 -8-4 -2 0 14 70 0 0 0 0

×

data(2,3,-1,2,3,-3,1,2,3,1,2,2,2,-1,4) col (0,1, 1,3,4, 2,3,2,3,0,3,4,3, 4,5) row (0,2,5,7,9,12,15)

data(1,-1,5,4,6,-2,7,-4,1,5,1,2) col (0, 1,1,2,3, 0,3, 4,1,4,3,4) row (0,2,3,5,8,10,12)

A C

B

×

pibaciva

vivi

1for 0

i=1i=1, v=1i=1, v=3i=1, v=4

++

×

+

2 3 0 0 0 0 0 -1 0 2 3 0 0 0 -3 1 0 0 0 0 2 3 0 0 1 0 0 2 2 0 0 0 0 2 -1 4

1 -1 0 0 0 0 5 0 0 0 0 0 4 6 0-2 0 0 7 -4 0 1 0 0 5 0 0 0 1 2

Sparse Matrix-Matrix Multiply - Gustavson

16

A C

B

pibaciva

vivi

1for 0

In the CUDA implementation:

• each result row ci is handled by the 16 threads of a half warp (1/2W)

• For each nonzero elements aiv in A one 1/2W performs the multiplications for each row v· in parallel

• The results are kept in dense form until all calculations are complete

• Then the results get compressed on the device.

0 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 00 0 0 0 0

2 13 0 0 0-4 -2 0 14 7-2 0 -12-11 -4-6 0 8 33 -12-3 -1 0 14 2-4 -1 0 18 -5

half-warp 0half-warp 1half-warp 2…

Overview

IntroductionCluster levelNode level

ResultsConclusionFuture Work

17

Sparse Matrix-Matrix Multiply – Case Study

Midsize matrix from the University of Florida Sparse Matrix Collection*

2D/3D problemsize 72, 000 × 72, 000 28, 715, 634 nonzeroBlocked into 5041 tiles. Multiplying matrix with itself.

18

*http://www.cise.ufl.edu/davis/sparse

Darker colors represent higher densities of nonzero elements.

19

Sparse Matrix-Matrix Multiply - Results

Scaling of SpGEMM with the different approaches

1 2 4 8 160

50

100

150

200

250

300

350

execution time over number of nodesStatic

LB-Hom

LB-Het

Nodes

tim

e in

sec


20

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15200

250

300

350

400

tasks executed by each nodeStatic

LB-Hom

LB-Het

node id

nu

mb

er o

f ta

sks

00000001111111222222233333334444444555555566666667777777888888899999991010101010101011111111111111121212121212121313131313131314141414141414151515151515150

5

10

15

20

25

30

Time to complete all assigned tasks per process

Static

LB-Hom

LB-Het

node id (7 processes per node)

tim

e in

sec


Even inside a node where different compute elements are used the load balancing mechanism still performs well

The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes.

21

Static

CPU1

CPU3

CPU5

CPU0

CPU2

CPU4

CPU6

LB-H

et

CUDA1

CPU1

CPU3

0

20

40

60

80

100

120

Tasks per Core in one of the nodes

nu

mb

er o

f ta

sks

Static

CPU1

CPU3

CPU5

CPU0

CPU2

CPU4

CPU6

LB-H

et

CUDA1

CPU1

CPU3

0

5

10

15

20

25

Time to complete all assigned tasks for each processor

tim

e in

sec

Overview

IntroductionCluster levelNode levelResults

ConclusionFuture Work

22

Sparse Matrix-Matrix Multiply

We presented a parallel framework using a co-design approach which takes into account characteristics of:

The selected application (here SpGEMM)The underlying hardware (heterogeneous cluster)

The difficulties of using static partitioning approaches show that a global load balancing method is neededDifferent optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute elementFor the selected case study optimal load balancing with uniform computation time across all processing elements is achieved

23

Overview

IntroductionCluster levelNode levelResultsConclusion

Future Work

24

Future Work – General Tasking Framework for Heterogeneous GPU Clusters

More General Task definitionMore flexibility in Input and output data definitionExploring limits imposed on Tasks by a Heterogeneous system

Feedback loop during execution that allows more efficient assignment of tasks.Introducing heterogeneous execution on GPU and CPU in one process/core.Locality aware Task queue(s) and work stealingTask reinsertion or generation at the node level.

25

Thank you

Questions?

26

Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.

Documents

b matrix b

data representation

transposed slide

compressed sparse row

data movement

additional needed data

noncontiguous data transfers

node implementation