Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1 , Oreste Villa 2 , Sriram Krishnamoorthy 2 , Antonino Tumeo 2 and Xiaoming Li 1 1 University of Delaware 2 Pacific Northwest National Laboratory 1 September 24 th , 2010
25
Embed
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems
AACEC 2010 – Heraklion, Crete, Greece
Jakob Siegel1, Oreste Villa2, Sriram Krishnamoorthy2, Antonino Tumeo2 and Xiaoming Li1
1 University of Delaware2 Pacific Northwest National Laboratory
1
September 24th, 2010
Overview
IntroductionCluster levelNode levelResultsConclusionFuture Work
2
Overview
IntroductionCluster levelNode levelResultsConclusionFuture Work
3
Sparse Matrix-Matrix Multiply- Challenges
The efficient implementation of sparse matrix-matrix multiplications on HPC systems poses several challenges:
Large size of input matricesE.g. 106×106 with 30×106 nonzero elements
Compressed representationPartitioningDensity of the output matricesLoad balancing
large differences in density and computation times
4
Matrices taken from Timothy A. Davis. University of Florida Sparse Matrix Collection, available online at: http://www.cise.ufl.edu/davis/sparse.
Even inside a node where different compute elements are used the load balancing mechanism still performs well
The processes using the CUDA devices here completing almost 5x more tasks than the pure CPU processes.
21
Static
CPU1
CPU3
CPU5
CPU0
CPU2
CPU4
CPU6
LB-H
et
CUDA1
CPU1
CPU3
0
20
40
60
80
100
120
Tasks per Core in one of the nodes
nu
mb
er o
f ta
sks
Static
CPU1
CPU3
CPU5
CPU0
CPU2
CPU4
CPU6
LB-H
et
CUDA1
CPU1
CPU3
0
5
10
15
20
25
Time to complete all assigned tasks for each processor
tim
e in
sec
Overview
IntroductionCluster levelNode levelResults
ConclusionFuture Work
22
Sparse Matrix-Matrix Multiply
We presented a parallel framework using a co-design approach which takes into account characteristics of:
The selected application (here SpGEMM)The underlying hardware (heterogeneous cluster)
The difficulties of using static partitioning approaches show that a global load balancing method is neededDifferent optimized implementations of the Gustavson algorithm are presented and are used depending on the available compute elementFor the selected case study optimal load balancing with uniform computation time across all processing elements is achieved
Future Work – General Tasking Framework for Heterogeneous GPU Clusters
More General Task definitionMore flexibility in Input and output data definitionExploring limits imposed on Tasks by a Heterogeneous system
Feedback loop during execution that allows more efficient assignment of tasks.Introducing heterogeneous execution on GPU and CPU in one process/core.Locality aware Task queue(s) and work stealingTask reinsertion or generation at the node level.