KokkosKernels: Compact Layouts for Batched Blas and Sparse ...

SandiaNationalLaboratoriesisamulti-programlaboratorymanagedandoperatedbySandiaCorporation,awhollyownedsubsidiary ofLockheedMartinCorporation,fortheU.S.DepartmentofEnergy’sNationalNuclearSecurityAdministrationundercontractDE-AC04-94AL85000.

KokkosKernels:CompactLayoutsforBatchedBlasandSparseMatrix-Matrixmultiply

SivaRajamanickam,Kyungjoo Kim,AndrewBradley,MehmetDeveci,ChristianTrott,

SiHammond

BatchedBLASWorkshop,2017,Atlanta

KokkosKernels :Overview

• Layerofperformanceportable kernelsontopofKokkos• Sparselinearalgebrakernels• Denselinearalgebrakernels(BatchedBLASaswellastraditionalBLAS)

• Graphkernels• TensorContractionkernels(upcoming)

KokkosKernels :Overview

• NodependenciesotherthanKokkos• Node-levelonly(NoMPI)• Providekernelsforall-levelsofparallelism(whereverapplicable):Devicelevel,Teamlevel,Threadlevel,Serial

• Copyrightreceivedlastweek• WillresideintheKokkos github organization

KokkosKernels :CurrentKernels• Sparselinearalgebrakernels• CrsMatrix – fill• SparseMatrixVectorMultiply• SparseMatrixMatrixMultiply– MehmetDeveci• (Symmetric)GaussSeidel

• Denselinearalgebrakernels(BLAS)• BLAS1,someBLAS2• BatchedBLAS– Kyungjoo Kim

• Graphkernels• Graphcoloring

• OtherUtilities• HashMap• UniformMemoryAllocator

MotivationforBatchedBLASwithCompactLayouts

KokkosKernels: Micro & Batched BLAS Design

Kyungjoo Kim

1 Problem

m

k

n

{ }k

m x n

Figure 1: Left: Block sparse matrix is formed with an entire mesh. Right: A preconditioner is constructed byextracting 1D line elements, which forms a set of block tridiagonal matrices.

The objective of this work is to develop performance portable numeric kernels for small matrices on a teamof threads or across vector units. These small dense kernels would be used for block crs (BCRS) matrix and itsline preconditioners. As depicted in Fig. 1, a global problem is given with a mesh and each node is populatedwith multiple DOFs, which results in a BCRS matrix. To precondition this matrix, line elements are extractedfrom the problem. The line elements result in block tridiagonal matrices - each block corresponding to a line andthe tridiagonal structure corresponding to the element connectivity in a line. There multiple degrees of freedomassociated with each element (or is it a node?) in the line which results in a ”small block” for each scalar entry inthe tridiagonal matrix. The bocks are factored using LU factorization.

T0

T1

Tm�n�1

Ar Br

Cr Ar+1

Algorithm 1: Reference impl. TriLU

1 for T in {T0,T1, · · · ,Tm⇥n�1} do in parallel

2 for r 0 to k�2 do

3 A

r := LU(Ar);4 B

r := L

�1B

r;5 C

r := C

r

U

�1;6 A

r+1 := C

r+1�C

r

B

r;7 end

8 A

k�1 := {L ·U};9 end

Figure 2: Left: Block/block tridiagonal matrix generated from sets of line elements. Right: Reference LU decom-position on a set of block tridiagonal matrices.

In this work, we consider a block sparse matrix-vector multiplication using GEMV and GEMM, TRSM and LU whichare required for tridiagonal LU decomposition as illustrated in Fig. 2.

1

• Sandiaapplicationcharacteristics• Onedimensionofthemeshmoreimportantthantheotherswhen

preconditioning• Multipledegreesoffreedomperelementgivesrisetotinyblocks

MotivationforBatchedBLAS/LAPACK

• BlockJacobipreconditioner whereeachblockisaTridiagonalmatrix

• Everyscalarinthetridiagonal matrixisasmallblockmatrix• Blocksizes5x5,9x9,15x15etc

• Typicalnumberofdiagonalblocks512-1024• KeykernelsneededDGEMM,LU,TRSM


Kyungjoo Kim

1 Problem

m

k

n

{ }k

m x n



T0

T1

Tm�n�1

Ar Br

Cr Ar+1




3 A

r := LU(Ar);4 B

r := L

�1B

r;5 C

r := C

r

U

�1;6 A

r+1 := C

r+1�C

r

B

r;7 end

8 A

k�1 := {L ·U};9 end



1


Kyungjoo Kim

1 Problem

m

k

n

{ }k

m x n



T0

T1

Tm�n�1

Ar Br

Cr Ar+1




3 A

r := LU(Ar);4 B

r := L

�1B

r;5 C

r := C

r

U

�1;6 A

r+1 := C

r+1�C

r

B

r;7 end

8 A

k�1 := {L ·U};9 end



1

KokkosKernels CompactLayoutsforBatchedBLAS

• DataLayoutforbettervectorintrinsics• Packentriesfromuptovlen blockdiagonalmatrices,vlen isthevectorlength(vectorlength=2shown)

• Usevectorintrinsics onthenewdatavectordatawithoperatoroverloading

• ScalarPerformanceisduetoexplicitloopunrolling

Ar Br

Cr Ar+1

Ar Br

Cr Ar+1

T0

T1

Block A of T0 and T1 is packed and elements are aligned to its vector lane· · ·�T000 �T1

00 �T001 �T1

01

Algorithm 2: Batched impl. TriLU1 for a pair T (0,1) in

{{T0,T1},{T2,T3}, · · · ,{T

m⇥n�2,Tm⇥n�1}} do in parallel


3 A

r(0,1) := LU(Ar(0,1));4 B

r(0,1) := L

�1B

r(0,1);5 C

r(0,1) := C

r(0,1)U

�1;6 A

r+1(0,1) := C

r+1(0,1)�C

r(0,1)B

r(0,1);7 end

8 A

k�1(0,1) := {L ·U};9 end

Figure 3: Left: Hybrid packing of T0 and T1 provided the vector length is two. Right: Vectorized version of LU

decomposition on a set of block tridiagonal matrices.

4.2 Team parallelization

Team level parallelization must be introduced. Here, we briefly discuss a parallelization approach within smallmatrix kernels. For tridiagonal LU factorization, we design level 3 operations i.e., LU, TRSM and GEMM. In theTriSolve phase, we design level 2 operations i.e., TRSV and GEMV.

TRSM There are a few possibilities to parallelize dense kernels according to the distribution of matrices e.g., ele-mental 1D and 2D cyclic distribution. 1D algorithm is illustrated in Fig. 4. We use columnwise cyclic distributionon the matrix B. If shared memory space allows, matrix A and a panel of B is loaded to shared memory. Therequired memory space is defined as a product of vector length, panelsize and blocksize. A 2D version ofthis algorithm is shown in Fig. ??.

GEMM Parallel GEMM can be implemented in three different ways according to communication pattern. Sincewe deal with a case of m = n = k, the choice of algorithm won’t matter. Fig. ?? illustrates the algorithm us-ing matrix-panel multiplication. This algorithm uses 1D parallelization and size of a team is restricted by ablocksize. On the other hand, 2D distribution as depicted in Fig. ?? increases concurrency upto blocksize⇥blocksize⇥ vectorlength. When we want to use a more number of threads in this operations, multiple panels areloaded together and the maximum number of concurrency would increase up to blocksize⇥blocksize⇥blocksize⇥vectorlength with atomic update..

LU The right looking version of the LU algorithm is illustrated in Fig. 8 and 9.

TRSV Fig. 10 explains the 1D TRSV algorithm.

GEMV Fig. 11 and 12 show 1D and 2D versions of the GEMV operation. Note that these two kernels can be fusedin the TriSolve phase.

5 Micro BLAS

The concept of micro BLAS is to solve each block exploiting vector units. Unlike the batched version, thisapproach does not require to repack user data. However, disadvantages of this approach are

• the vector length of modern computing architectures relatively larger than our “small” problem size;

• numeric algorithms depends on its problem layout (LayoutLeft, LayoutRight) in order to get coalescingaccess.

Add more merits of micro BLAS. At the end, this version should be required. Fill detailed algorithms.

3

Ar Br

Cr Ar+1

Ar Br

Cr Ar+1

T0

T1

Block A of T0 and T1 is packed and elements are aligned to its vector lane· · ·�T000 �T1

00 �T001 �T1

01

Algorithm 2: Batched impl. TriLU1 for a pair T (0,1) in

{{T0,T1},{T2,T3}, · · · ,{T

m⇥n�2,Tm⇥n�1}} do in parallel


3 A

r(0,1) := LU(Ar(0,1));4 B

r(0,1) := L

�1B

r(0,1);5 C

r(0,1) := C

r(0,1)U

�1;6 A

r+1(0,1) := C

r+1(0,1)�C

r(0,1)B

r(0,1);7 end

8 A

k�1(0,1) := {L ·U};9 end

Figure 3: Left: Hybrid packing of T0 and T1 provided the vector length is two. Right: Vectorized version of LU

decomposition on a set of block tridiagonal matrices.

4.2 Team parallelization

Team level parallelization must be introduced. Here, we briefly discuss a parallelization approach within smallmatrix kernels. For tridiagonal LU factorization, we design level 3 operations i.e., LU, TRSM and GEMM. In theTriSolve phase, we design level 2 operations i.e., TRSV and GEMV.

TRSM There are a few possibilities to parallelize dense kernels according to the distribution of matrices e.g., ele-mental 1D and 2D cyclic distribution. 1D algorithm is illustrated in Fig. 4. We use columnwise cyclic distributionon the matrix B. If shared memory space allows, matrix A and a panel of B is loaded to shared memory. Therequired memory space is defined as a product of vector length, panelsize and blocksize. A 2D version ofthis algorithm is shown in Fig. ??.

GEMM Parallel GEMM can be implemented in three different ways according to communication pattern. Sincewe deal with a case of m = n = k, the choice of algorithm won’t matter. Fig. ?? illustrates the algorithm us-ing matrix-panel multiplication. This algorithm uses 1D parallelization and size of a team is restricted by ablocksize. On the other hand, 2D distribution as depicted in Fig. ?? increases concurrency upto blocksize⇥blocksize⇥ vectorlength. When we want to use a more number of threads in this operations, multiple panels areloaded together and the maximum number of concurrency would increase up to blocksize⇥blocksize⇥blocksize⇥vectorlength with atomic update..

LU The right looking version of the LU algorithm is illustrated in Fig. 8 and 9.

TRSV Fig. 10 explains the 1D TRSV algorithm.

GEMV Fig. 11 and 12 show 1D and 2D versions of the GEMV operation. Note that these two kernels can be fusedin the TriSolve phase.

5 Micro BLAS

The concept of micro BLAS is to solve each block exploiting vector units. Unlike the batched version, thisapproach does not require to repack user data. However, disadvantages of this approach are

• the vector length of modern computing architectures relatively larger than our “small” problem size;

• numeric algorithms depends on its problem layout (LayoutLeft, LayoutRight) in order to get coalescingaccess.

Add more merits of micro BLAS. At the end, this version should be required. Fill detailed algorithms.

3

PathForwardforCompactLayoutsforBatchedBLAS• KokkosKernels:• APerformance-PortableReferenceImplementationforcompactlayouts

• Collaborationswouldbeideal• IntelMKLteamforcompactlayoutsinMKL(ongoing)• ThankstoT.Costa,M.Guney,S.Knepper,S.Story• Disseminatetheideastobroadercommunity

• “SharedFateMilestones”(Exascale ComputingProject)withtheMAGMAteam• ThankstoS.Tomov,J.Dongarra• Extendtheworktootherkernels(E.g:Tensorcontractions)

Kokkos::parallel_for(Kokkos::RangePolicy(N), KOKKOS_LAMBDA(const int k) {

auto aa = Kokkos::subview(a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(c, k, Kokkos::ALL(), Kokkos::ALL());

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, BlkSize, BlkSize, BlkSize, 1.0, (double*)aa.data(), aa.stride_0(), (double*)bb.data(), bb.stride_0() 1.0, (double*)cc.data(), cc.stride_0());

});

MKL_INT blksize[1] = { BlkSize }; MKL_INT lda[1] = { a.stride_1() }; MKL_INT ldb[1] = { b.stride_1() }; MKL_INT ldc[1] = { c.stride_1() }; CBLAS_TRANSPOSE transA[1] = { CblasNoTrans };CBLAS_TRANSPOSE transB[1] = { CblasNoTrans };double one[1] = { 1.0 };MKL_INT size_per_grp[1] = { N };

cblas_dgemm_batch(CblasRowMajor, transA, transB, blksize, blksize, blksize, one, (const double**)aa, lda, (const double**)bb, ldb, one, cc, ldc, 1, size_per_grp);

CompactLayoutsforBatchedBLAS:ExperimentsMKLTestSetup

// Scalar versionKokkos::View<double***,HostSpaceType>

a(“a", N, BlkSize, BlkSize);

// Vector versionKokkos::View<Vector<VectorTag<AVX<double>,4> >***,HostSpaceType>

a("a", N/VectorLength, BlkSize, BlkSize),

Kokkos::parallel_for( Kokkos::RangePolicy(/* N or N/VectorLength */), KOKKOS_LAMBDA(const int k) {

auto aa = Kokkos::subview(a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(c, k, Kokkos::ALL(), Kokkos::ALL());

KokkosKernels::Serial::Gemm<Trans::NoTranspose,Trans::NoTranspose,AlgoTagType>:: invoke(1.0, aa, bb, 1.0, cc); });

});

CompactLayoutsforBatchedBLAS:ExperimentsKokkosKernels TestSetup

DGEMMGFLOP/s 5 9 15 20

Speedupw.r.t.MKL

1 1.64 2.58 2.45 5.74 4.3

2 2.70 4.90 4.77 10.6 3.7

4 3.64 8.72 9.46 20.9 3.1

8 4.15 13.3 18.3 41.9 2.5

16 4.25 16.3 33.9 79.9 1.9

34 3.77 15.4 53.2 130 1.3

68 3.91 16.1 76.3 179 0.7

Blocksize

Num

bero

fthreads

• Intel Knights Landing Architecture• GFLOP/s (numbers) and speedup w.r.t MKL (colors) shownfor 512 worksets

• Data flushed after each GEMM

KokkosKernels BatchedBLAS:DGEMMPerformanceKNL,1x68x4,1.4Ghz,Intel17.1.132

TRSMGFLOP/s 5 9 15 20

Speedupw.r.t.MKL

1 0.80 1.61 1.90 3.83 33.9

2 1.06 2.79 3.57 7.16 28.6

4 1.07 3.89 6.35 13.0 23.2

8 1.06 4.58 10.4 21.8 17.9

16 0.96 4.40 14.6 31.9 12.5

34 0.80 3.84 16.6 38.2 7.2

68 0.84 3.66 18.8 43.2 1.8

BlocksizeNum

bero

fthreads

KokkosKernels BatchedBLAS:TRSMPerformanceKNL,1x68x4,1.4Ghz,Intel17.1.132


• Data flushed after each TRSM

LUGFLOP/s 5 9 15 20

Speedupw.r.t.MKL

1 0.53 1.26 1.73 3.18 20.9

2 0.69 1.93 3.25 6.01 17.7

4 0.68 2.68 5.60 10.6 14.5

8 0.69 3.04 8.52 17.4 11.3

16 0.64 3.03 11.2 24.2 8.1

34 0.53 2.57 11.6 27.6 4.9

68 0.59 2.52 13.3 29.8 1.6

Blocksize

Num

bero

fthreads


• Data flushed after each LU

KokkosKernels BatchedBLAS:LUPerformanceKNL,1x68x4,1.4Ghz,Intel17.1.132

0

2

4

6

8

10

1 2 4 8 16 34 68

Speedu

pw.r.t.SPAR

C

#ofthreads

Tridiagonal Factorization(32x32x128)

BlkSize5

BlkSize9

BlkSize15

BlkSize20

KokkosKernels BatchedBLAS:UsagewithinPreconditionerKNL,1x68x4,1.4Ghz,Intel17.1.132

0

0.5

1

1.5

2

2.5

3

1 2 4 8 16 34 68

Speedu

pw.r.t.SPAR

C#ofthreads

TridiagSolve(32x32x128)

BlkSize5

BlkSize9

BlkSize15

BlkSize20

• Performance comparisons for Large-Block Jacobi Small-Block Tridiagonal factorization and Triangular Solve

• One right hand side per solve• Speedups against a hand-tuned version of the code withinthe application

CompactLayoutsforBatchedBLAS:Discussion

• Pathforward– BatchedimplementationofotherkernelsthanDGEMM– Integratingwithotherlinearalgebracodes(FASTILU,directmethods)

– ImplementationofCompact/PackedLayoutsinotherlibraries

• SmallerblocksizesareanimportantusecaseforSandiaapplications

• Needcarefulinterfacedesignforreuseofthestructure• C++20standardizationofthe“packeddouble”orSIMTvector

SparseMatrix-MatrixMultiplication(SpGEMM)

Sparse Matrix-MatrixMultiplicationProblem

• SPGEMM:fundamentalblockfor– Algebraicmultigrid– Variousgraphanalyticsproblems:clustering,betweennesscentrality…

• Extrairregularity:nnz ofCisunknownbeforehand.

SpGEMM:PreviousWork• DistributedMemoryalgorithms:– 1DTrilinos,2DCombinatorialBlas[Buluç 12],3D[Azad15],Hypergraph-based:[Akbudak 14],[Ballard16]

• Mostofthesharedalgorithmsarebasedon1D-Gustavsonalgorithm[Gustavson 78]

• Multi-threadedalgorithms:– DenseAccumulator[Patwary 15]– SparseHeapaccumulators:ViennaCL,CommBlass– Sparseaccumulators:MKL

• GPUs:– CUSP:3DouterproductO(FLOPS)memory– Hierarchical:cuSPARSE,bhSparse [Liu14]

KokkosKernels PortableSPGEMMMethod

• TwoPhaseMethod:SymbolicandNumericPhase• Eachteamworksonabunchofrows

– Team:Block(GPU),groupofhyperthreads incore(CPU)• Eachworkerinteamworksonconsecutiverows.

– Worker:Warp(GPUs),hyperthread(CPU)– MorecoalescedaccessonGPUs,betterL1-cacheusageonCPUs.

• Eachvectorlane inaworkerworksonadifferentmultiplicationwithinarow:– Vectorlane:ThreadsinaWarp(GPUs),vectorunits(CPU)

• SeeMehmetDeveci’s talkonTuesday@CSEformoredetails

KokkosKernels PortableSPGEMMMethodKNLExperiments1.00$

3.83$

51.63$ 66.75$

66.46$

1.69$

6.46$

94.48$

129.04$

128.42$

1.74$

6.23$

56.02$

52.19$

37.26$

1.74$

6.23$

56.02$

52.19$

37.26$

0$

20$

40$

60$

80$

100$

120$

140$

1$ 4$ 64$ 128$ 256$ 1$ 4$ 64$ 128$ 256$

NoReuse$ Reuse$

Speedup$w.r.t.$Sequen;al$KKMEM

$on$DDR4$

KKMEM$

MKL1$

MKL2$

ViennaCL$

1.04%

3.96%

58.96%

88.24% 101.70%

1.75%

6.67%

101.79%

156.95%

186.74%

1.76%

6.36%

76.34% 10

5.75%

103.33%

1.76%

6.36%

76.34% 10

5.75%

103.33%

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%

1% 4% 64% 128% 256% 1% 4% 64% 128% 256%

NoReuse% Reuse%

Speedu

p%w.r.t.%Seq

uen;

al%KKM

EM%on%DDR4%

KKMEM%

MKL1%

MKL2%

ViennaCL%

• ComparingKokkosKernels SPGEMMandtwoSPGEMMinIntelMKLandViennaCL onIntelKnightsLanding

• GeometricMeanSpeedupsw.r.to sequentialKokkosKernelSPGEMMfor20differentmatrixmultiplications

• Reusingthesymbolicstructureiskeytobetterperformanceonapplications

KokkosKernels PortableSPGEMMMethodonGPUs

• ComparingKokkosKernelsSPGEMMandfourotherGPUimplementations(CUSPARSE,CUSP,bhSparse,ViennaCL)

• Bothmultigrid anddataanalysisstylemultiplications

• Reusingthesymbolicstructureiskeytobetterperformanceonapplications

Table IV: The matrices and multiplications used throughout this paper. The (#rows, #cols, #nnz) of the input matrices and #multiplications performed aregiven in the first four columns. The right side lists the execution time in seconds and GFLOPS of KKMEM on K80, and its speedup w.r.t other SPGEMMmethods. Blank spaces indicate the method failed. The matrices are sorted based the success of the algorithms, then by the #multiplications. The matrices2cubes sphere, cage12, webbase, offshore, filter3D, cant, hood, pwtk, ldoor are used as they are repetitively used in the literature [6], [7]. We run thesematrices only on GPUs, and omit them on KNL experiments as their run times are negligible.

KKMEM KKMEM speedups w.r.t.

# row A # cols A #nnz A # multiplications time GFLOPS CUSP bhSPARSE ViennaCL cuSPARSE2cubes sphere 101, 492 101, 492 1, 647, 264 27, 450, 606 0.05 1.098 3.40 1.80 1.40 5.20cage12 130, 228 130, 228 2, 032, 536 34, 610, 826 0.08 0.865 2.75 1.25 1.50 4.50webbase 1, 000, 005 1, 000, 005 3, 105, 536 69, 524, 195 0.57 0.244 0.72 0.65 6.07 2.07offshore 259, 789 259, 789 4, 242, 673 71, 342, 515 0.12 1.189 3.58 1.42 1.42 9.25filter3D 106, 437 106, 437 2, 707, 179 85, 957, 185 0.13 1.322 3.77 1.08 1.31 5.46hugebubbles20 21, 198, 119 21, 198, 119 63, 580, 358 190, 713, 076 0.33 1.156 3.85 2.33 1.64 16.30cant 62, 451 62, 451 4, 007, 383 269, 486, 473 0.16 3.369 9.69 1.38 2.19 1.44Empire RA P 8, 800 2, 160, 000 25, 410, 400 91, 604, 280 0.22 0.833 2.45 0.59 1.59europe 50, 912, 018 50, 912, 018 108, 109, 320 241, 277, 568 0.63 0.766 2.40 1.67 3.29Laplace A P 15, 625, 000 15, 625, 000 109, 000, 000 400, 324, 972 0.52 1.540 1.56 0.83 15.19Laplace R A 1, 969, 824 15, 625, 000 57, 354, 176 400, 324, 972 1.18 0.679 0.70 1.05 4.29Laplace R AP 1, 969, 824 15, 625, 000 57, 354, 176 517, 542, 942 0.69 1.500 2.19 1.91 6.48Laplace RA P 1, 969, 824 15, 625, 000 142, 929, 956 517, 542, 942 1.47 0.704 1.76 0.88 4.47hood 220, 542 220, 542 10, 768, 436 562, 028, 138 0.28 4.014 1.25 2.57 3.61pwtk 217, 918 217, 918 11, 634, 424 626, 054, 402 0.24 5.217 1.63 3.08 3.17Empire R A 8, 800 2, 160, 000 8, 572, 251 1, 286, 511, 829 1.39 1.851 1.42 1.80 1.15ldoor 952, 203 952, 203 46, 522, 475 2, 408, 881, 377 1.08 4.461 1.40 2.98 3.86Empire R AP 8, 800 2, 160, 000 8, 572, 251 91, 604, 280 0.11 1.666 0.91 2.27delaunay n24 16, 777, 216 16, 777, 216 100, 663, 202 633, 914, 372 1.29 0.983 2.09 1.66Brick R AP 592, 704 15, 625, 000 71, 991, 296 763, 551, 944 0.85 1.797 1.73 4.04Empire A P 2, 160, 000 2, 160, 000 303, 264, 000 1, 286, 511, 829 1.09 2.361 1.11 3.61channel 4, 802, 000 4, 802, 000 85, 362, 744 1, 522, 677, 096 1.56 1.952 2.03 4.06Brick AP 15, 625, 000 15, 625, 000 418, 508, 992 1, 934, 434, 936 1.69 2.289 1.09 7.75Brick R A 592, 704 15, 625, 000 197, 137, 368 763, 551, 944 2.04 0.749 5.95Brick RA P 592, 704 15, 625, 000 71, 991, 296 1, 934, 434, 936 1.70 2.276 2.31bump 2, 911, 419 2, 911, 419 127, 729, 899 5, 745, 156, 927 3.12 3.683 3.08audi 943, 695 943, 695 77, 651, 847 8, 089, 734, 897 4.76 3.399 2.68dielFilterV3real 1, 102, 824 1, 102, 824 89, 306, 020 8, 705, 461, 058 6.68 2.606 3.04cage15 5, 154, 859 5, 154, 859 99, 199, 551 2, 078, 631, 615

Geo.mean 3.19 1.47 1.59 3.80

Table V: Comparison against the performance numbers of AmgX on K80GPUs. Last column shows the difference in running time between the twoapproaches. We use default parameters for KKMEM and compare againstthe best numbers of AmgX provided to us.

AmgX KKMEMGFLOPS Time GFLOPS Time Time Diff.

2cubes sphere 2.388 0.023 1.013 0.054 0.031cage12 0.773 0.090 0.834 0.083 �0.007offshore 2.359 0.060 1.143 0.125 0.064filter3D 1.109 0.155 1.301 0.132 �0.023cant 4.532 0.119 3.436 0.157 0.038hood 5.435 0.207 4.082 0.275 0.069pwtk 6.475 0.193 5.268 0.238 0.044ldoor 4.074 1.183 4.477 1.076 �0.106audi 2.112 7.661 3.396 4.765 �2.896

performance achieved by the portable kernel is better thanthe best method by 17%, 4% and 54% on KNL-DDR4,P100, and K80. The performance is as close as 1% on KNL-MCDRAM, and while it is 20% close the best method onHaswell. Moreover, the performance of KKMEM on P100 isthe best methods among all the methods in all architectures.KKMEM obtains best performance on 13, 7, 5, 15 and 14multiplications on KNL-DDR, KNL-MCDRAM, Haswell,Pascal, K80, respectively.

V. CONCLUSION

We described performance-portable, thread-scalableSPGEMM kernel for highly threaded architectures. Weconclude by answering the primary question we startedwith - “How much performance will be sacrificed for

Table VI: Execution time in seconds and GFLOPS of KKMEM speedupnumbers w.r.t other SPGEMM methods on P100 GPUs.

KKMEM Speedup w.r.t.time gflops CUSP bhSPARSE ViennaCL cuSPARSE

2cubes sphere 0.02 3.631 4.54 1.20 1.06 3.62cage12 0.03 2.396 3.13 0.75 1.22 2.74webbase 0.27 0.521 0.66 0.54 5.18 2.30offshore 0.03 4.304 5.25 1.33 1.21 7.08filter3D 0.03 4.918 5.78 0.83 1.47 4.30hugebubbles20 0 0.10 3.804 4.99 4.81 1.94 12.14Europe 0.18 2.669 3.41 5.57 2.57 2.50cant 0.04 12.001 12.83 1.05 1.42 0.77hood 0.08 13.944 14.22 0.97 1.77 1.72pwtk 0.07 17.717 17.88 1.13 2.06 1.53Empire R AP 0.04 4.734 0.89 0.65 0.88Empire RA P 0.08 2.316 1.03 0.41 0.68Laplace R A 0.39 2.041 0.68 0.73 2.71Laplace A P 0.15 5.398 2.57 1.00 11.65Laplace R AP 0.19 5.466 2.36 1.24 5.24Laplace RA P 0.47 2.203 1.67 0.65 3.32Brick R A 0.64 2.381 1.16 1.82 4.91Empire R A 0.43 5.934 1.09 1.06 1.11Empire A P 0.30 8.463 3.60 1.05 1.48Brick RA P 0.61 6.326 1.26 0.43 1.14ldoor 0.32 14.910 1.09 1.88 1.76delaunay n24 0.41 3.086 1.74 1.12Brick R AP 0.24 6.349 0.76 1.91channel 0.43 7.054 1.51 3.10Brick AP 0.49 7.954 0.95 4.54cage15 1.56 2.660 4.86Bump 0.88 13.126 1.58audi 1.31 12.345 1.54dielFilterV3real 1.80 9.679 1.85

Geomean: 5.25 1.36 1.22 2.43

portability?”. We conclude that we do not sacrifice muchin terms of performance on highly-threaded architectures.This is demonstrated by the experiments comparing ourportable method against 5 native methods on GPUs, and 2native methods on KNLs. Our SPGEMM kernel is also the

SparseMatrix-MatrixmultiplicationDiscussion

• RaisingImportanceofSPGEMM– DataAnalysiscommunityisdrivinglotofthework– Keyforscalabilityofalgebraicmultigrid setup

• Anopportunitytoaddressagapforimportantapplications• Addressingthesymbolicreuseportionisanimportant

usecase forseveralapplications• Oneperformance-portablereferenceimplementation

available– Vendorcollaborationsandotherreferenceimplementationneeded

[email protected]

KokkosKernels PortableSPGEMMMethod

• 2levelHashmap Accumulator:– 1st levelusesGPUssharedmemoryorasmallmemorythatwillfitinL1cache

– 2nd levelgoestoglobalmemory• UniformMemoryPool:– Onlysomeoftheworkersneed2nd levelhashmap.Theyrequestmemoryfrommemorypool.

• Compression:Symbolicworksperformsunionsonrows.BinaryrelationsthatcanbedonewithBitWiseO

KokkosKernels: Compact Layouts for Batched Blas and Sparse ...

Documents