Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply Siva Rajamanickam, Kyungjoo Kim, Andrew Bradley, Mehmet Deveci, Christian Trott, Si Hammond Batched BLAS Workshop, 2017, Atlanta
24
Embed
KokkosKernels: Compact Layouts for Batched Blas and Sparse ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
• Denselinearalgebrakernels(BLAS)• BLAS1,someBLAS2• BatchedBLAS– Kyungjoo Kim
• Graphkernels• Graphcoloring
• OtherUtilities• HashMap• UniformMemoryAllocator
MotivationforBatchedBLASwithCompactLayouts
KokkosKernels: Micro & Batched BLAS Design
Kyungjoo Kim
1 Problem
m
k
n
{ }k
m x n
Figure 1: Left: Block sparse matrix is formed with an entire mesh. Right: A preconditioner is constructed byextracting 1D line elements, which forms a set of block tridiagonal matrices.
The objective of this work is to develop performance portable numeric kernels for small matrices on a teamof threads or across vector units. These small dense kernels would be used for block crs (BCRS) matrix and itsline preconditioners. As depicted in Fig. 1, a global problem is given with a mesh and each node is populatedwith multiple DOFs, which results in a BCRS matrix. To precondition this matrix, line elements are extractedfrom the problem. The line elements result in block tridiagonal matrices - each block corresponding to a line andthe tridiagonal structure corresponding to the element connectivity in a line. There multiple degrees of freedomassociated with each element (or is it a node?) in the line which results in a ”small block” for each scalar entry inthe tridiagonal matrix. The bocks are factored using LU factorization.
T0
T1
Tm�n�1
Ar Br
Cr Ar+1
Algorithm 1: Reference impl. TriLU
1 for T in {T0,T1, · · · ,Tm⇥n�1} do in parallel
2 for r 0 to k�2 do
3 A
r := LU(Ar);4 B
r := L
�1B
r;5 C
r := C
r
U
�1;6 A
r+1 := C
r+1�C
r
B
r;7 end
8 A
k�1 := {L ·U};9 end
Figure 2: Left: Block/block tridiagonal matrix generated from sets of line elements. Right: Reference LU decom-position on a set of block tridiagonal matrices.
In this work, we consider a block sparse matrix-vector multiplication using GEMV and GEMM, TRSM and LU whichare required for tridiagonal LU decomposition as illustrated in Fig. 2.
Figure 1: Left: Block sparse matrix is formed with an entire mesh. Right: A preconditioner is constructed byextracting 1D line elements, which forms a set of block tridiagonal matrices.
The objective of this work is to develop performance portable numeric kernels for small matrices on a teamof threads or across vector units. These small dense kernels would be used for block crs (BCRS) matrix and itsline preconditioners. As depicted in Fig. 1, a global problem is given with a mesh and each node is populatedwith multiple DOFs, which results in a BCRS matrix. To precondition this matrix, line elements are extractedfrom the problem. The line elements result in block tridiagonal matrices - each block corresponding to a line andthe tridiagonal structure corresponding to the element connectivity in a line. There multiple degrees of freedomassociated with each element (or is it a node?) in the line which results in a ”small block” for each scalar entry inthe tridiagonal matrix. The bocks are factored using LU factorization.
T0
T1
Tm�n�1
Ar Br
Cr Ar+1
Algorithm 1: Reference impl. TriLU
1 for T in {T0,T1, · · · ,Tm⇥n�1} do in parallel
2 for r 0 to k�2 do
3 A
r := LU(Ar);4 B
r := L
�1B
r;5 C
r := C
r
U
�1;6 A
r+1 := C
r+1�C
r
B
r;7 end
8 A
k�1 := {L ·U};9 end
Figure 2: Left: Block/block tridiagonal matrix generated from sets of line elements. Right: Reference LU decom-position on a set of block tridiagonal matrices.
In this work, we consider a block sparse matrix-vector multiplication using GEMV and GEMM, TRSM and LU whichare required for tridiagonal LU decomposition as illustrated in Fig. 2.
1
KokkosKernels: Micro & Batched BLAS Design
Kyungjoo Kim
1 Problem
m
k
n
{ }k
m x n
Figure 1: Left: Block sparse matrix is formed with an entire mesh. Right: A preconditioner is constructed byextracting 1D line elements, which forms a set of block tridiagonal matrices.
The objective of this work is to develop performance portable numeric kernels for small matrices on a teamof threads or across vector units. These small dense kernels would be used for block crs (BCRS) matrix and itsline preconditioners. As depicted in Fig. 1, a global problem is given with a mesh and each node is populatedwith multiple DOFs, which results in a BCRS matrix. To precondition this matrix, line elements are extractedfrom the problem. The line elements result in block tridiagonal matrices - each block corresponding to a line andthe tridiagonal structure corresponding to the element connectivity in a line. There multiple degrees of freedomassociated with each element (or is it a node?) in the line which results in a ”small block” for each scalar entry inthe tridiagonal matrix. The bocks are factored using LU factorization.
T0
T1
Tm�n�1
Ar Br
Cr Ar+1
Algorithm 1: Reference impl. TriLU
1 for T in {T0,T1, · · · ,Tm⇥n�1} do in parallel
2 for r 0 to k�2 do
3 A
r := LU(Ar);4 B
r := L
�1B
r;5 C
r := C
r
U
�1;6 A
r+1 := C
r+1�C
r
B
r;7 end
8 A
k�1 := {L ·U};9 end
Figure 2: Left: Block/block tridiagonal matrix generated from sets of line elements. Right: Reference LU decom-position on a set of block tridiagonal matrices.
In this work, we consider a block sparse matrix-vector multiplication using GEMV and GEMM, TRSM and LU whichare required for tridiagonal LU decomposition as illustrated in Fig. 2.
Block A of T0 and T1 is packed and elements are aligned to its vector lane· · ·�T000 �T1
00 �T001 �T1
01
Algorithm 2: Batched impl. TriLU1 for a pair T (0,1) in
{{T0,T1},{T2,T3}, · · · ,{T
m⇥n�2,Tm⇥n�1}} do in parallel
2 for r 0 to k�2 do
3 A
r(0,1) := LU(Ar(0,1));4 B
r(0,1) := L
�1B
r(0,1);5 C
r(0,1) := C
r(0,1)U
�1;6 A
r+1(0,1) := C
r+1(0,1)�C
r(0,1)B
r(0,1);7 end
8 A
k�1(0,1) := {L ·U};9 end
Figure 3: Left: Hybrid packing of T0 and T1 provided the vector length is two. Right: Vectorized version of LU
decomposition on a set of block tridiagonal matrices.
4.2 Team parallelization
Team level parallelization must be introduced. Here, we briefly discuss a parallelization approach within smallmatrix kernels. For tridiagonal LU factorization, we design level 3 operations i.e., LU, TRSM and GEMM. In theTriSolve phase, we design level 2 operations i.e., TRSV and GEMV.
TRSM There are a few possibilities to parallelize dense kernels according to the distribution of matrices e.g., ele-mental 1D and 2D cyclic distribution. 1D algorithm is illustrated in Fig. 4. We use columnwise cyclic distributionon the matrix B. If shared memory space allows, matrix A and a panel of B is loaded to shared memory. Therequired memory space is defined as a product of vector length, panelsize and blocksize. A 2D version ofthis algorithm is shown in Fig. ??.
GEMM Parallel GEMM can be implemented in three different ways according to communication pattern. Sincewe deal with a case of m = n = k, the choice of algorithm won’t matter. Fig. ?? illustrates the algorithm us-ing matrix-panel multiplication. This algorithm uses 1D parallelization and size of a team is restricted by ablocksize. On the other hand, 2D distribution as depicted in Fig. ?? increases concurrency upto blocksize⇥blocksize⇥ vectorlength. When we want to use a more number of threads in this operations, multiple panels areloaded together and the maximum number of concurrency would increase up to blocksize⇥blocksize⇥blocksize⇥vectorlength with atomic update..
LU The right looking version of the LU algorithm is illustrated in Fig. 8 and 9.
TRSV Fig. 10 explains the 1D TRSV algorithm.
GEMV Fig. 11 and 12 show 1D and 2D versions of the GEMV operation. Note that these two kernels can be fusedin the TriSolve phase.
5 Micro BLAS
The concept of micro BLAS is to solve each block exploiting vector units. Unlike the batched version, thisapproach does not require to repack user data. However, disadvantages of this approach are
• the vector length of modern computing architectures relatively larger than our “small” problem size;
• numeric algorithms depends on its problem layout (LayoutLeft, LayoutRight) in order to get coalescingaccess.
Add more merits of micro BLAS. At the end, this version should be required. Fill detailed algorithms.
3
Ar Br
Cr Ar+1
Ar Br
Cr Ar+1
T0
T1
Block A of T0 and T1 is packed and elements are aligned to its vector lane· · ·�T000 �T1
00 �T001 �T1
01
Algorithm 2: Batched impl. TriLU1 for a pair T (0,1) in
{{T0,T1},{T2,T3}, · · · ,{T
m⇥n�2,Tm⇥n�1}} do in parallel
2 for r 0 to k�2 do
3 A
r(0,1) := LU(Ar(0,1));4 B
r(0,1) := L
�1B
r(0,1);5 C
r(0,1) := C
r(0,1)U
�1;6 A
r+1(0,1) := C
r+1(0,1)�C
r(0,1)B
r(0,1);7 end
8 A
k�1(0,1) := {L ·U};9 end
Figure 3: Left: Hybrid packing of T0 and T1 provided the vector length is two. Right: Vectorized version of LU
decomposition on a set of block tridiagonal matrices.
4.2 Team parallelization
Team level parallelization must be introduced. Here, we briefly discuss a parallelization approach within smallmatrix kernels. For tridiagonal LU factorization, we design level 3 operations i.e., LU, TRSM and GEMM. In theTriSolve phase, we design level 2 operations i.e., TRSV and GEMV.
TRSM There are a few possibilities to parallelize dense kernels according to the distribution of matrices e.g., ele-mental 1D and 2D cyclic distribution. 1D algorithm is illustrated in Fig. 4. We use columnwise cyclic distributionon the matrix B. If shared memory space allows, matrix A and a panel of B is loaded to shared memory. Therequired memory space is defined as a product of vector length, panelsize and blocksize. A 2D version ofthis algorithm is shown in Fig. ??.
GEMM Parallel GEMM can be implemented in three different ways according to communication pattern. Sincewe deal with a case of m = n = k, the choice of algorithm won’t matter. Fig. ?? illustrates the algorithm us-ing matrix-panel multiplication. This algorithm uses 1D parallelization and size of a team is restricted by ablocksize. On the other hand, 2D distribution as depicted in Fig. ?? increases concurrency upto blocksize⇥blocksize⇥ vectorlength. When we want to use a more number of threads in this operations, multiple panels areloaded together and the maximum number of concurrency would increase up to blocksize⇥blocksize⇥blocksize⇥vectorlength with atomic update..
LU The right looking version of the LU algorithm is illustrated in Fig. 8 and 9.
TRSV Fig. 10 explains the 1D TRSV algorithm.
GEMV Fig. 11 and 12 show 1D and 2D versions of the GEMV operation. Note that these two kernels can be fusedin the TriSolve phase.
5 Micro BLAS
The concept of micro BLAS is to solve each block exploiting vector units. Unlike the batched version, thisapproach does not require to repack user data. However, disadvantages of this approach are
• the vector length of modern computing architectures relatively larger than our “small” problem size;
• numeric algorithms depends on its problem layout (LayoutLeft, LayoutRight) in order to get coalescingaccess.
Add more merits of micro BLAS. At the end, this version should be required. Fill detailed algorithms.
Kokkos::parallel_for(Kokkos::RangePolicy(N), KOKKOS_LAMBDA(const int k) {
auto aa = Kokkos::subview(a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(c, k, Kokkos::ALL(), Kokkos::ALL());
Kokkos::parallel_for( Kokkos::RangePolicy(/* N or N/VectorLength */), KOKKOS_LAMBDA(const int k) {
auto aa = Kokkos::subview(a, k, Kokkos::ALL(), Kokkos::ALL()); auto bb = Kokkos::subview(b, k, Kokkos::ALL(), Kokkos::ALL()); auto cc = Kokkos::subview(c, k, Kokkos::ALL(), Kokkos::ALL());
Table IV: The matrices and multiplications used throughout this paper. The (#rows, #cols, #nnz) of the input matrices and #multiplications performed aregiven in the first four columns. The right side lists the execution time in seconds and GFLOPS of KKMEM on K80, and its speedup w.r.t other SPGEMMmethods. Blank spaces indicate the method failed. The matrices are sorted based the success of the algorithms, then by the #multiplications. The matrices2cubes sphere, cage12, webbase, offshore, filter3D, cant, hood, pwtk, ldoor are used as they are repetitively used in the literature [6], [7]. We run thesematrices only on GPUs, and omit them on KNL experiments as their run times are negligible.
Table V: Comparison against the performance numbers of AmgX on K80GPUs. Last column shows the difference in running time between the twoapproaches. We use default parameters for KKMEM and compare againstthe best numbers of AmgX provided to us.
performance achieved by the portable kernel is better thanthe best method by 17%, 4% and 54% on KNL-DDR4,P100, and K80. The performance is as close as 1% on KNL-MCDRAM, and while it is 20% close the best method onHaswell. Moreover, the performance of KKMEM on P100 isthe best methods among all the methods in all architectures.KKMEM obtains best performance on 13, 7, 5, 15 and 14multiplications on KNL-DDR, KNL-MCDRAM, Haswell,Pascal, K80, respectively.
V. CONCLUSION
We described performance-portable, thread-scalableSPGEMM kernel for highly threaded architectures. Weconclude by answering the primary question we startedwith - “How much performance will be sacrificed for
Table VI: Execution time in seconds and GFLOPS of KKMEM speedupnumbers w.r.t other SPGEMM methods on P100 GPUs.
2cubes sphere 0.02 3.631 4.54 1.20 1.06 3.62cage12 0.03 2.396 3.13 0.75 1.22 2.74webbase 0.27 0.521 0.66 0.54 5.18 2.30offshore 0.03 4.304 5.25 1.33 1.21 7.08filter3D 0.03 4.918 5.78 0.83 1.47 4.30hugebubbles20 0 0.10 3.804 4.99 4.81 1.94 12.14Europe 0.18 2.669 3.41 5.57 2.57 2.50cant 0.04 12.001 12.83 1.05 1.42 0.77hood 0.08 13.944 14.22 0.97 1.77 1.72pwtk 0.07 17.717 17.88 1.13 2.06 1.53Empire R AP 0.04 4.734 0.89 0.65 0.88Empire RA P 0.08 2.316 1.03 0.41 0.68Laplace R A 0.39 2.041 0.68 0.73 2.71Laplace A P 0.15 5.398 2.57 1.00 11.65Laplace R AP 0.19 5.466 2.36 1.24 5.24Laplace RA P 0.47 2.203 1.67 0.65 3.32Brick R A 0.64 2.381 1.16 1.82 4.91Empire R A 0.43 5.934 1.09 1.06 1.11Empire A P 0.30 8.463 3.60 1.05 1.48Brick RA P 0.61 6.326 1.26 0.43 1.14ldoor 0.32 14.910 1.09 1.88 1.76delaunay n24 0.41 3.086 1.74 1.12Brick R AP 0.24 6.349 0.76 1.91channel 0.43 7.054 1.51 3.10Brick AP 0.49 7.954 0.95 4.54cage15 1.56 2.660 4.86Bump 0.88 13.126 1.58audi 1.31 12.345 1.54dielFilterV3real 1.80 9.679 1.85
Geomean: 5.25 1.36 1.22 2.43
portability?”. We conclude that we do not sacrifice muchin terms of performance on highly-threaded architectures.This is demonstrated by the experiments comparing ourportable method against 5 native methods on GPUs, and 2native methods on KNLs. Our SPGEMM kernel is also the