G S K NN BLIS-Based High Performance Computing Kernels in N-body Problems Chenhan D. Yu Copyright @ 2015, The University of Texas at Austin The 3rd BLIS Retreat Sep 28, 201 5 GSKS
GSKNNBLIS-Based High Performance Computing Kernels
in N-body Problems
Chenhan D. Yu
Copyright @ 2015, The University of Texas at Austin
The 3rd BLIS RetreatSep 28, 2015
GSKS
N-body Problems
2Copyright @ 2015, The University of Texas at Austinhttps://youtu.be/bLLWkx_MRfkHellstorm Astronomy and 3D
N-body ProblemsN-body problems aim to describe the interaction (relation) of N points { X } in a d dimensional space.
K(xi, xj) = Kij describes the interaction between xi and xj.
3 operations: Kernel Summation u=Kw, Kernel Inversion w=(K+λI)-1u and Nearest-Neighbors.
2D and 3D applications can be found in computational physics, geophysical exploration and medical imaging.
High dimension applications in computational statistic include clustering, classification and regression.
3Copyright @ 2015, The University of Texas at Austin
Gflo
ps
55
110
165
220
Dimension
4 20 36 68 132 260 516 1028
MKL+STL GSKNN
16 Nearest Neighbors
10 cores Ivy-Bridge, 8192 points
50%
80%
Copyright @ 2015, The University of Texas at Austin
8192x20x8192 DGEMM
OutlineKernel Summation (u=Kw) and Nearest-Neighbors.
How GEMM is applied in the conventional approach?
Why GEMM can be memory bound in these operations?
What insight is required to design an algorithm that avoids redundant memory operations but still preserves the efficiency?
How GSKS and GSKNN are inspired by the BLIS framework in their design?
5Copyright @ 2015, The University of Texas at Austin
Linear Kernel:
6Copyright @ 2015, The University of Texas at Austin
(1,1)
(0,0) (1,0)
1 1
0 0
1 0
1 0 1
1 0 0
R
QT = R
2 0 1
0 0 0
1 0 1
K = QTR
x1T
x2T
x3T
x1 x2 x3
K(xi, xj) = xiTxj
Kw3
0
2
x2 x3
x1 x2
x2 x1
Kernel Summation
Nearest-Neighbors
x1
x3x2
2 0 1
0 0 0
1 0 1
2 0 1
0 0 0
1 0 1
1
1
1x =
Other KernelsK(xi, xj) = f(xiTxj), e.g. Gaussian kernel
7Copyright @ 2015, The University of Texas at Austin
39:2 C. D. Yu et al.
grows exponentially with k. For high dimensional data analysis problems such as ker-nel regression and kernel classification, schemes are usually linear or superlinear tok for the scalability. For example, ASKIT [] takes an nearest-neighbor pruning and aspatial tree to improve the scalability.
Runtime evaluating instead of reusing K significantly increases the time complex-ity especially when k is large. How to compute a dense K and it’s summation Kw effi-ciently is a new bottleneck for all applications that requires the operation. For example,the floating efficiency of LIBSVM [Chang and Lin 2011] doesn’t scale with k, since thekernel value is computed element-wise without taking the advantage of the moderncomputer architecture. Without the insight of the computer architecture, most of thesoftware written in pure interpreting or partially compiled language are usually run-ning with in 3% of the CPU peak performance. Even a pure C/Fortran program, 10%is usually the average. One of the solution is to take the advantage of the high perfor-mance level 3 BLAS (Basic Linear Algebra Subroutines) to compute a submatrix of K.K(x
i
, xj
) is usually a function of the square distance kxi
� xj
k22, and its expansion (2)is mainly a pairwise inner-product xT
i
xj
.
kxi
� xj
k22 = kxi
k22 + kxj
k22 � 2xT
i
xj
(2)
With expression (2), computing K can rely on a high performance matrix-matrixmultiplication (GEMM) which can usually reach more than 80% peak performance on themodern CPU architectures with a large k. Level 3 BLAS routines such as GEMM arehighly optimized by domain experts, and the interfaces are standardized to facilitateusers. The kernel value can be accelerated by vectorized math functions (e.g. IntelVectorized Math Library), and Kw can be computed by GEMV (General Matrix-VectorMultiplication) which is also a level 2 subroutine of BLAS. CUKNN [Liang et al. 2009]computes pairwise distance with (2) to accelerate searching for k-nearest neighbors,and the same approach is taken in ASKIT [March et al. 2014] [March et al. 2015] atreecode approach for fast high dimensional kernel summation in both near-field andfar-field evaluation.
Although the BLAS approach works reasonably well on large k, yet there are stillseveral drawbacks remaining unsolved.
(1) The BLAS approach transform the low dimensional case of (1) from a computationbound problem to memory bound.
(2) Since GEMM is a memory bound problem for small k. Thus, GEMM needs a large enoughk to reach high performance, yet most of the practical problems have k 32.
(3) To compute a subproblem of (1), coordinates need to be collected to form dense Aand B in order to use GEMM, requiring extra memory space.
(4) The intermediate results �2xT
i
xj
need to be stored as a dense matrix C whichrequires extra memory space.
(5) The temporary spaces A, B and C are also accompanied with extra memory access,suffering from a serious penalty.
(6) GEMV is a memory bound operation which can hardly reach the peak floating pointperformance.
To summarize the BLAS approach, the standardized interface of BLAS limits the pos-sibility of combining different operations. A new BLAS like subroutine to compute thekernel summation is inspired by the drawbacks listed above, combining GEMM, GEMV andvectorized math functions together to exploit the modern CPU architecture.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
39:4 C. D. Yu et al.
ALGORITHM 1: Kernel Summation with GEMM, VEXP and GEMVX
A
(:,↵) ! As
, kXA
k22(↵) ! kAs
k22, u(↵) ! us
;X
B
(:,�) ! Bs
, kXB
k22(�) ! kBs
k22, w(!) ! ws
;GEMM(A
s
,�2.0,Bs
, 0.0, Cs
);for j=0:n-1 do
for i=0:m-1 doCs
(i, j)+ = kAs
k22(i) + kBs
k22(j);Cs
(i, j)⇤ = (�1/2h2);
endendVEXP(C
s
);GEMV(C
s
, 1.0, ws
, 1.0, uw
);us
! u(↵);
same map as Bs
. For example, ASKIT creates skeleton weights w̃ for approximation;thus, here we use w
s
= w(!s
) to take care this special situation.Given the notation above, (1) can be approximate by (3), and GSKS is designed a solve
a dense kernel summation.
u =
X
s
us
=
X
s
K(↵s
,�s
)w(!s
) (3)
Each elements of K(↵s
,�s
) is usually a function of the square distance kxi
�xj
k22 wherexi
2 As
and xj
2 Bs
. To be more concise, we take Gaussian kernel as an example forthe rest of the article, and the kernel function is written as:
K(xi
, xj
) = exp(�kxi
� xj
k22/(2h2)) (4)
where h is the width of the kernel. The square distance can be evaluated directly orby (2). Precomputing kx
i
k22 = kXA
k22(i) and reusing the results of kXA
k22 and kXB
k22requires many fewer FMA (Fused Multiply Add) operations than evaluating kx
i
�xj
k22directly. Moreover, �2xT
i
xj
, (4) and (3) can be computed by GEMM, GEMV and VEXP (vec-torized exponential function) which provides an opportunity to achieve excellent per-formance on modern CPU architectures.
The GEMM, VEXP and GEMV combination approach is widely used in the modern kernelmethods to achieve high a performance with the BLAS approach. We summarize thecombination approach in Algorithm 1 by using the same notations we just defined. Thedrawback of this approach is that A
s
, Bs
and Cs
need to be formed explicitly in orderto use GEMM, VEXP and GEMV due to the standardized BLAS interface. A
s
and Bs
arecreated to collect points from X
A
and XB
, since GEMM only takes contiguous or uniformstride inputs. C
s
must be created to output the result of GEMM, and it’s also required byVEXP and GEMV. These temporary spaces are redundant, and the extra memory accessesare accompanied. Inspired by the redundant memory operation, we develop a BLASlike subroutine which embeds Algorithm 1 into a micro-kernel which may avoid theseredundant memory allocations and operations.2.1. Sequencial General Stride Kernel SummationWe first present the pseudo-code of GSKS with Gaussian kernel in Algorithm 2 for com-puting a subproblem of (3). Other than GSKS, GEKS is a case of GSKS stands for thegeneral storage version where ↵(i) = i, �(j) = j and !(j) = j. The algorithm contains6 layers of loops which are corresponding to different partitioning of m, n and k. Thepartitioning scheme is similar to the GEMM implementation in BLIS [Van Zee and Van
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
The expansion exposes GEMM operations:
GEMM
1 1
0 0
1 0
2 0 1
0 0 0
1 0 1
x1T
x2T
x3T
2
0
1
x1Tx1
x2Tx2
x3Tx3
1+2-2*1K = QTRQT
The Big Picture
8
Kw takes O(N2) if K is precomputed, otherwise O(dN2). The cost is too expensive when N is large.
Exhaustive search requires O(N2log(k)) if K is precomputed, otherwise O(dN2+N2log(k)).
Divide-and-conquer approximation: Barnes-Hut or FMM for kernel summation, and randomized KD-tree or locality sensitive hashing for kNN.
Still the subproblem of all these algorithms is to solve several smaller dense kernel summation or kNN.
Solving the subproblem fast benefits all these methods.
Copyright @ 2015, The University of Texas at Austin
Subproblem
9
Take two subsets Q and R from X.
Compute K(Q,R) with GEMM using:
Compute Kw with GEMV or select k entries in each row.
Rely on BLAS, VML (Vectorized Math Library) and STL.
What can possibly go wrong?
39:2 C. D. Yu et al.
grows exponentially with k. For high dimensional data analysis problems such as ker-nel regression and kernel classification, schemes are usually linear or superlinear tok for the scalability. For example, ASKIT [] takes an nearest-neighbor pruning and aspatial tree to improve the scalability.
Runtime evaluating instead of reusing K significantly increases the time complex-ity especially when k is large. How to compute a dense K and it’s summation Kw effi-ciently is a new bottleneck for all applications that requires the operation. For example,the floating efficiency of LIBSVM [Chang and Lin 2011] doesn’t scale with k, since thekernel value is computed element-wise without taking the advantage of the moderncomputer architecture. Without the insight of the computer architecture, most of thesoftware written in pure interpreting or partially compiled language are usually run-ning with in 3% of the CPU peak performance. Even a pure C/Fortran program, 10%is usually the average. One of the solution is to take the advantage of the high perfor-mance level 3 BLAS (Basic Linear Algebra Subroutines) to compute a submatrix of K.K(x
i
, xj
) is usually a function of the square distance kxi
� xj
k22, and its expansion (2)is mainly a pairwise inner-product xT
i
xj
.
kxi
� xj
k22 = kxi
k22 + kxj
k22 � 2xT
i
xj
(2)
With expression (2), computing K can rely on a high performance matrix-matrixmultiplication (GEMM) which can usually reach more than 80% peak performance on themodern CPU architectures with a large k. Level 3 BLAS routines such as GEMM arehighly optimized by domain experts, and the interfaces are standardized to facilitateusers. The kernel value can be accelerated by vectorized math functions (e.g. IntelVectorized Math Library), and Kw can be computed by GEMV (General Matrix-VectorMultiplication) which is also a level 2 subroutine of BLAS. CUKNN [Liang et al. 2009]computes pairwise distance with (2) to accelerate searching for k-nearest neighbors,and the same approach is taken in ASKIT [March et al. 2014] [March et al. 2015] atreecode approach for fast high dimensional kernel summation in both near-field andfar-field evaluation.
Although the BLAS approach works reasonably well on large k, yet there are stillseveral drawbacks remaining unsolved.
(1) The BLAS approach transform the low dimensional case of (1) from a computationbound problem to memory bound.
(2) Since GEMM is a memory bound problem for small k. Thus, GEMM needs a large enoughk to reach high performance, yet most of the practical problems have k 32.
(3) To compute a subproblem of (1), coordinates need to be collected to form dense Aand B in order to use GEMM, requiring extra memory space.
(4) The intermediate results �2xT
i
xj
need to be stored as a dense matrix C whichrequires extra memory space.
(5) The temporary spaces A, B and C are also accompanied with extra memory access,suffering from a serious penalty.
(6) GEMV is a memory bound operation which can hardly reach the peak floating pointperformance.
To summarize the BLAS approach, the standardized interface of BLAS limits the pos-sibility of combining different operations. A new BLAS like subroutine to compute thekernel summation is inspired by the drawbacks listed above, combining GEMM, GEMV andvectorized math functions together to exploit the modern CPU architecture.
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
Copyright @ 2015, The University of Texas at Austin
Gflo
ps
55110165220
4 36 132 516
MKL+STLGSKNN
Visualization
10
X d
R
m
d n
d
QT
Km
n
x =
N
KSmx1
KNNmxk
Copyright @ 2015, The University of Texas at Austin
Insights
11
Q, R and K can’t be stored.
Collet Q and R from X during packing.
K(xi, xj) = Kij must be computed in registers.
Kw or k-select must be completed in registers.
Only store the output.
We need a special packing routine.
Fuse GEMM with distance calculations, special function evaluations, Kw or k-select.
Copyright @ 2015, The University of Texas at Austin
i.e.
Code Fusion in BLIS
12Copyright @ 2015, The University of Texas at Austin
K
R
QT
q
r
d n
m
d
Nm
k
n update
neighbor lists6th loop: nc
4th
loop
: mc
mr
nr
Code fusion is done in micro-kernel, and the BLIS framework is maintained.
Slice and Dice!
GSKNN and BLIS (K=QTR)
13
Pack RC
Pack QC
Reuse RC
Stream QC
& RC2
& RC2
& QC2
& QC2
Micro-Kernel
Micro-Kernel
14Copyright @ 2015, The University of Texas at Austin
R0 R1 R3R2
Q0
Q1
Q2
Q3
00
11
22
33
LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3
Micro-Kernel
15Copyright @ 2015, The University of Texas at Austin
R0 R1 R3R2
00
11
22
33
Q0
Q1
Q2
Q3
LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3
23
01
10
32
Micro-Kernel
16Copyright @ 2015, The University of Texas at Austin
LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3
R1 R0 R2R3
00
11
22
33
Q0
Q1
Q2
Q3
03
12
21
30
23
01
10
32
Micro-Kernel
17Copyright @ 2015, The University of Texas at Austin
00
11
22
33
03
12
21
30
Q0
Q1
Q2
Q3
LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3
02
13
20
31
23
01
10
32
R3 R2 R0R1
Micro-Kernel with p-norm
18Copyright @ 2015, The University of Texas at Austin
LOAD Q LOAD R FMA Q, R, C03_0 SHUFFLE FMA Q, R, C03_1 PERMUTE2F128 FMA Q, R, C03_2 SHUFFLE FMA Q, R, C03_3
1-norm SUB AND (flip signed bit) ADD
inf-norm SUB AND (flip signed bit) MAX
p-norm SUB POW (SVML) ADD
Vectorized Math Functions
19Copyright @ 2015, The University of Texas at Austin
p(x)-exp(x)
00 ln2
With a high precision (20 digits in decimal), Remez exchange algorithm can generate an 11 order near minimax polynomial with 1E-18 relative error.
39:16 C. D. Yu et al.
To derive the backward stability, we move the error term into the square operation.(1 + ✏)k+2kx
i
� xj
k22 = k(1 + ✏)k+22 x
i
� (1 + ✏)k+22 x
j
k22.
k(1 + ✏)k+22 x
i
� xi
k2kx
i
k2 = (1 + ✏)k+2 � 1 O(k✏) (13)
The expansion computes the square pairwise distance more efficiently by reusing theresult of kX
A
k22 and kXB
k22. Similar to the direct evaluation, both schemes are backwardstable, and the round-off error is also the same.
In the polynomial approximation part, the roundoff error mainly comes from thenested polynomial evaluation.
P11(x) = c11 + (...+ (c5 + (c4 + (c3 + (c2 + (c1 + c0x)x)x)x)x)x...)x (14)
The roundoff error of the order-n (n � 1) polynomial summation has the followingclosed form: c
n
xn
(1 + ✏)2n +
P0i=n�1 cix
i
(1 + ✏)2i+1. This polynomial approximation isforward stable, since exp(b) � 1 for b 2 in[0, ln 2]. The forward stability is derived in(15) and (16).
|cn
xn
[(1 + ✏)2n � 1] +
P0i=n�1 cix
i
[(1 + ✏)2i+1 � 1]||P0
i=n
ci
xi| (15)
|2n✏||P0i=n
ci
xi|+O(✏2)
|P0i=n
ci
xi| = O(n✏) (16)
x0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
P(x
) -
exp
(x)
×10-16
-3
-2
-1
0
1
2
3
4
5
Remez order 11Intel Vdexp()Double Machine Epsilon
Fig. 5: Error comparison between Remez order eleven polynomial approximation and Intel vdExp() function:the polynomial is chosen to fit the exponential function with [0, log(2)], and the converging criteria for theRemez exchange algorithm is the double machine epsilon (2.22E-16).
ACM Transactions on Embedded Computing Systems, Vol. 9, No. 4, Article 39, Publication date: March 2010.
= 1ADD + 11FMA
Vectorized Max Heap
20Copyright @ 2015, The University of Texas at Austin
LOAD C SHUFFLE D, C, 0x5 MAX D, C PERMUTE2F128 C, D, 0x1 MAX D, C
C0 C1 C3C2
H
Find the max childC:[1,3,4,2]->[4,3,4,3]->[4,4,4,4] D:[4,2,1,3] [3,4,3,4]
5
… k values
Q
3
GSKS Efficiency Analysis
21Copyright @ 2015, The University of Texas at Austin
TBLAS=TGSKS+TR+TQ+TK
TGSKS
mn(2d+36)TBLAS
mn(2d+36)- =
?
GSKNN Efficiency Graphs
22Copyright @ 2015, The University of Texas at Austin
d0 200 400 600 800 1000
GF
LO
PS
0
5
10
15
22
28.32Variant#1, m=n=8192, k=16
d0 200 400 600 800 1000
GF
LO
PS
0
50
100
150
200
248#1, nthd=10, m=n=8192, k=16
d0 200 400 600 800 1000
GF
LO
PS
0
5
10
15
22
28.32Variant#1, m=n=8192, k=512
d0 200 400 600 800 1000
GF
LO
PS
0
50
100
150
200
248#1, nthd=10, m=n=8192, k=512
d0 200 400 600 800 1000
GF
LO
PS
0
5
10
15
22
28.32Variant#6, m=n=8192, k=2048
d0 200 400 600 800 1000
GF
LO
PS
0
50
100
150
200
248#6, nthd=10, m=n=8192, k=2048
Figure 4 Predicted floating point e�ciency (Gflops). Sequentialparameters: m = n = 8192, k = 16, 512, 2048, ⌧fp = 8 ⇥ 3.54,
⌧cm = 2.2 ⇥ 10�9, ⌧rm = 13.91 ⇥ 10�9, ✏ = 0.5. For the 10-threadresult, ⌧fp = 10⇥8⇥3.10, ⌧cm and ⌧rm are 1
5 to the original value.
real experimented switching point. The predicted switch-ing point can significantly reduces the time spending on finetuning the switching point.
k0 500 1000 1500 2000
GF
LO
PS
16
32
64
128
nthd=10, m=n=8192, d=16
Modeled Variant#1Modeled Variant#6Variant#1Variant#6MKL + STLModeled thresholdThreshold
k0 500 1000 1500 2000
GF
LO
PS
32
64
128
nthd=10, m=n=8192, d=64
Modeled Variant#1Modeled Variant#6Variant#1Variant#6MKL + STLModeled thresholdThreshold
Figure 5 Predicted floating point e�ciency (Gflops) for di↵erentk.
4. EXPERIMENTAL SETUPIn this section we give details on the experimental setup
used to test our methods. Our current version of GSKNN con-tains double precision x86-64 micro-kernels designed for IntelSandy-Bridge/Ivy-Bridge architectures. GSKNN can and hasbeen integrated with other MPI based parallel knn imple-mentations such as randomized KD-trees and locality sen-sitive hashing which typically use a BLAS approach to thelocal search.
Implementation and hardware: Our GSKNN routine isimplemented in C, SSE2 and AVX intrinsic or assembly. Otherthan the micro-kernel, all other parts are written in pure C.The parallel randomized KD-tree knn is written in C++ andSSE4.2. The code is compiled with Intel C compiler version14.0.1.106 and mvapich2 version 2.0b with the -O3 optimiza-tion flag. We carry out runtime experiments on the Mavericksystem at TACC which has two ten-core CPUs. The dual-CPUs in each node are Intel Xeon E5-2680 v2(Ivy Bridge)processors (2.8GHz/3.6GHz) with 12.8Gb/core of memory
and a three-level cache: 32KB L1 data cache, 256KB L2cache and 25.6MB L3 cache. The stable CPU clockrate is3.54GHz/3.10GHz for 1/10 cores experiments.GSKNN parameters: We choose parameters as discussed
in §3.4 where mr
= 8, nr
= 4 and kc
= 256. mc
= 104 andnc
= 4096, which make the size of Ac 208 KB and the size ofBc 8192 KB. For all experiments with k 512, Variant#1is chosen, otherwise Variant#6 is used instead.
5. NUMERICAL RESULTSWe have shown and discussed our sequential design in
§3.3, and here we report three sets of results: (1) breakdownanalysis, (2) multi-threaded floating e�ciency and (3) theintegrated runtime of the randomize KD-tree knn solver. Allexperiments are in double precision, and each result has areference kernel implementing Algorithm 3.1 using MKL GEMM
and STL heap.
MKL+STL / GSKNN m = n = 8192, d = 16
k Tcoll
+ TGEMM
+ Tsq2d T
heap
Ttotal
16 0 + 55 + 24 / 20 13 / 1 92 / 21128 0 + 55 + 24 / 20 16 / 5 95 / 25512 0 + 55 + 24 / 20 30 / 33 109 / 53
2048 0 + 55 + 24 / 76 52 / 34 131 / 110
MKL+STL / GSKNN m = n = 8192, d = 64
16 1 + 117 + 24 / 52 13 / 1 155 / 53128 1 + 122 + 24 / 52 15 / 6 162 / 58512 1 + 113 + 24 / 52 30 / 35 168 / 87
2048 1 + 126 + 24 / 94 52 / 34 203 / 128
MKL+STL / GSKNN m = n = 8192, d = 256
16 3 + 210 + 24 / 186 13 / 2 250 / 188128 3 + 209 + 24 / 186 15 / 13 251 / 199512 3 + 211 + 24 / 186 30 / 38 268 / 224
2048 3 + 213 + 24 / 202 52 / 34 292 / 236
MKL+STL / GSKNN m = n = 8192, d = 1024
16 9 + 702 + 24 / 665 13 / 0 748 / 665128 9 + 734 + 24 / 665 15 / 11 782 / 676512 9 + 728 + 24 / 665 30 / 40 791 / 705
2048 9 + 735 + 24 / 673 51 / 34 819 / 707
Table 6 Runtime breakdown analysis (ms): for k = 16, 128, 512the Variant#1 GSKNN is used, and for k = 2048 Variant#6 is usedinstead.
We breakdown the total execute time Ttotal
= Tcoll
+TGEMM
+ Tsq2d + T
heap
which represents the time spending oncollecting data from the global table X , X2, computing GEMM,evaluating the square distance and the heap selection. ForGSKNN, the time spending on individual terms are di�cult tocollective, since the timer will lead to a serious overhead in-side the 2nd loop. Thus, we report integrated time of GSKNN,and we estimate the time spending on heap by evaluatingthe total execution di↵erence with the k = 1 case . Tak-ing the first row (k = 16) of Table 6 as an example, GSKNNspends 21 ms in total. The estimated heap selection timeis computed by 21� 20 = 1 where 20 is the total executiontime of the case k = 1. The breakdown results reflect thedi↵erence of (??) on the memory complexity, and we arecertain that the optimization lead to a smaller coe�cient onthe heap selection part. The performance degrading of Vari-ant#1 on a larger k can be observed in T
heap
, and we switchto Variant#6 for k = 2048 to secure the GEMM e�ciency.
Mem
ory
Bou
nd
Conclusion
23
The GEMM approach in N-body problems is a good example to show the current BLAS library is lacking flexibility for lower level integration.
The algorithmic innovation of GSKS and GSKNN is to break through the interface, seeking for the lowest memory complexity.
We exploit these observations with the help of the BLIS framework.
Ongoing work includes other operations. e.g. kernel inversion, k-meaning clustering. Port to GPU and other accelerators.
Copyright @ 2015, The University of Texas at Austin
Question?
24Copyright @ 2015, The University of Texas at Austin
GSKS GSKNNgithub.com/ChenhanYu/rnngithub.com/ChenhanYu/ks