STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY
DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO
TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
SPARSE MATRIX FACTORIZATION ON GPUS
Objective: Find methods for GPU acceleration of Sparse Cholesky Factorization
Experiment using SuiteSparse 4.4.3 / CHOLMOD
Outline Sparse Cholesky Factorization Previous work / Issues
‘Branches’ approach
Dense block Cholesky
DIRECT SPARSE FACTORIZATION
Supernodes
L11 Lt11 = A11
L11 Lt
21 = At21
A*
22 = A22 – L21 Lt21
POTRF
TRSM
GEMM
A11
A22 A21
At21
= L11 0
L21 I
I 0
0 A*22
Lt11 Lt
21
0 I
dense Cholesky
triangular solve
matrix multiplication
Schur complement
compressed column
DIRECT SPARSE FACTORIZATION
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
6
3
1
2
5
4
DIRECT SPARSE FACTORIZATION
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF
6
3
1
2
5
4
DIRECT SPARSE FACTORIZATION
1
2
5
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM 4
6
3
DIRECT SPARSE FACTORIZATION
1
2
5
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM 4
fill fill
6
3
DIRECT SPARSE FACTORIZATION
1
2
5
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF
fill fill
6
4
3
DIRECT SPARSE FACTORIZATION
1
2
3
4
5
6
7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF TRSM
fill fill
DIRECT SPARSE FACTORIZATION
1
2
3
4
5
6
fill fill 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF TRSM GEMM
DIRECT SPARSE FACTORIZATION
1
2
3
4
5
6
fill fill 7
Elimination tree
Bulk of work is in assembling supernodes (wide range of descendant sizes)
‘Left-looking supernodal’
Apply block Cholesky to supernodes
1 2 4 5
7
63
POTRF TRSM GEMM POTRF TRSM GEMM POTRF
DIRECT SPARSE FACTORIZATION
Lots of ‘small’ math
Irregular access patterns
Larger matrices -> more dense math
Greater connectivity -> more dense math
Factors can be large ( > 128 GB )
PREVIOUS WORK
Just send large BLAS-3 to GPU WORKS! For large, dense matrices
Not so good for:
small matrices
large matrices with low connectivity (shells / beams in FEA)
Find methods for further GPU acceleration of Sparse Factorization
PREVIOUS WORK
0
100
200
300
400
500
600
700
800
GFlop
s/s
Florida Sparse Matrix Collec4on
CPU CPU + GPU Send appropriately-sized BLAS calls to GPU
‘hide’ PCIe communication
Assemble supernodes on GPU
Hybrid computing
why not higher?
why so low?
supe
rnod
e sc
ore
GPU CPU
supernodes
decreasing cost to assemble
row/column threshold ndrow >= 256 ndcol >= 32
2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off) hEp://faculty.cse.tamu.edu/davis/suitesparse.html
SuiteSparse (CHOLMOD) 4.4.3
1.5x
ISSUES PCIe communication
Limits which BLAS operations can be accelerated on GPU
Small BLAS Low occupancy
Launch overhead
Most BLAS calls don’t get sent to the GPU
Seek methods which better accelerate factorization of small / minimally-connected matrices
audikw_1.mtx
% on CPU
PROPOSED SOLUTION Factor branches on GPU
Use previous methods for root
No use of CPU
Eliminates PCIe communication
Requires POTRF, TRSM & GEMM on GPU
Batch and stream BLAS operations
Within levels
Amortizes launch overhead
Streamed to improve occupancy
No size restriction
Maps well to muti-GPU / hybrid computing branch 1
level 0
level 1
level 2
branch 2 branch 3 branch 4
data on
device
data on
host
BATCHED / STREAMED BLAS Batch all BLAS calls to amortize kernel launch latency
Stream multiple batches to increase occupancy
Simply wrap cuBLAS subroutine with batch loop
DGEMM w/ m,n,k=16 -> 40 GF
DGEMM example, m,n,k=16
time st
ream
100 Mflops
: 500 Mflops
batched: 1.2 Gflops
streamed: 4.8 Gflops
Host <-> Device Kernel
BATCHED / STREAMED DGEMM
Square DGEMM
64 streams/threads
Batched / streamed cuBLAS performance matches MKL for small size
Created by wrapping existing, non-batched routines
passing lists 0
200
400
600
800
1000
1200
1400
0 100 200 300 400 500 G
flop
/s
DGEMM m,n,k
GPU streamed GPU
batched/streamed GPU streamed CPU
2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off)
PLENTY OF PARALLELISM
Lower levels Many supernodes
Few descendants
Upper levels Few supernodes
Many descendants
audikw_1.mtx
# of supernodes or GEMM + SYRK ops
supernodes
GEMM
BRANCHES Matrix # branches # levels # supernodes # root levels # root supernodes
Fault_639 2 18-‐19 14931 -‐ 15794 1 1 nd24k 2 11 302 -‐ 325 1 1 inline_1 4 16-‐17 3909 -‐ 10633 1 1 Emilia_923 4 17-‐18 10314 -‐ 11570 3 4 boneS10 4 18-‐23 7045 -‐ 26182 1 1 ldoor 3 19-‐20 17413 -‐ 35704 1 1 bone010 6 16-‐20 1957 -‐ 23610 1 1 Hook_1498 9 1-‐18 1 -‐ 33608 3 5 Geo_1438 8 17-‐18 8102 -‐ 9335 5 9 Serena 60 10-‐17 189 -‐ 4910 10 60 audikw_1 4 17-‐19 5631 -‐ 22300 1 1 Flan_1564 8 15-‐17 3937 -‐ 16309 2 2
branch 1
branch 2
branch 3
branch 4
CHOLMOD RESULTS
1.38x average speedup vs. previous CPU+GPU
2x average speedup vs. CPU
Poorly performing matrices see the greatest speedup
2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off) hEp://faculty.cse.tamu.edu/davis/suitesparse.html
0
100
200
300
400
500
600
700
800
900
GFl
op/s
Florida Sparse Matrix Collection
CPU CPU + GPU GPU Branches
CHOLMOD 4.43
PCIE DEPENDENCE
PCIe gen3 -> gen1 12 GB/s -> 3 GB/s
75% loss
CPU+GPU 23% loss
Branches 17% loss
0
100
200
300
400
500
600
700
800
900
Gfl
op/s
Florida Sparse Matrix Collection
PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3
4.4.3 CPU+GPU GPU Branches
1 x i7 3930K + K40 (max boost, ECC=on)
SHELL MODEL PERFORMANCE
0
50
100
150
200
250
300
350
400
450
0 2 4 6 8 10 12
Num
eric
al F
acto
riza
tion
rat
e G
F/s
million degrees of freedom
4.4.3 CPU
4.4.3 CPU+GPU
Branches 1xK40
PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd
• 506,082 supernodes • 640 branches
• 114–1,730 supernodes • 8-20 levels
• 49 levels in root branch • 637 supernodes
2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.
w/ 256 GB + 2xK40 (ECC=ON, full boost)
SHELL MODEL PERFORMANCE
‘Branches’ algorithm well-suited for Multi-GPU
4 x K40
Overall 1.5x speedup
Branches 3.1x speedup
We’ve ported the previous algorithm to multi-GPU
• 506,082 supernodes • 640 branches
• 114–1,730 supernodes
• 8-20 levels • 49 levels in root branch
• 637 supernodes
1 x K40
4 x K40
time
host <-> device
compute kernels
SHELL MODEL PERFORMANCE
0
200
400
600
800
1000
1200
1400
0 2 4 6 8 10 12
Num
eric
al F
acto
riza
tion
rat
e G
F/s
million degrees of freedom
4.4.3 CPU
4.4.3 CPU+GPU
Branches 1xK40
Branches 2xK40
Branches 2xK40-Proj.
Branches 4xK40
Branches 4xK40-Proj.
PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd
2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.
w/ 256 GB + 2xK40 (ECC=ON, full boost)
assuming 87.5% parallel efficiency
2x K40
4x K40
1x K40
CONCLUSIONS Factoring ‘branches’ on GPU avoids PCIe bottleneck Batching and streaming permits higher performance on small matrices
Universally beneficial Aspects apply to other factorization methods
Future Improved performance of batched routines
Support hybrid computing
Complete multi-GPU support
RELATED WORK S5232 - GPU Acceleration of WSMP (Watson Sparse Matrix Package)
Natalia Gimelshein, Anshul Gupta
S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks Jonathan Hogg
S5476 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs
Azzam Haidar, Stanimire Tomov
S5424 - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement Tim Davis
S5237 - Jacobi-Davidson Eigensolver in Cusolver Library Lung-Sheng Chien
THANK YOU