ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY

DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO

TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

SPARSE MATRIX FACTORIZATION ON GPUS

 Objective:   Find methods for GPU acceleration of Sparse Cholesky Factorization

  Experiment using SuiteSparse 4.4.3 / CHOLMOD

 Outline   Sparse Cholesky Factorization   Previous work / Issues

  ‘Branches’ approach

  Dense block Cholesky

DIRECT SPARSE FACTORIZATION

  Supernodes

L11 Lt11 = A11

L11 Lt

21 = At21

A*

22 = A22 – L21 Lt21

POTRF

TRSM

GEMM

A11

A22 A21

At21

= L11 0

L21 I

I 0

0 A*22

Lt11 Lt

21

0 I

dense Cholesky

triangular solve

matrix multiplication

Schur complement

compressed column


7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

6

3

1

2

5

4


7





1 2 4 5

7

63

POTRF

6

3

1

2

5

4


1

2

5

7





1 2 4 5

7

63

POTRF TRSM 4

6

3


1

2

5

7





1 2 4 5

7

63

POTRF TRSM GEMM 4

fill fill

6

3


1

2

5

7





1 2 4 5

7

63

POTRF TRSM GEMM POTRF

fill fill

6

4

3


1

2

3

4

5

6

7





1 2 4 5

7

63

POTRF TRSM GEMM POTRF TRSM

fill fill


1

2

3

4

5

6

fill fill 7





1 2 4 5

7

63

POTRF TRSM GEMM POTRF TRSM GEMM


1

2

3

4

5

6

fill fill 7





1 2 4 5

7

63

POTRF TRSM GEMM POTRF TRSM GEMM POTRF


  Lots of ‘small’ math

  Irregular access patterns

  Larger matrices -> more dense math

  Greater connectivity -> more dense math

  Factors can be large ( > 128 GB )

PREVIOUS WORK

  Just send large BLAS-3 to GPU   WORKS! For large, dense matrices

  Not so good for:

  small matrices

  large matrices with low connectivity (shells / beams in FEA)

  Find methods for further GPU acceleration of Sparse Factorization

PREVIOUS WORK

0

100

200

300

400

500

600

700

800

GFlop

s/s

Florida Sparse Matrix Collec4on

CPU CPU + GPU   Send appropriately-sized BLAS calls to GPU

  ‘hide’ PCIe communication

  Assemble supernodes on GPU

  Hybrid computing

why not higher?

why so low?

supe

rnod

e sc

ore

GPU CPU

supernodes

decreasing cost to assemble

row/column threshold ndrow >= 256 ndcol >= 32

2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off) hEp://faculty.cse.tamu.edu/davis/suitesparse.html

SuiteSparse (CHOLMOD) 4.4.3

1.5x

ISSUES   PCIe communication

  Limits which BLAS operations can be accelerated on GPU

  Small BLAS   Low occupancy

  Launch overhead

  Most BLAS calls don’t get sent to the GPU

  Seek methods which better accelerate factorization of small / minimally-connected matrices

audikw_1.mtx

% on CPU

PROPOSED SOLUTION   Factor branches on GPU

  Use previous methods for root

  No use of CPU

  Eliminates PCIe communication

  Requires POTRF, TRSM & GEMM on GPU

  Batch and stream BLAS operations

  Within levels

  Amortizes launch overhead

  Streamed to improve occupancy

  No size restriction

  Maps well to muti-GPU / hybrid computing branch 1

level 0

level 1

level 2

branch 2 branch 3 branch 4

data on

device

data on

host

BATCHED / STREAMED BLAS   Batch all BLAS calls to amortize kernel launch latency

  Stream multiple batches to increase occupancy

  Simply wrap cuBLAS subroutine with batch loop

  DGEMM w/ m,n,k=16 -> 40 GF

  DGEMM example, m,n,k=16

time st

ream

100 Mflops

: 500 Mflops

batched: 1.2 Gflops

streamed: 4.8 Gflops

Host <-> Device Kernel

BATCHED / STREAMED DGEMM

  Square DGEMM

  64 streams/threads

  Batched / streamed cuBLAS performance matches MKL for small size

  Created by wrapping existing, non-batched routines

  passing lists 0

200

400

600

800

1000

1200

1400

0 100 200 300 400 500 G

flop

/s

DGEMM m,n,k

GPU streamed GPU

batched/streamed GPU streamed CPU

2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off)

PLENTY OF PARALLELISM

  Lower levels   Many supernodes

  Few descendants

  Upper levels   Few supernodes

  Many descendants

audikw_1.mtx

# of supernodes or GEMM + SYRK ops

supernodes

GEMM

BRANCHES Matrix # branches # levels # supernodes # root levels # root supernodes

Fault_639 2 18-‐19 14931 -‐ 15794 1 1 nd24k 2 11 302 -‐ 325 1 1 inline_1 4 16-‐17 3909 -‐ 10633 1 1 Emilia_923 4 17-‐18 10314 -‐ 11570 3 4 boneS10 4 18-‐23 7045 -‐ 26182 1 1 ldoor 3 19-‐20 17413 -‐ 35704 1 1 bone010 6 16-‐20 1957 -‐ 23610 1 1 Hook_1498 9 1-‐18 1 -‐ 33608 3 5 Geo_1438 8 17-‐18 8102 -‐ 9335 5 9 Serena 60 10-‐17 189 -‐ 4910 10 60 audikw_1 4 17-‐19 5631 -‐ 22300 1 1 Flan_1564 8 15-‐17 3937 -‐ 16309 2 2

branch 1

branch 2

branch 3

branch 4

CHOLMOD RESULTS

  1.38x average speedup vs. previous CPU+GPU

  2x average speedup vs. CPU

  Poorly performing matrices see the greatest speedup

2 x Xeon E5-‐2698 v3 + K40 (max boost, ECC=off) hEp://faculty.cse.tamu.edu/davis/suitesparse.html

0

100

200

300

400

500

600

700

800

900

GFl

op/s

Florida Sparse Matrix Collection

CPU CPU + GPU GPU Branches

CHOLMOD 4.43

PCIE DEPENDENCE

  PCIe gen3 -> gen1   12 GB/s -> 3 GB/s

  75% loss

  CPU+GPU   23% loss

  Branches   17% loss

0

100

200

300

400

500

600

700

800

900

Gfl

op/s

Florida Sparse Matrix Collection

PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3

4.4.3 CPU+GPU GPU Branches

1 x i7 3930K + K40 (max boost, ECC=on)

SHELL MODEL PERFORMANCE

0

50

100

150

200

250

300

350

400

450

0 2 4 6 8 10 12

Num

eric

al F

acto

riza

tion

rat

e G

F/s

million degrees of freedom

4.4.3 CPU

4.4.3 CPU+GPU

Branches 1xK40

PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

•  506,082 supernodes •  640 branches

•  114–1,730 supernodes •  8-20 levels

•  49 levels in root branch •  637 supernodes

2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.

w/ 256 GB + 2xK40 (ECC=ON, full boost)


  ‘Branches’ algorithm well-suited for Multi-GPU

  4 x K40

  Overall 1.5x speedup

  Branches 3.1x speedup

  We’ve ported the previous algorithm to multi-GPU

•  506,082 supernodes •  640 branches

•  114–1,730 supernodes

•  8-20 levels •  49 levels in root branch

•  637 supernodes

1 x K40

4 x K40

time

host <-> device

compute kernels


0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12

Num

eric

al F

acto

riza

tion

rat

e G

F/s

million degrees of freedom

4.4.3 CPU

4.4.3 CPU+GPU

Branches 1xK40

Branches 2xK40

Branches 2xK40-Proj.

Branches 4xK40

Branches 4xK40-Proj.

PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.

w/ 256 GB + 2xK40 (ECC=ON, full boost)

assuming 87.5% parallel efficiency

2x K40

4x K40

1x K40

CONCLUSIONS   Factoring ‘branches’ on GPU avoids PCIe bottleneck   Batching and streaming permits higher performance on small matrices

  Universally beneficial   Aspects apply to other factorization methods

  Future   Improved performance of batched routines

  Support hybrid computing

  Complete multi-GPU support

RELATED WORK   S5232 - GPU Acceleration of WSMP (Watson Sparse Matrix Package)

  Natalia Gimelshein, Anshul Gupta

  S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks   Jonathan Hogg

  S5476 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs

Azzam Haidar, Stanimire Tomov

  S5424 - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement   Tim Davis

  S5237 - Jacobi-Davidson Eigensolver in Cusolver Library   Lung-Sheng Chien

THANK YOU

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Documents