Top Banner
STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU
28

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Oct 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

STEVE RENNICH, SR. ENGINEER, NVIDIA DEVELOPER TECHNOLOGY

DARKO STOSIC, PHD CANDIDATE, UNIV. FEDERAL DE PERNAMBUCO

TIM DAVIS, PROFESSOR, CSE, TEXAS A&M UNIVERSITY

ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU

Page 2: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

SPARSE MATRIX FACTORIZATION ON GPUS

 Objective:   Find methods for GPU acceleration of Sparse Cholesky Factorization

  Experiment using SuiteSparse 4.4.3 / CHOLMOD

 Outline   Sparse Cholesky Factorization   Previous work / Issues

  ‘Branches’ approach

Page 3: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

  Dense block Cholesky

DIRECT SPARSE FACTORIZATION

  Supernodes

L11 Lt11 = A11

L11 Lt

21 = At21

A*

22 = A22 – L21 Lt21

POTRF

TRSM

GEMM

A11

A22 A21

At21

= L11 0

L21 I

I 0

0 A*22

Lt11 Lt

21

0 I

dense Cholesky

triangular solve

matrix multiplication

Schur complement

compressed column

Page 4: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

6

3

1

2

5

4

Page 5: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF

6

3

1

2

5

4

Page 6: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

1

2

5

7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF TRSM 4

6

3

Page 7: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

1

2

5

7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF TRSM GEMM 4

fill fill

6

3

Page 8: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

1

2

5

7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF TRSM GEMM POTRF

fill fill

6

4

3

Page 9: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

1

2

3

4

5

6

7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF TRSM GEMM POTRF TRSM

fill fill

Page 10: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

1

2

3

4

5

6

fill fill 7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF TRSM GEMM POTRF TRSM GEMM

Page 11: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

1

2

3

4

5

6

fill fill 7

  Elimination tree

  Bulk of work is in assembling supernodes (wide range of descendant sizes)

  ‘Left-looking supernodal’

  Apply block Cholesky to supernodes

1 2 4 5

7

63

POTRF TRSM GEMM POTRF TRSM GEMM POTRF

Page 12: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

DIRECT SPARSE FACTORIZATION

  Lots of ‘small’ math

  Irregular access patterns

  Larger matrices -> more dense math

  Greater connectivity -> more dense math

  Factors can be large ( > 128 GB )

Page 13: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

PREVIOUS WORK

  Just send large BLAS-3 to GPU   WORKS! For large, dense matrices

  Not so good for:

  small matrices

  large matrices with low connectivity (shells / beams in FEA)

  Find methods for further GPU acceleration of Sparse Factorization

Page 14: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

PREVIOUS WORK

0  

100  

200  

300  

400  

500  

600  

700  

800  

GFlop

s/s  

Florida  Sparse  Matrix  Collec4on  

CPU   CPU  +  GPU    Send appropriately-sized BLAS calls to GPU

  ‘hide’ PCIe communication

  Assemble supernodes on GPU

  Hybrid computing

why not higher?

why so low?

supe

rnod

e sc

ore

GPU CPU

supernodes

decreasing cost to assemble

row/column threshold ndrow >= 256 ndcol >= 32

2  x  Xeon  E5-­‐2698  v3  +  K40  (max  boost,  ECC=off)  hEp://faculty.cse.tamu.edu/davis/suitesparse.html  

 

SuiteSparse (CHOLMOD) 4.4.3

1.5x

Page 15: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

ISSUES   PCIe communication

  Limits which BLAS operations can be accelerated on GPU

  Small BLAS   Low occupancy

  Launch overhead

  Most BLAS calls don’t get sent to the GPU

  Seek methods which better accelerate factorization of small / minimally-connected matrices

audikw_1.mtx

% on CPU

Page 16: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

PROPOSED SOLUTION   Factor branches on GPU

  Use previous methods for root

  No use of CPU

  Eliminates PCIe communication

  Requires POTRF, TRSM & GEMM on GPU

  Batch and stream BLAS operations

  Within levels

  Amortizes launch overhead

  Streamed to improve occupancy

  No size restriction

  Maps well to muti-GPU / hybrid computing branch 1

level 0

level 1

level 2

branch 2 branch 3 branch 4

Page 17: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

data on

device

data on

host

BATCHED / STREAMED BLAS   Batch all BLAS calls to amortize kernel launch latency

  Stream multiple batches to increase occupancy

  Simply wrap cuBLAS subroutine with batch loop

  DGEMM w/ m,n,k=16 -> 40 GF

  DGEMM example, m,n,k=16

time st

ream

100 Mflops

: 500 Mflops

batched: 1.2 Gflops

streamed: 4.8 Gflops

Host <-> Device Kernel

Page 18: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

BATCHED / STREAMED DGEMM

  Square DGEMM

  64 streams/threads

  Batched / streamed cuBLAS performance matches MKL for small size

  Created by wrapping existing, non-batched routines

  passing lists 0

200

400

600

800

1000

1200

1400

0 100 200 300 400 500 G

flop

/s

DGEMM m,n,k

GPU streamed GPU

batched/streamed GPU streamed CPU

2  x  Xeon  E5-­‐2698  v3  +  K40  (max  boost,  ECC=off)  

Page 19: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

PLENTY OF PARALLELISM

  Lower levels   Many supernodes

  Few descendants

  Upper levels   Few supernodes

  Many descendants

audikw_1.mtx

# of supernodes or GEMM + SYRK ops

supernodes

GEMM

Page 20: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

BRANCHES Matrix   #  branches   #  levels   #  supernodes   #  root  levels   #  root  supernodes  

Fault_639   2   18-­‐19   14931  -­‐  15794   1   1  nd24k   2   11   302  -­‐  325   1   1  inline_1   4   16-­‐17   3909  -­‐  10633   1   1  Emilia_923   4   17-­‐18   10314  -­‐  11570   3   4  boneS10   4   18-­‐23   7045  -­‐  26182   1   1  ldoor   3   19-­‐20   17413  -­‐  35704   1   1  bone010   6   16-­‐20   1957  -­‐  23610   1   1  Hook_1498   9   1-­‐18   1  -­‐  33608   3   5  Geo_1438   8   17-­‐18   8102  -­‐  9335   5   9  Serena   60   10-­‐17   189  -­‐  4910   10   60  audikw_1   4   17-­‐19   5631  -­‐  22300   1   1  Flan_1564   8   15-­‐17   3937  -­‐  16309   2   2  

branch 1

branch 2

branch 3

branch 4

Page 21: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

CHOLMOD RESULTS

  1.38x average speedup vs. previous CPU+GPU

  2x average speedup vs. CPU

  Poorly performing matrices see the greatest speedup

2  x  Xeon  E5-­‐2698  v3  +  K40  (max  boost,  ECC=off)  hEp://faculty.cse.tamu.edu/davis/suitesparse.html  

 

0

100

200

300

400

500

600

700

800

900

GFl

op/s

Florida Sparse Matrix Collection

CPU CPU + GPU GPU Branches

CHOLMOD 4.43

Page 22: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

PCIE DEPENDENCE

  PCIe gen3 -> gen1   12 GB/s -> 3 GB/s

  75% loss

  CPU+GPU   23% loss

  Branches   17% loss

0

100

200

300

400

500

600

700

800

900

Gfl

op/s

Florida Sparse Matrix Collection

PCIe gen1 PCIe gen3 PCIe gen1 PCIe gen3

4.4.3 CPU+GPU GPU Branches

1  x    i7  3930K    +  K40  (max  boost,  ECC=on)  

Page 23: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

SHELL MODEL PERFORMANCE

0

50

100

150

200

250

300

350

400

450

0 2 4 6 8 10 12

Num

eric

al F

acto

riza

tion

rat

e G

F/s

million degrees of freedom

4.4.3 CPU

4.4.3 CPU+GPU

Branches 1xK40

PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

•  506,082 supernodes •  640 branches

•  114–1,730 supernodes •  8-20 levels

•  49 levels in root branch •  637 supernodes

2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.

w/ 256 GB + 2xK40 (ECC=ON, full boost)

Page 24: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

SHELL MODEL PERFORMANCE

  ‘Branches’ algorithm well-suited for Multi-GPU

  4 x K40

  Overall 1.5x speedup

  Branches 3.1x speedup

  We’ve ported the previous algorithm to multi-GPU

•  506,082 supernodes •  640 branches

•  114–1,730 supernodes

•  8-20 levels •  49 levels in root branch

•  637 supernodes

1 x K40

4 x K40

time

host <-> device

compute kernels

Page 25: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

SHELL MODEL PERFORMANCE

0

200

400

600

800

1000

1200

1400

0 2 4 6 8 10 12

Num

eric

al F

acto

riza

tion

rat

e G

F/s

million degrees of freedom

4.4.3 CPU

4.4.3 CPU+GPU

Branches 1xK40

Branches 2xK40

Branches 2xK40-Proj.

Branches 4xK40

Branches 4xK40-Proj.

PCB model courtesy of Dr. Serban Georgescu, Fujitsu Laboratories of Europe Ltd

2 socket x 16 core HSW E5-2698 v3 @ 2.3 Ghz.

w/ 256 GB + 2xK40 (ECC=ON, full boost)

assuming 87.5% parallel efficiency

2x K40

4x K40

1x K40

Page 26: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

CONCLUSIONS   Factoring ‘branches’ on GPU avoids PCIe bottleneck   Batching and streaming permits higher performance on small matrices

  Universally beneficial   Aspects apply to other factorization methods

  Future   Improved performance of batched routines

  Support hybrid computing

  Complete multi-GPU support

Page 27: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

RELATED WORK   S5232 - GPU Acceleration of WSMP (Watson Sparse Matrix Package)

  Natalia Gimelshein, Anshul Gupta

  S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks   Jonathan Hogg

  S5476 - Energy Efficient, High-Performance Solvers through Small Dense Matrix Computations on GPUs

Azzam Haidar, Stanimire Tomov

  S5424 - Exploiting Multiple GPUs in Sparse QR: Regular Numerics with Irregular Data Movement   Tim Davis

  S5237 - Jacobi-Davidson Eigensolver in Cusolver Library   Lung-Sheng Chien

Page 28: ACCELERATING SPARSE CHOLESKY FACTORIZATION ON THE GPU · 2015. 3. 20. · Natalia Gimelshein, Anshul Gupta S5316 - DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

THANK YOU