Top Banner
Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU Thorsten Blaß and Michael Philippsen
32

Which Graph Representation to Select for Static Graph ...

May 02, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU

Thorsten Blaß and Michael Philippsen

Page 2: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 2

Motivation (I)

Your task: write static graph algorithm on the GPU.

Preparation: check literature□ several data structures to represent graphs.□ all papers pick CRS data structure

and focus on optimizing algorithmic aspects.

You wonder: □ Is data structure irrelevant for performance? □ Is it right to follow the crowed?

This paper: Graph representation matters!

Page 3: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 3

1.0

1.1

1.2

1.3

1.4

1.5

1.6

norm

alize

d fil

e-to

-file

runt

ime

w.r.

t. be

st se

lect

ion

benchmarks

With fastest data structure LonestarGPU codes with CRS data structure

BFS MST SSSP

Motivation (II)

§ Adequate data structure speeds up graph algorithms.

Our improvements: Fastest runs when choosing the adequate data structure.

Available Lonestar GPU benchmark implementations.

Higher = slower

Page 4: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 4

Study Setup

§ What is the adequate data structure?

§ A total of 754,000 measurements:□ 10 state-of-the-art static graph algorithms (do not modify the graph).□ 19 input graphs.□ 4 architecturally different Nvidia GPUs.□ 10 graph data structures.□ 3 widely used graph exchange formats.

Page 5: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 5

Study Setup – Benchmark Graph Algorithms

§ 10 state-of-the-art implementations from recent research and benchmark suites with different characteristics.

Page 6: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 6

§ 19 input graphs with different characteristics.

Study Setup – Input Graphs

Page 7: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 7

Study Setup – GPUs

§ 4 architecturally different Nvidia GPUs□ Titan XP, 12.2 GB, Pascal□ Geforce GTX 980, 4.0 GB, Maxwell□ Geforce GTX 680, 2.0 GB, Kepler□ Geforce GTX 580, 1.0 GB, Fermi

§ Only Titan XP can run all measurement configurations.

§ All measurements scale w.r.t. nominal GPU performance.à Suffices to discuss the Titan XP measurements.

Page 8: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 8

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 9: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 9

Decision Tree – First Layer (I)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 10: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 10

Decision Tree – First Layer (II)

§ First decision depends on whether the algorithm is edge-centric or node-centric.□ Splits data structures into two groups.

§ For edge-centric algorithms COO and EL perform best.□ The other formats are 25% slower in our measurements.

§ COOrdinate format□ Stores source, target, and weight of an

edge in different arrays.

§ Edge List format □ Stores an edge as a struct in an array.

0

15

3

32 26

5

2

64

4

1

𝑠𝑟𝑐 = 0 1 3 3 3 4 5𝑡𝑟𝑔𝑡 = 1 3 0 2 5 3 6

𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 5 3 2 6 2 1 4

𝑒𝑑𝑔𝑒𝑠 =015,133,302,⋯

Page 11: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 11

Decision Tree – First Layer (III)

§ COO and EL show the same runtime behavior.□ Either COO or EL can be picked for edge-centric algorithms.□ Runtime can be minimized with MTX/GR exchange format.

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

aver

age

of th

e fa

stes

t com

bina

tion

of in

put g

raph

and

ex

chan

ge fo

rmat

per

dat

a st

ruct

and

ben

chm

ark

benchmarks

COO ELL

RBsymm

RB

MTX, MTXsymm, GR, GRsymm

APSP BFS MIS MST PR-n PR-e SpMV SSSP WCC-n WCC-e

Higher = slower

Page 12: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 12

Decision Tree – First Layer (IV)

§ The MTX and GR input file formats are dumps of the COO and EL formats.

§ No costly conversion from file to representation in CPU memory.

§ The RB format is 20% slower in our measurements.

< 𝑀𝑇𝑋 ℎ𝑒𝑎𝑑𝑒𝑟 >7 7 70 1 51 3 33 0 23 2 63 5 24 3 15 6 4

< 𝐺𝑅 ℎ𝑒𝑎𝑑𝑒𝑟 >𝑝 𝑠𝑝 7 7 7𝑎 0 1 5𝑎 1 3 3𝑎 3 0 2𝑎 3 2 6𝑎 3 5 2𝑎 4 3 1𝑎 5 6 4

Encoded edges

Size of the array(s)

Page 13: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 13

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 14: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 14

Decision Tree – Second Layer (I)

§ Node-centric algorithms need a deeper analysis.

§ Compressed Row Storage format□ Most memory efficient format. □ Indirect addressing step necessary.

§ Compressed Column Storage format □ Same as CRS but stores the

predecessors of a node.

s𝑟𝑐@AB = 0 1 2 2 5 6 7 7𝒔𝒖𝒄𝒄 = 1 3 0 2 5 3 6𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = 5 3 2 6 2 1 4

s𝑟𝑐@AB = 0 1 2 3 5 5 7 8𝒑𝒓𝒆𝒅 = 3 0 3 1 4 3 5𝑤𝑒𝑖𝑔ℎ𝑡 = 2 5 6 3 1 2 4

0

15

3

32 26

5

2

64

4

1

Page 15: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 15

§ ELL format□ Fixed row length reserved for successors.□ Well suited for vector processors.□ Memory efficiency depends on degree

distribution of a graph.

§ HYBrid format□ Better memory efficiency than ELL.□ Heuristics (avg, dstr) find smaller the row length.□ Additional successors are stored in EL.

𝑒𝑙𝑙L53∗214∗

𝑒𝑙𝑙N13∗036∗

Decision Tree – Second Layer (II)

0

15

3

32 26

5

2

64

4

1

𝑠𝑢𝑐𝑐1 ∗ ∗3 ∗ ∗∗ ∗ ∗0 2 53 ∗ ∗6 ∗ ∗∗ ∗ ∗

𝑤𝑒𝑖𝑔ℎ𝑡𝑠5 ∗ ∗3 ∗ ∗∗ ∗ ∗2 6 21 ∗ ∗4 ∗ ∗∗ ∗ ∗

𝑒𝑙326,352

Page 16: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 16

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower

Page 17: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 17

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

Touch count splits the measurements into two

groups.

Higher = slower

Page 18: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 18

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

We will discuss this later.

Higher = slower

Page 19: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 19

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower

Decision Tree – Second Layer (III)

§ Data structures cause costs for construction, shipping, ...

CRS or CCS perform best, followed by ELL and HYBavg.

Page 20: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 20

Decision Tree – Second Layer (IV)

§ The Rutherford Boeing (RB) format is a dump of the CRS format.

§ Other file formats are 40% slower in our measurements.

8 2 2 2𝑖𝑔𝑎 7 7 711𝐼4 11𝐼30 1 2 25 6 71 3 0 25 3 65 3 2 62 1 4

𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑎𝑏𝑜𝑢𝑡 𝑡ℎ𝑒𝑠𝑡𝑟𝑢𝑐𝑡𝑢𝑟𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑛𝑐𝑜𝑑𝑒𝑑 𝑔𝑟𝑎𝑝ℎ𝑎𝑛𝑑 𝑡ℎ𝑒 𝑓𝑖𝑙𝑒.

𝑠𝑟𝑐@AB

𝑠𝑢𝑐𝑐/𝑝𝑟𝑒𝑑

𝑤𝑒𝑖𝑔ℎ𝑡𝑠

Page 21: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 21

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

Page 22: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 22

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)

Decision Tree – Third Layer (I)

Higher = slower

Page 23: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 23

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

0 5 10 15 20 25 30

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

touch count

C?S (RB) ELL (RB) HYB_avg (RB)Higher = slower

No best single data structure.

Inconclusive! Performance also depends on input graphs.

Decision Tree – Third Layer (I)

Page 24: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 24

0

10

20

30

40

50

60

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

2.6

perc

enta

ge o

f nod

es w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

input graphs

ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Decision Tree – Third Layer (II)

Graphs are sorted in ascending order by their excess rate.

Page 25: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 25

0

10

20

30

40

50

60

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

2.6

perc

enta

ge o

f nod

es w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

input graphs

ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Decision Tree – Third Layer (II)

If ELL fits into memory it is always the fastest choice.

Shaded = ELL too large

Page 26: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 26

0

10

20

30

40

50

60

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2

2.3

2.4

2.5

2.6

perc

enta

ge o

f nod

es w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d fil

e-to

-file

exe

cutio

n tim

e w

.r.t.

best

dat

a st

ruct

ure

input graphs

ELL (RB) avg. ELL (RB) ELL (RB_sym) C?S (RB) avg. C?S (RB)C?S (RB_sym) HYB_avg (RB) avg. HYB_avg (RB) HYB_avg (RB_sym) excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Decision Tree – Third Layer (II)

HYBavg faster

Excess Rate < 20%

C?S faster

Shaded = ELL too large

Y N

Page 27: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 27

Decision Tree (file-to-file)

Excess Rate < 20%

N

N

Y

Y

Y

N

ELL HYBavg

RB [symm]

Best data structure

Best exchg. format

ELL C?S

RB [symm]

C?S

RB [symm]

COO/EL

GR/MTX[symm]

Touch Count < 5

Edge-Centric

How often touches the algorithm every node (on average).

Percentage of nodes that have an above-average degree.

Page 28: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 28

Decision Tree – Kernel only

N

Y

ELL HYBdstr

Best data structure

COO/EL

Edge-Centric n.a.

n.a.

n.a.

Page 29: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 29

0

10

20

30

40

50

60

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

input graphs

perc

enta

ge o

f no

des w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d ke

rnel

runt

ime

w.r.

t. be

st d

ata

stru

ctur

e

ELL avg. ELL

CRS avg. CRS

HYB_avg avg. HYB_avg

HYB_dstr avg. HYB_dstr

excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Kernel-only results

§ Without I/O, data structure construction, and CPU ßà GPU shipping.

ELL remains fastest graph structure if it fits into memory.

Shaded = ELL too large

Page 30: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 30

0

10

20

30

40

50

60

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

input graphs

perc

enta

ge o

f no

des w

ith h

ighe

r deg

ree

than

the

aver

age

norm

alize

d ke

rnel

runt

ime

w.r.

t. be

st d

ata

stru

ctur

e

ELL avg. ELL

CRS avg. CRS

HYB_avg avg. HYB_avg

HYB_dstr avg. HYB_dstr

excess

road

_BAY

road

_CAL

road

_COL

road

_CTR

road

_FLA

road

_USA

msdoor

pwtk

rgg_n

_2_2

4

amaz

on-08

coAuth

ors

dblp-10

hollywood-09

indochina-0

4

it-04

kron_g

500

ljourn

al-08

twitt

er-ret

weed

soc-o

rkut

Kernel-only results

§ Without I/O, data structure construction, and CPU ßà GPU shipping.

Otherwise HYBdstr is the best choice.

Shaded = ELL too large

Page 31: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 31

Conclusion

§ There is no single best graph data structure for GPUs.§ Three decision layers:

□ Edge-centric vs. node-centric□ Touch count threshold around 5□ Excess rate threshold around 20%

§ Our Decision Tree can help developers to pick an adequate data structure for their use case.

§ Choosing the adequate data structure canspeed up graph algorithms by up to 45%.

Page 32: Which Graph Representation to Select for Static Graph ...

Which Graph Representation to Select for Static Graph-Algorithms on a CUDA-capable GPU • Blaß and Philippsen • Slide 32

Conclusion

§ There is no single best graph data structure for GPUs.§ Three decision layers:

□ Edge-centric vs. node-centric□ Touch count threshold around 5□ Excess rate threshold around 20%

§ Our Decision Tree can help developers to pick an adequate data structure for their use case.

§ Choosing the adequate data structure canspeed up graph algorithms by up to 45%. Thank you for your attention!Any questions?