Top Banner
USENIX ATC 2018
43

USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Jul 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

USENIX ATC 2018

Page 2: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

CGraph: A Correlations-aware Approach for Efficient Concurrent

Iterative Graph Processing

Page 3: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Part 1

Background and Challenges

Page 4: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

What is CGP Job

PageRank k-means SSSP

Graph Data

Platform

Shared

… …

Many concurrent graph processing jobs are daily executed over the same graph

(or its different snapshots) to provide various information for different products

Page 5: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

What is CGP Job

PageRank k-means SSSP

Graph Data

Platform

Shared

… …

Page 6: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

What is CGP Job

(a) Number of the CGP jobs

(b) Ratio of shared graph data

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

0

5

10

15

20

25

30

20 40 60 80 100 120 140 160

Nu

mb

er o

f C

GP

jo

bs

Time (hours)

The information traced over a large social network

Page 7: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

What is CGP Job

More than 20 CGP jobs

to concurrently analyze the

same graph at the peak time(a) Number of the CGP jobs

(b) Ratio of shared graph data

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

0

5

10

15

20

25

30

20 40 60 80 100 120 140 160

Nu

mb

er o

f C

GP

jo

bs

Time (hours)

The information traced over a large social network

Page 8: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

What is CGP Job

Serious cache interference

and memory wall(a) Number of the CGP jobs

(b) Ratio of shared graph data

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

0

5

10

15

20

25

30

20 40 60 80 100 120 140 160

Nu

mb

er o

f C

GP

jo

bs

Time (hours)

The information traced over a large social network

Page 9: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Challenges: Data Access Problems in the CGP Jobs

(a) Average execution time (b) Average data access time

1 2 4 80

2

4

6

8

Exec

uti

on t

ime

of

each

job

Number of jobs

PageRank

SSSP

SCC

BFS

1 2 4 80

2

4

6

8

Dat

a ac

cess

tim

e o

f ea

ch j

ob

Number of jobs

PageRank

SSSP

SCC

BFS

The average execution time of each job is significantly prolonged as the number

of jobs increases due to higher data access cost

Page 10: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Challenges: An Example

Reason: The CGP jobs

contend for data access

channel, memory and cache

P1

J3:

P2 P3 P4

Time

J2:

J1:

Iteration n3 for J3

P4 P3 P2 P1

P2 P4 P1 P3

Iteration n2 for J2

Iteration n1 for J1

An Iteration of Graph Processing

➢ The CGP jobs access the shared graph

partitions in an individual manner along

different graph paths

➢ The processing time of each partition is

various for different jobs

Page 11: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Motivations

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

Observations:

-Spatial correlation

-Temporal correlation

Page 12: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Motivations

Observations:

-Spatial correlation: The intersections of the set of graph partitions to be

handled by different CGP jobs in each iteration are large (more than 75% of all

active partitions on average).

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

-Temporal correlation

Page 13: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Motivations

Observations:

-Spatial correlation: The intersections of the set of graph partitions to be

handled by different CGP jobs in each iteration are large (more than 75% of all

active partitions on average).

-Temporal correlation: Some graph partitions may be accessed by multiple

CGP jobs (may be more than 16 jobs) within a short time duration.

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

Page 14: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Motivations

Develop a solution for efficient use of cache/memory and the data access channel

to achieve a higher throughput by fully exploiting the spatial/temporal correlations

Observations:

-Spatial correlation: The intersections of the set of graph partitions to be

handled by different CGP jobs in each iteration are large (more than 75% of all

active partitions on average).

-Temporal correlation: Some graph partitions may be accessed by multiple

CGP jobs (may be more than 16 jobs) within a short time duration.

0 1 2 3 4 5 6 7 8 90

20

40

60

80

100

Rat

io s

har

ed b

y #

jo

bs

(%)

Time (hours)

#>16 #>8 #>4

#>2 #>1

Page 15: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Motivations: An Example

• Load the shared partitions for the

related jobs along a common

order to provide opportunity to

consolidate the accesses to the

shared graph structure and store

a single copy of the shared data

in the cache to serve multiple

CGP jobs at the same time.

➢ Spatial Correlations

➢ Temporal Correlations

P1

J3:

P2 P3 P4

Time

J2:

J1:

An Iteration of Graph Processing

P1 P2 P3 P4

P1 P2 P3 P4

J4: P2 P4

J5: P1 P3 P4

Iteration n5 for J5

Iteration n4 for J4

Iteration n3 for J3

Iteration n2 for J2

Iteration n1 for J1

Page 16: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Motivations: An Example

• Load the shared partitions for the

related jobs along a common

order to provide opportunity to

consolidate the accesses to the

shared graph structure and store

a single copy of the shared data

in the cache to serve multiple

CGP jobs at the same time.

➢ Spatial Correlations

➢ Temporal Correlations

• Take into account the temporal

correlations, e.g., the usage

frequency of the graph partitions,

when loading them into the cache

P1

J3:

P2 P3 P4

Time

J2:

J1:

An Iteration of Graph Processing

P1 P2 P3 P4

P1 P2 P3 P4

J4: P2 P4

J5: P1 P3 P4

Iteration n5 for J5

Iteration n4 for J4

Iteration n3 for J3

Iteration n2 for J2

Iteration n1 for J1

Page 17: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Part 2

Related Work

Page 18: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Existing Graph Processing Systems

GraphChi X-Stream GridGraph NXgraph CLIP …

Single graph processing

Page 19: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Existing Graph Processing Systems

Mainly focus on

single graph

processing job

GraphChi X-Stream GridGraph NXgraph CLIP …

Single graph processing

• Higher sequential memory bandwidth

• Better data locality

• Less redundant data accesses

• Less memory consumption…

Page 20: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Existing Graph Processing Systems

GraphChi X-Stream GridGraph NXgraph CLIP …

Single graph processing Concurrent graph processing

Mainly focus on

single graph

processing job

• Higher sequential memory bandwidth

• Better data locality

• Less redundant data accesses

• Less memory consumption…

Page 21: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Part 3

Our Approach:

A Correlations-aware Data-centric

Execution Model

Page 22: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Main Goals

Minimize the redundant accessing and storing cost of the

shared graph structure data(occupies more than 70% of

the total memory of each job)by fully exploiting the

spatial/temporal correlations between the CGP jobs

Page 23: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Data-centric LTP Execution Model

➢ Traditional approach:

Most graph structure data G=(V, E, W) is the same

for different CGP jobs

Page 24: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Data-centric LTP Execution Model

➢ Traditional approach:

D = (V, S, E, W)

G = (V, E, W), where

➢ Load-Trigger-Pushing (denoted by LTP) model:

Most graph structure data G=(V, E, W) is the same

for different CGP jobs

Page 25: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Data-centric LTP Execution Model

• Graph Loading:

Memory/Disk

Load of graph structure data

Global Space (Storing the shared graph structure data)

➢ Load-Trigger-Pushing (denoted by LTP) model:

Page 26: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Data-centric LTP Execution Model

• Trigger and Parallel Execution:

Memory/Disk

Load of graph structure data

Global Space (Storing the shared graph structure data)

...

Parallel trigger

• Graph Loading:

➢ Load-Trigger-Pushing (denoted by LTP) model:

Page 27: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Data-centric LTP Execution Model

• State Pushing:

Memory/Disk

Load of graph structure data

Global Space (Storing the shared graph structure data)

...

Parallel trigger

State push State push State push

• Trigger and Parallel Execution:

• Graph Loading:

➢ Load-Trigger-Pushing (denoted by LTP) model:

Page 28: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Illustration of Our LTP Model

Memory/Disk

Partition 1 Partition 2

...Scheduler(Arranging the

Loading order of graph

structure partitions)

Global Space

v1

v2 v32.9

Partition 1

PageRank job

Job specific space of

PageRank Job

SSSP job

Job specific space of

SSSP Job

...

IsNotConvergent (vh):

return |vh.Δvalue|>ε

Acc(value1, value2):

return value1+value2

Compute(Gi, vh)://Processing of each vertex

vh.value Acc(vh.value, vh.Δvalue)

<links> look up outlinks of vh from Gi

for(each link <vh, ve> <links>){

Δvalue d× vh.Δvalue/Gi[vh].OutDegree

ve.Δvalue Acc(ve.Δvalue, Δvalue)

}

IsNotConvergent (vh):

return |vh.Δvalue| 0

Acc(value1, value2):

return min(value1, value2)

Compute(Gi, vh)://Processing of each vertex

vh.value Acc(vh.value, vh.Δvalue)

<links> look up outlinks of vh from Gi

for(each link <vh, ve> <links>){

Δvalue vh.value+<vh, ve>.distance

ve.Δvalue Acc(ve.Δvalue, Δvalue)

}

Cache

v1

v2 v32.9

v3

v4

v51.5

Vertex ID Value

v1 0.2

v2 0.1

v3 0.25

Vertex ID Value

v1 1.2

v2 0

v3 2.9

Page 29: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations: Graph Storage for Multiple CGP Jobs

Vertex ID Value

v1 0.2

v2 0.1

v3 0.25

PageRank Job

Vertex ID Edge List Flag Master Location Information Associated with Its Edges

v1 v3 Master Partition 1 1.1

v2 v1, v3 Master Partition 1 1.2, 2.9

v3 Ø Master Partition 1 Ø

Private Table Partitions

Graph Structure Partitions

Vertex ID Value

v3 0.05

v4 0.1

v5 0.3

Vertex ID Value

v1 1.2

v2 0

v3 2.9

SSSP Job

Private Table Partitions

Vertex ID Value

v3

v4

v5

Vertex ID Edge List Flag Master Location Information Associated with Its Edges

v3 v5 Mirror Partition 1 1.5

v4 v3, v5 Master Partition 2 0.9, 2.5

v5 Ø Master Partition 2 Ø

v1

v2 v3

v4

v52.9 1.5

Partition 1 Partition 2

Page 30: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations:Details to Store Evolving Graph Structure

Timestamp 1 Timestamp 2 Timestamp3

Job 1 Job 2 Job 3

Partition 4

Partition 2

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

Time

Page 31: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations:Load of Partitions

Partition 1

Partition 2

Partition 3

Partition 4

J1

J1

J1

J1

J1 J2

J2

J1

J1

J1 J2 J3

J1 J2 J3

J1

J1

(a) There is only one job J1

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

(b) J2 has been submitted

(c) J3 has been submitted

Page 32: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations:Load of Partitions

A core-subgraph based scheduling algorithm can be used to maximize the

utilization ratio of each partition loaded into the cache

B

D

C

A

E

1

2

4 3 9

65 7 8

29

28

27 26

25

1011

2123

2420

2219

12

13

18

16

171514

Page 33: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations:Parallel Processing of Graph Partition

Cache

Private Partition 1 of Job 1Private Partition

1 of Job 2

Private Partition

1 of Job 3

Graph Structure Partition 1

Core 1

Job 1

Core 2 Core 3

Job 2

Core 4

Job 3

Page 34: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations:Parallel Processing of Graph Partition

Cache

Private Partition 1 of Job 1Private Partition

1 of Job 2

Private Partition

1 of Job 3

Graph Structure Partition 1

Core 1

Job 1

Core 2

Job 1

Core 3

Job 2

Core 4

Job 3

Page 35: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Implementations:Vertex State Synchronization

Synchronization

from Master to

Mirrors

v1

v2v3

v3

v4

v5

Partition 1 Partition 2

v1

v6 v3

Partition 3

v4

v4

v5

v6

P1:v3->P2:v3

P1:v4->P2:v4

P1:v6->P3:v6

……

P1:v3->P2:v3

P1:v4->P2:v4

P1:v6->P3:v6…

P2:v3->P1:v3

P2:v4->P1:v4

P2:v3->P3:v3

P2:v4->P3:v4 …

P2:v3->P1:v3

P2:v4->P1:v4

P2:v3->P3:v3

P2:v4->P3:v4

Non-optimized:

Non-optimized:

Optimized:

Optimized:

Synchronization

from Mirrors to

Master

Page 36: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Part 4

Performance Evaluation

Page 37: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Evaluation

➢Experimental setup

Data sets Vertices Edges Sizes

Twitter 41.7 M 1.4 B 17.5 GB

Friendster 65 M 1.8 B 22.7 GB

uk2007 105.9 M 3.7 B 46.2 GB

uk-union 133.6 M 5.5 B 68.3 GB

hyperlink14 1.7 B 64.4 B 480.0 GB

Properties of data sets

➢ Machine information

-CPU: 4-way 8-core Intel Xeon CPU E5-2670; each CPU has 20 MB LLC

-Main Memory: 64 GB

➢ Typical graph algorithms

-PageRank, SSSP, SCC, BFS

➢ Data sets

Page 38: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Evaluation

PageRankSSSPSCCBFS

PageRankSSSPSCCBFS

PageRankSSSPSCCBFS

PageRankSSSPSCCBFS

0 20 40 60 80 100

................................................................................................

................................................................................................

................................................................................................

CGraph

Seraph

Nxgraph

CLIP

Execution time breakdown (%)D

iffe

ren

t Jo

bs

Vertex processing time

Time for data accessing

Twitter

Friend

ster

uk20

07

uk-u

nion

hype

rlink

140.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

.......................................................................

No

rma

lize

d e

xe

cutio

n t

ime

Data sets

CLIP Nxgraph

Seraph CGraph

Total execution time for the four jobs

with different solutionsExecution time breakdown of

different jobs on hyperlink14

Page 39: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Evaluation

Twitter

Friend

ster

uk20

07

uk-u

nion

hype

rlink

140.0

0.5

1.0

1.5

2.0

2.5

.......................................................................

No

rma

lize

d V

olu

me

Data sets

CLIP Nxgraph

Seraph CGraph

Volume of data swapped

into the cache for the four jobs

Twitter

Friend

ster

uk20

07

uk-u

nion

hype

rlink

14

0.0

0.5

1.0

1.5

2.0

2.5

......................................................................

No

rma

lize

d I

/O o

ve

rhe

ad

Data sets

CLIP Nxgraph

Seraph CGraph

I/O overhead for the four jobs with

different solutions

Page 40: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Evaluation

1 2 4 8

0

20

40

60

80

100

Ra

tio o

f spare

d a

ccesses (

%)

Number of jobs

Seraph-VT

Seraph

CGraph

Ratio of spared accessed data on

hyperlink14

Twitter

Friend

ster

uk20

07

uk-u

nion

hype

rlink

140

20

40

60

80

100

Norm

aliz

ed e

xecution t

ime (

%)

Data sets

CGraph-without CGraph

Execution time for the four jobs

without/with our scheduler

Page 41: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Part 5

Conclusions

Page 42: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Conclusions

➢What CGraph brings in graph processing➢ Analysis of temporal/spatial correlations in concurrent graph processing

➢ A novel data-centric LTP model for concurrent graph processing

➢ A core-subgraph based scheduling scheme

➢Future work➢ How to further optimize the approach for evolving graph analysis

➢ How to ensure QoS for some real-time CGP jobs

➢ How to extend it to a distributed platform and also heterogeneous platform

consisting of GPU, FPGA and even ASIC for higher throughput.

Page 43: USENIX ATC 2018 · Data-centric LTP Execution Model • State Pushing: Memory/Disk Load of graph structure data Global Space (Storing the shared graph structure data)... Parallel

Thanks!

Service Computing Technology and System Lab., MoE (SCTS)

Cluster and Grid Computing Lab., Hubei Province (CGCL)