Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.

Efficient Graph Processing

with Distributed Immutable View

Rong Chen+, Xin Ding+, Peng Wang+, Haibo Chen+, Binyu Zang+ and Haibing Guan*

Institute of Parallel and Distributed Systems +

Department of Computer Science *

Shanghai Jiao Tong University

2014HPDC

CommunicationComputation

100 Hrs of Video

every minute

1.11 Billion Users

6 Billion Photos400 Million

Tweets/day

How do we understand and use Big Data?

Big Data Everywhere

100 Hrs of Video

every minute

1.11 Billion Users

6 Billion Photos400 Million

Tweets/day

NLP

Big Data Big Learning

Machine Learning and Data Mining

It’s about the graphs ...

4 5

3 1 4

Example: PageRank

A centrality analysis algorithm to measure the relative rank for each element of a linked set

Characteristics□ Linked set data dependence□ Rank of who links it local accesses□ Convergence iterative computation

∑( 𝑗 , 𝑖 )∈𝐸

❑𝜔 𝑖𝑗𝑅 𝑗𝛼+(1−𝛼)𝑅𝑖=¿

4 5

1 23

4 5

3 1 4

4 5

3 1 21

Existing Graph-parallel Systems

“Think as a vertex” philosophy1. aggregate value of neighbors2. update itself value3. activate neighbors

compute (v) PageRank

double sum = 0double value, last =

v.get ()foreach (n in v.in_nbrs) sum += n.value /

n.nedges;

value = 0.15 + 0.85 * sum;

v.set (value);

activate (v.out_nbrs);

1

2

3

4 5

1 23

Existing Graph-parallel Systems

“Think as a vertex” philosophy1. aggregate value of neighbors2. update itself value3. activate neighbors

Execution Engine□ sync: BSP-like model□ async: dist. sched_queues

Communication□ message passing: push value□ dist. shared memory: sync & pull

value

4 5

1 23

1 2

3 4 1

423

comp.

comm.

1 2push

1 1pull

2sync

barrier

Issues of Existing Systems

Pregel[SIGMOD’09]→ Sync engine→ Edge-cut

+ Message Passingw/o dynamic

comp.high contention

3

keep alive

21

4x1

x1

2 1 master

2 1 replica

msg

GraphLab[VLDB’12]

PowerGraph[OSDI’12]



+ Message Passing

GraphLab[VLDB’12]→ Async engine→ Edge-cut

+ DSM (replicas)w/o dynamic

comp.high contention

high contention

hard to programduplicated

edgesheavy comm. cost

3

keep alive

2

233

1 1

2

replica

11

44x1

x1

x2 x

2

5

dup

2 1 master

2 1 replica

msg

PowerGraph[OSDI’12]



+ Message Passing

GraphLab[VLDB’12]→ Async engine→ Edge-cut

+ DSM (replicas)

PowerGraph[OSDI’12]→ (A)Sync engine → Vertex-cut

+ GAS (replicas)w/o dynamic comp.

high contention

high contention

hard to programduplicated

edges

heavy comm. cost

high contentionheavy comm.

cost

3

keep alive

2

3

1 1

2

1

x5

x5

1

44x1

x1

233

1 1

2

replica

1

4x2 x

2

5

2 1 master

2 1 replica

msg

5

dup

Contributions

Distributed Immutable View□ Easy to program/debug□ Support dynamic computation□ Minimized communication cost (x1 /replica)□ Contention (comp. & comm.) immunity

Multicore-based Cluster Support□ Hierarchical sync. & deterministic execution□ Improve parallelism and locality

Outline

Distributed Immutable View→ Graph organization→ Vertex computation→ Message passing→ Change of execution flow

Multicore-based Cluster Support→ Hierarchical model→ Parallelism improvement

Evaluation

General Idea

: For most graph algorithms, vertex only aggregates neighbors’ data in one direction and activates in another direction□ e.g. PageRank, SSSP, Community Detection, …

Observation

Local aggregation/update & distributed activation□ Partitioning: avoid duplicate edges□ Computation: one-way local semantics□ Communication: merge update & activate messages

Graph Organization

Partitioning graph and build local sub-graph□ Normal edge-cut: randomized (e.g., hash-based)

or heuristic (e.g., Metis)□ Only create one direction edges (e.g., in-edges)

→ Avoid duplicated edges□ Create read-only replicas for edges spanning

machines

4 5

23 1

4

3 1

4

23 1

5

21

master

replica

M1 M2 M3

Vertex Computation

Local aggregation/update□ Support dynamic computation

→ one-way local semantic□ Immutable view: read-only access neighbors

→ Eliminate contention on vertex

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

read-only

Communication

Sync. & Distributed Activation□ Merge update & activate messages

1. Update value of replicas2. Invite replicas to activate neighbors

4 5

23 1

4

3 1

4

23 1

5

21

rlist:W1 l-act: 1value: 8 msg: 4

l-act:3value:6 msg:3

msg: v|m|se.g. 8 4 0

M1 M2 M3

84

active

s0

Communication

Distributed Activation□ Unidirectional message passing

→ Replica will never be activated→ Always master replicas → Contention immunity

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

in-q

ueu

es

M1

M3

out-queues

Change of Execution Flow

Original Execution Flow (e.g. Pregel)

5

parsing11

8

computation sending

1

4

7

10

receiving

high overhead

high contention

M2 M3

M1

thread

vertex

message

4

2

Change of Execution Flow

M1

M3

out-queuescomputation sending

1

4

7

10

receiving lock-free

2

3

8

9

5

2

11

8

4

3

1

6

17 4

47

4

7

1

3

6

Execution Flow on Distributed Immutable View

low overhead

no contention

thread

master

4replica

4

M2 M3

M1

Outline



Evaluation

Multicore Support

Two Challenges1. Two-level hierarchical organization

→ Preserve synchronous and deterministic computation nature (easy to program/debug)

2. Original BSP-like model is hard to parallelize → High contention to buffer and parse

messages→ Poor locality in message parsing

Hierarchical Model

Design Principle□ Three level: iteration worker thread□ Only the last-level participants perform actual

tasks□ Parents (i.e. higher level participants) just wait

until all children finish their tasks

loop

tasktasktask

Level-0Level-1Level-2

worker

thread

iteration

global barrier

local barrier

Parallelism Improvement

Original BSP-like model is hard to parallelize

M1

M3

out-queues

in-q

ueu

es 5

parsing

2

11

8

computation sending

1

4

7

10

receiving

thread

vertex

message

4

M2 M3

M1


Original BSP-like model is hard to parallelize

M1

M3

priv. out-queues

in-q

ueu

es 5

parsing

2

11

8

computation sending

1

4

7

10

receiving

M1

M3

high contention

poor locality

thread

vertex

message

4

M2 M3

M1


M1

M3

out-queues

1

4

7

10

2

3

8

9

5

2

11

8

4

3

1

6

17 4

47

1

7

4

6

3

computation sending receiving

Distributed immutable view opens an opportunity

thread

master

4replica

4

M2 M3

M1

M2 M3

M1


M1

M3

priv. out-queues

1

4

7

10 M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

71

7

4

6

3 4

3

1

6poor locality

lock-freecomputation sending receiving


thread

master

4replica

4


M1

M31

4

7

10 M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

71

7

4

3

6 6

3

1

4

lock-freecomputation sending receiving


no interference

thread

master

4replica

4

M2 M3

M1

priv. out-queues

M2 M3

M1



M1

M31

4

7

10 M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

71

7

4

3

6 6

3

4

1

lock-free

sorted


good locality

thread

master

4replica

4

priv. out-queues

Outline



Implementation & Experiment

Implementation

Cyclops(MT)□ Based on (Java &

Hadoop)□ ~2,800 SLOC□ Provide mostly compatible user interface□ Graph ingress and partitioning

→ Compatible I/O-interface→ Add an additional phase to build replicas

□ Fault tolerance→ Incremental checkpoint→ Replication-based FT [DSN’14]

Experiment Settings

Platform□ 6X12-core AMD Opteron (64G RAM, 1GigE NIC)

Graph Algorithms□ PageRank (PR), Community Detection (CD),

Alternating Least Squares (ALS), Single Source Shortest Path (SSSP)

Workload□ 7 real-world dataset from SNAP1 □ 1 synthetic dataset from GraphLab2

1http://snap.stanford.edu/data/

Dataset

|V| |E|

Amazon 0.4M 3.4M

GWeb 0.9M 5.1M

LJournal 4.8M 69M

Wiki 5.7M 130M

SYN-GL 0.1M 2.7M

DBLP 0.3M 1.0M

RoadCA 1.9M 5.5M

2http://graphlab.org

http://snap.stanford.edu/data/



http://graphlab.org/

Overall Performance Improvement

Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0123456789

10 HamaCyclopsCyclopsMT

Norm

aliz

ed S

peedup

PageRank ALS CD SSSP

Push-mode

8.69X

2.06X

48 workers

6 workers(8)

Performance Scalability

6 12 24 4805

101520253035 Hama

CyclopsCy-clopsMT

Norm

aliz

ed

Speedup

Amazon6 12 24 48

GWeb6 12 24 48

LJournal6 12 24 48

Wiki

50

.2

6 12 24 4805

101520253035

Norm

aliz

ed

Speedup

SYN-GL6 12 24 48

DBLP6 12 24 48

RoadCA

threads

workers

Performance Breakdown

Amazon GWeb Ljournal Wiki SYN-GL DBLP RoadCA0.0

0.2

0.4

0.6

0.8

1.0

PARSESENDCOMPSYNC

Rati

o o

f Exe

c-Tim

e


0 6 12 18 24 300

100020003000400050006000

Iteration

#M

ess

ag

es

(K)

0 6 12 18 24 300

200

400

600

800

1000

Hama

Iteration#V

ert

ice

s (K

)

CyclopsMT

HamaCyclops

Comparison with PowerGraph1

Amazon GWeb LJournal Wiki0

20406080

100120 CyclopsMT

Pow-er-Graph

Exe

c-Tim

e

(Sec)

Amazon GWeb LJournal Wiki0

500

1000

1500

2000

#M

ess

ages

(M)

Dataset

COMP%

Amazon 11%GWeb 15%

LJournal 25%Wiki 39%

Cyclops-like engine on GraphLab1 Platform

Preliminary Results

Regular Natural0

4

8

12

Exe

c-Tim

e

(Sec)

1http://graphlab.org 2synthetic 10-million vertex regular (even edge) and power-law (α=2.0) graphs

22

1C++ & Boost RPC lib.




Conclusion

Cyclops: a new synchronous vertex-oriented graph processing system□ Preserve synchronous and deterministic

computation nature (easy to program/debug)□ Provide efficient vertex computation with

significantly fewer messages and contention immunity by distributed immutable view

□ Further support multicore-based cluster with hierarchical processing model and high parallelism

Source Code: http://ipads.se.sjtu.edu.cn/projects/cyclops

http://ipads.se.sjtu.edu.cn/projects/cyclops



Questions

Thanks

Cyclopshttp://

ipads.se.sjtu.edu.cn/projects/cyclops.html

IPADS

Institute of Parallel and Distributed

Systems

http://ipads.se.sjtu.edu.cn/projects/powerlyra.html



PowerLyra: differentiated graph computation and partitioning on skewed natural graphs□ Hybrid engine and partitioning algorithms□ Outperform PowerGraph by up to 3.26X

for natural graphs

What’s Next?


21

3Low

High

R N048

1216

Exe

c-T

ime

(S

ec)

Preliminary Results

PLPGCyclops

Power-law: “most vertices have relatively few neighbors while a few have many neighbors”



Generality

Algorithms: aggregate/activate all neighbors□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M3

Generality

Algorithms: aggregate/activate all neighbors□ e.g. Community Detection (CD)□ Transfer to undirected graph and duplicate edges□ Still aggregate in one direction (e.g. in-edges)

and activate in another direction (e.g. out-edges)□ Preserve all benefits of Cyclops

→ x1 /replica & contention immunity & good locality

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4

3 1

4

23 1

5

21

M1 M2 M35

Generality

Difference between Cyclops and GraphLab1. How to construct local sub-graph2. How to aggregate/activate neighbors

4

3 1

4

23 1

5

21

M1 M2 M354 5

23 1

4 5

23 1

Improvement of CyclopsMT

6x1

x1

/1

6x2

x1

/1

6x4

x1

/1

6x8

x1

/1

6x1

x1

/1

6x1

x2

/2

6x1

x4

/4

6x1

x8

/8

6x1

x8

/1

6x1

x8

/2

6x1

x8

/4

6x1

x8

/8

0.0

5.0

10.0

15.0

20.0

25.0

30.0 SEND COMP SYNC

Exe

cuti

on T

ime (

Sec)

#[M]achines MxWxT/R#[W]orkers

#[T]hreads

#[R]eceivers

Cyclops

CyclopsMT

Communication Efficiency

Hama

Cyclops

Hama

Cyclops

Hama

Cyclops0.1 1.0 10.0 100.0 1,000.0

SENDPARSE

Exec-Time (Sec)

50M

25M

5M

25.6X

16.2X

55.6%

12.6X

25.0%

W0

W1

W2

W3

W4

W5message:(id,data)

Hadoop RPC lib (Java) Boost RPC lib (C++)Hadoop RPC lib (Java)

Hama:PowerGrap

h:Cyclops:

send + buffer + parse (contention)

send + update

(contention)

31.5%

Using Heuristic Edge-cut (i.e. Metis)

Amazon Gweb LJournal Wiki SYN-GL DBLP RoadCA0

5

10

15

20

25 HamaCyclopsCyclopsMT

Norm

aliz

ed S

peedup


23.04X

5.95X

48 workers

6 workers(8)

Memory Consumption

Configuration

Max Cap (GB)

Max Usage (GB)

Young GC2

(#)Full GC2

(#)

Hama/48 1.7 1.5 132 69

Cyclops/48 4.0 3.0 45 15

CyclopsMT/6x8

12.6/8 11.0/8 268/8 32/8

Memory Behavior1 per Worker(PageRank with Wiki dataset)

2 GC: Concurrent Mark-Sweep

1 jStat

Ingress Time

Dataset

LD REP INIT TOT

H C H C H C H C

Amazon 6.2 5.9 0.0 2.5 1.7 1.5 7.9 9.9

GWeb 7.1 6.8 0.0 2.8 2.6 1.9 9.7 11.4

LJournal 27.1 31.0 0.0 44.7 17.9 9.2 45.0 84.9

Wiki 46.7 46.7 0.0 62.2 33.4 20.4 80.0 129.3

SYN-GL 4.2 4.0 0.0 2.6 2.4 1.8 6.6 8.4

DBLP 4.1 4.1 0.0 1.5 1.3 0.9 5.4 6.5

RoadCA 6.4 6.2 0.0 3.9 0.9 0.6 7.3 10.7

CyclopsHama

Selective Activation

Sync. & Distributed Activation□ Merge update & activate messages

1. Update value of replicas2. Invite replicas to activate neighbors

4 5

23 1

4

3 1

4

23 1

5

21

rlist:W1 l-act: 1value: 8 msg: 4

l-act:3value:6 msg:3

msg: v|m|se.g. 8 4 0

M1 M2 M3

84

active

msg: v|m|s|l

*Selective Activation (e.g. ALS)

Option: Activation_List

s0

M2 M3

M1



M1

M3

out-queues

1

4

7

10 M1

M3

2

3

8

9

5

2

11

8

1

7

4

4

71

7

4

3

6 6

3

4

1

lock-free

sorted


good locality

comp.threads

comm.threadsvs.

separateconfiguration

thread

master

4replica

4

w/ dynamic comp.

no contention

easy to program

duplicated edges

low comm. cost

CyclopsExisting graph-parallel

systems (e.g., Pregel, GraphLab, PowerGraph)

Cyclops(MT)→ Distributed

Immutable View

w/o dynamic comp.

high contention

hard to program

duplicated edges

heavy comm. cost

233

1 1

5replica

1

4x1

x1

BiGraph: bipartite-oriented distributed graph partitioning for big learning□ A set of online distributed graph partition

algorithms designed for bipartite graphs and applications

□ Partition graphs in a differentiated way and loading data according to the data affinity

□ Outperform PowerGraph with default partition by up to 17.75X, and save up to 96% network traffic

What’s Next?




Multicore Support

Two Challenges1. Two-level hierarchical organization

→ Preserve synchronous and deterministic computation nature (easy to program/debug)

2. Original BSP-like model is hard to parallelize → High contention to buffer and parse

messages→ Poor locality in message parsing→ Asymmetric degree of parallelism for CPU and

NIC

Efficient Graph Processing with Distributed Immutable View Rong Chen Rong Chen +, Xin Ding +, Peng Wang +, Haibo Chen +, Binyu Zang + and Haibing Guan.

Documents

dataparallel algorithms

change of vertex data

data miningthe value

data dependencerank

big data everywheresince

data scientists

era of big data

big graph efficient