Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Processing graph/relational datawith

Map-Reduceand

Bulk Synchronous Parallelv. 1.1

Tomasz Chodakowski,

1st Bristol Hadoop Workshop, 08-11-2010

Irregular Algorithms

● Map-reduce – a simplified model for “embarasingly parallel” problems

– Easily separable into independent tasks

– Captured by static dependence graph

● Most graph algorithms are irregular, ie.:

– Dependencies between tasks arise during execution

– “don't care non-determinism” - tasks can be executed in arbitrary order yet still yield correct results.

Irregular Algorithms

● Often operate on data structures with complex topologies:

– Graphs, trees, grids, ...

– Where “data elements” are connected by “relations”

● Computations on such structures depend strongly on relations between data elements

– primary source of dependencies between tasks

more in [ADP] “Amorphous Data-parallelism in Irregular Algorithms”

Relational Data

● Example relations between elements:

– social interactions (co-authorship, friendship)

– web links, document references

– linked data or semantic network relations

– geo-spatial relations

– ...● Different from a relational model

– in that relations are arbitrary

Graph Algorithms Rough Classification

● Aggregation, feature extraction

– Not leveraging latent relations● Network analysis (matrix-based, single relational)

– Geodesic (radius, diameter etc.)

– Spectral (eigenvector-based, centrality)● Algorithmic/node-based algorithms

– Recommender systems, belief/label propagation

– Traversal, path detection, interaction networks, etc.

Iterative Vertex-based Graph Algorithms

● Iteratively:

– Compute local function of a vertex that depends on the vertex state and local graph structure (neighbourhood)

– and/or Modify local state

– and/or Modify local topology

– pass messages to neighbouring nodes

● -> “vertex-based computation”Amorphous Data-Parallelism [ADP] operator formulation:

“repeated application of neighbourhood operators in a specific order”

Recent applications/developments

● Google work on graph-based YouTube recommendations:

– Leveraging latent information

– Diffusing interest in sparsely labeled video clips

● User profiling, sentiment analysis

– Facebook likes, Hunch, Gravity, MusicMetric ...

Single Source Shortest Path

1144

22

55

33

66

11

99

11

3322

00

P1P1 P2P2 P1P1 P2P2workwork

Directed graph Directed graph labelled with labelled with positive integerspositive integers

This time-space This time-space view shows view shows workload and workload and communication communication between between partitionspartitions

Graph structure Graph structure split into two split into two partitions (P1, P2)partitions (P1, P2)

TimeTime

Turquoise Turquoise rectangles show rectangles show computational computational work load for a work load for a partition (work)partition (work)


1144

22

55

33

66

11

99

11

3322

00 0+0+66

0+0+11

0+0+99

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

Thick green lines Thick green lines show, costly, inter show, costly, inter partition partition communicationscommunications

Active vertices Active vertices are in turquoiseare in turquoise

Signals being Signals being passed along passed along relations are in relations are in light greenlight green


1144

22

55

33

66

11

99

11

3322

00 0+0+66

0+0+11

0+0+99

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

Vertical grey line Vertical grey line is a barrier is a barrier synchronisation to synchronisation to avoid race avoid race conditionsconditions


1144

22

55

33

66

11

99

11

3322

0066

11

99

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

workwork

Work,comm,barrier Work,comm,barrier form a BSP superstepform a BSP superstep

Vertices become Vertices become active upon receiving active upon receiving signal in a previous signal in a previous superstepsuperstep


1144

22

55

33

66

11

99

11

3322

0066

11

991+1+11

1+1+33

6+6+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

After performing After performing local computation local computation they send signals to they send signals to their neighbouring their neighbouring verticesvertices


1144

22

55

33

66

11

99

11

3322

0066

11

991+1+11

1+1+33

6+6+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier


1144

22

55

33

66

11

99

11

3322

0044

11

9988

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

workwork


1144

22

55

33

66

11

99

11

3322

0044

11

9988

4+4+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork


1144

22

55

33

66

11

99

11

3322

0044

11

9988

4+4+22

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier


1144

22

55

33

66

11

99

11

3322

0044

11

9966

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

workwork


1144

22

55

33

66

11

99

11

3322

0044

11

9966

P1P1 P2P2 P1P1 P2P2

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcomm

workwork

barrierbarrier

commcommworkwork

barrierbarrier

Computation ends when Computation ends when there are no active there are no active vertices leftvertices left

Bulk Synchronous Parallel

P1P1 P2P2 ...... PnPnsuperstepsuperstep

00

11

22

33

......

...... ...... ...... ......

superstep n cost =

wn + hn + ln

w0

h0

h1

h2

h3

l1

l0

l2

l3

w1

w2

w3

Time to finish work on slowest partition + cost of bulk communication + barrier synchronization time

Bulk Synchronous Parallel

● Advantages

– Simple and portable execution model

– Clear cost model

– No concurrency control, no data races, deadlocks, etc.

● Disadvantages

– Coarse grained● Depends on a large “parallel slack”

– Requires well-partitioned problem space for efficiency (well balanced partitions)

more in [BSP] “A bridging model for parallel computation”

Bulk Synchronous Parallel - extensions

● Combiners

– minimizing inter-node communication (h factor)

● Aggregators

– Computing global state (ex. map/reduce)

And other extensions...

Sample code public void superStep() {public void superStep() {

int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;

forfor(DistanceMessage msg: messages()) { (DistanceMessage msg: messages()) { // Choose min. proposed distance// Choose min. proposed distance

minDist = Math.min( minDist, msg.getDistance() );minDist = Math.min( minDist, msg.getDistance() );

}} ifif( minDist < this.getCurrentDistance() ) { ( minDist < this.getCurrentDistance() ) { //If improves the path, store and propagate//If improves the path, store and propagate this.setCurrentDistance(minDist);this.setCurrentDistance(minDist);

IVertex v = this.getElement();IVertex v = this.getElement(); forfor(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) {(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) { IElement recipient = r.getOtherElement(v);IElement recipient = r.getOtherElement(v); int rDist = this.getLengthOf(r);int rDist = this.getLengthOf(r); this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) );this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) ); } }} }

SSSP - Map-Reduce Naive

● Idea [DPMR]:

– In map phase:● emit both signals and local vertex

structure and state– In reduce phase:

● gather signals and local vertex structure messages

● reconstruct vertex structure and state

SSSP - Map-Reduce Naive

def map(Id nId, Node N):

//emit state and structure

emit(nId, N.graphStateAndStruct)

if(N.isActive)

for(nbr :N.adjacencyL)

//local computation

dist:= N.currDist+DistToNbr

//emit signals

emit(nbr.id, dist)

def reduce(Id rId, {m1,m2,..} ):

new M; M.deActivate

minDist = MAX_VALUE

for(m in {m1,m2,..})

if(m is Node) M:=m //state

else if(m is Distance) //signals

minDist = min( minDist, m )

if(M.currDist > minDist)

M.currDist:=minDist;

M.activate

emit(rId, M)

SSSP - Map Reduce Naive - issues

● Cost associated with marshaling intermediate <k,v> pairs for combiners (which are optional)

– -> in-line combiner

● Need to pass the whole graph state and structure around

– -> “Shimmy trick” -- pin down the structure

● Partitions verticies without regard to graph topology

– -> cluster highly connected components together

Inline Combiners

● In job configure:

– Initialize a map<NodeId, Distance>;● In job map operation:

– Do not emit interm. pairs ( emit(nbr.id, dist) ) ;

– Store them in the local map;

– Combine values in the same slots.● In job close:

– Emit a value from each slot in the map to a corresponding neighbour

● emit(nbr.id, map[nbr.id])

“Shimmy trick”

● Store graph structure in a file system (no shuffle)

● Inspired by a parallel merge join

p1p1 p1p1

p2p2 p2p2

p3p3 p3p3

partitionpartition

sorted by join keysorted by join key sorted and partitioned by join keysorted and partitioned by join key

“Shimmy trick”

● Assume:

– Graph G representation sorted by node ids;

– G partitioned into n parts: G1, G

2, .., G

n

– Use the same partitioner as in MR

– Set number of reducers to n● The above gives us:

– Reducer Ri, receives the same intermediate

keys as those in Gi graph partition (in

sorted order).

“Shimmy trick”

def reduce(Id rId, {m1,m2,..} ):

repeat:

(id nId, node N) <- P.read()

if (nId != rId): N.deact; emit(nId, N)

until: nId == rId

minDist = MAX_VALUE

for(m in {m1,m2,..}):

minDist = min( minDist, m )

if(N.currDist > minDist)

N.currDist:=minDist;

N.activate

emit(rId, N)

def configure( ):

P.openGraphPartition()

def close( ):

repeat:

(id nId, node N) <-P.read()

N.deactivate

emit(nId, N)

“Shimmy trick”

● Improvements:

– Files containing graph structure reside on dfs

– Reducers arbitrarily assigned to cluster machines

● -> remote reads.

● -> change the scheduler to assign key ranges to the same machines consistently.

Topology-aware Partitioner

● Choose a partitioner that:

– minimizes inter-block traffic;

– maximizes intra-block traffic;

– places adjacent nodes in the same block

● Difficult to achieve particularly with many real world datasets:

– Power-law distributions

– Reported that state of the art partitioners (ex. parmetis) fail for such cases (???)

MR Graph Processing Design Pattern

● [DPMR] reports 60% 70% improvement over naive implementation

● Solution closely resembles the BSP model

BSP (inspired) implementations

● Google Pregel:– classic BSP, C++, production

● CMU GraphLab– inspired by BSP, java, multi-core

– consistency models, custom schedulers

● Apache Hama– scientific computation package that runs on top of

Hadoop, BSP, MS Dryad (?)

● Signal/Collect (Zurich University)– Scala, not yet distributed

● ...

Open questions

● What problems are particularly suitable for MR and which ones for BSP – where are the boundaries?

– Topology-based centrality algorithms (PageRank):

● Algebraic, matrix-based methods vs. vertex-based ones?

● When considering graph algorithms:

– MR user base vs. BSP ergonomy?

– Performance overheads?● Relaxing the BSP synchronous schedule -->

“Amorphous data parallelism”

POC, Sample Code

● Project Masuria (early stages, 2011-02)

– http://masuria-project.org/– As much POC of BSP framework as it is

(distributed) OSGI playground.

● Sample code:

– https://github.com/tch/Cloud9 *– [email protected]:tch_sandbox.git

– RunSSSPNaive.java

– RunSSSPShimmy.java *

* - expect (my) bugs

Based on Jimmy Lin and Michael Schatz Cloud9 library

http://masuria-project.org/

https://github.com/tch/Cloud9

References

● [ADP] “Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al.

● [BSP] “A bridging model for parallel computation”, Leslie G. Valiant

● [DPMR] “Design Patterns for Efficient Graph Algorithms in MapReduce”, Jimmy Lin and Michael Schatz

Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

Technology

Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel