Processing graph/relational datawith
Map-Reduceand
Bulk Synchronous Parallelv. 1.1
Tomasz Chodakowski,
1st Bristol Hadoop Workshop, 08-11-2010
Irregular Algorithms
● Map-reduce – a simplified model for “embarasingly parallel” problems
– Easily separable into independent tasks
– Captured by static dependence graph
● Most graph algorithms are irregular, ie.:
– Dependencies between tasks arise during execution
– “don't care non-determinism” - tasks can be executed in arbitrary order yet still yield correct results.
Irregular Algorithms
● Often operate on data structures with complex topologies:
– Graphs, trees, grids, ...
– Where “data elements” are connected by “relations”
● Computations on such structures depend strongly on relations between data elements
– primary source of dependencies between tasks
more in [ADP] “Amorphous Data-parallelism in Irregular Algorithms”
Relational Data
● Example relations between elements:
– social interactions (co-authorship, friendship)
– web links, document references
– linked data or semantic network relations
– geo-spatial relations
– ...● Different from a relational model
– in that relations are arbitrary
Graph Algorithms Rough Classification
● Aggregation, feature extraction
– Not leveraging latent relations● Network analysis (matrix-based, single relational)
– Geodesic (radius, diameter etc.)
– Spectral (eigenvector-based, centrality)● Algorithmic/node-based algorithms
– Recommender systems, belief/label propagation
– Traversal, path detection, interaction networks, etc.
Iterative Vertex-based Graph Algorithms
● Iteratively:
– Compute local function of a vertex that depends on the vertex state and local graph structure (neighbourhood)
– and/or Modify local state
– and/or Modify local topology
– pass messages to neighbouring nodes
● -> “vertex-based computation”Amorphous Data-Parallelism [ADP] operator formulation:
“repeated application of neighbourhood operators in a specific order”
Recent applications/developments
● Google work on graph-based YouTube recommendations:
– Leveraging latent information
– Diffusing interest in sparsely labeled video clips
● User profiling, sentiment analysis
– Facebook likes, Hunch, Gravity, MusicMetric ...
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
00
P1P1 P2P2 P1P1 P2P2workwork
Directed graph Directed graph labelled with labelled with positive integerspositive integers
This time-space This time-space view shows view shows workload and workload and communication communication between between partitionspartitions
Graph structure Graph structure split into two split into two partitions (P1, P2)partitions (P1, P2)
TimeTime
Turquoise Turquoise rectangles show rectangles show computational computational work load for a work load for a partition (work)partition (work)
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
00 0+0+66
0+0+11
0+0+99
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
Thick green lines Thick green lines show, costly, inter show, costly, inter partition partition communicationscommunications
Active vertices Active vertices are in turquoiseare in turquoise
Signals being Signals being passed along passed along relations are in relations are in light greenlight green
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
00 0+0+66
0+0+11
0+0+99
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
Vertical grey line Vertical grey line is a barrier is a barrier synchronisation to synchronisation to avoid race avoid race conditionsconditions
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0066
11
99
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
workwork
Work,comm,barrier Work,comm,barrier form a BSP superstepform a BSP superstep
Vertices become Vertices become active upon receiving active upon receiving signal in a previous signal in a previous superstepsuperstep
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0066
11
991+1+11
1+1+33
6+6+22
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
After performing After performing local computation local computation they send signals to they send signals to their neighbouring their neighbouring verticesvertices
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0066
11
991+1+11
1+1+33
6+6+22
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0044
11
9988
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
workwork
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0044
11
9988
4+4+22
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
commcomm
workwork
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0044
11
9988
4+4+22
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0044
11
9966
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
workwork
Single Source Shortest Path
1144
22
55
33
66
11
99
11
3322
0044
11
9966
P1P1 P2P2 P1P1 P2P2
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
commcomm
workwork
barrierbarrier
commcommworkwork
barrierbarrier
Computation ends when Computation ends when there are no active there are no active vertices leftvertices left
Bulk Synchronous Parallel
P1P1 P2P2 ...... PnPnsuperstepsuperstep
00
11
22
33
......
...... ...... ...... ......
superstep n cost =
wn + hn + ln
w0
h0
h1
h2
h3
l1
l0
l2
l3
w1
w2
w3
Time to finish work on slowest partition + cost of bulk communication + barrier synchronization time
Bulk Synchronous Parallel
● Advantages
– Simple and portable execution model
– Clear cost model
– No concurrency control, no data races, deadlocks, etc.
● Disadvantages
– Coarse grained● Depends on a large “parallel slack”
– Requires well-partitioned problem space for efficiency (well balanced partitions)
more in [BSP] “A bridging model for parallel computation”
Bulk Synchronous Parallel - extensions
● Combiners
– minimizing inter-node communication (h factor)
● Aggregators
– Computing global state (ex. map/reduce)
And other extensions...
Sample code public void superStep() {public void superStep() {
int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;int minDist = this.isStartingElement() ? 0 : Integer.MAX_VALUE;
forfor(DistanceMessage msg: messages()) { (DistanceMessage msg: messages()) { // Choose min. proposed distance// Choose min. proposed distance
minDist = Math.min( minDist, msg.getDistance() );minDist = Math.min( minDist, msg.getDistance() );
}} ifif( minDist < this.getCurrentDistance() ) { ( minDist < this.getCurrentDistance() ) { //If improves the path, store and propagate//If improves the path, store and propagate this.setCurrentDistance(minDist);this.setCurrentDistance(minDist);
IVertex v = this.getElement();IVertex v = this.getElement(); forfor(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) {(IEdge r: v.getOutgoingEdges(DemoRelationshipTypes.KNOWS) ) { IElement recipient = r.getOtherElement(v);IElement recipient = r.getOtherElement(v); int rDist = this.getLengthOf(r);int rDist = this.getLengthOf(r); this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) );this.sendMessage( new DistanceMessage(minDist+rDist, recipient.getId()) ); } }} }
SSSP - Map-Reduce Naive
● Idea [DPMR]:
– In map phase:● emit both signals and local vertex
structure and state– In reduce phase:
● gather signals and local vertex structure messages
● reconstruct vertex structure and state
SSSP - Map-Reduce Naive
def map(Id nId, Node N):
//emit state and structure
emit(nId, N.graphStateAndStruct)
if(N.isActive)
for(nbr :N.adjacencyL)
//local computation
dist:= N.currDist+DistToNbr
//emit signals
emit(nbr.id, dist)
def reduce(Id rId, {m1,m2,..} ):
new M; M.deActivate
minDist = MAX_VALUE
for(m in {m1,m2,..})
if(m is Node) M:=m //state
else if(m is Distance) //signals
minDist = min( minDist, m )
if(M.currDist > minDist)
M.currDist:=minDist;
M.activate
emit(rId, M)
SSSP - Map Reduce Naive - issues
● Cost associated with marshaling intermediate <k,v> pairs for combiners (which are optional)
– -> in-line combiner
● Need to pass the whole graph state and structure around
– -> “Shimmy trick” -- pin down the structure
● Partitions verticies without regard to graph topology
– -> cluster highly connected components together
Inline Combiners
● In job configure:
– Initialize a map<NodeId, Distance>;● In job map operation:
– Do not emit interm. pairs ( emit(nbr.id, dist) ) ;
– Store them in the local map;
– Combine values in the same slots.● In job close:
– Emit a value from each slot in the map to a corresponding neighbour
● emit(nbr.id, map[nbr.id])
“Shimmy trick”
● Store graph structure in a file system (no shuffle)
● Inspired by a parallel merge join
p1p1 p1p1
p2p2 p2p2
p3p3 p3p3
partitionpartition
sorted by join keysorted by join key sorted and partitioned by join keysorted and partitioned by join key
“Shimmy trick”
● Assume:
– Graph G representation sorted by node ids;
– G partitioned into n parts: G1, G
2, .., G
n
– Use the same partitioner as in MR
– Set number of reducers to n● The above gives us:
– Reducer Ri, receives the same intermediate
keys as those in Gi graph partition (in
sorted order).
“Shimmy trick”
def reduce(Id rId, {m1,m2,..} ):
repeat:
(id nId, node N) <- P.read()
if (nId != rId): N.deact; emit(nId, N)
until: nId == rId
minDist = MAX_VALUE
for(m in {m1,m2,..}):
minDist = min( minDist, m )
if(N.currDist > minDist)
N.currDist:=minDist;
N.activate
emit(rId, N)
def configure( ):
P.openGraphPartition()
def close( ):
repeat:
(id nId, node N) <-P.read()
N.deactivate
emit(nId, N)
“Shimmy trick”
● Improvements:
– Files containing graph structure reside on dfs
– Reducers arbitrarily assigned to cluster machines
● -> remote reads.
● -> change the scheduler to assign key ranges to the same machines consistently.
Topology-aware Partitioner
● Choose a partitioner that:
– minimizes inter-block traffic;
– maximizes intra-block traffic;
– places adjacent nodes in the same block
● Difficult to achieve particularly with many real world datasets:
– Power-law distributions
– Reported that state of the art partitioners (ex. parmetis) fail for such cases (???)
MR Graph Processing Design Pattern
● [DPMR] reports 60% 70% improvement over naive implementation
● Solution closely resembles the BSP model
BSP (inspired) implementations
● Google Pregel:– classic BSP, C++, production
● CMU GraphLab– inspired by BSP, java, multi-core
– consistency models, custom schedulers
● Apache Hama– scientific computation package that runs on top of
Hadoop, BSP, MS Dryad (?)
● Signal/Collect (Zurich University)– Scala, not yet distributed
● ...
Open questions
● What problems are particularly suitable for MR and which ones for BSP – where are the boundaries?
– Topology-based centrality algorithms (PageRank):
● Algebraic, matrix-based methods vs. vertex-based ones?
● When considering graph algorithms:
– MR user base vs. BSP ergonomy?
– Performance overheads?● Relaxing the BSP synchronous schedule -->
“Amorphous data parallelism”
POC, Sample Code
● Project Masuria (early stages, 2011-02)
– http://masuria-project.org/– As much POC of BSP framework as it is
(distributed) OSGI playground.
● Sample code:
– https://github.com/tch/Cloud9 *– [email protected]:tch_sandbox.git
– RunSSSPNaive.java
– RunSSSPShimmy.java *
* - expect (my) bugs
Based on Jimmy Lin and Michael Schatz Cloud9 library
References
● [ADP] “Amorphous Data-parallelism in Irregular Algorithms”, Keshav Pingali et al.
● [BSP] “A bridging model for parallel computation”, Leslie G. Valiant
● [DPMR] “Design Patterns for Efficient Graph Algorithms in MapReduce”, Jimmy Lin and Michael Schatz