GraphX: Graph Analytics on Spark Joseph Gonzalez, Reynold Xin, Ion Stoica, Michael Franklin Developed at the UC Berkeley AMPLab AMPCamp: August 29, 2013
Feb 26, 2016
GraphX:Graph Analytics on SparkJoseph Gonzalez, Reynold Xin,Ion Stoica, Michael FranklinDeveloped at the UC Berkeley AMPLab
AMPCamp: August 29, 2013
Graphs are Essential to Data Mining and Machine Learning
Identify influential people and informationFind communitiesUnderstand people’s shared interestsModel complex data dependencies
Liberal Conservative
Post
Post
Post
Post
Post
Post
Post
Post
Predicting Political Bias
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
Post
??
?
?
??
?
? ??
?
?
??
? ?
?
?
?
?
?
?
?
?
?
?
?
? ?
?
3
Conditional Random FieldBelief Propagation
Triangle CountingCount the triangles passing through each vertex:
Measures “cohesiveness” of local community
More TrianglesStronger Community
Fewer TrianglesWeaker Community
12 3
4
Collaborative FilteringRatings Item
sUser
s
6
Many More Graph Algorithms
• Collaborative Filtering– Alternating Least Squares– Stochastic Gradient Descent– Tensor Factorization– SVD
• Structured Prediction– Loopy Belief Propagation– Max-Product Linear
Programs– Gibbs Sampling
• Semi-supervised ML– Graph SSL – CoEM
• Graph Analytics– PageRank– Single Source Shortest Path– Triangle-Counting– Graph Coloring– K-core Decomposition– Personalized PageRank
• Classification– Neural Networks– Lasso…
7
Dependency Graph
Table
Structure of Computation
Result
Data-Parallel Graph-Parallel
Row
Row
Row
Row
Pregel
The Graph-Parallel AbstractionA user-defined Vertex-Program runs on each vertexGraph constrains interaction along edges
Using messages (e.g. Pregel [PODC’09, SIGMOD’10])Through shared state (e.g., GraphLab [UAI’10, VLDB’12])
Parallelism: run multiple vertex programs simultaneously
8
By exploiting graph-structure
Graph-Parallel systems can be orders-of-
magnitude faster.
9
Counted: 34.8 Billion Triangles
10
Triangle Counting on Twitter
64 Machines15 SecondsGraphLab
1536 Machines423 Minutes
Hadoop[WWW’11]
S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11
1000 x Faster
40M Users, 1.4 Billion Links
Pregel
Specialized Graph Systems
Specialized Graph Systems
1. APIs to capture complex graph dependencies
2. Exploit graph structure toreduce communicationand computation
Why GraphX?
13
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Graph
Lab Hadoop Graph AlgorithmsGraph CreationPostProc
.
The Bigger Picture
Time Spent in Data Pipeline
Vertices
Edges
Edges
Limitations of Specialized Graph-Parallel Systems
No support for Construction & Post ProcessingNot interactive Requires maintaining multiple platforms
Spark excels at these!
GraphX Unifies Data-Parallel and Graph-
Parallel Systems
Spark Table API
RDDs, Fault-tolerance, and task scheduling
GraphLabGraph API
graph representation and
execution
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Graph Construction ComputationPost-Processingone system for the entire graph pipeline
Enable Joining Tables and Graphs
User Data
ProductRatings
Friend Graph
ETL
Product Rec.Graph
Join Inf.
Prod.Rec.
Tables Graphs
20
The GraphX Resilient Distributed
GraphId
RxinJegonzalFranklinIstoica
SrcId DstIdrxin jegonzal
franklin
rxin
istoica franklinfrankli
njegonzal
R
J
F
IAttribute (E)
FriendAdvisor
CoworkerPI
Attribute (V)(Stu., Berk.)
(PstDoc, Berk.)(Prof., Berk)(Prof., Berk)
class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]
// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,
reduceF: (T, T) => T, direction: EdgeDir):
Graph[T, E]}
GraphX API
F
E
Aggregate NeighborsMap-Reduce for each vertex
D
B
A
C
mapF( )A B
mapF( )A C
a1
a2
reduceF( , )a1 a2 A
F
E
Example: Oldest Follower
D
B
A
CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices
23 42
30
19 75
16
We can express both Pregel and GraphLab using
aggregateNeighbors in 40 lines of code!
Performance Optimizations
Replicate & co-partition vertices with edges
»GraphLab (PowerGraph) style vertex-cut partitioning
»Minimize communication by avoiding edge data movement in JOINs
In-memory hash index for fast joins
Early Performance
GraphLab
GraphX
Hadoop
0 200 400 600 800 1000 1200 1400 1600
22
165
1340
Runtime (in seconds, PageRank for 10 iter-ations)
In Progress Optimizations
Byte-code inspection of user functions»E.g. if mapf does not need edge data, we
can rewrite the query to delay the join
Execution strategies optimizer»Scan edges randomly accessing vertices»Scan vertices randomly accessing edges
Current Implementation
Pregel (20)
PageRank (5)
GraphX
Spark (relational operators)
Connected
Comp. (10)
Shortest Path (10)
ALS(40)
GraphLab (20)
DemoReynold Xin
Summary1. Graph-parallel primitives on Spark.2. Currently slower than GraphLab, but
»No need for specialized systems»Easier ETL, and easier consumption of
output»Interactive graph data mining
3. Future work will bring performance closer to specialized engines.
StatusCurrently finalizing the APIs
»Feedback wanted: http://bit.ly/graph-api
Also working on improving system performanceWill be part of Spark 0.9
Backup slides
Vertex Cut Partitioning
Vertex Cut Partitioning
aggregateNeighbors
aggregateNeighbors
aggregateNeighbors
aggregateNeighbors
Example: Vertex Degree
Example: Vertex Degree
Example: Vertex DegreeA: 5B: 0C: 0D: 0E: 0F: 0
F
E
Example: Oldest Follower
D
B
A
CWhat is the age of the oldest follower for each user?val followerAge = graph.aggNbrs( e => e.src.age, // MapF max(_, _), // ReduceF InEdges).vertices
Specialized Graph Systems
47
Shared State[UAI’10, VLDB’12]
PregelMessaging
[PODC’09, SIGMOD’10]
Many OthersGiraph, Stanford GPS, Signal-Collect,
Combinatorial BLAS, BoostPGL, …
class Graph [ V, E ] {// Table Views -----------------def vertices: RDD[ (Id, V) ]def edges: RDD[ (Id, Id, E) ]def triplets: RDD[ ((Id, V), (Id, V), E) ]
// Transformations ------------------------------def reverse: Graph[V, E]def filterV(p: (Id, V) => Boolean): Graph[V,E]def filterE(p: Edge[V,E] => Boolean): Graph[V,E]def mapV[T](m: (Id, V) => T ): Graph[T,E] def mapE[T](m: Edge[V,E] => T ): Graph[V,T]
// Joins ----------------------------------------def joinV[T](tbl: RDD[(Id, T)]): Graph[(V, Opt[T]), E ]def joinE[T](tbl: RDD[(Id, Id, T)]): Graph[V, (E, Opt[T])]
// Computation ----------------------------------def aggregateNeighbors[T](mapF: (Edge[V,E]) => T,
reduceF: (T, T) => T, direction: EdgeDir):
Graph[T, E]}
GraphX API