@GraphDevroom
Single-pass Graph Stream Analytics with Apache Flink
Rethinking graph processing for dynamic data
Vasiliki Kalavri <[email protected]> Paris Carbone <[email protected]>
1
@GraphDevroom
Real Graphs are dynamic
Graphs created by events happening in real-time • liking a post • buying a book • listening to a song • rating a movie • packet switching in computer networks • bitcoin transactions
Each event adds an edge to the graph
2
@GraphDevroom
3
@GraphDevroom
In a batch world
We create and analyze a snapshot of the real graph • all events / interactions / relationships that
happened between t0 and tn • the Facebook social network on January 30 2016 • user web logs gathered between March 1st 12:00 and 16:00 • retweets and replies for 24h after the announcement of the
death of David Bowie
4
@GraphDevroom
Batch Graph Processing
5
@GraphDevroom
In a streaming world
• We receive and consume the events as they are happening, in real-time
• We analyze the evolving graph and receive results continuously
6
@GraphDevroom
7
Streaming Graph Processing
@GraphDevroom
8
Streaming Graph Processing
@GraphDevroom
9
Streaming Graph Processing
@GraphDevroom
10
Streaming Graph Processing
@GraphDevroom
11
Streaming Graph Processing
@GraphDevroom
12
Streaming Graph Processing
@GraphDevroom
13
Streaming Graph Processing
@GraphDevroom
14
Streaming Graph Processing
@GraphDevroom
15
Streaming Graph Processing
@GraphDevroom
16
Streaming Graph Processing
@GraphDevroom
17
Streaming Graph Processing
@GraphDevroom
Sounds expensive?
Challenges • maintain the graph structure
• how to apply state updates efficiently?
• update the result • re-run the analysis for each event? • design an incremental algorithm? • run separate instances on multiple snapshots?
• compute only on most recent events
18
@GraphDevroom
19
The Apache Flink Stack
APIs
Execution
DataStreamDataSet
Distributed Dataflow
Deployment
• Bounded Data Sources • Structured Iterations • Blocking Operations
• Unbounded Data Sources • Asynchronous Iterations • Incremental Operations
@GraphDevroom
Unifying Data Processing
Job Manager • scheduling tasks • monitoring/recovery
Client
• task pipelining • blocking
• execution plan building • optimisation
20
DataStreamDataSet
Distributed Dataflow
Deployment
HDFS
Kafka
DataSet<String> text = env.readTextFile(“hdfs://…”); text.map(…).groupReduce(…)…
DataStream<String> events = env.addSource(new KafkaConsumer(…)); events.map(…).filter(…).window(…).fold(…)…
@GraphDevroom Graph Processing on
Apache Flink
21
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly
• Static Graphs • Multi-Pass Algorithms • Full Computations
DataStream
@GraphDevroom
Data Streams as ADTs
22
• Direct access to the execution graph / topology
• Suitable for engineers
• Abstract Data Type Transformations hide operator details
• Suitable data analysts and engineers
similar to: PCollection, DStream
DataStream
@GraphDevroom
Nature of a DataStream Job
23
• Tasks are long running in a pipelined execution.
• State is kept within tasks.
• Transformations are applied per-record or per-window.
Execution Graph
unbounded data sinks
unbounded data sources
• operator parallelism• stream partitioning
Execution Properties
@GraphDevroom
Working with DataStreams
24
Creation TransformationsDataStream<String> myStream =
-for supported data sources: env.addSource(new FlinkKafkaConsumer<String>(…)); env.addSource(new RMQSource<String>(…)); env.addSource(new TwitterSource(propsFile)); env.socketTextStream(…);
-for testing: env.fromCollection(…); env.fromElements(…);
-for adding any custom source: env.addSource(MyCustomSource(…));
PropertiesmyStream.setParallelism(3)
myStream.broadcast(); .rebalance(); .forward();
.keyBy(key);
partitioning
partition stream and operator state by key
myStream.map(…); myStream.flatMap(…); myStream.filter(…); myStream.union(myOtherStream);
-for aggregations on partitioned-by-key streams:
myKeyStream.reduce(…); myKeyStream.fold(…); myKeyStream.sum(…);
@GraphDevroom
Example
25
env.setParallelism(2); //default parallelism DataStream<Tuple2<String, Integer>> counts = env
.socketTextStream("localhost", 9999) .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning .sum(1) //rolling aggregation
.setParallelism(4); counts.print();
“cool, gelly is cool”
<“gelly", 1><“is”, 1> <“cool”,1><“cool”,1>
<“is”, 1> <“gelly”, 1>
<“cool”,2> <“cool”,1>
printsum
flatMap
@GraphDevroom
Working with Windows
26
Why windows? We are often interested in fresh data!
Highlight: Flink can form and trigger windows consistently under different notions of time and deal with late events!
#sec40 80
SUM #2
0
SUM #1
20 60 100
#sec40 80
SUM #3
SUM #2
0
SUM #1
20 60 100
120
15 38 65 88
15 38
38 65
65 88
15 38 65 88
110 120
myKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS), Time.of(20, TimeUnit.SECONDS));
1) Sliding windows
2) Tumbling windowsmyKeyStream.timeWindow( Time.of(60, TimeUnit.SECONDS));
window buckets/panes
@GraphDevroom
Example
27
env.setParallelism(2); //default parallelism DataStream<Tuple2<String, Integer>> counts = env
.socketTextStream("localhost", 9999) .flatMap(new Splitter()) //transformation .keyBy(0) //partitioning
.window(Time.of(5, TimeUnit.MINUTES)) .sum(1) //rolling aggregation
.setParallelism(4); counts.print();
10:48 - “cool, gelly is cool”
printwindow sumflatMap
11:01 - “dataflow is cool too”
<“gelly”,1>… <“cool”,2>
<“dataflow”,1>… <“cool”,1>
@GraphDevroom Single-Pass Graph Streaming
with Windows• Each event represents an edge addition
• Each edge is processed once and thrown away, i.e. the graph structure is not explicitly maintained
• The state maintained corresponds to a graph summary, a continuously improving property, an aggregation
• Recent events can be grouped in a graph window and processed independently
28
@GraphDevroom
What’s the benefit?
• Get results faster • No need to wait for the job to finish • Sometimes, early approximations are better than late exact
answers • Get results continuously
• Process unbounded number of events • Use less memory
• single-pass algorithms don’t store the graph structure • run computations on a graph summary
29
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
30
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
31
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
32
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
33
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
34
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
35
@GraphDevroom
What can you do in this model?
• transformations, e.g. mapping, filtering vertex / edge values, reverse edge direction
• continuous aggregations, e.g. degree distribution
36
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
Streaming Degrees Distribution#v
ertic
es
degree
37
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
#ver
tices
degree
Streaming Degrees Distribution
38
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
#ver
tices
degree
Streaming Degrees Distribution
39
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
#ver
tices
degree
Streaming Degrees Distribution
40
@GraphDevroom
1
43
2
5
6
7
8
Streaming Degrees Distribution
0
2
4
6
1 2 3 4
#ver
tices
degree
41
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
#ver
tices
degree
Streaming Degrees Distribution
42
@GraphDevroom
1
43
2
5
6
7
8
Streaming Degrees Distribution
0
2
4
6
1 2 3 4
#ver
tices
degree
43
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
#ver
tices
degree
Streaming Degrees Distribution
44
@GraphDevroom
1
43
2
5
6
7
8
Streaming Degrees Distribution
0
2
4
6
1 2 3 4
#ver
tices
degree
45
@GraphDevroom
1
43
2
5
6
7
8
0
2
4
6
1 2 3 4
#ver
tices
degree
Streaming Degrees Distribution
46
@GraphDevroom
What can you do in this model?
• spanners for distance estimation • sparsifiers for cut estimation • sketches for homomorphic properties
graph summary
algorithm algorithm~R1 R2
47
@GraphDevroom
What can you do in this model?
• neighborhood aggregations on windows, e.g. triangle counting, clustering coefficient (no iterations… yet!)
48
@GraphDevroom
Examples
49
@GraphDevroom
Batch Connected Components
• State: the graph and a component ID per vertex (initially equal to vertex ID)
• Iterative Computation: For each vertex:
• choose the min of neighbors’ component IDs and own component ID as new ID
• if component ID changed since last iteration, notify neighbors
50
@GraphDevroom
1
43
2
5
6
7
8
i=0
Batch Connected Components
51
@GraphDevroom
1
43
2
5
6
7
8
i=13 4
1 4
4 5
2 4
1 2 4 5
7 8
6 8
6 7
1 1
2
6
6
Batch Connected Components
52
@GraphDevroom
1
11
2
2
6
6
6
i=2
1
1
1 2
1 2 6
6
6
1
1
Batch Connected Components
53
@GraphDevroom
1
11
1
1
6
6
6
i=3
Batch Connected Components
54
@GraphDevroom
Streaming Connected Components
• State: a disjoint set data structure for the components
• Computation: For each edge
• if seen for the 1st time, create a component with ID the min of the vertex IDs
• if in different components, merge them and update the component ID to the min of the component IDs
• if only one of the endpoints belongs to a component, add the other one to the same component
55
@GraphDevroom
31
52
54
76
86
ComponentID Vertices
1
43
2
5
6
7
8
56
@GraphDevroom
31
52
54
76
86
42
ComponentID Vertices
1 1, 3
1
43
2
5
6
7
8
57
@GraphDevroom
31
52
54
76
86
42
ComponentID Vertices
43
2 2, 5
1 1, 3
1
43
2
5
6
7
8
58
@GraphDevroom
31
52
54
76
86
42
43
87
ComponentID Vertices
2 2, 4, 5
1 1, 3
1
43
2
5
6
7
8
59
@GraphDevroom
31
52
54
76
86
42
43
87
41
ComponentID Vertices
2 2, 4, 5
1 1, 3
6 6, 7
1
43
2
5
6
7
8
60
@GraphDevroom
52
54
76
86
42
43
87
41
ComponentID Vertices
2 2, 4, 5
1 1, 3
6 6, 7, 8
1
43
2
5
6
7
8
61
@GraphDevroom
54
76
86
42
43
87
41 ComponentID Vertices
2 2, 4, 5
1 1, 3
6 6, 7, 8
1
43
2
5
6
7
8
62
@GraphDevroom
76
86
42
43
87
41
ComponentID Vertices
2 2, 4, 5
1 1, 3
6 6, 7, 8
1
43
2
5
6
7
8
63
@GraphDevroom
76
86
42
43
87
41
ComponentID Vertices
6 6, 7, 8
1 1, 2, 3, 4, 5
1
43
2
5
6
7
8
64
@GraphDevroom
86
42
43
87
41
ComponentID Vertices
6 6, 7, 8
1 1, 2, 3, 4, 5
1
43
2
5
6
7
8
65
@GraphDevroom
42
43
87
41
ComponentID Vertices
6 6, 7, 8
1 1, 2, 3, 4, 5
1
43
2
5
6
7
8
66
@GraphDevroom Distributed Streaming Connected
Components
67
@GraphDevroom
Streaming Bipartite Detection
Similar to connected components, but
• each vertex is also assigned a sign, (+) or (-)
• edge endpoints must have different signs
• when merging components, if flipping all signs doesn’t work => the graph is not bipartite
68
@GraphDevroom
1
43
2
5
6
7
(+) (-)
(+)(-)
(+) (-)
(+)
Cid=1
Cid=5
Streaming Bipartite Detection
69
@GraphDevroom
3 5
1
43
2
5
6
7
(+) (-)
(+)(-)
(+) (-)
(+)
Cid=1
Cid=5
Streaming Bipartite Detection
70
@GraphDevroom
3 5
1
43
2
5
6
7
(+) (-)
(+)(-)
(+) (-)
(+)
Cid=1
Cid=5
Streaming Bipartite Detection
71
@GraphDevroom
Cid=1
1
43
2
5
6
7
(+) (-)
(-)(+)
(+) (-)
(-)
3 5
Streaming Bipartite Detection
72
@GraphDevroom
3 7
Cid=1
1
43
2
5
6
7
(+) (-)
(-)(+)
(+) (-)
(-)Can’t flip signs and stay consistent
=> not bipartite!
Streaming Bipartite Detection
73
@GraphDevroom
The GraphStream
74
DataStreamDataSet
Distributed Dataflow
Deployment
Gelly Gelly-Stream
• Static Graphs • Multi-Pass Algorithms • Full Computations
• Dynamic Graphs • Single-Pass Algorithms • Incremental Computations
DataStream
@GraphDevroom
Introducing Gelly-Stream
75
• Gelly-Stream enriches the DataStream API with two new additional ADTs:
• GraphStream:
• A representation of a data stream of edges.
• Edges can have state (e.g. weights).
• Supports property streams, transformations and aggregations.
• GraphWindow:
• A “time-slice” of a graph stream.
• It enables neighborhood aggregations (and iterations in the future)
@GraphDevroom
Graph Property Streams
76
AB
C D
A B C D A CGraph Stream:
.getEdges()
.getVertices()
.numberOfVertices()
.numberOfEdges()
.getDegrees()
.inDegrees()
.outDegrees()
GraphStream -> DataStream
@GraphDevroom
.mapEdges();
.distinct();
.filterVertices();
.filterEdges();
.reverse();
.undirected();
.union();
Transform Graph Streams
77
AB
C D
A B C D A CGraph Stream:
GraphStream -> GraphStream
@GraphDevroom
Graph Stream Aggregations
78
result aggregate
property streamgraph stream
(window) fold
combine
fold
reduce
partitioned aggregates
global aggregates
edges
agg
global aggregates can be persistent or transient
graphStream.aggregate(new MyGraphAggregation(window, update, fold, combine, merge))
@GraphDevroom
Graph Stream Aggregations
79
result aggregate
property streamgraph stream
(window) fold
combine merge
graphStream.aggregate(new MyGraphAggregation(window, fold, combine, merge))
fold
reduce map
partitioned aggregates
global aggregates
edges
agg
@GraphDevroom
Connected Components
80
graph stream
combine merge
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge))
reduce map31
52
1
43
2
5
6
7
8
@GraphDevroom
Connected Components
81
graph stream
combine mergereduce map
{1,3}
{2,5}
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
82
graph stream
combine mergereduce map
{1,3}
{2,5}
54
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
83
graph stream
combine mergereduce map
{1,3}
{2,5}
{4,5}76
86
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
84
graph stream
combine mergereduce map
{1,3}
{2,5}
{4,5}
{6,7}
{6,8}
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
85
graph stream
combine mergereduce map
TODO:: show blocking reduce instead?
{2,5}{6,8}
{1,3}{4,5}
{6,7}
3
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
86
graph stream
combine mergereduce map
{1,3}{2,4,5}
{6,7,8}
3
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
87
graph stream
combine mergereduce map
{1,3}{2,4,5}
{6,7,8}
342
43
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
88
graph stream
combine mergereduce map
{1,3}{2,4,5}
{6,7,8}
3
{2,4}
{3,4}
41
87
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
89
graph stream
combine mergereduce map
{1,3}{2,4,5}
{6,7,8}
3
{1,2,4}
{3,4}{7,8}
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
90
graph stream
combine mergereduce map
{1,2,4,5}{6,7,8}
2
{3,4}{7,8}
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Connected Components
91
graph stream
combine mergereduce map
{1,2,3,4,5}{6,7,8}
2
graphStream.aggregate(new ConnectedComponents(window, update, fold, combine, merge)) 1
43
2
5
6
7
8
@GraphDevroom
Slicing Graph Streams
92
graphStream.slice(Time.of(1, MINUTE));
11:40 11:41 11:42 11:43
@GraphDevroom
Aggregating Slices
93
graphStream.slice(Time.of(1, MINUTE), direction)
.reduceOnEdges();
.foldNeighbors();
.applyOnNeighbors();
• Slicing collocates edges by vertex information
• Neighbourhood aggregations are now enabled on sliced graphs
source
target
Aggregations
@GraphDevroom
Finding matches nearby
94
graphStream.slice(Time.of(1, MINUTE)).applyOnNeighbors(FindPairs())
slice applyOnNeighbors
TODO: make it more interactive with transitions
@GraphDevroom
Summary
• Many graph analysis problems can be covered in single-pass
• Processing dynamic graphs requires an incremental graph processing model
• We introduce Gelly-Stream, a simple yet powerful library for graph streams