Top Banner
Andra Lungu Flink committer andra.lungu @campus.tu-berlin.de Large-Scale Graph Processing with Apache Flink
35
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Flink Gelly - Karlsruhe - June 2015

Andra Lungu

Flink committer

andra.lungu

@campus.tu-berlin.de

Large-Scale Graph Processing with Apache Flink

Page 2: Flink Gelly - Karlsruhe - June 2015

What is Gelly?

Large-scale graph processing API

On top of Flink’s Java API

Official release: Flink 0.9

Off-the shelf library methods

Supports record and graph analysis applications; iterative algorithms

2

Page 3: Flink Gelly - Karlsruhe - June 2015

The Growing Flink Stack

3

Page 4: Flink Gelly - Karlsruhe - June 2015

How to use Gelly?

4

Page 5: Flink Gelly - Karlsruhe - June 2015

Graph Creation

5

Page 6: Flink Gelly - Karlsruhe - June 2015

Graph Properties

getVertices() getEdges() getVertexIds() getEdgeIds() inDegrees() outDegrees() getDegrees() numberOfVertices() numberOfEdges() getTriplets()

6

Page 7: Flink Gelly - Karlsruhe - June 2015

Graph Transformations

Map• mapVertices(final MapFunction<Vertex<K, VV>, NV> mapper)

• mapEdges(final MapFunction<Edge<K, EV>, NV> mapper)

Filter • filterOnVertices(FilterFunction<Vertex<K, VV>> vertexFilter)

• filterOnEdges(FilterFunction<Edge<K, EV>> edgeFilter)

• subgraph(FilterFunction<Vertex<K, VV>> vertexFilter, FilterFunction<Edge<K, EV>> edgeFilter)

7

Page 8: Flink Gelly - Karlsruhe - June 2015

Filter Functions

8

Page 9: Flink Gelly - Karlsruhe - June 2015

Graph Transformations

Join• joinWithVertices(DataSet<Tuple2<K, T>> inputDataSet, final MapFunction<Tuple2<VV, T>, VV> mapper)

• joinWithEdges(DataSet<Tuple3<K, K, T>> inputDataSet, final MapFunction<Tuple2<EV, T>, EV> mapper)

• joinWithEdgesOnSource / joinWithEdgesOnTarget

Reverse

Undirected

9

Page 10: Flink Gelly - Karlsruhe - June 2015

Union

10

Page 11: Flink Gelly - Karlsruhe - June 2015

Graph Mutations

addVertex(final Vertex<K, VV> vertex) addVertices(List<Vertex<K, VV>>verticesToAdd)

addEdge(Vertex<K, VV> source, Vertex<K, VV>target, EV edgeValue)

addEdges(List<Edge<K, EV>> newEdges) removeVertex(Vertex<K, VV> vertex) removeVertices(List<Vertex<K, VV>>verticesToBeRemoved)

removeEdge(Edge<K, EV> edge) removeEdges(List<Edge<K, EV>>edgesToBeRemoved)

11

Page 12: Flink Gelly - Karlsruhe - June 2015

Neighborhood Methods

reduceOnNeighbors(reduceNeighborsFunction, direction)

reduceOnEdges

groupReduceOnNeighbors; groupReduceOnEdges

12

Page 13: Flink Gelly - Karlsruhe - June 2015

Graph Validation

Given criteria:

• Edge IDs correspond to vertex IDs

13

Page 14: Flink Gelly - Karlsruhe - June 2015

Vertex-centric Iterations

Pregel [BSP] Execution Model

UDFs:

• Messaging Function

• VertexUpdateFunction

S-1: receive messages from neighbors

S: update vertex values

S+1: send new value to neighbors

14

Page 15: Flink Gelly - Karlsruhe - June 2015

Single Source Shortest Paths

15

Page 16: Flink Gelly - Karlsruhe - June 2015

SSSP – Second Superstep

16

Page 17: Flink Gelly - Karlsruhe - June 2015

SSSP - Result

17

Page 18: Flink Gelly - Karlsruhe - June 2015

SSSP – code snippet

18

Page 19: Flink Gelly - Karlsruhe - June 2015

Gather-Sum-Apply Iterations

UDFs:

• GatherFunction

• SumFunction

• ApplyFunction

Back to SSSP:

• Gather: neighbor value + edge weight

• Sum/Accumulate: choose min

• Apply: compare computed min and update

19

Page 20: Flink Gelly - Karlsruhe - June 2015

SSSP – Superstep 1

20

Page 21: Flink Gelly - Karlsruhe - June 2015

SSSP – Superstep 2

21

Page 22: Flink Gelly - Karlsruhe - June 2015

SSSP - Result

22

Page 23: Flink Gelly - Karlsruhe - June 2015

SSSP – code snippet

23

Page 24: Flink Gelly - Karlsruhe - June 2015

Vertex-centric or GSA?

Messaging = Gather + Sum

Update = Apply

Skewed graphs? – GSA (parallel

gather)

coGroup vs. reduce

GSA gathers from immediate neighbors;

Vertex-centric send to any vertex

24

Page 25: Flink Gelly - Karlsruhe - June 2015

Library of Algorithms

Weakly Connected Components

Community Detection

Page Rank

Single Source Shortest Paths

Label Propagation

25

Page 26: Flink Gelly - Karlsruhe - June 2015

Music Profiles Example

26

Page 27: Flink Gelly - Karlsruhe - June 2015

Input Data

<user-id, song-id, play-count>

Set of bad records [IDs]

27

Page 28: Flink Gelly - Karlsruhe - June 2015

Filter out Bad Records

28

Page 29: Flink Gelly - Karlsruhe - June 2015

Compute Top Songs/User

29

Page 30: Flink Gelly - Karlsruhe - June 2015

Compute Top Songs/User

30

Page 31: Flink Gelly - Karlsruhe - June 2015

Create a user-user Graph

31

Page 32: Flink Gelly - Karlsruhe - June 2015

Create a user-user Graph

32

Page 33: Flink Gelly - Karlsruhe - June 2015

Cluster Similar Users

33

Page 34: Flink Gelly - Karlsruhe - June 2015

Coming up Next

Gelly Blog Post Scala API More Library Methods Flink Streaming Integration Graph Partitioning Techniques Specialized Operators for Highly Skewed Graphs

Bipartite Graph Support

Curious? Gelly Roadmap

34