Top Banner
GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)
20

GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

May 26, 2018

Download

Documents

trinhque
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

GraphFrames: Graph Queries in Apache Spark SQL

Ankur DaveUC Berkeley AMPLab

Joint work with Alekh Jindal (Microsoft), Li Erran Li (Uber), Reynold Xin (Databricks), Joseph Gonzalez (UC Berkeley), and Matei Zaharia (MIT and Databricks)

Page 2: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

+ Graph Queries

2016Apache Spark + GraphFrames

GraphFrames (2016)

+ Graph Algorithms

2013Apache Spark + GraphX

Relational Queries

2009Spark

Page 3: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Graph Algorithms vs. Graph Queries

≈x

PageRank

Alternating Least Squares

Graph Algorithms Graph Queries

Page 4: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank Graph Query: Wikipedia Collaborators

Editor 1 Editor 2 Article 1 Article 2

Article 1

Article 2

Editor 1

Editor 2

same day} same day}

Page 5: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Graph Algorithms vs. Graph QueriesGraph Algorithm: PageRank

// Iterate until convergence wikipedia.pregel(sendMsg = { e =>

e.sendToDst(e.srcRank * e.weight)},mergeMsg = _ + _,vprog = { (id, oldRank, msgSum) =>

0.15 + 0.85 * msgSum})

Graph Query: Wikipedia Collaboratorswikipedia.find("(u1)-[e11]->(article1);(u2)-[e21]->(article1);(u1)-[e12]->(article2);(u2)-[e22]->(article2)")

.select("*","e11.date – e21.date".as("d1"),"e12.date – e22.date".as("d2"))

.sort("d1 + d2".desc).take(10)

Page 6: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Separate SystemsGraph Algorithms Graph Queries

Page 7: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Raw Wikipedia

< / >< / >< / >XML

Text Table

Edit GraphEdit Table

Frequent Collaborators

Problem: Mixed Graph AnalysisHyperlinks PageRank

Article Text

User Article

Vandalism Suspects

User User

User Article

Page 8: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Solution: GraphFrames

Graph Algorithms Graph Queries

Spark SQL

GraphFrames API

Pattern Query Optimizer

Page 9: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

GraphFrames API• Unifies graph algorithms, graph queries, and DataFrames• Available in Scala, Java, and Python

class GraphFrame {def vertices: DataFramedef edges: DataFrame

def find(pattern: String): DataFramedef registerView(pattern: String, df: DataFrame): Unit

def degrees(): DataFramedef pageRank(): GraphFramedef connectedComponents(): GraphFrame...

}

Page 10: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Implementation

Parsed Pattern

Logical Plan

Materialized Views

Optimized Logical Plan

DataFrameResult

Query String

Graph–RelationalTranslation Join Elimination

and Reordering

Spark SQL

View SelectionGraph

AlgorithmsGraphX

Page 11: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Graph–Relational Translation

B

D

A

C

Existing Logical PlanOutput: A,B,C

Src Dst

⋈C=Src

Edge Table

ID Attr

Vertex Table

⋈D=ID

Page 12: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Materialized View Selection

GraphX: Triplet view enabled efficient message-passing algorithms

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A C

B C

C D

Graph

+

UpdatedPageRanks

B

A

C

D

A

Page 13: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Materialized View Selection

GraphFrames: User-defined views enable efficient graph queries

Vertices

B

A

C

D

Edges

A B

A C

B C

C D

A

B

Triplet View

A CB CC D

Graph

User-Defined Views

PageRank

CommunityDetection

Graph Queries

Page 14: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Join Elimination

Src Dst1 21 32 32 5

Edges

ID Attr1 A2 B3 C4 D

VerticesSELECT src, dstFROM edges INNER JOIN vertices ON src = id;

Unnecessary join

can be eliminated if tables satisfy referential integrity, simplifying graph–relational translation:

SELECT src, dst FROM edges;

Page 15: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Join Reordering

A → B B → A

⋈A, BB → D

C→ B⋈BB → E⋈B

C→ D⋈BC→ E⋈C, D

⋈C, EExample Query

Left-Deep Plan Bushy Plan

A → B B → A

⋈A, B

B → D C→ B

⋈B

B → E⋈B⋈B⋈B, C

User-Defined View

Page 16: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

EvaluationFaster than Neo4j for unanchored pattern queries

0

0.5

1

1.5

2

2.5

GraphFrames Neo4j

Que

ry la

tenc

y, s

Anchored Pattern Query

01020304050607080

GraphFrames Neo4j

Que

ry la

tenc

y, s

Unanchored Pattern Query

Triangle query on 1M edge subgraph of web-Google. Each system configured to use a single core.

Page 17: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

EvaluationApproaches performance of GraphX for graph algorithms using Spark SQL whole-stage code generation

0

1

2

3

4

5

6

7

GraphFrames GraphX Naïve Spark

Per-i

tera

tion

runt

ime,

s

PageRank Performance

Per-iteration performance on web-Google, single 8-core machine. Naïve Spark uses Scala RDD API.

Page 18: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

EvaluationRegistering the right views can greatly improve performance for some queries

Workload: J. Huang, K. Venkatraman, and D.J. Abadi. Query optimization of distributed pattern matching. In ICDE 2014.

Page 19: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Future Work• Suggest views automatically• Exploit attribute-based partitioning in optimizer• Code generation for single node

Page 20: GraphFrames: Graph Queries in Apache Spark SQLlierranli/publications/GraphFrames-Spark... · GraphFrames: Graph Queries in Apache Spark SQL Ankur Dave UC Berkeley AMPLab Joint work

Try It Out!Released as a Spark Package at:

https://github.com/graphframes/graphframesThanks to Joseph Bradley, Xiangrui Meng, and Timothy Hunter.

[email protected]