Top Banner
www.scads.de SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS WITH GRADOOP MARTIN JUNGHANNS, ANDRE PETERMANN, KEVIN GOMEZ, ERHARD RAHM
26

SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Apr 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

www.scads.de

SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS WITH GRADOOP

MARTIN JUNGHANNS, ANDRE PETERMANN, KEVIN GOMEZ, ERHARD RAHM

Page 2: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

„GRAPHS ARE EVERYWHERE“ AND LARGE

Facebook ca. 1.3 billion users ca. 340 friends per user

Twitter ca. 300 million users ca. 500 million tweets per day

Internet ca. 2.9 billion users

Gene (human) 20,000-25,000 ca. 4 million individuals

Patients > 18 millions (Germany)

Illnesses > 30.000

World Wide Web ca. 1 billion Websites

LOD-Cloud ca. 31 billion triples

Social science Engineering Life science Information science

2

Page 3: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Relational database systems, e.g., SAP HANA, Vertexica store vertices and edges in tables static schemas, expensive joins

Graph database system, e.g., Neo4J, OrientDB use of property graph data model & dedicated graph storage focus on online transactions and simple analytical queries

Parallel graph processing systems, e.g., Google Pregel, Apache Giraph in-memory processing of generic graphs in shared nothing cluster recent approaches (Spark, Flink): analysis workflow with graph

operators and general purpose data operators little support for semantically expressive graphs no end-to-end approach for graph analytics

GRAPH DATA MANAGEMENT

3

Page 4: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Integrate data from one or more sources into a dedicated graph storage with common graph data model

Definition of analytical workflows from operator algebra

Result representation in meaningful way

END-TO-END GRAPH ANALYTICS

Data Integration Graph Analytics Representation

4

Page 5: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

An end-to-end framework and research platform for efficient, distributed and domain independent graph

data management and analytics.

5

Page 6: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Hadoop-based framework for graph data management and analysis

Graph storage in scalable distributed store, e.g., HBase

Extended property graph data model operators on graphs and collections of (sub) graphs support for semantic graph queries and mining

Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

New functionality for graph-based processing workflows and graph mining Frequent Subgraph Mining, Graph Pattern Matching …

GRADOOP CHARACTERISTICS

6

Page 7: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

HIGH LEVEL ARCHITECTURE

HDFS Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Operator Implementations

Data Integration

Workflow Execution

Workflow Declaration

Visual

GrALa DSL Representation Data flow

Control flow

Graph Analytics Representation

7

Page 8: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

HIGH LEVEL ARCHITECTURE

HDFS Cluster

HBase Distributed Graph Store

Extended Property Graph Model

Operator Implementations

Data Integration

Workflow Execution

Workflow Declaration

Visual

GrALa DSL Representation Data flow

Control flow

Graph Analytics Representation

8

Page 9: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

1. Simple but powerful • intuitive graphs are flat structures of vertices and binary edges

2. Logical graphs • support of multiple, possibly overlapping graphs in one

database is advantageous for analytical applications

3. Attributes and type labels • type labels and custom properties

for vertices, edges and graphs

4. Parallel edges and loops • allow multiple relations between two vertices and self-

connected relations

DATA MODEL - REQUIREMENTS

9

Page 10: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

EXTENDED PROPERTY GRAPH MODEL Vertex space 𝒱 = 𝑣0, . . , 𝑣𝑛

Properties 𝜅 ∶ 𝒱 ∪ ℰ ∪ 𝒢 × 𝐾 → A

𝐷𝐷𝐸𝐸𝐸𝐸 = 𝒱,ℰ,𝒢,𝑇, 𝜏,𝐾,𝐴, 𝜅

Logical graphs 𝒢 = 𝐺𝐷𝐷 ,𝐺0, . . ,𝐺𝑝 𝐺𝑖 = 𝑉,𝐸 𝑉 ⊆ 𝒱 ∧ 𝐸 ⊆ ℰ

Edge space ℰ = {𝑒0, . . , 𝑒𝑚 } 𝑒𝑖 = 𝑣𝑖 , 𝑣𝑗 𝑣𝑖 , 𝑣𝑗 ∈ 𝒱

Type labels 𝜏 ∶ 𝒱 ∪ ℰ ∪ 𝒢 → T

10

Page 11: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Operator Definition GrALa notation

unary

Pattern Matching

𝜇𝐸∗,𝜑 ∶ 𝒢 → 𝒢n graph.match(patternGraph,predicate) : Collection

Aggregation 𝛾𝑎 ∶ 𝒢 → 𝒢 graph.aggregate(propertyKey,aggregateFunction) : Graph

Projection 𝜋𝜐,𝜖 ∶ 𝒢 → 𝒢 graph.project(vertexFunction,edgeFunction) : Graph

Summarization 𝜍𝜐,𝜖 ∶ 𝒢 → 𝒢 graph.summarize(vertexGroupKeys, vertexAggregateFunction, edgeGroupKeys,edgeAggregateFunction) : Graph

binary

Combination ⊔ ∶ 𝒢2 → 𝒢 graph.combine(otherGraph) : Graph

Overlap ⊓ ∶ 𝒢2 → 𝒢 graph.overlap(otherGraph) : Graph

Exclusion − ∶ 𝒢2 → 𝒢 graph.exclude(otherGraph) : Graph

GRAPH OPERATORS

11

Page 12: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

PATTERN MATCHING

1: pattern = new Graph(“(a)<-d-(b)-e->(c)”) 2: predicate = (Graph g => g.V[$a][:type] == “Person” && g.V[$b][:type] == “Forum” && g.V[$c][:type] == “Person” && g.E[$d][:type] == “hasMember” && g.E[$e][:type] == “hasMember”) 3: result = db.match(pattern, predicate)

12

Page 13: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

PATTERN MATCHING

1: pattern = new Graph(“(a)<-d-(b)-e->(c)”) 2: predicate = (Graph g => g.V[$a][:type] == “Person” && g.V[$b][:type] == “Forum” && g.V[$c][:type] == “Person” && g.E[$d][:type] == “hasMember” && g.E[$e][:type] == “hasMember”) 3: result = db.match(pattern, predicate)

13

Page 14: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

SUMMARIZATION

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, edgeGroupingKeys, vertexAggFunc, edgeAggFunc)

14

Page 15: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

SUMMARIZATION

1: personGraph = db.G[0].combine(db.G[1]).combine(db.G[2]) 2: vertexGroupingKeys = {:type, “city”} 3: edgeGroupingKeys = {:type} 4: vertexAggFunc = (Vertex vSum, Set vertices => vSum[“count”] = |vertices|) 5: edgeAggFunc = (Edge eSum, Set edges => eSum[“count”] = |edges|) 6: sumGraph = personGraph.summarize(vertexGroupingKeys, edgeGroupingKeys, vertexAggFunc, edgeAggFunc)

15

Page 16: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Operator Definition GrALa notation collection

Selection 𝜎𝜑 ∶ 𝒢n → 𝒢n collection.select(predicate) : Collection

Distinct δ ∶ 𝒢n → 𝒢n collection.distinct() : Collection

Sort by ξ𝑘,𝑑 ∶ 𝒢n → 𝒢n collection.sortBy(key, [:asc|:desc]) : Collection

Top 𝛽𝑛 ∶ 𝒢n → 𝒢n collection.top(limit) : Collection

Union ∪ ∶ 𝒢n 2 → 𝒢n collection.union(otherCollection) : Collection

Intersection ∩ ∶ 𝒢n 2 → 𝒢n collection.intersect(otherCollection) : Collection

Difference \ ∶ 𝒢n 2 → 𝒢n collection.difference(otherCollection) : Collection

auxiliary

Apply 𝜆𝑜 ∶ 𝒢n → 𝒢n collection.apply(unaryGraphOperator) : Collection

Reduce 𝜌𝑜 ∶ 𝒢n → 𝒢 collection.reduce(binaryGraphOperator) : Graph

Call 𝜂𝑎,𝐸 ∶ 𝒢n → 𝒢n [graph|collection].callFor[Graph|Collection]( algorithm,parameters) : [Graph|Collection]

COLLECTION OPERATORS

16

Page 17: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

SELECTION

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

17

Page 18: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

SELECTION

1: collection = <db.G[0],db.G[1],db.G[2]> 2: predicate = (Graph g => |g.V| > 3) 3: result = collection.select(predicate)

18

Page 19: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

1. Social Network Analysis “Summarized Communities” • Find communities by label propagation • Summarize vertices per community

and edges between community members

2. Business Intelligence “Top Revenue Subgraph” • Find the common subgraph of the top 100 revenue business

transaction graphs

EXAMPLE GRALA WORKFLOWS

19

Page 20: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

// define pattern to extract persons and their “knows” relations 1: pattern = new Graph( "(a)-c->(b)“ ) 2: predicate = ( Graph g => g.V[$a][:type] == "Person" && g.V[$b][:type] == "Person" && g.E[$c][:type] == "knows“) // find all matches inside the database 3: friendships = db.match( pattern , predicate ) // combine all matches to a single graph 4: knowsGraph = friendships.reduce( Graph g, Graph f => g.combine(f) ) // remove properties 5: knowsGraph = knowsGraph.project( Vertex v => new Vertex(v[:type], {}), new Edge(e[:type], {})) // extract communities, store community at vertex property “community” 6: knowsGraph = knowsGraph.callForGraph( :CommunityDetectionAlgorithm , {"propertyKey":"community"}) // summarize vertices based on their community // count edges inside and between communities 7: summarizedCommunities = knowsGraph.summarize( {“community"}, ((Vertex vSum, Set vertices) => vSum["count"] = |vertices|), {}, ((Edge eSum, Set edges) => eSum["count"] = |edges|))

GRALA EXAMPLE : SUMMARIZED COMMUNITIES

20

Page 21: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

// compute logical graphs 1: btgs = db.callForCollection( :BusinessTransactionGraphs , {} ) // define predicate function (graph contains invoice)2: predicate = (Graph g => g.V.select(Vertex v => v[:type] == “SalesInvoice”).count() > 0) // define aggregate function (revenue per graph) 3: aggRevenue = (Graph g => g.V.values(“revenue”).sum()) // apply predicate and aggregate function 4: invBtgs = btgs.select(predicate).apply(Graph g => g.aggregate(“revenue”, aggRevenue)) // sort graphs by revenue and return top 100 5: topBtgs = invBtgs.sortBy( “revenue“ , :desc ).top( 100 ) // compute overlap to find master data objects (e.g., Employees) 6: topBtgOverlap = invBtgs.reduce( Graph g, Graph h => g.overlap(h))

GRALA EXAMPLE : TOP REVENUE SUBGRAPH

21

Page 22: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

GRADOOP end-to-end framework for graph data management and analytics leverages Hadoop ecosystem including graph processing systems extended property graph model (EPGM) with powerful operators Gradoop graph store based on HBase initial implementation running (using MapReduce and Giraph)

SUMMARY

22

Page 23: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

complete processing framework implementation for all operators implement more mining algorithms on EPGM (FSM, …) workflow execution layer (Tez, Spark, Flink, …) Visualization

evaluate different storage layouts / solutions (e.g., Cassandra)

automatic optimization of analysis workflows

optimized graph partitioning approaches

graph-based data integration (DeDoop)

OUTLOOK

23

Vorführender
Präsentationsnotizen
Lernbasierte skalierbare Strategien zur Datenbereinigung und -integration  Lernbasierte Verfahren mit denen Matching-Verfahren anhand relativ weniger Trainingsbeispiele konfiguriert werden können skalierbaren Anwendung der Verfahren auf großen Datenmengen mit hoch-paralleler Ausführung auf Hunderten von Knoten und Tausenden von Prozessoren, z. B. auf Cloud-Infrastrukturen Holistische Integration zahlreicher Datenquellen holistische Integration von Datenquellen die nicht wie bisher auf den paarweisen Abgleich von Datenquellen abzielen, sondern auf sehr viele Datenquellen automatisierte Erstellung und Evolution der integrierten Taxonomie (Erstellung einer ausgefeilten Taxonomie zur Integration der Quelldaten ist bei sehr vielen Datenquellen manuell schwer umsetzbar) Dynamische Informationsanreicherung für Realzeit-Analysen on-the-fly auf Hintergrundinformationen im Web zuzugreifen Erweitern von Mashup-Techniken und optimieren von Realzeit-Analysen zur dynamischen Abfrage und zur Integration von Informationen
Page 24: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Graph Store / Workflow Execution / Graph Pattern Matching: Martin Junghanns (wiss. MA)

BIIIG / Workflow Execution / Frequent Subgraph Mining: Andre Petermann (wiss. MA)

RDF Graph Analytics: Markus Nentwig (wiss. MA)

Gradoop + Flink: Niklas Teichmann (SHK)

Graph Partitioning: Kevin Gómez (SHK/BA)

Visual Workflow Definition: Simon Chill (MA)

Graph Pattern Matching: Andreas Krause (MA)

Frequent Subgraph Mining: Thomas Döring (MA)

Graph Visualization: Ngoc Ha Tran (MA)

GRADOOP TEAM

24

Page 25: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

Junghanns, M., Petermann, A., Gomez, K., Peukert, E., Rahm, E.: GRADOOP - Scalable Graph Data Management and Analytics with Hadoop. Tech. report, Univ. of Leipzig, June 2015

L. Kolb, E. Rahm: Parallel Entity Resolution with Dedoop. Datenbank-Spektrum 13(1): 23-32 (2013) L. Kolb, A. Thor, E. Rahm: Dedoop: Efficient Deduplication with Hadoop. PVLDB 5(12), 2012 L. Kolb, A. Thor, E. Rahm: Load Balancing for MapReduce-based Entity Resolution. ICDE 2012: 618-629 L. Kolb, Z. Sehili, E. Rahm: Iterative Computation of Connected Graph Components with MapReduce.

Datenbank-Spektrum 14(2): 107-117 (2014) A. Petermann, M. Junghanns, R. Müller, E. Rahm: BIIIG : Enabling Business Intelligence with Integrated

Instance Graphs. Proc. 5th Int. Workshop on Graph Data Management (GDM 2014) A. Petermann, M. Junghanns, R. Müller, E. Rahm: Graph-based Data Integration and Business Intelligence

with BIIIG. Proc. VLDB Conf., 2014 Petermann, A.; Junghanns, M.; Müller, R.; Rahm, E.: FoodBroker - Generating Synthetic Datasets for Graph-

Based Business Analytics. Proc. 5th Int. Workshop on Big Data Benchmarking (WBDB), 2014 Jindal, A. et.al.: Vertexica: your relational friend for graph analytics!. PVLDB 7(13), 2014 Rudolf, M. et.al.: The Graph Story of the SAP HANA Database. BTW, 2013

REFERENCES

25

Page 26: SCALABALE GRAPH DATA MANAGEMENT AND ANALYTICS …support for semantic graph queries and mining Leverage powerful components of Hadoop ecosystem MapReduce, Giraph, Spark, Flink, …

26

Thank you!

www.gradoop.com