Graph Data Management Lab, School of Computer Science GDM@FUDAN GDM@FUDAN http://gdm.fudan.edu. http://gdm.fudan.edu. Scalable SPARQL Querying of Large RDF Graphs Xu Bo 2012.06.11 In PVLDB, 4(21), 2011
Jan 08, 2016
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Scalable SPARQL Querying of Large RDF Graphs
Xu Bo
2012.06.11
In PVLDB, 4(21), 2011
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
OutlineOutline• About Presenter
• Semantic Web
• Previous Work
• New Problem
• SYSTEM ARCHITECTURE
• EXPERIMENTS
• CONCLUSIONS AND FUTURE WORK
23/4/20 2
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
About Presenter
• Daniel Abadi• Associate Professor of Computer Science in Yale University• Research
– Column-Oriented Database Systems– Petascale Parallel Database Systems (HadoopDB) – Semantic Web Data Management
23/4/20 3
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Semantic WebSemantic Web
• The vision of Semantic Web is to build a "web of data" that enables machines to understand the semantics of information on the Web
23/4/20 4
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Google Knowledge Graph
23/4/20 5
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Key Technology
• HTML
• XML
23/4/20 6
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The Disadvantage of XML David Billington is a lecturer of Discrete Mathematics.
• there is no standard way of assigning meaning to tag nesting
23/4/20 7
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The Disadvantage of Xpath
• Suppose we want to collect all academic staff members. A path expression in Xpath might be //academicStaffMember
• XML is semantically unsatisfactory
23/4/20 8
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF
• Resource Description Framework• 用 Web标识符(称作统一资源标识符, Uniform
Resource Identifiers 或 URIs)来标识事物,用简单的属性( property)及属性值来描述资源
23/4/20 9
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF as Triples and a Graph
23/4/20 10
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
SPARQL
• RDF query language
• A basic graph pattern
• Answering SPARQL can be seen as finding subgraphs in the RDF data that match the graph pattern
23/4/20 11
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Example for Star Pattern
• Find the names of the strikers that play for FC Barcelona.
23/4/20 12
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Another Example• Find football players playing for clubs in apopulous region where they were born.
23/4/20 13
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
23/4/20 14
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Previous Work• RDF In RDBMSs
• Property Tables
• Vertically Partitioned Approach
23/4/20 15
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
RDF In RDBMSs• Get the title of the book(s) Joe Fox wrote
in 2001
23/4/20 16
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Property Tables
23/4/20 17
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Vertically Partitioned Approach
23/4/20 18
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
New Problem• Single node RDF management systems are
abundant– Sesame– Jena– RDF3X– 3store
• Research in clustered RDF management is less significantly explored: The focus of the talk
23/4/20 19
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
SYSTEM ARCHITECTURE
23/4/20 20
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Graph Partitioning• Hash vs. Graph partitioning
– Hash: Only efficient for star patterns– Graph: Taking advantage of graph model
23/4/20 21
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Graph Partitioning• Edge vs. Vertex partitioning
– Edge: Natural but inefficient for query execution
– Vertex: Superior for common graph patterns
23/4/20 22
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Vertex Partitioning• Preprocess
– remove triples whose predicate is rdf:type
• METIS partitioner
23/4/20 23
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Triple Placement• Minimizing data shuffling/exchange
– Allowing data overlap
• N-hop guarantee– The extent of data overlap– If a vertex is assigned to a machine, any
vertex that is within n-hop of this vertex is also stored in this machine
23/4/20 24
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
DIRECTED N-HOP GUARANTEE
23/4/20 25
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
A potential problem• triples (s, p, o) and (o, p’, o’)
– 2-hop guarantee
• triples (s, p, o) and (s’, p’, o)– not guaranteed
• “object-connected” is not unusual
• undirected n-hop guarantee
23/4/20 26
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Triple Placement Algorithm
23/4/20 27
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Query Processing• Queries are executed in RDF-stores and/or
Hadoop• Query execution is more efficient in RDF-
stores than in Hadoop– Pushing as much of the processing as possible
into RDF-stores– Minimizing the number of Hadoop jobs– The larger the hop guarantee, the more work is
done in RDF-stores23/4/20 28
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
To Communicate, or not to Communicate
• Given a query and n-hop guarantee, is communication (Hadoop job) between nodes needed?– Choose the “center” of the query graph– Calculate the distance from the “center” to the
furthest edge– If distance > n, communication is needed; not
needed otherwise
23/4/20 29
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Determining whether a Query is PWOC
PWOC Query– parallelizable without communication
• DoFE– distance of farthest edge– the vertex in a graph with the smallest DoFE will be
the most central in a graph
23/4/20 30
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
The algorithm
23/4/20 31
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
the issue of duplicate results• naive approach
– remove duplicates after the query has completed
• owner-computes model– add triples (v, ‘<isOwned>’, ‘Yes’) to a
partition– For each query issued to the RDF-stores, add
an additional pattern (core, ‘<isOwned>’, ‘Yes’)
23/4/20 32
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
A query is not PWOC• decompose the query into PWOC
subqueries
• use Hadoop jobs to join the results of the PWOC subqueries
• The number of Hadoop jobs required to complete the query increases as the number of subqueries increases
23/4/20 33
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
minimal number of subqueries• reduces to the problem of finding minimal
edge partitioning of a graph into subgraphs of bounded diameter
• brute-force
23/4/20 34
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Examlple
23/4/20 35
DoFEs for manager, footballClub, Barcelona and club are 2, 2, 2 and 1
the DoFEs for footballer, pop, region, player and club are 3, 3, 2, 2 and 2,
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Decompose Example
23/4/20 36
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
EXPERIMENTS• 20-machine cluster
• Leigh University Benchmark (LUBM): 270 million triples
• Competitors:– Single-node RDF-3X– SHARD: triple-store system in Hadoop– Graph partitioning (the proposed system)– Hash partitioning on subjects
23/4/20 37
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Data Load Time
23/4/20 38
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Performance Comparison
23/4/20 39
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Varying Number of Machines
23/4/20 40
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
Summary
23/4/20 41
Graph Data Management Lab, School of Computer Science
GDM@FUDAGDM@FUDANN
http://gdm.fudan.edu.cnhttp://gdm.fudan.edu.cn
23/4/20 42