Page 1
SPARQL Basic Graph Pattern Processing with SPARQL Basic Graph Pattern Processing with Iterative MapReduceIterative MapReduce
2010-04-26
Presented by Jaeseok Myung
Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea
Page 2
Copyright 2010 by CEBT
MapReduceMapReduce
MapReduce is easily accessible
The Hadoop project provides an open-source MR implementation
MapReduce gives users a simple abstraction for utilizing parallel and distributed system
Programming Model
– Map(k,v) -> list(k’, v’)
– Reduce(k’, list(v’)) -> list(v’’)
Useful for Massive Data Processing
Center for E-Business Technology MDAC 2010 – 2/23
Page 3
Copyright 2010 by CEBT
MR & Cloud ComputingMR & Cloud Computing
MapReduce is a kind of platform
MapReduce utilizes a number of commodity machines
There can be a number of applications using MapReduce
Center for E-Business Technology
MapReduceMapReduce
App.App. App.App. App.App.
MDAC 2010 – 3/23
Page 4
Copyright 2010 by CEBT
RDF Data Warehouse using RDF Data Warehouse using MapReduceMapReduce
Data Warehouse using MapReduce
With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses
Hive, CloudBase
– Data warehousing solutions built on top of Hadoop
Advantages
– Scalability
– Extensibility
– Fault-tolerance
My Research Interest
RDF Data Warehouse using MapReduce
Center for E-Business Technology MDAC 2010 – 4/23
Page 5
Copyright 2010 by CEBT
Why RDF Data Warehouse?Why RDF Data Warehouse?
Flexible Data Model
The underlying structure of any expression in RDF is a collection of triples (s, p, o)
Data Integration
RDB-to-RDF (intra)
Linked Open Data (inter)
Incremental Integration
Inference
We can discover some knowledge from what we already know
A goal of data analyses
Center for E-Business Technology MDAC 2010 – 5/23
Page 6
Copyright 2010 by CEBT
Approaches & AdvantagesApproaches & Advantages
Center for E-Business Technology
Building a Data
Warehouse
Building a Data
Warehouse
Conventional DW
SolutionsRDF Data Warehous
e
RDF Data Warehous
e
Centralized
Distributed & Parallel
Distributed & Parallel
Beforethe Cloud
(MR)Cloud Computing(MR)Cloud Computing
• Flexibility• Integration• Inference
• Complexity• Large-scale
data analyses
• Scalability• Extensibilit
y• Fault-
tolerance
• Support Tools
• Simple• Fast
• Performance• Optimization
MDAC 2010 – 6/23
Page 7
Copyright 2010 by CEBT
SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce
Both RDF and MapReduce can benefit a data warehouse
RDF is a data model
– Flexibility, Integration, Inference
MapReduce is a programming model
– Scalability, Extensibility, Fault-tolerance
It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework
We should focus on a MR algorithm that manipulates RDF datasets
A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing
Center for E-Business Technology MDAC 2010 – 7/23
Page 8
Copyright 2010 by CEBT
SPARQL Basic Graph PatternSPARQL Basic Graph Pattern
SPARQL is a query language for RDF datasets
Basic Graph Pattern(BGP) is a set of triple patterns
Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable
BGP processing is important
– Most of SPARQL queries have one or more BGPs
– BGPs require expansive join operations among triple patterns
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
TP#1TP#1BGPBGP
TP#2TP#2
TP#3TP#3
TP#4TP#4
TP#5TP#5
MDAC 2010 – 8/23
Page 9
Copyright 2010 by CEBT
SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce
Two Operations
MR-Selection
– Extracts RDF triples which satisfy at least one triple pattern
MR-Join
– Merges selected triples
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
<Dept0><Dept0> rdf:typerdf:type ub:Departmentub:Department
…… …… ……
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
ub:worksForub:worksFor <Dept0><Dept0>
ub:nameub:name “Professor0”“Professor0”
ub:emailub:email “[email protected] ”“[email protected] ”
ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
MR-SelectionMR-Selection
MR-JoinMR-Join
MDAC 2010 – 9/23
Page 10
Copyright 2010 by CEBT
MR-SelectionMR-Selectionpublic void map() {
Read a triple (s, p, o)
// example, s: Prof0 p: rdf:type o:ub:Professor
for each (triple pattern in a given query) {
if(input triple satisfies a triple pattern) {
make a key and a value
// key = [x]Prof0 (variable name, value)
// value = 1 (# of the satisfied triple pattern)
output (key, value)
}
}
}
public void reduce() {
read input from the map function
// input format: (key, list(satisfied tp_numbers))
for each (value in a list of tp_numbers) {
make a key and a value
// key = <1>x, value = [x]Prof0
output (key, value)
}
}
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
MDAC 2010 – 10/23
Page 11
Copyright 2010 by CEBT
MR-SelectionMR-Selection
Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern
A result table has variable names as a relational table has attribute names
It also has values for the variable names, as does the relational table
The result table will be used for the next MR-Join operation if necessary
Center for E-Business Technology
tp1
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
MDAC 2010 – 11/23
Page 12
Copyright 2010 by CEBT
MapperMapper
Values of Join-key variable
MR-Join: MapMR-Join: Map
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”<Prof0><Prof0>
<Prof1><Prof1> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof1><Prof1> ub:telephoneub:telephone “111-1111-1111”“111-1111-1111”
<Prof1><Prof1>
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
<Prof1><Prof1> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof1><Prof1> ub:telephoneub:telephone “111-1111-1111”“111-1111-1111”
BGP Analyzer
BGP Analyzer examines a given query before execution and provides join-keys to the map function
BGP Analyzer
BGP Analyzer examines a given query before execution and provides join-keys to the map function
Join-key (shared variable) ?x
MDAC 2010 – 12/23
Page 13
Copyright 2010 by CEBT
MR-Join: MapMR-Join: Mappublic void map() {
read input from MR-Selection
// example input (<1>x, [x]Prof0)
// example input (<3>x|y1, [x]Prof0|[y1]Professor0)
get join-key variables and corresponding tp_numbers
to be joined from the BGP Analyzer
// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)
for each (join-key determined by BGP Analyzer) {
if(input is related to the join-key) {
make a key and a value
// key = [x]Prof0 (variable name, value)
// value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value)
// value = <tp>3</tp>[x]Prof0|[y1]Professor0
output (key, value)
}
}
}
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
MDAC 2010 – 13/23
Page 14
Copyright 2010 by CEBT
MR-Join: ReduceMR-Join: Reduce
Center for E-Business Technology
ReducerReducer
Constraints for Join-key variable X
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345
<x>1, 2, 3,
4, 5
<x>1, 2, 3,
4, 5
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>
<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”
<Prof0><Prof0> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
<Prof1><Prof1> ub:emailub:email “[email protected] ”“[email protected] ”
<Prof1><Prof1> ub:telephoneub:telephone “111-1111-1111”“111-1111-1111”
BGP Analyzer
BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query
BGP Analyzer
BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query
Triple pattern numbers related to the join-key variable
<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor
ub:worksForub:worksFor <Dept0><Dept0>
ub:nameub:name “Professor0”“Professor0”
ub:emailub:email “[email protected] ”“[email protected] ”
ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”
MDAC 2010 – 14/23
Page 15
Copyright 2010 by CEBT
MR-Join: ReduceMR-Join: Reducepublic void reduce() {
read input from the Map function
// example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0])
get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer
// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)
create a temporary hashtable H
for each (value in values) {
add an element
// key = <1>x, value = [x]Prof0
// key = <3>x|y1, value = [x]Prof0|[y1]Professor0
} // H will be used for checking whether the input satisfies all related tps.
if(keys in H cover all tp_numbers to be joined) {
make a Cartesian product among values in H
// (a1, b1), (a1, c1) => (a1, b1, c1)
make a key and a value
// key = <1|3>x|y1
// value = [x]Prof0|[y1]Professor0
output (key, value)
}
}
Center for E-Business Technology MDAC 2010 – 15/23
Page 16
Copyright 2010 by CEBT
Join-key Selection StrategiesJoin-key Selection Strategies
BGP Analyzer provides join-key variables by analyzing a query
How to select join-key variables?
If a BGP has a shared variable
– We can easily select the variable
If a BGP has two or more shared variables
– We applied two heuristics to select join-key variables
– Greedy Selection Select a join-key according to the number of related triple patterns
– Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join
operation
Utilize the distributed and parallel system architecture
Center for E-Business Technology MDAC 2010 – 16/23
Page 17
Copyright 2010 by CEBT
SPARQL BGP Processing with MRSPARQL BGP Processing with MR
Advantages
MapReduce can benefit from the multi-way join technique
– If triple patterns share a variable, MR can join them all at once
– It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}
12345 ⋈
tp1 ⋈ ⋈ ⋈ ⋈(x)
(x, y1)
(x, y1, y2)
(x, y1, y2, y3)
(x, y1, y2, y3)
(a)
(b)
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
tp1
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
MDAC 2010 – 17/23
Page 18
Copyright 2010 by CEBT
SPARQL BGP Processing with MRSPARQL BGP Processing with MR
Disadvantages
If we have two or more shared variables, we need expansive MR iterations
triple patterns in a query cannot be covered by a certain variable
If we have two shared variables, MR iterations cannot be avoided
To reduce unnecessary MR iteration, join-key selection strategies should be applied
Center for E-Business Technology
SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3. ?y2 ub:alias ?y4}
123456
⋈(x, y1, y2, y3)
tp1
x
…
x y1
… …
x
…
x y2
… …
x y3
… …
2 3 4 5
y2 y4
… …
6
⋈
MDAC 2010 – 18/23
Page 19
Copyright 2010 by CEBT
ExperimentExperiment
Environment
LUBM Dataset
Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS
The effect of multi-way join
Multi-way join technique reduces the execution time by joining several triple patterns at once
Some queries do not show a significant difference because they are too simple to take advantages of multi-way join
Center for E-Business Technology
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14
2-way
123.391
181.583
69.773256.59
175.53344.198
205.636
232.551
256.031
68.83466.834112.80
273.36947.092
Multi-way
86.423104.03
567.214
126.474
74.16344.526135.04
7140.41
4152.74
773.33763.55786.11772.82542.156
Diff. 36.96877.548 2.559130.11
71.37 -0.328 70.58992.137
103.284
-4.503 3.277 26.685 0.544 4.936
MDAC 2010 – 19/23
Page 20
Copyright 2010 by CEBT
ExperimentExperiment
Scalability
As the number of machines increase, the average execution time is decreased
– The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines
While we increase the data size, the algorithm shows scalable execution time
Center for E-Business Technology MDAC 2010 – 20/23
Page 21
Copyright 2010 by CEBT
Issues & Future Work – IndexingIssues & Future Work – Indexing
Execution Time of MR-Selection and each MR-Join Iteration
MR-Selection can be a bottleneck because it takes about 40 seconds
The underlying storage structure is important
N-triple format -> HBase, Partitioning
Building an index needs a significant amount of loading time
Center for E-Business Technology MDAC 2010 – 21/23
Page 22
Copyright 2010 by CEBT
Issues & Future Work – PipeliningIssues & Future Work – Pipelining
Hadoop’s MR implementation materializes intermediate results into the file system
It takes so much time because of disk I/O
Pipelining
Allows to send and receive data between tasks and between jobs without disk I/O
– Some implementations become available
Hadoop Online Prototype (http://code.google.com/p/hop/)
CGL-MapReduce (eScience 2008)
Center for E-Business Technology MDAC 2010 – 22/23
Page 23
Copyright 2010 by CEBT
ConclusionConclusion
There still remain many issues
This work is still in progress
Conclusion
RDF Data Warehouse using MapReduce
– RDF: Flexibility, Integration, Inference
– MapReduce: Scalability, Extensibility, Fault-tolerance
SPARQL Processing with MapReduce
– Synergy effects between RDF and MapReduce
– Issues
System Architecture
Loading(Indexing), Pipelining, Encoding, …
Center for E-Business Technology MDAC 2010 – 23/23