SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer.

SPARQL Basic Graph Pattern Processing with SPARQL Basic Graph Pattern Processing with Iterative MapReduceIterative MapReduce

2010-04-26

Presented by Jaeseok Myung

Intelligent Database Systems LabSchool of Computer Science & EngineeringSeoul National University, Seoul, Korea

Copyright 2010 by CEBT

MapReduceMapReduce

MapReduce is easily accessible

The Hadoop project provides an open-source MR implementation

MapReduce gives users a simple abstraction for utilizing parallel and distributed system

Programming Model

– Map(k,v) -> list(k’, v’)

– Reduce(k’, list(v’)) -> list(v’’)

Useful for Massive Data Processing

Center for E-Business Technology MDAC 2010 – 2/23


MR & Cloud ComputingMR & Cloud Computing

MapReduce is a kind of platform

MapReduce utilizes a number of commodity machines

There can be a number of applications using MapReduce

Center for E-Business Technology

MapReduceMapReduce

App.App. App.App. App.App.

MDAC 2010 – 3/23


RDF Data Warehouse using RDF Data Warehouse using MapReduceMapReduce

Data Warehouse using MapReduce

With extensive studies, it has become known that MR is specialized for large-scale fault-tolerant data analyses

Hive, CloudBase

– Data warehousing solutions built on top of Hadoop

Advantages

– Scalability

– Extensibility

– Fault-tolerance

My Research Interest

RDF Data Warehouse using MapReduce



Why RDF Data Warehouse?Why RDF Data Warehouse?

Flexible Data Model

The underlying structure of any expression in RDF is a collection of triples (s, p, o)

Data Integration

RDB-to-RDF (intra)

Linked Open Data (inter)

Incremental Integration

Inference

We can discover some knowledge from what we already know

A goal of data analyses



Approaches & AdvantagesApproaches & Advantages


Building a Data

Warehouse

Building a Data

Warehouse

Conventional DW

SolutionsRDF Data Warehous

e

RDF Data Warehous

e

Centralized

Distributed & Parallel

Distributed & Parallel

Beforethe Cloud

(MR)Cloud Computing(MR)Cloud Computing

• Flexibility• Integration• Inference

• Complexity• Large-scale

data analyses

• Scalability• Extensibilit

y• Fault-

tolerance

• Support Tools

• Simple• Fast

• Performance• Optimization

MDAC 2010 – 6/23


SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce

Both RDF and MapReduce can benefit a data warehouse

RDF is a data model

– Flexibility, Integration, Inference

MapReduce is a programming model

– Scalability, Extensibility, Fault-tolerance

It has been difficult to create synergy because there have been only few algorithms which connects the data model and the framework

We should focus on a MR algorithm that manipulates RDF datasets

A MapReduce Algorithm for SPARQL Basic Graph Pattern Processing



SPARQL Basic Graph PatternSPARQL Basic Graph Pattern

SPARQL is a query language for RDF datasets

Basic Graph Pattern(BGP) is a set of triple patterns

Triple patterns are similar to RDF triples (s, p, o) except that each of the subject, predicate and object can be a variable

BGP processing is important

– Most of SPARQL queries have one or more BGPs

– BGPs require expansive join operations among triple patterns


SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3}

TP#1TP#1BGPBGP

TP#2TP#2

TP#3TP#3

TP#4TP#4

TP#5TP#5

MDAC 2010 – 8/23


SPARQL BGP Processing with SPARQL BGP Processing with MapReduceMapReduce

Two Operations

MR-Selection

– Extracts RDF triples which satisfy at least one triple pattern

MR-Join

– Merges selected triples



12345

<Prof0><Prof0> rdf:typerdf:type ub:Professorub:Professor

<Prof0><Prof0> ub:worksForub:worksFor <Dept0><Dept0>

<Prof0><Prof0> ub:nameub:name “Professor0”“Professor0”

<Prof0><Prof0> ub:emailub:email “[email protected]”“[email protected]”

<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”

<Dept0><Dept0> rdf:typerdf:type ub:Departmentub:Department

…… …… ……


ub:worksForub:worksFor <Dept0><Dept0>

ub:nameub:name “Professor0”“Professor0”

ub:emailub:email “[email protected]”“[email protected]”

ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”

MR-SelectionMR-Selection

MR-JoinMR-Join

MDAC 2010 – 9/23


MR-SelectionMR-Selectionpublic void map() {

Read a triple (s, p, o)

// example, s: Prof0 p: rdf:type o:ub:Professor

for each (triple pattern in a given query) {

if(input triple satisfies a triple pattern) {

make a key and a value

// key = [x]Prof0 (variable name, value)

// value = 1 (# of the satisfied triple pattern)

output (key, value)

}

}

}

public void reduce() {

read input from the map function

// input format: (key, list(satisfied tp_numbers))

for each (value in a list of tp_numbers) {


// key = <1>x, value = [x]Prof0

output (key, value)

}

}



12345

MDAC 2010 – 10/23


MR-SelectionMR-Selection

Conceptually, the MR-Selection algorithm produces temporary tables which satisfy each triple pattern

A result table has variable names as a relational table has attribute names

It also has values for the variable names, as does the relational table

The result table will be used for the next MR-Join operation if necessary


tp1

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

MDAC 2010 – 11/23


MapperMapper

Values of Join-key variable

MR-Join: MapMR-Join: Map



12345





<Prof0><Prof0> ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”<Prof0><Prof0>



<Prof1><Prof1>








BGP Analyzer

BGP Analyzer examines a given query before execution and provides join-keys to the map function

BGP Analyzer

BGP Analyzer examines a given query before execution and provides join-keys to the map function

Join-key (shared variable) ?x

MDAC 2010 – 12/23


MR-Join: MapMR-Join: Mappublic void map() {

read input from MR-Selection

// example input (<1>x, [x]Prof0)

// example input (<3>x|y1, [x]Prof0|[y1]Professor0)

get join-key variables and corresponding tp_numbers

to be joined from the BGP Analyzer

// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)

for each (join-key determined by BGP Analyzer) {

if(input is related to the join-key) {


// key = [x]Prof0 (variable name, value)

// value = <tp>1</tp>[x]Prof0 (# of the satisfied triple pattern, variable name, value)

// value = <tp>3</tp>[x]Prof0|[y1]Professor0

output (key, value)

}

}

}



12345

MDAC 2010 – 13/23


MR-Join: ReduceMR-Join: Reduce


ReducerReducer

Constraints for Join-key variable X


12345

<x>1, 2, 3,

4, 5

<x>1, 2, 3,

4, 5








BGP Analyzer

BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query

BGP Analyzer

BGP Analyzer can provide triple pattern numbers related to the join-key variable by examining a given query

Triple pattern numbers related to the join-key variable


ub:worksForub:worksFor <Dept0><Dept0>

ub:nameub:name “Professor0”“Professor0”

ub:emailub:email “[email protected]”“[email protected]”

ub:telephoneub:telephone “000-0000-0000”“000-0000-0000”

MDAC 2010 – 14/23


MR-Join: ReduceMR-Join: Reducepublic void reduce() {

read input from the Map function

// example input ([x]Prof0, [<tp>1</tp>[x]Prof0, <tp>3</tp>[x]Prof0|[y1]Professor0])

get join-key variables and corresponding tp_numbers to be joined from the BGP Analyzer

// example join-key: x, tp_numbers=(1, 2, 3, 4, 5)

create a temporary hashtable H

for each (value in values) {

add an element

// key = <1>x, value = [x]Prof0

// key = <3>x|y1, value = [x]Prof0|[y1]Professor0

} // H will be used for checking whether the input satisfies all related tps.

if(keys in H cover all tp_numbers to be joined) {

make a Cartesian product among values in H

// (a1, b1), (a1, c1) => (a1, b1, c1)


// key = <1|3>x|y1

// value = [x]Prof0|[y1]Professor0

output (key, value)

}

}



Join-key Selection StrategiesJoin-key Selection Strategies

BGP Analyzer provides join-key variables by analyzing a query

How to select join-key variables?

If a BGP has a shared variable

– We can easily select the variable

If a BGP has two or more shared variables

– We applied two heuristics to select join-key variables

– Greedy Selection Select a join-key according to the number of related triple patterns

– Multiple Selection Select join-keys until every triple pattern is participated in a MR-Join

operation

Utilize the distributed and parallel system architecture



SPARQL BGP Processing with MRSPARQL BGP Processing with MR

Advantages

MapReduce can benefit from the multi-way join technique

– If triple patterns share a variable, MR can join them all at once

– It is not unusual that a BGP has several triple patterns sharing the same variable because RDF has a fixed simple data model



12345 ⋈

tp1 ⋈ ⋈ ⋈ ⋈(x)

(x, y1)

(x, y1, y2)

(x, y1, y2, y3)

(x, y1, y2, y3)

(a)

(b)

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

tp1

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

MDAC 2010 – 17/23


SPARQL BGP Processing with MRSPARQL BGP Processing with MR

Disadvantages

If we have two or more shared variables, we need expansive MR iterations

triple patterns in a query cannot be covered by a certain variable

If we have two shared variables, MR iterations cannot be avoided

To reduce unnecessary MR iteration, join-key selection strategies should be applied


SELECT ?x ?y1 ?y2 ?y3WHERE { ?x rdf:type ub:Professor. ?x ub:worksFor <Department0>. ?x ub:name ?y1. ?x ub:emailAddress ?y2. ?x ub:telephone ?y3. ?y2 ub:alias ?y4}

123456

⋈(x, y1, y2, y3)

tp1

x

…

x y1

… …

x

…

x y2

… …

x y3

… …

2 3 4 5

y2 y4

… …

6

⋈

MDAC 2010 – 18/23


ExperimentExperiment

Environment

LUBM Dataset

Amazon EC2, Cloudera’s Hadoop Distribution, Amazon EBS

The effect of multi-way join

Multi-way join technique reduces the execution time by joining several triple patterns at once

Some queries do not show a significant difference because they are too simple to take advantages of multi-way join


Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14

2-way

123.391

181.583

69.773256.59

175.53344.198

205.636

232.551

256.031

68.83466.834112.80

273.36947.092

Multi-way

86.423104.03

567.214

126.474

74.16344.526135.04

7140.41

4152.74

773.33763.55786.11772.82542.156

Diff. 36.96877.548 2.559130.11

71.37 -0.328 70.58992.137

103.284

-4.503 3.277 26.685 0.544 4.936

MDAC 2010 – 19/23


ExperimentExperiment

Scalability

As the number of machines increase, the average execution time is decreased

– The MR algorithm makes a sufficient number of reducers so we can utilize a number of machines

While we increase the data size, the algorithm shows scalable execution time



Issues & Future Work – IndexingIssues & Future Work – Indexing

Execution Time of MR-Selection and each MR-Join Iteration

MR-Selection can be a bottleneck because it takes about 40 seconds

The underlying storage structure is important

N-triple format -> HBase, Partitioning

Building an index needs a significant amount of loading time



Issues & Future Work – PipeliningIssues & Future Work – Pipelining

Hadoop’s MR implementation materializes intermediate results into the file system

It takes so much time because of disk I/O

Pipelining

Allows to send and receive data between tasks and between jobs without disk I/O

– Some implementations become available

Hadoop Online Prototype (http://code.google.com/p/hop/)

CGL-MapReduce (eScience 2008)



ConclusionConclusion

There still remain many issues

This work is still in progress

Conclusion

RDF Data Warehouse using MapReduce

– RDF: Flexibility, Integration, Inference

– MapReduce: Scalability, Extensibility, Fault-tolerance

SPARQL Processing with MapReduce

– Synergy effects between RDF and MapReduce

– Issues

System Architecture

Loading(Indexing), Pipelining, Encoding, …


SPARQL Basic Graph Pattern Processing with Iterative MapReduce 2010-04-26 Presented by Jaeseok Myung Intelligent Database Systems Lab School of Computer.

Documents

mapreduce data warehouse

data warehouse rdf

cebt mapreduce mapreduce

cebt rdf data warehouse

mapreduce center

open data

inference mapreduce

mapreduce algorithm