Query Processing Using Structure Index for RDF Data on the Web

KIT – die Kooperation von Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH)

1

Query Processing Using Structure Index for RDF Data on the WebThanh Tran and Günter LadwigInstitute AIFB, Karlsruhe Institute of [email protected], [email protected]

mailto:[email protected]

mailto:g%C3%[email protected]


Agenda

Problem Introduction Approach

Structure Index for RDF Data Structure-based Partitioning Structure-aware Query Processing

Evaluation Conclusion

2


RDF data

3

0

6 7

8 9

432

1

Auth

orOf AuthorOfAu

thorOf AuthorOf

Auth

orOf

AuthorOf

Supervises Supervises Supervises

WorksAt WorksAt

WorksAt

Wor

ksAt

Wor

ksAt

KIT MITName Name

5Supervises

WorksAt

- Consists of triples <s,p,o>- Triples form a graph, where vertices denote resources and their values, connected

by directed labelled edges representing properties (i.e.,relations and attributes)- URIs are used as labels of edges and vertices representing resources


Conjunctive Queries

4

- Important fragment of widely used languages (SQL, SPARQL)- Consisting of triple patterns p(s,o) where p is a predicate and s and o are variables

or constants- Distinguished variables, e.g. x, vs. undistinguished variables- Triple patterns constitute a query graph

z

u

yx

AuthorOf

Supervises

Wor

ksAtWorksAt

KITName


Conjunctive Query Answering

5

0

6 7

8 9

432

1

Auth

orOf AuthorOfAuth

orOf AuthorOf

Auth

orOf

AuthorOf


WorksAt WorksAt

WorksAt

Wor

ksAt

Wor

ksAt

KIT MITName Name

5Supervises

WorksAt

- Graph pattern matching problem: a match of a query q on a graph G is a mapping h from the variables of q to vertices of G such that the substitution of variables in the graph-representation of q would yield a subgraph of G

- A match h is a homomorphism from the “query graph” to the data graph- Query answering based on two basic operations: data loading and join

z

u

yx

AuthorOf

SupervisesW

orks

AtWorksAt

KITName


State-of-the-art Data Partitioning

Vertical partitioning (SW-Store) Indexing

Sextuple indexing (Hexastore) Materialization and indexing of entire join paths (GRIN)

Index Implementation B+ tree Inverted index (Semplore) Index compression (RDF-3X)

Query processing Sorted merge join based on vertical partitioning and indexing (SW-Store) Join order optimization based on dynamic programming (RDF-3X)

A combination of different concepts makes up the state-of-the-art!

6


Large Volume of RDF Data on the Web

- ̴10 billions RDF triples (2009)- Interlinked by ̴10 millions mappings (2009)- Besides linked data, there are standalone ontologies, RDFa, etc.

7


Semi-structured RDF data on the Web0

6 7

8 9

432

1

Auth

orOf AuthorOfAu

thorOf AuthorOf

Auth

orOf

AuthorOf


WorksAt WorksAt

WorksAt

Wor

ksAt

Wor

ksAt

KIT MITName Name

5 Supervises

WorksAt

Publication

Institute

Post Doc

PhD Student

Auth

orOf

AuthorO

f

Supervises

WorksAt

Wor

ksAt

String Name

- RDF graph often contains both data and schema information

- Resources are linked with a rdf:class via rdf:type

- Schema information incomplete, especially Web data, RDFa data

RDF data might be schema-less, semi-structured data


Overview of Our Approach

Problems

• Management of possibly semi-structured RDF data on the Web • Scalability and efficiency of RDF Web data query processing

Contributions

• Parameterized structure index for RDF data• Structure-based partitioning (SP)• Structure-aware query processing

Benefits

• Reduction of unions & joins as well as IO cost

9


Structure Index for RDF data on the Web

10

Structure index is a graph Is a structural description more fine-granular then a schema Consists of classes (extensions) and relations between them Resources in an extension exhibit the same structure, i.e., cannot be distinguished by

outgoing (forward bisimilarity) and incoming (backward bisimilarity) “edge trees” Parameterize bisimulation by two sets of edge labels

0

6 7

8 9

432

1

Auth

orOf AuthorOfAuth

orOf AuthorOf

Auth

orOf

AuthorOf


WorksAt WorksAt

WorksAt

Wor

ksAt

Wor

ksAt

KIT MITName Name

5Supervises

WorksAt

B1: 3,7

B4: 2,4,6

B3: 8,9

B2: 0,1

AuthorOf

Auth

orOfSupervises

WorksAt

Wor

ksAt

Nam

e

B6: 5

WorksAtSu

perv

ises

B5:KIT,MIT


Structure-based Partitioning

11

Whether a graph vertex instantiates a variable of a query depends on its structure vertices physically grouped based on structural similarity

Apply grouping captured by the structure index to the physical organization Creating a physical group for every vertex Triples are in the same group when their subjects belong to the same extension

Triples of a SP table satisfy not only the property of a triple pattern but also, provide some structural guarantee, e.g., match the entire query structure

B1: 3,7

B4: 2,4,6

B3: 8,9

B2: 0,1

AuthorOf

Auth

orOfSupervises

WorksAt

Wor

ksAt

Nam

e

B6: 5

WorksAtSu

perv

ises

B5:KIT,MIT

Sub Property Obj

2 AuthorOf 0

4 AuthorOf 0

6 AuthorOf 1

2 WorksAt 8

4 WorksAt 8

6 WorksAt 9

Sub Obj

2 0

4 0

6 1

3 0

7 1

VP AuthorOf tableSP B4 table


Structure-aware Query Processing

Proposition 1 A mapping of q into G exists only if it also exists into the

associated index graph G’. The resulting extensions that match the nodes in q will

contain all data graph matches.

12

2-steps query processing Index graph: find extensions Ei matching q Data graph: combining data elements retrieved for Ei


Index Graph Matching

13

AuthorOf Supervises

Supervises

WorksAt

WorksAt Name

Wor

ksAtAu

thor

Of

B1

B2

B3

B4

B5

B6

y

x

u KIT

zx

u KIT

z y

Retrieve index graph edges matching query edges (triple patterns) Join index graph edges along query edges

h1 = {B1, B2, B3, B4, B5}

h2 = {B2, B3, B4, B5, B6,}


Query Pruning

14

Proposition 2 If a query is tree-shaped, and consists only of

undistinguished variables (besides the root), matches on the structure index contain all and only data graph matches.

Data elements contained in the extensions matching the query root node represent all and only final query answers

Given such queries, no further processing is needed Given more general queries, tree-shaped query parts can be

pruned away


Query Pruning

15

y

x

u KIT

z

AuthorOf Supervises

Supervises

WorksAt

WorksAt Name

Wor

ksAtAu

thor

Of

B1

B2

B3

B4

B5

B6

h1 = {B1, B2, B3, B4, B5}

Elements in extensions are known to satisfy query structure Elements in B4 are already known to be authors of some z No further data processing is needed for this part


Data Graph Matching

16

AuthorOf Supervises

Supervises

WorksAt

WorksAt Name

Wor

ksAtAu

thor

OfB1

B2

B3

B4

B5

B6

3 WorksAt 87 WorksAt 93 Supervises 23 Supervises 47 Supervises 6...

8 Name KIT9 Name MIT

2 WorksAt 8 4 WorksAt 86 WorksAt 9...

Retrieve triples from matching extensions & join along query edges Match class processing: group index graph matches to match classes to

avoid processing matches that partially overlap

{ 3 WorksAt 8,3 Supervises 2,2 WorksAt 8,8 Name KIT}

h’1 =


Evaluation

DBLP and several synthetic datasets created using the Lehigh University Benchmark (LUBM)

30 queries categorized into five classes

17

Path query

QLUBM6SELECT ?x ?y takesCourse (x, y) teacherOf (z, y) type (z, FullProfessor)

SELECT ?x ?m emailAddress (x, fp@edu) res.Interest (x, research24) telephone (x, xxx-xxx-xxxx)

QLUBM9

Entity query

SELECT ?x type (x, Person)

QDBLP1

Single-atom query Graph-shaped query

SELECT ?x ?a teacherOf (FullProfessor5, y) takesCourse (x, y) publicationAuthor (b, x) name (b, Publication7) memberOf (x, z) memberOf (a, z) advisor (x, a) telephone (a, xxx-xxx-xxxx)

QLUBM15

Star query

QDBLP12SELECT ?x, ?n type (x, Person) name (x, n) editor (y, x) author (z, x) cites (u, z)


Evaluation – Performance

18

q1 q2 q3 q4 q5 q6 q7 q8 q9q10

q11q12

q13q14

q15Mea

n0.1

1.0

10.0

100.0

1000.0

10000.0

100000.0

SP VP

q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14 q151.0

10.0

100.0

1000.0

10000.0

100000.0idx match load(VP-SP) join(VP-SP) # removed query nodes

Compare our work (SP) against vertical partitioning (VP) [Abadi et al.] Total query processing times Times of individual steps involved

Slightly slower w.r.t simple queries (1-3) SP 8-9 times faster w.r.t complex queries (4-15) With more complex queries, the overhead incurred by answer space matching can be outweighed by the accumulated gain for load and join

Total time in ms on DBLP Time of separate steps in ms, #pruned query nodes


Conclusions

Structure index that can deal with general graph-structured RDF data on the Web

Structure index can be leveraged for dealing with semi-structured data on the Web

Structure index can be used for RDF data partitioning & query processing, allowing complex queries to be processed many times faster

Future work Adopt existing concepts in XML data management for

structure index optimization & updates Query optimization for structure-aware query processing

19


Thank you for your attention!

Structure Index for RDF Data on the WebDuc Thanh Tran, AIFB Institute, KITE-Mail: [email protected]: http://sites.google.com/site/kimducthanh

20


State-of-the-art Data Partitioning

Big table (Old versions of Oracle, Jena, Sesame) Property tables (Jena) Vertical partitioning (SW-Store)

Indexing Multiple indexing (YARS) Sextuple indexing (Hexastore) Materialization and indexing of entire join paths (GRIN)

Index Implementation B+ tree Inverted index (Semplore) Index compression (RDF-3X)

Query processing Sorted merge join based on vertical partitioning and indexing (SW-Store) Join order optimization based on dynamic programming (RDF-3X)

A combination of different concepts makes up the state-of-the-art!

21


Overview of Our Approach

Problems• Management of possibly semi-structured RDF data on the Web • Scalability and efficiency of RDF Web data query processing

Contributions• Parameterized structure index for RDF data• Structure-based partitioning (SP): triples with same structure are grouped• Structure-aware query processing

• Use structure index to focus on data that satisfy the overall query structure• Then retrieves data in corresponding structure-based partitioned tables

Benefits• Target data partitioning & query processing, i.e., complementary to other concepts • Reduction of unions & joins as well as IO cost

22


Evaluation – Scalability

23

LUBM1 LUBM5 LUBM10 LUBM20 LUBM500.00

1000.002000.003000.004000.005000.006000.007000.008000.009000.00

VPQP-SQP SQP idx match

load (VPQP-SQP) join(VPQP-SQP)

Proc

essin

g Ti

mes

[ms]

DBLP LUBM1 LUBM5 LUBM10 LUBM500

5000

10000

15000

20000

25000OSQPSQP

Que

ry T

imes

(ms)

Measured the average query performance for LUBM with varying size Times increases with the size of the data Gain for load and join increases in larger proportion than the overhead incurred for index match

Match performance is determined by the size of the index graph Size depends on structure but not on the size of the data graph Match time does not necessarily increase when the data becomes larger Positive effect of data filtering (IO reduction) and query pruning (load and join) correlates with the data size

Query Processing Using Structure Index for RDF Data on the Web

Education

entire join

structured

web structure

based partitioning

triple patterns

query processing

rdf data

authorof b2