Top Banner
Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee
39

Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Dec 18, 2015

Download

Documents

Blaze White
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Workload Matters: Why RDF Databases Need a New Design

Güneş Aluç M. Tamer Özsu Khuzaima Daudjee

Page 2: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?

Page 3: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

A Running Example

1

Tamer ?post ?person UWaterloohasPost ??? worksAt

likestaggedInretweetsfavorites

etc.

Consider the following SPARQL query:

Page 4: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

Page 5: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

Tamer ?posthasPost

Page 6: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

?person UWaterlooworksA

t

Page 7: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Single Table Layout

2

P S O… … …

hasPost … …hasPost Tamer Post2hasPost Tamer Post23hasPost Tamer Polst235hasPost Tamer Post2357hasPost Tamer Post23571hasPost … …

… … …

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

?post ?person???

Page 8: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Single Table Layout

2

O S P… … …

Post1 Gunes …Post2 Alice …Post23 Alice …Post24 Ken …Post234 Olaf …Post235 Bob …Post2357 Bob …Post2358 Gunes …Post23570 Ken …Post23571 Olaf …

… … …

P O S… … …

worksAt … …worksAt UW GunesworksAt UW KenworksAt UW OlafworksAt … …

… … …

?post ?person???

(1) Many irrelevant intermediate result tuples

(2) These tuples are fragmented across the OSP table

(3) Indexes are not very useful in locating the relevant tuple

Page 9: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Predicates

3

favorites

S O… …

Bob Post235Gunes Post1Olaf Post234

… …

likes

S O… …

Alice Post23Ken Post24

… …

retweets

S O… …

Gunes Post2358Ken Post23570

… …

taggedIn

S O… …

Alice Post2Bob Post2357Olaf Post23571

… …

Page 10: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Predicates

3

favorites

S O… …

Bob Post235Gunes Post1Olaf Post234

… …

likes

S O… …

Alice Post23Ken Post24

… …

retweets

S O… …

Gunes Post2358Ken Post23570

… …

taggedIn

S O… …

Alice Post2Bob Post2357Olaf Post23571

… …

?post ?person???

Page 11: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Entities

4

Post2

Post23

Post24

Post2357

Post23571

Post1

Post234

Post235

Post2358

Post23570

Alice X X

Bob X X

Gunes X X

Ken X X

Olaf X X

likestaggedIn retweets

favorites

FacebookEntities

TwitterEntities

Page 12: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Entities

4

Post2

Post23

Post24

Post2357

Post23571

Post1

Post234

Post235

Post2358

Post23570

Alice X X

Bob X X

Gunes X X

Ken X X

Olaf X X

likestaggedIn retweets

favorites

FacebookEntities

TwitterEntities

?post ?person???

Page 13: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Vertices

5

Post1 ← favorites GunesPost2 ← taggedIn AlicePost23 ← likes AlicePost24 ← likes KenPost234 ← favorites OlafPost235 ← favorites BobPost2357 ← taggedIn BobPost2358 ← retweets GunesPost23570 ← retweets KenPost23571 ← taggedIn Olaf

Page 14: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Vertices

5

Post1 ← favorites GunesPost2 ← taggedIn AlicePost23 ← likes AlicePost24 ← likes KenPost234 ← favorites OlafPost235 ← favorites BobPost2357 ← taggedIn BobPost2358 ← retweets GunesPost23570 ← retweets KenPost23571 ← taggedIn Olaf

?post ?person???

Page 15: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Does The Winner Take It All?

• With a single query, we were able to conceptually show problems with existing solutions

• SPARQL workloads that RDF data management systems support – contain a very diverse selection of queries– and these selection of queries dynamically change

6

Page 16: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Does The Winner Take It All?

G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. International Semantic Web Conference, 2014. Forthcoming. 6

1 6 10 15 20 25 30 35 40 44 49 54 59 64 69 73 78 83 88 93 981

10

100

1000

10000

100000

RDF-3x Fastest System

Percentage of Test Query Templates

Mea

n Q

uery

Exe

cutio

n Ti

me

(mill

isec

onds

)

Page 17: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Does The Winner Take It All?

G. Aluç, O. Hartig, M. T. Özsu and K. Daudjee. Diversified Stress Testing of RDF Data Management Systems. In Proc. International Semantic Web Conference, 2014. Forthcoming. 6

Page 18: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?

Page 19: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query

7

RDFPhysical Design

Fixed Workload-Driven

Single Table LayoutGroup-by-PredicatesGroup-by-EntitiesGroup-by-Vertices

Group-by-Query

Page 20: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?– Group-by-Query Representation– Partial Tuning

Page 21: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query

7

Tamer ?post ?person UWaterloohasPost worksAt???

Page 22: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Page 23: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Page 24: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Page 25: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query

7

Tamer Post23571 Olaf UWaterloohasPost worksAttaggedIn

Tamer Post2357 Bob UWaterloohasPost worksAttaggedIn

Post235favoriteshasPost

Tamer Post23 Bob UWaterloohasPost worksAtlikes

Post2taggedInhasPost

Page 26: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query (Advantages)

• Triples relevant to the evaluation of a query are physically clustered

• Indexes are more efficient in localizing query evaluation to only the relevant triples

• Fewer intermediate result tuples are generated

8

Page 27: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Group-by-Query (Advantages)

8

Page 28: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

ChallengeDynamism

9

1) Types of queries2) Parts of the database that are being queried3) Hotspots

Page 29: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

ChallengeDynamism

9

Page 30: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Outline

• Why do RDF data management systems need a new design?

• How do we envision RDF data management systems to be re-designed?– Group-by-Query Representation– Partial Tuning

Page 31: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #1Updating Physical Storage Layout

10

Initially, triples are not clustered in the storage system for any particular workload

Page 32: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #1Updating Physical Storage Layout

10

As queries are executed (that is, as triples flow through the cache), there is an opportunity to cluster (hot) triples that are co-accessed within the same query or across multiple queries

Page 33: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #1Updating Physical Storage Layout

10

Assume a hash function (oracle) decides on a good placement of triples and that the hash function is capable of adapting to changing workloads

Page 34: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #1Updating Physical Storage Layout

10

Then, one of the challenges is to develop this hash function

Page 35: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #2Partial Indexing

11

On top of the aforementioned scheme, consider an index which

false positively returns irrelevant triples (striped)for some queries in the workload

Page 36: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #2Partial Indexing

11

This is no big deal because, these false positive triples can be eliminated from the query evaluation pipeline, w/ just a little bit of extra computational cost

On the other hand, this index is much easier to update and maintain

Page 37: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #2Partial Indexing

11

Page 38: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Proposal #n…

• In the paper

11

Page 39: Workload Matters: Why RDF Databases Need a New Design Güneş Aluç M. Tamer Özsu Khuzaima Daudjee.

Conclusions

• Problems w/ fixed, workload-oblivious approaches

• Purely workload-driven design is compelling but not trivial especially when it comes to adapting to dynamic workloads

12