Top Banner
Assessing the performance of RDF Engines: Discussing RDF Benchmarks Irini Fundulaki Institute of Computer Science – FORTH, Greece Anastasios Kementsietsidis Google Research, USA 6/15/16 ESWC 2016: Assessing the performance of RDF Engines - Discussing RDF Benchmarks 1
164

Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Apr 13, 2017

Download

Engineering

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

AssessingtheperformanceofRDFEngines:DiscussingRDFBenchmarks

IriniFundulakiInstituteofComputerScience–FORTH,Greece

AnastasiosKementsietsidisGoogleResearch,USA

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 1

Page 2: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

TraditionalWeb:WebofDocuments•  singleinformationspace:globalfilesystem•  designedforhumanconsumption•  documentsaretheprimaryobjectswithaloosestructure•  URLsarethegloballyuniqueIDsandpartoftheretrieval

mechanism•  cannotaskexpressivequeries

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 2

©Hartig

,Cyg

aniac,Bizer,H

ausenb

las,Hea

th

How

toPub

lishLink

edDataon

theWeb

HTML HTML HTML

WebBrowsers WebBrowsers

hyperlinks hyperlinks

Page 3: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

GoingfromtheWebofDocumentstotheWebofData•  Aglobaldatabase•  Designedformachinesfirst,humanslater•  Thingsareprimaryobjectswithawelldefinedstructure•  Typedlinksbetweenthings•  Abilitytoexpressstructuredqueries

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 3

Thing

Thing

Thing

Thing

Thing

Thing

Don’tlinkthedocuments,linkthethings

typedlinks typedlinks

©The

Web

ofL

inke

dData:Tom

Hea

th,

AnIntrod

uctio

ntoLinke

dData

Page 4: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LinkingOpenDatasets(LOD)•  PublishopendataasLinkedDataontheWeb•  Interlinkentitiesbetweenheterogeneousdatasources

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 4

Page 5: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StatusoftheLinkedOpenDataCloud,2007

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 5

Page 6: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StatusoftheLinkedOpenDataCloud,2011

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 6

Page 7: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StatusoftheLinkedOpenDataCloud,2014

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 7

Media

Government

Geographic

Publications

User-generated

Lifesciences

Cross-domain

RDF,acommondatamodel

Morethan31BtriplesinLOD

Links(external):500M

Page 8: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LinkedDatainnumbers(2014)•  StateoftheLODCloud2014,UniversityofManheim

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 8

Domain Datasets % Any SPARQL Dump

Government 183 18.05 61(32.80%) 30.11% 30.65%

Publications 96 9.47 10(10.58%) 9.62% 3.85%

LifeSciences 83 8.19 19(21.35%) 20.22% 16.85%

User-generatedcontent

48 4.73 3(5.4%5) 5.45%

1.82%

Cross-domain 41 4.04 4(9.09%) 4.55% 6.82%

Media 22 2.17 1(2.70%) 0.00% 2.70%

Geographic 21 2.07 8(19.51%) 12.20% 12.20%

SocialWeb 520 51.28 6(1.16%%) 1.16% 0.39%

Total 1014 - 48(5.89%) 4.54% 3.80%

AccessMethods

Page 9: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ProliferationofBigDataStores

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 9

Page 10: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Many(notalot)RDFStores

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 10

Page 11: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

TheQuestion(s)•  WhicharetheproblemsthatIwishtosolve?•  Whicharetherelevantkeyperformanceindicators?•  Whichisthebehavioroftheexistingenginesw.r.t.thekey

performanceindicators?

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 11

Whicharethetool(s)thatIshoulduseformydataandformyusecase?

Page 12: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

TheAnswer:Benchmarkyourengines!

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 12

•  QueryingBenchmarkcomprisesof

–  datasets(syntheticorreal)–  setofsoftwaretools

•  syntheticdatagenerators•  querygenerators

–  performancemetrics,and

–  setofclearexecutionrules•  Standardizedapplicationscenario(s)thatserveasabasisfor

testingsystems

•  Mustincludeaclearsetoffactorstobemeasuredandtheconditionsunderwhichthesystemsshouldbemeasured

Page 13: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 13

•  Benchmarksexist–  Toallowadequatemeasurementsofsystems–  Toprovideevaluationofenginesforreal(orclosetoreal)usecases

•  Providehelp–  DesignersandDeveloperstoassesstheperformanceoftheirtools

–  Userstocomparethedifferentavailabletoolsandevaluatesuitabilityfortheirneeds

–  Researcherstocomparetheirworktoothers•  Leadstoimprovements:–  Vendorscanimprovetheirtechnology–  Researcherscanaddressnewchallenges–  Currentbenchmarkdesigncanbeimprovedtocovernewnecessitiesandapplicationdomains

ImportanceofBenchmarking

Page 14: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

TutorialObjective&Benefits•  Objectives:–  Discussasetofprinciplesandbestpracticesforbenchmarkdevelopment

–  PresentanoverviewofthecurrentworkonbenchmarksforRDFqueryengines

–  Focusonidentifyingresearchchallenges&unexploredresearchdirections

•  Benefitsfortheaudience–  Academic:Obtainasolidbackground,discovernewresearchdirections

–  Practitioner:findoutwhataretheavailablebenchmarks,advantagesandlimitationsthereof

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 14

Page 15: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

PurposeoftheTutorial

•  Stimulatediscussionsonthefollowingtopics:1.  Howcanonecomeupwiththerightbenchmarkthat

accuratelycapturesusecasesofinterest?

2.  HowcanabenchmarkcapturethefactthatRDFdataoriginatefromamultitudeofformats

! Structured:relationaland/orXMLdatatoRDF

! Unstructured3.  Howcanabenchmarkcapturethedifferentdataandquery

patternsandprovideaconsistentpictureforsystembehavioracrossdifferentapplicationsettings?

4.  Howcanoneselecttherightbenchmarkforhersystem,dataandworkload?

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 15

Page 16: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

6/15/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 16

Page 17: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

AshortdiscussionaboutLinkedData-ResourceDescriptionFramework(DataModel)-SPARQL(QueryLanguage)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 17

Page 18: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ResourceDescriptionFramework(RDF)•  W3CstandardtorepresentWebdataandmetadata•  genericandsimplegraphbasedmodel•  informationfromheterogeneoussourcesmergesnaturally:–  resourceswiththesameURIdenotethesamenon-informationresource(leadingtotheLinkedDataCloud)

•  structureisaddedusingschemalanguagesandisrepresentedasRDFtriples

•  WebbrowsersuseURIstoretrieveinformation

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 18

Page 19: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ResourceDescriptionFramework(RDF)•  AnRDFtripleisoftheform(s,p,o)where–  sisthesubject:theURIidentifyingthedescribedresource–  oistheobject:caneitherbeasimpleliteralvalueortheURIofanotherresource

–  pisthepredicate:theURIindicatingtherelationbetweensubjectandobject

•  AnRDFgraphisasetoftriples–  Canbeviewedasanodeandedge-labeleddirectedgraph–  Itispublishedindifferentformats

•  RDF-XML,turtle,n3triples,…

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 19

(dbpedia:Good_Day_Sunshine,dbpedia-owl:artist,dbpedia:The_Beatles)

Closetohowpeopleseetheworld(asagraph)!

Page 20: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

AddingSemanticstoRDF•  RDFisageneric,abstractdatamodelfordescribingresources

intheformoftriples•  RDFdoesnotprovidewaysofdefiningclasses,properties,

constraints•  W3CStandardSchemaLanguages– RDFVocabularyDescriptionLanguage(RDFSchema-RDFS)todefineschemavocabularies

– OntologyWebLanguage(OWL)todefineontologies

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 20

Page 21: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

AddingSemanticstoRDF

•  RDFVocabulariesaresetsoftermsusedtodescribenotionsinadomainofinterest

•  AnRDFtermiseitheraClassoraProperty– Objectpropertiesdenoterelationshipsbetweenobjects– Datatypepropertiesdenoteattributesofresources

•  RDFSdesignedtointroduceusefulsemanticstoRDFtriples•  RDFSSchemasarerepresentedasRDFtriples

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 21

"AnRDFVocabularyisaschemacomprisingofclasses,propertiesandrelationshipswhichcanbeusedfor

describingdataandmetadata"

Page 22: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

RDFVocabularyDescriptionLanguage(RDFS)•  Typing:definingclasses,properties,instances•  Relationshipsbetweenclassesandproperties:subsumption•  Constraints:domainandrangeofproperties•  Inferencerulestoentailnew,inferredknowledge

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 22

Subject Predicate Object

t1 dbo:MusicalWork rdfs:subClassOf dbo:Album

t2 dbo:MusicalWork rdfs:domain dbo:artist

t3 dbo:MusicalWork rdfs:range dbo:march

t4 dbr:Seven_Seas_Of_Rye rdf:type dbo:MusicalWork

t5 dbo:Album rdf:type rdf:Class

Page 23: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

RDFSInference

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 23

•  Usedtoentailnewinformationfromtheonethatisexplicitlystatedinthedataset–  Transitiveclosureacrossclassandpropertyhierarchies

–  Transitiveclosurealongthetypeandclass/propertyrelations

•  Twowaystoimplementit:Forward&BackwardReasoning–  ForwardReasoning:closureiscomputedatloadingtime–  BackwardReasoning:closureiscomputedontheflywhenneeded

(P1,rdfs:subPropertyOf,P2),(P2,rdfs:subPropertyOf,P3)

(P1,rdfs:subPropertyOf,P3)R1:

(C1,rdfs:subClassOf,C2),(C2,rdfs:subClassOf,C3)

(C1,rdfs:subClassOf,C3)R2:

(C1,rdfs:subClassOf,C2),(r1,rdf:type,C1)

(r1,rdf:type,C2)R2:

(P1,rdfs:subPropertyOf,P2),(r1,P1,r2)

(r1,P2,r2)R3:

Page 24: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

RDFSInference

•  Transitiveclosurealongthetypeandclass/propertyrelations

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 24

(C1,rdfs:subClassOf,C2),(r1,rdf:type,C1)

(r1,rdf:type,C2)R2:

Subject Predicate Object

t1 dbo:MusicalWork rdfs:subClassOf dbo:Album

t2 dbo:MusicalWork rdfs:domain dbo:artist

t3 dbo:MusicalWork rdfs:range dbo:march

t4 dbr:Seven_Seas_Of_Rye rdf:type dbo:MusicalWork

t5 dbo:Album rdf:type rdf:Class

t6 dbo:MusicalWork rdf:type rdf:Class

Page 25: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPARQL:QueryingRDFData•  SPARQL:W3CStandardLanguageforQueryingLinked

Data•  SPARQL1.0(2008)onlyallowsaccessingthedata(query)•  SPARQL1.1(2013)introduces:–  QueryExtensions:aggregates,sub-queries,negation,expressionsintheSELECTclause,propertypaths,assignment,shortformforCONSTRUCT,expandedsetoffunctionsandoperators

–  Updates:•  Datamanagement:Insert,Delete,Delete/Insert•  Graphmanagement:Create,Load,Clear,Drop,Copy,Move,Add

–  Federationextension:Service,values,servicevariables(informative)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 25

Page 26: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPARQLQueries(1)•  BuildingBlockistheTriplePattern–  RDFtriplewithvariables

•  GroupGraphPatterns–  BuiltthroughinductiveconstructioncombiningsmallerpatternsintomorecomplexonesusingSPARQLoperators

•  Join-similartorelationaljoin

•  Union(UNION)–similartorelationalunion

•  Optional(OPTIONAL)operatorsontriplepatterns–similartorelationalleftouterjoin(introducesnegationinthelanguage)

•  Filteringconditions(FILTER)•  PatternsonNamedGraphs6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 26

Page 27: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPARQLQueries(2)•  Aggregates–  specifyexpressionsovergroupsofsolutions–  Asinstandardsettingsusedwhentheresultiscomputedoveragroupofsolutionsratherthanasinglesolution•  Example:averagevalueofasetofvalues,sumofaset

–  AggregatesdefinedinSPARQL1.1areCOUNT,SUM,MIN,MAX,AVG,GROUP_CONCAT,andSAMPLE.

–  SolutionsaregroupedusingtheGROUPBYclause–  PruningatgrouplevelisperformedwiththeHAVINGclause

•  AdditionalFeatures–  duplicateelimination(DISTINCT)–  orderingresults(ORDERBY)withanoptionalLIMITclause

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 27

Page 28: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPARQLSemantics•  SPARQLsemanticsbasedonPatternMatching

– Queriesdescribesubgraphsofthequeriedgraph

–  SPARQLgraphpatternsdescribethesubgraphstomatch

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 28

IntuitivelyatriplepatterndenotesthetriplesinanRDFgraphthatareofaspecificform

TP1=(?album,dbpedia-owl:artist,dbpedia:The_Beatles)

TP2=(dbpedia_The_Beatles,?property,?object)

matchesallalbumsoftheBeatles

matchesallinformationaboutTheBeatles

Page 29: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPARQLTypesofQueries•  SELECTreturnsorderedmulti-setofvariablebindings–  Bindings:mappingsofvariablestoRDFtermsinthedataset

–  SQL-LikeSyntax

•  ASKcheckswhetheragraphpatternhasatleastonesolution-returnsaBooleanvalue(true/false)

•  CONSTRUCTreturnsanewRDFgraphasspecifiedbythegraphtemplateoftheCONSTRUCTclauseusingthecomputedbindingsfromthequery’sWHEREclause

•  DESCRIBEreturnstheRDFgraphcontainingtheRDFdataabouttherequestedresource

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 29

SELECT?v1,?v2,…WHEREGraphPattern

Page 30: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

QueryingRDFDatawithSPARQL(1)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 30

PREFIXdc:<http://purl.org/dc/elements/1.1/>SELECT?titleWHERE{<http://example.org/book/book1>dc:title?title}

SimpleSELECTquery

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>SELECT?name?mboxWHERE{?xfoaf:name?name.?xfoaf:mbox?mbox.}

JOINQuery

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>SELECT?name?mboxWHERE{?xfoaf:name?name.OPTIONAL{?xfoaf:mbox?mbox}}

OPTIONALOperator

PREFIXdc:<http://purl.org/dc/elements/1.1/>SELECT?titleWHERE{?xdc:title?title.FILTERregex(?title,"^SPARQL")}

REGEXinFILTER

Page 31: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

QueryingRDFDatawithSPARQL(2)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 31

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>PREFIXorg:<http://example.com/ns#>CONSTRUCT{?xfoaf:name?name}WHERE{?xorg:employeeName?name}

PREFIXfoaf:<http://xmlns.com/foaf/0.1/>ASK{?xfoaf:name"Alice"}

“Findthepeoplewholivein“PaloAlto” andhavefoundedorareboardmembersofcompaniesinthesoftwareindustry.Foreachsuchcompany,findtheproductsthatweredevelopedbyit,itsrevenue,andoptionallyitsnumberofemployees.“SELECT*WHERE{?xhome“PaloAlto” .

{?xfounder?y}UNION{?xmember?y}{

?yindustry“Software” .?zdeveloper?y.?yrevenue?n.OPTIONAL{?yemployees?m}.

}}

SPARQL1.1:SPARQLplusAggregates,Sub-

queries,Propertypaths,Negationandmore!

Page 32: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StoringandQueryingRDFdata•  Schemaagnostic–  triplesarestoredinalargetripletablewheretheattributesare(subject,predicateandobject)-“Monolithic”triple-stores

–  Butitcangetabitmoreefficient

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 32

Subject Predicate Object

t1 dbr:Seven_Seas_Of_Rye rdf:type dbo:MusicalWork

t2 dbr:Starman_(song) rdf:type dbo:MusicalWork

t3 dbr:Seven_Seas_Of_Rye dbo:artist dbo:Queen

id URI/Literal

1 dbr:Seven_Seas_Of_Rye

2 dbr:Starman_(song)

3 dbo:MusicalWork

4 dbo:Queen

5 dbo:artist

6 rdf:type

Subject Predicate Object

1 6 3

2 6 3

1 5 4

RDF-3Xmaintains6indexes,namely,SPO,SOP,OSP,OPS,PSO,POS.Toavoidstorageoverhead,indexesarecompressed![NW09]

Page 33: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StoringandQueryingRDFdata•  schemaaware:

–  onetableiscreatedperpropertywithsubjectandobjectattributes(PropertyTables[Wilkinson06])

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 33

Subject Predicate Object

ID1 type BookType

ID1 title “XYZ”

ID1 author “Fox,Joe”

ID1 copyright “2001”

ID2 type CDType

ID2 title “ABC”

ID2 artist “Orr,Tim”

ID2 copyright “1985”

ID2 language “French”

ID3 type BookType

ID3 title “MNO”

ID3 language “English”

ID4 type DVDType

ID4 title “DEF”

ID5 type CDType

ID5 title “GHI”

ID5 copyright “1995”

ID6 type BookType

ID6 copyright “2004”

Subject Type Title copyright

ID1 BookType “XYZ” “2001”

ID2 CDType “ABC” “1985”

ID3 BookType “MNO” NULL

ID4 DVDType “DEF” NULL

ID5 CDType “GHI” “1995”

ID6 BookType NULL “2004”

Subject Predicate Object

ID1 author “Fox,Joe”

ID2 artist “Orr,Tim”

ID2 language “French”

ID3 language “English”

Subject Title Author copyright

ID1 “XYZ” “Fox,Joe” “2001”

ID3 “MNO” NULL NULL

ID6 NULL NULL “2004”

Subject Title artist copyright

ID2 “ABC” “Orr,Tim” “1985”

ID5 “GHI” NULL “1985”

Subject Predicate Object

ID2 language “French”

ID3 language “English”

ID4 type DVDType

ID4 title “DEF”

Booktype

CDType

Property-classTable

Subject Object

… …

… …

ClusteredPropertyTable

Multi-ValueP

Page 34: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StoringandQueryingRDFdata•  VerticallypartitionedRDF[AMM+07]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 34

Subject Predicate Object

ID1 type BookType

ID1 title “XYZ”

ID1 author “Fox,Joe”

ID1 copyright “2001”

ID2 type CDType

ID2 title “ABC”

ID2 artist “Orr,Tim”

ID2 copyright “1985”

ID2 language “French”

ID3 type BookType

ID3 title “MNO”

ID3 language “English”

ID4 type DVDType

ID4 title “DEF”

ID5 type CDType

ID5 title “GHI”

ID5 copyright “1995”

ID6 type BookType

ID6 copyright “2004”

Subject Object

ID1 BookType

ID2 CDType

ID3 BookType

ID4 DVDType

ID5 CDType

ID6 BookType

Subject Object

ID1 “XYZ”

ID2 “ABC”

ID3 “MNO”

ID4 “DEF”

ID5 “GHI”

Subject Object

ID1 “2001”

ID2 “1985”

ID5 “1995”

ID6 “2004”

Subject Object

ID2 “Orr,Tim”

Subject Object

ID1 “Fox,Joe”

Subject Object

ID2 “French”

ID3 “English”

type

title

copyright

author

artist

language

Togetth

emosto

utofthisp

ar0cular

decompo

si0on

,acolum

n-oriented

DB

MSisrecommen

ded.

Page 35: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ComparisonofStorageTechniques[BDK+13]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 35

company released

Google Android

Apple iPhonesubject object

Google Android

Google developer Android

subject predicate object

LarryPage born “1973”

LarryPage founder Google

Google HQ “MTV”

Google employees 50,000

Google industry Internet

Google industry Software

Google industry Hardware

Triplestore

person born founder

LarryPage “1973 Google

Type-orientedstore

company HQ employees

Google “MTV” 50,000

subject predicate object

Google industry Internet

Google industry Software

Google industry Hardware

subject object

LarryPage “1973”

Predicate-orientedstore

subject object

Google “MTV”

subject object

Google Internet

Google Software

Google Hardware

subject object

LarryPage Google

subject object

Google 50,000

born

founder

HQ

employees

industry

industtry

LarryPage

“1973”

Google

Internet

Software

Hardware

“MTV”HQ

50,000employee

s

samplegraphColumnsareoverloaded

Traditionalrelationalcolumntreatment

Staticmixofoverloadedandnormalcolumns

developer

Schemadoesnotchangeonupdates

Schemamightchangeonupdates

Page 36: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StoringLinkedData:QueryProcessing•  SchemaAgnostic–  algebraicplanobtainedforaqueryinvolvesalargenumberofselfjoins

–  queriesarefavorablewhenthepredicateisavariable

•  HybridApproachandSchema-aware–  algebraicplancontainsoperationsovertheappropriateproperty/classtables(moreinthespiritofexistingrelationalschemas)

–  savesmanyself-joinsovertripletables–  ifthepredicateisavariable,thenonequeryperproperty/classmustbeexpressed

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 36

Page 37: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

PurposeofanRDFQueryingBenchmark

•  TesttheperformanceofRDFstores–  Independentlyofunderlyingstorageengine–  Independentlyofunderlyinglogicalandphysicalschema–  Independentlyofthequeryactuallyexecutedintheengine•  SPARQLfornativestores•  SQL(SPARQLtranslatedtoSQL)forrelationalstores

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 37

Page 38: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 38

Page 39: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkingPrinciples&ChokePoints

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 39

Page 40: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WhyBenchmarks?•  PerformanceEvaluation–  Thereisnonosinglerecipeonhowtodoitright–  Therearemanywayshowtodoitwrong–  Thereareanumberofbestpracticesbutnobroadlyacceptedstandardonhowtodesignanddevelopabenchmark

•  Questionsasked:– Whatdata/datasetsshouldweuse?– Whichworkload/queriesshouldweconsider?– Whattomeasureandhowtomeasure?

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 40

Page 41: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkCategories

•  Micro-benchmarks•  Standardbenchmarks•  Real-lifeapplications

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 41

Page 42: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

MicroBenchmarks

•  Specialized,stand-alonepieceofsoftware•  Isolateoneparticularfunctionalityofalargersystem•  Indatabasesamicrobenchmarktestsasingledatabase

operator–  Selection,Join(andalltypesthereof),Projection,Aggregates,Sub-Queries,…

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 42

Page 43: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

MicroBenchmarks:Advantages•  Veryfocused

–  Testaspecificoperatorofthesystem•  Controllabledata&workload

–  SyntheticandRealDatasets•  Differentvaluerangesandvaluedistributionandcorrelations(mostlyapplicabletostructureddata)

–  Variousdatasizestotacklescalabilityconcerns•  Queries

–  Workloadsofdifferentcomplexity&size•  Complexity:astothetypesofqueryoperatorsandpatterns•  Size:astothenumberofqueryoperatorsinvolved

–  Allowbroadparameterrange(s)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 43

! Usefulfordetailed,in-depthanalysis! Lowsetupthreshold;! Easytorun

Page 44: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

MicroBenchmarks:Disadvantages•  Neglectlargerpicturesincetheydonottestthewholesystem•  Donotconsidertheflowofcostsofspecificoperationstothe

costofthesystem•  Donotmeasuretheimpactofmicro-benchmarkonreal-life

applications•  Difficulttogeneralizetheresults•  Theresultsofmicro-benchmarkscannotbeappliedina

straightforwardmanner•  Micro-benchmarksdonotusestandardizedmetrics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 44

Page 45: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StandardBenchmarks

•  Relational,ObjectOriented,ObjectRelationalDatabaseManagementSystems–  FamilyofTPCBenchmarksforrelationaldatabases

•  XML,XPath,XQuery,– Mbench,XBench,XMach-1,XMark,

•  GeneralComputing–  SPEC

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 45

Page 46: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

StandardBenchmarks:Advantages&Disadvantages•  Advantages

–  Mimicreal-lifescenarios(respondtorealneeds)•  E.g.,TPCisabusinessorientedbenchmark

–  Publiclyavailable–  Welldefined–  Providescalabledatasetsandworkloads–  Metricsarewelldefined

•  Disadvantages–  Outdated(standardizationisalengthyprocess)

•  XQuerytookaround7yearstobecomeastandard•  TPCbenchmarkdefinitionisstillanongoingprocess

–  Verylargeandcomplicatedtorun–  Limiteddatasetvariation(targetaspecifictypeofdata)–  LimitedWorkload(focusesontheapplicationinmind)–  Systemsareoftenoptimizedforthebenchmark(s)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 46

Page 47: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

•  Managementandmethodologicalactivitiesperformedbyagroupofpeople– Management:Organizationalprotocolstocontroltheprocess– Methodological:principles,methodsandstepsforbenchmarkcreation

•  BenchmarkDevelopment–  Rolesandbodies:people/groupsinvolvedinthedevelopment–  Designprinciples:fundamentalrulesthatdirectthedevelopmentofabenchmark

–  Developmentprocess:seriesofstepstodevelopabenchmarkbasedonChokePoints

BenchmarkDevelopmentMethodology

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 47

ChokePoints:thesetoftechnicaldifficultiesthatforcesystemstoimprovetheirperformance

Page 48: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

TheExampleStandardBenchmark:TPC•  TransactionProcessingCouncil(TPC)

–  non-profitcorporationfocusedondevelopingdata-centricbenchmarkstandardsanddisseminatingobjective,verifiableperformancedatatotheindustry

–  goalisto«create,manageandmaintainasetoffairandcomprehensivebenchmarksthatenableend-usersandvendorstoobjectivelyevaluatesystemperformanceunderwelldefinedconsistentandcomparableworkloads»[NPM+12]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 48

Benchmark Explanation

TPC-C Focusesontransactions.

TPC-DI FocusesonETLprocesses

TPC-DS Decisionsupportsolutionsfor,butnotlimitedto,BigData.

TPC-E On-LineTransactionProcessing(OLTP)workload

TPC-H Decisionsupportbenchmark,adhocqueriesandconcurrentdatamodifications

TPC-VMS VirtualMeasurementSingleSystemSpecificationforrunningandreportingperformancemetricsforvirtualizeddatabases

TPC-xHS measureofhardware,operatingsystemandcommercialApacheHadoopFileSystemAPI

TPX-xV measuretheperformanceofserversrunningdatabaseworkloadsinvirtualmachines.

Acti

veT

PCB

enchm

ark

s(

2016)

Page 49: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkDevelopmentProcess(1)•  DesignPrinciples[L97]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 49

Principle Comment

Relevant Thebenchmarkismeaningfulforthetargetdomain

Understandable Thebenchmarkiseasytounderstandanduse

GoodMetrics Themetricsdefinedbythebenchmarkarelinear,orthogonalandmonotonic

Scalable Thebenchmarkisapplicabletoabroadspectrumofhardwareandsoftwareconfigurations

Coverage Thebenchmarkworkloaddoesnotoversimplifythetypicalenvironment

Acceptance Thebenchmarkisrecognizedasrelevantbythemajorityofvendorsandusers

Page 50: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkDevelopmentProcess(2)•  BenchmarkingMetrics

–  Performance–  Price/Performance–  Energy/PerformanceMetrics:Energymetrictomeasuretheenergy

consumptionofsystemcomponents

•  TPCPricingspecification–  Providesconsistentmethodologiesforcomputingthepriceofthe

benchmarkedsystem,licensingofsoftware,maintenance,…

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 50

Benchmark Metrics

TPC-C TransactionRate(tpmC),PriceperTransaction($/tmpC)

TPC-E TransactionsperSecond(tpS)

TPC-H CompositeQueryperHourPerformanceMetric(QpH@Size),PriceperCompositeQueryperHourPerformanceMetric($/QpH@Size)

Page 51: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DesirableAttributesofaBenchmark:

•  “Agoodbenchmarkiswritteninahigh-levellanguage,makingitportableacrossdifferentmachines;isrepresentativeofsomeprogrammingstyleorapplication;canbemeasuredeasily;haswidedistribution[W90]”

•  “adomainspecificbenchmarkmustmeetfourimportantcriteria:relevance,portability,simplicity,scalability[G93]”

•  SixdesirableattributesforTPC-C[L97]:relevance,understandability,goodmetrics,scalability,coverage,acceptance

•  FivedesirableattributesinHuppler[H09]:relevance,repeatability,fairness,verifiability,economy

•  BigDataBenchmarking[1]:“asuccessfulbenchmarkshouldbesimpletoimplementandexecute,costeffective,timelyandverifiable”.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 51

Page 52: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DesirableAttributesofaBenchmark:

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 52

Page 53: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DesignPrinciples:DesirableAttributesofaBenchmark

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 53

•  Relevant/Representative:basedonrealisticusecasescenariosandmustreflecttheneedsoftheusecase

•  Understandable/Simple:theresultsandworkloadareeasilyunderstandablebyusers

•  Portable/Fair/Repeatable:nosystembenefitsfromthebenchmark.Mustbedeterministicandprovidea«goldstandard»

•  Metrics:shouldbewelldefinedtobeabletoassessandcomparethesystems.

•  Scalable:datasetsshouldbeintheorderofbillionsof«objects»

•  Verifiable:allowverifiableresultsineachexecution

BenchmarkAttributes

relevant

representative

understandable

simple

portable

fair

repeatable

metrics

scalable

verifiable

Page 54: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DesignofBenchmarkWorkload[Grey93]

•  Designthequeriestotestspecificfeaturesofthequerylanguageortotestspecificdatamanagementapproaches

•  Basethequerymixonspecificrequirementsofrealworldusecases–  Leadstocomplexqueriesthatinvolvemany(different)languagefeatures

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 54

Micro-benchmarks

Domainspecificandstandardbenchmarks

Page 55: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DevelopmentProcess:ChokePoints•  Abenchmarkexposesasystemtoaworkloadandshouldidentify

thetechnicaldifficultiesofthesystemundertest

•  ChokePoints[BNE14]arethosetechnologicalchallengeswhoseresolutionwillsignificantlyimprovetheperformanceofaproduct

•  TPC-H:a20yearsoldbenchmark(supersededbyTPC-DS)butstillinfluentialusingbusiness-orientedqueriesandconcurrentmodifications

•  22queriescapturing(mostof)theaspectsofrelationalqueryprocessing

•  [BNE14]performedananalysisoftheTPC-Hworkloadandidentified28chokepointsgroupedinto6categories

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 55

Page 56: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ChokePointsàlaTPC-H•  CP1:AggregationPerformance

–  Orderedaggregation,smallgroup-bykeys,interestingorders,dependentgroup-bykeys

•  CP2:JoinPerformance –  Largejoins,sparseforeignkeys,richjoinorderoptimization,lateprojection

•  CP3:DataAccessLocality(materializedviews)–  Columnarlocality,physicallocalitybykey,detectingcorrelation

•  CP4:ExpressionCalculation–  RawExpressionArithmetic,ComplexBooleanExpressionsinJoinsand

Selections,StringMatchingPerformance

•  CP5:CorrelatedSub-queries–  Flatteningsub-queries,movingpredicatestoasub-query,overlapbetween

outer-andsub-query

•  CP6:ParallelismandConcurrency–  Queryplanparallelization,workloadmanagement,resultre-use

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 56

Page 57: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ChokePointsàlaRDF

ChokePoint Description

CP1:JOINORDERING

1.  Testsiftheenginecanevaluatethetrade-offsbetweenthetimespenttofindthebestexecutionplanandthequalityoftheoutputplan

2.  Teststheabilityoftheenginetoconsidercardinalityconstraintsexpressedbythedifferentkindsofschemaconstraints(e.g.,functionalandinversefunctionalproperties)

CP2:AGGREGATION

Aggregationsareimplementedwiththeuseofsub-selectsintheSPARQLquery;theoptimizershouldrecognizetheoperationsincludedinthesub-selectsandevaluatethemfirst.

CP3:OPTIONAL&NESTEDOPTIONALCLAUSES

Teststheabilityoftheoptimizertoproduceaplanwheretheexecutionoftheoptionaltriplepatternsisthelasttobeperformedsinceoptionalclausesdonotreducethesizeofintermediateresults.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 57

Page 58: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ChokePointsinRDFBenchmarks

ChokePoint Description

CP4:REASONINGTeststheabilityoftheenginetohandleefficientlyRDFSandOWLconstructsexpressedintheschema

CP5:PARALLELEXECUTIONOFUNIONS

Teststheabilityoftheoptimizertoproduceplanswhereunionsareexecutedinparallel

CP6:FILTERSTeststheabilityoftheenginestoexecuteasearlyaspossiblethosefilterexpressionstoeliminateapossiblylargenumberofintermediateresults

CP7:ORDERINGTeststheabilityoftheenginetochoosequeryplan(s)thatfacilitatetheorderingofresults

CP8:GEO-SPATIALPREDICATES

Teststheabilityofthesystemtohandlequeriesforgeospatialdata

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 58

Page 59: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ChokePointsinRDFBenchmarks

ChokePoint Description

CP9:FULLTEXT Queriesthatinvolvetheevaluationofregularexpressionsondatavaluepropertiesofresources

CP10:DUPLICATEELIMINATION

Teststheabilityofthesystemtoidentifyduplicateentriesandeliminatethemduringthecreationofintermediateresults

CP11:COMPLEXFILTERCONDITIONS

Teststheabilityoftheenginetodealwithnegation,conjunctionanddisjunctionefficiently(i.e.,breakingthefiltersintoconjunctionoffiltersandexecutetheminparallel).

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 59

Page 60: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

QueryCharacteristics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 60

Characteristics

SimplefiltersUnboundpredicates

LIMIT REGEX CONSTRUCT

Complexfilters Negation ORDERBY UNION ASK

>=9TPs OPTIONAL DISTINCT DESCRIBE

Page 61: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 61

Page 62: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ASurveyofRDFBenchmarksSyntheticBenchmarksRealBenchmarksBenchmarkGenerators

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 61

Page 63: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkComponents•  Datasets•  Therawmaterialofthebenchmarkagainstwhichtheworkload

willbeevaluated•  Synthetic&RealDatasets

!  Synthetic:Producedwithadatagenerator(thathopefullyproducesdatawithinterestingcharacteristics)

!  Real:Widelyuseddatasetsfromadomainofinterest

•  QueryWorkload•  Setsofqueriesand/orupdatestoevaluatethesystemwith

•  Metrics•  Theperformancemetric(s)thatdeterminethesystemsbehavior

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 62

Page 64: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SyntheticRDFBenchmarks

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 63

Page 65: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LehighUniversityBenchmark(LUBM)[GPH05]•  BenchmarkintendedtofacilitatetheevaluationofSemantic

Webrepositories•  WidelyadoptedbythedataengineeringandSemanticWeb

communities

•  FocusesonevaluatingtheperformanceofqueryoptimizersandnotontologyreasoningasinDLsystems

•  Components:–  ScalableSyntheticdatagenerator– Ontologyofmoderatesizeandcomplexity–  Supportsextensionalqueries(i.e.,queriesthatrequestinstancesandnotonlyschemainformation)

– ProposesPerformancemetrics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 64

Page 66: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMUniv-BenchOntology•  Describesuniversitiesanddepartmentsandrelatedactivities•  ExpressedinOWLLite(tookintoconsiderationthe

limitationsofreasoningsystemsreg.completeness)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 65

Statistics:!  43Classes!  32ObjectTypeProperties!  7DataTypeProperties! OWLLiteinverseOf,TransitiveProperty,

someValuesFrom,intersectionOf

Page 67: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMDataGeneration(1)•  Syntheticallyproducedextensionaldatathatconformtothe

LUBMOntology•  DataaregeneratedusingtheUBA(Univ-BenchArtificialData

Generator)•  RandomandRepeatableDataGeneration•  Minimumunitofdatageneration:Universitythathas

departments,employees,courses•  Instancesofclassesandpropertiesarerandomlyproduced•  Tomakedatamorerealisticrestrictionsareapplied:–  «Minimum15andmaximum25departmentsperuniversity»–  «Undergraduatestudent/facultyratiobetween8and14inclusive»

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 66

Page 68: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMDataGeneration(2)•  AssignmentofIdentifiersisdoneusingzero-basedindexes– University0,Department0,…

•  Datageneratedbythetoolarerepeatablefortheuniversities– Userentersaseedfortherandomnumbergeneratoremployedinthedatagenerationprocess

•  DatacreatedarerepresentedinOWLLite•  Configurableserializationandrepresentationmodel(RDF/

XMLin.owlfiles,DAML)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 67

Page 69: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMQueries(1)•  14RealisticQueries•  WritteninSPARQL1.0•  QueryDesigncriteria–  InputSize:•  proportionoftheclassinstancesinvolvedandentailedinthequerytothetotalinstancesinthedataset

– Selectivity:•  estimatedproportionoftheclassinstancesthatsatisfythequerycriteria•  dependsontheinputdatasetsize

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 68

Page 70: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMQueries(2)– Complexity:• measuredonthebasisofthenumberofclassesandpropertiesinvolvedinthequery•  differentcomplexityforthesamequeryandfordifferentimplementations:relationalvsRDF

– Hierarchyinformation:•  classandpropertyhierarchiesareusedtoobtainallqueryanswers

–  Logicalinference:•  inferenceisrequiredtoobtainallqueryanswers

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 69

Page 71: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMQueries(3):Characteristics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 70

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14

Simplefilters

Complexfilters

>=9TPs

Unboundpredicates

Negation

OPTIONAL

LIMIT

ORDERBY

DISTINCT

REGEX

UNION

DESCRIBE

CONSTRUCT

ASK

SimpleSPARQLSELECTQueries

Page 72: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMQueries(4):ChokePoints

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1

Q2 ✓

Q3 ✓

Q4 ✓ ✓

Q5 ✓

Q6 ✓

Q7 ✓

Q8 ✓

Q9 ✓

Q10 ✓

Q11 ✓

Q12 ✓ ✓

Q13 ✓

Q14

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 71

JoinOrderingMostcomplexquerycontains5joins

ReasoningFocusonsubClassandsubProperty

hierarchies

Page 73: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMPerformanceMetrics(1)•  LoadTime:–  Timeneededtoparse,loadandreasonforadataset–  Focusesonpersistentstores

•  RepositorySize:–  Forpersistentstorageonly–  Thesizeofallfilesthatconstitutetherepository

•  QueryResponseTime:– Averagetimeforexecutingaquery10times(warmrun)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 72

Page 74: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LUBMPerformanceMetrics(2)•  QueryCompletenessandSoundness:– Measuresthedegreeofcompletenessofaqueryanswerasthepercentageofentaileduniqueanswers

•  CombinedMetric:– Combinesqueryresponsetimewithanswercompletenessandanswersoundness

– Measuresthetrade-offbetweenqueryresponsetimeandcompletenessofresults•  Seehowreasoningaffectsqueryperformance

– Providesanabsoluterankingofsystems– Buthidesdetails!

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 73

Page 75: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2Bench[SHM+09]•  Proposesalanguagespecificbenchmarktotestthemost

commonSPARQLconstructs,operatorconstellationsandRDFaccesspatterns

•  Components:–  Scalablesyntheticdatagenerator•  CreationofDBLPdocumentsinRDFmimickingkeycharacteristicsoftheoriginalDBLPdataset•  ProduceddatasetscontainblanknodesandRDFcontainers

–  Supportsextensionalqueries(i.e.,queriesthatrequestinstancesandnotschemainformation)

– Proposesperformancemetrics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 74

Page 76: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchSchemaDBLP(1)

•  StudyofDBLPrealdata–  Determinetheprobabilitydistributionforselectedattributesperdocumentclassesthatformsthebasisforgeneratingclassinstances

–  Revealsthatonlyfewoftheattributesarerepeatedforthesameclass

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 75

<!ELEMENTdblp(article|inproceedings|proceedings|book|incollection|phdthesis|masterthesis|www)*><!ENTITY%field“author|editor|title|booktitle|pages|year|address|journal|volume|number|month|url|ee|cdrom|cite|publisher|note|crossref|isbn|series|school|chapter”><!ELEMENTarticle(%field)*><!ELEMENTinproceedings(%field)*>

ExtractDBLPDTD2008

Page 77: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchSchemaDBLP(2)•  Probabilitydistributionforselectedattributesperdocument

classes

•  Additionalassumptionisthatattributesarenotdependent–  Existenceofanattributedoesnotdependonanother

•  UseBell-shapedGaussiancurvestoapproximateinputdata–  Typicallyusedtomodelnormaldistributions

•  Studiedthenumberofclassinstancesovertimeandmodeledthosewithapowerlawdistribution

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 76

Article Inproc. Proc. Book WWW

author 0.9895 0.9970 0.0001 0.8937 0.9973

cite 0.0048 0.0104 0.0001 0.0079 0.0000

editor 0.0000 0.0000 0.7992 0.1040 0.0004

isbn 0.0000 0.0000 0.8592 0.9294 0.0000

… … … … … …

Page 78: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchDataGeneration

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 77

•  SyntheticallyproducedextensionaldatathatconformtotheDBLPSchema

•  Useofexistingexternalvocabulariestodescriberesourcesinauniformway–  FOAF(persons)–FriendofAFriend[FOAF],SWRC-SemanticWeb

forResearchCommunities(scientificpublications)[SWRC],DC–DublinCore[DC]

•  IntroduceblanknodesandRDFcontainers(rdf:Bag)tocaptureallaspectsoftheRDFdatamodel

•  DatagenerationtakesintoaccountdataapproximationasreflectedintheGaussiancurves

•  Datageneratortakesasinputeitherthetriplecount,oryearuptowhichthedataisgenerated–  Alwaysendingupinaconsistentstate!

•  Randomfunctionsarebasedonafixedseedmakingdatagenerationdeterministic

Page 79: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchQueries(1):Characteristics•  17queries–  12mainqueriesandmodificationsthereof

•  Providedinnaturallanguage,inSPARQL1.0andSQLtranslationsarealsoavailable

•  Querydesigncriteria–  FocusonSELECTandASKSPARQLforms– AimatcoveringthemajorityofSPARQLconstructs(includingDISTINCT,ORDERBy,LIMIT,OFFSET)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 78

Page 80: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchQueries(2):Characteristics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 79

Characteristic Q1 Q2 Q3abc Q4 Q5ab Q6 Q7 Q8 Q9 Q10 Q11 Q12abc

Simplefilters ✔ ✔ ✔ ✔Complexfilters ✔ ✔ ✔>=9TPs ✔ ✔ ✔ ✔ ✔Unboundpredicates

✔ ✔

Negation ✔ ✔

OPTIONAL ✔ ✔ ✔LIMIT ✔ORDERBY ✔ ✔DISTINCT ✔ ✔ ✔ ✔ ✔ ✔REGEX

UNION ✔ ✔ ✔DESCRIBE

CONSTRUCT

ASK ✔

Page 81: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchQueries(3):ChokePoints

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✓

Q2 ✓ ✓

Q3 ✓

Q4 ✓ ✓ ✓

Q5 ✓ ✓ ✓

Q6 ✓ ✓ ✓ ✓

Q7 ✓ ✓ ✓ ✓

Q8 ✓ ✓ ✓ ✓ ✓

Q9 ✓ ✓

Q10

Q11 ✓

Q12 ✓ ✓ ✓

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 80

JoinOrdering:mostcomplexquerycontains8joins

Filters:mostcomplexquerycontains2filters

DuplicateElimination

Page 82: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SP2BenchPerformanceMetrics•  LoadingTime:–  timeneededtoparse,loadandreasonusingthetestedsystemforadataset

–  Focusesonpersistentstores•  «Per-query»performance:–  Performanceofeachquery

•  «Global»performance:–  Listthearithmeticandgeometricmeanofqueries

1.  Multiplytheexecutiontimeofall17queries2.  Penalizequeriesthatfailwith3600spenalty3.  Computethe17throotoftheresult

•  Memoryconsumption–  Highwatermarkofmainmemoryconsumption–  Averagememoryconsumptionofallqueries

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 81

Page 83: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BerlinSPARQLBenchmark(BSBM)[BS09][BSBM]•  Builtaroundane-commerceusecase•  Querymixemulatesthesearchandnavigationpatternsofauser

lookingforaproductofinterest•  Goals–  AllowthecomparisonofSPARQLenginesacrossdifferentarchitectures(relationaland/orRDF)

–  Challengeforwardandbackwardchainreasoningengines–  Focusesonanenterprisesettingwheremultipleclientsconcurrentlyexecuteworkloads

– MeasuresSPARQLqueryperformanceandnot(somuch)reasoning

•  Components–  Datagenerator:supportsthecreationofarbitrarilylargedatasets

–  TestDriver:executessequencesofSPARQLqueries6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 82

Page 84: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMSchema(1)•  E-commerceusecase:productsareofferedbyseveralvendors

andconsumerspostreviewsforthoseproducts

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 839..22

Reviewbsbm:reviewForrev:reviewerbsbm:reviewDatedc:titlerev:textbsbm:rating1[0..1]bsbm:rating2[0..1]bsbm:rating3[0..1]bsbm:rating4[0..1]

Producerrdfs:labelrdfs:commentrdf:typefoaf:homepagebsbm:country

ProductTyperdfs:labelrdfs:commentrdf:typerdfs:subClassOf[1..0]

ProductFeaturerdfs:labelrdfs:commentrdf:type

Productrdfs:labelrdfs:commentrdf:typebsbm:producerbsbm:productFeature[9..22]bsbm:productPropertyTextual1bsbm:productPropertyTextual2bsbm:productPropertyTextual3bsbm:productPropertyTextual4[0..1]bsbm:productPropertyTextual5[0..1]bsbm:productPropertyNumeric1bsbm:productPropertyNumeric2bsbm:productPropertyNumeric3bsbm:productPropertyNumeric4[0..1]bsbm:productPropertyNumeric5[0..1]

Offerbsbm:productbsbm:vendorbsbm:pricebsbm:validFrombsbm:validTobsbm:deliveryDaysbsbm:offerWebpage

Personfoaf:namefoaf:mbox_sha1sumbsbm:country

Vendorrdfs:labelrdfs:commentrdf:typefoaf:homepagebsbm:country

1..89

1

1..*

1..*

1..*

1

2..16

14..32

1

280..3730

2..37

1

Page 85: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMSchema&DataCharacteristics(1)•  Everyproducthasatypefromaproducthierarchy•  ProductHierarchyisnotfixed(dependsonthedatasetsize)–  It’sdepthandwidthdependsonthechosenscalefactor–  Hierarchydepth–  Branchingfactorfor

•  rootlevel•  allotherlevelsis8

•  Producttypesareassignedavariablenumberofproductfeatures–  computedaslowerBoundandupperBoundwith

•  aa–  Setofpossiblefeaturesforagivenproducttypeistheunionofthetypeandallits“super-types”.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 84

d =1+round(log10(n)) / 2n

bfr =1+ round(log10(n))

lowerBound = 35* i / (d *(d +1) / 2−1),upperBound = 75* i / (d *(d +1) / 2−1)

Page 86: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMSchema&DataCharacteristics(2)•  Products,Vendors,Offers–  Productsthatsharethesametype,havealsothesamesetoffeatures

–  Foragivenproduct,itsfeaturesarechosenfromthesetofpossiblefeatureswithahard-codedprobabilityof25%

–  Normaldistributionwithameanofμ=50andstandarddeviationσ=16.6isemployedtoassociateproductswithproducers

–  Vendorsareassociatedtocountriesfollowinghard-codeddistributions

–  Sizeofoffersisn*20 aredistributedoverproductsfollowinganormaldistributionwith«fixedparameters»μ=n/2andσ=n/4

–  Offersaredistributedovervendorsfollowinganormaldistributionwith«fixedparameters»μ=2000andσ=667

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 85

Page 87: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMSchema&DataCharacteristics(3)•  Reviews–  10timesthescalefactorn –  Datatypepropertyvalues(titleandtext)between50–300words

–  Upto4ratings,eachratingisarandomintegerbetween1and10

–  Eachratingismissingwithhard-codedprobability10%–  Distributedoverproductswithanormaldistributiondependingondatasetsizeandfollowingμ=n/2andσ=n/4

–  Numberofreviewsperreviewerfollowsnormaldistributionwithμ=20andσ=6.6

–  Reviewsaregenerateduntilallreviewsareassignedareviewer–  Reviewercountriesfollowthesamedistributionasvendorcountries

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 86

Page 88: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMDataGeneration(1)•  SyntheticallyproducesinstancesofclassProductthatconformto

theBSBMSchema

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 87

Total#triples 250K 1M 2M 100M

#products 666 2,785 70,812 284,826

#productfeatures

2,860 4,745 23,833 47,884

#producttypes 55 151 731 2011

#producers 14 60 1422 5,618

#vendors 8 34 722 2,854

#offers 13,320 55,700 1,416,240 5,696,520

#reviewers 339 1432 36,249 146,054

#reviews 6,660 27,850 708,120 2,848,260

Total#instances 23,922 92,757 2,258,129 9,034,027

Indicativenumberofinstancesfordifferentdatasetsizes

Page 89: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMQueries(1)•  12Queries•  Querymixisemulatessearchandnavigationpatternsofacustomer

lookingforaproduct•  BSBMqueriesaregiveninnaturallanguage,SPARQLandSQL

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 88

Query Description

Q1 Findproductsforagivensetofgenericfeatures

Q2 Retrievebasicinformationaboutaspecificproductfordisplaypurposes

Q3 Findproductshavingsomespecificfeaturesandnothavingonefeature

Q4 Findproductsmatchingtwodifferentsetsoffeatures

Q5 Findproductsthataresimilartoagivenproduct

Q6 Findproductshavingalabelthatcontainsaspecificstring

Q7 Retrievein-depthinformationaboutaproductincludingoffersandreviews

Q8 Givemerecentlanguagereviewsforaspecificproduct

Q9 Getinformationaboutareviewer

Q10 Getcheapofferswhichfulfilltheconsumer’sdeliveryrequirements

Q11 Getallinformationaboutanoffer

Q12 Exportinformationaboutanofferintoanotherschema

Page 90: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12

Simplefilters ✔ ✔ ✔ ✔ ✔ ✔ ✔

Complexfilters ✔ ✔

>9TPs ✔ ✔ ✔ ✔ ✔

Unboundpredicates

Negation ✔

OPTIONAL ✔ ✔ ✔ ✔

LIMIT ✔ ✔ ✔ ✔ ✔ ✔

ORDERBY ✔ ✔ ✔ ✔ ✔ ✔

DISTINCT ✔ ✔ ✔

REGEX ✔

UNION ✔ ✔

DESCRIBE ✔

CONSTRUCT ✔

ASK

BSBMQueries(2):Characteristics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 89

11JOINs,3OPTIONALclauses,

3Filters,1Unboundvariable

4OPTIONALclauses

Page 91: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBMQueries(3):ChokePoints

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 90

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔ ✔ ✔ ✔

Q2 ✔Q3 ✔ ✔ ✔

Q4 ✔ ✔ ✔ ✔Q5 ✔ ✔ ✔ ✔Q6 ✔ ✔Q7 ✔ ✔ ✔Q8 ✔ ✔ ✔

Q9 ✔Q10 ✔ ✔ ✔ ✔Q11 ✔

Q12 ✔

JoinOrdering:mostcomplexquerycontains11joins

Filters:mostcomplexquerycontains3filtersandmostcomplexfiltercontainsarithmeticexpressions

ResultOrdering

Page 92: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BSBM:PerformanceMetrics•  QueryMixesperHour(QMpH)– MeasuresthenumberofcompleteBSBMquerymixesansweredbyasystemundertestandforaspecificnumberofclientsrunningconcurrentlyagainstthesystemundertest

•  QueriesperSecond(QpS)– Measuresthenumberofqueriesofaspecifictypehandledbythesystemundertestinasecond

–  Calculatedbydividingthenumberofqueriesofaspecifictypewithinabenchmarkrunbythetotalexecutiontimeofthosequeries

•  LoadTime:–  TimetoloadthedatasetintheRDForrelationalrepositories

•  Includesthetimetocreatetheappropriatedatastructures&indices

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 91

Page 93: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SemanticPublishingBenchmark(SPB)•  DevelopedinthecontextofFP7EUProjectLDBC(2012-2015)•  LDBC’sgoals:– Developqueryingbenchmarksthatwillspurresearch&industryprogressinlarge-scalegraphandRDFdatamanagement•  scalability,storage,indexingandqueryoptimizationtechniquesforRDFandgraphdatabasesolutions•  quantitativelyandqualitativelyassessdifferentsolutionsforRDFdataintegration

–  Toestablishanindustry-neutralentity-LDBCfoundation-àlatheTransactionProcessingCouncil(TPC)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 92

Page 94: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SemanticPublishingBenchmark(SPB)•  Industry-motivatedbenchmark–  Thescenarioinvolvesamedia/publisherorganizationthatmaintainssemanticmetadataaboutitsJournalisticassets

•  Components–  ScalableSyntheticDataGenerator•  CreationofinstancesofBBContologiesmimickingcharacteristicsoftheoriginalrealinputdatasets

–  Supportsextensionalqueries(i.e.,queriesthatrequestinstancesandnotschemainformation)

– WorkloadsimulatesconsumptionofRDFmetadata•  Concurrentreadandupdatequeries

– Proposesperformancemetrics6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 93

Page 95: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBDesign:Requirements•  StoringandprocessingRDFdata–  StoringandisolatingdatainseparateRDFgraphs–  SupportingfollowingSPARQLstandards:

•  SPARQL1.1Protocol,Query,Update

•  SupportforSchemaLanguages–  SupportforRDFStoobtainthecorrectanswers–  OptionalsupportfortheRLprofileofWebOntologyLanguage(OWL2RL)inordertopasstheconformancetestsuite

•  LoadingdatafromRDFserializationformats–  N-Quads,TRIG,Turtle,etc.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 94

Page 96: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBSchema:BBCOntologies(1)•  CoreOntologies:7ontologiesdescribebasicconceptsabout

entitiesandrelationshipsinthedomainofinterest–  BasicConcepts:CreativeWorks,Places,Persons,ProvenanceInformation,CompanyInformation,etc.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 95

Thing CreativeWork

String

cwork:title

owl:Thing owl:sameAs

Theme Organisation

Event PlacePerson Programme

NewsItemBlogPost

cwork:tag

cwork:shortTitle

String

cwork:categoryxsd:Any

cwork:description

String

Audience

International Audience National Audience

cwork:audience

cwork:Format

Textual Format

VideoFormat

Interactive Format

Image Format Audio Format

PictureGallery Format

cwork:primaryFormat

xsd:dateTime

xsd:dateTime

cwork:dateModifiedcwork:dateCreated

cwork:Thumbnail

cwork:thumbnail

Thumbnail ThumbnailTypethumbnailType

StandardThumbnail

FixedSize66Thumbnail

CloseUpThumbnail

FixedSize266Thumbnail

FixedSize466Thumbnailp

rdfs:subClassOf rdfs:subPropertyOf

rdf:type

tag

about mentions

Stringcwork:altText

Page 97: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SchemaBBCSchema(2)•  DomainOntologies:3ontologiesdescribeconceptsand

propertiesrelatedtoaspecificdomain–  sports(competitions,events)–  politicsentities–  news(conceptsthatjournaliststagannotationswith)

•  Statistics–  74classes–  88datatypeproperties,28objecttypeproperties–  60rdfs:subClassOf(maximumdepth3),17rdfs:subPropertyOf(maximumdepth1)hierarchies

–  105rdfs:domainand115rdfs:rangeRDFSproperties–  8owl:oneOfclassaxioms,1oneowl:TransitivePropertyproperty.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 96

Page 98: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPB:Referencedatasets•  Collectionsofentitiesdescribingvariousdomains–  SnapshotsoftherealdatasetsofBBC

•  Footballcompetitionsandteams•  FormulaOnecompetitionsandteams•  UKParliamentMembers

–  Additionaldatasets•  GeoNames-Places,namesandcoordinates•  DBPedia–Persondata

–  ReferenceDatasetSize:25Mtriples

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 97

Page 99: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBDataGeneration(1):Process

1.   Loader–  Ontology&ReferenceData

2.   DataGeneratora.  Retrievesinstances

fromReferenceDatasetsb.  GeneratesCreativeWorks

accordingtopre-definedallocationsandmodels

c.  Writesgenerateddatatodisk

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 98

RDFRepository

BBCOntologies

ReferenceDatasets

Ontology&ReferenceDataSetLoader

CreativeWorksGenerator

SPARQLEndpoint

SPBDataGenerator

Datagenerationparameters

(1) (1)

(2.a)

GeneratedCWs

(2.c)

(1)

(2.d)

Page 100: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBDataGeneration(2)•  Producessyntheticdatathatmimicmostofthecharacteristicsofreal

worlddataprovidedbyBBC•  Input:Core&DomainOntologiesandReferencedatasets•  Output:

–  InstancesthatconformtoBBCcoreontologies(classCreativeWork)–  Instancesrefertoentitiesinthereferencedatasetsusingtheabout&

mentionsschemaproperties–  followsthe(user)pre-defineddistributionsofSPB’sDataGenerator

Tagg

edentities

01/2012 12/2012

clustering

correla1onsrandomdistribu1on

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 99

Page 101: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBOperationalPhases•  DataLoading

1.  Initialloadingofreferencedatasets•  BBCdatasetsenrichedwithDBPediaPersonandGeoNamesplacedata

2.  GenerationofCreativeWorks•  Parallelgeneration(multi-threadedandmulti-process)

3.  LoadingofCreativeWorksintheRDFrepository

•  RunningtheBenchmark1.  Warm-upphrase2.  RunthebenchmarkusingtheTestDriver3.  Runconformancetests(OWL2RL)[optional]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 100

Page 102: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkConfiguration•  DataGenerator–  AllocationoftagsinCreativeWorks

•  Correlationsofcreativeworkswithimportantentities(persons,places,events)•  ClusteringofCreativeWorksaroundmajor/minorevents

–  Sizeofgenerateddata(triples)–  Paralleldatageneration

•  TestDriver–  Distributionofqueriesinthequery-mix

•  editorialoperations(deletion/additionofRDFtriples)•  aggregateoperations(complexSPARQLqueries)

–  Numberofeditorial/aggregationagents–  DurationofWarm-upandBenchmarkphases–  Eachoperationalphasecanbeenabledordisabled

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 101

Page 103: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBBaseWorkloadQueries(2)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 102

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12

Simplefilters ✔

Complexfilters ✔ ✔ ✔>9TPs ✔ ✔ ✔ ✔Unboundpredicates

Negation

OPTIONAL ✔ ✔ ✔ ✔LIMIT ✔ ✔ ✔ ✔ ✔ ✔ORDERBY ✔ ✔ ✔ ✔ ✔ ✔ ✔DISTINCT ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔COUNT ✔

REGEX

UNION ✔ ✔ ✔GROYPBY ✔CONSTRUCT ✔ ✔ ✔ ✔ ✔

Evaluate(partsofthe)queryongraphs

Page 104: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBQueries(1)•  BaseandAdvancedWorkloads– BaseWorkload:12queries&updateoperations– AdvancedWorkload:24queries

•  WorkloadsbasedonrealqueriesusedbyBBCjournalistsduringtheireditorialoperations

•  Editorialagents–simulateeditorialworkperformedbyjournalists:–  Insert,Update,Delete

•  Aggregationagents–simulateretrievaloperationsperformedbyend-users

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 103

Page 105: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBBaseWorkloadQueries(3):ChokePoints

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 104

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔ ✔ ✔ ✔ ✔Q2 ✔ ✔ ✔Q3 ✔ ✔ ✔ ✔ ✔ ✔Q4 ✔ ✔ ✔ ✔ ✔Q5 ✔ ✔ ✔ ✔ ✔Q6 ✔ ✔ ✔ ✔Q7 ✔ ✔Q8 ✔ ✔ ✔Q9 ✔ ✔ ✔Q10 ✔ ✔ ✔ ✔Q11 ✔ ✔ ✔ ✔ ✔Q12 ✔ ✔

Reasoningreg.class&propertyhierarchies

JoinOrdering

Ordering&DuplicateElimination

Page 106: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

SPBPerformanceMetrics•  SPBPrimaryMetrics

•  QueryExecutionReport(1)

•  QueryExecutionReport(2)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 105

QueryRateInteractivemix

(Queriespersecond)

QueryRateAnalyticalMix

(Queriespersecond)

UpdateRate(Operationsper

second)

DurationofBulkLoad(inms)

DurationofMeasurement

Window(inminutes)

#CompleteAnalyticalmixes

(persecond)

#CompleteInteractivemixes

(persecond)

#CompleteUpdate

Operations

Query ArithmeticMeanExecutionTime

MinimumExecutionTime

90th%AverageExecutionTime

#Executions

Page 107: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

RealRDFBenchmarks

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 106

Page 108: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

UniProt[RU09][UniprotKB]•  Comprehensive,high-qualityandfreelyaccessibleresourceof

proteinsequenceandfunctionalinformation•  UniProtSchema–  UniProtCoreVocabulary,BIBO(journals),ECO(evidencecodes),DublinCore(metadata)

–  UniProtCoreVocabulary:124classes,113Properties•  Datasetcontainsapproximately–  13billiontriples–  2.5billiondistinctsubjects–  2billiondistinctobjects

•  Queries–  Norepresentativesetofqueriesisoffered.–  [NW09]offersasetof8queriestotesttheRDF-3Xengine

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 107

Page 109: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

UniProtQueries(1)[NW09]:Characteristics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 108

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8

Simplefilters

Complexfilters

>9TPs ✔ ✔ ✔ ✔ ✔ ✔Unboundpredicates

Negation

OPTIONAL

LIMIT

ORDERBY

DISTINCT

REGEX

UNION

DESCRIBE

CONSTRUCT

ASK

JoinOrderingRDF-3XaimsatoptimizingjoinprocessingforRDFdata

Page 110: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

UniProtQueries(2)[NW09]:ChokePoints•  Focusondiscoveringoptimalorclosetooptimaljoinorders

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 109

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔Q2 ✔Q3 ✔Q4 ✔Q5 ✔Q6 ✔Q7 ✔Q8 ✔ JoinOrdering:most

complexquerycontains12joins7queriescontainmorethan7joins

Page 111: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

YAGO(YetAnotherGreatOntology)[SKW07]•  Highqualitymultilingualknowledgebasedderivedfrom

Wikipedia,WordNetandGeoNames•  Schema– WikipediaEntities,WordNetandGeoNamesConceptsandRelationships:associatesWordNettaxonomywithWikipediaCategorySystem

–  10millionschemaentities•  Dataset–  120milliontriplesaboutschemaentities–  2.625millionlinkstoDBPedia

•  Queries–  NorepresentativesetofqueriesisofferedbyYAGO–  [NW10]providesarepresentativesetof8queriesforRDF-3XEvaluation

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 110

Page 112: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

YAGOQueries(1)[NW10]:Characteristics•  SimpleSELECTqueriesthatfocusonJoinordering,negation

andduplicateelimination

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 111

Characteristic A1 A2 A3 B1 B2 B3 C1 C2

Simplefilters ✔

Complexfilters

>9TPs ✔

Unboundpredicates

Negation ✔ ✔ ✔

OPTIONAL

LIMIT

ORDERBY

DISTINCT ✔ ✔ ✔ ✔ ✔

REGEX

UNION ✔

Page 113: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

YAGOQueries(2)[NW10]:ChokePoints•  Queriesfocusmostlyondiscoveringoptimalorclosetoquery

evaluationplans,includingnegationinfiltersandduplicateelimination.

• 

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 112

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

A1 ✔

A2 ✔

A3 ✔ ✔ ✔ ✔

B1 ✔ ✔ ✔

B2 ✔ ✔

B3 ✔ ✔ ✔

C1 ✔ ✔ ✔ ✔

C2 ✔ ✔JoinOrdering:mostcomplexquerycontains8joins

allqueriescontainmorethan5joins

Page 114: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BartonLibrary[Barton]•  DatafromtheMITSimileProjectthatdevelopstoolsforlibrarydata

management–  containsrecordsthatcomposeanRDF-formatteddumpoftheMIT

LibrariesBartoncatalog–  convertedfromrawdatastoredinanoldlibraryformatstandard

calledMARC(MachineReadableCatalog).•  Schema

–  CommontypesincludeRecordandItem,thelatterbeingassociatedwithinstancesoftypePersonandwithinstancesofDescription.

–  PrimitivetypesincludeTitleandDate.•  Dataset

–  Approximately45millionRDFtriples•  Queries

–  NorepresentativequeriesprovidedwiththeBartonLibraryDataset–  [Abadi07]providesaworkloadof7queries([NW10]inSPARQL)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 113

Page 115: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BartonQueries(1)[NW10]:Characteristics

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 114

Characteristic Q1 Q2 Q3 Q4 Q5 Q6 Q7

Simplefilters ✔ ✔ ✔ ✔

Complexfilters

>9TPs

Unboundpredicates

Negation ✔

OPTIONAL

LIMIT

ORDERBY

DISTINCT ✔ ✔ ✔

REGEX

UNION ✔

Page 116: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BartonQueries(2)[NW10]:ChokePoints•  Queriesfocusmostlyondiscoveringoptimalorcloseto

optimalqueryevaluationplans,includingnegationinfiltersandduplicateelimination.

• 

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 115

# CP1 CP2 CP3 CP4 CP5 CP6 CP7 CP8 CP9 CP10 CP11

Q1 ✔

Q2 ✔ ✔

Q3 ✔

Q4 ✔

Q5 ✔ ✔

Q6 ✔ ✔ ✔ ✔

Q7 ✔

JoinOrdering:mostcomplexquerycontains3joins

Page 117: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

LinkedSensorDataset[PHS10]•  Expressivedescriptionsofapproximately20,000weather

stationsintheUS•  dividedupintomultiplesubsets,thatreflectweatherdatafor

specifichurricanesorblizzardsfromthepast(focusonhurricaneIke)

•  Schema–  Containsinformationabouttemperature,precipitation,pressure,wind,speed,humidity

–  ContainslinkstoGeoNamesandlinkstoobservationsprovidedbyMesoWest(meteorologicalserviceintheUS)

•  Dataset– morethan1billiontriples

•  Queries–  Norepresentativesetofqueriesisoffered.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 116

Page 118: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WordNet[WordNet]•  LargelexicaldatabaseofEnglish,developedunderthe

directionofGeorgeA.Miller(Emeritus).•  Schema–  Nouns,verbs,adjectivesandadverbsaregroupedintosetsofcognitivesynonyms(synsets),eachexpressingadistinctconcept.

–  Synsetsareinterlinkedbymeansofconceptual-semanticandlexicalrelations.Theresultingnetworkofmeaningfullyrelatedwordsandconceptscanbenavigatedwiththebrowser.

•  Dataset–  Approximately1.9milliontriples(300MB).

•  Queries–  Norepresentativequeryworkload

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 117

Page 119: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

PublishingTPC-HasRDF[TPC-H]•  Benchmarkcanbeusedbydecisionsupportsystemsthat

examine

–  largevolumesofdata,executequerieswithahighdegreeofcomplexity,andprovideanswerstocriticalbusinessquestions

•  Benchmarkprovidesasuiteofbusinessorientedad-hocqueriesandconcurrentdatamodifications

•  Queriesandthedatapopulatingthedatabasehavebeenchosentohavebroadindustry-widerelevance

•  UsetheDBGENTPC-HgeneratortogenerateaTPC-Hrelationaldataset

•  UsetheD2RtoolorotherrelationaltoRDFtooltoconverttherelationaldatasettotheequivalentRDFone.

•  TPCSQLqueriesaretranslatedtoequivalentSPARQLqueries6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 118

Page 120: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

BenchmarkGenerators

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 119

Page 121: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBPediaSPARQLBenchmark(DBSB)[MLA+14]•  GenericMethodologyforSPARQLBenchmarkCreation•  Basedon–  Flexibledatagenerationthatmimicsaninputdatasource–  Query-logmining–  Clusteringofqueries–  SPARQLqueriesfeatureanalysis

•  Methodologyisschemaagnostic–  DemonstratedusingDBPediaKB

•  ProposedapproachappliedonvarioussizesoftheDBPediaKnowledgeBase

•  BenchmarkproposesqueryworkloadbasedonrealqueriesexpressedagainstDBPedia

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 120

Page 122: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBDataGeneration(1)•  Workingassumptions

1.  Outputdatasetshouldhavesimilarcharacteristicsasinputdataset

• Numberclasses,properties,valuedistributions,taxonomicstructures(hierarchies)

2.  Varyingoutputdatasetsizes3.  Characteristicssuchasin-,out-degreeofnodesin

datasetsofvaryingsizesshouldbesimilar

4.  Easilyrepeatabledatagenerationprocess

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 121

Page 123: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBDataGeneration(2)•  Idea

1.   Largedatasetsproducedby•  Duplicatingalltriplesandchangingtheirnamespace

2.   Smallerdatasetsproducedby•  Removingtriplesinawaythatwouldpreservethe

propertiesoftheoriginalgraph•  Usingaseedbasedmethodbasedontheassumptionthata

representativesetofresourcesisobtainedbysamplingacrossclasses1.  Foreachselectedelementinthedataset,itsconcise

bounddescription(CBD)isretrievedandaddedinthequeue

2.  Processisrepeateduntilthenumberoftriplesisreached

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 122

Page 124: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBQueryAnalysis(1)•  Goalistodetectprototypicalqueriesthatweresenttoa

DBPediaSPARQLendpointusingsimilaritymeasures–  Stringsimilarityandgraphsimilarity

•  Idea:4-stepqueryanalysisandclusteringapproach1.  Selectqueriesexecutedfrequentlyontheinputdata2.  Stripcommonsyntacticconstructs(namespace,prefixes)3.  Computequerysimilarityusingstringmatching4.  Computequeryclustersusingasoftgraphclustering

algorithm•  Clustersusedtodevisethebenchmarkquerygenerationpatterns

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 123

Page 125: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBQueryAnalysis(2)•  QuerySelection

1.  UseDBPediaSPARQLQuerylog(31.5millionqueriesina3monthperiod)

2.  Reducetheinitialsetofqueriesbyconsidering•  QueryVariations:useastandardwaytonamevariablestoreducedifferencesamongqueries(promotingqueryconstructssuchasDISTINCT,REGEX)•  QueryFrequency:discardquerieswithlowfrequencysincetheydonotcontributetotheoverallqueryperformance

–  Result:35,965queries3.  StringStripping:removeallSPARQLkeywordsandcommon

prefixes4.  SimilarityComputation:computethesimilarityofthestripped

queries

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 124

Page 126: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBQueryAnalysis(3)•  QuerySelection(cont’d)

4.  SimilarityComputation•  Reducethetimeofbenchmarkcompilation,useLIMES[NS11]framework

•  UsetheLevenshteinstringsimilaritymeasure,0.9threshold•  Reduceby16.6%thenumberofcomputationsrequiredbycomputingtheCartesianproductofqueries

5.  Clustering•  Applygraphclusteringtothequerysimilaritygraphof(4)•  Goalistoidentifysimilargroupsofqueriesoutofwhichprototypicalquerieswillbegenerated

•  UseBorderFlow[NS09]algorithmthatfollowsaseed-basedapproach

•  Obtain12272clusters,24%containasinglequery•  Selecttheclusterswith>5queries

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 125

Page 127: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBQueryGeneration(1)•  SelectthemostinterestingSPARQLqueries

–  WhicharethemostfrequentlyaskedSPARQLqueries–  WhichofthosequeriescoverthemostSPARQLfeatures

•  SPARQLFeatures–  Overallnumberoftriplepatterns

•  Testtheefficiencyofjoinoperations(CP1)–  SPARQLpatternconstructors(UNION&OPTIONAL)

•  HandleparallelexecutionofUnions(CP5)•  PerformOPTIONALsaslateaspossibleinthequeryplan(CP3)

–  Solutionsequences&modifiers(DISTINCT)•  Efficiencyofduplicationelimination(CP10)

–  Filterconditionsandoperators(FILTER,LANG,REGEX,STR)•  Efficiencyofenginestoexecutefiltersasearlyaspossible(CP6)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 126

Page 128: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

DBSBQueryGeneration(2)•  25queriesareselected–  Foreachofthefeatures,manuallyselectthepartofthequerytobevaried(IRIorfiltercondition)

–  Variabilityofquerytemplate(s)forthechosenvaluesissufficientlyhigh(>=1000perquerytemplate)

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 127

Methodensuresthat•  Executedqueriesduringthebenchmarkdiffer

•  Alwaysreturnnonemptyresults

Page 129: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ApplesandOranges[DKS+11]•  Proposestructurednesstocharacterizedatasets

–  ThelevelofstructurednessofadatasetD,withrespecttoatype(class)T,isdeterminedbyhowwelltheinstancesofT,conformtotypeT

–  IfeachinstanceofThasthepropertiesdefinedinT,thenthedatasethashighstructurednesswithrespecttoT

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 128

0

1

2

3

4

5

6

name office ext major GPA

OC(p,I(T,D))

OC(p,T)foreachpropertypofT

0

1

2

3

4

5

6

name office ext major GPA

Highlystructureddataset

•  allinstanceshavethenameattribute•  ext&GPApropertiesencounteredin

50%oftheinstances•  οfficepropertyfoundin20%oftheinstances•  majorpropertyin10%oftheinstances

•  allinstanceshaveallattributes

Page 130: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ApplesandOranges[DKS+11]•  Oneofthekeyconsiderationswhiledeciding:–  appropriatedatarepresentationformat(e.g.,relationalforstructuredandXMLforsemi-structureddata)

–  organizationofdata(e.g.,dependencytheoryandnormalformsfortherelationalmodel,andXML).

–  dataindexes(e.g.,B+-treeindexesforrelationalandnumberingscheme-basedindexesforXML).

–  dataquerying(e.g.,usingSQLfortherelationalandXPath/XQueryforXML).

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 129

Inotherwords,structurednesspermeateseveryaspectofdatamanagement

Page 131: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ApplesandOranges[DKS+11]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 130

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Highlystructured

datasets

(relationa

llike)

Lessstruc

turedda

tasets

SyntheticDatasets

RealDatasets

Page 132: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ApplesandOranges[DKS+11]

Someimportantobservations:•  SinceTPC-Hisarelational

dataset,itshouldhavehighstructuredness.

•  Thereisadifferencebetweensyntheticandandrealdatasets.

•  Syntheticarefairlystructuredandrelational-like

•  Realdatasetscoverthewholespectrumofstructuredness.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 131

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Structuredness of datasets

ExistingRDFstoresaretestedandcomparedagainsteachotherwithrespecttodatasetsthatarenotrepresentativeof

mostrealRDFdata.

Page 133: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ApplesandOranges[DKS+11]•  Nothingcanbetterrepresentdatathanthedataitself!•  Idea:Turneverydatasetintoabenchmark

1.  Noneedtosyntheticallygeneratevalues•  Usetheactualdatavaluesinthedataset

2.  Noneedtosyntheticallygeneratequeries.•  Thequeriesthatareknowntoruninyourdatacanbe

usedinthebenchmark.3.  Butweneedtocoverthestructurednessspectrum•  togetdataascloseaspossibletotherealworlddata•  toseehowthesystemsperformwhendatagoesfrom

verystructuredtolessstructured

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 132

Page 134: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

CountingCoins[DKS+11]

•  StartwithadatasetwithsizeSandCH=0.5

•  AimforadatasetwithsizeS’andCH’,whereS>S’andCH>CH’.

Process:

•  Assignacointoeachtriple(s,p,o)andcomputetheimpactinCHofitsremoval

–  Theremovalwillimpactthesizeby1.Example:Consider(person1,ext,x5304).RemovingthetriplefromDgivesadatasetwithCH(T,D)=0.467.Thereforethecoin(person1,ext,x5304)=0.5–0.467=0.033.

•  Formulate(automatically)anintegerprogrammingproblemwhosesolutionswilltellushowmanycoinstoremovetoachievethedesiredcoherenceCH’andsizeS’.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 133

subject predicate object

person0 name Eric

person0 office BA7430

person0 ext x4401

person1 name Kenny

person1 office BA7349

person1 office BA5439

person1 ext x5304

person2 name Kyle

person2 ext x6281

person3 name Timmy

person3 major C.S.

person3 GPA 3.4

person4 name Stan

person4 GPA 3.8

person5 name Jimmy

person5 GPA 3.7 Oneofthefewoccasionsinlifewherehavingtoomanycoinsisundesirable…

Page 135: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Technicalchallengesinproblemformulation•  Computecoinswhichrepresenttheimpactonstructuredness

ofremovingalltripleswithsubjectsthatareinstancesofatypeTwithpropertiesequaltop–  Thereforeonecoinforeachtype/propertycombination.

•  Addconstraintsthatsetlowerandupperboundsonthenumberofcoinsthatcanberemovedsoasnottocompletelyremoveapropertyfromatype.

•  Addconstraintswhichguaranteethatnotallinstancesofatypeareremoved.

•  Todealwemulti-valuedproperties,weaddconstraintsthatintroducearelaxationparameterρ–  requiredbecauseoftheapproximationbyusingtheaveragenumberoftriplespercoin.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 134

Page 136: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WaterlooSPARQLDiversityTestSuite[AHO+14]•  StressexistingRDFenginestorevealawiderrangeofquery

requirementsasestablishedbywebapplications•  Contributions– Definitionof2classesofqueryfeaturesusedtoevaluatethevariabilityofworkloadsanddatasets•  Structural(e.g.,numberoftriplepatterns)•  Data-driven(affectselectivityandresultcardinality)

–  In-depthanalysisofexistingSPARQLbenchmarksusingthestructuralanddata-drivenfeatures

– WatDivTestSuitetostressexistingRDFenginestorevealawiderrangeofqueryrequirements

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 135

Page 137: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivStructuralFeatures(1)1.  TriplePatternCount

–  NumberoftriplepatternsinSPARQLGraphPatterns2.  JoinVertexCount

–  NumberofRDFterms(IRIs,literals,blanknodes)andvariablesthataresubjectsorobjectsofmultipletriplepatterns

3.  JoinVertexDegree–  Thedegreeofajoinvertexvisthenumberoftriplepatternswhose

subjectorobjectisv

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 136

SP2BenchQ5aSELECTDISTINCT?person?nameWHERE{?articlerdf:typebench:Article.?articledc:creator?person.?inprocrdf:typebench:Inproceedings.?inprocdc:creator?person2.?personfoaf:name?name.?person2foaf:name?name2FILTER(?name=?name2)}

TripleCount

JoinVertices

JoinVertexCount

JoinVertexDegree

6 ?article,?inproc?person,?person2

10 ?article:2,?inproc:2?person:2,?person2:2

Page 138: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivStructuralFeatures(2)•  JoinVertexDegree&Countprovideagoodcharacterizationof

thestructuralcomplexityofaquery–  Numberoftriplepatternsdoesnotproperlycharacterizethequery:twoquerieswiththesamesetoftriplepatternscanhavedifferentstructures

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 137

?n

?m ?x

?l

C

E

?k

A

?y

?b

?z

?d ?o

Linearquery

?c

D D ?x

?b

B

?z

C

?w

D

?b

E

?w

Snowflakequery

?y

?b

?x

B

E

A D

?z

C

?c

Starquery

?m ?f

G

Page 139: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivStructuralFeatures(3)•  JoinVertexType

–  PlayanimportantroleinthebehaviorofRDFenginestodetermineefficientqueryplans•  E.g.,starqueriespromoteefficientmergejoins

•  3(mutuallynon-exclusive)typesofjoinvertices–  Vertexx oftypeSS+ ifforalltriplepatterns(s,p,o)*, x isthesubject–  Vertexx oftypeOO+ ifforalltriplepatterns(s,p,o)*, x istheobject–  Vertexx oftypeSO+ ifforalltriplepatterns(s,p,o)*, (s’,p’,o’) x=s & x=o’

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 138

?n

?m ?x

?l

C

E

?k

?m type SS+

?x B

?z

C

?w

?x type OO+

?c

D D ?x

?b

B

?z

C

?w

?x type SO+

*Triplepa8erns(s,p,o) areincidentonx

Page 140: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivData-drivenFeatures(1)•  Asystem’schoiceonthemostefficientqueryplandependson

–  (a)thecharacteristicsofthedatasetand–  (b)thequery

•  Ifthesystemreliesonselectivityestimationsandresultcardinality,thesamequerywillhaveadifferentqueryplanfordataset(s)ofdifferentsizes

•  Differentcases:– Querieshaveadiversemixofresultcardinalities

–  Sometriplepatternsareveryselective,othersarenot

– Alltriplepatternsareequallyselective

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 139

Page 141: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivData-drivenFeatures(2)•  ResultCardinalityCARD(Ā,G)–  thenumberofsolutionsintheresultoftheevaluationofagraphpatternĀ = <A, F> overgraphG

•  FilterTriplePatternSelectivity(f-TPSelectivity)SELFG (tp)

–  theratioofdistinctsolutionmappingsofatriplepatterntptothesetoftriplesingraphG

•  Measures

1.  Resultcardinality

2.  Mean&standarddeviationoff-TPselectivitiesoftriplepatterns•  Importantfordistinguishingquerieswhosetriplepatternsarealmostequallyselectivefromquerieswithvaryingf-TPselectivities

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 140

Page 142: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivData-drivenFeatures(3)•  ResultCardinality&f-TPselectivityarenotsufficient

–  Intermediatesolutionmappingswillnotmakeittothefinalresult(e.g.,duetofiltersormorerestrictivejoins)

–  Theoverallselectivityofagraphpatterncanbedeterminedbyasingleveryselectivetriplepattern

•  Run-timeoptimizationtechniques(e.g.,side-waysinformationpassing)toearlypruneintermediateresults

•  Introduce2featurestocaptureabovecases1.  BGP-Restrictedf-TPselectivity

2.  Join-Restrictedf-TPselectivity

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 141

Page 143: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivData-DrivenFeatures(4)•  BGP-Restrictedf-TPselectivitySELF

G (tp|Ā)•  assesseshowmuchatriplepatterncontributestotheoverall

selectivenessofthequery

•  fractionofdistinctsolutionmappingsforatriplepatternthatarecompatiblewithsomesolutionmappinginthequeryresult.

•  Join-restrictedf-TPselectivitySELF G (tp|x)•  assesseshowmuchafilteredtriplepatterncontributestothe

overallselectivenessofthejoinsthatitparticipatesin

•  forxajoinvertexandtp atriplepatternincidentonx, thex-restrictedf-TPoftp overgraphGisthefractionofdistinctsolutionmappingscompatiblewithasolutionmappinginthequeryresultofthesub-querythatcontainsalltriplepatternsincidenttox

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 142

Page 144: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivTestSuite(1)•  Components:DataGeneratorandQueryGenerator•  DataGenerator–  Allowsuserstodefinetheirowndatasetcontrolling

•  Entitiestoinclude•  TopologyofthegraphsallowingonetomimictherealtypesofdatadistributionsintheWeb– «well-structuredness»ofentities– probabilityofentityassociations– cardinalityofpropertyassociations

–  Important:Instancesofthesameentitydonothavethesamesetofattributes:breakingthe«relationalnature»ofpreviousRDFbenchmarks

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 143

Page 145: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivTestSuite(2)•  QueryTemplateGenerator–  User-specifiednumberoftemplates–  Userspecifiedtemplatecharacteristics

•  Numberoftriplepatterns•  Typesofjoinsandfiltersinthetriplepatterns

–  TraversestheWatDivschemausingarandomwalkandgeneratesasetofquerytemplates

•  QueryGenerator–  Instantiatesthequerytemplateswithterms(IRIs,literalsetc.)fromtheRDFdataset

–  User-specifiednumberofqueriesproduced

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 144

Page 146: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

WatDivTestSuite(3)•  QueryTemplateGenerator–  RandomWalkonaninternalrepresentationoftheschema

•  Entitytypesintheschemacorrespondtographvertices•  Relationships(i.e.,objecttypeproperties)aregraphedges•  Verticesareannotatedwithdatatypeproperties(i.e.,attributes)

–  ProducesasetofBasicGraphPatternswithamaximumntriplepatternswithunboundobjectsandsubjects

–  kuniformlyrandomlyselectedsubjects/objectsarereplacedwithplaceholders

–  PlaceholdersarereplacedwithactualRDFtermsrandomlyretrievedfromthedataset

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 145

Page 147: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ComparisonofWatDivwithotherRDFBenchmarks

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 146

Copy

right[A

HO+1

4]

•  QueryWorkload–  Largerangeofqueries

•  Meanjoinvertexdegreedistributedamong2and10–  JoinVertexTypes:

•  18%ofqueriesarestarjoins,4.4%inDBSB•  61.3%ofqueriesarepathqueries,5.4%inDBSB

Page 148: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ComparisonofWatDivwithotherRDFBenchmarks

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 147

Copy

right[A

HO+1

4]

•  Data-DrivenFeatures–  DBSBandBSBMcovertheendsofthespectrumofmeanJoin-Restrictedf-

TPselectivityvalues–  WatDivcoversthefullspectrumofRestrictedf-TPselectivityvalues–  WatDivcoversalowerrangeofvaluesformeanf-TPselectivitywhen

comparedtoDBSB

GeneralRemarks•  comparabletoDBSB•  morediversethanLUBM,SP2BenchandBSBM

Page 149: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

FEASIBLE[SNM15]

•  Proposesafeature-basedbenchmarkgenerationapproachfromrealqueries–  Structure-based–  Data-drivenbased

•  ApproachissimilartoWatDivTestSuite•  Novelsamplingapproachforqueriesbasedonexemplarsand

medoids•  ProposeSELECT,ASK,CONSTRUCTandDESCRIBESPARQL

queries

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 148

Page 150: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

FEASIBLEQueryFeatures•  NumberofTriplePatterns•  NumberofJoinVertices–  Distinguishingbetween«star»,«path»,«hybrid»and«sink»vertices

•  JoinVertexDegree–  Sumofincomingandoutgoingedgesofthevertex

•  TriplePatternSelectivity–  Ratiooftriplesthatmatchthetriplepatternoveralltriplesinthedataset

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 149

o1

x o2

p1

p2

x y p1 p2 z

Starvertex:x Pathvertex:x Hybridvertex:x

o1 x

o2

p1

p2 y

z

Sinkvertex:x

x

y

z

Page 151: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

FEASIBLEBenchmarkGeneration•  3-stepbenchmarkgeneration•  Data-setCleaning–  Leadstopracticallyreliablebenchmarks

•  NormalizationofFeatureVectors–  Queryselectionprocessrequiresdistancesbetweenqueriestobecomputed

–  Normalizethequeryrepresentationssothatallqueriesareinaunithypercube

•  QuerySelection–  Basedontheideaofexemplars[NS11]

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 150

Page 152: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

FEASIBLEBenchmarkGeneration

•  DatasetCleaning–  Removeerroneousandzero-resultqueriesfromthesetofrealqueriesusedtogeneratethebenchmark

–  Excludeallsyntacticallyincorrectqueries–  Attach9SPARQLoperators(UNION,DISTINCT,OPTIONAL,..)and7queryfeatures(joinvertices,joinvertexcountetc.)toeachofthequeries

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 151

Page 153: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

FEASIBLEBenchmarkGeneration

•  NormalizationofFeatureVectors–  Queriesaremappedtoavectoroflength16whichstoresthequeryfeatures•  ForbinarySPARQLclauses(e.g.,UNIONiseitherusedornotused),storevalue1.Elsestorevalue0•  Allnon-binaryfeaturevectorsarenormalizedbydividingtheirvaluewiththeoverallmaximumvalueinthedataset•  Queryrepresentationsareassociatedwithvaluesbetween1and0

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 152

Page 154: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

FEASIBLEBenchmarkGeneration•  QuerySelection

–  GivenanumberofNqueriestoselectasbenchmarkqueries–  AsetofcleanedandnormalizedqueriesL, |L| >>> |N| –  Computeanl-sizepartitionofqueriessuchthat

•  Theaveragedistancebetweentwopointsin2differentelementsinthepartitionishighand

•  Theaveragepointswithinapartitionissmall•  Selectthepointthatisclosetotheaverageofeachpartitionandincludeitinthebenchmark

–  Implementedby•  Selectingexemplars(pointsthatrepresentaportionofthespace)thatareasfaraspossiblefromeachother

•  PartitioningL bymappingeverypointofL tooneoftheseexemplarstocomputeapartitionofthespace

•  Selectingthemedoidofeachofthespacepartitionsasaqueryinthebenchmark

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 153

Page 155: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Overview•  IntroducingBenchmarks•  AshortdiscussionaboutLinkedData–  ResourceDescriptionFramework(DataModel)–  SPARQL(QueryLanguage)

•  BenchmarkingPrinciples&ChokePoints•  Benchmarks–  Synthetic–  Real–  BenchmarkGenerators

•  Sumup:whatdidwelearntoday?

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 154

Page 156: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

ConcludingWhathavewelearnedtoday?Whatshouldwedonext?

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 155

Page 157: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Whatdidwelearn?

•  Whyweneedbenchmarking•  Whataretheprinciplesandmethodsunderlying(orshouldbe

followed)duringbenchmarkdesign– ChokePointsdesignisthenewandinterestingtrend

•  Whataretheexistingcategoriesofbenchmarks–  Synthetic– Real– Benchmarkgenerationframeworks

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 156

Page 158: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Whatshouldwedonext

•  NostandardqueryingbenchmarksforRDFengines(àlaTPC)•  Nobenchmarksthatare100%RDF-oriented!– Benchmarkstendtobemorerelationallike

•  Benchmarkframeworksaregoodinprovidingmorerealisticdatasetsandworkloads– Beingschemaanddomainagnosticmakeseasierforpeopletousethosetoolsfortheirusecaseandfortheengineofinterest

–  Beinghighlyparameterizedsincetheyallowahighdegreeofvariabilityinthedataandworkloads

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 157

Page 159: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

Whatshouldwedonext•  NobenchmarksthatconsidercomplexandexpressiveOWL

ontologyconstructs!–  36.4%oftheLODClouddatasourcesconsiderOWLontologies–  Timetodefinebenchmarksthatgobeyondtestingconformance

–  TimetodefineRDFbenchmarkswithadatabaseperspective•  Haveinmindqueryplans!

•  Needtolookatbenchmarksthatscale–  Needtodealwithbig,bigdata(orderoftrillionsoftriples)

•  NeedtolookatqueryworkloadsforSPARQL1.1–  aggregates,nestedsubqueries,…–  donotforgetupdates!

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 158

Page 160: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

References•  [NPM+12]R.OthayothNambiar,M.Poess,A.Masland,H.R.Taheri,M.

Emmerton,F.Carman,andM.Majdalany.TPCBenchmarkRoadmap2012.InTPCTC,2012.

•  [L97]CharlesLevine.TPC-C:TheOLTPBenchmark.InSIGMOD-IndustrialSession,1997.

•  [Castro04]R.lGarcia-CastroandA.Gomez-Perez.AMethodforPerforminganExhaustiveEvaluationofRDF(S)Importers.InWISE2005Workshops,2005.

•  [W90]R.P.Weicker.Anoverviewofcommonbenchmarks.Computer,23(12):65–75,December1990.

•  [G93]J.Gray,editor.TheBenchmarkHandbookforDatabaseandTransactionSystems(2ndEdition).MorganKaufmann,1993.

•  [H09]K.Huppler.TheArtofBuildingaGoodBenchmark.InTPCTC,2009.•  [GPH05]Y.Guo,Z.Pan,andJ.Heflin.LUBM:ABenchmarkforOWLKnowledge

BaseSystems.JournalWebSemantics:Science,ServicesandAgentsontheWorldWideWebarchiveVolume3Issue2-3,October,2005,Pages158-182

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 1

Page 161: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

References•  [BNE14]P.Boncz,T.Neumann,O.Erling.TPC-HAnalyzed:HiddenMessagesand

LessonsLearnedfromanInfluentialBenchmark.PerformanceCharacterizationandBenchmarking.InTPCTC2013,RevisedSelectedPapers.

•  [SHM+09]M.Schmidt,T.Hornung,M.Meier,C.Pinkel,G.Lausen.SP2Bench:ASPARQLPerformanceBenchmark.SemanticWebInformationManagement,2009.

•  [FOAF]FriendofAFriend.http://www.foaf-project.org/•  [SWRC]SWRCOntologyhttp://ontoware.org/swrc/•  [DC]DublinCoreMetadataInitiative.http://dublincore.org/•  [BS09]C.BizerandA.Schultz.TheBerlinSPARQLBenchmark.Int.J.Semantic

WebandInf.Sys.,5(2),2009.•  [BSBM]BerlinSPARQLBenchmark(BSBM)Specification-V3.1.http://

wifo5-03.informatik.unimannheim.de/bizer/berlinsparqlbenchmark/spec/index.html.

•  [RU09]N.RedaschiandUniProtConsortium.UniProtinRDF:TacklingDataIntegrationandDistributedAnnotationwiththeSemanticWeb.InBiocurationConference,2009.

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 2

Page 162: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

References•  [UniProtKB]UniProtKBQueries.http://www.uniprot.org/help/query-fields.•  [NW10]T.NeumannandG.Weikum.TheRDF-3Xengineforscalable

managementofRDFdata.TheVLDBJournal,19(1),2010•  [NW09]T.NeumannandG.Weikum..ScalablejoinprocessingonverylargeRDF

graphs.InSIGMOD2009•  [SKW07]F.M.Suchanek,G.KasneciandG.Weikum.YAGO:ACoreofSemantic

KnowledgeUnifyingWordNetandWikipedia,InWWW2007.•  [Barton]TheMITBartonLibrarydataset.http://simile.mit.edu/rdf-test-data/•  [PHS10]H.Patni,C.Henson,andA.Sheth.Linkedsensordata.2010•  [TPC-H]TheTPC-HHomepage.http://www.tpc.org/tpch/•  [WordNet]WordNet:AlexicaldatabaseforEnglish.http://

wordnet.princeton.edu/•  [XMark]AnXMLBenchmarkProject.http://www.xml-benchmark.org/•  [MLA+14]M.Morsey,J.Lehmann,S.Auer,A-C.NgongaNgomo.DBpedia

SPARQLBenchmark-PerformanceAssessmentwithRealQueriesonRealData.ISWC,2011

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 3

Page 163: Assessing the performance of RDF Engines: Discussing RDF Benchmarks

References•  [NS11]A–C.NgongaNgomo,S.Auer.LIMES:atime-efficientapproachforlarge-scale

linkdiscoveryonthewebofdata.IJCAI'11.•  [NS09]A–C.NgongaNgomoandD.Schumacher.Borderflow:Alocalgraphclustering

algorithmfornaturallanguageprocessing.InCICLing,2009.•  [AHO+14]G.Aluc,O.Hartig,T.Ozsu,K.Daudjee.DiversifedStressTestingofRDFData

ManagementSystems.InISWC,2014.•  [SMN15]M.Saleem,Q.Mehmood,andA–C.NgongaNgomo.FEASIBLE:AFeature-

BasedSPARQLBenchmarkGenerationFramework.ISWC2015.•  [DKS+11]S.Duan,A.Kementsietsidis,KavithaSrinivasandOctavianUdrea.Applesand

oranges:acomparisonofRDFbenchmarksandrealRDFdatasets.InSIGMOD,2011.•  [Wilkinson06]K.Wilkinson.Jenapropertytableimplementation.InSSWS,2006.•  [AMM+07]DanielJ.Abadi,AdamMarcus,SamuelMadden,KatherineJ.Hollenbach:

ScalableSemanticWebDataManagementUsingVerticalPartitioning.VLDB2007:411-42

•  [BDK+13]MihaelaA.Bornea,JulianDolby,AnastasiosKementsietsidis,KavithaSrinivas,PatrickDantressangle,OctavianUdrea,BishwaranjanBhattacharjee:BuildinganefficientRDFstoreoverarelationaldatabase.SIGMODConference2013:121-132

6/21/16 ESWC2016:AssessingtheperformanceofRDFEngines-DiscussingRDFBenchmarks 4

Page 164: Assessing the performance of RDF Engines: Discussing RDF Benchmarks