Top Banner
LDBC Semantic Publishing Benchmark v2.0 April 2015 LDBC Semantic Publishing Benchmark v2.0 #1 Apr 2015
41
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: LDBC Semantic Publishing Benchmark v2.0

LDBC Semantic Publishing Benchmarkv2.0

April 2015

LDBC Semantic Publishing Benchmark v2.0 #1Apr 2015

Page 2: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#2LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 3: LDBC Semantic Publishing Benchmark v2.0

• LDBC is an industry association aimed to create TPC-like benchmarks for RDF triplestores and graph DBs– Started of as 2-year EC funded project

– Check http://ldbcouncil.org/

• Members include: Ontotext, OpenLink, neo4j, CWI, UPM, ORACLE, IBM, SYSTAP, *Sparsity, SPARQLCity

• Benchmarks– Semantic Publishing Benchmark

– Social Network Benchmark

Linked Data Benchmarking Council

LDBC Semantic Publishing Benchmark #3

Page 4: LDBC Semantic Publishing Benchmark v2.0

• Based on BBC’s Dynamic Semantic Publishing pattern

• Triplestore holds both reference data and content metadata– Reference data: master data in Dynamic Semantic Publishing

– Taxonomies and entity datasets used to describe “creative works”

• The main interactions with the repository are:– updates, adding new metadata or updating the reference knowledge

– aggregation queries retrieve content according to various criteria

LDBC Semantic Publishing Benchmark

LDBC Semantic Publishing Benchmark #4

Page 5: LDBC Semantic Publishing Benchmark v2.0

Future Media BBC MMXII

10000+ Dynamic Aggregations

Page 6: LDBC Semantic Publishing Benchmark v2.0

• Simulate the work performed by journalists or editors with the system– E.g. enriching journalistic assets with meta-data: description, creation

date, location etc.

• Run simultaneously– Editorial updates should immediately impact the results of the

aggregation queries

– A typical set up is 2 editorial + 8 aggregation agents

• Provide a constant stream of update operations

• Editorial operations:– INSERT, DELETE, UPDATE (Insert + Update)

Editorial Workloads

LDBC Semantic Publishing Benchmark #6

Page 7: LDBC Semantic Publishing Benchmark v2.0

• Basic (interactive) query mix: 12 Queries

– Search queries

– Full-text search queries

– Aggregation queries

– Geospatial queries

• Advanced (interactive + analytical) query mixes

– 25 Queries altogether

– Analytical queries

– Faceted search queries

– Drill-down queries

Aggregation Workload

LDBC Semantic Publishing Benchmark #7

Page 8: LDBC Semantic Publishing Benchmark v2.0

• The dataset can be generated at different scale factors: 64M, 256M, 1B, …

• There are 25 annotations per “creative work”

• 1B triples dataset contains 22M triples of reference data and metadata for 40M documents

• More information at http://ldbcouncil.org/developer/spb

• You can download everything from Github– https://github.com/ldbc/ldbc_spb_bm_2.0/blob/master/doc

– https://github.com/ldbc/ldbc_spb_bm_2.0

LDBC SPB Data

LDBC Semantic Publishing Benchmark #8

Page 9: LDBC Semantic Publishing Benchmark v2.0

• Much bigger reference dataset: from 170k to 22M– 7M statement spatial data from GeoNames about Europe

– 14M statements from DBPedia for Companies, Persons, Events

• Interconnected reference data: more than 5M links– owl:sameAs links between DBPedia and Geonames

• More comprehensive usage of inference– Still the simplest possible flavor of OWL is used

• rdfs:subClassOf, rdfs:subPorpertyOf, owl:TransitiveProperty, owl:sameAs

– Transitive closure over company control and geographic nesting

– Simpler queries through usage of super-property

• Retrieval of relevant content through links in the reference data

Summary of SPB v.2.0 Changes

#9LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 10: LDBC Semantic Publishing Benchmark v2.0

• Audited runs using GraphDB-SE 6.2beta

• Hardware: – 64GB RAM

– Intel XEON E5-1650 V3 @ 3.5GHZ

– 2x SSD Samsung 845 DC

• Benchmark configuration– 8 reading / 2 writing agents,

– 10 min warmup, 30 min benchmark

• SF1, 64M results: Updates/s : 10, Selects /s : 105

• SF2, 256M results: Updates/s : 10, Selects /sec : 29– The slowdown of selects is due to inefficiency in Q10, to be fixed soon

GraphDB – First Results

LDBC Semantic Publishing Benchmark v2.0 #10Apr 2015

Page 11: LDBC Semantic Publishing Benchmark v2.0

• Engine shutout – Implemented on both GraphDB and Virtuoso

– MarkLogic and BlazeGraph are working to implement it

– No assumptions about inference strategy or special-purpose plug-ins

• Performance engineering and sizing

• Quality Assurance– to avoid performance degradation

• Training– It has broad variety of queries

– Realistic & understandable dataset

What SPB is good for

LDBC Semantic Publishing Benchmark #11

Page 12: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#12LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 13: LDBC Semantic Publishing Benchmark v2.0

SPB needed to evolve in order to:

• Allow for retrieval of semantically relevant content – Based on rich metadata descriptions and diverse reference knowledge

• Demonstrate simplified and more efficient querying– By using named graphs and reasoning

• Tune interactive queries to be … more interactive

And this way to present:

• More advanced and pertinent usage of triplestores

• A bigger challenge to the engines

Change is Needed

#13LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 14: LDBC Semantic Publishing Benchmark v2.0

• SPB v.1.0 had a very small reference data set

• Reference data was not interconnected– Very very few relations between entities

• Descriptions of content assets (e.g. articles) refer to the Reference Data, but not to one another – Star-shaped data graph with small reference dataset in the middle

– This way SPB is not using the full potential of graph databases

• Very limited usage of very trivial inference

• Just few sub-classes and sub-properties

Issues with SPB v.1.0

#14LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 15: LDBC Semantic Publishing Benchmark v2.0

• Bigger and better interconnected Reference Data– More entity descriptions, which connected to one another

• A better connected dataset– Make cross references between the different pieces of content

• Make use of inference– Pertinent use of some transitive, symmetric and inverse properties

– Such semantics can foster retrieval of relevant content

• More interesting and challenging query mixes – But also make queries “nicer”, i.e. simpler and cleaner

• Testing high-availability – FT acceptance tests for HA cluster are great starting point

– Too ambitious for SPB v.2.0

Directions for Improvement

#15LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 16: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#16LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 17: LDBC Semantic Publishing Benchmark v2.0

• 22M explicit statements in SPB v.2.0– vs. 127k statements in SPB 1.0; BBC lists + tiny fractions of GeoNames

• Added reference data from DBPedia 2014– Extracts alike: DESCRIBE ?e WHERE { ?e a dbp-ont:Company }

– Companies (85 000 entities)

– Events (50 000 entities)

– Persons (1M entities)

• Geonames data for Europe– All European countries, w/o RUS, UKR, BLR; 650 000 locations total

– Some properties and locations types irrelevant to SPB excluded

• owl:sameAs links between DBpedia and Geonames– 500k owl:sameAs mappings

Extended Reference Dataset

#17LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 18: LDBC Semantic Publishing Benchmark v2.0

To Company

To Person To Place /gn:Feature

To Event

Company 40,797 26,675 218,636 18

Person 89,506 1,324,425 3,380,145 145,892

Event 5,114 154,207 140,579 35,442

Interconnected Reference Dataset

#18LDBC Semantic Publishing Benchmark v2.0

• Substantial volume of connections between entities

• Geonames comes with hierarchical relationship, defining nesting of locations, gn:parentFeature

• DBPedia inter-entity relationships between entities*:

* numbers shown in table are approximate

Apr 2015

Page 19: LDBC Semantic Publishing Benchmark v2.0

• GraphDB’s RDFRank feature is used to calculate a rank for all the entities in the Reference Data– RDFRank calculates a measure of importance for all URIs in an RDF

graph, based on Google’s PageRank algorithm

• These RDFRanks are calculated using GraphDB after the entire Reference Data is loaded– This way all sorts of relationships in the graph are considered during

calculations, including such that were logically inferred

• RDFRanks are inserted using system predicate <http://www.ldbcouncil.org/spb#hasRDFRank>

• These ranks are available as a part of the Ref. Data– No need to get compute them again during loading

RDF-Rank for the Reference Data

#19LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 20: LDBC Semantic Publishing Benchmark v2.0

• The Data Generator uses a set of “popular entities”– Those are referred to in 30% of the content-to-entity relations/tags

– Heuristics used to produce more realistic data distributions

– This is implemented in SPB 1.0, no change in SPB 2.0

• Popular entities are those with top 5% RDF Rank– This is the change in SPB v.2.0

– Before than the popular entities were selected randomly

– This way, we get a more realistic dataset where those entities which are more often used to tag content also have better connectivity in the Reference Data

• In the future RDF-ranks can be used for other purposes also– E.g. ordering, disambiguation, etc.

RDF-Rank Used for Popular Entities

#20LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 21: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#21LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 22: LDBC Semantic Publishing Benchmark v2.0

• Generated 3x more relationships between Creative Works and Entities, than in SPB 1.0 – More recent use cases in publishing, adopt rich metadata descriptions

with more than 10 references to relevant entities and concepts

– In SPB 1.0 there are on average a bit less than 3 entity references per creative work, based on distributions from an old BBC archive

– For SPB 2.0 we didn’t have access to real rich semantic metadata, to allow us derive content-to-concept frequency distributions

– We used the same distributions from BBC, but tripled the concept references for each of them (from 1-2-3-… tags to 3-6-9-… tags/asset)

• On average about 8 content-to-concept references– All together about 25 statements/creative work description

– All other heuristics and specifics of the metadata generator were preserved (popular entities, CW sequences alike storylines, etc.)

Changes in the Metadata

#22LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 23: LDBC Semantic Publishing Benchmark v2.0

• Geonames URIs used for content-to-location tags– As in SPB v.1.0, each creative work description refers (through

cwork:Mention property) to exactly one location• These references are exploited in queries with geo-spatial constraints

– Most of the geo-spatial information in the Reference Data comes from Geonames, but 500 thousand locations have also DBPedia URIs

• that come from the DBPedia-to-Geonames owl:sameAs mappings

– Intentionally DBPedia URIs are not used in content metadata

• DBPedia location URIs are used in queries– The substitution parameters for locations in the corresponding queries

use only DBpedia locations

• This way the corresponding queries would return no results without owl:sameAs reasoning

Changes in the Metadata (2)

#23LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 24: LDBC Semantic Publishing Benchmark v2.0

• As a result of changes in the size of the Reference Data and the metadata size, there are differences in the number of Creative Works included at the same scale factor in SPB v.1.0 and in SPB v.2.0

Changes in the Metadata - Stats

#24LDBC Semantic Publishing Benchmark v2.0 Apr 2015

SPB 1.0 SPB 2.0

Reference Data size (explicit statements) 170k 22M

Creative Work descr. size (explicit st./CW) 19 25

Metadata in 50M dataset (explicit statements) 50M 28M

Creative Works in 50M datasets (count) 2.6M 1.1M

Creative Work-to-Entity relationships in 50M 7M 9M

Metadata in 1B dataset (explicit statements) 1B 978M

Creative Works in 1B datasets (count) 53M 39M

Creative Work-to-Entity relationships in 1B 137M 313M

Page 25: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#25LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 26: LDBC Semantic Publishing Benchmark v2.0

The Rules Sets

LDBC Semantic Publishing Benchmark v2.0 #26

• Complete inference in SPB 2.0 requires support for the following primitives:– RDFS: subPropertyOf, subClassOf

– OWL: TransitiveProperty, SymmetricProperty, sameAs

• Any triplestore with OWL 2 RL reasoning support will be able to process SPB v2.0 correctly– In fact, a much reduced rule-set is sufficient for complete reasoning in

as in OWL 2 RL there are host of primitives and rules that SPB 2.0 does not make use of. Such examples are all onProperty restrictions, class and property equivalence, property chains, etc.

– The popular (but not W3C standardized) OWL Horst profile is sufficient for reasoning with SPB v 2.0

• OWL Horst refers to the pD* entailment defined by Herman Ter Horst in: Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. J. Web Sem. 3(2-3): 79-115 (2005)

Apr 2015

Page 27: LDBC Semantic Publishing Benchmark v2.0

• Modified versions of the BBC Ontologies– As in SPB v.1.0, BBC Core ontology defines relationships between

Person – Organizations – Locations - Events

– Those were mapped to corresponding DBPedia and Geonames classes and relationships

• bbccore:Thing is defined as super class of dbp-ont:Company, dbp-ont:Event, foaf:Person, geonames-ontology:Feature (geographic feature)

– dbp-ont:parentCompany is defined to be owl:TransitiveProperty

– These extra definitions are added at the end of the BBC Core ontology

• Added a SPB-modified version of the Geonames ontology– Stripped out unnecessary pieces such owl:onProperty restrictions that

impose cardinality constraints

New and Updated Ontologies

LDBC Semantic Publishing Benchmark v2.0 #27Apr 2015

Page 28: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#28LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 29: LDBC Semantic Publishing Benchmark v2.0

• Transitive closure of location-nesting relationships– geo-ont:parentFeature property is used in GeoNames to link each

location to larger locations that is it part of

– geo-ont:parentFeature is owl:TransitiveProperty in the GN ontology

– This way if Munich has gn:parentFeature Bavaria, that on its turn has gn:parentFeature Germany, than reasoning should make sure that Germany also appears as gn:parentFeature of Munich

• Transitive closure of dbp-ont:ParentCompany– Inference unveils indirect company control relationships

• owl:sameAs relations between Geonames features and DBPedia URIs for the same locations:– The standard semantics of owl:sameAs requires that each statement

asserted using the Geonames URIs should be inferable/retrievable also with the mapped DBPedia URIs

Concrete Inference Patterns

LDBC Semantic Publishing Benchmark v2.0 #29Apr 2015

Page 30: LDBC Semantic Publishing Benchmark v2.0

• The inference patterns from SPB v.1.0 are still there:– cwork:tag statements inferred from each of its sub-properties

cwork:about and cwork:mentions

– < ?cw rdf:type cwork:CreativeWork > statements inferred for instances of each of its sub-classes

– These two are the only reasoning patterns that apply to the generated content metadata; all the others apply only to the Reference Data

Concrete Inference Patterns (2)

LDBC Semantic Publishing Benchmark v2.0 #30Apr 2015

Page 31: LDBC Semantic Publishing Benchmark v2.0

• If brute force materialization is used, the Reference Data expands from 22M statements to about 100M– If owl:sameAs expansion is not considered, materialization on top of

the Reference Data adds about 7M statements – most of those coming from the transitive closure of geo-ont:parentFeature

– Several triplestores (e.g. ORACLE and GraphDB) that use forward-chaining has specific mechanism, which allow them to handle owl:sameAs reasoning in a sort of hybrid manner, without expanding their indices. After materialization, such engines will have to deal with 29M statements of Reference Data, instead of 100M

• Brute-force materialization of the generated metadata describing creative works would double it– In SPB v.1.0 the expansion factor was 1.6, but now there are more

entity references and also owl:sameAs equivalents of the location URIs

Inference Statistics and Ratios

LDBC Semantic Publishing Benchmark v2.0 #31Apr 2015

Page 32: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#32LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 33: LDBC Semantic Publishing Benchmark v2.0

Modified Queries in Basic Interactive

LDBC Semantic Publishing Benchmark v2.0 #33

• SPB 2.0 changes only the Basic Interactive Query-Mix– The other query mixes are mostly unchanged

• In most of the queries, cwork:about and cwork:mention properties that link creative work to an entity, have been replaced with their supper cwork:tag– This applies also to the Advanced Interactive and the Analytical use

cases

– For most of the queries, it both types of relations are equally relevant

– Using the super-property make the query simpler; as compared to other approaches to make a query that uses both (e.g. UNION)

Apr 2015

Page 34: LDBC Semantic Publishing Benchmark v2.0

New Queries

LDBC Semantic Publishing Benchmark v2.0 #34

• Basic Interactive query-mix now contains 3 new queries exploring the interconnectedness in reference data

• Q10: Retrieve CWs that mention locations in the same province (A.ADM1) as the specified one – There is additional constraint on time interval (5 days)

• Q11: Retrieve the most recent CWs that are tagged with entities, related to a specific popular entity– Relations can be inbound and outbound; explicit or inferred

• Q12: Retrieve all information about CWs– Using SPARQL GRAPH clause

Apr 2015

Page 35: LDBC Semantic Publishing Benchmark v2.0

Q10: News from the region

LDBC Semantic Publishing Benchmark v2.0 #35

SELECT ?cw ?title ?dateModified {

<http://dbpedia.org/resource/Sofia> geo-ont:parentFeature ?province .

?province geo-ont:featureCode geo-ont:A.ADM1 .

{

?location geo-ont:parentFeature ?province .

} UNION {

BIND(?province as ?location) .

}

?cw a cwork:CreativeWork ;

cwork:tag ?location ;

cwork:title ?title ;

cwork:dateModified ?dateModified .

FILTER(?dateModified >=

"2011-05-14T00:00:00.000"^^<http://www.w3.org/2001/XMLSchema#dateTime>

&& ?dateModified <

"2011-05-19T23:59:59.999"^^<http://www.w3.org/2001/XMLSchema#dateTime>)

}

LIMIT 100

Apr 2015

Page 36: LDBC Semantic Publishing Benchmark v2.0

Q11: News on about related entities

LDBC Semantic Publishing Benchmark v2.0 #36

SELECT DISTINCT ?cw ?title ?description ?dateModified ?primaryContent

{

{

<http://dbpedia.org/resource/Teresa_Fedor> ?p ?e .

}

UNION

{

?e ?p <http://dbpedia.org/resource/Teresa_Fedor> .

}

?e a core:Thing .

?cw cwork:tag ?e ;

cwork:title ?title ;

cwork:description ?description ;

cwork:dateModified ?dateModified ;

bbc:primaryContentOf ?primaryContent .

}

ORDER BY DESC(?dateModified)

LIMIT 100

Apr 2015

Page 37: LDBC Semantic Publishing Benchmark v2.0

Trivial Validation Opportunities

LDBC Semantic Publishing Benchmark v2.0 #37

• The validation mechanisms from SPB v.1.0 remain unchanged

• With SPB 2.0 if specific inference patterns are not supported by the engine, some of the queries will return zero results– If owl:sameAs inference is not supported Q10 will return 0 results

– If transitive properties are not supported, Q10 will return 0 results• Q11 will return smaller number of results (higher ratio of zeros also)

– If rdfs:subProperty is not supported, Q2-Q10 will return 0 results

– If rdfs:subClass is not supported, Q1, Q2, Q6, Q8 will return no results

Apr 2015

Page 38: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• The New Queries

• GraphDB Configuration Disclosure

• Future plans

Outline

#38LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 39: LDBC Semantic Publishing Benchmark v2.0

1. Reasoning performed through forward-chaining with a custom ruleset– This is the rdfs-optimized ruleset of GraphDB with added rules to

support transitive, inverse and symmetric properties

2. owl:sameAs optimization of GraphDB is beneficial– This is the standard behavior of GraphDB; but it helps a lot

– Query-time tuning is used to disable expansion of results with respect to owl:sameAs equivalent URIs

3. Geo-spatial index in Q6 of the basic interactive mix

4. Lucene Connector for full-text search in Q8

Note: Points #2-#4 above help query time, but slow down updates in GraphDB. As materialization does also

GraphDB Configuration

LDBC Semantic Publishing Benchmark v2.0 #39Apr 2015

Page 40: LDBC Semantic Publishing Benchmark v2.0

• The LDBC Semantic Publishing Benchmark

• Motivation to Evolve It

• The New Reference Datasets

• Changes in the generated metadata

• Ontologies and Required Rule-sets

• The Concrete Inference patterns

• Changes in the Basic Interactive Query Mix

• GraphDB Configuration Disclosure

• Future plans

Outline

#40LDBC Semantic Publishing Benchmark v2.0 Apr 2015

Page 41: LDBC Semantic Publishing Benchmark v2.0

• Relationships between pieces of content– At present Creative Works are related only to Entities, not to other

CWs

– These could be StoryLines or Collections of MM assets, e.g. an article with few images related to it

• Better modelling of content-to-entity cardinalities– Based on data from FT

• Enriched query sets and realistic query frequencies– Based on query logs from FT

• Loading/updating big dataset in live instances

• Testing high-availability– FT acceptance tests for High Availability cluster are great starting point

Future Plans

#41LDBC Semantic Publishing Benchmark v2.0 Apr 2015