Ana ROXIN – [email protected]Pieter PAUWELS – [email protected]Querying and reasoning over large scale building datasets: an outline of a performance benchmark Pieter Pauwels, Tarcisio Mendes de Farias, Chi Zhang, Ana Roxin, Jakob Beetz, Jos De Roo, Christophe Nicolle International Workshop on Semantic Big Data (SBD 2016) in conjunction with the 2016 ACM SIGMOD Conference in San Francisco, USA
19
Embed
ACM SIGMOD SBD2016 - Querying and reasoning over large scale building datasets: an outline of a performance benchmark
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Ana
ROXI
N – a
na-m
aria
.roxi
n@u-
bour
gogn
e.fr
Piet
er PA
UWEL
S – P
iete
r.pau
wels@
ugen
t.be
Querying and reasoning over large scale building datasets: an outline of
a performance benchmarkPieter Pauwels, Tarcisio Mendes de Farias, Chi Zhang, Ana Roxin, Jakob Beetz, Jos De Roo, Christophe Nicolle
International Workshop on Semantic Big Data (SBD 2016)in conjunction with the 2016 ACM SIGMOD Conference in San Francisco, USA
Problem identified◼ Different implementations exist for the components (TBox,
ABox, RBox) of such Semantic approach Diverse reasoning engines Diverse query processing techniques Diverse query handling Diverse dataset size Diverse dataset complexity
◼ Missing an appropriate rule and query execution performance benchmark
Expressiveness vs. performance
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
TBox - the ifcOWL ontology◼ All building models are encoded using the ifcOWL ontology
Built up under the impulse of numerous initiatives during the last 10 years
◼ The ontology used is the one that is made publicly available by the buildingSMART Linked Data Working Group (LDWG) http://ifcowl.openbimstandards.org/IFC4# http://ifcowl.openbimstandards.org/IFC4_ADD1# http://ifcowl.openbimstandards.org/IFC2X3_TC1# http://ifcowl.openbimstandards.org/IFC2X3_Final#
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
RBox – Data transformation rules◼ Need for a representative set of rewrite rules◼ 68 manually built rules◼ Classified in several rule sets according to their contentRule Set (RS)
Description
RS1Contains 2 rules for rewriting property set references into additional property statements sbd:hasPropertySet and sbd:hasProperty. This is a small, yet often used rule set that can be used in many contexts to simplify querying and data publication of common simple properties attached to IFC entity instances.
RS2Includes 31 rules, all involving subtypes of the IfcRelationship class (e.g. ifcowl:IfcRelAssigns, ifcowl:IfcRelDecomposes, ifcowl:IfcRelAssociates, ifcowl:IfcRelDefines, ifcowl:IfcRelConnects)
RS3 Contains 3 rules related to handling lists in IFC.
RS4 Contains one rule that allows wrapping simple data types.
RS4 Consists of 20 rules for inferring single property statements sbd:hasPropertySet and sbd:hasProperty.
RS6 Extends RS5 and RS1 with 6 additional rules for inferring whether an objet is internal or external to a building.
RS7 Contains 7 rules dealing with the (de)composition of building spaces and spatial elements.
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
• Implemented based on the open source APIs of Topbraid SPIN (SPIN API 1.4.0) and Apache Jena (Jena Core 2.11.0, Jena ARQ 2.11.0, Jena TDB 1.0.0)
• Rules are written with Topbraid Composer Free version, and they are exported as RDF Turtle files.
• A small Java program is implemented to read RDF models, schema, rules from the TDB store and query data.
• All the SPARQL queries are configured using the Jena org.apache.jena.sparql.algebra package
• To avoid unnecessary reasoning processes, in this test environment only the RDFS vocabulary is supported.
SPIN + Jena TDB
• Version ‘EYE-Winter16.0302.1557’ (‘SWI-Prolog 7.2.3 (amd64): Aug 25 2015, 12:24:59’).
• EYE is a semi-backward reasoner enhanced with Euler path detection.
• As our rule set currently contains only rules using =>, forward reasoning will take place.
• Each command is executed 5 times
• Each command includes the full ontology, the full set of rules and the RDFS vocabulary, as well as one of the 369 building model files and one of the 3 query files.
• No triple store is used: triples are processed directly from the considered files.
Supplied by the University of Burgundy, research group CheckSem,
Following specifications: Ubuntu OS, Intel Xeon CPU E5-2430 at 2.2GHz, 6 cores and 16GB of DDR3 RAM memory
◼ 3 Virtual Machines (VMs) were set up in this central server SPIN VM (Jena TDB), EYE VM (EYE inference engine), Stardog
VM (Stardog triplestore)
◼ The VMs were managed as separate test environments and Each of these VMs had 2 cores out of 6 allocated Each contained the above resources (ontologies, data, rules,
queries).
Querying and reasoning over large scale building datasets: an outline of a performance benchmark
• The three considered procedures are quite far apart from each other, explaining the considerable performance differences, not only between the procedures, but also between diverse usages within one and the same system.
• Algorithms and optimization techniques used for each approach aren't entirely used: differences in indexation algorithms, query rewriting techniques and rule handling strategies used.
Indexing algorithms, query rewriting techniques, and rule handling strategies
• The disadvantage of forward-chaining reasoning process is that millions of triples can be materialized (EYE, SPIN for Q1 and Q2)
• Using backward-chaining reasoning allows avoiding triple materialization, thus saving query execution time (Stardog, SPIN for Q3).
Forward- versus backward-chaining
• Query Q3 triggers a rule that in turn triggers several other rules in the rule set. If the first rule does not fire, however, the process stops early.
• Query Q2, however, fires relatively long rules. It takes more time to make these matches in all three approaches.
Type of data in the building model
• Loading files in memory at query execution time leads to considerable delays.
Impact of the triple store
• Linear relation: the more results are available, the more triples need to be matched, leading to more assertions.
Impact of the number of output results
Querying and reasoning over large scale building datasets: an outline of a performance benchmark