23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea TripleProv Efficient Processing of Lineage Queries over a Native RDF Store Marcin Wylot 1 , Philippe Cudré-Mauroux 1 , and Paul Groth 2 1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands
29
Embed
TripleProv: Efficient Processing of Lineage Queries over a Native RDF Store
Given the heterogeneity of the data one can find on the Linked Data cloud, being able to trace back the provenance of query results is rapidly becoming a must-have feature of RDF systems. While provenance models have been extensively discussed in recent years, little attention has been given to the efficient implementation of provenance-enabled queries inside data stores. This paper introduces TripleProv: a new system extending a native RDF store to efficiently handle such queries. TripleProv implements two different storage models to physically co-locate lineage and instance data, and for each of them implements algorithms for tracing provenance at two granularity levels. In the following, we present the overall architecture of our system, its different lineage storage models, and the various query execution strategies we have implemented to efficiently answer provenance-enabled queries. In addition, we present the results of a comprehensive empirical evaluation of our system over two different datasets and workloads.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
23rd International World Wide Web Conference, 10th April 2014, Seoul, Korea
TripleProvEfficient Processing of Lineage
Queries over a Native RDF Store
Marcin Wylot1, Philippe Cudré-Mauroux1, and Paul Groth2
1) eXascale Infolab, University of Fribourg, Switzerland 2) Web & Madia Group, VU University Amsterdam, Netherlands
Outline
➢ Motivation
➢ Provenance Polynomials
➢ System
➢ Results
Data Provenance
“Provenance is information about
entities, activities, and people involved
in producing a piece of data or thing, which can be used to form
assessments about its quality, reliability or trustworthiness.”
How a query answer was derived: what data was
combined to produce the result.
Data Integration
➢ Integrated and summarized data
➢ Trust, transparency, and cost
➢ Capability to pinpoint the exact source from which the result was selected
➢ Capability to trace back the complete list of sources and how they were combined to deliver a result
Querying Distributed Data SourcesHow exactly was the answer derived?
Application: Post-query Calculations
➢ Scores or probabilities for query result
➢ Result ranking
➢ Compute trust
➢ Information quality based on used sources
Application: Query Execution
➢ Modify query strategies on the fly
➢ Restrict results to certain subset of sources
➢ Restrict results w.r.t. queries over provenance
➢ Access control, only certain sources will appear
➢ Detect if result would be valid when removing certain
source
Provenance Polynomials
➢ Ability to characterize ways each source contributed
➢ Pinpoint the exact source to each result
➢ Trace back the list of sources the way they were combined
lat long l1 l2 l4 l4, lat long l1 l2 l4 l5,lat long l1 l2 l5 l4, lat long l1 l2 l5 l5,lat long l1 l3 l4 l4, lat long l1 l3 l4 l5,lat long l1 l3 l5 l4,lat long l1 l3 l5 l5,
lat long l2 l2 l4 l4, lat long l2 l2 l4 l5,lat long l2 l2 l5 l4, lat long l2 l2 l5 l5,lat long l2 l3 l4 l4, lat long l2 l3 l4 l5,lat long l2 l3 l5 l4, lat long l2 l3 l5 l5,
lat long l3 l2 l4 l4, lat long l3 l2 l4 l5,lat long l3 l2 l5 l4, lat long l3 l2 l5 l5,lat long l3 l3 l4 l4, lat long l3 l3 l4 l5,lat long l3 l3 l5 l4,lat long l3 l3 l5 l5,