First Keystone Summer School – Malta July 2015 – P. Missier Provenance and the W3C PROV model (in the Big Data context)§ Paolo Missier School of Computing Science Newcastle University, UK Tutorial First Keystone Summer School, Malta, July 2015 Some of the slides courtesy of Luc Moreau – thanks!
70
Embed
Keystone summer school 2015 paolo-missier-provenance
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance and the W3C PROV model(in the Big Data context)§
Paolo Missier
School of Computing Science
Newcastle University, UK
Tutorial
First Keystone Summer School,
Malta, July 2015
Some of the slides courtesy of Luc Moreau – thanks!
Independent validation of scientific claims is a cornerstone of experimental science
• Scientific claims are supported by experiments
• How do express my “material and methods” so that you can independently verify my results?
• How do I document my results to promote their understanding / reuse
Provenance is the equivalent of a logbook• Capture all steps involved in the derivation of a
result• Replay, validate the execution, compare it with
others
To what extent these can be formalised and automated in data-intensive science?
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
2- Explaining the outcome of a complex decision process
• Which process was used to derive a diagnosis?
• How did the process use the input data?
• How were the steps configured?
• Which decisions were made by human experts (clinicians)?
Clinical diagnosis of genetic diseases
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
3- Understanding the results of a computation
• Why has my [very complicated algorithm] produced this particular result?
• Why is my predictive analytics model suggesting that it will rain tomorrow?
• Why is this record part of the result of my database query?• Database provenance
• Why is this record included in the result of my keyword search?
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
4- Content reuse on the Social Web
Open Data, Data Journalism
• A consume-select-curate-share workflow, not only professional
• Ethos: to expose the data and methods used to produce news items
• But: Data wrangling can introduce errors• Is the data I am using valid? What is its primary source? What are the
transformation steps?
NowNews publishes an article based on the latest employment data published by GovStat
PolicyOrg compiles a report including NowNews article
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
What is provenance?
Oxford English Dictionary: • the fact of coming from some particular source or quarter; origin, derivation• the history or pedigree of a work of art, manuscript, rare book, etc.;• a record of the passage of an item through its various owners
Magna Carta (‘the Great Charter’) was agreed between King John and his barons on 15 June 1215.
Provenance refers to the sources of information, including entities and processes, involving in producing or delivering an artifact (*)
Provenance is a description of how things came to be, and how they came to be in the state they are in today (*)
Provenance is a record that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data or a thing in the world
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance on the Web
Tim Berners-Lee’s “Oh Yeah” button:
• A browser button by which the user can express their uncertainty about a document being displayed “so how do I know I can trust this information?”.
• Upon activation of the button, the software then retrieves metadata about the document, listing assumptions on which trust can be based.
http://users.ugent.be/~tdenies/OhYeah/Easy Access to Provenance: an Essential Step Towards Trust on the Web, Procs METHOD 2013: The 2nd IEEE International Workshop on Methods for Establishing Trust with Open Data Held in conjunction with COMPSAC, the IEEE Signature Conference on Computers, Software & Applications - July 22-26, 2013 - Kyoto, Japan
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Use cases on the Social Web
Open Data, Data Journalism
NowNews publishes an article based on the latest employment data published by GovStat
PolicyOrg compiles a report including NowNews article
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Derivation - Timeliness
Derivation:• Charts, graphs and visualizations are all based on multiple data sets• Eg Bob’s article on employment that appeared in NowNews• Which data was a figure based upon?
Is the report based on the most up-to-date data?
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Derivation - Trusted sources
Derivation:• Is this content derived from data coming from a reliable source?
• The chart within Bob’s article is based on GovStat data• However that information is hidden:
• the chart was produced by a complex process performed by Alice
Policy rule:
“data supplied by the government is reliable”
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Tracing the source of errors
Derivation, attribution:• When did this error occur?• Who was responsible for the chart?
Nick discovers an error in the chart included in Bob’s article
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Ensuring policy compliance
Process inspection:• Which process steps led to publication?• Was editorial check part of it?
Policy rule:
“posts are to be checked by an editor prior to publication”
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Ensuring credit and acknowledgement
NowNews relies on multiple contributors
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
Attribution and responsibility:• How do we ensure that all relevant
contributors are acknowledged?
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Reproducibility
Documenting the data generation process:• How do we ensure that
the figures can be reproduced using the new versions of the data?
NowNews must ensure that the article figures reflect the most recent data
Bob: JournalistAlice: Data CruncherTom: EditorNick: Web Master
:L-Moreau a Agent.:original-slide a Entity; wasAttributedTo L-Moreau.:this-slide a Entity; wasDerivedFrom original-slide
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
So, why does provenance matter?
• To establish quality, relevance, trust
• To track information attribution through complex transformations
• To enable process analysis for debugging, improvement, evolution
• To enable reproducibility of processes (eg in science, data journalism…)
See also:
ACM Journal of Data and Information Quality (JDIQ) - Special Issue on Provenance, Data and Information Quality, Paolo Missier, Paolo Papotti, Eds. Volume 5 Issue 3, February 2015DOI: 10.1145/2692312http://dl.acm.org/citation.cfm?id=2700413http://jdiq.acm.org/archive.cfm?id=2698232
Main output:“Provenance XG Final Report”http://www.w3.org/2005/Incubator/prov/XGR-prov/- provides an overview of the various existing approaches, vocabularies- proposes the creation of a dedicated W3C Working
Group
April, 2011 April, 2013
ProposedRecommendationsfinalised
prov-dm: Data Modelprov-o: OWL ontology, RDF encodingprov-n: prov notationprov-constraints
Moreau, Luc, and Paul Groth. “Provenance: An Introduction to PROV.” Synthesis Lectures on the Semantic Web: Theory and Technology 3, no. 4 (September 15, 2013): 1–129. doi:10.2200/S00528ED1V01Y201308WBE007.
An entity is a physical, digital, conceptual, or other kind of thing with some fixed aspects; entities may be real or imaginary.
An activity is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, ..., using, or generating entities.
An agent is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or for another agent's activity.
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erGeneration, Usage
25
Generation is the completion of production of a new entity by an activity. This entity did not exist before generation and becomes available for usage after this generation.
Usage is the beginning of utilizing an entity by an activity. Before usage, the activity had not begun to utilize this entity
PROV is based on a notion of instantaneous events, that mark transitions in the world
- generation, usage (and others)
Ordering constraints amongst events:
“generation of e must precede each of usages”
“a can only use / generate e after it has started and before it has ended”
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erConcepts and relations
26
Generation of “draft v1” expressed as relation:
wasGeneratedBy(“draft v1”, ...)
Usage of “draft v1” by “commenting” expressed as relation:
used(“commenting, “draft v1”,...)
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erPROV notation
27
document
prefix prov <http://www.w3.org/ns/prov#>prefix ex <http://www.example.com/>
Communication is the exchange of some unspecified entity by two activities, one activity using some entity generated by the other.
activity(ex:commenting)activity(ex:drafting)
wasInformedBy(ex:commenting, ex:drafting)
:drafting a prov:Activity .
:commenting a prov:Activity ; prov:wasInformedBy :drafting .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erCommunication, generation, usage
33
activity(ex:commenting)activity(ex:drafting)entity(e)wasInformedBy(ex:commenting, ex:drafting)wasGeneratedBy(e,ex:drafting, -)used(ex:commenting, e, -)
Q.: what is the relationship between communication, generation, and usage?
This are inference rules 5 and 6 in the PROV-CONSTR document
A derivation is a transformation of an entity into another, an update of an entity resulting in a new one, or the construction of a new entity based on a pre-existing entity.
Q.: what is the relationship between derivation, generation, and usage?
:draftComments a prov:Entity ; prov:wasDerivedFrom :draftV1 .
:draftV1 a prov:Entity .
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance and Big Data: what’s the connection?
opportunities and challenges
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Provenance {as,of} Big Data
1. BigProv: Provenance as big data• High volume provenance
• What kind of analytics are interesting on big provenance?
2. Provenance of analytics processes
• “Prediction provenance”• Train a model provenance of the model as a record of the training
process and data involved
• Use the model to make predictions provenance of the prediction
3. Provenance of a search
• What is the provenance of a keyword search?
• Why would it be interesting? What can we learn from it?
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Recent research on Provenance as Big Data
Chen, Peng; Plale, Beth A., "Big Data Provenance Analysis and Visualization," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.797,800, 4-7 May 2015 doi: 10.1109/CCGrid.2015.85
Chen, Peng; Plale, Beth A., "ProvErr: System Level Statistical Fault Diagnosis Using Dependency Model," Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International Symposium on , vol., no., pp.525,534, 4-7 May 2015doi: 10.1109/CCGrid.2015.86
Provenance Map Orbiter: Interactive Exploration of Large Provenance GraphsPeter Macko and Margo Seltzer, Harvard University, Procs. TAPP’11, 2011, Crete, Greece
Provenance from Log Files: a BigData Problem, Devarshi Ghoshal and Beth Plale, Procs. BigProv workshop, EDBT, Genova, Italy, 2013
Adam Bates, Kevin Butler and Thomas Moyer. Take Only What You Need: Leveraging Mandatory Access Control Policy to Reduce Provenance Storage Costs. In Procs. TAPP’15 workshop, Edinburgh, 2015 http://workshops.inf.ed.ac.uk/tapp2015/TAPP15_II_3.pdf
• A Provenance Generator tool for experimenting with provenance at scale
• Why generate synthetic provenance?• Synthetic PROV graphs can be a valuable complement to emerging natural
provenance collections
• … provided their structural properties reflect specific provenance patterns
• control over their repetition and variability
• varying scales
• Useful for benchmarking emerging provenance management systems
• Useful to test analytics algorithms that operate on large provenance collections
Firth, Hugo, and Paolo Missier. “ProvGen: Generating Synthetic PROV Graphs with Predictable Structure.” In Procs. IPAW 2014 (Provenance and Annotations). Koln, Germany: Springer, 2014.http://arxiv.org/pdf/1406.2495
an Entity must have relationship "WasDerivedFrom" exactly 2 times unless it has property("version"="original");
the Entity(e1) must not have relationship "WasDerivedFrom" with the Entity(e2) unless e1 has relationship "Used" with the Activity(a) and e2 has the relationship "WasGeneratedBy" with the Activity(a);
an Entity must have relationship "WasGeneratedBy" exactly 1 times;
an Entity must have property("version"="original") with probability 0.05;
an Entity must have out degree at most 2;
an Activity must have relationship "Used" at most 1 times;
an Activity must have property("type"="create") with probability 0.01;
an Activity must have relationship "WasAssociatedWith" exactly 1 times;
an Activity must have relationship "Used" exactly 1 times unless it has property("type"="create");
an Activity must have relationship "WasGeneratedBy" exactly 1 times;
an Agent must have relationship "WasAssociatedWith" with probability 0.1;
an Agent must have relationship "WasAssociatedWith" between 1, 120 times with distribution gamma(1.3, 2.4);
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Some test queries
Generated graph loaded to Neo4J GDBMSQueries expressed using the Cypher graph query language
Transitive closure over Derivation: Return all the derivation chains, along with their length
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) RETURN a,b, length(r)
MATCH (a)-[r:`WASDERIVEDFROM`*]->(b) WHERE length(r) > 10 RETURN a,b, length(r) ORDER BY length(r) desc limit 50
Return the top 50 length derivation chains
MATCH (a)-[:`WASASSOCIATEDWITH`]->(b)RETURN a as Agent, b as Activity
All agents and their associated activities
All agents who created new documents
MATCH (a{type:'create'})-[:`WASASSOCIATEDWITH`]->(b)RETURN a,b LIMIT 25
All agents who edited a document that was derived from an original
Two alternate entities present aspects of the same thing. These aspects may be the same or different, and the alternate entities may or may not overlap in time.
An entity that is a specialization of another shares all aspects of the latter, and additionally presents more specific aspects of the same thing as the latter.
...But, this is still that car!
Semantic notes:1. Specialization implies alternate: IF specializationOf(e1,e2) THEN alternateOf(e1,e2).2. Alternate is symmetric: IF alternateOf(e1,e2) THEN alternateOf(e2,e1)
3. Specialization is transitive: IF specializationOf(e1,e2) and specializationOf(e2,e3) THEN specializationOf(e1,e3).
differing in their location
same owner, added location
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erReserved attributes and types
56
A small set of reserved attributes, with some usage restrictions
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
erBundles, provenance of provenance
57
A bundle is a named set of provenance descriptions, and is itself an entity, so allowing provenance of provenance to be expressed.
• Motivation for collecting provenance of data and information• In Science
• In the Social Web
• The W3C PROV Recommendation (2013)• PROV-DM: The PROV data model
• PROV-O: the Provenance Ontology
• (PROV-CONSTRAINTS)
• Provenance as Big Data• High volume provenance
• Storage, analytics, visualisation
• Provenance of analytics• How can I explain my predictions?
• The ProvGen tool
Firs
t Key
ston
e S
umm
er S
choo
l – M
alta
Jul
y 20
15 –
P. M
issi
er
Selected bibliography
Moreau, Luc, Paolo Missier, Khalid Belhajjame, Reza B’Far, James Cheney, Sam Coppens, Stephen Cresswell, et al. PROV-DM: The PROV Data Model. Edited by Luc Moreau and Paolo Missier, 2012. http://www.w3.org/TR/prov-dm/
Cheney, James, Paolo Missier, and Luc Moreau. Constraints of the Provenance Data Model, 2012. http://www.w3.org/TR/prov-constraints/
Moreau, Luc, Paul Groth, James Cheney, Timothy Lebo, and Simon Miles. “The Rationale of PROV.” Web Semantics: Science, Services and Agents on the World Wide Web (April 2015). doi:10.1016/j.websem.2015.04.001.http://www.sciencedirect.com/science/article/pii/S1570826815000177
Marinho, Anderson, Leonardo Murta, Cláudia Werner, Vanessa Braganholo, Sérgio Manuel Serra da Cruz, Eduardo Ogasawara, and Marta Mattoso. “ProvManager: a Provenance Management System for Scientific Workflows.” Concurrency and Computation: Practice and Experience 24, no. 13 (2012): 1513–1530. http://dx.doi.org/10.1002/cpe.1870.
ProvGen: generating synthetic PROV graphs with predictable structure.Firth, H.; and Missier, P. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.2495
ProvAbs: model, policy, and tooling for abstracting PROV graphs.Missier, P.; Bryans, J.; Gamble, C.; Curcin, V.; and Danger, R. In Procs. IPAW 2014 (Provenance and Annotations), Koln, Germany, 2014. Springer http://arxiv.org/pdf/1406.1998
De Oliveira, Daniel, Vítor Silva, and Marta Mattoso. “How Much Domain Data Should Be in Provenance Databases?” In 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP 15). Edinburgh, Scotland: USENIX Association, 2015. https://www.usenix.org/conference/tapp15/workshop-program/presentation/de-oliveira.