Top Banner
Personal Data Management • Why is this such an issue? Data Provenance • Representing links v Representing data • Identifying resources: Life Science Identifiers • Different types of provenance • Provenance generation • Provenance storage • Provenance retrieval
13

Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Personal Data Management• Why is this such an issue? Data Provenance

• Representing links v Representing data• Identifying resources: Life Science Identifiers

• Different types of provenance

• Provenance generation

• Provenance storage

• Provenance retrieval

Page 2: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Problem

• Automated workflows produce lots of heterogeneous data

• These are just some of the results from one workflow run for Williams Disease

Page 3: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Amplification of results

One input

Many outputs

Page 4: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Link v Data Representation

• Data management questions refer to relationships rather than internal content– What are the origins of this data?

• Which service produced this data?• Which data is this derived from?• Who was this data produced for?• ?What is this data telling me?

• Data analysis questions delegated to external services.

Page 5: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Representing links

• Identify each resource– Life science identifier: URI with associated data and

metadata retrieval protocols.– Understanding that underlying data will not change

urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3

Page 6: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Representing links II

• Identify link type– Again use URI– Allows us to use RDF infrastructure

• Repositories• Ontologies

urn:lsid:taverna.sf.net:datathing:45fg6 urn:lsid:taverna.sf.net:datathing:23ty3

http://www.mygrid.org.uk/ontology#derived_from

Page 7: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Workflow run

Workflow design

Experiment design

Project

Person

Organisation

Process

Service

Event

Data item

Data itemData item

data derivation e.g. output data derived from input data

knowledge statementse.g. similar protein sequence to

instanceOf

partOf componentProcesse.g. web service invocation of BLAST @ NCBI

componentEvente.g. completion of a web service invocation at 12.04pm

runBye.g. BLAST @ NCBI

run for

Organisation level provenance Process level provenance

Data/ knowledge level provenance

Pro

vena

nce

(1)

User can add templates to each workflow process to determine links between data items.

Page 8: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Storing management metadata

• Automated generation of this web of links preferable

• Workflow enactor generates– LSIDs– Data derivation links– Knowledge links– Process links– Organisation links

As RDF

Page 9: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Provenance generation

• Configuring and generating provenance within Taverna

Page 10: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Storage

• LSID has no protocol for storage

• Taverna/ Freefluo implements its own data/ metadata storage protocol

Taverna/Freefluo

Metadata Store

Data store

Publish interface

data

metadata

Page 11: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Retrieval• LSID protocol used to retrieve data and

metadata

• Query handled separately

Metadata Store

Data store

LSID interface

LSID aware client

Query

RDF aware client

Page 12: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

LSID launchpad

• Light weight plug in to Internet Explorer providing access to LSID data / metadata

• demo

Page 13: Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.

Using IBM’s HaystackGenBank

record

Portion of the Web of

provenance

Managing collection of

sequences for review