Top Banner
Describing Datasets with the W3C HCLS standard Melissa Haendel Michel Dumontier
22
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dataset description using the W3C HCLS standard

Describing Datasets with the W3C HCLS standard

Melissa HaendelMichel Dumontier

Page 2: Dataset description using the W3C HCLS standard

World Wide Web Consortium (W3C)

The W3C is the main international standards organization for the World Wide Web

The W3C is made up of over 400 member organizations for the purpose of working together in the development of standards for the World Wide Web

W3C has sophisticated development and community validation procedures for standards development

Page 3: Dataset description using the W3C HCLS standard

The Semantic Webis the new global web of knowledge

It involves standards for publishing, sharing, and querying facts, expert knowledge and services

It is a scalable approach to thediscovery of independently formulated

and distributed knowledge

Cyganiak and Jentzsch. http://lod-cloud.net/

Page 4: Dataset description using the W3C HCLS standard

Resource Description Framework

Language to represent knowledge Logic-based formalism -> automated reasoning graph-like properties -> data analysis

Good for: Describing in terms of type, attributes, relations Integrating data from different sources Sharing the data (W3C standard) Reusing what is available, developing what you need,

and contributing back to the web of data

Page 5: Dataset description using the W3C HCLS standard

Challenge: Working with Web Data Often have inadequate descriptions so we don’t know what

they are about or how they were constructed

Datasets change over time, but often don’t come with versioning information

May have been constructed using other data, but it’s not clear which version of data was used or whether these were modified

Data may be available in a variety of formats

There may be multiple copies of data from different providers, but it’s unclear if they are exact copies or derivatives

Version of standard or vocabulary used not indicated

Data registries are not synchronized and can contain conflicting information

Page 6: Dataset description using the W3C HCLS standard

Key Use Cases for HCLS Dataset description

1. Dataset Identification, Description, Licensing and Provenance

2. Dataset Discovery (via Catalog)

3. Exchange of Dataset Descriptions

4. Dataset Linking

5. Content Summary

6. Monitoring of Dataset Changes

Page 7: Dataset description using the W3C HCLS standard

Objectives

Develop a guidance note for reusing existing vocabularies to describe datasets with RDF– Mandatory, recommended, optional descriptors– Identifiers– Versioning– Attribution– Provenance– Content summarization

Recommend vocabulary-linked attributes and value sets

Provide reference editor and validation

Page 8: Dataset description using the W3C HCLS standard

We complied a list of metadata fields used across the community

and then surveyed over 20 vocabularies to see if they provided relevant metadata elements or value sets…

…to produce a big spreadsheet that maps metadata needs with existing vocabularies

Page 9: Dataset description using the W3C HCLS standard

Dublin Core Metadata Initiative

Widely used

Broadly applicable– Documents

– Datasets

✗Generic terms

✗Not comprehensive

✗No required properties

“Date: A point or period of time associated with an event in the lifecycle of the resource.”

Page 10: Dataset description using the W3C HCLS standard

DCAT: Data Catalog Separates Dataset and Distribution

✗No versioning

✗No prescribed properties

Page 11: Dataset description using the W3C HCLS standard

No single vocabulary provides all key metadata fields

Page 12: Dataset description using the W3C HCLS standard

http://tiny.cc/hcls-datadesc

Page 13: Dataset description using the W3C HCLS standard
Page 14: Dataset description using the W3C HCLS standard

Included Vocabularies

Page 15: Dataset description using the W3C HCLS standard

Three Component Metadata Model:description – version – distribution

Page 16: Dataset description using the W3C HCLS standard

Description

Identifiers Title Description Homepage License Language Keywords Concepts and vocabularies used Standards Publication

Page 17: Dataset description using the W3C HCLS standard

Attribution

Simple Model– Individuals are related to roles using specific

propertiese.g. dct:creator, pav:createdBy, pav:curatedBy

Expandable Model– Individuals are related to roles and dates via

associated object– PROV, VIVO-ISF

Page 18: Dataset description using the W3C HCLS standard

Provenance and Change

Version number

Source

Provenance: retrieved from, derived from, created with

Frequency of change

Page 19: Dataset description using the W3C HCLS standard

Availability

Format

Download URL

Landing page

SPARQL endpoint

Page 20: Dataset description using the W3C HCLS standard

VoID EditorTools to create the metadata

Page 21: Dataset description using the W3C HCLS standard

Tools to validate the metadata

New version using ShEx in development

Page 22: Dataset description using the W3C HCLS standard

HCLS:http://www.w3.org/blog/hcls/

Mailing list: http://lists.w3.org/Archives/Public/public-semweb-lifesci/

Editors’ Draft: http://tiny.cc/hcls-datadesc-ed

W3C Interest Group Note:http://tiny.cc/hcls-datadesc

Special thanks to Alasdair Gray, Scott Marshall, Joachim BaranThanks to all other contributors to the HCLS note