Data Sets, Vocabularies and Tools
Pablo N. MendesFreie Universität Berlin
1st year reviewLuxembourg, December 2011
11/02/11
18 24 30 366 120
FUBFUB
42 48D4.1 Assembly and maintenance of the PlanetData data set catalogue
D4.1 Assembly and maintenance of the PlanetData data set catalogue D4.2 Best practices on
how to provideself-describing data
D4.2 Best practices on how to provideself-describing data
KITKIT
KITKIT
Work Plan View WP4
UPMUPM
D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal
D4.3 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal
D4.4 Data quality benchmark datasetD4.4 Data quality benchmark dataset
D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal
D4.5 PlanetData data sets, vocabularies and provisioning tools catalogue and access portal
Task 4.4Assembly and maintenance of a catalogue of data provisioning tools
Task 4.3Development of best practices for providing self-describing data
Task 4.2Community-driven creation and maintenance of vocabularies
Task 4.1Assembly and maintenance of the PlanetData data set catalogue
18 24 30 366 120
Task 5.1Assembly and maintenance of PlanetData technology catalogue
Task 5.2Development of best practices of large-scale data management infrastructures
D5.3 PlanetData data management toolscatalogue and access portal
D5.3 PlanetData data management toolscatalogue and access portal
EPFLEPFL
42 48
D5.1PlanetData data management toolscatalogue and access portal
D5.1PlanetData data management toolscatalogue and access portal
D5.2 Best practices on how to deploy tools on large-scale infrastructures
D5.2 Best practices on how to deploy tools on large-scale infrastructures
KITKIT
Work Plan View WP5
Summary
WP4
Assembly and maintenance of the PlanetData data set, vocabularies and tools catalogue;
Community-driven creation and maintenance of vocabularies;
Development of best practices;
WP5
Assembly and maintenance of the PlanetData technology catalogue;
Best practices for large-scale data management infrastructure;
Deliverables in Year 1
D 4.1• Data Sets Catalog• Vocabularies Catalog
D 5.1• Data Management Tools Catalog
Data Sets Catalog
• Where to maintain the catalog?
• How to catalog?
• What to catalog?
• How to provide access for humans and machines?
• How to organize a community around the catalog?
Repository: TheDataHub.org
Maintained by Open Knowledge Foundation (OKF) and world-wide open data community
Widely used catalog• Dec 1st 2012: has 2418 datasets, 314 LOD
Features of the portal: • Tagging, Rating, Feedback,
Discussions, Groups
Cataloguing Process
• Planet Data Editor
• Collected a list of new datasets → 49 new entries
• Updated existing entries (537 edits)
• Crowdsourcing: data providers and third parties
• Public call for action to mailing lists, OKFN blog
• Supported the community contributions
• Quality Assurance
• Tools to support cataloguing (validator, auto-complete)
• Joint work with LATC
Catalog Metadata QuickRef
What? package name, title, url tag:lod topic shortname format-*
Who?author || maintainerpublished by producerprovenance metadata license
When?versionlast updated
Why?package description
Where to find?example URIdownloads/dumpsSPARQL endpoint
How much?tripleslinks:* (outlinks)namespace (inlinks)vocab mappings
http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation
How are datasets described?
Catalog Metadata
Resources:• example URIs• SPARQL endpoint• RDF Dumps• Sitemaps, VoID files
Catalog Entry Validator
Checks levels of metadata completeness
Step-by-step annotation instructions
Already checks some quality indicatorse.g. availability, provenance, access methods
http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php
Auto-completion scripts
For the entries that pass the validator, we can auto-complete metadata with information such as:• Number of triples• Links to other sources• Vocabularies used• Quality indicators
Catalog Access Portal
For machines• CKAN API (continuously improved by OKFN)• VOID descriptions for LOD group (will be
continuously improved in cooperation with LATC)
For humans• LOD Cloud Diagram • State of the LOD Report
State of the LOD Cloud
Triples by domain Links by domain
Domain# of datasets
Triples % (Out-)Links %
Media 25 1,841,852,061 5.82 % 50,440,705 10.01 % Geographic 31 6,145,532,484 19.43 % 35,812,328 7.11 % Government 49 13,315,009,400 42.09 % 19,343,519 3.84 % Publications 87 2,950,720,693 9.33 % 139,925,218 27.76 % Cross-domain 41 4,184,635,715 13.23 % 63,183,065 12.54 % Life sciences 41 3,036,336,004 9.60 % 191,844,090 38.06 % User-generated content
20 134,127,413 0.42 % 3,449,143 0.68 %
295 31,634,213,770 503,998,829
http://www4.wiwiss.fu-berlin.de/lodcloud/state/
State of the LOD Cloud (2)
SPARQL Endpoint: 68.14%RDF Dumps: 39.66%Provide provenance:36.63 %Provide licensing:17.84%
vocabulary use:
Vocabularies Catalog
• Based on BTC Dataset (2.1 billion triples)• Shows vocabulary usage in practice• Executed on a 54 node Hadoop cluster
• Access portal:• Searchable• URI Lookup• Top usage statistics
Hosted at http://vocab.cc
Tools Catalog
• Initial focus on tools from the consortium
• Currently 15 tools
Entry for Global Sensor Networks (GSN)
Available from planet-data.eu
Tools Description
•Textual description• What is it?• Documentation• Publications• Requirements• License• Contact person/mailing list• Organization• Events
•Tags•Produce•Publish•Consume•Provisioning
Names of Tools in the Catalog
CumulusRDF
D2R
DBpedia Spotlight
GSN (Global Sensor Networks)
Geometry2RDF
LDIF
LDSpider (Linked Data Spider)
LarKC (Large Knowledge Collider)
MonetDB
NOR2O
R2O&ODEMapster
OKKAM
Pubby
R2R
S2O
Silk
Tools Catalog
Related: LATC Tools Catalog• 11 tools• 5 tools in both, 10 new tools in PlanetData
Proposal for next year:• Join catalogs at linkeddata.org• Jointly maintain catalog until LATC finishes• Build a community → people can add their
own tools• Afterwards PlanetData takes over and
maintains the catalog for another 2 years