Top Banner
Development of guidelines for publishing statistical data as linked open data MERGING STATISTICS AND GEOSPATIAL INFORMATION IN MEMBER STATES - POLAND Mirosław Migacz INSPIRE Conference 2016 Barcelona, 26 IX 16
33

Development of guidelines for publishing statistical data ...

Dec 20, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Development of guidelines for publishing statistical data ...

Development of guidelines for publishing statistical dataas linked open dataMERGING STATISTICS AND GEOSPATIALINFORMATION IN MEMBER STATES - POLAND

Mirosław MigaczINSPIRE Conference 2016Barcelona, 26 IX 16

Page 2: Development of guidelines for publishing statistical data ...

Agenda

• project aims,

• introduction to linked open data,

• project timeline,

• project tasks,

• intranet site.

• from ontology do sparql endpoint

Page 3: Development of guidelines for publishing statistical data ...

Overall objective

Support decision-making processes involving provision of standardized, usable and open georeferenced statistical data.

Page 4: Development of guidelines for publishing statistical data ...

What is linked open data?

• Internet – collection of documents published online – accessible at Web location identified by a URL,

• Documents mainly human-readable and cannot be understood by machines.

• Linked open data – data machine-readable formats and connecting described using Uniform Resource Identifiers (URIs), thus enabling people and machines to collect the data, and put it together to do all kinds of things with it (permitted by the licence).

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Page 5: Development of guidelines for publishing statistical data ...

Linked open data

• URI – for names

• RDF – to describe data

• SPARQL – to query for data

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Page 6: Development of guidelines for publishing statistical data ...

Uniform Resource Identifier (URI)to „make a long story short”:

object described by an internet address

A country, e.g. Belgium

http://publications.europa.eu/resource/authority/country/BEL

A dataset, e.g. Countries Named Authority List

http://publications.europa.eu/resource/authority/country/

In official statistics it can look like this:

http://teryt.stat.gov.pl/32/18/05/3 - gmina Węgorzyno

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Page 7: Development of guidelines for publishing statistical data ...

RDF i SPARQLResource Description Framework (RDF ) is a syntax for representing data and resources in the Web

RDF breaks every piece of information down in triples:

• Subject – a resource, which may be identified with a URI.

• Predicate – a URI-identified reused specification of the relationship.

• Object – a resource or literal to which the subject is related.

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

http://example.org/place/Brussels is the capital of “Belgia”LUB

http://example.org/place/Brussels is the capital of http://example.org/place/Belgium

subject predicate object

SPARQL is a standardised language for querying RDF data.

Page 8: Development of guidelines for publishing statistical data ...

Five stars of linked open data

source: https://joinup.ec.europa.eu/community/ods/description (CC 2.0)

Make your stuff available on the Web (whatever format) under an open license.

Make it available as structured data (e.g., Excel instead of image scan of a table)

Use non-proprietary formats (e.g., CSV instead of Excel)

Use URIs to denote things, so that people can point at your stuff

Link your data to other data to provide context

Page 9: Development of guidelines for publishing statistical data ...

Now

powiatłobeski(LAU 1)

3218

4.4.32.64.18

lobeski

4326418

Page 10: Development of guidelines for publishing statistical data ...

Aim

powiat łobeski

http://nts.stat.gov.pl/4/4/32/64/18

Page 11: Development of guidelines for publishing statistical data ...

Specific objectives

• identification of statistical units for which data can be published with harmonization of theirgeometries for respective years

• building standarized URIs for statistical units

• identification and analysis of potential data sources

• plan for transformation of existing data sourcesinto open formats

• creation of RDF metadata for data sources

• feasibility analysis for publishing linked open data

Page 12: Development of guidelines for publishing statistical data ...

Stage I – until 4/10/2016

• identification of statistical unitsfor which data can be publishedwith harmonization of theirgeometries for respective years

• building standarized URIs for statistical units

• identification and analysis of potential data sources, analyzing for: „openness”, georeference, veryfing need for geocoding

5 GUS-PK

2GUS-DI

1 GUS-AZ

3US Poznań

2 US Olsztyn

1US Wrocław / OBDL J. Góra

Page 13: Development of guidelines for publishing statistical data ...

Stage II – until 7/10/2017

• plan for transformation of existing data sources intoopen formats

• creation of RDF metadata for data sources

• feasibility analysis for publishing linked open data (building a SPARQL endpoint)

5 GUS-PK

1GUS-AZ

3 US Poznań

2US Olsztyn

1 US Wrocław / OBDL J. Góra

Page 14: Development of guidelines for publishing statistical data ...

Identification of data sources

• Three major databases:

• Local Data Bank

• biggest set of statistical information availablefor a wide range of years

• updated monthly

• Demography Database

• integrated data source for state and structureof population, vital statistics and migrations

• Development monitoring system STRATEG

• a system for facilitating and monitoring the development policy

• key measures to monitor execution of strategies at local, regional, transregional and EU level.

Page 15: Development of guidelines for publishing statistical data ...

Identification of data sources

• Other data sources:

• publications

• tables

• communiques

• announcements

• articles

Page 16: Development of guidelines for publishing statistical data ...

Identification of data sources

• Metadata:

• thematic category,

• format (PDF, DOC, XLS, CSV),

• spatial reference (country, NUTS, LAU, functional areas, urbanareas),

• temporal reference (years)

• presence of identifiers (TERYT, NTS, NUTS)

• update cycle

Page 17: Development of guidelines for publishing statistical data ...

Preliminary analysis of data sources

• Key aspects:

• openness

• redundance of information

• popularity (based on view and download statistics)

• Inclusion / exclusion of the data source

Page 18: Development of guidelines for publishing statistical data ...

Statistical units harmonization

• Basis:

• NTS (Nomenclature of Territorial Units for Statistical Purposes)

Name NTS NUTS/LAU Identifier

Region 1 NUTS 1 1.6

Voivodship 2 NUTS 2 2.6.22

Subregion 3 NUTS 3 3.6.22.40

Powiat 4 LAU 1 4.6.22.40.11

Gmina 5 LAU 2 5.6.22.40.11.01.1

Page 19: Development of guidelines for publishing statistical data ...

Statistical units harmonization

• Input data:

• administrative boundaries since 2002 for LAU 2 (gmina), excluding 2007

• Harmonization process:

• structure standardization

• standardization of identifiers (creating NTS identifiers)

• aggregation to higher level units (LAU 1 -> NUTS 1)

Page 20: Development of guidelines for publishing statistical data ...

Statistical units harmonization

• Non-standard statistical units:

• functional areas

• urban areas

• Groups of NTS units

• Derive mostly from strategic documents

• Changes of geometries in time to be determined

Page 21: Development of guidelines for publishing statistical data ...

Statistical units URIs

• NTS as basic classification

Name NTS NUTS/LAU

Identifier URIhttp://nts.stat.gov.pl/...

Region 1 NUTS 1 1.6 …1/6

Voivodship 2 NUTS 2 2.6.22 …2/6/22

Subregion 3 NUTS 3 3.6.22.40 …3/6/22/40

Powiat 4 LAU 1 4.6.22.40.11 …4/6/22/40/11

Gmina 5 LAU 2 5.6.22.40.11.01.1 …5/6/22/40/11/01/1

http://nts.stat.gov.pl/5/6/22/40/11/01/1

Page 22: Development of guidelines for publishing statistical data ...

Data transformation plan

• From ontology to SPARQL endpoint

• Decide what will be published as Open Data

• three major databases

• other data sources

• Create ontology

• Map to existing databases

• Export to RDF data store

• Publish on linked data server

• Workflow tested on STRATEG database

Page 23: Development of guidelines for publishing statistical data ...

Ontology - methods and tools

• Ontop - platform to query databases as Virtual RDF Graphs using SPARQL

• SPARQL 1.0 Support

• Supports interface for ontology development

• Intuitive/powerful mapping language

• Support for free and commercial DBMS

• SPARQL end-point

Page 24: Development of guidelines for publishing statistical data ...

Mapping ontology on database

Page 25: Development of guidelines for publishing statistical data ...

SPARQL query on mapped data

Page 26: Development of guidelines for publishing statistical data ...

SPARQL endpoint tools for the web

• Apache Jena Fuseki

• Fuseki is a SPARQL server. It allows REST-style SPARQL Query.

• Ontop generated RDF’s are imported to Apache Jena

• Pubby

• A Linked Data Frontend for SPARQL Endpoints

• Pubby makes it easy to turn a SPARQL endpoint into a Linked Data server. It is implemented as a Java web application.

• Provides data at given linked data uri

Page 27: Development of guidelines for publishing statistical data ...

Fuseki SPARQL endpoint query

Page 28: Development of guidelines for publishing statistical data ...

Query result facilitated by Pubby

Page 29: Development of guidelines for publishing statistical data ...

Further works

• Consultation of the designed workflow during a studyvisit at the Madrid University of Technology

• Setting up an internal test linked data server to implement web tools

• Creating ontologies and workflows for databases and other data sources

Page 30: Development of guidelines for publishing statistical data ...

Summary – results so far

• Harmonized geometries for statistical units

• Identified data sources with comprehensive metadata

• Preliminary data transformation plan with tools tested

Page 31: Development of guidelines for publishing statistical data ...

Poland’s data opening strategy

• launched this year

• aimed at opening data resources of governmentinstitutions with respect to the 5-stars of linked open data goals

• the grant results (guidelines) in line with the strategy

• increased probability of acquiring financing for a fullyfledged implementation

Page 32: Development of guidelines for publishing statistical data ...

INSPIRE Thematic Clusters

https://themes.jrc.ec.europa.eu – collaboration platform

Statistical Cluster:

statistical units

population distribution (demography)

human health and safety

Informal meeting of Cluster members duringthe coffee break (15:30-16:00)

Page 33: Development of guidelines for publishing statistical data ...

Questions?