Top Banner
Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7, 2009
11

Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Data.gov Wiki: A Semantic Web Approach to

Government Data

Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler

Tetherless World ConstellationAug 7, 2009

Page 2: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Government Data on the Web

Page 3: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Objectives

• Investigate the role of semantic web in producing, processing and utilizing government datasets– To enrich the value of data via normalizing,

linking and information-extraction– To realize the value of data via applications,

esp. visualization– To support web developers via machine

friendly data access and web services

Page 4: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Data Processors(Web Services & Analyzers)Data Processors(Web Services & Analyzers)

SPARQL Web Service

XSLT Service Diff Service

RDF/XML

RSS Generator

SPARQL End Point

Linked Data

Linked DataGOV data

(RDF)

Google Viz MIT Exhibit RSS 1.0 tagCloud

CSVXSL…

Tabulator

Convert D

ataLink &

Enrich D

ataV

iew &

Use D

ata

Link Annotator

RDF/XML

Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler · Tetherless World Constellation · Rensselaer Polytechnic Institute · Aug 7 2009 · http://data-gov.tw.rpi.edu/

Sem Wiki

Semantic Web Architecture for Government Data

Page 5: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Translate GOV data into RDF

• Principle 1: Keep the translation minimal – keep table structure– skip parsing values, unique property namespace

• Principle 2: Let the translation meet the Web– RDF/XML as output– Partition of big dataset, dereferenable URI

• Principle 3: Make the translation extensible– Property definition updatable via Semantic MediaWiki

• Principle 4: Preserve knowledge provenance– Recording provenance metadata using DC and FOAF

Dominic

Page 6: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Translated Dataset Statistics

• data.gov hosts 432 Datasets: – 390 “Raw Data Catalog” and 41

“Tool Catalog”– from 37 US government agencies

• We have 16 translated RDF datasets

– 13,532,385 table entries – 2,927,399,269 triples. – 2,526 properties.

• data.gov mentioned 458 data access points (mainly tables)

– 3 - RSS,ATOM– 248 - csv/txt– 46 – xml– 66 - xls (MS Excel) – 14 - kml or kmz– 22 ESRI shape

Page 7: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

(#10) Residential Energy Consumption Survey

(#401) Budget Authority and

offsetting receipts1976-2014

(#403) Governmental

Receipts1962-2014

(#402) Outlays and

offsetting receipts1962-2014

(#249) 2006 Toxics Release

Inventory

(#90) 2005-2007 ACS PUMS

Housing (#191) 2005 Toxics Release

Inventory

(#91) 2005-2007 ACS PUMS Population

(#34) Worldwide M1+

Earthquakes past 7 days

(#9) CASTNET Visibility

(#397) 2007 Toxics Release

Inventory

(#8) CASTNET Ozone

Budget

Population

Energy and Utilities

Geography and Environment

(@10001)CASTNET sites

Cloud of government data

Li Ding, Dominic DiFranzo, Sarah Magidson, and Jim Hendler · Tetherless World Constellation · Rensselaer Polytechnic Institute · Aug 7 2009 · http://data-gov.tw.rpi.edu/

Page 8: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Issues in Data.gov

• Duplicated Datasets- Some datasets are part of another dataset

– Dataset 140 (2005 Toxics Release Inventory data for the state of California (EPA)) is a subset of Dataset 191.

• Formatting Issues - The format of some datasets is not friendly to machine processing.

– Dataset 37 (Lower Colorado River Daily Average Water Elevations and Releases (US Bureau of Reclamation)).

– Dataset 335 (National Longitudinal Surveys (US Bureau of Labor Statistics)) tells you how to order data from the government.

• Access Point Issues - The access points are interactive webpage which is not friendly for machine access.

– Dataset 330 (Local Area Unemployment Statistics (US Bureau of Labor Statistics)

Sarah

Page 9: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Demos

• Visualization– Tabulator– Google Visualization (live)– Exhibit (live)

• Computation– RSS generation– TDB query (live)

• Live Demos: – http://onto.rpi.edu/joseki/ – http://data-gov.tw.rpi.edu/wiki/Demos

Dominic, Sarah

Page 10: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

TODO List

• More demos– US Pollution Map– US agency– Earthquake in RPI Map

• Getting more data linked– Link properties– Link instance data

• More web services– Gov data auto-completion

• SPARQL integration for 2B triples– TDB– 4Store

(#9) CASTNET Visibility

(#8) CASTNET Ozone

(@10001)CASTNET sites

Page 11: Data.gov Wiki: A Semantic Web Approach to Government Data Li Ding, Dominic DiFranzo, Sarah Magidson, Jim Hendler Tetherless World Constellation Aug 7,

Sample SPARQL queries

• List datasets: – SELECT ?s ?o WHERE {?s <http://purl.org/dc/elements/1.1/source> ?o }

• List all loaded documents: – SELECT ?s ?o WHERE {?s <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>

<http://xmlns.com/foaf/0.1/Document> }• List description about a EPA site (integration)

– select ?s WHERE {?s <http://data-gov.tw.rpi.edu/vocab/p/8/site_id> "SHN418". }• List contributions of agency (count)

– PREFIX dgp92: <http://data-gov.tw.rpi.edu/vocab/p/92/> SELECT ?ag count(*) WHERE { ?entry dgp92:agency ?ag. } GROUP BY ?ag ORDER BY ?ag

• List agencies (distinct)– PREFIX dgp401: <http://data-gov.tw.rpi.edu/vocab/p/401/> SELECT distinct ?ag

?ag_code ?branch ?branch_code WHERE { ?entry dgp401:bureau_name ?ag; dgp401:bureau_code ?ag_code; dgp401:agency_name ?branch; dgp401:agency_code ?branch_code . } ORDER BY ?ag