Top Banner
DISIT Lab, Distributed Data Intelligence and Technologies Distributed Systems and Internet Technologies Department of Information Engineering (DINFO) http://www.disit.dinfo.unifi.it Ontology Building vs Data Harvesting and Cleaning for Smartcity Services Pierfrancesco Bellini, Monica Benigni, Riccardo Billero, Paolo Nesi, Nadia Rauch Dipartimento di Ingegneria dell’Informazione, DINFO Università degli Studi di Firenze Via S. Marta 3, 50139, Firenze, Italy Tel: +39-055-4796567, fax: +39-055-4796363 DISIT Lab http://www.disit.dinfo.unifi.it alias http://www.disit.org , [email protected] Proc. of the 20th International Conference on Distributed Multimedia Systems, Pittsburgh, USA, August 2014 DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 1
22

Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

Jan 15, 2015

Download

Technology

Paolo Nesi

Presently, a very large number of public and private data sets are available around the local governments. In most cases, they are not semantically interoperable and a huge human effort is needed to create integrated ontologies and knowledge base for smart city. Smart City ontology is not yet standardized, and a lot of research work is needed to identify models that can easily support the data reconciliation, the management of the complexity and reasoning. In this paper, a system for data ingestion and reconciliation of smart cities related aspects as road graph, services available on the roads, traffic sensors etc., is proposed. The system allows managing a big volume of data coming from a variety of sources considering both static and dynamic data. These data are mapped to smart-city ontology and stored into an RDF-Store where they are available for applications via SPARQL queries to provide new services to the users. The paper presents the process adopted to produce the ontology and the knowledge base and the mechanisms adopted for the verification, reconciliation and validation. Some examples about the possible usage of the coherent knowledge base produced are also offered and are accessible from the RDF-Store and related services. The article also presented the work performed about reconciliation algorithms and their comparative assessment and selection. Keywords— Smart city, knowledge base construction, reconciliation, validation and verification of knowledge base, smart city ontology, linked open graph.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Ontology Building vs Data Harvesting and Cleaning for 

Smart‐city ServicesPierfrancesco Bellini, Monica Benigni, 

Riccardo Billero, Paolo Nesi, Nadia RauchDipartimento di Ingegneria dell’Informazione, DINFO

Università degli Studi di FirenzeVia S. Marta 3, 50139, Firenze, Italy

Tel: +39-055-4796567, fax: +39-055-4796363DISIT Lab

http://www.disit.dinfo.unifi.it alias http://www.disit.org , [email protected]

Proc. of the 20th International Conference on Distributed Multimedia Systems, Pittsburgh, USA, August 2014

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 1

Page 2: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Smart‐City axes• Cities produce a HUGE amount of data every day

– ‘Static’ data• Road graph• Bus/train graph• Services• ...

– Dynamic (real time) data• Weather conditions• Traffic conditions• Pollution status• Bus/train positions• Parking status• People flows• ...

– Open/Private Data

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 2

• Smart Health• Smart Education• Smart Mobility• Smart Energy• Smart Governmental

– Smart economy– Smart people– Smart environment– Smart living

• Smart Telecommunication

Page 3: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Smart‐City• Main Aim

– Provide a platform able to ingest and take advantage a large number of the above data, big data: 

• Exploit data integration and reasoning• Deliver new services and applications to citizens, Leverage on the ongoing Semantic Web effort 

• Problems & Challenges– Data are provided in many different formats and protocols and from many different institutions, different convention and protocols, a different time, …. !

– Data are typically not aligned (e.g., street names, dates, geolocations, tags, … ). That is, they are not semantically interoperable

– resulting a big data problem: volume, velocity, variability, variety, …..

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 3

Page 4: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

ProfiledServices

ProfiledServices

Smart City Paradigm

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 4

Interoperability

Data processing

Smart City Engine

Data Harvesting

Data / info RenderingData / info ExploitationSuggestionsand Alarms

Real Time Data

Social Data trends

Data Sensors

Data Ingestingand mining

Reasoning and Deduction

Data Actingprocessors

Real Time Computing

Trafficcontrol 

Social Media

Sensorscontrol

Peripheral processors

Energy centraleHealthAgency

eGov Data collection

Telecom. Services…….

CitizensFormation

Applications

Page 5: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Smart‐city Ontology• The data model provided have been mapped intothe ontology, it covers different aspects:– Administration– Street‐guide– Points of interest– Local public transport– Sensors– Temporal aspects– Metadata on the data

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 5

TemporalMacroclass

Point of Interest

Macroclass

SensorsMacroclass

Local public transportMacroclass

AdministrationMacroclass

Street‐guideMacroclass

PA  hasPublicOffice  OFFICE

SENSOR  measuredTime  TIME

SERVICE  isInRoad  ROAD

CARPARKSENSOR observeCarPark  CARPARK

BUS  hasExpectedTime  TIME

CARPARK isInRoad 

ROAD

BUSSTOPFORECAST atBusStop  BUSSTOP

WEATHERREPORT  refersTo  PA

BUSSTOP  isInRoad  ROAD

ADMINISTRATIVEROAD ownerAuthority  PA

MetaData

Page 6: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Smart‐city Ontology• Administration: structure of the general public administrations 

(Municipality, Province and Region) also includes Resolutions(ordinance issued by administrations, may change the viability, infrastructural works, schedule for RTZ, etc. )

• Street‐guide: formed by entities as Road, Node, RoadElement, AdministrativeRoad, Milestone, StreetNumber, RoadLink, Junction, Entry, EntryRule, Maneuver,… represents the entire road system of the region, including the permitted maneuvers and the rules of access to the limited traffic zones. Based on OTN (Ontology of Transportation Networks) vocabulary

• Points of Interest: includes all Services, activities, which may be useful to the citizen and who may have the need to search for and to arrive at, commercials, public administration, Cultural, …. 

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 6

Page 7: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Smart‐city Ontology• Local public transport: includes the data related to major local 

public transport companies as scheduled times, the rail graph, and data relating to real time passage at bus stops, real time position, ... 

• Sensors: data provided by sensors: currently, data are collected from various sensors (parking status, meteo, pollution) installed along some streets of Florence and surrounding areas, and from sensors installed into the main car parks of the region. – Plus: car sharing, bike sharing, AVM, RTZ, etc.

• Temporal: that puts concepts related with time (time intervals and instants) into the ontology, so that associate a timeline to the events recorded and is possible to make forecasts. It uses time ontologies such as OWL‐Time. 

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 7

Page 8: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Smart‐city Ontology• Metadata: modeling the additional information associated with:

– Descriptor of Data sets that produced the triples: data set ID, title, description, purpose, location, administration, version, responsible, etc.. 

– Licensing information– Process information: IDs of the processes adopted for ingestion, quality improvement, 

mapping, indexing,.. ; date and time of ingestion, update, review, …; When a problem is detected, we have the information to understand when and how  the problem has been included  

• Including basic ontologies as:– DC: Dublin core, standard metadata– OTN: Ontology for Transport Network– FOAF: for the description of the relations among people or groups – vCard: for a description of people and organizations – wgs84_pos: for latitude and longitude, GPS info– OWL‐Time: reasoning on time, time intervals – GoodRelations: commercial activities models

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 8

Page 9: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 9

Smart‐city Ontology

84   Classes93   ObjectProperties103 DataPropertieshttp://www.disit.org/5606

Page 10: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Data Engineering Architecture

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 10

HBa

se Service Map

Linked Open Graph

ETLTrans.

ETLTrans.

Process Scheduler

ETLTrans.

MappingProcess

RDFStore + indexes

R2RMLModel Validation

Process

SPAR

QL en

dpoint

ReconciliationProcess

Applications

Phase I

Phase III

Phase V

Phase VI

Phase VII

Data source 2

Data source 1

Data source n

Phase IIQuality

improvement RDF

triples

Phase IV

indexing

ontologies

RT Data source n

ETLTrans.

Page 11: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase I ‐ Data Ingestion• Ingesting a wide range of OD/PD: public and private data, static, quasi 

static and/or dynamic real time data. • For the case of Florence, we are addressing about 150 different data 

sources of the 564 available, plus the regional, province, other municipalities, …. 

• Using Pentaho ‐ Kettle for data integration (Open source tool)– using specific ETL Kettle transformation processes (one or more for each 

data source)– data are stored in HBase (Bigdata NoSQL database) 

• Static and semi‐static data include: points of interests, geo‐referenced services, maps, accidents statistics, etc. – files in several formats (SHP, KML, CVS, ZIP, XML, etc.)

• Dynamic data mainly data coming from sensors– parking, weather conditions, pollution measures, bus position, etc. – using Web Services. DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 11

Page 12: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase II ‐ Data Quality Improvement• Problems kinds:

– Inconsistencies, incompleteness,..• Problems on:

– CAPs vs Locations– Street names (e.g., dividing names from numbers, normalize when 

possible) – Dates and Time: normalizing – Telephone numbers: normalizing – Web links and emails: normalizing 

• Partial Usage of– Certified and accepted tables and additional knowledge

• Enrichment process may need several versions:– VIP names, GeoNames, etc..

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 12

Page 13: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase III ‐ Data mapping• Transforms the data from HBase to RDF triples

• Using Karma Data Integration tool, a mapping model from SQL to RDF on the basis of the ontology was created– Data to be mapped first temporarly passed from Hbase to MySQL and then mapped using Karma (in batch mode)

• The mapped data in triples have to be uploaded (and indexed) to the RDF Store (OpenRDF – sesame with OWLIM‐SE)

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 13

Page 14: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase V ‐ Data Reconciliation/alignment

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 14

• After the loading and indexing into the RDF store a dataset maybe connected with the others if entities refer to the same triples– Missed connections strongly limit the usage of the knowledge base,– e.g. the services are not connected with the road graph.

• To associate each Service with a Road and an Entity on the basis of the street name, number and locality

• It is not easy! data coming from different sources

Phase IV ‐ Indexing• Periodic task for reindexing: triples, text, space (GPS), dates, etc.  • Indexing triples: ontologies, all RDF files for OD, RT triples (from ‐

to), reconciliation triples for OD, triples for enrichments, etc.• If you do not index, you cannot identify all missing reconciliations

Page 15: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase V ‐ Data Reconciliation/alignment

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 15

• Examples:– Typos;– Missing street number, or replaced with   "0" or 

"SNC";– Municipalities with no official name (e.g. 

Vicchio/Vicchio del Mugello);– Street names and street numbers with strange 

characters ( ‐, /, ° ? , Ang., ,);– Road name with words in a different order ( e.g. 

Via Petrarca Francesco, exchange of name and surname);

– Red street numbers (for shops);– Presence/absence of proper names in road 

name (e.g. via Camillo Benso di Cavour / via Cavour);

– Number wrongly written (e.g. 34/AB, 403D, 36INT.1);

– Roman numerals in the road name (e.g., via XXVII Aprile).

• Steps:1. SPARQL Exact match –match 

the strings as they are2. SPARQL Enhanced Exact

Match – make some substitutions (Via S. Marta Via Santa Marta, ...)

3. Last Word Search – use onlythe last word of street name

4. Use Google GeoCoding API5. Remove ‘strange chars’ ( ‐, /, 

°, ? , Ang., ,) from Street number

6. Remove ‘strange chars’ from Street name

7. Rewrite wrong municipalitynames

Page 16: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase V ‐ Data Reconciliation/alignment

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 16

Method Precision Recall F1SPARQL –based reconciliation  1,00 0,69 0,820SPARQL ‐based reconciliation + 

additional manual review 0,985 0,722 0,833Link discovering ‐ Leveisthein 0,927 0,508 0,656

Link discovering ‐ Dice 0,968 0,674 0,794Link discovering ‐ Jaccard 1,000 0,472 0,642

Link discovering + heuristics based on data knowledge + Leveisthein 0,925 0,714 0,806

Comparing different reconciliation approaches based on• SILK link discovering language• SPARQL based reconciliation described above

Thus automation of reconciliation is possible and producesacceptable results!!

Page 17: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Phase VII ‐ Data access• Applications can access the data using the SPARQL endpoint, currently we have two applications:– ServiceMap (http://servicemap.disit.org) for a map based application

– Linked Open Graph (http://log.disit.org) for browsing the data from SPARQL/Linked Data sources

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 17

Phase VI ‐ Validation• A set of queries applied automatically to verify the consistency and completeness, after new re‐indexing and new data integration– I.e.: the KB regression testing!!!!!

Page 18: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 18http://servicemap.disit.org

Page 19: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 19

http://log.disit.org

Page 20: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Conclusions• Developed

– Smart‐city Ontology as conceptual model for reasoning– platform for smart‐city data ingestion and semantic interoperability

processes as big data tools– Assessment demonstrated that automated reconciliation is

possible• Future/Ongoing activities

– Improvement of data alignment and cleaning– Definition of languages and tools for reasoning

• It will be used in Sii‐Mobility project:– Adding prediction algorithms– Adding user‐generated information– Adding more applications using the data

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 20

Page 21: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

References• Caragliu, A., Del Bo, C., Nijkamp, P. (2009), Smart cities in Europe, 3rd Central European Conference in Regional Science – CERS, Kosice (sk), 7‐9 ottobre

2009.• Bellini P., Di Claudio M., Nesi P., Rauch N., "Tassonomy and Review of Big Data Solutions Navigation", Big Data Computing To Be Published 26th July 2013 by 

Chapman and Hall/CRC• Vilajosana, I. ; Llosa, J. ; Martinez, B. ; Domingo‐Prieto, M. ; Angles, A., "Bootstrapping smart cities through a self‐sustainable model based on big data 

flows", Communications Magazine, IEEE, Vol.51, n.6, 2013• Ontology of Trasportation Networks, Deliverable A1‐D4, Project REWERSE, 2005 http://rewerse.net/deliverables/m18/a1‐d4.pdf• Pan, Feng, and Jerry R. Hobbs. "Temporal Aggregates in OWL‐Time." In FLAIRS Conference, vol. 5, pp. 560‐565. 2005.• Embley, David W., Douglas M. Campbell, Yuan S. Jiang, Stephen W. Liddle, Deryle W. Lonsdale, Y‐K. Ng, and Randy D. Smith. "Conceptual‐model‐based data 

extraction from multiple‐record Web pages." Data & Knowledge Engineering 31, no. 3 (1999): 227‐251.• Auer, Sören, Jens Lehmann, and Sebastian Hellmann. "Linkedgeodata: Adding a spatial dimension to the web of data." In The Semantic Web‐ISWC 2009, pp. 

731‐746. Springer Berlin Heidelberg, 2009.• Andrea Bellandi, Pierfrancesco Bellini, Antonio Cappuccio, Paolo Nesi, Gianni Pantaleo, Nadia Rauch, ASSISTED KNOWLEDGE BASE GENERATION, 

MANAGEMENT AND COMPETENCE RETRIEVAL, International Journal of Software Engineering and Knowledge Engineering, Vol.22, n.8, 2012• Apache HBase: A Distributed Database for Large Datasets. The Apache Software Foundation, Los Angeles, CA. URL http://hbase.apache.org.• Pentaho Data Integration, http://www.pentaho.com/product/data‐integration• Barry Bishop, Atanas Kiryakov, Damyan Ognyanoff, Ivan Peikov, Zdravko Tashev, Ruslan Velkov, “OWLIM: A family of scalable semantic repositories”, 

Semantic Web Journal, Volume 2, Number 1 / 2011.• S.Gupta, P.Szekely, C.Knoblock, A.Goel, M.Taheriyan, M.Muslea, "Karma: A System for Mapping Structured Sources into the Semantic Web", 9th Extended 

Semantic Web Conference (ESWC2012).• A. Ngomo, S. Auer. “LIMES: a time‐efficient approach for large‐scale link discovery on the web of data”. Proc. of the 22nd int. joint conf. on Artificial 

Intelligence, Vol.3. AAAI Press, 2011.• R. Isele, C. Bizer. “Active learning of expressive linkage rules using genetic programming”. Web Semantics: Science, Services and Agents on the World Wide 

Web 23 (2013): pp.2‐15.• Powers, D.M.W. (February 27, 2011). "Evaluation from precision, recall and F‐Measure to roc informedness, markedness and correlation". Journal of 

Machine Learning Technologies 2 (1): 37–63.

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 21

Page 22: Ontology Building vs Data Harvesting and Cleaning for Smart-city Services

DISIT Lab, Distributed Data Intelligence and TechnologiesDistributed Systems and Internet TechnologiesDepartment of Information Engineering (DINFO)

http://www.disit.dinfo.unifi.it

Thank you!

Paolo NesiDipartimento di Ingegneria dell’Informazione, DINFO

Università degli Studi di FirenzeVia S. Marta 3, 50139, Firenze, Italy

Tel: +39-055-4796567, fax: +39-055-4796363DISIT Lab

http://www.disit.dinfo.unifi.it alias http://[email protected]

DISIT Lab (DINFO UNIFI), DMS 2014, USA, August 2014 22

http://www.disit.org/5606http://www.disit.dinfo.unifi.it