Sören Auer | Enterprise Knowledge Graphs

Enterprise Knowledge GraphsSören Auer

Sören Auer 2

The three Big Data „V“ – Variety is often neglected

Quelle: Gesellschaft für Informatik

Linked Data Principles

Addressing the neglected third V (Variety)

1. Use URIs to identify the “things” in your data

2. Use http:// URIs so people (and machines) can look them up on the web

3. When a URI is looked up, return a description ofthe thing (in RDF format)

4. Include links to related things

http://www.w3.org/DesignIssues/LinkedData.html

3

[1] Auer, Lehmann, Ngomo, Zaveri: Introduction to Linked Data and Its Lifecycle on the Web. Reasoning Web 2013

http://www.w3.org/DesignIssues/LinkedData.html

http://dblp.uni-trier.de/db/conf/rweb/rweb2013.html#AuerLNZ13

http://dblp.uni-trier.de/db/conf/rweb/rweb2013.html#AuerLNZ13

Sören Auer

Linked (Open) Data: The RDF Data Model

4

RDF = Resource Description Framework

located in

label

industryheadquarters

full nameDHL

Post Tower

162.5 m

Bonn

Logistics Logistik

DHL International GmbH

height 物流

label

RDF Data Model (a bit more technical)

– Graph consists of:• Resources (identified via URIs)• Literals: data values with data type (URI) or language (multilinguality integrated)• Attributes of resources are also URI-identified (from vocabularies)

– Various data sources and vocabularies can be arbitrarily mixed and meshed– URIs can be shortened with namespace prefixes; e.g. dbp: → http://dbpedia.org/resource/

gn:locatedIn

rdfs:label

dbo:industryex:headquarters

foaf:namedbp:DHL_International_GmbH

dbp:Post_Tower

"162.5"^^xsd:decimal

dbp:Bonn

dbp:Logistics

"Logistik"@de

"DHL International GmbH"^^xsd:string

ex:height"物流"@zh

rdfs:label

rdf:value

unit:Meter

ex:unit

Sören Auer 6

RDF mediates between different Data Models & bridges between Conceptual and Operational Layers

Id Title Screen

5624 SmartTV 104cm

5627 Tablet 21cm

Prod:5624 rdf:type ElectronicsProd:5624 rdfs:label “SmartTV”Prod:5624 hasScreenSize “104”^^unit:cm...

Electronics

Vehicle

Car Bus TruckVehicle rdf:type owl:ThingCar rdfs:subClassOf VehicleBus rdfs:subClassOf Vehicle...

Tabular/Relational Data

Taxonomic/Tree Data

Logical Axioms / Schema

Male rdfs:subClassOf HumanFemale rdfs:subClassOf HumanMale owl:disjointWith Female...

© Fraunhofer · Seite 7

Vocabularies – Breaking the mold!

Semantic data virtualization allows for continuous expansion andenhancement of data and metadata across data sources without loosingthe overall perspective

Relationaldata models

1:1 Relation betweenData Model und Application

Graph baseddata model

SubjectPredicate

Object / Subject

Predicate

Object / Subject

1:n Relation betweenData Model and Application


Vocabulary ExampleVocabulary Schema Instantiation

PostTower rdf:type BuildingPostTower locatedIn dbpedia:BonnPostTower height "162.5"^^meter

located in

label

industryheadquarters

full nameDHL

Post Tower

162.5 m

Bonn

Logistics Logistik

DHL International GmbH

height 物流

label

Class: CompanyProperty Expected typeinIndustry IndustryfullName Stringheadquarter Building

Class: BuildingProperty Expected typelocatedIn Industryheight unit:meter

RDF

Repr

esen

tati

onVi

sual

Rep

rese

ntat

ion

Company rdf:type rdfs:ClassBuilding rdf:type rdfs:Class

inIndustry rdf:type rdfs:PropertyinIndustry rdfs:domain CompanyinIndustry rdfs:range Industry

headquarter rdf:type rdfs:Propertyheadquarter rdfs:domain Companyheadquarter rdfs:range Building

DHL rdf:type CompanyDHL fullName "DHL Int. GmbH"DHL inIndustry LogisticsDHL headquarter PostTower

Die Semantic Web Layer Cake 2001

http://www.w3.org/2001/10/03-sww-1/slide7-0.html

• Monolithisch basierend auf XML• Fokus auf schwergewichtige

Semantik (Ontologien, Logic, Reasoning)




© Fraunhofer

The Semantic Web Layer Cake 2015 – Bridging between Big & Smart Data

Unicode URIs

XML JSON CSV RDB HTML

RDF

RDF/XML JSON-LD CSV2RDF R2RML RDFa

RDF Data Shapes

RDF-Schema

Vocabularies

OntologienSKOS Thesauri

LogikSWRL Regeln

SPARQL

(Acc

ess c

ontro

l), S

igna

tur,

Encr

yptio

n (H

TTPS

/CER

T/DA

NE),

• Lingua Franca of Data integration with many technology interfaces (XML, HTML, JSON, CSV, RDB,…)

• Focus on lightweight vocabularies, rules,thesauri etc.

• Less “invasive”

© Fraunhofer

RDF - the Lingua Franca of Data Integration

• RDF is simple• We can easily encode and combine all kinds of data models (relational,

taxonomic, graphs, object-oriented, …)• RDF supports distributed data and schema• We can seamlessly evolve simple semantic representations (vocabularies)

to more complex ones (e.g. ontologies)• Small representational units (URI/IRIs, triples) facilitate mixing and

mashing• RDF can be viewed from many perspectives: facts, graphs, ER, logical

axioms, graphs, objects• RDF integrates well with other formalisms - HTML (RDFa), XML

(RDF/XML), JSON (JSON-LD), CSV, …• Linking and referencing between different knowledge bases, systems and

platforms facilitates the creation of sustainable data ecosystems

11

© Fraunhofer

Successful application domainsLinked Data & Semantic Integration

Search Engine Optimization & Web-Commerce Schema.org used by >20% of Web sites Major search engines exploit semantic desciptions

Pharma, Lifesciences Mature, comprehensive vocabularies and ontologies Billions of disease, drug, clinical trial descriptions

Digital Libraries Many established vocabularies (DublinCore, FRBR, EDM) Millions of aggregated from thousends of memory

institutions in Europeana, German Digital Library

© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme

IAIS

The Web evolves into a Web of Data

Sören Auer 13

Linked Open Data

FacebookOpen Graph


IAIS

Knowledge Graphs – A definition

• Fabric of concept, class, property, relationships, entity descriptions

• Uses a knowledge representation formalism (typically RDF, RDF-Schema, OWL)

• Holistic knowledge (multi-domain, source, granularity):• instance data (ground truth),

• open (e.g. DBpedia, WikiData), private (e.g. supply chain data), closed data (product models),

• derived, aggregated data,• schema data (vocabularies, ontologies) • meta-data (e.g. provenance, versioning, documentation

licensing)• comprehensive taxonomies to categorize entities• links between internal and external data• mappings to data stored in other systems and databases


IAIS

Knowledge Graph Challenges & OpportunitiesKnowledge graphs typically cover• Multiple domains• Various levels of granularity• Data from multiple sources• Various degrees of structure

Challenges• Quality• Coherence• Co-evolution• Update propagation• Curation & interaction

Opportunities• Background knowledge for various applications (e.g. question answering, data

integration, machine learning)• Facilitate intra-organizational data sharing and exchange (data value chains)

15


IAIS

Comparison of various enterprise data integration paradigmsParadigm Data

ModelIntegr. Strateg

y

Conceptual/

operational

Hetero-geneous data

Intern./ extern.

data

No. of source

s

Type of integr.

Domain coverage

Se-mantic repres.

XML Schema

DOM trees

LaV operational medium

both medium high

Data Warehouse

relational GaV operational - partially medium

physical small medium

Data Lake various LaV operational large physical high medium

MDM UML GaV conceptual - - small physical small medium

PIM / PCS trees GaV operational partially partially - physical medium medium

Enterprise search

document - operational partially large virtual high low

EKG RDF LaV both medium

both high very high

[1] Michael Galkin, Sören Auer, Simon Screrri: Enterprise Knowledge Graphs: A Survey. Submitted to 37th International Conference on Information Systems. 2016.


IAIS

Knowledge Graph Technology

17

18

Adding a Semantic Layer to Data Lakes

ManagementAccounting

Marketing Sales SupportR&D

Semantic Data Lake• central place for

model, schema and data historization

• Combination of Scale Out (cost reduction) and semantics (increased control & flexibility)

• grows incrementally (pay-as-you-go)

Inbound

Data Sources

Outbound and Consumption

Inbound Raw Data Store

Data Lake (order of magnitude cheaper scalable data store)

Knowledge Graph for Relationship Definition and Meta Data

Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to

Target Systems

JSON-LD CSVW R2RMLXML2RDF

Sören Auer 19

W3C R2RML – Relational to RDF Mapping

R2RML: RDB to RDF Mapping Language, W3C Recommendation 27 September 2012Editors: Souripriya Das, Seema Sundara, Richard Cyganiakhttp://www.w3.org/TR/r2rml/

Sören Auer 20

Example R2RML Mapping

1. Either resulting RDF knowledge base is materialized in a triple store &2. subsequently queried using SPARQL3. or the materialization step is avoided by dynamically mapping an input

SPAQRL query into a corresponding SQL query, which renders exactly the same results as the SPARQL query being executed against the materialized RDF dump

SPARQLMap – Mapping RDB 2 RDF

Example: Sparqlify

• Rationale: Exploit existing formalisms (SQL, SPARQL Construct) as much as possible

• flexible & versatile mapping language• translating one SPARQL query into

exactly one efficiently executable SQL query

• Solid theoretical formalization based on SPARQL-relational algebra transformations

• Extremely scalable through elaborated view candidate selection mechanism

• Used to publish 20B triples for LinkedGeoData

[1] Stadler, Unbehauen, Auer, Lehmann: Sparqlify – Very Large Scale Linked Data Publication from Relational Databases.[2] Unbehauen, Stadler, Auer: Optimizing SPARQL-to-SQL Rewriting. iiWAS 2013[3] Auer, et al.: Triplify: light-weight linked data publication from relational databases. WWW 2009

SPARQLConstruct

SQLView

Bridge

http://www2009.eprints.org/63/













Sören Auer 23

Semantified Big Data Architecture Blueprint

[1] Mami, Scerri, Auer, Vidal: Towards the Semantification of Big Data Technology. DEXA 2016

Datasources Ingestion Storage

Semantic Lifting with Mappings

QuerysStoring of semantic and semantified data in Apache Parquet files on HDFS

Sören Auer 24

SEBIDA Implementation Architecture

Sören Auer 25

SEBIDA Evaluation Results

• Loads data faster• Has quite different query

performance characteristics – faster in 5 out of 12 queries, similar performance in 2, slower in 5


VOCOL: COLLABORATIVE VOCABULARY CURATION ENVIRONMENT

Comprehensive Support for Evolving Vocabularies


Industry 4.0Semantic Models as Bridge between Shop & Office Floor


Semantic Administrative Shell & Reference Architecture for Industry 4.0 (RAMI4.0)Administrative Shell (Verwaltungsschale)

provides a digital identity for arbitrary Industry 4.0 components (e.g. sensors, actors/robots) exposing data covering the whole life-cycle

Reference Architecture for Industry 4.0 (RAMI4.0) provides a conceptual framework for implementing comprehensive Industry 4.0 scenarios

We have implemented both concepts along with a number of IEC and ISO standards in a comprehensive information model ready to be implemented in productive environments


VoCol collaborative Development Environment for Vocabularies

Versioning

Git/Bitbucket

Issue trackingGitLab/ GitHub

Syntax validation

Docu-mentation generatio

n

AuthoringTurtle

Visualization

vOWL

Publishing

LOD/Sparql

Integrates a number of tools & services for different aspects of vocabulary developmentIs centered around Git version control (or Bitbucket), thus supporting the branching and merging of vocabulariesSupports the roundtrip between• Schema/vocabulary

development• Competency questions

(expressed in SPARQL)• Example data Bridges between conceptual

models and executable codehttp://eis.iai.uni-bonn.de/Projects/VoCol.html

http://eis.iai.uni-bonn.de/Projects/VoCol.html

http://eis.iai.uni-bonn.de/Projects/VoCol.html


Development based on Git – Version Control

Git is meanwhile the most widely used version control system. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.Git was initially designed and developed in 2005 by Linux kernel developers for Linux kernel developmentGit is the basis for a variety of open-source or commercial services and products such as:GitHub/Bitbucket - Web-based Git repository hosting service with

millions of usersGitLab/Gitolite - open-source Web-based Git repository management

platformsSince TeamFoundationServer release 2013, Microsoft added native

support for GitGit is easily extensible and integratable into arbitrary workflows via GitHooks


Information Model – Environment


Environment: Dynamic Documentation


Environment: Dynamic Documentation


Environment: Dynamic Visualization


Environment: Analytics







Environment: Querying


Environment: Evolution


INDUSTRIAL DATA SPACE


Vocabulary-based Integration facilitates Data-driven Businesses

Vocablary

© Fraunhofer ·· Seite 43

Die Arbeiten zum Industrial Data Space sind komplementär verzahnt mit der Plattform Industrie 4.0

Handel 4.0 Bank 4.0Versicherung4.0

…Industrie 4.0

Fokus auf die produzierende

IndustrieSmart Services

Übertragung,Netzwerke

Echtzeitsysteme

Industrial Data SpaceFokus auf Daten

Daten

…


IAIS

The Industrial Data Space InitiativeCommunity of >30 large German and European CompaniesPre-competitive, publicly funded innovation project involving 11

Fraunhofer institutes for developing IDS reference architectureCurrent members of the

Industrial Data Space Association


Bilder: ©FotoliaFrancesco De Paoli, Nmedia, hakandogu

Semantic Data Linking for Enterprise Data Value Chains

Data Lake Pure Internet

centralized, monopolistic federated, secure, „trusted“, standard-based

completely dezentral, open, unsecure

Data management Central Repository Decentral Decentral

Data Ownership Central Decentral Decentral

Data Linking Single provider Federated, on demand Missing

Data Security Bilateral Certified system Bilateral

Market structure Central Provider Role system Unstructured

Transport infrastructure Internet Internet Internet

Industrial Data Space


Bilder: © Fotolia 77260795 ∙ 73040142 58947296 ∙ 68898041

Basic principles of the Industrial Data Space

On DemandVernetzung

Linked Light Semantics

Securitywith

Industrial Data

Container

Certified Roles

On DemandInterlinking


Bildquellen: Istockphoto

Industrial Data Space: On Demand Interlinking

Service A

Service C

Service EService B

Service D

Service GService F

Enterprise 4

Enterprise 1

Enterprise 6

Enterprise 2 Enterprise 3

Enterprise 5

All Data stays with its Ownern and are controlled and secured. Only on request for a service data will be shared. No central platform.

© Fraunhofer · Seite 48 --- VERTRAULICH ---

Industrial Data Space

Upload / Download / SearchInternet

AppsVocabulary

Industrial Data SpaceBroker

Clearing

RegistryIndex

Industrial Data SpaceApp Store

Internal IDS

Connector

Company A Internal IDS

Connector

Company B

External IDS

Connector

External IDS

Connector

Upload

Third PartyCloud Provider

Download

Upload / Download

© Fraunhofer

IDS Architecture Overview

Sören Auer 49

Big Data is not Just Volume and VelocityVariety (& Varacity) are key challengesLinked Data helps dealing with both• Linked Data life-cycle requires to integrate and

adapt results from a number of disciplines– NLP, – Machine Learning, – Knowledge Representation, – Data Management, – User Interaction– …

• Applications in a number of domains – cultural heritage, – life sciences, – industry 4.0 / cyber-physical systems, – smart cities, – mobility,– …

Linked Data links not only data but also:• Various disciplines• Applications and Use cases

Sören Auer 50

The Team

Creating Knowledge out of Interlinked Data

Thanks for your attention!

Sören Auerhttp://www.iai.uni-bonn.de/~auer | http://[email protected]

http://www.iai.uni-bonn.de/~auer

http://aksw.org/

http://aksw.org/

LINKED-DATA-BASED QUESTION ANSWERING

A Grand Challenge

Sören Auer 52

Question Answering research challengesMain Goals• Completeness Extension of background knowledge, streams, deduplication⇒• Flexibility Deal with keywords and NL⇒• Runtime New models for query processing, ranking for top-k queries⇒• Easy use Verbalization of queries, entity verbalization, explanation of answers in NL⇒• Multilinguality cover several European languages⇒Automatic Extension of background knowledge• 1. Generate query from own data and get answer set A; 2. Add new data set and get answer A’; 3. If info

gain, then iterate; 4. Else terminateData Streams• Continuous queries on data streams (update SPARQL results as new information comes in)• Send novel answers to end user• Open Information ExtractionHybrid Search - extension for queries on unstructured dataEnsure Quasi-Completeness• Fully automatic entity consolidation• Find links at runtime, e.g., between DBpedia and LinkedMDB to answer “Which films were directed by and

starred Tarantino”?

Sören Auer 53

[1] Shekarpour, Marx, Ngomo, Auer: Semantic query interpretation for question answering on linked data. J. Web Semantic 30 (2015)[2] Marx, Usbeck, Ngomo, Höffner, Lehmann, Auer: Towards an open question answering architecture. SEMANTICS 2014[3] Shekarpour, Ngomo, Auer: Question answering on interlinked data. WWW 2013:

The approach: An Open QA Architecture

Create an open, extensible architecture for Linked-Data-based Question Answering• Enable the plugin and competition of different modules for various QA aspects:• Input: query string / question, voice, brain input; Query Splitting; Disambiguation/Mapping; Query

Construction; Query Execution; Result presentation• Take context, personalization, feedback into accountFor Whom? Use Cases:• In-car interaction / Human Vehicle Interaction

Where can I find parking? What are the main sights in Luxembourg?• Assisting people with disabilities (e.g. vision impaired)

Is there any pharmacy still open? What classics concerts are brodcast next week?• Medical information retrieval

Which side effects can be caused by Paracetamol? Do Paracetamol and Tamiflu interfere?•…

Sören Auer 54

[1] The WDAqua Marie Curie ITN: Answering Questions using Web Data. http://wdaqua.informatik.uni-bonn.de

http://wdaqua.informatik.uni-bonn.de/