Enterprise Knowledge Graphs Sören Auer
Enterprise Knowledge GraphsSören Auer
Sören Auer 2
The three Big Data „V“ – Variety is often neglected
Quelle: Gesellschaft für Informatik
Linked Data Principles
Addressing the neglected third V (Variety)
1. Use URIs to identify the “things” in your data
2. Use http:// URIs so people (and machines) can look them up on the web
3. When a URI is looked up, return a description ofthe thing (in RDF format)
4. Include links to related things
http://www.w3.org/DesignIssues/LinkedData.html
3
[1] Auer, Lehmann, Ngomo, Zaveri: Introduction to Linked Data and Its Lifecycle on the Web. Reasoning Web 2013
Sören Auer
Linked (Open) Data: The RDF Data Model
4
RDF = Resource Description Framework
located in
label
industryheadquarters
full nameDHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
height 物流
label
RDF Data Model (a bit more technical)
– Graph consists of:• Resources (identified via URIs)• Literals: data values with data type (URI) or language (multilinguality integrated)• Attributes of resources are also URI-identified (from vocabularies)
– Various data sources and vocabularies can be arbitrarily mixed and meshed– URIs can be shortened with namespace prefixes; e.g. dbp: → http://dbpedia.org/resource/
gn:locatedIn
rdfs:label
dbo:industryex:headquarters
foaf:namedbp:DHL_International_GmbH
dbp:Post_Tower
"162.5"^^xsd:decimal
dbp:Bonn
dbp:Logistics
"Logistik"@de
"DHL International GmbH"^^xsd:string
ex:height"物流"@zh
rdfs:label
rdf:value
unit:Meter
ex:unit
Sören Auer 6
RDF mediates between different Data Models & bridges between Conceptual and Operational Layers
Id Title Screen
5624 SmartTV 104cm
5627 Tablet 21cm
Prod:5624 rdf:type ElectronicsProd:5624 rdfs:label “SmartTV”Prod:5624 hasScreenSize “104”^^unit:cm...
Electronics
Vehicle
Car Bus TruckVehicle rdf:type owl:ThingCar rdfs:subClassOf VehicleBus rdfs:subClassOf Vehicle...
Tabular/Relational Data
Taxonomic/Tree Data
Logical Axioms / Schema
Male rdfs:subClassOf HumanFemale rdfs:subClassOf HumanMale owl:disjointWith Female...
© Fraunhofer · Seite 7
Vocabularies – Breaking the mold!
Semantic data virtualization allows for continuous expansion andenhancement of data and metadata across data sources without loosingthe overall perspective
Relationaldata models
1:1 Relation betweenData Model und Application
Graph baseddata model
SubjectPredicate
Object / Subject
Predicate
Object / Subject
1:n Relation betweenData Model and Application
© Fraunhofer · Seite 8
Vocabulary ExampleVocabulary Schema Instantiation
PostTower rdf:type BuildingPostTower locatedIn dbpedia:BonnPostTower height "162.5"^^meter
located in
label
industryheadquarters
full nameDHL
Post Tower
162.5 m
Bonn
Logistics Logistik
DHL International GmbH
height 物流
label
Class: CompanyProperty Expected typeinIndustry IndustryfullName Stringheadquarter Building
Class: BuildingProperty Expected typelocatedIn Industryheight unit:meter
RDF
Repr
esen
tati
onVi
sual
Rep
rese
ntat
ion
Company rdf:type rdfs:ClassBuilding rdf:type rdfs:Class
inIndustry rdf:type rdfs:PropertyinIndustry rdfs:domain CompanyinIndustry rdfs:range Industry
headquarter rdf:type rdfs:Propertyheadquarter rdfs:domain Companyheadquarter rdfs:range Building
DHL rdf:type CompanyDHL fullName "DHL Int. GmbH"DHL inIndustry LogisticsDHL headquarter PostTower
Die Semantic Web Layer Cake 2001
http://www.w3.org/2001/10/03-sww-1/slide7-0.html
• Monolithisch basierend auf XML• Fokus auf schwergewichtige
Semantik (Ontologien, Logic, Reasoning)
© Fraunhofer
The Semantic Web Layer Cake 2015 – Bridging between Big & Smart Data
Unicode URIs
XML JSON CSV RDB HTML
RDF
RDF/XML JSON-LD CSV2RDF R2RML RDFa
RDF Data Shapes
RDF-Schema
Vocabularies
OntologienSKOS Thesauri
LogikSWRL Regeln
SPARQL
(Acc
ess c
ontro
l), S
igna
tur,
Encr
yptio
n (H
TTPS
/CER
T/DA
NE),
• Lingua Franca of Data integration with many technology interfaces (XML, HTML, JSON, CSV, RDB,…)
• Focus on lightweight vocabularies, rules,thesauri etc.
• Less “invasive”
© Fraunhofer
RDF - the Lingua Franca of Data Integration
• RDF is simple• We can easily encode and combine all kinds of data models (relational,
taxonomic, graphs, object-oriented, …)• RDF supports distributed data and schema• We can seamlessly evolve simple semantic representations (vocabularies)
to more complex ones (e.g. ontologies)• Small representational units (URI/IRIs, triples) facilitate mixing and
mashing• RDF can be viewed from many perspectives: facts, graphs, ER, logical
axioms, graphs, objects• RDF integrates well with other formalisms - HTML (RDFa), XML
(RDF/XML), JSON (JSON-LD), CSV, …• Linking and referencing between different knowledge bases, systems and
platforms facilitates the creation of sustainable data ecosystems
11
© Fraunhofer
Successful application domainsLinked Data & Semantic Integration
Search Engine Optimization & Web-Commerce Schema.org used by >20% of Web sites Major search engines exploit semantic desciptions
Pharma, Lifesciences Mature, comprehensive vocabularies and ontologies Billions of disease, drug, clinical trial descriptions
Digital Libraries Many established vocabularies (DublinCore, FRBR, EDM) Millions of aggregated from thousends of memory
institutions in Europeana, German Digital Library
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
The Web evolves into a Web of Data
Sören Auer 13
Linked Open Data
FacebookOpen Graph
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Knowledge Graphs – A definition
• Fabric of concept, class, property, relationships, entity descriptions
• Uses a knowledge representation formalism (typically RDF, RDF-Schema, OWL)
• Holistic knowledge (multi-domain, source, granularity):• instance data (ground truth),
• open (e.g. DBpedia, WikiData), private (e.g. supply chain data), closed data (product models),
• derived, aggregated data,• schema data (vocabularies, ontologies) • meta-data (e.g. provenance, versioning, documentation
licensing)• comprehensive taxonomies to categorize entities• links between internal and external data• mappings to data stored in other systems and databases
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Knowledge Graph Challenges & OpportunitiesKnowledge graphs typically cover• Multiple domains• Various levels of granularity• Data from multiple sources• Various degrees of structure
Challenges• Quality• Coherence• Co-evolution• Update propagation• Curation & interaction
Opportunities• Background knowledge for various applications (e.g. question answering, data
integration, machine learning)• Facilitate intra-organizational data sharing and exchange (data value chains)
15
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Comparison of various enterprise data integration paradigmsParadigm Data
ModelIntegr. Strateg
y
Conceptual/
operational
Hetero-geneous data
Intern./ extern.
data
No. of source
s
Type of integr.
Domain coverage
Se-mantic repres.
XML Schema
DOM trees
LaV operational medium
both medium high
Data Warehouse
relational GaV operational - partially medium
physical small medium
Data Lake various LaV operational large physical high medium
MDM UML GaV conceptual - - small physical small medium
PIM / PCS trees GaV operational partially partially - physical medium medium
Enterprise search
document - operational partially large virtual high low
EKG RDF LaV both medium
both high very high
[1] Michael Galkin, Sören Auer, Simon Screrri: Enterprise Knowledge Graphs: A Survey. Submitted to 37th International Conference on Information Systems. 2016.
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
Knowledge Graph Technology
17
18
Adding a Semantic Layer to Data Lakes
ManagementAccounting
Marketing Sales SupportR&D
Semantic Data Lake• central place for
model, schema and data historization
• Combination of Scale Out (cost reduction) and semantics (increased control & flexibility)
• grows incrementally (pay-as-you-go)
Inbound
Data Sources
Outbound and Consumption
Inbound Raw Data Store
Data Lake (order of magnitude cheaper scalable data store)
Knowledge Graph for Relationship Definition and Meta Data
Frontend to Access Relationship and KPI Definition / Documentation Frontend to Access (ad hoc) Reports Outbound Data Delivery to
Target Systems
JSON-LD CSVW R2RMLXML2RDF
Sören Auer 19
W3C R2RML – Relational to RDF Mapping
R2RML: RDB to RDF Mapping Language, W3C Recommendation 27 September 2012Editors: Souripriya Das, Seema Sundara, Richard Cyganiakhttp://www.w3.org/TR/r2rml/
Sören Auer 20
Example R2RML Mapping
1. Either resulting RDF knowledge base is materialized in a triple store &2. subsequently queried using SPARQL3. or the materialization step is avoided by dynamically mapping an input
SPAQRL query into a corresponding SQL query, which renders exactly the same results as the SPARQL query being executed against the materialized RDF dump
SPARQLMap – Mapping RDB 2 RDF
Example: Sparqlify
• Rationale: Exploit existing formalisms (SQL, SPARQL Construct) as much as possible
• flexible & versatile mapping language• translating one SPARQL query into
exactly one efficiently executable SQL query
• Solid theoretical formalization based on SPARQL-relational algebra transformations
• Extremely scalable through elaborated view candidate selection mechanism
• Used to publish 20B triples for LinkedGeoData
[1] Stadler, Unbehauen, Auer, Lehmann: Sparqlify – Very Large Scale Linked Data Publication from Relational Databases.[2] Unbehauen, Stadler, Auer: Optimizing SPARQL-to-SQL Rewriting. iiWAS 2013[3] Auer, et al.: Triplify: light-weight linked data publication from relational databases. WWW 2009
SPARQLConstruct
SQLView
Bridge
Sören Auer 23
Semantified Big Data Architecture Blueprint
[1] Mami, Scerri, Auer, Vidal: Towards the Semantification of Big Data Technology. DEXA 2016
Datasources Ingestion Storage
Semantic Lifting with Mappings
QuerysStoring of semantic and semantified data in Apache Parquet files on HDFS
Sören Auer 24
SEBIDA Implementation Architecture
Sören Auer 25
SEBIDA Evaluation Results
• Loads data faster• Has quite different query
performance characteristics – faster in 5 out of 12 queries, similar performance in 2, slower in 5
© Fraunhofer · Seite 26
VOCOL: COLLABORATIVE VOCABULARY CURATION ENVIRONMENT
Comprehensive Support for Evolving Vocabularies
© Fraunhofer · Seite 27
Industry 4.0Semantic Models as Bridge between Shop & Office Floor
© Fraunhofer · Seite 28
Semantic Administrative Shell & Reference Architecture for Industry 4.0 (RAMI4.0)Administrative Shell (Verwaltungsschale)
provides a digital identity for arbitrary Industry 4.0 components (e.g. sensors, actors/robots) exposing data covering the whole life-cycle
Reference Architecture for Industry 4.0 (RAMI4.0) provides a conceptual framework for implementing comprehensive Industry 4.0 scenarios
We have implemented both concepts along with a number of IEC and ISO standards in a comprehensive information model ready to be implemented in productive environments
© Fraunhofer · Seite 29
VoCol collaborative Development Environment for Vocabularies
Versioning
Git/Bitbucket
Issue trackingGitLab/ GitHub
Syntax validation
Docu-mentation generatio
n
AuthoringTurtle
Visualization
vOWL
Publishing
LOD/Sparql
Integrates a number of tools & services for different aspects of vocabulary developmentIs centered around Git version control (or Bitbucket), thus supporting the branching and merging of vocabulariesSupports the roundtrip between• Schema/vocabulary
development• Competency questions
(expressed in SPARQL)• Example data Bridges between conceptual
models and executable codehttp://eis.iai.uni-bonn.de/Projects/VoCol.html
© Fraunhofer · Seite 30
Development based on Git – Version Control
Git is meanwhile the most widely used version control system. It is a distributed revision control system with an emphasis on speed, data integrity, and support for distributed, non-linear workflows.Git was initially designed and developed in 2005 by Linux kernel developers for Linux kernel developmentGit is the basis for a variety of open-source or commercial services and products such as:GitHub/Bitbucket - Web-based Git repository hosting service with
millions of usersGitLab/Gitolite - open-source Web-based Git repository management
platformsSince TeamFoundationServer release 2013, Microsoft added native
support for GitGit is easily extensible and integratable into arbitrary workflows via GitHooks
© Fraunhofer · Seite 31
Information Model – Environment
© Fraunhofer · Seite 32
Environment: Dynamic Documentation
© Fraunhofer · Seite 33
Environment: Dynamic Documentation
© Fraunhofer · Seite 34
Environment: Dynamic Visualization
© Fraunhofer · Seite 35
Environment: Analytics
© Fraunhofer · Seite 36
Environment: Analytics
© Fraunhofer · Seite 37
Environment: Analytics
© Fraunhofer · Seite 38
© Fraunhofer · Seite 39
Environment: Querying
© Fraunhofer · Seite 40
Environment: Evolution
© Fraunhofer · Seite 41
INDUSTRIAL DATA SPACE
© Fraunhofer · Seite 42
Vocabulary-based Integration facilitates Data-driven Businesses
Vocablary
© Fraunhofer ·· Seite 43
Die Arbeiten zum Industrial Data Space sind komplementär verzahnt mit der Plattform Industrie 4.0
Handel 4.0 Bank 4.0Versicherung4.0
…Industrie 4.0
Fokus auf die produzierende
IndustrieSmart Services
Übertragung,Netzwerke
Echtzeitsysteme
Industrial Data SpaceFokus auf Daten
Daten
…
© Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme
IAIS
The Industrial Data Space InitiativeCommunity of >30 large German and European CompaniesPre-competitive, publicly funded innovation project involving 11
Fraunhofer institutes for developing IDS reference architectureCurrent members of the
Industrial Data Space Association
© Fraunhofer · Seite 45
Bilder: ©FotoliaFrancesco De Paoli, Nmedia, hakandogu
Semantic Data Linking for Enterprise Data Value Chains
Data Lake Pure Internet
centralized, monopolistic federated, secure, „trusted“, standard-based
completely dezentral, open, unsecure
Data management Central Repository Decentral Decentral
Data Ownership Central Decentral Decentral
Data Linking Single provider Federated, on demand Missing
Data Security Bilateral Certified system Bilateral
Market structure Central Provider Role system Unstructured
Transport infrastructure Internet Internet Internet
Industrial Data Space
© Fraunhofer · Seite 46
Bilder: © Fotolia 77260795 ∙ 73040142 58947296 ∙ 68898041
Basic principles of the Industrial Data Space
On DemandVernetzung
Linked Light Semantics
Securitywith
Industrial Data
Container
Certified Roles
On DemandInterlinking
© Fraunhofer · Seite 47
Bildquellen: Istockphoto
Industrial Data Space: On Demand Interlinking
Service A
Service C
Service EService B
Service D
Service GService F
Enterprise 4
Enterprise 1
Enterprise 6
Enterprise 2 Enterprise 3
Enterprise 5
All Data stays with its Ownern and are controlled and secured. Only on request for a service data will be shared. No central platform.
© Fraunhofer · Seite 48 --- VERTRAULICH ---
Industrial Data Space
Upload / Download / SearchInternet
AppsVocabulary
Industrial Data SpaceBroker
Clearing
RegistryIndex
Industrial Data SpaceApp Store
Internal IDS
Connector
Company A Internal IDS
Connector
Company B
External IDS
Connector
External IDS
Connector
Upload
Third PartyCloud Provider
Download
Upload / Download
© Fraunhofer
IDS Architecture Overview
Sören Auer 49
Big Data is not Just Volume and VelocityVariety (& Varacity) are key challengesLinked Data helps dealing with both• Linked Data life-cycle requires to integrate and
adapt results from a number of disciplines– NLP, – Machine Learning, – Knowledge Representation, – Data Management, – User Interaction– …
• Applications in a number of domains – cultural heritage, – life sciences, – industry 4.0 / cyber-physical systems, – smart cities, – mobility,– …
Linked Data links not only data but also:• Various disciplines• Applications and Use cases
Sören Auer 50
The Team
Creating Knowledge out of Interlinked Data
Thanks for your attention!
Sören Auerhttp://www.iai.uni-bonn.de/~auer | http://[email protected]
LINKED-DATA-BASED QUESTION ANSWERING
A Grand Challenge
Sören Auer 52
Question Answering research challengesMain Goals• Completeness Extension of background knowledge, streams, deduplication⇒• Flexibility Deal with keywords and NL⇒• Runtime New models for query processing, ranking for top-k queries⇒• Easy use Verbalization of queries, entity verbalization, explanation of answers in NL⇒• Multilinguality cover several European languages⇒Automatic Extension of background knowledge• 1. Generate query from own data and get answer set A; 2. Add new data set and get answer A’; 3. If info
gain, then iterate; 4. Else terminateData Streams• Continuous queries on data streams (update SPARQL results as new information comes in)• Send novel answers to end user• Open Information ExtractionHybrid Search - extension for queries on unstructured dataEnsure Quasi-Completeness• Fully automatic entity consolidation• Find links at runtime, e.g., between DBpedia and LinkedMDB to answer “Which films were directed by and
starred Tarantino”?
Sören Auer 53
[1] Shekarpour, Marx, Ngomo, Auer: Semantic query interpretation for question answering on linked data. J. Web Semantic 30 (2015)[2] Marx, Usbeck, Ngomo, Höffner, Lehmann, Auer: Towards an open question answering architecture. SEMANTICS 2014[3] Shekarpour, Ngomo, Auer: Question answering on interlinked data. WWW 2013:
The approach: An Open QA Architecture
Create an open, extensible architecture for Linked-Data-based Question Answering• Enable the plugin and competition of different modules for various QA aspects:• Input: query string / question, voice, brain input; Query Splitting; Disambiguation/Mapping; Query
Construction; Query Execution; Result presentation• Take context, personalization, feedback into accountFor Whom? Use Cases:• In-car interaction / Human Vehicle Interaction
Where can I find parking? What are the main sights in Luxembourg?• Assisting people with disabilities (e.g. vision impaired)
Is there any pharmacy still open? What classics concerts are brodcast next week?• Medical information retrieval
Which side effects can be caused by Paracetamol? Do Paracetamol and Tamiflu interfere?•…
Sören Auer 54
[1] The WDAqua Marie Curie ITN: Answering Questions using Web Data. http://wdaqua.informatik.uni-bonn.de