Top Banner
A DATABASE APPROACH TO MONITORING THE QUALITY OF INFORMATION IN RDF STORES Alexandre Rademaker and Edward Hermann Wednesday, November 30, 11
29

A database approach to monitoring the quality of information in RDF stores

Aug 29, 2014

Download

Education

 
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A database approach to monitoring the quality of information in RDF stores

A DATABASE APPROACH TO MONITORING THE QUALITY OF INFORMATION IN RDF STORES

Alexandre Rademaker and Edward Hermann

Wednesday, November 30, 11

Page 2: A database approach to monitoring the quality of information in RDF stores

NOTES

This is not a research report, this is a research propose!

Let us start by looking results from database researchers.

Wednesday, November 30, 11

Page 3: A database approach to monitoring the quality of information in RDF stores

WHAT IS (ENSURE) DATA QUALITY?

Semantic properties of databases can be represented by integrity constraints!

Integrity enforcement means maintain correctness of database. Truth Maintenance!

Hendrik, 2011

Wednesday, November 30, 11

Page 4: A database approach to monitoring the quality of information in RDF stores

HENDRIK DECKER

http://web.iti.upv.es/~hendrik/Universidad Politécnica de Valencia

Wednesday, November 30, 11

Page 5: A database approach to monitoring the quality of information in RDF stores

EXAMPLE

A marriage is between one man and one women only. How can we model such constraint in a relational DB?

We are talking about more than: check constraint, foreign key and primary key.

Wednesday, November 30, 11

Page 6: A database approach to monitoring the quality of information in RDF stores

DB THEORY USES DATALOG

Datalog is more expressive than SQL (transitive closure)

SQL is FOL (dedidable for finite model)

SELECT X WHERE Y (give me the binds that satisfy the clauses)

Wednesday, November 30, 11

Page 7: A database approach to monitoring the quality of information in RDF stores

TWO WAYS TO ENFORCE INTEGRITY

In each update, check if any integrity constraint is violated. (not always rigorously check due its performance penalty)

Repair extant violations of constraints. (accumulation of inconsistency is inevitable)

Hendrik, 2011

Wednesday, November 30, 11

Page 8: A database approach to monitoring the quality of information in RDF stores

INCONSISTENCY-TOLERANT METHODS

Rigorous way is to eliminate all inconsistency. Repair the whole database.

Relaxation... partial (flexible) repairs!

Hendrik, 2011

Absolute consistency is out of question due its intractability!

Wednesday, November 30, 11

Page 9: A database approach to monitoring the quality of information in RDF stores

FLEXIBILITY OF PARTIAL INCONSISTENCY

Integrity enforcement is more flexible. Don’t have to be done all at once. (constraint violations can be tolerated to be solved in appropriate moment)

Some inconsistency may be unknown at update time. Total approach would fail in such situation.

But...

Hendrik, 2011

Flexibility served in two ways:

Wednesday, November 30, 11

Page 10: A database approach to monitoring the quality of information in RDF stores

PARTIAL REPAIRS

Absolute consistency is out of question due its intractability.

But, naive inconsistency-tolerant repairs can be data-destructive.

For a rational flexible repair strategy, one needs criteria (expressed in terms of metrics)

Only admit repairs that are integrity-preserving! That is, total amount of integrity violation not increase after the repair.

Hendrik, 2011

Wednesday, November 30, 11

Page 11: A database approach to monitoring the quality of information in RDF stores

FORMAL DEFINITIONS

Hendrik, 2011

D = databaseIC = integrity theoryI = constraint U = update

D(F) = true if F eval to true in D

D(I) = true if I is satisfied in D

D(IC) = true if all I in IC is satisfied in D

For an update U (inserts, deletes) of database D, we

denoted DUthe updated database.

Wednesday, November 30, 11

Page 12: A database approach to monitoring the quality of information in RDF stores

FORMAL DEFINITIONS

Hendrik, 2011

Let � be an ordering antisymmetric, reflexive and transitive.

For two elements in a lattice A and B, A�B is their least upper bound.

Wednesday, November 30, 11

Page 13: A database approach to monitoring the quality of information in RDF stores

FORMAL DEFINITIONS

Hendrik, 2011

We say that (µ,�) is an inconsistency metric if

µ maps tuples (D, IC) to some lattice that is partially ordered by �.

Simple example of a metric � is given by �(D, IC) = D(IC)

with the natural order true � false of the range of �.

That is, integrity sat, D(IC) = true, mean lower inconsistency than integrity violation, D(IC) = false.

Non trivial examples given by comparing or counting violated constraints.

Wednesday, November 30, 11

Page 14: A database approach to monitoring the quality of information in RDF stores

INCONSISTENCY METRICS

Inconsistency metrics are used to decide if an update preserves integrity, that is, doesn’t create a integrity violation that doesn’t exist before the update.

Intuitively, an update preserves integrity if it doesn’t increase the measured inconsistency

Hendrik, 2011

For a metric (µ,�), an update U in a database Dwith integrity theory IC is integrity-preserving with

regard to (µ,�) if µ(DU , IC) � µ(D, IC).

Wednesday, November 30, 11

Page 15: A database approach to monitoring the quality of information in RDF stores

AND MORE...

Inconsistency-tolerant integrity checking

Repairs

Computing and checking partial repairs

Computing integrity-preserving repairs

Hendrik, 2011

Wednesday, November 30, 11

Page 16: A database approach to monitoring the quality of information in RDF stores

WHY WE ARE TALKING ABOUT IT?

Wednesday, November 30, 11

Page 17: A database approach to monitoring the quality of information in RDF stores

WHY WE ARE TALKING ABOUT IT?

Lattes@FGV Project (a unified KB of FGV research publications, researchers, skills etc), http://dck092.fgv.br/

Semantic Web brings, RDF, description logics, linked data etc.

Our research topics include Logics and knowledge representation.

RDF are the key concept of Semantic Web

Relational has fixed model (TBOX of an ontology)

Wednesday, November 30, 11

Page 18: A database approach to monitoring the quality of information in RDF stores

TOPOS: THEORETICAL PART

A topos (plural topoi or toposes) is a category with a quite expressive internal logic

The category of graphs and graph-homomorphisms can be viewed as a topos.

This topos already has a Heyting algebra that is used as the truth-basis of its internal logic.

A Heyting algebra is a lattice with additional properties. This topos-theoretic view of RDF stores can be investigated in order to provide a natural way to provide foundations to partial repairs in RDF stores.

Besides that, if we view traditional DBs as finite first-order logical structures, the category of (finite) first-order structures and homomorphism between then has its own internal logic. This internal logic can be investigated also regarding partial repairs.

scratching the surface!

Wednesday, November 30, 11

Page 19: A database approach to monitoring the quality of information in RDF stores

LATTES@FGV

Wednesday, November 30, 11

Page 20: A database approach to monitoring the quality of information in RDF stores

LATTES@FGV

Wednesday, November 30, 11

Page 21: A database approach to monitoring the quality of information in RDF stores

LATTES@FGV

Wednesday, November 30, 11

Page 22: A database approach to monitoring the quality of information in RDF stores

LATTES@FGV: THE RDF KB

http://dck092.fgv.br:10035/repositories/fgv (800k triples)

Wednesday, November 30, 11

Page 23: A database approach to monitoring the quality of information in RDF stores

LATTES@FGV

480 CV Lattes and collected data from other sources (Qualis, Digital Library etc) in one triple store

lots of errors (inconsistencies) for different reasons: poor user interface for input data, misinterpretation etc.

How to identify the errors? (non ad-hoc matter)

How to fix what can be fixed automatically?

Wednesday, November 30, 11

Page 24: A database approach to monitoring the quality of information in RDF stores

INTEGRITY CONSTRAINTS IN RDF

We can consider the extension of what was discussed so far to non-SQL

KR/DB can be viewed as a graph

The query language of RDF based stores, SPARQL, can be used to provide semantics to the store.

Wednesday, November 30, 11

Page 25: A database approach to monitoring the quality of information in RDF stores

EXAMPLES

An article referenced by a CV must have the author of this CV as one of its authors!

Wednesday, November 30, 11

Page 26: A database approach to monitoring the quality of information in RDF stores

EXAMPLES

If two resources were identified by reference to the same article, every author of the first one should also be related to the second one!

Wednesday, November 30, 11

Page 27: A database approach to monitoring the quality of information in RDF stores

IN THE LAST EXAMPLE

ask {  ?p1 owl:sameAs ?p2 ;      dc:creator ?c .  OPTIONAL {    ?p2 ?rel ?c .  }  FILTER( !bound(?rel) )}

Of course, two publications cannot be considered the same comparing only their titles!

We need entity alignment, similarity checker...

Suppose we have identified all resources that represent the same real “entity” using owl:sameAs, than ...

Wednesday, November 30, 11

Page 28: A database approach to monitoring the quality of information in RDF stores

A LITTLE BIT ABOUT THE IDENTIFICATION OF SIMILARITY

(defun assert-same-list (list) (let ((new nil)) (mapcar (lambda (pair) (let ((a (first pair)) (b (second pair))) (if (not (blank-node-p a)) (push (reverse pair) new) (push pair new)))) list) (dolist (pair new) (add-triple (first pair) !owl:sameAs (second pair)))))

(select0/callback (?x ?y) #'insert-same-as (q- ?x !rdf:type !foaf:Agent) (q- ?y !rdf:type !foaf:Agent) (q- ?x !foaf:name ?n) (q- ?y !foaf:name ?n) (lispp (upi< ?x ?y)))

Naive approach: Shaking hands!

Wednesday, November 30, 11

Page 29: A database approach to monitoring the quality of information in RDF stores

A LITTLE BIT ABOUT THE IDENTIFICATION OF SIMILARITY

(defun components (vertices n generator) (do ((res nil) (vtx vertices (set-difference vtx (car res) :test #'upi=))) ((null vtx) res) (push (ego-group (car vtx) n generator) res)))

(defsna-generator same-journal (node) (select0 (?j) (q- (?? node) !bibo:issn ?i) (q- ?j !bibo:issn ?i) (lispp (utils::check-issn (part->value ?i))) (lispp (upi< node ?j)) (q- ?j !dc:title ?t2) (q- (?? node) !dc:title ?t1) (lispp (> (utils::jaro-winkler-distance (part->value ?t1) (part->value ?t2)) 0.7))))

(let ((nodes (mapcar #'subject (get-triples-list :p !bibo:issn :limit nil)))) (dolist (g (components nodes 2 'same-journal))) (merge-nodes g))

An ad-hoc solution: breath-first-search of connected components!

Wednesday, November 30, 11