Top Banner
In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski
36

In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

In collaboration with Werner Nutt

Free University of Bozen-Bolzano

Data Quality

Simon Razniewski

Page 2: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

10.8.2011 - EURACData Quality 2

Introduction

• Simon RazniewskiPhD Student at the FUB– Data quality– Data completeness

• Werner NuttProfessor in Computer Science at the FUBFocus in research and teaching:– Data management, data modelling– Data integration– Incomplete information

Page 3: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

3

Why data quality?

• Data are the basis for (scientific) conclusions about

the world

• Conclusions only as good as the data they are based on

• Low-quality data low-quality conclusions

10.8.2011 - EURACData Quality

Page 4: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

4

Some effects of erroneous data are funny

Man invited for pre-natal check

10.8.2011 - EURACData Quality

Page 5: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

5

Some data errors are long-living

Spinach contains much iron

100g of spinach contain 35mg of iron

Gustav v. Bunge 1890

100g spinach contain only 3,5mg of iron

10.8.2011 - EURACData Quality

Page 6: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

6

Some data errors are mysterious

Student records in in Georgia (USA), 2009

19.000 students leave their school to change to another

… but arrive nowhere

? ?

10.8.2011 - EURACData Quality

Page 7: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

7

Overview

• What are data used for?– Data model the real world

• What can go wrong?– Wrong, outdated, missing data

• What can one do for– Correctness– Currency– Completeness of data?

10.8.2011 - EURACData Quality

Page 8: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

8

Data model the real world

We analyze the data (instead of the real world)

and draw (scientific) conclusions

data determines our conclusions

Real world: Students, teachers, classes

Database: Tables

HOB Bozen

Class 2A

PaulAnna

MariaDiego

10.8.2011 - EURACData Quality

Page 9: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

9

Questions about students

• „How many students are there in the class 2A of the HOB Bozen?“

• „What is the average age of the students of this class?“

• „How many students play an instrument?“

10.8.2011 - EURACData Quality

Page 10: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

10

Table „Students“

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

What is the average age of the students of

the class 2A of the HOB Bozen?

10.8.2011 - EURACData Quality

Page 11: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

11

Many things can go wrong

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

10.8.2011 - EURACData Quality

What is the average age of the students of

the class 2A of the HOB Bozen?

Page 12: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

12

Typos

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

date of birth of Anna

school of Paul

10.8.2011 - EURACData Quality

Page 13: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

13

Factual errors

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

school of Diego (“HOB Meran“ instead of “HOB Bozen“)

10.8.2011 - EURACData Quality

Page 14: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

14

Outdated entries

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

class of Anna (“1A“ instead of “2A“)

10.8.2011 - EURACData Quality

Page 15: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

15

Missing values

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

date of birth of Diego (“Null value“)

10.8.2011 - EURACData Quality

Page 16: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

16

Missing records

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

Maria 12.10.1995 HOB Bozen 2A

the record about Maria is missing

10.8.2011 - EURACData Quality

Page 17: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

17

Missing concepts

Name Date of birth School Class

Paul 7.4.1995 HOB Bolzen 2A

Anna 3.8.1959 HOB Bozen 1A

Diego ? HOB Meran 2A

no possibility to store information

about music instruments

Instrument

Cello

?

?

10.8.2011 - EURACData Quality

Page 18: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Datenqualität 18

What can be done?

There is a distinction between different dimensions of data quality The most important ones are

• Correctness Does the data match the real world?

• Timeliness Is the data up-to-date?

• Completeness Are all aspects of the domain of interest captured?

Further: Comprehensibility, accessability, …

10.8.2011 - EURAC

Page 19: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Datenqualität 19

Dimension 1: Correctness

IT-techniques:1. Detecting typos or statistical outliers

students born in 1959

2. Recognizing duplicates

Mohammad Al Zaïn = Muhamad Alzain

3. Rules for logical consistency

no student can visit two schools at the same time

Organisation:Special treatment of core data: Master data management

For example: students, teachers, schools10.8.2011 - EURAC

Page 20: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Datenqualität 20

Dimension 2: Timeliness

• By workflow organisation: Bind workflows onto the IT system Timeliness is guaranteed

Example: an enrolment is only valid

if it is recorded in the database

• Trough data about the currency of the data (metadata) Timeliness can be estimated

Example: “All dropouts until 31th of March are recorded“

10.8.2011 - EURAC

Page 21: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Datenqualität 21

Dimension 3/1: Completeness of values

• Can be enforced by the IT system

Risk: nonsensical entries

• Alternative solution: enforce input of less values Record reasons for missing values

E.g. “Not applicable” or “Unknown”

10.8.2011 - EURAC

Page 22: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Datenqualität 22

Dimension 3/3: Conceptual completeness

• Solid design is important, but not everything can be foreseen

• Flexible IT: Schema changes if necessary – Space for comments, additional information

• Otherwise: Other fields will be abused

Example: Gasworks in the USA

Warning of dogs for meter-readers

… later they send bills

Address Mountain Road 102 (Beware of dog)

10.8.2011 - EURAC

Page 23: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Datenqualität 23

Dimension 3/2: Table completeness

• Events are completely recorded, if they are bound to the IT system

Example: Sales in a supermarket

• In general, this binding is not possible

only parts of the database tables are complete

• But: Completeness is only necessary for specific uses

Example: school statistics from ASTAT

Research

10.8.2011 - EURAC

Page 24: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Partial table completeness

• Common scenario: We have – Some, but not all data complete – Questions (‘‘queries“) over data

• Problems:– Do we have the data that is needed to answer

the queries?

If not:– What more data do we need?

10.8.2011 - EURACData Quality 24

Page 25: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

An (intuitive) example

• Suppose we have data about all students from – Italian schools– German schools, except of the primary school ‘‘Andreas Hofer“ – Ladin schools, except of the high school “Gherdëna“

• Can we correctly answer questions about the italian students in South Tyrol?

Yes, because we have all data about students from

italian schools

10.8.2011 - EURACData Quality 25

Page 26: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

An (intuitive) example (2)

• Suppose we have data about all students from – Italian schools

– German schools, except of the primary school ‘‘Andreas Hofer“ – Ladin schools, except of the high school “Gherdëna“

• Can we answer questions about the high-school students in South Tyrol?

No, because data from the “Gherdëna“ high school is missing

We could bug them to submit their data (but maybe the secretary is

on holiday)

We could ask someone else for the data, e.g., the local district administration

10.8.2011 - EURACData Quality 26

Page 27: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Our research

• How can one describe that data is complete to a certain extent?

• How can one find out whether the data one has is sufficient for a certain use?

• How can one find out which data is necessary to serve a certain use?

10.8.2011 - EURACData Quality 27

Page 28: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Formal example

How many students attend an Italian school?

SELECT count(*)FROM student, schoolWHERE student.school = school.name AND school.language = ‘italian‘;

Suppose, we have all Italian students.

Can we answer this query completely?

10.8.2011 - EURACData Quality 28

Page 29: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

How can we formalize table completeness?

“We have all students from italian schools“

•We imagine: an ideal database that contains complete information about the world

•Completeness statements refer to this ideal database:

=> All ideal students from Italian schools

occur among our real students

10.8.2011 - EURACData Quality 29

Page 30: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

How can we assert (partial) completeness of tables?

“We have all students from italian schools“

Table completeness assertion:

real.student CONTAINS(

SELECT ideal.student.*

FROM ideal.student, ideal.school

WHERE ideal.student.school = ideal.school.name

AND ideal.school.language = ‘Italian‘)

Table completeness assertions constitute a

logical theory about real and ideal database

10.8.2011 - EURACData Quality 30

Page 31: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

What does it mean that our query is complete?

Consider two versions:

“Real query“ “Ideal query“

SELECT count(*) SELECT count(*)

FROM real.student, real.school FROM ideal.student, ideal.school

WHERE WHERE

real.student.school = real.school.name ideal.student.school = ideal.school.name

AND AND

real.school.language = ‘italian‘; ideal.school.language = ‘italian‘;

Our query is complete if the real and the ideal query return the same results

(Can be expressed in logic, too)

=> Reasoning

10.8.2011 - EURACData Quality 31

Page 32: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Our results so far

• Formalization• General reasoning procedures for

– Single block SQL queries– With comparisons– Group By– Aggregate functions min, max, count, sum

• Complexity analysis (sometimes high!)• Architecture for reasoning system• “Inverse reasoning” (see later slide)

This is a start, many things are still missing

10.8.2011 - EURACData Quality 32

Page 33: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Reasoning with schema information

• To draw interesting inferences, we need to take into account– Keys– Foreign keys– Finite domains

~> Reasoning becomes more complicated

(Current research)

10.8.2011 - EURACData Quality 33

Page 34: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Inverse reasoning

• So far:

Given: Assertions about table completeness

Question: Can query Q be answered completely?

• Also interesting:

Given: query Q

Question: which are the minimal completeness

assertions that assure completeness of Q?

• Can be answered by applying our inference methods backwards

10.8.2011 - EURACData Quality 34

Page 35: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

Perspective: Probabilistic completeness management

• Our theory so far:

Boolean statements: complete/not complete

• In practice, it is often sufficient to know

“With probability < p, we make an error < ε“

• Probabilistic assertions:

“With 90% probability, we are not missing more than 5 students“

=> Probabilistic inferences

10.8.2011 - EURACData Quality 35

Page 36: In collaboration with Werner Nutt Free University of Bozen-Bolzano Data Quality Simon Razniewski.

36

Conclusion

• Data quality has several dimensions– Correctness, timeliness, completeness

• Our current interest– How can one describe which data are complete– How can one find out which queries can be answered

completely?– If not, which additional data is needed?

• Perspective: Probabilistic completeness management

10.8.2011 - EURACData Quality