Top Banner
Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension
31

Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Dec 24, 2015

Download

Documents

Alberta Ross
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Saskia Ossen, and Piet Daas

Introduction in the Data hyperdimension

Page 2: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Purpose of the module

- Introduction in Data hyperdimension

- Introduction of indicators for data evaluation (implemented in R software package)• Developed within European BLUE ETS project• Theory and practical examples• Group exercise in which groups determine whether a source

should be used based on the results for the data hyperdimension.

- Introduction of Quality Report Card

Page 3: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Data: quality of the input

– Input quality of administrative data• After evaluation of Source and Metadata hyperdimension

– Data hyperdimension studies• Quality of the facts (values) in the source• Data are part of every delivery!

• Time needed for evaluation is a serious issue• Evaluate every delivery thoroughly?• Evaluation may differ depending on the use intended (output)• Relation with process (availability and quality of other data sources)

Page 4: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Essential pre-requisites and considerations

– Evaluation of the data quality of input sources needs to be efficient

– Focus on essential quality components• What are the essential dimensions of input data quality?• What are the essential indicators for those dimensions?• For objects (units/events) and variables

– Purely input or also with output in mind?• Data Source Quality (admin. data quality per se)• Input oriented Output Quality (guestimate of expected effect

on output)

Page 5: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Essential dimensions of input data quality

– Five essential quality dimensions identified for input data of administrative sources:

1. Technical checks• Technical accessibility, IT-part

2. Accuracy• Correctness, validity, error-freeness

3. Completeness• Coverage of units, missing variable data

4. Time-related dimension• Timeliness, punctuality, period covered

5. Integrability• Easiness of integration and consistency of data between sources

Page 6: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Technical checks: Theory

Indicators Description 1. Technical checks Technical usability of the file and data in the file

1.1 Readability Accessability of the file and data in the file

1.2 File declaration Compliance of the data in the file to the metadata compliance agreements

1.3 Convertability Conversion of the file to the NSI-standard format

Technical checks dimension

Page 7: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Technical checks: Examples

– Very important for new sources, becomes somewhat less essential later on‐ Corrupt files‐ Encoded files of which decoding password is missing‐ Files of which the data is not compliant to the metadata

description‐ Files with errors during/after conversion

Page 8: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Technical checks: File declaration compliance

– Simple frequency distributions are very helpful

Page 9: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Technical checks: File declaration compliance

Page 10: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Accuracy: Theory

Indicators Description

2. Accuracy The extent to which data are correct, reliable, and certified

Objects 2.1 Authenticity Legitimacy of objects 2.2 Inconsistent objects Extent of erroneous objects in source 2.3 Dubious objects Presence of untrustworthy objects

Variables 2.4 Measurement errors Deviation of actual data value from ideal error-free

measurements 2.5 Inconsistent values Extent of inconsistent combinations of variable values 2.6 Dubious values Presence of implausible values or combinations of

values for variables

Accuracy dimension

Page 11: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

– Objects with incorrect Identification numbers (ID’s)

– In the Netherlands all people have a Citizen’s Service Number‐ 9-digit number (e.g. 123456782)‐ Number has a feasibility check, last digit is a checking digit‐ Rule used: sum(9*n1 + 8*n2 + 7*n3 + 6*n4 + 5*n5 + 4*n6 + 3*n7 +

2*n8 – 1*n9)Remainder of sum/11 should be 0

– In the Social Statistical Database* it was found (in 2000) that:‐ 0,3% of all persons in admin. data sources used had an invalid

Citizen Service Number

*set of integrated admin. data sources and surveys (then ~100 million admin records) Arts et al. (2000) Netherlands Official Statistics 15, pp. 16-22.

Accuracy example: Authenticity (1)% of objects with a syntactically incorrect identification key

Page 12: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Accuracy example: Authenticity (2)% of objects for which the source contains information contradictive to information in a reference list for those objects

– Studies reveal significant differences between findings for ‘educational attainment’ obtained from a survey and from linked administrative data sources.

More in: Bakker (2011) Estimating the Validity of Administrative Variables. ISI-paper session IPS030, Dublin, Ireland.

Page 13: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Accuracy example: Authenticity (3)% of objects for which the source contains information contradictive to information in a reference list for those objects

Page 14: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Accuracy example: Inconsistent objects

Rule: a person is part of exactly one household

Page 15: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Accuracy example: Dubious valuesCross tabulation of the variable “Current activity status” versus age group

Page 16: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Completeness: Theory

Indicators Description

3. Completeness Degree to which a data source includes data describing the corresponding set of real-world objects and variables

Objects 3.1 Undercoverage Absence of target objects (missing objects) in the source 3.2 Overcoverage Presence of non-target objects in the source 3.3 Selectivity Statistical coverage and representativity of objects 3.4 Redundancy Presence of multiple registrations of objects

Variables 3.5 Missing values Absent values for (key) variables

3.6 Imputed values Presence of values resulting from imputation actions by data source holder

Completeness dimension

Page 17: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Completeness example: Selectivity (1)

Page 18: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Completeness example: Selectivity (2)

The education register has age-related undercoverage of educational attainment (56,3% is missing)

Explanation:1) Children <15 age have a known level of education2) Level of education of young adults is usually stored in recently created admin. data sources 3) Information from ‘middle-aged’ people is obtained from LFS-survey (small compared to admin. data info)4) Information of ‘elderly’ people (≥65 year) almost completely missing (not surveyed and hardly registered)

Page 19: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Pre-evaluation and input quality of administrative data sources (Part 2)

Completeness example: Selectivity (3)

Page 20: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Time related: Theory

Indicators Description 4. Time-related dimension Indicators that are time and/or stability related

4.1 Timeliness Lapse of time between the end of the reference period and the moment of receipt of the data source

4.2 Punctuality Possible time lag between the actual delivery date of the source and the date it should have been delivered

4.3 Overall time lag Overall time difference between the end of the reference period and the moment it is concluded that it can definitely be used

4.4 Delay Extent of delays in registrationObjects

4.5 Dynamics of objects Changes in the population of objects (new and dead objects) over time

Variables 4.6 Stability of variables Changes of variables or values over time

Time-related dimension

Page 21: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Time-related example: Delay

– Events recorded some time after they have occurred• Events are missing (or erroneously recorded)• Particularly important for sources used immediately

– Examples:• Marriages contracted in immigrants’ country of origin are

sometimes recorded two or three years after the event (Bakker et al. AIOS-paper 2008)

• Part of VAT-data is reported later than is needed for monthly estimates (Vlag, ISI-paper 2011)

Page 22: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Time-related example: Stability of variables (1)

Type of comparison used in the Dutch Short term Statistics

Page 23: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Time-series for a single company

Time-related example: Stability of variables (2)

Page 24: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Integrability: Theory

Indicators Description 5. Integrability Extent to which the data source is capable of

undergoing integration or of being integrated.Objects 5.1 Comparability of objects Similarity of objects in source -at the proper level of

detail- with the objects used by NSI 5.2 Alignment of objects Linking-ability (align-ability) of objects in source with

those of NSI

Variables 5.3 Linking variable Usefulness of linking variables (keys) in source 5.4 Comparability of variables Proximity (closeness) of variables

Integrability dimension

Page 25: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Integrability example: Alignment of objects

export

import

VAT-turnover (€)

ICP-

turn

over

(€)

Finding:- Differences between two admin.

Data sources (ICP and VAT) both used for International trade statistics

- Export aligns good but import is much more problematic!

Explanation:- ICP import units are difficult to

identify and can therefore not always by linked correctly

- ICP export data can be integrated well.

VAT-turnover (€)

ICP-

turn

over

(€)

Page 26: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Quality Report Card: Step 1 Indicator level

– Step 1: Determine one score per indicator

Page 27: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Quality Report Card: Step 2 Dimensional level

– Step 2: Determine one score per dimension

Page 28: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Quality Report Card: Step 3 General level

– Step 3: Determine a general score

Page 29: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Questions?

Any questions or comments?

Page 30: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Exercise

– Let’s try to interpret some data quality findings!– To ease the exercise, every indicator has a single score

Page 31: Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension.

Group exercise

– Participants will be split into groups and each group is provided with:‐ The Source, Metadata and Data results for the administrative

data source discussed in the previous exercise‐ An intended use

– Each group will be asked to discuss:‐ whether the data in the source could be used for the purpose

intended/• If yes, why is everything OK? • If not, what is the problem that prevents its use and how can

it be solved?