Saskia Ossen, and Piet Daas Introduction in the Data hyperdimension
Dec 24, 2015
Purpose of the module
- Introduction in Data hyperdimension
- Introduction of indicators for data evaluation (implemented in R software package)• Developed within European BLUE ETS project• Theory and practical examples• Group exercise in which groups determine whether a source
should be used based on the results for the data hyperdimension.
- Introduction of Quality Report Card
Data: quality of the input
– Input quality of administrative data• After evaluation of Source and Metadata hyperdimension
– Data hyperdimension studies• Quality of the facts (values) in the source• Data are part of every delivery!
• Time needed for evaluation is a serious issue• Evaluate every delivery thoroughly?• Evaluation may differ depending on the use intended (output)• Relation with process (availability and quality of other data sources)
Essential pre-requisites and considerations
– Evaluation of the data quality of input sources needs to be efficient
– Focus on essential quality components• What are the essential dimensions of input data quality?• What are the essential indicators for those dimensions?• For objects (units/events) and variables
– Purely input or also with output in mind?• Data Source Quality (admin. data quality per se)• Input oriented Output Quality (guestimate of expected effect
on output)
Essential dimensions of input data quality
– Five essential quality dimensions identified for input data of administrative sources:
1. Technical checks• Technical accessibility, IT-part
2. Accuracy• Correctness, validity, error-freeness
3. Completeness• Coverage of units, missing variable data
4. Time-related dimension• Timeliness, punctuality, period covered
5. Integrability• Easiness of integration and consistency of data between sources
Technical checks: Theory
Indicators Description 1. Technical checks Technical usability of the file and data in the file
1.1 Readability Accessability of the file and data in the file
1.2 File declaration Compliance of the data in the file to the metadata compliance agreements
1.3 Convertability Conversion of the file to the NSI-standard format
Technical checks dimension
Technical checks: Examples
– Very important for new sources, becomes somewhat less essential later on‐ Corrupt files‐ Encoded files of which decoding password is missing‐ Files of which the data is not compliant to the metadata
description‐ Files with errors during/after conversion
Accuracy: Theory
Indicators Description
2. Accuracy The extent to which data are correct, reliable, and certified
Objects 2.1 Authenticity Legitimacy of objects 2.2 Inconsistent objects Extent of erroneous objects in source 2.3 Dubious objects Presence of untrustworthy objects
Variables 2.4 Measurement errors Deviation of actual data value from ideal error-free
measurements 2.5 Inconsistent values Extent of inconsistent combinations of variable values 2.6 Dubious values Presence of implausible values or combinations of
values for variables
Accuracy dimension
– Objects with incorrect Identification numbers (ID’s)
– In the Netherlands all people have a Citizen’s Service Number‐ 9-digit number (e.g. 123456782)‐ Number has a feasibility check, last digit is a checking digit‐ Rule used: sum(9*n1 + 8*n2 + 7*n3 + 6*n4 + 5*n5 + 4*n6 + 3*n7 +
2*n8 – 1*n9)Remainder of sum/11 should be 0
– In the Social Statistical Database* it was found (in 2000) that:‐ 0,3% of all persons in admin. data sources used had an invalid
Citizen Service Number
*set of integrated admin. data sources and surveys (then ~100 million admin records) Arts et al. (2000) Netherlands Official Statistics 15, pp. 16-22.
Accuracy example: Authenticity (1)% of objects with a syntactically incorrect identification key
Accuracy example: Authenticity (2)% of objects for which the source contains information contradictive to information in a reference list for those objects
– Studies reveal significant differences between findings for ‘educational attainment’ obtained from a survey and from linked administrative data sources.
More in: Bakker (2011) Estimating the Validity of Administrative Variables. ISI-paper session IPS030, Dublin, Ireland.
Accuracy example: Authenticity (3)% of objects for which the source contains information contradictive to information in a reference list for those objects
Accuracy example: Dubious valuesCross tabulation of the variable “Current activity status” versus age group
Completeness: Theory
Indicators Description
3. Completeness Degree to which a data source includes data describing the corresponding set of real-world objects and variables
Objects 3.1 Undercoverage Absence of target objects (missing objects) in the source 3.2 Overcoverage Presence of non-target objects in the source 3.3 Selectivity Statistical coverage and representativity of objects 3.4 Redundancy Presence of multiple registrations of objects
Variables 3.5 Missing values Absent values for (key) variables
3.6 Imputed values Presence of values resulting from imputation actions by data source holder
Completeness dimension
Completeness example: Selectivity (2)
The education register has age-related undercoverage of educational attainment (56,3% is missing)
Explanation:1) Children <15 age have a known level of education2) Level of education of young adults is usually stored in recently created admin. data sources 3) Information from ‘middle-aged’ people is obtained from LFS-survey (small compared to admin. data info)4) Information of ‘elderly’ people (≥65 year) almost completely missing (not surveyed and hardly registered)
Pre-evaluation and input quality of administrative data sources (Part 2)
Completeness example: Selectivity (3)
Time related: Theory
Indicators Description 4. Time-related dimension Indicators that are time and/or stability related
4.1 Timeliness Lapse of time between the end of the reference period and the moment of receipt of the data source
4.2 Punctuality Possible time lag between the actual delivery date of the source and the date it should have been delivered
4.3 Overall time lag Overall time difference between the end of the reference period and the moment it is concluded that it can definitely be used
4.4 Delay Extent of delays in registrationObjects
4.5 Dynamics of objects Changes in the population of objects (new and dead objects) over time
Variables 4.6 Stability of variables Changes of variables or values over time
Time-related dimension
Time-related example: Delay
– Events recorded some time after they have occurred• Events are missing (or erroneously recorded)• Particularly important for sources used immediately
– Examples:• Marriages contracted in immigrants’ country of origin are
sometimes recorded two or three years after the event (Bakker et al. AIOS-paper 2008)
• Part of VAT-data is reported later than is needed for monthly estimates (Vlag, ISI-paper 2011)
Time-related example: Stability of variables (1)
Type of comparison used in the Dutch Short term Statistics
Integrability: Theory
Indicators Description 5. Integrability Extent to which the data source is capable of
undergoing integration or of being integrated.Objects 5.1 Comparability of objects Similarity of objects in source -at the proper level of
detail- with the objects used by NSI 5.2 Alignment of objects Linking-ability (align-ability) of objects in source with
those of NSI
Variables 5.3 Linking variable Usefulness of linking variables (keys) in source 5.4 Comparability of variables Proximity (closeness) of variables
Integrability dimension
Integrability example: Alignment of objects
export
import
VAT-turnover (€)
ICP-
turn
over
(€)
Finding:- Differences between two admin.
Data sources (ICP and VAT) both used for International trade statistics
- Export aligns good but import is much more problematic!
Explanation:- ICP import units are difficult to
identify and can therefore not always by linked correctly
- ICP export data can be integrated well.
VAT-turnover (€)
ICP-
turn
over
(€)
Exercise
– Let’s try to interpret some data quality findings!– To ease the exercise, every indicator has a single score
Group exercise
– Participants will be split into groups and each group is provided with:‐ The Source, Metadata and Data results for the administrative
data source discussed in the previous exercise‐ An intended use
– Each group will be asked to discuss:‐ whether the data in the source could be used for the purpose
intended/• If yes, why is everything OK? • If not, what is the problem that prevents its use and how can
it be solved?