Top Banner
ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data Report of WP2. Recommendations on the use of methodologies for the integration of surveys and administrative data LIST OF CONTENTS Preface 5 Basic principles for matching procedures (Gervasio Fernández - INE) 5 1. Recommendations on harmonization of definitions, variables, units, populations of different sources 1 1.1. Some preliminary common aspects for record linkage and matching: Reconciliation of concepts and definitions of two sources (Mauro Scanu - ISTAT). 1 1.2. Bibl iography 9 2. Recommendations on probabilistic record linkage 10 2.1. The practical aspects to be considered for record linkage (Nicoletta Cibella, Mauro Scanu and Tiziana Tuoto - ISTAT) 10 2.2. How to choose the matching variables (Nicoletta Cibella - ISTAT) 13 WP2 I
147

2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Apr 06, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

ESSnet Statistical Methodology Project on Integration of Survey and Administrative Data

Report of WP2. Recommendations on the use of methodologies for the integration of surveys and administrative data

LIST OF CONTENTS

Preface 3

Basic principles for matching procedures (Gervasio Fernández - INE) 3

1. Recommendations on harmonization of definitions, variables, units, populations of different sources 1

1.1. Some preliminary common aspects for record linkage and matching: Reconciliation of concepts and definitions of two sources (Mauro Scanu - ISTAT). 1

1.2. Bibliography 6

2. Recommendations on probabilistic record linkage 7

2.1. The practical aspects to be considered for record linkage (Nicoletta Cibella, Mauro Scanu and Tiziana Tuoto - ISTAT) 7

2.2. How to choose the matching variables (Nicoletta Cibella - ISTAT) 9

2.3. How to choose the blocking variables (Nicoletta Cibella - ISTAT) 11

2.4. Selection of the matching type (Nicoletta Cibella and Tiziana Tuoto - ISTAT)13

2.5. Some practical issues for deterministic record linkage (Nicoletta Cibella and Tiziana Tuoto - ISTAT) 14

2.6. How to choose the comparison function (Nicoletta Cibella - ISTAT) 15

2.7. Modelling issues (Marco Fortini and Mauro Scanu - ISTAT) 17

2.8. How to assign pairs (a,b) to the sets M* and U*. Setting the cut-off threshold (Miguel Guigó - INE) 19

2.9. Constraints on the results: how many times a record can be used in a record linkage procedure (Gervasio Fernández - INE , Monica Scannapieco - ISTAT)

21

WP2I

Page 2: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.10. How to build the linked data set (Miguel Guigó - INE) 23

2.11. The use of human judgment during the record linkage process (clerical review of record linkage) (Marco Fortini - ISTAT) 25

2.12. Estimation of linkage errors (Tiziana Tuoto - ISTAT) 27

2.13. A particular record linkage problem: Duplicate Record Detection (Gervasio Fernández - INE) 30

2.14. Bibliography 31

3. Recommendations on statistical matching 34

3.1. The practical aspects to be considered for statistical matching (Mauro Scanu - ISTAT) 34

3.2. Identification of the common variables and harmonization issues (Mauro Scanu - ISTAT) 36

3.3. How to choose the matching variables (Marcello D’Orazio - ISTAT) 38

3.4. Definition of a model (Marco Di Zio - ISTAT) 40

3.5. Application of the statistical matching algorithm (Mauro Scanu - ISTAT) 43

3.6. Bibliography 45

4. Recommendations on micro integration processing methodologies 47

4.1. Micro-integration of different sources: Introduction and preliminary issues (Miguel Guigó - INE) 47

4.2. Imputation and construction of consistent estimates (Miguel Guigó - INE) 50

4.3. The Principle of Redundancy (Manuela Lenk - STAT) 54

4.4. Bibliography 55

5. Experiences and cases studies 57

5.1. Practical aspect on the harmonization of the definitions, of the variables and units for the EU-SILC project in Italy (Paolo Consolini - ISTAT) 57

5.2. A report about an employment data base used for the register census: EVA – the Austrian Activity Register (E. M. Reiner / P. Schodl - STAT) 67

5.3. The Principle of Redundancy – Austrian Register Based Census (Manuela Lenk - STAT) 74

5.4. A Case Study: the Italian Post Enumeration Census 2001 (Marco Fortini - ISTAT) 78

5.5. A Case Study: Challenges of the Austrian Register based Census (Manuela Lenk - STAT) 84

5.6. Bibliography 93

WP2II

Page 3: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Preface

Basic principles for matching procedures(Gervasio Fernández - INE)

When putting into practice the several techniques for record matching, a set of steps must be considered. As when solving any kind of statistical / mathematical problem, it is always possible to take into account a basic outline, as follows:

Analysis of start-up conditions Election of the technique to be applied Resolution of the concrete problem Application of the so obtained results

The problems related to integration of different datasets are characterized by the availability of two or more sources at the micro level and the need to obtain only one synthetic source also at the micro level.

The most appropriate technique to obtain the so desired synthetic source is set by means of a detailed analysis of the original sources of data and their conditions.

Roughly speaking, two main kinds of record matching techniques have to be distinguished:

1. Probabilistic record linkage

2. Statistical matching

The first occurs when the synthetic source consists, at the individual level, of the same real units that are present in the different original sources and, therefore, in the target population that is being studied. The solution of this type of problems basically consists of identifying the units of one of the original sources among the other sources, and then merge the corresponding records in order to eventually obtain a synthetic record of the appropriate unit. Unit identification is achieved by means of an analysis of the values held by a set of common variables -in the sense that they exist in every original source- which permits to allocate or not the match of the corresponding records.

The second occurs when it is not possible to identify the same actual units in the different original sources and it is necessary to create a synthetic source whose records consist of fictitious -but statistically representative- units of the target population. In this case, matching of records must be set up by means of the analysis of variables that are common to all the original data sources, so that the resulting synthetic record is statistically representative of the target population, according to its supposed statistical distribution.

As it can be seen, a set of characteristics is shared by both problems, as follows:

a) Working with records related to individuals/units,

b) Working with variables that are common to all the original sources,

and other characteristics, specific to each method:

WP2III

Page 4: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

c.1) Identifying the same unit in each original source

c.2) Identifying records from the original sources that are also representative of the target population.

The first of the common characteristics (a) leads to analyse which populations are covered by the different sources and then harmonise them, since our attempt is to work always with the same target population.

Likewise, the second of the common characteristics (b) shows that an effort to harmonise the common variables is needed in order to get truly homogeneous variables in the different sources; specially, definitions, classifications, reference periods, etc. should be harmonised.

For the specific case of probabilistic record linkage (c.1), it is necessary to distinguish which of the common variables are the most appropriate for the identification of the records as well as to elucidate which class of record matching should be performed. In this case, the comparison function to be used for assigning the pairs of matching records must be defined, as well as the matching error produced in the process must be estimated.

When the number of records to compare is high, it is advisable to split them into several blocks, according to a set of variables, so that the total number of comparisons is significantly lowered, but without lose of possible matched records.

In the case of statistical matching (c.2), it is necessary to define the statistical model which fits the target population and also to choose the most suitable among the common variables, in order to find the representative records of the target population.

Once the synthetic source has been obtained, we would be able to use it for the planned purposes, though it is probably necessary to carry out some micro integration tasks before, in order to definitively adjust the results obtained by the use of matching procedures. Among these possible tasks, those meant for editing and imputation of the data from the synthetic source, before using it for estimation purposes, can be listed.

For those other cases where the objective of the synthetic source is to achieve a complete and up-to-date database of the target population, some tasks of redundancy analysis are probably advisable.

All of the tasks mentioned above are schematically set in the table 0.1, where sections and pages which include a further development of the basic guidelines are also indicated. Likewise, practical experiences and case studies are shown at the end of the document, in order to make easier a better understanding of the steps to follow when putting into practice these different matching procedures.

WP2IV

Page 5: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Matching ProcessPr

obab

ilist

ic R

ecor

d L

inka

ge

Harmonization issues for the unit(sec: 1.1.a, pg: 2)

Statistical Matching

Harmonization issues for the variables(sec: 1.1.b, pg: 3)

Harmonization issues on other operational aspects(sec: 1.1.c, pg: 4)

Practical aspects to be considered(sec: 2.1, pg: 7)

Matching variables(sec: 2.2, pg: 9)

Blocking variables(sec: 2.3, pg: 11)Matching type

(sec: 2.4, pg: 13)Deterministic record linkage

(sec: 2.5, pg: 14)Comparison function

(sec: 2.6, pg: 15)Modelling issues(sec: 2.7, pg: 17)

How to assign pairs to the sets M* and U*(sec: 2.8, pg: 19)

How many times a record can be used(sec: 2.9, pg: 21)Linked data set

(sec: 2.10, pg: 23)Clerical review

(sec: 2.11, pg: 25)Estimating linkage errors

(sec: 2.12, pg: 27)Duplicate Record Detection

(sec: 2.13, pg: 30)

Practical aspects to be considered(sec: 3.1, pg: 34)

Common variables and harmonization issues

(sec: 3.2, pg: 36)Matching variables(sec: 3.3, pg: 38)

Definition of a model(sec: 3.4, pg: 40)

Application of the statistical matching algorithm

(sec: 3.5, pg: 43)

Micro-Integration

Introduction and preliminary issues(sec: 4.1, pg: 47)

Imputation and construction of consistent estimates(sec: 4.2, pg: 50)

The principle of redundancy(sec: 4.3, pg: 54)

Table 0.1 : Basic principles in matching procedures.

WP2V

Page 6: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

1. Recommendations on harmonization of definitions, variables, units, populations of different sources

1.1. Some preliminary common aspects for record linkage and matching(Mauro Scanu - ISTAT)

Summary and hints

Harmonization issues for the units: it is necessary to make assumptions on the characteristics of the files to integrate

Harmonization issues for variables: among the common variables of the two files, detect those that cannot be harmonized, those that can be harmonized by means of appropriate categorization, those that need the definition of new variables from existing ones

Other harmonization issues: use of rough or clean data? Both approaches have pros and cons

Although statistical matching and record linkage are two completely different problems, there is something that both the problems should consider: when two data files have to be integrated, the data files must be harmonized. Harmonization is a very time consuming task that can be (at least) partially avoided by a good statistical organization: a centralized system of definitions for units, populations, variables, variable classifications. Usually, this is not the case.

When dealing with the harmonization task, these are the issues to consider (van der Laan, 2000):

1) harmonization of the definition of units;

2) harmonization of reference periods;

3) completion of populations;

4) harmonization of variables;

5) harmonization of classifications;

6) adjustment for measurement errors (accuracy);

7) adjustment for missing data;

8) derivation of variables.

As a matter of fact, it is necessary to analyze in detail all the metadata of the data file to integrate. For the sake of simplicity, let us cluster the previous points in the following ones: harmonization issues on the unit (issues 1, 2 and 3); harmonization issues on the variables (issues 4, 5 and 8), and harmonization issues on other operational aspects (issues 6 and 7). The previous points will be described with reference to both the record linkage and statistical matching practices, underlining the possible differences. In fact, the two problems are quite different:

1) In record linkage, the aim is the identification of those records in two files that refer to the same unit.

WP21

Page 7: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2) In statistical matching, the objective is to analyze (at least) two variables that are never jointly observed in a unique data file; this is done by the analysis of two distinct data files, each one containing one of the variables of interest, and some other common variables.

The result is that unit harmonization is extremely critical for both statistical matching and record linkage, as far as the common variables to be used as matching variables are concerned. On the contrary, it is a bit less stringent the necessity to harmonize population and unit definition for record linkage (the objective can be the selection of the common units). Harmonization practices can be quite different also for the operational aspects.

1.1.1. Harmonization issues for the unit

Integration of two data files makes sense in the following two situations.

The reference populations for the two data files are the same. In this case, it is possible to apply both record linkage or statistical matching, according to the characteristics of the files to integrate.

The data files refer to two different but partially overlapping populations. This is an extremely important case: in fact, it contains also the frequent case of two temporally lagged data files (this will be treated afterwards). In this case, it is worthwhile to distinguish between statistical matching and record linkage.

1) Record linkage can seek the subset of units in common. It must be analyzed whether this subset corresponds to the target population.

2) Statistical matching cannot be performed automatically. A first approach consists in reducing the two samples A and B to the subsets of units A1 and B1 that refer to the overlapping part of the two populations. In this case, it must also be questioned whether the two subsamples are representative of this subpopulation or better of the population of interest. After these checks, statistical matching procedures can be applied on A1 and B1.

An alternative, consists in considering the statistical matching problem as a statistical inferential problem. In this case, it is not important to reduce the sets A and B to the overlapping parts, but to assume that the two populations are random samples from the same data generating process. In other words, the sets of units that do not link do not modify the statistical distributions of the variables of interest. In this case, the two samples should not be reduced to the subsets of units in common.

The two data files refer to two different (separate) populations. Evidently, record linkage is impossible. Although it is possible to use some additional hypothesis on the homogeneity of the distributions of the variables of interests in the two populations, we do not suggest to perform statistical matching in this case

Temporally lagged data sources and possible hypothesesA typical case of dishomogeneity between two data sources is the time lag between the data files A and B to integrate, say t1 for A and t2 for B, with t1 < t2. The two reference populations are just partially overlapping (units present in both the populations in the two instants). The joint use of the two sources can have the following objectives

WP22

Page 8: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

1) to detect the set of common units in the two lagged files, in order to create a panel (i.e. to study the evolution of some variables over time). Record linkage can be applied, without any use of particular assumptions. Statistical matching cannot be applied.

2) to study jointly two variables Y and Z observed distinctly in the two files. In this case, it is necessary to make assumptions for both record linkage and statistical matching. The assumptions depend on the method to be used

Assumption 1: For record linkage, the subset of common units in A and B (AB) should be representative of the overall population of interest (usually the most recent population in time t2). In other words, all the marginal distributions as well as the joint distribution of Y and Z should be very similar.

Assumption 2: For statistical matching (joint analysis of both A and B) the variables of interest (X, Y, Z) should be considered as distributed identically in the two populations A and B.

Assumption 3: For statistical matching, in case A is considered a donor and B a recipient file, it is possible to assume a less stringent hypothesis: the distribution of Y given the common variable X does not change for the two populations at time t1 and t2.

The comparison between the two population aggregates for A and B should accurately take into account the definition of the units in the two occasions. The following example shows some of the problems that can occur.

1.1.2. Harmonization issues for variables

This paragraph deals only with the matching variables X. It is necessary that the matching variables in A and B are homogeneous. In general, the set of common variables (where the matching variables are taken) can be clustered in

a) common variables that cannot be harmonized (different definitions or incompatible classifications)

b) common variables that can be harmonized by modifying the corresponding categorizations

c) new common variables, obtained by appropriate transformation of the available information

Common variables that cannot be harmonized. As a matter of fact these variables are not common, given that they cannot be reconciled. Unluckily, it is rather frequent the presence of variables that seem to be the same, but have irreducible differences. This happens especially when the two files A and B are managed by different institutes.

Common variables that can be harmonized by modifying the corresponding categorizations Categorical variables admit many different (plausible) classifications. Harmonization of the classifications of a common variable X in A and B (say XA and XB) means the definition of a set of new categories for X. Each new category includes one or more categories of the classifications in A and in B. The number of the new categories is necessarily not greater than the number of categories available in A and B. Typical examples of variables

WP23

Page 9: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

that need to be frequently recategorized are: age class, geographical area, educational status, any continuous variable reported in classes (as income, expenditures, etc).

New common variables, obtained by appropriate transformation of the available information Sometimes it is necessary to create new variables as the common source of information to use for matching A and B. This new variables will play the role of matching variables during the matching process

1.1.3. Harmonization issues on other operational aspects

Rough or clean data? One of the questions that may arise during a matching process is to consider clean data (i.e. data sets on which editing and imputation procedures have been applied) or rough data. There are pros and cons in both the cases

Clean data – In this case, some of the items in the matching variables are not genuinely observed variables, but imputed or reconstructed ones. Differences among the key variables can be due to the imputation process

Rough data – These data can be affected by incompatible results (e.g. a married person with an age less than 10). Evidently, some of the items are not correct (but it is not clear which value) and this can be the cause for differences between the matching variables in the files to match

There is not a unique approach in this case. Most of the times, integration is performed once the data sets are released (i.e. after the data production process is completed). Hence, integration is performed on clean data sets. In this case, it is worthwhile to give a look at the indicators that monitor the data production process. In general, it is advisable to avoid the use of highly imputed variables as matching variables, for both record linkage and statistical matching.

1.1.4. Example

The following examples refers to a statistical matching application. In Italy, the joint analysis of household income and expenditures can be obtained by the joint use of two samples, the Household Budget Survey (HBS, managed by Istat) and the Survey on Household Income and Wealth (SHIW, managed by the Bank of Italy).

Harmonization of the units: The two target populations just partially overlap. One of the causes, is the different definition of household. According to HBS practices, a household consists in a set of cohabitants linked by marriage, familiarity, affinity, adoption, tutelage and affection. According to SHIW practices, a household consists of a set of people who, regardless of familiarity, completely or partially combine their income. As a result, a household consisting of grandparents living together with a family will be considered as a unique household according to the HBS definition, while it will consist of two distinct households if the grandparents’ budget is managed distinctly from the family’s one according to the SHIW definition.

In order to manage this problem, we considered an explicit assumption (which is appropriate for the household characteristics in Italy). The assumption was that the number of distinct households for the two definitions (i.e. those household that do not belong to the intersection of the two populations) is negligible. Consequently, the probability of inclusion in the samples of these household is so small, that it is reasonable to assume that these households are not

WP24

Page 10: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

included in the two sample surveys. Hence statistical matching of the two files can be performed: the target population consists of the Italian households but the ones that will be treated differently by the two definitions.

Common variables that cannot be harmonized: In both the surveys it was present the concept “head of the household”, and this concept was connected with many variables: “Head of household gender”, “Head of household age”, “Head of household educational status”, “Head of household occupational activity” and so on. These variables are considered important when the objective is to study jointly the household income and expenditures (respectively available in SHIW and HBS). The two surveys consider two different definitions of head of the household. In HBS, the head of the household is defined as reported in the archives (anagrafi comunali). SHIW considers as head of the household the main responsible of the household economy. The two files do not contain information able to reconcile the two definitions. All the variables related to the concept of “Head of the household” had to be disregarded.

Common variables that can be harmonized by modifying the corresponding categorizations: One of the most important variables to consider is the “Main occupational activity”. Sometimes HBS has a more complex classification than SHIW (as for industry and commerce). In other cases, the categories can be harmonized only by fusing two or more categories in each classification, as for the case of “other public and private services”. The overall compatibility relationship between the classifications of the variable “Main occupational activity” in the two surveys is represented in the following table, as well as the final harmonised classification.

Harmonized Category SHIW HBSAgriculture Agriculture AgricultureIndustry Industry Energy

IndustryConstructions Constructions ConstructionsCommerce Commerce Workshops

CommerceTransportation Transportation TransportationBanks and Insurance Banks and Insurance Banks and InsurancePublic administration and public services Public administration Public administration

Other public or private services

Private services

New common variables, obtained by appropriate transformation of the available information:

As explained before. when the SHIW and HBS were matched it was not possible to use the head of household characteristics. Anyway, it was important to use the household characteristics during the matching process, given that these characteristics have usually a large statistical relationship with both household income and expenditures. Household profiles were created by means of the following transformed variables: number of household members aged over 64 (categories: 0, 1, 2, 2+), number of employed members (0, 1, 2, 3, 3+), number of members with a university degree (0, 1, 2, 2+) and so on.

WP25

Page 11: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

1.2. Bibliography

Coli, A., Tartamella, F., Sacco, G., Faiella, I., Scanu, M, D’Orazio, M, Di Zio, M; Siciliani, I., Colombini, S., Masi, A. (2005). La costruzione di un archivio di microdati sulle famiglie italiane ottenuto integrando l’indagine ISTAT sui consumi delle famiglie italiane e l’indagine Banca d’Italia sui bilanci delle famiglie italiane. Technical report Documenti 12/2006, Istituto Nazionale di Statistica, Roma (in Italian).

Di Zio M., Scanu M. (2006) Statistical Matching: Theory and Practise, Wiley, Chapter 7.

Van der Laan, P. (2000). Integrating administrative registers and household surveys. Netherlands Official Statistics, Volume 15, pp. 7-15.

WP26

Page 12: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2. Recommendations on probabilistic record linkage2.1 The practical aspects to be considered for record linkage(Nicoletta Cibella, Mauro Scanu, Tiziana Tuoto - ISTAT)

Record linkage is a complex procedure that can be decomposed in many different phases. Each phase implies a decision by a practitioner, which can not always be justified by theoretical methods. In the following figure, a workflow of the decisions that a practitioner should assume is given. The figure is adapted from a workflow in Gill et al (2001), p. 33.

The workflow describing the practical actions of a practitioner for applying record linkage procedures shows that the actual record linkage problem (as described in WP1 in Section 1) is tackled only in a few steps (the two nodes following the “probabilistic method” node). As a matter of fact, the steps to be performed are summarized in the following list.

1) At first a practitioner, should decide which are the variables of interest available distinctly in the two files. To the purpose of linking the files, the practitioner should

WP27

Page 13: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

understand which variables are able to identify the correct matched pairs among all the common variables. These variables will be used as either matching (Section 2.2) or blocking (Section 2.3) variables.

2) The blocking and matching variables should be appropriately harmonised before applying any record linkage procedure. Harmonization is in terms of variables definition, categorization and so on, as already described in Chapter 1.

3) When the files A and B are too large (as usually happens) it is appropriate to reduce the search space from the Cartesian product of the files A and B to a smaller set of pairs, as described in WP1 (Section 1.6).

4) Usually different techniques are applied sequentially in order to catch all the possible links. Two major record linkage techniques can be used: the deterministic and the probabilistic ones (Section 2.4).

a. For deterministic record linkage, it is necessary to define rules that discriminate pairs between links and non links (Section 2.5)

b. For probabilistic record linkage, after the selection of a comparison function (Section 2.6) a suitable model should be chosen (Section 2.7). This should be complemented by the selection of an estimation method, and possibly an evaluation of the obtained results. After this step, the application of a decision procedure needs the definition of cut-off thresholds (Section 2.8).

5) There is the possibility of different outputs, logically dependent on the aims of the match. The output can take the form of a one-to-one, one-to-many or many-to-many links (Section 2.9). It is appropriate that the final linked data set fulfils some rules (Section 2.10)

6) The output of a record linkage procedure is composed of three sets of pairs: the links, the non links, and the possible links. This last set of pairs should be analyzed by trained clerks (Section 2.11).

7) The final decision that a practitioner should consider consists in deciding how to estimate the linkage errors and how to include this evaluation in the analyses of linkage files (Section 2.12).

Finally, it is briefly outlined also a special record linkage case: how to deal with the problem of deduplication of records in a data set (Section 2.13).

WP28

Page 14: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.2. How to choose the matching variables(Nicoletta Cibella, ISTAT)

Summary and hints

It is recommended to choose as matching variables those variables with high level of identification power and quality (no errors, no missing data)

Generally, these variables are selected by a domain expert

The number of matching variables and some of their characteristics (as the number of categories and their rarity) influence the identification of links

Hint: for choosing matching variables, perform some exploratory analyses on the distribution of the variable, e.g. the number of categories, missing data, etc.

In a record linkage procedure the choice of matching variables is a crucial task and should be suited to the collected data. The common variables recorded in the two files can be used as matching variables so as to determine whether a pair is a link or not. Different aspects should be considered for the selection of the matching variables from the common variables in the two files. They generally refer to the metadata of a variable, from its definition (including the number of categories) to the quality aspects of the variable and its statistical characteristics. The main task is to include those variables with high discriminative power (i.e. capability to detect and distinguish each unit).

For deterministic record linkage, unique identifiers are created combining some of the common variables (linkage keys): pairs that agree on all the keys correspond to matches. Obviously, a very strict check (missing cases, incorrect values, use of initials, miscoded values, variations in spelling, etc.) of the above mentioned variables must be performed so as to prevent errors in the linkage procedure.

Probabilistic record linkage is used because perfect unique identifiers are missing. Comparison of the records can be performed on the basis of the common variables recorded in the files. Usually, the matching variables are chosen by a domain expert, and this phase is not automatic. Anyway, some important suggestions can help the choice of the matching variables.

The first suggestion is on the number of matching variables to use. The number of matching variables has an effect on the possible individuation of the link; generally as the number of the variables gets larger, the higher is the capability to identify correctly matches. Anyway, a good rule of thumb is to limit the number of matching variables to the ones necessary to detect a unit, having suitable information to identify a link. Gill (2001) suggests to cluster the common variables observed in socio-demographic studies in six groups:

a) persons proper name, for almost all cases unchangeable during the lifespan (e.g. surname, initials);

b) non name personal characteristics at birth which can rarely change during the lifetime (e.g. sex, date of birth);

c) geographical items (address) and socio-demographic information (e.g. addresses, postcode);

WP29

Page 15: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

d) information occasionally collected for a specific register (e.g. diagnosis, occupation);

e) variables which can be used for family record linkage (e.g. date of the marriage, birth order);

f) arbitrarily allocated numbers which identify records.

The variables from different groups could be combined or joined in a single one. Gill suggests to use a combination of one variable from each of the groups a, b, c and f. Prescriptive rules for the record linkage of data sets on enterprises have not been defined yet.

A second suggestion is to analyze the statistical relationship between the variable to use as matching variables, as well as the errors reported in the variables. Statistics New Zealand (2006) suggests to avoid the use of highly correlated matching variables, that unnecessarily increase the composite weights without better discriminating between possible links and possible non links. Furthermore, the joint use of the variables would be useless in case of items with correlated or functionally dependent errors.

A third suggestion is to evaluate each variable quality, and to discard those whose quality is poor. Particular attention should be considered for missingness. Common variables highly affected by missing data should not be used as matching variables.

As already stated, it is preferable to have matching variables that are able to distinguish as much as possible the units (high discriminative power). This can be accomplished by analyzing the number of categories of each matching variable. The larger the number of categories of a variable the higher is its discriminative power. Lack of discrimination does not affect only variables with a few categories (as the variable gender). The same happens also when a variable has a large number of categories, but a few of these categories are much more frequent than others; it would be useless to select them as matching variables. This is the case of the variables name and surname for which some categories are much more frequent than others (e.g. the surname Smith can be more frequent than some ethnic surnames). When names and surnames (affected by the previous problems) are selected as matching variables, Winkler (1989) and references therein suggests to introduce penalties in the comparison vector score in order to reduce the high values of the comparison function for an agreement on a very frequent category.

WP210

Page 16: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.3. How to choose the blocking variables(Nicoletta Cibella, ISTAT)

Summary and hints

Which technique for data reduction is preferable? The case of blocking and sorted neighbourhood

Block size must be small enough to solve computational problems and large enough to avoid excluding true matches from it

Hint: It is suggested to use as blocking variables those variables without missing items

Hint: It is suggested to use as blocking variables those variables with a large number of categories, whose distribution is as much as possible uniform (avoid rare categories)

It is still not available an automatic selection of blocking variables able to satisfy some optimal criteria

In order to reduce the number of pairs, when dealing with large datasets, and improving the performances of the whole record linkage procedure, it is necessary to filter records considering only those pairs that agree on a block variable. Techniques like blocking, sorting, filtering, clustering can be implemented so as to reduce the search space (AxB or AxA, in case of de-duplication) removing pairs that are non matches undoubtedly. There is not a general rule for privileging a method against the others. Generally speaking, blocking is considered a good dimension reduction method. It is less preferable when the search space is very large (for computational problem, the sorted neighbourhood method is preferred) and when some of the categories of the blocking variable have a small frequency in one of the data sets to link (also in this case, the sorted neighbourhood method is preferred, because it allows comparisons for larger sets of records, decreasing the possibility of false unmatches).

As already described in Report WP1 (Section 1.6), blocking consists of partitioning the set of candidate pairs into blocks: only pairs inside each block are analyzed. On these sets of records accurate distance metrics are considered for assigning records to the set of matches (M) or unmatches (U). The nature of the blocking procedure is analogous to indexing; the partition is made through blocking keys: if the keys are equal on two records or if a hash function applied to the keys of the two records gives the same result, the two units belong to the same block.

Choosing day of birth and gender as block variables decreases the number of candidate pairs, and the number of comparisons by a factor of 60; that is the number of analyzed pairs is 1/60 th

of the total number of pairs nAnB; on the contrary, when using gender on its own, records are split in just two very large subsets.

Blocking variables should be error free, while the number of categories of the blocking variables should be large enough. Preferably blocking variables should be almost uniformly distributed so as to have exhaustive and mutually exclusive small subset of approximately the same number of records. If there is a (negligible) risk of errors or missing values in a blocking variable, although for computational purposes the size of each block should be small enough,

WP211

Page 17: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

block size should be at the same time large enough in order to decrease the risk to avoid including true matches in the blocks. Better results in a blocking procedure could be obtained by means of the multi-pass approach: the candidate matches are found via independent runs on different attributes so as to overcome errors in the data. Variables like the phonetic codes of name or surname, or each field of the date of birth, namely day, month and year, can be considered a good choice for blocking aims. For a review on the blocking schemes see Baxter et al. (2003).

WP212

Page 18: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.4. Selection of the matching type (deterministic vs probabilistic)(Nicoletta Cibella and Tiziana Tuoto – Istat)

Summary

Deterministic approach: appropriate in presence of high quality unique identifiers. It is fast but ad hoc solution, no linkage error evaluation. Very low linkage rate.

Probabilistic approach: complex but formal, no unique identifiers needed and variables affected by errors and missing also considered. Useful for large datasets. Linkage error evaluation performed.

The definition of the deterministic record linkage is not firmly established. According to Statistics Canada, the deterministic record linkage method individuates links if and only if there is a full agreement of unique identifiers or a set of common identifiers, the key variables. Other authors backed up that in deterministic record linkage a pair is a link also if it satisfied some specific criteria a priori defined; as a matter of fact, not only the matching variables must be chosen and combined but also the quality of the “match” has to be fixed in order to establish whether a pair should be considered a link or not, that is this kind of linkage is almost-exact but not exact in the strict sense (Gill, 2001). In the deterministic approach, both exact and almost-exact, the uncertainty in the match between two different databases is minimized but the linkage rate could be very low.

Deterministic record linkage can be adopted, instead of a probabilistic method, in presence of error free unique identifiers (such a social security number or fiscal code) or when key variables with high quality and discriminating power are available and can be combined so as to establish the pairs link status; in this case the deterministic approach is very fast and effective and, in presence of variables with rare errors and high discriminating power, its adoption is appropriate. Anyway, the deterministic procedure can reject some true links due to presence of errors or missing values in the key variables; so the choice between the deterministic and probabilistic methods must take into account “the availability, the stability and the uniqueness of the variables in the files” (Gill, 2001). It is important also to underline that, in a deterministic context, the linkage quality can be assessed only by means of re-linkage procedures or accurate and expensive clerical reviews (Section 2.11). On the contrary, the probabilistic approach is more complex but formal and can solve problems caused by bad quality data. In particular it can be helpful when differently spelled, swapped or misreported variables are stored in the two data files. In addition the probabilistic procedure allows to evaluate the linkage errors, calculating the likelihood of the correct matches.Generally, the deterministic and the probabilistic approaches can be combined in a two step process:

1. firstly the deterministic method is performed on the high quality variables 2. then the probabilistic approach is adopted on the residuals, the units not linked in the

first step.Anyway the joint use of the two techniques depends on the aims of the whole linkage project.

WP213

Page 19: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.5. Some practical issues for deterministic record linkage(Nicoletta Cibella and Tiziana Tuoto – Istat)

Summary and hints

Deterministic record linkage should be adopted in presence of high quality variables

The ideal matching variables are: unique, universal, stable, with no-errors and verifiable

Suggestion: develop a stepwise deterministic strategy

Hint: Assign different importance to variables on the basis of statistical analysis

Differently from the probabilistic approach in which there is a set of possible links, in the deterministic one a pair of units is always assigned to the matched or unmatched set, on the basis of the chosen rules; i.e. a positive decision is always taken on the linkage status of the candidate pairs.According to Gill (2001) the ideal variables for the deterministic method are those that are:

1. unique for each units;

2. available for all the records;

3. stable, everlasting;

4. recorded easily and without errors;

5. simply verifiable.

Although, undoubtedly, deterministic procedures are frequently implemented by many statistical offices, not much literature is available on the adopted strategies. The procedures are developed with ad hoc strategies, hand-coded rules, strictly depending on data and on the knowledge of the practitioners.

Some authors (Gomatam et al, 2002) suggest to develop a stepwise deterministic strategy. Actually, in order to better evaluate if a pair is a link or not in a deterministic context, the matching variables can be classified on the basis of their quality and the records are matched on the basis of the so obtained hierarchical subsets of variables. In particular,

1. firstly, records are compared and matched on the basis of the most reliable variable subsets (e.g. last name, first name, day-month-year of birth)

2. then the remaining pairs on the basis of a less reliable subset (in this case, also a subset of the previous subset can be used as the combination of last name, day-month of birth).

In addition, the key variables can be ordered so as to consider differently the agreement (full or partial) on the variable sex or date of birth, for example. A good solution, in order to determine the discriminating power of the variables, is to perform statistical analysis on external data.

WP214

Page 20: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.6. How to choose the comparison function(Nicoletta Cibella, ISTAT)

Summary and hints

Definition of the comparison vector γ on the basis of the comparison function adopted

The function must be appropriate for reporting the characteristics of the selected matching variables.

Hint: The equality function is the most widespread

In order to assign pairs to the sets M and U, a comparison function between records is required. It computes distances on the matching variables between records of each pair. The result is a comparison vector, γ, to each record pair.

The easiest comparison vector consists of binary elements, each one recording equality on a matching variable (1) or any other kind of difference (0). More generally, the comparison vector can be also composed of categorical or continuous values.

Some of the most common comparison functions are in hierarchical order (Gu et al. 2003) :

1. the equality comparison function, which for every comparison variable returns 1 if there is an agreement and 0 otherwise;

2. the function that evaluates the absolute difference between values in a numeric field; agreement can be decided when the difference is below a fixed threshold;

3. the function that compares each item character-by character

4. the Jaro (1989) comparison vector and the modification proposed by Porter and Winkler (1997), which enumerate the number of common characters and the number of transpositions of characters (same character with a different position in the string) between two strings;

5. the edit-distance, that returns the minimum cost in terms of insertions, deletions and substitutions needed to transform a string of one record into the corresponding string of the compared record;

6. the Smith-Waterman comparison function which uses dynamic programming to find the minimum cost to convert one string into the corresponding string of the compared record;

7. the Hamming distance that counts the number of different digits between two strings;

8. the adaptive comparison function in which the parameters of the function are obtained via training samples;

9. the TF-IDF (Token Frequency- Inverse Document Frequency) distance function, which is used to match strings in a document; it assigns high weights to frequent tokens in the document and low weights to tokens that are also frequent in other documents; Cohen points out the fine computational performances of this metric compared with the edit-distance one.

WP215

Page 21: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

The character-based comparison function faces the problem of the typographical error while token-based ones deal with rearrangement of words. Generally, string comparators, such as number 3, 4, 5, 6 and 7 above described, are commonly used because of the frequent choice of text and string attributes as matching variables.

WP216

Page 22: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.7. Modelling issues(Marco Fortini, Mauro Scanu –ISTAT)

Summary and hints

The identifiability problem of the models for record linkage

Some simplified models: the conditional independence assumption

Appropriateness of dependence models for the comparison variables described by loglinear models. Problems in testing the adequacy of these models

Hint: a sufficiently general model for record linkage – the three-way loglinear model with the use of constraints (on the probabilities m and u, on the frequency of the matched pairs in the whole set of record pairs)

During the decision process of a record linkage procedure, the agreement of the key variables is detected on the different record pairs. The number of the key variables necessary for detecting the status (match or unmatch) of the pairs is a fundamental issue. Given that the agreement (γj

ab, j=1,…J) of the key variables is seen as a result of a random variable for two different random processes (the comparison vectors for the matches, cab=1, and the corresponding comparison vectors for the unmatches, cab=0) the set of observed comparison vectors is defined as a mixture of these two different distributions. The status of the pairs is unknown (an additional latent variable) and it is actually the focus of record linkage.

As a result, it is important to define a model on all these variables: Yj, j=1,…J, and C. At first, it is important to note that there is a constraint on all the models that it is allowed to assume. The presence of the latent variable C makes some of the models unidentified, i.e. for some models there are parameters that cannot be estimated from the data set at hand (i.e. the overall set of comparison vectors on the nAnB pairs). For instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least 3. When J=3, it is possible to assume only the conditional independence of the Yj, j=1,2,3, given the status of the pair C (the so called conditional independence assumption). This assumption has been studied by Fellegi and Sunter (1969, the estimator was based on the method of moments) and by Jaro (1989, the estimator was based on a maximum likelihood approach, by means of the EM algorithm). When the variables are more than 3, it is possible to use more complex models. The interdependence relationships among the comparison variables can be modelled in the framework of loglinear models with the presence of latent variables, as in Thibaudeau (1993), and Armstrong and Mayda (1993)

The model can be even much more complicated. This is the case when the latent variable C can assume three states, as in the situation of pairs of records hierarchically matched. Herzog et al (2007) reports the case in which some of the matching variables refer to individuals while the others are related to the households. In this framework the best model is represented by a latent variable C assuming the following states: matched pairs, unmatched pairs within the household, unmatched pairs outside the household.

There is not a unique solution in choosing an appropriate model for the estimation process of the parameters u, m, p. Winkler (1993) suggests that usual tests (as chi square or likelihood ratio tests) are inefficient, due to the presence of a latent variable. This inefficiency is based on empirical results, and more efforts should be devoted to this topic. When the matching variables are many (at about 10) Winkler suggests to use a “general” model of

WP217

Page 23: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

interdependence between the key variables, like a loglinear model of order 3. When appropriate constraints on the parameters are available, it is possible to use them together with maximum likelihood approaches (as the EM algorithm) and the estimators of the parameters u, m, p are more efficient.

In general, the independence model seems to be quite robust against model misspecification in order to identify the matches. Anyway the use of more appropriate models can help in estimating more efficiently the frequencies of error induced by the record linkage procedures (as the false match error).

Apart from the previous general indications on how to choose a model for record linkage, there are not many more insights. A different but related topic is the effect of the relationship between a key variable and the status of a pair C (match or unmatch) on the definition of a comparison variable. This approach has been defined in a record linkage study between a traffic accidents archive and an archive on hospitalization (Tuoto et al, 2006). A key variable is represented by the day of the accident on one data set, and the hospitalization day on the second. The comparison variable is defined by the difference between the two dates. The nearer to zero is this difference, the more there is evidence that the pair is a match. This difference is transformed into a categorical variable, selecting a threshold, with categories given by differences over and under the threshold. The parameter estimation procedure can help in choosing the threshold: a suggestion is to take the most discriminating threshold, i.e. that dichotomous variable such that the probability of an agreement given that the pair status is matched (the distribution m) is the highest.

WP218

Page 24: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.8. How to assign pairs (a,b) to the sets M* and U*. Setting the cut-off threshold(Miguel Guigó - INE)

Summary and hints

Assign links or non-links according to the ratio R(γ) Relation between cut-off thresholds and linkage errors: what are the acceptable

error levels? Hint: be aware of the risk and importance of a wrong match Hint: visual analysis of the R(γ) histogram in order to select the thresholds

The setting of a comparison function between records as a measure of agreement or similarity between pairs (a,b) leads to the second step of the process, that is to say, the statement of a decision rule to decide whether (a,b) belongs to M or U, or even whether it is a possible match, and therefore a later clerical review will be needed. Following Fellegi and Sunter (1969), let R(γ) be the ratio between the so-obtained score for each pair of records when (a,b) M and the score when (a,b) U. This ratio is supposed to be large for true matches and small for non-matches, since the agreement pattern which has been built to weigh up the closeness of both records shall reach high values with higher probability in the case of real matches, and these high values shall be rarely found among non-matches, given the probabilities

P{ γ /(a,b) M} and P{ γ /(a,b) U},

where is the set of all possible patterns of agreement (Herzog, Scheuren, Winkler, 2007). As previously seen in WP1 report, in the paragraphs on decision rules and procedures (Section 1.4), depending on the value of R(γ), the decision procedure consists of identifying (a,b) as a link (A1), as a non-link (A3), or as a possible link (A2). That is to say, it consists of locating (a,b) in the set M* or in the set U*, or in an intermediate zone until a non-probabilistic clerical review or an iterative process determines definitively the set where the pair will be definitively assigned. That decision rule means to establish a lower (L) and a upper (U) bound for the outcome of the ratio R(γ).

R(γ) is often, as for example in the Fellegi-Sunter model, calculated as the log2 of the likelihood ratio of the probabilities shown above. Then, two cut-off points are to be established: values of R(γ) lower than L will lead to consider the pairs as a non-link, and values that are higher than U will lead to consider the pair as a link. Values between L and U will be supposed to belong to possible links. It is important to take into account that the selected values of L and U will determine the size of the linkage errors: the higher the value of L, the larger the probability of getting a false non-match; and the lower the value of U, the larger the probability of getting a false match. So, it is obviously not desirable to set both values too far from each other, because of the resulting loss of discriminating power and effectiveness of the test; but, on the other hand, extremely close values of both thresholds will result in an overall increase of the error rates, with the subsequent risk of wrong linkages.

More exactly, error rates and threshold values are self related as follows: once a pair of values , corresponding to the size of Type I and Type II errors has been chosen, L= T and U = T are set from the distributions u(γ) and m(γ) , which are supposed to be known. This last part is not true in practice, and those distributions should be estimated. This means to deal with a sample that must be comprehensively reviewed in order to estimate the behaviour of u(γ) and m(γ).

WP219

Page 25: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Of course, type I error (wrong matches) and type II error (wrong unmatches) cannot be considered as the same. It depends on the concrete aim of the linkage procedure but, roughly speaking, a false negative can be considered as less hazardous and therefore more bearable than a false positive. In order to meet this criteria, type I error should be observed more carefully and this should mean to choose as high values of U as possible, while the values of L must be simply reasonably low since the error that is associated to this bound is considered more common and less critical than a wrong match. In other words, it is usually preferable to choose much smaller target values for than for . Gill (2001) deals extensively with the relative costs associated with resolving the two kinds of errors, and the next section gives a more detailed guide for the risks associated to wrong linkage and how to reduce those risks. Chesher and Nesheim (2004) gives a complete set of recommended practices to prevent and to deal with the impact of both excluding unmatched units and including erroneously matched units in linked datasets. In any case, since those failures in the linking process are unavoidable and not necessarily negligible, at least this effect should be determined and estimated.

Although in general it is possible to conclude that the choice of a given pair of bounds cannot be done on a completely arbitrary basis, it is far from being an automatic task. Statistics New Zealand (2006) proposes an iterative procedure in order to calculate the appropriate threshold, where the cut-off is initially set at zero, and is iteratively changed for a given pass. The weights histogram can be examined to decide the new cut-off value; the ideal situation should produce a bimodal distribution as a result of the mixture of both estimated u(γ) and m(γ), The farther apart from each other the modes are, the better the discrimination criteria between the matched and unmatched records. An additional proposal is to examine a file of linked records, which can be sorted by weight and where the record pairs with high composite weights represent good links; this can be useful to assess the patterns of field disagreements as the weights decrease and to establish the most suitable cut-off value for each pass. The estimation of error rates, however, must consist as well of a manual check of a sample of linked records, in order to reckon the number of false negatives and, specially, false positives.

WP220

Page 26: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.9. Constraints on the results: how many times a record can be used in a record linkage procedure(Gervasio Fernández - INE , Monica Scannapieco - ISTAT)

Summary and hints

Definition of the constraint on the number of times a record can be used In the 1:1 constraint, the problem can be seen as a linear programming problem,

to be solved by means of the simplex algorithm Hint: The software language R contains a package (lpSolve) able to tackle the

problem

Other possible solutions for the 1:1 constraint are outlined

2.9.1. Introduction

One of the tasks to be undertaken is to determine which sort of matches will be done according to the type of registers desired as a result of the process, and to the types of matches available for that purpose.

A first type can be met when we are trying to enlarge the number of variables available in a file for a certain set of individuals, from the data provided by an administrative register regarding those individuals. The type of matching aimed at those purposes is a one-to-one matching.

For each register from the first file, corresponding to an individual, the corresponding register from the directory frame should be found

A second type can be met when we need to match the registers from a file that is associated to some kind of recorded administrative acts with the corresponding registers to a file that is associated to a directory of individuals. We can suppose that several recorded administrative acts may be related to the same individual from the directory.

In this case, the suitable type of matching is a many-to-one matching.

A concrete example could be a study of social benefits through a time period, where social-demographic characteristics among direct beneficiaries of the social welfare program are analysed.

In other situations, the aim will be to match records from a directory of individuals with their corresponding records from a file associated to recorded administrative acts.

In this case, we would be analysing, for each direct beneficiary, some or all of the characteristics related to social benefits by individuals.

The suitable type of matching then is a one-to-many matching.

2.9.2. How to fulfil the 1:1 constraint

The execution of a record linkage procedure on two data sets, say A and B, produces an output consisting of a set of matches between records of A and of B. In the general case, each record of A can be matched to nB (where nB 0) records of B and each record of B can be matched to nA (where nA 0) records of A.

WP221

Page 27: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

However, several applications require that records of the two data sets are matched “uniquely”, that is each record of A with at most one of B and viceversa. In other words, there may be the problem of reducing an nA: nB matching to a 1:1 matching, called in the following “cardinality reduction problem”.

In the case of a probabilistic record linkage procedure, in which for each pair of records belonging to A and B the ratio r is computed, the cardinality reduction problem can be formulated as an optimization problem.

The most direct formulation is as a linear programming (LP) problem in which:

- the function to maximize is a linear combination (e.g. a sum) of the r values associated to the matches

- the constraints are given by the fact that each record of A must be matched with at most one record of B and vice versa

In this case, the simplex algorithm can be used for the solution. The worst case complexity of the simplex method is exponential, though in practice it exhibits polynomial time complexity, hence it is quite efficient.

In order to solve the cardinality reduction problem formulated as an LP problem we have used the R package “lpSolve”. Such a formulation has proven not very efficient in practice, especially for the need of a “high” dimensionality of the data structures to be used. Indeed, a memory overflow in a PC environment was caused even by small instances of the problem (e.g. few hundreds records per dataset).

A further formulation of the cardinality reduction problem can be done by relying onto combinatorial optimization, and specifically the problem can be formulated as a “maximum bipartite matching”. A “matching” on a graph G is a set of edges of G such that no two of them share a vertex in common. Bipartite matching gives as a result of the matching a bipartite graph1. By constraining the bipartite matching to be a 1:1 matching with weights represented by r coefficients, we get exactly our cardinality reduction problem.

Bipartite matching can be solved by several algorithms including the Hungarian algorithm that exhibits a complexity O(n3).

A possible solution to the cardinality reduction problem, formulated as a maximum bipartite matching, is the R package OPTMATCH.

The OPTMATCH package makes available functions for solving “full match” that permits bipartite matching with no constraints, thus generalizing pair matching (bipartite matching with 1:1 constraints). The implemented algorithm is the Bertsekas and Tseng (1994) relax-iv algorithm that actually provides a solution to a minimum cost network flow problem to which full match is reduced. The complexity in the size of the input is O(n3). Such a formulation has proven very efficient in practice too, especially with respect to the size of the needed data structures when compared to the LP formulation of the cardinality reduction problem.

1 A bipartite graph is a graph whose vertices can be divided into two disjoint sets V1 and V2 such that every edge connects a vertex in V1 and one in V2; that is, there is no edge between two vertices in the same set.

WP222

Page 28: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.10. How to build the linked data set(Miguel Guigó - INE)

Summary and hints

Two main approaches for building the linked data set: use a single entity number for the matched pairs of records; build an index with the access (i.e. item) number

Hint: Include detailed information on the record linkage process in the final linked data set: matching weights, clerical decisions (if any), variables used for linking the pairs of records

Hint: Rearrange all the information obtained by linking the records. This activity allows quality and consistency checks, the selection of updated information

What to do with those records that have not been linked?

The final product of the record linkage procedures should be a single file containing all the available data that has been collected for the same unit, say the same person, usually in a unique record, or at least in different records completely identified and known to belong to the same entity. This can be obtained either from two different files which could be named as the master and the data file, or from the same master file in the case of a de-duplication task.

In a very first stage, the output of the matching process is a file containing details about each pair of records. From this output, two main ways of building the single linked file could be possible, depending on whether a single entity number is assigned to all the matched records, or links are set using an access (i.e. item) number in an index.

Following Gill (2001), the output file generated by the matching system uses to contain details of the record from the data file, details of the record from the master file, and details about the match run. Since entity (person) numbers have not been used to compare records and get the agreement score, person and access number, both from data and master files, can be included. Details on the record linkage process should be the sum of weights, number or list of variables used for linking the pair of records and clerical matching decision.

Once all of the records belonging to the same entity have been identified, a new and unique entity (person) number is assigned to them, usually the lowest among the old entity numbers. Then the resulting linked files can be first sorted on person number, and then on some other included details such as record type or date of the event, in order to rearrange the records belonging to the same entity. Among some of the advantages of this rearrangement, it is outstanding the ability to make some additional checks to ensure the consistency of the linked and sorted data. For example, if the records belong to a sequence of events related to the same person, as health or medical care records, they will start with their birth and end with their death, while various types of life events must be in a logical temporal sequence.

Going further than the concrete kind of process for linking the additional information from the "donor" file to the "recipient" file, and whatever the steps may be, issues related to the nature of the linked dataset must be remarked. It cannot be forgotten that the so-obtained file contains some records with measurement errors, since the possibility of making incorrect links is recognised, and a formal procedure to determine its probability is being discussed

WP223

Page 29: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

along this paper and in the WP1 report, in the sections on record linkage. Following Chesher and Nesheim (2004), a unit with values {x, y, z} will be recorded in the linked dataset with values {x, y*, z} or {x, y, z* } where y* and z* are values recorded for another unit in the donor dataset. Then y =y - y* or z =z - z* can be regarded as measurement errors which contaminate the observed values of y and z.

This affects to all of the different causes for which the combination of information from two samples has been made, be it simply to produce descriptive statistics or for analytical purposes. The error problem related to datasets obtained through record linkage does not fall into the classical literature, so it can provide only limited help to analyse the impact of measurement error in non-standard cases, and methods to exploit this information are generally not available. However, at least it can be concluded that, since measurement errors are inevitable from a practical perspective (it can be avoided only in the case of a non-probabilistic or exact linkage), it has effects on inference and, more concretely, it leads to the undesired consequence of biased estimates even for the simplest settings.

In some other cases, data in records that are finally not linked are discarded and not used. As it happens in the case above, this leads as well to loss of information and consequently loss of accuracy of the estimates. Furthermore, when the master and data file do not contain the same units, some of them can be erroneously classified as non-responding. General proceedings are not available either, and analysis of the impact must be done on a case-by-case basis. Sometimes, the use of previously considered as ancillary data or information from alternative registers concerning those records can be helpful, see Statistics Finland (2004), pp 16-ff.

WP224

Page 30: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.11. The use of human judgment during the record linkage process (clerical review of record linkage)(Marco Fortini - Istat)

Summary and hints

If no human resources are available for a clerical review of the record linkage output, fix only one threshold. The value of the threshold should take into account the objective of the record linkage procedure.

If a clerical review of the record linkage procedure is available, fix two thresholds. The record pairs whose record linkage weight is between the two thresholds are declared as possible matches, and should be processed

Clerks should aim at detecting false matches among the possible matches

The activity of the clerks should not be restricted to the comparison of the key variables adopted during the record linkage process, but should look at any source of information able to solve any uncertainty on the pair.

If the status of a pair is still doubtful, declare it as a non-match.

Clerks should have expertise on the treatment of errors in the data. Training on the job with subject matter experts and record linkage experts is advisable

In case of 1:1 constraints, it is advisable to perform clerical review in advance, unless budget constraints restricts the clerical review activity

The record linkage methodology explained in Section 1 of WP1 gives as an output record pairs that are declared as 1) matches, 2) non-matches, 3) possible matches. These sets are found by means of two thresholds properly defined by the researchers in order to manage the amount of errors in the record linkage process, given the expected quality and under constrained resources.

The resources available during the record linkage process play a fundamental role. In case there are not any human resources available, it is possible to perform only an unsupervised record linkage procedure. This corresponds to avoid the creation of the set of possible matches by defining only one threshold (i.e. the two thresholds coincide).

Decisions on how to assess a value for the threshold depend on the objective, as in the following two examples.

1. The objective is the analysis of variables distinctly observed in the two data files to match. In this case, it is better to have a threshold sufficiently large, so that false matches are not included in the linked data set, and there is no risk to underestimate the statistical relationship between the variables (under the hypothesis of no selection bias due to the quality of the record linkage process).

2. The objective is the creation of a survey frame. In this case, it could be desirable to include as many correct matches as possible in order to avoid undercoverage. Hence, it is better to make the threshold lower, although a larger false match rate can cause the inclusion of more non-eligible units into the frame. This error can be easily

WP225

Page 31: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

corrected during the survey field activities, at some costs (e.g. the enlargement of the sample size in order to deal with non eligible sampled units).

If human resources are available, it is possible to perform the so-called clerical review of the linked records, by defining also the set of possible matches (two distinct thresholds). This set will be checked by the clerks. This activity consists in checking if the possible match is a false match: in fact, it is easier for the human eye to detect false matches instead of searching for true matches. The declared matches will be included in the set of matches.Clerks should not only restrict their activities in checking only key variables for the possible match pairs, but also to scrutiny all the other available information on the pair, such as the values of the key variables before any parsing procedures, the relationship between records in the case of clustered data (e.g. people in a household, or local units in a firm). Sometimes it is necessary to contact directly the unit in order to solve any uncertainty. Pairs which are still in doubt after this scrutiny should not be declared as matches.

In order to select clerks, it is advisable to include people with some expertise on the subject matter area, with a particular focus on the treatment of errors. The detection of errors in the matching variables is in fact the main task during the clerical review, since for example clerks should be able to detect errors such as swap between name and surname, or month, day year in a birth date, or among letters/numbers in strings. It could be also appropriate not to restrain the training to theoretical aspects or to the use of the available software tools, but to train on the job the clerks with subject matter experts and record linkage experts.

In case of a 1:1 constraint (see Section 2.9), it might be questioned whether the clerical review should be performed after or before any procedure able to force the constraint (as the application of the transportation algorithm). If the clerical review is done before the fulfilment of the 1:1 constraint, the procedure is more conservative (i.e. it is possible to lose less true matches) but it is more expensive (more time is needed because more possible pairs are available). Hence, if enough human resources are available, it is always advisable to perform at first the clerical review and after the fulfilment of the 1:1 match constraint.

Clerical review can be also used as a tool for assessing quality of the overall record linkage procedure. In this case, clerks take into consideration a subset of the pairs declared as matches. The frequency of false matches in this set will be an estimated measure of the false match rate in the linked data set (see also Section 2.12).

WP226

Page 32: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.12. Estimation of linkage errors(Tiziana Tuoto, ISTAT)

Summary and hints

Definition of two types of error: False match rate and False non-match rate

According to theoretical arguments, error levels are fixed once the thresholds for the selection of linked and non-linked sets are chosen

What is the most relevant error type in your application? Different record linkage aims imply different importance for the matching types (matches, unmatches, probable matches): some examples.

Hints: activities to carry out during the matching phases in order to avoid or reduce errors

Difficulties in applying automatic procedures to evaluate the matching errors

Re-link of particular sub-sets to evaluate the matching errors

2.12.1. Types of Error

As stated in the previous Report on WP1 (Section 1), the accuracy of the whole record linkage process is generally measured by the false-match and false nonmatch rates. The false match rate is defined as the number of incorrectly linked record pairs divided by the total number of linked record pairs. The false match rate corresponds to the well-known 1- error in a one-tail hypothesis test. On the other side, the false non-match rate is defined as the number of incorrectly unlinked record pairs divided by the total number of true match record pairs. The false non-match rate corresponds to the error in a one-tail hypothesis test.

According to the Fellegi-Sunter theory, these error rates are fixed when choosing the thresholds for the selection of linked and non-linked sets (M and U); generally there is a trade-off between the two types of errors since the reduction of the rate of false positives may increase the rate of false negatives. In fact, false positives and false negatives are very sensitive to the thresholds; too low thresholds give a low false positive rate and a high false negative rate, while too high thresholds give a high and unacceptable false positive rate with a low false negative rate. Thus it is important to consider the consequences of each type of error and to determine whether one is more critical than the other, according to the scope of the linkage problem.

Generally speaking, false negatives or missed matches is the more common error and it is possibly due to erroneous or missing variables on one or both records. Matches may also be missed if the linkage procedure includes a blocking strategy and two records referred to the same entity fall into different blocks. This may happen for example when a surname is misspelled and the phonetic compression algorithm assigns the records into two different blocks, or when a person has a number of forenames and they choose to use different forenames on different occasions; consequently, the records are not matched together.

False positives or wrongly matched records are a less common source of error. However it is potentially more serious. This type of error arises when the two records belonging to two different people have identifying variables that are almost identical. For example this is the situation when the two people have very common surnames and forenames, or are similar sexed twins. Generally, in most linkage studies, matching is performed in a conservative

WP227

Page 33: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

manner with regard to the links that should be accepted, so unless other information for matching a pair is available, the records are left non-matches.

2.12.2. The effect of errors and practical hints to avoid or reduce errors

Many different matching applications are concerned with improving coverage in census and surveys. In these cases a false negative is an extremely serious error because each non-match is added to a list as a new entity. For instance the estimates of the census coverage rate using a capture-recapture model requires to match the Census list and the coverage survey records, assuming no errors in matching operations, so the accuracy of the matching process is of crucial importance because even very small matching errors could compromise the reliability of the coverage rate estimates.

In other contexts, different from coverage surveys, the major risks in using data that is badly matched is that the data record will be linked up with records for different entities with the consequent risk of seriously biasing results by retaining a large number of the potential erroneous links.

Matching errors mainly occur when information reported in the frames is incorrect, or when correct information is reported, but it is not correctly used in matching. The adoption of some precautions during the whole linkage procedure can be useful in order to avoid or reduce errors in linkage results.

When data are reported with error, it is a common practice to combine the two record matching approaches: a general strategy consists in treating the easiest to identify matches by means of the more straightforward computational procedures, i.e. exact methods, and in introducing probabilistic record linkage techniques when dealing with the hardest matches during subsequent phases. In fact, errors in the data will cause exact matching to fail so it would be more appropriate to use probabilistic matching to resolve these cases.

As far as false negatives coming from the chosen blocking variable are concerned, in order to avoid these errors the linkage procedures can be scheduled adopting blocking strategies that use a number of different keys and re-matching at each stage the record pairs that failed to be linked at the previous stage. For example, when a person changes surname, for instance due to changes in marital status, and is not linked using these keys (surname and marital status), the link could be made at some future step using either date of birth, place of birth or address. As each match proceeds, the number of non-linked pairs should diminish.

In order to avoid false matches due to the presence of very common surnames or addresses, the frequency of those occurrences could be taken into account when calculating the matching weights, introducing some penalties.

Some practitioners also suggest to carry out many tests runs with different values of the matching thresholds criteria using a random sample of the data records. This will involve clerical checking of the matches made, but the time spent in this activity will be offset by the better matching and lower future clerical requirement.

2.12.3. Estimating errors

Whatever are the error rate levels fixed at the decision model stage, at the end of the linkage procedures a good practice is to undertake an assessment of the size of each of those errors and results made available, as the analyses of the integrated dataset should take into account possible impacts of the linkage error. This operation generally is quite difficult and requires

WP228

Page 34: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

the estimation of unknown characteristics, that is, number of pairs linked correctly (true positives) or number of pairs linked incorrectly (false positives), number of pairs unlinked incorrectly (false negatives) and total number of true match pairs.

Generally it’s not easy to find automatic procedures to estimate these two types of errors so as to evaluate the record linkage procedures quality. A widespread approach to evaluate false positive rates is to carry out manually checking samples of linked records or some re-linkage procedures, assumed error-free, because performed with more accurate techniques; consequently, the bias is evaluated by means of the differences between the match and the “perfect match” results.

Often the clerical review of the sample is done by visually comparing the records, and while this method is able to draw upon subject-matter knowledge and other information, it still involves the subjective view of the reviewer. If it is understood where errors are most likely to occur in the datasets, it may be necessary to target the sample to these areas with a view to improving the quality of the match. Several iterations of clerical review and adjustment of match criteria may be necessary before a linked dataset is confirmed and final false positive error rates calculated.

In large datasets, this method for analysing the false positive rates can be a very time-consuming work and it is often useful to group the linked data prior to selecting the samples.

To overcome time and budget constraints related to the manual check of the records in the sample, sometimes it is possible to refer to automatic methods for evaluating the false match rate. In this context, the most popular is the method proposed by Belin and Rubin (1995). They suggest a mixture model for estimating false match rates for given threshold values that works well when there is good separation between the matching weights associated with matches and non-matches. The critical point for the application of this model is that it also requires the existence of previously collected and accurate training data.

WP229

Page 35: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.13. A particular record linkage problem: Duplicate Record Detection(Gervasio Fernández - INE)

Summary and hints

Harmonisation is not needed for this problem

Problem: to solve incompatibilities between two records which are found to refer to the same unit

A particular case of record linkage can be found when we try to find different entries that refer to the same entity in the same file.

We will exclude the obvious case of an entire replica of the register.

Harmonisation of variables is not needed to solve a Duplicate Record Detection problem, since "both" sources are the same source; the need of normalization, however, is not excluded.

Once duplicate records have been detected, the problem of generating the "correct" record arises. In the case of discrepancy when handling rough classification/identification variables, it will be necessary to set whether discrepancy is due to variable codification or it is due to a misprint.

The setting of the "true" value among different possible values when handling analysis variables is a more problematic issue.

WP230

Page 36: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

2.14. Bibliography

Armstrong, J. and Mayda, J.E., 1993. Model-based estimation of record linkage error rates. Survey Methodology, Volume 19, pp. 137-147.

Baxter R., Christen P. and Churches T, “ A comparison of fast blocking methods for record linkage”, Proceedings of 9th ACM SIGKDD Workshop on data cleaning, Record linkage and Object Consolidation, 2003

Belin TR. and Rubin D.B., 1995. A method for calibrating false-match rates in record linkage. Journal of the American Statistical Association, 90, 694-707.

Bertsekas, D. P. and Tseng, P., 1994. Relax-iv: a Faster Version of the relax 23 Code for Solving Minimum Cost Flow Problems, Tech. rep., M.I.T., report P-2276, mit.edu/dimitrib/www/noc.htm.

Chesher, A. and Nesheim, L., 2004. Review of the Literature on the Statistical Properties of Linked Datasets. Report to the Department of Trade and Industry, United Kingdom.

Cohen, W. W., 2003. A comparison of string distance metrics for name-matching tasks, in W.W. Cohen, P. Ravikumar, S.E. Fienberg, American Association for Artificial Intelligence (AAAI).

Ding Y. and Fienberg S.E., 1994. Dual system estimation of Census undercount in the presence of matching error, Survey Methodology, 20, 149-158.

Fellegi, I. P., and A. B. Sunter, 1969. A theory for record linkage. Journal of the American Statistical Association, Volume 64, pp. 1183-1210.

Gill L., 2001. Methods for automatic record matching and linkage and their use in national statistics, National Statistics Methodological Series No. 25, London (HMSO)

Gomatam S., Carter R., Ariet M., Mitchell G., 2002, An empirical comparison of record linkage procedures, Statistics in medicine, 2002, vol. 21, Wiley.

Gu L., Baxter R., Vickers D., and Rainsford C., 2003. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, Canberra, Australia, April 2003.

Herzog, T.N. Scheuren, F.J. a Winkler, W.E., 2007. Data Quality and Record Linkage Tehniques. Springer Science+Business Media, New York.

Hogan H. and Wolter K., 1998. Measuring accuracy in a post-enumeration survey. Survey Methodology, 14, 99-116.

Jaro M. A., 1989. Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Society, 84(406):414–420.

WP231

Page 37: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Lahiri P. and Larsen M.D., 2000. Model based analysis of records linked using mixture models, Proceedings of the Section on Survey Research Methods Section, American Statistical Association, pp. 11-19.

National Statistics, 2004 National Statistics code of practice – protocol on Data Matching. Office for National Statistics, London.

National Statistics, 2004 National Statistics code of practice – protocol on Statistical Integration. Office for National Statistics, London.

Neter J., Maynes S. and Ramathan R., 1965. The effect of mismatching on the measurement of response error. Journal of the American Statistical Association, 60, 1005-1027.

Porter E.H. and Winkler W.E., 1997. Approximate string comparison and its effect on an advanced record linkage system. Proc. of an International Workshop and Exposition - Record Linkage Techniques, Arlington, VA, USA.

Ruddock V., 1999. Measuring and Improving Data Quality. Government Statistical Service Methodology Series No. 14. Office for National Statistics, London.

Scheuren F. and Winkler W.E., 1993. Regression analysis of data files that are computer matched, Survey Methodology, 19, 39-58.

Scheuren F. and Winkler W.E., 1997. Regression analysis of data files that are computer matched- part II, Survey Methodology, 23, 157-165.

Statistics Finland, 2004. Use of Registers and Administrative Data Sources for Statistical Purposes. Best Practices of Statistics Finland. Tilastokeskus, Helsinki.

Statistics New Zealand, 2006. Data integration manual; Statistics New Zealand publication, Wellington, August 2006.

T. Tuoto, S. Farchi, M. Fortini, V. Greco, F. Chini (2006). Probabilistic Record Linkage for the Integrated Surveillance of the Road Traffic Accident. Proceedings of the XLIII Italian Statistical Society Conference, Torino, 14-16 June 2006.

Thibaudeau, Y., 1993. The discrimination power of dependency structures in record linkage. Survey Methodology, Volume 19, pp. 31-38.

Winkler W.E., 1989. Frequency-based matching in the Fellegi-Sunter model of record linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association, 778-783.

Winkler, W.E., 1993. Improved decision rules in the Fellegi-Sunter model of record linkage. Proceedings of the Survey Research Methods Section, American Statistical Association, 274-279.

Winkler, W.E., 1995. Matching and record linkage. Business Survey Methods, Cox, Binder, Chinappa, Christianson, Colledge, Kott (eds.). John Wiley & Sons, New York.

WP232

Page 38: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Winkler W. E., 2000. Frequency based matching in the Fellegi-Sunter model of record linkage (long version). Statistical Research Report Series no. RR 2000/06, U.S. Bureau of the Census. At http://www.census.gov/srd/papers/pdf/rr2000-06.pdf (last view, 21/04/2008).

WP233

Page 39: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

3. Recommendations on statistical matching3.1 The practical aspects to be considered for statistical matching

(Mauro Scanu - ISTAT)

Statistical matching is essentially a procedure whose objective is the representation of joint information on variables that are never jointly observed. In order to reach this goal, the core of the statistical matching procedure is the estimation of the joint distribution of the variables that are not jointly observed. This estimation can be either explicit (in most cases this corresponds to estimate contingency tables, or some key parameters on the statistical relationship between variables, as correlation coefficients) or implicit (as in the case of obtaining a complete synthetic file by means of imputation procedures). Nevertheless, this objective is quite specific, and the steps to perform for reaching the goal are quite simplified, if compared to record linkage.

WP234

Page 40: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

The previous figure represents the steps that need to be performed for solving a statistical matching problem.

1) A key role is represented by the choice of the target variables, i.e. of the variables observed distinctly in two sample surveys. The objective of the study will be to obtain joint information on these variables. This task is important because it influences all the subsequent steps. In particular, the matching variables (i.e. those variables used for linking the two sample surveys) will be chosen according to their capacity to preserve the direct relationship between the target variables.

2) The second step is the identification of all the common variables in the two sources (potentially all these variables can be used as matching variables). Not all these variables can actually be used. The reasons can be different, as lack of harmonization between the variables. To this purpose, some steps need to be performed as the harmonization of their definition and classification (already described in Section 1), the need to take only accurate variables whose statistical content is homogeneous (described in Section 3.2).

3) Once the common variables have been cleaned of those variables that cannot be harmonized, it is necessary to choose only those that are able to predict the target variables. To this purpose, it is possible to apply some statistical methods whose aim is to discover the relationship between variables, as statistical tests or appropriate models (Section 3.3).

4) As already introduced in the beginning, the statistical matching aim can be solved in different ways (illustrated in Section 3.4):

a. By a micro objective or a macro objective

b. By the use of specific models (as the conditional independence assumption), the use of auxiliary information, or the study of uncertainty

c. By parametric, nonparametric or mixed procedures.

5) Once a decision has been taken, the procedure is applied on the available data sets. Section 3.5 gives some practical hints on the procedures to apply before applying the methodologies described in Section 2.3 of WP1.

6) As already described in Section 2.5 of WP1, quality evaluations of the results are the final step to perform.

The following paragraphs will detail the practical aspects that need to be performed in the different steps of a statistical matching application.

WP235

Page 41: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

3.2. Identification of the common variables and harmonization issues(Mauro Scanu - ISTAT)

Summary and hints

For the problem of missingness of the common variables, choose one of these two strategies

1. use only completely observed common variables

2. use also partially observed common variables; in this case, split the data sets in as many sub-data sets as the missing data patterns of the common variables

For the evaluation of the statistical content of the common variables, check if the estimated distribution of each common variable is almost the same in the two sample surveys. Formal statistical tests can be too restrictive. Empirical approaches are usually used

As usual in the joint analysis of two different files, a key role is played by the variables observed in both the data sources. Among these variables there will be the selection of the matching variables. Before this selection, it is necessary to modify and, possibly, discard those variables that present some problems. The common problems are:

1. problems in the definitions and the corresponding classifications2. problems in the accuracy3. problems in the statistical content

The first step has already been described in Section 1. This paragraph tackles the tasks of how to discard some of the common variables from the statistical matching procedures by the assessment of the accuracy of the common variables, and the comparison of the statistical content of these variables.

3.2.1. Accuracy evaluations

The variables in common in the two sample surveys A and B should be of good quality. This aspect is crucial, given that from this set of variables there will be the selection of the matching variables. The statistical relationship between the matching variables and the distinct variables (Y in A and Z in B) is the main source of information for getting the objective of statistical matching, i.e. the estimation of the joint distribution of Y and Z or of some of its parameters. For this reason, it is crucial to restrict the set of common variables to those variables that satisfy some important criteria. The first aspect to consider is missingness. Let missingness be distinguished in its two forms: total non response and partial non response.

Total non response is an aspect that affects the number of units to take into consideration in the data sets. This aspect will be treated in Section 3.5. For the moment, drop those units in the two surveys that are affected by total non response.

Partial non response, on the contrary, heavily affects the accuracy of the variable under study. Generally speaking it must be expected that the common variables for the two surveys are “structural” variables of the units under study (e.g. gender or age for persons, number of employees or NACE classification for enterprises). Usually missingness is a rare event for these variables. Anyway, how a practitioner should treat the problem of partial non response?

WP236

Page 42: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

1. Use only completely observed variables. This criteria allows to perform statistical matching in just one step. Anyway, it can happen that the common variables that satisfy this criterion are just a few, and that their ability to predict the distinct variables (Y in A and Z in B) is low

2. Use also variables affected by missing items. In this case, partition A and B in as many data sets as the pattern of missingness. In fact, in each subsample there is a list of completely observed common variables from which the matching variables can be selected. This situation is much simpler when the objective is micro (each subsample can be reconstructed distinctly by using different matching strategies, i.e. different sets of matching variables). On the contrary, it is more difficult if the objective is macro.

3.2.2 A comparison of the statistical content of the common variables

The common variables should be homogeneous also in their statistical content. In other words, the two samples A and B should estimate (almost) the same distribution for each common variable. The reason is simple: the two sample surveys should represent the same population.

If all the common variables are characterized by estimated distributions with many differences (e.g. different shapes, different modes, large differences in some frequencies), there is the possibility that the target populations of the two surveys are different, or that the time lag between the two surveys is too large. In this case, it is better to avoid any statistical matching between the two sample surveys.

In some cases, some common variables may be characterized by differences in the estimated distributions which are due only to sample variability. In this case, statistical matching is still possible.

In order to understand which common variables are better to discard, a formal approach would consist to use appropriate statistical tests. Given that the common variables are usually categorical, it is possible to use chi-square tests. Anyway in case of continuous common variables, the Kolmogorov –Smirnov test can be applied. Although these approaches are formal, we suggest to avoid using statistical tests in the official statistics context for two reasons. The first is that in official statistics, sample surveys are drawn according to complex survey designs. Although it is possible to find adaptations of statistical tests also when samples are drawn according to complex survey designs, they are not always easy to apply. The second reason is that usually samples drawn by national statistical institutes are large. This characteristic allows to have very precise estimates. Anyway, statistical tests use to detect even small differences in the distributions as significant differences. This characteristic prevents the use of any common variable as an appropriate matching variable.

For this reason, we usually follow an empirical approach.

1. Look at the estimated frequency distribution of each variable in the two surveys (possibly drawing a histogram)

2. for the categories characterized by large frequencies, check if these frequencies differ for less than 5% in the two surveys; for small frequencies, check if these differences are smaller than 2%. In addition to the previous bounds, check that the shape of the histogram is almost the same in the two samples (same modes)

3. If the previous constraints are not satisfied, discard the common variable from the analysis.

WP237

Page 43: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

3.3. How to Choose the Matching Variables(Marcello D’Orazio - ISTAT)

Summary and hints

Avoid the use of many matching variables: the statistical procedures would become unnecessarily complex. Use only those matching variables with a high association with the target variables.

In order to select the matching variables, an easy (but not optimal) procedure is to compute all the bivariate measures of association between the target variables (Y and Z) and each common variable, respectively on A and B. Select those common variables with the highest association

Alternative solutions are:

1. the construction of a dendrogram from hierarchical cluster analysis

2. testing the significant parameters of a model where a target variable is function of the common variables (as the regression model), respectively in A (target variable Y) and B (target variable Z)

3. in a similar way, the previous procedure can be considered in a nonparametric setting by means of classification trees (CART)

4. the best approach could be the estimation of a probabilistic expert system (Bayesian network)

In SM applications the two independent sample surveys, A and B, may share many common variables. The question is whether all the common variables shall be used in the matching process, or just some of them. Obviously, the less are the common variables used (matching variables), the easier are the matching procedures. For instance in a parametric approach, the reduction of the common variables may correspond to a simplification of the model and a reduction of the number of parameters to estimate. Moreover, in a multivariate framework, inferential procedures are at the same time more efficient and interpretable when the number of variables decreases. At the same time, less matching variables may determine a decrease of the computational effort.

A first approach to identify the set of matching variables, X, is rather natural, consists in disregarding all those variables which are not statistically connected with Y or Z (Singh et al., 1988; Cohen, 1991).

In order to find out the variables associated/correlated with Y or Z, different procedures can be applied. The simplest ones consist in computing all the bivariate measures of association among couples of variables formed considering each common variable and each Y variable, in A. The same is done for the common variables and each Z variable in B. D’Orazio et al. (2006) provide a summary table with some association measures that can be computed according to the measurement scale of the variables.

This simple procedure is straightforward but may result inefficient because it may lead to choose two common variables that are highly correlated and therefore show the same level association with a variable of interest Y or Z. For this reason it would be preferable to analyze also the association/correlation among all the potential common variables. A useful tool may

WP238

Page 44: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

be that of examining the dendrogram resulting from a hierarchical cluster analysis carried out using a matrix of similarities of each variable against all the other ones (see Coli et al., 2005; Harrel, 2001).

A step further could be that of considering dependence models of respectively Y (and Z) against the common variables. In linear regression models, Y (and then Z) can be regressed against common variables and an automatic variable selection procedure (forward selection, stepwise selection, least angle regression, …) can be used to determine the set of matching variables X. In presence of a continuous dependent variable and mixed type predictors, generalized linear models (McCullagh and Nelder, 1989) can be applied.

If a non linear relationship is supposed to exist between a univariate Y and X, techniques based on Classification Trees or Regression Trees (CART; Breiman et al. 1984) may represent a good choice. These methods are based on binary recursive partitioning of units in homogeneous groups with respect to the values of the response variable. The procedure ends with the growing of a tree; predictors that appear in the higher part of the tree are usually those with higher explaining power. Advantage of using CART techniques is that they can deal with both categorical or continuous response variables and, at the same time, mixed mode predictors are allowed. Moreover, no assumptions are required on the underlying distribution of predictors. Unfortunately, in presence of mixed type predictors, CART tends to favor the continuous predictors instead of categorical ones for the starting nodes. Similarly, when only categorical predictors are considered, CART tends to privilege those with a high number of categories. Moreover, CART result may suffer the collinearity among predictors. For this reason, it would be better to shift to Random Forest technique (Breiman, 2001) that consists in growing many regression/classification trees. Random Forest output provides a score to determine importance of each predictor in classification/regression. Moreover, it can handle a large number of predictors even in presence of few observations.

In presence of a set of continuous response variables Y and a set of continuous common variables, a canonical correlation analysis can be performed. In this case, it is possible to restrict X to those variables with the highest weight in the canonical correlations.

Bayesian networks (Cowell et al, 1999) are a class of probability models. This class is defined in graphical terms. The probabilistic relationship between the variables (independence, conditional independence) can be clearly understood from the graphical structure. This graphical structure can be useful for determining the most important common variables respectively for Y and Z. At first, consider the set of common variables and Y in file A. Estimate the Bayesian network structure on these variables. Split the set of common variables in two sets: X1A are those variables having a direct relationship with Y, and X2A are those that are independent or conditional independent with Y, given some of the variables in X1A.

The same approach is taken for the set B, obtaining as a result the set of variables X 1B directly dependent with Z, and X2B which are the variables independent or conditionally independent with Y given X1B. The matching variables can be given by the union of X1A and X1B. If these variables are too many, this step can be considered as a first preliminary selection among the matching variables.

WP239

Page 45: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

3.4. Definition of a model(Marco Di Zio - ISTAT)

Summary and hints

A model describing the relationship between the variables of interest cannot be estimated and tested in the statistical matching problem. The solutions are:

1. a priori validation: rely on an easily estimable model, as for instance the conditional independence assumption

2. a posteriori validation: search for peculiarities of the phenomenon for which it is almost known their expected behaviour. Compare the results obtained by the matching process with them

3. sensitivity analysis: assess the validity of the conditional independence assumption by means of the results of the uncertainty analysis

The practitioner should also shape the statistical matching problem with the following decisions:

1. choose between a micro and macro objective; this choice will influence the method to use

2. choose between a parametric and nonparametric approach. For continuous variables the parametric approach has been defined only on normal variables. If the variables are not normal a first possibility is to adopt some transformation (as the Box-Cox transformation). Otherwise, categorize the continuous variables so that statistical matching procedures for multinomial variables can be used.

3.4.1 Under what model can I work?

There is a general question underlying statistical data analysis: which model (either explicit or implicit) can be used to describe and analyze data. Any model is based (again explicitly or implicitly) on assumptions. In statistical matching, the model should describe the joint relationships among the variables (X,Y,Z). An important part of statistics is devoted to verify the fit of the assumed model to the data at hand. In the particular area of statistical matching, the researcher is in an unfavorable situation, since the variables (X,Y,Z) are not jointly observed on data, and thus it is not possible to test the fit of the model to data in order to validate the hypotheses on the relationship between X,Y, and Z. In practice, the researcher has different possibilities:

1) (a priori validation) the researcher must validate (or discard) the assumptions by means of his/her a priori knowledge on the phenomenon;

2) (a posteriori validation) discard some assumptions by evaluating the acceptability of the results (on some specific indicators) that would be obtained according to the hypotheses;

3) (sensitivity analysis) the researcher does not choose a single hypothesis, but chooses to analyze the outcomes related to different hypotheses.

For instance, in the first case if only the two data sets A and B are used (where the variables (X,Y) and (X,Z) are observed respectively), it is unavoidable to resort to the conditional independence assumption. Hence the researcher must believe that Y and Z are independent

WP240

Page 46: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

given X, roughly it means that X explains the relationship between the variables Y and Z. Let us suppose that Y represents household income, Z represents household expenditures, and X represents gender. The association of sex separately with income and expenditure can be high, but it is generally difficult to assume the conditional independence of expenditure and income given gender,

P( expenditure | income, gender) = P( expenditure | gender)

As a matter of fact, it is hard to assume that gender explains the association between income and expenditure. This kind of “model validation” is only based on a speculative approach to the problem. There can be situations when, in other surveys, proxy variables for (X,Y,Z) are jointly observed. In this case, it is useful to test the associations of the proxy variables in order to have a hint for the existing relation among the variables (X,Y,Z). This external information should be used by means of models exploiting auxiliary information described in the WP.1 document (Section 2.3).

In the a posteriori validation, the procedure of statistical matching is performed according to a model assumption, and then some specific indicators of the variables of interest are analyzed in order to understand if they violate some specific behavior. For instance an important indicator related to the analysis of income and expenditure is the propensity to consume, that is the part of income used for expenditures. It is generally expected that the propensity to consume is a decreasing function of income. If after matching this indicator does not present this behavior, the hypotheses assumed for the model are probably not appropriate to describe the phenomenon (D’Orazio et al., 2006, page 180).

The sensitivity analysis approach is interesting in that it allows to draw different scenarios. There are some aspects that must be taken into account when different models are proposed since not all models can be assumed. In fact, some of them are incompatible with the data at hand. For instance, let us suppose that we have observed (X,Y) in the data set A and (X,Z) in the data set B. Let us assume that the data are normally distributed (at least after some transformations), and that the estimated correlation coefficients from data sets A and B are ρyx

and ρzx . The only parameter that cannot be estimated from the data is ρyz and thus the sensitivity analysis consists of drawing different scenarios that arise from the different values assigned to ρyz (different assumptions). Nevertheless, not all the values in the interval [-1,1] can be assigned, otherwise it could happen that the correlation matrix is not positive semidefinite, failing a necessary condition for a correlation matrix. In order to avoid this problem, ρyz can assume values only in the interval bounded by

ρyx ρzx ± [(1-ρ2yx) (1-ρ2

zx)](1/2).

Since the process of choosing a model for statistical matching cannot rely directly on the available data to be matched, it is extremely important to describe and state clearly the assumptions at the basis of the procedure used to match data. Note that the sensitivity analysis approach is strictly connected with the analysis of uncertainty described in WP1 (Section 2.4). The difference is that the uncertainty analysis is focused on the estimation of the set of parameters of interest compatible with the available marginal information on (X,Y) and (X,Z). This set of parameters is the objective of the estimation procedure. On the contrary, the sensitivity analysis approach has the objective to estimate pointwise the parameter of interest. The reliability of this estimate is assessed by the set of parameters given by the uncertainty analysis, without assuming any particular model. The intuitive idea is that when the uncertainty is low, which means that not too “many” models are compatible with the observed data (and among them there is the conditional independence model), the results obtained via

WP241

Page 47: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

the CIA are not too much misleading. More details are in D’Orazio et al. 2006, and Rässler, 2002.

3.4.2 Which approach should I use? Parametric or nonparametric? Micro or macro?At the beginning of the statistical matching procedure, the practitioner should be able to shape the framework of the statistical matching problem by means of some assumptions. These assumptions will influence the results that can be obtained, and consist of the following three aspects:

1. the objective of the matching is micro or macro;2. the choice of a parametric or a nonparametric model.

The objective of the procedure (micro/macro) is generally clear, and it has an important role for the choice of the techniques to be used. Up to now, the macro objective is generally dealt with parametric models, while non parametric approaches are generally used for micro objectives. As a matter of fact, the reverse can also be adopted. A rich discussion on this topic is in D’Orazio et al. (2006).

A final decision is about the nature of the technique to be used, that is to choose either a parametric or a nonparametric model. As previously stated, a first suggestion for the choice of the model is motivated by the micro/macro objective of the study.Other practical considerations may help the researcher. One is the nature of the variables. For instance, as far as continuous variables are concerned, although it is not really a constraint, until now only Gaussian models are introduced, and this limits the use of purely parametric models to the case of linear relationships among variables. If the variables at hand cannot be considered Gaussian, possible approaches are:

1) normalize the variables by means of appropriate functions (as the Box-Cox functions, Box and Cox, 1964)

2) categorize the continuous variables and use the statistical matching methods for the multinomial distributions.

As a matter of fact, approaches dealing with other parametric families of distributions which are frequently used in the social or economic analyses (as the exponential or the F distributions) still need to be defined (in particular as far as assessment of uncertainty is concerned).

In all these considerations, that in fact are concerning the a priori analysis, the previously described a posteriori validation and sensitivity analysis steps should be performed in order to validate the assumptions.

WP242

Page 48: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

3.5. Application of the statistical matching algorithm(Mauro Scanu - ISTAT)

Summary and hints

Description of the effect of total non response on the samples to match.

Check the presence of units which are collected in both the samples and consider them as an additional source of information

How to deal with survey weights. There are two strategies: 1) harmonize the survey weights by means of calibration procedures; create a unique sample by means of the union of the two samples, and modify the survey weights by means of the concatenated file procedure

WP1 (Section 2.3) already described the different algorithms to apply for statistical matching, according to the available information. In this section we give some practical hints, concerning the data sets and the possible sample surveys (if the data sets are drawn according to complex surveys designs).

3.5.1 The effects of total non response on the data sets

The two sample surveys to match can be in two different situations.The first (and most common case) happens when the two sample surveys are disjoint, and do not share any common unit, as in the next figure (for the column “Sample”, 1 corresponds to selected and 0 to unselected; for the column “Respondent”, 1 corresponds to respondent and 0 to missing, while NA applies to unselected units; the red lines correspond to total non response respectively in A or B):

YA XASample

ARespondent

ASample

BRespondent

BRespondent

A U B    1 1 0 NA 1    1 1 0 NA 1    1 1 0 NA 1    1 0   0 NA   0    1 0   0 NA   0    1 1 0 NA 1    1 1 0 NA 1    1 1 0 NA 1        1 0 0 NA 0  XB ZB 1 1 0 NA 1        0 NA   1 0   0      0 NA 1 1 1        0 NA 1 0 1

    0 NA 1 1 1    0 NA 1 1 1    0 NA   1 0   0    0 NA   1 0   0    0 NA 1 1 1    0 NA 1 1 1    0 NA 1 1 1

WP243

Page 49: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

In this case1 drop the units that are affected by total non response2 adjust the original survey weights of the respondents in order to take into account non respondents

The next figure shows the most complex situation for statistical matching of two sample surveys A and B

YA XASample

ARespondent

ASample

BRespondent

BRespondent

A U B    1 1 0 NA 1    1 1 0 NA 1    1 1 0 NA 1    1 0   0 NA   0    1 0   0 NA   0    1 1 0 NA 1    1 1 0 NA 1    XB ZB 1 1 0 NA 1        1 0 1 1 1      1 1 1 1 1        1 0   1 0   0      1 1 1 1 1        1 1 1 0 1

    0 NA 1 1 1    0 NA 1 1 1    0 NA   1 0   0    0 NA   1 0   0    0 NA 1 1 1    0 NA 1 1 1    0 NA 1 1 1

This situation applies when the two sample surveys are partially overlapping. This is a very useful situation, because the intersection of the two files can give joint information on the variables of interest (this is termed “Auxiliary information” in WP1, Section 2.3.2).The final data sets A, B and C will be given by the following sets:1 drop the non respondents so that

A consists of all the respondents in AB consists of all the respondents in BC consists of the common respondents in A and B (in the figure, this set is given by those records whose columns should be equal to 1).

2 adjust the original survey weights in A and B in order to take into account the non respondents.

3.5.2 Further survey weights adjustments

In order to allow appropriate estimates, when samples are drawn according to complex survey designs, it is necessary to standardize the survey weights of the sample surveys. It is possible to perform this approach in two different ways. The first one is always feasible and it is particularly appropriate for the macro objective, although it can also be used for the micro objective of statistical matching. The second method is particularly appealing for the micro objective, although it needs information that is seldom available.

WP244

Page 50: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Standardization by calibrationRenssen (1998) suggests to harmonize the survey weights of the two sample surveys A and B and of the possible additional complete file C by means of repeated calibrations of the survey weights. This approach can be summarized in the following way

a) calibrate the original survey weights in A and B to known population totals of the common variables. Estimate the totals of the remaining common variables respectively in A and B and obtain a unique estimate combining the two estimates

b) calibrate the weights computed in step a) in A and B to the known and estimated totals of all the common variables.

If the two sample surveys A and B do not intersect, and a third sample C is not available, use the weights obtained in step b) in order to estimate the parameters of interest under the conditional independence assumption. If there is a third sample C

c) Renssen suggests to estimate some tables of interaction between the common variables and the distinct variables, respectively Y in A and Z in B. Use these estimates as constraints for the calibration of the weights in C.

Standardization by file concatenationFile concatenation aims at obtaining a unique sample survey from the two distinct samples A and B. In order to do this, Rubin (1986) suggests to modify the survey weights according to the following steps:

a) for each sampled unit a A, compute the probability of inclusion of the first order that the unit would have had if it were sampled under the sample design for B ( , a=1,… nA)

b) for each sample unit bB, compute the probability of inclusion of the first order that the unit would have had if it were sampled under the sample design for A ( , b=1,… nB)

c) assuming that the probability of selection of a unit in both the samples is negligible, the probability of inclusion of a unit in the union of the two samples becomes

, i=1,…nA for the units in A and i=1,…nB for the units in B.The probability of inclusion pi is the one to consider in the concatenated sample given by the union of A and B.If the inclusion of a unit in both the samples is not negligible, the new probability of inclusion should be , where is the probability that a unit is included in both samples.

Although this procedure is extremely appealing, and allows the computation of very efficient estimators for the common variables (the estimator is computed on a very large sample), it is difficult to apply in practice because

1) usually the design variables of the survey design A are not available in B and vice versa

2) if there are total non responses, it is difficult to modify the probabilities of inclusion in order to take into account this kind of missingness

3) usually the probability of inclusion in both the samples is extremely difficult to compute.

3.6. Bibliography

WP245

Page 51: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Box, George E. P.; Cox, D. R. (1964). "An analysis of transformations". Journal of the Royal Statistical Society, Series B 26: 211-246.

Breiman, L., 2001 “Random Forests”, Machine Learning 45(1), 5-32.

Breiman L., Friedman J. H., Olshen R. A., and Stone, C. J., 1984, Classification and Regression Trees. Wadsworth.

Cohen, M.L., 1991, “Statistical matching and microsimulation models. Improving Information for Social Policy Decisions”, The Use of Microsimulation Modeling, vol. II. National Academy Press.

Coli A., Tartamella F., Sacco G., Faiella I., Scanu M., D’Orazio M., Di Zio M., Siciliani I., Colombini S., and Masi A., 2005, “La costruzione di un archivio di microdati sulle famiglie italiane ottenuto integrando l’indagine ISTAT sui consumi delle famiglie italiane e l’indagine Banca d’Italia sui bilanci delle famiglie italiane” [in Italian]. Technical Report, Istituto Nazionale di Statistica, Roma.

Cowell, R.G., Dawid, A.P., Lauritzen, S., Spiegelhalter, D.J., 1999, Probabilistic networks and export systems. Sprinter-Verlag, New York.

D’Orazio M., Di Zio M., Scanu M., 2006, Statistical Matching, Theory and Practice. Wiley, Chichester.

Harrell, F.E., 2001, Regression Modeling Strategies, with Applications to Linear Models, Logistic Regression and Survival Analysis. Springer-Verlag, New York.

McCullagh P. and Nelder, J. A., 1989, Generalized Linear Models. London, Chapman and Hall.

Rässler, S., 2002. Statistical matching: a frequentist theory, practical applications, and alternative Bayesian approaches. Springer-Verlag, New York.

Renssen, R.H., 1998. Use of statistical matching techniques in calibration estimation. Survey Methodology, Volume 24(2), pp. 171-183.

Rubin, D.B., 1986. Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics, Volume 4, pp. 87-94.

WP246

Page 52: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

4. Recommendations on micro integration processing methodologies

4.1. Micro-integration of different sources: Introduction and preliminary issues(Miguel Guigó - INE)

4.1.1. Introduction

Nowadays, the NSIs have to face growing demands from policy makers and other users of official statistics, not only for a set of isolated indicators and time series, but for a comprehensive frame of reference for those variables. That is to say, a coherent set of relations among them. Although this frame exists and is widely accepted for economic statistics, above all else due to the existence of national accounts and all of its theoretical background, it rarely happens concerning social statistics. A method to ensure consistent data, regardless the existence or not of this background, is to get these data on different variables from the same statistical unit; so, this seems to get coherent data at the micro level.

Of course, data belonging to the same unit can be gathered through large sample surveys, but it usually causes well known problems, as: an increase of the response burden, when people are less and less inclined to participate in surveys; higher costs for statistical offices already affected by budgetary restrictions; and even biased estimates as a consequence of non-response. For example, in the case of the Netherlands (Al and Bakker, 2000) the response rate to the Labour Force Survey (LFS) dropped form 90% in 1977 to 60% in 19952, and most of non-response consists of refusals and is concentrated among selective population groups, a fact that raises awareness of a potential bias. Some different approaches have been proposed in order to solve this problem, mainly through calibration estimators, see for example Lundström and Särndal (1999) or generalized regression methods, see for example Särndal (1980), but, in any case, all of these techniques demand the use of some kind of auxiliary information, be it more complete data from the same source or data collected somewhere else.

The use of multiple sources to gather different information for the same individuals can be seen then as an alternative to the enlargement of a sample survey, and more specifically the use of variables from multiple registers to be linked or matched to the records with no existing information. And, in a broader sense, the micro integration process has to do with all kind of methods of compiling statistics on units (such as persons, households or business) by matching, editing, imputing and weighting data from the combined set of administrative registers and sample surveys. In the context of record linkage, statistical matching and all the different techniques for record matching discussed in this report, micro integration tasks intend to improve the outcome of those techniques, via the use of variables that have not played the role of unit identifiers during the matching process, but can affect its results. So, some of these could be modified, from publishable aggregates to the actual needs of clerical review of the linked datasets.

However, this does not seem to be a method without difficulties in practice. First of all, the availability of these sources will significantly vary from a country to another, depending on the particular legislation on data privacy, so it is hard to give common criteria for all the

2 Statisics Netherlands has faced the problem by means of better training of interviewers and more re-visits in case of nonresponse, as well as by means of weighting methods as described in the next section.

WP247

Page 53: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

statistical offices in the ESS. As the previous report on WP1 underlines (Section 5.3), most of the NSIs are affected by the existence of a legal foundation that regulates the supply and usage of administrative data, although none of the problems identified are common for a majority of the data integration projects carried out by the Statistical Offices. These problems usually consist of no permission to use the unit identifiers for integration purposes, unavailability of data sets at micro level, or simply prohibition of linking some groups of administrative data.

Even regardless of this fact, with any kind of sources being supposed to be completely available, be them household sample surveys or administrative registers, a second question is to identify and to choose the best source for each part of the information required. Some variables are redundant because they are included in more than one source, some information may be contradicting: it is not unusual, but often possible, to compare the same characteristics of an achieved sample, with the corresponding administrative data or with another survey, given that the target information and population overlaps among many of them. Ruddock (1999) gives the example of the Family Expenditure Survey (FES) and the Labour Force Survey (LFS) in the UK, since both can provide data on household income, although the FES is designed to collect information on this subject matter whereas the main purpose of the LFS is to collect information on income from employment. As Ruddock suggests, the discovery of differences is not only bad news, because it is an starting point to investigate sources of non-sampling error or improve the quality level of statistics, but from the point of view of integration it leads to decide which source is "better" or use some of the techniques described above in Chapter 1 of this report.

Once the most appropriate data have been selected, it must be taken into account that they can still contain inconsistencies at a macro-level. Al and Bakker (2000) show as an example the Dutch Censuses of 1981 and 1991, when some register data were only available in an aggregate form (that was the case of social security benefits) and the information was not always able to be correctly integrated with that from household sample surveys. For details on 2001 Dutch Census see Schulte Nordholt et al. (2004).

4.1.2. Preliminary issues

The previous step when integrating information at the micro level is to match or link records from different sources, and it also includes a check to find out records with no existing information (incompleteness) or some others that are duplicated (deduplication). This can be considered also as a form of micro-integration (Al and Bakker, 2000). Then, the statistical variables have to be obtained from the characteristics in the merged records, though it does not mean necessarily "directly" obtained: the data from the linked sets is sometimes used to replace the data that could be collected by a survey, but sometimes is used as an ancillary variable, in order to get improved estimators, classification variables, etcetera. So, record linkage or statistical matching methods produce a new "recipient" dataset where some of the characteristics from the "donor" file are not being added and even less are being used in the production of statistics.

Furthermore, the so obtained micro-data bases cannot be directly used as output databases since the output information is obviously not disseminated at that level of disaggregation -of course, the first but not only reason is that it does not meet any privacy requirement-; also, following Van der Laan (2000) some imputations might become nonsense at the micro level, and finally different sources might investigate different units or simply reporting units in the

WP248

Page 54: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

source data may differ from the units of interest in the target population (Statistics New Zealand, 2006, p.23), so the data ought to be aggregated to a suitable higher level and an additional database containing all publishable aggregates is therefore needed.

Herrador and Porras (2007) provide an example of integration of administrative sources where matched records are used not to get directly statistical variables, but as auxiliary information to improve sample designs. The Spanish Statistical Office (INE) receives from the Tax Agency (AEAT) aggregated tax information from Census sections, which are the primary sampling unit in household surveys, by matching the information held in the Population Register (PR) with the data from the AEAT files, through the ID Card number of the persons registered in the PR. This provides a stratification variable in order to classify Census sections according to the level and structure of income. Van der Laan (2000) provides an example of the second issue, as is the StatBase which contains all statistical data of which Statistics Netherlands considers publication meaningful and reliable.

The points of view described above mainly focus on data integration, in the sense that they pay special attention to the variables contained in different sources and its availability. Some other approaches as Denk and Hackl (2003) emphasize less on the properties of variables than on discrepancies and similarities of the whole data sources. Therefore, from this other point of view, data source integration is the precondition to apply an integration procedure, which is then called dataset integration. According to this, several types of heterogeneity can be distinguished:

- technological differences, related to IT: hardware, operating system, database management.

- structural diversity, related to the way in which entities are represented: format, measurement units, density of possible values or attributes, codes, classifications and ranges.

- differences in the semantics of the data, related to meanings, concepts and definitions.

Whatever the classification criteria used, a previous analysis of these discrepancies prevents a variety of problems that may arise when the matching operations and then the integration procedure is actually performed. This issue links with some of the aspects on harmonization already considered in the paragraphs on record linkage and statistical matching. In this sense, together with an optimized internal organization of the statistical office, an observance of standards abroad is also desirable. A proper introduction and maintenance of them into the registers will result in time saving. In the case of the metadata, this allows comparisons with similar fields in other datasets, easier detection of inconsistencies, and what fields could be useful for linking the files. The metadata integration into information systems (templates, etc) affects the internal organization, while the adoption of interchange formats that are widely accepted in the ESS (such as SDMX-SODI, Euro-SDMX, etc) leads to meeting of international standards.

WP249

Page 55: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

4.2. Imputation and construction of consistent estimates(Miguel Guigó - INE)

Both the need of conceptual and numerical consistency among different statistics, and the growing availability of data from diverse sources, make advisable to undertake some practical steps from the collection of the data to the final estimates. As stated in the previous section, one of the new subject matters for the statistical offices is to obtain variables that are embedded into a well-constructed frame of statistics, no matter if they are economic or social variables. If statistics can be seen as a product, then it leads to modify the way to obtain them, and move from a model with different and isolated production processes to a single and integrated production process with a range of outputs.

This aim involves immediately to the various types of datasets, files and databases that are to be used to serve as "recipient" or "donor" files, or databases containing aggregates and publishable data.

Kroese, Renssen and Trijssenaar (2000) propose an example of integrated datasets model, based on the experience of Statistics Netherlands. The procedure lies on the construction of a first input-database which includes the variables that are matched from all of the different sources, be them administrative registers, sample surveys, or electronic data interchange (EDI).

However, the information provided by the different sources refers to observation units which are not necessarily equal to statistical units. All input-data must be adapted and turned into data on those statistical or target units. Then, once this input-database has been obtained, a second step is to construct several micro-databases, one for each kind of statistical unit, be it persons, households, business or whatever. Then one single record from each micro-database represents one statistical unit, with different fields corresponding to different variables. Imputation and editing should be done at this step into each micro-database. It must be taken into account, though, that only some of the variables are known for all the units in the population, and for some others only a subset of units held real values into their corresponding records.

Imputing and editing the missing values allows fulfilling the empty cells. Then the next step is estimating totals for the whole population. Estimated totals will depend on all of the values of the variable whether they were real values or not. The resulting figures should be placed in an aggregate-database. Finally, in order to deal with other issues related to dissemination, as for example disclosure control, multidimensional tables, arrangement according to a variety of themes, a user-oriented database can also be built3.

4.2.1. Estimation

The core of the estimation problem lies in the conditions to be satisfied by the aggregates that will be placed in the aggregate-database, and the methods that must be used to fulfil the empty cells of the micro-database, that is, the missing data. According to Kroese, Renssen and Trijssenaar (2000), these conditions should be as follows:

3 Statistics Netherlands names the input-database, the aggregate-database and the user-oriented-database, respectively, as 'BaseLine', 'StatBase' and 'StatLine'. StatLine can nowadays be freely accessed from the Statistics Netherlands’ website; BaseLine and StatBase have remained conceptual terms.

WP250

Page 56: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

- Reliability, in the sense that the estimates must be obtained by means of an approximately design-unbiased method or by a plausible model-based method (be it generalized regression, calibration, etc).

- Consistency, in the sense that no contradictions may result when comparing estimates obtained in the aggregate-database.

In all cases, variances are supposed to be reasonably small.

Several alternative estimation strategies can be used when obtaining the frequencies and cross tabulations to be published. Once the administrative registers and sample surveys have been linked, a micro-database as described above must have the following appearance:

This is an example for a micro-database where the statistical units are persons, and each of the units owns a record which holds the variables in columns and where the data on the same row belong to the same unit. A cell holds, therefore, the value of a concrete variable for a concrete person. Usually, some data related to name, sex, or age, and an ID number, or also an address, are available for the whole population since they have been obtained from a population register or even integrated from Censuses. Some other information concerning salary or earnings can be integrated from a Tax Register or from Social Security registers, but it may not have data on all of the records. Finally, several sample surveys can add information on some other variables as unemployment, health, education, etc, where the different surveys can give in some cases information on different variables for the same individuals.

4.2.2. Mass imputation

The first alternative of estimation consists of filling the whole micro-database by imputing all of the missing values that appear as empty cells. Observed and unobserved but later imputed values are then used to construct the corresponding estimates. Population tables and results to be held by the aggregate-database are then calculated by simply adding the observed and the imputed values. As an example, a table with estimated average earnings for each population group can be obtained by adding the values in the column of earnings for each group and then dividing the totals by the number of persons.

WP251

Page 57: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

The resulting set of estimates is numerically consistent, which is the most important advantage of this method. The condition of reliability, though, is not properly accomplished. Only a few estimates obtained by the mass imputation method (be it hot-deck, etc) correspond to a design unbiased estimator. This problem is especially clear for small areas. For example, suppose that some score variable like 'higher education' (yes, no) has to be imputed for all the individuals not included in a sample survey; then, if 20% of the observed records of the sample hold 'higher education', the imputation should be such that 20% of the imputed records are also 'higher education' in order to obtain a design-based estimate. But also, for each combination of classes, the total number or percentage of persons that have the observed or imputed score 'higher education' should be equal to the corresponding design-based estimates (based on the sample). Without any other assumptions, the accomplishment with all the underlying restrictions is not possible and then in most cases the corresponding aggregates are not reliable.

These aggregates consist of totals of the sample survey variable for subsets of the population, and two types should be considered:

- Totals defined by variables in the imputation model. These can be estimated by means of adding observed and imputed values in the imputed dataset, and the estimates are reliable.

- Totals not defined by variables in the imputation model. Then the imputation only generates reliable estimates if the conditional independence assumption (Winkler, 2000b) is satisfied.

An important advantage of mass imputation is that once the records are imputed, any user will be able to reproduce results when using the same imputed file. However, in the Dutch Virtual Census of 2001 mass imputation was not a viable strategy for rising survey outcomes to population totals. There were simply not enough degrees of freedom to sustain a sufficiently rich imputation model accounting for all significant data patterns between sample and register variables (Schulte Nordholt, 2004).

4.2.3. Weighting

The second strategy of estimation consists of constructing different rectangular micro-datasets which are taken from the micro-database, attending to the subsequent needs of dissemination. Any traditional weighting scheme can then provide the weights for each rectangular micro-dataset. Estimations of the aggregates are calculated for as many mutually consistent population tables of interest as possible. Moreover, the method of repeated weighting previously analyzed in the WP1 report (Section 3.3) can be used in order to improve the overall results, by means of using that scheme for other population tables of interest.

When constructing the rectangular datasets, they must include the classification variables together with the variable of study and a set of auxiliary variables with their population totals. So, if an aggregate as 'health by age' is to be estimated, the corresponding micro-dataset has to include the variables from the health interview together with the variable 'age' from the population register.

Then, general regression or calibration estimators can be derived for each micro-dataset, trying to meet at least some consistency requirements. A set of starting weights and the specification of the weighting scheme are needed. The starting weights can be obtained from

WP252

Page 58: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

the inclusion probabilities of the original dataset (in case the rectangular dataset corresponds to an original dataset the starting weights should be the inverse of the inclusion probabilities). The starting weights are finally calibrated in order to obtain as much non-response correction and variance reduction as possible.

Finally, if estimates for other population tables are required, the relevant rectangular micro-datasets are selected and reweighted.

In this case, the resulting set of estimates is design-unbiased, at least approximately, and therefore they are reliable. And also, the estimates are numerically consistent. This leads to recommend a strategy based on weighting (and reweighting) methods rather than based on mass imputation. However, it must be taken into account that the number of estimates that can be obtained by this approach has its limits. Thus one has to decide whether the estimates cover the needed information to be published.

WP253

Page 59: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

4.3. The Principle of Redundancy(Manuela Lenk - STAT)

As pointed out above, the process of data integration is not always complete just with the use of a matching procedure through a set of key identifiers, and sometimes a set of additional variables can be used in order to improve the results achieved in a first stage. Then, since these variables are held in the resulting data archive and are collected from different sources, not necessarily connected with each other, it is possible to have different values available for the same unit and the same variable.

The principle of redundancy (Lenk, 2007) takes into account this fact and states that the information from several registers on the same entity can be used for data integration purposes although it is not coincident. The integration of these data needs a strategy based on a set of logical rules in order to decide the most reliable value for each variable at the micro level, an iterative process to solve the problems of consistency that might arise -even inferring that a unit does not belong to the population- and eventually feedback to data owners.

The set of logical rules, including the iterative process to correct the lack of consistency, once arranged in an automated sequence, can significantly reduce the number of "clearing cases" designated for clerical review. Further details on some proposed rules can be found in Section 5.5 of this document.

When putting into practice this principle, it is advised to maintain separate archives keeping the distinction between raw data collected in the original databases and the so-called prepared (imputed and edited) data, which are subjected to the consecutive series of cross checks.

WP254

Page 60: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

4.4. BibliographyAl P.G. and Bakker B.F.M., 2000. Re-engineering social statistics by micro-integration of different sources; an introduction. In Al P.G. and Bakker B.F.M. eds., Integrating administrative registers and household surveys. Netherlands Official Statistics, Volume 15, Special issue, Voorburg.

Denk M. and Hackl P., 2003. Data Integration and Record Matching: An Austrian Contribution to Research in Official Statistics. Austrian Journal of Statistics, Vol. 32 (2003), Number 4, 305-321, Vienna.

Eurostat, 1999. Use of administrative sources for business statistics purposes: Handbook on good practices. Office for Official Publications of the European Communities, Luxembourg.

Findl, P. and Lenk, M., 2007. Register-based census 2010 and census test 2006 in Austria. Registers in Statistics - methodology and quality. Statistics Finland, Helsinki.

Geuzinge L., Rooijen van J. and Bakker B.F.M., 2000. The use of administrative registers to reduce non-response bias in household surveys. In Integrating administrative registers and household surveys. Netherlands Official Statistics, Volume 15, Special issue, Voorburg.

Herrador M. and Porras J., 2007. The use of administrative register to improve sampling frame. Experiences and possibilities in Spain. Seminar on Registers in Statistics - methodology and quality, Helsinki.

Kroese B., Robbert H. Renssen R.H. and Trijssenaar M., 2000. Weighting or imputation: constructing a consistent set of estimates based on data from different sources. In Integrating administrative registers and household surveys. Netherlands Official Statistics, Volume 15, Special issue, Voorburg.

Lundström S. and Särndal C-E.,1999. Calibration as a standard method for Treatment of Nonresponse. Journal of Official Statistics, Vol.15, Nº.2, 305-327.

Laan P. van der, 2000. Integrating administrative registers and household surveys. In Al P.G. and Bakker B.F.M. eds., Integrating administrative registers and household surveys. Netherlands Official Statistics, Volume 15, Special issue, Voorburg

Ruddock V., 1999. Measuring and Improving Data Quality. Government Statistical Service Methodology Series No. 14. Office for National Statistics, London.

Särndal C-E.,1980. On p-inverse Weighting Versus Best Linear Unbiased Weighting in Probability Sampling. Biometrika, 67, 47-60.

Särndal, C.-E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling, Springer-Verlag, New York.

Schulte Nordholt, E., 2004. Introduction to the Dutch Virtual Census of 2001. In: Schulte Nordholt, E., M. Hartgers and R. Gircour (eds.), The Dutch Virtual Census of 2001. Analyses and Methodology, Statistics Netherlands, Voorburg / Heerlen, July, 2004, 9-22.

Statistics New Zealand, 2006. Data Integration Manual. Wellington.

WP255

Page 61: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Winkler, W.E., 2000b. Using the EM-algorithm for weight computation in the Fellegi-Sunter model of record linkage. U.S. Bureau of the Census, Statistical Research Report Series, No. RR2000/05. U.S. Bureau of the Census, Washington, D.C.

WP256

Page 62: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5. Experiences and cases studies

5.1. Experiences on the harmonization of the definitions, the variables and the units for the EU-SILC project in Italy(Paolo Consolini - ISTAT)

5.1.1. Introduction

The main objective of the EU-SILC (European Union Statistics on Income and Living Condition) project is to disseminate statistics on the income4 level and it’s distribution, or more extensively on the living condition across the EU countries. The Italian EU-SILC survey is based on the “face to face interview” method of collecting data and uses administrative micro-data in order to reduce measurement errors. Many topics, including statistics, psychology, sociology and economics, share a common concern with the weakness of the measurement process in the survey method. Errors can arise from any of the factors engaged in the measurement process, like the questionnaire, the respondent and the interviewer, as well as the methods of collecting data. The questionnaire includes several sources that affect the interpretation of the question by the respondent, like the question wording and the structure of questionnaire. Even if the respondent fully understands the question, there can be memory problems. Memory problems are the cause of errors: omissions and telescoping errors are just two typical examples. Interviewers play an important role in most surveys: make contact with the people to be interviewed and ask for their collaboration, orient respondents to their tasks during the interview, manage the question and answer processes, and so on. Within each of these duties the interviewers introduce unaware measurement errors. Another source of errors is related to the method of collecting information: answers to the same income question may vary significantly between target and proxy respondents. In addition to the above errors there is a specific error, known as non-observation error, concerning the lack of measurement for some proportion of sample units as a consequence of: coverage, non-response (either item or unit), sampling (Kish 1965). A typical non-observation error in income surveys is represented by the “selective non-response” bias. Van der Laan et al (1996) assert, for instance, that the Netherlands income survey suffers from “selective-non response” and that even then most well-defined “income question” underestimates the (true) income level from tax register by as much as 10%. In order to limit the impact of measuring errors on the income reported in the questionnaire by the interviewees, and generally to improve the data quality in the survey, a project of multi-source data collection has been started up in the Italian National Institute of Statistics (Istat). This project uses administrative data to support the editing and imputation processes in the survey. In order to carry out this project, some basic requirements have to be satisfied by all data-sources involved. Therefore, the statistical units are to be defined uniformly in all sources (harmonisation of units), all sources should cover the same target population (completion of populations), all variables have to be defined and classified in the same way among the data-sources considered (harmonisation of the variables and classifications), all data should refer to the same period or the same point in time (van der Laan, 2000). In other terms, administrative data need to be comparable with the Eu-Silc survey data. The technique used to link the administrative units to those in the survey sample is exact record linkage. This technique allows to combine

4 The income is also analysed according to the principal components: 1) cash or near cash employment; 2) cash benefit/losses from self- employment; 3) Pensions according to old-age, survivors and disability functions; 4) Not-pension cash benefit paid by Social security system, 5) Interest, dividends, profit from capital investment, 6) Incomes from rental of a property or land.

WP257

Page 63: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

information related to the same statistical units by means of a collection of identifiers called “match keys” provided that each unit is associated with a unique identifier not affected by errors. Different typologies of exact record linkage exist: in this case we refer to the simplest “one-to one” relationship, where every statistical unit of a data source is associated with at most one record from the other data source. Records in different data sources are matched using the Personal Tax Number. Once the integration task is completed, the identification numbers are dropped and replaced with an internal system code according to the Italian rules and regulations that protect people confidentiality.

The integration process between survey and administrative data at micro level, can be summarized in nine points, representing the main phases in chronological order:

1. Selection of the personal characteristics which are used to generate the Italian Personal Tax Number (i.e., name, surname, sex, year and place of birth), and Personal Tax Number retrieval from the Municipality Register Office and recovered by the statistical register SIGIF (Istat Statistical Register for the Eu-Silc and the Labour force frames);

2. Detection of the potential errors on the personal tax number reported in SIGIF and their replacement with the ones generated by the automatic procedure (based on the personal data);

3. Arrangement of the file standard used for the meta-information exchange with the Tax Agency. Disposition of an application form to the Istat Office (DCAR), responsible for the exchange of the information with the tax Agency. This form includes all specifications for the selection of the administrative register (tax statement register) and the income variables;

4. Comparison between the file standard released by the Tax Agency (containing all personal data of the frame), that edits and imputes the incorrect or missing tax numbers, and the SIGIF register containing information about the units and individuals included in the frame;

5. Implementation of the reading and migration procedures of tax file stored in compressed form to a readable SAS form;

6. Detection and correction of duplicate records in the administrative file and rearrangement of information on the same person reported in multiple records;

7. Exact matching of the administrative data sources and setting up of the integrated administrative database on incomes;

8. Analysis of the coherence of the income values on the same units reported in different administrative sources and formulation of hypotheses to solve incoherent income values;

9. Analysis of the coherence between administrative and survey data and formulation of hypotheses for the reconciliation of incoherent income values.

5.1.2. The case of social benefits

This work presents, as a case study, the integration of the administrative data-sources on pensions and pensioners, defining the solutions to the problem of the harmonization of units, definitions and variables and the reconciliation of the incoherences in income values between

WP258

Page 64: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

the sources involved. Before starting with the harmonization of the concepts on pensions, it is necessary to define what pension means. To this purpose, we can refer to the Istat definition: “pension is meant as a periodic and continuous flow of money transfer from Social Security Bodies to the beneficiaries to relieve them of the burden of certain risks or needs (social protection functions)”. Among the risks or needs covered by the social security programs we can distinguish those which are related with the working activity (labour accident, occupational diseases) from other risks independent from this activity (old age, disability, death, etc.). The Eu-Silc target variables associated with the pensions comprehend three kinds of social protection functions: 1. Old age (PY100N), 2. Survivors (PY110N) and Disability (PY130N). The pensions relieved from the Italian social security system do not cover exactly the three functions mentioned above, but they represent a predominant share. More precisely, among these three social functions also not-pension cash transfer, like lump sum payments, should be considered. The administrative sources used to collect micro-data on pensions and pensioners (and relative characteristics) are: 1) the “Pensions Register (PR)” of INPS (Italian National Social Security Board), 2) the “CUD/770-Tax Statements Register” of the National Tax Agency, 3) the 730-Tax Returns Register of the National Tax Agency, 4) the “UPF-Tax Returns Register” of the National Tax Agency that includes self-employment income records.

Table 1 reports the most relevant meta-information of each data source. It is noticeable that only bringing together two or more separate pieces of information recorded in different sources it is possible to estimate the Eu-Silc target variables. As a result, in order to reckon the net pension income distinctly for each function (target variables), the “yearly net tax income of the pensions” (National Tax Register) and the “monthly gross payments on pensions” broken down into functions and types (Italian Social Security System) have to be combined.

Table 1 – Meta information on pensions/pensioners by administrative sources

Data sources

Variables

Domains UnitsGross Income for pension

NetIncome for pension

Number of

payments

Pension type

(Function)Monthly Yearly Monthly Yearly

Pension Register (PR) (a) (c) - - (c) (a)

Census of pensioners of the

Italian Social Security System

Pensionerand/or

Pension

CUD/770 Tax Register (b) (a) (b) (b) - All beneficiaries of

taxable pensions Pensioner

730 Tax Register (b) (a) (b) (b) -

All beneficiaries of taxable pensions (only.730 Tax

Register)

Pensioner

Unico Tax Register (b) (a) (b) (b) -

All beneficiaries taxable pensions (only.Unico Tax

Register)

Pensioner

(a): recorded data(b): variables derived from the integration of data by different sources.(c): partially estimated (new pensioners from Pension Registers 2003-2004). In the Pension Registers 2005 data are recorded.

The Pension Register is the only administrative source in Italy able to enumerate all beneficiaries of pensions, also broken down into functions and types. Although, this source reports the monthly pensions, before tax, paid at 31st December of each year, that doesn’t correspond to the yearly net income tax of the three Eu-Silc target variables: PY100N, PY110N and PY130N. Anyway, it allows to identify correctly the pensioners and the

WP259

Page 65: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

typologies of pensions related to the Italian Social Security system. With respect to each type of pension the Pension Register collects a set of information at individual level on the relative beneficiaries, the monthly amount before tax, the classification according to Eu-Silc target variables. On the other hand, the Tax Registers record the information on yearly gross/net incomes received by each pensioner without any distinction between the functions or the target variables. In order to join the information of the Tax Registers with the Pension Register we need to define a “harmonized definition of pension income” that is comparable between these data-sources. The common base of the comparison is represented by the “taxable income connected to the pensions”.

5. 1.3. Recommendations

3.1 Identifying receivers of pension income per administrative source

In order to be able to establish an exact match between the different archives, it is essential to have a common identification key univocally associated to each record, which, in our case, is the “identification tax number (CF)”. As a result of merging the various administrative sources that cover the pensions across the three functions (CUD/770, 730, UPF, Pension Register) we obtain an integrated archive with 11 distinct reference segments or subsets for the analysis. Thus, it is possible to put in relation each pensioner with the data source that records the payments transferred to him/her. The graphical representation in Figure 1 shows a first group of analysis, subset D, that includes pensioners exclusively covered by the Pension Register (7.47% of the total pensioners in 2005 revealed in at least one source).

Figure 1 - The structure of the integrated administrative database on pensions

PRPensionRegister

730 TaxRetunsRegister

CUD (770) Tax Statement Register

A

E

BC

’ A B C D E F G H I L M

A: units only in 730B: units only in CUDC: units only in UPFD: units only in PRE: units in CUD & UPF

LEGENDA

F

D

G

H

I

LM

F: units in CUD & PRG: units in CUD & UPFH: units in CUD & 730 & PRI : units in 730 e PRL: units in CUD & UPF & PRM: units in UPF & PRUPF Self-employment

Tax Retuns Register

Though this segment includes almost exclusively the units whose pensions are not-taxable, residually, it could contain also units with taxable pensions and individuals whose match key

WP260

Page 66: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

is not linked (units not linked) with the Tax Register because of under-coverage or errors in the key (0.74%). The second group involves mainly people whose income is exclusively recorded by the CUD/770 Tax Register (subset B in figure 1), within this group there is a small fraction of pensioners: 0.62% of the total pensioners associated with at least one source. The latter case allows pointing out only the level of income from transfer (net and gross) but not the type of pension, information indeed contained in the Pension Register (see fig.2). While the two above-mentioned groups represent the two extremes, it is possible to reach many different combinations per type of source equally interesting. One of the most relevant combinations regards the records whose information on pensioners is present in both the Pension Register and the CUD/770 Tax Register: that is the case of subsets F, L and H in figure 1 (90.19%).

When we compare at the individual level the gross taxable pensions of the Pension Register with the gross income pensions of the CUD/770 tax source, we find out that 84.14% of the matched cases shows relative differences in absolute values under the 5% threshold. Such a result demonstrates the high coherence of information on pension levels between administrative and tax sources.

From an operative viewpoint, the statistical units of the EU-SILC survey, which can be linked to the Pension Register but not to the Tax Archives (subset D in fig. 1), must be analysed according to the type of pension they receive (see cases A and B in fig. 2). The tax exempt pension components do not cause any problem when calculating net income since the gross component (the only one surveyed by the Records Office) coincides with the net one (case A). Vice versa, in case of pension treatments subject to taxation, some questions do arise when calculating net income from the gross component revealed in the Pension Register (case B ) because a micro-simulation model has to be employed.

Figure 2 - The integration process of administrative micro-data on pension/pensioners

Units of the Eu-SilcFrame matched withthe Pension Register

Units of the Eu-SilcFrame matched with

the Tax Archives

Units of Eu-Silc Framemacthed with PensionRegister but not linked

with Tax Archives

Units of Eu-Silc Framemacthed with bothPension Registerand Tax Archives

Units of Eu-Silc Framemacthed with Tax

Archives but not linkedwith Pension Register

Units withno-taxablepensions

Units withtaxable

pensions

Net income=

Gross income

Net incomecomputed bymicro-simula-

tion model

Pensionsreportedin CUD

Pensionsreportedin 730

Pensionsreportedin UPF

Attributionof the type of

pension

Survivor’s pensioner

Old age’s pensioner

(A) (B)

(C) (D) (E)(F)

WP261

Page 67: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

For the cases A and B, pensions/pensioners retrieved only from the Pension Register5 (fig. 2), the after tax income for the three target variables (PY100-PY110-PY130) is obtained by means of the “type of pensions” associated to each tax scheme. This task is quite simple when the beneficiary receives one or more pension all classified as tax exempt. In this case gross and net income are equivalent. However, the task becomes much more difficult when the kind of pension is not univocally associated to the specific tax scheme, like the inability pension. In this case, it is assumed that low amount of benefits are considered as tax exempt incomes. For the sake of simplicity, pensions are considered as taxable when their amount exceeds 7,000 euro per year and as non-taxable when their values fall under the so-called threshold. This value of 7,000 euro corresponds to the Italian tax threshold (called “no-tax area”) on personal incomes in the year 2005. The net income of invalidity pensions is obtained in two steps: the tax burden is first extracted and then taken off from the gross income. The following figure reports the Italian tax schedule on personal incomes applied to that year.

Figure 3 - Income Brackets applied to total taxable income of pensioners – year 2005.

Income Brackets: total taxable income related to the pensioners (Euro)

Marginal tax rate

Year: 2005From: Up to: 0 7000 0%

7000 26000 23%26000 33500 33%33500 100000 39%

100000+ 43%

In case of Old age and/or Survivor’s pensions (for which only the amounts that exceed 7,000 euro are subject to taxation), the same “no tax area threshold” of the invalidity pensions was applied to separate the net incomes equal to gross wages from incomes that require the computation of the relative tax. In this case, the procedure for calculating the tax and net income is similar to the afore-mentioned one regarding the invalidity pensions. When both Survivor’s and Old age pensions are recorded, the per-share tax is applied to the value of the respective gross amounts.

The tools of relative differences, in terms of income values observed on the same statistical units across different data sources (administrative and tax), represent the core of the decisional structure used when defining the pension levels and, generally, when attributing the income components in presence of information from both the Pension Register and Tax sources (see cases C6, D and E in figure 2). Strictly speaking, to properly compare at a micro level the before tax benefits of the Pension Register with those from the Tax Archives, it is necessary to establish first what are the pension types subject to taxation in the Pension Register (harmonisation of the definition and variables). To compare the value of several pension schemes’ payments subject to different tax schedules (taxable/no-taxable) and

5 As regards the data bases of the year 2005, this case includes 7.46% of the statistical units to which at least one pension is included in one or more administrative and tax sources. 6 As regards the data bases for the reference year 2005, the percentage of cases in which the pension income record is collected by both the Pension Register and the CUD Tax Register equals 90.19 % of all persons entitled to a pension included in one of the afore-mentioned sources (Case C).

WP262

Page 68: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

inherent to the same function (such as invalidity) between Pension Register and Tax Archives, it would be necessary to harmonise the definition of the variables involved, excluding from the analysis the non-taxable pension components that are by definition not surveyed by the tax sources but are present in the Pension Register.

In this context, the Old-age pensions and the Survivor’s pensions (except for the INAIL pensions to the survivors) represent two types of benefits included in the Pension Register that are theoretically7 burdened with withholding taxes. Vice versa, social pensions and assistance benefits for disabled represent tax exempt categories. Pension schemes included in the target variable “PY130N” money transfers for Disability function are partially taxed and partially exempt from taxation. In particular, civil invalidity allowances, allowances entitled to deaf-mute and blind persons, (direct) pensions for accident at work or professional illnesses, (direct) war pensions and allowances paid as reward to victims of political persecution are non-taxable. On the other hand, the disability pension paid by the INPS to employees and self-employed workers are subject to taxes.

The integration procedure of the pension values from the various data sources can be represented as a step by step process that starts from the comparison between the annual taxable income pension by Pension Register (including Old age pensions and Survivor’s pensions) and the gross income of pension by CUD, in terms of relative differences in absolute values (units of the subsets F, L, H in figure 1). Through the comparisons of different annual pension aggregates estimated respectively according to the Tax Register8 and Pension Register, the process allows to reach a satisfactory coverage of 85.67% of cases (pension receivers) for which income values across the two data sources are in a large extent comparable, see figure 4.

After defining the list of main relative differences, the next step is to establish a tolerance limit for the relative variations between data from administrative sources (Pension Register) and those from the remaining tax sources. This tolerance threshold defines as acceptable income values reported in one of these two source types. A threshold equal to 5% was assigned in order to integrate pension data from a different source. Hence, a final value is assigned to each target variable (annual income by functions) for all those records that reported relative differences, in absolute term, between the “pension sums” associated to the two types of sources, within the interval 0-5%. In particular, the data on the pension gross/net income included in the tax source was assumed as final value. As far as all the statistical units that received pensions in 2005 pension, recorded in both the Pension Register and the CUD/770 Tax Register, 84.14% presents a threshold value under 5% in at least one of the first nine series of relative differences considered9. When considering the whole list of 11 relative differences, the 5% threshold is met by 77.29% of cases with at least one tax and/or administrative source pension. Figure 4 shows that in 14.33% of the records, additional rules need to be applied in a residual way to reconcile the values of the Pension Register with those from the remaining Tax Registers. 7 More precisely, these two types of pensions are actually burdened with taxes when their gross amount exceeds the threshold of the so-called “no-tax area” that the lawmaker has set at € 7,000 for the pension incomes referred to 2005. 8 In the Tax Registers it is not always possible to distinguish pensions from employment incomes for some taxpayers, as the difference is sometimes not relevant for tax purposes. In these cases, the tax form doesn’t support the researcher in calculating the true annual amount of pensions as well as the true annual employment incomes.9 If we compare the same number of subjects with the total statistical units that present at least a pension income in one of the mentioned sources, such percentage rises to 76%. In other words, more than three quarters of the EU-SILC05 sample units to which a pension of administrative and /or tax source is associated to, presents pension annual gross income values coinciding between the Pension Register and the CUD/770 Tax Register .

WP263

Page 69: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Figure 4 – Decision tree in use for integrating the pension data from the Pension Register and from Tax Archives: relative differences between the annual sums of these two types of sources– EU-SILC 2006

The remaining work hypotheses applied to reconcile the pension data of the two data sources consider other aspects that mainly regard the nature of the pensions and the tax system. Among these, the hypothesis of tax burden exemption on the invalidity pensions and the one that regards applying the no-tax area even in relation to the accumulation of different pensions, are particularly relevant. More precisely, a majority of the remaining cases10 shows pension sums (invalidity pensions, survivor’s pensions and old-age/retirement pensions, distinct or accumulated) under the no-tax area threshold (7,000 euro). In these cases, it is reasonable to assume that there are also significant gaps on the gross pension between Tax Archives and the Pension Register and that the value of the latter source (systematically higher than the first ones) are closer to the true gross income. The latter case includes 5.71% 10 These cases refer, to the case in which all eleven types of relative differences do not meet the 5%-threshold.

WP264

EUSILC Statistical Units Matched with Pension Register

DREL1_PEN 5%

EUSILC Statistical Units Matched with Tax Registers

EUSILC Frame Matched with both Pension Register and Tax Registers

EUSILC Frame Matched with Tax Registers but not Linked with

Pension Register

EUSILC Frame Matched with Pension Register but not Linked

with Tax Registers

DREL2_PEN 5%

DREL3_PEN 5%

DREL4A_PEN 5%

Old Age and Survivors Pensions almost the same amount in both sources

Old Age, Survivors, Disability Pension almost the same amount in both sources

Yes

No

Yes

Old Age, Survivors, Disability Pension Calculated as: (Monthly amount) X (nr of payments) of PR= (annual gross pension income) of CUD

Yes

No

Old Age, Survivors, Disability Pension Calculated as: (Monthly amount) X (nr of payments) of PR= (annual gross pension income + empl. income) of CUD

Yes

DREL4B_PEN 5%

Old Age, Survivors, Disability Pension Calculated as: (Monthly amount of PR) X ( nr of payments of CUD)= (annual gross pension income + empl. income) of CUD

Yes

DREL4C_PEN 5%

Old Age, Survivors, Disability Pension Calculated as: (Monthly amount) X (nr of payments) of PR = annual gross employment income by CUD

Yes

No

No

No

DREL4D_PEN d 5%

Old Age, Survivors, Disability Pension Calculated as: (Monthly amount of PR) X (nr of payments of CUD) = (annual gross employment income) of CUD

Yes

No

DREL5A_PEN 5% Old Age, Survivors, Disability Pension

Calculated as: (Monthly amount ) X (nr of payments) of PR= (gross pension income) of 730

Yes

No

DREL6A_PEN 5%

Old Age, Survivors, Disability Pens. Calculated as Monthly amounts per nr of payments by UPF = gross pension income by UPF Tax Register

DREL5B_PEN 5%

NoOld Age, Survivors, Disability Pension Calculated as: (Monthly amount of PR) X (nr of payments of 730) = (gross pension income) of 730

No

Yes

Old Age, Survivors, Disability Pension Calculated as: (Monthly amount of PR) X (nr of payments of PR)= (gross pension income) of UPF

YesDREL6B_PEN 5% YesNo

EUSILC Frame Matched with both Pension Register and Tax Registers for which are applied other decisional rules

(7,47% ) ( (91,62%) (0,91%)

(14,33% )

No

Page 70: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

of the EU-SILC’05 sample units to which a pension from administrative and/or tax source is associated to. On the other hand, the differences between the annual gross incomes for pensions of the two data sources can be traced back to the accumulation of pensions all for invalidity, which are in part non-taxable (3.07%). Here too, one can rightfully suppose as close to the true value (gross) the amounts of the gross pension from the Pension Register. Likewise, it is possible to assume that the no-tax area is applied also in relation to the pension incomes (Old age and/Survivor’s pensions) slightly over the threshold of 7,000 euro though under 8,000 euro (3.26%). Finally, to reconcile part of the observations that contain pension data from the two sources, it was decided to relax the condition relative to the 5% threshold on all the 11 categories of relative differences in order to include also the cases between 5% and 10% (0.94%). The remaining cases in which the sources diverge, imply errors linked in attributing the incomes in the tax sources and signals of duplication errors in the CUD that cannot be identified through other tax sources (1.35%). As far as the completion (coverage) problem of the frame of the two data sources we assume that the pensioners need to belong at least in one of the Registers in order to take part in this research. In particular, we find out that only 5.3% of the total number of pensioners in the Eu-Silc project 2005 is not covered by administrative data but the same fraction is still included in the survey data (i.e. interviewees declaring a pension for whom are satisfied the main eligibility requirements). The reconciliation of incoherent income values has been solved assuming administrative data to be more reliable than the corresponding survey data. The following table shows the improvement in data quality when integrating administrative and survey data on pensioners under a benchmark source (Pension Register).

Table 2 – Comparisons between the estimated number of pensioners in Italy by the Eu-Silc’04 survey, by the integrated database (administrative and survey data) and the number pensioners in Pension Register (Italian census) by functions Year 2004

Data sources:

Per 100 Pensioners of the Pension Register (benchmark)

FunctionsOld age Survivors Disability TOTAL

Eu-Silc’04 survey 102,0 73,3 48,9 93,3

Integrated database(Eu-Silc’04 survey joined with administrative data)

101,0 91,1 91,7 100,9

Pension Register(a) (benchmark source) 100,0 100,0 100,0 100,0

(a): For comparability reasons are excluded the beneficiaries aged up to 14 years

5.1.4. Conclusion

The experience done by integrating administrative and survey income data represents a useful example for what concerns the establishment of guidelines that may be generalized to the data collection of other variables. Integration of different data sources starts by studying the phenomena, that is from the definition of the variables and of the units of analysis. The next step consists in the detailed comparative examination of the sources and the set up of a

WP265

Page 71: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

suitable matching strategy. This permits to consistently join together the various pieces of information into a richer and more effective database.

The most crucial aspect for a successful integration is the harmonisation of the two original data sources by a thorough semantic analysis of the meaning of the definitions and a sound statistical study of their quality. This is clearly a necessity, especially when it is not available a third source to be used as a benchmark (in truth, the most frequent case is when the two dataset to be merged are actually the best available). In our example, we could use an external benchmark only for the number of pensioners, with satisfactory results.

WP266

Page 72: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.2. A report about an employment data base used for the register census:E V A – the Austrian Activity Register(E. M. Reiner, P. Schodl - Statistics Austria)

For many statistical applications inside the Austrian statistical office, it is essential to have access to data about employment or activity respectively. With continuing improvement of data processing and the wish to unburden respondents and cut down costs on the one hand and improve data quality one the other hand, more focus is put on the use of administrative data sources.

For that reason, the Austrian Activity Register (EVA) is to be created. When completed, it will contain connected and anonymous data from different sources on the activity status for every Austrian person, for every desired reference time and aggregation level.

The acronym EVA stands for: Erwerbstätige (employed persons), Versicherte (insured persons), and Arbeitslose (unemployed persons), which are the main groups the register contains.

5.2.1. Data basesThere are a number of different data sources containing essential information on the activity status of a person. In the EVA register the register of the Austrian social insurance (HVSV) is used as main source. This data is a good basis for picturing the actual labour market, as paying social security contribution is compulsory for almost every working or retired person in Austria.

5.2.1.1. Social Insurance RegisterThis register is owned by the Social Security authorities, it contains one observation for every insurance case.

One observation is structured the following way:

an identification number (social insurance number) for the person an identification number of the employer (in the self-employer case this identification

number is an artificial number) beginning date of the insurance case ending date of the insurance case information about the type of insurance, with more than 200 different values (so called

“qualification”)

For one person at one reference time, there may be several observations. Social insurance does not require employment, so the insurance register also contains data about retirees, unemployed persons and the like. Also, this register covers employees as well as self-employed persons.

The “qualification” gives some indication about the status of a person (e.g. whether he or she is employed, self-employed, retired, receiving a widow’s pension …).

This table is called the “qualification fact table”.

Additionally, the register contains three other tables:,

WP267

Page 73: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

the so called “Basis Fact Table”, which contains information on the salary for insurance cases (employee-employer combinations)

the insured persons’ table containing information on the person (e.g. date of birth, sex) the employers’ table containing information on the employer (e.g. name, NACE,

address)

Note that data completing the “Basis Fact Table” for a year is delivered in October the following year to Statistics Austria, while the rest of the data is delivered each month for the previous month.

Historic data is included in the register from 2002 until one month before the actual date.

5.2.1.2. Tax registerThe Austrian tax register, owned by the tax authorities, consists at the moment of two data sources. One contains all annual pay slips issued in the past year, the other the basic personal data of all tax paying Austrians, including self employed. Due to legal and technical reasons information in the second file is up to three years delayed, so it is not used for the Activity Register at the moment. Data containing pay slips is delivered in July the following year to Statistics Austria. This data offers information on employment relationship level, very similar to the insurance data described in (1A). For most persons only one annual pay slip per person is issued, but there can be more than one.

This pay slip contains personal data (e.g. Personal Identification Number, date of birth, sex, address) and information about the employment relationship (e.g. full- or part-time work, amount of salary, and status of work (e.g. apprenticeship, employed, retired)). A starting and ending date is also included in the pay slip data, but in our experience these values are imprecise. This is due to the fact that they are not important for tax purposes and therefore not maintained properly.

5.2.1.3. Unemployment registerThis register is owned by the “Public Employment Service Austria” (AMS) and contains all persons seeking for work or apprenticeship positions, or undergoing a training course in order to find a new job. It is delivered once a month for the previous month to Statistics Austria. This data contains information on status (e.g. receiving unemployment benefit, looking for apprenticeship position), starting and ending date of this status, as well as a reason for ending.

5.2.1.4. Enterprise registerThe enterprise register is a house-own register which can be linked with the insurance-case by the identification number of the employer in the insurance register. It contains usually better information about the enterprises than the insurance register (e.g. on NACE) and allows to pool employer-identification numbers of the insurance register to legal and/or statistical units. This is essential since not every identification number of an employer in the social security register represents an actual enterprise in a statistical or legal sense.

5.2.2. Data delivery Raw administrative data is delivered to Statistics Austria via FTP connection, depending on the data source, once a month or once a year.

Using SAS Data Integration Studio, those delivered data are preprocessed and archived to be able to use historic data for different reference days or periods.

WP268

Page 74: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

From Raw Data (Austrian Social Insurance) to “Qualification Fact Table” and “Basis Fact Table” using SAS Data Integration Studio technology

5.2.3. Data linkage

5.2.3.1. Personal Identification Number (PIN):The identifier of each source is the social security number. This number is assigned to persons living or working in Austria by the national social insurance. It is a 10-digit number which contains the date of birth, three serial numbers and one check digit. Since the social security number is used in many other registers and surveys, data linkage to other sources can be achieved without difficulties.

5.2.3.2. Business Identification Number (BIN):At the moment different business identification numbers are delivered from each raw data source. Using information about the linkage of these identity numbers already established for the enterprise register, it is possible to enrich different employment data (tax, insurance) with a unique business identification number (BIN). This BIN originates from the enterprise register and can be used for linkage with all kind of business data.

WP269

Page 75: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.2.3.3. Record Linkage:All data contain a personal identification number (PIN), and employment data contain additionally a business identification number (BIN). This BIN is used, in connection with the PIN, to link data on employment relationship level. Due to wrong data or missing links not all data from the tax register can be connected to data from the insurance register. Some of these missing connections are a result of conceptual differences in the administrative data sources. Others are origin from missing or corrupt data, for those there are different algorithms implemented to improve the linking to the correct employment relationship by using all data available (e.g. about the employer, status of work).

5.2.4. Data refinement and clearing

5.2.4.1. Editing and Clearing of Data based on redundant informationSome users in our National Statistics Institute want to access the data basis of EVA “as raw as possible”. For them a raw data base is provided on insurance/employment/unemployment relationship level using data only technically edited.

For some purposes data representing reality as close as possible are needed. Here using more than one data base gives the possibility to improve quality of the whole activity register. Each administrative source shows better quality in variables which are important for the administrative purpose they are maintained for, so using these variables from the corresponding registers leads to a better register quality as a whole.

For example the exact starting and ending date is important for insurance purposes whereas for tax purposes only the year of the wage payment is important. So these dates are used from the social insurance register in the activity register.

5.2.4.2. Main activityThe choice of the main activity from all activities of one person is a non-trivial task, which can be solved in many different ways. The more information is available about a person, the closer to reality the main activity can be determined. On the other hand rules for implementing this choice become more complex and arguable.

The simplest way to choose the main activity is using only the type of insurance (“qualification”), as provided by the social insurance register. For this purpose a hierarchic list of all qualifications is compiled. If two insurance cases of one person have the same type, the one which began earlier, is chosen. If no other data than the insurance register (without salary) is available, this may be the only possibility to choose a main activity. This way of determining the main activity forms a “concept”, which is used in Statistics Austria for different statistics. It is implemented in EVA as this concept can be used for the latest data and does not depend on information on salary, which is delivered with a time lag.

If furthermore the salary, which the insurance register contains as basis for the social security contribution as well as the tax register, is used this provides a more reliable criterion for the main activity. Here the length of employment should be considered in the calculation too, for more important than the amount of the annual salary is the amount in the reference time, which may be just one day. The choice on basis of salary forms another “concept” implemented in EVA.

Of course the most accurate choice is obtained by using information from all the registers. For example, the tax register contains information about full time/part time work and pre-tax salary. The unemployment register contains information about job seeking people. A bundle

WP270

Page 76: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

of rules, which makes use of all available information, is another “concept” of main activity which is also implemented in EVA.

5.2.4.3. Economically active or not activeLike the choice of the main activity, the determination whether a person is economically active or not is not trivial for certain groups of people. And again, the easiest way is to judge by type of insurance (“qualification”) only: if a person is insured only because of an economically inactive status (like drawing a pension) in the reference time, this person is classified “not economically active”. By including other registers, and historic data of a person, this procedure can be refined furthermore. Especially people who enjoy job protection during abstention are impossible to detect by mere consideration of one report day. However, even by using all available information, a job protection can only be assumed, although with different levels of reliability.

5.2.5. Presentation: OLAP-CubesThe EVA Activity Register will be used for queries on aggregated level, as well as on the lowest level, where one observation represents one activity. Hence we use OLAP-technology (OLAP = Online Analytical Processing) with the possibility to access the lowest level, and obtain flat data files on activity-level, e.g. to link register data with other data.

The concept of an OLAP-tool is to arrange the data multi-dimensional, with each dimension having several levels and hierarchies. A query on the OLAP-cube is in fact a slice through the cube in selected dimensions and on selected levels of selected hierarchies. The result is always a cross-classified table with entries as measures. For analysing, the user is able to “move” up and down levels of one hierarchy (drill up / drill down), to change hierarchies and to include / exclude dimensions from the query. Also, the analyst is able to extract the observations, which are part of the aggregated query, as flat table (drill-through).

The most important dimensions of the EVA Register are:

About the employed/insured person: Identification number of the person Demographic attributes such as age, sex, nationality…

About the employer: Identification number of the employer (also on legal and/or statistical level) NACE Geographic attributes (i.e. address)

About the insurance-case Type of insurance (“qualification”) Reference time Salary Economically active / not active: different concepts are represented in different

hierarchies of this dimension Main activity: different concepts are represented in different hierarchies of this

dimension

WP271

Page 77: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Example for an Analysis by Age and Sex using OLAP Cubes

We are facing difficulties regarding the performance of low-level queries. This is a result of the huge data base: for one single reference day there are about 11 million observations. The problem arises from two requirements. Firstly a great flexibility on the reference time (different reference days, reference weeks, and reference months) is needed by the teams of analysts, which enlarges the underlying data base enormously. Secondly the counting on personal level is often needed, while data in general is provided on employment or insurance case level, so there is more than one observation per person. It is not a problem to count observations (insurance or employment cases) in the fact-table.

In most cases there is no possibility to count only one special observation per person, since queries are made on the main-activities as well as on minor activities. Hence one has to perform a distinct-count on the persons, which means that for every observation in the query it has to be checked if the person of this observation is yet counted or not. In large queries, this results in massive holding time.

A possible solution for the problems arising from the flexibility in the reference time is to separate the data according to predefined reference dates into different cubes. Another way is to eliminate the possibility to access the cube on the lowest level. Instead a drill-through, which results in a flat data table, can be performed. At least the second possibility showed satisfying results regarding the performance time on aggregate levels using measures without distinct-count in our test cubes.

The problem arising from the distinct-count is more difficult to handle: Performing a distinct-count unfortunately requires a dimension of the person-key. This means the necessity of one dimension with about eight million members, which again brings performance problems. A possible solution to this problem might be creating two cubes, a big one with the possibility of the distinct count, used to build aggregations in advance, and a small and fast one without the

WP272

Page 78: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

direct possibility of a distinct count, but which can access the aggregation table of the big cube, and hence can display the measure “number of different persons” using an add-on. This small cube can be used for analysis.

5.2.6. Future work There are two other registers to be included in the activity register, namely the

population register and the education register. Both are owned by the Austrian statistics office, and contain information about the dwelling place (e.g. for analysis of commuting) and the educational background of a person.

The social security number will be replaced by an anonymous personal identification number, called the “area-specific person code” (bPK). The advantage of the bPK is that two different data owners do not have the same bPK for the same person. Hence for linkage, the Registration Authority, which is part of the Austrian Data Protection Commission (DSK), has to be involved.

The activity register will be used for the register-based census; hence the concepts for activity according to census recommendations have to be incorporated.

Until now, the insured employment was only related to an enterprise, but starting from 2008 it will be related more precisely to the local unit of work, so more regional analysis will be possible for future data. Probably also more information about self employed persons will be available the coming years from the social insurance register.

WP273

Page 79: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.3. The Principle of Redundancy – Austrian Register Based Census(Manuela Lenk – Statistik Austria)

The main objective of a register based census is to reflect the actual situation of living, residence, and working of the population as well as the actual situation of dwellings and buildings by using existing register and administrative data. Due to the fact that the administrative data sources used for the register based census are not linked, and collected independently, it is possible that contradicting information appears on the same variable concerning one person. To ensure a quality as good as possible the principle of redundancy is used. For example information on sex, date of birth, nationality is collected from many registers. A positive side effect of these register checks is the possibility for a feedback to data owners, leading to quality improvement of the underlying data sources.

5.3.1. Base and comparison registersHow is the principle of redundancy implemented in the Austrian testing register based census (reference date 31.10.2006)? Statistics Austria receives data for the reference date from eight base registers and from seven comparison registers containing all required attributes. Attributes that are included in more than one base register are used for cross checks.

5.3.2. Data preparationFollowing scheme shows the way of preparing the data:

STA registers other registers

prepared data

analyse

cubes

(SAS, MSAccess , MSExcel etc.)

Data to correct

translation process

prepare

correct

refill

analyse

raw data

1STA registers other registers

prepared data

analyse

cubes

(SAS, MSAccess , MSExcel etc.)

Data to correct

translation process

prepare

correct

refill

analyse

raw data

2

3

STA registers other registers

prepared data

analyse

cubes

(SAS, MSAccess , MSExcel etc.)

Data to correct

translation process

prepare

correct

refill

analyse

raw data

1STA registers other registers

prepared data

analyse

cubes

(SAS, MSAccess , MSExcel etc.)

Data to correct

translation process

prepare

correct

refill

analyse

raw data

2

3

Figure 1

First the raw data is stored in the so-called staging area (1), in a next step the data is prepared (2), which means linked on personal level and checked for plausibility during a translation process. Then OLAP Cubes are built for analysis. With OLAP technology data can be displayed in a multidimensional way by using measures and dimensions. A dimension, for example, is the geography dimension, which has as top level Austria, then nine Länder (federal provinces), further down municipalities to the base level of each dwelling. Using OLAP cubes one can view the data on the required level of aggregation for each dimension. On micro data level further checks of plausibility are carried out and corrected in a loop back

WP274

Page 80: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

process, rebuilding the cubes with corrected data. This ensures that every analyst uses the latest version of data.

5.3.3. Analysis-Cubes

There are cubes for all areas of analysis. The following figure shows all basic registers as well as the cubes for analysis for register based census test 2006:

Figure 2

5.3.4. Multiple attributes

Attributes with origin from more than one register are called multiple attributes. For each of these multiple attributes there is one analyst responsible, who defines the rules for plausibility and valid values per person.

At the moment the following multiple attributes are used in the test register based census:

Marital status Date of birth Country of birth Sex Nationality Main residence

WP275

Page 81: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.3.5. Example Sex

To illustrate the process of analysis the example sex is described step by step. Information on sex is supplied from seven different registers, the base registers shown in figure 2 as well as three comparison registers.

Central population register (CPR) Central social security register (CSR) Tax register (TR) Unemployment register (UR) Register of car owners (COR) Register of social welfare recipients (SWR) Registers of public servants of the federal state and the Länder (RSP)

The area of analysis “demography” is responsible for defining the valid value. During the translation process the rules for deciding the valid value are determined.

Example for Rules:

R1: Same content in all sourcesR2: If there are only data from CPR, this is the valid valueR3: If CPR_SEX is unknown, CSR_SEX is the valid valueR4: If CPR_SEX and CSR_SEX are equal, this is valid (even if it is not consistent with

other sources)

For further analysis information on sex is shown in detail on a personal level in the analysis cube „Demography“.

cube demography

PID sex_CPR sex_CSR sex_TR sex_UR sex_COR sex_SWR sex_RSP sex_valid sex_rulesex_change

124586 1 1 1 0 0 0 1 1 1  124587 1 0 0 0 0 0 0 1 2  124588 9 2 2 0 0 2 0 2 3  124589 2 2 1 1 0 1 0 2 4  

PID = Personal identification numberChange = Change field

During the process of analysis there can appear cases where these rules cannot or should not be applied. Therefore the analyst has the possibility to change the multiple attribute via loop back process (Field change).

5.3.6. Central attributes

Apart from multiple attributes there is the concept of central attributes – meaning variables that are used in more than one cube of analysis. Similar to the multiple attributes there is always one area of analysis responsible for changes. Only in the cube belonging to this area the central attribute can be changed. A multiple attribute can be a central attribute as well, e.g.

WP276

Page 82: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

sex, which is used in all cubes containing persons. The value fixed in the demography cube is then used in the other personal cubes.If at least one value of a central attribute is changed, all analytic cubes containing this attribute have to be rebuilt to guarantee consistency.

Examples for central attributes: quality of residence geographical dimension sex

The prepared data contains information of the building and dwelling register (Figure 2) for the geographical dimension. All cubes (for example the cube for demography, the cube for local units of employment, the cube for families and the cube for housing census) use the same geographical dimension.

5.3.7. Residence analysis

A main result of the register based census is the number of Austrian citizens and the living population of Austria, determined by an analysis of residence deals with this question. To be part of the Austrian population a person has to appear both in the central population register and one or more comparison registers. Otherwise it is a case of clearing and further examination. For this analysis the main information lies in the fact that the person appears in another register at all, so mainly flags are used in this cube.

cube residence analysis

PID flag_CPR flag_CSR flag_TR flag_UR flag_COR flag_SWR flag_RSPresid.valid

789101 1 1 1 0 0 0 1 1789102 1 0 0 0 1 0 0 1789103 1 1 1 0 0 1 0 1789104 0 1 1 1 0 1 0 1

Within the residence analysis cube all persons - also those not (uniquely) linked during the translation process – are represented. With a record linkage procedure these persons are linked later in the process.

5.3.8. Register checks

Using the procedure as stated above, single registers as well as consistencies between registers can be analysed and evaluated. This should lead to a relatively small amount of clearing cases. Those cases will be cleared, leading to deleting or approving a main residence, by obtaining further information from registers (“sign of life”) or questioning all people relevant.

WP277

Page 83: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.4. A Case Study: the Italian Post Enumeration Census 2001(Marco Fortini - ISTAT)

5.4.1. Introduction

In this example, the main features of the 2001 Italian census post enumeration survey (PES) are described with particular emphasis to the record linkage procedure. Some figures from the PES are also given.

Finally it is briefly shown why this record linkage is applied, illustrating the theory underlying a capture-recapture model (used to estimate the Italian 2001 census coverage rate) and the totals to estimate in order to apply the capture-recapture model.

5.4.2. The Italian 2001 Census Post Enumeration Survey (PES)

The main Italian Census objective is twofold:

• to count the resident population at the census date;

• to characterize Italian families; in doing so, the relationship of each enumerated person with the reference person of the household was also collected.

Consequently the PES objective consists in estimating the census coverage rate.

The main features of the PES sampling design are:

• two stage sampling design of municipalities (98) and enumeration areas (EA) (about 1100);

• the final size of the PES's sample was about 70.000 households and 180.000 people;

• a comparable amount of households and people were selected from the census database with respect to the same EAs;

• the census enumeration process is replicated for each of the sampled EAs.

Other important features are:

• a capture-recapture model for estimating the hidden amount of the population were used;

• a record linkage between the lists of people built up by the Census and the PES was carried out in order to apply the capture-recapture model;

• a rate of coverage, consisting of the ratio of the people enumerated at the census day over the hidden amount of the population, was computed with respect to some variables: age, gender, citizenship, marital status, level of education.

In this context some more details of the record linkage workflow are given, taking into account that capture-recapture models require the absence of errors during the record linkage procedure. This is a strong assumption since the linkage accuracy is crucial and even a few linkage errors can affect the accuracy of the coverage rate estimates.

For this reason a structured record linkage workflow, consisting of different phases and iterations, has been considered in order to ensure the maximum correctness of the matches between PES and Census. Both empirical and probabilistic record linkage techniques were used, under different comparison rules in the various phases of the workflow

The phases of the PES record linkage can be summed up by a hierarchical approach:

WP278

Page 84: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

• The first phases aim to identify the easiest matches, by means of straightforward computational procedures and leaving the more difficult ones to the subsequent phases;

• Subsequent steps of the record linkage workflow were driven by the hierarchical structure of the data in order to take advantage of the relationships among individuals belonging to the same household;

• Indeed, people can be grouped in households membership. This suggests to start by first linking households and then individuals;

• The next figure shows a graphical description of the workflow.

The record linkage steps are the following.

Step1: A deterministic linkage is performed on the households after a previous activity of standardization and parsing on names, surnames and addresses.

Key variables are: Names, surnames and date of birth of household reference person; address of households; number of male and female components.

Step 2: A probabilistic record linkage, such as the one based on the Fellegi-Sunter model, is carried out on the set of households not matched at the previous step. The key variables are those used in the previous step.

WP279

Step 1

Step 2

Step 3.bStep 3.a

Matched households

Unmatched households

Matched households

Unmatched households

Matched people

Unmatched people

Unmatched people

Step 4.a Step 4.b

Matched people

Unmatched people

Matched people

Unmatched people

Step 5Matched people Unmatched

people

CENSUS PES

Matched people

Page 85: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Step 3.a: A deterministic linkage is performed on people belonging to matched households. Only people inside the same household can be matched. Two records are matched if (standardized name, standardized surname, gender, year of birth) or (standardized name, day, month, year of birth) agree.

Step 4.a: Residual people belonging to matched households are clerically checked in order to detect as many matches as possible. Complete names and surnames together with gender, date of birth and place of birth were used.

Step 3.b: The unmatched people coming from step 4.a, are considered as input together with those belonging to non-linked households coming from step 2. A deterministic linkage is carried out in this context. Key variables: standardized address, standardized name, standardized surname, gender and date of birth.

Step 4.b, The people unlinked in step 3.b, are tentatively matched by means of a probabilistic record linkage. The key variables are those used at the step 3.b plus address number, place of birth and marriage status.

Step.5: The residual units, not linked at the previous steps, are finally submitted to a clerical review in order to recover other matches. All available information is considered.

Although the key variables quality was fairly good, the project needed to face the following problems:

– quite large hierarchical data sets had to be managed;

– negligible matching error rate was required.

The hierarchical structure suggests to distinguish record linkage workflow iterations at two levels, namely: we first match records at a higher level (households), and then at a lower level (people). In this way, we take advantage of the hierarchical structure reducing the search space and, moreover, increasing the number of real matches. The dimension of the data sets implies high complexity of the linkage algorithm; this suggests to apply blocking techniques to reduce the complexity of the linkage.

Due to the size of the data sets, a direct use of the probabilistic model could have been time consuming. Therefore, at first the deterministic model is applied with the purpose to be refined by the subsequent use of the probabilistic model. The high quality of data implies the choice of equality as comparison function in most of the phases. The requirement concerning non-significant errors in the matching process suggests the adoption of a probabilistic model in the final iterations, in order to have a quantitative estimation of the errors that can be regarded as acceptable or not. Moreover, this requirement also suggests the appropriateness of a clerical review and an exact comparison function in order to achieve the desired error bounds.

The following two tables show the general good quality of the results given that less than 1% of matches is obtained by means of the final clerical review and a very low number of matches linked with less than four variables in agreement.

WP280

Page 86: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Step %

3.a) Deterministic linkage of people belonging to households previously matched at steps 1 and 2 89,01

3.b) People not matched at the previous step 3.a and subsequently matched by a manual review 6,97

4.a) Deterministic linkage of people without reference to their household membership 1,31

4.b) Probabilistic record linkage of people not linked at previous step 1,85

5) people matched by the concluding manual review 0,86

Sum 100,00Table 1 - Distribution of matched pairs for each linkage step

Number of key variables agreeing for matched pairs % Cum. %

10 59,90 59,90 9 29,43 89,33 8 6,67 96,00 7 1,16 97,16 6 1,84 99,00 5 0,81 99,81 4 0,14 99,95Less than 4 0,04 100,00Sum 100,00 --

Table 2 - Percentage distribution of matches by number of agreeing key variables

Among the main results a satisfactory national coverage is achieved, better in the north-east (NE) of the country and in the smaller municipalities, whereas problems subsist for the largest municipalities, in the centre of the Country and in the small municipalities in north west (NW).

Percentage of Census coverage Territorial Areas

 Class of municipality population Italy NW NE Centre South Islands

Overall 98,55 98,44 99,35 97,47 98,81 98,760-10.000 99,30 98,69 99,60 99,45 99,52 99,9810.001-100.000 99,00 99,32 99,51 98,20 99,04 98,94100.000+ 98,21 98,23 98,97 97,46 97,36 97,76  95,89 96,24 97,89 94,68 96,15 96,80

WP281

Page 87: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

GenderCitizenshipAge classes

ItalyTerritorial Areas Class of municipality population

NW NE Centre South Islands 0-10 10-100 100+ Cities

Overall 98,55 98,44 99,35 97,47 98,81 98,76 99,30 99,00 98,21 95,89Male 98,48 98,27 99,35 97,50 98,71 98,65 99,16 98,92 98,06 95,91Female 98,62 98,61 99,35 97,45 98,90 98,86 99,43 99,08 98,34 95,88Foreigners 89,66 88,76 97,29 81,15 94,70 91,16 93,83 94,75 94,18 73,920 - 5 97,92 98,41 99,48 95,86 97,21 99,35 98,68 98,43 98,60 94,296 - 13 98,34 97,78 99,36 97,07 98,69 98,92 98,87 99,07 98,24 94,8214 - 19 98,62 98,51 98,86 98,18 98,86 98,52 99,50 99,13 97,29 95,4920 - 29 97,82 97,28 98,47 96,41 98,73 98,06 99,21 98,70 97,16 92,6630 - 44 98,42 98,36 99,32 97,06 98,82 98,55 99,21 99,05 97,94 95,2545 - 54 98,83 98,93 99,41 97,66 99,26 98,82 99,51 99,13 98,13 96,9555 - 64 98,92 98,76 99,72 98,43 98,81 98,99 99,28 99,19 98,87 97,5465 - 74 99,02 99,02 99,68 98,09 99,17 99,38 99,61 99,07 98,78 97,8775 - 84 99,23 99,04 99,97 98,78 99,19 99,33 99,63 99,05 99,35 98,80 84 + 98,80 98,54 99,82 97,77 99,39 98,47 99,79 98,92 98,86 96,34

Among the other results, the coverage is quite small for

– People separated by law (94,65%);

– Divorced people (96,69%);

– Graduated people (97,37%);

– Students (97,37%);

– Businessmen (96,46%);

– People that live in country areas (95,74%).

Moreover, people that live in the largest cities have a uniformly greater risk to be missed by census than the corresponding people whose residence is located in smaller municipalities, people that live alone have a greater risk to be missed (96.21%).

Finally, it has to be noted that a changing coverage rate in the different classes of a given variable causes also a bias in the distribution of that variable.

5.4.3. An overview of the capture-recapture model in theory

The main goal of a capture-recapture application is to estimate the amount of a hidden population through a multiple enumeration of its eligible components and their subsequent matching among the different sources. This kind of studies are traditionally used in the field of the estimation of animal abundance and are well discussed in the statistical literature (Seber, 1982). The method is also known as dual system estimation when it is based only on two sources and was early used on human populations by Laplace in order to estimate the French population in 1786.

The capture-recapture method can be also viewed in the theoretical frame of log-linear models with structural zeros (Bishop et al., 1975), which is a particularly useful approach when modelling multiple recapture experiments.

The capture recapture model is effective under many strong assumptions, namely:

the population N exists and does not change during the experiment;

WP282

Page 88: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

the frequency distribution function for the number of individuals at each capture is multinomial;

the captures are statistically independent;

the number of captures is detected without error for every unit;

the captures are not affected by over-counts;

for each single occasion all the units have the same probability to be captured.

II occurrence+ - tot

Ioc

curr

ence +

-

A particular capture recapture estimate is given by the Petersen estimator with two complete surveys. The complete likelihood function for a Petersen model is given by

and it is not actually identifiable without further assumptions since 4 parameters have to be estimated with only 3 independent observations. When instead the independence hypothesis is given, the model becomes identifiable since the likelihood simplifies as follows

In this case, sufficient statistics for N, p1. and p.1 are:

The Petersen model has been also extended to the case in which one of the two occurrences is achieved through sampling (Wolter, 1986). In this case the territory is partitioned into enumeration areas (EAs’) and a sample of EAs’ is extracted where

M: number or EAs’ in the whole territory

m: number of sampled EAs’

Indicating with xi.1 the dichotomised variable that assumes value 1 if and only if the i-th PES person were enumerated at census, and xi11 being the corresponding variable that counts people enumerated by the PES, a sample estimate of the number of the whole population from the PES and of the number of the people enumerated both by the Census and the PES is obtained respectively from

; .

WP283

Page 89: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

If sampling estimates are then plugged in the maximum likelihood model estimates

; ; ,

estimates of the coverage rate for both the sources together with the estimate of the population size are obtained.

Since record linkage is the method by which the number of people that are counted twice in a capture recapture study is computed, matching errors can severely affect the capture recapture estimates. Actually, two kinds of error can be made because of errors or missing information in the matching variables:

• missed matches, consisting in those pairs of records being the same units but not linked because of errors;

• false matches, consisting in those pairs that belong to different units and that are instead linked due to errors.

A simulation exercise can describe in what extent this problem can affect the estimates.

Let us suppose that the population size to estimate is N=1000 and that the rate of coverage for both the occurrences is 0.98. Moreover, let

be the probability of matching a pair because of an error, and

be the probability of missing to match a pair because of an error.

We can represent the expected amount in absence of matching errors

; ;

whereas is the expected number of people enumerated by both the occurrences when it is estimated in presence of matching errors.

The table below shows the estimate of N and p1. both when they are computed with (reported with an asterisk) or without matching errors. From this table it can be seen that, under the examined conditions, the estimates affected by matching errors result in a severe bias and in a much less efficiency than those resulting for the estimates not affected by matching errors.

Simulated data: 1000 samples of 1000 units each1000 samples

s2 p5% p95%

1000.02 0.37 999.18 1000.60

1020.05 22.62 1012.81 1028.20

0.9800 1.9 E-5 0.9704 0.9868

0.9608 3.6 E-5 0.9473 0.9704

5.5. A Case Study: Challenges of the Austrian Register based Census(Manuela Lenk - Statistics Austria)

WP284

Page 90: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

In the year 2010 Austria is replacing the traditional census with a register-based census. The legal basis is the register-based census law of 16 March 2006. This law also stipulates that a census test will be carried out in October 2006.The census test with the reference date of 31st October 2006 covers the complete federal territory as the register-based census of 2010 will.

5.5.1. The principles of the register based census

The following bullet points outline the principles of the concept:

Eight "base registers” are used (mentioned below). A decision was made about which register to use as the base register for each variable (if more than one register contains this variable). Many variables are also collected from seven "comparison registers” (mentioned below), which are used as proof for the confirmation of the values in the base registers (principle of redundancy)11.

The data will be collected without the registered names and without the social security PIN, which both have to be replaced by the branch-specific personal identification number (bPIN OS, see below) in the delivery of the data to Statistics Austria for data protection reasons.

The data from the registers will be linked using the bPIN OS and afterwards checked for consistency and adjusted according to plausibility rules.

Remaining gaps in the registers have to be closed by statistically sound estimation procedures.

Not the full range of the data available in the base and comparison registers is collected for the census. Only those variables are gathered which have already been used in previous censuses.

As some variables are not available in any register, it is not possible to collect them using a register-based census, e.g. colloquial language or some characteristics of commuting like the duration of the journey or the means of transport. The biggest disadvantage of the register-based census is the fact that the variable "occupation” is not included in large registers and is therefore not collected in the census, although it is a core variable of censuses in the UN census recommendations. This variable will therefore not be available at local level in Austria, but only at regional level of NUTS II ("Länder") using a different source like the labour force survey.

A lot of work has been done and is ongoing to improve the quality of many registers to fulfil the requirements of the census. In some cases even further variables have been added to a register. The most important example is the address of the place of work which has been added to the social security records to provide annual commuting data.

In order to check inconsistencies in the data Statistics Austria is entitled to get in contact with the respective register authority.

11 See Section 4.3 of this report

WP285

Page 91: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

If further investigation is necessary where there are severe inconsistencies or there is the suspicion that a person has already left Austria in spite of being included in the central population register (CPR), the names and addresses of people have to be delivered from the register to Statistics Austria. According to the response or non-response to a letter addressed to the people concerned, the data can be clarified or a decision has to be taken on whether to include the people in the census or not.

In cases where Statistics Austria intends to exclude a person from being counted in the municipality in which they are registered in the central population register as having their main residence12, this municipality has to be informed of the intended non-inclusion to enable it to apply for re-inclusion with proof of the existence of the main residence.

5.5.2. The registers

For each variable it is regulated by law which register is the base one and which registers are the comparison ones.

5.5.2.1. The base registers

For the register-based census Statistics Austria receives data for the reference date from eight base registers containing the variables, as required by law. Statistics Austria is the owner of four base registers.

Statistics Austria's registers:

Business register of enterprises and their local units (BR)

Housing register of buildings and dwellings (HR)

Register of educational attainment (EAR)

Register of enrolled pupils & students (PSR)

Other registers:

Central population register (CPR)

Central social security register (CSSR)

Tax register (TR)

Unemployment register (UR)

The data from these registers contains all the variables required for the register-based census.

12 The definition of "main residence" is similar to the UN definition of the "usual residence” in the UN Recommendations on Statistics of International Migration.

WP286

Page 92: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.5.2.2. The comparison registers

To keep the quality of the census results as high as possible, the base registers will be compared with the comparison registers (principle of redundancy).

Seven registers used only for comparison:

Child allowance register (CAR)

Central foreigner register (CFR) - available as of 2008

Registers of public servants of the federal state and the Länder (RSP)

Register of car owners (COR)

Register of social welfare recipients (SWR)

Conscription Register (CR)

Register of alternative civilian service (ACSR)

Further comparison registers are the base registers for each other, because for many variables the base registers are used for comparisons with other base registers too.

The comparison-only registers contain primarily basic demographic data e.g. main residence, nationality, sex and information about employment e.g. full-time or part-time. On the other hand, they provide very special information, like place (local unit) and branch (NACE) of work, or about people doing military or alternative civilian service.

WP287

PersonsbPIN

Addr.codePC

Building and dwellingsAddr.code

HC

Local units ofemployment

Addr.code, employer-PIN

CE

HR

CSSR

CPREARUR

BR

TR

Page 93: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.5.3. Linking and adjusting basic and comparative data

5.5.3.1. Record-linking of persons

The linking is conducted by a "branch-specific personal identification number for official statistics” (bPIN OS). The complete governmental administration is divided by law into several branches like "social security", "taxes", "health", "official statistics" and so on. Each branch has its own branch-specific PIN. These PINs, which should serve for the data protected communication between public authorities within e-government, are derived by the Austrian Data Protection Commission (DPC) from the central population register by the following procedure.

The owner of the register has to ask the DPC for this PIN for each person by sending the name, date of birth, birth place and address. The bPIN OS as well as the other bPINs are derived from the PIN in the central population register (CPR) using a special and very complex algorithm known only by the DPC. The bPIN OS is given to other register owners than Statistics Austria only in an encrypted (ciphered) form. The owner of the register has to send the data together with the encrypted bPIN OS and an accompanying number of their own to Statistics Austria. The latter number serves to identify the respective record in case of further inquiries by Statistics Austria.

Only Statistics Austria is able to decrypt (decipher) its "own" PINs, which were dedicated for the use for statistical purposes, namely the bPIN OS, and to use these PINs as a common linking variable.

In practice, Statistics Austria has the problem that the names and dates of birth in the registers are sometimes inaccurate. "Data-twins" and other people who cannot be found by the procedure mentioned and therefore cannot be linked pose problems.

5.5.3.2. Linking of addresses

As the diagram shows, the housing register of buildings and dwellings is very important as it is principally able to connect all addresses in the registers by a numerical address code. The housing register is linked with the central population register and with the business register. These registers contain the same addresses (numerical codes) for buildings but not always for door numbers. The register of buildings and dwellings is highly reliable concerning the buildings. As far as dwellings are concerned, the linking of dwellings with people registered in the central population register is less successful due to many missing door numbers. In general, the linking of the central population register, the housing register and the business register is much more reliable for buildings than for dwellings. This may pose problems for the identification of households and families.

5.5.3.3. Linking of persons to the business register

Each record of an employee in the central social security register includes a PIN for the employer. This employer-PIN can be used to link the records of employees with the business register, where it is used as an identification variable.

WP288

Page 94: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.5.4. The population census

5.5.4.1. The Population

The base information for the population comes from the central population register, e.g. main residence, date of birth etc.

There are some special rules for people who change their main residence in the central population register around the reference date of the census. Firstly, immigrants are counted as being mainly resident in Austria only if they stay for at least 90 days. Otherwise, they are considered as mere visitors who are not part of the population of Austria. The same happens with emigrants the other way round: If people deregister to leave the country and register again within 90 days in Austria, they are considered as being absent only temporarily and therefore as having lived without a break in Austria (in the same or previous municipality). This rule is consistent with the definitions given in the UN Recommendations on Statistics of International Migration.

Secondly, another rule should reduce so-called "census tourism": In the past, due to pressure by mayors, a personal attachment to the municipality of their secondary residence or to the residence where they grew up, a lot of people changed their main residence shortly before the reference date of the census and returned to their previous residence shortly afterwards (not in reality, but in their register record). For this reason, there is a special procedure if they migrate back to the former municipality after the reference date, within 180 days of the first migration. They have to be counted in the former municipality in the census (but not in the central population register), although they were registered in the other municipality on the reference date.

In order to be able to count people who are recorded in the central population register also in the census population it is necessary to check if there is information about them in the other registers. For that purpose the central social security register is the most important comparison register. An employment or pension entry can be found in this register for the vast majority of the population. If people become unemployed or are receiving social welfare benefits, an entry is found in the unemployment register and the register of social welfare recipients.

If an entry can only be found in the central population register without any entries in other registers, a special procedure called "residency analysis" is used which is a signs-of-life analysis. Statistics Austria is authorised to ask the registrar for the names and addresses of the people concerned, to whom a letter is addressed that they are obliged to answer. If there is no reaction to the letter and there are no other signs of existence, the person is supposed to be deleted from the census records. As already mentioned in the introductory section on principles, in this case the municipality in which the people are registered as having their main residence13 has to be informed of the intended non-inclusion to enable it to apply for re-inclusion with proof of the existence of the main residence. Otherwise, or if there is a negative decision by Statistics Austria the records of the people concerned have to be deleted.

13 The definition of "main residence" is similar to the UN definition of the "usual residence” in the UN Recommendations on Statistics of International Migration.

WP289

Page 95: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.5.4.2. Households and families

The lack of many flat door numbers in the housing register poses some problems for the household and family statistics. It is planned to use not only the information on door numbers from the two registers mainly containing that information, i.e. the central population register and the housing register, but also information from other registers with addresses. Additional sources could be used, e.g. information about living in the same household from the child allowance register.

It is partially possible to deduce the relationship between the members of a family from the register data to construct the family statistically and to build the variable "family status" which means the position of a person within the family (parent, child). There is basic data available from the social security register for people who were sometimes out of the labour force and therefore were not self-insured but covered by a family member's national health insurance. For these people their relationship to the family member is registered. One of the comparison registers is the child allowance register with information about the parent–child relationship for children up to 18 or, if studying, up to 27 years of age. The remaining data gap has to be closed by estimates.

The necessary estimation procedures may be based not only on the register data but also on data from the labour force survey which is a very large sample survey.

5.5.4.3. Hints on some variables of the population census

Information about marital status should come from the central population register as the corresponding base register. However, this register collects the marital status only for new registrations or registrations changed since July 2006. Only partial or even rudimentary information exists in the miscellaneous comparison registers.

Statistics Austria obtains the base information for the employed and pensioners from the social security register and the tax register. To analyse other kinds of livelihood Statistics Austria uses the registers of military and alternative civilian service, the unemployment register, the register of social welfare recipients and information from the central social security register on people who are out of the labour force but covered by a family member's national health insurance. Information on pupils and students concerning their enrolment, the type of school or university they attend, the field of studies and the location of the school or university is provided by the register of enrolled pupils and students.

The register of educational attainment is the base register for the highest completed educational level.

There is no data available in the registers for the following variables although these variables have always been collected in Austrian censuses:

Occupation,

Mode of transport for commuters,

Duration of the daily commute,

Language(s) most commonly spoken,

Religion.

WP290

Page 96: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

Most of them are only optional according to the UN Recommendations on Population and Housing Censuses, with occupation as the only exception which is a core variable in censuses. The easiest way to collect these variables is to use a sample survey, the data is matched with the census data. However, this data will not be available at local level because the largest sample survey in Austria is the labour force survey which is only representative for the NUTS II-regions ("Länder").

Another possible way of collecting these variables is offered by the register-based census law: It authorises the government to enact a regulation which stipulates the need for a special survey on the variables "religion" and "language(s) most commonly spoken". However, such a regulation has not passed the Council of Ministers yet, presumably due to the costs of such a survey.

5.5.5. The housing census

The base register for the housing variables is the housing register of buildings and dwellings. No comparison register is available. There are almost no problems analysing the different variables for buildings. However, in urban areas with apartment buildings, there are the previously mentioned problems of the partial lack of door numbers. It is necessary to use estimation methods for this group of variables too. Again, an important source for sound estimations is the labour force survey.

5.5.6. The business census

The base register for the variables of enterprises and their local units of employment is the business register. It contains aggregated numbers of employed people for the whole enterprise or institution as well as for the local units of employment.

For comparisons in the public sector Statistics Austria is able to use the registers of public servants of the federal state and the "Länder".

Statistics Austria has more or less complete data on enterprises (concerning the number and structure of employees), but less information on local units of employment. Therefore, enterprises with only one local unit of employment are well registered, but those with more than one local unit are partly incomplete. The main data source for the latter is the structural business survey which covers NACE sections C-K for enterprises with 20 or more employed people (NACE C-F) or above a specified turnover (NACE G-K). Only limited information is available for local units of enterprises below these thresholds and for those which belong to other economic branches (NACE sections L to O).

Statistics Austria will receive better information for those local units of employment from 2008 onwards. Each employer is obliged to submit the address of the place of work of each employee together with their social security record and their tax record to Statistics Austria once a year. This information will be used to provide individual commuting data and will also be incorporated into the business register in terms of the number of employed people in the local units.

WP291

Page 97: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.5.7. Special features of the 2006 census test

The register-based census test of 2006 is a country-wideregister-based census. However, it has no financial consequences for the municipalities and no consequences for the number of votes necessary to win a seat in parliamentary elections. Statistics Austria is checking the results of this test by comparing them with a sample survey with the same reference date. The questionnaire only contains items designed to check the quality of the registers.

The Modalities of the sample survey:

A surface area sample of 25,000 people = three per thousand of the total population, in 100 sample areas;

The people concerned were obliged to answer the questionnaire.

5.5.8. Future work

Evaluation report on the census test to the federal government by April 2008;

Amendment of the concept by summer 2008;

Required modifications of regulations from autumn 2008 onwards;

Preparation of the 2010 register-based census;

Implementation of a register-based census with the reference date the 31st of October 2010.

WP292

Page 98: 2ec.europa.eu/eurostat/cros/system/files/Report_of_WP2.doc · Web viewFor instance, if the variables Yj are dichotomous, the number of key variables should be necessarily at least

5.6. Bibliography

Arbeitsmarktservice Österreich, Geschäftsbericht 2006, http://www.ams.or.at/neu/001_GB-Druckversion-neu_290607.pdf

Bishop, Y.M.M. Fienberg, S.E. and Holland, P.W. 1975. Discrete Multivariate Analysis: Theory and Practice; MIT Press, Cambridge, MA.

Hauptverband der österreichischen Sozialversicherungen, Organisationsbeschreibung, continuously adjusted, since 2001.

Informationen des Stammzahlenregisters, http://www.stammzahlenregister.gv.at/allgemeines.htm

Kish, L., 1965. Survey sampling. John Wiley & Sons, New York.

SAS Institute Inc. 2004. SAS® 9.1 OLAP Server: MDX Guide. Cary, NC: SAS Institute Inc.

SAS Data Integration Studio 3.4 User’s Guide, http://support.sas.com/documentation/onlinedoc/etls/usage34.pdf

Seber, G.A.F., 1982. The Estimation of Animal Abundance and Related Parameters - 2-nd edition, Charles Griffin & Company LTD, London and High Wycombe.

van der Laan, P. and van Tuinen H.T., 1996. Increasing the Relevance of Income Statistics. Paper prepared for the first meeting of the Expert Group on Household Income Statistics, Camberra, Australia, 2-4 December 1996.

van der Laan P., 2000. Integrating administrative registers and household surveys, Netherlands Official Statistics Volume 15, Summer 2000, Integrating administrative registers and household surveys.

Wallgren, Anders and Britt Anders: Register-based Statistics Administrative Data for Statistical Purposes, John Wiley & Sons, 2007.

Wolter, K.M., 1986. Some Coverage Error Models for Census Data, Journal of the American Statistical Association, Vol. 81, No. 394, pp. 338.

WP293