Eurostat The statistical matching problem: definition Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality,

Eurostat

The statistical matching problem:definition

Training Course «Statistical Matching»

Rome, 6-8 November 2013

Mauro ScanuDept. Integration, Quality, Research and Production Networks Development, Istatscanu [at] istat.it

Eurostat

Outline

Motivation Example 1: Time use and Labour force surveys Example 2: The Social Accounting Matrix (SAM) Example 3: Farms accounting surveys and Accountancy data Example 4: Microsimulation

Methodological issuesContents of the courseReferences

Eurostat

Statistical matching

Let us assume that data are collected in two sample surveys, say A and B of size nA and nB from the same population.

• Some X variables are observed in both the samples• Variables Y are observed only in survey A• Variables Z are observed only in survey B.

The goal is inference on (X,Y,Z), or at least on the bivariate (Y,Z)

Eurostat

Goal: estimation of parameters describing (Y,Z) or (X,Y,Z)

Statistical matching

Eurostat

The objective of the integration of the Time Use Survey (TUS) and of the Labour Force Survey (LFS) is to create at a micro level, a synthetic file of both surveys that allows the study of the relationships between variables measured in each specific survey.

By using together the data relative to the specific variables of both surveys, one would be able to analyse the characteristics of employment and the time balances at the same time.• Information on labour force units and the organisation of her/his life times

will help enhance the analyses of the labour market• The analyses of the working condition characteristics that result from the

labour force survey will integrate the TUS more general analysis of the quality of life.

Motivating example 1: TUS--LFS

Eurostat

The possibilities for a reciprocal enrichment have been largely recognised (see the 17th International Conference of Labour Statistics in 2003 and the 2003 and 2004 works of the Paris group). The emphasis was indeed put on how the integration of the two surveys could contribute to analysing the different participation modalities in the labour market determined by hour and contract flexibility.

Among the issues raised by researchers on time use, we list the following two:• the usefulness and limitations involved in using and combining various

sources, such as labour force and time-use surveys, for improving data quality

• Time-use surveys are useful, especially for measuring hours worked of workers in the informal economy, in home-based work, and by the hidden or undeclared workforce, as well as to measure absence from work


Eurostat

What is the input of the analysis


Eurostat

Specific variables in the TUS (Y): it enables to estimate the time dedicated to daily work and to study its level of "fragmentation" (number of intervals/interruptions), flexibility (exact start and end of working hours) and intra-relations with the other life times

Specific variables in the LFS (Z): The vastness of the information gathered allows us to examine the peculiar aspects of the Italian participation in the labour market: professional condition, economic activity sector, type of working hours, job duration, profession carried out, etc. Moreover, it is also possible to investigate dimensions relative to the quality of the job


Eurostat

The statistical matching problem corresponds to the joint analysis of the specific variables in TUS and LFS. Y: Hours worked in an average weekday (average generic length)Z: Willingness to work a different number of hours from those actually worked in the reference week (records from LFS, 1st trimester 2003)


YX

Eurostat

In order to obtain the previous result, we have considered these steps:

impute a file (TUS) with records from the other file (LFS) by means of the common variables

estimate what you need from this file

NOTE -- in this case, it is possible to prove that we are assuming a particular model on the variable relationship: Y and Z are assumed independent given X


Eurostat

Summary

objective: micro data for analyses ``traditional'' statistical matching framework use of imputation techniques use of a particular model: the CIA


Eurostat

The new system of the national accounts (also known as European System of the Accounts, or ESA95) is a source of very detailed information on the economic behaviour of all the economic agents, as households or enterprises. A very important role in ESA95 is played by the social accounting matrix (SAM).

A SAM has two main objectives: first, organising information about the economic and social structure of a country over a period of time (usually a year) and second, providing statistical basis for the creation of a plausible economic model capable of presenting a static image of the economy along with simulating the effects of policy interventions in the economy.

Motivating example 2: SAM

Eurostat

The SAM module on the households is a matrix containing for eachhousehold typology:

the amount of expenditures (distinguished according to a very detailed list of different expenditure categories)income (employees income, self-employed income, interests, dividends, rents, social security transfers).

The household typologies may be of different types: e.g. area (region) of residence, primary income source, head of household characteristics.


Eurostat

An example of social accounting matrix is the following one (data 1995)


Eurostat

In general, we expect to have a table like this:


C=(C1,…,Cu,…,CU) represents different expenditure typologies, e.g. food expenditures, durable goods expenditures, and so onM=(M1,….,Mv,…,MV) denotes different income typologies, e.g. salaries, dividends and interests, and so onTw, w=1,…,W represent the different household typologies of interest

Eurostat

What do we have in practice


Eurostat


HBS (Istat Household Budget Survey) collects:

some socio--demographic variables X (used for the construction of the household typologies T)

a very detailed vector of consumption variables Z (e.g. if C1 represents ``Food Consumption'', it can be considered as a combination of HBS Z variables as consumption of meat, eggs, fish, vegetables, and so on)

a variable on income, TM•: the monthly total amount of the household entries (categorical variable with 14 classes, not reliable).

The first two items allow the computation of the terms cwu.

Eurostat


SHIW (Bank of Italy Survey on Households Income and Wealth) collects:

some socio-demographic variables X (used for the construction of the household typologies T)

a very detailed list of income variables Y from which the variables Mv, v=1,…,V can be reconstructed

a few generic questions on consumption Z (e.g. the amount of expenditures for durable goods, food,...; not reliable)

The first two items allow the computation of the terms mwv

Eurostat


At first sight, the SAM can be directly estimated: i.e. those rows where Tw is available in HBS. However, also for these rows there is a problem. The two independent surveys produce sometimes inconsistent results. In other words reconciliation of definitions and concepts of the two surveys is not enough for the joint use of estimates from the two surveys.

In this case, sample variability produces estimates of the table entries which are incompatible for the current economic theory. Incompatibility is on the propensity to consume.

Approach 1: complete A (i.e. the BdI survey SHIW, the smallest sample) imputing records from B using the common information X. Again, this approach assumes the conditional independence of income and consumption given the common variables.

Approach 2: let us formalize the problem in its probabilistic components

Eurostat


Simplified model: joint distribution of X, TM (total monetary income), and TC (total consumption), for a household typology (Tw):P(X, TM, TC|Tw)=P(TC | X, TM, Tw)P(X, TM | Tw), w=1,…,W

The joint distribution of X and TM can be easily estimated from SHIW. P(TC | X, TM, Tw) cannot be estimated from SHIW. In fact, these

consumption variables are not as reliable and detailed as the HBS ones (observed through the compilation of a diary) for memory problems

P(TC | X, TM, Tw) cannot be estimated from HBS. In fact, (i) This survey observes TM•, which is not reliable: asking directly for

the total amount of entries, as in HBS, usually leads to under report it;

(ii) TM• is categorical, while TM in SHIW is continuous

The proxy information from the variable TM• will be used as additional information

Eurostat


Summary

objective: table ``traditional'' statistical matching framework attempt to avoid the CIA use of a formal dependence model and use of proxy variable

Eurostat

Motivating example 3: FSS--FADN

The two surveys collect data on agricultural enterprises and are designed to investigate separate phenomena

FSS focuses on structure of the farms (labour force, holder family characteristics, crops, machinery and equipments, etc)

FADN Survey on economic structure of the farms (costs, added value, household income, etc) of the farms

Common variables includes, Use agricultural area UAA, Economic size unit ESU, Livestock size unit (LSU) geographical characteristics

Eurostat


The two surveys collect data on agricultural enterprises and are designed to investigate separate phenomena

Both surveys have been carried out in 2003 Total sample size is about 55.000 for FSS and about 20.000 for FADN

Survey The two surveys have similar stratified sample designs (both use

region, dimension of the farm in terms of economic size units) Both the surveys include take-all strata containing the largest farms In order to reduce respondent burden, selection of units in the two

surveys is negatively correlated (units selected in one sample are less likely to be selected for the other survey)

Eurostat


Structure of the common subset $C$

The common subset is made of 883 unitsMany of them belong to the take-all stratum of the FSSIt contains mainly large farms

Frequencies of the variable ``Intermediate costs'' (thousands of Euro) for FADN and for $C$

Eurostat


Structure of the RS and the FADN survey dataCommon subset C=A B

Eurostat

Motivating example 3: FSS--FADNObjectives To have a complete data set to disseminate for research purposes To estimate directly joint information on a pair of variables never jointly

observed, as contingency tables or correlation coefficients combining from FADN: economic information on farms/ performance indicators on

farms (added value, production, sales,...) from FSS: structural information on farms which are not observed on

FADN as: typology of commerce (commercial farm, with or without contractual constraints, sales to associative organisms); work dedicated to connected activities (in days); head of the farm characteristics (age, sex, educational level, quantity of work dedicated to the farm activities); cultivation practices (green manure, mulching, controlled green cover; use of mineral/organic fertilizer); organic food production (surface dedicated to the production of organic food); irrigation systems (surface, underground, ... ); use of total/partial agricultural service supply agency for ploughing, fertilizing, sowing, et.)

Eurostat

Motivating example 3: FSS--FADNThis example shows that there is a different form of additional information with respect to the presence of proxy variables.

This is given by the presence of an additional data source, possibly complete on all the variables of interest (X,Y,Z) or on the target variables (Y,Z).

These additional data sources are most of the times outdated (results from previous experiences) or difficult to use in a statistical context (C is representative of the population of interest?)

Eurostat

Motivating example 3: FSS--FADNSummary

objective: complete synthetic data set and table estimation non ``traditional'' statistical matching framework (further information available) can $C$ be used, and how?

Eurostat

Motivating example 4: Microsimulation models

The Social Policy Simulation Database and Model (SPSD/M) is a micro computer-based product designed to assist those interested in analyzing the financial interactions of governments and individuals in Canada (see http://www.statcan.ca/english/spsd/spsdm.htm).

It can help one to assess the cost implications or income redistributive effects of changes in the personal taxation and cash transfer system.

The SPSD is a non-confidential, statistically representative database of individuals in their family context, with enough information on each individual to compute taxes paid to and cash transfers received from government.

Eurostat


The SPSM is a static accounting model which processes each individual and family on the SPSD, calculates taxes and transfers using legislated or proposed programs and algorithms, and reports on the results.

It gives the user a high degree of control over the inputs and outputs to the model and can allow the user to modify existing tax/transfer programs or test proposals for entirely new programs. The model can be run using a visual interface and it comes with full documentation.

Eurostat


In order to apply the algorithms for microsimulation of tax--transfer benefits policies, it is necessary to have a data set representative of the Canadian population. This data set should contain information on structural (age, sex,...), economic (income, house ownership, car ownership, ...), health--related (permanent illnesses, child care,...) social (elder assistance, cultural--educational benefits,...) variables (among the others).

It does not exist a unique data set that contains all the variables that can influence the fiscal policy of a state

In Canada 4 samples are integrated (Survey of consumers finances, Tax return data, Unemployment insurance claim histories, Family expenditure survey)

Common variables: some socio-demographic variables Interest is on the relation between the distinct variables in the different samples

Eurostat


Summary

objective: complete synthetic data set for general purpose analyses ``traditional'' statistical matching framework with more than two data sources

Eurostat

Summary of the examples

Objective of the matching

Approach Objectives Example

Micro Some parameters of $(Y,Z)$ as contingency tables, correlation coefficients

SAM, LFS--TUS, FSS--FADN

Macro Synthetic and complete data sets

Data sets for microsimulation, data sets for the joint analysis of income and exmpenditures

Eurostat


Information to use in the matching process and approaches

Available information Example

Further information is not available

TUS--LFSSPSD

Proxy variable SAM

Auxiliary information on parameters

---

Auxiliary information on a complete data set

FADN--FSS

Eurostat


Information to use in the matching process and approaches

Available information Example

Further information is not available

CIA (unreliable results)

Proxy variable CIA (results are reliable)

Auxiliary information on parameters

Constrained inference

Auxiliary information on a complete data set

Due to the structure of the data sets, typical approaches can be more complicated

Eurostat

Methodological problems

1) Statistical matching can be seen as a problem of treatment of missing data

2) Imputation algorithms have been used extensively, especially those of the hot-deck family3) This problem is special: Y,Z are never jointly observed. In this case, when the two samples are drawn according to the same model, it can be proved that the missing data mechanism is MCAR4) The big problem is that not all the models for (X,Y,Z) are identifiable, i.e. there are not data for the estimation of all the parameters that characterize a model

Eurostat

Methodological problems

If the statistical matching problem is an inferential problem what method can be used?

Approach Micro Macro

Parametric Methods available for normal and multinomial variables

Methods available for normal and multinomial variables

Nonparametric Usually performed by means of hot deck methods

New results in the last years

Eurostat

Contents of the course

The following issues will be analyzed in the rest of the course

Today Statistical matching under the conditional independence assumptionTomorrow Statistical matching with auxiliary information Selection of the matching variables Accuracy issues in statistical matchingThe day after tomorrow Uncertainty in statistical matching Matching and complex sample designs

Eurostat

Selected references

D'Orazio, M., Di Zio, M. and Scanu, M. (2006), ``Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints", Journal of Official Statistics, vol. 22, 137--157.

D'Orazio M., Di Zio M., Scanu M. (2002). ``Statistical Matching and Official Statistics", Rivista di Statistica Ufficiale, 1/2002, 5-24.

Gazzelloni S., Romano M.C., Corsetti G., Di Zio M., D'Orazio M., Pintaldi F., Scanu M., Torelli N. (2008). ``Time Use and Labour Force: a proposal to integrate the data through statistical matching", in Romano C. Time Use in Daily Life: A Multidisciplinary Approach to the Time Use's Analysis, collana Argomenti n. 35, Istat (available on http://www3.istat.it/dati/catalogo/20080612_01/)

Torelli N., Ballin M., D'Orazio M., Di Zio M., Scanu M., Corsetti G. (2008). ``Statistical matching of two surveys with a nonrandomly selected common subset", CENEX-ISAD workshop (Vienna, 29--30 May 2008), (available on http://cenex-isad.istat.it)

Wolfson M., Gribble S., Bordt M., Murphy B., Rowe G., Scheuren F. (1989). ``The social policy simulation database and model: an example of survey and administrative data integration", Survey of Current Business, May, 1989.

http://www3.istat.it/dati/catalogo/20080612_01/

Eurostat The statistical matching problem: definition Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality,

Documents

time use survey tus

labour force survey

labour force units

labour force surveysexample

specific survey

labour marketthe analyses

time balances

survey avariables z