Eurostat The statistical matching problem: definition Training Course «Statistical Matching» Rome, 6-8 November 2013 Mauro Scanu Dept. Integration, Quality, Research and Production Networks Development, Istat scanu [at] istat.it
Jan 04, 2016
Eurostat
The statistical matching problem:definition
Training Course «Statistical Matching»
Rome, 6-8 November 2013
Mauro ScanuDept. Integration, Quality, Research and Production Networks Development, Istatscanu [at] istat.it
Eurostat
Outline
Motivation Example 1: Time use and Labour force surveys Example 2: The Social Accounting Matrix (SAM) Example 3: Farms accounting surveys and Accountancy data Example 4: Microsimulation
Methodological issuesContents of the courseReferences
Eurostat
Statistical matching
Let us assume that data are collected in two sample surveys, say A and B of size nA and nB from the same population.
• Some X variables are observed in both the samples• Variables Y are observed only in survey A• Variables Z are observed only in survey B.
The goal is inference on (X,Y,Z), or at least on the bivariate (Y,Z)
Eurostat
Goal: estimation of parameters describing (Y,Z) or (X,Y,Z)
Statistical matching
Eurostat
The objective of the integration of the Time Use Survey (TUS) and of the Labour Force Survey (LFS) is to create at a micro level, a synthetic file of both surveys that allows the study of the relationships between variables measured in each specific survey.
By using together the data relative to the specific variables of both surveys, one would be able to analyse the characteristics of employment and the time balances at the same time.• Information on labour force units and the organisation of her/his life times
will help enhance the analyses of the labour market• The analyses of the working condition characteristics that result from the
labour force survey will integrate the TUS more general analysis of the quality of life.
Motivating example 1: TUS--LFS
Eurostat
The possibilities for a reciprocal enrichment have been largely recognised (see the 17th International Conference of Labour Statistics in 2003 and the 2003 and 2004 works of the Paris group). The emphasis was indeed put on how the integration of the two surveys could contribute to analysing the different participation modalities in the labour market determined by hour and contract flexibility.
Among the issues raised by researchers on time use, we list the following two:• the usefulness and limitations involved in using and combining various
sources, such as labour force and time-use surveys, for improving data quality
• Time-use surveys are useful, especially for measuring hours worked of workers in the informal economy, in home-based work, and by the hidden or undeclared workforce, as well as to measure absence from work
Motivating example 1: TUS--LFS
Eurostat
What is the input of the analysis
Motivating example 1: TUS--LFS
Eurostat
Specific variables in the TUS (Y): it enables to estimate the time dedicated to daily work and to study its level of "fragmentation" (number of intervals/interruptions), flexibility (exact start and end of working hours) and intra-relations with the other life times
Specific variables in the LFS (Z): The vastness of the information gathered allows us to examine the peculiar aspects of the Italian participation in the labour market: professional condition, economic activity sector, type of working hours, job duration, profession carried out, etc. Moreover, it is also possible to investigate dimensions relative to the quality of the job
Motivating example 1: TUS--LFS
Eurostat
The statistical matching problem corresponds to the joint analysis of the specific variables in TUS and LFS. Y: Hours worked in an average weekday (average generic length)Z: Willingness to work a different number of hours from those actually worked in the reference week (records from LFS, 1st trimester 2003)
Motivating example 1: TUS--LFS
YX
Eurostat
In order to obtain the previous result, we have considered these steps:
impute a file (TUS) with records from the other file (LFS) by means of the common variables
estimate what you need from this file
NOTE -- in this case, it is possible to prove that we are assuming a particular model on the variable relationship: Y and Z are assumed independent given X
Motivating example 1: TUS--LFS
Eurostat
Summary
objective: micro data for analyses ``traditional'' statistical matching framework use of imputation techniques use of a particular model: the CIA
Motivating example 1: TUS--LFS
Eurostat
The new system of the national accounts (also known as European System of the Accounts, or ESA95) is a source of very detailed information on the economic behaviour of all the economic agents, as households or enterprises. A very important role in ESA95 is played by the social accounting matrix (SAM).
A SAM has two main objectives: first, organising information about the economic and social structure of a country over a period of time (usually a year) and second, providing statistical basis for the creation of a plausible economic model capable of presenting a static image of the economy along with simulating the effects of policy interventions in the economy.
Motivating example 2: SAM
Eurostat
The SAM module on the households is a matrix containing for eachhousehold typology:
the amount of expenditures (distinguished according to a very detailed list of different expenditure categories)income (employees income, self-employed income, interests, dividends, rents, social security transfers).
The household typologies may be of different types: e.g. area (region) of residence, primary income source, head of household characteristics.
Motivating example 2: SAM
Eurostat
An example of social accounting matrix is the following one (data 1995)
Motivating example 2: SAM
Eurostat
In general, we expect to have a table like this:
Motivating example 2: SAM
C=(C1,…,Cu,…,CU) represents different expenditure typologies, e.g. food expenditures, durable goods expenditures, and so onM=(M1,….,Mv,…,MV) denotes different income typologies, e.g. salaries, dividends and interests, and so onTw, w=1,…,W represent the different household typologies of interest
Eurostat
What do we have in practice
Motivating example 2: SAM
Eurostat
Motivating example 2: SAM
HBS (Istat Household Budget Survey) collects:
some socio--demographic variables X (used for the construction of the household typologies T)
a very detailed vector of consumption variables Z (e.g. if C1 represents ``Food Consumption'', it can be considered as a combination of HBS Z variables as consumption of meat, eggs, fish, vegetables, and so on)
a variable on income, TM•: the monthly total amount of the household entries (categorical variable with 14 classes, not reliable).
The first two items allow the computation of the terms cwu.
Eurostat
Motivating example 2: SAM
SHIW (Bank of Italy Survey on Households Income and Wealth) collects:
some socio-demographic variables X (used for the construction of the household typologies T)
a very detailed list of income variables Y from which the variables Mv, v=1,…,V can be reconstructed
a few generic questions on consumption Z (e.g. the amount of expenditures for durable goods, food,...; not reliable)
The first two items allow the computation of the terms mwv
Eurostat
Motivating example 2: SAM
At first sight, the SAM can be directly estimated: i.e. those rows where Tw is available in HBS. However, also for these rows there is a problem. The two independent surveys produce sometimes inconsistent results. In other words reconciliation of definitions and concepts of the two surveys is not enough for the joint use of estimates from the two surveys.
In this case, sample variability produces estimates of the table entries which are incompatible for the current economic theory. Incompatibility is on the propensity to consume.
Approach 1: complete A (i.e. the BdI survey SHIW, the smallest sample) imputing records from B using the common information X. Again, this approach assumes the conditional independence of income and consumption given the common variables.
Approach 2: let us formalize the problem in its probabilistic components
Eurostat
Motivating example 2: SAM
Simplified model: joint distribution of X, TM (total monetary income), and TC (total consumption), for a household typology (Tw):P(X, TM, TC|Tw)=P(TC | X, TM, Tw)P(X, TM | Tw), w=1,…,W
The joint distribution of X and TM can be easily estimated from SHIW. P(TC | X, TM, Tw) cannot be estimated from SHIW. In fact, these
consumption variables are not as reliable and detailed as the HBS ones (observed through the compilation of a diary) for memory problems
P(TC | X, TM, Tw) cannot be estimated from HBS. In fact, (i) This survey observes TM•, which is not reliable: asking directly for
the total amount of entries, as in HBS, usually leads to under report it;
(ii) TM• is categorical, while TM in SHIW is continuous
The proxy information from the variable TM• will be used as additional information
Eurostat
Motivating example 2: SAM
Summary
objective: table ``traditional'' statistical matching framework attempt to avoid the CIA use of a formal dependence model and use of proxy variable
Eurostat
Motivating example 3: FSS--FADN
The two surveys collect data on agricultural enterprises and are designed to investigate separate phenomena
FSS focuses on structure of the farms (labour force, holder family characteristics, crops, machinery and equipments, etc)
FADN Survey on economic structure of the farms (costs, added value, household income, etc) of the farms
Common variables includes, Use agricultural area UAA, Economic size unit ESU, Livestock size unit (LSU) geographical characteristics
Eurostat
Motivating example 3: FSS--FADN
The two surveys collect data on agricultural enterprises and are designed to investigate separate phenomena
Both surveys have been carried out in 2003 Total sample size is about 55.000 for FSS and about 20.000 for FADN
Survey The two surveys have similar stratified sample designs (both use
region, dimension of the farm in terms of economic size units) Both the surveys include take-all strata containing the largest farms In order to reduce respondent burden, selection of units in the two
surveys is negatively correlated (units selected in one sample are less likely to be selected for the other survey)
Eurostat
Motivating example 3: FSS--FADN
Structure of the common subset $C$
The common subset is made of 883 unitsMany of them belong to the take-all stratum of the FSSIt contains mainly large farms
Frequencies of the variable ``Intermediate costs'' (thousands of Euro) for FADN and for $C$
Eurostat
Motivating example 3: FSS--FADN
Structure of the RS and the FADN survey dataCommon subset C=A B
Eurostat
Motivating example 3: FSS--FADNObjectives To have a complete data set to disseminate for research purposes To estimate directly joint information on a pair of variables never jointly
observed, as contingency tables or correlation coefficients combining from FADN: economic information on farms/ performance indicators on
farms (added value, production, sales,...) from FSS: structural information on farms which are not observed on
FADN as: typology of commerce (commercial farm, with or without contractual constraints, sales to associative organisms); work dedicated to connected activities (in days); head of the farm characteristics (age, sex, educational level, quantity of work dedicated to the farm activities); cultivation practices (green manure, mulching, controlled green cover; use of mineral/organic fertilizer); organic food production (surface dedicated to the production of organic food); irrigation systems (surface, underground, ... ); use of total/partial agricultural service supply agency for ploughing, fertilizing, sowing, et.)
Eurostat
Motivating example 3: FSS--FADNThis example shows that there is a different form of additional information with respect to the presence of proxy variables.
This is given by the presence of an additional data source, possibly complete on all the variables of interest (X,Y,Z) or on the target variables (Y,Z).
These additional data sources are most of the times outdated (results from previous experiences) or difficult to use in a statistical context (C is representative of the population of interest?)
Eurostat
Motivating example 3: FSS--FADNSummary
objective: complete synthetic data set and table estimation non ``traditional'' statistical matching framework (further information available) can $C$ be used, and how?
Eurostat
Motivating example 4: Microsimulation models
The Social Policy Simulation Database and Model (SPSD/M) is a micro computer-based product designed to assist those interested in analyzing the financial interactions of governments and individuals in Canada (see http://www.statcan.ca/english/spsd/spsdm.htm).
It can help one to assess the cost implications or income redistributive effects of changes in the personal taxation and cash transfer system.
The SPSD is a non-confidential, statistically representative database of individuals in their family context, with enough information on each individual to compute taxes paid to and cash transfers received from government.
Eurostat
Motivating example 4: Microsimulation models
The SPSM is a static accounting model which processes each individual and family on the SPSD, calculates taxes and transfers using legislated or proposed programs and algorithms, and reports on the results.
It gives the user a high degree of control over the inputs and outputs to the model and can allow the user to modify existing tax/transfer programs or test proposals for entirely new programs. The model can be run using a visual interface and it comes with full documentation.
Eurostat
Motivating example 4: Microsimulation models
In order to apply the algorithms for microsimulation of tax--transfer benefits policies, it is necessary to have a data set representative of the Canadian population. This data set should contain information on structural (age, sex,...), economic (income, house ownership, car ownership, ...), health--related (permanent illnesses, child care,...) social (elder assistance, cultural--educational benefits,...) variables (among the others).
It does not exist a unique data set that contains all the variables that can influence the fiscal policy of a state
In Canada 4 samples are integrated (Survey of consumers finances, Tax return data, Unemployment insurance claim histories, Family expenditure survey)
Common variables: some socio-demographic variables Interest is on the relation between the distinct variables in the different samples
Eurostat
Motivating example 4: Microsimulation models
Summary
objective: complete synthetic data set for general purpose analyses ``traditional'' statistical matching framework with more than two data sources
Eurostat
Summary of the examples
Objective of the matching
Approach Objectives Example
Micro Some parameters of $(Y,Z)$ as contingency tables, correlation coefficients
SAM, LFS--TUS, FSS--FADN
Macro Synthetic and complete data sets
Data sets for microsimulation, data sets for the joint analysis of income and exmpenditures
Eurostat
Summary of the examples
Information to use in the matching process and approaches
Available information Example
Further information is not available
TUS--LFSSPSD
Proxy variable SAM
Auxiliary information on parameters
---
Auxiliary information on a complete data set
FADN--FSS
Eurostat
Summary of the examples
Information to use in the matching process and approaches
Available information Example
Further information is not available
CIA (unreliable results)
Proxy variable CIA (results are reliable)
Auxiliary information on parameters
Constrained inference
Auxiliary information on a complete data set
Due to the structure of the data sets, typical approaches can be more complicated
Eurostat
Methodological problems
1) Statistical matching can be seen as a problem of treatment of missing data
2) Imputation algorithms have been used extensively, especially those of the hot-deck family3) This problem is special: Y,Z are never jointly observed. In this case, when the two samples are drawn according to the same model, it can be proved that the missing data mechanism is MCAR4) The big problem is that not all the models for (X,Y,Z) are identifiable, i.e. there are not data for the estimation of all the parameters that characterize a model
Eurostat
Methodological problems
If the statistical matching problem is an inferential problem what method can be used?
Approach Micro Macro
Parametric Methods available for normal and multinomial variables
Methods available for normal and multinomial variables
Nonparametric Usually performed by means of hot deck methods
New results in the last years
Eurostat
Contents of the course
The following issues will be analyzed in the rest of the course
Today Statistical matching under the conditional independence assumptionTomorrow Statistical matching with auxiliary information Selection of the matching variables Accuracy issues in statistical matchingThe day after tomorrow Uncertainty in statistical matching Matching and complex sample designs
Eurostat
Selected references
D'Orazio, M., Di Zio, M. and Scanu, M. (2006), ``Statistical Matching for Categorical Data: displaying uncertainty and using logical constraints", Journal of Official Statistics, vol. 22, 137--157.
D'Orazio M., Di Zio M., Scanu M. (2002). ``Statistical Matching and Official Statistics", Rivista di Statistica Ufficiale, 1/2002, 5-24.
Gazzelloni S., Romano M.C., Corsetti G., Di Zio M., D'Orazio M., Pintaldi F., Scanu M., Torelli N. (2008). ``Time Use and Labour Force: a proposal to integrate the data through statistical matching", in Romano C. Time Use in Daily Life: A Multidisciplinary Approach to the Time Use's Analysis, collana Argomenti n. 35, Istat (available on http://www3.istat.it/dati/catalogo/20080612_01/)
Torelli N., Ballin M., D'Orazio M., Di Zio M., Scanu M., Corsetti G. (2008). ``Statistical matching of two surveys with a nonrandomly selected common subset", CENEX-ISAD workshop (Vienna, 29--30 May 2008), (available on http://cenex-isad.istat.it)
Wolfson M., Gribble S., Bordt M., Murphy B., Rowe G., Scheuren F. (1989). ``The social policy simulation database and model: an example of survey and administrative data integration", Survey of Current Business, May, 1989.