ICES III Montreal, June 18-21, 2007 A new Approach for Disclosure Control in the IAB Establishment Panel Multiple Imputation for Better Data Access Jörg Drechsler Institute for Employment Research (IAB)
Mar 27, 2015
ICES III
Montreal, June 18-21, 2007
A new Approach for Disclosure Control in the IAB Establishment Panel
Multiple Imputation for Better Data Access
Jörg Drechsler
Institute for Employment Research (IAB)
2
Overview
Background
Statistical disclosure control with fully synthetic data sets
Application to the IAB-Establishment Panel
First results
Proceedings/open questions
3
The IAB Establishment Panel
Annually conducted Establishment Survey
Since 1993 in Western Germany, since 1996 in Eastern Germany
Population: All establishments with at least one employee covered by social security
Source: Official Employment Statistics
Response rate of repeatedly interviewed establishments more than 80%
Sample of more than 16.000 establishments in the last wave
Contents: employment structure, changes in employment, business policies, investment, training, remuneration, working hours, collective wage agreements, works councils
4
Overview
Background
Statistical disclosure control with fully synthetic data sets
Application to the IAB-Establishment Panel
First results
Proceedings/open questions
5
YsynthetischYsynthetischYsynthetischYsynthetisch
Generating Synthetic Data Sets (Rubin 1993)
Advantages: - Data are fully synthetic
- no re-identification of single units possible
- all variables are still fully available
Yobserved
X Ynot observed
Ysynthetic
6
Overview
Background
Statistical disclosure control with fully synthetic data sets
Application to the IAB-Establishment Panel
First results
Proceedings/open questions
7
Generating synthetic data sets for the IAB Establishment Panel Create a synthetic data set for selected variables from the wave 1997 from
the Establishment Panel
Imputation for the whole population is not feasible
Draw a new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by economic branch, size, and region)
Each stratum cell contains the same number of observations as the wave 1997 from the Establishment Panel
Additional Information from the German Social Security Data (GSSD) for the imputation
8
The German Social Security Data (GSSD)
Contains information on all employees covered by social security
Since 1973 all employers are required to notify the social security agencies about all employees covered by social security.
The GSSD represents about 80% of the German workforce
Information from the GSSD is aggregated on the establishment level and is matched to the IAB Establishment Panel via establishment identification number
Information on: number of employees by gender, schooling, mean of the employees age, mean of the wages of the employees…
9
YsynthetischYsynthetischYsynthetischYsynthetisch
Synthetic Establishment Panels
The IAB Establishment Panel
GS
SD
EPsynthetic
10
Imputation Procedure
For simplicity new founded establishments are excluded from the sampling frame and from the panel
10 new samples are drawn
The number of observations in each sample equals the number of observations in the panel ns=np=7332
Every sample is imputed ten times using chained equations
Number of variables from the GSSD: 24
Number of variables from the establishment panel: 48
Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)
11
Overview
Background
Statistical disclosure control with fully synthetic data sets
Application to the IAB-Establishment Panel
First results
Proceedings/open questions
12
First Results
Compare regression results from the original data with results from the synthetic data
Zwick (2005) analyses the productivity effects of different continuing vocational training forms in Germany
Results: vocational training is one of the most important measures to gain and keep productivity
Probit regression to explain, why firms offer vocational training
13 Explanatory variables including: Share of qualified employees, establishment size, region, collective wage agreement, high qualification needs expected…
2 variables, based on the 1998 wave of the panel, are dropped for the evaluation
13
Descriptive comparison of the original and in the synthetic data setVariable
survey mean
synthetic data mean
Deviation
Training Yes/No 0.7069 0.7109 0.55%Redundancies expected 0.2239 0.2223 -0.75%Many employees are expected to be on maternity leave
0.0644 0.0737 14.34%
High qualification needs expected 0.1551 0.1551 0.02%Establishment size 20-199 0.3973 0.4043 1.77%Establishment size 200-499 0.1348 0.1439 6.78%Establishment size 500-999 0.0745 0.0769 3.30%Establishment size 1000+ 0.0942 0.0977 3.71%Collective wage agreement 0.7643 0.7535 -1.41%Apprenticeship training reaction on skill shortages 0.3632 0.3655 1.00%Training reaction on skill shortages 0.4490 0.4678 4.10%State-of-the-art technical equipment 0.6513 0.6861 5.35%Apprenticeship training 0.6141 0.6247 1.73%Share of qualified employees 0.6740 0.6271 -6.96%number of employees 365.6238 350.5626 -4.12%
14
Results from the regression
original data set synthetic data setExogenous variables coeff. p-value coeff. p-value
Redundancies expected 0.2503*** 0.0000 0.2513*** 0.0001
Emp. exp. on maternity leave 0.2657** 0.0050 0.2445* 0.0172
High qual. needs expected 0.6480*** 0.0000 0.6245*** 0.0000
Appr. tr. react. on skill shortages 0.1130* 0.0390 0.1471* 0.0130
Tr. reaction on skill shortages 0.5273*** 0.0000 0.5233*** 0.0000
Establishment size 20-199 0.6859*** 0.0000 0.6450*** 0.0000
Establishment size 200-499 1.3554*** 0.0000 1.2031*** 0.0000
Establishment size 500-999 1.3473*** 0.0000 1.3402*** 0.0000
Establishment size 1000+ 1.9641*** 0.0000 1.7775*** 0.0000
Share of qualified employees 0.7776*** 0.0000 0.8204*** 0.0000
State-of-the-art tech. equipment 0.1690*** 0.0000 0.1678*** 0.0001
Collective wage agreement 0.2541*** 0.0000 0.3131*** 0.0000
Apprenticeship training 0.4838*** 0.0000 0.4058*** 0.0000*** significant on the 0.1% level, ** significant on the 1% level, * significant on the 5% level
15
Overview
Background
Statistical disclosure control with fully synthetic data sets
Application to the IAB-Establishment Panel
First results
Proceedings/open questions
16
Proceedings/open questions
More detailed evaluation
Replace only selected variables
Generate weights for the synthetic sample
Imputation of more than one wave maintaining the panel structure
References Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T. (2007). A
New Approach for Disclosure Control in the IAB Establishment Panel - Multiple Imputation for a Better Data Access, IAB Discussion Paper
No.11/2007 Reiter, J. und Drechsler, J. (2007). Releasing Multiply-Imputed, Synthetic
Data Generated in Two Stages To Protect Confidentiality, submitted
17
Thank you for your attention
18
Information from the two data sets
- number of employees in June 1996 - qualification of the employees- number of temporary employees- number of agency workers- working week (full-time and overtime)- the firm‘s commitment to collective agreements- existence of a works council- turnover, advance performance and export share- investment total- overall wage bill in June 1997- technological status- age of the establishment- legal form and corporate position- overall company-economic situation- reorganisation measures- company further training activities- additional information on new foundations
Information contained in the German Social Security Data (from 1997)
Available for all German establishments with at least one employee covered by social security
Information contained in the IAB Establishment Panel (wave 1997)
Available for establishments in the survey
Covered in both datasets
establishment number, branch and size
location of the establishment
number of employees in June 1997
- number of full-time and part-time employees- short-time employment- mean of the employees age- average wages for full-time employees- average wages for all employees- occupation- schooling and training- number of women and men- number of German employees
19
Disclosure is possible, if…
An establishment is included in the original data set and in at least on of the newly drawn samples
The original values and the imputed values for this establishment are nearly the same
20
11,4% 11,7% 11,4% 11,2% 11,7% 10,9% 11,4% 11,6% 11,5% 11,4%
0,0%
10,0%
20,0%
30,0%
40,0%
50,0%
sample1
sample2
sample3
sample4
sample5
sample6
sample7
sample8
sample9
sample10
Percentage of establishments selected in the IAB Establishment Panel and in the newly
drawn samples
21
How often are establishments included in the IAB-Establishment Panel drawn in the new samples?
Occurrence in … sample(s)
Number Percentage
0 4,469 61.0%
1 1,091 14.9%
2 535 7.3%
3 362 4.9%
4 275 3.8%
5 199 2.7%
6 144 2.0%
7 89 1.2%
8 53 0.7%
9 32 0.4%
10 83 1.1%
Total 7,332 100%
22
96,6%
75,0%
50,8%
3,3%
16,2% 20,8%
0,1%
8,8%
28,4%
in none of the 10samples
in 1 to 5 samples in 6 to 10 samples
Identical Establishments by Establishment Size
1 to 199 employees
200 to 999 employees
1000 and more employees
23
Comparing original and imputed values
Binary variables: probability of identical values: 60-90%
Multiple response questions: - with four categories: 57%- with 13 categories: 6%
Numerical variables:
- average relative difference: 21%
- outliers