ICES III Montreal, June 18-21, 2007 A new Approach for Disclosure Control in the IAB Establishment Panel Multiple Imputation for Better Data Access Jörg.

ICES III

Montreal, June 18-21, 2007

A new Approach for Disclosure Control in the IAB Establishment Panel

Multiple Imputation for Better Data Access

Jörg Drechsler

Institute for Employment Research (IAB)

2

Overview

Background

Statistical disclosure control with fully synthetic data sets

Application to the IAB-Establishment Panel

First results

Proceedings/open questions

3

The IAB Establishment Panel

Annually conducted Establishment Survey

Since 1993 in Western Germany, since 1996 in Eastern Germany

Population: All establishments with at least one employee covered by social security

Source: Official Employment Statistics

Response rate of repeatedly interviewed establishments more than 80%

Sample of more than 16.000 establishments in the last wave

Contents: employment structure, changes in employment, business policies, investment, training, remuneration, working hours, collective wage agreements, works councils

4

Overview

Background



First results


5

YsynthetischYsynthetischYsynthetischYsynthetisch

Generating Synthetic Data Sets (Rubin 1993)

Advantages: - Data are fully synthetic

- no re-identification of single units possible

- all variables are still fully available

Yobserved

X Ynot observed

Ysynthetic

6

Overview

Background



First results


7

Generating synthetic data sets for the IAB Establishment Panel Create a synthetic data set for selected variables from the wave 1997 from

the Establishment Panel

Imputation for the whole population is not feasible

Draw a new sample from the Official Employment Statistics using the same sampling design as for the Establishment Panel (Stratification by economic branch, size, and region)

Each stratum cell contains the same number of observations as the wave 1997 from the Establishment Panel

Additional Information from the German Social Security Data (GSSD) for the imputation

8

The German Social Security Data (GSSD)

Contains information on all employees covered by social security

Since 1973 all employers are required to notify the social security agencies about all employees covered by social security.

The GSSD represents about 80% of the German workforce

Information from the GSSD is aggregated on the establishment level and is matched to the IAB Establishment Panel via establishment identification number

Information on: number of employees by gender, schooling, mean of the employees age, mean of the wages of the employees…

9

YsynthetischYsynthetischYsynthetischYsynthetisch

Synthetic Establishment Panels

The IAB Establishment Panel

GS

SD

EPsynthetic

10

Imputation Procedure

For simplicity new founded establishments are excluded from the sampling frame and from the panel

10 new samples are drawn

The number of observations in each sample equals the number of observations in the panel ns=np=7332

Every sample is imputed ten times using chained equations

Number of variables from the GSSD: 24

Number of variables from the establishment panel: 48

Imputations are generated using IVEware by Raghunathan, Solenberger and Hoewyk (2001)

11

Overview

Background



First results


12

First Results

Compare regression results from the original data with results from the synthetic data

Zwick (2005) analyses the productivity effects of different continuing vocational training forms in Germany

Results: vocational training is one of the most important measures to gain and keep productivity

Probit regression to explain, why firms offer vocational training

13 Explanatory variables including: Share of qualified employees, establishment size, region, collective wage agreement, high qualification needs expected…

2 variables, based on the 1998 wave of the panel, are dropped for the evaluation

13

Descriptive comparison of the original and in the synthetic data setVariable

survey mean

synthetic data mean

Deviation

Training Yes/No 0.7069 0.7109 0.55%Redundancies expected 0.2239 0.2223 -0.75%Many employees are expected to be on maternity leave

0.0644 0.0737 14.34%

High qualification needs expected 0.1551 0.1551 0.02%Establishment size 20-199 0.3973 0.4043 1.77%Establishment size 200-499 0.1348 0.1439 6.78%Establishment size 500-999 0.0745 0.0769 3.30%Establishment size 1000+ 0.0942 0.0977 3.71%Collective wage agreement 0.7643 0.7535 -1.41%Apprenticeship training reaction on skill shortages 0.3632 0.3655 1.00%Training reaction on skill shortages 0.4490 0.4678 4.10%State-of-the-art technical equipment 0.6513 0.6861 5.35%Apprenticeship training 0.6141 0.6247 1.73%Share of qualified employees 0.6740 0.6271 -6.96%number of employees 365.6238 350.5626 -4.12%

14

Results from the regression

original data set synthetic data setExogenous variables coeff. p-value coeff. p-value

Redundancies expected 0.2503*** 0.0000 0.2513*** 0.0001

Emp. exp. on maternity leave 0.2657** 0.0050 0.2445* 0.0172

High qual. needs expected 0.6480*** 0.0000 0.6245*** 0.0000

Appr. tr. react. on skill shortages 0.1130* 0.0390 0.1471* 0.0130

Tr. reaction on skill shortages 0.5273*** 0.0000 0.5233*** 0.0000

Establishment size 20-199 0.6859*** 0.0000 0.6450*** 0.0000



Establishment size 1000+ 1.9641*** 0.0000 1.7775*** 0.0000

Share of qualified employees 0.7776*** 0.0000 0.8204*** 0.0000

State-of-the-art tech. equipment 0.1690*** 0.0000 0.1678*** 0.0001

Collective wage agreement 0.2541*** 0.0000 0.3131*** 0.0000

Apprenticeship training 0.4838*** 0.0000 0.4058*** 0.0000*** significant on the 0.1% level, ** significant on the 1% level, * significant on the 5% level

15

Overview

Background



First results


16


More detailed evaluation

Replace only selected variables

Generate weights for the synthetic sample

Imputation of more than one wave maintaining the panel structure

References Drechsler, J., Dundler, A., Bender, S., Rässler, S., Zwick, T. (2007). A

New Approach for Disclosure Control in the IAB Establishment Panel - Multiple Imputation for a Better Data Access, IAB Discussion Paper

No.11/2007 Reiter, J. und Drechsler, J. (2007). Releasing Multiply-Imputed, Synthetic

Data Generated in Two Stages To Protect Confidentiality, submitted

17

Thank you for your attention

18

Information from the two data sets

- number of employees in June 1996 - qualification of the employees- number of temporary employees- number of agency workers- working week (full-time and overtime)- the firm‘s commitment to collective agreements- existence of a works council- turnover, advance performance and export share- investment total- overall wage bill in June 1997- technological status- age of the establishment- legal form and corporate position- overall company-economic situation- reorganisation measures- company further training activities- additional information on new foundations

Information contained in the German Social Security Data (from 1997)

Available for all German establishments with at least one employee covered by social security

Information contained in the IAB Establishment Panel (wave 1997)

Available for establishments in the survey

Covered in both datasets

establishment number, branch and size

location of the establishment

number of employees in June 1997

- number of full-time and part-time employees- short-time employment- mean of the employees age- average wages for full-time employees- average wages for all employees- occupation- schooling and training- number of women and men- number of German employees

19

Disclosure is possible, if…

An establishment is included in the original data set and in at least on of the newly drawn samples

The original values and the imputed values for this establishment are nearly the same

20

11,4% 11,7% 11,4% 11,2% 11,7% 10,9% 11,4% 11,6% 11,5% 11,4%

0,0%

10,0%

20,0%

30,0%

40,0%

50,0%

sample1

sample2

sample3

sample4

sample5

sample6

sample7

sample8

sample9

sample10

Percentage of establishments selected in the IAB Establishment Panel and in the newly

drawn samples

21

How often are establishments included in the IAB-Establishment Panel drawn in the new samples?

Occurrence in … sample(s)

Number Percentage

0 4,469 61.0%

1 1,091 14.9%

2 535 7.3%

3 362 4.9%

4 275 3.8%

5 199 2.7%

6 144 2.0%

7 89 1.2%

8 53 0.7%

9 32 0.4%

10 83 1.1%

Total 7,332 100%

22

96,6%

75,0%

50,8%

3,3%

16,2% 20,8%

0,1%

8,8%

28,4%

in none of the 10samples

in 1 to 5 samples in 6 to 10 samples

Identical Establishments by Establishment Size

1 to 199 employees

200 to 999 employees

1000 and more employees

23

Comparing original and imputed values

Binary variables: probability of identical values: 60-90%

Multiple response questions: - with four categories: 57%- with 13 categories: 6%

Numerical variables:

- average relative difference: 21%

- outliers

ICES III Montreal, June 18-21, 2007 A new Approach for Disclosure Control in the IAB Establishment Panel Multiple Imputation for Better Data Access Jörg.

Documents

iabestablishment panel

establishment panel

establishment size

establishment survey

establishment level

y synthetic slide

synthetic data zwick

synthetic data sets