Www.ihsn.org Geoffrey Greenwell, IHSN/PARIS21 IASSIST Conference Tampere, Finland, May 2009 Development of Microdata Anonymization Tools by the Olivier.

Post on 22-Dec-2015

213 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

www.ihsn.org

• International Household Survey Network• A network of international agencies

• Based in Paris at the OECD at PARIS21• A coordinating mechanism to:

– Improve quality and use of household survey data in developing countries

– Harmonize international recommendations for survey design, data analysis, etc

– Produce and disseminate international good practices

About IHSN

www.ihsn.org

Accelerated Data Program

• Implementing the IHSN Tools in the countries• Technical and financial support to establish national data

archives (in > 50 countries)• Many datasets documented (DDI)• Improved access to data by researchers, but not yet

satisfactory. We can measure demand through the NADA• Need to anonymize data remains the most frequently

expressed concern and obstacle to data access.• The ADP has provided some guidance but there is a lack of

simple and intuitive tools and guidelines available ADP countries.

ADP/IHSN in the world

ADP country Expected ADP in 2009 By partners

www.ihsn.org

Setting up Catalogs

Focus Nigeria

Effects of data availability on MDG 7.Halving the population without sustainable access tosafe drinking water.

Providing robustestimates to informpolicy makersand sectormonitoring.

Water and SanitationSector. Workshop withWHO/UNICEF

www.ihsn.org

Effects of Data Availability

• Nigeria and the MDG: Rural access to improved water source

Resistance in the countries

• Nigeria Statistics Law: Statistical Act of 2007 obliges microdata release after due anonymization. The legal framework exists.

• Willing institution (the NBS in Nigeria)• Current anonymization strategies undertaken are limited to

removal of direct identifiers however,• Other countries are unable to articulate a proper policy for

dissemination and tend to use confidentiality as a barrier to mask political resistance or inertia.

• IHSN anonymization tools will be a way to deal with both real ethical concerns but also political resistance

www.ihsn.org

Better use of survey data

• Lots of survey data remain under-exploited because not accessible by researchers/users

• Obstacles:– Technical – Psychological– Financial Support by many sponsors– Legal – Ethical– Political … ? …

IHSN data documentation and cataloguing tools and guidelines

www.ihsn.org

•Direct identifiers, which are variables such as names, addresses, or identity card numbers. They permit direct identification of a respondent but are not needed for statistical or research purposes, and should thus be removed from the published dataset.

•Indirect identifiers, which are characteristics that may be shared by several respondents, and whose combination could lead to the re-identification of one of them. For example, the combination of variables such as district of residence, age, sex, and profession would be identifying if only one individual of that particular sex, age and profession lived in that particular district. Such variables are needed for statistical purposes, and should thus not be removed from the published data files.

Anonymize:Process

Once all identifying variables have been removed we can still have a disclosure problem, the problem remains dealing with the indirect identifiers.

The IHSN Anonymization tools will approach these problems by building on a

great deal of technical work undertaken by experts in the field.

The IHSN hosted an expert meeting in October 2008 to present its tools and acknowledges the work done by:

University of ManchesterISTAT (Italian Statistics)Cornell UniversityICPSR

Defining the problem

Developing SDC tools

• Building on existing work • Not an integrated software• A collection of specialized tools for:– Measuring the risk– Reducing the risk– Assessing the information loss 12 plug ins developed in C++ that interface with SPSS,

STATA or direct Server (Windows/Linux).Need to be thoroughly tested.

12 Plug-ins

• 12 plug-ins1. The μ-argus risk for weighted sample2. Re-identification rate to individual risk threshold3. Individual risk to household risk4. L-diversity for unweighted data5. SUDA2: DIS-sample data

6. Kanon: Micro-aggregation7. Local recoding8. Fixed length micro aggregation9. Noise Addition10. Pram: Post Randomization11. Rank Swapping12. Sampling

Risk Measures &Intruder ScenariosWhat does theintruder know?

Risk Reduction

What does the intruderwant?

Based on CENEX Handbook on Statistical Disclosure Control Version 1.01

Individual risk methodology

Poisson model

Individual

Hierarchical

K-anonymityl-diversity

t-completeness

SUDA

Record linkage

Distance-based

Probabilistic

Others

Measuring Disclosure Risk

Based on CENEX Handbook on Statistical Disclosure Control Version 1.01

Masking data Synthetic data file

Perturbative

Sampling

Global recoding

Top/bottom coding

Local suppression

Non perturbative

MASCC

Fixed/variable group

Uni-/Multivariate

Uncorrelated

Correlated

Non-linear

Noise addition

Multiplicative noise

Micro-aggregation

Data swapping

Rank swapping

Rounding

Resampling

PRAM

Local recoding

Reducing risk disclosure

Categorical data Continuous data

Entropy-based measures Mean variation

Direct comparison

Comparison of contingency tables

Mean square error

Mean absolute error

Based on CENEX Handbook on Statistical Disclosure Control Version 1.01

Measuring Information Loss

• In Stata (SPSS, SAS) using C++ plugins– Stata version 9 or >– Log file for easy replication of procedure– Informative output

• Or command-line (plugins with “data server”)• Why Stata (SPSS/SAS)?

– Because most countries use/know these software– Can use all tabulation and analysis functions

Developing SDC toolsProposal

Beta Interface

• Large, imperfect datasets in under resourced countries

• For use by official data producers in developing countries (IHSN objective)

• Relevant for other users as well• Free to all; public source code

Target use

• Testing, “calibrating” and documenting– Cornell + IHSN + selected countries

• Development/implementation of training and TA program– Detailed documentation and guidelines– Reference manual and training materials

• Possibly launched before end of the year (IHSN website)

• Participation of others welcome

Work Program for 2009

• Adding to the Tools to facilitate data access in developing countries:– Tools

• Metadata Editor• CDROM/HTML developer• Web Based National Data Archives• Question Bank

– Guidelines• Data Dissemination• Documentation Guide• Survey Quality Assessment Framework

www.ihsn.org

Thank you.

The End

top related