www.ihsn.org •International Household Survey Network • A network of international agencies • Based in Paris at the OECD at PARIS21 • A coordinating mechanism to: – Improve quality and use of household survey data in developing countries – Harmonize international recommendations for survey design, data analysis, etc – Produce and disseminate international good practices … About IHSN
21
Embed
Www.ihsn.org Geoffrey Greenwell, IHSN/PARIS21 IASSIST Conference Tampere, Finland, May 2009 Development of Microdata Anonymization Tools by the Olivier.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
www.ihsn.org
• International Household Survey Network• A network of international agencies
• Based in Paris at the OECD at PARIS21• A coordinating mechanism to:
– Improve quality and use of household survey data in developing countries
– Harmonize international recommendations for survey design, data analysis, etc
– Produce and disseminate international good practices
•Direct identifiers, which are variables such as names, addresses, or identity card numbers. They permit direct identification of a respondent but are not needed for statistical or research purposes, and should thus be removed from the published dataset.
•Indirect identifiers, which are characteristics that may be shared by several respondents, and whose combination could lead to the re-identification of one of them. For example, the combination of variables such as district of residence, age, sex, and profession would be identifying if only one individual of that particular sex, age and profession lived in that particular district. Such variables are needed for statistical purposes, and should thus not be removed from the published data files.
Once all identifying variables have been removed we can still have a disclosure problem, the problem remains dealing with the indirect identifiers.
The IHSN Anonymization tools will approach these problems by building on a
great deal of technical work undertaken by experts in the field.
The IHSN hosted an expert meeting in October 2008 to present its tools and acknowledges the work done by:
University of ManchesterISTAT (Italian Statistics)Cornell UniversityICPSR
Defining the problem
Developing SDC tools
• Building on existing work • Not an integrated software• A collection of specialized tools for:– Measuring the risk– Reducing the risk– Assessing the information loss 12 plug ins developed in C++ that interface with SPSS,
STATA or direct Server (Windows/Linux).Need to be thoroughly tested.
12 Plug-ins
• 12 plug-ins1. The μ-argus risk for weighted sample2. Re-identification rate to individual risk threshold3. Individual risk to household risk4. L-diversity for unweighted data5. SUDA2: DIS-sample data
6. Kanon: Micro-aggregation7. Local recoding8. Fixed length micro aggregation9. Noise Addition10. Pram: Post Randomization11. Rank Swapping12. Sampling
Risk Measures &Intruder ScenariosWhat does theintruder know?
Risk Reduction
What does the intruderwant?
Based on CENEX Handbook on Statistical Disclosure Control Version 1.01
Individual risk methodology
Poisson model
Individual
Hierarchical
K-anonymityl-diversity
t-completeness
SUDA
Record linkage
Distance-based
Probabilistic
Others
Measuring Disclosure Risk
Based on CENEX Handbook on Statistical Disclosure Control Version 1.01
Masking data Synthetic data file
Perturbative
Sampling
Global recoding
Top/bottom coding
Local suppression
Non perturbative
MASCC
Fixed/variable group
Uni-/Multivariate
Uncorrelated
Correlated
Non-linear
Noise addition
Multiplicative noise
Micro-aggregation
Data swapping
Rank swapping
Rounding
Resampling
PRAM
Local recoding
Reducing risk disclosure
Categorical data Continuous data
Entropy-based measures Mean variation
Direct comparison
Comparison of contingency tables
Mean square error
Mean absolute error
Based on CENEX Handbook on Statistical Disclosure Control Version 1.01
Measuring Information Loss
• In Stata (SPSS, SAS) using C++ plugins– Stata version 9 or >– Log file for easy replication of procedure– Informative output
• Or command-line (plugins with “data server”)• Why Stata (SPSS/SAS)?
– Because most countries use/know these software– Can use all tabulation and analysis functions
Developing SDC toolsProposal
Beta Interface
• Large, imperfect datasets in under resourced countries
• For use by official data producers in developing countries (IHSN objective)
• Relevant for other users as well• Free to all; public source code
Target use
• Testing, “calibrating” and documenting– Cornell + IHSN + selected countries
• Development/implementation of training and TA program– Detailed documentation and guidelines– Reference manual and training materials
• Possibly launched before end of the year (IHSN website)
• Participation of others welcome
Work Program for 2009
• Adding to the Tools to facilitate data access in developing countries:– Tools
• Metadata Editor• CDROM/HTML developer• Web Based National Data Archives• Question Bank
– Guidelines• Data Dissemination• Documentation Guide• Survey Quality Assessment Framework