Top Banner
Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar
73

Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Disclosure Limitation in Large Statistical Databases

Stephen F. Roehrig

CMU/Pitt Applied Decision Modeling Seminar

Page 2: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Collaborators George Duncan, Heinz/Statistics, CMU Stephen Fienberg, Statistics, CMU Adrian Dobra, Statistics, Duke Larry Cox, Center for Health Statistics,

H&HS Jesus De Loera, Math, UC Davis Bernd Sturmfels, Math, UC Berkeley JoAnne O’Roarke, Institute for Social

Research, U. Michigan

Page 3: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Funding NSF NCES NIA NCHS NISS Census BLS

Page 4: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

How Should Government Distribute What It Knows?

Individuals and businesses contribute data about themselves, as mandated by government

Government summarizes and returns these data to policy makers, policy researchers, individuals and businesses

Everyone is supposed to see the value, but with no privacy downside Obligations: return as much information as

possible Restrictions: don’t disclose sensitive

information

Page 5: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Who Collects Data? Federal, state and local government:

Census Bureau of Labor Statistics National Center for Health Statistics, etc.

Federally funded surveys: Health and Retirement Study (NIA) Treatment Episode Data Set (NIH)

Industry Health care Insurance “Dataminers”

Page 6: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Real-World Example I The U.S. Census Bureau wants to allow

online queries against its census data. It builds a large system, with many

safeguards, then calls in “distinguished university statisticians” to test it.

Acting as “data attackers”, they attempt to discover sensitive information.

They do, and so the Census Bureau’s plans must be scaled back.

(Source: Duncan, Roehrig and Kannan)

Page 7: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Real-World Example II To better understand health care usage

patterns, a researcher wants to measure the distance from patients’ residences to the hospitals where their care was given.

But because of confidentiality concerns, it’s only possible to get the patients’ ZIP codes, not addresses.

The researcher uses ZIP code centroids instead of addresses, causing large errors in the analysis.

(Source: Marty Gaynor)

Page 8: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Real-World Example III To understand price structures in the

health care industry, a researcher wants to compare negotiated prices to list prices, for health services.

One data set has hospital IDs and list prices for various services. Another data set gives negotiated prices for actual services provided, but for for confidentiality reasons, no hospital IDs can be provided.

Thus matching is impossible.(Source: Marty Gaynor)

Page 9: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Data Utility vs. Disclosure Risk Data utility: a measure of the usefulness

of a data set to its intended users Disclosure risk: the degree to which a

data set reveals sensitive information The tradeoff is a hard one, currently

judged mostly heuristically Disclosure limitation, in theory and

practice, examines this tradeoff.

Page 10: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

The R-U Confidentiality Map

Data Utility

DisclosureRisk

DisclosureThreshold

OriginalData

NoData

Two possible releasesof the same data

Page 11: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

The R-U Confidentiality Map

Data Utility

DisclosureRisk

DisclosureThreshold

DL Method 1

DL Method 2

OriginalData

NoData

For many disclosure limitation methods, we canchoose one or more parameters.

Page 12: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Traditional Methods: Microdata Microdata: a set of records containing

information on individual respondents Suppose you are supposed to release

microdata, for the public good, about individuals. The data include: Name Address City of residence Occupation Criminal record

Page 13: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Microdata Typical safeguard: delete

“identifiers” So release a dataset containing only city

of residence, occupation and criminal record…

Page 14: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Microdata Typical safeguard: delete

“identifiers” So release a dataset containing only city

of residence, occupation and criminal record… Residence = Amsterdam Occupation = Mayor Criminal history = has criminal record

Page 15: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Microdata Typical safeguard: delete “identifiers” So release a dataset containing only city of

residence, occupation and criminal record… Residence = Amsterdam Occupation = Mayor Criminal history = has criminal record

HIPPA says this is OK, so long as a researcher “promises not to attempt re-identification”

Is this far-fetched?

Page 16: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Microdata Unique (or almost unique)

identifiers are common The 1997 voting list for Cambridge,

MA has 54,805 voters:

birth date alone 12%birth date and gender 29%birth date and 5-digit ZIP 69%birth date and full postal code 97%

Uniqueness of demographic fields

(Source: Latanya Sweeney)

Page 17: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Traditional Solutions for Microdata Sampling Adding noise Global recoding (coarsening) Local suppression Data swapping Micro-aggregation

Page 18: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Example: Data SwappingLocation Age Sex Candidate

Pittsburgh Young M Bush

Pittsburgh Young M Gore

Pittsburgh Young F Gore

Pittsburgh Old M Gore

Pittsburgh Old M Bush

Cleveland Young F Bush

Cleveland Young F Gore

Cleveland Old M Gore

Cleveland Old F Bush

Cleveland Old F Gore

Unique on location,age and sex

Find a match in another location

Flip a coin to seeif swapping is done

Page 19: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Data Swapping Terms Uniques Key: Variables that are used to

identify records that pose a confidentiality risk.  

Swapping Key: Variables that are used to identify which records will be swapped.

Swapping Attribute: Variables over which swapping will occur.

Protected Variables: Other variables, which may or may not be sensitive.

Page 20: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Location Age Sex Candidate

Pittsburgh Young M Bush

Pittsburgh Young M Gore

Pittsburgh Young F Gore

Pittsburgh Old M Gore

Pittsburgh Old M Bush

Cleveland Young F Bush

Cleveland Young F Gore

Cleveland Old M Gore

Cleveland Old F Bush

Cleveland Old F Gore

Uniques Key &Swapping Key

SwappingAttribute

ProtectedVariable

Page 21: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Data Swapping In Use The Treatment Episode Data Set

(TEDS) National admissions to drug treatment

facilities Administered by SAMHSA (part of HHS) Data released through ICPSR 1,477,884 records in 1997, of an

estimated 2,207,375 total admissions nationwide

Page 22: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

TEDS (cont.) Each record contains

Age Sex Race Ethnicity Education level Marital status Source of income State & PMSA Primary substance abused and much more…

Page 23: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

First Step: Recode Example recode: Education level

Continuous 0-25

1: 8 years or less2: 9-113: 124: 13-155: 16 or more

becomes

Page 24: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

TEDS Uniques Key This was determined empirically,after

recoding. We ended up choosing: State and PMSA Pregnancy Veteran status Methadone planned as part of treatment Race Ethnicity Sex Age

Page 25: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Other Choices Swapping key: the uniques key,

plus primary substance of abuse Swapping attribute: state, PMSA,

Census Region and Census Division Protected variables: all other TEDS

variables

Page 26: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

TEDS Results After recoding, only 0.3% of the

records needed to be swapped Swapping was done between

nearby locations, to preserve statistics over natural geographic aggregations

Page 27: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Tables: Magnitude Data

Language Region A Region B Region C TotalC++ 11 47 58 116Java 1 15 33 49Smalltalk 2 31 20 53Total 14 93 111 218

Profit of Software Firms $10 million(Source: Java Random.nextInt(75))

Page 28: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Tables: Frequency Data

Race $10,000 >$10,000 and $25,000 > $25,000 TotalWhite 96 72 161 329Black 10 7 6 23Chinese 1 1 2 4Total 107 80 169 356

Income Level, Gender = Male

Race $10,000 >$10,000 and $25,000 > $25,000 TotalWhite 186 127 51 364Black 11 7 3 21Chinese 0 1 0 1Total 197 135 51 386

Income Level, Gender = Female

(Source: 1990 Census)

Page 29: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Traditional Solutions for Tables Suppress some cells

Publish only the marginal totals Suppress the sensitive cells, plus others as

necessary Perturb some cells

Controlled rounding Lots of research here, and good results

for 2-way tables For 3-way and higher, this is surprisingly

hard!

Page 30: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Disclosure Risk, Data Utility Risk

the degree to which confidentiality might be compromised

perhaps examine cell feasibility intervals, or better, distributions of possible cell values

Utility a measure of the value to a legitimate user higher, the more accurately a user can

estimate magnitude of errors in analysis based on the released table

Page 31: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Example: Delinquent Children

County Low Medium High Very High

Total

Alpha 15 1 3 1 20

Beta 20 10 10 15 55

Gamma 3 10 10 2 25

Delta 12 14 7 2 35

Total 50 35 30 20 135

Education Level of Head of Household

Number of Delinquent Children by County and Education Level(Source: OMB Statistical Policy Working Paper 22)

Page 32: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Controlled Rounding (Base 3)

Uniform (and known) feasibility interval. Easy for 2-D tables, sometimes impossible for 3-D 1,025,908,683 possible original tables.

County Low Medium High Very High

Total

Alpha 15 0 3 0 18

Beta 21 9 12 15 57

Gamma 3 9 9 3 24

Delta 12 15 6 3 36

Total 51 33 30 21 135

Page 33: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Suppress Sensitive Cells & Others

Hard to do optimally (NP complete). Feasibility intervals easily found with LP. Users have no way of finding cell value

probabilities.

County Low Medium High Very High

Total

Alpha 15 p s p 20

Beta 20 10 10 15 55

Gamma 3 10 s p 25

Delta 12 s 7 p 35

Total 50 35 30 20 135

Page 34: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Release Only the Margins

County Low Medium High Very High

Total

Alpha 20

Beta 55

Gamma 25

Delta 35

Total 50 35 30 20 135

Page 35: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Release Only the Margins 18,272,363,056 tables have our margins

(thanks to De Loera & Sturmfels). Low risk, low utility. Easy! Very commonly done. Statistical users might estimate internal

cells with e.g., iterative proportional fitting.

Or with more powerful methods…

Page 36: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Some New Methods for Disclosure Detection Use methods from commutative algebra

to explore the set of tables having known margins

Originated with the Diaconis-Sturmfels paper (Annals of Statistics, 1998)

Extensions and complements by Dobra and Fienberg

Related to existing ideas in combinatorics

Page 37: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Background: Algebraic Ideals

Let A be a ring (e.g., ℝ or ℤ), and I A. Then I is called an ideal if 0 I If f, g I, then f + g I If f I and h A, then f h I

Ex. 1: The ring ℤ of integers, and the set

I = {…-4, -2, 0, 2, 4,…}.

AIfg h

Page 38: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Generators Let f1,…,fs A. Then

is the ideal generated by f1,…,fs

Ex. 1: The ring ℤ of integers, and the set I = {…-4, -2, 0, 2, 4,…}. SAT question: What is a generator of I ? I = 2, since I = {2} {…-1, 0, 1,…}

Ahhfhff si

s

iis ,,:,, 1

11

Page 39: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Ideal Example 2 The ring k [x] of polynomials of one

variable, and the ideal I = x 4-1, x 6-1.

GRE question: What is a minimal generator of I (i.e., no subset also a generator)?

I = x 2-1 since x 2-1 is the greatest common divisor of {x 4-1, x 6-1}.

Page 40: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Why Are Minimal Generators Useful? Compact representation of the

ideal---the initial description may be “verbose”.

Allow easy generation of elements, often in a disciplined order.

Guaranteed to explore the full ideal.

Page 41: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Disclosure Analysis: Marginals Suppose a “data snooper” knows

only this:

County Low Medium High Very High

Total

Alpha 20

Beta 55

Gamma 25

Delta 35

Total 50 35 30 20 135

Page 42: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Disclosure Analysis Of course the data collector knows

the true counts:

County Low Medium High Very High

Total

Alpha 15 1 3 1 20

Beta 20 10 10 15 55

Gamma 3 10 10 2 25

Delta 12 14 7 2 35

Total 50 35 30 20 135

Page 43: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Disclosure Analysis What are the feasible tables, given

the margins? Here is one:

County Low Medium High Very High

Total

Alpha 15+1 1-1 3 1 20

Beta 20-1 10+1 10 15 55

Gamma 3 10 10 2 25

Delta 12 14 7 2 35

Total 50 35 30 20 135

Page 44: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Disclosure Analysis Problems Both the data collector and the snooper

are interested in the largest and smallest feasible values for each cell. Narrow bounds might constitute disclosure.

Both might also be interested in the distribution of possible cell values. A tight distribution might constitute

disclosure.

Page 45: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

The Bounds Problem

0 1 2 3 4 5

County Low Medium High Very High

Total

Alpha 20

Beta 55

Gamma 25

Delta 35

Total 50 35 30 20 135

This is usually framedas a continuous problem.Is this the right question?

Page 46: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

The Distribution Problem

0 1 2 3 4 5

County Low Medium High Very High

Total

Alpha 20

Beta 55

Gamma 25

Delta 35

Total 50 35 30 20 135

Given the margins, anda set of priors over cellvalues, what distributionresults?

Page 47: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Transform to an Algebraic Problem Define some indeterminates:

Then write the move

as the polynomial x11x22 - x12x21

x1

1

x1

2

x1

3

x1

4

x2

1

x2

2

x2

3

x2

4

x3

1

x3

2

x3

3

x3

4

x4

1

x4

2

x4

3

x4

4

1 -1

-1 1

Page 48: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Ideal Example 3 Ideal I is the set of monomial differences

that take a non-negative table of dimension

J K with fixed margins M to another non-negative table with the same margins.

Putnam Competition question: What is a minimal generator of I ?

JKJK mJK

mnJK

n xxxx 11

11

Page 49: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Solutions: Bounds, Distributions Upper and lower bounds on cells:

Integer linear programming (max/min, subject to constraints implied by the marginals).

Find a generator of the ideal: Use Buchberger’s algorithm to find a Gröbner basis. Apply long division repeatedly; the remainder yields

an optimum (Conti & Traverso).

Distribution of cell values: Use the Gröbner basis to enumerate or sample

(Diaconis & Sturmfels, Dobra, Dobra & Fienberg).

Page 50: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Trouble in Paradise For larger disclosure problems

(dimension 3), the linear programming bounds may be fractional.

For larger disclosure problems (dimension 3), the Gröbner basis is very hard to compute.

Page 51: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Why Fractional Bounds? Sometimes the marginal sum

values cause the constraints to intersect at non-integer points.

x2

x1

LP maximum for x2

Integer maximum

Page 52: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

But Could This Happen? Standard example from ManSci

101:

x2

x1

LP maximum for x2

Integer maximum

Page 53: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Computing Gröbner Bases Both the Conti & Traverso and Sturmfels

constructions include extra variables. The ring k [x1,…,xIJK] is embedded in

k [x1,…,xIJK, y1,…,yN], where N is the number of constraints.

Buchberger’s algorithm applied. Then throw away basis elements not in k[x1,

…,xIJK] Computing a basis for the 333

problem: about 7 hours with Macaulay.

Page 54: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only

process elements in k [x1,…,xIJK].

Page 55: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only

process elements in k [x1,…,xIJK]. 333 problem runs in 25 mS.

Page 56: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only

process elements in k [x1,…,xIJK]. 333 problem runs in 25 mS. 433 problem runs in 20 min.

Page 57: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

An Improved Algorithm Designed to compute implicitly in k [x1,…,xIJK, y1,…,yN] but only

process elements in k [x1,…,xIJK]. 333 problem runs in 25 mS. 433 problem runs in 20 min. 533 problem runs in 3 months…

Page 58: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Ideals

“If they’re so freaking ideal, how come you can’t compute them?” Overheard at Grostat 5, the

Conference on Commutative Algebra and Statistics.

Page 59: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Other Approaches Connections with decomposable and

graphical log-linear models (Dobra & Fienberg, Dobra).

Example: 3-D sensitive table, two 2-D margins given. Use conditional independence:

A B C

Known margins

A

B

C

Page 60: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Decomposable Models Gröbner bases are “simple,” and

can be written down explicitly. For some other problems (e.g.,

graphical models), the Gröbner basis can be “assembled” from bases for smaller pieces.

Nice!

Page 61: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

A New Method for Disclosure Limitation: Cyclic Perturbation Choose cycles that leave the

margins fixed. One example:

The set of cycles determines the published table’s feasibility interval

15 1 3 120 10 10 15 3 10 10 212 14 7 2

+ 1 0 -1 0 -1 0 1 0 0 0 0 0 0 0 0 0

16 1 2 119 10 11 15 3 10 10 212 14 7 2

=

Original Cycle Perturbed table

Page 62: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Cyclic Perturbation: Details Choose a set of cycles that covers

all table cells “equally”. Example:

+ - 0 0 0 + - 0 0 0 + - - 0 0 +

0 + - 0 0 0 + - - 0 0 + + - 0 +

0 0 + - - 0 0 + + - 0 0 0 + - 0

- 0 0 + + - 0 0 0 + - 0 0 0 + -

Each cell has exactlytwo “chances” to move.

Page 63: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Cyclic Perturbation: Details

Flip a three-sided coin with outcomes A (probability = ) B (probability = ) C (probability = )

If A, add the first cycle (unless there is a zero in the cycle)

If B, subtract the first cycle (unless there is a zero in the cycle)

If C, do nothing Repeat with the remaining cycles

Page 64: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Cyclic Perturbation: Details For the chosen set of cycles, there

are 34=81 possible perturbed tables.

The feasibility interval is original value 2.TO

TPPerturbed

Table

Original

Table

cycle 1applied

cycle 2applied

cycle 3applied

cycle 4applied

Page 65: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Cyclic Perturbation: Details Choose , . Perturb with each cycle. Publish the resulting table. Publish the cycles and , .

15 1 3 120 10 10 15 3 10 10 212 14 7 2

16 0 2 221 11 9 14 2 11 11 111 13 8 3

Original Perturbed table

Page 66: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Analysis of Cell Probabilities

TO

TP

TO

T4

Tk

OriginalTable

PerturbedTable

PossibleTables

Page 67: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Distributions of Cell Values Since the mechanism is public, a user can

calculate the distribution of true cell values: Compute every table Tk that could have been

the original, along with the probability Pr(TP | Tk).

Specify a prior distribution over all the possible original tables Tk.

Apply Bayes’ theorem to get the posterior probability Pr(Tk | TP) for each Tk.

The distribution for each cell is

qjitk

Pk

k

TTqjit),(:

)|Pr()),(Pr(

Page 68: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Results for the Example

Pr( t(1,2) = q | TP ) 0.71 0.25 0.04 0.00 0.00 0.00Pr( t(1,4) = q | TP ) 0.06 0.25 0.38 0.25 0.06 0.00Pr( t(3,4) = q | TP ) 0.00 0.71 0.25 0.04 0.00 0.00Pr( t(4,4) = q | TP ) 0.00 0.05 0.29 0.44 0.21 0.01

q = 0 1 2 3 4 5

15 1 3 120 10 10 15 3 10 10 212 14 7 2

Original Perturbed table

16 0 2 221 11 9 14 2 11 11 111 13 8 3

Page 69: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Properties It’s not difficult to quantify data

utility and disclosure risk (cf. cell suppression and controlled rounding).

Priors of data users and data intruders can be different.

Theorem: For a uniform prior, the mode of each posterior cell distribution is it’s published value.

Page 70: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Scaling Sets of cycles w/ desirable properties

are easy to find for larger 2-D tables. Extensions to 3 and higher dimensions

also straightforward. Computing the perturbation for any size

table is easy & fast. The complete Bayesian analysis is

feasible to at least 2020 (with no special TLC)

Page 71: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

What Might Priors Be? They could reflect historical data. If I’m in the survey, I know my cell

is at least 1. Public information. Insider information.

Page 72: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Rounding Redux A similar Bayesian analysis can be done,

provided the exact algorithm is available. It’s generally much harder to do. Using a deterministic version of Cox’s `87

rounding procedure, we must consider “only” 17,132,236 tables.

For uniform priors, the posterior cell distributions were nearly uniform.

Three days of computing time for a 44 table…

Page 73: Disclosure Limitation in Large Statistical Databases Stephen F. Roehrig CMU/Pitt Applied Decision Modeling Seminar.

Future Work (i.e. What’s Hot?) Much more to be learned about the

connections between log-linear models, algebraic structures and networks.

The Grostat conference, and the “Tables” conference at Davis have resulted in wonderful collaborations between statisticians, mathematicians and operations researchers.

Stay tuned!