February 2020 Protecting Privacy in Data Releases...2020/02/24 · building knowledge, research, informing policy, and providing information to the public. However, publishing or

February 2020

Protecting Privacy in DataReleasesA Primer on Disclosure Limitation

Christopher Sadler

Last edited on February 24, 2020 at 11:09 a.m. EST

Acknowledgments

We would like to thank the Bill & Melinda GatesFoundation for its generous support of our work. Theviews expressed in this report are those of its authorand do not necessarily represent the views of thefoundation, their officers, or their employees.

newamerica.org/oti/reports/primer-disclosure-limitation/ 2

About the Author(s)

Christopher Sadler is the Education Data andPrivacy Fellow at New America’s Open TechnologyInstitute.

About New America

We are dedicated to renewing America by continuingthe quest to realize our nation’s highest ideals,honestly confronting the challenges caused by rapidtechnological and social change, and seizing theopportunities those changes create.

About Open Technology Institute

OTI works at the intersection of technology andpolicy to ensure that every community has equitableaccess to digital technology and its benefits. Wepromote universal access to communicationstechnologies that are both open and secure, using amultidisciplinary approach that brings togetheradvocates, researchers, organizers, and innovators.


5

7

9

9

12

18

21

Contents

Introduction

Laws Governing Disclosure

Disclosure Limitation Techniques

Information Limiting Techniques

Data Perturbation Methods

The Census Bureau

Conclusion


Introduction

The falling cost of data storage and the spread of the internet have led to an

acceleration in the collection of data about individuals. Many organizations, both

private and public, gather and store information from a myriad of sources,

resulting in an accumulation of data exceeding 40 zettabytes (40 trillion

gigabytes) globally. Data holds the potential for substantial gains to society in

building knowledge, research, informing policy, and providing information to the

public. However, publishing or sharing data creates privacy risks in exposing

individuals to potential financial, reputational, and other harms and liabilities.

Organizations have ethical and often legal requirements to protect the

confidentiality of data, but this involves a tradeoff with the usefulness of the data.

Protecting confidentiality necessitates excluding, aggregating, or obscuring the

data in some way that reduces its detail and exactness. Thus, a balance must be

struck between confidentiality and the informational value of data.

“Disclosure” refers to the release of data by some means, including making it

publicly available or available to another entity or individual (such as the sharing

of records with a researcher). The primary privacy concern created by

disclosure occurs when the data released contains either direct personally

identifiable information (PII), or when other fields or aspects of the released data

can be used in some way (often in conjunction with other available datasets) to

identify a person). Disclosures may include sensitive information about

individuals, but the risk is in being able to link data in the disclosure to a specific

person. “Disclosure limitation” (also known as “disclosure avoidance” and

“disclosure control”) refers to the safeguards and statistical methods used to

reduce the risk of disclosure of identifiable information in a data release.

Privacy concerns about disclosure have often focused on the release of public-

use government data. The Census Bureau, with a primary mission of disclosing

data to the public, has been at the forefront of cutting-edge research on and

empirical use of methods for disclosure limitation. Other statistical agencies at

both the federal and state level also regularly release data, and non-statistical

agencies are about to start doing the same, as the 2019 OPEN Government Data

Act will require federal agencies to publish much of their information online as

open data.

There is also growing concern about how corporations use the data they hold. As

corporate data warehouses build in volume and detail over time, they become

valuable for discovering information relationships about customers through

analytic techniques (a process known as data mining). The potential for

derivation of highly sensitive information through data mining carries serious

ethical implications. There are calls for comprehensive laws that would control

collection and use of personal data by companies, but even without those laws,

1

2 3

4

5


we are seeing groundbreaking research into disclosure limitation from many

sources in the private sector.

Traditionally, government and private entities seeking to disclose information

without creating privacy harms have attempted to provide data in aggregate or

anonymized form, so that sensitive information cannot be related back to any

particular individual. In recent years, however, it has become clear that

traditional techniques of anonymization and aggregation of data are not as

privacy protecting as had been thought. The challenges of balancing the quality

and usefulness of disclosures with the fundamental rights of confidentiality and

privacy have become much more complex as both technological advances and

public perceptions of privacy have changed. There is a wide range of methods for

suppressing, aggregating, and obscuring data, all with a goal of creating a release

of information that reduces individually identifiable information. However,

increases in computing power, the advancement of analytical techniques and

sophistication of attacks, the growth of available data sources on individuals, and

other factors have weakened the protections of many traditional disclosure

techniques. While these older methods are still useful in reducing disclosure risks

and continue to be refined, there has been an accelerating shift to the modern,

formal disclosure limitation techniques of differential privacy.

This paper provides an overview of some of the privacy issues involved with data

disclosures, and how disclosure limitation techniques can be used to protect the

confidentiality of individuals whose data is included in disclosures. It provides an

overview of some of the primary methods that have traditionally been used, as

well as those that have emerged more recently. It does not aim to be an

exhaustive list of disclosure limitation methods, but will hopefully provide

pointers to further reading. The Census Bureau, which has been a primary center

of developing disclosure limitation techniques, is used as an example of how

disclosure limitation is practiced and how it has evolved.

6

7


Laws Governing Disclosure

Data privacy is not covered by a comprehensive law in the United States, though

this is an idea that is under much current discussion. Instead, there are a variety

of federal and state laws that form a patchwork of privacy protections for the

disclosure of data in the United States. These include certain sector-specific laws,

such as the Health Insurance Portability and Accountability Act (HIPAA) for

personal medical data, the Family Educational Rights and Privacy Act (FERPA)

for educational records, and the Gramm-Leach-Bliley Act (GLBA) for financial

information. Some industry best practice standards, such as the Health

Information Trust Alliance framework and the Payment Card Industry Data

Security Standard also address disclosure, but focus more on data security

controls.

The Privacy Act of 1974 covers the collection and release of information

contained in U.S. federal government agency systems of records. It restricts

disclosure of personally identifiable records, prohibits disclosure of an

individual’s record without written consent—with certain exceptions, such as the

release of certain information under a Freedom of Information Act (FOIA)

request—and requires recordkeeping of all disclosures and releases of data.

Activities of statistical agencies and units of the government are also governed by

the Confidential Information Protection and Statistical Efficiency Act of 2002

(CIPSEA), which limits and protects the use of statistical data and is discussed

below further in the context of the Census Bureau.

Signed into law in 2019, the Open, Public, Electronic and Necessary (OPEN)

Government Data Act provides a mandate for all federal agencies to publish all

nonsensitive information assets in “modern, open, and electronic format.” In

2009, the White House issued the Open Government Directive to improve data

transparency in the federal government, which included an increase in the

release of data online through the Data.gov site. The OPEN Government Act

makes the Open Government Directive a requirement in statute, rather than a

policy. In implementation of the act, the Office of Management and Budget

(OMB) is set to issue guidance to agencies on “risks and restrictions related to the

disclosure of personally identifiable information.” This includes the risk that

although an individual data asset in isolation does not pose a privacy or

confidentiality risk, this data “when combined with other available information

may pose such a risk.”

The European Union has had a comprehensive privacy law for years, which was

overhauled by the passage of the General Data Protection Regulation (GDPR)

that went into effect in May 2018. The GDPR restricts disclosure of personally

identifiable information under Recital 26 to data that is “anonymized.” The

complicated issue of anonymization and potential re-identification are further


discussed below in the section on de-identification of data, using the GDPR as an

example.

Other country-specific laws, such as Canada’s Personal Information Protection

and Electronic Documents Act (PIPEDA) and Australia's Privacy Principles

(APP), govern aspects of privacy and disclosure practices in varying ways.

Internationally, privacy principles that define practices to follow in handling data,

such as those developed by the Organization for Economic Co-operation and

Development and the Asia-Pacific Economic Cross-Border Privacy Rules have

also been adopted by some countries.

8 9


Disclosure Limitation Techniques

Techniques for disclosure limitation can be classified in a number of ways. For

the purposes of this paper, techniques are grouped by information limiting

methods and data perturbation methods. Information limiting methods are

those that delete, mask, suppress, or obscure data fields or values in order to

prevent re-identification. Data perturbation methods are those that use statistical

means to alter either the underlying data itself or query results drawn from the

data.

Information Limiting Techniques

PII, Anonymization and the Re-identification Problem

The simplest method of disclosure limitation is to strip PII from a dataset,

removing all fields (or suppressing or masking these fields in some way) that

could directly and uniquely identify an individual, such as name, social security

number, and phone number. Through the 1990s and into the 2000s, PII was

often used as something of a bright-line approach to data anonymization. It

defined what data needed to be protected, with the remaining data fields

considered harmless for disclosure from a privacy perspective. However, in the

mid-2000s it became clear that a wide range of other data categories can be used

to identify individuals.

The GDPR expands the scope of protected information beyond PII, instead using

the term “personal data”—a broader range of potentially identifying information

as defined in Article 4(1):

‘personal data’ means any information relating to an identified or

identifiable natural person (‘data subject’); an identifiable natural

person is one who can be identified, directly or indirectly, in particular

by reference to an identifier such as a name, an identification number,

location data, an online identifier or to one or more factors specific to

the physical, physiological, genetic, mental, economic, cultural or social

identity of that natural person.

Most, if not all, privacy laws rely on some concept of data being either personally

identifiable or not to determine whether the law applies. As the GDPR continues,

“The principles of data protection should therefore not apply to anonymous

information, namely information that does not relate to an identified or

identifiable natural person or to personal data rendered anonymous in such a

manner that the data subject is not or no longer identifiable.” Defining

10

11


“identifiable” data, however, becomes complicated by the possibility of re-

identification attempts.

Re-identification is the matching of anonymized data back to an individual. In

recent years, faith in anonymization has been greatly shaken by studies

demonstrating re-identification of released data. Two high profile examples are

the 2006 Netflix Prize study by Arvind Narayanan and Vitaly Shmatikov, which

re-identified individuals from Netflix’s release of over 100 million user ratings of

movies, and a 2009 Social Security Number study by Alessandro Acquisti and

Ralph Gross showing that data about an individual's place and date of birth can

be used to predict their Social Security number.

Most commonly, re-identification is performed using external databases to infer

information about the anonymized data (known as “linkage attacks”). The

Narayanan and Shmatikov Netflix study identified a subset of individuals by

cross-referencing the Netflix data with non-anonymized movie ratings from the

Internet Movie Database (IMDb). The Acquisti and Gross study used the Social

Security Administration’s Death Master File to detect statistical patterns in SSN

assignment, and used multiple sources for inferring birthdate, such as voter

registration lists, online white pages, and social media.

Researchers have developed new anonymization algorithms such as k-

anonymity, t-closeness, and l-diversity to more formally protect against re-

identification. These frameworks each rely on their own assumptions and

limitations, and therefore each protects against only certain types of attacks. For

example, k-anonymity requires that there be k number of different records that

share a combination of quasi-identifiers (attributes that are not direct identifiers,

but might potentially contribute to identification, such as age and sex). For

example, in a 3-anonymous table containing medical condition by zip code and

age, every combination of zip code and age values needs to appear at least three

times. This can prevent the identification of a specific individual’s record in a

table, but is still susceptible to what is known as a homogeneity attack. If the

quasi-identifier values of an individual are known, then even though their actual

record may not be identified (as there are k number of records that all have the

same quasi-identifier values), their presence alone in a certain dataset could be

confirmed and could reveal sensitive information about that person. Consider

again a table of medical conditions 3-anonymized by zip code and age. If it

happened that the only three records for a certain zip code and age combination

all had a heart condition diagnosis, it could be possible for someone to deduce

information about someone matching that zip code and age. If we know Bob’s

age, where he lives (and thus his zip code), and knew he had been in the hospital,

we would then be able to tell that it was because of a heart condition.

While anonymization has come under scrutiny as a method of privacy protection,

the concepts of anonymity and identification are still key to the applicability and

interpretation of privacy laws. Difficult questions concerning de-identification

12

13

14

15

16


will likely be around for the foreseeable future. As techniques for de-

identification improve, so will methods of attack. Ongoing research continues to

provide techniques for evaluating and measuring re-identification risks, but it is

unclear whether such methods can adequately account for external data sources

that may become available in the future. Datasets cannot be taken back once

released, so even if data is anonymized effectively based on current standards,

future techniques could remove protections. Additionally, datasets are becoming

more detailed and increasing longitudinally (covering a greater period of time),

and these higher-dimension, long-term datasets will be more challenging to

effectively anonymize.

Anonymization of data is perhaps best considered through a risk management

perspective. Removing or obscuring potentially identifying data can be of

benefit, not in providing certainty of de-identification, but in lowering risks

when combined with other privacy and security controls. Anonymization

techniques should not be a stand-alone approach to privacy, but rather one tool in

the disclosure limitation toolkit. Other methods, as discussed in this paper, can

be combined with anonymization to lower risks to acceptable levels.

Aggregating, Coarsening, and Suppressing Data

Beyond removing or masking direct PII identifiers, a number of disclosure

limitation techniques have been developed for obscuring data in some manner,

by coarsening, rounding, aggregating, or suppressing data.

Data coarsening (or generalization) techniques reduce the detail of the data so

that individuals whose information is reflected in categories with low n-count

values cannot be uniquely identified. One such technique involves top- and

bottom-coding methods to place bounds on the reporting of data in order to

prevent identification of outliers. Essentially, this means broadening data

categories to include more unusual or extreme values, so they are not listed on

their own, and thus potentially identifiable. For example, if only one individual in

a dataset were age 99, their age could be recoded to a broader 90+ category.

Cell suppression is the withholding of information in the cell of a table output,

based on some threshold rule for what counts or aggregates would implicitly or

explicitly reveal confidential information. For table cells in which the data would

allow estimating a single individual’s value too closely, missing (or imputed)

values are displayed. Suppression is often used when there are very few values

contributing to a cell, or when one or two large values are the dominant

contributors to the aggregate statistics. The table below shows a very simple

suppression of the cell showing that two women in the dataset live in New York

City.

17

18

19


NYC DC LA Total

Male 10 14 6 30

Female 2 12 9 23

NYC DC LA Total

Male 10 14 6 30

Female * 12 9 23

Across either rows and columns of tables, or across multiple tables, this

“primary” cell suppression alone does not always protect the data. Confidential

values of primary cells can be determined through comparing and subtracting

values. For these cases, secondary (or “complementary”) cell suppression is

necessary, in which additional values are also displayed as missing. In the

example above, the suppressed value of 2 could be determined by simply

subtracting 12 and 9 from 23. Thus, as shown in the following table, a secondary

suppression would be made.

NYC DC LA Total

Male 10 14 6 30

Female * 12 * 23

Data Perturbation Methods

Beginning in the 1970s, as computers moved beyond the days of punch cards and

mainframes, increasing numbers of researchers and others have been able to

easily access and query databases. As a result, computer scientists began

thinking about the associated risks to privacy and how to prevent queries from

revealing information about particular individuals. An initial approach was to


restrict or audit database queries, preventing queries that would pull back data

that could identify individuals. The simplest such limit is to only allow users to

run queries that would return aggregate, or statistical results, and prevent queries

that would pull back individual, unaggregated records. For example, developers

could design a system so that queries would only return record results of a certain

size in the result set. Using a certain threshold number, n, would ensure that only

aggregate queries based on at least a set of n records can be run. For example,

with an n of two, this would involve preventing a query from being run that would

return data of a single, individual record. The problem is that multiple queries

can use differentials and overlapping sets (calculating values across query results,

by say, subtracting the results of one query from another) to obtain confidential

information. A simple example is running a query seeking the total income for all

individuals in the data, and then running a second query seeking the total income

for all individuals in the data excluding one certain individual. Subtracting the

results from the two datasets would provide the income of that individual.

Researchers can employ techniques that prevent releasing statistics if the

number of common records returned in a set of queries exceed a given threshold

to guard against these re-identification attempts. Further research has created

increasingly sophisticated methods of query restriction. However, query

restrictions ultimately provide no real guarantee of privacy against a

sophisticated user who could potentially generate sets of queries that eventually

reveal information about an individual.

The other approach researchers developed to protect databases is data

perturbation—either altering the actual, underlying data or altering query

outputs to protect confidentiality. This usually involves adding statistical noise—

altering values of the data while still maintaining the statistical relationships

between data fields. There is a long history of use of perturbation in disclosure

limitation, and there are a number of perturbation approaches and techniques.

Data swapping, first proposed by Tore Dalenius and Steven Reiss in the late

1970s, is a technique through which researchers use statistical models to find

pairs of records with similar attributes and switch personally identifying or

sensitive data values between the records. After this manipulation, outside

researchers or attackers will not be able to determine which values correspond to

which individuals. However, the aggregate values of the data are preserved

sufficiently to enable researchers to make statistical inferences.

Other perturbation techniques involve changing the values of the data by some

random amount, while still preserving the underlying statistical properties of the

data within a certain range. For example, the perturbation technique of additive

noise works by replacing true values x with the values of x+r, drawing value r

from some distribution. The r amounts are such that the replacement x+r values

preserve the statistical relations between data records.

20

21


While perturbation has been used extensively and effectively by the Census

Bureau and other organizations, perturbation methods do have weaknesses. If

strongly correlated attributes can be found in real-world data, this correlation can

potentially be used to filter out the additive randomizations. The same can

occur if someone has certain background knowledge about the data.

Additionally, a number of mathematical and statistical techniques, such as

spectral filtering, can be used to filter off the random noise from perturbed data,

retrieving the original values.

Synthetic Data

Synthetic data are datasets that seek to replicate the statistical properties of real-

world datasets, serving as analytical replacements. Synthetic datasets can be

either fully synthesized, with all of the original dataset values generated

synthetically, or partially synthesized, with only certain fields or a portion of

records synthesized. Often, the data must be presented in the same form and

structure as the original data in order to be compatible with existing systems,

algorithms, and software. Created through various modeling techniques,

synthetic data differs from perturbation techniques that alter the original,

underlying data as discussed above. Instead, synthetic data creates completely

new data by using models that fit the original data (or by using defined

parameters and constraints) to generate statistically comparable data

independent from the underlying, real-world data.

Synthetic data can greatly reduce the risks of re-identification through ancillary

datasets, as attempts to perform matching with external databases is difficult.

The ability to adjust the model also provides an additional confidentiality

advantage. Researchers modeling the data can make decisions about which

relationships of the real-world data will be preserved. Relationships omitted from

the model will not be discoverable by analysts, since they will not be present in

the synthetic data, providing the ability to keep certain data correlations and the

sensitive information they might reveal confidential. For example, if the

correlation of interest in a research dataset is between gender, age, and health

condition, that could be the only correlation preserved in the synthetic dataset.

Other data fields might be generated and included in this dataset, modeled to

provide broad, aggregate statistics, but not with statistically significant

correlation to combinations of certain other variables—correlations that could

predictively reveal information about individuals.

There are still confidentiality concerns that need to be taken into account with

synthetic data, however. Synthetic datasets can potentially leak the underlying

data if the model fits too closely. For example, if the synthetic data has enough

different fields per individual, and the model is closely fitted to the original data,

outliers can potentially be identifiable. However, there is ongoing research aimed

at developing the means to generate synthetic data in a way that would provide

22

23

24


formal privacy guarantees (discussed in the section on differential privacy

below).

Adversarial machine learning, which is the use of AI techniques to detect

vulnerabilities, can potentially be used to determine information about a record

used in the modeling of the synthetic dataset (i.e. “real-world” data). However,

those seeking to uncover the information would also need key information about

the model used to create the data.

Apart from its value in disclosure limitation, another potential benefit of

synthetic data is the ability to generate large volumes of research data at low cost.

Machine learning requires running large volumes of training data through

algorithms, and synthetic data may have great value as a way to rapidly provide

these large volumes of data. These datasets contain no real-world data, but are

statistically similar enough to real-world data to be of training value. While

companies such as Google and Facebook generate large datasets as part of their

business, smaller companies may be able to use this synthetic data to jump-start

a machine learning program without collecting data about real people. From a

privacy standpoint, using synthetic data to train artificial intelligence is attractive

in that it avoids the need for collecting, storing, and using real-world data in the

large amounts needed for machine learning.

Differential Privacy

A group of prominent computer scientists first introduced the concept of

differential privacy (also known as “formal privacy”) in their 2006 paper,

Calibrating Noise to Sensitivity in Private Data Analysis, although precursors to

the technique go back decades. It is not a single tool or method, but rather a

privacy standard that provides formal mathematical guarantees of privacy that

can be implemented in various ways. Differential privacy’s guarantee is that an

adversary can learn virtually nothing more about an individual based upon

disclosures from a dataset than they would learn if that person’s record were not

included in the dataset. In other words, whether or not your personal data is

included, resulting outputs from a dataset would be approximately the same.

Strong privacy protection is provided, while still allowing an analyst to derive

useful statistical results. Differential privacy provides a promising solution to

database reconstruction and re-identification attacks, as it would be highly

difficult to link the noisy, approximate results to external data sources.

In practice, differential privacy works by injecting a precise, calculated amount of

statistical noise to the data contained in query results by using statistical means

(so it can be essentially thought of as a perturbation method for query outputs).

What is provided by differential privacy is an approximation of the true value—

the exact same query could produce two different answers. The difference

between the data value provided in differentially private outputs and the real-

world value can be tuned to be a larger or smaller value (known as the privacy

25

26

27

28

29


loss parameter), but at a trade-off between accuracy and privacy. Differential

privacy defines privacy risk as an allowable leakage of data on an individual in

comparison to a hypothetical database without a certain individual. The allowed

deviation between data that includes an individual and one that does not is

usually represented as ε (epsilon).

While an individual’s personal information is almost irrelevant to the outputs

produced via differential privacy, some insignificantly small change in belief

about an individual can potentially be made based on the information released.

The probability that some inference can be made about an individual is at most

1+ε times the probability that an inference could be made without the

individual’s data. For example, if the baseline probability of an individual

developing a certain disease is 3 percent (say for a female in the United States),

with an ε of 0.01, the known probability under differential privacy would rise

from 3 percent to only 3.03 percent (3 x 1+ε) at most. Put another way, the

probability difference in the outputs between a dataset with the individual and

one without the individual included is .03 percent.

Differential privacy also measures and bounds the total privacy loss over multiple

analyses. Part of the application of differential privacy involves establishing a

“privacy budget,” which limits the overall amount of data that can be disclosed.

Setting this budget requires determining the cumulative risk of data disclosures

over the lifespan of the data. With all disclosure limitation techniques, there is no

avoiding the fundamental fact that when multiple analyses are performed using

an individual’s data, disclosure risk increases by some amount. Thus with each

statistical release, or query, under differential privacy, some small amount of

potentially private information is leaked. Therefore, while risk does increase with

each release, the privacy budget ensures that risk accumulates in a bounded way.

Queries are analyzed to determine their privacy cost (ε) and whether the

remaining balance (a running tally ε over all queries) of the privacy budget is

sufficiently high to run it. Setting a privacy budget thus returns us to the always

present trade-off of informational value of the data and confidentiality;

potentially releasing identifiable information (if the privacy budget is set too

high) versus data releases not being informationally useful (if the budget is set

too low). Methods for optimally calculating the privacy budget are an area of

current research.

Differential privacy has recently seen a number of real-world uses by companies.

Uber uses differential privacy to protect internal analyses, such as those done on

driver revenue. Apple is using differential privacy to protect user privacy while

improving the usability of features such as lookup hints. Federal agencies such

as the Census Bureau are also beginning to adopt differential privacy.

Additional tools for making differential privacy use more accessible are under

development, with some being provided open-source, such as Google’s

30

31

32

33

34


differential privacy kit, which is available via GitHub and allows users to calculate

differentially private simple statistics from a dataset.

Differential privacy stands as one of the most promising disclosure limitation

techniques, one that can provide formal, mathematical assurances of privacy

while unlocking valuable research data. However, like any other disclosure

limitation method, it is not an absolute assurance. In addition to concerns about

correctly calculating the privacy budget, covert-channel attacks can potentially

be used against differentially private query systems and need to be protected

against. Using information other than the query values, such as the time to

complete the query, could potentially be used to reveal information such as the

presence of an individual in a database. For example, a query looking for an

individual in a dataset of cancer patients may take one second to run if the

individual is not present, versus a half hour to run if the individual is in the

dataset.

35

36


The Census Bureau

While new legal rules mandating government transparency, such as the Open

Data Act, will require agencies to release more data, the Census Bureau has long

differed from other major government agencies in that the public release of data

is one of its primary functions. The Census Bureau publishes a large amount of

information on the demographics and economy of the United States, while

endeavoring to protect the privacy of individuals. For the 2010 Census, the

Bureau published 5.6 billion independent tabular summaries, based on over 300

million person records. The Census Bureau faces a number of disclosure

limitation challenges, including the high-dimensionality of its data, and the need

to preserve associations among variables. In addition to releasing tabular

summaries, the Census Bureau also publicly releases microdata (record-level

data) from the decennial census and from many of its demographic and

economic surveys.

Under Title 13 of the U.S. Code, the Bureau is prohibited from releasing data that

allows “any particular establishment or individual” to be identified. In addition,

the Census Bureau is bound by the Confidential Information Protection and

Statistical Efficiency Act of 2002 (CIPSEA). Applied primarily, but not

exclusively, to the statistical agencies of the federal government, CIPSEA was

created to provide federal agencies the ability to make a statutory commitment to

confidentiality and to restrict data use to statistical purposes only. CIPSEA sets

high penalties for disclosures, including fines and jail time. Other statistical

agencies have often used data licensing agreements to provide data to specific

users with confidentiality requirements. However, the Census Bureau cannot rely

on such agreements because any Census data released to external parties is

automatically considered publicly available. The Census Bureau thus has a strong

focus on preserving confidentiality in the data it releases, and has been on the

cutting edge of disclosure limitation methods.

This history of how the Census Bureau has protected public releases of

information provides useful examples of disclosure limitation in practice. Early

censuses in the 1800s used few privacy measures, often only removing names. As

concerns about confidentiality grew, the 1929 census law established a

requirement that “no publication shall be made by the Census Office whereby the

data furnished by any particular establishment or individual can be identified.”

Protections were further codified and strengthened by the 1954 census law (Title

13 of the U.S. Code). Until the early 1960s, Census data was only released in

printed volumes, greatly limiting the detail and amount of data that could be

disclosed (and thus reducing privacy risks). With the move to publishing of

extensive electronic data in more recent times, and the greater attendant need to

protect privacy, the Census Bureau has used a number of information reduction

and data perturbation methods to limit disclosure. Information reduction

37

38


techniques traditionally used by the Bureau include geographic thresholds,

coding, and sampling. For example, to protect against identification, any

geographic areas identified on public-use files must have a population over

100,000, and every categorical variable (a variable used for grouping, such as

gender), must have at least 10,000 people nationwide, otherwise the variable is

recoded into a broader one. Cell suppression methodology, as described above,

has also been a primary information restriction method by the Census Bureau,

one they have continually worked to improve. In 1996, the Bureau began using

data swapping and noise infusion techniques to perturb data by a confidential

amount. The Census Bureau has also protected data from the decennial census

and the American Community Survey (an ongoing population and housing

survey) by creating partially and fully synthetic data.

To further protect against disclosure risks, the Census Bureau also uses

procedural and administrative methods. Before dissemination, all data products

released by the Census Bureau must be reviewed by their Disclosure Review

Board (DRB). The DRB examines whether appropriate disclosure limitation

techniques have been applied for a Census product, but also determines whether

a certain product presents additional disclosure risks that need to be addressed.

After an error was made in the release of a product in 2010, the Bureau also

created the position of disclosure limitation officer. Each division at the Census

Bureau that produces data releases must designate an officer to oversee all

disclosure limitation activities and final submission to the DRB.

Centralized disclosure review boards (used by the Census Bureau, and other

government agencies such as the Department of Education) offer added benefits

beyond what a more limited, specific review of a disclosure would provide. Each

data release can be considered in the context of all planned data releases by an

agency. Additionally, centralized disclosure review boards bring together experts

from across the agency, including staff with technical skills and those with

specialized knowledge about particular data types and datasets.

Another, different approach to disclosure limitation is to restrict data access by

legal and/or operational means. Due to confidentiality concerns, some Census

data cannot be released publicly. So to provide secure, authorized data access to

researchers (rather than licensed release), the Census Bureau maintains 29

Federal Statistical Research Data Centers (RDCs) hosted at government

agencies, universities, and nonprofit institutions. Using their own review and

approval processes, statistical agencies (both the Census Bureau and others,

including the Bureau of Labor Statistics and the Bureau of Economic Analysis)

provide microlevel data to the secure RDC environments. Researchers must

obtain Census Bureau Special Sworn Status by passing a background check and

swearing a lifetime confidentiality oath. Under Title 13 and Title 26 of the U.S.

Code, penalties of a federal prison sentence of up to five years, a fine of up to

$250,000, or both apply to any violations of the confidentiality requirements.

Researchers function under the supervision of employees of the RDC on non-

39

40


networked machines, and a researcher’s output, code, and notes all undergo

disclosure review by an RDC analyst. Additionally, statistical software used at

the RDCs has certain commands, such as those for copying or printing datasets,

restricted. Certain projects may allow remote access through a secure

communication network, with the code submitted by the researcher and

executed on a computer in the RDC, and subject to the same code and output

review provisions.

Realizing that the increases in computing power and availability of external

databases were increasing the risks of re-identification, the Census Bureau has

begun to move from legacy disclosure limitation methods to techniques based on

formal privacy. Internal researchers at the Census Bureau in 2019 discovered that

confidential data could be reconstructed from the publicly released tabulations of

the 2010 Census by using commercial data, potentially revealing the race and

ethnicity of individuals. For the 2020 Census, differential privacy will be used to

protect data through a new processing system developed in-house. The

adoption of differential privacy will require Census to closely evaluate how the

quantity of and nature of statistics it releases affects its privacy budget, as each

release of data will use a fraction of it. Tables for which high accuracy is critical

will require a larger share of the privacy budget.

41

42


Conclusion

Collection of data continues to expand rapidly, growing datasets into longer-term

repositories with increasing value. However, this higher-dimension, longitudinal

data creates a greater risk of privacy harms and the corresponding need to

develop more privacy-protective techniques and technologies. The tension in

providing detailed enough data to be useful while maintaining confidentiality of

the underlying information will always remain. When datasets include

potentially identifiable personal information, steps to prevent disclosure of this

information can limit the extent to which researchers can analyze data with

granular and accurate enough calculations.

Both private and public organizations have long relied on notice and consent and

de-identification for protecting privacy—methods that have been shown to be no

longer reliable. There are no silver bullets in disclosure limitation, and no single

privacy-enhancing technique or technology will completely remove privacy risks.

However, recent advances in disclosure limitation hold great promise for

protecting confidentiality while allowing data to be used to provide valuable

information. The emerging techniques of differential privacy and synthetic data

can help move us forward from debates about anonymization and re-

identification, towards a better balancing of data disclosure and confidentiality

based on formalized and measurable metrics. Traditional disclosure limitation

techniques are still of value as well and can be used in conjunction with modern

methods to greatly reduce privacy risks. However, the focus on personally

identifiable information in current privacy regulations presents complications

when considering disclosures protected by modern means such as differential

privacy. Existing and future laws and policies will need to take account of the

more quantifiable, comprehensive concepts of privacy that formal privacy

methods provide. Researchers and policymakers will need to ask tough questions

about how much statistical noise is enough to adequately protect privacy while

still providing useful data, and how to capture and define these considerations in

regulations and policies.

43


Notes

1 Jeff Desjardins, “How much data is generatedeach day?” World Economic Forum, April 17, 2019,“https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/

2 A disclosure is distinguished from a breach, whichis an unintentional release of information.

3 “Confidential” as used in this paper means anyinformation that was not intended to be released aspart of data made available, which includes bothpersonal and non-personal information.

4 In some instances, there may be other disclosureconcerns apart from personal identification, such asrelease of classified information or certain sensitiveor proprietary information about an organization orcompany.

5 Such as the noted case of Target determining andrevealing a teen pregnancy to her father. See KashmirHill "How Target Figured Out A Teen Girl WasPregnant Before Her Father Did", Forbes.com(accessed Dec 9, 2020), https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#1bc2f7ea6668

6 The field of research in protecting privacy andconfidentiality in data mining is known as PrivacyPreserving Data Mining (PPDM), and utilizes many ofthe disclosure limitation techniques discussed in thispaper; Such as Google’s development of tools forsecure multiparty computation and differentialprivacy.

7 Paul Ohm, Broken Promises of Privacy:Responding to the Surprising Failure ofAnonymization, UCLA Law Review 57 (August 2010):1701-1777. https://www.uclalawreview.org/pdf/57-6-3.pdf

8 Organisation for Economic Co-operation andDevelopment, OECD Guidelines on the Protection of

Privacy and Transborder Flows of Personal Data,2013. http://www.oecd.org/sti/ieconomy/oecdguidelinesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm

9 Asia-Pacific Economic Cooperation, What is theCross-Border Privacy Rules System?, April 15, 2019.https://www.apec.org/About-Us/About-APEC/Fact-Sheets/What-is-the-Cross-Border-Privacy-Rules-System

10 There is some blurring of lines between thesetwo categories.

11 Starting in 2006, a number of studiesdemonstrated the ability to re-identify individuals inpublicly released, anonymized data. This includes re-identifications from the 2006 AOL release of searchqueries of users (see Michael Barbaro and Tom ZellerJr., "A Face Is Exposed for AOL Searcher No.4417749", New York Times, Aug. 9, 2006, https://www.nytimes.com/2006/08/09/technology/09aol.html, The Massachusetts government’s releaseof state employee hospital visit data, Daniel Barth-Jones, “The 'Re-Identification' of Governor WilliamWeld's Medical Information: A Critical Re-Examination of Health Data Identification Risks andPrivacy Protections, Then and Now”, SSRN (July2012) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2076397, and the two other examplesdiscussed below in this paper.

12 A 2016 comprehensive review of re-identificationattacks found that 72.7% of all successful attackshave taken place since 2009. Jane Henriksen-Bulmerand Sheridan Jeary, "Re-identification Attacks—ASystematic Literature Review", International Journalof Information Management 36 (December 2016):1184-1192https://www.sciencedirect.com/science/article/pii/S0268401215301262

13 Ohm, 1716-1720

14 Arvind Narayanan and Vitaly Shmatikov, "RobustDe-anonymization of Large Sparse Datasets",SP '08: Proceedings of the 2008 IEEE Symposium on Security


and Privacy (May 2008):111–125, https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf

15 Alessandro Acquisti and Ralph Gross, "PredictingSocial Security numbers from public data", Proceedings of the National Academy of Sciences 27 (July2009): 10975-10980, https://www.ncbi.nlm.nih.gov/pubmed/19581585

16 Methods to prevent against such attackscontinue to be developed however. See for instanceQian Wang, Zhiwei Xu and Shengzhi Qu, "AnEnhanced K-Anonymity Model against HomogeneityAttack", Journal Of Software 6 (October 2011):1945-1952, https://pdfs.semanticscholar.org/2a2a/497d05311ecfd9164ce5eb30b0a9b55f991d.pdf andAshwin Machanavajjhala, Daniel Kifer, JohannesGehrke, And MuthuramakrishnanVenkitasubramaniam, "l-Diversity: Privacy Beyond k-Anonymity", ACM Transactions on KnowledgeDiscovery from Data 1 (March 2007): 1-52,https://ptolemy.berkeley.edu/projects/truststc/pubs/465/L%20Diversity%20Privacy.pdf

17 See for instance on qualitative and quantitativerisk measurement for randomized control trial data,Parveen Kumar and Rajan Sareen, “Evaluation of Re-identification Risk for Anonymized ClinicalDocuments”, Candian Journal of Hospital Pharmacy62, (July–August 2009): 307-319, https://www.lexjansen.com/phuse/2017/rg/RG02.pdf

18 In revisiting Latanya Sweeney’s well known re-identification study showing that 87% of the USpopulation could be identified by gender, date ofbirth, and ZIP code, the authors found that re-identification would drop to .02% by replacing dateof birth with month and year only, and zip code withcounty. Philippe Golle, “Revisiting the Uniqueness ofSimple Demographics in the US Population”, Proceedings of the 5th ACM workshop on Privacy in ElectronicSociety (October 2006): 77–80, http://crypto.stanford.edu/~pgolle/papers/census.pdf

19 When data is broken down by categories intables, only a few individuals (“low n” for number of

individuals) may fall into some of the categories--such as an example of only one or two students of acertain race and gender being in a particular collegeprogram.

20 See Appendix B in Irit Dinur and Kobbi Nissim,“Revealing Information while Preserving Privacy”, Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (June 2003): 202–210, http://www.cse.psu.edu/~ads22/privacy598/papers/dn03.pdf

21 Tore Dalenius and Steven P. Reiss, "Data-swapping: A Technique for Disclosure Control", Journal of Statistical Planning and Inference 6, no 1 (1982):73-85, http://www.asasrms.org/Proceedings/papers/1978_038.pdf

22 See Kun Liu, Chris Giannella, and Hillol Kargupta,“A Survey of Attack Techniques on Privacy-PreservingData Perturbation Methods”, in Privacy-PreservingData Mining: Models and Algorithms, ed. Charu C.Aggarwal and Philip S. Yu (New York, NY: Springer,2008) 359-381, https://www.researchgate.net/publication/226016259_A_Survey_of_Attack_Techniques_on_Privacy-Preserving_Data_Perturbation_Methods

23 See Songtao Guo and Xintao Wu, "On The UseOf Spectral Filtering For Privacy Preserving DataMining" Proceedings of the ACM Symposium onApplied Computing, Dijon, France, April 23-27, 2006

24 See Surendra H and Mohan HS “A Review OfSynthetic Data Generation Methods For PrivacyPreserving Data Publishing”, International Journal OfScientific & Technology Research 6 (March 2017):95-101, https://www.ijstr.org/final-print/mar2017/A-Review-Of-Synthetic-Data-Generation-Methods-For-Privacy-Preserving-Data-Publishing.pdf

25 See Haoran Li, Li Xiong, and Xiaoqian Jiang,“Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions”, AdvancedDatabase Technology 2014 (2014): 475–486, https://


www.ncbi.nlm.nih.gov/pmc/articles/PMC4232968/ ,and National Institute of Standards and Tehcnology,“2018 Differential Privacy Synthetic Data Challenge”(accessed Jan 10, 2020) https://www.nist.gov/ctl/pscr/funding-opportunities/open-innovation-prize-challenges/2018-differential-privacy-synthetic

26 See for instance, Reza Shokri, Marco Stronati,Congzheng Song, and Vitaly Shmatikov,“Membership Inference Attacks against MachineLearning Models”, Proceedings of the IEEESymposium on Security and Privacy (2017), https://www.cs.cornell.edu/~shmat/shmat_oak17.pdf

27 Methods for generating synthetic data itself viamachine learning are currently being developedusing techniques such as generative adversarialnetworks (GAN).

28 Cynthia Dwork, Frank McSherry, Kobbi Nissim,and Adam Smith, “Calibrating Noise to Sensitivity inPrivate Data Analysis”. In: Theory of Cryptography,Lecture Notes in Computer Science, ed. Shai Haleviand Tal Rabin (Berlin: Springer, 2006),265-284.https://link.springer.com/chapter/10.1007/11681878_14

29 Providing individuals in the dataset plausibledeniability. As an example, for a population, 1,022may be returned one time and 1,016 another, but thiswould be irrelevant as part of statistical analyses.

30 For the mathematical proof of differentialprivacy, see Cynthia Dwork, “A Firm Foundation forPrivate Data Analysis”, Communications of the ACM,54 (January 2011):86-95, https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/dwork_cacm.pdf

31 See for instance, Anis Bkakria and AimiliaTasidou, “Optimal Distribution of Privacy Budget inDifferential Privacy”, Risks and Security of Internetand Systems. CRiSIS 2018. Lecture Notes in ComputerScience 11391 (2019)https://link.springer.com/chapter/10.1007/978-3-030-12143-3_18

32 https://iapp.org/news/a/uber-becomes-the-latest-company-to-embrace-differential-privacy/

33 Apple Inc., “Differential Privacy Overview”,https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

34 The Census Bureau’s move to differential privacyis discussed below.

35 https://github.com/google/differential-privacy

36 Andreas Haeberlen, Benjamin C. Pierce, andArjun Narayan, “Differential Privacy Under Fire”, Proceedings of the 20th USENIX conference on Security(August 2011) https://www.cis.upenn.edu/~ahae/papers/fuzz-sec2011.pdf

37 United States Census Bureau, American FactFinder (accessed January 7, 2020), https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_SF1/QTP10/0100000US

38 Included as part of the Reapportionment Act of1929. Reapportionment Act of 1929, 71st Cong., 1stsess., June 18,1929, 21-27, https://www.census.gov/history/pdf/1929_census_act.pdf

39 Amy Lauger, Billy Wisniewski, And LauraMcKenna, “Disclosure Avoidance Techniques at theU.S. Census Bureau: Current Practices and Research”,Research Report Series, Center for DisclosureAvoidance Research #2014-2 (2014), https://www.census.gov/library/working-papers/2014/adrm/cdar2014-02.html

40 Phyllis Singer and Nelson Chung, “PredictingComplementary Cell Suppressions Given PrimaryCell Suppression”, Research Report Series, Center forDisclosure Avoidance Research #2016-5 (2016),Conditions https://www.census.gov/srd/CDAR/cdar2016-05_Predicting_Complementary_Cell_Suppressions.pdf

41 John M. Abowd, “Starting Down the DatabaseReconstruction Theorem” (presentation at the


American Association for the Advancement ofScience Annual Meeting, Washington, DC, February16, 2019) https://www2.census.gov/programs-surveys/decennial/2020/resources/presentations-publications/2019-02-16-abowd-db-reconstruction.pdf?#

42 United States Census Bureau, “DisclosureAvoidance and the 2020 Census”, https://www.census.gov/about/policies/privacy/statistical_safeguards/disclosure-avoidance-2020-census.html

43 And administrative and regulatory approachessuch as formal application and review, data useagreements, and secure data enclaves can also beused to minimize disclosure risks from non-publiclyreleased data further.


This report carries a Creative Commons Attribution4.0 International license, which permits re-use ofNew America content when proper attribution isprovided. This means you are free to share and adaptNew America’s work, or include our content inderivative works, under the following conditions:

• Attribution. You must give appropriate credit,provide a link to the license, and indicate if changeswere made. You may do so in any reasonable manner,but not in any way that suggests the licensorendorses you or your use.

For the full legal code of this Creative Commonslicense, please visit creativecommons.org.

If you have any questions about citing or reusingNew America content, please visit www.newamerica.org.

All photos in this report are supplied by, and licensedto, shutterstock.com unless otherwise stated.Photos from federal government sources are usedunder section 105 of the Copyright Act.