February 2020 Protecting Privacy in Data Releases A Primer on Disclosure Limitation Christopher Sadler Last edited on February 24, 2020 at 11:09 a.m. EST
February 2020
Protecting Privacy in DataReleasesA Primer on Disclosure Limitation
Christopher Sadler
Last edited on February 24, 2020 at 11:09 a.m. EST
Acknowledgments
We would like to thank the Bill & Melinda GatesFoundation for its generous support of our work. Theviews expressed in this report are those of its authorand do not necessarily represent the views of thefoundation, their officers, or their employees.
newamerica.org/oti/reports/primer-disclosure-limitation/ 2
About the Author(s)
Christopher Sadler is the Education Data andPrivacy Fellow at New America’s Open TechnologyInstitute.
About New America
We are dedicated to renewing America by continuingthe quest to realize our nation’s highest ideals,honestly confronting the challenges caused by rapidtechnological and social change, and seizing theopportunities those changes create.
About Open Technology Institute
OTI works at the intersection of technology andpolicy to ensure that every community has equitableaccess to digital technology and its benefits. Wepromote universal access to communicationstechnologies that are both open and secure, using amultidisciplinary approach that brings togetheradvocates, researchers, organizers, and innovators.
newamerica.org/oti/reports/primer-disclosure-limitation/ 3
5
7
9
9
12
18
21
Contents
Introduction
Laws Governing Disclosure
Disclosure Limitation Techniques
Information Limiting Techniques
Data Perturbation Methods
The Census Bureau
Conclusion
newamerica.org/oti/reports/primer-disclosure-limitation/ 4
Introduction
The falling cost of data storage and the spread of the internet have led to an
acceleration in the collection of data about individuals. Many organizations, both
private and public, gather and store information from a myriad of sources,
resulting in an accumulation of data exceeding 40 zettabytes (40 trillion
gigabytes) globally. Data holds the potential for substantial gains to society in
building knowledge, research, informing policy, and providing information to the
public. However, publishing or sharing data creates privacy risks in exposing
individuals to potential financial, reputational, and other harms and liabilities.
Organizations have ethical and often legal requirements to protect the
confidentiality of data, but this involves a tradeoff with the usefulness of the data.
Protecting confidentiality necessitates excluding, aggregating, or obscuring the
data in some way that reduces its detail and exactness. Thus, a balance must be
struck between confidentiality and the informational value of data.
“Disclosure” refers to the release of data by some means, including making it
publicly available or available to another entity or individual (such as the sharing
of records with a researcher). The primary privacy concern created by
disclosure occurs when the data released contains either direct personally
identifiable information (PII), or when other fields or aspects of the released data
can be used in some way (often in conjunction with other available datasets) to
identify a person). Disclosures may include sensitive information about
individuals, but the risk is in being able to link data in the disclosure to a specific
person. “Disclosure limitation” (also known as “disclosure avoidance” and
“disclosure control”) refers to the safeguards and statistical methods used to
reduce the risk of disclosure of identifiable information in a data release.
Privacy concerns about disclosure have often focused on the release of public-
use government data. The Census Bureau, with a primary mission of disclosing
data to the public, has been at the forefront of cutting-edge research on and
empirical use of methods for disclosure limitation. Other statistical agencies at
both the federal and state level also regularly release data, and non-statistical
agencies are about to start doing the same, as the 2019 OPEN Government Data
Act will require federal agencies to publish much of their information online as
open data.
There is also growing concern about how corporations use the data they hold. As
corporate data warehouses build in volume and detail over time, they become
valuable for discovering information relationships about customers through
analytic techniques (a process known as data mining). The potential for
derivation of highly sensitive information through data mining carries serious
ethical implications. There are calls for comprehensive laws that would control
collection and use of personal data by companies, but even without those laws,
1
2 3
4
5
newamerica.org/oti/reports/primer-disclosure-limitation/ 5
we are seeing groundbreaking research into disclosure limitation from many
sources in the private sector.
Traditionally, government and private entities seeking to disclose information
without creating privacy harms have attempted to provide data in aggregate or
anonymized form, so that sensitive information cannot be related back to any
particular individual. In recent years, however, it has become clear that
traditional techniques of anonymization and aggregation of data are not as
privacy protecting as had been thought. The challenges of balancing the quality
and usefulness of disclosures with the fundamental rights of confidentiality and
privacy have become much more complex as both technological advances and
public perceptions of privacy have changed. There is a wide range of methods for
suppressing, aggregating, and obscuring data, all with a goal of creating a release
of information that reduces individually identifiable information. However,
increases in computing power, the advancement of analytical techniques and
sophistication of attacks, the growth of available data sources on individuals, and
other factors have weakened the protections of many traditional disclosure
techniques. While these older methods are still useful in reducing disclosure risks
and continue to be refined, there has been an accelerating shift to the modern,
formal disclosure limitation techniques of differential privacy.
This paper provides an overview of some of the privacy issues involved with data
disclosures, and how disclosure limitation techniques can be used to protect the
confidentiality of individuals whose data is included in disclosures. It provides an
overview of some of the primary methods that have traditionally been used, as
well as those that have emerged more recently. It does not aim to be an
exhaustive list of disclosure limitation methods, but will hopefully provide
pointers to further reading. The Census Bureau, which has been a primary center
of developing disclosure limitation techniques, is used as an example of how
disclosure limitation is practiced and how it has evolved.
6
7
newamerica.org/oti/reports/primer-disclosure-limitation/ 6
Laws Governing Disclosure
Data privacy is not covered by a comprehensive law in the United States, though
this is an idea that is under much current discussion. Instead, there are a variety
of federal and state laws that form a patchwork of privacy protections for the
disclosure of data in the United States. These include certain sector-specific laws,
such as the Health Insurance Portability and Accountability Act (HIPAA) for
personal medical data, the Family Educational Rights and Privacy Act (FERPA)
for educational records, and the Gramm-Leach-Bliley Act (GLBA) for financial
information. Some industry best practice standards, such as the Health
Information Trust Alliance framework and the Payment Card Industry Data
Security Standard also address disclosure, but focus more on data security
controls.
The Privacy Act of 1974 covers the collection and release of information
contained in U.S. federal government agency systems of records. It restricts
disclosure of personally identifiable records, prohibits disclosure of an
individual’s record without written consent—with certain exceptions, such as the
release of certain information under a Freedom of Information Act (FOIA)
request—and requires recordkeeping of all disclosures and releases of data.
Activities of statistical agencies and units of the government are also governed by
the Confidential Information Protection and Statistical Efficiency Act of 2002
(CIPSEA), which limits and protects the use of statistical data and is discussed
below further in the context of the Census Bureau.
Signed into law in 2019, the Open, Public, Electronic and Necessary (OPEN)
Government Data Act provides a mandate for all federal agencies to publish all
nonsensitive information assets in “modern, open, and electronic format.” In
2009, the White House issued the Open Government Directive to improve data
transparency in the federal government, which included an increase in the
release of data online through the Data.gov site. The OPEN Government Act
makes the Open Government Directive a requirement in statute, rather than a
policy. In implementation of the act, the Office of Management and Budget
(OMB) is set to issue guidance to agencies on “risks and restrictions related to the
disclosure of personally identifiable information.” This includes the risk that
although an individual data asset in isolation does not pose a privacy or
confidentiality risk, this data “when combined with other available information
may pose such a risk.”
The European Union has had a comprehensive privacy law for years, which was
overhauled by the passage of the General Data Protection Regulation (GDPR)
that went into effect in May 2018. The GDPR restricts disclosure of personally
identifiable information under Recital 26 to data that is “anonymized.” The
complicated issue of anonymization and potential re-identification are further
newamerica.org/oti/reports/primer-disclosure-limitation/ 7
discussed below in the section on de-identification of data, using the GDPR as an
example.
Other country-specific laws, such as Canada’s Personal Information Protection
and Electronic Documents Act (PIPEDA) and Australia's Privacy Principles
(APP), govern aspects of privacy and disclosure practices in varying ways.
Internationally, privacy principles that define practices to follow in handling data,
such as those developed by the Organization for Economic Co-operation and
Development and the Asia-Pacific Economic Cross-Border Privacy Rules have
also been adopted by some countries.
8 9
newamerica.org/oti/reports/primer-disclosure-limitation/ 8
Disclosure Limitation Techniques
Techniques for disclosure limitation can be classified in a number of ways. For
the purposes of this paper, techniques are grouped by information limiting
methods and data perturbation methods. Information limiting methods are
those that delete, mask, suppress, or obscure data fields or values in order to
prevent re-identification. Data perturbation methods are those that use statistical
means to alter either the underlying data itself or query results drawn from the
data.
Information Limiting Techniques
PII, Anonymization and the Re-identification Problem
The simplest method of disclosure limitation is to strip PII from a dataset,
removing all fields (or suppressing or masking these fields in some way) that
could directly and uniquely identify an individual, such as name, social security
number, and phone number. Through the 1990s and into the 2000s, PII was
often used as something of a bright-line approach to data anonymization. It
defined what data needed to be protected, with the remaining data fields
considered harmless for disclosure from a privacy perspective. However, in the
mid-2000s it became clear that a wide range of other data categories can be used
to identify individuals.
The GDPR expands the scope of protected information beyond PII, instead using
the term “personal data”—a broader range of potentially identifying information
as defined in Article 4(1):
‘personal data’ means any information relating to an identified or
identifiable natural person (‘data subject’); an identifiable natural
person is one who can be identified, directly or indirectly, in particular
by reference to an identifier such as a name, an identification number,
location data, an online identifier or to one or more factors specific to
the physical, physiological, genetic, mental, economic, cultural or social
identity of that natural person.
Most, if not all, privacy laws rely on some concept of data being either personally
identifiable or not to determine whether the law applies. As the GDPR continues,
“The principles of data protection should therefore not apply to anonymous
information, namely information that does not relate to an identified or
identifiable natural person or to personal data rendered anonymous in such a
manner that the data subject is not or no longer identifiable.” Defining
10
11
newamerica.org/oti/reports/primer-disclosure-limitation/ 9
“identifiable” data, however, becomes complicated by the possibility of re-
identification attempts.
Re-identification is the matching of anonymized data back to an individual. In
recent years, faith in anonymization has been greatly shaken by studies
demonstrating re-identification of released data. Two high profile examples are
the 2006 Netflix Prize study by Arvind Narayanan and Vitaly Shmatikov, which
re-identified individuals from Netflix’s release of over 100 million user ratings of
movies, and a 2009 Social Security Number study by Alessandro Acquisti and
Ralph Gross showing that data about an individual's place and date of birth can
be used to predict their Social Security number.
Most commonly, re-identification is performed using external databases to infer
information about the anonymized data (known as “linkage attacks”). The
Narayanan and Shmatikov Netflix study identified a subset of individuals by
cross-referencing the Netflix data with non-anonymized movie ratings from the
Internet Movie Database (IMDb). The Acquisti and Gross study used the Social
Security Administration’s Death Master File to detect statistical patterns in SSN
assignment, and used multiple sources for inferring birthdate, such as voter
registration lists, online white pages, and social media.
Researchers have developed new anonymization algorithms such as k-
anonymity, t-closeness, and l-diversity to more formally protect against re-
identification. These frameworks each rely on their own assumptions and
limitations, and therefore each protects against only certain types of attacks. For
example, k-anonymity requires that there be k number of different records that
share a combination of quasi-identifiers (attributes that are not direct identifiers,
but might potentially contribute to identification, such as age and sex). For
example, in a 3-anonymous table containing medical condition by zip code and
age, every combination of zip code and age values needs to appear at least three
times. This can prevent the identification of a specific individual’s record in a
table, but is still susceptible to what is known as a homogeneity attack. If the
quasi-identifier values of an individual are known, then even though their actual
record may not be identified (as there are k number of records that all have the
same quasi-identifier values), their presence alone in a certain dataset could be
confirmed and could reveal sensitive information about that person. Consider
again a table of medical conditions 3-anonymized by zip code and age. If it
happened that the only three records for a certain zip code and age combination
all had a heart condition diagnosis, it could be possible for someone to deduce
information about someone matching that zip code and age. If we know Bob’s
age, where he lives (and thus his zip code), and knew he had been in the hospital,
we would then be able to tell that it was because of a heart condition.
While anonymization has come under scrutiny as a method of privacy protection,
the concepts of anonymity and identification are still key to the applicability and
interpretation of privacy laws. Difficult questions concerning de-identification
12
13
14
15
16
newamerica.org/oti/reports/primer-disclosure-limitation/ 10
will likely be around for the foreseeable future. As techniques for de-
identification improve, so will methods of attack. Ongoing research continues to
provide techniques for evaluating and measuring re-identification risks, but it is
unclear whether such methods can adequately account for external data sources
that may become available in the future. Datasets cannot be taken back once
released, so even if data is anonymized effectively based on current standards,
future techniques could remove protections. Additionally, datasets are becoming
more detailed and increasing longitudinally (covering a greater period of time),
and these higher-dimension, long-term datasets will be more challenging to
effectively anonymize.
Anonymization of data is perhaps best considered through a risk management
perspective. Removing or obscuring potentially identifying data can be of
benefit, not in providing certainty of de-identification, but in lowering risks
when combined with other privacy and security controls. Anonymization
techniques should not be a stand-alone approach to privacy, but rather one tool in
the disclosure limitation toolkit. Other methods, as discussed in this paper, can
be combined with anonymization to lower risks to acceptable levels.
Aggregating, Coarsening, and Suppressing Data
Beyond removing or masking direct PII identifiers, a number of disclosure
limitation techniques have been developed for obscuring data in some manner,
by coarsening, rounding, aggregating, or suppressing data.
Data coarsening (or generalization) techniques reduce the detail of the data so
that individuals whose information is reflected in categories with low n-count
values cannot be uniquely identified. One such technique involves top- and
bottom-coding methods to place bounds on the reporting of data in order to
prevent identification of outliers. Essentially, this means broadening data
categories to include more unusual or extreme values, so they are not listed on
their own, and thus potentially identifiable. For example, if only one individual in
a dataset were age 99, their age could be recoded to a broader 90+ category.
Cell suppression is the withholding of information in the cell of a table output,
based on some threshold rule for what counts or aggregates would implicitly or
explicitly reveal confidential information. For table cells in which the data would
allow estimating a single individual’s value too closely, missing (or imputed)
values are displayed. Suppression is often used when there are very few values
contributing to a cell, or when one or two large values are the dominant
contributors to the aggregate statistics. The table below shows a very simple
suppression of the cell showing that two women in the dataset live in New York
City.
17
18
19
newamerica.org/oti/reports/primer-disclosure-limitation/ 11
NYC DC LA Total
Male 10 14 6 30
Female 2 12 9 23
NYC DC LA Total
Male 10 14 6 30
Female * 12 9 23
Across either rows and columns of tables, or across multiple tables, this
“primary” cell suppression alone does not always protect the data. Confidential
values of primary cells can be determined through comparing and subtracting
values. For these cases, secondary (or “complementary”) cell suppression is
necessary, in which additional values are also displayed as missing. In the
example above, the suppressed value of 2 could be determined by simply
subtracting 12 and 9 from 23. Thus, as shown in the following table, a secondary
suppression would be made.
NYC DC LA Total
Male 10 14 6 30
Female * 12 * 23
Data Perturbation Methods
Beginning in the 1970s, as computers moved beyond the days of punch cards and
mainframes, increasing numbers of researchers and others have been able to
easily access and query databases. As a result, computer scientists began
thinking about the associated risks to privacy and how to prevent queries from
revealing information about particular individuals. An initial approach was to
newamerica.org/oti/reports/primer-disclosure-limitation/ 12
restrict or audit database queries, preventing queries that would pull back data
that could identify individuals. The simplest such limit is to only allow users to
run queries that would return aggregate, or statistical results, and prevent queries
that would pull back individual, unaggregated records. For example, developers
could design a system so that queries would only return record results of a certain
size in the result set. Using a certain threshold number, n, would ensure that only
aggregate queries based on at least a set of n records can be run. For example,
with an n of two, this would involve preventing a query from being run that would
return data of a single, individual record. The problem is that multiple queries
can use differentials and overlapping sets (calculating values across query results,
by say, subtracting the results of one query from another) to obtain confidential
information. A simple example is running a query seeking the total income for all
individuals in the data, and then running a second query seeking the total income
for all individuals in the data excluding one certain individual. Subtracting the
results from the two datasets would provide the income of that individual.
Researchers can employ techniques that prevent releasing statistics if the
number of common records returned in a set of queries exceed a given threshold
to guard against these re-identification attempts. Further research has created
increasingly sophisticated methods of query restriction. However, query
restrictions ultimately provide no real guarantee of privacy against a
sophisticated user who could potentially generate sets of queries that eventually
reveal information about an individual.
The other approach researchers developed to protect databases is data
perturbation—either altering the actual, underlying data or altering query
outputs to protect confidentiality. This usually involves adding statistical noise—
altering values of the data while still maintaining the statistical relationships
between data fields. There is a long history of use of perturbation in disclosure
limitation, and there are a number of perturbation approaches and techniques.
Data swapping, first proposed by Tore Dalenius and Steven Reiss in the late
1970s, is a technique through which researchers use statistical models to find
pairs of records with similar attributes and switch personally identifying or
sensitive data values between the records. After this manipulation, outside
researchers or attackers will not be able to determine which values correspond to
which individuals. However, the aggregate values of the data are preserved
sufficiently to enable researchers to make statistical inferences.
Other perturbation techniques involve changing the values of the data by some
random amount, while still preserving the underlying statistical properties of the
data within a certain range. For example, the perturbation technique of additive
noise works by replacing true values x with the values of x+r, drawing value r
from some distribution. The r amounts are such that the replacement x+r values
preserve the statistical relations between data records.
20
21
newamerica.org/oti/reports/primer-disclosure-limitation/ 13
While perturbation has been used extensively and effectively by the Census
Bureau and other organizations, perturbation methods do have weaknesses. If
strongly correlated attributes can be found in real-world data, this correlation can
potentially be used to filter out the additive randomizations. The same can
occur if someone has certain background knowledge about the data.
Additionally, a number of mathematical and statistical techniques, such as
spectral filtering, can be used to filter off the random noise from perturbed data,
retrieving the original values.
Synthetic Data
Synthetic data are datasets that seek to replicate the statistical properties of real-
world datasets, serving as analytical replacements. Synthetic datasets can be
either fully synthesized, with all of the original dataset values generated
synthetically, or partially synthesized, with only certain fields or a portion of
records synthesized. Often, the data must be presented in the same form and
structure as the original data in order to be compatible with existing systems,
algorithms, and software. Created through various modeling techniques,
synthetic data differs from perturbation techniques that alter the original,
underlying data as discussed above. Instead, synthetic data creates completely
new data by using models that fit the original data (or by using defined
parameters and constraints) to generate statistically comparable data
independent from the underlying, real-world data.
Synthetic data can greatly reduce the risks of re-identification through ancillary
datasets, as attempts to perform matching with external databases is difficult.
The ability to adjust the model also provides an additional confidentiality
advantage. Researchers modeling the data can make decisions about which
relationships of the real-world data will be preserved. Relationships omitted from
the model will not be discoverable by analysts, since they will not be present in
the synthetic data, providing the ability to keep certain data correlations and the
sensitive information they might reveal confidential. For example, if the
correlation of interest in a research dataset is between gender, age, and health
condition, that could be the only correlation preserved in the synthetic dataset.
Other data fields might be generated and included in this dataset, modeled to
provide broad, aggregate statistics, but not with statistically significant
correlation to combinations of certain other variables—correlations that could
predictively reveal information about individuals.
There are still confidentiality concerns that need to be taken into account with
synthetic data, however. Synthetic datasets can potentially leak the underlying
data if the model fits too closely. For example, if the synthetic data has enough
different fields per individual, and the model is closely fitted to the original data,
outliers can potentially be identifiable. However, there is ongoing research aimed
at developing the means to generate synthetic data in a way that would provide
22
23
24
newamerica.org/oti/reports/primer-disclosure-limitation/ 14
formal privacy guarantees (discussed in the section on differential privacy
below).
Adversarial machine learning, which is the use of AI techniques to detect
vulnerabilities, can potentially be used to determine information about a record
used in the modeling of the synthetic dataset (i.e. “real-world” data). However,
those seeking to uncover the information would also need key information about
the model used to create the data.
Apart from its value in disclosure limitation, another potential benefit of
synthetic data is the ability to generate large volumes of research data at low cost.
Machine learning requires running large volumes of training data through
algorithms, and synthetic data may have great value as a way to rapidly provide
these large volumes of data. These datasets contain no real-world data, but are
statistically similar enough to real-world data to be of training value. While
companies such as Google and Facebook generate large datasets as part of their
business, smaller companies may be able to use this synthetic data to jump-start
a machine learning program without collecting data about real people. From a
privacy standpoint, using synthetic data to train artificial intelligence is attractive
in that it avoids the need for collecting, storing, and using real-world data in the
large amounts needed for machine learning.
Differential Privacy
A group of prominent computer scientists first introduced the concept of
differential privacy (also known as “formal privacy”) in their 2006 paper,
Calibrating Noise to Sensitivity in Private Data Analysis, although precursors to
the technique go back decades. It is not a single tool or method, but rather a
privacy standard that provides formal mathematical guarantees of privacy that
can be implemented in various ways. Differential privacy’s guarantee is that an
adversary can learn virtually nothing more about an individual based upon
disclosures from a dataset than they would learn if that person’s record were not
included in the dataset. In other words, whether or not your personal data is
included, resulting outputs from a dataset would be approximately the same.
Strong privacy protection is provided, while still allowing an analyst to derive
useful statistical results. Differential privacy provides a promising solution to
database reconstruction and re-identification attacks, as it would be highly
difficult to link the noisy, approximate results to external data sources.
In practice, differential privacy works by injecting a precise, calculated amount of
statistical noise to the data contained in query results by using statistical means
(so it can be essentially thought of as a perturbation method for query outputs).
What is provided by differential privacy is an approximation of the true value—
the exact same query could produce two different answers. The difference
between the data value provided in differentially private outputs and the real-
world value can be tuned to be a larger or smaller value (known as the privacy
25
26
27
28
29
newamerica.org/oti/reports/primer-disclosure-limitation/ 15
loss parameter), but at a trade-off between accuracy and privacy. Differential
privacy defines privacy risk as an allowable leakage of data on an individual in
comparison to a hypothetical database without a certain individual. The allowed
deviation between data that includes an individual and one that does not is
usually represented as ε (epsilon).
While an individual’s personal information is almost irrelevant to the outputs
produced via differential privacy, some insignificantly small change in belief
about an individual can potentially be made based on the information released.
The probability that some inference can be made about an individual is at most
1+ε times the probability that an inference could be made without the
individual’s data. For example, if the baseline probability of an individual
developing a certain disease is 3 percent (say for a female in the United States),
with an ε of 0.01, the known probability under differential privacy would rise
from 3 percent to only 3.03 percent (3 x 1+ε) at most. Put another way, the
probability difference in the outputs between a dataset with the individual and
one without the individual included is .03 percent.
Differential privacy also measures and bounds the total privacy loss over multiple
analyses. Part of the application of differential privacy involves establishing a
“privacy budget,” which limits the overall amount of data that can be disclosed.
Setting this budget requires determining the cumulative risk of data disclosures
over the lifespan of the data. With all disclosure limitation techniques, there is no
avoiding the fundamental fact that when multiple analyses are performed using
an individual’s data, disclosure risk increases by some amount. Thus with each
statistical release, or query, under differential privacy, some small amount of
potentially private information is leaked. Therefore, while risk does increase with
each release, the privacy budget ensures that risk accumulates in a bounded way.
Queries are analyzed to determine their privacy cost (ε) and whether the
remaining balance (a running tally ε over all queries) of the privacy budget is
sufficiently high to run it. Setting a privacy budget thus returns us to the always
present trade-off of informational value of the data and confidentiality;
potentially releasing identifiable information (if the privacy budget is set too
high) versus data releases not being informationally useful (if the budget is set
too low). Methods for optimally calculating the privacy budget are an area of
current research.
Differential privacy has recently seen a number of real-world uses by companies.
Uber uses differential privacy to protect internal analyses, such as those done on
driver revenue. Apple is using differential privacy to protect user privacy while
improving the usability of features such as lookup hints. Federal agencies such
as the Census Bureau are also beginning to adopt differential privacy.
Additional tools for making differential privacy use more accessible are under
development, with some being provided open-source, such as Google’s
30
31
32
33
34
newamerica.org/oti/reports/primer-disclosure-limitation/ 16
differential privacy kit, which is available via GitHub and allows users to calculate
differentially private simple statistics from a dataset.
Differential privacy stands as one of the most promising disclosure limitation
techniques, one that can provide formal, mathematical assurances of privacy
while unlocking valuable research data. However, like any other disclosure
limitation method, it is not an absolute assurance. In addition to concerns about
correctly calculating the privacy budget, covert-channel attacks can potentially
be used against differentially private query systems and need to be protected
against. Using information other than the query values, such as the time to
complete the query, could potentially be used to reveal information such as the
presence of an individual in a database. For example, a query looking for an
individual in a dataset of cancer patients may take one second to run if the
individual is not present, versus a half hour to run if the individual is in the
dataset.
35
36
newamerica.org/oti/reports/primer-disclosure-limitation/ 17
The Census Bureau
While new legal rules mandating government transparency, such as the Open
Data Act, will require agencies to release more data, the Census Bureau has long
differed from other major government agencies in that the public release of data
is one of its primary functions. The Census Bureau publishes a large amount of
information on the demographics and economy of the United States, while
endeavoring to protect the privacy of individuals. For the 2010 Census, the
Bureau published 5.6 billion independent tabular summaries, based on over 300
million person records. The Census Bureau faces a number of disclosure
limitation challenges, including the high-dimensionality of its data, and the need
to preserve associations among variables. In addition to releasing tabular
summaries, the Census Bureau also publicly releases microdata (record-level
data) from the decennial census and from many of its demographic and
economic surveys.
Under Title 13 of the U.S. Code, the Bureau is prohibited from releasing data that
allows “any particular establishment or individual” to be identified. In addition,
the Census Bureau is bound by the Confidential Information Protection and
Statistical Efficiency Act of 2002 (CIPSEA). Applied primarily, but not
exclusively, to the statistical agencies of the federal government, CIPSEA was
created to provide federal agencies the ability to make a statutory commitment to
confidentiality and to restrict data use to statistical purposes only. CIPSEA sets
high penalties for disclosures, including fines and jail time. Other statistical
agencies have often used data licensing agreements to provide data to specific
users with confidentiality requirements. However, the Census Bureau cannot rely
on such agreements because any Census data released to external parties is
automatically considered publicly available. The Census Bureau thus has a strong
focus on preserving confidentiality in the data it releases, and has been on the
cutting edge of disclosure limitation methods.
This history of how the Census Bureau has protected public releases of
information provides useful examples of disclosure limitation in practice. Early
censuses in the 1800s used few privacy measures, often only removing names. As
concerns about confidentiality grew, the 1929 census law established a
requirement that “no publication shall be made by the Census Office whereby the
data furnished by any particular establishment or individual can be identified.”
Protections were further codified and strengthened by the 1954 census law (Title
13 of the U.S. Code). Until the early 1960s, Census data was only released in
printed volumes, greatly limiting the detail and amount of data that could be
disclosed (and thus reducing privacy risks). With the move to publishing of
extensive electronic data in more recent times, and the greater attendant need to
protect privacy, the Census Bureau has used a number of information reduction
and data perturbation methods to limit disclosure. Information reduction
37
38
newamerica.org/oti/reports/primer-disclosure-limitation/ 18
techniques traditionally used by the Bureau include geographic thresholds,
coding, and sampling. For example, to protect against identification, any
geographic areas identified on public-use files must have a population over
100,000, and every categorical variable (a variable used for grouping, such as
gender), must have at least 10,000 people nationwide, otherwise the variable is
recoded into a broader one. Cell suppression methodology, as described above,
has also been a primary information restriction method by the Census Bureau,
one they have continually worked to improve. In 1996, the Bureau began using
data swapping and noise infusion techniques to perturb data by a confidential
amount. The Census Bureau has also protected data from the decennial census
and the American Community Survey (an ongoing population and housing
survey) by creating partially and fully synthetic data.
To further protect against disclosure risks, the Census Bureau also uses
procedural and administrative methods. Before dissemination, all data products
released by the Census Bureau must be reviewed by their Disclosure Review
Board (DRB). The DRB examines whether appropriate disclosure limitation
techniques have been applied for a Census product, but also determines whether
a certain product presents additional disclosure risks that need to be addressed.
After an error was made in the release of a product in 2010, the Bureau also
created the position of disclosure limitation officer. Each division at the Census
Bureau that produces data releases must designate an officer to oversee all
disclosure limitation activities and final submission to the DRB.
Centralized disclosure review boards (used by the Census Bureau, and other
government agencies such as the Department of Education) offer added benefits
beyond what a more limited, specific review of a disclosure would provide. Each
data release can be considered in the context of all planned data releases by an
agency. Additionally, centralized disclosure review boards bring together experts
from across the agency, including staff with technical skills and those with
specialized knowledge about particular data types and datasets.
Another, different approach to disclosure limitation is to restrict data access by
legal and/or operational means. Due to confidentiality concerns, some Census
data cannot be released publicly. So to provide secure, authorized data access to
researchers (rather than licensed release), the Census Bureau maintains 29
Federal Statistical Research Data Centers (RDCs) hosted at government
agencies, universities, and nonprofit institutions. Using their own review and
approval processes, statistical agencies (both the Census Bureau and others,
including the Bureau of Labor Statistics and the Bureau of Economic Analysis)
provide microlevel data to the secure RDC environments. Researchers must
obtain Census Bureau Special Sworn Status by passing a background check and
swearing a lifetime confidentiality oath. Under Title 13 and Title 26 of the U.S.
Code, penalties of a federal prison sentence of up to five years, a fine of up to
$250,000, or both apply to any violations of the confidentiality requirements.
Researchers function under the supervision of employees of the RDC on non-
39
40
newamerica.org/oti/reports/primer-disclosure-limitation/ 19
networked machines, and a researcher’s output, code, and notes all undergo
disclosure review by an RDC analyst. Additionally, statistical software used at
the RDCs has certain commands, such as those for copying or printing datasets,
restricted. Certain projects may allow remote access through a secure
communication network, with the code submitted by the researcher and
executed on a computer in the RDC, and subject to the same code and output
review provisions.
Realizing that the increases in computing power and availability of external
databases were increasing the risks of re-identification, the Census Bureau has
begun to move from legacy disclosure limitation methods to techniques based on
formal privacy. Internal researchers at the Census Bureau in 2019 discovered that
confidential data could be reconstructed from the publicly released tabulations of
the 2010 Census by using commercial data, potentially revealing the race and
ethnicity of individuals. For the 2020 Census, differential privacy will be used to
protect data through a new processing system developed in-house. The
adoption of differential privacy will require Census to closely evaluate how the
quantity of and nature of statistics it releases affects its privacy budget, as each
release of data will use a fraction of it. Tables for which high accuracy is critical
will require a larger share of the privacy budget.
41
42
newamerica.org/oti/reports/primer-disclosure-limitation/ 20
Conclusion
Collection of data continues to expand rapidly, growing datasets into longer-term
repositories with increasing value. However, this higher-dimension, longitudinal
data creates a greater risk of privacy harms and the corresponding need to
develop more privacy-protective techniques and technologies. The tension in
providing detailed enough data to be useful while maintaining confidentiality of
the underlying information will always remain. When datasets include
potentially identifiable personal information, steps to prevent disclosure of this
information can limit the extent to which researchers can analyze data with
granular and accurate enough calculations.
Both private and public organizations have long relied on notice and consent and
de-identification for protecting privacy—methods that have been shown to be no
longer reliable. There are no silver bullets in disclosure limitation, and no single
privacy-enhancing technique or technology will completely remove privacy risks.
However, recent advances in disclosure limitation hold great promise for
protecting confidentiality while allowing data to be used to provide valuable
information. The emerging techniques of differential privacy and synthetic data
can help move us forward from debates about anonymization and re-
identification, towards a better balancing of data disclosure and confidentiality
based on formalized and measurable metrics. Traditional disclosure limitation
techniques are still of value as well and can be used in conjunction with modern
methods to greatly reduce privacy risks. However, the focus on personally
identifiable information in current privacy regulations presents complications
when considering disclosures protected by modern means such as differential
privacy. Existing and future laws and policies will need to take account of the
more quantifiable, comprehensive concepts of privacy that formal privacy
methods provide. Researchers and policymakers will need to ask tough questions
about how much statistical noise is enough to adequately protect privacy while
still providing useful data, and how to capture and define these considerations in
regulations and policies.
43
newamerica.org/oti/reports/primer-disclosure-limitation/ 21
Notes
1 Jeff Desjardins, “How much data is generatedeach day?” World Economic Forum, April 17, 2019,“https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/
2 A disclosure is distinguished from a breach, whichis an unintentional release of information.
3 “Confidential” as used in this paper means anyinformation that was not intended to be released aspart of data made available, which includes bothpersonal and non-personal information.
4 In some instances, there may be other disclosureconcerns apart from personal identification, such asrelease of classified information or certain sensitiveor proprietary information about an organization orcompany.
5 Such as the noted case of Target determining andrevealing a teen pregnancy to her father. See KashmirHill "How Target Figured Out A Teen Girl WasPregnant Before Her Father Did", Forbes.com(accessed Dec 9, 2020), https://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#1bc2f7ea6668
6 The field of research in protecting privacy andconfidentiality in data mining is known as PrivacyPreserving Data Mining (PPDM), and utilizes many ofthe disclosure limitation techniques discussed in thispaper; Such as Google’s development of tools forsecure multiparty computation and differentialprivacy.
7 Paul Ohm, Broken Promises of Privacy:Responding to the Surprising Failure ofAnonymization, UCLA Law Review 57 (August 2010):1701-1777. https://www.uclalawreview.org/pdf/57-6-3.pdf
8 Organisation for Economic Co-operation andDevelopment, OECD Guidelines on the Protection of
Privacy and Transborder Flows of Personal Data,2013. http://www.oecd.org/sti/ieconomy/oecdguidelinesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm
9 Asia-Pacific Economic Cooperation, What is theCross-Border Privacy Rules System?, April 15, 2019.https://www.apec.org/About-Us/About-APEC/Fact-Sheets/What-is-the-Cross-Border-Privacy-Rules-System
10 There is some blurring of lines between thesetwo categories.
11 Starting in 2006, a number of studiesdemonstrated the ability to re-identify individuals inpublicly released, anonymized data. This includes re-identifications from the 2006 AOL release of searchqueries of users (see Michael Barbaro and Tom ZellerJr., "A Face Is Exposed for AOL Searcher No.4417749", New York Times, Aug. 9, 2006, https://www.nytimes.com/2006/08/09/technology/09aol.html, The Massachusetts government’s releaseof state employee hospital visit data, Daniel Barth-Jones, “The 'Re-Identification' of Governor WilliamWeld's Medical Information: A Critical Re-Examination of Health Data Identification Risks andPrivacy Protections, Then and Now”, SSRN (July2012) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2076397, and the two other examplesdiscussed below in this paper.
12 A 2016 comprehensive review of re-identificationattacks found that 72.7% of all successful attackshave taken place since 2009. Jane Henriksen-Bulmerand Sheridan Jeary, "Re-identification Attacks—ASystematic Literature Review", International Journalof Information Management 36 (December 2016):1184-1192https://www.sciencedirect.com/science/article/pii/S0268401215301262
13 Ohm, 1716-1720
14 Arvind Narayanan and Vitaly Shmatikov, "RobustDe-anonymization of Large Sparse Datasets",SP '08: Proceedings of the 2008 IEEE Symposium on Security
newamerica.org/oti/reports/primer-disclosure-limitation/ 22
and Privacy (May 2008):111–125, https://www.cs.cornell.edu/~shmat/shmat_oak08netflix.pdf
15 Alessandro Acquisti and Ralph Gross, "PredictingSocial Security numbers from public data", Proceedings of the National Academy of Sciences 27 (July2009): 10975-10980, https://www.ncbi.nlm.nih.gov/pubmed/19581585
16 Methods to prevent against such attackscontinue to be developed however. See for instanceQian Wang, Zhiwei Xu and Shengzhi Qu, "AnEnhanced K-Anonymity Model against HomogeneityAttack", Journal Of Software 6 (October 2011):1945-1952, https://pdfs.semanticscholar.org/2a2a/497d05311ecfd9164ce5eb30b0a9b55f991d.pdf andAshwin Machanavajjhala, Daniel Kifer, JohannesGehrke, And MuthuramakrishnanVenkitasubramaniam, "l-Diversity: Privacy Beyond k-Anonymity", ACM Transactions on KnowledgeDiscovery from Data 1 (March 2007): 1-52,https://ptolemy.berkeley.edu/projects/truststc/pubs/465/L%20Diversity%20Privacy.pdf
17 See for instance on qualitative and quantitativerisk measurement for randomized control trial data,Parveen Kumar and Rajan Sareen, “Evaluation of Re-identification Risk for Anonymized ClinicalDocuments”, Candian Journal of Hospital Pharmacy62, (July–August 2009): 307-319, https://www.lexjansen.com/phuse/2017/rg/RG02.pdf
18 In revisiting Latanya Sweeney’s well known re-identification study showing that 87% of the USpopulation could be identified by gender, date ofbirth, and ZIP code, the authors found that re-identification would drop to .02% by replacing dateof birth with month and year only, and zip code withcounty. Philippe Golle, “Revisiting the Uniqueness ofSimple Demographics in the US Population”, Proceedings of the 5th ACM workshop on Privacy in ElectronicSociety (October 2006): 77–80, http://crypto.stanford.edu/~pgolle/papers/census.pdf
19 When data is broken down by categories intables, only a few individuals (“low n” for number of
individuals) may fall into some of the categories--such as an example of only one or two students of acertain race and gender being in a particular collegeprogram.
20 See Appendix B in Irit Dinur and Kobbi Nissim,“Revealing Information while Preserving Privacy”, Proceedings of the Twenty-Second ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (June 2003): 202–210, http://www.cse.psu.edu/~ads22/privacy598/papers/dn03.pdf
21 Tore Dalenius and Steven P. Reiss, "Data-swapping: A Technique for Disclosure Control", Journal of Statistical Planning and Inference 6, no 1 (1982):73-85, http://www.asasrms.org/Proceedings/papers/1978_038.pdf
22 See Kun Liu, Chris Giannella, and Hillol Kargupta,“A Survey of Attack Techniques on Privacy-PreservingData Perturbation Methods”, in Privacy-PreservingData Mining: Models and Algorithms, ed. Charu C.Aggarwal and Philip S. Yu (New York, NY: Springer,2008) 359-381, https://www.researchgate.net/publication/226016259_A_Survey_of_Attack_Techniques_on_Privacy-Preserving_Data_Perturbation_Methods
23 See Songtao Guo and Xintao Wu, "On The UseOf Spectral Filtering For Privacy Preserving DataMining" Proceedings of the ACM Symposium onApplied Computing, Dijon, France, April 23-27, 2006
24 See Surendra H and Mohan HS “A Review OfSynthetic Data Generation Methods For PrivacyPreserving Data Publishing”, International Journal OfScientific & Technology Research 6 (March 2017):95-101, https://www.ijstr.org/final-print/mar2017/A-Review-Of-Synthetic-Data-Generation-Methods-For-Privacy-Preserving-Data-Publishing.pdf
25 See Haoran Li, Li Xiong, and Xiaoqian Jiang,“Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions”, AdvancedDatabase Technology 2014 (2014): 475–486, https://
newamerica.org/oti/reports/primer-disclosure-limitation/ 23
www.ncbi.nlm.nih.gov/pmc/articles/PMC4232968/ ,and National Institute of Standards and Tehcnology,“2018 Differential Privacy Synthetic Data Challenge”(accessed Jan 10, 2020) https://www.nist.gov/ctl/pscr/funding-opportunities/open-innovation-prize-challenges/2018-differential-privacy-synthetic
26 See for instance, Reza Shokri, Marco Stronati,Congzheng Song, and Vitaly Shmatikov,“Membership Inference Attacks against MachineLearning Models”, Proceedings of the IEEESymposium on Security and Privacy (2017), https://www.cs.cornell.edu/~shmat/shmat_oak17.pdf
27 Methods for generating synthetic data itself viamachine learning are currently being developedusing techniques such as generative adversarialnetworks (GAN).
28 Cynthia Dwork, Frank McSherry, Kobbi Nissim,and Adam Smith, “Calibrating Noise to Sensitivity inPrivate Data Analysis”. In: Theory of Cryptography,Lecture Notes in Computer Science, ed. Shai Haleviand Tal Rabin (Berlin: Springer, 2006),265-284.https://link.springer.com/chapter/10.1007/11681878_14
29 Providing individuals in the dataset plausibledeniability. As an example, for a population, 1,022may be returned one time and 1,016 another, but thiswould be irrelevant as part of statistical analyses.
30 For the mathematical proof of differentialprivacy, see Cynthia Dwork, “A Firm Foundation forPrivate Data Analysis”, Communications of the ACM,54 (January 2011):86-95, https://www.microsoft.com/en-us/research/wp-content/uploads/2011/01/dwork_cacm.pdf
31 See for instance, Anis Bkakria and AimiliaTasidou, “Optimal Distribution of Privacy Budget inDifferential Privacy”, Risks and Security of Internetand Systems. CRiSIS 2018. Lecture Notes in ComputerScience 11391 (2019)https://link.springer.com/chapter/10.1007/978-3-030-12143-3_18
32 https://iapp.org/news/a/uber-becomes-the-latest-company-to-embrace-differential-privacy/
33 Apple Inc., “Differential Privacy Overview”,https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf
34 The Census Bureau’s move to differential privacyis discussed below.
35 https://github.com/google/differential-privacy
36 Andreas Haeberlen, Benjamin C. Pierce, andArjun Narayan, “Differential Privacy Under Fire”, Proceedings of the 20th USENIX conference on Security(August 2011) https://www.cis.upenn.edu/~ahae/papers/fuzz-sec2011.pdf
37 United States Census Bureau, American FactFinder (accessed January 7, 2020), https://factfinder.census.gov/bkmk/table/1.0/en/DEC/10_SF1/QTP10/0100000US
38 Included as part of the Reapportionment Act of1929. Reapportionment Act of 1929, 71st Cong., 1stsess., June 18,1929, 21-27, https://www.census.gov/history/pdf/1929_census_act.pdf
39 Amy Lauger, Billy Wisniewski, And LauraMcKenna, “Disclosure Avoidance Techniques at theU.S. Census Bureau: Current Practices and Research”,Research Report Series, Center for DisclosureAvoidance Research #2014-2 (2014), https://www.census.gov/library/working-papers/2014/adrm/cdar2014-02.html
40 Phyllis Singer and Nelson Chung, “PredictingComplementary Cell Suppressions Given PrimaryCell Suppression”, Research Report Series, Center forDisclosure Avoidance Research #2016-5 (2016),Conditions https://www.census.gov/srd/CDAR/cdar2016-05_Predicting_Complementary_Cell_Suppressions.pdf
41 John M. Abowd, “Starting Down the DatabaseReconstruction Theorem” (presentation at the
newamerica.org/oti/reports/primer-disclosure-limitation/ 24
American Association for the Advancement ofScience Annual Meeting, Washington, DC, February16, 2019) https://www2.census.gov/programs-surveys/decennial/2020/resources/presentations-publications/2019-02-16-abowd-db-reconstruction.pdf?#
42 United States Census Bureau, “DisclosureAvoidance and the 2020 Census”, https://www.census.gov/about/policies/privacy/statistical_safeguards/disclosure-avoidance-2020-census.html
43 And administrative and regulatory approachessuch as formal application and review, data useagreements, and secure data enclaves can also beused to minimize disclosure risks from non-publiclyreleased data further.
newamerica.org/oti/reports/primer-disclosure-limitation/ 25
This report carries a Creative Commons Attribution4.0 International license, which permits re-use ofNew America content when proper attribution isprovided. This means you are free to share and adaptNew America’s work, or include our content inderivative works, under the following conditions:
• Attribution. You must give appropriate credit,provide a link to the license, and indicate if changeswere made. You may do so in any reasonable manner,but not in any way that suggests the licensorendorses you or your use.
For the full legal code of this Creative Commonslicense, please visit creativecommons.org.
If you have any questions about citing or reusingNew America content, please visit www.newamerica.org.
All photos in this report are supplied by, and licensedto, shutterstock.com unless otherwise stated.Photos from federal government sources are usedunder section 105 of the Copyright Act.
newamerica.org/oti/reports/primer-disclosure-limitation/ 26