DPCube: Differentially Private Histogram Release through ... · Abstract—Differential privacy is a strong notion for protecting individual privacy in privacy preserving data analysis

1

DPCube: Differentially Private HistogramRelease through Multidimensional Partitioning

Yonghui Xiao, Li Xiong, Liyue Fan and Slawomir Goryczka

Abstract—Differential privacy is a strong notion for protecting individual privacy in privacy preserving data analysis orpublishing. In this paper, we study the problem of differentially private histogram release for random workloads. We study twomultidimensional partitioning strategies including: 1) a baseline cell-based partitioning strategy for releasing an equi-width cellhistogram, and 2) an innovative 2-phase kd-tree based partitioning strategy for releasing a v-optimal histogram. We formallyanalyze the utility of the released histograms and quantify the errors for answering linear queries such as counting queries.We formally characterize the property of the input data that will guarantee the optimality of the algorithm. Finally, we implementand experimentally evaluate several applications using the released histograms, including counting queries, classification, andblocking for record linkage and show the benefit of our approach.

Index Terms—Differential privacy, non-interactive data release, histogram, classification, record linkage.

F

1 INTRODUCTION

As information technology enables the collection, stor-age, and usage of massive amounts of informationabout individuals and organizations, privacy becomesan increasingly important issue. Governments andorganizations recognize the critical value in sharingsuch information while preserving the privacy ofindividuals. Privacy preserving data analysis and datapublishing [3], [4], [5] has received considerable atten-tion in recent years. There are two models for privacyprotection [3]: the interactive model and the non-interactive model. In the interactive model, a trustedcurator (e.g. hospital) collects data from record owners(e.g. patients) and provides an access mechanism fordata users (e.g. public health researchers) for queryingor analysis purposes. The result returned from theaccess mechanism is perturbed by the mechanismto protect privacy. In the non-interactive model, thecurator publishes a “sanitized” version of the data,simultaneously providing utility for data users andprivacy protection for the individuals represented inthe data.

Differential privacy [6], [7], [3], [5], [8] is widelyaccepted as one of the strongest known privacy guar-antees with the advantage that it makes few as-sumptions on the attacker’s background knowledge.It requires the outcome of computations to be for-mally indistinguishable when run with or without

• A preliminary version of the manuscript appeared in [1] and ademonstration description will appear in [2]

• Y. Xiao, L. Xiong, L. Fan. and S. Goryczka are with the Departmentof Mathematics & Computer Science, Emory University, Atlanta, GA,30322.E-mail: {yonghui.xiao, lxiong, lfan3, sgorycz}@emory.edu

• This research was supported in part by NSF grant CNS-1117763, aCisco Research Award, and an Emory URC grant.

any particular record in the dataset, as if it makeslittle difference whether an individual is being optedin or out of the database. Many meaningful resultshave been obtained for the interactive model withdifferential privacy [6], [7], [3], [5]. Non-interactivedata release with differential privacy has been recentlystudied with hardness results obtained and it remainsan open problem to find efficient algorithms for manydomains [9], [10].

Diff.

Private

Interface

Diff.

Private

Histogram

Partitioning and

Querying Strategy

UserOriginal

Data

Queries

Diff. Private

Answers

Queries

Diff. Private

Answers

Fig. 1. Differentially private histogram release

In this paper, we study the problem of differen-tially private histogram release based on a differentialprivacy interface, as shown in Figure 6. A histogramis a disjoint partitioning of the database points withthe number of points which fall into each partition.A differential privacy interface, such as the PrivacyINtegrated Queries platform (PINQ) [11], providesa differentially private access to the raw database.An algorithm implementing the partitioning strategysubmits a sequence of queries to the interface andgenerates a differentially private histogram of the rawdatabase. The histogram can then serve as a sanitizedsynopsis of the raw database and, together with anoptional synthesized dataset based on the histogram,can be used to support count queries and other typesof OLAP queries and learning tasks.

An immediate question one might wonder is what

arX

iv:1

202.

5358

v1 [

cs.D

B]

24

Feb

2012

2

is the advantage of the non-interactive release com-pared to using the interactive mechanism to answerthe queries directly. A common mechanism providingdifferentially private answers is to add carefully cali-brated noise to each query determined by the privacyparameter and the sensitivity of the query. The com-posability of differential privacy [11] ensures privacyguarantees for a sequence of differentially-privatecomputations with additive privacy depletions in theworst case. Given an overall privacy requirement orbudget, expressed as a privacy parameter, it can beallocated to subroutines or each query in the querysequence to ensure the overall privacy. When thenumber of queries grow, each query gets a lower pri-vacy budget which requires a larger noise to be added.When there are multiple users, they have to share acommon privacy budget which degrades the utilityrapidly. The non-interactive approach essentially ex-ploits the data distribution and the query workloadand uses a carefully designed algorithm or querystrategy such that the overall noise is minimized for aparticular class of queries. As a result, the partitioningstrategy and the algorithm implementing the strategyfor generating the query sequence to the interfaceare crucial to the utility of the resulting histogram orsynthetic dataset.

Contributions. We study differentially private his-togram release for random query workload in thispaper and propose partitioning and estimation algo-rithms with formal utility analysis and experimentalevaluations. We summarize our contributions below.

• We study two multidimensional partitioningstrategies for differentially private histogram re-lease: 1) a baseline cell-based partitioning strategyfor releasing an equi-width cell histogram, and2) an innovative 2-phase kd-tree (k-dimensionaltree) based space partitioning strategy for releas-ing a v-optimal histogram. There are several inno-vative features in our 2-phase strategy. First, weincorporate a uniformity measure in the partition-ing process which seeks to produce partitions thatare close to uniform so that approximation er-rors within partitions are minimized, essentiallyresulting in a differentially private v-optimal his-togram. Second, we implement the strategy usinga two-phase algorithm that generates the kd-treepartitions based on the cell histogram so thatthe access to the differentially private interfaceis minimized.

• We formally analyze the utility of the releasedhistograms and quantify the errors for answer-ing linear distribute queries such as countingqueries. We show that the cell histogram providesbounded query error for any input data. Wealso show that the v-optimal histogram combinedwith a simple query estimation scheme achievesbounded query error and superior utility than

existing approaches for “smoothly” distributeddata. We formally characterize the “smoothness”property of the input data that guarantees theoptimality of the algorithm.

• We implement and experimentally evaluate sev-eral applications using the released histograms,including counting queries, classification, andblocking for record linkage. We compare ourapproach with other existing privacy-preservingalgorithms and show the benefit of our approach.

2 RELATED WORKS

Privacy preserving data analysis and publishing hasreceived considerable attention in recent years. Werefer readers to [3], [4], [5] for several up-to-datesurveys. We briefly review here the most relevantwork to our paper and discuss how our work differsfrom existing work.

There has been a series of studies on interactiveprivacy preserving data analysis based on the notionof differential privacy [6], [7], [3], [5]. A primaryapproach proposed for achieving differential privacyis to add Laplace noise [6], [3], [7] to the originalresults. McSherry and Talwar [12] give an alterna-tive method to implement differential privacy basedon the probability of a returned result, called theexponential mechanism. Roth and Roughgarden [13]proposes a median mechanism which improves uponthe Laplace mechanism. McSherry implemented theinteractive data access mechanism into PINQ [11], aplatform providing a programming interface througha SQL-like language.

A few works started addressing non-interactivedata release that achieves differential privacy. Blumet al. [9] proved the possibility of non-interactive datarelease satisfying differential privacy for queries withpolynomial VC-dimension, such as predicate queries.It also proposed an inefficient algorithm based on theexponential mechanism. The result largely remainstheoretical and the general algorithm is inefficient forthe complexity and required data size. [10] furtherproposed more efficient algorithms with hardnessresults obtained and it remains a key open problemto find efficient algorithms for non-interactive data re-lease with differential privacy for many domains. [14]pointed out that a natural approach to side-steppingthe hardness is relaxing the utility requirement, andnot requiring accuracy for every input database.

Several recent work studied differentially privatemechanisms for particular kinds of data such assearch logs [15], [16] or set-valued data [17]. Othersproposed algorithms for specific applications or op-timization goals such as recommender systems [18],record linkage [19], data mining [20], or differentiallyprivate data cubes with minimized overall cuboiderror [21]. It is important to note that [19] uses severaltree strategies including kd-tree in its partitioning step

3

and our results show that our 2-phase uniformity-driven kd-tree strategy achieves better utility for ran-dom count queries.

A few works considered releasing data for predic-tive count queries and are closely related to ours.X. Xiao et al. [22] developed an algorithm usingwavelet transforms. [23] generates differentially pri-vate histograms for single dimensional range queriesthrough a hierarchical partitioning approach and aconsistency check technique. [24] proposes a querymatrix mechanism that generates an optimal querystrategy based on the query workload of linear countqueries and further mapped the work in [22] and [23]as special query strategies that can be representedby a query matrix. It is worth noting that the cell-based partitioning in our approach is essentially theidentity query matrix referred in [24]. While we willleverage the query matrix framework to formallyanalyze our approach, it is important to note that theabove mentioned query strategies are data-obliviousin that they are determined by the query workload, astatic wavelet matrix, or hierarchical matrix withouttaking into consideration the underlying data. On theother hand, our 2-phase kd-tree based partitioningis designed to explicitly exploit the smoothness ofthe underlying data indirectly observed by the dif-ferentially private interface and the final query matrixcorresponding to the released histogram is dependenton the approximate data distribution. We are alsoaware of a forthcoming work [25] and will comparewith it as the proceedings becomes available.

In summary, our work complements and advancesthe above works in that we focus on differentiallyprivate histogram release for random query workloadusing a multidimensional partitioning approach thatis “data-aware”. Sharing the insights from [14], [17],our primary viewpoint is that it is possible and desir-able in practice to design adaptive or data-dependentheuristic mechanisms for differentially private datarelease for useful families or subclasses of databasesand applications. Our approach provides formal util-ity guarantees for a class of queries and also supportsa variety of applications including general OLAP,classification and record linkage.

3 PRELIMINARIES AND DEFINITIONS

In this section, we formally introduce the definitionsof differential privacy, the data model and the querieswe consider, as well as a formal utility notion called(ε, δ)-usefulness. Matrices and vectors are indicatedwith bold letters (e.g H, x) and their elements areindicated as Hij or xi. While we will introducemathematical notations in this section and subsequentsections, Table 3 lists the frequently-used symbols forreferences.

TABLE 1Frequently used symbols

Symbol Descriptionn number of records in the datasetm number of cells in the data cubexi original count of cell i(1 ≤ i ≤ m)yi released count of cell i in cell histogramyp released count of partition p in subcube histogramnp size of partition ps size of query range

α, α1, α2 differential privacy parametersγ smoothness parameter

3.1 Differential Privacy

Definition 3.1 (α-Differential privacy [7]): In the in-teractive model, an access mechanism A satisfies α-differential privacy if for any neighboring databases1

D1 and D2, for any query function Q, r ⊆ Range(Q),AQ(D) is the mechanism to return an answer to queryQ(D),

Pr[AQ(D1) = r] ≤ eαPr[AQ(D2) = r]

In the non-interactive model, a data release mecha-nism A satisfies α-differential privacy if for all neigh-boring database D1 and D2, and released output D,

Pr[A(D1) = D] ≤ eαPr[A(D2) = D]

Laplace Mechanism. To achieve differential privacy,we use the Laplace mechanism [6] that adds randomnoise of Laplace distribution to the true answer of aquery Q, AQ(D) = Q(D)+ N , where N is the Laplacenoise. The magnitude of the noise depends on theprivacy level and the query’s sensitivity.

Definition 3.2 (Sensitivity): For arbitrary neighbor-ing databases D1 and D2, the sensitivity of a query Q,denoted by SQ, is the maximum difference betweenthe query results of D1 and D2,

SQ = max|Q(D1)−Q(D2)| (1)

To achieve α-differential privacy for a given queryQ on dataset D, it is sufficient to return Q(D) + N inplace of the original result Q(D) where we draw Nfrom Lap(SQ/α) [6].

Composition. The composability of differential pri-vacy [11] ensures privacy guarantees for a sequenceof differentially-private computations. For a generalseries of analysis, the privacy parameter values addup, i.e. the privacy guarantees degrade as we exposemore information. In a special case that the analysesoperate on disjoint subsets of the data, the ultimateprivacy guarantee depends only on the worst of theguarantees of each analysis, not the sum.

1. We use the definition of unbounded neighboring databases[8] consistent with [11] which treats the databases as multisets ofrecords and requires their symmetric difference to be 1.

4

Theorem 3.1 (Sequential Composition [11]): Let Mi

each provide αi-differential privacy. The sequence ofMi provides (

∑i αi)-differential privacy.

Theorem 3.2 (Parallel Composition [11]): If Di are dis-joint subsets of the original database and Mi providesα-differential privacy for each Di, then the sequenceof Mi provides α-differential privacy.

Differential Privacy Interface. A privacy interfacesuch as PINQ [11] can be used to provide a differ-entially private interface to a database. It providesoperators for database aggregate queries such ascount (NoisyCount) and sum (NoisySum) which usesLaplace noise and the exponential mechanism to en-force differential privacy. It also provides a Partitionoperator that can partition the dataset based on theprovided set of candidate keys. The Partition operatortakes advantage of parallel composition and thus theprivacy costs do not add up.

3.2 Data and Query Model

Data Model. Consider a dataset with N nominal ordiscretized attributes, we use an N -dimensional datacube, also called a base cuboid in the data ware-housing literature [26], [21], to represent the aggregateinformation of the data set. The records are the pointsin the N -dimensional data space. Each cell of a datacube represents an aggregated measure, in our case,the count of the data points corresponding to themultidimensional coordinates of the cell. We denotethe number of cells by m and m = |dom(A1)| ∗ · · · ∗|dom(AN )| where |dom(Ai)| is the domain size ofattribute Ai. We use the term “partition” to refer toany sub-cube in the data cube.

x2 x3

x4 x5 x6

x7 x8 x9

x1

21 37

20 0 0

53 0 0

10

0~10K

10~20K

>20K

Income

ID Age Income

1 20~30 10~20K

2 30~40 30K

… … …

n 30~40 60K

Age20~30 30~40 40~50 Age

Fig. 2. Running example: original data represented ina relational table (left) and a 2-dimensional count cube(right)

Figure 2 shows an example relational dataset withattribute age and income (left) and a two-dimensionalcount data cube or histogram (right). The domain val-ues of age are 20∼30, 30∼40 and 40∼50; the domainvalues of income are 0∼10K, 10K∼20K and > 20K.Each cell in the data cube represents the populationcount corresponding to the age and income values.

Query Model. We consider linear counting queriesthat compute a linear combination of the count valuesin the data cube based on a query predicate. We can

represent the original data cube, e.g. the counts of allcells, by an m-dimensional column vector x shownbelow.

x =[

10 21 37 20 0 0 53 0 0]T

Definition 3.3 (Linear query [24]): A linear query Qcan be represented as an m-dimensional boolean vec-tor Q = [q1 . . . qn] with each qi ∈ R. The answer to alinear query Q on data vector x is the vector productQx = q1x1 + · · ·+ qnxn.

In this paper, we consider counting queries withboolean predicates so that each qi is a boolean variablewith value 0 or 1. The sensitivity of the countingqueries, based on equation (1), is SQ = 1. We denotes as the query range size or query size, which is thenumber of cells contained in the query predicate, andwe have s = |Q|. For example, a query Q1 askingthe population count with age = [20, 30] and income> 30k, corresponding to x1, is shown as a queryvector in Figure 3. The size of this query is 1. It alsoshows the original answer of Q1 and a perturbedanswer with Laplace noise that achieves α-differentialprivacy. We note that the techniques and proofs aregeneralizable to real number query vectors.

x2 x3

x4 x5 x6

x7 x8 x9

21 37

20 0 0

53 0 0

20~30 30~40 40~50

0~10K

10~20K

>20K

Income

Age

10

x1

20~30 30~40 40~50 Age

Fig. 3. Example: a linear counting query

Definition 3.4 (Query matrix [24]): A query matrix isa collection of linear queries, arranged by rows toform an p×m matrix.Given a p × m query matrix H, the query answerfor H is a length-p column vector of query results,which can be computed as the matrix product Hx.For example, an m×m identity query matrix Im willresult in a length-m column vector consisting of allthe cell counts in the original data vector x.

A data release algorithm, consisting of a sequenceof designed queries using the differential privacy in-terface, can be represented as a query matrix. We willuse this query matrix representation in the analysis ofour algorithms.

3.3 Utility MetricsWe formally analyze the utility of the released databy the notion of (ε, δ)-usefulness [9].

Definition 3.5 ((ε, δ)-usefulness [9]): A databasemechanism A is (ε, δ)-useful for queries in class C ifwith probability 1 − δ, for every Q ∈ C, and everydatabase D, A(D) = D, |Q(D)−Q(D)| ≤ ε.

5

In this paper, we mainly focus on linear countingqueries to formally analyze the released histograms.We will discuss and experimentally show how thereleased histogram can be used to support other typesof OLAP queries such as sum and average and otherapplications such as classification.

3.4 Laplace Distribution PropertiesWe include a general lemma on probability distribu-tion and a theorem on the statistical distribution ofthe summation of multiple Laplace noises, which wewill use when analyzing the utility of our algorithms.

Lemma 3.1: If Z ∼ f(z), then aZ ∼ 1af(z/a) where

a is any constant.Theorem 3.3: [27] Let fn(z, α) be the PDF of∑ni=1 Ni(α) where Ni(α) are i.i.d. Laplace noise

Lap(1/α),

fn(z, α) =

αn

2nΓ2(n)exp(−α|z|)

∫ ∞0

vn−1(|z|+ v

2α)n−1e−vdv (2)

4 MULTIDIMENSIONAL PARTITIONING

4.1 Motivation and OverviewFor differentially private histogram release, a multi-dimensional histogram on a set of attributes is con-structed by partitioning the data points into mutuallydisjoint subsets called buckets or partitions. The countsor frequencies in each bucket is then released. Any ac-cess to the original database is conducted through thedifferential privacy interface to guarantee differentialprivacy. The histogram can be then used to answerrandom counting queries and other types of queries.

The partitioning strategy will largely determine theutility of the released histogram to arbitrary countingqueries. Each partition introduces a bounded Laplacenoise or perturbation error by the differential privacyinterface. If a query predicate covers multiple parti-tions, the perturbation error is aggregated. If a querypredicate falls within a partition, the result has to beestimated assuming certain distribution of the datapoints in the partition. The dominant approach inhistogram literature is making the uniform distributionassumption, where the frequencies of records in thebucket are assumed to be the same and equal to theaverage of the actual frequencies [28]. This introducesan approximation error.

Example. We illustrate the errors and the impact ofdifferent partitioning strategies through an exampleshown in Figure 4. Consider the data in Figure 2. Asa baseline strategy, we could release a noisy countfor each of the cells. In a data-aware strategy, as ifwe know the original data, the 4 cells, x5, x6, x8, x9,can be grouped into one partition and we release asingle noisy count for the partition. Note that the noiseare independently generated for each cell or partition.Because the sensitivity of the counting query is 1 and

Original Data in

Histogram

Released

Histogram

Query: Count(x5) = ?0

x5 x6

x8x7

00+

～～～～N

Data-aware

strategy User

0+N 0+N

0+

0+ 0+

0+～～～～

N

～～～～

N

～～～～N

～～～～N

Baseline

strategy

0

0 0

x5 x6

x8x7

0

Answer: 0+

Query error:

～～～～

N～～～～

N

Query: Count(x5) = ?

User

0 0x8x7

Estimated Answer: 0+ /4

Query error: /4～～～～

N

～～～～

N

Fig. 4. Baseline strategy vs. data-aware strategy

the partitioning only requires parallel composition ofdifferential privacy, the magnitude of the noise inthe two approaches are the same. Consider a query,count(x5), asking the count of data points in theregion x5. For the baseline strategy, the query erroris N which only consists of the perturbation error.For the data-aware strategy, the best estimate for theanswer based on the uniform distribution assumptionis 0 + N/4. So the query error is N/4. In this case,the approximation error is 0 because the cells in thepartition are indeed uniform. If not, approximationerror will be introduced. In addition, the perturbationerror is also amortized among the cells. Clearly, thedata-aware strategy is desired in this case.

In general, a finer-grained partitioning will intro-duce smaller approximation errors but larger aggre-gated perturbation errors. Finding the right balanceto minimize the overall error for a random queryworkload is a key question. Not surprisingly, findingthe optimal multi-dimensional histogram, even with-out the privacy constraints, is a challenging problemand optimal partitioning even in two dimensions isNP-hard [29]. Motivated by the above example andguided by the composition theorems, we summarizeour two design goals: 1) generate uniform or closeto uniform partitions so that the approximation errorwithin the partitions is minimized, essentially gen-erating a v-optimal histogram [30]; 2) carefully andefficiently use the privacy budget to minimize theperturbation error. In this paper, we first study themost fine-grained cell-based partitioning as a base-line strategy, which results in a equi-width histogramand does not introduce approximation error but onlyperturbation error. We then propose a 2-phase kd-tree (k-dimensional tree) based partitioning strategythat results in an v-optimal histogram and seeks tominimize both the perturbation and approximationerror.

4.2 A Baseline Cell Partitioning Strategy

A simple strategy is to partition the data based on thedomain and then release a noisy count for each cellwhich results in a equi-width cell histogram. Figure 5illustrate this baseline strategy. The implementation isquite simple, taking advantage the Partition operator

6

followed by NoisyCount on each partition, shown inAlgorithm 1.

Diff.

Private

Interface

Cell

Histogram

Cell Partitioning

UserRaw

Records

Fig. 5. Baseline cell partitioning

Algorithm 1 Baseline cell partitioning algorithmRequire: α: differential privacy budget

1. Partition the data based on all domains.2. Release NoisyCount of each partition using pri-vacy parameter α

Privacy Guarantee. We present the theorem below forthe cell partitioning algorithm which can be deriveddirectly from the composibility theorems.

Theorem 4.1: Algorithm 1 achieves α-differentialprivacy.

Proof: Because every cell is a disjoint subset ofthe original database, according to theorem 3.2, it’sα-differentially private.

Error Quantification. We present a lemma followedby a theorem that states a formal utility guarantee ofcell-based partitioning for linear distributive queries.

Lemma 4.1: If Ni (i = 1 . . .m) is a set of randomvariables i.i.d from Lap(b) with mean 0, given 0 <ε < 1, the following holds:

Pr[

m∑i=1

|Ni| ≤ ε] ≥ 1−m · exp(− ε

mb) (3)

Proof: Let ε1 = ε/m, given the Laplace distribu-tion, we have

Pr[|Ni| > ε1] = 2∫∞ε1

12bexp(−

xb ) = e−ε1/b

then

Pr[|Ni| ≤ ε1] = 1 - Pr[|Ni| > ε1] = 1− e−ε1/b

If each |Ni| ≤ ε1, we have∑mi=1 |Ni| ≤ m · ε1 = ε, so

we have

Pr[

m∑i=1

|Ni| ≤ ε] ≥ Pr[|Ni| ≤ ε1]m = (1− e−ε1/b)m

Let F (x) = (1− x)m +mx− 1. The derivative of F (x)is F ′(x) = −m(1− x)m−1 +m = m(1− (1− x)m−1) ≥0 when 0 < x < 1. Note that 0 < e−ε1/b < 1, soF (e−ε1/b) ≥ F (0) = 0. We get

(1− e−ε1/b)m ≥ 1−m · e−ε1/b

Recall ε1 = ε/m, we derive equation (3).

Theorem 4.2: The released D of algorithm 1 main-tains (ε, δ)-usefulness for linear counting queries, ifα ≥ m·ln(mδ )

ε , where m is the number of cells in thedata cube.

Proof: Given original data D represented as countvector x, using the cell partitioning with Laplacemechanism, the released data D can be representedas y = x + N, where N is a length-m column vectorof Laplace noises drawn from Lap(b) with b = 1/α.

Given a linear counting query Q represented as qwith query size s (s ≤ m), we have Q(D) = qx, and

Q(D) = Qy = Qx + QN = Qx +

s∑i=1

|Ni|

With Lemma 4.1, we have

Pr[|Q(D)−Q(D)| ≤ ε] = Pr[

s∑i=1

|Ni| ≤ ε] ≥ 1−m·exp(− ε

mb)

If m · exp(− εmb ) ≤ δ, then

Pr[|Q(x,D)−Q(x, D)| ≤ ε] ≥ 1− δ

In order for m · exp(− εmb ) ≤ δ, given b = 1/α, we

derive the condition α ≥ m·ln(mδ )

ε .

4.3 DPCube: Two-Phase Partitioning

Diff.

Private

Interface

Cell

Histogram

Estimation

1. Cell Partitioning

2. Multi-dimensional

Partitioning

UserRaw

Records

Subcube

Histogram

Fig. 6. DPCube: 2-phase partitioning

We now present our DPCube algorithm. DPCubeuses an innovative two-phase partitioning strategyas shown in Figure 6. First, a cell based partition-ing based on the domains (not the data) is used togenerate a fine-grained equi-width cell histogram (asin the baseline strategy), which gives an approxima-tion of the original data distribution. We generate asynthetic database Dc based on the cell histogram.Second, a multi-dimensional kd-tree based partition-ing is performed on Dc to obtain uniform or close touniform partitions. The resulting partitioning keys areused to Partition the original database and obtain aNoisyCount for each of the partitions, which resultsin a v-optimal histogram [30]. Finally, given a user-issued query, an estimation component uses either thev-optimal histogram or both histograms to computean answer. The key innovation of our algorithm is

7

that it is data-aware or adaptive because the multi-dimensional kd-tree based partitioning is based on thecell-histogram from the first phase and hence exploitsthe underlying data distribution indirectly observedby the perturbed cell-histogram. Essentially, the kd-tree is based on an approximate distribution of theoriginal data. The original database is not queriedduring the kd-tree construction which saves the pri-vacy budget. The overall privacy budget is efficientlyused and divided between the two phases only forquerying the NoisyCount for cells and partitions.Algorithm 2 presents a sketch of the algorithm.

The key step in Algorithm 2 is the multi-dimensional partitioning step. It uses a kd-tree basedspace partitioning strategy that seeks to produce closeto uniform partitions in order to minimize the es-timation error within a partition. It starts from theroot node which covers the entire space. At eachstep, a splitting dimension and a split value fromthe range of the current partition on that dimen-sion are chosen heuristically to divide the space intosubspaces. The algorithm repeats until a pre-definedrequirement are met. In contrast to traditional kd-treeindex construction which desires a balanced tree, ourmain design goal is to generate uniform or close touniform partitions so that the approximation errorwithin the partitions is minimized. Thus we proposea uniformity based heuristic to make the decisionwhether to split the current partition and to selectthe best splitting point. There are several metricsthat can be used to measure the uniformity of apartition such as information entropy and variance.In our experiments, we use variance as the metric.Concretely, we do not split a partition if its variance isgreater than a threshold, i.e. it is close to uniform, andsplit it otherwise. In order to select the best splittingpoint, we choose the dimension with the largest rangeand the splitting point that minimizes the accumu-lative weighted variance of resulting partitions. Thisheuristic is consistent with the goal of a v-optimalhistogram which places the histogram bucket bound-aries to minimize the cumulative weighted varianceof the buckets. Algorithm 3 describes the step 2 ofAlgorithm 2.

Privacy Guarantee. We present the theorem belowfor the 2-phase partitioning algorithm which can bederived directly from the composibility theorems.

Theorem 4.3: Algorithm 2 is α-differentially private.Proof: Step 2 and Step 5 are α1,α2-differentially

private. So the sequence is α-differentially privatebecause of theorem 3.1 with α = α1 + α2.

Query Matrix Representation. We now illustrate howthe proposed algorithm can be represented as a querymatrix. We denote H as the query matrix generatingthe released data in our algorithm and we have H =[HII;HI], where HI and HII correspond to the querymatrix in the cell partitioning and kd-tree partitioning

Algorithm 2 2-phase partitioning algorithmRequire: β: number of cells;α: the overall privacy budgetPhase I:1. Partition the original database based on all do-mains.2. get NoisyCount of each partition using privacyparameter α1 and generate a synthetic dataset Dc.Phase II:3. Partition Dc by algorithm 3.4. Partition the original database based on thepartition keys returned from step 3.5. release NoisyCount of each partition using pri-vacy parameter α2 = α− α1

Algorithm 3 Kd-tree based v-optimal partitioningRequire: Dt: input database; ξ0: variance threshold;

if variance of Dt > ξ0 thenFind a dimension and splitting point m whichminimizes the cumulative weighted variance ofthe two resulting partitions;Split Dt into Dt1 and Dt2 by m.partition Dt1 and Dt2 by algorithm 3.

end if

phase respectively. HI is an Identity matrix with mrows, each row querying the count of one cell. HIIcontains all the partitions generated by the secondphase. We use N(α) to denote the column noise vectorand each noise Ni is determined by a differentialprivacy parameter (α1 in the first phase and α2 inthe second phase respectively). The released data isy = Hx + N. It consists of the cell histogram yI inphase I and the v-optimal histogram yI in phase II:yI = (HI)x + N(α1), yII = (HII)x + N(α2).

Using our example data from Figure 2, the querymatrix H consisting of HI and HII and the releaseddata consisting of the cell histogram yI and subcubehistogram yII are shown in Equation (4, 5). The his-tograms are also illustrated in Figure 7.

HI =

1 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 0 1 0 0 0 0 00 0 0 0 1 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 1 0 00 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 1

yI =

10 + N1(α1)

21 + N2(α1)

37 + N3(α1)

20 + N4(α1)

0 + N5(α1)

0 + N6(α1)

53 + N7(α1)

0 + N8(α1)

0 + N9(α1)

(4)

HII =

[1 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 0 1 0 0 1 0 00 0 0 0 1 1 0 1 1

]yII =

[51 + N10(α2)

37 + N11(α2)

53 + N12(α2)

0 + N13(α2)

](5)

4.4 Query Estimation and Error QuantificationOnce the histograms are released, given a randomuser query, the estimation component will computean answer using the released histograms. We study

8

II

y11y10

37+N31+N

>20K

Income

y2 y3y1

21+N 37+N10+N

>20K

Income

y13y12

37+N1131+N10

10~2

y4 y5 y6

21+N2 37+N310+N1

10~2

0+N1373+N120~0K

y7 y8 y9

20+N4 0+N5 0+N6

0~0K

20~30 30~40 40~50

10K

Age

53+N7 0+N8 0+N9

20~30 30~40 40~50

10K

Age 0 30 30 0 0 50 ge0 30 30 0 0 50 ge

Fig. 7. Example: released cell histogram (left) andsubcube histogram (right)

two techniques and formally quantify and comparethe errors of the query answer they generate. Thefirst one estimates the query answer using only thesubcube histogram assuming a uniform distributionwithin each partition. As an alternative approach, weadopt the least squares (LS) method, also used in [24],[23], to our two-phase strategy. The basic idea is to useboth cell histogram and subcube histogram and findan approximate solution of the cell counts to resolvethe inconsistency between the two histograms.

Uniform Estimation using Subcube Histogram.Based on our design goal to obtain uniform partitionsfor the subcube histogram, we make the uniform dis-tribution assumption in the subcube histogram, wherethe counts in each cell within a partition are assumedto be the same. Given a partition p, we denote np asthe size of the partition in number of cells. Hence thecount of each cell within this partition is yp/np whereyp is the noisy count of the partition. We denote xHas the estimated cell counts of the original data. If aquery predicate falls within one single partition, theestimated answer for that query is s

npyp where s is

the query range size. Given a random linear querythat spans multiple partitions, we can add up theestimated counts of all the partitions within the querypredicate. In the rest of the error analysis, we onlyconsider queries within one partition as the result canbe easily extended for the general case.

Figure 8 shows an example. Given a query Q2 onpopulation count with age = [30, 40], the originalanswer is x2 + x5 + x8. Using the subcube histogram,the query overlaps with partition y10 and partition y13.Using the uniform estimation, the estimated answeris y10

2 + y132 .

Error Quantification for Uniform Estimation. Wederive the (ε, δ)-usefulness and the expected error forthe uniform estimation method. To formally under-stand the distribution properties of input databasethat guarantees the optimality of the method, we firstdefine the smoothness of the distribution within apartition or database similar to [31]. The differenceis that [31] defines only the upper bound, while ourdefinition below defines the bound of differences ofall cell counts. Intuitively, the smaller the difference,

y11

y13y12

y10

37+N11

0+N13

73+N12

31+N10

0~10K

10~20K

>20K

Income

20~30 30~40 40~50 Age

Fig. 8. Example: query estimation with uniform estima-tion using subcube histogram

the smoother the distribution.Definition 4.1 (γ-smoothness): Denote x the original

counts of cells in one partition, ∀xi, xj ∈ x, if |xj −xi| ≤ γ, then the distribution in the partition satisfiesγ-smoothness.

We now present a theorem that formally analyzesthe utility of the released data if the input databasesatisfies γ-smoothness, followed by a theorem for thegeneral case.

Theorem 4.4: For γ-smooth data x, given a query qwith size s, np the size of the partition, the uniformestimation method is (ε, δ)-useful if equation (6) holds;the upper bound of the expected absolute error E(εH)is shown in equation (7).

γ ≤ (ε+s · logδα2np

)/min(s, np − s) (6)

E(εH) ≤ γ[min(s, np − s)] +s

α2np(7)

Proof: Given a query vector Q, the answer usingthe original data is Qx, the estimated answer usingthe released partition is s

npyp where yp is the released

partition count, np is the partition size, and s is thequery size. The released count yp =

∑npi=1 xi + N(α2).

So we have the absolute error as

εH = | snpyp −Qx| = |( s

np

np∑i=1

xi −Qx) +s

npN(α2)|

According to the definition of γ-smoothness,∀i, j, |xj − xi| ≤ γ, then | snp

∑npi=1 xi − Qx| ≤

min(s, np − s)γ. Because of the symmetry of the PDFof N(α2), we have

εH ≤ |min(s, np − s)γ +s

npN(α2)|

≤ min(s, np − s)γ + | snpN(α2)|

By Lemma 3.1, we know the PDF of snpN(α2). To

satisfy (ε, δ)-usefulness, we require Pr(εH ≤ ε) ≥ 1−δ,then we derive the condition as in equation (6).

The expected absolute error is

E(εH) =s

np

np∑i=1

xi −Qx +s

npE|N(α2)|

≤ min(s, np − s)γ +s

npE|N(α2)|

9

Based on the PDF of snpN(α2), we have

E|N(α2)| = 1/α2

and hence derive equation (7).From theorem 4.4, we can conclude that if the input

data is smoothly distributed or very sparse, γ wouldbe small, the error would be small. In this case, ouralgorithm achieves the best result. In the general case,if we do not know the distribution properties of theinput data, we present Theorem 4.5 to quantify theerror.

Theorem 4.5: Given a linear counting query Q, theexpected absolute error for the uniform estimationmethod, E(εH), is a function of (α1, s, η) shown inequation (8):

E(εH) =

∫fs(z, α1)|η + z|dz (8)

where η = snpyp − QyI, yp is the released count of

the partition in the subcube histogram, np the size ofthe partition, s is the size of the query, and yI is thereleased counts of the cell histogram.

Proof: Given a query Q, the answer using theoriginal data is Qx, the estimated answer using thereleased partition is s

npyp. The released partition count

is yp =∑npi=1 xi + N(α2). The released cell counts for

the cell histogram in the first phase is yI = x+ N(α1).So we have

εH = | snpyp −Qx| = | s

npyp −Q(yI − N(α1))|

= |( snpyp −QyI) +

s∑i=1

Ni(α1)|

Denote η = snpyp − QyI, which is the difference

(inconsistency) between the estimated answers usingthe cell histogram and the subcube histogram, thenεH = |η +

∑si=1 Ni(α1)|. By equation (2) , we have

E(εH) in equation (8).

Least Square Estimation using Cell Histogram andSubcube Histogram. The least square (LS) method,used in [24], [23], finds an approximate (least square)solution of the cell counts that aims to resolve theinconsistency between multiple differentially privateviews. In our case, the cell histogram and the subcubehistogram provide two views. We derive the theoret-ical error of the least square method and compare itwith the uniform estimation method. Note that theerror quantification in [24] is only applicable to thecase when α1 and α2 are equal. We derive new resultfor the general case in which α1 and α2 may havedifferent values.

Theorem 4.6: Given a query Q, a least square esti-mation xLS based on the cell histogram and subcubehistogram, the expected absolute error of the queryanswer, E(εLS), is a function of (α1, α2, np, s) in equa-tion (9), where s is the size of Q, and np is the size of

the partition.

E(εLS) =(np + 1)3

s2(np + 1− s)

·∫|ε|∫fnp−s(−

(ε− z)(np + 1)

s, α1)

·∫fs(

(z − y)(np + 1)

np + 1− s, α1)f1(

y(np + 1)

s, α2)dydzdε

(9)

Proof: Given our two phase query strategy, thequery matrix for the partition is H = [ones(1, np); Inp ],where np is the partition size. Using the least squaremethod in [24], we solve H+ = (HTH)−1HT , we have

H+ =

1np+1

npnp+1

− 1np+1

− 1np+1

· · · − 1np+1

1np+1

− 1np+1

npnp+1

− 1np+1

· · · − 1np+1

1np+1

− 1np+1

− 1np+1

npnp+1

· · · − 1np+1

1np+1

− 1np+1

− 1np+1

− 1np+1

. . . − 1np+1

1np+1

− 1np+1

− 1np+1

− 1np+1

· · · npnp+1

We compute the least square estimation based on the

released data y as xLS = H+y. So the query answerusing the estimation is

QxLS = QH+y = QH+(Hx+ N(α)) = Qx+QH+N(α)

The absolute error is

QxLS −Qx =s

np + 1N(α2) +

np + 1− snp + 1

s∑i=1

N(α1)

− s

np + 1

np−s∑i=1

N(α1)

By Lemma 3.1 and equation (2), we know thePDF of s

np+1N(α2),np+1−snp+1

∑si=1 N(α1) and

snp+1

∑np−si=1 N(α1), then by convolution formula,

we have equation (9).We will plot the above theoretical results for both

uniform estimation and least square estimation withvarying parameters in Section 6 and demonstrate thebenefit of the uniform estimation method, esp. whendata is smoothly distributed.

5 APPLICATIONS

Having presented the multidimensional partitioningapproach for differentially private histogram release,we now briefly discuss the applications that the re-leased histogram can support.

OLAP. On-line analytical processing (OLAP) is a keytechnology for business-intelligence applications. Thecomputation of multidimensional aggregates, such ascount, sum, max, average, is the essence of on-lineanalytical processing. The released histograms withthe estimation can be used to answer most commonOLAP aggregate queries.

10

Classification. Classification is a common data anal-ysis task. Several recent works studied classifica-tion with differential privacy by designing classifier-dependent algorithms (such as decision tree) [32], [33].The released histograms proposed in this paper canbe also used as training data for training a classifier,offering a classifier-independent approach for clas-sification with differential privacy. To compare theapproach with existing solutions [32], [33], we havechosen an ID3 tree classifier algorithm. However, thehistograms can be used as training data for any otherclassifier.

Blocking for Record Linkage. Private record linkagebetween datasets owned by distinct parties is anothercommon and challenging problem. In many situa-tions, uniquely identifying information may not beavailable and linkage is performed based on matchingof other information, such as age, occupation, etc.Privacy preserving record linkage allows two par-ties to identify the records that represent the samereal world entities without disclosing additional in-formation other than the matching result. While Se-cure Multi-party Computation (SMC) protocols can beused to provide strong security and perfect accuracy,it incurs prohibitive computational and communica-tion cost in practice. Typical SMC protocols requireO(n ∗ m) cryptographic operations where n and mcorrespond to numbers of records in two datasets.In order to improve the efficiency of record linkage,blocking is generally in use according to [34]. The pur-pose of blocking is to divide a dataset into mutuallyexclusive blocks assuming no matches occur acrossdifferent blocks. It reduces the number of record pairsto compare for the SMC protocols but meanwhilewe need to devise a blocking scheme that providesstrong privacy guarantee. [19] has proposed a hy-brid approach with a differentially private blockingstep followed by an SMC step. The differentiallyprivate blocking step adopts tree-structured spacepartitioning techniques and uses Laplace noise ateach partitioning step to preserve differential privacy.The matching blocks are then sent to SMC protocolsfor further matching of the record pairs within thematching blocks which significantly reduces the totalnumber of pairs that need to be matched by SMC.

The released histograms in this paper can be alsoused to perform blocking which enjoys differentialprivacy. Our experiments show that we can achieveapproximately optimal reduction ratio of the pairsthat need to be matched by SMC with appropriatesetting for the threshold ξ0. We will examine someempirical results in Section 6.

6 EXPERIMENT

We first simulate and compare the theoretical queryerror results from Section 4.4 for counting queries thatfalls within one partition to show the properties and

benefits of our approach (Section 6.1). We then presenta set of experimental evaluations of the quality ofthe released histogram in terms of weighted variance(Section 6.2), followed by evaluations of query erroragainst random linear counting queries and a com-parison with existing solutions (Section 6.3). Finally,we also implemented and experimentally evaluatethe two additional applications using the releasedhistograms, classification and blocking for record link-age, and show the benefit of our approach (Section6.4).

6.1 Simulation Plots of Theoretical ResultsAs some of the theoretical results in Section 4.4 con-sists of equations that are difficult to compute forgiven parameters, we use the simulation approach toshow the results and the impact of different param-eters in this section. Detailed and sophisticated tech-niques about the simulation approach can be foundin [35] and [36]. Table 6.1 shows the parameters usedin the simulation experiment and their default values.

TABLE 2Simulation parameters

parameter description default valuenp partition size np = 11s query size s ≤ np

α1, α2 diff. privacy parameters α1 = 0.05, α2 = 0.15γ smoothness parameter γ = 5η inconsistency between η = 5

cell and subcube histogram

Metric. We evaluate the error of absolute countqueries. Recall that E(εLS) is the expected error ofthe least square estimation method; max(E(εH)) is theupper bound of expected error of the uniform esti-mation method when the data is γ-smooth; E(εH) isthe expected error of the uniform estimation methodin the general case. Note that E(εLS), max(E(εH)),and E(εH) are derived from equation (9), (7), (8)respectively.

Impact of Query Size. We first study the impactof the query size. Figure 9 shows the error of theuniform and least square solutions with respect tovarying query size for γ-smooth data and general case(any distribution) respectively. We can see that thehighest error appears when the query size s is halfof the partition size np. When γ is small, i.e. data issmoothly distributed, the uniform estimation methodoutperforms the least square method. In the generalcase when we do not have any domain knowledgeabout the data distribution, it is beneficial to use bothcell histogram and subcube histogram to resolve theirinconsistencies.

Impact of Privacy Budget Allocation. We now studythe impact of the allocation of the overall differen-tial privacy budget α between the two phases. The

11

2 4 6 8 100

5

10

15

20

25

30

35

40

s

Err

or

max(E(εH

))

E(εLS

)

(a) γ-smooth distribution

2 4 6 8 100

10

20

30

40

50

60

70

80

s

Err

or

E(εH

),η=5

E(εLS

)

(b) any distribution

Fig. 9. Query error vs. query size s

overall budget is fixed with α = 0.2. Figure 10 showsthe error of the uniform and least square solutionswith respect to varying query size for γ-smooth dataand general case (any distribution) respectively. Forthe LS method, equally dividing α between the twophases or slightly more for cell-based partitioningworks better than other cases. The results for theuniform estimation method present interesting trends.For smoothly distributed data, a smaller privacy bud-get for the partitioning phase yields better result.Intuitively, since data is smoothly distributed, it isbeneficial to save the privacy budget for the secondphase to get a more accurate overall partition count.On the contrary, for the random case, it is beneficialto spend the privacy budget in the first phase to geta more accurate cell counts, and hence more accuratepartitioning.

0 0.05 0.1 0.15 0.20

50

100

150

200

α1

Err

or

E(εLS

)

max(E(εH

))

(a) γ-smooth distribution

0 0.05 0.1 0.15 0.20

50

100

150

200

250

300

α1

Err

or

E(εLS

)

E(εH

)

(b) any distribution

Fig. 10. Query error vs. privacy budget allocation α1

Impact of Partition Size. For γ-smooth data, theexpected error is dependent on the partition size npand the smoothness level γ. Figure 11(a) shows theerror of the uniform and least square solutions withrespect to varying partition size np for γ-smooth data.We observe that the error increases when the partitionsize increases because of the increasing approximationerror within the partition. Therefore, a good partition-ing algorithm should avoid large partitions.

Impact of Data Smoothness. For γ-smooth data,we study the impact of the smoothness level γ onthe error bound for the uniform estimation method.Figure 11(b) shows the maximum error bound withvarying γ. We can see that the more smooth the data,the less error for released data. Note that the query

20 40 60 80 1005

10

15

20

25

30

35

40

45

np

Err

or

E(εLS

)

max(E(εH

))

(a) vs. partition size np

2 4 6 8 100

5

10

15

20

25

30

γ

Err

or

max(E(εH

))

(b) vs. smoothness γ

Fig. 11. Query error for γ-smooth distribution

size s is set to be nearly half size of the partitionsize np as default, which magnifies the impact of thesmoothness. We observed in other parameter settingsthat the error increases only slowly for queries withsmall or big query sizes.

Impact of Inconsistency. For data with any (un-known) distribution, E(εH) is a function of η, the levelof inconsistency between the cell histogram and thesubcube histogram. In contrast to the the γ-smoothcase in which we have a prior knowledge aboutthe smoothness of the data, here we only have thisobserved level of inconsistency from the released datawhich reflects the smoothness of the original data.Figure 12 shows the error for the uniform estimationmethod with varying η. Note that the error increaseswith increasing η and even when η = 10, the erroris still lager than the error of least square method inFigure 11.

20 40 60 80 1000

20

40

60

80

100

120

η

Err

or

E(εH

)

Fig. 12. Query error vs. level of inconsistency η for anydistribution

6.2 Histogram Variance

We use the Adult dataset from the Census [37].All experiments were run on a computer with IntelP8600(2∗2.4 GHz) CPU and 2GB memory. For compu-tation simplicity and smoothness of data distribution,we only use the first 104 data records.

Original and Released Histograms. We first presentsome example histograms generated by our algorithmfor the Census dataset. Figure 13(a) shows the original2-dimensional histogram on Age and Income. Figure13(b) shows a cell histogram generated in the firstphase with α1 = 0.05. Figure 13(c) shows a subcube

12

histogram generated in the second phase with esti-mated cell counts using uniform estimation, in whicheach horizontal plane represents one partition. Figure13(d) shows an estimated cell histogram using the LSestimation using both cell histogram and subcube his-togram. We systematically evaluate the utility of thereleased subcube histogram with uniform estimationbelow.

02

46

810

0

5

1050

100

150

200

Original Distribution

(a) Original histogram

02

46

810

0

5

100

50

100

150

200

250

Cell release

(b) Cell histogram

02

46

810

0

5

1070

80

90

100

110

120

130

Released

(c) Subcube histogram

02

46

810

0

5

100

50

100

150

200

250

Released

(d) Estimated cell histogram

Fig. 13. Original and released histograms

Metric. We now evaluate the quality of the releasedhistogram using an application-independent metric,the weighted variance of the released subcube his-togram. Ideally, our goal is to obtain a v-optimalhistogram which minimizes the weighted variance soas to minimize the estimation error within partitions.Formally, the weighted variance of a histogram isdefined as V =

∑pi=1 xiVi, where p is the number

of partitions, xi is the number of data points inthe i-th partition, and Vi is the variance in the i-thpartition [30].

0 2 4 6 8 100

1

2

3

4

5x 10

4

ξ0

Variance

((a) vs. threshold ξ0

0 0.05 0.1 0.15 0.20

0.5

1

1.5

2x 10

4

α1

Variance

(b) vs. privacy budget α1

Fig. 14. Histogram variance and impact of parameters

Impact of Variance Threshold. Figure 14(a) shows theweighted variance of the released subcube histogramwith respect to varying variance threshold value ξ0used in our algorithm. As expected, when ξ0 becomes

large, less partitioning is performed, i.e. more datapoints are grouped into one bucket, which increasesthe variance.

Impact of Privacy Budget Allocation. Figure 14(b)shows the weighted variance with respect to varyingprivacy budget in the first phase α1 (with fixed ξ0.Note that the overall privacy budget is fixed. We seethat the correlation is not very clear due to the ran-domness of noises. Also the ξ0 is fixed which does notreflect the different magnitude of noise introduced bydifferent α1. Generally speaking, the variance shoulddecrease gradually when α1 increases because largeα1 will introduce less noise to the first phase, whichwill lead to better partitioning quality.

6.3 Query Error

We evaluate the quality of the released histogramusing random linear counting queries and measurethe average absolute query error and compare theresults with other algorithms. We use the Age andIncome attributes and generated 105 random count-ing queries to calculate the average query error. Therandom queries consist of two range predicates onthe two attributes respectively. We also implementedan alternative kd-tree strategy similar to that used in[19], referred to as hierarchical kd-tree, and anotherdata release algorithm [23], referred to as consistencycheck, for comparison purposes.

0 2 4 6 8 100

200

400

600

800

1000

ξ0

Query error

(a) vs. threshold ξ0

0 0.05 0.1 0.15 0.20

200

400

600

800

1000

α1

Query error

(b) vs. privacy budget α1

Fig. 15. Query error and impact of parameters

Impact of Variance Threshold. We first evaluate theimpact of the threshold value ξ0 on query error. Figure15(a) shows the average absolute query error withrespect to varying threshold values. Consistent withFigure 14, we observe that the query error increaseswith an increasing threshold value, due to the in-creased variance within partitions.

Impact of Privacy Budget Allocation. We next eval-uate the impact of how to allocate the overall privacybudget α into α1, α2 for the two phases in thealgorithm. Figure 15(b) shows the average absolutequery error vs. varying α1. We observe that small α1

values yield better results, which complies with ourtheoretical result for γ-smooth data in figure 10. Thisverifies that the real life Adult dataset has a somewhat

13

smooth distribution. On the other hand, for data withunknown distribution, we expect α1 cannot be toosmall as it is more important for generating a moreaccurate partitioning.

Comparison with Other Works. We compare ourapproach with two representative approaches in ex-isting works, the hierarchical kd-tree strategy used in[19], and the hierarchical partitioning with consistencycheck in [23]. Figure 16 shows the average abso-lute query error of different approaches with respectto varying privacy budget α. We can see that theDPCube algorithm achieves best utility against therandom query workload because of its efficient 2-phase use of the privacy budget and the v-optimalhistogram.

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.210

1

102

103

104

105

α

Que

ry e

rror

Interactive modelHierarchical kd−treeconsistency checkDPCube

(a) vs. privacy budget α

100

101

102

103

0

100

200

300

400

500

600

query size

Qu

ery

erro

r

DPCubeconsistency check

(b) vs. query size

Fig. 16. Query error for different approaches

We also experimented with query workloads withdifferent query sizes for the DPCube approach and theconsistency check approach to closely examine theirdifferences. Figure 16(b) shows the average absolutequery error with respect to varying query sizes forthe two approaches. Note that the query size is in logscale. As proven in our theoretical result in Figure11, the query error of our algorithm would increasewith increasing query size. We can see that for querieswith reasonable sizes, our 2-phase algorithm achievesbetter results than the consistency check algorithm.On the other hand, the consistency check algorithm isbeneficial for large size queries because our algorithmfavors smaller partitions with small variances whichwill result in large aggregated perturbation errors forlarge queries that spans multiple partitions.

6.4 Additional Applications

Classification. We evaluate the utility of the releasedhistogram for classification and compare it with otherdifferentially private classification algorithms. In thisexperiment, the dataset is divided into training andtest subsets with 30162 and 15060 records respectively.We use the work class, martial status, race, and sexattributes as features. The class was represented bysalary attribute with two possible values indicating ifthe salary is greater than $50k or not.

For this experiment, we compare several classifiers.As a baseline, we trained a ID3 classifier from the orig-inal data using Weka [38]. We also adapted the Weka

ID3 implementation such that it can use histogram asits input. To test the utility of the differentially privatehistogram generated from our algorithm, we usedit to train an ID3 classifier called DPCube histogramID3. As a comparison, we implemented an interactivedifferentially private ID3 classifier, private interactiveID3, introduced by Friedman et al. [33].

0 0.2 0.4 0.6 0.8 174.5

75

75.5

76

76.5

77

α

Acc

ura

cy

ID3private interactive ID3DPCube ID3

Fig. 17. Classification accuracy vs. privacy budget α

Figure 17 shows the classification accuracy of thedifferent classifiers with respect to varying privacybudget α. The original ID3 classifier provides a base-line accuracy at 76.9%. The DPCube ID3 achievesslightly worse but comparable accuracy than the base-line due to the noise. While both the DPCube ID3and the private interactive ID3 achieve better accu-racy with increasing privacy budget as expected, ourDPCube ID3 outperforms the private interactive ID3due to its efficient use of the privacy budget.

Blocking for Record linkage. We also evaluated theutility of the released histogram for record linkageand compared our method against the hierarchical kd-tree scheme from [19]. The attributes we consideredfor this experiment are: age, education, wage, maritalstatus, race and sex. As the histogram is used for theblocking step and all pairs of records in matchingblocks will be further linked using an SMC protocol,our main goal is to reduce the total number of pairsof records in matching blocks in order to reduce theSMC cost. We use the reduction ratio used in [19] asour evaluation metric. It is defined as follows:

reduction ratio = 1−∑ki=1 ni ∗mi

n ∗m(10)

where ni (mi) corresponds to the number of recordsin dataset 1 (resp. 2) that fall into the ith block, andk is the total number of blocks.

We compared both methods by running experi-ments with varying privacy budget (α) values (usingthe first 2 attributes of each record) and with varyingnumbers of attributes (with α fixed to 0.1). Figure18(a) shows the reduction ratio with varying privacybudget. Both methods exhibit an increasing trendin reduction ratio as the privacy budget grows butour 2-phase v-optimal histogram consistently outper-forms the hierarchical kd-tree approach and maintainsa steady reduction ratio around 85%. Figure 18(b)

14

shows the reduction ratio with varying number ofattributes (dimensions). As the number of attributesincreases, both methods show a drop in the reductionration due to the sparsification of data points, thusincreasing the relative error for each cell/partition.However, our DPCube approach exhibits desirable ro-bustness when the dimensionality increases comparedto the hierarchical kd-tree approach.

0 0.1 0.2 0.3 0.4 0.50.5

0.6

0.7

0.8

0.9

α

Red

uctio

n ra

tio

DPCubeHierarchical kd−tree

(a) vs. privacy budget α

2 3 4 5 60.65

0.7

0.75

0.8

0.85

0.9

number of attributes

Red

uctio

n ra

tio

DPCubeHierarchical kd−tree

(b) vs. number of attributes

Fig. 18. Reduction ratio of blocking for record linkage

7 CONCLUSIONS AND FUTURE WORKS

We have presented a two-phase multidimensionalpartitioning algorithm with estimation algorithms fordifferentially private histogram release. We formallyanalyzed the utility of the released histograms andquantified the errors for answering linear countingqueries. We showed that the released v-optimal his-togram combined with a simple query estimationscheme achieves bounded query error and supe-rior utility than existing approaches for “smoothly”distributed data. The experimental results on usingthe released histogram for random linear countingqueries and additional applications including classi-fication and blocking for record linkage showed thebenefit of our approach. As future work, we plan todevelop algorithms that are both data- and workload-aware to boost the accuracy for specific workloadsand investigate the problem of releasing histogramsfor temporally changing data.

REFERENCES

[1] Y. Xiao, L. Xiong, and C. Yuan, “Differentially Private DataRelease through Multidimensional Partitioning,” in 7th VLDBWorkshop on Secure Date Management, Lecture Notes in ComputerScience,, vol. 6358, 2010.

[2] Y. Xiao, J. Gardner, and L. Xiong, “Dpcube: Releasing differen-tially private data cubes for health information (demo paper),”in ICDE, 2012.

[3] C. Dwork, “Differential privacy: a survey of results,” in 5thinternational conference on Theory and applications of models ofcomputation, TAMC, 2008.

[4] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, “Privacy-preserving data publishing: A survey on recent develop-ments,” ACM Computing Surveys, vol. 42, no. 4, 2010.

[5] C. Dwork, “A firm foundation for private data analysis,”Commun. ACM, vol. 54, January 2011.

[6] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibratingnoise to sensitivity in private data analysis,” in 3rd Theory ofCryptography Conference, 2006.

[7] C. Dwork, “Differential privacy,” Automata, Languages andProgramming, Pt 2, vol. 4052, 2006.

[8] D. Kifer and A. Machanavajjhala, “No free lunch in dataprivacy,” in SIGMOD, 2011.

[9] A. Blum, K. Ligett, and A. Roth, “A learning theory approachto non-interactive database privacy,” in STOC, 2008.

[10] C. Dwork, M. Naor, O. Reingold, G. N. Rothblum, and S. Vad-han, “On the complexity of differentially private data release:efficient algorithms and hardness results,” in STOC, 2009.

[11] McSherry, “Privacy integrated queries: an extensible platformfor privacy-preserving data analysis,” in SIGMOD, 2009.

[12] F. McSherry and K. Talwar, “Mechanism design via differentialprivacy,” in FOCS, 2007.

[13] A. Roth and T. Roughgarden, “Interactive privacy via themedian mechanism,” in STOC, 2010.

[14] M. Hardt and G. Rothblum, “A multiplicative weights mech-anism for interactive privacy-preserving data analysis,” inFOCS, 2010.

[15] A. Korolova, K. Kenthapadi, N. Mishra, and A. Ntoulas,“Releasing search queries and clicks privately,” in WWW, 2009.

[16] M. Gotz, A. Machanavajjhala, G. Wang, X. Xiao, and J. Gehrke,“Privacy in search logs,” CoRR, vol. abs/0904.0682, 2009.

[17] R. Chen, N. Mohammed, B. C. M. Fung, B. C. Desai, andL. Xiong, “Publishing set-valued data via differential privacy,”in VLDB, 2011.

[18] F. McSherry and I. Mironov, “Differentially private recom-mender systems: building privacy into the net,” in KDD, 2009.

[19] A. Inan, M. Kantarcioglu, G. Ghinita, and E. Bertino, “Privaterecord matching using differential privacy,” in EDBT, 2010.

[20] N. Mohammed, R. Chen, B. C. Fung, and P. S. Yu, “Differen-tially private data release for data mining,” in KDD, 2011.

[21] B. Ding, M. Winslett, J. Han, and Z. Li, “Differentially privatedata cubes: optimizing noise sources and consistency,” inSIGMOD, 2011.

[22] X. Xiao, G. Wang, and J. Gehrke, “Differential privacy viawavelet transforms,” in ICDE, 2010.

[23] M. Hayy, V. Rastogiz, G. Miklauy, and D. Suciu, “Boosting theaccuracy of differentially-private histograms through consis-tency,” VLDB, 2010.

[24] C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor, “Opti-mizing linear counting queries under differential privacy,” inPODS, 2010.

[25] J. Xu, Z. Zhang, X. Xiao, Y. Yang, and G. Yu, “Differentiallyprivate histogram publication,” in ICDE, 2012.

[26] K. M and H. J, W, Data mining: concepts and techniques, SecondEdition. MorganKaufman, 2006.

[27] U. Kchler and S. Tappe, “On the shapes of bilateral gammadensities,” Statistics & Probability Letters, vol. 78, no. 15, 2008.

[28] Y. Ioannidis, “The history of histograms (abridged),” in Proc.of VLDB Conference, 2003.

[29] S. Muthukrishnan, V. Poosala, and T. Suel, “On rectangularpartitionings in two dimensions: Algorithms, complexity, andapplications,” in ICDT, 1999, pp. 236–256.

[30] V. Poosala, P. J. Haas, Y. E. Ioannidis, and E. J. Shekita,“Improved histograms for selectivity estimation of range pred-icates,” SIGMOD Rec., vol. 25, June 1996.

[31] M. Hardt and G. N. Rothblum, “A multiplicative weightsmechanism for privacy-preserving data analysis,” in FOCS,2010.

[32] G. Jagannathan, K. Pillaipakkamnatt, and R. N. Wright, “Apractical differentially private random decision tree classifier,”in ICDM Workshops, 2009.

[33] A. Friedman and A. Schuster, “Data mining with differentialprivacy,” in SIGKDD, 2010.

[34] A. Elmagarmid, P. Ipeirotis, and V. Verykios, “Duplicate recorddetection: A survey,” Knowledge and Data Engineering, IEEETransactions on, vol. 19, no. 1, pp. 1 –16, jan. 2007.

[35] C. George and R. L.Berger, Statistical Inference, 2001.[36] H. Anton and C. Rorres, Introduction to Probability Models, Tenth

Edition. John Wiley and Sons, Inc, 2005.[37] A. Frank and A. Asuncion, “UCI machine learning repository,”

2010. [Online]. Available: http://archive.ics.uci.edu/ml[38] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,

and I. H. Witten, “The weka data mining software: an update,”SIGKDD Explor. Newsl., vol. 11, pp. 10–18, November 2009.

http://archive.ics.uci.edu/ml

DPCube: Differentially Private Histogram Release through ... · Abstract—Differential privacy is a strong notion for protecting individual privacy in privacy preserving data analysis

Documents