Part 3: Query Processing Data-Independent Methods
Part 3: Query Processing --Data-Independent Methods1Marianne
Winslett1,3, Xiaokui Xiao2, Yin Yang3, Zhenjie Zhang3, Gerome
Miklau41 University of Illinois at Urbana Champaign, USA2 Nanyang
Technological University, Singapore3 Advanced Digital Sciences
Center, Singapore4 University of Massachusetts, Amherst, USA1Batch
query answeringLaplaceMechanismdataworkload Ww1w2w3w1(D) + noise
w2(D) + noisew3(D) + noiseGiven goal query set W
(workload)a1a2a3a1(D) + noisea2(D) + noisea3(D) +
noisealternativequeries Anoisy est. w1(D) noisy est. w2(D)noisy
est. w3(D)workload Ww1w2w3LaplaceMechanismdataFrequency
representation of the
databasenamegendergradeAliceFemale91BobMale84CarlMale82DaveMale97EdwinaFemale88FaithFemale78GhitaFemale85.........Relational
databaseFrequency vector{Male, Female} x
{A,B,C,D,F}gendergradecountMale10010Male9913Male985Male977.........Female10015Female9921Female984Female9714Female969x1x2x3x4x5x6x7x8...xn{gender,
grade}x = [x1, x2, ...
xn]xgradecount90-1001080-902370-801660-703{grade}x1x2x3x4Answering
all range
queriesx1+x2+x3+x4x1+x2+x3x2+x3+x4x1+x2x2+x3x3+x4x1x2x3x4workload
WGoal: answer all range-count queries over xAllRange = { w | w = xi
+ ... + xj for 1 i j n
}w1w2w3w4w5w6w7w8w9w10range(x1,x4)range(x1,x3)range(x2,x4)range(x1,x2)range(x2,x3)range(x3,x4)range(x1,x1)range(x2,x2)range(x3,x3)range(x4,x4)1023163x=w1w2w3w4w5w6w7w8w9w105249423339191023163For
domain of size n, there are 1/2*n*(n+1) = O(n^2) range queries.
The sensitivity for the workload of all range queries is
(n/2)*(n/2+1) = O(n^2)Approach 1: basic Laplace
mechanismx1+x2+x3+x4x1+x2+x3x2+x3+x4x1+x2x2+x3x3+x4x1x2x3x4WTwo
problems:
- high error- inconsistency
w1w2w3w4w5w6w7w8w9w10n=4nSensitivity ||W||16O(n2)Error per
query2(||W||1/)2 = 72/22(||W||1/)2 = O(n4)/2Error is measured
asvarianceb1b2b3b4b5b6b7b8b9b10+(6/) private outputLaplace
noisew1w2w3w4w5w6w7w8w9w10Workload queries||W||1
=68.2-5.4-3.16.6-7.92.4-3.0-4.96.74.660.244.638.939.631.121.47.018.122.77.65249423339191023163=55.4Explain
sensitivityFor domain of size n, there are 1/2*n*(n+1) = O(n^2)
range queries.
The sensitivity for the workload of all range queries is
(n/2)*(n/2+1) = O(n^2)Approach 2: noisy frequency
countsz1z2z3z4b1b2b3b4+(1/) Use Laplace mechanism to get noisy
estimates for each xi.private outputx1x2x3x4queries
submittedderivedworkload
answersw1w2w3w4w5w6w7w8w9w10z1+z2+z3+z4z1+z2+z3z2+z3+z4z1+z2z2+z3z3+z4z1z2z3z4Laplace
noise||I||1 =1IFor w=range(xi,xj) Error(w)= 2(j-i+1)/28/22/2Explain
computation of estimates.Approach 3: hierarchical queriesH||H||1 =
3 = logn+1Hierarchical queries: recursively partition the domain,
computing sums of each interval.[Hay,
2010]x1+x2+x3+x4x1+x2x3+x4x1x2x3x4+(3/) private outputLaplace
noiseb1b2b3b4b5b6b7z1z2z3z4z5z6z7More than one possible estimate
for a range query can be derived from z Queries
submittedderivedworkload answersw1w2w3w4w5w6w7w8w9w10?z5 + z6z1 -
z4 - z7z2 - z4 + z6Possible estimates for query range(x2,x3) = x2 +
x3Least-squares estimate(6z1 + 3z2 + 3z3 - 9z4 + 12z5 + 12z6 -
9z7)/21Idea: only a small number of noisy outputs to needed to
estimate any range query.Approach 4: wavelet queries[Xiao,
2010]x1+x2+x3+x4x1+x2-x3-x4x1-x2x3-x4z1z2z3z4b1b2b3b4+(3/) private
outputQueries submittedderivedworkload
answersw1w2w3w4w5w6w7w8w9w10?Wavelet queries: use Haar wavelet to
get noisy summary of data.Estimate for query range(x2,x3) = x2 +
x3.5z1 + 0z2 - .5z3 + .5z4YLaplace noise||Y||1 = 3 =
logn+1Approaches for workload AllRangeLow sensitivity, and all
range queries can be estimated using no more than logn output
entries.Very low sensitivity, but large ranges estimated
badly.HYINoisy countsHierarchicalWaveletO(n/2)Max/Avg
errorO(log3n/2)O(log3n/2)x1x2x3x4x1+x2+x3+x4x1+x2x3+x4x1x2x3x4x1+x2+x3+x4x1+x2-x3-x4x1-x2x3-x4State
error bounds from respective papersError: workload of all range
queries = 0.1n = 1024
-differential privacysmall rangesbig ranges
Visualizing error: identity strategyn=128range(x1,x128)Identity
strategyrange(x1,x1)ErrorVisualizing error: hierarchical v.
wavelet
HierarchicalstrategyErrorWavelet strategyError(branching =
2)State result about asymptotic equivalence?Data-independent
methodsTwo key ideas in choosing alternative query set A:low
sensitivity (typically much lower than the workload itself).W can
be estimated efficiently from A.Can we do better? Are these
approaches optimal for all range queries?What about other
workloads?arbitrary sets of range queries, data cubes, sets of
marginals, CDFs, arbitrary sets of predicate counting queries,
etc.Batch query answering (Design) Choose alternative query set A
(Apply Laplace) Use the Laplace mechanism to answer A (Derivation)
Compute each query in W using answers to Aa1a2a3a1(D) + noisea2(D)
+ noisea3(D) + noisealternativequeries Anoisy est. w1(D) noisy est.
w2(D)noisy est. w3(D)Given goal query set W (workload)workload
Ww1w2w3LaplaceMechanismdataGeneralize, remove matrix mechanismThe
matrix mechanismGiven a workload W and a strategy matrix A, the
following randomized algorithm is -differentially
private:MatrixA(W,x) = Wx + (||A||1/) WA+ bworkload
WAlgdataa1a2a3a1(D) + noisea2(D) + noisea3(D) + noisestrategy
Aw1w2w3w1(D) + noisew2(D) + noisew3(D) + noiseb=Lap(1)Laplace(W,x)
= Wx + (||W||1/)bCompare with the Laplace mechanism:instantiated
withstrategy Atrue answerscaling by ||A||1transformation by
WA+x=A+(Ax + (A/)b)Wx=WA+(Ax + (A/)b))Wx=Wx + (A/)WA+bWxDerived
noisy answers to workload W[Li, 2010]This is never worse than the
Laplace mechanism:even if we have no idea how to choose A, we could
set A=W.
If W is square the matrix mech. is equivalent to Laplace.
Otherwise, its better. Strategies equivalent to
wavelet111111-1-11-100001-1Wavelet Y||Y||1 = 3Y||Y||1 =
2.414Neither the hierarchical nor the wavelet strategy is optimal,
i.e. there exist uniformly better strategies with matching error
profiles.Y||Y||1 =
31100001110000100001000011000010000100001>110000112000020000200002Overview
of data-independent methodsMethodGoal
WorkloadStrategyAdaptiveFourier[Barak 07]sets of marginalsfourier
basis vectorsYESWavelet[Xiao 10]All Range (multi-dim)Haar
waveletNOHierarchical[Hay 10]All Range (one-dim)k-order
treeNOMatrix Mechanism[Li, 10]sets of linear queriesset of linear
queriesYESCuboid[Ding,11]sets of data cubesset of
cuboidsYESQuad-tree[Cormode,12]Range queries
(multi-dim)quad-treeNOAdaptive: alternative query set customized to
workload WAdaptive: Fourier basis methodGoal workload: a sets of
low-order marginals over multi-dimensional data.Alternative query
set: subset of Fourier basis vectorsAdaptivity: Any workload of
marginals can be expressed using a small number of Fourier basis
vectors, reducing sensitivity.Naturally adaptive, without explicit
optimization step.[Barak, 2007]Adaptive: Cuboid methodGoal
workload: a set of cuboids W selected by the user.Alternative query
set: a subset A of cuboids. Adaptivity:Select A to minimize the max
error over all cuboids in W.NP-hard, so an approximation algorithm
based on set-cover is used to achieve log(|W|+2)2 approximation to
optimal.[Ding,
2011]SexAgeSalarySexAgeSex*AgeSalarySexSalaryAgeSalaryAdaptive: the
matrix mechanismGoal workload: Any set of linear queries W selected
by the user.Alternative query set: any set of linear
queries.Adaptivity:Select A to minimize the average per query error
for W.Solved exactly using semi-definite programming (but not
feasible in practice).Solved approximately by designing A from
scaled eigenvectors of W. [Li, 2012][Li, 2010]Summary:
data-independent methodsFor batch query answering, it is possible
to exploit properties of the workload to significantly improve
accuracy over typical applications of the Laplace mechanism.By
submitting alternative query set to Laplace mechanism and inferring
answers:Sensitivity is reducedNoise ultimately added to workload
queries is correlated (not independent) which can fit correlation
amongst workload queries.Next: exploiting properties of the input
database data dependent methods.References[Barak, 2007] B. Barak,
K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Talwar.
Privacy, accuracy, and consistency too: a holistic solution to
contingency table release. In Principles of Database Systems (PODS)
2007.[Hay, 2010] M. Hay, V. Rastogi, G. Miklau, and D. Suciu.
Boosting the accuracy of dierentially-private queries through
consistency. Proceedings of the VLDB Endowment (PVLDB), 2010.[Xiao,
2010] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via
wavelet transforms. In International Conference on Data
Engineering, 2010.[Li, 2010] C. Li, M. Hay, V. Rastogi, G. Miklau,
and A. McGregor. Optimizing Linear Counting Queries Under
Differential Privacy. Principles of Database Systems (PODS)
2010.[Ding, 2011] B. Ding, M. Winslett, J. Han, and Z. Li.
Differentially private data cubes: optimizing noise sources and
consistency. In SIGMOD, pages 217228. ACM, 2011.[Cormode, 2012] G.
Cormode, M. Procopiuc, D. Srivastava, E. Shen, and T. Yu.
Differentially private spatial decompositions. ICDE, 2012.[Li,
2012] Chao Li and Gerome Miklau. An adaptive mechanism for accurate
query answering under differential privacy. PVLDB 2012.