-
Accuracy-Constrained Privacy-PreservingAccess Control
Mechanism
for Relational DataZahid Pervaiz, Walid G. Aref, Senior Member,
IEEE, Arif Ghafoor, Fellow, IEEE, and
Nagabhushana Prabhu
AbstractAccess control mechanisms protect sensitive information
from unauthorized users. However, when sensitive information is
shared and a Privacy Protection Mechanism (PPM) is not in place,
an authorized user can still compromise the privacy of a person
leading to identity disclosure. A PPM can use suppression and
generalization of relational data to anonymize and satisfy
privacy
requirements, e.g., k-anonymity and l-diversity, against
identity and attribute disclosure. However, privacy is achieved at
the cost of
precision of authorized information. In this paper, we propose
an accuracy-constrained privacy-preserving access control
framework.
The access control policies define selection predicates
available to roles while the privacy requirement is to satisfy the
k-anonymity or
l-diversity. An additional constraint that needs to be satisfied
by the PPM is the imprecision bound for each selection predicate.
The
techniques for workload-aware anonymization for selection
predicates have been discussed in the literature. However, to the
best of
our knowledge, the problem of satisfying the accuracy
constraints for multiple roles has not been studied before. In our
formulation of
the aforementioned problem, we propose heuristics for
anonymization algorithms and show empirically that the proposed
approach
satisfies imprecision bounds for more permissions and has lower
total imprecision than the current state of the art.
Index TermsAccess control, privacy, k-anonymity, query
evaluation
1 INTRODUCTION
ORGANIZATIONS collect and analyze consumer data toimprove their
services. Access Control Mechanisms(ACM) are used to ensure that
only authorized informationis available to users. However,
sensitive information canstill be misused by authorized users to
compromise the pri-vacy of consumers. The concept of
privacy-preservation forsensitive data can require the enforcement
of privacy poli-cies or the protection against identity disclosure
by satisfy-ing some privacy requirements [1]. In this paper,
weinvestigate privacy-preservation from the anonymityaspect. The
sensitive information, even after the removal ofidentifying
attributes, is still susceptible to linking attacksby the
authorized users [2]. This problem has been studiedextensively in
the area of micro data publishing [3] and pri-vacy definitions,
e.g., k-anonymity [2], l-diversity [4], andvariance diversity [5].
Anonymization algorithms use sup-pression and generalization of
records to satisfy privacyrequirements with minimal distortion of
micro data. The
anonymity techniques can be used with an access controlmechanism
to ensure both security and privacy of the sensi-tive information.
The privacy is achieved at the cost of accu-racy and imprecision is
introduced in the authorizedinformation under an access control
policy.
We use the concept of imprecision bound for eachpermission to
define a threshold on the amount ofimprecision that can be
tolerated. Existing workload-aware anonymization techniques [5],
[6] minimize theimprecision aggregate for all queries and the
imprecisionadded to each permission/query in the anonymizedmicro
data is not known. Making the privacy require-ment more stringent
(e.g., increasing the value of k or l)results in additional
imprecision for queries. However,the problem of satisfying accuracy
constraints for indi-vidual permissions in a policy/workload has
not beenstudied before. The heuristics proposed in this paper
foraccuracy-constrained privacy-preserving access controlare also
relevant in the context of workload-aware ano-nymization. The
anonymization for continuous data pub-lishing has been studied in
literature [3]. In this paperthe focus is on a static relational
table that is anony-mized only once. To exemplify our approach,
role-basedaccess control is assumed. However, the concept of
accu-racy constraints for permissions can be applied to
anyprivacy-preserving security policy, e.g., discretionaryaccess
control.
Example 1 (Motivating Scenario). Syndromic surveil-lance systems
are used at the state and federal levels todetect and monitor
threats to public health [7]. Thedepartment of health in a state
collects the emergency
Z. Pervaiz and A. Ghafoor are with the School of Electrical and
ComputerEngineering and Purdues Center for Education and Research
in Informa-tion Assurance and Security (CERIAS). E-mail:
[email protected].
W.G. Aref is with the Department of Computer Science and Purdues
Cen-ter for Education and Research in Information Assurance and
Security(CERIAS).
N. Prabhu is with the School of Industrial Engineering, Purdue
Univer-sity, IN 47907.
Manuscript received 1 Oct. 2012; revised 17 Feb. 2013; accepted
11 Apr. 2013;date of publication 1 May 2013; date of current
version 18 Mar. 2014.Recommended for acceptance by E. Ferrari.For
information on obtaining reprints of this article, please send
e-mail to:[email protected], and reference the Digital Object
Identifier below.Digital Object Identifier no.
10.1109/TKDE.2013.71
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26,
NO. 4, APRIL 2014 795
1041-4347 2013 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission.See
http://www.ieee.org/publications_standards/publications/rights/index.html
for more information.
-
department data (age, gender, location, time of
arrival,symptoms, etc.) from county hospitals daily. Generally,each
daily update consists of a static instance that isclassified into
syndrome categories by the departmentof health. Then, the
surveillance data is anonymizedand shared with departments of
health at each county.An access control policy is given in Fig. 1
that allowsthe roles to access the tuples under the
authorizedpredicate, e.g., Role CE1 can access tuples under
Per-mission P1. The epidemiologists at the state andcounty level
suggest community containment meas-ures, e.g., isolation or
quarantine according to thenumber of persons infected in case of a
flu outbreak.According to the population density in a county,
anepidemiologist can advise isolation if the number ofpersons
reported with influenza are greater than 1,000and quarantine if
that number is greater than 3,000 ina single day. The anonymization
adds imprecision tothe query results and the imprecision bound for
eachquery ensures that the results are within the
tolerancerequired. If the imprecision bounds are not satisfiedthen
unnecessary false alarms are generated due to thehigh rate of false
positives.
The contributions of the paper are as follows. First,
weformulate the accuracy and privacy constraints as the prob-lem of
k-anonymous Partitioning with Imprecision Bounds(k-PIB) and give
hardness results. Second, we introduce theconcept of
accuracy-constrained privacy-preserving accesscontrol for
relational data. Third, we propose heuristics toapproximate the
solution of the k-PIB problem and conductempirical evaluation.
The remainder of this paper proceeds as follows. InSection 2,
relevant background is discussed. The problemformulation and access
control framework are presentedin Section 3. Section 4 covers the
proposed top-downheuristics for multi-dimensional partitioning to
satisfyimprecision bounds. Experimental results are in Section
5,and in Section 6, an additional step to reduce the num-ber of
permissions violating imprecision bounds is pro-posed. The related
work is presented in Section 7 andSection 8 concludes the
paper.
2 BACKGROUND
In this section, role-based access control and
privacydefinitions based on anonymity are over-viewed. Query
evaluation semantics, imprecision, and the SelectionMondrian
algorithm [5] are briefly explained.
Given a relation T fA1; A2; . . . ; Ang, where Ai is
anattribute, T is the anonymized version of the relation T .We
assume that T is a static relational table. The attributescan be of
the following types:
Identifier. Attributes, e.g., name and social security,that can
uniquely identify an individual. Theseattributes are completely
removed from the anony-mized relation.
Quasi-identifier (QI). Attributes, e.g., gender, zipcode, birth
date, that can potentially identify an indi-vidual based on other
information available to anadversary. QI attributes are generalized
to satisfy theanonymity requirements.
Sensitive attribute. Attributes, e.g., disease or salary,that if
associated to a unique individual will cause aprivacy breach.
2.1 Access Control for Relational Data
Fine-grained access control for relational data allows todefine
tuple-level permissions, e.g., Oracle VPD [8] andSQL [9]. For
evaluating user queries, most approachesassume a Truman model [10].
In this model, a user query ismodified by the access control
mechanism and only theauthorized tuples are returned. Column level
access controlallows queries to execute on the authorized column of
therelational data only [8], [11]. Cell level access control
forrelational data is implemented by replacing the unautho-rized
cell values by NULL values [12].
Role-based Access Control (RBAC) allows defining per-missions on
objects based on roles in an organization. AnRBAC policy
configuration is composed of a set of Users(U), a set of Roles (R),
and a set of Permissions (P). For therelational RBAC model, we
assume that the selection predi-cates on the QI attributes define a
permission [11]. UA is auser-to-role (U R) assignment relation and
PA is a role-to-permission (R P ) assignment relation. A role
hierarchy(RH) defines an inheritance relationship among roles and
isa partial order on roles (R R) [13]. Each permission definesa
hyper-rectangle in the tuple space and all the tuplesenclosed by
this hyper-rectangle are authorized to the roleassigned to the
permission. In practice, when a userassigned to a role executes a
query, the tuples satisfying theconjunction of the query predicate
and the permission arereturned [1], [10].
2.2 Anonymity Definitions
In this section, privacy definitions related to anonymity
areintroduced.
Definition 1 (Equivalence Class (EC)). An equivalence class isa
set of tuples having the same QI attribute values.
Definition 2 (k-anonymity Property). A table T satisfies
thek-anonymity property if each equivalence class has k or
moretuples [2].
k-anonymity is prone to homogeneity attacks when thesensitive
value for all the tuples in an equivalence class isthe same. To
counter this shortcoming, l-diversity hasbeen proposed [4] and
requires that each equivalence
Fig. 1. Access control policy.
796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
class of T contain at least l distinct values of the
sensitiveattribute. For sensitive numeric attributes, an
l-diverseequivalence class can still leak information if the
numericvalues are close to each other. For such cases,
variancediversity [5] has been proposed that requires the
varianceof each equivalence class to be greater than a given
vari-ance diversity parameter.
The table in Fig. 2a does not satisfy k-anonymitybecause knowing
the age and zip code of a person allowsassociating a disease to
that person. The table in Fig. 2b isa 2-anonymous and 2-diverse
version of table in Fig. 2a.The ID attribute is removed in the
anonymized table andis shown only for identification of tuples.
Here, for anycombination of selection predicates on the zip code
andage attributes, there are at least two tuples in each
equiva-lence class. In Section 4, algorithms are presented for
k-anonymity only. However, the experiments are per-formed for both
l-diversity and variance diversity usingthe proposed heuristics for
partitioning.
2.3 Predicate Evaluation and Imprecision
In this section the query predicate evaluation semanticshave
been discussed. For query predicate evaluation over atable, say T ,
a tuple is included in the result if all the attri-bute values
satisfy the query predicate. Here, we only con-sider conjunctive
queries (The disjunctive queries can beexpressed as a union of
conjunctive queries), where eachquery can be expressed as a
d-dimensional hyper-rectangle.The semantics for query evaluation on
an anonymized tableT needs to be defined. When the equivalence
class partition(Each equivalence class can be represented as a
d-dimen-sional hyper-rectangle) is fully enclosed inside the
queryregion, all tuples in the equivalence class are part of
thequery result. Uncertainty in query evaluation arises when
apartition overlaps the query region but is not fully enclosed.In
this case, there can be many possible semantics. We dis-cuss the
following three choices:
1. Uniform. Assuming the uniform distribution oftuples in the
overlapping partitions, include tuplesfrom all partitions according
to the ratio of overlapbetween the query and the partition. Query
evalua-tion under this option might under-count or over-count the
query result depending upon the originaldistribution of tuples in
the partition region. Most ofthe literature uses this uniform
distribution seman-tics to compare anonymity techniques over
selectiontasks [6], [14]. However, the choice of the sensitive
attribute value for the selected tuples from an over-lapping
partition is not defined under uniformsemantics. For access
control, a tuples QI attributevalues along with the sensitive
attribute value needto be returned.
2. Overlap. Include all tuples in all partitions that over-lap
the query region. This option will add false posi-tives to the
original query result.
3. Enclosed. Discard all tuples in all partitions thatpartially
overlap the query region. This optionwill have false negatives with
respect to the origi-nal query result.
The imprecision under any query evaluation scheme isreduced if
the number of tuples in the partitions that over-lap the query
region can be minimized. For the remainderof this paper, we assume
Overlap semantics. The impreci-sion quality metric definition using
Overlap semantics is asfollows [5]:
Definition 3 (Query Imprecision). Query Imprecision isdefined as
the difference between the number of tuples returnedby a query
evaluated on an anonymized relation T and thenumber of tuples for
the same query on the original relation T .The imprecision for
query Qi is denoted by impQi ,
impQi jQiT j jQiT j; wherejQiT j
X
EC overlaps Qi
jECj: (1)
The query Qi is evaluated over T by including all the
tuples in the equivalence classes that overlap the query
region.
Example 2. Consider a range Query Q1(0-25, 5-20) for thetable
given in Fig. 2. jQ1T j 2 as tuples 1 and 4 inFig. 2a satisfy the
query. jQ1T j 5 as the first twoequivalence classes given in Fig.
2b overlap the queryrange. Then, the query imprecision for Q1 is 3
accordingto Equation (1).
2.4 Top Down Selection Mondrian
Top Down Selection Mondrian (TDSM) algorithm is pro-posed by
LeFevre et al. [5], [14] for a given query work-load. This is the
current state of the art for query-workload-based anonymization.
The objective of TDSM isto minimize the total imprecision for all
queries while theimprecision bounds for queries have not been
considered.The anonymization for a given query workload
withimprecision bounds has not investigated before to thebest of
our knowledge. We compare our results withTDSM in the experiments
section. The algorithm pre-sented in [14] is similar to the kd-tree
construction [15].TDSM starts with the whole tuple space as one
partitionand then partitions are recursively divided till the
timenew partitions meet the privacy requirement. To divide
apartition, two decisions need to be made, i) Choosing asplit value
along each dimension, and ii) Choosing adimension along which to
split. In the TDSM algorithm[5], the split value is chosen along
the median and thenthe dimension is selected along which the sum of
impreci-sion for all queries is minimum. The time complexity ofTDSM
has not been reported in [5] and is OdjQjnlgn,where d is the number
of dimensions of a tuple, Q is the
Fig. 2. Generalization for k-anonymity and l-diversity.
PERVAIZ ET AL.: ACCURACY-CONSTRAINED PRIVACY-PRESERVING ACCESS
CONTROL MECHANISM FOR RELATIONAL DATA 797
-
set of queries, and n is the total number of tuples.
Theexpression is derived by multiplying the height of the kd-tree
with the work done at each level. The median cutgenerates a
balanced tree with height lgn and the workdone at each level is
djQjn. The partitions created byTDSM have dimensions along the
median of the parentpartition. A compaction procedure has been
proposed in[6] where the created partitions are replaced by
minimumbounding boxes. This step improves the precision of
theanonymized table for any given query workload byreducing the
overlapping partitions. In Section 5, compac-tion is carried out
for all the algorithms and then theresults are compared.
3 ANONYMIZATION WITH IMPRECISION BOUNDS
In this section, we formulate the problem of k-anony-mous
Partitioning with Imprecision Bounds and presentan
accuracy-constrained privacy-preserving access con-trol
framework.
3.1 Definitions
Let ti be a tuple in Table T with d QI attributes. Tuple ti
can
be expressed as a d-dimensional vector fvti1 ; . . . ; vtid g,
wherevi is the value of the ith attribute. Let DQIi be the domain
of
quasi-identifier attribute QIi, then ti 2 DQI1 DQId .Any
d-dimensional Partition Pi of the QI attribute domain
space can be defined as a d-dimensional vector of closed
intervals fIPi1 ; . . . ; IPid g. The closed Interval IPij is
furtherdefined as aPij ; bPij , where aPij is the start of the
interval andbPij is the end of the interval, and the length of the
interval l
Pij
is bPij aPij . A multidimensional global recoding function,
e.g.,Mondrian [14], first divides the d-dimensional QI
attribute
domain space into non-overlapping partitions Pi 2 P , whereeach
Pi is a d-dimensional rectangle. In the second step, the
d-dimensional vector fv1; . . . ; vdg for each tuple is
replacedby the intervals fIPi1 ; . . . ; IPid g of the partition to
whichthe tuple belongs. A Tuple, say tj, belongs to a Partition,
say
Pl, if 8vtji ; vtji 2 IPli : aPli v
tji bPli .
Consider a set of queries Q, where Qi 2 Q is defined by aBoolean
function of predicates on quasi-identifier attributes
fQI1; . . . ; QIdg. A query defines a space in the domain
ofquasi-identifier attributes DQI1 DQId and can be rep-resented by
a d-dimensional rectangle or a set of non-over-
lapping d-dimensional rectangles. To simplify the notation,
we assume that Query Qi is a single d-dimensional rectangle
represented by fIQi1 ; . . . ; IQid g. A Tuple tj belongs to
QueryQi, if 8vtji ; v
tji 2 IQii : aQii v
tji bQii . Query Qj and Partition
Pl overlap if 8IQji 8IPli ; aQji 2 IPli or aPli 2 I
Qji .
Definition 4 (Query Imprecision Bound). The queryimprecision
bound, denoted by BQi , is the total imprecisionacceptable for a
query predicate Qi and is preset by theaccess control
administrator.
Example 3. Assume two range queries as given in Fig. 3.The
queries are the shaded rectangles with solid lineswhile the
partitions are the regions enclosed by
rectangles with dashed lines. The imprecision bounds forQueries
Q1 and Q2 are preset to 2 and 0. The partitioninggiven in Fig. 2b
does not satisfy the imprecision bounds.However, the partitioning
given in Fig. 3 satisfies thebounds for Queries Q1 and Q2 as the
imprecision for Q1and Q2 is 2 and 0, respectively.
Definition 5 (Query Imprecision Slack). The query impreci-sion
slack, denoted by sQi for a Query, say Qi, is defined as
thedifference between the query imprecision bound and the
actualquery imprecision.
sQi BQi impQi ; if impQi BQi;0; otherwise:
(2)
Definition 6 (Partition Imprecision Cost (PIC)). The parti-tion
imprecision cost is a vector ficQ1Pi ; . . . ; ic
QnPig, where icQjPi is
the imprecision cost of a Partition Pi 2 P with respect to
aQuery Qj. This cost is the number of tuples that are present inthe
partition but not in the query, i.e.,
icQjPi
Pi Qj ; (3)
where the minus sign denotes the set difference. The impre-
cision for a query impQj , defined in Equation (1), can also
be expressed in terms of icQjPi
as
impQj X
Pi2PicQjPi:
The TDSM algorithm uses the median value along adimension to
split a partition. In the proposed heuristics inSection 4, query
intervals are used to split the partitions thatare defined as query
cuts.
Definition 7 (Query Cut). A query cut is defined as the
splittingof a partition along the query interval values. For a
query cutusing Query Qi, both the start of the query interval
(a
Qij ) and
the end of the query interval (bQij ) are considered to split a
par-
tition along the jth dimension.
Example 4. A comparison of median cut and query cut isgiven in
Fig. 4 for 3-anonymity. The rectangle with solidlines represents
Query Q1. While, the rectangles withdotted lines represent
partitions. In Fig. 4a the tuples arepartitioned according to the
median cut and even afterdividing the tuple space into four
partitions there is noreduction in imprecision for the Query Q1.
However, forquery cuts in Fig. 4b the imprecision is reduced to
zeroas partitions are either non-overlapping or fully
enclosedinside the query region.
Fig. 3. Anonymization satisfying imprecision bounds.
798 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
3.2 The k-PIB Problem
The optimal k-anonymity problem has been shown to beNP-complete
for suppression [16] and generalization [17].The hardness result
for k-PIB follows the construction ofLeFevre et al. [14] that shows
the hardness of k-anonymousmulti-dimensional partitioning with the
smallest averageequivalence class size. We show that finding
k-anonymouspartitioning that violates imprecision bounds for
minimumnumber of queries is also NP-hard. A multiset of tuples
istransformed into an equivalent set of distinct tuple; countpairs.
The cardinality of Query Qi is the sum of count val-ues of tuples
falling inside the query hyper-rectangle. Theconstant qv defines an
upper bound for the number ofqueries that can violate the bounds.
The decision version ofthe k-PIB problem is as follows:
Definition 8 (Decisional k-anonymity with ImprecisionBounds).
Given a set t 2 T of unique tuple; count pairswith tuples in the
d-dimensional space and a set of queriesQi 2 Q with imprecision
bounds BQi , does there exist a multi-dimensional partitioning for
T such that the size of every mul-tidimensional partition Ri is
greater than or equal to k and thenumber of queries violating
imprecision bounds is less than thepositive constant qv?
Theorem 3.1. Decisional k-anonymity with Imprecision Boundsis
NP-complete.
Proof. Refer to Appendix, which can be found on the Com-puter
Society Digital Library at
http://doi.ieeecomputer-society.org/10.1109/TKDE.2013.71. tu
3.3 Accuracy-Constrained Privacy-PreservingAccess Control
An accuracy-constrained privacy-preserving access con-trol
mechanism, illustrated in Fig. 5 (arrows represent thedirection of
information flow), is proposed. The privacyprotection mechanism
ensures that the privacy and accu-racy goals are met before the
sensitive data is available tothe access control mechanism. The
permissions in theaccess control policy are based on selection
predicates onthe QI attributes. The policy administrator defines
thepermissions along with the imprecision bound for
eachpermission/query, user-to-role assignments, and
role-to-permission assignments [18]. The specification of
theimprecision bound ensures that the authorized data hasthe
desired level of accuracy. The imprecision boundinformation is not
shared with the users because knowing
the imprecision bound can result in violating theprivacy
requirement. The privacy protection mechanismis required to meet
the privacy requirement along withthe imprecision bound for each
permission.
3.3.1 Access Control Enforcement
The exact tuple values in a relation are replaced by the
gen-eralized values after the anonymization. In this case,
accesscontrol enforcement over the generalized data needs to
bedefined. In this section, we discuss the Relaxed and Strictaccess
control enforcement mechanisms over anonymizeddata. The access
control enforcement by reference monitorcan be of the following two
types:
1. Relaxed. Use overlap semantics to allow access to
allpartitions that are overlapping the permission.
2. Strict. Use enclosed semantics to allow access to onlythose
partitions that are fully enclosed by thepermission.
Both schemes have their own pros and cons. Relaxedenforcement
violates the authorization predicate by giv-ing access to extra
tuples but is beneficial for applicationswhere low cost of a false
alarm is tolerable as comparedto the risk associated with a missed
event. Examplesinclude epidemic surveillance and airport security.
Onthe other hand, strict enforcement is suitable for applica-tions
where a high risk is associated with a false alarm ascompared to
the cost of a missed event. An example is afalse arrest in case of
shoplifting. In this paper, the focusis on relaxed enforcement.
However the proposed meth-ods for anonymization are also valid for
strict enforce-ment because the proposed heuristics reduce the
overlapbetween partitions and queries. We further assume thatunder
relaxed enforcement if the imprecision bound isviolated for a
permission then that permission is notassigned to any role.
3.3.2 Probabilistic Analysis for Access Control
Enforcement
In this section, the relaxed enforcement of access controlis
analyzed probabilistically. The access control policyadministrator
sets the imprecision bound BQi for eachquery, and requires that the
imprecision bound for the
Fig. 5. Accuracy-constrained privacy-preserving access
controlmechanism.
Fig. 4. Comparison of median and query cut.
PERVAIZ ET AL.: ACCURACY-CONSTRAINED PRIVACY-PRESERVING ACCESS
CONTROL MECHANISM FOR RELATIONAL DATA 799
-
least number of queries be violated by PPM. The
policyadministrator might revise the imprecision bounds forqueries
and further relax the access control policy if it isknown with a
high probability that a large number ofqueries will violate the
bounds and access requests forroles will be denied. From this
perspective, we are inter-ested in answering the following two
questions:
1. What is the average imprecision for a given query?2. Given a
set of queries with imprecision bounds, how
many queries are expected to violate the bounds?Given n tuples,
it is assumed that the tuples are uni-
formly distributed in the domain space of the QI attributes.In
order to estimate the expected imprecision for a ran-domly selected
query, first the expected number of parti-tions overlapping the
query needs to be found. We use theapproach by Otoo et al. [19],
where they find overlappingintervals in each dimension and then
compute the productto get the expected number of overlapping
partitions. How-ever, we still need to find the expected partition
size jPejand expected length of intervals lPei . We use the
domainlength of each attribute in the domain space and then
dividethis length of the first QI attribute by 2. The length of
inter-val lPe1 is updated and the new partition will now
contain
n2
tuples. For the next division, another QI attribute is
selectedand the process is repeated until the expected partition
sizeis k jPej < 2k.Lemma 3.2. Let IQj be a non-negative random
variable that
denotes the query imprecision. Then, the expected imprecisionfor
a query Qj is
EIQj Yd
i1
lQji lPeilPei
jPej
jQjj: (4)
In this equation, we round-up the fraction (lQji divided by
lPei ) and then take the floor in each dimension. Multiplyingthe
number of partitions with the expected size of each par-tition
gives the expected number of tuples in the queryjQjT j. Subtracting
the original size jQjj of the query givesthe expected
imprecision.
Example 5. Consider a query with range 10-21 and 5-10 fortwo
attributes and a query size of 50. If the expected par-tition
length for the two attributes is 3 and 2 and theexpected partition
size is 6, then 12 partitions areexpected to overlap the query. The
expected queryimprecision will be 22 (12 6 50) tuples.Given an
imprecision bound BQi for a Query Qi, for
the second question, we are interested in finding theexpected
number of queries that will violate the bounds.Let X1; . . . ; Xn
be a set of independent random variablessuch that PrXi 1 pi and
PrXi 0 1 pi where,0 pi 1. Xi is a random variable that is equal to
1 ifthe Query Qi violates the imprecision bound BQi other-wise is
equal to 0. The total number of queries violatingtheir imprecision
bounds is X Pni1Xi. X1; . . . ; Xn arecalled a Poisson trial and
follow a Poisson binomial distri-bution. The expected number of
queries violating theirimprecision bounds EX m Pni1 pi[20].
Dependen-cies exist among the queries but for our analysis weassume
that queries are independent.
Theorem 3.3. Let IQi be a non-negative random variable
thatdenotes the query imprecision. Let X1; . . . ; Xn be an
indepen-dent Poisson trial, where Xi is a random variable that is
equalto 1 if a query, say Qi, violates the imprecision bound BQi
oth-erwise is equal to 0. For X Pni1Xi and BQi > 0, we have
EX Xn
i1pi
Xn
i1
EIQiBQi 1
: (5)
Proof. Refer to Appendix, available in the online supple-mental
material. tu
4 HEURISTICS FOR PARTITIONING
In this section, three algorithms based on greedy heuris-tics
are proposed. All three algorithms are based on kd-tree
construction [15]. Starting with the whole tuplespace the nodes in
the kd-tree are recursively dividedtill the partition size is
between k and 2k. The leaf nodesof the kd-tree are the output
partitions that are mappedto equivalence classes in the given
table. Heuristic 1 and2 have time complexity of OdjQj2n2. Heuristic
3 is amodification over Heuristic 2 to have OdjQjnlgn com-plexity,
which is same as that of TDSM. The proposedquery cut can also be
used to split partitions using bot-tom-up (R-tree) techniques
[6].
4.1 Top-Down Heuristic 1 (TDH1)
In TDSM, the partitions are split along the median. Considera
partition that overlaps a query. If the median also fallsinside the
query then even after splitting the partition, theimprecision for
that query will not change as both the newpartitions still overlap
the query as illustrated in Fig. 4. Inthis heuristic, we propose to
split the partition along thequery cut and then choose the
dimension along which theimprecision is minimum for all queries. If
multiple queriesoverlap a partition, then the query to be used for
the cutneeds to be selected. The queries having imprecision
greaterthan zero for the partition are sorted based on the
impreci-sion bound and the query with minimum imprecisionbound is
selected. The intuition behind this decision is thatthe queries
with smaller bounds have lower tolerance forerror and such a
partition split ensures the decrease inimprecision for the query
with the smallest imprecisionbound. If no feasible cut satisfying
the privacy requirementis found, then the next query in the sorted
list is used tocheck for partition split. If none of the queries
allow parti-tion split, then that partition is split along the
median andthe resulting partitions are added to the output
aftercompaction.
The TDH1 algorithm is listed in Algorithm 1. In thefirst line,
the whole tuple space is added to the set of can-didate partitions.
In the Lines 3-4, the query overlappingthe candidate partition with
least imprecision bound andimprecision greater than zero is
selected. The while loopin Lines 5-8 checks for a feasible split of
the partitionalong query intervals. If a feasible cut is found,
then theresulting partitions are added to CP . Otherwise, the
can-didate partition is checked for median cut in Line 12.
Afeasible cut means that each partition resulting from splitshould
satisfy the privacy requirement. The traversal of
800 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
the kd-tree for partitions to consider in Set CP can
bedepth-first or breadth-first. However, the order of tra-versal
for TDH1 does not matter.
This heuristic of selecting cuts along minimum boundqueries
favors queries with smaller bounds. This behav-ior is also evident
in the experiments in Section 5 for therandomly selected query
workload. However, thisapproach creates imprecision slack in the
queries withsmaller bounds that could have been used to
satisfybounds of other queries.
Lemma 4.1. The time complexity of TDH1 is OdjQj2n2.Proof. The
time complexity is derived by multiplying the
height of the kd-tree with the work performed at eachlevel. The
height of the kd-tree for TDH1 in the worstcase can be nk, which
occurs when each successive cut cre-ates one partition of exactly
size k. In the worst case, ateach level we might have to check all
queries for a feasi-ble cut, which leads to djQj2n. The total time
complexityis then OdjQj2n2. tu
4.2 Top-Down Heuristic 2 (TDH2)
In the Top-Down Heuristic 2 algorithm (TDH2, for short),the
query bounds are updated as the partitions are added tothe output.
This update is carried out by subtracting theicQjPi
value from the imprecision bound BQj of each query,for a
Partition, say Pi, that is being added to the output. Forexample,
if a partition of size k has imprecision 5 and 10 forQueries Q1 and
Q2 with imprecision bound 100 and 200,then the bounds are changed
to 95 and 190, respectively.The best results are achieved if the
kd-tree traversal isdepth-first (preorder). Preorder traversal for
the kd-treeensures that a given partition is recursively split till
the leafnode is reached. Then, the query bounds are updated.
Ini-tially, this approach favors queries with smaller bounds.
Asmore partitions are added to the output, all the queries are
treated fairly. During the query bound update, if the
impre-cision bound for any query gets violated, then that query
isput on low priority by replacing the query bound by thequery
size. The intuition behind this decision is that what-ever future
partition splits TDH2 makes, the query boundfor this query cannot
be satisfied. Hence, the focus shouldbe on the remaining
queries.
The algorithm for TDH2 is listed in Algorithm 2. Thereare two
differences compared to TDH1. First, the kd-treetraversal for the
for loop in Lines 2-14 is preorder. Sec-ond, in Line 14, the query
bounds are updated as the par-titions are being added to the output
(P ). The timecomplexity of TDH2 is OdjQj2n2, which is the same
asthat of TDH1. In Section 4.3, we propose changes toTDH2 that
reduce the time complexity at the cost ofincreased query
imprecision.
4.3 Top-Down Heuristic 3 (TDH3)
The time complexity of the TDH2 algorithm is OdjQj2n2,which is
not scalable for large data sets (greater than 10 mil-lion tuples).
In the Top-Down Heuristic 3 algorithm (TDH3,for short), we modify
TDH2 so that the time complexity ofOdjQjnlgn can be achieved at the
cost of reduced precisionin the query results. Given a partition,
TDH3 checks thequery cuts only for the query having the lowest
imprecisionbound. Also, the second constraint is that the query
cuts arefeasible only in the case when the size ratio of the
resultingpartitions is not highly skewed. We use a skew ratio of
1:99for TDH3 as a threshold. If a query cut results in onepartition
having a size greater than hundred times theother, then that cut is
ignored. TDH3 algorithm is listed inAlgorithm 3. In Line 4 of
Algorithm 3, we use only onequery for the candidate cut. In Line 6,
the partition size ratiocondition needs to be satisfied for a
feasible cut. If a feasible
PERVAIZ ET AL.: ACCURACY-CONSTRAINED PRIVACY-PRESERVING ACCESS
CONTROL MECHANISM FOR RELATIONAL DATA 801
-
query cut is not found, then the partition is split along
themedian as in Line 11.
Lemma 4.2. The time complexity TDH3 is OdjQjnlgn.Proof. The
height of the kd-tree for TDH3 will be log 100
99n. The
work performed at each level of the kd-tree is jQjn as onlyone
query is considered for a feasible cut. This gives a total
time complexity ofOdjQjnlgn. tuThe time complexity of TDH3 is
OdjQjnlgn with a con-
stant factor of log 10099
in comparison to TDSM.
5 EXPERIMENTS
The experiments have been carried out on two data sets forthe
empirical evaluation of the proposed heuristics. The firstdata set
is the Adult data set from the UC Irvine MachineLearning Repository
[21] having 45,222 tuples and is the defacto benchmark for
k-anonymity research. The attributes inthe Adult data set are: Age,
Work class, Education, Maritalstatus, Occupation, Race, and,
Gender. The second data setis the Census data set [22] from IPUMS.
This data set isextracted for Year 2001 using attributes: Age,
Gender, Mari-tal status, Race, Birth place, Language, Occupation,
andIncome. The size of the data set is about 1.2 million tuples.For
the k-anonymity experiments, we use the first eightattributes as
the QI attributes. For the l-diversity experi-ments, we use
Attribute occupation as the sensitiveattribute and the first seven
attributes as the QI attributes.For the l-diversity experiments,
all the tuples having theoccupation value as Not Applicable (0 in
the data set) areremoved, which leaves about 700k tuples. In the
case of thevariance diversity experiments, Attribute income is used
asthe sensitive attribute and all the tuples having the incomevalue
as Not Applicable (9,999,999 in the data set) areremoved, which
leaves about 950k tuples.
We use 200 and 500 queries generated randomly as
theworkload/permissions for the Adult data set and Censusdata set,
respectively. The experiments have been con-ducted for two types of
query workloads. To avoid yieldingtoo many empty queries, the
queries are generated ran-domly using the approach by Iwuchukwu and
Naughton[6]. In this approach, two tuples are selected randomly
fromthe tuple space and a query is formed by making a bound-ing box
of these two tuples. To simulate the permissions foran access
control policy, the query selectivity for both thedata sets is set
to range from 0.5 to 5 percent. For the firstworkload, if the query
output is between 500 to 5,500 tuplesfor the Adult data set and
1,000 to 50,000 for the Censusdata set, the query is added to the
workload. For the secondworkload (we will refer to this workload as
the uniformquery workload) this range (1,000 to 50,000 for Census
dataset) is divided into ten equal intervals and we add only50
queries from each interval to the workload. Similarly, forthe Adult
data set, 20 queries are added from each sizeinterval. The first
workload is used for the l-diversity andvariance diversity
experiments. The average query size forthe Adult data set is 3,000
and for the Census data set is25,000 for the uniform query
workload. The imprecisionbounds for all queries are set based on
the query size for thecurrent experiment. Otherwise, bounds for
queries can beset according to the precision required by the access
controladministrator. The intuition behind setting bounds as a
fac-tor of the query size is that imprecision added to the queryis
proportional to the query size. Further, as no real rela-tional
policy data is available, we believe this approach canallow
researchers to reproduce our workload and comparetheir results with
the approaches presented in this paper.
For the k-anonymity experiments, we fix the value ofk and change
the query imprecision bounds from 5 to30 percent with increments of
5. Then, we find the numberof queries whose bounds have not been
satisfied by eachalgorithm for the uniform query workload. The
resultsfor k-anonymity are given in Fig. 6 for the Adult data
setfor k values of 3, 5, 7 and 9. Heuristic TDH2 has the
leastnumber of query bound violations and is better than
Fig. 6. No of queries violating bounds for k-anonymity for the
Adultdata set.
802 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
TDH1 because of TDH2s query-bound update step. TDH3with added
constraints and reduced complexity also per-forms better than TDSM.
The number of queries violatingimprecision bounds increases as the
value of k increases.The focus is to maximize the number of queries
satisfyingimprecision bounds even if the total imprecision as
com-pared to TDSM is increased. However, as in Fig. 7, eventhe
total imprecision for all the proposed heuristics is con-siderably
less than TDSM for all values of k. Due to limitedspace, only the
above results are discussed for the Adultdata set.
For k-anonymity, the number of queries for which theimprecision
bound is violated is given in Fig. 8 for the Censusdata set using
the uniform query workload of 500 queries.The results have the same
behavior as that for the Adult dataset. In both cases, TDH2 has the
lowest number of queriesviolating the imprecision bounds. The sum
of imprecisionfor all queries is given in Fig. 9, where TDH2 also
has thelowest total imprecision for all values of k. In Fig. 8, the
totalnumber of violated queries is given. So, in Fig. 10, we
plotthe number of queries against the margin by which they vio-late
the query bound (Imprecision bound is set as 25 percentof the query
size). Six query imprecision ranges have beenconsidered that are:
imprecision is less than 10, 10-25, 25-50,50-75, 75-100 percent and
greater than 100 percent of thebound. In Section 6, an algorithm is
proposed to realign the
output partitions to satisfy the imprecision bounds of
queriesthat violate the bound by a less than 10 percent margin.
Thereason for using the uniform query workload (50 randomlyselected
queries from each size range having cardinalitybetween 0.5 to 5
percent of the data set) is that it helpsobserve the behavior of
the queries violating the bounds foreach algorithm. Intuitively,
there is more chance of violatingthe imprecision bounds for a query
having a smaller impreci-sion bound. In Fig. 11, the number of
queries violated foreach size range (10 size intervals in 1k-50k)
are plotted. Thebehavior of TDSM follows the intuition as more
queries inthe smaller size range are violated. For TDH1, the
heuristicalways favors the queries with smaller bounds when
beingconsidered for a partition split. Thus, for TDH1, less
queriesare violated of smaller bounds than of larger ones. TDH2and
TDH3 favor queries with smaller bounds initially. How-ever, as
partitions are added to the output, all queries aretreated fairly.
Hence, the number of queries violated isalmost uniform in this
case.
We use the same heuristics for the privacy requirementsof
l-diversity and variance diversity. The experimentsare conducted
for l values of 7 and 9. For each value ofl, we change the query
imprecision bounds from 5 to30 percent with increments of 5 and
find the number ofqueries whose bounds are not satisfied by each
algorithm.The results for l values of 7 and 9 are given in Fig. 12.
The
Fig. 7. Total imprecision for all queries for the Adult data
set.
Fig. 8. No of queries violating bounds for k-anonymity for the
Censusdata set.
Fig. 9. Total imprecision for all queries for the Census data
set.
Fig. 10. Distribution of queries (wrt bound) violating bound at
25 percentfor k-anonymity for the Census data set.
PERVAIZ ET AL.: ACCURACY-CONSTRAINED PRIVACY-PRESERVING ACCESS
CONTROL MECHANISM FOR RELATIONAL DATA 803
-
results show that TDH2 violates the bound for a less num-ber of
queries for l-diversity.
In the case of variance diversity the experiments are con-ducted
for the variance values V200 and
V100, where V is the vari-
ance of the sensitive attribute in the data set. For a
variancediversity value, we change the query imprecision boundsfrom
5 to 30 percent and find the number of queries whosebounds are
violated by each algorithm. The results for vari-ance diversity are
given in Fig. 13. For variance diversity,TDH2 gives the best
results.
In the next experiment, all the algorithms are com-pared with
respect to the size of the given query set.The size of the query
set is changed from 32 to 1,024 fora k value of 5 and a query
imprecision bound of 30 per-cent. Observe in Fig. 14 that as the
size of query work-load is increased bounds for more queries are
violated.However, the proposed heuristics still violate bounds
ofless queries than TDSM.
While the intention is to satisfy the imprecision boundsfor as
many queries as possible from the given set ofqueries, it is as
important to maintain the utility of allother queries. In this
experiment, after partitioning for agiven set of queries, we
generate 1,000 new randomqueries and compare the number of queries
satisfied at30 percent imprecision bound by each algorithm.
Theresults are given in Fig. 15. Observe that the performanceof all
the algorithms is similar. The slightly better results
in case of TDH1, TDH2, and TDH3 are due to the factthat more
queries are picked from high density tupleregions for which
partitioning is already optimized forthe proposed heuristics.
The proposed techniques do not provide any perfor-mance
guarantees. However, we compare the performanceof the proposed
heuristics with the optimal solution using asmaller subset of the
Adult data set. We use three attributes(Work Class, Marital Status,
and, Race) and pick 1,000 tuplesrandomly from the Adult data set.
The heuristic algorithmsare executed using a workload of 1,000
randomly selectedqueries with an imprecision bound of 20 percent of
the sizeof query. For the optimal partitioning, all possible
partitionsare created based on the selected three attributes. In
thenext step, the partitions having less than k tuples or morethan
2dk 1 fmax [14] are rejected, where fmax is themaximum frequency of
any tuple in the partition. Forthe remaining partitions, an integer
programming model inGeneral Algebraic Modeling System (GAMS) is
executed toselect a set of partitions containing all the tuples
while vio-lating the imprecision bound for the minimum number
ofqueries. The comparison of the optimal partitioning for theleast
number of query imprecision bound violations againstTDSM and TDH2
is given in Fig. 16. Observe that as thevalue of k is increased,
the gap between TDH2 and the opti-mal solution increases suggesting
that the quality factor isdependent on k.
The visual representation of the partitions resultingfrom the
proposed heuristic TDH2 and TDSM is given inFig. 17. Here, 1,000
tuples with two attributes are ran-domly selected (Normal
distribution with m 50, s 10,and cardinality 100). 10 random
queries are alsoselected (Query selectivity is from 10 to 50
percent) and
Fig. 13. Number of queries violating bound for
variance-diversity for theCensus data set.
Fig. 14. Varying the size of given query workload for the
Censusdata set.
Fig. 11. Distribution of queries (wrt size) violating bound at
15 percent fork-anonymity for the Census data set.
Fig. 12. Number of queries violating bound for l-diversity for
the Censusdata set.
804 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
the query imprecision bound is set to 10 percent of thequery
size. The rectangles with the blue (darker) lines arethe queries
while the rectangles with red (lighter) linesare partitions
generated by the heuristics at k 5.Observe that in Fig. 17, less
partitions are overlappingthe query region for TDH2 as compared to
TDSM, e.g.,Query Q2 (range: 32-54, 30-43) has zero imprecisionunder
TDH2 and all the partitions are fully enclosed bythe query
region.
6 IMPROVING THE NUMBER OF QUERIESSATISFYING THE IMPRECISION
BOUNDS
In Section 3, the query imprecision slack is defined as
thedifference between the query bound and query impreci-sion. This
query imprecision slack can help satisfy queriesthat violate the
bounds by only a small margin by increas-ing the imprecision of the
queries having more slack. Themargin by which queries violate the
bounds is given inFig. 10. In this repartitioning step, we consider
only thefirst two groups of queries that fall within 10 percent
and10-25 percent of the bound only and these queries areadded to
the Candidate Query set (CQ), while all queriessatisfying the
bounds are added to the query set SQ. Theoutput partitions are all
the leaf nodes in the kd-tree. Forrepartitioning, we only consider
those pairs of partitionsfrom the output that are siblings in the
kd-tree and haveimprecision greater than zero for the queries in
the candi-date query set. These pairs of partitions are then added
tothe candidate partition set for repartitioning. Mergingsuch a
pair of sibling leaf nodes ensures that we still get
ahyper-rectangle and the merged partition is non-overlap-ping with
any other output partition. The repartitioning isfirst performed
for the set of queries within 10 percent ofthe bound. The
partitions that are modified are removedfrom the candidate set and
then the second group ofqueries is checked. The algorithm for
repartitioning islisted as Algorithm 4. In Lines 6-9, we check if a
query cutalong any dimension exists that reduces the total
impreci-sion for the queries in CQ Set while still satisfying
thebounds of the queries in SQ. If such a cut exists, then theold
partitions are removed and the new ones are added toOutput P in
Lines 11-12. After every iteration, the impreci-sion of the queries
in Set CQ is checked. If the imprecisionis less than the bound for
any query, then as in Line 15,that query is moved from Set CQ to
SQ. The proposed
algorithm in the experiments satisfies most of the queriesfrom
the first group and only a few queries from the sec-ond group. This
repartitioning step is equivalent to parti-tioning all the leaf
nodes that in the worst case can takeOjQjn time for each candidate
query set.
In the experiments, we set the value of k to 5 and 7with a query
imprecision bound of 30 percent of thequery size. The results for
repartitioning are given inFig. 18. TDH2p and TDH3p are the results
after therepartitioning step. Observe that most of the queries
inthe 10 percent group have been satisfied, while for the10-25
percent group, some of these have been satisfiedwhile the others
have moved into the first group. Repar-titioning of the other
groups of queries reduces the totalimprecision but the gains in
terms of having morequeries satisfying bounds are not
worthwhile.
Fig. 15. Performance for a different query workload for the
Censusdata set.
Fig. 16. Comparison with optimal solution.
PERVAIZ ET AL.: ACCURACY-CONSTRAINED PRIVACY-PRESERVING ACCESS
CONTROL MECHANISM FOR RELATIONAL DATA 805
-
7 RELATED WORK
Access control mechanisms for databases allow queries onlyon the
authorized part of the database [8], [10]. Predicate-based
fine-grained access control has further been pro-posed, where user
authorization is limited to pre-definedpredicates [11]. Enforcement
of access control and privacypolicies have been studied in [23].
However, studying theinteraction between the access control
mechanisms and theprivacy protection mechanisms has been missing.
Recently,Chaudhuri et al. have studied access control with
privacymechanisms [24]. They use the definition of differential
pri-vacy [25] whereby random noise is added to original
queryresults to satisfy privacy constraints. However, they havenot
considered the accuracy constraints for permissions. Wedefine the
privacy requirement in terms of k-anonymity. Ithas been shown by Li
et al. [26] that after sampling, k-ano-nymity offers similar
privacy guarantees as those of differen-tial privacy. The proposed
accuracy-constrained privacy-preserving access control framework
allows the access con-trol administrator to specify imprecision
constraints that the
privacy protection mechanism is required to meet alongwith the
privacy requirements.
The challenges of privacy-aware access control are simi-lar to
the problem of workload-aware anonymization. Inour analysis of the
related work, we focus on query-awareanonymization. For the state
of the art in k-anonymity tech-niques and algorithms, we refer the
reader to a recent sur-vey paper [3]. Workload-aware anonymization
is firststudied by LeFevre et al. [5]. They have proposed the
Selec-tion Mondrian algorithm, which is a modification to thegreedy
multidimensional partitioning algorithm Mondrian[14]. In their
algorithm, based on the given query-workload,the greedy splitting
heuristic minimizes the sum of impreci-sion for all queries.
Iwuchukwu and Naughton have pro-posed an R-tree based anonymization
algorithm [6]. Theauthors illustrate by experiments that anonymized
datausing biased R-tree based on the given query workload ismore
accurate for those queries than for an unbiased algo-rithm. Ghinita
et al. have proposed algorithms based onspace filling curves for
k-anonymity and l-diversity [27].They also introduce the problem of
accuracy-constrainedanonymization for a given bound of acceptable
informationloss for each equivalence class [28]. Similarly, Xiao et
al. [29]propose to add noise to queries according to the size of
thequeries in a given workload to satisfy differential
privacy.However, bounds for query imprecision have not been
con-sidered. The existing literature on workload-aware
ano-nymization has a focus to minimize the overall imprecisionfor a
given set of queries. However, anonymization withimprecision
constraints for individual queries has not beenstudied before. We
follow the imprecision definition ofLeFevre et al. [5] and
introduce the constraint of imprecisionbound for each query in a
given query workload.
8 CONCLUSIONS
An accuracy-constrained privacy-preserving access
controlframework for relational data has been proposed. The
frame-work is a combination of access control and privacy
protec-tion mechanisms. The access control mechanism allows
onlyauthorized query predicates on sensitive data. The
privacy-preserving module anonymizes the data to meet
privacyrequirements and imprecision constraints on predicates setby
the access control mechanism. We formulate this interac-tion as the
problem of k-anonymous Partitioning with Impre-cision Bounds
(k-PIB). We give hardness results for the k-PIBproblem and present
heuristics for partitioning the data tothe satisfy the privacy
constraints and the imprecisionbounds. In the current work, static
access control and
Fig. 17. Anonymization for two attributes with discrete normal
distribution(m 50; s 10).
Fig. 18. Improvements after repartitioning for k-anonymity for
theCensus data set.
806 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.
26, NO. 4, APRIL 2014
-
relational data model has been assumed. For future work, weplan
to extend the proposed privacy-preserving access con-trol to
incremental data and cell level access control.
ACKNOWLEDGMENTS
This research was partially supported by the US NationalScience
Foundation (NSF) Grants IIS-1117766, IIS-0964639,and
IIS-0811954.
REFERENCES[1] E. Bertino and R. Sandhu, Database
Security-Concepts,
Approaches, and Challenges, IEEE Trans. Dependable and
SecureComputing, vol. 2, no. 1, pp. 2-19, Jan.-Mar. 2005.
[2] P. Samarati, Protecting Respondents Identities in
MicrodataRelease, IEEE Trans. Knowledge and Data Eng., vol. 13, no.
6,pp. 1010-1027, Nov. 2001.
[3] B. Fung, K. Wang, R. Chen, and P. Yu, Privacy-Preserving
DataPublishing: A Survey of Recent Developments, ACM
ComputingSurveys, vol. 42, no. 4, article 14, 2010.
[4] A. Machanavajjhala, D. Kifer, J. Gehrke, and M.
Venkitasubrama-niam, L-Diversity: Privacy Beyond k-anonymity, ACM
Trans.Knowledge Discovery from Data, vol. 1, no. 1, article 3,
2007.
[5] K. LeFevre, D. DeWitt, and R. Ramakrishnan,
Workload-AwareAnonymization Techniques for Large-Scale Datasets,
ACMTrans. Database Systems, vol. 33, no. 3, pp. 1-47, 2008.
[6] T. Iwuchukwu and J. Naughton, K-Anonymization as
SpatialIndexing: Toward Scalable and Incremental
Anonymization,Proc. 33rd Intl Conf. Very Large Data Bases, pp.
746-757, 2007.
[7] J. Buehler, A. Sonricker, M. Paladini, P. Soper, and F.
Mostashari,Syndromic Surveillance Practice in the United States:
Findingsfrom a Survey of State, Territorial, and Selected Local
HealthDepartments, Advances in Disease Surveillance, vol. 6, no. 3,
pp. 1-20, 2008.
[8] K. Browder and M. Davidson, The Virtual Private Database
inoracle9ir2, Oracle Technical White Paper, vol. 500, 2002.
[9] A. Rask, D. Rubin, and B. Neumann, Implementing
Row-andCell-Level Security in Classified Databases Using SQL
Server2005, MS SQL Server Technical Center, 2005.
[10] S. Rizvi, A. Mendelzon, S. Sudarshan, and P. Roy,
ExtendingQuery Rewriting Techniques for Fine-Grained Access
Control,Proc. ACM SIGMOD Intl Conf. Management of Data, pp.
551-562,2004.
[11] S. Chaudhuri, T. Dutta, and S. Sudarshan, Fine Grained
Authori-zation through Predicated Grants, Proc. IEEE 23rd Intl
Conf. DataEng., pp. 1174-1183, 2007.
[12] K. LeFevre, R. Agrawal, V. Ercegovac, R. Ramakrishnan, Y.
Xu,and D. DeWitt, Limiting Disclosure in Hippocratic
Databases,Proc. 30th Intl Conf. Very Large Data Bases, pp. 108-119,
2004.
[13] D. Ferraiolo, R. Sandhu, S. Gavrila, D. Kuhn, and R.
Chandra-mouli, Proposed NIST Standard for Role-Based Access
Control,ACM Trans. Information and System Security, vol. 4, no. 3,
pp. 224-274, 2001.
[14] K. LeFevre, D. DeWitt, and R. Ramakrishnan, Mondrian
Multidi-mensional K-Anonymity, Proc. 22nd Intl Conf. Data Eng., pp.
25-25, 2006.
[15] J. Friedman, J. Bentley, and R. Finkel, An Algorithm for
FindingBest Matches in Logarithmic Expected Time, ACM Trans.
Mathe-matical Software, vol. 3, no. 3, pp. 209-226, 1977.
[16] A. Meyerson and R. Williams, On The Complexity of
Optimalk-Anonymity, Proc. 23rd ACM SIGMOD-SIGACT-SIGART
Symp.Principles of Database Systems, pp. 223-228, 2004.
[17] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R.
Pani-grahy, D. Thomas, and A. Zhu, Approximation Algorithmsfor
k-Anonymity, J. Privacy Technology, vol. 2005112001,pp. 1-18,
2005.
[18] R. Sandhu and Q. Munawer, The Arbac99 Model for
Administra-tion of Roles, Proc. 15th Ann. Computer Security
Applications Conf.,pp. 229-238, 1999.
[19] E. Otoo, D. Rotem, and S. Seshadri, Optimal Chunking of
LargeMultidimensional Arrays for Data Warehousing, Proc. ACM
10thIntl Workshop on Data Warehousing and OLAP, pp. 25-32,
2007.
[20] W. Hoeffding, On the Distribution of the Number of
Successes inIndependent Trials, The Annals of Math. Statistics,
vol. 27, no. 3,pp. 713-721, 1956.
[21] A. Frank and A. Asuncion, UCI Machine Learning
Repository,2010.
[22] B. Steven, A. Trent, G. Katie, G. Ronald, B.S. Matthew, and
M. Sobek,Integrated Public Use Microdata Series: Version 5.0
[Machine-Readable Database],, https://usa.ipums.org/usa/, 2010.
[23] R. Agrawal, P. Bird, T. Grandison, J. Kiernan, S. Logan,
and W.Rjaibi, Extending Relational Database Systems to
AutomaticallyEnforce Privacy Policies, Proc. 21st Intl Conf. Data
Eng., pp. 1013-1022, 2005.
[24] S. Chaudhuri, R. Kaushik, and R. Ramamurthy, Database
AccessControl & Privacy: Is There a Common Ground? Proc. Fifth
Bien-nial Conf. Innovative Data Systems Research (CIDR), pp.
96-103, 2011.
[25] C. Dwork, Differential Privacy, Proc. 33rd Intl
ColloquiumAutomata, Languages and Programming, pp. 1-12, 2006.
[26] N. Li, W. Qardaji, and D. Su, Provably Private Data
Anonymiza-tion: Or, k-Anonymity Meets Differential Privacy, Arxiv
preprintarXiv:1101.2604, 2011.
[27] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, Fast
DataAnonymization with Low Information Loss, Proc. 33rd Intl
Conf.Very Large Data Bases, pp. 758-769, 2007.
[28] G. Ghinita, P. Karras, P. Kalnis, and N. Mamoulis, A
Frameworkfor Efficient Data Anonymization Under Privacy and
AccuracyConstraints, ACM Trans. Database Systems, vol. 34, no. 2,
article 9,2009.
[29] X. Xiao, G. Bender, M. Hay, and J. Gehrke, Ireduct:
DifferentialPrivacy with Reduced Relative Errors, Proc. ACM SIGMOD
IntlConf. Management of Data, 2011.
Zahid Pervaiz is working toward the PhD degreein the School of
Electrical and Computer Engi-neering, Purdue University. His
research inter-ests include data privacy, distributed
systemsecurity, and access control.
Walid G. Aref is currently professor of computerscience at
Purdue University. His research inter-ests include developing
database technologiesfor emerging applications, for example,
spatial,multimedia, genomics, and sensor-based data-bases. He is a
senior member of the IEEE.
Arif Ghafoor is currently a professor in theSchool of Electrical
and Computer Engineeringat Purdue University. He has been
activelyengaged in research on multimedia informationsystems,
database security, and parallel and dis-tributed computing. He is a
fellow of the IEEE.
Nagabhushana Prabhu received the PhDdegree in computer science
from NYU and thePhD degree in theoretical physics from the MIT.He
is a professor in the School of Industrial Engi-neering at Purdue
University. His research inter-ests include optimization,
high-energy physics,and computational oncology.
PERVAIZ ET AL.: ACCURACY-CONSTRAINED PRIVACY-PRESERVING ACCESS
CONTROL MECHANISM FOR RELATIONAL DATA 807
/ColorImageDict > /JPEG2000ColorACSImageDict >
/JPEG2000ColorImageDict > /AntiAliasGrayImages false
/CropGrayImages true /GrayImageMinResolution 150
/GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true
/GrayImageDownsampleType /Bicubic /GrayImageResolution 300
/GrayImageDepth -1 /GrayImageMinDownsampleDepth 2
/GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true
/GrayImageFilter /DCTEncode /AutoFilterGrayImages false
/GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict >
/GrayImageDict > /JPEG2000GrayACSImageDict >
/JPEG2000GrayImageDict > /AntiAliasMonoImages false
/CropMonoImages true /MonoImageMinResolution 1200
/MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true
/MonoImageDownsampleType /Bicubic /MonoImageResolution 600
/MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000
/EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode
/MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None
] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false
/PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000
0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true
/PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ]
/PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier ()
/PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped
/False
/Description >>> setdistillerparams>
setpagedevice