Top Banner
Set 6, Statistical Database Security Sylvia Osborn CS9616 1 Set 6, Statistical Database Security
43

Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Mar 31, 2015

Download

Documents

Avery Colten
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Set 6, Statistical Database Security

Sylvia Osborn

CS9616 1Set 6, Statistical Database

Security

Page 2: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

What is a Statistical Database?

can be general purpose or special purpose often they are special purpose census databases statistical queries can also be asked of normal (business)

databases usually relational

used for statistical queries: sums, averages, counts, relative frequency, medians on some subset of the entities in a relation

the queries can be on line or off line users may have supplementary knowledge of some

of the individuals in the statistical database (SDB)

CS9616Set 6, Statistical Database

Security 2

Page 3: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

The Objective attributes can be divided into confidential and non-confidential

data for example, a person’s address in a census database would be

non-confidential as it is readily available, whereas their annual income would be confidential

the goal is to prevent confidential information about individual persons or entities from being obtained, by single queries or sequences of queries.

the database is positively compromised when a user finds out that an individual does have a certain (set of) attribute value(s).

the database is negatively compromised when a user finds out that an individual does not have a certain (set of) attribute value(s).

to protect from statistical inference requires mechanisms completely different from those we have discussed so far.

CS9616Set 6, Statistical Database

Security 3

Page 4: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Keys and quasi-identifiers

in a relational database, a key is an (a set of) attribute(s) which uniquely identify all the other values in the tuple, according to some functional dependencies

a quasi-identifier is a set of attributes which usually identifies an entity, but not always e.g. in a small town, with one employer, job title and years of service

might predict salary (but some people do not work for the main employer)

quasi-identifiers depend more on the data and not on the user’s knowledge of functional dependencies.

often census type data is released without identifying attributes, as microdata (i.e. all data except the key attributes)

CS9616Set 6, Statistical Database

Security 4

Page 5: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Techniques

fall into two main categories: restriction or suppression

certain data is not released is recoded to avoid releasing individual

sensitive values queries are restricted

noise addition or perturbation

CS9616Set 6, Statistical Database

Security 5

Page 6: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Some data for our department many years ago (salary is actually years of service)

CS9616Set 6, Statistical Database

Security 6

Page 7: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Sorted by rank

CS9616Set 6, Statistical Database

Security 7

Page 8: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Basic Assumptions relational database made up of numerical (like salary or

date of birth) and non-numerical or categorical data types (like rank, job title, diagnosis).

The categorical types often have only a few discrete values (like Sex or Rank)

CS9616Set 6, Statistical Database

Security 8

Page 9: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Assumptions, cont’d

CS9616Set 6, Statistical Database

Security 9

often statistical databases release counts based on a summary of several attributes, called macrostatistics or summary tables

Page 10: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Assumptions, cont’d statistics are usually based on a characteristic formula, which is a Boolean

expression over attribute values connected by AND, OR and NOT

(basically the WHERE clause in the SQL query)

(do a query first and then summarize)

e.g. C = (Sex = F) AND ((Rank = Asst) OR (Rank = Assoc)) AND (Salary ≤ 9)

The query set is the set of records satisfying the characteristic,

denoted X(C) for characteristic C.

The query set size is the number of records in X(C),

denoted by |X(C)|.

the relative frequency of a query set, X(C) is given by |X(C)| / |R|

CS9616Set 6, Statistical Database

Security 10

Page 11: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

More Basics some of the elementary sets may be empty. the statistical queries are sum, count, average, median, max, min and

relative frequency if a relation R has m attributes, then an elementary set corresponds to

the characteristic formula:

C = (A1 = a1) AND ... AND (Am = am)

where Ai is one of the attributes of R, and ai is one of the values of

attribute Ai. If Ai has |Ai| possible values, then the number of elementary

sets is given by

πi=1,m|Ai|

CS9616Set 6, Statistical Database

Security 11

Page 12: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Query Set Size Control a very simple technique is to control the size of the answer to a

query. This is done by picking some number k such that

k ≤ |X(C)| ≤ N-k

where N is the number of tuples in the whole relation. for example, if k=1 for our example, then queries giving 1 or

17 tuples in the result would not be allowed.

An example would be a query which gives the average salary of female Assoc Profs (there is one), or one which gives the average salary of all profs who are male (not female) or are not Assoc. Profs (which has 17 members)

CS9616Set 6, Statistical Database

Security 12

Page 13: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

More on Query Set Size Control

the N-k sized query is disallowed because the user could ask for the average and count of salaries for all Profs, and then do the arithmetic to isolate the k.

k is chosen by the DBadmin. in general, then, if Q(C) gives a number

of tuples ≤ k, then query Q(NOT C) gives an answer of size ≥ N - k.

CS9616Set 6, Statistical Database

Security 13

Page 14: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

A Tracker a tracker is a set of characteristic formulas which allows one to ask

queries which fall within the query set size, but still allows an individual tuple to be isolated.

an example: C = (Rank = Assoc) AND (Sex = F) uniquely identifies Osborn.

one tracker can be built as follows: if

C is of the form: A AND B,

then the tracker T is: A AND NOT B

and the query is answered by asking A, T and computing

Q(C) = Q(A) - Q(T).

Presumably both Q(A) and Q(T) fall within the allowed query set size.

CS9616Set 6, Statistical Database

Security 14

Page 15: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

The query is

Say for Q(C), the query is:

select sum(salary)from Rwhere C

You could use another statistic like avg(salary) or count(salary)

so you compute Q(A) and Q(T), which are both allowed, and do the arithmetic

CS9616Set 6, Statistical Database

Security 15

Page 16: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Example in our example, C = (Rank = Assoc) AND (Sex = F),

pick A = (Sex = F),

which makes T = (Sex = F) AND NOT (Rank = Assoc)

suppose the query wants the count of C. to compute Q(C) = Q(A) - Q(T):

Count(A)= 5,

Count(T)= 4

Count(C) = Count(A) - Count(T) = 1

this is called an individual tracker because it isolates an individual tuple. for an individual tracker, a new formula must be found for each individual.

CS9616Set 6, Statistical Database

Security 16

Page 17: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

General Tracker given that all query answers must lie in the

range [2k, n-2k], a general tracker is a characteristic formula T such that its query set size lies in this range:

k ≤ n/4 suppose in our example that k = 2, so the

tracker T must have query set size in [4, 14]. again assume the query C we want to

answer is the total salary for all female, associate profs.or C = (Sex = F) AND (Rank = Assoc)

CS9616Set 6, Statistical Database

Security 17

Page 18: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

General Tracker, cont’d to use the general tracker, the following

formulas are used:

1. Q = q(T) OR q(NOT T)

2. if Count(C) < k, then use:

q(C) = q(C OR T) + q(C OR NOT T) – Q

3. if Count(C) > n-k, then use:

q(C) = 2Q - q(NOT C OR T) - q(NOT C OR NOT T)

CS9616Set 6, Statistical Database

Security 18

Page 19: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

In our example T can be (Rank = Full)

q(T) = total salary for (Rank = Full) = 22 + 31 + 14 + 1 + 9 = 77

q(NOT T) = total salary for (Rank NOT = Full) = 118

Q = q(T) + q(NOT T) = 77 + 118 = 195.

using equation 2, total salary of C = (Sex = F) AND (Rank = Assoc)= q(C OR T) + q(C OR NOT T) - Q

= q( ((Sex = F) AND (Rank = Assoc)) OR (Rank = Full))

+ q( ((Sex = F) AND (Rank = Assoc)) OR (Rank NOT = Full)) – Q

= (21 + 77) + (118) - 195

= 21

CS9616Set 6, Statistical Database

Security 19

Page 20: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Sufficient Condition for a General Tracker

assume that n ≥ 4k

if there are 2k + 1 formulas C1, ... C2k+1 whose query sets are

mutually disjoint and collectively exhaust the database pick exactly 4k records such that there is exactly one record from

each class. there is a theorem that says there is a subset I of {1, ... 2k+1} such

that exactly 2k of the 4k records satisfy the disjunctive formula

T = Σi ∊ I Ci

if T is applied to the entire relation, there are an additional n - 4k records which can satisfy T, so

2k ≤ Count(T) ≤ 2k + (n -4k) = n - 2k

CS9616Set 6, Statistical Database

Security 20

Page 21: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

one way this might happen is if one of the attributes has ≥ 2k + 1 distinct values,

pick classes so that their total cardinality = 2k. The corresponding characteristic formula is the tracker.

If there are fewer than 2k + 1 disjoint classes in the database, then a general tracker may or may not exist.

in general, with a relation of size n, in O(n2) time, one can sort it on all fields and count the size of each distinguishable group, and come up with a Tracker

A good reference on trackers is Denning, Denning and Schwartz, The Tracker: A Threat to Statistical Database Security, ACM TODS, vol. 4, no. 1, 1979}

CS9616Set 6, Statistical Database

Security 21

Page 22: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Query Set Overlap Control idea here is to make sure that any new query does not

overlap with a query previously asked by this user, with an overlap whose size is less than k

must keep a history of all previous queries asked by this user, and compare the new query to see if the overlap with each previous query is allowed.

techniques can be devised to do this with bit matrices. If queries are off-line, they can be batched up and those which give out the most information, and satisfy all the criteria for overlaps, can be selected from the batch

for on-line queries, the result is a yes or no to this query systems which store information about past query activities

are called audit based

CS9616Set 6, Statistical Database

Security 22

Page 23: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Conceptual Techniques -- the Lattice Model

this summary table is called a 3-dimensional table because the count summaries are based on three attribute values.

such tables are called m-dimensional tables, or just m-tables.

for a relation with m attributes, there are 2m ways to take subsets of the attributes and form an m-table, for some summary statistic.

CS9616Set 6, Statistical Database

Security 23

Page 24: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Conceptual Techniques -- the Lattice Model

the set of all subsets of the set of attributes forms a lattice:

can calculate one m-table from its neighbour in the lattice

CS9616Set 6, Statistical Database

Security 24

Page 25: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

CS9616Set 6, Statistical Database

Security 25

This is the 3-table for our example:

These are the 2-tables:

Page 26: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Cell Suppression in general, any cell in an m-table whose count is 1 should

be suppressed (any statistics calculated based on this cell should not be released.)

Higher values of k for the count can also be used.

for sum statistics, it is usually considered that a sum is sensitive if n or fewer values contribute to form more than k% of the total.

this is called the n-respondent, k%-dominance rule. n and k are stated by the DBA and kept secret. sometimes cells of size 1 are merged with their neighbours.

CS9616Set 6, Statistical Database

Security 26

Page 27: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Complementary Suppression

CS9616Set 6, Statistical Database

Security 27

These are the counts:

The following gives the sums of salaries in each cell:

Page 28: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Complementary Suppression, cont’d if we suppress cells whose count is 1, we suppress females in 10-19 range, and males in 20-40 range.

but, from the row totals, the other column entry can be deduced.

therefore, those values must also be suppressed. the following are the unsuppressed cells:

CS9616Set 6, Statistical Database

Security 28

Page 29: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Altering the ranges

Sometimes releasing summary statistics with different ranges of values will avoid the problems with small cells

are there any better ranges in our data?

CS9616Set 6, Statistical Database

Security 29

Page 30: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Another technique: Partitioning the database records are grouped into atomic populations, and are physically stored this way.

each partition must be of a size 0, 2, 4, 6, etc. - i.e. an even number

Any answer to a query must be the union of these disjoint partitions.

a new record must wait until there is a pair to be inserted into the atomic population with it

may not be able to answer all queries because of the structure of the partitions.

but any answer given cannot contribute to a compromise on single records

CS9616Set 6, Statistical Database

Security 30

Page 31: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Perturbation-based Techniques idea is to modify the data before releasing statistics

can modify the database itself -- called record-based perturbation

can modify the answers or results before giving them back to the user – called result-based perturbation

desirable to avoid bias, which is the difference between a true value and a released statistic. The bias should be as close to zero as possible.

desirable to have consistency, which is the absence of contradictions and paradoxes. Repeated statistical results should be equal, for example. Counts should be greater than zero.

CS9616Set 6, Statistical Database

Security 31

Page 32: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Data Swapping values are exchanged within a column, in such a way that any statistical

computation over the column does not change. a t-order statistic is one which is computed using t attributes in the relation. data swapping for t columns such that any t-order statistic remains

unchanged, and higher order ones are not necessarily correct, is called a multidimensional transformation.

idea is to find a matrix transform which transforms the t columns of the matrix, and leaves no rows in common between the original matrix and the permuted matrix.

if such a transformation is carried out, then any statistics of order t or less can be released and will be accurate.

if the relation is only used for statistical queries, and is static, then it can be stored in its permuted form. Otherwise the knowledge that the matrix part of the relation is transformable is sufficient to allow all the statistics up to order t to be released.

CS9616Set 6, Statistical Database

Security 32

Page 33: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Random Sample Queries if the statistical database is extremely large, then a random

sample can be taken and statistics can be calculated on the random sample.

this is often done with census databases. If the database is dynamic, then a query set can be found, a

random sample taken and the query answered the samples are taken according to some probability p. bias on reported averages tends to be smaller than that on

relative frequency, and sums and counts with large query sets, using random sample queries guarantees

protection against attacks using trackers. with smaller query sets, it may be necessary for the system to

always use the same sample, or to audit previous queries.

CS9616Set 6, Statistical Database

Security 33

Page 34: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Fixed Perturbation a random value v is added to every

value in a column in the SDB this altered value is stored, so repeated

queries always get the same answer. the reported statistics may vary from the

true value too much (according to some threshold). Attempts to correct for this seem to make the system too easy to compromise.

CS9616Set 6, Statistical Database

Security 34

Page 35: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Query-Based Perturbation

the perturbation is on the attribute values used for computing the statistic, and varies from one query to the next.

there is a tradeoff between the amount of perturbation, and the number of queries a user needs to ask before being able to estimate the exact statistic.

CS9616Set 6, Statistical Database

Security 35

Page 36: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Rounding the response to the query is perturbed

before being given back to the user. there is systematic rounding, which can be

compromised by posing a lot of Sum queries there is random rounding which can be

compromised by posing a lot of average queries

there is controlled rounding, in which, whenever systematic or random rounding is performed, applies the same rounding to three other table entries.

CS9616Set 6, Statistical Database

Security 36

Page 37: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Combined Technique the SDB records are partitioned into cells

according to attribute values to be used in the queries. These are usually attributes for which an index has been defined.

then query set size control is applied to the cells.

those cells which are large enough, called sufficient query sets, have random sampling applied before queries are posed

those cells which are insufficient according to query set size, have perturbation applied.

CS9616Set 6, Statistical Database

Security 37

Page 38: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Criteria for comparing techniques security -- the level of protection against various types of

attack (e.g. based on trackers) exact compromise: the possibility of a user exactly compromising the

SDB partial compromise: the possibility of a user partially compromising

the SDB robustness: the capability of the technique to take into account the

user's supplementary knowledge

quality: the quality of the information released information loss: the number of non-sensitive statistics that are

unnecessarily restricted.

for perturbation techniques, related to the variance of the error in the statistics provided to the user.

bias: the expected value of the stats given out is different from the true value.

CS9616Set 6, Statistical Database

Security 38

Page 39: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Criteria, cont’d precision: for perturbation-based techniques, the variance of the

estimator returned to the user. The higher the variance, the better the protection of the method, but the lower the precision of the answers.

consistency: for perturbation-based techniques, lack of contradictions and paradoxes

cost of implementation and use of the technique implementation cost processing overhead per query user education

suitability of the technique for the type and number of SDB attributes numerical or non-numerical attributes number of attributes on-line dynamic SDBs

CS9616Set 6, Statistical Database

Security 39

Page 40: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

pages 337 and 338 from the database security book by Castano, Fugini, Martella and Samarati

CS9616Set 6, Statistical Database

Security 40

Page 41: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Conclusions from these tables no single protection technique alone provides high security

and low information loss at low cost no technique exists that can prevent both exact and partial

compromise choice needs to be based on the environment cell suppression works well for static off-line published

macrostatistics data swapping works for static online SDBs random sample queries, fixed perturbation and query-based

perturbation work for dynamic on-line SDBs, providing that bias is made acceptable

combined techniques are worth looking at

CS9616Set 6, Statistical Database

Security 41

Page 42: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

Other ideas to deal with statistical data k-anonymity

personally identifying information may not be released, but quasi-identifiers are released

if each sequence of (quasi-identifier) values released appears with at least k occurrences, the data is said to have k-anonymity

to achieve this, data is sometimes generalized, e.g. to “middle aged” instead of 60

l-diversity given that a block of data is to be used to produce statistical

results, this block should contain l “well-represented” values for sensitive attributes

differential privacy a condition on the release of information from a statistical database the presence or absence of a single element will not affect the

output of a query in a significant way.

CS9616Set 6, Statistical Database

Security 42

Page 43: Set 6, Statistical Database Security Sylvia Osborn CS96161Set 6, Statistical Database Security.

An “architectural” point

for DAC, MAC and RBAC a reference monitor architecture decides if an access is allowed

for statistical databases, can speak of a curator which decides what queries to answer what data to use to calculate the answer

restricting the size of query sets used to compute statistical quantities

random sampling possibly altering the data before answering

CS9616Set 6, Statistical Database

Security 43