Auditing Compliance with a Hippocratic Database Rakesh Agrawal Rakesh Agrawal Roberto Bayardo Roberto Bayardo Christos Faloutsos Christos Faloutsos Jerry Kiernan Jerry Kiernan Ralf Rantzau Ralf Rantzau Ramakrishnan Srikant Ramakrishnan Srikant Intelligent Information Systems Research Intelligent Information Systems Research IBM Almaden Research Center IBM Almaden Research Center
47
Embed
Rakesh Agrawal Roberto Bayardo Christos Faloutsos Jerry Kiernan Ralf Rantzau Ramakrishnan Srikant Intelligent Information Systems Research IBM Almaden.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Intelligent Information Systems ResearchIntelligent Information Systems ResearchIBM Almaden Research CenterIBM Almaden Research Center
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
Hippocratic databases advocate policy directed Hippocratic databases advocate policy directed data management for privacy sensitive datadata management for privacy sensitive data
– Need reinforced by legislations and regulations:Need reinforced by legislations and regulations: Health Insurance Portability & Accountability ActHealth Insurance Portability & Accountability Act Gramm-Leach Bliley Act – Consumer Privacy RuleGramm-Leach Bliley Act – Consumer Privacy Rule
GoalGoal– Build a system to assist with auditing compliance with Build a system to assist with auditing compliance with
the stated policythe stated policy Event driven - privacy complaintEvent driven - privacy complaint Periodic - monitor exposure to privacy violationPeriodic - monitor exposure to privacy violation
Audit ScenarioAudit Scenario
Jane complains to the department of Health and Human Services saying that she had opted out of the doctor sharing her medical information with pharmaceutical companies for marketing purposes
The doctor must now review disclosures of Jane’s information in order to understand the circumstances of the disclosure, and take appropriate action
Sometime later, Jane receives promotional literature from a pharmaceutical company, proposing over the counter diabetes tests
Jane has not been feeling well and decides to consult her doctor
The doctor uncovers that Jane’s blood sugar level is high and suspects diabetes
Audit ExpressionAudit Expression
audit T.disease
from Customer C, Treatment T
where C.cid=T.pcid and C.name = ‘Jane’
Who has accessed Jane’s disease information?
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
query)query) Fast and precise on auditsFast and precise on audits Non disruptive Non disruptive
– Minimal performance impact on Minimal performance impact on normal database operationnormal database operation
Fine grainedFine grained
AssumptionsAssumptions
Disclosures stemming from multiple Disclosures stemming from multiple query executions is not consideredquery executions is not considered
No use of outside knowledge to No use of outside knowledge to deduce information without deduce information without detectiondetection
Queries considered include Queries considered include – Joins and aggregation, but not nested Joins and aggregation, but not nested
subqueriessubqueries Note that existential subqueries can be Note that existential subqueries can be
converted into joins [SIGMOD92]converted into joins [SIGMOD92]
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
““Candidate” queryCandidate” query– Logged query that accesses all columns Logged query that accesses all columns
specified by the audit expressionspecified by the audit expression ““Indispensable” tuple (for a query)Indispensable” tuple (for a query)
– A tuple whose omission makes a difference A tuple whose omission makes a difference to the result of a queryto the result of a query
““Suspicious” querySuspicious” query– A candidate query that shares an A candidate query that shares an
indispensable tuple with the audit indispensable tuple with the audit expressionexpression
Indispensable TupleIndispensable Tuple
))(())((
STARTQ
AOA
QOQ
PC
PC
The SPJ query Q and the audit expression A are of the form:
))}){((())((),( RvTRTQvind QQQQ PCPC
Definition 1 - A virtual tuple v T is indispensable for an SPJ query Q if the result of Q changes when we delete v:
Predicates in Q
Columns appearing anywhere in Q
Duplicate preserving projection operator
Tables common to Q and A
Output columns in Q
““Candidate” QueryCandidate” Query
OAQ CC
Definition 6 - Q is a candidate query with respect to A if:
Only candidate queries can be suspicous queries
““Suspicious” QuerySuspicious” Query
),(),( s.t. ),( AvindQvindTvAQsusp
Definition 7 - Q is suspicious with respect to A if they share an indispensable MVT v
For example,Query Q: Addresses of people with diabetesAudit A: Jane’s diagnosis
Jane’s tuple is indispensable for both; hence query Q is “suspicious” with respect to A
A tuple v is a MVT for queries Q1 and Q2 if it belongs to the cross product of common tables in their from clauses
Definition 5 - Maximal virtual tuple (MVT):
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
– Build a query which, when run, Build a query which, when run, returns the id’s of suspicious queries returns the id’s of suspicious queries with respect to an audit expression with respect to an audit expression AA
Generating the Audit Generating the Audit QueryQuery
Candidate Query
1
Candidate Query
2
Audit Expressio
n
Union
Combine individual candidate queries and the audit expression into a single query graph
Combine the audit expression with individual candidate queries to identify suspicious queries
Replace each table with it’s backlog to restore the version of the table to the time of each query
T1 T2
QGM is a graphical representation of a query
Boxes represent operators, such as select
Lines represent input/output relationships between
operators
Boxes with no inputs are tables
Suspicious SPJ QuerySuspicious SPJ Query
)(( SRTQA PP
))((
))((
STA
RTQ
AOA
QOQ
PC
PC
Theorem 2 - A candidate SPJ query Q is suspicious with respect to an audit expression A if and only if:
The candidate SPJ query Q and the audit expression A are of the form:
QGM rewrites, shown in previous slide, transform Q and A into:
)))((("" SRTQAi PPQ
Proof of correctness is based upon Definition 7 (suspicious query) and
given in the paper
Suspicious Aggregate Suspicious Aggregate Query (Including Query (Including Having)Having) Solution in the paper Solution in the paper
11 select name, address, zip select name, address, zip from Customer, from Customer, Treatment where disease Treatment where disease = ‘diabetes’ and cid=pcid= ‘diabetes’ and cid=pcid
T3T3 jamesjames marketingmarketing othersothers
22 select name, address select name, address from Customer where from Customer where zip=‘95112’zip=‘95112’
Merge logged queries and audit expression into a single query graph
Treatment
p, r, …, t
TT
Transform Query Transform Query Graph into an Audit Graph into an Audit QueryQuery
Customer
c, n, …, t
audit expression := X.n= ‘Jane’
‘Q1’
Select := T.s=‘diabetes’ and C.c=T.p
C.n
C
X
View of Customer (Treatment) is a temporal view at the time of the query was executed
The audit expression now ranges over the logged query. If the logged query is suspicious, the audit query will output the id of the logged query
T
Treatment
p, r, ..., t
Scenario OutcomeScenario Outcome
The audit uncovers that Query 1 in the query The audit uncovers that Query 1 in the query log accessed Jane’s informationlog accessed Jane’s information
OutlineOutline
Introduction and motivationIntroduction and motivation Problem statementProblem statement FoundationsFoundations System organization and System organization and
Empirical Evaluation: Empirical Evaluation: GoalsGoals Cost of maintaining backlog Cost of maintaining backlog
tablestables– Understand the impact of Understand the impact of
maintaining backlog tables on maintaining backlog tables on ongoing database operationsongoing database operations
Cost of running auditsCost of running audits– Understand whether audits can run Understand whether audits can run
in reasonable timein reasonable time
Experimental SetupExperimental Setup
IBM M Pro 6868 IntellistationIBM M Pro 6868 Intellistation– 800 MHz Pentium III processor800 MHz Pentium III processor– 512 MB of memory512 MB of memory– 16.9 GB disk drive16.9 GB disk drive
Windows 2000 Version 5, SP 4Windows 2000 Version 5, SP 4 DB2 v7 with default settingsDB2 v7 with default settings TPC-H databaseTPC-H database
Maintain an index over the backlog tableMaintain an index over the backlog table Maintained during ongoing database operationsMaintained during ongoing database operations
– Lazy indexingLazy indexing No index over the backlog tableNo index over the backlog table Create indices at the time of auditCreate indices at the time of audit
Choice of indexChoice of index– Simple indexSimple index
Primary key of source tablePrimary key of source table– Composite indexComposite index
Primary key of source tablePrimary key of source table Time stampTime stamp
Impact on Ongoing Impact on Ongoing OperationsOperations QueriesQueries
– Additionally log the query stringAdditionally log the query string Already performed in many application Already performed in many application
environmentsenvironments
UpdatesUpdates– For each updated tuple,For each updated tuple,
Insert a tuple to the backlog tableInsert a tuple to the backlog table
– Inserts and deletes are handled similarlyInserts and deletes are handled similarly In a majority of environments, queries In a majority of environments, queries
are much more frequent than updatesare much more frequent than updates
Update PerformanceUpdate Performance
100,000 tuples in Supplier table100,000 tuples in Supplier table Update statement updates all tuplesUpdate statement updates all tuples Each update statement fires triggers Each update statement fires triggers
which inserts an additional 100,000 which inserts an additional 100,000 tuples in backlogtuples in backlog
Evaluate impact of multiple versions Evaluate impact of multiple versions on performanceon performance
Overhead on UpdatesOverhead on Updates
0
50
100
150
200
250
5 20 35 50
# of versions per tuple
Tim
e (
min
ute
s)
CompositeSimpleNo IndexNo Triggers
Simple wins over Composite
7x if all tuples are updates
3x if a single tuple is updated
Eager indexing doesn’t add much cost
Number of version of each tuple in the Supplier backlog
table
Audit Query Audit Query PerformancePerformance
Audit query:
select ‘Q’ from Supplier where skey = k
Experiment:
Evaluate the impact of the number of versions of tuples in the backlog table on performance
Composite wins over simple if initial version is selected
Simple wins over composite if the current
version is selected
TakeawaysTakeaways
The composite indexThe composite index– Enhances the performance of audits, Enhances the performance of audits,
butbut– Additionally burdens updates when Additionally burdens updates when
using eager indexingusing eager indexing The system supportsThe system supports
– Efficient auditingEfficient auditing– Without substantially burdening normal Without substantially burdening normal
query processingquery processing
Related WorkRelated Work
Oracle Privacy Security Auditing– Facility for logging queries with timestamp– Flash-back queries
Restores the version of the data at the time of the query– No support for automated auditing
User manually selects queries from the log and runs them The user to decide if the query is suspicious
G. Miklau D. Suciu [SIGMOD 2004]– Formal analysis of information disclosure in data exchange
Is information about a secret query S revealed by views V1,…,Vn Considers all possible instances of a database schema Assumes tuple independence
– We’re interested in given instances (temporal versions)– Nonetheless, it will be interesting to explore the connection
between the two works Active enforcement of policies by limiting disclosure Active enforcement of policies by limiting disclosure
[VLDB’04][VLDB’04] Literature on multi-query optimization
SummarySummary
In light of new privacy legislationIn light of new privacy legislation– The problem of auditing usage of The problem of auditing usage of
information represents an important information represents an important opportunity for database researchopportunity for database research
Formalized the problem through the Formalized the problem through the fundamental concepts of fundamental concepts of indispensable tuple and suspicious indispensable tuple and suspicious queriesqueries