Distributing Data for Secure Data Services Vignesh Ganapathy, Dilys Thomas, Tomas Feder, Hector Garcia Molina, Rajeev Motwani April 8th, 2011 Stanford, TRDDC, TRUST
Jan 13, 2016
Distributing Data for Secure Data Services
Vignesh Ganapathy, Dilys Thomas, Tomas Feder,
Hector Garcia Molina, Rajeev MotwaniApril 8th, 2011
Stanford, TRDDC, TRUST
RoadMap
Motivation for Secure Databases
Column level distribution
Encryption, Distribution
Privacy constraints
Set cover initialization
Query Mediation
Cost estimation
Where and Select clause processing
Query decomposition
Experiments
Related Work
HealthPersonal medical details
Disease history
Clinical research dataBanking
Bank statement
Loan Details
Transaction history
FinancePortfolio information
Credit history
Transaction records
Investment details
InsuranceClaims records
Accident history
Policy details
OutsourcingCustomer data for testing
Remote DB Administration
BPO & KPORetail BusinessInventory records
Individual credit card details
Audits
ManufacturingProcess details
Blueprints
Production data
Govt. AgenciesCensus records
Economic surveys
Hospital Records
Motivation 1: Data Privacy in Enterprises
Motivation 2: Government Regulations
Country Privacy Legislation
Australia Privacy Amendment Act of 2000
European Union Personal Data Protection Directive 1998
Hong Kong Personal Data (Privacy) Ordinance of 1995
United Kingdom Data Protection Act of 1998
United States Security Breach Information Act (S.B. 1386) of 2002
Gramm-Leach-Bliley Act of 1999
Health Insurance Portability and Accountability Act of 1996
Motivation 3: Personal Information
EmailsSearches on Google/YahooProfiles on Social Networking sitesPasswords / Credit Card / Personal information at multiple E-
commerce sites / OrganizationsDocuments on the Computer / Network
Losses due to Lack of Privacy: ID-Theft
• 3% of households in the US affected by ID-Theft
• US $5-50B losses/year
• UK £1.7B losses/year
• AUS $1-4B losses/year
Data Privacy
Value disclosure: What is the value of attribute salary of person X
Perturbation
Privacy Preserving OLAP
Identity disclosure: Whether an individual is present in the database table
Randomization, K-Anonymity etc.
Data for Outsourcing / Research
Linkage disclosure: Linking columns from multiple sites
RoadMap
Motivation for Secure Databases
Column level distribution
Encryption, Distribution
Privacy constraints
Set cover initialization
Query Mediation
Cost estimation
Where and Select clause processing
Query decomposition
Experiments
Related Work
Masketeer: A tool for data privacy
Lodha, Patwardhan, Roy, Sundaram etal.
Two Can Keep a Secret: A Distributed Architecture for Secure Database Services
Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi,
Motwani, Srivastava, Thomas, Xu
CIDR 2005
How to distribute data across multiple sites for (1)redundancy and(2) privacy so that a singlesite being compromised does not lead to data loss
Motivation
• Data outsourcing growing in popularity– Cheap, reliable data storage and management
• 1TB $399 < $0.5 per GB• $5000 – Oracle 10g / SQL Server• $68k/year DBAdmin
• Privacy concerns looming ever larger– High-profile thefts (often insiders)
• UCLA lost 900k records• Berkeley lost laptop with sensitive information• Acxiom, JP Morgan, Choicepoint• www.privacyrights.org
Present solutions
Application level: Salesforce.com
On-Demand Customer Relationship Management
$65/User/Month ---- $995 / 5 Users / 1 Year
Amazon Elastic Compute Cloud
1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth
Elastic, Completely controlled, Reliable, Secure
$0.10 per instance hour
$0.20 per GB of data in/out of Amazon
$0.15 per GB-Month of Amazon S3 storage used
Google Apps for your domain
Small businesses, Enterprise, School, Family or Group
Encryption Based Solution
EncryptClient DSP
Client-side
Processor
Query Q Q’
“Relevant Data”
Answer
Problem: Q’ “SELECT *”
The Power of Two
Client DSP1
DSP2
The Power of Two
DSP1
DSP2
Client-side
Processor
Query QQ1
Q2
Key: Ensure Cost (Q1)+Cost (Q2) Cost (Q)
SB1386 Privacy
{ Name, SSN},
{ Name, LicenceNo}
{ Name, CaliforniaID}
{ Name, AccountNumber}
{ Name, CreditCardNo, SecurityCode}
are all to be kept private.
A set is private if at least one of its elements is “hidden”.
Element in encrypted form ok
Techniques
Vertical FragmentationPartition attributes across R1 and R2E.g., to obey constraint {Name, SSN}, R1 Name, R2 SSNUse tuple IDs for reassembly. R = R1 JOIN R2
EncodingOne-time Pad
For each value v, construct random bit seq. rR1 v XOR r, R2 r
Deterministic EncryptionR1 EK (v) R2 K Can detect equality and push selections with equality predicate
Random additionR1 v+r , R2 rCan push aggregate SUM
Example
An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode}
Privacy Constraints
{Telephone}, {Email}
{Name, Salary}, {Name, Position}, {Name, DoB}
{DoB, Gender, ZipCode}
{Position, Salary}, {Salary, DoB}
Will use just Vertical Fragmentation and Encoding.
Decomposed Schema
R1:{TID, Name, Email, Telephone, Gender, Salary}
R2:{TID, Name, Email, Telephone, DoB, Position,ZipCode}
Encrypted Attributes E: {Telephone, Email, Name}
Partitioning, Execution
• Partitioning Problem– Partition to minimize communication cost for
given workload– Even simplified version hard to
approximate– Hill Climbing algorithm after starting with
weighted set cover
• Query Reformulation and Execution– Consider only centralized plans– Algorithm to partition select and where clause
predicates between the two partitions
Set Cover+ Greedy for partitioning
RoadMap
Motivation for Secure Databases
Column level distribution
Encryption, Distribution
Privacy constraints
Set cover initialization
Query Mediation
Cost estimation
Where and Select clause processing
Query decomposition
Experiments
Related Work
Cost Estimation
State Definitions
• 0: condition clause cannot be pushed to either servers• 1: condition clause can be pushed to Server 1• 2: condition clause can be pushed to Server 2 • 3: condition clause can be pushed to both servers• 4: condition clause can be pushed to either servers
OR State Evaluation
AND State Evaluation
Query Partitioning
• Query 1:
SELECT TID, name, salary
FROM R1
WHERE Name=’Tom’
• Query 2:
SELECT TID, dob, zipcode
FROM R2
WHERE Position=’Staff’
Original Query
SELECT Name, DoB, Salary
FROM R WHERE
(Name =’Tom’ AND Position=’Staff’) AND
(Zipcode =’94305’ OR Salary > 60000)
R1:R1:{TID, Name, Email, Telephone,Gender, Salary}
R2:{TID, Name, Email, Telephone, DoB, Position,Zipcode}
Distributed Query Plan
RoadMap
Motivation for Secure Databases
Column level distribution
Encryption, Distribution
Privacy constraints
Set cover initialization
Query Mediation
Cost estimation
Where and Select clause processing
Query decomposition
Experiments
Related Work
Perfomance Gain Experiment
Iterations Vs Privacy Constraints
Acknowledgements: Collaborators
Stanford Privacy Group
TRDDC Privacy Group
PORTIA, TRUST, Google
March 18, 2011
Back Up slides