Distributing Data for Secure Data Services Vignesh Ganapathy, Dilys Thomas, Tomas Feder, Hector Garcia Molina, Rajeev Motwani April 8th, 2011 Stanford,

Post on 13-Jan-2016

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Distributing Data for Secure Data Services

Vignesh Ganapathy, Dilys Thomas, Tomas Feder,

Hector Garcia Molina, Rajeev MotwaniApril 8th, 2011

Stanford, TRDDC, TRUST

RoadMap

Motivation for Secure Databases

Column level distribution

Encryption, Distribution

Privacy constraints

Set cover initialization

Query Mediation

Cost estimation

Where and Select clause processing

Query decomposition

Experiments

Related Work

HealthPersonal medical details

Disease history

Clinical research dataBanking

Bank statement

Loan Details

Transaction history

FinancePortfolio information

Credit history

Transaction records

Investment details

InsuranceClaims records

Accident history

Policy details

OutsourcingCustomer data for testing

Remote DB Administration

BPO & KPORetail BusinessInventory records

Individual credit card details

Audits

ManufacturingProcess details

Blueprints

Production data

Govt. AgenciesCensus records

Economic surveys

Hospital Records

Motivation 1: Data Privacy in Enterprises

Motivation 2: Government Regulations

Country Privacy Legislation

Australia Privacy Amendment Act of 2000

European Union Personal Data Protection Directive 1998

Hong Kong Personal Data (Privacy) Ordinance of 1995

United Kingdom Data Protection Act of 1998

United States Security Breach Information Act (S.B. 1386) of 2002

Gramm-Leach-Bliley Act of 1999

Health Insurance Portability and Accountability Act of 1996

Motivation 3: Personal Information

EmailsSearches on Google/YahooProfiles on Social Networking sitesPasswords / Credit Card / Personal information at multiple E-

commerce sites / OrganizationsDocuments on the Computer / Network

Losses due to Lack of Privacy: ID-Theft

• 3% of households in the US affected by ID-Theft

• US $5-50B losses/year

• UK £1.7B losses/year

• AUS $1-4B losses/year

Data Privacy

Value disclosure: What is the value of attribute salary of person X

Perturbation

Privacy Preserving OLAP

Identity disclosure: Whether an individual is present in the database table

Randomization, K-Anonymity etc.

Data for Outsourcing / Research

Linkage disclosure: Linking columns from multiple sites

RoadMap

Motivation for Secure Databases

Column level distribution

Encryption, Distribution

Privacy constraints

Set cover initialization

Query Mediation

Cost estimation

Where and Select clause processing

Query decomposition

Experiments

Related Work

Masketeer: A tool for data privacy

Lodha, Patwardhan, Roy, Sundaram etal.

Two Can Keep a Secret: A Distributed Architecture for Secure Database Services

Aggarwal, Bawa, Ganesan, Garcia-Molina, Kenthapadi,

Motwani, Srivastava, Thomas, Xu

CIDR 2005

How to distribute data across multiple sites for (1)redundancy and(2) privacy so that a singlesite being compromised does not lead to data loss

Motivation

• Data outsourcing growing in popularity– Cheap, reliable data storage and management

• 1TB $399 < $0.5 per GB• $5000 – Oracle 10g / SQL Server• $68k/year DBAdmin

• Privacy concerns looming ever larger– High-profile thefts (often insiders)

• UCLA lost 900k records• Berkeley lost laptop with sensitive information• Acxiom, JP Morgan, Choicepoint• www.privacyrights.org

Present solutions

Application level: Salesforce.com

On-Demand Customer Relationship Management

$65/User/Month ---- $995 / 5 Users / 1 Year

Amazon Elastic Compute Cloud

1 instance = 1.7Ghz x86 processor, 1.75GB RAM, 160GB local disk, 250 Mb/s network bandwidth

Elastic, Completely controlled, Reliable, Secure

$0.10 per instance hour

$0.20 per GB of data in/out of Amazon

$0.15 per GB-Month of Amazon S3 storage used

Google Apps for your domain

Small businesses, Enterprise, School, Family or Group

Encryption Based Solution

EncryptClient DSP

Client-side

Processor

Query Q Q’

“Relevant Data”

Answer

Problem: Q’ “SELECT *”

The Power of Two

Client DSP1

DSP2

The Power of Two

DSP1

DSP2

Client-side

Processor

Query QQ1

Q2

Key: Ensure Cost (Q1)+Cost (Q2) Cost (Q)

SB1386 Privacy

{ Name, SSN},

{ Name, LicenceNo}

{ Name, CaliforniaID}

{ Name, AccountNumber}

{ Name, CreditCardNo, SecurityCode}

are all to be kept private.

A set is private if at least one of its elements is “hidden”.

Element in encrypted form ok

Techniques

Vertical FragmentationPartition attributes across R1 and R2E.g., to obey constraint {Name, SSN}, R1 Name, R2 SSNUse tuple IDs for reassembly. R = R1 JOIN R2

EncodingOne-time Pad

For each value v, construct random bit seq. rR1 v XOR r, R2 r

Deterministic EncryptionR1 EK (v) R2 K Can detect equality and push selections with equality predicate

Random additionR1 v+r , R2 rCan push aggregate SUM

Example

An Employee relation: {Name, DoB, Position, Salary, Gender, Email, Telephone, ZipCode}

Privacy Constraints

{Telephone}, {Email}

{Name, Salary}, {Name, Position}, {Name, DoB}

{DoB, Gender, ZipCode}

{Position, Salary}, {Salary, DoB}

Will use just Vertical Fragmentation and Encoding.

Decomposed Schema

R1:{TID, Name, Email, Telephone, Gender, Salary}

R2:{TID, Name, Email, Telephone, DoB, Position,ZipCode}

Encrypted Attributes E: {Telephone, Email, Name}

Partitioning, Execution

• Partitioning Problem– Partition to minimize communication cost for

given workload– Even simplified version hard to

approximate– Hill Climbing algorithm after starting with

weighted set cover

• Query Reformulation and Execution– Consider only centralized plans– Algorithm to partition select and where clause

predicates between the two partitions

Set Cover+ Greedy for partitioning

RoadMap

Motivation for Secure Databases

Column level distribution

Encryption, Distribution

Privacy constraints

Set cover initialization

Query Mediation

Cost estimation

Where and Select clause processing

Query decomposition

Experiments

Related Work

Cost Estimation

State Definitions

• 0: condition clause cannot be pushed to either servers• 1: condition clause can be pushed to Server 1• 2: condition clause can be pushed to Server 2 • 3: condition clause can be pushed to both servers• 4: condition clause can be pushed to either servers

OR State Evaluation

AND State Evaluation

Query Partitioning

• Query 1:

SELECT TID, name, salary

FROM R1

WHERE Name=’Tom’

• Query 2:

SELECT TID, dob, zipcode

FROM R2

WHERE Position=’Staff’

Original Query

SELECT Name, DoB, Salary

FROM R WHERE

(Name =’Tom’ AND Position=’Staff’) AND

(Zipcode =’94305’ OR Salary > 60000)

R1:R1:{TID, Name, Email, Telephone,Gender, Salary}

R2:{TID, Name, Email, Telephone, DoB, Position,Zipcode}

Distributed Query Plan

RoadMap

Motivation for Secure Databases

Column level distribution

Encryption, Distribution

Privacy constraints

Set cover initialization

Query Mediation

Cost estimation

Where and Select clause processing

Query decomposition

Experiments

Related Work

Perfomance Gain Experiment

Iterations Vs Privacy Constraints

Acknowledgements: Collaborators

Stanford Privacy Group

TRDDC Privacy Group

PORTIA, TRUST, Google

March 18, 2011

Back Up slides

top related