Top Banner
Venue and Date: Center for Business and Graduate Studies Dean’s Conference Room 1303 Open to the Public Thursday, April 17, 2014 at 1 pm Dissertation Committee: Claude Turner, Ph.D. Chair Soo-Yeon Ji, Ph.D. Member Hoda El-Sayed, D.Sc. Member Darsana Josyula, Ph.D. Member Anthony Joseph, Ph.D. External Examiner Department of Computer Science Dissertation Defense AN INVESTIGATION OF DATA PRIVACY AND UTILITY USING MACHINE LEARNING AS A GAUGE Kato Mivule For the Degree of D.Sc. in Computer Science Cosmas U. Nwokeafor, PhD Dean, The Graduate School Lethia Jackson, D.Sc. Chair, Computer Science Department
62

An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Oct 19, 2014

Download

Data & Analytics

An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge By Kato Mivule for the Degree of D.Sc. in Computer Science - Bowie State University
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Venue and Date:

Center for Business and Graduate Studies

Dean’s Conference Room 1303

Open to the Public

Thursday, April 17, 2014 at 1 pm

Dissertation Committee:

Claude Turner, Ph.D. Chair

Soo-Yeon Ji, Ph.D. Member

Hoda El-Sayed, D.Sc. Member

Darsana Josyula, Ph.D. Member

Anthony Joseph, Ph.D. External Examiner

Department of Computer Science

Dissertation Defense

AN INVESTIGATION OF DATA PRIVACY AND

UTILITY USING MACHINE LEARNING AS A GAUGE

Kato Mivule For the Degree of

D.Sc. in Computer Science

Cosmas U. Nwokeafor, PhD

Dean, The Graduate School Lethia Jackson, D.Sc.

Chair, Computer Science Department

Page 2: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

OUTLINE

• Introduction

o The Problem

o Contributions

• Literature Review

• Methodology

• Results and Discussion

o Results

o Discussion

• Conclusion and Future work

o Conclusion

o Future work

Kato Mivule – Bowie State University Department of Computer Science

Page 3: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

CONTRIBUTIONS

1. A proposed a data privacy engineering framework, SIED.

2. A proposed Comparative x-CEG data utility analysis heuristic.

3. A proposed Initial and Subsequent basic (IBP and SBP) privacy

indexes.

4. A proposed data swapping and noise addition hybrid model for

privacy.

5. A proposed privatized synthetic data generation model using

image and signal processing techniques (DT, DCT, and DWT).

6. An implementation of k-anonymity by minimizing information

loss via the frequency count analysis and synthetic data

replacement model.

Kato Mivule – Bowie State University Department of Computer Science

Page 4: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

THE PROBLEM

Finding a user-defined balance between data privacy and utility

needs with trade-offs.

• The challenge of ambiguous definitions of privacy and utility.

“Perfect privacy can be achieved by publishing nothing at all, but this has no

utility; perfect utility can be obtained by publishing the data exactly as received, but

this offers no privacy” Cynthia Dwork (2006)

Data Privacy

~Differential Privacy

~Noise addition

~K-anonymity, etc...

Data Utility

~Completeness

~Currency

~Accuracy

Kato Mivule – Bowie State University Department of Computer Science

Page 5: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

MOTIVATION

• Generate privatized synthetic data sets that meet acceptable

privacy and utility requirements.

• Data Privacy Engineering - Adapt engineering principles in the

data privacy and utility process.

HYPOTHESIS

• Fine-tuning parameters in the data privacy procedure,

specifically using perturbation methods such as noise addition

and differential privacy, lowers the classification error and thus

generates better data utility.

Kato Mivule – Bowie State University Department of Computer Science

Page 6: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

LITERATURE REVIEW

The data privacy and utility problem

• Wong et, al., (2007); Meyerson & Williams, (2004); Park &

Shim, (2007): Data privatization diminishes data utility – an

NP-Hard problem.

• Krause & Horvitz, (2010); Wang & Wu, (2005): Optimal data

utility with privacy is a well-documented NP hard problem.

• Ghosh, et al., (2008); Brenner & Nissim, (2010 ): Trade-offs

needed in the privacy verses utility process – also NP hard.

• Li & Li, (2009): It is not possible to equate privacy and utility.

• Fienberg, Rinaldo, & Yang, (2010): Even with differential

privacy, privacy is granted but at a loss of data utility. Kato Mivule – Bowie State University Department of Computer Science

Page 7: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

LITERATURE REVIEW

Techniques and Algorithms used in this study

• Data Privacy

• Noise Addition

• Logarithmic Noise

• Multiplicative Noise

• Differential Privacy

• K-anonymity

• Image and Signal Processing

• Distance Transform

• Discrete Cosine Transform

• Discrete Wavelet Transform

• Gaussian Filtering

• Machine Learning

• KNN

• Neural Networks

• Naïve Bayes

• Decision Trees

• AdaBoost M1

Kato Mivule – Bowie State University Department of Computer Science

Page 8: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework

• SIED phases – Specifications, Implementation, Evaluation, and Dissemination

• Motivation: Given any original dataset 𝑋, a set of data privacy engineering phases should be

followed from start to completion in the generation of a privatized dataset 𝑋′.

Kato Mivule – Bowie State University Department of Computer Science

Page 9: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework -

The SIED Specification Phase:

Kato Mivule – Bowie State University Department of Computer Science

Page 10: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework -

The SIED Implementation Phase:

Kato Mivule – Bowie State University Department of Computer Science

Page 11: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework -

The SIED Evaluation Phase:

Kato Mivule – Bowie State University Department of Computer Science

Page 12: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework -

The SIED Dissemination Phase:

Kato Mivule – Bowie State University Department of Computer Science

Page 13: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 2 – A Data Privacy Parameter Mapping Heuristic

•Categorize parameters for effective fine-tuning – better privacy

and utility. What parameters need adjustment in the data privacy

process?

CATEGORY 1

PARAMETERS

CATEGORY 2

PARAMETERS

CATEGORY 3

PARAMETERS

Data Utility Goal Parameters:

For example Accuracy, Currency,

and Completeness.

Data Privacy Algorithm Parameters: Values k in k-anonymity, ε in Noise

addition and Differential privacy.

Application Parameters (e.g. Machine

Learning Classifier):

For example weak learners in

AdaBoost.

Parameter

Adjustment and

Fine-tuning

Trade-offs

Data Privacy and Utility

Preservation

Kato Mivule – Bowie State University Department of Computer Science

Page 14: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 3 – The x-CEG and Comparative x-CEG Heuristics

The Classification

Error Gauge (x-CEG)

Replicates x times

until threshold t is

reached.

Better utility might be achieved -

Publish

Apply data privacy

Classify privatized dataset

Get original dataset

If error <= t

Adjust data privacy parameters

Adjust classifier parameters

If error > t

The Comparative x-

CEG heuristic employs

multiple data privacy

and classifier algorithms

in each run.

Kato Mivule – Bowie State University Department of Computer Science

Page 15: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 4 – The x-CEG Threshold determination heuristic

• Average value of the function = integral / interval.

• 𝐴𝑉𝐹 = 𝐼𝑛𝑡𝑒𝑔𝑟𝑎𝑙/𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙

•1

𝑏−𝑎 𝑓 𝑥 𝑑𝑥𝑏

𝑎

•1

𝑏−𝑎 𝑓(𝑥 𝑖)𝑛𝑖=1 ∆𝑥

• 𝑊ℎ𝑒𝑟𝑒 ∆𝑥 = 𝑏−𝑎

𝑛

• 𝐴𝑛𝑑 𝑥 𝑖 = 1

2 𝑥 𝑖−1 + 𝑥𝑖

• 𝑇ℎ𝑒 𝑚𝑒𝑎𝑛 𝜇 = 1

𝑁 𝑥𝑖

𝑁𝑖=1

• 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 ,𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ]

• The threshold 𝑡 is chosen as the highest point between the max mean and max mid-point values.

• The classification error of the original data set is used as a benchmark in measuring privatized synthetic data sets.

Kato Mivule – Bowie State University Department of Computer Science

Page 16: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 5 – The Initial and Subsequent Privacy Indices

• Let 𝑋 be the set of all values in database 𝑋 such that 𝑋 = {𝑋1…𝑋𝑛} .

• Let 𝑋′ be the set of items to be privatized such that 𝑋′ = {𝑋1′ …𝑋𝑛

′ }

• Let 𝑌 be the set of items that get revealed after our initial privacy measurement.

• Where |𝑋′| ≤ |𝑋| and |𝑌| ≤ |𝑋|

• As long as 𝑋, 𝑋′, 𝑎𝑛𝑑 𝑌 are countable, such that there is a one-to-one function

(injective) 𝑓: 𝑋 → 𝑁; 𝑋′ → 𝑁; 𝑌 → 𝑁 from 𝑋, 𝑋′, 𝑎𝑛𝑑 𝑌 to natural numbers

𝑁 = { 0, 1, 2, 3…𝑛} respectively.

• 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝐵𝑎𝑠𝑖𝑐 𝑃𝑟𝑖𝑣𝑎𝑐𝑦 (𝐼𝐵𝑃) = 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 𝑋′

𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 (𝑋)∗ 100

• 𝑆𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝐵𝑎𝑠𝑖𝑐 𝑃𝑟𝑖𝑣𝑎𝑐𝑦 (𝑆𝐵𝑃) = 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 𝑋′− 𝑌

𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 (𝑋)∗ 100

• where 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 is the total count of elements in both 𝑋′, 𝑌 and 𝑋.

• IBP and SBP could be taken as percentages or normalized between 0 and 1.

Kato Mivule – Bowie State University Department of Computer Science

Page 17: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Methodology – Contribution 6 – The Filtered Comparative x-CEG

Heuristic - Using image and signal processing techniques to generate

privatized synthetic data.

Kato Mivule – Bowie State University Department of Computer Science

Page 18: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 7– Data swapping and noise addition data privacy

hybrid - Generating privatized synthetic data using data swapping and

noise perturbation.

Kato Mivule – Bowie State University Department of Computer Science

Page 19: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

METHODOLOGY

Contribution 8 – Minimizing information loss with K-anonymity

• Implementation of k-anonymity by minimizing information loss via

the frequency count analysis and synthetic data replacement model.

Kato Mivule – Bowie State University Department of Computer Science

Page 20: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

Comparative x-CEG Results

•The Iris Fisher multivariate

dataset from the UCI repository

was used.

•165 experiment runs – generating

165 privatized synthetic data sets.

•KNN, Neural Nets, Decision

Trees, AdaBoost, and Naïve Bayes

•MATLAB data privacy and

Rapid Miner for machine learning.

NOISE LEVEL KNN NEURAL NETS NAÏVE BAYES DECISION TREES ADABOOST M1

Original 96.00 96.67 96.00 94.67 97.33

Noise1(μ=5.8, σ=0.8) 66.67 74.00 64.00 66.67 64.00

Noise2(μ=0, σ=0.8) 61.33 72.00 66.67 63.33 54.67

Noise3(μ=1, σ=0.8) 68.67 74.00 69.33 66.67 60.00

Noise4(μ=2, σ=0.8) 68.67 62.67 62.00 59.33 54.67

Noise5(μ=3, σ=0.8) 72.67 66.67 67.33 61.33 50.67

Noise6(μ=4, σ=0.8) 75.33 82.67 70.00 72.00 63.33

Noise1a(μ=5, σ=0.1) 94.00 93.33 92.67 91.33 92.67

Noise1b(μ=5, σ=0.2) 92.00 94.67 91.33 90.00 90.67

Noise1c(μ=5, σ=0.3) 93.33 94.00 90.67 92.00 94.00

Noise1d(μ=5, σ=0.4) 90.00 93.33 87.33 86.67 86.67

Noise2b(μ=0, σ=0.1) 96.67 96.67 94.00 96.67 92.00

Noise2c(μ=0, σ=0.2) 89.33 92.00 86.67 87.33 90.00

Noise2d(μ=0, σ=0.3) 87.33 90.00 86.67 84.67 85.33

Noise2e(μ=0, σ=0.4) 87.33 90.00 86.67 84.67 85.33

Noise3a(μ=1, σ=0.4) 87.33 87.33 85.33 84.00 83.33

Noise3b(μ=1, σ=0.1) 97.33 94.00 96.00 96.00 94.67

Noise3c(μ=1, σ=0.2) 92.67 95.33 91.33 90.67 93.33

Noise3d(μ=1, σ=0.3) 94.67 95.33 91.33 94.00 90.00

Noise4a(μ=2, σ=0.1) 94.67 98.00 98.00 96.67 98.00

Noise4b(μ=2, σ=0.2) 93.33 96.00 92.67 91.33 90.67

Noise4c(μ=2, σ=0.3) 88.00 91.33 89.33 90.00 86.67

Noise4d(μ=2, σ=0.4) 87.33 87.33 85.33 84.00 83.33

Noise5a(μ=3, σ=0.1) 97.33 94.00 96.00 96.00 94.67

Noise5b(μ=3, σ=0.2) 92.67 95.33 91.33 90.67 93.33

Noise5c(μ=3, σ=0.3) 94.67 95.33 91.33 94.00 90.00

Noise5d(μ=3, σ=0.4) 93.33 94.00 93.33 92.00 87.33

Noise6a(μ=4, σ=0.1) 78.00 87.33 87.33 82.67 84.67

Noise6b(μ=4, σ=0.2) 93.33 95.33 94.00 93.33 92.67

Noise6c(μ=4, σ=0.3) 91.33 92.00 92.00 90.00 92.00

Noise6d(μ=4, σ=0.4) 78.00 87.33 88.67 82.67 84.67

Multiplicative 56.67 68.67 59.33 64.67 58.00

Logarithmic 50.67 58.00 56.00 53.33 57.33

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Kato Mivule – Bowie State University Department of Computer Science

Page 21: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

RESULTS AND DISCUSSION

Comparative x-CEG Results

• A bar chart depiction of the Comparative x-CEG classification accuracy results

Kato Mivule – Bowie State University Department of Computer Science

Page 22: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

RESULTS AND DISCUSSION - Comparative x-CEG Results

Comparative x-CEG results classifier performance results – Neural Nets most resilient.

Kato Mivule – Bowie State University Department of Computer Science

Page 23: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• x-CEG Threshold Determination Results

• Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 ,𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ]

• The threshold value is chosen heuristically using the mid-point value classification accuracy of

87.33% for the Neural Nets.

Statistic KNN NEURAL

NETS NAÏVE

BAYES DECISION

TREES ADABOOST

M1 MAX

Mean 84.87 87.41 84.54 83.74 82.30 87.41

Mid-Point 80.18 82.48 79.81 79.05 77.51 82.48

Max 84.87 87.41 84.54 83.74 82.30 87.41

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Kato Mivule – Bowie State University Department of Computer Science

Page 24: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• x-CEG Threshold Determination Results

• Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 ,𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ]

• The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the

Neural Nets.

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Kato Mivule – Bowie State University Department of Computer Science

Page 25: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• x-CEG Threshold Determination Results

• Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 ,𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ]

• The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the

Neural Nets.

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Kato Mivule – Bowie State University Department of Computer Science

Page 26: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• How much privacy? – statistical traits of the original and privatized data.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 27: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• How much privacy? – statistical traits of the original and privatized data.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 28: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• How much privacy? – statistical traits of the original and privatized data.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 29: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

• How much privacy? – statistical traits of the original and privatized data.

Statistic Value

Original Data MSE 15.8937

Privatized Data MSE 24.0875

Original Data Entropy -3.05E+04

Privatized Data Entropy -5.05E+04

Correlation 0.9808

MSE Difference 8.1938

Entropy Difference -2.00E+04

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 30: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid

• 330 data sets generated from the data swapping and noise addition hybrid experiment.

• Optimal data swap for acceptable privacy and utility levels is between 5% and 10% data swap.

• The two data sets satisfied the threshold criteria after the Comparative x-CEG:

• 𝑛𝑜𝑖𝑠𝑒 ~ (𝜇 = 1, 𝜎 = 0.1) at 5% swap.

• 𝑛𝑜𝑖𝑠𝑒 ~ (𝜇 = 5, 𝜎 = 0.1) at 5% swap.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 31: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 32: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid

• Best classification accuracy obtained between 5 to10% data swap.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 33: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid

Privatized synthetic data sets using Discrete Cosine Transforms (DCT)

.

Synthetic DCT-based Sepal Length data results

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 34: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid

Privatized synthetic data sets using Discrete Cosine Transforms (DCT)

.

Synthetic Filtered DCT-based Sepal Length data results

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 35: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid

Privatized synthetic data sets using Discrete Cosine Transforms (DCT)

.

Filtered DCT-based data descriptive statistics – skeletal structure not kept as in DT-based data

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 36: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid

Privatized synthetic data sets using Discrete Cosine Transforms (DCT)

.

Filtered DCT-based data inference statistics – low correlation

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 37: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid

Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure

kept.

.

DT-based Sepal Length data results

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 38: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid

Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure

kept.

.

Filtered DT-based Sepal Length data results

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 39: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid

Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure

kept.

.

Filtered DT-based data descriptive statistics – skeletal structure kept

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 40: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid

Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure

kept.

.

Filtered DT-based data Iinference statistics – High correlation

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 41: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test

DT produced the best Davis Bouldin Criterion at 0.419 after filtering.

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

DT-based synthetic data produced the best Davis Bouldin Criterion at 0.419 after filtering, out

performing the original data.

Page 42: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 43: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test

Clustering results of the Original Fisher Iris Data

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 44: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test

Clustering results of the synthetic DT-based synthetic Fisher Iris Data

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 45: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test

Clustering Results of the Filtered DT-based Fisher Iris Data.

Clustering greatly improved after filtering.

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 46: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION

DT, DCT, and DWT improved classification accuracy after filtering.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 47: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Results – Signal Processing – The Machine Learning Classification Error Test

Bowie State University Department of Computer Science

Priv Synth Data NN KNN NB DT AdaBoost Max

Mean 91.00 87.95 86.07 86.74 84.33 91.00

MID-POINT 75.83 72.78 71.65 72.31 70.39 75.83

Max 91.00 87.95 86.07 86.74 84.33 91.00

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 48: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP)

•Results of the Iris-Fisher data after DP – Too much noise is an issue with DP

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 49: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP)

• Classification accuracy of DP data (before filtering) reduces with increased

DP levels.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 50: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP)

• Improved Classification accuracy of DP data sets after filtering.

Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 51: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP)

• Comparative descriptive statistics of Original, DP, and filtered DP based data.

•Skeletal structure not kept as in DT-based data but outlier noise removed in DP-based

data

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 52: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Results – Non-Interactive Differential Privacy – Inference Statistics

Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 53: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Results – Non-Interactive Differential Privacy – How much DP?

Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 54: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Results – Non-Interactive Differential Privacy – How much DP?

Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 55: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION– Data Privacy using K-Anonymity

• Suppress all items were k = 1.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 56: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION– Data Privacy using K-Anonymity

• Replace suppressed items with new synthetic values (most frequent values) such

that k > 1 for all items.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 57: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Data Privacy using K-Anonymity

• Only sensitive attributes removed – info loss minimized in published

attributes.

Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 58: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

RESULTS AND DISCUSSION – Data Privacy using K-Anonymity

• Only sensitive attributes removed – info loss minimized in published

attributes.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 59: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

CONCLUSION • The Comparative x-CEG: Empirical results from this study show that fine-tuning parameters in the data privacy

procedure, specifically, Noise Addition and Differential Privacy, and with adjustments to the machine learning

classifiers, lowers the classification error and thus generates better and desirable data utility. The hypothesis holds. The

x-CEG model could help in presenting acceptable trade-off points between privacy and utility.

• The SIED model: It is vital for the appropriate solicitation of data privacy requirements that vary on a case by case

basis; therefore SIED could serve as a suitable framework in such data privacy engineering process.

• Privatized Synthetic Data Generation: Data swapping, Distance Transforms, Discrete Cosine Transforms, and

Discrete Wavelet Transforms, in combination with data privacy procedures allow for the generation of privatized

synthetic data sets. However, more research on optimal parameterization needs to be done; as well as using other signal

processing techniques.

• Distance Transforms and Filtering: Empirical results from this study show that a hybrid of Distance Transforms (DT)

and data privacy, in combination with filtering, maintains the skeletal structure of the original data, generates privatized

synthetic data with better classification accuracy results, thus better utility. However, more study needs to be done on

securing DT-based privatized data, to prevent attackers from reconstructing private data.

• Differential Privacy and Filtering: On the other hand, Differential Privacy (DP) offers strong privacy guarantees but at

the loss of data utility. However, empirical results from this study have shown that Gaussian filtering does reduce outlier

noise in DP-based data and with improved classification accuracy results.

• K-anonymity: Information loss could be minimized using frequency count analysis for privatized data models requiring

k-anonymity for confidentiality. Only remove sensitive attributes and use synthetics for suppressed values.

• Privacy versus Utility: Achieving optimal utility while granting privacy is still sought; Yet still, accurate classification

could also mean loss of privacy; Trade-offs must be made between privacy and utility.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 60: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

FUTURE WORK

•Future works include:

•Further the state-of-the-art in Data Privacy Engineering by developing data privacy

compliant software, data privacy modeling, autonomous intelligent data privacy

agent systems following the SIED framework.

•Apply data privacy and utility principles on digital forensics data, network traffic

data, bioinformatics data, and big data.

•Study efficient generation of privatized synthetic data sets.

• Apply data privacy principles to real time data; including realistic scenarios, where

users of data provide feedback on how useful the data was to them.

•Show, analytically, differences in performance between the various methods

introduced in this work, as well as other state-of-the-art methods.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 61: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

PUBLICATIONS

1. Kato Mivule, “Towards Agent-based Data Privacy Engineering”, Proceedings of the Sixth International Conference on Advanced Cognitive Technologies and

Applications – COGNITIVE 2014, May 25 – May 30, 2014 (In Print), Venice, Italy.

2. Kato Mivule and Claude Turner, “SIED, A Data Privacy Engineering Framework”, Abstracts, Emerging Researchers National Conference in STEM (ERN 2014),

Page A239, ISBN 978-0-87168-757-9, Feb 20-22, 2014, Washington DC, USA. [Best Oral Presentation Award]

3. Kato Mivule and Claude Turner, International Journal of Computer Science and Mobile Computing, ICMIC13, December- 2013, pg. 36-43, Trivandrum, Kerala,

India, Dec 17-18, 2013, Trivandrum, Kerala, India.

4. Kato Mivule and Claude Turner, A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge,

Procedia Computer Science, Volume 20, 2013, Pages 414-419, ISSN 1877-0509, Nov 13-15, Baltimore, MD, USA.

5. Kato Mivule and Claude Turner, “An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge”, International Conference on

Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA.

6. Kato Mivule, Darsana Josyula, and Claude Turner, “Data Privacy Preservation in Multi-Agent Learning Systems”, Proceedings of the Fifth International Conference

on Advanced Cognitive Technologies and Applications – COGNITIVE 2013, May 27 - June 1, 2013, Pages 14-20, Valencia, Spain.

7. Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science, 2012,

Pages 176-181, Washington DC, USA.

8. Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records

System", Proceedings at the ISCA 21th International Conference on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles CA,

USA.

9. Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-

Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.

10. Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering

(IKE 2012), Pages 65-71, Las Vegas, NV, USA.

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

Page 62: An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

THANK YOU!

QUESTIONS?

[email protected]

Kato Mivule – Bowie State University Department of Computer Science

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE