An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Venue and Date:

Center for Business and Graduate Studies

Dean’s Conference Room 1303

Open to the Public

Thursday, April 17, 2014 at 1 pm

Dissertation Committee:

Claude Turner, Ph.D. Chair

Soo-Yeon Ji, Ph.D. Member

Hoda El-Sayed, D.Sc. Member

Darsana Josyula, Ph.D. Member

Anthony Joseph, Ph.D. External Examiner

Department of Computer Science

Dissertation Defense

AN INVESTIGATION OF DATA PRIVACY AND

UTILITY USING MACHINE LEARNING AS A GAUGE

Kato Mivule For the Degree of

D.Sc. in Computer Science

Cosmas U. Nwokeafor, PhD

Dean, The Graduate School Lethia Jackson, D.Sc.

Chair, Computer Science Department

DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE

OUTLINE

• Introduction

o The Problem

o Contributions

• Literature Review

• Methodology

• Results and Discussion

o Results

o Discussion

• Conclusion and Future work

o Conclusion

o Future work

Kato Mivule – Bowie State University Department of Computer Science


CONTRIBUTIONS

1. A proposed a data privacy engineering framework, SIED.

2. A proposed Comparative x-CEG data utility analysis heuristic.

3. A proposed Initial and Subsequent basic (IBP and SBP) privacy

indexes.

4. A proposed data swapping and noise addition hybrid model for

privacy.

5. A proposed privatized synthetic data generation model using

image and signal processing techniques (DT, DCT, and DWT).

6. An implementation of k-anonymity by minimizing information

loss via the frequency count analysis and synthetic data

replacement model.



THE PROBLEM

Finding a user-defined balance between data privacy and utility

needs with trade-offs.

• The challenge of ambiguous definitions of privacy and utility.

“Perfect privacy can be achieved by publishing nothing at all, but this has no

utility; perfect utility can be obtained by publishing the data exactly as received, but

this offers no privacy” Cynthia Dwork (2006)

Data Privacy

~Differential Privacy

~Noise addition

~K-anonymity, etc...

Data Utility

~Completeness

~Currency

~Accuracy



MOTIVATION

• Generate privatized synthetic data sets that meet acceptable

privacy and utility requirements.

• Data Privacy Engineering - Adapt engineering principles in the

data privacy and utility process.

HYPOTHESIS

• Fine-tuning parameters in the data privacy procedure,

specifically using perturbation methods such as noise addition

and differential privacy, lowers the classification error and thus

generates better data utility.



LITERATURE REVIEW

The data privacy and utility problem

• Wong et, al., (2007); Meyerson & Williams, (2004); Park &

Shim, (2007): Data privatization diminishes data utility – an

NP-Hard problem.

• Krause & Horvitz, (2010); Wang & Wu, (2005): Optimal data

utility with privacy is a well-documented NP hard problem.

• Ghosh, et al., (2008); Brenner & Nissim, (2010 ): Trade-offs

needed in the privacy verses utility process – also NP hard.

• Li & Li, (2009): It is not possible to equate privacy and utility.

• Fienberg, Rinaldo, & Yang, (2010): Even with differential

privacy, privacy is granted but at a loss of data utility. Kato Mivule – Bowie State University Department of Computer Science


LITERATURE REVIEW

Techniques and Algorithms used in this study

• Data Privacy

• Noise Addition

• Logarithmic Noise

• Multiplicative Noise

• Differential Privacy

• K-anonymity

• Image and Signal Processing

• Distance Transform

• Discrete Cosine Transform

• Discrete Wavelet Transform

• Gaussian Filtering

• Machine Learning

• KNN

• Neural Networks

• Naïve Bayes

• Decision Trees

• AdaBoost M1



METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework

• SIED phases – Specifications, Implementation, Evaluation, and Dissemination

• Motivation: Given any original dataset 𝑋, a set of data privacy engineering phases should be

followed from start to completion in the generation of a privatized dataset 𝑋′.



METHODOLOGY

Contribution 1 – SIED, a data privacy engineering framework -

The SIED Specification Phase:



METHODOLOGY


The SIED Implementation Phase:



METHODOLOGY


The SIED Evaluation Phase:



METHODOLOGY


The SIED Dissemination Phase:



METHODOLOGY

Contribution 2 – A Data Privacy Parameter Mapping Heuristic

•Categorize parameters for effective fine-tuning – better privacy

and utility. What parameters need adjustment in the data privacy

process?

CATEGORY 1

PARAMETERS

CATEGORY 2

PARAMETERS

CATEGORY 3

PARAMETERS

Data Utility Goal Parameters:

For example Accuracy, Currency,

and Completeness.

Data Privacy Algorithm Parameters: Values k in k-anonymity, ε in Noise

addition and Differential privacy.

Application Parameters (e.g. Machine

Learning Classifier):

For example weak learners in

AdaBoost.

Parameter

Adjustment and

Fine-tuning

Trade-offs

Data Privacy and Utility

Preservation



METHODOLOGY

Contribution 3 – The x-CEG and Comparative x-CEG Heuristics

The Classification

Error Gauge (x-CEG)

Replicates x times

until threshold t is

reached.

Better utility might be achieved -

Publish

Apply data privacy

Classify privatized dataset

Get original dataset

If error <= t

Adjust data privacy parameters

Adjust classifier parameters

If error > t

The Comparative x-

CEG heuristic employs

multiple data privacy

and classifier algorithms

in each run.



METHODOLOGY

Contribution 4 – The x-CEG Threshold determination heuristic

• Average value of the function = integral / interval.

• 𝐴𝑉𝐹 = 𝐼𝑛𝑡𝑒𝑔𝑟𝑎𝑙/𝐼𝑛𝑡𝑒𝑟𝑣𝑎𝑙

•1

𝑏−𝑎 𝑓 𝑥 𝑑𝑥𝑏

𝑎

•1

𝑏−𝑎 𝑓(𝑥 𝑖)𝑛𝑖=1 ∆𝑥

• 𝑊ℎ𝑒𝑟𝑒 ∆𝑥 = 𝑏−𝑎

𝑛

• 𝐴𝑛𝑑 𝑥 𝑖 = 1

2 𝑥 𝑖−1 + 𝑥𝑖

• 𝑇ℎ𝑒 𝑚𝑒𝑎𝑛 𝜇 = 1

𝑁 𝑥𝑖

𝑁𝑖=1

• 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 ,𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ]

• The threshold 𝑡 is chosen as the highest point between the max mean and max mid-point values.

• The classification error of the original data set is used as a benchmark in measuring privatized synthetic data sets.



METHODOLOGY

Contribution 5 – The Initial and Subsequent Privacy Indices

• Let 𝑋 be the set of all values in database 𝑋 such that 𝑋 = {𝑋1…𝑋𝑛} .

• Let 𝑋′ be the set of items to be privatized such that 𝑋′ = {𝑋1′ …𝑋𝑛

′ }

• Let 𝑌 be the set of items that get revealed after our initial privacy measurement.

• Where |𝑋′| ≤ |𝑋| and |𝑌| ≤ |𝑋|

• As long as 𝑋, 𝑋′, 𝑎𝑛𝑑 𝑌 are countable, such that there is a one-to-one function

(injective) 𝑓: 𝑋 → 𝑁; 𝑋′ → 𝑁; 𝑌 → 𝑁 from 𝑋, 𝑋′, 𝑎𝑛𝑑 𝑌 to natural numbers

𝑁 = { 0, 1, 2, 3…𝑛} respectively.

• 𝐼𝑛𝑖𝑡𝑖𝑎𝑙 𝐵𝑎𝑠𝑖𝑐 𝑃𝑟𝑖𝑣𝑎𝑐𝑦 (𝐼𝐵𝑃) = 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 𝑋′

𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 (𝑋)∗ 100

• 𝑆𝑢𝑏𝑠𝑒𝑞𝑢𝑒𝑛𝑡 𝐵𝑎𝑠𝑖𝑐 𝑃𝑟𝑖𝑣𝑎𝑐𝑦 (𝑆𝐵𝑃) = 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 𝑋′− 𝑌

𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 (𝑋)∗ 100

• where 𝐶𝑎𝑟𝑑𝑖𝑛𝑎𝑙𝑖𝑡𝑦 is the total count of elements in both 𝑋′, 𝑌 and 𝑋.

• IBP and SBP could be taken as percentages or normalized between 0 and 1.



METHODOLOGY

Methodology – Contribution 6 – The Filtered Comparative x-CEG

Heuristic - Using image and signal processing techniques to generate

privatized synthetic data.



METHODOLOGY

Contribution 7– Data swapping and noise addition data privacy

hybrid - Generating privatized synthetic data using data swapping and

noise perturbation.



METHODOLOGY

Contribution 8 – Minimizing information loss with K-anonymity

• Implementation of k-anonymity by minimizing information loss via

the frequency count analysis and synthetic data replacement model.


RESULTS AND DISCUSSION

Comparative x-CEG Results

•The Iris Fisher multivariate

dataset from the UCI repository

was used.

•165 experiment runs – generating

165 privatized synthetic data sets.

•KNN, Neural Nets, Decision

Trees, AdaBoost, and Naïve Bayes

•MATLAB data privacy and

Rapid Miner for machine learning.

NOISE LEVEL KNN NEURAL NETS NAÏVE BAYES DECISION TREES ADABOOST M1

Original 96.00 96.67 96.00 94.67 97.33

Noise1(μ=5.8, σ=0.8) 66.67 74.00 64.00 66.67 64.00

Noise2(μ=0, σ=0.8) 61.33 72.00 66.67 63.33 54.67

Noise3(μ=1, σ=0.8) 68.67 74.00 69.33 66.67 60.00

Noise4(μ=2, σ=0.8) 68.67 62.67 62.00 59.33 54.67

Noise5(μ=3, σ=0.8) 72.67 66.67 67.33 61.33 50.67

Noise6(μ=4, σ=0.8) 75.33 82.67 70.00 72.00 63.33

Noise1a(μ=5, σ=0.1) 94.00 93.33 92.67 91.33 92.67

Noise1b(μ=5, σ=0.2) 92.00 94.67 91.33 90.00 90.67

Noise1c(μ=5, σ=0.3) 93.33 94.00 90.67 92.00 94.00

Noise1d(μ=5, σ=0.4) 90.00 93.33 87.33 86.67 86.67

Noise2b(μ=0, σ=0.1) 96.67 96.67 94.00 96.67 92.00

Noise2c(μ=0, σ=0.2) 89.33 92.00 86.67 87.33 90.00

Noise2d(μ=0, σ=0.3) 87.33 90.00 86.67 84.67 85.33

Noise2e(μ=0, σ=0.4) 87.33 90.00 86.67 84.67 85.33

Noise3a(μ=1, σ=0.4) 87.33 87.33 85.33 84.00 83.33

Noise3b(μ=1, σ=0.1) 97.33 94.00 96.00 96.00 94.67

Noise3c(μ=1, σ=0.2) 92.67 95.33 91.33 90.67 93.33

Noise3d(μ=1, σ=0.3) 94.67 95.33 91.33 94.00 90.00

Noise4a(μ=2, σ=0.1) 94.67 98.00 98.00 96.67 98.00

Noise4b(μ=2, σ=0.2) 93.33 96.00 92.67 91.33 90.67

Noise4c(μ=2, σ=0.3) 88.00 91.33 89.33 90.00 86.67

Noise4d(μ=2, σ=0.4) 87.33 87.33 85.33 84.00 83.33

Noise5a(μ=3, σ=0.1) 97.33 94.00 96.00 96.00 94.67

Noise5b(μ=3, σ=0.2) 92.67 95.33 91.33 90.67 93.33

Noise5c(μ=3, σ=0.3) 94.67 95.33 91.33 94.00 90.00

Noise5d(μ=3, σ=0.4) 93.33 94.00 93.33 92.00 87.33

Noise6a(μ=4, σ=0.1) 78.00 87.33 87.33 82.67 84.67

Noise6b(μ=4, σ=0.2) 93.33 95.33 94.00 93.33 92.67

Noise6c(μ=4, σ=0.3) 91.33 92.00 92.00 90.00 92.00

Noise6d(μ=4, σ=0.4) 78.00 87.33 88.67 82.67 84.67

Multiplicative 56.67 68.67 59.33 64.67 58.00

Logarithmic 50.67 58.00 56.00 53.33 57.33





Comparative x-CEG Results

• A bar chart depiction of the Comparative x-CEG classification accuracy results



RESULTS AND DISCUSSION - Comparative x-CEG Results

Comparative x-CEG results classifier performance results – Neural Nets most resilient.



• x-CEG Threshold Determination Results

• Threshold 𝒕 = 𝑴𝒂𝒙[𝒎𝒂𝒙 𝒎𝒆𝒂𝒏 ,𝒎𝒂𝒙 𝒎𝒊𝒅𝒑𝒐𝒊𝒏𝒕 ]

• The threshold value is chosen heuristically using the mid-point value classification accuracy of

87.33% for the Neural Nets.

Statistic KNN NEURAL

NETS NAÏVE

BAYES DECISION

TREES ADABOOST

M1 MAX

Mean 84.87 87.41 84.54 83.74 82.30 87.41

Mid-Point 80.18 82.48 79.81 79.05 77.51 82.48

Max 84.87 87.41 84.54 83.74 82.30 87.41






• The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the

Neural Nets.






• The threshold value is chosen heuristically using the mid-point value classification accuracy of 87.33% for the

Neural Nets.




• How much privacy? – statistical traits of the original and privatized data.













Statistic Value

Original Data MSE 15.8937

Privatized Data MSE 24.0875

Original Data Entropy -3.05E+04

Privatized Data Entropy -5.05E+04

Correlation 0.9808

MSE Difference 8.1938

Entropy Difference -2.00E+04



RESULTS AND DISCUSSION – Data Swapping and Noise Addition Hybrid

• 330 data sets generated from the data swapping and noise addition hybrid experiment.

• Optimal data swap for acceptable privacy and utility levels is between 5% and 10% data swap.

• The two data sets satisfied the threshold criteria after the Comparative x-CEG:

• 𝑛𝑜𝑖𝑠𝑒 ~ (𝜇 = 1, 𝜎 = 0.1) at 5% swap.

• 𝑛𝑜𝑖𝑠𝑒 ~ (𝜇 = 5, 𝜎 = 0.1) at 5% swap.







• Best classification accuracy obtained between 5 to10% data swap.



RESULTS AND DISCUSSION – Signal Processing and Data Privacy Hybrid

Privatized synthetic data sets using Discrete Cosine Transforms (DCT)

.

Synthetic DCT-based Sepal Length data results





.

Synthetic Filtered DCT-based Sepal Length data results





.

Filtered DCT-based data descriptive statistics – skeletal structure not kept as in DT-based data





.

Filtered DCT-based data inference statistics – low correlation



RESULTS AND DISCUSSION – Image Processing and Data Privacy Hybrid

Privatized synthetic data sets using Distance Transforms (DT) – Skeletal Structure

kept.

.

DT-based Sepal Length data results




kept.

.

Filtered DT-based Sepal Length data results




kept.

.

Filtered DT-based data descriptive statistics – skeletal structure kept




kept.

.

Filtered DT-based data Iinference statistics – High correlation


RESULTS AND DISCUSSION – Distance Transforms Based Data and the Clustering Test

DT produced the best Davis Bouldin Criterion at 0.419 after filtering.


DT-based synthetic data produced the best Davis Bouldin Criterion at 0.419 after filtering, out

performing the original data.




Clustering results of the Original Fisher Iris Data



Clustering results of the synthetic DT-based synthetic Fisher Iris Data



Clustering Results of the Filtered DT-based Fisher Iris Data.

Clustering greatly improved after filtering.



DT, DCT, and DWT improved classification accuracy after filtering.



Results – Signal Processing – The Machine Learning Classification Error Test

Bowie State University Department of Computer Science

Priv Synth Data NN KNN NB DT AdaBoost Max

Mean 91.00 87.95 86.07 86.74 84.33 91.00

MID-POINT 75.83 72.78 71.65 72.31 70.39 75.83

Max 91.00 87.95 86.07 86.74 84.33 91.00


RESULTS AND DISCUSSION - Non-Interactive Differential Privacy (DP)

•Results of the Iris-Fisher data after DP – Too much noise is an issue with DP




• Classification accuracy of DP data (before filtering) reduces with increased

DP levels.




• Improved Classification accuracy of DP data sets after filtering.




• Comparative descriptive statistics of Original, DP, and filtered DP based data.

•Skeletal structure not kept as in DT-based data but outlier noise removed in DP-based

data



Results – Non-Interactive Differential Privacy – Inference Statistics



Results – Non-Interactive Differential Privacy – How much DP?



Results – Non-Interactive Differential Privacy – How much DP?



RESULTS AND DISCUSSION– Data Privacy using K-Anonymity

• Suppress all items were k = 1.



RESULTS AND DISCUSSION– Data Privacy using K-Anonymity

• Replace suppressed items with new synthetic values (most frequent values) such

that k > 1 for all items.



RESULTS AND DISCUSSION – Data Privacy using K-Anonymity

• Only sensitive attributes removed – info loss minimized in published

attributes.



RESULTS AND DISCUSSION – Data Privacy using K-Anonymity

• Only sensitive attributes removed – info loss minimized in published

attributes.



CONCLUSION • The Comparative x-CEG: Empirical results from this study show that fine-tuning parameters in the data privacy

procedure, specifically, Noise Addition and Differential Privacy, and with adjustments to the machine learning

classifiers, lowers the classification error and thus generates better and desirable data utility. The hypothesis holds. The

x-CEG model could help in presenting acceptable trade-off points between privacy and utility.

• The SIED model: It is vital for the appropriate solicitation of data privacy requirements that vary on a case by case

basis; therefore SIED could serve as a suitable framework in such data privacy engineering process.

• Privatized Synthetic Data Generation: Data swapping, Distance Transforms, Discrete Cosine Transforms, and

Discrete Wavelet Transforms, in combination with data privacy procedures allow for the generation of privatized

synthetic data sets. However, more research on optimal parameterization needs to be done; as well as using other signal

processing techniques.

• Distance Transforms and Filtering: Empirical results from this study show that a hybrid of Distance Transforms (DT)

and data privacy, in combination with filtering, maintains the skeletal structure of the original data, generates privatized

synthetic data with better classification accuracy results, thus better utility. However, more study needs to be done on

securing DT-based privatized data, to prevent attackers from reconstructing private data.

• Differential Privacy and Filtering: On the other hand, Differential Privacy (DP) offers strong privacy guarantees but at

the loss of data utility. However, empirical results from this study have shown that Gaussian filtering does reduce outlier

noise in DP-based data and with improved classification accuracy results.

• K-anonymity: Information loss could be minimized using frequency count analysis for privatized data models requiring

k-anonymity for confidentiality. Only remove sensitive attributes and use synthetics for suppressed values.

• Privacy versus Utility: Achieving optimal utility while granting privacy is still sought; Yet still, accurate classification

could also mean loss of privacy; Trade-offs must be made between privacy and utility.



FUTURE WORK

•Future works include:

•Further the state-of-the-art in Data Privacy Engineering by developing data privacy

compliant software, data privacy modeling, autonomous intelligent data privacy

agent systems following the SIED framework.

•Apply data privacy and utility principles on digital forensics data, network traffic

data, bioinformatics data, and big data.

•Study efficient generation of privatized synthetic data sets.

• Apply data privacy principles to real time data; including realistic scenarios, where

users of data provide feedback on how useful the data was to them.

•Show, analytically, differences in performance between the various methods

introduced in this work, as well as other state-of-the-art methods.



PUBLICATIONS

1. Kato Mivule, “Towards Agent-based Data Privacy Engineering”, Proceedings of the Sixth International Conference on Advanced Cognitive Technologies and

Applications – COGNITIVE 2014, May 25 – May 30, 2014 (In Print), Venice, Italy.

2. Kato Mivule and Claude Turner, “SIED, A Data Privacy Engineering Framework”, Abstracts, Emerging Researchers National Conference in STEM (ERN 2014),

Page A239, ISBN 978-0-87168-757-9, Feb 20-22, 2014, Washington DC, USA. [Best Oral Presentation Award]

3. Kato Mivule and Claude Turner, International Journal of Computer Science and Mobile Computing, ICMIC13, December- 2013, pg. 36-43, Trivandrum, Kerala,

India, Dec 17-18, 2013, Trivandrum, Kerala, India.

4. Kato Mivule and Claude Turner, A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge,

Procedia Computer Science, Volume 20, 2013, Pages 414-419, ISSN 1877-0509, Nov 13-15, Baltimore, MD, USA.

5. Kato Mivule and Claude Turner, “An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge”, International Conference on

Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA.

6. Kato Mivule, Darsana Josyula, and Claude Turner, “Data Privacy Preservation in Multi-Agent Learning Systems”, Proceedings of the Fifth International Conference

on Advanced Cognitive Technologies and Applications – COGNITIVE 2013, May 27 - June 1, 2013, Pages 14-20, Valencia, Spain.

7. Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science, 2012,

Pages 176-181, Washington DC, USA.

8. Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records

System", Proceedings at the ISCA 21th International Conference on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles CA,

USA.

9. Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-

Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.

10. Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering

(IKE 2012), Pages 65-71, Las Vegas, NV, USA.



THANK YOU!

QUESTIONS?

[email protected]



An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge

Data & Analytics

davis bouldin

dissertation

advanced cognitive

frequency

discrete cosine

apply data

signal processing

interactive