Venue and Date: Center for Business and Graduate Studies Dean’s Conference Room 1303 Open to the Public Thursday, April 17, 2014 at 1 pm Dissertation Committee: Claude Turner, Ph.D. Chair Soo-Yeon Ji, Ph.D. Member Hoda El-Sayed, D.Sc. Member Darsana Josyula, Ph.D. Member Anthony Joseph, Ph.D. External Examiner Department of Computer Science Dissertation Defense AN INVESTIGATION OF DATA PRIVACY AND UTILITY USING MACHINE LEARNING AS A GAUGE Kato Mivule For the Degree of D.Sc. in Computer Science Cosmas U. Nwokeafor, PhD Dean, The Graduate School Lethia Jackson, D.Sc. Chair, Computer Science Department
62
Embed
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge
An Investigation of Data Privacy and Utility Using Machine Learning as a Gauge By Kato Mivule for the Degree of D.Sc. in Computer Science - Bowie State University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Venue and Date:
Center for Business and Graduate Studies
Dean’s Conference Room 1303
Open to the Public
Thursday, April 17, 2014 at 1 pm
Dissertation Committee:
Claude Turner, Ph.D. Chair
Soo-Yeon Ji, Ph.D. Member
Hoda El-Sayed, D.Sc. Member
Darsana Josyula, Ph.D. Member
Anthony Joseph, Ph.D. External Examiner
Department of Computer Science
Dissertation Defense
AN INVESTIGATION OF DATA PRIVACY AND
UTILITY USING MACHINE LEARNING AS A GAUGE
Kato Mivule For the Degree of
D.Sc. in Computer Science
Cosmas U. Nwokeafor, PhD
Dean, The Graduate School Lethia Jackson, D.Sc.
Chair, Computer Science Department
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
OUTLINE
• Introduction
o The Problem
o Contributions
• Literature Review
• Methodology
• Results and Discussion
o Results
o Discussion
• Conclusion and Future work
o Conclusion
o Future work
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
CONTRIBUTIONS
1. A proposed a data privacy engineering framework, SIED.
2. A proposed Comparative x-CEG data utility analysis heuristic.
3. A proposed Initial and Subsequent basic (IBP and SBP) privacy
indexes.
4. A proposed data swapping and noise addition hybrid model for
privacy.
5. A proposed privatized synthetic data generation model using
image and signal processing techniques (DT, DCT, and DWT).
6. An implementation of k-anonymity by minimizing information
loss via the frequency count analysis and synthetic data
replacement model.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
THE PROBLEM
Finding a user-defined balance between data privacy and utility
needs with trade-offs.
• The challenge of ambiguous definitions of privacy and utility.
“Perfect privacy can be achieved by publishing nothing at all, but this has no
utility; perfect utility can be obtained by publishing the data exactly as received, but
this offers no privacy” Cynthia Dwork (2006)
Data Privacy
~Differential Privacy
~Noise addition
~K-anonymity, etc...
Data Utility
~Completeness
~Currency
~Accuracy
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
MOTIVATION
• Generate privatized synthetic data sets that meet acceptable
privacy and utility requirements.
• Data Privacy Engineering - Adapt engineering principles in the
data privacy and utility process.
HYPOTHESIS
• Fine-tuning parameters in the data privacy procedure,
specifically using perturbation methods such as noise addition
and differential privacy, lowers the classification error and thus
generates better data utility.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
LITERATURE REVIEW
The data privacy and utility problem
• Wong et, al., (2007); Meyerson & Williams, (2004); Park &
Shim, (2007): Data privatization diminishes data utility – an
NP-Hard problem.
• Krause & Horvitz, (2010); Wang & Wu, (2005): Optimal data
utility with privacy is a well-documented NP hard problem.
Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
Results – Non-Interactive Differential Privacy – How much DP?
Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
Results – Non-Interactive Differential Privacy – How much DP?
Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
RESULTS AND DISCUSSION– Data Privacy using K-Anonymity
• Suppress all items were k = 1.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
RESULTS AND DISCUSSION– Data Privacy using K-Anonymity
• Replace suppressed items with new synthetic values (most frequent values) such
that k > 1 for all items.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
RESULTS AND DISCUSSION – Data Privacy using K-Anonymity
• Only sensitive attributes removed – info loss minimized in published
attributes.
Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
RESULTS AND DISCUSSION – Data Privacy using K-Anonymity
• Only sensitive attributes removed – info loss minimized in published
attributes.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
CONCLUSION • The Comparative x-CEG: Empirical results from this study show that fine-tuning parameters in the data privacy
procedure, specifically, Noise Addition and Differential Privacy, and with adjustments to the machine learning
classifiers, lowers the classification error and thus generates better and desirable data utility. The hypothesis holds. The
x-CEG model could help in presenting acceptable trade-off points between privacy and utility.
• The SIED model: It is vital for the appropriate solicitation of data privacy requirements that vary on a case by case
basis; therefore SIED could serve as a suitable framework in such data privacy engineering process.
• Privatized Synthetic Data Generation: Data swapping, Distance Transforms, Discrete Cosine Transforms, and
Discrete Wavelet Transforms, in combination with data privacy procedures allow for the generation of privatized
synthetic data sets. However, more research on optimal parameterization needs to be done; as well as using other signal
processing techniques.
• Distance Transforms and Filtering: Empirical results from this study show that a hybrid of Distance Transforms (DT)
and data privacy, in combination with filtering, maintains the skeletal structure of the original data, generates privatized
synthetic data with better classification accuracy results, thus better utility. However, more study needs to be done on
securing DT-based privatized data, to prevent attackers from reconstructing private data.
• Differential Privacy and Filtering: On the other hand, Differential Privacy (DP) offers strong privacy guarantees but at
the loss of data utility. However, empirical results from this study have shown that Gaussian filtering does reduce outlier
noise in DP-based data and with improved classification accuracy results.
• K-anonymity: Information loss could be minimized using frequency count analysis for privatized data models requiring
k-anonymity for confidentiality. Only remove sensitive attributes and use synthetics for suppressed values.
• Privacy versus Utility: Achieving optimal utility while granting privacy is still sought; Yet still, accurate classification
could also mean loss of privacy; Trade-offs must be made between privacy and utility.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
FUTURE WORK
•Future works include:
•Further the state-of-the-art in Data Privacy Engineering by developing data privacy
compliant software, data privacy modeling, autonomous intelligent data privacy
agent systems following the SIED framework.
•Apply data privacy and utility principles on digital forensics data, network traffic
data, bioinformatics data, and big data.
•Study efficient generation of privatized synthetic data sets.
• Apply data privacy principles to real time data; including realistic scenarios, where
users of data provide feedback on how useful the data was to them.
•Show, analytically, differences in performance between the various methods
introduced in this work, as well as other state-of-the-art methods.
Kato Mivule – Bowie State University Department of Computer Science
DISSERTATION DEFENSE PRESENTATION BY KATO MIVULE
PUBLICATIONS
1. Kato Mivule, “Towards Agent-based Data Privacy Engineering”, Proceedings of the Sixth International Conference on Advanced Cognitive Technologies and
Applications – COGNITIVE 2014, May 25 – May 30, 2014 (In Print), Venice, Italy.
2. Kato Mivule and Claude Turner, “SIED, A Data Privacy Engineering Framework”, Abstracts, Emerging Researchers National Conference in STEM (ERN 2014),
Page A239, ISBN 978-0-87168-757-9, Feb 20-22, 2014, Washington DC, USA. [Best Oral Presentation Award]
3. Kato Mivule and Claude Turner, International Journal of Computer Science and Mobile Computing, ICMIC13, December- 2013, pg. 36-43, Trivandrum, Kerala,
India, Dec 17-18, 2013, Trivandrum, Kerala, India.
4. Kato Mivule and Claude Turner, A Comparative Analysis of Data Privacy and Utility Parameter Adjustment, Using Machine Learning Classification as a Gauge,
Procedia Computer Science, Volume 20, 2013, Pages 414-419, ISSN 1877-0509, Nov 13-15, Baltimore, MD, USA.
5. Kato Mivule and Claude Turner, “An Investigation of Data Privacy and Utility Preservation Using KNN Classification as a Gauge”, International Conference on
Information and Knowledge Engineering (IKE 2013), July 22-25, Pages 203-204, Las Vegas, NV, USA.
6. Kato Mivule, Darsana Josyula, and Claude Turner, “Data Privacy Preservation in Multi-Agent Learning Systems”, Proceedings of the Fifth International Conference
on Advanced Cognitive Technologies and Applications – COGNITIVE 2013, May 27 - June 1, 2013, Pages 14-20, Valencia, Spain.
7. Kato Mivule, Claude Turner, Soo-Yeon Ji, "Towards A Differential Privacy and Utility Preserving Machine Learning Classifier", Procedia Computer Science, 2012,
Pages 176-181, Washington DC, USA.
8. Kato Mivule, Stephen Otunba, Tattwamasi Tripathy, Sharad and Sharma, "Implementation of Data Privacy and Security in an Online Student Health Records
System", Proceedings at the ISCA 21th International Conference on Software Engineering and Data Engineering (SEDE-2012), Pages 143-148, Los Angeles CA,
USA.
9. Kato Mivule, Claude Turner, "Applying Data Privacy Techniques on Published Data in Uganda", Proceedings of the 2012 International Conference on e-Learning, e-
Business, Enterprise Information Systems, and e-Government (EEE 2012), Pages 110-115, Las Vegas, NV, USA.
10. Kato Mivule, "Utilizing Noise Addition for Data Privacy, an Overview", Proceedings of the International Conference on Information and Knowledge Engineering
(IKE 2012), Pages 65-71, Las Vegas, NV, USA.
Kato Mivule – Bowie State University Department of Computer Science