Generating Synthetic Healthcare Records Using Generative Adversarial Networks Amirsina Torfi and Mohammadreza Beyki Virginia Tech, Department of Computer Science Blacksburg, VA, 24061 Final Project Presentation (CS 6604 - Digital Libraries) Course Instructor: Dr. Edward A. Fox 5 December 2019 Amirsina Torfi and Mohammadreza Beyki Final Presentation 1 / 24
30
Embed
Generating Synthetic Healthcare Records Using Generative … · Amirsina Tor and Mohammadreza Beyki Final Presentation9 / 24. Background Introduction Speci c Aims Background Accomplishments
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Generating Synthetic Healthcare Records UsingGenerative Adversarial Networks
Amirsina Torfi and Mohammadreza Beyki
Virginia Tech, Department of Computer ScienceBlacksburg, VA, 24061
Final Project Presentation (CS 6604 - Digital Libraries)
Course Instructor:Dr. Edward A. Fox
5 December 2019
Amirsina Torfi and Mohammadreza Beyki Final Presentation 1 / 24
Table of Content
Introduction
Specific Aims
Background
Accomplishments
References
Amirsina Torfi and Mohammadreza Beyki Final Presentation 2 / 24
Introduction
Motivation
Electronic Health Records (EHRs) & Big Data in healthcare → Callsfor employing data-driven methods with Artificial Intelligence (AI)
De-identification of EHR data employed for mitigation of privacy risks→ NOT SECURE! [1, 2, 3]
Need for synthetic healthcare records for Machine Learning
Amirsina Torfi and Mohammadreza Beyki Final Presentation 3 / 24
Specific Aims
Introduction
Specific Aims
Background
Accomplishments
References
Amirsina Torfi and Mohammadreza Beyki Final Presentation 4 / 24
Specific Aims
Aim 1: Develop the Generative Model
Capturing spatial-temporal information
Handling discrete data
Evaluation of synthetic data quality using statistical analysis
Figure 1: Generative/Discriminative Models [Link]
Amirsina Torfi and Mohammadreza Beyki Final Presentation 5 / 24
Hypothesis 1: Generative Adversarial Networks (GANs) performbetter than other generative models.
Hypothesis 2: Convolutional Neural Networks (CNNs) outperformMultilayer Perceptrons → Capturing and integrating more temporaland spatial information from healthcare records
Amirsina Torfi and Mohammadreza Beyki Final Presentation 6 / 24
Specific Aims
Aim 2: Measuring Realistic Characteristics
Propose a discriminative model to measure the realistic characteristicsof the data (unique contribution)
Use machine learning instead of statistics
Can we replace real data with synthetic data?
Amirsina Torfi and Mohammadreza Beyki Final Presentation 7 / 24
Specific Aims
Aim 3: Privacy
Assess privacy by Membership Inference Attack
Amirsina Torfi and Mohammadreza Beyki Final Presentation 8 / 24
Specific Aims
Aim 3: Privacy
Hypothesis: Machine Learning models responding differently to datathey saw or never saw in training
Amirsina Torfi and Mohammadreza Beyki Final Presentation 9 / 24
Background
Introduction
Specific Aims
Background
Accomplishments
References
Amirsina Torfi and Mohammadreza Beyki Final Presentation 10 / 24
Background
Generative Adversarial Networks
Amirsina Torfi and Mohammadreza Beyki Final Presentation 11 / 24
Background
Power of GANs
Figure 2: Example of Fake images [4]
Amirsina Torfi and Mohammadreza Beyki Final Presentation 12 / 24
Accomplishments
Introduction
Specific Aims
Background
Accomplishments
References
Amirsina Torfi and Mohammadreza Beyki Final Presentation 13 / 24
Accomplishments
Accomplished Goals
Proposed an efficient architecture to generate synthetic healthcarerecords using Convolutional GANs and Convolutional Autoencoders→ “COR-GAN“
The effectiveness of utilizing Convolutional Neural Networks (CNNs)is proved empirically → Capturing inter-correlation between features
Privacy is assessed → Membership Inference Attack
Amirsina Torfi and Mohammadreza Beyki Final Presentation 14 / 24
Accomplishments
EHR data
There are |M| discrete variables (e.g., diagnosis, medication, orprocedure codes)
EHR data of a particular patient: A fixed-size vector X ∈ Z|M|+ ,
Z+ = 0, 1, 2, ...
The i th dimension → Number of occurrences (i.e., counts) of i-thvariable in patient record
Binary representation X ∈ {0, 1}|M| → i th dimension indicatesabsence or occurrence of i th variable
Amirsina Torfi and Mohammadreza Beyki Final Presentation 15 / 24
Accomplishments
Train/Test Data
Amirsina Torfi and Mohammadreza Beyki Final Presentation 16 / 24
Accomplishments
Autoencoder Training
Autoencoder :BCEloss = − 1
N
N∑i=1
xi log(yi ) + (1− xi )log(1− yi )
yi = Dec(Enc(xi ))
Amirsina Torfi and Mohammadreza Beyki Final Presentation 17 / 24
Accomplishments
Proposed Architecture
Amirsina Torfi and Mohammadreza Beyki Final Presentation 18 / 24
Accomplishments
Dataset
The MIMIC-III dataset [5]
Medical records of almost 46K patients
Extracted ICD-9 codes only
Represent patient records as a fixed-size vector
1071 entries for each patient record
Dataset is used for experiments associated with binary discretevariables
Amirsina Torfi and Mohammadreza Beyki Final Presentation 19 / 24
Accomplishments
Baseline Models
Table 1: Comparison of different baseline architectures.
Amirsina Torfi and Mohammadreza Beyki Final Presentation 20 / 24
Accomplishments
Dimension-Wise Probability
Amirsina Torfi and Mohammadreza Beyki Final Presentation 21 / 24
Accomplishments
Dimension-Wise Probability
Figure 3: x- and y-axes represent Bernoulli success probability for real andsynthetic datasets. Diagonal line shows ideal case.
Amirsina Torfi and Mohammadreza Beyki Final Presentation 22 / 24
Accomplishments
Discrete Synthetic Data Quality Evaluation
Maximum Mean DiscrepancyRepresents similarity between two distributions → Distance betweenmean feature embeddingsDistributions PR and PG are defined over set XUsed Kernel MMD, with isotropic GaussianFor 100 runs
Table 2: Distinguishing between real and synthesized samples by employingMaximum Mean Discrepancy metric.
Amirsina Torfi and Mohammadreza Beyki Final Presentation 23 / 24
References
Introduction
Specific Aims
Background
Accomplishments
References
Amirsina Torfi and Mohammadreza Beyki Final Presentation 23 / 24
References
[1] V. Janmey and P. L. Elkin, “Re-identification risk in HIPAAde-identified datasets: The MVA attack,” in AMIA Annual SymposiumProceedings, vol. 2018, p. 1329, American Medical InformaticsAssociation, 2018.
[2] M. Scaiano, S. Korte, A. Baker, G. Green, K. El Emam, andL. Arbuckle, “Re-identification risk measurement estimation of adataset,” Apr. 26 2018.US Patent App. 15/320,240.
[3] A. Baker, L. Arbuckle, K. El Emam, B. Eze, S. Korte, S. Rose, andC. Ilie, “Method of re-identification risk measurement and suppressionon a longitudinal dataset,” June 5 2018.US Patent 9,990,515.
[4] T. Karras, S. Laine, and T. Aila, “A style-based generator architecturefor generative adversarial networks,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition,pp. 4401–4410, 2019.
Amirsina Torfi and Mohammadreza Beyki Final Presentation 23 / 24
References
[5] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng,M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark,“MIMIC-III, a freely accessible critical care database,” Scientific data,vol. 3, p. 160035, 2016.
[6] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun,“Generating multi-label discrete patient records using generativeadversarial networks,” arXiv preprint arXiv:1703.06490, 2017.
Amirsina Torfi and Mohammadreza Beyki Final Presentation 24 / 24
References
Amirsina Torfi and Mohammadreza Beyki Final Presentation 24 / 24
References
Privacy
Amirsina Torfi and Mohammadreza Beyki Final Presentation 24 / 24
References
Privacy Assessment
Pick (Z) and (X) from realtraining and random datasource. Pick (R) from syntheticset.
Compared each sample in set ofX + Z with each sample in setof RCalculate Cosine Similarity
If similarity is higher thanthreshold: Match
Amirsina Torfi and Mohammadreza Beyki Final Presentation 24 / 24
References
Measure Privacy
Assessing the effect of number of records known by attacker →Assumption: |R| = |X| = |Z|Precision: For matches identified by adversary, only a portion ofthem actually used
Recall: Adversary has successfully determined a portion of knownrecords being used in training