Top Banner
1 2 Methodological Review 4 Publishing data from electronic health records while preserving privacy: 5 A survey of algorithms 6 7 8 Aris Gkoulalas-Divanis Q1 a,, Grigorios Loukides b , Jimeng Sun c 9 a IBM Research-Ireland, Damastown Industrial Estate, Mulhuddart, Dublin 15, Ireland 10 b School of Computer Science & Informatics, Cardiff University, 5 The Parade, Roath, Cardiff CF24 3AA, UK 11 c IBM Thomas J. Watson Research Center, 17 Skyline Drive, Hawthorne, NY 10532, USA 12 13 14 16 article info 17 Article history: 18 Received 1 October 2013 19 Accepted 5 June 2014 20 Available online xxxx 21 Keywords: 22 Privacy 23 Electronic health records 24 Anonymization 25 Algorithms 26 Survey 27 28 abstract 29 The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medical 30 studies, spanning from clinical trials to epidemic control studies, but it must be performed in a way that 31 preserves patients’ privacy. This is not straightforward, because the disseminated data need to be pro- 32 tected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work, 33 we present a survey of algorithms that have been proposed for publishing structured patient data, in a 34 privacy-preserving way. We review more than 45 algorithms, derive insights on their operation, and 35 highlight their advantages and disadvantages. We also provide a discussion of some promising directions 36 for future research in this area. 37 Ó 2014 Published by Elsevier Inc. 38 39 40 41 1. Introduction 42 Electronic Medical Record/ Electronic Health Record (EMR/EHR) 43 systems are increasingly adopted to collect and store various types 44 of patient data, which contain information about patients’ demo- 45 graphics, diagnosis codes, medication, allergies, and laboratory test 46 results [22,90,63]. For instance, the use of EMR/EHR systems, 47 among office-based physicians, increased from 18% in 2001 to 48 72% in 2012 and is estimated to exceed 90% by the end of the dec- 49 ade [56]. 50 Data from EMR/EHR systems are increasingly disseminated, for 51 purposes beyond primary care, and this has been shown to be a 52 promising avenue for improving research [63]. This is because it 53 allows data recipients to perform large-scale, low-cost analytic 54 tasks, which require applying statistical tests (e.g., to study corre- 55 lations between BMI and diabetes), data mining tasks, such as clas- 56 sification (e.g., to predict domestic violence [107]) and clustering 57 (e.g., to control epidemics [117]), or query answering. To facilitate 58 the dissemination and reuse of patient-specific data and help the 59 advancement of research, a number of repositories have been 60 established, such as the Database of Genotype and Phenotype 61 (dbGaP) [89], in the U.S., and the U.K. Biobank [104], in the United 62 Kingdom. 63 1.1. Motivation 64 While the dissemination of patient data is greatly beneficial, it 65 must be performed in a way that preserves patients’ privacy. Many 66 approaches have been proposed to achieve this, by employing var- 67 ious techniques [43,5], such as cryptography (e.g., [73,55,121,11]) 68 and access control (e.g., [110,71]). However, these approaches are 69 not able to offer patient anonymity (i.e., that patients’ private 70 and confidential information will not be disclosed) when data 71 about patients are disseminated [39]. This is because the data need 72 to be disseminated to a wide (and potentially unknown) set of 73 recipients. 74 Towards preserving anonymity, policies that restrict the sharing 75 of patient-specific medical data are emerging worldwide [91]. For 76 example, in the U.S., the Privacy Rule of the Health Insurance Por- 77 tability and Accountability Act (HIPAA) [120] outlines two policies 78 for protecting anonymity, namely Safe Harbor, and Expert Determi- 79 nation. The first of these policies enumerates eighteen direct iden- 80 tifiers that must be removed from data, prior to their 81 dissemination, while, according to the Expert Determination pol- 82 icy, an expert needs to certify that the data to be disseminated pose 83 a low privacy risk before the data can be shared with external par- 84 ties. Similar policies are in place in countries, such as the U.K. [2] 85 and Canada [3], as well as in the European Union [1]. These policies 86 focus on preventing the privacy threat of identity disclosure (also 87 referred to as re-identification), which involves the association of 88 an identified individual with their record in the disseminated data. http://dx.doi.org/10.1016/j.jbi.2014.06.002 1532-0464/Ó 2014 Published by Elsevier Inc. Corresponding Q2 author. E-mail addresses: [email protected], [email protected] (A. Gkoulalas-Divanis). Journal of Biomedical Informatics xxx (2014) xxx–xxx Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin YJBIN 2187 No. of Pages 16, Model 5G 19 June 2014 Please cite this article in press as: Gkoulalas-Divanis Q1 A et al. Publishing data from electronic health records while preserving privacy: A survey of algo- rithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002
16
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1-s2.0-S1532046414001403-main

1

2

4

5

6

7

8 Q1

91011

121314

1 6

17181920

21222324252627

2 8

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

Q2

Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

Contents lists available at ScienceDirect

Journal of Biomedical Informatics

journal homepage: www.elsevier .com/locate /y jb in

Methodological Review

Publishing data from electronic health records while preserving privacy:A survey of algorithms

http://dx.doi.org/10.1016/j.jbi.2014.06.0021532-0464/� 2014 Published by Elsevier Inc.

⇑ Corresponding author.E-mail addresses: [email protected], [email protected] (A. Gkoulalas-Divanis).

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data from electronic health records while preserving privacy: A survey orithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

Aris Gkoulalas-Divanis a,⇑, Grigorios Loukides b, Jimeng Sun c

a IBM Research-Ireland, Damastown Industrial Estate, Mulhuddart, Dublin 15, Irelandb School of Computer Science & Informatics, Cardiff University, 5 The Parade, Roath, Cardiff CF24 3AA, UKc IBM Thomas J. Watson Research Center, 17 Skyline Drive, Hawthorne, NY 10532, USA

29303132333435363738

a r t i c l e i n f o

Article history:Received 1 October 2013Accepted 5 June 2014Available online xxxx

Keywords:PrivacyElectronic health recordsAnonymizationAlgorithmsSurvey

a b s t r a c t

The dissemination of Electronic Health Records (EHRs) can be highly beneficial for a range of medicalstudies, spanning from clinical trials to epidemic control studies, but it must be performed in a way thatpreserves patients’ privacy. This is not straightforward, because the disseminated data need to be pro-tected against several privacy threats, while remaining useful for subsequent analysis tasks. In this work,we present a survey of algorithms that have been proposed for publishing structured patient data, in aprivacy-preserving way. We review more than 45 algorithms, derive insights on their operation, andhighlight their advantages and disadvantages. We also provide a discussion of some promising directionsfor future research in this area.

� 2014 Published by Elsevier Inc.

39

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

1. Introduction

Electronic Medical Record/ Electronic Health Record (EMR/EHR)systems are increasingly adopted to collect and store various typesof patient data, which contain information about patients’ demo-graphics, diagnosis codes, medication, allergies, and laboratory testresults [22,90,63]. For instance, the use of EMR/EHR systems,among office-based physicians, increased from 18% in 2001 to72% in 2012 and is estimated to exceed 90% by the end of the dec-ade [56].

Data from EMR/EHR systems are increasingly disseminated, forpurposes beyond primary care, and this has been shown to be apromising avenue for improving research [63]. This is because itallows data recipients to perform large-scale, low-cost analytictasks, which require applying statistical tests (e.g., to study corre-lations between BMI and diabetes), data mining tasks, such as clas-sification (e.g., to predict domestic violence [107]) and clustering(e.g., to control epidemics [117]), or query answering. To facilitatethe dissemination and reuse of patient-specific data and help theadvancement of research, a number of repositories have beenestablished, such as the Database of Genotype and Phenotype(dbGaP) [89], in the U.S., and the U.K. Biobank [104], in the UnitedKingdom.

85

86

87

88

1.1. Motivation

While the dissemination of patient data is greatly beneficial, itmust be performed in a way that preserves patients’ privacy. Manyapproaches have been proposed to achieve this, by employing var-ious techniques [43,5], such as cryptography (e.g., [73,55,121,11])and access control (e.g., [110,71]). However, these approaches arenot able to offer patient anonymity (i.e., that patients’ privateand confidential information will not be disclosed) when dataabout patients are disseminated [39]. This is because the data needto be disseminated to a wide (and potentially unknown) set ofrecipients.

Towards preserving anonymity, policies that restrict the sharingof patient-specific medical data are emerging worldwide [91]. Forexample, in the U.S., the Privacy Rule of the Health Insurance Por-tability and Accountability Act (HIPAA) [120] outlines two policiesfor protecting anonymity, namely Safe Harbor, and Expert Determi-nation. The first of these policies enumerates eighteen direct iden-tifiers that must be removed from data, prior to theirdissemination, while, according to the Expert Determination pol-icy, an expert needs to certify that the data to be disseminated posea low privacy risk before the data can be shared with external par-ties. Similar policies are in place in countries, such as the U.K. [2]and Canada [3], as well as in the European Union [1]. These policiesfocus on preventing the privacy threat of identity disclosure (alsoreferred to as re-identification), which involves the association ofan identified individual with their record in the disseminated data.

f algo-

Page 2: 1-s2.0-S1532046414001403-main

89

90

91

92

93

94

95

96

97

98

99

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

2 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

However, it is important to note that they do not provide any com-putational guarantees for thwarting identity disclosure nor aim at pre-serving the usefulness of disseminated data in analytic tasks.

To address re-identification, as well as other privacy threats, thecomputer science and health informatics communities have devel-oped various techniques. Most of these techniques aim at publish-ing a dataset of patient records, while satisfying certain privacyand data usefulness objectives. Typically, privacy objectives areformulated using privacy models, and enforced by algorithms thattransform a given dataset (to facilitate privacy protection) to theminimum necessary extent. The majority of the proposed algo-rithms are applicable to data containing demographics or diagnosiscodes,1 focus on preventing the threats of identity, attribute, and/ormembership disclosure (to be defined in subsequent sections), andoperate by transforming the data using generalization and/or sup-pression techniques.

1.2. Contributions

In this work, we present a survey of algorithms for publishingpatient-specific data in a privacy-preserving way. We begin by dis-cussing the main privacy threats that publishing such data entails,and present the privacy models that have been designed to preventthese threats. Subsequently, for each privacy threat, we provide asurvey of algorithms that have been proposed to block it. Whenselecting the privacy algorithms to be surveyed in the article, weput preference on methods that have appeared in major confer-ences and journals in the area, as well as are effective in terms ofpreserving privacy and maintaining good utility. We opted for dis-cussing algorithms that significantly differ from one another, byexcluding articles that propose minor algorithmic variations. Forthe surveyed privacy algorithms we explain the strategies thatthey employ for: (i) transforming data, (ii) preserving data useful-ness, and (iii) searching the space of potential solutions. Based onthese strategies, we classify over 45 privacy algorithms. This allowsderiving interesting insights on the operation of these algorithms,as well as on their advantages and limitations. In addition, we pro-vide an overview of techniques for preserving privacy that aredesigned for different settings and types of data, and identify anumber of important research directions for future work.

To the best of our knowledge, this is the first survey on algo-rithms for facilitating the privacy-preserving sharing of structuredmedical data. However, there are surveys in the computer scienceliterature that do not focus on methods applicable to such data[39], as well as surveys that focus on privacy preservation methodsfor text data [94], privacy policies [91,93], or system security [36]issues. In addition, we would like to note that the aim of this paperis to provide insights on the tasks and objectives of a wide range ofalgorithms. Thus, we have omitted the technical details and analy-sis of specific algorithms and refer the reader to the publicationsdescribing the algorithms for them.

1.2.1. OrganizationThe remainder of this work is organized as follows. Section 2

presents the privacy threats and models that have been proposedfor preventing them. Section 3 discusses the two scenarios for pri-vacy-preserving data sharing. Section 4 surveys algorithms forpublishing data, in the non-interactive scenario. Section 5 dis-cusses other classes of related techniques. Section 6 presents pos-sible directions for future research, and Section 7 concludes thepaper.

1 These algorithms deal with either relational or transaction (set-valued) attributes.However, following [34,75,76,87], we discuss them in the context of demographic anddiagnosis information, which is modeled using relational and transaction attributes,respectively.

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

2. Privacy threats and models

In this section, we first discuss the major privacy threats thatare related to the disclosure of individuals’ private and/or sensitiveinformation. Then, we present privacy models that can be used toguard against each of these threats. The importance of discussingprivacy models is twofold. First, privacy models can be used toevaluate how safe data are prior to their release. Second, privacymodels can be incorporated into algorithms to ensure that the datacan be transformed in a way that preserves privacy.

203

204

205

206

207

2.1. Privacy threats

Privacy threats relate to three different types of attributes, directidentifiers, quasi-identifiers, and sensitive attributes. Direct identifi-ers are attributes that can explicitly re-identify individuals, suchas name, mailing address, phone number, social security number,other national IDs, and email address. On the other hand, quasi-identifiers are attributes which in combination can lead to identitydisclosure, such as demographics (e.g., gender, date of birth, andzip code) [109,128] and diagnosis codes [75]. Last, sensitive attri-butes are those that patients are not willing to be associated with.Examples of these attributes are specific diagnosis codes (e.g., psy-chiatric diseases, HIV, cancer, etc.) and genomic information. InTable Table 1, we present an example dataset, in which Name

and Phone Number are direct identifiers, Date of birth, ZipCode, and Gender are quasi-identifiers, and DNA is a sensitiveattribute.

Based on the above-mentioned types of attributes, we can con-sider the following classes of privacy threats:

� Identity disclosure (or re-identification) [112,128]: This is argu-ably the most notorious threat in publishing medical data. Itoccurs when an attacker can associate a patient with theirrecord in a published dataset. For example, an attacker mayre-identify Maria in Table 1, even if the table is publisheddeprived of the direct identifiers (i.e., Name and Phone Number).This is because Maria is the only person in the table who wasborn on 17.01.1982 and also lives in zip code 55332.� Membership disclosure [100]: This threat occurs when an

attacker can infer with high probability that an individual’srecord is contained in the published data. For example, considera dataset which contains information on only HIV-positivepatients. The fact that a patient’s record is contained in thedataset allows inferring that the patient is HIV-positive, andthus poses a threat to privacy. Note that membership disclosuremay occur even when the data are protected from identity dis-closure, and that there are several real-world scenarios whereprotection against membership disclosure is required. Suchinteresting scenarios were discussed in detail in [100,101].� Attribute disclosure (or sensitive information disclosure) [88]: This

threat occurs when an individual is associated with informationabout their sensitive attributes. This information can be, forexample, the individual’s value for the sensitive attribute (e.g.,the value in DNA in Table 1), or a range of values which containan individual’s sensitive value (e.g., if the sensitive attribute isHospitalization Cost, then knowledge that a patient’s value inthis attribute lies in a narrow range, say ½5400;5500�, may beconsidered as sensitive, as it provides a near accurate estimateof the actual cost incurred, which may be considered to be high,rare, etc.).

There have been several incidents of patient data publishing,where identity disclosure has transpired. For instance, Sweeney[112] first demonstrated the problem in 2002, by linking a claims

rom electronic health records while preserving privacy: A survey of algo-

Page 3: 1-s2.0-S1532046414001403-main

208

209

210

211

212

213

214

215

216

217

218

219

220

221

222

223

224

225

226

227

228

229

230

231

232

233

234

235

236

237

238

239

240

241

242

243

244

245 Q3

246

247

248

249

250

251

252

253

254

255

256

Table 1An example of different types of attributes in a relational table.

Direct identifiers Quasi-identifiers Sensitive attribute

Name Phone number Date of birth Zip code Gender DNA

Tom Green 6152541261 11.02.1980 55432 Male AT. . .GJohanna Marer 6152532126 17.01.1982 55454 Female CG. . .AMaria Durhame 6151531562 17.01.1982 55332 Female TG. . .CHelen Tulid 6153553230 10.07.1977 55454 Female AA. . .GTim Lee 6155837612 15.04.1984 55332 Male GC. . .T

Table 2Privacy models to guard against different attacks.

Attack type Privacy models

Demographics Diagnosis codes

Identity disclosure k-Anonymity [112]k-Map [34] Complete k-anonymity [52]ð1; kÞ-Anonymity [45] km-Anonymity [115]ðk;1Þ-Anonymity [45] Privacy-constrained

anonymity [76]ðk; kÞ-Anonymity [45]

Membershipdisclosure

d-Presence [100]c-confident d-presence[103]

Attributedisclosure

l-Diversity [88,69](a; k)-Anonymity [126]p-Sensitive-k-anonymity[118]

q-Uncertainty [16]

t-Closeness [69] ðh; k; pÞ-Coherence [130]Range-based [81,60] PS-rule based anonymity [80]Variance-based [64]Worst Group Protection[84]

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 3

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

database, which contains information of about 135 K patients andwas disseminated by the Group Insurance Commission, to thevoter list of Cambridge, Massachusetts. The linkage was performed,based on patient demographics (e.g., Date of birth, Zip code, andGender) and led to the re-identification of, William Weld, then gov-ernor of Massachusetts. It was also suggested that more than 87%of U.S. citizens could be re-identified, based on such attacks. Manyother identity disclosure incidents have been reported since [33].These include attacks in which (i) students re-identified individu-als in the Chicago homicide database by linking it with the socialsecurity death index, (ii) an expert witness re-identified most ofthe individuals represented in a neuroblastoma registry, and (iii)a national broadcaster re-identified a patient, who died while tak-ing a drug, by combining the adverse drug event database withpublic obituaries.

Membership and attribute disclosure have not led yet to docu-mented privacy breaches in the healthcare domain. However, theyhave raised serious privacy concerns and were shown to be feasiblein various domains. For example, individuals who were opposed totheir potential association with sensitive movies (e.g., moviesrelated to their sexual orientation) took legal action when it wasshown that data published by Netflix may be susceptible to attri-bute disclosure attacks [99].

257

258

259

260

261

262

263

264

265

266

267

2.2. Privacy models

In this section, we present some well-established privacy mod-els that guard against the aforementioned threats. These privacymodels: (i) model what leads to one or more privacy threats and(ii) describe a computational strategy to enforce protection againstthe threat. Privacy models are subsequently categorized accordingto the privacy threats they protect from, as also presented in Table2.

268

269

270

271

272

273

274

2.2.1. Models against identity disclosureA plethora of privacy models have been proposed to prevent

identity disclosure in medical data publishing. These models canbe grouped, based on they type of data to which they are applied,into two major categories: (i) models for demographics and (ii)models for diagnosis codes.

275

276

277

278

279

280

281

282

283

284

285

286

2.2.1.1. Models for demographics. The most popular privacy modelfor protecting demographics is k-anonymity [109,112]. k-anonym-ity requires each record in a dataset D to contain the same values inthe set of Quasi-IDentifier attributes (QIDs) with at least k � 1other tuples in D. Recall that quasi-identifiers are typically innocu-ous attributes that can be used in combination to link external datasources with the published dataset. Satisfying k-anonymity offersprotection against identity disclosure, because it limits the proba-bility of linking an individual to their record, based on QIDs, to 1=k.The parameter k controls the level of offered privacy and is set bydata publishers, usually to 5 in the context of patient demograph-ics [92].

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

Another privacy model that has been proposed for demograph-ics is k-map [113]. This model is similar to k-anonymity but con-siders that the linking is performed based on larger datasets(called population tables), from which the published dataset hasbeen derived. Thus, k-map is less restrictive than k-anonymity,typically allowing the publishing of more detailed patient informa-tion, which helps data utility preservation. On the negative side,however, the k-map privacy model is weaker (in terms of offeredprivacy protection) than k-anonymity because it assumes that: (i)attackers do not know whether a record is included in the pub-lished dataset and (ii) data publishers have access to the popula-tion table.

El Emam et al. [34] provide a discussion of the k-anonymity andk-map models and propose risk-based measures, which approxi-mate k-map and are more applicable in certain re-identificationscenarios. Three privacy models, called ð1; kÞ-anonymity, ðk;1Þ-ano-nymity and ðk; kÞ-anonymity, which follow a similar concept to k-map, and are relaxations to k-anonymity, have been proposed byGionis et al. [45]. These models differ in their assumptions aboutthe capabilities of attackers and can offer higher data utility butweaker privacy than k-anonymity.

2.2.1.2. Models for diagnosis codes. Several privacy models havebeen proposed to protect identity disclosure attacks when sharingdiagnosis codes. The work of He and Naughton [52] proposed com-plete k-anonymity, a model which assumes that any combination ofdiagnosis codes can lead to identity disclosure and requires at leastk records, in the published dataset, to have the same diagnosiscodes. Complete k-anonymity, however, may harm data utilityunnecessarily because it is extremely difficult for attackers toknow all the diagnoses in a patient record [75].

rom electronic health records while preserving privacy: A survey of algo-

Page 4: 1-s2.0-S1532046414001403-main

287

288

289

290

291

292

293

294

295

296

297

298

299

300

301

302

303

304

305

306

307

308

309

310

311

312

313

314

315

316

317

318

319

320

321

322

323

324

325

326

327

328

329

330

331

332

333

334

335

336

337

338

339

340

341

342

343

344

345

346

347

348

349

350

351

352

353

354

355

356

357

358

359

360

361

362

363

364

365

366

367

368

369

370

371

372

373

374

375

376

377

378

379

380

381

382

383

384

385

386

387

388

389

390

391

392

393

394

395

396

397

398

399

400

401

402

403

404

405

406

407

408

409

410

411

412

413

414

4 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

A more flexible privacy model, called km-anonymity, was pro-posed by Terrovitis et al. in [115]. km-anonymity uses a parameterm to control the maximum number of diagnosis codes that may beknown to an attacker, and it requires each combination of m diag-nosis codes to appear in at least k records of the released dataset.This privacy model is useful in scenarios in which data publishersare unable (or unwilling) to specify certain sets of diagnosis codesthat may lead to identity disclosure attacks.

Recently, a privacy model, called privacy-constrained anonymity,was introduced by Loukides et al. in [76]. Privacy-constrained ano-nymity is based on the notion of privacy constraints. These are setsof diagnosis codes that may be known to an attacker and, collec-tively, they form the privacy policy. Given an owner-specified pri-vacy policy, the privacy-constrained anonymity model limits theprobability of performing identity disclosure to at most 1=k, byrequiring the set of diagnoses in each privacy constraint to appearat least k times in the dataset (or not appear at all).

By definition, privacy-constrained anonymity assumes thatattackers know whether a patient’s record is contained in thereleased dataset. This assumption is made by most research inthe field (e.g., [115,130,52,78]), because such knowledge can beobtained by applying the procedure used to create the releaseddata from a larger patient population, which is often described inthe literature [75]. Relaxing this assumption, however, is straight-forward by following an approach similar to that of the k-mapmodel, and can potentially offer more utility at the expense of pri-vacy. Privacy-constrained anonymity allows protecting only sets ofdiagnosis codes that may be used in identity disclosure attacks, asspecified by the privacy policy. Thus, it addresses a significant lim-itation of both complete k-anonymity and km-anonymity whichtend to overly protect the data (i.e., by protecting all or all m com-binations of diagnosis codes), as well as preserving data utility sig-nificantly better.

2.2.2. Models against membership disclosureThe privacy models that have been discussed so far are not ade-

quate for preventing membership disclosure, as explained in [100].To address this shortcoming, two privacy models have been pro-posed by Nergiz et al. in [100,101]. The first of these models, calledd-presence [100], aims at limiting the attacker’s ability to infer thatan individual’s record is contained in a relational dataset D, given aversion eD of dataset D that is to be published and a public, popu-lation table P. The latter table is assumed to contain ‘‘all publiclyknown data’’ (i.e., the direct identifiers and quasi-identifiers of allindividuals in the population, including those in D). Satisfyingd-presence offers protection against membership disclosure,because the probability of inferring that an individual’s record iscontained in table D, using eD and P, will be within a rangeðdmin; dmaxÞ of acceptable probabilities. A record that is inferred witha probability within this range is called d-present, and the param-eters dmin and dmax are set by data publishers, who also need to pos-sess the population table P.

The fact that d-presence requires data owners to have access tocomplete information about the population, in the form of table P,limits its applicability. To address this issue, Nergiz et al. [103] pro-posed the c-confident d-presence privacy model. This modelassumes a set of distribution functions for the population (i.e.,attackers know the probability that an individual is associated withone or more values, over one or more attributes) instead of table P,and ensures that a record is d-present with respect to the popula-tion with an owner-specified probability c.

2.2.3. Models against attribute disclosurePrivacy models against sensitive attribute disclosure can be

classified into two groups, according to the type of attributes theyare applied to: (i) models for patient demographics and (ii) models

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

for diagnosis codes. In what follows, we describe some representa-tive privacy models from each group.

2.2.3.1. Models for demographics. The most popular privacy modelthat thwarts attribute disclosure attacks in patient demographicsis l-diversity [88]. It requires each anonymized group in a datasetD to contain at least l ‘‘well represented’’ sensitive attribute (SA) val-ues [88]. In most cases, an anonymized group is k-anonymous (i.e.,it contains at least k records with the same values over the set ofquasi-identifiers), although this is not a requirement of the defini-tion of l-diversity. The simplest interpretation of ‘‘well repre-sented’’ is distinct and leads to distinct l-diversity [69], whichrequires each anonymized group to contain at least l distinct SAvalues. Another interpretation leads to recursive (c; l)-diversity[88], which requires each group in D to contain a large numberof distinct SA values, none of which appears ‘‘too’’ often. Otherprinciples that guard against value disclosure by limiting the num-ber of distinct SA values in an anonymized group are (a; k)-ano-nymity [126] and p-sensitive-k-anonymity [118]. However, theseprivacy principles still allow attackers to infer that an individualis likely to have a certain SA value when that value appears muchmore frequently than other values in the group.

t-closeness [69] is another privacy model for protecting demo-graphics from attribute disclosure attacks. This model aims at lim-iting the distance between the probability distribution of the SAvalues in an anonymized group and that of SA values in the entiredataset. This prevents an attacker from learning information aboutan individual’s SA value that is not available from the dataset. Con-sider, for example, a dataset in which 60% of tuples have the valueFlu in a SA Disease, and we form an anonymous group, whichalso has 60% of its disease values as Flu. Then, although anattacker can infer that an individual in the group suffers fromFlu with relatively high probability (i.e. 60%), the group is pro-tected according to t-closeness, since this fact can be inferred fromthe dataset itself.

Privacy models to guard against the disclosure of sensitiveranges of values in numerical attributes have also been proposed.Models that work by limiting the maximum range of SA valuesin a group of tuples have been proposed by Loukides et al. [81]and Koudas et al. [60], while LeFevre et al. [64] proposed limitingthe variance of SA values instead. A privacy model, called WorstGroup Protection (WGP), which prevents range disclosure and canbe enforced without generalization of SA values was introducedin [84]. WGP measures the probability of disclosing any range inthe least protected group of a table, and captures the way SA valuesform ranges in a group, based on their frequency and similarity.

2.2.3.2. Models for diagnosis codes. Several privacy models havebeen proposed to protect attribute disclosure attacks when sharingdiagnosis codes (e.g., the association of patients with sensitivediagnosis codes, such as those representing sexually transmitteddiseases). One such model, proposed by Cao et al. [16], is q-uncer-tainty, which limits the probability of associating an individualwith any (single) diagnosis code to less than q. This model makesthe (stringent) assumption that each diagnosis code in a patientrecord can be sensitive, and all the remaining codes in the recordmay be used for its inference.

Another privacy model, called ðh; k; pÞ-coherence, was proposedin [130] and guards against both identity and sensitive informationdisclosure. This model treats non-sensitive diagnosis codes simi-larly to km-anonymity and limits the probability of inferring sensi-tive diagnosis codes. In fact, parameters k and p have a similar roleto k and m in km-anonymity, and h limits the probability of attri-bute disclosure.

The PS-rule based anonymity model (PS-rule stands for PrivacySensitive rule), proposed by Loukides et al. in [80], also thwarts both

rom electronic health records while preserving privacy: A survey of algo-

Page 5: 1-s2.0-S1532046414001403-main

415

416

417

418

419

420

421

422

423

424

425

426

427

428

429

430

431

432

433

434

435

436

437

438

439

440

441

442

443

444

445

446

447

448

449

450

451

452

453

454

455

456

457

458

459

460

461

462

463

464

465

466

467

468

469

470

471

472

473

474

475

476

477

478

479

480

481

482

483

484

485

486

487

488

489

490

491

492

493

494

495

496

497

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 5

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

identity and sensitive information disclosure. Similarly to associa-tion rules [7], PS-rules consist of two sets of diagnosis codes, theantecedent and consequent, which contain diagnosis codes thatmay be used in identity and sensitive information disclosureattacks, respectively. Given a PS-rule A! B, where A and B is theantecedent and consequent of the rule, respectively, PS-rule basedanonymity requires that the set of diagnosis codes in A appears inat least k records of the published dataset, while at most c � 100% ofthe records that contain the diagnosis codes in A, also contain thediagnosis codes in B. Thus, it protects against attackers who knowwhether a patient’s record is contained in the published dataset.The parameter c is specified by data publishers, takes valuesbetween 0 and 1, and is analogous to the confidence threshold inassociation rule mining [7]. The PS-rule based anonymity modeloffers three significant benefits compared to previously discussedmodels for diagnosis codes: (i) it protects against both identityand sensitive information disclosure, (ii) it allows data publishersto specify detailed privacy requirements and (iii) it is more generalthan these models (i.e., the models in [115,130,52] are specialcases of PS-rule based anonymity).

498

499

500

501

502

503

504

505

506

507

508

509

510

511

512

513

514

515

516

517

518

519

520

521

522

523

524

525

526

527

528

529

530

531

532

533

534

535

536

537

3. Privacy scenarios

There are two scenarios for privacy-preserving data sharing, asillustrated in Fig. 1. In this paper, we survey privacy models andalgorithms that belong to the non-interactive data sharing sce-nario. This scenario has certain benefits: (i) it offers constant dataavailability (since the original dataset is published after beinganonymized), (ii) it does not require any infrastructure costs, and(iii) it is good for hypothesis generation and testing (since patientrecords are published in a utility-aware, anonymized form). How-ever, the non-interactive scenario suffers from two importantshortcomings. First, data owners need to specify privacy and utilityrequirements prior to sharing their data, in order to ensure that thereleased dataset is adequately protected and highly useful. Second,data owners have no control over the released dataset. Thus, thereleased dataset may be susceptible to attacks that had not beendiscovered at the time of data release.

Privacy-preserving data sharing can also be facilitated in thenon-interactive scenario. This scenario assumes that the data aredeposited into a (secure) repository and can be queried by externaldata users. Thus, the users receive protected answers to their que-ries, and not the entire dataset, as in the non-interactive scenario.The interactive scenario offers three main benefits, which stemfrom the fact that data are kept in-house to the hostingorganization.

First, data owners can audit the use of their data and applyaccess control policies. This ensures that attackers can be identifiedand held accountable, a capability that is not offered by techniquesthat are designed for the non-interactive scenario. Furthermore,the enforced protection mechanism for the repository can beimproved at any time based on new privacy threats that are iden-tified, thus data owners can provide state-of-the-art protection ofthe sensitive data in the repository. Second, the interactive sce-nario allows the enforcement of strong, semantic privacy modelsthat will be discussed later. Third, the fact that the types of posedqueries are known a priori to data owners helps deciding on anappropriate level of privacy that should be offered when answeringthe queries.

On the other hand, complex queries are difficult to support inthe interactive setting, while there are often restrictions on thenumber of queries that can be answered. Additionally, several ana-lytic tasks (e.g., visualization) require individual records, asopposed to aggregate results or models. These tasks are difficultto be supported in the interactive scenario. In general, it is interest-

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

ing to observe that the advantages of the interactive scenario aredisadvantages of the non-interactive scenario, and vice versa. Con-sequently, data publishers need to carefully select the appropriateprivacy-preserving data sharing scenario based on their needs.

A popular class of algorithms that are designed for the interac-tive scenario enforce privacy by adding noise to each query answer,thereby offering output privacy. The goal of these algorithms is totune the magnitude of the added noise so that privacy is preserved,while accurate, high-level statistics can still be computed usingqueries. For instance, several algorithms that enforce differentialprivacy [29], a strong privacy model to be discussed later, in theinteractive setting, are surveyed in [30]. In addition to constructingprotected query answers, these algorithms monitor the number ofqueries posed to the system and stop answering queries, when themaximum number of queries that can be answered, while satisfy-ing differential privacy, is reached. The release of statistics in a pri-vacy-preserving way has also been thoroughly investigated by thestatistical disclosure control community (see [4] for a survey).However, the techniques in [4] do not guarantee privacy preserva-tion using a rigorous privacy model [29].

4. Privacy techniques

In this section, we provide a classification of algorithms thatemploy the privacy models in Section 2 and have been designedfor the non-interactive scenario. These algorithms are summarizedin Table 3. For each class of algorithms, we also discuss techniquesthat are employed in their operation.

4.1. Algorithms against identity disclosure

The prevention of identity disclosure requires transformingquasi-identifiers to enforce a privacy model in a way that preservesdata utility. Since transforming the data to achieve privacy andoptimal utility is computationally infeasible (see for example[109]), most algorithms adopt heuristic strategies to explore thespace of possible solutions. That is, they consider different waysof transforming quasi-identifiers in order to find a ‘‘good’’ solutionthat satisfies privacy and the utility objective. After discussingapproaches to transform quasi-identifiers, we survey utility objec-tives and heuristic strategies. Based on this, we subsequently pres-ent a detailed classification of algorithms.

4.1.1. Transforming quasi-identifiersThere are three main techniques to transform quasi-identifiers

in order to prevent identity disclosure: (i) microaggregation [24],(ii) generalization [109], and (iii) suppression [109]. Microaggrega-tion involves replacing a group of values in a QID using a summarystatistic (e.g., centroid or median for numerical and categoricalQIDs, respectively). This technique has been applied to demo-graphics but not to diagnosis codes. Generalization, on the otherhand, suggests replacing QID values by more general, but semanti-cally consistent, values. Two generalization models, called globaland local recoding, have been proposed in the literature (see[102] for an excellent survey of generalization models). Globalrecoding involves mapping the domain of QIDs into generalizedvalues. These values correspond to aggregate concepts (e.g., Britishinstead of English, for Ethnicity) or collections of values (e.g., Eng-lish or Welsh, for Ethnicity, or 18–30, for Age). Thus, all occurrencesof a certain value (e.g., English) in a dataset will be generalized tothe same value (e.g., European). On the other hand, local recodinginvolves mapping QID values of individual records into generalizedones on a group-by-group basis. Therefore, the value English in twodifferent records may be replaced by British in one record, and byEuropean, in another. Similarly, diagnosis codes can be replaced

rom electronic health records while preserving privacy: A survey of algo-

Page 6: 1-s2.0-S1532046414001403-main

538

539

540

541

542

543

544

545

546

547

548

549

550

551

552

553

554

555

556

557

558

559

560

561

562

563

564

565

566

567

568

569

570

571

572

573

574

575

576

Privacy aware query answering

Data request

Privacy-aware result

Protected data repository

Researchers

data owners data publisher (trusted) data recipient (untrusted)

Originaldata

Releaseddata

PROS:

- Data is kept in-house- Strong privacy- Attack identi cation and recovery from breaches are possible

CONS:

- Dif cult to answer complex queries- Data availability reduces with time- Infrastructure costs- Bad for hypothesis generation(a) The interactive scenario (akin to statistical databases)

PROS:

- Constant data availability- No infrastructure costs- Good for hypothesis generation and testing

CONS:

- Privacy and utility requirements need to be speci ed- Publisher has no control after data publishing- No auditing can be performed

(b) The non-interactive scenario (also known as data publishing )

Fig. 1. Privacy-preserving data sharing scenarios: (a) interactive vs. (b) non-interactive.

Table 3Algorithms to prevent against different attacks.

Attack type Privacy models

Demographics Diagnosis codes

Identitydisclosure

k-Minimal generalization [109]OLA [32]Incognito [65]Genetic [58]Mondrian [66,67] UGACLIP [76]TDS [40] CBA [87]NNG [28] UAR [86]Greedy [129] Apriori [115]k-Member [15] LRA [116]KACA [68] VPA [116]Agglomerative [45] mHgHs [74]ðk; kÞ-Anonymizer [45] Recursive partition [52]Hilb [44]iDist [44]MDAV [25]CBFS [62]

Membershipdisclosure

SPALM [100]MPALM [100]SFALM [101]

Attributedisclosure

Incognito with l-diversity [88]Incognito with t-closeness [69]Incognito with ða; kÞ-anonymity[126]

Greedy [130]

p-Sensitive k-anonymity [118] SuppressControl [16]Mondrian with l-diversity [127] TDControl [16]Mondrian with t-closeness [70] RBAT [79]Top down [126] Tree-based [80]Greedy algorithm [81] Sample-based [80]Hilb with l-diversity [44]iDist with l-diversity [44]Anatomize [127]

6 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

either by aggregate concepts (e.g., Diseases of Other EndocrineGlands instead of Diabetes melitus type I) or by sets of diagnosiscodes (e.g., {Diabetes melitus type I, Diabetes melitus type II}), which

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

are interpreted as any (non-empty) subset of diagnosis codes con-tained in the set. Last, suppression involves the deletion of specificQID values from the data.

Although each technique has its benefits, generalization is typ-ically preferred over microaggregation and suppression. This isbecause microaggregation may harm data truthfulness (i.e., thecentroid may not appear in the data), while suppression incurshigh information loss. Interestingly, there are techniques thatemploy more than one of these operations. For example, the workof [76] employs suppression when it is not possible to apply gen-eralization while satisfying some utility requirements.

4.1.2. Utility objectivesPreventing identity disclosure may lower the utility of data, as it

involves data transformation. Thus, existing methods aim at pre-serving data utility by following one of the following general strat-egies: (i) they quantify information loss using an optimizationmeasure, which they attempt to minimize, (ii) they assume thatdata will be used in a specific data analysis task and attempt topreserve the accuracy of performing this task using the publisheddata, and (iii) they take into account utility requirements, specifiedby data owners, and aim at generating data that satisfy theserequirements. In what follows, we discuss each of these strategies.

One way to capture data utility is by measuring the level ofinformation loss incurred by data transformation. The measuresthat have been proposed are based on (i) the size of anonymizationgroups or (ii) the characteristics of generalized values. Measures ofthe first category are based on the intuition that all records in ananonymization group are indistinguishable from one another, asthey have the same value over QIDs. Thus, larger groups incur moreinformation loss. Examples of these measures are DiscernabilityMetric (DM) [9] and Normalized Average Equivalence Class Size[66], which differ from one another in the way they penalizegroups. The main drawback of these measures is that they neglectthe way values are transformed within an anonymized group.These measures, for example, would assign the same penalty to agroup of records with values f14;15;16g in a QID Age that are

rom electronic health records while preserving privacy: A survey of algo-

Page 7: 1-s2.0-S1532046414001403-main

577

578

579

580

581

582

583

584

585

586

587

588

589

590

591

592

593

594

595

596

597

598

599

600

601

602

603

604

605

606

607

608

609

610

611

612

613

614

615

616

617

618

619

620

621

622

623

624

625

626

627

628

629

630

631

632

633

634

635

636

637

638

639

640

641

642

643

644

645

646

647

648

649

650

651

652

653

654

655

656

657

658

659

660

661

662

663

664

665

666

667

668

669

670

671

672

673

674

675

676

677

678

679

680

681

682

683

684

685

686

687

688

689

690

691

692

693

694

695

696

697

698

699

700

701

702

703

704

705

706

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 7

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

generalized to 14–16 or Underage. However, using the generalizedvalue 14–16 incurs lower information loss, as this is more specificthan Underage.

The above-mentioned limitation is addressed by the second cat-egory of measures, which take into account the way values aregeneralized. Examples of these measures are Generalization Cost(GC) [6], Normalized Certainty Penalty (NCP) [129], and Loss Metric(LM) [58]. All of these measures are applicable to demographicsand penalize less specific generalized values (i.e., they favor Britishover European) but the latter two (i.e., NCP and LM) are more flex-ible, as they can be applied to both numerical and categorical attri-butes. A recently proposed information-loss measure for diagnosiscodes is Information Loss Metric (ILM) [87]. ILM quantifies informa-tion loss of a generalized diagnosis code by imposing a large pen-alty on generalized terms that contain many diagnosis codes andappear in many records of the dataset.

Another way to capture data utility is based on measuring theaccuracy of a specific task performed on anonymized data. Iyengar[58], for example, observed that generalization can make it difficultto build an accurate classification model. This is because recordswith different class labels become indistinguishable from oneanother, when they fall into the same anonymization group. Forexample, assume that all records, whose value in Ethnicity isWelsh, have a classification label Yes, whereas all records with Eng-lish have a label No. Generalizing the values Welsh and English toBritish does not allow to distinguish between records that have dif-ferent classification labels. To capture data utility, Iyengar intro-duced the Classification Metric (CM), which is expressed as thenumber of records whose class labels are different from that ofthe majority of records in their anonymized group, normalizedby the dataset size.

LeFevre et al. [66] considered measuring the utility of anony-mized data when used for aggregate query answering purposesand proposed a measure, called Average Relative Error (ARE). AREquantifies data utility by measuring the difference between theanswers to a query using the anonymized and using the originaldata. This measure has been widely employed, as it is applicableto different types of data (e.g., both demographics and diagnosiscodes) and is independent of the way data are anonymized. Funget al. [41], on the other hand, considered clustering and proposedcomparing the cluster structures of the original and anonymizeddata, using the F-measure [122] and Match point. Although thesemeasures are also general, currently they have only been appliedto demographics.

Several publishing scenarios involve the release of an anony-mized dataset to support a specific medical study, or to data recip-ients having certain data analysis requirements. In such scenarios,knowledge of how the dataset will be analyzed can be exploitedduring anonymization to better preserve data utility. For example,consider a dataset which contains Age, Gender, Ethnicity, andMarital Status, as quasi-identifiers, and needs to be releasedfor performing a study on the age of female patients. Intuitively,distorting the values of the first two attributes should be avoided,as the result of the study depends on their values. Samarati pro-posed modeling data analysis requirements based on the mini-mum number of suppressed tuples, or on the height ofhierarchies for categorical QID values [109]. However, suchrequirements are difficult to be specified by data publishers, asthey require knowledge of how the dataset will be anonymized.

Xu et al. [129] prioritized the anonymization of certain quasi-identifier attributes by using data-owner specified weights. Theproposed approach, however, cannot guarantee that some attri-butes will not be overdistorted (e.g., gender information can belost, even when the generalization of Ethnicity is preferred tothat of Gender). To guarantee that the anonymized data willremain useful for the specified analysis requirements, Loukides

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

et al. [85] proposed a model for expressing data utility require-ments and an algorithm for anonymizing data, based on thismodel. Utility requirements can be expressed at an attribute level(e.g., imposing the length of range, or the size of set that anony-mized groups may have in a given quasi-identifier attribute), orat a value level (e.g., imposing ranges or sets allowed for specifiedvalues). The approach of [85] can be applied to patient demograph-ics but not to diagnosis codes.

Anonymizing diagnosis codes in a way that satisfies data utilityrequirements has been considered in [76]. The proposed approachmodels data utility requirements using sets of diagnosis codes,referred to as utility constraints. A utility constraint representsthe ways the codes, contained in it, can be generalized in orderto preserve data utility. Thus, utility constraints specify the infor-mation that the anonymized data should retain in order to be use-ful in intended medical analysis tasks. For example, assume thatthe disseminated data must support a study that requires countingthe number of patients with Diabetes. To achieve this, a utility con-straint for Diabetes, which is comprised of all different types of dia-betes, must be specified. By anonymizing data according to thisutility constraint, we can ensure that the number of patients withDiabetes in the anonymized data will be the same as in the originaldata. Thus, the anonymized dataset will be as useful as the originalone, for the medical study on diabetes.

4.1.3. Heuristic strategiesOptimally anonymizing data with respect to the aforemen-

tioned utility criteria is computationally infeasible (see for exam-ple [66,129,78]). Consequently, many anonymization methodsemploy heuristic search strategies to form anonymous groups. Inwhat follows, we discuss search strategies that have been appliedto demographics and diagnosis codes.

4.1.3.1. Algorithms for demographics. Algorithms for demographicstypically employ: (i) binary search on the lattice of possible gener-alizations [109], (ii) a lattice search strategy similar in principle tothe Apriori [7] used in association rule mining, (iii) genetic searchon the lattice of possible generalizations [58], (iv) data partitioning[66,57], (v) data clustering [102,129,81,68], or (vi) space mapping[44].

The main idea behind strategies (i)–(iii) is to represent the pos-sible ways to generalize a value in a quasi-identifier attribute,using a taxonomy, and then combine the taxonomies for allquasi-identifier attributes, to obtain a lattice. For instance, Englishand Welsh are the leaf-level nodes of a taxonomy for Ethnicityand their immediate ascendant is the generalized value British.Similarly, Male and Female are the leaf-level nodes of a taxonomyfor Gender, whose root value and immediate ascendant of theleaves is Any. Thus, we can combine these two taxonomies to geta lattice for Ethnicity and Gender. Each node in this lattice rep-resents a different set of generalized values for Ethnicity andGender, such as {English, Male}, {English, Female}, {Welsh, Male},and {British, Any}. Thus, finding a way to generalize values can beperformed by exploring the lattice using heuristics that avoid con-sidering certain lattice nodes for efficiency reasons. The strategy (i)prunes the ascendants of lattice nodes that are sufficient to satisfya privacy model, while the strategies (ii) and (iii) prune latticenodes that are likely to incur high utility loss. The latter nodesare identified while considering nodes that represent incremen-tally larger sets of generalized values, for strategy (i), or whileselecting nodes by combining their descendants, as specified by agenetic algorithm, in the case of strategy (ii).

Binary and Apriori-like lattice search strategies explore a smallspace of potential solutions and thus may fail to preserve data util-ity to the extent that genetic search strategies can do. However,genetic search is computationally intensive (e.g., the algorithm in

rom electronic health records while preserving privacy: A survey of algo-

Page 8: 1-s2.0-S1532046414001403-main

707

708

709

710

711

712

713

714

715

716

717

718

719

720

721

722

723

724

725

726

727

728

729

730

731

732

733

734

735

736

737

738

739

740

741

742

743

744

745

746

747

748

749

750

751

752

753

754

755

756

757

758

759

760

761

762

763

764

765

766

767

768

769

770

771

772

773

774

775

776

777

778

779

8 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

[58] is orders of magnitude slower than the partitioning-basedmethod of [66]) and may converge slowly. Consequently, morerecent research has focused on developing methods that use strat-egies (iv) and (v), which are applied to the records of a dataset, andnot to attribute values as strategies (i)–(iii) are. The objective of theformer strategies is to organize records into carefully selectedgroups that help the preservation of privacy and the satisfactionof a utility objective. Both data partitioning and clustering-basedstrategies create groups iteratively, but they differ in the task theyperform in an iteration. Specifically, partition-based strategies splitrecords into groups, based on the value that these records have in asingle quasi-identifier attribute (i.e., an iteration creates two typi-cally large groups of records that are similar with respect to aquasi-identifier), while clustering-based strategies merge twogroups of records, based on the values of the records in all quasi-identifier attributes together. Therefore, partitioning-based meth-ods tend to incur higher utility loss when compared to cluster-ing-based methods [129,81], and they are sensitive to the choiceof the splitting attribute, performing poorly particularly whenthe dataset is skewed [102]. However, partitioning is faster thanclustering by orders of magnitude, requiring Oðn � logðnÞÞ timeinstead of Oðn2Þ, where n is the cardinality of the dataset.

A different heuristic search strategy relies on space mappingtechniques [44]. These techniques create a ranking of records, suchthat records with similar values in quasi-identifiers have similarranks. Based on this ranking, groups of records are subsequentlyformed by considering a number of records (e.g., at least k for k-anonymity) that have consecutive ranks. Space mapping tech-niques achieve good efficiency, as the ranking can be calculatedin linear time, as well as being effective at preserving data utility.

780

781

782

783

784

785

786

787

788

789

790

791

792

793

4.1.3.2. Algorithms for diagnosis codes. Algorithms for diagnosiscodes employ: (i) space partitioning in a bottom-up [76] or top-down [16] fashion, (ii) space clustering [46], or (iii) data partition-ing in a top-down [52], vertical or horizontal [116] way. Clearly,lattice search cannot be used in the context of diagnosis codes,because there is a single, set-valued attribute to consider. Thus,one taxonomy which organizes diagnosis codes, and not a latticeof taxonomies, is used to model the ways these codes can be gen-eralized. In addition, the space mapping techniques considered byGhinita et al. in [44] are not applicable to diagnosis codes becausethere is a single, set-valued quasi-identifier attribute (i.e., a patientcan be re-identified using a set of their diagnosis codes) and notmany quasi-identifier attributes, as in the case of patientdemographics.

Table 4Algorithms for preventing identity disclosure based on demographics.

Algorithm Privacy model Transformation

k-Minimal generalization [109] k-Anonymity Generalization andOLA [32] k-Anonymity GeneralizationIncognito [65] k-Anonymity Generalization andGenetic [58] k-Anonymity GeneralizationMondrian [66] k-Anonymity GeneralizationLSD Mondrian [67] k-Anonymity GeneralizationInfogain Mondrian [67] k-Anonymity GeneralizationTDS [40] k-Anonymity GeneralizationNNG [28] k-Anonymity GeneralizationGreedy [129] k-Anonymity Generalizationk-Member [15] k-Anonymity GeneralizationKACA [68] k-Anonymity GeneralizationAgglomerative [45] k-Anonymity Generalization(k,k)-Anonymizer [45] (k,k)-Anonymity GeneralizationHilb [44] k-Anonymity GeneralizationiDist [44] k-Anonymity GeneralizationMDAV [25] k-Anonymity MicroaggregationCBFS [62] k-Anonymity Microaggregation

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

Both strategies (i) and (ii) attempt to find a set of generalizeddiagnosis codes that can be used to replace diagnosis codes inthe original dataset (e.g., ‘‘diabetes’’ that replaces ‘‘diabetes melli-tus type I’’ and ‘‘diabetes mellitus type II’’). However, they differin the way they operate. Specifically, space partitioning strategiesrequire a taxonomy for diagnosis codes, which is provided by dataowners (e.g., a healthcare institution), and dictate that the general-ized diagnosis codes are part of the taxonomy. Space clusteringstrategies lift this requirement and are more effective in terms ofpreserving data utility. On the other hand, data partitioning strat-egies are applied to transactions (records) instead of diagnosiscodes, and they aim to create groups of transactions that can besubsequently anonymized with low data utility loss. For example,assume that privacy is preserved by applying km-anonymity withk ¼ 2 and m ¼ 2. Two transactions with exactly the same diagnosiscodes are already 22-anonymous, and thus they do not incur datautility loss as they can be released intact.

Space partitioning allows searching only a smaller space of pos-sible solutions and typically results in incurring high informationloss when compared to space clustering strategies. On the otherhand, space clustering-based strategies are computationally inten-sive. It is important to note that the worst-case complexity of allstrategies is exponential to the number of distinct diagnosis codesin a dataset, which can be in the order of several hundreds. Thisexplains the need for developing more effective and efficientstrategies.

4.1.4. Classification of algorithmsWe now present a classification of algorithms for preventing

identity disclosure, based on the strategies they adopt for (i) trans-forming quasi-identifiers, (ii) preserving utility, and (iii) heuristi-cally searching for a ‘‘good’’ solution.

4.1.4.1. Algorithms for demographics. Table 4 presents a classifica-tion of algorithms for demographics. As can be seen, these algo-rithms employ various data transformation and heuristicstrategies, and aim at satisfying different utility objectives. Allalgorithms adopt k-anonymity, with the exception of ðk; kÞ-anony-mizer [45] which adopts the ðk; kÞ-anonymity model, discussed inSection 2.2. The fact that ðk; kÞ-anonymity is a relaxation of k-ano-nymity allows the algorithm in [45] to preserve more data utilitythan the Agglomerative algorithm, which is also proposed in[45]. Furthermore, most algorithms use generalization to anony-mize data, except (i) the algorithms in [109,65], which use sup-pression in addition to generalization in order to deal with a

Utility objective Heuristic strategy

suppression Min. inf. loss Binary lattice searchMin. inf. loss Binary lattice search

suppression Min. inf. loss Apriori-like lattice searchClassification accuracy Genetic searchMin. inf. loss Data partitioningRegression accuracy Data partitioningClassification accuracy Data partitioningClassification accuracy Data partitioningMin. inf. loss Data PartitioningMin. inf. loss Data clusteringMin. inf. loss Data clusteringMin. inf. loss Data clusteringMin. inf. loss Data clusteringMin. inf. loss Data clusteringMin. inf. loss Space mappingMin. inf. loss Space mappingMin. inf. loss Data clusteringMin. inf. loss Data clustering

rom electronic health records while preserving privacy: A survey of algo-

Page 9: 1-s2.0-S1532046414001403-main

794

795

796

797

798

799

800

801

802

803

804

805

806

807

808

809

810

811

812

813

814

815

816

817

818

819

820

821

822

823

824

825

826

827

828

829

830

831

832

833

834

835

836

837

838

839

840

841

842

843

844

845

846

847

848

849

850

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 9

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

typically small number of values that would incur excessive infor-mation loss if generalized and (ii) the algorithms in [25,62], whichuse microaggregation.

In addition, it can be observed that the majority of algorithmsaim at minimizing information loss and that no algorithm takesinto account specific utility requirements, such as limiting the setof allowable ways for generalizing a value in a quasi-identifier.At the same time, the Genetic [58], Infogain Mondrian [67], andTDS [40] algorithms aim at releasing data in a way that allowsfor building accurate classifiers. These algorithms were comparedin terms of how well they can support classification tasks, usingpublicly available demographic datasets [53,119]. The results arereported in [67,40], and demonstrate that Infogain Mondrian out-performs TDS which, in turn, outperforms the Genetic algorithm.The LSD Mondrian [67] algorithm is similar to Infogain Mondrianbut uses a different utility objective measure, as its goal is to pre-serve the ability of using the released data for linear regression.

It is also interesting to observe that several algorithms imple-ment data partitioning heuristic strategies. Specifically, the algo-rithms proposed in [66,67] follow a top-down partitioningstrategy inspired by kd-trees [38], while the TDS algorithm [40]employs a different strategy that takes into account the partitionsize and data utility in terms of classification accuracy. Interest-ingly, the partitioning strategy of NNG [28] is based on the distanceof values and allows creating k-anonymous datasets, whose utilityis no more than 6 � q times worse than that of the optimal solution,where q is the number of quasi-identifiers in the dataset. On theother hand, the algorithms that employ clustering [129,15,98,45]follow a similar greedy, bottom-up procedure, which aims at build-ing clusters of at least k records by iteratively merging togethersmaller clusters (of one or more records), in a way that helps datautility preservation. A detailed discussion and evaluation of clus-tering-based algorithms that employ generalization has beenreported in [82], while the authors of [25,26] provide a rigorousanalysis of clustering-based algorithms for microaggregation.

The use of space mapping techniques in algorithms iHilb andiDist, both of which were proposed in [44], enables them to pre-serve data utility equally well or even better than the Mondrianalgorithm [66] and to anonymize data more efficiently. To mapthe space of quasi-identifiers, iHilb uses the Hilbert curve, whichcan preserve the locality of points (i.e., values in quasi-identifiers)fairly well [97]. The intuition behind using this curve is that, withhigh probability, two records with similar values in quasi-identifi-ers will also be similar with respect to their rank that is producedbased on the curve. The iDist algorithm employs iDistance [131], atechnique that measures similarity based on sampling and cluster-ing of points, and is shown to be slightly inferior than iHilb interms of data utility. Last, the algorithms in [109,65,32] use lat-tice-search strategies. An experimental evaluation using a publiclyavailable dataset containing demographics [53], as well as 5 hospi-tal discharge summaries, shows that the OLA algorithm [32]

Table 5Algorithms for preventing identity disclosure based on diagnosis codes.

Algorithm Privacy model Transformation

UGACLIP [76] Privacy-constrained anonymity Generalization aCBA [87] Privacy-constrained anonymity Generalization aUAR [86] Privacy-constrained anonymity Generalization aApriori [115] km-Anonymity GeneralizationLRA [116] km-Anonymity Generalization

km-AnonymityVPA [116] km-Anonymity Generalization

km-AnonymitymHgHs [74] km-Anonymity Generalization aRecursive partition [52] Complete k-anonymity Generalization

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

performs similarly to Incognito [65] and better than k-MinimalGeneralization [109] in terms of preserving data utility. Theauthors of [32] also suggest that the way OLA generalizes datamight help medical data analysts. Nevertheless, algorithms thatuse lattice-based search strategies typically explore a smaller num-ber of generalizations than algorithms that employ data partition-ing or clustering, and are generally less effective at preserving datautility.

4.1.4.2. Algorithms for diagnosis codes. Algorithms for anonymizingdiagnosis codes are summarized in Table 5. Observe that thesealgorithms adopt different privacy models, but they all use eithera combination of generalization and suppression, or generalizationalone in order to anonymize datasets. More specifically, the algo-rithms in [76,87,86] use suppression as a secondary operationand only when generalization alone cannot be used to satisfy thespecified utility constraints. However, they differ in that CBA andUAR consider suppressing individual diagnosis codes, whereasUGACLIP suppresses sets of typically more than one diagnosiscodes. Experiments using patient records derived from the Elec-tronic Medical Record (EMR) system of Vanderbilt University Med-ical Center, which are reported in [87,86], showed that thesuppression strategy that is employed by CBA and UAR is moreeffective than that of UGACLIP.

Furthermore, the algorithms in Table 5 aim at either satisfyingutility requirements, or at minimizing information loss. The UGA-CLIP, CBA, and UAR algorithms adopt utility constraints to formulateutility requirements and attempt to satisfy them. However, thesealgorithms still favor solutions with low information loss, amongthose that satisfy the specified utility constraints. All other algo-rithms attempt to minimize information loss, which they quantifyusing two different measures; a variation of the NormalizedCertainty Penalty (NCP) measure [129] for the algorithms in[115,116,52], or the Loss Metric (LM) [58] for the mHgHs algorithm[74]. However, to our knowledge, there are no algorithms for diag-nosis codes that aim at preserving data utility for intended miningtasks, such as classification. Given the extensive use of diagnosiscodes in these tasks, we believe that the development of such algo-rithms merits further investigation.

It is also interesting to observe that several algorithms imple-ment data partitioning heuristic strategies. Specifically, the algo-rithms proposed in [66,67] follow a top-down partitioningstrategy inspired by kd-trees [38], while the TDS algorithm [40]employs a different strategy that takes into account the partitionsize and data utility in terms of classification accuracy. Interest-ingly, the partitioning strategy of NNG [28] is based on the distanceof values and allows creating a k-anonymous dataset, whose utilityis no more than 6 � q times worse than that of the optimal solution,where q is the number of quasi-identifiers in the dataset. On theother hand, the algorithms that employ clustering [129,15,98,45]follow a similar greedy, bottom-up procedure, which aims at

Utility objective Heuristic strategy

nd suppression Utility requirements Bottom-up space partitioningnd suppression Utility requirements Space clusteringnd suppression Utility requirements Space clustering

Min. inf. loss Top-down space partitioningMin. inf. loss Horizontal data

partitioningMin. inf. loss Vertical data

partitioningnd suppression Min. inf. loss Top-down space partitioning

Min. inf. loss Data partitioning

rom electronic health records while preserving privacy: A survey of algo-

Page 10: 1-s2.0-S1532046414001403-main

895

896

897

898

899

900

901

902

903

904

905

906

907

908

909

910

911

912

913

914

915

916

917

918

919

920

921

922

923

924

925

926

927

928

929

930

931

932

933

934

935

936

937

938

939

940

941

942

943

944

945

946

947

948

949

950

951

952

953

954

955

956

957

958

959

960

961

962

963

964

965

966

967

968

969

970

971

972

973

974

975

976

977

978

979

980

981

982

983

984

985

986

987

988

989

990

991

992

993

994

995

996

997

998

999

1000

1001

1002

1003

1004

1005

1006

1007

1008

1009

1010

1011

1012

1013

1014

1015

1016

1017

1018

1019

1020

1021

1022

10 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

building clusters of at least k records by iteratively mergingtogether smaller clusters of records, in a way that helps data utilitypreservation. A detailed discussion and evaluation of clustering-based algorithms that employ generalization has been reportedin [82].

Moreover, it can be seen that all algorithms in Table 5 operateon either the space of diagnosis codes, or on that of the recordsof the dataset to be published. Specifically, UGACLIP [76] partitionsthe space of diagnosis codes in a bottom-up manner, whereas Apri-ori [115] and mHgHs [74] employ top-down partitioning strate-gies. The strategy of UGACLIP considers a significantly largernumber of ways to generalize diagnosis codes than that of Apriori,which allows for better data utility preservation. In addition, thespace clustering strategies of CBA and UAR are far more effectivethan the bottom-up space partitioning strategy of UGACLIP, butthey are also more computationally demanding.

Data partitioning strategies are employed by the Recursive par-tition [52], LRA [116] and VPA [116] algorithms. The first of thesealgorithms employs a top-down partitioning strategy, which isapplied recursively. That is, it starts by a dataset which contains(i) all transactions of the dataset to be published and (ii) a singlegeneralized diagnosis code Any, which replaces all diagnosis codes.This dataset is split into subpartitions of at least k transactions,which contain progressively less general diagnosis codes (e.g.,Any is replaced by Diabetes and then by Diabetes mellitus type I).The strategy employed by the Recursive partition algorithmenforces complete k-anonymity with lower utility loss than thatof Apriori [52]. On the other hand, LRA and VPA use horizontaland vertical data partitioning strategies, respectively. Specifically,LRA attempts to create subpartitions of transactions with ‘‘similar’’items that can be generalized with low information loss. To achievethis, it sorts the transactions in the dataset to be published basedon Gray ordering [106] and then groups these transactions intosubpartitions of approximately equal size. VPA partitions datarecords vertically in order to create sub-records (i.e., parts of trans-actions) with ‘‘similar’’ items. The Apriori algorithm, discussedabove, is then used by the LRA and VPA algorithms for anonymiz-ing each of the created subpartitions, separately.

4.2. Techniques against membership disclosure

The fact that membership disclosure cannot be forestalled bysimply preventing identity disclosure, calls for specialized algo-rithms. However, as can be seen in Table 6, the proposed algo-rithms for membership disclosure share the same maincomponents (i.e., quasi-identifier transformation strategy, utilityobjective, and heuristic strategy) with the algorithms that protectfrom identity disclosure. Furthermore, these algorithms are allapplied to demographics.

All existing algorithms against membership disclosure havebeen proposed by Nergiz et al. [100,101], to the best of our knowl-edge. In [100], they proposed two algorithms, called SPALM andMPALM, which transform quasi-identifiers, using generalization,and aim at finding a solution that satisfies d-presence with lowinformation loss. Both algorithms adopt a top-down search onthe lattice of all possible generalizations, but they differ in theirgeneralization model. Specifically, the SPALM algorithm general-izes the values of each quasi-identifier separately, requiring all val-ues in a quasi-identifier to be generalized in the same way (e.g., allvalues English, in Ethnicity, are generalized to British). On the con-trary, the MPALM algorithm drops this requirement, allowingtwo records with the same value in a quasi-identifier to be general-ized differently (e.g., one value English to be generalized to Britishand another to European). In a subsequent work [101], Nergizet al. proposed an algorithm called SFALM, which is similar toSPALM but employs c-confident d-presence. The fact that the latter

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

privacy model does not require complete information about thepopulation, as discussed above, greatly improves the applicabilityof SFALM in practice.

The aforementioned algorithms against membership disclosureare limited in their choice of data transformation strategies andutility objectives, since they all employ generalization and aim atminimizing information loss. We believe that developing algo-rithms that adopt different data transformation strategies (e.g.,microaggregation) and utility objectives (e.g., utility requirements)is worthwhile. At the same time, the algorithms in [100,101] arenot applicable to diagnosis codes, because diagnosis codes havedifferent semantics than demographics. However, it is easy to seethat membership disclosure attacks based on diagnosis codes arepossible, because diagnosis codes can be used to reveal the factthat a patient’s record is contained in the published dataset. Thiscalls for developing algorithms for sharing diagnosis codes in away that forestalls membership disclosure.

4.2.1. Techniques against attribute disclosureIn what follows, we discuss privacy considerations that are spe-

cific for algorithms that aim at thwarting attribute disclosure. Sub-sequently, we present a classification of these algorithms.

Algorithms for preventing attribute disclosure enforce privacyprinciples that govern the associations between quasi-identifierand sensitive values (e.g., Income in a demographics dataset orSchizophrenia in a dataset containing diagnosis codes). To enforcethese principles, they create anonymous groups and then mergethem iteratively, until the associations between these attributesand sensitive values become protected, according to a certain pri-vacy model (e.g., l-diversity) [88,118,126,69,67]. While this can beachieved using generalization and/ or suppression, a techniquecalled bucketization has been proposed in [127] as a viable alterna-tive. Bucketization works by releasing: (i) a projection Dq of thedataset D on the set of quasi-identifiers, and another projection,Ds, on the sensitive attribute and (ii) a group membership attributethat specifies the associations between records in Dq and Ds. Bycarefully constructing Dq and Ds, it is possible to enforce l-diversitywith low information loss [127], as values in quasi-identifiers arereleased intact. However, the algorithm in [127] does not guaran-tee that identity disclosure will be prevented.

Many of the algorithms considered in this section follow thesame data transformation strategies and utility objectives, withthe algorithms examined in Section 4, but they additionally ensurethat sensitive values are protected within each anonymized group.This approach helps data publishers construct data that are nomore distorted than necessary to thwart attribute disclosure, andthe algorithms following this approach are termed protection con-strained. Alternatively, data publishers may want to produce datawith a desired trade-off between data utility and privacy protec-tion against identity disclosure. This is possible using trade-off con-strained algorithms [81,83,84]. These algorithms quantify and aimat optimizing the trade-off between the distortion caused by gen-eralization and the level of data protection against attributedisclosure.

4.2.1.1. Algorithms for demographics. A classification of algorithmsfor demographics is presented in Table 7. As can be seen, themajority of these algorithms follow the protection-constrainedapproach and are based on algorithms for identity disclosure, suchas Incognito [65], Mondrian [66], iHilb [44], or iDist [44]. Further-more, most of these algorithms employ generalization, or a combi-nation of generalization and suppression, and they enforce l-diversity, t-closeness, p-sensitive k-anonymity, ða; kÞ-anonymity,or tuple-diversity. An exception is the Anatomize algorithm[127], which was specifically developed for enforcing l-diversityusing bucketization. This algorithm works by creating buckets with

rom electronic health records while preserving privacy: A survey of algo-

Page 11: 1-s2.0-S1532046414001403-main

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

1033

1034

1035

1036

1037

1038

1039

1040

1041

1042

1043

1044

1045

1046

1047

1048

1049

1050

1051

1052

1053

1054

1055

1056

1057

1058

1059

1060

1061

1062

1063

1064

1065

1066

1067

1068

1069

1070

1071

1072

1073

1074

1075

1076

1077

Table 6Algorithms for preventing membership disclosure.

Algorithm Data type Privacy model Transformation Utility objective Heuristic strategy

SPALM [100] Demographics d-Presence Generalization Min. inf. loss Top-down lattice searchMPALM [100] Demographics d-Presence Generalization Min. inf. loss Top-down lattice searchSFALM [101] Demographics c-Confident Generalization Min. inf. loss Top-down lattice search

d-Presence

Table 7Algorithms for preventing attribute disclosure based on demographics.

Algorithm Privacy model Transformation Approach Heuristic strategy

Incognito with l-Diversity Generalization Protection constrained Apriori-like latticel-diversity [88] and suppression searchIncognito with t-Closeness Generalization Protection constrained Apriori-like latticet-closeness [69] and suppression searchIncognito with ða; kÞ-Anonymity Generalization Protection constrained Apriori-like latticeða; kÞ-anonymity [126] and suppression searchp-Sens k-anon [118] p-Sensitive Generalization Protection constrained Apriori-like lattice

k-anonymity searchMondrian with l-diversity [127] l-Diversity Generalization Protection constrained Data partitioningMondrian with t-closeness [70] t-Closeness Generalization Protection constrained Data partitioningTop down [126] ða; kÞ-Anonymity Generalization Protection constrained Data partitioningGreedy algorithm [81] Tuple diversity Generalization and suppression Trade-off constrained Data clusteringHilb with l-diversity [44] l-Diversity Generalization Protection constrained Space mappingiDist with l-diversity [44] l-Diversity Generalization Protection constrained Space mappingAnatomize [127] l-Diversity Bucketization Protection constrained Quasi-identifiers are released intact

Table 8Algorithms for preventing attribute disclosure based on diagnosis codes.

Algorithm Privacy model Transformation Approach Heuristic strategy

Greedy [130] ðh; k;pÞ-Coherence Suppression Protection constrained Greedy searchSuppressControl [16] q-Uncertainty Suppression Protection constrained Greedy searchTDControl [16] q-Uncertainty Generalization and suppression Protection constrained Top-down space partitioningRBAT [79] PS-rule based anonymity Generalization Protection constrained Top-down space partitioningTree-based [80] PS-rule based anonymity Generalization Protection constrained Top-down space partitioningSample-based [80] PS-rule based anonymity Generalization Protection constrained Top-down and bottom-up space partitioning

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 11

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

the records that have the same value in the sensitive attribute andthen constructing groups with at least l different values in the sen-sitive attribute. The construction of groups is performed by select-ing records from appropriate buckets, in a round-robin fashion.Interestingly, the Anatomize algorithm requires an amount ofmemory that is linear to the number of distinct values of the sen-sitive attribute and creates anonymized data with bounded recon-struction error, which quantifies how well correlations amongvalues in quasi-identifiers and the sensitive attribute are pre-served. In fact, the authors of [127] demonstrated experimentallythat the Anatomize algorithm outperforms an adaption of theMondrian algorithm that enforces l-diversity in terms of preservingdata utility. Moreover, it is worth noting that the algorithms in[118,126,81], which employ p-sensitive k-anonymity, ða; kÞ-ano-nymity, or tuple diversity, are applied to both quasi-identifiersand sensitive attributes and provide protection from identity andattribute disclosure together. On the other hand, the Anatomizealgorithm does not provide protection guarantees against identitydisclosure, as all values in quasi-identifiers are released intact.

4.2.1.2. Algorithms for diagnosis codes. Algorithms for anonymizingdiagnosis codes against attribute disclosure are summarized inTable 8. As can be seen, the algorithms adopt different privacymodels, namely ðh; k; pÞ-coherence, q-uncertainty, or PS-rule basedanonymity, and they use suppression, generalization, or a combi-nation of suppression and generalization. Specifically, the authorsin [16] propose an algorithm, called TDControl, which applies sup-pression when generalization alone cannot enforce q-uncertainty,

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

and a second algorithm, called SuppressControl, which onlyemploys suppression. Through experiments, they demonstrate thatcombining suppression with generalization is beneficial for bothdata utility preservation and efficiency.

Another algorithm that uses suppression only is the Greedyalgorithm, which was proposed by Xu et al. [130] to enforceðh; k; pÞ-coherence. This algorithm discovers all unprotected com-binations of diagnosis codes with minimal size and protects eachidentified combination, by iteratively suppressing the diagnosiscode contained in the greatest number of these combinations. Onthe other hand, the RBAT [79], Tree-based [80], and Sample-based[80] algorithms employ generalization alone. All algorithms followthe protection-constrained approach, as they minimize informa-tion loss no more than necessary to prevent attribute disclosure.

In terms of heuristic search strategies, the algorithms in Table 8employ a greedy search and operate on either the space of diagno-sis codes, or on the transactions of the dataset to be published. Spe-cifically, UGACLIP [76] partitions the space of diagnosis codes in abottom-up manner, whereas Apriori [115] and mHgHs [74] employtop-down partitioning strategies. The strategy of UGACLIP consid-ers a significantly larger number of ways to generalize diagnosiscodes than that of Apriori, which allows better data utility preser-vation. In addition, the space clustering strategies of CBA and UARare more effective than the bottom-up space partitioning strategyof UGACLIP, but they are more computationally demanding.

Moreover, all algorithms in Table 8 operate on the space of diag-nosis codes and either perform greedy search to discover diagnosiscodes that can be suppressed with low data utility loss, or they

rom electronic health records while preserving privacy: A survey of algo-

Page 12: 1-s2.0-S1532046414001403-main

1078

1079

1080

1081

1082

1083

1084

1085

1086

1087

1088

1089

1090

1091

1092

1093

1094

1095

1096

1097

1098

1099

1100

1101

1102

1103

1104

1105

1106

1107

1108

1109

1110

1111

1112

1113

1114

1115

1116

1117

1118

1119

1120

1121

1122

1123

1124

1125

1126

1127

1128

1129

1130

1131

1132

1133

1134

1135

1136

1137

1138

1139

1140

1141

1142

1143

1144

1145

1146

1147

1148

1149

1150

1151

1152

1153

1154

1155

1156

1157

1158

1159

1160

1161

1162

1163

1164

1165

1166

1167

1168

1169

1170

1171

1172

1173

1174

1175

1176

1177

1178

1179

1180

1181

1182

1183

1184

1185

1186

1187

1188

1189

1190

1191

1192

1193

12 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

employ space partitioning strategies. Specifically, TDControl, RBAT,and the Tree-based algorithm all employ top-down partitioning,while Sample-based uses both top-down and bottom-up partition-ing strategies. The main difference between the strategy of TDCon-trol and that of RBAT is that the former is based on a taxonomy,which is used to organize diagnosis codes. This restricts the possi-ble ways of partitioning diagnosis codes to those that can beexpressed as cuts in the taxonomy,2 whereas the strategy of RBATpartitions the space in a more flexible way as it does not employ thisrestriction. This helps the preservation of data utility, as it allowsexploring more ways to generalize data. The process of partitioningemployed by RBAT can be thought of as ‘‘growing’’ a tree, where thenodes correspond to increasingly less generalized diagnosis codes.However, it was shown in [80] that the strategy employed in RBATmight fail to preserve data utility well, as the ‘‘growing’’ of the treemay stop ‘‘early’’. That is, a replacement of diagnosis codes with lessgeneral ones, which is beneficial for data utility, is possible but hasnot been considered by the strategy employed by RBAT.

To address this issue, a different strategy that examines suchreplacements, when partitioning the space of diagnosis codes,was proposed in [80]. This strategy examines certain branches ofthe tree that are not examined by the strategy of RBAT and itsuse allows the Tree-based algorithm to preserve data utility betterthan RBAT. Moreover, to further increase the number of ways togeneralize data, the authors of [80] proposed a way to combinetop-down with bottom-up partitioning strategies, by first growingthe tree as long as identity disclosure is prevented, and then back-tracking (i.e., traversing the tree in a bottom-up way) to ensurethat attribute disclosure is guarded against.

5. Relevant techniques

This section provides a discussion of privacy-preserving tech-niques that are relevant, but not directly related, to those surveyedin this paper. These techniques are applied to different types ofmedical data, or aim at privately releasing aggregate informationabout the data.

5.1. Privacy-preserving sharing of genomic and text data

While many works investigate threats related to the publishingof demographics and diagnosis codes, there have been consider-able efforts by the computer science and health informatics com-munities to preserve the privacy of other types of data, such asgenomic and text. In the following, we briefly discuss techniquesthat have been proposed for the protection of each of the lattertypes of data.

5.1.1. Genomic privacyIt is worth noting that a patient’s record may be distinguishable

with respect to genomic data. Lin et al. [88], for example, estimatedthat an individual is unique with respect to a small number(approximately 100) of Single Nucleotide Polymorphisms (SNPs),i.e., DNA sequence variations occurring when a single nucleotidein the genome differs between paired chromosomes in an individ-ual. In addition, the release of aggregate genomic information maythreaten privacy, as genomic sequences contain sensitive informa-tion, such as the ancestral origin of an individual [108], and geneticinformation about the individual’s family members [17]). Forinstance, Homer et al. [54] showed that such information mayallow an attacker to infer whether an individual belongs to the case

1194

1195

1196

1197

1198

2 A cut is a set of generalized diagnosis codes that correspond to nodes of thetaxonomy and replace (map) one or more diagnosis codes in the original dataset.Furthermore, the mapping must be such that each diagnosis code in the originaldataset is mapped to exactly one of these generalized codes.

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

or control group of GWAS data (i.e., if the individual is diagnosedwith a GWAS-related disease or not), while Wang et al. [123] pre-sented two attacks; one that can statistically determine the pres-ence of an individual in the case group, based upon a measure ofthe correlation between alleles, and another that allows the infer-ence of the SNP sequences of many individuals that are present inthe GWAS data, based on correlations between SNPs.

To protect the privacy of genomic information, there are severaltechniques that are based on cryptography (e.g., see [8] and the ref-erences therein) or on perturbation (e.g., see [37]). For instance,Wang et al. [124] proposed cryptographic techniques for the com-putation of edit distance on genomic data, while Baldi et al. [8]considered different operations, including paternity and geneticcompatibility tests. On the other hand, Fienberg et al. [37] exam-ined how to release aggregate statistics for GWAS while satisfyingdifferential privacy through perturbation. In particular, the authorsof [37] proposed two methods; one that focuses on the publicationof the v2 statistic and p-values and works by adding Laplace noiseto the original statistics, and a second method that allows releasingnoisy versions of these statistics for the most relevant SNPs.

5.1.2. Text de-identificationA considerable amount of information about patients is con-

tained in textual data, such as clinical notes, SOAP (Subjective,Objective, Assessment, Patient care plan) notes, radiology andpathology reports, and discharge summaries. Text data containmuch confidential information about a patient, including theirname, medical record identifier, and social security number, whichmust be protected before data release. This involves two steps: (i)detecting direct identifiers and (ii) transforming the detected iden-tifiers, in a way that preserves the integrity of medical information.The latter step is called de-identification. In the following, we brieflydiscuss some techniques that have been proposed for both detect-ing and transforming direct identifiers. We refer the reader to thesurvey by Meystre et al. [94], for an extensive review.

Techniques for discovering direct identifiers are based on: (i)Named Entity Recognition (NER), (ii) Grammars (or Rules), and (iii)Statistical learning. NER-based techniques work by locating directidentifiers in text and then classifying them into pre-defined cate-gories. For instance, the atomic elements Tom Green and6152541261 in a clinical note would be classified into the categoryfor Name and Phone Number, respectively.

The second type of techniques use hand-coded rules and dictio-naries to identify direct identifiers, or regular expressions for iden-tifiers that follow certain syntactic patterns (e.g., a phone numbermust start with a valid area code), while the last type of techniquesare typically based on classification. That is, they aim at classifyingthe terms of previously unseen elements, contained in test data, asdirect identifiers or as non-identifiers, based on knowledge oftraining data.

The main advantage of NER and grammar-based approaches isthat they need little or no training data, and can be easily modified(e.g., by adding a new regular expression). However, their configu-ration typically requires significant domain expertise (e.g., to spec-ify rules) and, in many cases, knowledge of the specific dataset(e.g., naming conventions). On the other hand, techniques thatare based on statistical learning can ‘‘learn’’ the characteristics ofdata, using different methods, such as Support Vector Machines(SVM) [14] or Conditional Random Fields (CRF) [61]. However, theyare limited in that they typically require large training datasets,such as manually annotated text data with pre-labeled identifiers,whose construction is challenging [13].

After the discovery of direct identifiers, there are several trans-formation strategies that can be applied to them. These include thereplacement of direct identifiers with fake, but realistic-looking,elements [111,12,50], suppression [18], and generalization [59].

rom electronic health records while preserving privacy: A survey of algo-

Page 13: 1-s2.0-S1532046414001403-main

1199

1200

1201

1202

1203

1204

1205

1206

1207

1208

1209

1210

1211

1212

1213

1214

1215

1216

1217

1218

1219

1220

1221

1222

1223

1224

1225

1226

1227122812301230

1231

1232

1233

1234

1235

1236

1237

1238

1239

1240

1241

1242

1243

1244

1245

1246

1247

1248

1249

1250

1251

1252

1253

1254

1255

1256

1257

1258

1259

1260

1261

1262

1263

1264

1265

1266

1267

1268

1269

1270

1271

1272

1273

1274

1275

1276

1277

1278

1279

1280

1281

1282

1283

1284

1285

1286

1287

1288

1289

1290

1291

1292

1293

1294

1295

1296

1297

1298

1299

1300

1301

1302

1303

1304

1305

1306

1307

1308

1309

1310

1311

1312

1313

1314

1315

1316

1317

1318

1319

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 13

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

Most of the works on protecting text data aim at transformingdirect identifiers without offering specific privacy guarantees. Onthe contrary, the works of Chakaravarthy et al. [18] and Jianget al. [59] offer such guarantees by employing privacy models.The first of these works proposes the K-safety model, which pre-vents the matching of documents to entities, based on terms thatco-occur in a document. This is achieved by lower-bounding thenumber of entities that these terms correspond to. The work ofJiang et al. [59] proposes a different privacy model, called t-plausi-bility, which, given word ontologies and a threshold t, requires thesanitized text to be associated with at least t plausible texts, any ofwhich could be the original text.

5.1.3. Aggregate information releaseThere are certain applications in which data recipients are inter-

ested in learning aggregate information from the data, instead ofdetailed information about individual records. Such aggregateinformation can range from simple statistics that are directly com-puted from the data to complex patterns that are discoveredthrough the application of data mining techniques. The interestfor supporting these applications has been fueled by recentadvances in the development of semantic privacy models. Thesemodels dictate that the mechanism chosen for releasing the aggre-gate information (e.g., in the form of a noisy summary of the data),must satisfy certain properties.

One of the most established semantic models is differential pri-vacy [49], which requires the outcome of a calculation to be insen-sitive to any particular record in the dataset. More formally, givenan arbitrary, randomized function K and a subset S of its possibleoutputs, a dataset D is differentially private if

PðKðDÞ 2 SÞ 6 e� � PðKðD0Þ 2 SÞ ð1Þ

where D0 is a dataset that differs from D in only one record, andPðKðDÞ 2 SÞ (respectively, PðKðD0Þ 2 SÞ) is the probability that theresult of applying K to D (respectively, D0), is contained in the subsetS. For instance, the result of statistical analysis carried out on a dif-ferentially private data summary must be insensitive to the inser-tion (or deletion) of a record in (from) the original dataset fromwhich the summary is produced. This offers privacy, because theinferences an attacker can make about an individual will be approx-imately independent of whether any individual’s record is includedin the original dataset or not. On the negative side, the enforcementof differential privacy only allows the release of noisy summary sta-tistics,3 and it does not guarantee the prevention of all attacks. Cor-mode, for example, showed that an attacker can infer the sensitivevalue of an individual fairly accurately, by applying a classificationalgorithm on differentially private data [21].

Differential privacy has led to the development of several othersemantic models, which are surveyed in [23]. These models relaxthe (strong) privacy requirements posed by differential privacyby: (i) introducing an additive factor d to the right part of Eq. (1)[31] or (ii) considering attackers with limited computationalresources (i.e., attackers with polynomial time computationbounds) [95]. This offers the advantage of limiting noise additionat the expense of weaker privacy guarantees than those offeredby differential privacy.

At the same time, there are algorithms for enforcing differentialprivacy which are applicable to demographics or diagnosis codes.For example, Mohammed et al. [96] proposed a method to releasea noisy summary of a dataset containing demographics that aimsat preserving classification accuracy, while Chen et al. [20] showedhow to release noisy answers to certain count queries involving

3 This is similar to knowing the queries posed to a technique for enforcingdifferential private in the interactive setting and releasing the noisy answers to thesequeries in the form of a summary.

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

sets of diagnosis codes. Interestingly, both techniques apply parti-tioning strategies similar in principle to those used by the TDSalgorithm [40] and the Recursive partition [52] algorithm,respectively. In addition, systems that allow the differentially pri-vate release of aggregate information from electronic healthrecords are emerging. For instance, SHARE [42] is a recently pro-posed system for releasing multidimensional histograms and lon-gitudinal patterns.

1320

1321

1322

1323

6. Future research directions

Disseminating person-specific data from electronic healthrecords offers the potential for allowing large-scale, low-cost med-ical studies, in areas including epidemic detection and post-mar-keting safety evaluation. At the same time, preserving patientprivacy is necessary and, in many cases, this can be achieved basedon the techniques presented in this survey. However, there are sev-eral directions that warrant further research.

First, it is important to study privacy threats posed when releas-ing patient data, from both a theoretical and practical perspective.This requires the identification and modeling of privacy attacks,beyond those discussed in the paper, and an evaluation of their fea-sibility on large cohorts of patient data. In fact, it is currently diffi-cult to automatically detect threats for many types of medical data(e.g., for data containing diagnosis codes, or for genomic data),despite some interesting work [10,35], on demographics, towardsthis direction. Furthermore, knowledge of: (i) dependenciesbetween quasi-identifiers and sensitive values (e.g., the fact thatmale patients are less likely to be diagnosed with breast cancerthan female ones) [72,27], (ii) quasi-identifier values of particularindividuals [114] and/or their family members [19], and (iii) theoperations of anonymization algorithms [125], may pose privacyrisks. However, none of these threats has been investigated inthe context of medical data, and it is not clear whether or notthe solutions proposed by the computer science community totackle them are suitable for use in healthcare settings.

Second, the mining of published data may reveal privacy-intru-sive inferences about individuals [47,48,51], which cannot be elim-inated by applying the privacy models discussed so far. Intuitively,this is because mining reveals knowledge patterns that apply to alarge number of individuals, and these patterns are not consideredas sensitive by the aforementioned privacy models. Consider, forexample, that an insurance company applies classification to thedata obtained from a healthcare institution to discover thatpatients over 40 years old, who live in an area with Zip Code55413, are likely to be diagnosed with diseases that have a veryhigh hospitalization cost. Based on this (sensitive) knowledge,the insurance company may decide to offer more expensive insur-ance coverage to these patients. To avoid such inferences, sensitiveknowledge patterns need to be identified prior to data publishingand be concealed, so that they cannot be discovered when the dataare shared.

Third, the large growth in the complexity and size of medicaldatasets that are being disseminated poses significant challengesto existing privacy-preserving algorithms. As an example of a com-plex data type, consider a set of records that contain both demo-graphics and diagnosis codes. Despite the need for analyzingdemographics and diagnosis codes together, in the context of med-ical tasks (e.g., for predictive modeling), preserving the privacy ofsuch datasets is very challenging. This is because, it is not safe toprotect demographics and diagnosis codes independently, usingexisting techniques (e.g., the Mondrian [66] algorithm for demo-graphics and the UGACLIP [76] algorithm for diagnosis codes),while guarding against this threat and minimizing information lossis computationally infeasible [105]. In addition, the vast majority

rom electronic health records while preserving privacy: A survey of algo-

Page 14: 1-s2.0-S1532046414001403-main

1324

1325

1326

1327

1328

1329

1330

1331

1332

1333

1334

1335

1336

1337

1338

1339

1340

1341

1342

1343

1344

1345

1346

1347

1348

1349

1350

1351

1352

1353

1354

1355

1356

1357

1358

1359

1360

1361

1362

1363

1364136513661367136813691370137113721373137413751376137713781379138013811382138313841385138613871388

1389139013911392139313941395139613971398139914001401140214031404140514061407140814091410141114121413141414151416141714181419142014211422

14 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

of existing techniques assume that the dataset to be protected isrelatively small, so that it fits into the main memory. However,datasets with sizes of several GBs or even TBs may need to be dis-seminated in practice. Thus, it would be worthwhile to developscalable techniques that potentially take advantage of parallelarchitectures to solve this problem.

Fourth, privacy approaches that apply to complex data sharingscenarios need to be proposed. As an example, consider the caseof multiple healthcare providers and data recipients who wish tobuild a common data repository. Healthcare providers may, forexample, contribute different parts of patients’ EHR data to therepository, whereas data recipients may be querying these data,to obtain anonymized views (i.e., anonymized parts of one or moredatasets in the repository), for different purposes [77]. This sce-nario presents several interesting challenges. First, data contrib-uted by different providers need to be integrated in an efficientand privacy-preserving way. Second, user queries posed to therepository need to be audited and the anonymized views to be pro-duced, so as to adhere to the imposed privacy requirements.Achieving privacy in this scenario is non-trivial, because malicioususers may combine their obtained views to breach privacy, evenwhen each query answer is safe when examined independentlyof others.

Last but not least, it is important to note that the overall assur-ance of health data privacy requires appropriate policy, in additionto technical means that are exceedingly important.

1423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474

7. Conclusions

In this work, we presented a survey of privacy algorithms thathave been proposed for publishing structured patient data. Wereviewed more than 45 privacy algorithms, derived insights ontheir operation, and highlighted their advantages and disadvan-tages. Subsequently, we provided a discussion of some promisingdirections for future research in this area.

Acknowledgements

Grigorios Loukides is partly supported by a Research Fellowshipfrom the Royal Academy of Engineering. The authors would like tothank the anonymous reviewers for their insightful commentswhich significantly helped to improve the quality of thismanuscript.

References

[1] EU Data Protection Directive 95/46/ECK; 1995.[2] UK Data Protection Act; 1998.[3] Personal Information Protection and Electronic Documents Act; 2000.[4] Adam NR, Worthmann JC. Security-control methods for statistical databases:

a comparative study. ACM Comput Surv 1989;21(4):515–56.[5] Aggarwal CC, Yu PS. Privacy-preserving data mining: models and

algorithms. Springer; 2008.[6] Aggarwal G, Kenthapadi F, Motwani K, Panigrahy R, Thomasand D, Zhu A.

Approximation algorithms for k-anonymity. Journal of Privacy Technology2005.

[7] Agrawal R, Srikant R. Fast algorithms for mining association rules in largedatabases. In: VLDB; 1994. p. 487–99.

[8] Baldi P, Baronio R, De Cristofaro E, Gasti P, Tsudik G. Countering GATTACA:efficient and secure testing of fully-sequenced human genomes. In:Proceedings of the 18th ACM conference on computer and communicationssecurity, CCS ’11; 2011. p. 691–702.

[9] Bayardo RJ, Agrawal R. Data privacy through optimal k-anonymization. In:21st ICDE; 2005. p. 217–28.

[10] Benitez K, Loukides G, Malin B. Beyond safe harbor: automatic discovery ofhealth information de-identification policy alternatives. In: ACMinternational health informatics symposium; 2010. p. 163–72.

[11] Berchtold S, Keim DA, Kriegel H. The x-tree: an index structure for high-dimensional data. In: VLDB; 1996. p. 28–39.

[12] Berman JJ. Concept-match medical data scrubbing. Arch Pathol Lab Med2003;127(6):680–6.

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

[13] Bhagwan V, Grandison T, Maltzahn C. Recommendation-based de-identification: a practical systems approach towards de-identification ofunstructured text in healthcare. In: Proceedings of the 2012 IEEE eighthworld congress on services, SERVICES ’12; 2012. p. 155–62.

[14] Burges CJC. A tutorial on support vector machines for pattern recognition.Data Min Knowl Discov 1998;2(2):121–67.

[15] Byun J, Kamra A, Bertino E, Li N. Efficient k-anonymization using clusteringtechniques. In: DASFAA; 2007. p. 188–200.

[16] Cao J, Karras P, Raïssi C, Tan K. rho-uncertainty: inference-proof transactionanonymization. Pvldb 2010;3(1):1033–44.

[17] Cassa CA, Schmidt B, Kohane IS, Mandl KD. My sister’s keeper? Genomicresearch and the identifiability of siblings. BMC Med Genom 2008;1:32.

[18] Chakaravarthy VT, Gupta H, Roy P, Mohania MK. Efficient techniques fordocument sanitization. In: Proceedings of the 17th ACM conference oninformation and knowledge management; 2008. p. 843–52.

[19] Chen B, Ramakrishnan R, LeFevre K. Privacy skyline: privacy withmultidimensional adversarial knowledge. In: VLDB; 2007. p. 770–81.

[20] Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L. Publishing set-valueddata via differential privacy. Pvldb 2011;4(11):1087–98.

[21] Cormode G. Personal privacy vs population privacy: learning to attackanonymization. In: KDD; 2011. p. 1253–61.

[22] Dean BB, Lam J, Natoli JL, Butler Q, Aguilar D, Nordyke RJ. Use of electronicmedical records for health outcomes research: a literature review. Med CareRes Rev 2010;66(6):611–38.

[23] De Capitani di Vimercati S, Foresti S, Livraga G, Samarati P. Data privacy:definitions and techniques. Int J Uncertainty, Fuzz Knowl-Based Syst2012;20(6):793–818.

[24] Domingo-Ferrer J, Mateo-Sanz JM. Practical data-oriented microaggregationfor statistical disclosure control. IEEE Trans Knowl Data Eng2002;14(1):189–201.

[25] Domingo-Ferrer J, Torra V. Ordinal, continuous and heterogeneousk-anonymity through microaggregation. Dmkd 2005;11(2):195–212.

[26] Domingo-Ferrer J, Martínez-Ballesté A, Mateo-Sanz J, Sebé F. Efficientmultivariate data-oriented microaggregation. VLDB J 2006;15(4):355–69.

[27] Du W, Teng Z, Zhu Z. Privacy-maxent: integrating background knowledge inprivacy quantification. In: SIGMOD; 2008. p. 459–72.

[28] Du Y, Xia T, Tao Y, Zhang D, Zhu F. On multidimensional k-anonymity withlocal recoding generalization. In: ICDE ’07; 2007. p. 1422–4.

[29] Dwork C. Differential privacy. In: ICALP; 2006. p. 1–12.[30] Dwork C. Differential privacy: a survey of results. In: TAMC; 2008. p. 1–19.[31] Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M. Our data, ourselves:

privacy via distributed noise generation. In: Proceedings of the 24th annualinternational conference on the theory and applications of cryptographictechniques, EUROCRYPT’06; 2006. p. 486–503.

[32] El Emam K, Dankar F, Issa R, Jonker E, et al. A globally optimal k-anonymitymethod for the de-identification of health data. J Am Med Informat Assoc2009;16(5):670–82.

[33] El Emam K, Jonker E, Arbuckle L, Malin B. A systematic review of re-identification attacks on health data. PLoS ONE 2011;6(12) [e28071, 12].

[34] El Emam K, Dankar FK. Protecting privacy using k-anonymity. J Am MedInformat Assoc 2008;15(5):627–37.

[35] El Emam K, Buckeridge D, Tamblyn R, Neisa A, Jonker E, Verma A. The re-identification risk of canadians from longitudinal demographics. BMC MedInformat Dec Mak 2011;11(46).

[36] Fernandez-Aleman J, Senor I, Oliver Lozoya PA, Toval A. Security and privacyin electronic health records: a systematic literature review. J Biomed Informat2013;46(3):541–62.

[37] Fienberg SE, Slavkovic A, Uhler C. Privacy preserving GWAS data sharing. In:IEEE ICDM worksops; 2011. p. 628–35.

[38] Filho YV. Optimal choice of discriminators in a balanced k–d binary searchtree. Inf Process Lett 1981;13(2):67–70.

[39] Fung BCM, Wang K, Chen R, Yu PS. Privacy-preserving data publishing: asurvey on recent developments. ACM Comput Surv 2010;42.

[40] Fung BCM, Wang K, Yu PS. Top-down specialization for information andprivacy preservation. In: ICDE; 2005. p. 205–16.

[41] Fung BCM, Wang K, Wang L, Hung PCK. Privacy-preserving data publishingfor cluster analysis. Data Knowl Eng 2009;68(6):552–75.

[42] Gardner JJ, Xiong L, Xiao Y, Gao J, Post AR, Jiang X, et al. Share: system designand case studies for statistical health information release. Jamia2013;20(1):109–16.

[43] Gertz M, Jajodia S, editors. Handbook of database security – applications andtrends. Springer; 2008.

[44] Ghinita G, Karras P, Kalnis P, Mamoulis N. Fast data anonymization with lowinformation loss. In: Proceedings of the 33rd international conference on verylarge data bases, VLDB ’07; 2007. p. 758–69.

[45] Gionis A, Mazza A, Tassa T. k-Anonymization revisited. In: ICDE; 2008. p.744–53.

[46] Gkoulalas-Divanis A, Loukides G. PCTA: privacy-constrained clustering-basedtransaction data anonymization. In: EDBT PAIS; 2011. p. 5.

[47] Gkoulalas-Divanis A, Loukides G. Revisiting sequential pattern hiding toenhance utility. In: KDD; 2011. p. 1316–24.

[48] Gkoulalas-Divanis A, Verykios VS. Hiding sensitive knowledge without sideeffects. Knowl Inf Syst 2009;20(3):263–99.

[49] Guha S, Rastogi R, Shim K. Cure: an efficient clustering algorithm for largedatabases. In: SIGMOD; 1998. p. 73–84.

rom electronic health records while preserving privacy: A survey of algo-

Page 15: 1-s2.0-S1532046414001403-main

14751476147714781479148014811482148314841485148614871488148914901491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539154015411542154315441545154615471548154915501551155215531554155515561557155815591560

15611562156315641565156615671568156915701571157215731574157515761577157815791580158115821583158415851586158715881589159015911592159315941595159615971598159916001601160216031604160516061607160816091610161116121613161416151616161716181619162016211622162316241625162616271628162916301631163216331634163516361637163816391640164116421643164416451646

A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx 15

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

[50] Gupta D, Saul M, Gilbertson J. Evaluation of a deidentification (De-Id)software engine to share pathology reports and clinical documents forresearch. Am J Clin Pathol 2004;121(2):176–86.

[51] Gwadera R, Gkoulalas-Divanis A, Loukides G. Permutation-based sequentialpattern hiding. In: IEEE international conference on data mining (ICDM);2013. p. 241–50.

[52] He Y, Naughton JF. Anonymization of set-valued data via top-down, localgeneralization. Pvldb 2009;2(1):934–45.

[53] Hettich S, Merz CJ. UCI repository of machine learning databases; 1998.[54] Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing

trace amounts of dna to highly complex mixtures using high-density SNPgenotyping microarrays. PLoS Genet 2008;4(8):e1000167.

[55] Hore B, Jammalamadaka RC, Mehrotra S. Flexible anonymization for privacypreserving data publishing: a systematic search based approach. In: SDM;2007.

[56] Hsiao CJ, Hing E. Use and characteristics of electronic health record systemsamong office-based physician practices: United States, 2001–2012. NCHSdata brief; 2012. p. 1–8.

[57] Iwuchukwu T, Naughton JF. k-Anonymization as spatial indexing: towardscalable and incremental anonymization. In: VLDB; 2007. p. 746–57.

[58] Iyengar VS. Transforming data to satisfy privacy constraints. In: KDD; 2002. p.279–88.

[59] Jiang W, Murugesan M, Clifton C, Si L. t-Plausibility: semantic preserving textsanitization. In: CSE ’09. International conference on computational scienceand engineering, 2009. vol. 3; 2009. p. 68–75.

[60] Koudas N, Zhang Q, Srivastava D, Yu T. Aggregate query answering onanonymized tables. In: ICDE ’07; 2007. p. 116–25.

[61] Lafferty JD, McCallum A, Pereira FCN. Conditional random fields: probabilisticmodels for segmenting and labeling sequence data. In: Proceedings of theeighteenth international conference on machine learning, ICML ’01; 2001. p.282–9.

[62] Laszlo M, Mukherjee S. Minimum spanning tree partitioning algorithm formicroaggregation. IEEE Trans Knowl Data Eng 2005;17(7):902–11.

[63] Lau EC, Mowat FS, Kelsh MA, Legg JC, Engel-Nitz NM, Watson HN, et al. Use ofelectronic medical records (EMR) for oncology outcomes research: assessingthe comparability of EMR information to patient registry and health claimsdata. Clin Epidemiol 2011;3(1):259–72.

[64] LeFevre K, DeWitt DJ, Ramakrishnan R. Workload-aware anonymizationtechniques for large-scale datasets. Tods 2008;33(3).

[65] LeFevre K, DeWitt DJ, Ramakrishnan R. Incognito: efficient full-domain k-anonymity. In: SIGMOD; 2005. p. 49–60.

[66] LeFevre K, DeWitt DJ, Ramakrishnan R. Mondrian multidimensional k-anonymity. In: ICDE; 2006. p. 25.

[67] LeFevre K, DeWitt DJ, Ramakrishnan R. Workload-aware anonymization. In:KDD; 2006. p. 277–86.

[68] Li J, Wong R, Fu A, Pei J, Achieving -anonymity by clustering in attributehierarchical structures. In: DaWaK; 2006. p. 405–416.

[69] Li N, Li T, Venkatasubramanian S. t-Closeness: privacy beyond k-anonymityand l-diversity. In: ICDE; 2007. p. 106–15.

[70] Li N, Li T, Venkatasubramanian S. Closeness: a new privacy measure for datapublishing. IEEE Trans Knowl Data Eng 2010;22(7):943–56.

[71] Li N, Tripunitara MV. Security analysis in role-based access control. ACMTrans Inf Syst Secur 2006;9(4):391–420.

[72] Li T, Li N. Injector: mining background knowledge for data anonymization. In:ICDE; 2008. p. 446–55.

[73] Lindell Y, Pinkas B. Privacy preserving data mining. 2000. p. 36–54.[74] Liu J, Wang K. Anonymizing transaction data by integrating suppression and

generalization. In: Proceedings of the 14th Pasific-Asia conference onadvances in knowledge discovery and data mining, PAKDD ’10; 2010. p.171–80.

[75] Loukides G, Denny JC, Malin B. The disclosure of diagnosis codes can breachresearch participants’ privacy. J Am Med Informat Assoc 2010;17:322–7.

[76] Loukides G, Gkoulalas-Divanis A, Malin B. Anonymization of electronicmedical records for validating genome-wide association studies. Proc NatAcad Sci 2010;17(107):7898–903.

[77] Loukides G, Gkoulalas-Divanis A, Malin B. An integrative framework foranonymizing clinical and genomic data. Database technology for life sciencesand medicine. World Scientific; 2010. p. 65–89 [chapter 8].

[78] Loukides G, Gkoulalas-Divanis A, Malin B. COAT: constraint-basedanonymization of transactions. Knowl Inf Syst 2011;28(2):251–82.

[79] Loukides G, Gkoulalas-Divanis A, Shao J. Anonymizing transaction data toeliminate sensitive inferences. In: DEXA; 2010. p. 400–415.

[80] Loukides G, Gkoulalas-Divanis A, Shao J. Efficient and flexible anonymizationof transaction data. Knowl Inf Syst 2013;36(1):153–210.

[81] Loukides G, Shao J. Capturing data usefulness and privacy protection in k-anonymisation. In: SAC; 2007. p. 370–4.

[82] Loukides G, Shao J. Clustering-based k-anonymisation algorithms. In: DEXA;2007. p. 761–71.

[83] Loukides G, Shao J. An efficient clustering algorithm for – anonymisation. JComput Sci Technol 2008;23(2):188–202.

[84] Loukides G, Shao J. Preventing range disclosure in k-anonymised data. ExpertSyst Appl 2011;38(4):4559–74.

[85] Loukides G, Tziatzios A, Shao J. Towards preference-constrained -anonymisation. In: DASFAA international workshop on privacy-preservingdata analysis (PPDA); 2009. p. 231–45.

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

[86] Loukides G, Gkoulalas-Divanis A. Utility-preserving transaction dataanonymization with low information loss. Expert Syst Appl2012;39(10):9764–77.

[87] Loukides G, Gkoulalas-Divanis A. Utility-aware anonymization of diagnosiscodes. IEEE J Biomed Health Informat 2013;17(1):60–70.

[88] Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. l-Diversity:privacy beyond k-anonymity. In: ICDE; 2006. p. 24.

[89] Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al. TheNCBI dbGaP database of genotypes and phenotypes. Nat Genet2007;39:1181–6.

[90] Makoul G, Curry RH, Tang PC. The use of electronic medical recordscommunication patterns in outpatient encounters. J Am Med InformatAssoc 2001;8(6):610–5.

[91] Malin B, Karp D, Scheuermann RH. Technical and policy approaches tobalancing patient privacy and data sharing in clinical and translationalresearch. J Invest Med: Off Pub Am Fed Clin Res 2010;58(1):11–8.

[92] Malin B, Loukides G, Benitez K, Clayton EW. Identifiability in biobanks:models, measures, and mitigation strategies. Hum Genet2011;130(3):383–92.

[93] Malin B, Loukides G, Benitez K, Clayton EW. Identifiability in biobanks:models, measures, and mitigation strategies. Hum Genet2011;130(3):383–92.

[94] Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a reviewof recent research. BMC Med Res Methodol 2010;10(70).

[95] Mironov I, Pandey O. Reingold O, Vadhan S. Computational differentialprivacy. In: Proceedings of the 29th annual international cryptologyconference on advances in cryptology, CRYPTO ’09; 2009. p. 126–42.

[96] Mohammed N, Chen R, Fung BCM, Yu PS. Differentially private data releasefor data mining. In: KDD; 2011. p. 493–501.

[97] Moon B, Jagadish Hv, Faloutsos C, Saltz JH. Analysis of the clusteringproperties of the hilbert space-filling curve. IEEE Trans Knowl Data Eng2001;13(1):124–41.

[98] Ruffolo M, Angiulli F, Pizzuti C. Descry: a density based clustering algorithmfor very large dataset. In: 5th International conference on intelligent dataengineering and automated learning (IDEAL’04); 2004. p. 25–7.

[99] Narayanan A, Shmatikov V. Robust de-anonymization of large sparsedatasets. In: IEEE S&P, 2008, p. 111–125.

[100] Nergiz ME, Atzori M, Clifton C. Hiding the presence of individuals from shareddatabases. In: SIGMOD ’07; 2007. p. 665–676.

[101] Nergiz ME, Clifton C. d-presence without complete world knowledge. Tkde2010;22(6):868–83.

[102] Nergiz ME, Clifton C. Thoughts on k-anonymization. Dke 2007;63(3):622–45.[103] Nergiz ME, Clifton C. d-presence without complete world knowledge. IEEE

Trans Knowl Data Eng 2010;22(6):868–83.[104] Ollier WER, Sprosen T, Peakman T. UK biobank: from concept to reality.

Pharmacogenomics 2005;6(6):639–46.[105] Poulis G, Loukides G, Gkoulalas-Divanis A, Skiadopoulos S. Anonymizing data

with relational and transaction attributes. In: Machine learning andknowledge discovery in databases – european conference (ECML/PKDD)(3); 2013. p. 353–69.

[106] Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in Cthe art of scientific computing. 2nd ed. Cambridge University Press; 1992.

[107] Reis BY, Kohane IS, Mandl KD. Longitudinal histories as predictors of futurediagnoses of domestic abuse: modelling study. Bmj 2009;339(9).

[108] Rothstein M, Epps P. Ethical and legal implications of pharmacogenomics. NatRev Genet 2001;2:228–31.

[109] Samarati P. Protecting respondents identities in microdata release. Tkde2001;13(9):1010–27.

[110] Sandhu RS, Coyne EJ, Feinstein HL, Youman CE. Role-based access controlmodels. IEEE Comput 1996;29(2):38–47.

[111] Sweeney L. Replacing personally-identifying information in medical records,the scrub system. In: Proceedings of the AMIA annual fall symposium; 1996.p. 333–7.

[112] Sweeney L. k-anonymity: a model for protecting privacy. Ijufks2002;10:557–70.

[113] Sweeney L. Computational disclosure control: a primer on data privacyprotection. PhD thesis, AAI0803469; 2001.

[114] Tao Y, Xiao X, Li J, Zhang D. On anti-corruption privacy preservingpublication. In: Proceedings of the 2008 IEEE 24th international conferenceon data engineering, ICDE ’08; 2008. p. 725–34.

[115] Terrovitis M, Mamoulis N, Kalnis P. Privacy-preserving anonymization of set-valued data. Pvldb 2008;1(1):115–25.

[116] Terrovitis M, Mamoulis N, Kalnis P. Local and global recoding methods foranonymizing set-valued data. VLDB J 2011;20(1):83–106.

[117] Tildesley MJ, House TA, Bruhn MC, Curry RJ, ONeil M, Allpress JLE, et al.Impact of spatial clustering on disease transmission and optimal control. ProcNat Acad Sci 2010;107(3):1041–6.

[118] Truta TM, Vinay B. Privacy protection: p-sensitive k-anonymity property. In:ICDE workshops; 2006. p. 94.

[119] United States Census American Community Survey. Public Use Microdata;2003.

[120] U.S. Department of Health and Human Services Office for Civil Rights. HIPAAadministrative simplification regulation text; 2006.

[121] Vaidya J, Clifton C. Privacy-preserving k-means clustering over verticallypartitioned data. In: KDD; 2003. p. 206–15.

rom electronic health records while preserving privacy: A survey of algo-

Page 16: 1-s2.0-S1532046414001403-main

16471648164916501651165216531654165516561657

16581659166016611662166316641665166616671668

16 A. Gkoulalas-DivanisQ1 et al. / Journal of Biomedical Informatics xxx (2014) xxx–xxx

YJBIN 2187 No. of Pages 16, Model 5G

19 June 2014

Q1

[122] Van Rijsbergen CJ. Information retrieval. 2nd ed. Newton, MA,USA: Butterworth-Heinemann; 1979.

[123] Wang R, Li YF, Wang X, Tang H, Zhou X. Learning your identity and diseasefrom research papers: information leaks in genome wide association study.In: CCS; 2009. p. 534–44.

[124] Wang R, Wang X, Li Z, Tang H, Reiter MK, Dong Z. Privacy-preserving genomiccomputation through program specialization. In: Proceedings of the 16thACM conference on computer and communications security, CCS ’09; 2009. p.338–47.

[125] Wong RC, Fu A, Wang K, Pei J. Minimality attack in privacy preserving datapublishing. In: VLDB; 2007. p. 543–54.

1669

Please cite this article in press as: Gkoulalas-Divanis A et al. Publishing data frithms. J Biomed Inform (2014), http://dx.doi.org/10.1016/j.jbi.2014.06.002

[126] Wong RC, Li J, Fu A, Wang K. alpha-k-Anonymity: an enhanced k-anonymitymodel for privacy-preserving data publishing. In: KDD; 2006. p. 754–59.

[127] Xiao X, Tao Y. Anatomy: simple and effective privacy preservation. In: VLDB;2006. p. 139–50.

[128] Xiao X, Tao Y. Personalized privacy preservation. In: SIGMOD; 2006. p. 229–40.[129] Xu J, Wang W, Pei J, Wang X, Shi B, Fu AW-C. Utility-based anonymization

using local recoding. In: KDD; 2006. p. 785–90.[130] Xu Y, Wang K, Fu AW-C, Yu PS. Anonymizing transaction databases for

publication. In: KDD; 2008. p. 767–75.[131] Zhang R, Kalnis P, Ooi B, Tan K. Generalized multidimensional data mapping

and query processing. ACM Trans Database Syst 2005;30(3):661–97.

rom electronic health records while preserving privacy: A survey of algo-