De-identifying Clinical Data Khaled El Emam, CHEO RI & uOttawa
May 20, 2015
De-identifying Clinical DataKhaled El Emam, CHEO RI & uOttawa
www.ehealthinformation.cawww.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Secondary Use/DisclosureSecondary Use/Disclosure
disclosure collection
recipient
collection
individualscustodian
tagent
use custodiandisclosure
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Data Flows• Mandatory disclosures• Uses by an agent for secondary
Data Flows
• Uses by an agent for secondary purposes
• Permitted discretionary disclosures for • Permitted discretionary disclosures for secondary purposes
• Other disclosures for secondary Other disclosures for secondary purposes
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Obtaining Consent - I• Sometimes it is not possible or
practical to obtain consent:
Obtaining Consent I
practical to obtain consent:– Making contact to obtain consent may
reveal the individual’s condition to others h hagainst their wishes
– The size of the population may be too large to obtain consent from everyoneto obtain consent from everyone
– Many patients may have relocated or died
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Obtaining Consent - II– There may be a lack of existing or
continuing relationship with the patients
Obtaining Consent II
– There is a risk of inflicting psychological, social or other harm by contacting individuals or their families in delicate individuals or their families in delicate circumstances
– It would be difficult to contact individuals through advertisements and other public notices
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Impact of Obtaining Consent• In the case where explicit consent is
used, consenters and non-consenters
Impact of Obtaining Consent
used, consenters and non consenters differ on:– age, sex, race, marital status, educational
level, socioeconomic status, health status, mortality, lifestyle factors, functioning
• The consent rate for express consent • The consent rate for express consent varied from 16% to 93%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Limiting Principles• Do not collect, use, or disclose PHI if
other information will serve the
Limiting Principles
other information will serve the purpose
• For example, even if it is easier to p ,disclose a whole record, that should not be done if lesser information will reasonably satisfy the purpose
• De-identification would be one element i li iti th t f PHI th t i
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
in limiting the amount of PHI that is collected/used/disclosed
Breaches• In many large research hospitals and
hospital networks it is simply not
Breaches
hospital networks it is simply not possible to control and manage all of the databases and data sets that are created, used, and disclosed for research
• Breach frequency and severity is growingD id tifi ti id t
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• De-identification provides one way to manage the risks, however
Trust• Patients change their behavior if they
perceive a threat to privacy
Trust
perceive a threat to privacy• This can have a negative impact on the
quality of the data that is used for q yresearch
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Deloitte Survey (2007)• N=827 respondents in North America• 43% reported more than 10 privacy breaches
Deloitte Survey (2007)
within the last 12 months in their organizations
• Over 85% reported at least one privacy • Over 85% reported at least one privacy breach
• Over 63% reported multiple privacy breaches requiring notification
• Breaches involving 1000+ records were reported by 34% of respondents
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
reported by 34% of respondents
Verizon Study• Based on forensic engagements conducted by
Verizon
Verizon Study
• Breaches resulting from external sources: 73%
• Caused by insiders: 18%• Caused by insiders: 18%• Implicated business partners: 39%• The median number of records involved in an e ed a u be o eco ds o ed a
insider breach were 10 times more than an external breachBi t d h k
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Biggest causes are errors and hackers
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
HIMSS Leadership Survey• Survey of healthcare IT executives, n=307• Conducted in the 2007-2008 timeframe
HIMSS Leadership Survey
• 24% of respondents reported that they have had a security breach in their organization in the last 12 monthsthe last 12 months
• 16% of respondents reported that they have had a security breach in their organization in the last 6 months
• Half indicated that an internal security breach is a concern to their organizations
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
is a concern to their organizations
HIMSS Analytics Report• IT executives and security officers at
healthcare institutions; n=263
HIMSS Analytics Report
• Half of respondents are concerned with internal inadvertent access to patient data
• 13% indicated that their organization has had • 13% indicated that their organization has had a security breach in the last 12 months
• 80% of these were internal breaches
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Medical Record Breaches 2008• For all of 2008 (datalossdb.org)• 83 breaches involving medical records (14%
Medical Record Breaches 2008
of total)• Approx. 7.2 million records involved in these
breaches (21 5% of all records)breaches (21.5% of all records)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Does this Happen Here ?Does this Happen Here ?• Do you know of any cases where computer
equipment was stolen from a hospital ? Did this equipment contain personal health information ?equipment contain personal health information ?
• Do you know if any cases where memory sticks with data on them were lost ?
• Does anyone email data to their hotmail or gmailaccounts so that they can access them from home or while travelling ?or while travelling ?
• Do people still share passwords ?
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Known Data LeaksKnown Data Leaks• PHI on second hand computers• Leaks through peer-to-peer file sharing networks
P P i t fil th I t t• PowerPoint files on the Internet• Password protected files sent by email
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Identity TheftIdentity Theft• William Ernst Black (Edmonton 1999)• The creation of identity packages usingThe creation of identity packages using
information about dead children who were living in one jurisdiction but died in another ($37k for each identity package)
• Example: drug smuggler was caught with these identity packages
• Example: American getting free medical care i C d
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
in Canada
Patient ConcernsPatient Concerns• There is evidence (from surveys) that the general
public has changed their behavior to adjust for i d i i k t th i PHIperceived privacy risks wrt their PHI:
– 15% to 17% of US adults– 11% to 13% of Canadian adults
• There is also evidence that vulnerable populations exhibit similar behaviors (e.g., adolescents, people with HIV or at high risk for HIV those undergoingwith HIV or at high risk for HIV, those undergoing genetic testing, mental health patients and battered women)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Behavior Change - IBehavior Change I• Going to another doctor• Paying out of pocket when insured to avoid
disclosuredisclosure• Not seeking care to avoid disclosure to an employer
or to not be seen entering a clinic by other members of the community
• Giving inaccurate or incomplete information on medical historyy
• Asking a doctor not to record a health problem or record a less serious or embarrassing one
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Behavior Change - IIBehavior Change II• 87% of US physicians reported that a patient
had asked them not to include certain information in their record
• 78% of US physicians reported that they have withheld information due to privacy concerns
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
SS
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Asymmetry Principle - IAsymmetry Principle I• Trust is hard to gain but easy to lose:
– Negative events/news carry more weight than g y gpositive ones (negativity bias); it is more diagnosticAvoiding loss people weight negative– Avoiding loss – people weight negative information more greatly in an effort to avoid loss
– Sources of negative information appear more g ppcredible (positive information seems self-serving)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Asymmetry Principle - IIAsymmetry Principle II– People interpret information according to their
prior beliefs: if they have negative prior beliefs th ti t ill f th t dthen negative events will re-enforce that and positive events will have little impact
– Undecided individuals tend to be affected moreUndecided individuals tend to be affected more by negative information
– People with positive prior beliefs may feel b t d b ti i f ti / tbetrayed by negative information/events
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Canadian Public - 2007Canadian Public 2007
80
90
100
39 37
46
3440 37
44
3540
50
60
70
0
10
20
30
0
Total BC Alberta Prairies Ont Que Atlantic Territories
In your opinion, how safe and secure is the health
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
y p ,information which EXISTS about you?
(5-7 on a 7 pt scale)
Canadian Public - 2003Canadian Public 2003
Agree (5 7)Agree (5-7)Neither (4)Disagree (1-3)
0 10 20 30 40 50 60 70 80 90 100
DK/NR
I really worry that my personal health information i ht b d f th i th f t
0 10 20 30 40 50 60 70 80 90 100
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
might be used for other purposes in the future which have little to do with my health
How not to De-identify• Just removing the name and address
information is not enough
How not to De identify
information is not enough• It is quite easy to re-identify
individuals from the other data that is left
• There are a number of public real life pexamples of re-identification actually happening
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Example Data With PHIExample Data With PHI
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Types of Variables• Identifying variables: variables that
can directly identify a patient
Types of Variables
can directly identify a patient• Quasi-identifiers: variables that can
indirectly identify a patienty y p• Sensitive variables: sensitive clinical
information that the patient would not pwant to be known beyond the circle of care
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identified Data ?De identified Data ?
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
User #4417749• “tea for good health”• “numb fingers”, “hand tremors”
User #4417749
numb fingers , hand tremors• “dry mouth”• “60 single men”• 60 single men• “dog that urinates on everything”• “landscapers in Lilburn Ga”• landscapers in Lilburn, Ga• “homes sold in shadow lake subdivision
gwinnett county georgia”
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
gwinnett county georgia
Thelma Arnold• 62 year old widow
living in Lilburn Ga
Thelma Arnold
living in Lilburn Ga re-identified by the New York Times
• She has three dogs
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
What Happened Next ?• Maureen Govern, CTO of AOL “resigns”• Abdur Chowdhury, AOL researcher who
What Happened Next ?
Abdur Chowdhury, AOL researcher who released the data was fired
• Abdur’s boss in the research department was fired
• Big embarrassment for AOLg
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Re-identificationExamples of Re identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Uniqueness in the US Population• Studies show that between 63% to
87% of the US population is unique on
Uniqueness in the US Population
87% of the US population is unique on their date of birth + ZIP code + gender
• Uniqueness makes it quite easy to re-q q yidentify individuals using a variety of techniques
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Uniqueness in Canadian PopulationUniqueness in Canadian Population100%
60%
80%
ques
40%
Perc
ent U
niq
PC0%
20%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
PC PC + Gender PC + DoBPC + DoB + Gender1 2 3 4 5 6
Number of Characters in Postal Code
Example• This example shows the risk of re-
identification using just demographics
Example
identification using just demographics
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Types of Disclosure• Identity Disclosure: being able to
determine the identity associated with
Types of Disclosure
determine the identity associated with a record
• Attribute Disclosure: discovering gsomething new about an individual known to be in the database
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Disclosure and Invasion-of-Privacy• An important first criterion is deciding
on the sensitivity of the data and the
Disclosure and Invasion of Privacy
on the sensitivity of the data and the potential for harm to the patients from a secondary use/disclosure
• If the invasion-of-privacy is deemed low then there may not be a need to de-identify the data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Invasion-of-Privacy - I• The personal information in the Data is
highly detailed
Invasion of Privacy I
highly detailed• The information in the Data is of a
highly sensitive and personal natureg y p• The information in the Data comes
from a highly sensitive contextg y• Many people would be affected if there
was a Data breach or the Data was
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
processed inappropriately by the recipient/agent
Invasion-of-Privacy - II• If there was a Data breach or the Data
was processed inappropriately by the
Invasion of Privacy II
was processed inappropriately by the recipient/agent that may cause direct and quantifiable damages and measurable injury to the patients
• If the recipient/agent is located in a different jurisdiction, there is a possibility, for practical purposes, that the data sharing agreement will be
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
the data sharing agreement will be difficult to enforce
Invasion-of-Privacy – Consent - I• There is a provision in the relevant
legislation permitting the
Invasion of Privacy Consent I
legislation permitting the disclosure/use of the Data without the consent of the patients
• The Data was unsolicited or given freely or voluntarily by the patients with little expectation of it being maintained in total confidence
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Invasion-of-Privacy – Consent - II• The patients have provided express
consent that their Data can be
Invasion of Privacy Consent II
consent that their Data can be disclosed for this secondary Purpose when it was originally collected or at some point since then
• The custodian has consulted well-defined groups or communities regarding the disclosure of the Data and had a positive response
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
and had a positive response
Invasion-of-Privacy – Consent - III• A strategy for informing/notifying the
public about potential disclosures for
Invasion of Privacy Consent III
public about potential disclosures for the recipient’s secondary Purpose was in place when the data was collected or since then
• Obtaining consent from the individuals at this point is inappropriate or impractical
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Identity Disclosure• Three common types:
– Prosecutor risk
Identity Disclosure
Prosecutor risk– Journalist risk– Rareness
• All three are concerned with the risk of re-identifying a single individual
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Prosecutor vs. Journalist• If all of the following is true then
prosecutor risk is relevant:
Prosecutor vs. Journalist
p– The data represents the whole population
such that everyone is known to be in it or the sampling fraction is very highthe sampling fraction is very high
– If not the whole population, it is possible for an intruder to know that a particular pperson has a record in the data• Patient may self-reveal
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Data collection method is revealing
• Otherwise journalist risk is relevant
Prosecutor Risk - I• The intruder has background
information about a specific individual
Prosecutor Risk I
pknown to be in the database
• The amount of background information will depend on the intruder
• The intruder is attempting to find the record belonging to that individual in the database
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Prosecutor Risk - II• Examples of intruders:
– Neighbor
Prosecutor Risk II
g– Ex-spouse– Employer– Relative
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
ExampleExampleDate of Birth Gender Postal Code Diagnosis12/03/1957 M K0J 1P012/03/1957 M K0J 1P0 …01/7/1978 M K0J 1P0 …09/12/1968 F K0J 1P0 …17/08/1987 F K0J 1P0 …25/02/1974 F K0J 1T0 …23/05/1985 M K0J 1T0 …K0J 1T0 …14/03/1965 F K0J 2A0 …
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Selecting Variables – Prosecutor - I• In the best case assumption, a
neighbor would know:
Selecting Variables Prosecutor I
g– Address and telephone information about
the VIP– Household and dwelling information – Household and dwelling information
(number of children, value of property, type of property)K d t (bi th d th ddi )– Key dates (births, deaths, weddings)
– Visible characteristics: gender, race, ethnicity, language spoken at home,
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
weight, height, physical disabilities– Profession
Selecting Variables – Prosecutor - II• What would an ex-spouse know:
– The same things that a neighbor would
Selecting Variables Prosecutor II
g gknow
– Basic medical history (allergies, chronic diseases)diseases)
– Income, years of schooling• All of these variables would be
considered quasi-identifiers if they appear in the database
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Journalist Risk• The journalist is not looking for a
specific person – re-identifying any
Journalist Risk
p p y g yperson will do
• The journalist has access to a database that s/he can use for matching
• This is called an identification database
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Journalist Matching ExampleJournalist Matching ExampleMedical Database Identification DB
D BClinicaland labdata
DoB
Initials
Gender
Postal
Name
Address
Telephone No.PostalCode
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Quasi-Identifiers
Assessing Journalist Risk• In general, we want to know how rare
the quasi-identifier values would be in
Assessing Journalist Risk
qthe population (e.g., homeowners/professionals/civil
t i th hi f servants in the geographic area of interest)If th bi ti i t th • If the combination is not rare then there is small journalist risk
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Selecting Variables – Journalist - I• Depends on what information can be
obtained in an identification database
Selecting Variables Journalist I
• For an external intruder, likely variables are those available in public egist iesregistries:– Key dates (birth, death, marriage)– ProfessionProfession– Home address and telephone number– Type of dwelling
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
– Gender, ethnicity, race– Income if a highly paid public servant
Selecting Variables – Journalist - II• Assume that an internal intruder would
be able to get all relevant
Selecting Variables Journalist II
gadministrative data:– Key dates (birth, death, admission,
discharge visit)discharge, visit)– Gender, address, telephone number
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Inference of Variables - I• Even though a particular quasi-
identifier may not be known to the
Inference of Variables I
yintruder (prosecutor risk), available in an identification database (journalist), or available in the disclosed database or available in the disclosed database (all three risks), it may be possible to infer it from other variables
• Variables that can be inferred should be treated as quasi-identifiers
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Inference of Variables - II• Inferred variables should be added to
the disclosed database if they are not
Inference of Variables II
ythere because they may be used in a re-identification attack, and you want to take them into account during risk to take them into account during risk assessment
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Inference Examples• Gender, ethnicity, religious origin from
name
Inference Examples
• Age from graduation date• Profession from payer of insurance
claim (e.g., civil servants have a single health insurer)
• Age and gender from a diagnostic or • Age and gender from a diagnostic or lab code (e.g., mamogram or PSA test)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Rareness• If individuals are rare on the quasi-
identifiers, then they are at higher
Rareness
, y gprosecutor and journalist re-identification risk
• If an individual has a rare and visible characteristic/feature, then that also
k th i t id tif ( makes them easier to re-identify (eg, put an ad in the radio)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Attribute Disclosure• If there is very little variation on
sensitive variables
Attribute Disclosure
• The data set can represent a whole population or some subset
• Learn something new about a person without actually finding which record belongs to them
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
A Pragmatic Approach• It is important to ensure that the
quasi-identifiers are plausible for the
A Pragmatic Approach
q pdata and the recipients of the data
• If you select many quasi-identifiers then that ill b definition inc ease the then that will by definition increase the re-identification risk
• Ideally each selected quasi-identifier • Ideally, each selected quasi identifier should be associated with a realistic re-identification scenario
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Constructing an Identification DB• This may be a single physical database
or a join of multiple sources together
Constructing an Identification DB
or a join of multiple sources together to construct a virtual database
• It will have the quasi-identifiers as well qas identity information, but will not have the sensitive information (e.g., clinical or financial details)
• The sources may be public and free, bli d f f f ll
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
public and for a fee, or fully commercial
Examples of Identification DBs - I• These are databases or sources
(Canada):
Examples of Identification DBs I
(Canada):– Obituaries: available from newspapers and
funeral homes; there are obituary h k h laggregator sites that make this simple
– PPSR: Private Property Security Registration; contains information on loans Registration; contains information on loans secured by property (e.g., cars)
– Land Registry: information on house
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
ownership
Examples of Identification DBs - II– Membership Lists: provide comprehensive
listings of professionals (e.g., doctors,
Examples of Identification DBs II
lawyers, civil servants)– Salary Disclosure Reports: provided by
governments for those earning higher than governments for those earning higher than a certain threshold
– White Pages: public telephone directory– Job Sites: CVs posted in public and closed
job web sitesD ti Di l f d ti t
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
– Donations: Disclosures of donations to political parties (include address)
Voter Lists - I• Cannot legally be used for purposes
outside of an election (in Canada)
Voter Lists I
( )• But, a charity allegedly supporting a
terrorist group (Tamil Tigers) was fo nd b the RCMP to ha e Canadian found by the RCMP to have Canadian voter lists
• Volunteers do not necessarily destroy • Volunteers do not necessarily destroy or dispose of the lists after an election (and in many cases do not sign
thi b f th t th )
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
anything before they get them)
Voter Lists - II• It is not expensive (or difficult) to
become a candidate in an election and
Voter Lists II
get the voter list:– Alberta: $500
BC: $100– BC: $100– NB: $100 (+nominated by 25 electors)– Ontario: $100$– Quebec: 0$ (+nominated by 100 electors)
• Canadian voter lists do not contain the D B ( t)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
DoB (yet)
Economics of Identification DBs• Some data sources have a fee for each
individual record/search
Economics of Identification DBs
• This makes the cost of creating an identification database quite high
• This may impose a large economic burden on an intruder and act as a deterrent from creating identification deterrent from creating identification databases
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Internal Identification Databases• An internal intruder may have access
to administrative databases that can
Internal Identification Databases
act as Identification DB• For example, in a hospital an internal
int de ma ha e access to all intruder may have access to all admissions; this is not sensitive data so is less protected but has enough p gdemographics that it can be good as an identification databaseThi t i t l i t d t h
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• This puts internal intruders at a huge advantage
Internal Access• An internal intruder can get access to
such an administrative database:
Internal Access
– had access in a previous position but that access was not revoked
– people in the organization share access credentials, so the intruder can use someone else’s credentials to get the administrative database
– has access as part of his/her job and there are no audit trailsaudit trails
– internal systems are not well protected because internal people are trusted and intruder knows how to break-in the system to get the data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
to break in the system to get the data
Public Registries• In the following slides I will explain
how to create identification databases
Public Registries
from public registries in Canada
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Professional Groups - IProfessional Groups IWe can construct identification databases for specific
professional groups
MembershipLists
PPSRLists
White Pages
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Professional Groups - II• College of Physicians and Surgeons of Ontario• Law Society of Upper Canada
P f i l E i O t i
Professional Groups II
• Professional Engineers Ontario• College of Occupational Therapists• College of Physical Therapists • Public servants (eg, GEDS)• …….
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
What is the success rate ?CPSO LSUC
• Ability to get home postal codes (source: PPSR and 60% 45%• Ability to get home postal codes (source: PPSR and telephone directory)
60% 45%
• Ability to get practice/firm postal codes (source: CPSO/LSUC)
100% 100%CPSO/LSUC)
• Ability to get date of birth (source: PPSR) 40% 45%
• Ability to get gender (source: CPSO/genderizing 100% 100%Ability to get gender (source: CPSO/genderizingLSUC)
100% 100%
• Ability to get initials (source: CPSO/LSUC) 100% 100%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
What is the success rate by gender?What is the success rate by gender?CPSO LSUC
MALEMALE
• Ability to get home postal codes (source: PPSR and telephone directory)
63% 48%
• Ability to get date of birth (source: PPSR) 45% 48%• Ability to get date of birth (source: PPSR) 45% 48%
FEMALE
• Ability to get home postal codes (source: PPSR and 49% 40%Ability to get home postal codes (source: PPSR and telephone directory)
49% 40%
• Ability to get date of birth (source: PPSR) 29% 40%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
HomeownersHomeownersWe can construct identification databases for specific
postal codes
LandRegistry
PPSRCanadaPost RegistryPost
White Pages
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
What is the success rate ?Ott To
• Ability to get initials 93% 100%
• Ability to get DoB 33% 40%
• Ability to get telephone number 80% 50%
• Ability to get gender 87% 95%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-id Risk for Homeowners• The number of households per postal
code is quite small (Ott: 15; To: 20)
Re id Risk for Homeowners
q ( ; )• The individuals (homeowners) were
unique on common combinations of quasi-identifiers (eg, gender and DoB)
• For these individuals re-identification risk is very high
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Civil Servants - I• GEDS is on the Internet: Government
Electronic Directory Services
Civil Servants I
• There are 386,630 individuals in the federal government (159,652 in Ontario and 28 046 in Alberta)Ontario and 28,046 in Alberta)
• GEDS has approx. 170,000 entries• Incomplete because: organizations can • Incomplete because: organizations can
opt-out, some individuals need to opt-in, and some employees and orgs are
d ( CSIS DND)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
exempted (eg, CSIS, DND)
Civil Servants - II• We selected a sample of 40 individuals
in health care related federal
Civil Servants II
departments in Ontario• Able to get home address for 50%,
home telephone number for 40%, gender for 100%, DoB for 22.5%
• Provincial governments have similar sources
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Threshold• There is a spectrum of re-identification
risk
Re identification Threshold
• When does the probability of re-identification become so high that the information is deemed identifiable ?
• Canadian privacy law tends not to be precise about this
• Gordon case: serious possibility test
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Canadian Definitions - ICanadian Definitions IPrivacy Law DefinitionOntario PHIPA “Identifying information” means information that identifies an
individual or for which it is reasonably foreseeable in theindividual or for which it is reasonably foreseeable in the circumstances that it could be utilized, either alone or with other information, to identify an individual.
Nfld PPHI “Identifying information” means information that identifies anNfld PPHI Identifying information means information that identifies an individual or for which it is reasonably foreseeable in the circumstances that it could be utilized either alone or together with other information to identify an individual.
Sask THIPA “De-identified personal health information” means personal health information from which any information that may reasonably be expected to identify an individual has been removed
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
removed.
Canadian Definitions - IICanadian Definitions IIPrivacy Law DefinitionAlberta HIA “Individually identifying” means that the identity of the individual be a d dua y de y g ea s a e de y o e d dua
who is the subject of the information can be readily ascertained from the information; “nonidentifying” means that the identity of the individual who is the subject of the information cannot be readily ascertained from the informationreadily ascertained from the information.
NB PPIA “Identifiable individual” means an individual can be identified by the contents of the information because the information includes the individual’s name makes the individual’s identity obvious orthe individual s name, makes the individual s identity obvious, or is likely in the circumstances to be combined with other information that includes the individual’s name or makes the individual’s identity obvious.
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Risk SpectrumRe identification Risk Spectrum
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Re-identification Threshold• Privacy legislation treats the threshold
in two ways:
Re identification Threshold
y– Discretionary/permitted disclosures and
uses = threshold can be anywhere along the spectrumthe spectrum
– Only de-identified information without consent = information id identifiable or not; there is no spectrum
• Any systematic approach to dealing
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
with thresholds must cover both
Threshold Precedents - I• We will use healthcare precedents as
an indication of the risk that society
Threshold Precedents I
yhas agreed to take:– The largest probability of re-identification
th t i d i li id li that is used in any policy or guideline document in Canada or the US is 0.33
– If the probability is > 0.33 then the If the probability is > 0.33 then the information would certainly be considered identifiable
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Threshold Precedents - II– The most common probability of re-
identification used in disclosure control of h lth d t i 0 2 ( ll i f 5)
Threshold Precedents II
health data is 0.2 (cell size of 5)– It makes sense that a value of 0.2 would
be used as a “default” riskbe used as a default risk
• Below 0.33 there are many degrees of de-identification
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Example• The choice of threshold has a
significant impact on risk assessment
Example
g presults
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
De-identification TechniquesDe identification Techniques
D1identifyingquasiidentifying
D2 D3
y gvariables
y gvariables
D2 D3
Randomization Coding Heuristics Analytics
Suppression
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Examples of Analytics• Table aggregation – disclose only
summary tables
Examples of Analytics
y• Generalization• Record or variable suppressionpp• Geographic aggregation• Sub-samplingSub sampling• Adding noise
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Common De-identification Heuristic• If geographic area has a small
population, then:
Common De identification Heuristic
p p ,– Suppress all data from that area– Aggregate the geographic area
• Applied for a variety of data sets, including public health data sets
• For many applications this heuristic results in significant loss of data or imperils analysis
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
imperils analysis
Examples• HIPAA: 20k rule• Census Bureau: 100k rule
Examples
Census Bureau: 100k rule• Statistics Canada: 70k rule• British Census: 120k rule• British Census: 120k rule
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
The Problem• Such generic rules ignore the specific
variables that are included in a data
The Problem
set• A smaller cutoff should be used if few
variables are in a data set• A larger cutoff should be used if many
variables are in a data set
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Automation - IAutomation I
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Automation - IIAutomation II
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Our GAPS Models20,000 70,000 100,000
Province Our GAPS Models
Cutoff Cutoff Cutoff
FSA Pop FSA Pop FSA Pop FSA Pop
Alb t 55% 84% 38% 71% 1 4% 5% 0 0Alberta 55% 84% 38% 71% 1.4% 5% 0 0
British Columbia 68% 87% 46% 70% 1.1% 4% 0 0
Manitoba 59% 88% 39% 68% 0 0 0 0
New Brunswick 20% 51% 4.5% 19% 0 0 0 0
Newfoundland 55% 83% 30% 62% 0 0 0 0
Nova Scotia 47% 82% 16% 43% 0 0 0 0
Ontario 69% 91% 49% 76% 1.4% 5% 0.2% 1%
PEI 57% 90% 43% 79% 0 0 0 0
Quebec 59% 84% 36% 63% 1% 5% 0.25% 0
Saskatchewan 60% 93% 49% 84% 2% 7% 0 2%
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Saskatchewan 60% 93% 49% 84% 2% 7% 0 2%
Risk Methodology• De-identification by itself is not
sufficient:
Risk Methodology
– Using low thresholds results in rapid data quality deteriorationUsing high thresholds is perceived as too – Using high thresholds is perceived as too risky
– We want to create incentives for the data recipients to improve their security and privacy practices
M th d l ll t l t d
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
• Methodology allows you to select and justify a threshold
Managing Re-identification RiskManaging Re identification RiskV A
Amount ofDe-identification
-
RiskExposurep
- ++
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
MitigatingControls
Motives &Capacity
Invasion-of-PrivacyV A
The TradeoffsThe TradeoffsAbility to Re-identify the Data
s Low High
Con
trol
s g
Low balanced dangerous
gatin
g C
High
higher costburden ondata recipient
Miti
g Highconservative balanced
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
lower data quality
Steps in Risk Methodology• The methodology has two steps to
evaluate the overall risks
Steps in Risk Methodology
• First we determine the probability of a re-identification attempt
• Then we determine the re-identification risk to use
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Determining Pr Re-identification Attempts
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Determining Risk Threshold to Use
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Implementation of MethodologyImplementation of Methodology• An important component of this
methodology is the ability to audit the gy ydata recipient/agent receiving the data
• Update audits are performed regularly• Data sharing agreements are put in
place for external recipients and external agents (internal ones usually external agents (internal ones usually covered by employment agreements)
• The elements in the security maturity
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
y yprofile are part of the data sharing agreement
Compliance AuditsCompliance Audits• The audits use a publicly available
checklist• Audit results would be generally
accepted so that recipients do not need to get a dited epeatedl fo diffe ent to get audited repeatedly for different disclosures
• Intended to be rapid (one or two day • Intended to be rapid (one or two day on-site) and cheap ($1k to $2k)
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Example - Pharmacy DataExample Pharmacy Data• Request to CHEO for prescription data
from a commercial data broker• Concern that this data could potentially
identify patients• We performed a study to evaluate re-
identification risk and come up with an anonymous version of the dataanonymous version of the data
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
Prescription Records Example• Patient age in days• Patient gender• Forward Sortation Area• Admission date
• Gender• Length of stay in days• Quarter and year of admission• Admission date
• Discharge date• Diagnosis• Dispensed drug
• Quarter and year of admission• Patient’s region (first character of the
postal code)• Patient’s age in weeks• Diagnosis• Dispensed drug
• Regular third party privacy/security auditsB h ifi i l b i l• Breach notification protocols must be in place
• Restrictions on further distribution of raw data• Data destruction provisions
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
An Example DeploymentAn Example Deployment
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
An Example DeploymentAn Example Deployment
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca
An Example DeploymentAn Example Deployment
Electronic Health Information Laboratory, CHEO Research Institute, 401 Smyth Road, Ottawa K1H 8L1, Ontario; www.ehealthinformation.ca