De-identification: A Critical Success Factor in Clinical and Population Research

De-identification: A Critical Success Factor in Clinical and

Population Research

Steven Merahn MD

Dee Lang, RHIT

Prepared for 2007 APIII

Pittsburgh, PA

September 10, 2007

Major gaps exist today in between patient care, clinical research and evidence-based

medicine.

Sharing Data is the Key “Amassing large quantities of anonymized

clinical and non-clinical information from medical records and reports and analyzing that data for patterns and other observations (is the best way to) to support continuous quality improvement, shape best practices and inform clinical and population-based decision making” A Rapid Learning Health System Health Affairs

26(2), January 2007

Processing Predicated on Protecting Patient Privacy Clinical records can be an important source of

information…most of the information in these records is in the form of free text and extracting useful information from them requires automatic processing (e.g., index, semantically interpret, and search). A prerequisite to the distribution of clinical records outside of hospitals, be it for Natural Language Processing (NLP) or medical research, is de-identification J Am Med Inform Assoc. 2007;14:550-563. DOI

10.1197/jamia.M2444.

Problems to Solve

Sources of data Protecting patient privacy Creating and maintaining a corpus of

HIPAA compliant and searchable data Building collaborations; creating networks of

institutions sharing data Emerging patient “data rights” issues

Sources of Data EMR/CIS systems

Large amounts of free text; not all data is parsed or field-limited

Transcribed Records and Reports Even in systems without CIS, most transcriptions are

delivered as electronic files• Pathology Reports (cf CaTIES)• Surgical Notes• Radiology Reports• Dischage Summaries

No need to wait for an EMR to create an RLHS

Protecting Patient Privacy De-identification is a well-defined, but limited, step

in a broader research workflow or protocol The defined nature of the step includes managing

individually identifiable information in records and reports Such schema includes redaction, elimination,

categorical replacement (e.g., place, age range), and replacement with proxies (Dr X), and offsets (day 1)

A process which must be constantly “tuned” in response to dynamic input variables and patterns of documentation

CISTranscribed

Reports

De-identified Database

De-identified Data

De-identificationMethodology

QueryInterface

FIREWALL

Trusted Proxy

RE-ID Method

NLPOther

processes

Considerations When choosing a de-identification methodology,

four things need consideration What is the reliability and validity of the

methodology? Can the method maintain its specificity and

sensitivity in local use? What are the limitations of the methodology? Can files be re-identified?

Consistency, Reliability and Validity

Fundamental problems is inter-record reliability, manpower resource and time constraints

The issue then becomes the quality of the quality -- over-marking (specificity) and under-marking (sensitivity)

What are acceptable levels of sensitivity and specificity? 100% for sensitivity for names What is the benchmark? What is the value of consistency?

Automated Methodologies:As Good As?/Better? Classification of tokens Sequence tracking problem (using Hidden

Markov Models or Conditional Random Fields Rule-based system utilizing global features

(sentence position), local features (lexical cues, special characters, and format patterns), and syntactic features

Hybrid systems of rules, pattern matching algorithms, heuristics and dictionaries

Local Use Can your methodology be customized to meet local

needs? While some methods may have good ‘numbers’, will

they hold up in local use? Every community has its own acronyms, place names

and other local vocabulary What is the protocol to manage local quality?

Regular checks against manual review Formal evaluation research

“Data Rights” Issues Legal models exist Make ‘de-identified” data sharing part of

informed consent Offer different tiers of consent

Publicly-funded research Academic research Commercial research

Make the general public aware of the level of existing data sharing Claims data already widely shared and sold

De-identified Database

QueryInterface

FIREWALLBuilding Collaboration

Call to Action:Pathology Informatics Community

caBIG and caTIES are models for cross institutional data sharing

Major institutions are establishing data repositories of pathology reports

Help facilitate data aggregation among other departments Radiology (Radiology Reports) Surgery (Surgical Notes) Medicine (Discharge Summaries)

Establish cross-departments “Rapid Learning” teams

De-identification: A Critical Success Factor in Clinical and Population Research

clinical research

nonclinical information

sharing data

medical records

solvesources of data

patient care

patient privacycreating

deidentification methodology

Documents

POPULATION AND IDENTIFICATION OF MYCORRHIZAL FUNGI IN...

Improved Models for Transcription Factor Binding Site...

Factor Analysis of Population Allele Frequencies as a...

THE IDENTIFICATION OF FACTOR/FACTOR THAT … Identification....

Identification of Success Factor for B2B E-Commerce in ...

Bird Species Identification and Population Estimation by...

Identification of a Transcription Factor-microRNA-Gene...

Identification of a Novel Virulence Factor in Recombinant...

Food The major limiting factor to human population growth...

Identification of Transcription Factor Binding Sites

Bird Species Identification and Population Estimation by ......

Factor analysis of ancient population genomic samples

Nucleosome Positioning & Transcription Factor ...

Human Population 4.2 Population Growth Is Limited Limiting.....

dPeak: High Resolution Identification of Transcription...

Week 5: Human population as environmental factor