De-identification: A Critical Success Factor in Clinical and Population Research
Post on 30-Dec-2015
25 Views
Preview:
DESCRIPTION
Transcript
De-identification: A Critical Success Factor in Clinical and
Population Research
Steven Merahn MD
Dee Lang, RHIT
Prepared for 2007 APIII
Pittsburgh, PA
September 10, 2007
Sharing Data is the Key “Amassing large quantities of anonymized
clinical and non-clinical information from medical records and reports and analyzing that data for patterns and other observations (is the best way to) to support continuous quality improvement, shape best practices and inform clinical and population-based decision making” A Rapid Learning Health System Health Affairs
26(2), January 2007
Processing Predicated on Protecting Patient Privacy Clinical records can be an important source of
information…most of the information in these records is in the form of free text and extracting useful information from them requires automatic processing (e.g., index, semantically interpret, and search). A prerequisite to the distribution of clinical records outside of hospitals, be it for Natural Language Processing (NLP) or medical re- search, is de-identification J Am Med Inform Assoc. 2007;14:550-563. DOI
10.1197/jamia.M2444.
Problems to Solve
Sources of data Protecting patient privacy Creating and maintaining a corpus of
HIPAA compliant and searchable data Building collaborations; creating networks of
institutions sharing data Emerging patient “data rights” issues
Sources of Data EMR/CIS systems
Large amounts of free text; not all data is parsed or field-limited
Transcribed Records and Reports Even in systems without CIS, most transcriptions are
delivered as electronic files• Pathology Reports (cf CaTIES)• Surgical Notes• Radiology Reports• Dischage Summaries
No need to wait for an EMR to create an RLHS
Protecting Patient Privacy De-identification is a well-defined, but limited, step
in a broader research workflow or protocol The defined nature of the step includes managing
individually identifiable information in records and reports Such schema includes redaction, elimination,
categorical replacement (e.g., place, age range), and replacement with proxies (Dr X), and offsets (day 1)
A process which must be constantly “tuned” in response to dynamic input variables and patterns of documentation
CISTranscribed
Reports
De-identified Database
De-identified Data
De-identificationMethodology
QueryInterface
QA QA
FIREWALL
Trusted Proxy
RE-ID Method
Admin
NLPOther
processes
Considerations When choosing a de-identification methodology,
four things need consideration What is the reliability and validity of the
methodology? Can the method maintain its specificity and
sensitivity in local use? What are the limitations of the methodology? Can files be re-identified?
Consistency, Reliability and Validity
Fundamental problems is inter-record reliability, manpower resource and time constraints
The issue then becomes the quality of the quality -- over-marking (specificity) and under-marking (sensitivity)
What are acceptable levels of sensitivity and specificity? 100% for sensitivity for names What is the benchmark? What is the value of consistency?
Automated Methodologies:As Good As?/Better? Classification of tokens Sequence tracking problem (using Hidden
Markov Models or Conditional Random Fields Rule-based system utilizing global features
(sentence position), local features (lexical cues, special characters, and format patterns), and syntactic features
Hybrid systems of rules, pattern matching algorithms, heuristics and dictionaries
Local Use Can your methodology be customized to meet local
needs? While some methods may have good ‘numbers’, will
they hold up in local use? Every community has its own acronyms, place names
and other local vocabulary What is the protocol to manage local quality?
Regular checks against manual review Formal evaluation research
“Data Rights” Issues Legal models exist Make ‘de-identified” data sharing part of
informed consent Offer different tiers of consent
Publicly-funded research Academic research Commercial research
Make the general public aware of the level of existing data sharing Claims data already widely shared and sold
Call to Action:Pathology Informatics Community
caBIG and caTIES are models for cross institutional data sharing
Major institutions are establishing data repositories of pathology reports
Help facilitate data aggregation among other departments Radiology (Radiology Reports) Surgery (Surgical Notes) Medicine (Discharge Summaries)
Establish cross-departments “Rapid Learning” teams
top related