De-identifying Pathology Reports for Pathology Informatics James Gardner, Li Xiong Department of Math and Computer Science Fusheng Wang, Andrew Post, Joel Saltz Center for Comprehensive Informatics
Jan 02, 2016
De-identifying Pathology Reports for Pathology Informatics
James Gardner, Li XiongDepartment of Math and Computer
Science
Fusheng Wang, Andrew Post, Joel Saltz
Center for Comprehensive Informatics
Introduction
• The HIPAA Privacy Rule regulates the use and disclosure of Protected Health Information (PHI)
• De-identification of pathology reports is of critical importance in order to facilitate secondary use of medical records for research
• HIDE (Health Information DE-identification) is an open-source de-id tool based on advanced statistical based de-identification technologies
HIPAA Identifiers
1. Names;
2. All geographical subdivisions smaller than a state;
3. All elements of dates (except year);
4. Phone numbers;
5. Fax numbers;
6. Electronic mail addresses;
7. Social Security numbers;
8. Medical record numbers;
9. Health plan beneficiary numbers;
10. Account numbers;
11. Certificate/license numbers;
12. Vehicle identifiers and serial numbers;
13. Device identifiers and serial numbers;
14. Web Universal Resource Locators (URLs);
15. Internet Protocol (IP) address numbers;
16. Biometric identifiers, including finger and voice prints;
17. Full face photographic images or comparable images; and
18. Any other unique identifying number, characteristic, or code
1. Names;
2. All geographical subdivisions smaller than a state;
3. All elements of dates (except year);
4. Phone numbers;
5. Fax numbers;
6. Electronic mail addresses;
7. Social Security numbers;
8. Medical record numbers;
9. Health plan beneficiary numbers;
10. Account numbers;
11. Certificate/license numbers;
12. Vehicle identifiers and serial numbers;
13. Device identifiers and serial numbers;
14. Web Universal Resource Locators (URLs);
15. Internet Protocol (IP) address numbers;
16. Biometric identifiers, including finger and voice prints;
17. Full face photographic images or comparable images; and
18. Any other unique identifying number, characteristic, or code
• These identifiers have to be removed or• Based on the opinion from an qualified
statistical expert, the risk of identifying an individual is very small
HIDE Overview
• Utilizes the state-of-the-art named entity recognition technique, Conditional Random Fields, for extracting PHI
− Previous tools such as DE-ID and HMS scrubber use rule-based approaches which are labor intensive and not portable
• Provides flexible de-identification options including full de-identification and state-of-the-art statistical de-identification
− Previous tools allow simple removal or substitution of the PHI
• Provides an easy-to-use web-based interface that utilizes the latest web-technologies
• Integrated with caTIES, and caTissue (in progress)
PHI Extraction
• Utilizes state-of-the-art NLP technique, Conditional Random Fields − High accuracy, easy to train, portable
• Combines different feature sets and sampling techniques− Feature sets: dictionary, affix, regular expression and context
• Can use default models or custom trained models− Web interface for annotating and training custom models− A set of reports are loaded and manually labeled− The labeled documents will generate a trained model for
automatically de-identifying new reports
HIDE: De-identification Options
• Full de-identification− safe-harbor, all 18 HIPAA identifiers removed or substituted
• Partial de-identification− limited dataset, all direct HIPAA identifiers removed or
substituted(not for dates, address other than street/P.O.Box)
• Configurable de-identification− A configurable set of identifiers removed or substituted
• Statistical de-identification− Advanced anonymization that guarantees rigorous
statistically acceptable privacy while keeping the utility of the data
Statistical De-identification Example
De-identification satisfying k-anonymity (k=2) (every record is indistinguishable in a group of records with size greater than or equal to k)
(100 reports,10-fold cross validation)
Study 1: PHI Extraction on Emory Pathology Reports
Precision: true positives over the sum of true positives and false positivesRecall (sensitivity): true positives over total actual positivesF1: combination: 2*precision*recall/(precision+ recall)
Study 2: PHI Extraction on i2b2 Reports
• Based on 669 discharge summaries, 10-fold cross validation
• Good precision and recall for most individual PHI identifiers
• Good overall precision and recall for PHI extraction
Study 3: Impact of Different Feature Sets
Dictionary (d), affix (a), regular expression (r) and context (c) features are in order of increasing importance for statistical CRF based PHI extraction
Integrating HIDE with caTIES
• caTIES (cancer Text Information Extraction System) provides tools for de-identification and automated coding of free-text pathology reports
• caTIES provides de-id extensibility through implementing its CaTIES_DeIdentifier interface
• HIDEDeIdentifier, which calls HIDE client API
• Added HIDE de-id option in caTIES installer
• HIDE is bundled with caTIES since release v3.7 (May 2010)
Integrating HIDE with caTissue (in Progress)
• caTissue uses caTIES V2.x and refactored it into caTissue’s workflow
• HIDE integration with caTissue is similar to caTIES
• Implementation and evaluation under going
• Goal: Integration of pathology reports into caTissue installation at Winship Cancer Institute at Emory University
Ongoing Development
• Continue development on HIDE/caTissue integration
• Usability improvement: simplified installation progress
• System improvements− Efficiency and scalability of the system
− Multiple file formats support
− Additional statistical de-identification options