IEBI Workshop-10/23/07 Challenges in Evaluating Natural Language Processing Systems for Military Health Records Carol Friedman, PhD Columbia University/MedLEE.

IEBI Workshop-10/23/07

Challenges in Evaluating NaturalLanguage Processing Systems

for Military Health Records Carol Friedman, PhD

Columbia University/MedLEE Applications Technologies

Lawrence Fagan, MD, PhDStanford University/MedLEE Applications

Technologies


Outline

• NLP evaluation issues

• Ideal evaluation of NLP output requires consideration of the context of the applications

• Catalog of common NLP applications in biomedicine and the implication for evaluation


Outline





Different Evaluation Objectives

• Different NLP communities have different objectives and traditions

Improvement of:– Science of NLP– Science of biomedical NLP – Biological research– Clinical research – Clinical care


Evaluation Objectives Determine

• Evaluation design

• NLP requirements– Type of information needed

• Medical terms with/without modifiers • Clinical & other external knowledge

– End product • Codes, facts, yes/no categories


Evaluation to ImproveClinical Research and Care

Issues to Consider


• Need to start with a concrete clinical goal– Detect potential case of tuberculosis in

chest x-ray report for isolation– Detect positive mammography reports for

follow up– Find new adverse events to find ways to

avoid them


Type of Task:Broad vs. Narrow

• Very specific application– Identify reports of patients who smoke– Identify x-ray reports positive for pneumonia

• General application – Data mining & knowledge discovery– Generate patient problem list


• Structural knowledge– Extract diagnoses from Diagnosis Section of

Discharge Summaries

• Coding knowledge– ICD-9 coding of x-ray reports for billing

• Clinical knowledge– Identifying x-ray reports indicating pneumonia

• ~ 38 different combinations of findings & modifiers

Application Requires NLP + External Knowledge

486 (pneumonia)for infiltrate in cxr


• Different steps of process impact results

NLP Components

Preprocess Extraction Engine

Post-process

Clean-up, recognize text portions and boundaries, …

Recognize entities, relations, generate codes, …

Clinical logic for application

NLP Components

CXR Findingsopacity mod: patchy loc: left lung ........

Pneumonia: possible


Preprocess Extraction Engine

Post-process

Clean-up, recognize text portions and boundaries, …

Recognize entities, relations, generate codes, …

Clinical logic for application

NLP Components

CXR Findingsopacity mod: 5x5cm loc: left lung ........

Pneumonia: unlikely


Use of Experts • Need guidelines and examples

• How much to train

• Inter-annotator agreement & resolution

• Borderline cases confound results

• Granularity issues– Comparability

Fever (mod: persistent)Persistent feverSNOMED codes: persistent fever chronic persistent fever prolonged fever fever (mod: persistent)


• Chief complaints (‘well baby 3 mo’, ‘c/f/h’)

• Discharge summaries, radiology reports

• Reports with structured & unstructured information

• Telegraphic notes

• Special templates

Document Heterogeneity & Complexity of Text


“Well-Structured” Reports:Chest Radiology Report

CLINICAL INFORMATION:F/U. IMPRESSION:MODERATE PULMONARY VASCULAR CONGESTION AND

INTERSTITIAL EDEMA SHOWS NO SIGNIFICANT CHANGE FROM 3/25 THROUGH 3/27/95. SIDE HOLE OF THE NG TUBE IS NEAR THE EG JUNCTION. DEVELOPMENT OF RIGHT BASILAR ATELECTASIS ON 3/27/95.

DESCRIPTION:A series of portable chest x-rays demonstrate worsening but stable

vascular congestion and interstitial edema from 3/25 through 3/27/95. The NG tube side hole is seen near the EG junction. A duo- tube is seen extending into the stomach, but its distal tip is not seen. A tracheostomy is seen in good position.

……………………………………………………..


Mixed Structure: Catheterization Report


Admit 10/2371 yo woman h/o DM, HTN, Dilated CM/CHF, Afib s/p embolic event, chronic diarrhea, admitted with SOB. CXR pulm edema. Rx’d Lasix.All: noneMeds Lasix 40mg IVP bid, ASA, Coumadin 5, Prinivil 10, glucophage 850 bid, glipizide 10 bid, immodium prnHospitalist=Smith PMD=Jones Full Code, Cx>101

Poorly Structured Report: Telegraphic Note


Reducing Potential Bias

NLP developers should avoid

– Designing study – Being involved in choice or determination

of reference standard– Correcting bugs– Changing system– Performing actual evaluation


Analyzing Results & Errors• Determine effect of components on performance

– NLP vs. domain knowledge– Document characteristics/quirks– Frequency of adding/updating clinical terms– Type of NLP task: classification/information

extraction/specialized– Borderline situations

• Report degree of complexity needed to correct errors

• Determine if performance is adequate for task• Report on confidence intervals


Other Issues:Clinical Environment

• Heterogeneity– Systems– Document formats– Document types– Clinical Domain

• Working with physicians• Clinical evaluation tradition• Workflow issues


Patient Documents

• Lack of access to patient records– Significant bottleneck for NLP progress

• Difficult to get permission to share from health care institutions

• Large scale effort needed to establish scrubbed document sets for development and evaluation

• Individual efforts beneficial but limited and scattered


Outline

• NLP Evaluation Issues




Context-based Evaluation: Example Record

• Chief Complaint: Asthma re-evaluation.• Subjective: 8 year-old girl with past history of moderate

persistent asthma while living in Alaska until 2 years ago

• The primary triggers for her asthma have been viral colds and irritant exposure, and she had particular difficulty with the forest fire smoke in central Alaska.

• She also has a history of a low serum IgA. Her last IgA determination was August 2004, which showed an IgA level of 29 mg/dl, with the lower limit of normal for a child her age being 33.


Context-based Evaluation

• Chief Complaint: Asthma re-evaluation.• Subjective: 8 year-old girl with past history of

moderate persistent asthma while living in Alaska until 2 years ago

• Tasks: Disease Maintenance Summarization• vs. Infectious Disease Reporting



• Chief Complaint: Asthma re-evaluation.• Subjective: 8 year-old girl with past history of

moderate persistent asthma while living in Alaska until 2 years ago

• Tasks: Disease Maintenance Summarization• vs. Infectious Disease Reporting



• Chief Complaint: Asthma re-evaluation.• …• The primary triggers for her asthma have been viral

colds and irritant exposure, and she had particular difficulty with the forest fire smoke in central Alaska.

• …• Tasks: Disease Maintenance Summarization• vs. Infectious Disease Reporting



• Chief Complaint: Asthma re-evaluation.• …• The primary triggers for her asthma have been viral

colds and irritant exposure, and she had particular difficulty with the forest fire smoke in central Alaska.

• …• Tasks: Disease Maintenance Summarization• vs. Infectious Disease Reporting



• Chief Complaint: Asthma re-evaluation.• …• She also has a history of a low serum IgA. Her last IgA

determination was August 2004, which showed an IgA level of 29 mg/dl, with the lower limit of normal for a child her age being 33.

• Task: Disease Maintenance Summarization• vs. Infectious Disease Reporting



• Chief Complaint: Asthma re-evaluation.• …• She also has a history of a low serum IgA. Her last IgA

determination was August 2004, which showed an IgA level of 29 mg/dl, with the lower limit of normal for a child her age being 33.

• Task: Disease Maintenance Summarization• vs. Infectious Disease Reporting


Outline





Potential NLP Applications

• Health reporting requirements• Known disease surveillance• Unknown disease surveillance• Recognizing adverse drug reaction• Quality assurance/avoiding clinical errors• Charge capture• Recognizing scientific relations in text databases


Health Reporting Requirements

• Example: Reporting new TB cases• Task description: Governmental

requirements that certain disease states must be identified within a period after the original information (typically diagnosis) is identified.

• Task requirements: Text may be confined to one or more sections of record. May require inference to identify disease state. May be easier to get the “right” answer than other apps.


Known Disease Surveillance

• Example: Locating Hospital Acquired (nosocomial) infections

• Task description: Looking at a set of fixed reports for specific findings or combination of findings that suggest disease state

• Task requirements: Need to combine free text with structured text such as lab reports, and existing codes (e.g., ICD-9 coding on discharge)


“Unknown” Disease Surveillance

• Example: Looking for the next “gulf war syndrome.”

• Task description: By far, the most difficult task because it is not clear what is being searched for. Looking for a pattern of signs, symptoms, lab tests, time course, etc, not explained by known patterns

• Task requirements: Every concept is potentially relevant plus need significant inference to determine novelty of problem.


Recognizing Adverse Drug Reactions

• Example: Searching for known (and possibly unknown) side effects of treatments

• Task description: Side effect profiles are known for many drugs/regimens. Early recognition of onset of those side effects important to decreasing morbidity

• Task requirements: Temporal relationship between treatment and possible side effects important to glean from narrative.


Quality Assurance/Avoiding Clinical Errors

• Example: Flagging contra-indicated treatments due to a drug allergy

• Task description: Extract from narrative signs/symptoms/lab tests that suggest unanticipated response to prior treatment.

• Task requirements: combining concepts from narrative with structured parts of records and comparing to guidelines/protocols


Charge Capture

• Example: Locating clinic/hospital charges that have not been otherwise captured

• Task description: Scan narrative for suggestion of procedures performed or supplies used that have not been billed

• Task requirements: Inferring actions from narrative and comparing with billing codes. Concepts are well defined and can be enumerated.


Recognizing scientific relations in text databases

• Example: Finding protein-protein interactions in pubmed database

• Task description: Scan abstracts to identify protein names and description of relationships

• Task requirements: Requires understanding of naming schemes in biology and ability to handle naming issues. Inference to identify correctly the relationship described in the text


Summary

• Overview of evaluation issues• Key point: evaluation requires

consideration of the context of the applications


IEBI Workshop-10/23/07 Challenges in Evaluating Natural Language Processing Systems for Military Health Records Carol Friedman, PhD Columbia University/MedLEE.

Documents

iebi workshop

evaluation slide

persistent slide

cxr slide

nlp external knowledge

possible slide

unlikely slide

billing clinical knowledge