Use of SAS Based Natural Language Processing to Identify Incident and Recurrent Malignancies STRAUSS

Post on 15-Jun-2015

638 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Clinical Informatics

Transcript

Use of SAS-Based Natural Language Processing to Identify Incident and Recurrent Malignancies

Justin A. Strauss, MAResearch Associate III

Kaiser Permanente Southern California

May 1, 2012 • 2012 HMORN Conference • Seattle, Washington

Co-Authors & Funding

• Chun R. Chao, PhD

• Marilyn L. Kwan, PhD

• Syed A. Ahmed, MD

• Joanne E. Schottinger, MD

• Virginia P. Quinn, PhD

Acknowledgements & Funding• Mayra Martinez, Michelle McGuire, Melissa

Preciado, Nirupa Ghai, and Jeff Slezak (KPSC); Lawrence Kushi (KPNC); Debra Ritzwoller (KPCO); Joan Warren (NCI); Jianyu Rao and Jiaoti Huang (UCLA)

• Funding was provided by KPSC Community Benefit and the Cancer Research Network

Malignancy Identification• Malignancy identification is important for clinical

and epidemiologic cancer research.

• Limited quality and availability of incident and recurrent malignancy data within health plans.

• Delayed availability of incident malignancy data from cancer registries.

• Few registries track cancer recurrences.

• Manual chart abstraction slow and expensive.

• Previous research has shown electronic diagnosis codes (e.g., ICD-9) to be unreliable.

Natural Language Processing• Natural language processing (NLP) can be used to identify

and extract information from electronic clinical text, including incident and recurrent malignancy data.

• Increasing opportunity for NLP with adoption of electronic clinical systems in patient care delivery.

• Despite its potential value in clinical and research settings, NLP usage has been relatively sparse. Contributing factors may include:

• Technical complexity

• Systems integration requirements

• Habitual use of existing methods

SCENT Overview• A SAS-based coding, extraction, and nomenclature tool

(SCENT) was developed to identify incident and recurrent malignancies using text from pathology reports.

• SCENT is currently being implemented in two research studies at Kaiser Permanente Southern California (KPSC):

• Intervention to improve medication adherence among breast cancer patients.

• Differences in the prognosis of prostate cancer patients according to their genetic factors

• Use of SAS programming minimizes implementation barriers and increases availability for multisite research.

Description of Methods• SCENT identifies non-negated clinical concepts within

pathology report text.

• Built using SAS Base (does not require Text Miner add-on).

• Makes extensive use of SAS hash objects and regular expressions.

• Includes components for preprocessing, matching, negation and uncertainty detection, extracting diagnostic information (e.g., staging and Gleason score), and classifying report malignancy status.

• Flexibility to assign codes using variety of coding systems.

• Validation used subset of SNOMED 3.x (~1000 concepts).

SCENT Process Diagram

Concept Dictionary (SAS)

Pathology Text (Research Database)Text : Raw text segment from reportLine : Sequential text segment identifier

Regular Expressions

LoopConcepts

Examine Segments

Tokenize Words

[adenocarcinoma[ls]?][papillar(y|ies)]

Extract Data

Code Matches

Tokenize Words

Clean

Enhance

Disease Extent

Diagnostic Certainty

Tumor Staging

Gleason Score

Check Negation

Clinical Concepts (Excel)Type : Morphology, topology, or proceduralCode : SNOMED 3.XClass : Malignant, basaloid, benign, or N/ADescription : Concept description

[intraductal][papillary][adenocarcinoma][with][invasion]

[intraductal][papillary][adenocarcinoma][with][invasion]

[((intra)?duct(al)?)][papillar(y|ies)][adenocarcinoma[ls]?]

[moderately-differentiated ductal adenocarcinoma with papillary][features.][the tumor involves 0.6 cm of one core.]

[moderately-differentiated ductal adenocarcinoma with papillary features.][the tumor involves 0.6 cm of one core.]

Preprocessed TextCode : M-85033

Description : intraductal papillary adenocarcinoma with invasion

[moderately] [differentiated] [ductal] [adenocarcinoma] with [papillary] [features]

moderately differentiated <nlp snm=m85033 type=m class=3>ductal adenocarcinoma with papillary</nlp snm=m85033> features

free (of|from)not? (support[a-z]*|identified)non(?!small|hodgkins)

[((intra)?duct(al)?)]

Match Tokens

Sample Report Coding

LEFT BREAST CORE BIOPSY TWO O CLOCK.<BR>

INVASIVE DUCTAL CARCINOMA NOTTINGHAM GRADE 2.<BR>

NO CALCIFICATION IS IDENTIFIED.<BR>

NO VASCULAR INVASION IS IDENTIFIED.<BR>

HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

<NLP SNM=T04030 TYPE=T>LEFT BREAST</NLP SNM=T04030> CORE <NLP SNM=P1140 TYPE=P>BIOPSY</NLP SNM=P1140> TWO O CLOCK.<BR>

INVASIVE <NLP SNM=M85003 TYPE=M CLASS=3>DUCTAL CARCINOMA</NLP SNM=M85003> NOTTINGHAM GRADE 2.<BR>

NO CALCIFICATION IS IDENTIFIED.<BR>

NO VASCULAR INVASION IS IDENTIFIED.<BR>

HORMONE RECEPTOR AND HER 2 NEU STATUS PENDING AN ADDENDUM WILL FOLLOW.

Preprocessed Text

Coded Text

Validation Study• To validate SCENT, trained chart abstractors reviewed

electronic pathology reports.

• Random samples of breast (n=400) and prostate (n=400) cancer patients.

• Patients diagnosed at KPSC between 2000-2007.

• Reports included from six months post-diagnosis through end of 2008.

• In total, 206 breast and 186 prostate cancer patients contributed 490 and 425 eligible reports, respectively.

• SCENT classifications were compared with those of abstractors.

Classification ConcordanceAbstractor Classifications

Benign CancerRecurrence

OtherPrimary Cancer Suspicious

SCENT Classifications % N % N % N % N Kappa

Breast Cancer (Total) (436) (32) (18) (4)

Benign 99.8 435 - - - - 25.0 1 0.96

Cancer Recurrence - - 100.0 32 - - - -

Other Primary Cancer 0.2 1 - - 100.0 18 50.0 2

Suspicious - - - - - - 25.0 1

Prostate Cancer (Total) (356) (29) (36) (4)

Benign 99.4 354 - - 5.6 2 - - 0.95

Cancer Recurrence - - 96.6 28 2.8 1 - -

Other Primary Cancer 0.6 2 3.4 1 91.7 33 - -

Suspicious - - - - - - 100.0 4

Note: incident contralateral breast malignancies were considered to be recurrences.

SCENT Performance Metrics

Sensitivity* Specificity* PPV* NPV*

Breast Cancer 1.00 (0.93-1.00) 0.99 (0.98-1.00) 0.94 (0.85-0.98) 1.00 (0.99-1.00)

Prostate Cancer 0.97 (0.89-0.99) 0.99 (0.98-1.00) 0.97 (0.89-0.99) 0.99 (0.98-1.00)

* Shown with Wilson's 95% confidence interval.

Conclusions• Favorable results suggest SCENT can identify and extract

information about primary and recurrent malignancies from pathology reports.• Rapid cancer case identification.

• Improved measurement accuracy of common study endpoint.

• SCENT has the potential to expedite chart reviews by narrowing the search and highlighting relevant concepts.

• Generalized utility for extracting standardized disease scores and other clinical information.

• SCENT is proof of concept for SAS-based NLP that can be easily shared between institutions to support research.

Limitations & Next Steps• SCENT has a number of limitations, including:

• Unable to disambiguate and contextualize identified clinical concepts without part-of-speech (POS) tagging.

• More susceptible to changes in text structure and increased linguistic variability than statistical NLP approaches.

• General purpose NLP (e.g., cTAKES) likely to perform better outside of pathology.

• Next steps include:• Release SCENT source code and requisite support files.

• Optimize current functionality and assess feasibility of adding methods (e.g., POS tagging, n-grams, statistical classifiers).

• Attempt to identify non-pathologically diagnosed malignancies using radiology reports and clinical progress notes.

• Quantify cost savings associated with SCENT-assisted chart reviews.

Questions?

top related