Title A systematic review of Natural Language Processing for classification tasks in the field of incident reporting and adverse event analysis. Authors Ian James Bruce Young a , Saturnino Luz b , Nazir Lone c a Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA. [email protected]. b Usher Institute of Population Health Sciences & Informatics, The University of Edinburgh, 9 Little France Rd, Edinburgh, Scotland EH16 4UX. [email protected]c Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG. [email protected]
44
Embed
€¦ · Web viewa Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA. [email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Title
A systematic review of Natural Language Processing for
classification tasks in the field of incident reporting and adverse event
analysis.
AuthorsIan James Bruce Younga, Saturnino Luzb, Nazir Lonec
a Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA. [email protected].
b Usher Institute of Population Health Sciences & Informatics, The University of Edinburgh, 9 Little France Rd, Edinburgh, Scotland EH16 4UX. [email protected]
c Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG. [email protected]
Corresponding Author
Ian James Bruce Young
ABSTRACT
Context: Adverse events in healthcare are often collated in incident reports which contain
unstructured free text. Learning from these events may improve patient safety. Natural
language processing (NLP) uses computational techniques to interrogate free text, reducing
the human workload associated with its analysis. There is growing interest in applying NLP
to patient safety, but the evidence in the field has not been summarised and evaluated to date.
Objective: To perform a systematic literature review and narrative synthesis to describe and
evaluate NLP methods for classification of incident reports and adverse events in healthcare.
Methods: Data sources included Medline, Embase, The Cochrane Library, CINAHL,
MIDIRS, ISI Web of Science, SciELO, Google Scholar, PROSPERO, hand searching of key
articles, and OpenGrey. Data items were manually abstracted to a standardised extraction
form.
Results: From 428 articles screened for eligibility, 35 met the inclusion criteria of using NLP
to perform a classification task on incident reports, or with the aim of detecting adverse
events. The majority of studies used free text from incident reporting systems or electronic
health records. Models were typically designed to classify by type of incident, type of
medication error, or harm severity. A broad range of NLP techniques are demonstrated to
perform these classification tasks with favourable performance outcomes. There are
methodological challenges in how these results can be interpreted in a broader context.
Conclusion: NLP can generate meaningful information from unstructured data in the specific
domain of the classification of incident reports and adverse events. Understanding what or
why incidents are occurring is important in adverse event analysis. If NLP enables these
insights to be drawn from larger datasets it may improve the learning from adverse events in
Neighbours (5 studies). Decision Trees and Random Forests were both used in 4 studies.
There were then a number of Machine Learning techniques that appear in 2 or fewer studies;
Topic Modelling, Decision Rules, Neural Networks, Boosted Trees, and Active Learning.
While many studies developed their own NLP models, several used proprietary NLP software
such as MedLee, or SAS text Miner[63][64]. Five or ten fold cross-validation was used either
to split data into training, validation and testing sets, or to optimise model parameters in 10 of
the 35 studies[7,11,33,39,40,44,47,48,54,59].
The vast majority of studies used a manually annotated corpus of free text documents as the
comparator to their NLP model. Broadly, studies managed to develop an NLP classifier
whose performance approached that of the comparator. Fong et al demonstrated AUC
performance of 0.96 using a SVM classifier to identify ADE in incident reports[47]. Ong et al
demonstrated AUC performance of 0.97 using a Naïve Bayes classifier to identify patient
identification and handover events in incident reports[29].
Figure 2: Summary of included studies
3.3.4 Quality Assessment
A critical evaluation of study quality was conducted using the TRIPOD reporting guidelines
as a framework[65]. Broadly, studies clearly identified the data source and the nature of the
classification task. Datasets consisted of a mixture of free text and structured data entry
fields. These were extracted, in all cases, from internal databases affiliated either to a hospital
or public institution. The number of records extracted for use ranged from five to over 20
million, with a median of 2974[44,45]. Studies were clear that NLP was being used to
Figure 2 Legend:
Incident Reporting System (IRS), Electronic Health Record (EHR), Morbidity & Mortality Record (M&M), Adverse Drug Event Reporting System (ADE), Discharge Summary (DIS)
Incident Type (IT), Medication Error Type (MET), Harm Severity (HS), Contributory Factors (CF), Postoperative Complication (POC)
Topic Modelling (TM), Proprietary Software (PS), Neural Networks (NN), Decision Trees (DT), Logistic Regression (LR), K-Nearest Neighbours (K-NN), Naïve Bayes (NB), Support Vector Machines (SVM), Random Forests (RF)
Manually Annotated Corpus (MAC), Initial Reporter Classification (IRC)
develop rather than validate predictive models. Studies were clear about the method for
internal validation, which was typically a manually annotated corpus. Studies often described
multiple NLP models and performance metrics. It was typically not clear which of these had
been decided a priori, and whether actions were taken to blind the assessment of individual
models. The specifics of NLP model development were not made clear in all cases.
Classification performance was reported heterogeneously amongst the included studies.
Studies discussed limitations and provided an overall interpretation of their results and
potential clinical applications of their models. Studies provided conflict of interest and
funding statements.
4.0 DISCUSSION
4.1 SUMMARY OF EVIDENCE
There are now a number of studies demonstrating that NLP models can be developed to
classify the unstructured free text contained in incident reports and EHRs according to
incident type and the severity of harm associated. Published work has explored binary
classification techniques more widely than multi-labelled classification problems. The type of
NLP that has been found to perform best has varied between datasets and classification tasks.
A wide variety of model performance metrics are reported, reflecting different priorities in
the use of the model. In general, studies have developed NLP models which can perform
classification tasks in this domain with performance outcomes which approach manual
human classifiers.
4.2 LIMITATIONS
In conducting this systematic review, resource limitations did not allow for the search to be
performed in duplicate with two independent reviewers. We limited the search to English
language articles due to lack of funding for translation facilities. We also limited the search to
articles published since 2005 to ensure relevance to current practice as the fields of adverse
event analysis and NLP have evolved significantly over the past decade. As figure 3 shows,
frequency of publications in this field appears to be increasing, with the majority of studies
published in the last decade. As such this review will have captured most relevant
publications. Syntactic and ontological differences between languages may limit the
applicability of the NLP models used in this review to other languages, particularly in the text
pre-processing techniques described[37][49].
Figure 3.
Some aspects of the internal validity of these studies have been explored, such as the
difficulty in assessing the quality of comparative classification technique, and the effects of
multiple testing due to the publication of outcomes for multiple NLP models and
classification tasks. Both of these factors could bias in favour of the NLP model performance.
Across the range of studies, 16 of the 35 studies were published in two journals; Studies in
Health Technology and Informatics, and Journal of the American Medical Informatics
Association. Our search strategy included a grey literature search to minimise the possibility
of publication bias.
At outcome level, a challenge in this review was how best to summarise the performance of
NLP classifiers. In this review a narrative synthesis rather than a quantitative approach was
chosen due to studies reporting outcomes for multiple binary classification and multiple NLP
models, a lack of assurance of data homogeneity, and a lack of a uniform outcome
performance metric.
4.3 THE LACK OF DATA HOMOGENEITY
Although the majority of studies used free text from proprietary incident reporting systems,
this does not mean we can assume data homogeneity between these studies. It is recognised
that the performance of NLP classifiers is very data dependent[61]. This is one explanation
for why a range of NLP models were found to perform best across studies in this review.
Further it makes it difficult to infer which NLP model would demonstrate best classifier
performance on a future data set.
4.4 THE USE OF MULTIPLE PERFORMANCE METRICS
When reporting model performance, a metric should be chosen that best represents the
association between model classification and "true" classification[66]. There is however
acceptance that this relationship is complex and multifactorial. As such, there is an argument
for reporting all performance metrics such that the most information possible is available for
those who might wish to develop the model further or for a different use case. Fong et al. and
Ong et al. are good examples, presenting 6 and 5 performance metrics respectively[66][29].
It is known that efforts to improve one NLP model performance measure can detrimentally
affect another[66]. For example, increasing model sensitivity can decrease model precision. If
the model is adjusted to make it more likely to predict a positive occurrence, there will be an
increase in the number of positive occurrences that are recorded as positive, but also an
increase the number of negative occurrences incorrectly recorded as positive.
Because of this, in model development one performance metric may have to be prioritised
above another. Appropriate prioritisation depends on the intended use of the model[66].
In the domain of using NLP for classification of incidents and adverse events, it is important
the model does not miss an important event, thus high sensitivity is important[66]. This could
result in a decreased specificity due to an increased false positive rate. In this case, it is likely
an important event would require some supplemental human confirmation, which could
manage the additional false positives.
4.4.1 The impact of multiple testing
As most studies report performance outcomes for multiple NLP models, they are framed as
methodological exploratory studies as much as experimental studies. The problem with
interpreting the use of multiple models might be considered similarly to interpreting results
from multiple testing[67].
4.5 COMPARING PERFORMANCE AGAINST MANUAL ANNOTATION
Model performance is typically reported as compared to a manually annotated corpus. This
presumes manual annotation to be a gold standard. Thus, the validity of the accuracy
measurements depends on the validity of the manual annotation. The use of multiple
annotators and calculation of inter-rater agreement can improve the validity of manual
annotation, but it has limitations. For example, inter-rater agreement would be unaffected if
both raters missed a classification. Similarly, in most cases there is no way to be certain what
proportion of outcomes are unclassified by both automated and manual systems[66].
5.0 FUTURE RESEARCH
The majority of studies focus on binary classification, e.g. “drug error or no drug error”, “fall
or no fall”[54] [38]. It is recognised that incidents which lead to harm are often the result of
multiple interacting factors[2]. Moving forwards, NLP interrogation of incident reports
should look to achieve high performing multi-class models[7].
Understanding why incidents occur may be more important for effecting change than
understanding what incidents have occurred. Further studies exploring the ability of NLP to
classify incident reports by contributory factors could offer more learning opportunities from
adverse events.
Clinical free text represents a massive data set which has been largely underutilised because
of its size, unstructured nature, and until recently, inability to be electronically
searched[68,69]. A wealth of new knowledge may be generated if computational techniques
such as NLP can make these data suitable for analysis.
6.0 CONCLUSIONS
This systematic review presents evidence that NLP can generate meaningful information
from unstructured data in the specific domain of the classification of incident reports and
adverse events. Understanding what incidents are occurring or why they are occurring is
important in adverse event analysis. NLP has the potential to allow such classification tasks
to be performed at scale, for example between hospitals within a geographic region, between
regions, or across an entire healthcare system. This has the potential to improve learning from
adverse events in healthcare, which may ultimately reduce the risk to future patients.
One of the roles of data science in healthcare is to reduce the human burden of data
acquisition and analysis. The hope, in doing so, is to give healthcare professionals the time to
think creatively and effect change[70]. In a broader context, understanding how to interrogate
this unstructured data offers opportunities in a range of healthcare settings.
CONTRIBUTORSHIP STATEMENT
Ian Young, Saturnino Luz, and Nazir Lone all qualify for authorship according to the
International Committee of Medical Journal Editors (ICMJE). Each shares responsibility for
the conception and design of this study, interpretation of this review, and drafting and critical
revision of the manuscript. Ian Young is the corresponding author and is further responsible
for the acquisition of the data used in this review.
STATEMENT ON CONFLICT OF INTEREST
The authors have no conflicts of interest to declare.
FUNDING
This work was supported by the department of Anaesthesia, Critical Care, and Pain Medicine
at the Royal Infirmary of Edinburgh, via monies from the Trustees of the Edinburgh
Anaesthesia Festival.
SUMMARY TABLE
What was already known
Analysis of incident reports and adverse events in healthcare is considered an important part of quality improvement and patient safety.
NLP can provide structured information from unstructured free text, including performing classification tasks.
What this study adds
Within the domain of incident reporting and adverse event analysis, NLP has been shown to perform favourably compared to manual annotation in a wide range of classification tasks.
Studies in this domain have focussed on binary classification of incident types. Exploring multi-class problems and contributory factor analysis of incident reports could have clinical utility.
No single NLP technique shows superiority in this domain and training multiple models may be required to optimise classifier performance.
Word Count: 3492
REFERENCES
[1] R. Lawton, R.R.C. McEachan, S.J. Giles et al. Development of an evidence-based
framework of factors contributing to patient safety incidents in hospital settings: a
[65] E. Network, TRIPOD Checklist : Prediction Model Development and Validation,
Checklist. (2016).
[66] J. Chubak, G. Pocobelli, N.S. Weiss, Tradeoffs between accuracy measures for
electronic health care data algorithms, J. Clin. Epidemiol. 65 (2012) 343–349.
[67] Y. Benjamini, Y. Hochberg, Controlling the False Discovery Rate: A Practical and
Powerful Approach to Multiple Controlling the False Discovery Rate: a Practical and
Powerful Approach to Multiple Testing, Source J. R. Stat. Soc. Ser. B. 57 (1995) 289–
300.
[68] T. Murdoch, A. Detsky, The Inevitable Application of Big Data to Health Care, JAMA
Evid. 309 (2013) 1351–1352.
[69] N.R. Adam, R. Wieder, D. Ghosh, Data science, learning, and applications to
biomedical and health sciences, Ann. N. Y. Acad. Sci. 1387 (2017) 5–11.
[70] B. Young, Getting the measure of diabetes : the evolution of the National Diabetes
Audit, Practical Diabetes 35 (2018) 1–7.
APPENDIX 1
Full electronic search strategy
Limits placed on all searches were English language articles, in humans only, published between 2005 and 2018. When this review was revised, any article with a publication date after May 8 th, 2018 was excluded, as this was the last date on which the search was run for the original review.
Ovid MEDLINE® ALL 1(natural language processing or NLP or text mining or machine learning or artificial intelligence or information technology or classifier or document classification or semantic similarity or ontology).mp⇒ 57960 2(Natural Language Processing/ or Data Mining/ or Artificial intelligence/ or Machine Learning/)⇒ 36778 3(event report or adverse event or medication event or incident report or medication error or medical error or error report or patient safety event).mp⇒ 23367 4(Adverse Drug Reaction Reporting System/ or Medical Errors/ or Patient Safety/)⇒ 38438
5(*Algorithms/ and 4) ⇒ 144
61 or 2⇒ 63120 73 or 4 ⇒ 58720 8(6 and 7) or 5 ⇒ 924
MIDIRS: Maternity and Infant CareEmbase
1(natural language processing or NLP or text mining or text classification or machine learning).af. 29540
MIDIRS: Maternity and Infant Care 29 Embase 29511
2(event reports or adverse events or medication events or incident report* or patient safety).af. 152722
MIDIRS: Maternity and Infant Care 150897 Embase 1825
1 and 2 301
MIDIRS: Maternity and Infant Care 0 Embase 301
1 and 2 Deduplicated 290
MIDIRS: Maternity and Infant Care 0 Embase 290
Title and Abstract screen: 277 removed due to "wrong topic" or "irrelevant".
CINAHL
Initial search identical to Medline - 0 resultsThen using CINAHL suggested search terms:
1 Natural Language Processing and Adverse Health Care Event - 1
Title and Abstract screen: 1 removed for “wrong topic”
Cochrane Library
MeSH "Natural Language Processing" Subheading "Classification" - 11
Title and Abstract screen: 11 excluded for "wrong topic" or "irrelevant".
SciELO
1 Event report* or adverse events or incident reporting AND Natural language processing or NLP - 3
Title and Abstract screen: 3 excluded for “not English language” or “irrelevant”.
ISI Web of Science
1 Event report* or adverse events or incident reporting AND Natural language processing or NLP
Filtered by Research Domain of "Science Technology" – 209
Title and Abstract screen: 198 excluded for “duplicates”, "wrong topic" or "irrelevant". 11
Google ScholarScholar.google.com1 "natural language processing" and "medical" and "incident reports" and "classification" - 228
Title and Abstract screen: 226 excluded for “duplicates”, "wrong topic" or "irrelevant”.