€¦ · Web viewa Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA. [email protected].

Title

A systematic review of Natural Language Processing for

classification tasks in the field of incident reporting and adverse event

analysis.

AuthorsIan James Bruce Younga, Saturnino Luzb, Nazir Lonec

a Department of Anaesthesia, Critical Care and Pain Medicine, Edinburgh Royal Infirmary, 51 Little France Crescent, Edinburgh, Scotland, EH16 4SA. [email protected].

b Usher Institute of Population Health Sciences & Informatics, The University of Edinburgh, 9 Little France Rd, Edinburgh, Scotland EH16 4UX. [email protected]

c Usher Institute of Population Health Sciences and Informatics, The University of Edinburgh, Teviot Place, Edinburgh, EH8 9AG. [email protected]

Corresponding Author

Ian James Bruce Young

ABSTRACT

Context: Adverse events in healthcare are often collated in incident reports which contain

unstructured free text. Learning from these events may improve patient safety. Natural

language processing (NLP) uses computational techniques to interrogate free text, reducing

the human workload associated with its analysis. There is growing interest in applying NLP

to patient safety, but the evidence in the field has not been summarised and evaluated to date.

Objective: To perform a systematic literature review and narrative synthesis to describe and

evaluate NLP methods for classification of incident reports and adverse events in healthcare.

Methods: Data sources included Medline, Embase, The Cochrane Library, CINAHL,

MIDIRS, ISI Web of Science, SciELO, Google Scholar, PROSPERO, hand searching of key

articles, and OpenGrey. Data items were manually abstracted to a standardised extraction

form.

Results: From 428 articles screened for eligibility, 35 met the inclusion criteria of using NLP

to perform a classification task on incident reports, or with the aim of detecting adverse

events. The majority of studies used free text from incident reporting systems or electronic

health records. Models were typically designed to classify by type of incident, type of

medication error, or harm severity. A broad range of NLP techniques are demonstrated to

perform these classification tasks with favourable performance outcomes. There are

methodological challenges in how these results can be interpreted in a broader context.

Conclusion: NLP can generate meaningful information from unstructured data in the specific

domain of the classification of incident reports and adverse events. Understanding what or

why incidents are occurring is important in adverse event analysis. If NLP enables these

insights to be drawn from larger datasets it may improve the learning from adverse events in

healthcare.

Keywords

Natural language processing Machine learningText classification Incident reportingAdverse event analysis Patient Safety

Abbreviations

Adverse Drug Event ADEElectronic Health Record EHRArea under receiver operating characteristic curves AUCSupport Vector Machines SVM

1.0 INTRODUCTION

1.1 RATIONALE

Incident reports are tools to collect data about adverse events and errors in healthcare[1].

Their use in healthcare has been brought over from other high reliability industries which

have recognised the importance of reporting potential and actual harm for improving

safety[2]. A culture which promotes the reporting and analysis of incidents, errors, and

adverse events is now thought a central tenet of patient safety[3,4].

Ultimately the utility of this system is predicated on the reporting of incidents reducing the

risk to future patients. This could be either by understanding what incidents are occurring, or

why incidents are occurring, and then taking actions based on this understanding. In the

investigation and analysis of incident reports, one component is classification[2][5]. This may

be classification of incident type, of the type and severity of harm that resulted, or of the

factors that contributed to the incident occurring.

There are problems with the current work flow for processing incident reports that make it

difficult to translate reports into better outcomes[4]. Firstly, the system is neither reliable nor

robust. It does not give the same consideration to all reports, and it is often unclear what

factors determine the review course of a particular report.

Secondly, issues of data validity within incident reports make analysis harder. Proprietary

incident reporting systems typically record a combination of structured data entry fields and

free text responses[6]. Free text responses provide the initiator of the report with freedom to

describe the incident as they saw it, but the completeness and accuracy of reporting can limit

data validity. Structured responses do not necessarily increase data validity. Classification of

incident type is often entered by the initiator of the report and chosen from a structured list of

options. This may worsen data validity if the structured choices do not allow the initiator to

adequately summarise the incident, or if they are missing important contextual information to

inform the classification[7].

Lastly, incident reports are produced with a volume and velocity that makes thorough and

timely human review impossible[8][9][10]. Through the efficiency of automation, data

science solutions may be beneficial where data volume and velocity are problems to

overcome.

Within data science, natural language processing (NLP) is a field which tries to understand,

process, and interpret human language[11]. It would be a monumental task to stay on top of

all the clinical free text loosed on the world every day. NLP aims to create structure from this

unstructured data. These structured data then provide a substrate to train Machine Learning

Models to analyse the text As such, set in the context of other methods for analysing incident

reports, NLP may confer benefit through allowing all incident reports to be processed in a

reliable way, and in dramatically reducing the time associated with analysis.

In the last decade, NLP has demonstrated potential utility in healthcare data beyond the field

of clinical incident reporting and adverse event analysis. Using free text radiology reports,

NLP has been used to automate the detection of Venous Thromboembolism diagnoses[12]

[13][14], malignancy diagnoses[15][16] and detect critical follow-up recommendations[17].

In the domain of preventing adverse drug events (ADEs), NLP has been used to provide

patient individualised ADE information[18], detect and provide real-time clinician feedback

of drug errors in a neonatal intensive care unit [19][20], and look for incidences of known

drug side effects within EHR data[21].

NLP may have utility as a method for detecting clinically important outcomes, in contrast to

traditional methods such as manual chart review or diagnostic coding. NLP has been used on

EHR data to identify hypoglycaemic episodes[22], inpatient falls[23], healthcare associated

urinary tract infections[24], and cancer recurrence[25].

The potential of NLP as a clinical predictive tool has also been explored. NLP has been used

to predict clinical complications for cancer patients using EHR data[26].

1.2 OBJECTIVES

There is growing interest in applying NLP to patient safety but the evidence in the field has

not been summarised and evaluated to date. For this reason, we conducted a systematic

review and narrative synthesis to understand the published work on using natural language

processing for classification tasks in the field of incident reports and adverse event analysis.

Our specific objectives were to understand what NLP has been shown to achieve, the

techniques employed, and to highlight areas of future research in this field. PRISMA

guidelines were followed in creating this review.

2.0 METHODS

2.1 ELIGIBILITY CRITERIA

We followed international Preferred Reporting Items for Systematic Reviews and Meta-

Analyses (PRISMA) guidelines in conducting this review. Study characteristics for eligibility

were: published research, of experimental or methodological studies, reviews and conference

abstracts, published after 2004, written in English language, using NLP techniques on free

text for the purposes of classification. Studies which trained machine learning models on

structured text fields were excluded. The application of NLP techniques should be in the field

of human healthcare. The source of free text should be incident reporting systems or if not,

the classification should relate to detection of adverse events or medical error.

2.2 INFORMATION SOURCES

Articles were identified for this review through a search of the online databases; Medline via

Ovid, Embase via Ovid, The Cochrane Library, CINAHL, MIDIRS, ISI Web of Science,

SciELO, Google Scholar, and PROSPERO. A pearling strategy was applied to bibliographies

of key articles. A grey literature search was conducted via OpenGrey. The search was last

conducted on May 8th, 2018.

2.3 SEARCH

The full electronic search strategy, including limits used, is presented in appendix A.

2.4 STUDY SELECTION

Studies identified through electronic search were initially de-duplicated in their online

databases before being stored in Mendeley reference manager (Mendeley Ltd, version 1.19.2,

2018)[27]. Mendeley’s desktop application was used to de-duplicate the combined search

output. Initial screening was at the level of title and abstract. Full text review was then

conducted for remaining articles. University of Edinburgh library requests were submitted for

those studies whose full text could not be accessed initially. Included studies were those that:

(a) used NLP (b) to perform a classification task (c) either of incident reports, or other source

of clinical free text where the aim of classification was to detect adverse events or medical

errors.

2.5 DATA COLLECTION PROCESS

Data extraction from included studies was performed within Mendeley’s desktop application

and stored in an Excel spreadsheet (Microsoft, version 16.21.1, 2018). Data extraction was

performed independently, using presented data. Rejected studies were classified under a

standard set of explanations and are fully detailed in the PRISMA flow diagram in figure 1.

2.6 DATA ITEMS

The variables for which data was sought were mapped to the PICOS statement, stored in a

data extraction form, and are described below:

Participants – Study title as study ID. From what dataset was the free text extracted?

Interventions – What classification task was being performed? What type or types of

NLP were being used?

Comparisons – What alternative non-automated classification technique was used as a

comparator to the classifier?

Outcomes – What statistical analysis of classifier performance was performed?

Study design – What was the type of study?

2.7 SUMMARY MEASURES

The summary is a narrative synthesis of the eligible studies, based principally on the

extracted data items.

3.0 RESULTS

3.1 STUDY SELECTION

The number of studies screened, assessed for eligibility, and included in the review are

presented in a PRISMA flow diagram in figure 1. This also details the number of studies

rejected at each stage with accompanying reasons for rejection.

Figure 1: PRISMA 2009 Flow Diagram

3.2 STUDY CHARACTERISTICS

Data were sought for the variables mapped to the PICOS statement as described above. A

summary of the data extraction form is presented in figure 2. The full citation list for studies

included in the review is presented in the bibliography under references[7,11,36–45,28,46–

55,29,56–60,30–35].

3.3 NARRATIVE SYNTHESIS OF RESULTS

Thirty-five studies were included in this review. Twenty-one studies used free text extracted

from incident reporting systems[7,11,46,47,49–55,58,28,59,29,30,33,36,37,40,42], nine from

EHRs[34,35,38,43–45,48,56,61], three from ADE reporting systems[39,57,60], one from

morbidity and mortality records[41], and one from discharge summaries[31].

3.3.1 Type of classification

A range of classification tasks were performed. The total number described exceeded 35 as

some studies performed more than one type of classification task. Twenty studies developed

NLP classifiers for “type of incident”[28,29,49,50,52,53,55,56,58,59,61,30,31,36,38,40–

42,44], seven for “type of medication error”[35,39,45,47,54,57,60], six for “severity or

presence of harm”[7,11,33,46,51,59], three for “type of postoperative

complication”[34,43,48], and one for “type of contributory factor”[7].

Studies using NLP for the classification of incident types have taken various approaches.

Some have defined a single incident type and modelled various NLP techniques to optimise

this binary classification[38,42,52,61]. Others have imposed a predefined ontology of

incident types and developed either multi-label text classification[7] or multiple binary

classifiers[30,31]. Network analysis has also been used to identify incident types from free

text rather than imposing a known ontology[49].

Of the studies which focused on ADEs, four developed classifiers to identify the presence of

generalised medication events[39][47][60][45], while three looked specifically at identifying

bleeding events[35], anaphylaxis [57], and “look-alike sound-alike” errors[54].

3.3.2 Classification performance

With respect to reporting performance outcomes for NLP models, studies consistently utilised

measures that can be calculated from an error matrix[62]. The most commonly reported

performance metrics were Sensitivity (Recall), Positive Predictive Value (Precision),

Accuracy, and F Score (Harmonic mean of Precision and Recall). Multiple studies also

presented error matrix outcomes as area under receiver operating characteristic curves

(AUC). There was however no overall consistency as to which specific measures were

reported.

3.3.3 Machine Learning Models

Most studies reported outcomes for more than one NLP technique. The most frequent models

developed used variants of Machine Learning Models, including Support Vector Machines

(SVM) (20 studies), Naïve Bayes (11 studies), Logistic Regression (7 studies), and K-Nearest

Neighbours (5 studies). Decision Trees and Random Forests were both used in 4 studies.

There were then a number of Machine Learning techniques that appear in 2 or fewer studies;

Topic Modelling, Decision Rules, Neural Networks, Boosted Trees, and Active Learning.

While many studies developed their own NLP models, several used proprietary NLP software

such as MedLee, or SAS text Miner[63][64]. Five or ten fold cross-validation was used either

to split data into training, validation and testing sets, or to optimise model parameters in 10 of

the 35 studies[7,11,33,39,40,44,47,48,54,59].

The vast majority of studies used a manually annotated corpus of free text documents as the

comparator to their NLP model. Broadly, studies managed to develop an NLP classifier

whose performance approached that of the comparator. Fong et al demonstrated AUC

performance of 0.96 using a SVM classifier to identify ADE in incident reports[47]. Ong et al

demonstrated AUC performance of 0.97 using a Naïve Bayes classifier to identify patient

identification and handover events in incident reports[29].

Figure 2: Summary of included studies

3.3.4 Quality Assessment

A critical evaluation of study quality was conducted using the TRIPOD reporting guidelines

as a framework[65]. Broadly, studies clearly identified the data source and the nature of the

classification task. Datasets consisted of a mixture of free text and structured data entry

fields. These were extracted, in all cases, from internal databases affiliated either to a hospital

or public institution. The number of records extracted for use ranged from five to over 20

million, with a median of 2974[44,45]. Studies were clear that NLP was being used to

Figure 2 Legend:

Incident Reporting System (IRS), Electronic Health Record (EHR), Morbidity & Mortality Record (M&M), Adverse Drug Event Reporting System (ADE), Discharge Summary (DIS)

Incident Type (IT), Medication Error Type (MET), Harm Severity (HS), Contributory Factors (CF), Postoperative Complication (POC)

Topic Modelling (TM), Proprietary Software (PS), Neural Networks (NN), Decision Trees (DT), Logistic Regression (LR), K-Nearest Neighbours (K-NN), Naïve Bayes (NB), Support Vector Machines (SVM), Random Forests (RF)

Manually Annotated Corpus (MAC), Initial Reporter Classification (IRC)

Area Under Receiver Operating Characteristic Curves (AUROC), Confusion Matrix (e.g. Sensitivity, Specificity, Positive Predictive Value, Negative Predictive Value, Accuracy, Precision, Recall, F-Score) (Confusion Matrix)

develop rather than validate predictive models. Studies were clear about the method for

internal validation, which was typically a manually annotated corpus. Studies often described

multiple NLP models and performance metrics. It was typically not clear which of these had

been decided a priori, and whether actions were taken to blind the assessment of individual

models. The specifics of NLP model development were not made clear in all cases.

Classification performance was reported heterogeneously amongst the included studies.

Studies discussed limitations and provided an overall interpretation of their results and

potential clinical applications of their models. Studies provided conflict of interest and

funding statements.

4.0 DISCUSSION

4.1 SUMMARY OF EVIDENCE

There are now a number of studies demonstrating that NLP models can be developed to

classify the unstructured free text contained in incident reports and EHRs according to

incident type and the severity of harm associated. Published work has explored binary

classification techniques more widely than multi-labelled classification problems. The type of

NLP that has been found to perform best has varied between datasets and classification tasks.

A wide variety of model performance metrics are reported, reflecting different priorities in

the use of the model. In general, studies have developed NLP models which can perform

classification tasks in this domain with performance outcomes which approach manual

human classifiers.

4.2 LIMITATIONS

In conducting this systematic review, resource limitations did not allow for the search to be

performed in duplicate with two independent reviewers. We limited the search to English

language articles due to lack of funding for translation facilities. We also limited the search to

articles published since 2005 to ensure relevance to current practice as the fields of adverse

event analysis and NLP have evolved significantly over the past decade. As figure 3 shows,

frequency of publications in this field appears to be increasing, with the majority of studies

published in the last decade. As such this review will have captured most relevant

publications. Syntactic and ontological differences between languages may limit the

applicability of the NLP models used in this review to other languages, particularly in the text

pre-processing techniques described[37][49].

Figure 3.

Some aspects of the internal validity of these studies have been explored, such as the

difficulty in assessing the quality of comparative classification technique, and the effects of

multiple testing due to the publication of outcomes for multiple NLP models and

classification tasks. Both of these factors could bias in favour of the NLP model performance.

Across the range of studies, 16 of the 35 studies were published in two journals; Studies in

Health Technology and Informatics, and Journal of the American Medical Informatics

Association. Our search strategy included a grey literature search to minimise the possibility

of publication bias.

At outcome level, a challenge in this review was how best to summarise the performance of

NLP classifiers. In this review a narrative synthesis rather than a quantitative approach was

chosen due to studies reporting outcomes for multiple binary classification and multiple NLP

models, a lack of assurance of data homogeneity, and a lack of a uniform outcome

performance metric.

4.3 THE LACK OF DATA HOMOGENEITY

Although the majority of studies used free text from proprietary incident reporting systems,

this does not mean we can assume data homogeneity between these studies. It is recognised

that the performance of NLP classifiers is very data dependent[61]. This is one explanation

for why a range of NLP models were found to perform best across studies in this review.

Further it makes it difficult to infer which NLP model would demonstrate best classifier

performance on a future data set.

4.4 THE USE OF MULTIPLE PERFORMANCE METRICS

When reporting model performance, a metric should be chosen that best represents the

association between model classification and "true" classification[66]. There is however

acceptance that this relationship is complex and multifactorial. As such, there is an argument

for reporting all performance metrics such that the most information possible is available for

those who might wish to develop the model further or for a different use case. Fong et al. and

Ong et al. are good examples, presenting 6 and 5 performance metrics respectively[66][29].

It is known that efforts to improve one NLP model performance measure can detrimentally

affect another[66]. For example, increasing model sensitivity can decrease model precision. If

the model is adjusted to make it more likely to predict a positive occurrence, there will be an

increase in the number of positive occurrences that are recorded as positive, but also an

increase the number of negative occurrences incorrectly recorded as positive.

Because of this, in model development one performance metric may have to be prioritised

above another. Appropriate prioritisation depends on the intended use of the model[66].

In the domain of using NLP for classification of incidents and adverse events, it is important

the model does not miss an important event, thus high sensitivity is important[66]. This could

result in a decreased specificity due to an increased false positive rate. In this case, it is likely

an important event would require some supplemental human confirmation, which could

manage the additional false positives.

4.4.1 The impact of multiple testing

As most studies report performance outcomes for multiple NLP models, they are framed as

methodological exploratory studies as much as experimental studies. The problem with

interpreting the use of multiple models might be considered similarly to interpreting results

from multiple testing[67].

4.5 COMPARING PERFORMANCE AGAINST MANUAL ANNOTATION

Model performance is typically reported as compared to a manually annotated corpus. This

presumes manual annotation to be a gold standard. Thus, the validity of the accuracy

measurements depends on the validity of the manual annotation. The use of multiple

annotators and calculation of inter-rater agreement can improve the validity of manual

annotation, but it has limitations. For example, inter-rater agreement would be unaffected if

both raters missed a classification. Similarly, in most cases there is no way to be certain what

proportion of outcomes are unclassified by both automated and manual systems[66].

5.0 FUTURE RESEARCH

The majority of studies focus on binary classification, e.g. “drug error or no drug error”, “fall

or no fall”[54] [38]. It is recognised that incidents which lead to harm are often the result of

multiple interacting factors[2]. Moving forwards, NLP interrogation of incident reports

should look to achieve high performing multi-class models[7].

Understanding why incidents occur may be more important for effecting change than

understanding what incidents have occurred. Further studies exploring the ability of NLP to

classify incident reports by contributory factors could offer more learning opportunities from

adverse events.

Clinical free text represents a massive data set which has been largely underutilised because

of its size, unstructured nature, and until recently, inability to be electronically

searched[68,69]. A wealth of new knowledge may be generated if computational techniques

such as NLP can make these data suitable for analysis.

6.0 CONCLUSIONS

This systematic review presents evidence that NLP can generate meaningful information

from unstructured data in the specific domain of the classification of incident reports and

adverse events. Understanding what incidents are occurring or why they are occurring is

important in adverse event analysis. NLP has the potential to allow such classification tasks

to be performed at scale, for example between hospitals within a geographic region, between

regions, or across an entire healthcare system. This has the potential to improve learning from

adverse events in healthcare, which may ultimately reduce the risk to future patients.

One of the roles of data science in healthcare is to reduce the human burden of data

acquisition and analysis. The hope, in doing so, is to give healthcare professionals the time to

think creatively and effect change[70]. In a broader context, understanding how to interrogate

this unstructured data offers opportunities in a range of healthcare settings.

CONTRIBUTORSHIP STATEMENT

Ian Young, Saturnino Luz, and Nazir Lone all qualify for authorship according to the

International Committee of Medical Journal Editors (ICMJE). Each shares responsibility for

the conception and design of this study, interpretation of this review, and drafting and critical

revision of the manuscript. Ian Young is the corresponding author and is further responsible

for the acquisition of the data used in this review.

STATEMENT ON CONFLICT OF INTEREST

The authors have no conflicts of interest to declare.

FUNDING

This work was supported by the department of Anaesthesia, Critical Care, and Pain Medicine

at the Royal Infirmary of Edinburgh, via monies from the Trustees of the Edinburgh

Anaesthesia Festival.

SUMMARY TABLE

What was already known

Analysis of incident reports and adverse events in healthcare is considered an important part of quality improvement and patient safety.

NLP can provide structured information from unstructured free text, including performing classification tasks.

What this study adds

Within the domain of incident reporting and adverse event analysis, NLP has been shown to perform favourably compared to manual annotation in a wide range of classification tasks.

Studies in this domain have focussed on binary classification of incident types. Exploring multi-class problems and contributory factor analysis of incident reports could have clinical utility.

No single NLP technique shows superiority in this domain and training multiple models may be required to optimise classifier performance.

Word Count: 3492

REFERENCES

[1] R. Lawton, R.R.C. McEachan, S.J. Giles et al. Development of an evidence-based

framework of factors contributing to patient safety incidents in hospital settings: a

systematic review, BMJ Qual. Saf. 21 (2012) 369–380.

[2] S. Taylor-Adams, C. Vincent, D. Hewett et al. Systems Analysis of Clinical Incidents

the London Protocol, (n.d.).

[3] To Err Is Human, National Academies Press, Washington, D.C., 2000.

[4] K.E. Wood, D.B. Nash, Mandatory State-Based Error-Reporting Systems: Current and

Future Prospects, Am. J. Med. Qual. 20 (2005) 297–303.

[5] B. Peter, J. Pronovost, D.A. Thompson et al. Toward learning from patient safety

reporting systems, (n.d.).

[6] DATIX, No Title, (2018). https://www.datix.co.uk/en/.

[7] C. Liang, Y. Gong, Automated Classification of Multi-Labeled Patient Safety Reports:

A Shift from Quantity to Quality Measure., Stud. Health Technol. Inform. 245 (2017)

1070–1074.

[8] M. Govindan, Automated detection of harm in healthcare with information technology:

a systematic review, Qual. Saf. Health Care. 19 (2010) e11.

[9] D.S. Carrell, R.E. Schoen, D.A. Leffler et al. Challenges in adapting existing clinical

natural language processing systems to multiple, diverse health care settings, J. Am.

Med. Informatics Assoc. 24 (2017) 986–991.

[10] R. Pivovarov, N. Mie Elhadad, Automated methods for the summarization of

electronic health records, (n.d.).

[11] R. Jacobsson, Extraction of adverse event severity information from clinical narratives

using natural language processing, Pharmacoepidemiol. Drug Saf. 26 (2017) 37.

[12] C.M. Rochefort, A.D. Verma, T. Eguale et al. A novel method of adverse event

detection can accurately identify venous thromboembolisms (VTEs) from narrative

electronic health record data, (2014).

[13] Z. Tian, S. Sun, T. Equale et al. Automated extraction of vte events from narrative

radiology reports in electronic health records: A validation study, Med. Care. 55

(2017) e73–e80.

[14] J.W. Galvez, J.M. Pappas, L. Ahumada et al. The use of natural language processing

on pediatric diagnostic radiology reports in the electronic health record to identify deep

venous thrombosis in children, J. Thromb. Thrombolysis. 44 (2017) 281–290.

[15] W. Yim, S.W. Kwan, M. Yetisgen, Classifying tumor event attributes in radiology

reports, J. Assoc. Inf. Sci. Technol. 68 (2017) 2662–2674.

[16] C.R. Moore, A. Farrag, E. Ashkin, Using Natural Language Processing to Extract

Abnormal Results From Cancer Screening Reports., J. Patient Saf. 13 (2017) 138–143.

[17] M. Yetisgen-Yildiz, M.L. Gunn, F. Xia et al. Automatic identification of critical

follow-up recommendation sentences in radiology reports, AMIA Annu. Symp. Proc.

2011 (2011) 1593–1602.

[18] J.D. Duke, ADESSA: A Real-Time Decision Support Service for Delivery of

Semantically Coded Adverse Drug Event Data, AMIA Annu. Symp. Proc. 2010 (2010)

177–181.

[19] Q. Li, E.S. Kirkendall, E.S. Hall et al. Automated detection of medication

administration errors in neonatal intensive care, Journal of Biomedical Informatics. 57

(2015) 124-133.

[20] Y. Ni, T. Lingren, E.S. Hall et al. Designing and evaluating an automated system for

real-time medication administration error detection in a neonatal intensive care unit., J.

Am. Med. Inform. Assoc. 25 (2018) 555–563.

[21] T. Cai, Natural language processing to rapidly identify potential signals for adverse

events using electronic medical record data: Example of arthralgias and vedolizumab,

Arthritis Rheumatol. 68 (2016) 2802–2804.

[22] A.P. Nunes, J. Yang, L. Radican et al. Assessing occurrence of hypoglycemia and its

severity from electronic health records of patients with type 2 diabetes mellitus,

Diabetes Res. Clin. Pract. 121 (2016) 192–203.

[23] S. Toyabe, Characteristics of Inpatient Falls not Reported in an Incident Reporting

System, Glob. J. Health Sci. 8 (2015) 17–25.

[24] H. Tanushi, Detection of healthcare-associated urinary tract infection in Swedish

electronic health records, Stud. Health Technol. Inform. 207 (2014) 330–339.

[25] D.S. Carrell, S. Halgrim, D.-T. Tran et al. Practice of Epidemiology Using Natural

Language Processing to Improve Efficiency of Manual Chart Abstraction in Research:

The Case of Breast Cancer Recurrence, (n.d.).

[26] K. Jensen, C. Soguero-Ruiz, K. Oyvind Mikalsen et al. Analysis of free text in

electronic health records for identification of cancer patient trajectories, Sci. Rep. 7

(2017) 46226.

[27] MENDELEY, No Title, (n.d.). https://www.mendeley.com (accessed June 15, 2018).

[28] F. A, An Evaluation of Patient Safety Event Report Categories Using Unsupervised

Topic Modeling, Methods Inf. Med. 54 (2015) 338–345.

[29] M.-S. Ong, F. Magrabi, E. Coiera, Automated categorisation of clinical incident

reports using statistical text classification, Qual. Saf. Health Care. 19 (2010) e55.

[30] J. Gupta, I. Koprinska, J. Patrick, Automated Classification of Clinical Incident Types,

Stud. Health Technol. Inform. 214 (2015) 87–93.

[31] G.B. Melton, G. Hripcsak, Automated detection of adverse events using natural

language processing of discharge summaries, J. Am. Med. Informatics Assoc. 12

(2005) 448–457.

[32] J.F.E. Penz, A.B. Wilcox, J.F. Hurdle, Automated identification of adverse events

related to central venous catheters, J. Biomed. Inform. 40 (2007) 174–182.

[33] M.-S. Ong, F. Magrabi, E. Coiera, Automated identification of extreme-risk events in

clinical incident reports, J. Am. Med. Informatics Assoc. 19 (2012) e110–e118.

[34] H.J. Murff, F. FitzHenry, M.E. Matheny et al. Automated Identification of

Postoperative Complications Within an Electronic Medical Record Using Natural

Language Processing, Jama. 306 (2011) 848–855.

[35] R.D. Boyce, J. Jao, T. Miller et al. Automated Screening of Emergency Department

Notes for Drug-Associated Bleeding Adverse Events Occurring in Older Adults.,

Appl. Clin. Inform. 8 (2017) 1022–1030.

[36] G. J, Automated validation of patient safety clinical incident classification: Macro

analysis, Stud. Health Technol. Inform. 188 (2013) 52–57.

[37] K. Fujita, M. Akiyama, N. Toyama et al. Detecting effective classes of medical

incident reports based on linguistic analysis for common reporting system in Japan,

Stud. Health Technol. Inform. 192 (2013) 137–141.

[38] S. Toyabe, Detecting inpatient falls by using natural language processing of electronic

medical records, BMC Health Serv. Res. 12 (2012) 448.

[39] L. Han, R. Ball, C.A. Pamer et al. Development of an automated assessment tool for

MedWatch reports in the FDA adverse event reporting system, J. Am. Med.

Informatics Assoc. 24 (2017) 913–920.

[40] A.L. Benin, S.J. Fodeh, K. Lee et al. Electronic approaches to making sense of the text

in the adverse event reporting system, J. Healthc. Risk Manag. 36 (2016) 10–20.

[41] C. Liang, Y Gong, Enhancing Patient Safety Event Reporting by K-nearest Neighbor

Classifier, Stud. Health Technol. Inform. 218 (2015) 40603.

[42] A. Fong, A.Z. Hettinger, R.M. Ratwani, Exploring methods for identifying related

patient safety events using structured and unstructured data, J. Biomed. Inform. 58

(2015) 89–95.

[43] T. Speroff, Exploring the frontier of electronic health record surveillance the case of

postoperative complications, Med. Care. 51 (2013) 509–516.

[44] J. Gaebel, T. Kolter, F. Arlt et al. Extraction Of Adverse Events From Clinical

Documents To Support Decision Making Using Semantic Preprocessing, Stud. Health

Technol. Inform. 216 (2015) 1030.

[45] E. Iqbal, R. Mallah, R.G. Jackson et al. Identification of Adverse Drug Events from

Free Text Electronic Patient Records and Information in a Large Mental Health Case

Register, PLoS One. 10 (2015) e0134208.

[46] A. Cohan, A. Fong, R.M. Ratwani et al. Identifying Harm Events in Clinical Care

through Medical Narratives, in: Proc. 8th ACM Int. Conf. Bioinformatics, Comput.

Biol. Heal. Informatics - ACM-BCB ’17, ACM Press, New York, 2017: pp. 52–59.

[47] A. Fong, N. Harriott, D.M. Walters et al. Integrating natural language processing

expertise with patient safety event review committees to improve the analysis of

medication events, Int. J. Med. Inform. 104 (2017) 120–125.

[48] G.B. Weller, J. Lovely, D.W. Larson et al. Leveraging electronic health records for

predictive modeling of post-surgical complications, Stat. Methods Med. Res. 0(0)

(2017) 1-15.

[49] K. Fujita, M. Akiyama, K. Park et al. Linguistic analysis of large-scale medical

incident reports for patient safety, Stud. Health Technol. Inform. 180 (2012) 250–254.

[50] P.A. Ravindranath, S. Bruschi, K. Ernstrom et al. Machine learning in automated

classification of adverse events in clinical studies of Alzheimer’s disease, Alzheimer’s

Dement. 13 (2017) P1256.

[51] C. Liang, Y. Gong, Predicting Harm Scores from Patient Safety Event Reports., Stud.

Health Technol. Inform. 245 (2017) 1075–1079.

[52] W.M M, Screening Electronic Health Record-Related Patient Safety Reports Using

Machine Learning, J. Patient Saf. 13 (2017) 31–36.

[53] S.D. McKnight, Semi-Supervised Classification of Patient Safety Event Reports, J.

Patient Saf. 8 (2012) 60–64.

[54] Z.S.Y. Wong, Statistical classification of drug incidents due to look-alike sound-alike

mix-ups, Health Informatics J. 22 (2016) 276–292.

[55] Z.S.Y. Wong, M. Akiyama, Statistical text classifier to detect specific type of medical

incidents, Stud. Health Technol. Inform. 192 (2013) 1053.

[56] L.U. Gerdes, C. Hardahl, Text mining electronic health records to identify hospital

adverse events, Stud. Health Technol. Inform. 192 (2013) 1145.

[57] T. Botsis, M.D. Nguyen, E.J. Woo et al. Text mining for the vaccine adverse event

reporting system: Medical text classification using informative feature selection, J.

Am. Med. Informatics Assoc. 18 (2011) 631–638.

[58] A. Fong, J. Howe, K.T. Adams et al. Using Active Learning to Identify Health

Information Technology Related Patient Safety Events, Appl. Clin. Inform. 8 (2017)

35–46.

[59] Y. Wang, E. Coiera, W. Runciman et al. Using multiclass classification to automate

the identification of patient safety incident reports by type and severity, BMC Med.

Inform. Decis. Mak. 17 (2017) 84.

[60] T. Botsis, T. Buttolph, M. Nguyen et al. Vaccine adverse event text mining system for

extracting features from vaccine safety reports, J. Am. Med. Informatics Assoc. 19

(2012) 1011–1018.

[61] J.F.E. Penz, A.B. Wilcox, J.F. Hurdle, Automated identification of adverse events

related to central venous catheters, J. Biomed. Inform. 40 (2007) 174–182.

[62] S. V. Stehman, Selecting and interpreting measures of thematic classification accuracy,

Remote Sens. Environ. 62 (1997) 77–89.

[63] MedLee, MedLee, (n.d.).

[64] SAS, SAS, (n.d.). https://www.sas.com/en_gb/software/text-miner.html.

[65] E. Network, TRIPOD Checklist : Prediction Model Development and Validation,

Checklist. (2016).

[66] J. Chubak, G. Pocobelli, N.S. Weiss, Tradeoffs between accuracy measures for

electronic health care data algorithms, J. Clin. Epidemiol. 65 (2012) 343–349.

[67] Y. Benjamini, Y. Hochberg, Controlling the False Discovery Rate: A Practical and

Powerful Approach to Multiple Controlling the False Discovery Rate: a Practical and

Powerful Approach to Multiple Testing, Source J. R. Stat. Soc. Ser. B. 57 (1995) 289–

300.

[68] T. Murdoch, A. Detsky, The Inevitable Application of Big Data to Health Care, JAMA

Evid. 309 (2013) 1351–1352.

[69] N.R. Adam, R. Wieder, D. Ghosh, Data science, learning, and applications to

biomedical and health sciences, Ann. N. Y. Acad. Sci. 1387 (2017) 5–11.

[70] B. Young, Getting the measure of diabetes : the evolution of the National Diabetes

Audit, Practical Diabetes 35 (2018) 1–7.

APPENDIX 1

Full electronic search strategy

Limits placed on all searches were English language articles, in humans only, published between 2005 and 2018. When this review was revised, any article with a publication date after May 8 th, 2018 was excluded, as this was the last date on which the search was run for the original review.

Ovid MEDLINE® ALL 1(natural language processing or NLP or text mining or machine learning or artificial intelligence or information technology or classifier or document classification or semantic similarity or ontology).mp⇒ 57960 2(Natural Language Processing/ or Data Mining/ or Artificial intelligence/ or Machine Learning/)⇒ 36778 3(event report or adverse event or medication event or incident report or medication error or medical error or error report or patient safety event).mp⇒ 23367 4(Adverse Drug Reaction Reporting System/ or Medical Errors/ or Patient Safety/)⇒ 38438

5(*Algorithms/ and 4) ⇒ 144

61 or 2⇒ 63120 73 or 4 ⇒ 58720 8(6 and 7) or 5 ⇒ 924

MIDIRS: Maternity and Infant CareEmbase

1(natural language processing or NLP or text mining or text classification or machine learning).af. 29540

MIDIRS: Maternity and Infant Care 29 Embase 29511

2(event reports or adverse events or medication events or incident report* or patient safety).af. 152722


1 and 2 301


1 and 2 Deduplicated 290


Title and Abstract screen: 277 removed due to "wrong topic" or "irrelevant".

CINAHL

Initial search identical to Medline - 0 resultsThen using CINAHL suggested search terms:

1 Natural Language Processing and Adverse Health Care Event - 1

Title and Abstract screen: 1 removed for “wrong topic”

Cochrane Library

MeSH "Natural Language Processing" Subheading "Classification" - 11

Title and Abstract screen: 11 excluded for "wrong topic" or "irrelevant".

SciELO

1 Event report* or adverse events or incident reporting AND Natural language processing or NLP - 3

Title and Abstract screen: 3 excluded for “not English language” or “irrelevant”.

ISI Web of Science

1 Event report* or adverse events or incident reporting AND Natural language processing or NLP

Filtered by Research Domain of "Science Technology" – 209

Title and Abstract screen: 198 excluded for “duplicates”, "wrong topic" or "irrelevant". 11

Google ScholarScholar.google.com1 "natural language processing" and "medical" and "incident reports" and "classification" - 228

Title and Abstract screen: 226 excluded for “duplicates”, "wrong topic" or "irrelevant”.

Prospero Systematic Reviews

Crd.york.ac.ukSeparate searches for:"natural language processing" - 2 "NLP" - 4"Machine Learning" – 36

Title and Abstract screen: All excluded for "wrong topic" or "irrelevant".

Hand Search / Reference List / PearlingArticles identified - 13