Original Paper Identifying Acute Low Back Pain Episodes in Primary Care Practice from Clinical Notes Riccardo Miotto 1,2,3 , PhD; Bethany L. Percha 2,3 , PhD; Benjamin S. Glicksberg 1,2,3 , PhD; Hao-Chih Lee 2,3 , PhD; Lisanne Cruz 4 , MD, MSc, FAAPMR; Joel T. Dudley 1,2,3 , PhD; and Ismail Nabeel 5 , MD, MPH (1) Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, USA (2) Institute for Next Generation Healthcare, Icahn School of Medicine at Mount Sinai, New York, USA (3) Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, USA (4) Department of Physical Medicine and Rehabilitation, Icahn School of Medicine at Mount Sinai, New York, USA (5) Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, USA Abstract Background: Acute and chronic low back pain (LBP) are different conditions with different treatments. However, they are coded in electronic health records with the same ICD-10 code (M54.5) and can be differentiated only by retrospective chart reviews. This prevents efficient definition of data-driven guidelines for billing and therapy recommendations, such as return-to-work options. Objective: To solve this issue, we evaluate the feasibility of automatically distinguishing acute LBP episodes by analyzing free text clinical notes. Methods: We used a dataset of 17,409 clinical notes from different primary care practices; of these, 891 documents were manually annotated as “acute LBP” and 2,973 were generally associated with LBP via the recorded ICD-10 code. We compared different supervised and unsupervised strategies for automated identification: keyword search; topic modeling; logistic regression with bag-of-n-grams and manual features; and deep learning (ConvNet). We trained the supervised models using either manual annotations or ICD-10 codes as positive labels. Results: ConvNet trained using manual annotations obtained the best results with an AUC- ROC of 0.97 and F-score of 0.69. ConvNet’s results were also robust to reduction of the number of manually annotated documents. In the absence of manual annotations, topic models performed better than methods trained using ICD-10 codes, which were unsatisfactory for identifying LBP acuity. Conclusions: This study uses clinical notes to delineate a potential path toward systematic learning of therapeutic strategies, billing guidelines, and management options for acute LBP at the point of care. All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint (which was this version posted November 11, 2019. . https://doi.org/10.1101/19010462 doi: medRxiv preprint
22
Embed
Identifying Acute Low Back Pain Episodes in Primary Care ... · logistic regression with bag-of-n-grams and manual features; and deep learning (ConvNet). We trained the supervised
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
OriginalPaperIdentifyingAcuteLowBackPainEpisodesinPrimaryCarePracticefromClinicalNotesRiccardoMiotto1,2,3, PhD; Bethany L. Percha2,3, PhD; Benjamin S. Glicksberg1,2,3, PhD;Hao-Chih Lee2,3, PhD;LisanneCruz4,MD,MSc,FAAPMR;JoelT.Dudley1,2,3,PhD;andIsmailNabeel5,MD,MPH
Background:Acuteandchroniclowbackpain(LBP)aredifferentconditionswithdifferenttreatments.However,theyarecodedinelectronichealthrecordswiththesameICD-10code(M54.5)andcanbedifferentiatedonlybyretrospectivechartreviews.Thispreventsefficientdefinition of data-driven guidelines for billing and therapy recommendations, such asreturn-to-workoptions.Objective: To solve this issue,we evaluate the feasibility of automatically distinguishingacuteLBPepisodesbyanalyzingfreetextclinicalnotes.Methods:Weusedadatasetof17,409clinicalnotesfromdifferentprimarycarepractices;ofthese,891documentsweremanuallyannotatedas“acuteLBP”and2,973weregenerallyassociatedwithLBPviatherecordedICD-10code.Wecompareddifferentsupervisedandunsupervised strategies for automated identification: keyword search; topic modeling;logisticregressionwithbag-of-n-gramsandmanualfeatures;anddeeplearning(ConvNet).We trained the supervised models using either manual annotations or ICD-10 codes aspositivelabels.Results:ConvNettrainedusingmanualannotationsobtainedthebestresultswithanAUC-ROC of 0.97 and F-score of 0.69. ConvNet’s resultswere also robust to reduction of thenumber of manually annotated documents. In the absence of manual annotations, topicmodels performed better than methods trained using ICD-10 codes, which wereunsatisfactoryforidentifyingLBPacuity.Conclusions:Thisstudyusesclinicalnotestodelineateapotentialpathtowardsystematiclearningoftherapeuticstrategies,billingguidelines,andmanagementoptionsforacuteLBPatthepointofcare.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Keywords: electronic health records; clinical notes; low back pain; natural languageprocessing;machinelearning.
IntroductionLowbackpain(LBP)isoneofthemostcommoncausesofdisabilityinUSadultsundertheageof45 [1],with10-20%ofAmericanworkers reportingpersistentbackpain [2]. LBPimpactsone’sabilitytoworkandaffectsthequalityoflife.Forexample,in2015Luckhauptetal.showedthat,fromapoolof19,441people,16.9%ofworkerswithanyLBPand19.0%ofthosewithfrequentandsevereLBPmissedatleastonefulldayofworkoveraperiodofthreemonths[3].LBPeventsalsoleadtosignificantfinancialburdenforbothindividualsand clinical facilities, with combined direct and indirect costs of treatment formusculoskeletal injuries and associated pain estimated to be approximately $213 billionannually[4].LBPeventsfallintotwomajorcategories:acuteandchronic[5].AcuteLBPoccurssuddenly,usuallyassociatedwithtraumaorinjurywithsubsequentpain,whereaschronicLBPisoftenreportedbypatientsinregularcheckupsandhasledtoasignificantincreaseintheuseofhealthcareservicesoverthepasttwodecades.ItisveryimportanttodifferentiatebetweenacuteandchronicLBPintheclinicalsettingastheseconditions-aswellastheirmanagementandbilling-aresubstantivelydifferent.Chronicbackpainisgenerallytreatedwithspinalinjections[6,7],surgery[8,9],and/orpainmedications[10,11],whileanti-inflammatoriesandarapidreturntonormalactivitiesofdailylivingaregenerallythebestrecommendationsforacuteLBP[12].However, acute and chronic LBP are usually not explicitly separated in electronic healthrecords (EHRs) due to a lack of distinguishing codes. The ICD-10-CM (InternationalClassificationofDiseases,TenthRevision,ClinicalModification)standardonlyincludesthecodeM54.5tocharacterize“Lowbackpain”diagnosis,anddoesnotprovidemodifierstodistinguishdifferentLBPacuities[13].Acuityisusuallyreportedinclinicalnotes,requiringretrospective chart review of the free text to characterize LBP events, which is time-consumingandnotscalable[14].Moreover,acuitycanbeexpressedindifferentways.Forexample,thetextcouldmention“acutelowbackpain”or“acutelbp”,butcouldalsosimplyreport “shootingpaindown into the lower extremities”, “limited spine rangeofmotion”,“vertebral tenderness”, “diffuse pain in lumbarmuscles”, and so on [15]. This variabilitymakesitdifficult forclinical facilitiesandresearcherstogroupLBPepisodesbyacuitytoperformkeytasks,suchasdefiningappropriatediagnosticandbillingcodes;evaluatingtheeffectivenessofprescribedtreatments;andderivingtherapeuticguidelinesandimproveddiagnosticmethodsthatcouldreducetime,disabilityandcost.Thispaperisthefirsttoexploretheuseofautomatedapproachesbasedonmachinelearningandinformationretrievaltoanalyzefree-textclinicalnotesandidentifytheacuityofLBPepisodes.Specifically,weuseasetofmanuallyannotatednotestotrainandevaluatevarious
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
machine learningarchitecturesbasedon logistic regression,n-grams, topicmodels,wordembeddings and convolutional neural networks, and to demonstrate that some of thesemodelsareable to identifyacuteLBPepisodeswithpromisingprecision. Inaddition,wedemonstratetheineffectivenessofusingICD-10codesalonetotrainthemodels,reinforcingtheideathattheyarenotsufficienttodifferentiatetheacuityofLBP.Ouroverallobjectiveistodevelopanautomatedframeworkthatcanhelpfrontlineprimarycareprovidersinthedevelopment of targeted strategies and return-to-work (RTW) options for acute LBPepisodesinclinicalpractice.
BackgroundandSignificancePrimary care providers (PCPs) are commonly the first medical practitioners to assesspatient’smusculoskeletalinjuriesandpainassociatedwiththeseinjuriesandarethereforeinauniqueposition tooffer reassurance, treatmentoptions, andRTWrecommendationscatered to the acuity of the injury and pain associated with it. Several studies havedocumented increases in medication prescriptions and visits to physicians, physicaltherapists,andchiropractorsforLBPepisodes[16–18].SinceindividualswithchronicLBPseekcareandusehealthcareservicesmorefrequentlythanthosewithacuteLBP,increasesinhealthcareuseandcostsforbackpainaredrivenmorebychronicthanacutecases[19].Arapidreturntonormalactivitiesofdailyliving,includingwork,isgenerallythebestactivityrecommendationforacuteLBPmanagement[12].ThenumberofworkdaysthatarelostduetoacuteLBPcanbereducedbyimplementingclinicalpracticeguidelinesintheprimarycaresetting [20]. In previous work, Cruz et al. built a RTW protocol tool for PCPs based onguidelinesfromtheLBPliterature[21].Basedonthetypeofwork(e.g.,clerical,manual,orheavy) and the severity of the condition, thedoctorwould recommendRTWoptions (inpartialorfulldutycapacity)withinacertainnumberofdays.Thestudyfoundthatphysicianswerelikelytousethisprotocol,especiallywhenitwasintegratedintotheEHRs.TheprotocolwasnotalwaysusedforpatientssufferingfromacuteLBP,however,astheresearchteamwasunabletoquicklyidentifytheacuityusingonlythestructuredEHRdata(e.g., ICD-10codes). Acuity information was only available in the progress notes and was thus notincorporatedintotheautomatedrecommendations.ThispreventedtheresearchteamfromprovidinganaccuratefeedbacktoPCPsbasedonafullpictureofthepatient’scondition.AsimilartoolthatcouldincorporateacuityinformationfromnotescouldprovidemuchmorespecificrecommendationstoPCPsthatincorporatebestpracticeguidelinesforeachacuitylevel.Besidesleadingtomoreprecisecare,thiswouldstreamlinebillingforLBP[22].Similarneedsariseforothermusculoskeletalconditions,suchasknee,elbow,andshoulderpain,whereICD-10codesdonotdifferentiatebypainlevelandacuity[23,24].MachinelearningmethodsforEHRdataprocessingareenablingimprovedunderstandingofpatientclinicaltrajectories,creatingopportunitiestoderivenewclinicalinsights[25,26].Inrecentyears,theapplicationofdeeplearning,ahierarchicalcomputationaldesignbasedon
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
layersofneuralnetworks[27],tostructuredEHRshasledtopromisingresultsonclinicaltaskslikediseasephenotypingandprediction[28–33].However,awealthofrelevantclinicalinformation remains locked behind clinical narratives in the free text of notes. NaturalLanguage Processing (NLP), a branch of computer science that enables machines tounderstandandprocesshumanlanguage[34]forapplicationslikemachinetranslation[35],text generation [36], and image captioning [37], has beenused toparse clinical notes toextractrelevantinsightsthatcanguideclinicaldecisions[38].RecentapplicationsofdeeplearningtoclinicalNLPhaveclassifiedclinicalnotesaccordingtodiagnosisordiseasecodes[39–41], predicted disease onset [32,42], and extracted primary cancer sites and theirlateralityinpathologyreports[43,44].However,whiledeeplearninghassuccessfullybeenappliedtoanalyzeclinicalnotes,traditionalmethodsarestillpreferablewhentrainingdataarelimited[45,46].Regardlessof the specificmethodology, toolsbasedonNLPapplied to clinicalnarrativeshavenotbeenwidelyusedinclinicalsettings[31,38],despitethefactthatphysiciansarelikely to follow computer-assisted guidelines if recommendations are tied to their ownobservations [47]. In this paper, we present an NLP-based framework that can helpphysiciansadheretobestpracticesandRTWrecommendationsforLBP.Tothebestofourknowledge,therearenostudiestodatethathaveappliedmachinelearningtoclinicalnotestodistinguish theacuityofamusculoskeletal condition incaseswhere it isnotexplicitlycoded.
MethodsThe conceptual steps of this study are summarized in Figure 1, specifically: datasetcomposition; text processing; clinical notes modeling; and experimental evaluation. Theoverall goal was to evaluate the feasibility of automatically identifying clinical notesreporting“acuteLBP”episodes.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Figure1:Conceptualframeworkusedtoevaluatetheuseofautomatedapproachesbasedonmachine learning and information retrieval to analyze free-text clinical notes and identify“acutelowbackpain”episodes.DatasetWeuseda setof free-text clinicalnotesextracted from theMountSinaidatawarehouse,made available foruseunder IRB approval followingHIPAAguidelines. TheMount SinaiHealthSystemisanurbantertiarycarehospitallocatedontheUpperEastSideofManhattanin New York City. It generates a high volume of structured, semi-structured, andunstructureddataaspartof its routinehealthcareandclinicaloperations,which includeinpatient,outpatient,andemergencyroomvisits.TheseclinicalnoteswerecollectedduringapreviouspilotstudyevaluatingaRTWtoolbasedonEHRdatathatincludednearly40,000encountersfor15,715patientsspanningtheyears2016-2018andclinicalnoteswrittenby
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
81differentproviders(Cruzetal.[21]).Inthatstudy,weusedthepublishedliteraturetodevelop a list of guidelines to determine the assessment andmanagement of acute LBPepisodesinclinicalpractice.InparticularweusedICD-10codesaswellasotherparameters,such as “presenting complaint”, “pre-existing conditions”, “management factors”,“imaging/radiology/testordered”,andsoon,todefineandlabeltheacuityofLBPinaclinicalencounter.Followingtheseguidelines,14individuals(physicalmedicineandrehabilitationfellows,residents,andmedicalstudents)manuallyreviewedarandomsetof4,291clinicalnotesassociatedwiththeseencountersandlabeledall“acutelowbackpain”events.Eachnotewasreviewedbyatleasttwoindividualsandwasfurthercheckedbyaleadphysicianresearcherifitwasmarkedasambiguousand/ortherewasdiscordancebetweenreviewers.Thisproject leveraged the entire set of clinicalnotes thatwere collected in thepreviousstudy. Inparticular,we joinedall theprogressnotesof theseencountersunder thesameinitial visit, andwe eliminated duplicate, short (less than 3words), and non-meaningfulreports.Thefinaldatasetwascomposedof17,409distinctclinicalnotes,withlengthrangingfromsevento6,638words.Ofthisset,3,092notesweremanuallyreviewedinthepreviousstudyand891ofthemwereannotatedas“acuteLBP”.Theremaining14,317noteswerenotmanually evaluated and were related to different clinical domains, including variousmusculoskeletaldisordersandpotentiallyLBPevents.Inthisfinaldataset,1,973noteswerealsoassociatedtoanencounterbilledwithanICD10M54.5“Lowbackpain”code.TextProcessingEverynote in thedatasetwas tokenized, divided into sentences, and checked to removepunctuation,numbers,andnon-relevantconceptssuchasURLs,emails,dates,etc.Eachnotewasthenrepresentedasalistofsentences,witheverysentencebeingalistoflemmatizedwordsrepresentedasone-hotencodings.Thevocabularywascomposedofall thewordsappearingatleastfivetimesinthetrainingset.Thediscardedwordswerecorrectedtotheterms in the vocabularyhaving theminimumedit distance, i.e., theminimumnumberofoperations required to transform one string into the other [48]. This step reduced thenumber of misspelled words and prevented the accidental discarding of relevantinformation;atthesametimeitalsolimitedthesizeofthevocabularytoimprovescalability[39].Overall, thevocabularycoveringthewholedatasetwascomprisedof56,142uniquewords.ClinicalNoteModelingWe evaluated different approaches for identifying clinical notes that refer to acute LBPepisodes.Theseincludedbothsupervisedandunsupervisedmethods.Whilewebenefitedfrom theuseofhigh-qualitymanual annotations to train the supervisedmodels,wealsoinvestigated alternatives that did not require manual annotation of notes. All of these
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
methodsprovidedstraightforwardexplanationsoftheirpredictions,enablingustovalidateeachmodelandto identifypartsoftextandpatternsthatarerelevanttothe“acuteLBP”predictions.KeywordSearchWesearchedforasetofrelevantkeywordsinthetext.Inparticular,welookedfor“acutelowback pain”, “acute lbp”, “acute lowbp”, and “acute back pain” andwe counted theiroccurrencesinthetext.WeusedNegEx[49]toremovenegatedoccurrencesofthekeywords.Intheevaluation,werefertothismodelas“WordSearch”.TopicModelingWeusedtopicmodelingonthefullsetofwordscontainedinthenotestocaptureabstracttopicsreferredtointhedataset[50].Topicmodelingisanunsupervisedinferenceprocess,inthiscaseimplementedusinglatentDirichletallocation[51],thatcapturespatternsofwordco-occurrences within documents to define interpretable topics (i.e., multinomialdistributionofwords)andrepresentadocumentasamultinomialoverthesetopics.Everydocument can then be classified as talking about one or (usually) more topics. Topicmodeling is often used in healthcare to generalize clinical notes, improve the automaticprocessingofpatientdata,andexploreclinicaldatasets[52–55].Inthisstudy,weassumedthatoneormoreofthesetopicsmightrefertoacuteLBP.Inordertodiscoverthem,weidentifiedthemostlikelytopicsforasetofkeywords(i.e.,“acute”,“low”,“back”,“pain”,“lbp”,“bp”)andwemanuallyreviewedthemtoretainonlythosethatseemedmorelikelytocharacterizeacuteLBPepisodes(i.e.,thatincludedmostofthekeywordswithhighprobability).WethenconsideredthemaximumlikelihoodamongthesetopicsastheprobabilitythatareportreferredtoacuteLBP(i.e.,“TopicModel”intheexperiments).BagofN-gramsEach clinical note was represented as a bag of n-grams (with n = 1, …, 5), with TermFrequency-InverseDocument Frequency (tf-idf)weights (determined from the corpus ofdocuments).Eachn-gramisacontiguoussequenceofnwordsfromthetext.WeconsideredallthewordsinthevocabularyandfilteredthecommonstopwordsbasedontheEnglishdictionarybeforebuildingallthen-grams.TheclassificationwasimplementedusingLogisticRegressionwithLasso(i.e.,“BoN-LR”).FeatureEngineeringWeusedtheprotocolbuiltbyCruzetal. [21]todefineacuteLBPepisodes intheclinicalnotes.Inparticular,weusedalltheconceptsdescribedinthatguideline,pre-processedthemwiththesamealgorithmusedfortheclinicalnotes,andbuiltasetof5,154distinctn-grams(withn=1,…,5),thatwerefertoas“FeatEng”.Wethenrepresentedeachclinicalnoteasa
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
bag of FeatEng (i.e., we counted the occurrences of only these n-grams in the text),normalizedwithtf-idfweights,andclassifiedthemusingLogisticRegressionwithLasso(i.e.,“FeatEng-LR”).DeepLearningWeimplementedanend-to-enddeepneuralnetworkarchitecture(i.e.,“ConvNet”)thattakesasinputthefullnoteandoutputsitsprobabilityofbeingrelatedto“acuteLBP”.Thefirstlayerofthearchitecturemapsthewordstodensevectorrepresentations(i.e,“embeddings”),which attempt to contextualize the semanticmeaning of eachwordby creating ametricspace where vectors of semantically similar words are close to each other. We appliedword2vecwiththeskip-gramalgorithmtotheparsednotes[56]toinitializetheembeddingofeachwordinthevocabulary.Word2veciscommonlyusedwithEHRstolearnembeddingsofmedicalconceptsfromstructureddataaswellasfromclinicalnotes[46,57–59].TheembeddingswerethenfedtoaConvolutionalNeuralNetwork(CNN)inspiredbythemodel described by Kim [60] and by Liu et al. [42]. This architecture concatenatesrepresentationsofthetextatdifferentlevelsofabstraction,byessentiallychoosingthemostrelevantn-gramsateachlevel.Here,wefirstappliedasetofparallel1Dconvolutionsontheinputsequencewithkernelsizesrangingfrom1to5,thussimulatingn-gramswithn=1,…,5.Theoutputsofeachoftheseconvolutionswerethenmax-pooledoverthewholesequenceandconcatenatedtoa5xddimensionalvector,wheredisthenumberof1Dconvolutionalfilters.Thisrepresentationwasthenfedtosequencesoffullyconnectedlayers,whichlearnthe interactionsbetweenthetext features,andfinally toasigmoid layerthatoutputs thepredictionprobability.Then-grams that aremost relevant to theprediction, in this architecture, are those thatactivatetheneuronsinthemax-poolinglayer.Therefore,weusedthelog-oddsthatthen-gramcontributestothesigmoiddecisionfunction[42]asanindicationofhowmucheachn-graminfluencesthedecision.EvaluationDesignWeevaluatedallthearchitecturesusinga10-foldcross-validationexperiment,witheverynoteappearinginthetestsetonlyonce.Ineachtrainingsetweusedarandom90/10splitto train and validate all themodel configurations.As baselinewe also report the resultsobtainedbyconsideringas“acuteLBP”all thenotesassociatedwiththe“Lowbackpain”M54.5ICD-10code(i.e.,“ICD-10”intheresults).
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
TrainingAnnotationsWeconsideredtwodifferentsetsofannotationsasgoldstandardstotrainthesupervisedmodels.Inthefirstexperiment,weusedthemanuallycuratedannotationsprovidedwiththedatasetfrompreviouswork[21],whereasinthesecondexperimentwetrainedthemodelsusing the ICD-10 codes associated with each note encounter. Both experiments wereevaluated using manual annotations. The rationale was to compare the feasibility ofidentifyingacuteLBPeventswhenmanualannotationsareandarenotavailable.Wetrainedtheclassifiertooutput“acuteLBP”vs.“other”becausethegoaloftheprojectwastoidentifyclinicalnoteswithacuteLBPevents,ratherthandiscriminatedifferentfacetsofLBPevents(e.g.,“chronicLBP”vs.“acuteLBP”).MetricsForallexperiments,wereportareaunderthereceiveroperatingcharacteristiccurve(AUC-ROC),micro-precision,recall,F-score,andareaundertheprecision-recallcurve(AUC-PRC)[61].TheROCcurveisaplotoftruepositiverateversusfalsepositiveratefoundoverthesetof predictions. F-score is the harmonic mean of classification precision and recall perannotation,whereprecisionisthenumberofcorrectpositiveresultsdividedbythenumberof all positive results, and recall is thenumber of correct positive results dividedby thenumberofpositiveresultsthatshouldhavebeenreturned.ThePRCisaplotofprecisionandrecallfordifferentthresholds.TheareasundertheROCandPRCcurvesarecomputedbyintegratingthecorrespondingcurves.ModelHyperparametersThemodelhyperparameterswereempiricallytunedusingthevalidationsetstooptimizetheresultswithbothtrainingannotations.Inthetopicmodelingmethod,weinferredtopicsusingthewholetrainingsetofdocumentsand200topics(derivedusingperplexityanalysis).Whileseeminglymore intuitive,usingonly thenotesassociatedwith theM54.5 “Lowbackpain” ICD10codeactuallyproducedworse results. For each fold, the most relevant topics associated with acute LBP weremanuallyreviewedandusedtoannotatethenotes.Inthedeeplearningarchitecture,weusedembeddingswithsize300,andfull-lengthnotes.We trained word2vec just on the clinical note dataset to initialize embeddings. Pre-initializingtheembeddingswithageneral-purposecorpusdidnotleadtoanyimprovement.EachCNNhad200filtersandusedaReLuactivationfunction.Weaddedtwofullyconnectedlayers of size 600 following the CNNs with ReLu activations and batch normalization.Dropoutvaluesacrossthelayerswereallsetto0.5.Thearchitecturewastrainedusingcross-entropy losswith theAdamoptimizer for five epochs andbatch size32 (learning rate=0.001).Theclassificationthresholdsforprecision,recall,andF-scorewerefoundbyranging
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
ResultsTable1andFigure2showtheaverageresultsofthe10-foldcross-validationexperimentforallthemodelsconsidered.ThebestresultswereobtainedbyConvNetwhentrainedwiththemanualannotations.Whilethisisnotentirelysurprisinggiventhesuccessofdeeplearningfor NLPwhen high-quality annotations and a large amount of data (i.e., on the order ofmillionsoftrainingexamples)areavailable,thiswasnotcertaininthisdomainwherethetrainingdatasetwasmuchsmaller.Asexpected,theresultsobtainedbythebaselineandbytrainingthemodelsusingtheICD-10codeswerenotasgood,confirmingthattheM54.5ICD-10codeisnotasufficientindicatorofacuteLBP.TopicModelleadstosimilarperformancebut provides a more intuitive and potentially effective way for exploring the collection,extractingmeaningfulpatternsthatarerelatedtoacuteLBPepisodes(seeFigure3).Whilethisapproachmightnotberobustenoughforclinicalapplication,arefinedandmanuallycuratedversionofTopicModelpromisestoallowanefficientpre-filteringofclinicalreportsthat can speed up themanual work required to annotate them.On the contrary but asexpected,WordSearchperformedpoorlyastheconditionismentionedintoomanydifferentwaysacrossthetextandsimplekeywordswerenotsufficient.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Table1:Classificationresultsinidentifyingclinicalnoteswith“acuteLBP”episodesintermsofPrecision, Recall, F-score, Area Under the ROC (AUC-ROC) and Precision-Recall (AUC-PRC)Curves. Results are averaged over the 10-fold cross validation experiment. We compareddifferent supervised and unsupervised strategies: keyword search (“WordSearch”); topicmodeling (“TopicModel”); logistic regression with bag-of-n-grams (“BoN-LR”) and manualfeatures(“FeatEng-LR”);anddeeplearning(“ConvNet”).Thesupervisedmodels(i.e.,BoN-LR,FeatEng-LRandConvNet)weretrainedusingmanualannotationsorM54.5ICD-10codes.The“ICD-10”baselinesimplyconsideredas“acuteLBP”allthenotesassociatedwiththegenericM54.5“Lowbackpain”ICD-10code.
Precision Recall F-score AUC-ROC AUC-PRC
Baseline ICD-10 0.32 0.68 0.41 0.81 0.42
UnsupervisedMethods
WordSearch 0.71 0.03 0.06 0.52 0.40
TopicModel 0.44 0.58 0.50 0.92 0.46
TrainedwiththeM54.5ICD-10Code
BoN-LR 0.50 0.70 0.59 0.83 0.42
FeatEng-LR 0.47 0.59 0.52 0.88 0.41
ConvNet 0.55 0.68 0.61 0.89 0.46
TrainedwithManual
Annotations
BoN-LR 0.53 0.64 0.58 0.93 0.56
FeatEng-LR 0.58 0.66 0.62 0.93 0.58
ConvNet 0.65 0.73 0.70 0.98 0.72
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Figure2:ROCandPrecision-RecallcurvesobtainedwhenusingastrainingdataforBoN-LR,FeatEng-LRandConvNetthemanualannotations(a)andtheM54.5ICD-10codes(b).ConvNettrained using themanual annotations obtained the best results. In the absence ofmanualannotationstousefortraining,TopicModelworkedbetterthanmethodstrainedusingICD-10codes,whichprovednottobeagoodindicatortoidentifyacuityinLBPepisodes.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Figure3:Representative“acuteLBP”-relatedtopicderivedbyaveragingthewordlikelihoodinall the relevant topics (i.e., that weremanually verified) inferred across the 10-fold cross-validation experiment.We report the top30words,with the biggestwords being themostrelevant.Asitcanbeseen,mostofthewordsareindeedrelatedtoacuteLBP,includingseveralmedicationsthatareusuallyprescribedtotreatinflammationandpain(e.g.,Cyclobenzaprine,Flexeril,Advil).AmanuallyrefinedversionofTopicModelcanhelppre-filteringthenotesinanintuitivesemi-automaticway,promisingtospeedupthemanualannotationprocess.Figure4reportstheclassificationresultsintermsofAUC-ROCandAUC-PRCwhenrandomlysubsamplingthe“acuteLBP”manualannotationsinthetrainingset.WefoundthatConvNetalwaysoutperformstheothermethodsbasedonLRaswellasTopicModel.Inaddition,wenoticethatusingjust30%ofthemanualannotations(i.e.,70clinicalnotes)alreadyleadstobetter results than using ICD-10 codes as training data. This is a particularly interestinginsightas it shows thatonlyminimal manualwork is required inorder toachievegoodclassifications;thesecanthenbefurtherimprovedbyaddingautomaticallyannotatednotestothemodel(aftermanualverification)andretraining.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Figure4:AreaundertheROC(AUC-ROC)andPrecision-Recall(AUC-PRC)curvesobtainedwhentraining the supervised models using random sub-samples of the manual annotations.TopicModel is reported as reference baseline. ConvNet obtained satisfactory results whentrainedusinglessmanuallyannotateddocuments,showingrobustnessandscalabilitytothegoldstandard.
Figure5highlightsthedistributionsoftheclassificationscores(predictedprobabilityofthelabel“acuteLBP”)derivedbyseveralsupervisedmodels(trainedwithmanualannotations)andTopicModel.ConvNetshowsclearseparationbetweenacuteLBPnotesandtherestofthedataset.Inparticular,allacuteLBPnoteshadscoresgreaterthan0.2,with82%ofthem(i.e.,727notes)havingscoresgreaterthan0.5.Onthecontrary,only347controlshadscoresgreater than 0.5, meaning that only a few notes were highly likely to be misclassified.Similarly,TopicModelhadnocontrolswithscoresgreaterthan0.7andall“acuteLBP”noteshadscoresgreaterthan0.2.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Figure 5: Representation of the probability distribution of the scores obtained by BoN-LR,FeatEng-LR,ConvNet(trainedwithmanualannotations)andTopicModel.ConvNetledtogoodseparationbetween“acuteLBP”clinicalnotesandalltheotherdocuments.Intheothercases,suchseparation isnotasclear,explainingtheworseclassificationresultsobtainedbythosemodels. Finally, Table 2 summarizes some of the n-grams driving the “acute LBP” predictionsobtainedbyConvNet(trainedwithmanualannotations)acrosstheexperiments.Whilesomeof these are obvious and refer to the disease itself (e.g., “acute lbp”), others refer tomedications(e.g.,“prescribedmusclerelaxant”,“flexeril”),andrecommendations(e.g.,“rtwfulldutyquick”).Giventheirclinicalmeaningandrelevance,allthesepatternscanbefurtheranalyzedandreviewedtopotentiallydrivethedevelopmentofguidelinesfor,e.g.,treatmentandRTWoptions.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
Table2:Examplesofn-gramsthatwererelevantinidentifying“acuteLBP”noteswhenusingConvNet trained with manual annotations. The n-grams relevance was determined byanalyzing theneurons of theCNNsactivating themax-pooling layers and their log-odds tocontributetothefinaloutput.Log-oddswerefilteredpernotesandthenaveragedoverallthenotesandevaluationfolds.
DiscussionInthisworkweevaluatedtheuseofseveralmachinelearningapproachestoidentifyacuteLBP episodes in free text clinical notes in order to better personalize the treatment andmanagementofthisconditioninprimarycare.Theexperimentalresultsshowedthatit ispossible toextractacuteLBPepisodeswithpromisingprecision,especiallywhenat leastsomemanuallycuratedannotationsareavailable.Inthisscenario,ConvNet,adeeplearningarchitecturebasedonCNNs,significantlyoutperformedothershallowtechniquesbasedon
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
bag-of-n-gramsandlogisticregression,openingthepossibilitytoboostperformancesusingmorecomplexarchitecturesfromcurrentresearchintheNLPcommunity.Theimplementeddeeparchitecturealsoprovidesaneasymechanismtoexplain thepredictions, leading toinformeddecisionsupportbasedonmodel transparency[62,63]andthe identificationofmeaningfulpatternsthatcandriveclinicaldecisionmaking.Ifnoannotationsareavailable,experimentsshowedthattheuseoftopicmodelingispreferredtotrainingaclassifierusingonly the M54.5 ICD-10 codes (i.e., “Low back pain”) associated with the clinical noteencounter,whichprovedtobeapoorindicatortodiscriminateLBPepisodes.Inaddition,thetopicsidentifiedcanserveasanintuitivetooltoinformguidelinesandrecommendationsaswellastopre-filterthedocumentsandreducethemanualworkrequiredtoannotatethenotes. Theproposed framework is inherently domain agnostic anddoes not require anymanual supervision to identify relevant features from the free-text. Therefore, it can beleveragedinothermusculoskeletalconditiondomainswhereacuityisnotexpressedintheICD-10/diagnosticcodes,suchasknee,elbow,andshoulderpain.PotentialApplicationsMedical care decisions are often based on heuristics and manually derived rule-basedmodelsconstructedonpriorknowledgeandexpertise[64].Cognitivebiasesandpersonalitytraits,suchasaversiontoriskorambiguity,overconfidence,andtheanchoringeffect,maylead to diagnostic inaccuracies and medical errors resulting in mismanagement orinadequate utilization of resources [65]. In the LBP domain, thismay lead to: delays infindingtherighttherapyandassistinginthereturnofpatientstonormalactivities;increasein the risk of transitioning the condition from acute to chronic; creating discomfort forpatients;andincreasingtheeconomicburdensonclinicalfacilitiestoadequatelytreatandmanage this patient population. Deriving data-driven guidelines for treatmentrecommendationscanhelpinreducingthesecognitivebiasesandpersonalitytraits,leadingto more consistent and accurate decisions. In this scenario, the proposed frameworksintegrate seamlesslywith theRTWtoolproposedbyCruzetal. [21]by includingacuity-relevantinformationintheclinicalnotesandaddressingoneofthelimitationsofthatstudy(i.e.,recommendingtheRTWtoolatthepointofcarebyaccuratelyidentifyingconditionasacuteLBP).Similarly,anunderstandingofthepatternsdrivingthepredictionscanleadtothedevelopmentofnewand improved treatment strategies forvarious typesof injuries,whichcanbepresentedtothecliniciansatthetimeofpatientencountertohelpthemwithbettermanagementof thecondition.Whilephysicianswillcontinue tohaveautonomy indeterminingoptimalcarepathways fortheirpatients,therecommendationsprovidedbythesupportingframeworkwillbeusefultosystematizeandsupporttheiractivitieswithintherealmofthebusyclinicalpractice.PosterioranalysisoftheclinicalnotestoinferacuteLBP episodes can also help in assigning the proper diagnostic and billing codes for theencounter.Inaforeseeablefuturescenariowhereclinicalobservations areautomaticallytranscribedviavoiceandEHRsareprocessedinreal-time,anautomatedtoolthatidentifies
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
acuityinformationcouldalsoimprovetheaccuracyofdiagnosisandbillinginreal-time,withnoneedtowaitforposteriorevaluations.LimitationsThisworkevaluatedthefeasibilityofusingmachinelearningtoidentifyacuteLBPepisodesinclinicalnotes.Therefore,wecompareddifferenttypesofmodels(shallowvs.deep)andlearning frameworks (unsupervised vs. supervised) to identify the best directions forimplementationanddeploymentinrealclinicalsettings.Whileseveralofthearchitecturesevaluatedinthisworkobtainedpromisingresults,moresophisticatedmodelsarelikelytoimprove these performances, especially in the deep learning domain. For example,algorithms based on attention models [66], BERT [67], or XLNet [68] have shownencouraging results on similarNLP tasks and are likely to obtain better results in thisdomain as well. In this work we only focused on processing clinical notes; however,embeddingstructuredEHRdata,especiallymedications,imagingstudiesand/orlabtests,intothemethodshouldimprovetheresults.ThedatasetofclinicalnotesusedinthisstudyoriginatedfromageographicallydiversesetofprimarycareclinicsservingtheNewYorkpopulationacrosstheNYmetroareaoveralimitedperiod(i.e.,2016-2018).Providerswereenrolledandrandomizedintothestudyonarollingbasis,withthenumberofencountersforLBPvaryingforeachindividualprovider,basedonhis/herownpractice.Themajorityoftheprimarycareproviderswereassistantprofessorsservingonthefrontlines.Nospecialistswereincludedintheinitialstudyasthepilotprojectwasonlygearedtowardstheprimarycareproviders.Consequently,theresultsofthisstudymightnotbeapplicabletospecialtycarepractice.FutureWorkTheclassificationofLBPepisodesasacuteorchronicatthepointofcarelevelwithinprimarycarepracticeisimperativeforaRTWtooltobeeffectivelyusedtorenderevidence-basedguidelines.Atthistime,weplantoclassifya largesetofnotes,derivepatternsrelatedtoacuteLBPandextendthetoolproposedbyCruzetal.[21]accordingtothem.WealsoplantoidentifycaseswheretheRTWtoolcanbeeasilydeployedbasedonEHRintegrationintheclinicaldomain.Second,wewillbegintoaddresssomeofthemethodologicallimitationsofthisstudytooptimizeperformanceandevaluateitsgeneralizabilityoutsideprimarycare.Finally,weaimtoevaluatethefeasibilityofthistypeofapproachforothermusculoskeletalconditions,inparticular,shoulderandkneepain.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
manuallyannotatingasetofnotes touseasagoldstandardcan leadtoeffectiveresults,especiallywhenusingdeeplearning.Topicmodelingcanhelpinspeedinguptheannotationprocess,initiatinganiterativeprocesswhereinitialpredictionsarevalidatedandthenusedto refineandoptimize themodel.This approachprovidesa generalizable framework forlearning to differentiate disease acuity in primary care, which canmore accurately andspecificallyguidethediagnosisandtreatmentofLBP.ItalsoprovidesaclearpathtowardimprovingtheaccuracyofcodingandbillingofclinicalencountersforLBP.
AcknowledgementsI.N.andL.C.wouldliketothankthePilotProjectsResearchTrainingProgramoftheNYandNJ Education and Research Center (ERC), National Institute for Occupational Safety andHealth,fortheirfunding(grant#T42OH008422).R.M.wouldliketothankthesupportfromtheHassoPlattnerFoundationandacourtesyGPUdonationfromNVIDIA.CompetingInterestsNone.ContributionsR.M.and I.N. initiated the ideaandwrote thearticle; I.N.collected thedataandprovidedclinicalsupport;R.M.conductedtheresearchandtheexperimentalevaluation;B.L.P.advisedonevaluationstrategiesandrefinedthearticle;B.S.G.,H.L.,andL.C.refinedthearticle;J.T.D.supportedtheresearch.Alltheauthorseditedandreviewedthemanuscript. Abbreviations AUC-PRC:AreaUnderthePrecision-RecallCurveAUC-ROC:AreaUndertheReceiverOperatingCharacteristicCurveBoN:BagOfN-gramsCNN:ConvolutionalNeuralNetworkEHR:ElectronicHealthRecordHIPAA:HealthInsurancePortabilityandAccountabilityActICD-CM:InternationalStatisticalofDiseases,ClinicalModificationIRB:InstitutionalReviewBoardLBP:LowBackPainLR:LogisticRegressionNLP:NaturalLanguageProcessingNY:NewYorkPCP:PrimaryCareProviderRTW:ReturnToWorkTF-IDF:TermFrequency-InverseDocumentFrequency
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
on Work of Low Back Pain Among US Workers. Ann Intern Med Published Online First:2019.https://annals.org/aim/article-abstract/2733500/prevalence-recognition-work-relatedness-effect-work-low-back-pain-among?searchresult=1
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
41 Shi H, Xie P, Hu Z, et al. Towards Automated ICD Coding Using Deep Learning. arXiv [cs.CL].2017.http://arxiv.org/abs/1711.04075
42 Liu J, Zhang Z, RazavianN.DeepEHR: ChronicDisease PredictionUsingMedicalNotes. arXiv [cs.LG].2018.http://arxiv.org/abs/1808.04928
43 Yoon H-J, Ramanathan A, Tourassi G. Multi-task Deep Neural Networks for Automated Extraction ofPrimarySiteandLateralityInformationfromCancerPathologyReports.In:AdvancesinBigData.SpringerInternationalPublishing2017.195–204.
45 Turner CA, Jacobs AD, Marques CK, et al. Word2Vec inversion and traditional text classifiers for
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint
diseasesindischargesummaries.JBiomedInform2001;34:301–10.50 Blei DM. Probabilistic topic models. Communications of the ACM. 2012;55:77.
doi:10.1145/2133806.213382651 BleiDM,NgAY,JordanMI.LatentDirichletAllocation.JMachLearnRes2003;3:993–1022.52 Miotto R,Weng C. Case-based reasoning using electronic health records efficiently identifies eligible
56 Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and theirCompositionality.In:BurgesCJC,BottouL,WellingM,etal.,eds.AdvancesinNeuralInformationProcessingSystems26.CurranAssociates,Inc.2013.3111–9.
All rights reserved. No reuse allowed without permission. not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint (which wasthis version posted November 11, 2019. .https://doi.org/10.1101/19010462doi: medRxiv preprint