Top Banner
1 Machine Learning for Integrating Social Determinants in 1 Cardiovascular Disease Prediction Models: A Systematic Review 2 Yuan Zhao 1 , Erica P. Wood 2 , Nicholas Mirin, 2 Rajesh Vedanthan, 3,4 Stephanie H. Cook, 2,5 Rumi 3 Chunara 5,6 4 5 1 New York University, School of Global Public Health, Department of Epidemiology 6 2 New York University, School of Global Public Health, Department of Social and Behavioral Sciences 7 3 New York University Grossman School of Medicine, Department of Population Health 8 4 New York University Grossman School of Medicine, Department of Medicine 9 5 New York University, School of Global Public Health, Department of Biostatistics 10 6 New York University Tandon School of Engineering, Department of Computer Science & Engineering 11 12 Summary 13 Background 14 Cardiovascular disease (CVD) is the number one cause of death worldwide, and CVD burden is increasing in low- 15 resource settings and for lower socioeconomic groups worldwide. Machine learning (ML) algorithms are rapidly 16 being developed and incorporated into clinical practice for CVD prediction and treatment decisions. Significant 17 opportunities for reducing death and disability from cardiovascular disease worldwide lie with addressing the 18 social determinants of cardiovascular outcomes. We sought to review how social determinants of health (SDoH) 19 and variables along their causal pathway are being included in ML algorithms in order to develop best practices for 20 development of future machine learning algorithms that include social determinants. 21 22 Methods 23 We conducted a systematic review using five databases (PubMed, Embase, Web of Science, IEEE Xplore and ACM 24 Digital Library). We identified English language articles published from inception to April 10, 2020, which reported 25 on the use of machine learning for cardiovascular disease prediction, that incorporated SDoH and related variables. 26 We included studies that used data from any source or study type. Studies were excluded if they did not include the 27 use of any machine learning algorithm, were developed for non-humans, the outcomes were bio-markers, 28 mediators, surgery or medication of CVD, rehabilitation or mental health outcomes after CVD or cost-effective 29 analysis of CVD, the manuscript was non-English, or was a review or meta-analysis. We also excluded articles 30 presented at conferences as abstracts and the full texts were not obtainable. The study was registered with 31 PROSPERO (CRD42020175466). 32 33 Findings 34 Of 2870 articles identified, 96 were eligible for inclusion. Most studies that compared ML and regression showed 35 increased performance of ML, and most studies that compared performance with or without SDoH/related 36 variables showed increased performance with them. The most frequently included SDoH variables were 37 race/ethnicity, income, education and marital status. Studies were largely from North America, Europe and China, 38 limiting the diversity of included populations and variance in social determinants. 39 40 Interpretation 41 Findings show that machine learning models, as well as SDoH and related variables, improve CVD prediction model 42 performance. The limited variety of sources and data in studies emphasize that there is opportunity to include more 43 SDoH variables, especially environmental ones, that are known CVD risk factors in machine learning CVD prediction 44 models. Given their flexibility, ML may provide opportunity to incorporate and model the complex nature of social 45 determinants. Such data should be recorded in electronic databases to enable their use. 46 47 Funding 48 We acknowledge funding from Blue Cross Blue Shield of Louisiana. The funder had no role in the decision to 49 publish. 50 51 Introduction 52 An estimated 17.9 million people die each year from cardiovascular diseases (CVD), which represent 31% of all 53 deaths worldwide and the number one cause of death. 1 Low-income and middle-income countries carry 75% of 54 the burden of CVD deaths worldwide and in high-income countries, lower socioeconomic groups have a higher 55 . CC-BY-NC 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989 doi: medRxiv preprint NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.
24

Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

Sep 28, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

1

MachineLearningforIntegratingSocialDeterminantsin1

CardiovascularDiseasePredictionModels:ASystematicReview2YuanZhao1,EricaP.Wood2,NicholasMirin,2RajeshVedanthan,3,4StephanieH.Cook,2,5Rumi3Chunara5,64

51NewYorkUniversity,SchoolofGlobalPublicHealth,DepartmentofEpidemiology62NewYorkUniversity,SchoolofGlobalPublicHealth,DepartmentofSocialandBehavioralSciences73NewYorkUniversityGrossmanSchoolofMedicine,DepartmentofPopulationHealth84NewYorkUniversityGrossmanSchoolofMedicine,DepartmentofMedicine95NewYorkUniversity,SchoolofGlobalPublicHealth,DepartmentofBiostatistics106NewYorkUniversityTandonSchoolofEngineering,DepartmentofComputerScience&Engineering1112

Summary13

Background 14Cardiovasculardisease(CVD)isthenumberonecauseofdeathworldwide,andCVDburdenisincreasinginlow-15resourcesettingsandforlowersocioeconomicgroupsworldwide.Machinelearning(ML)algorithmsarerapidly16beingdevelopedandincorporatedintoclinicalpracticeforCVDpredictionandtreatmentdecisions.Significant17opportunitiesforreducingdeathanddisabilityfromcardiovasculardiseaseworldwideliewithaddressingthe18socialdeterminantsofcardiovascularoutcomes.Wesoughttoreviewhowsocialdeterminantsofhealth(SDoH)19andvariablesalongtheircausalpathwayarebeingincludedinMLalgorithmsinordertodevelopbestpracticesfor20developmentoffuturemachinelearningalgorithmsthatincludesocialdeterminants.21 22Methods 23Weconductedasystematicreviewusingfivedatabases(PubMed,Embase,WebofScience,IEEEXploreandACM24DigitalLibrary).WeidentifiedEnglishlanguagearticlespublishedfrominceptiontoApril10,2020,whichreported25ontheuseofmachinelearningforcardiovasculardiseaseprediction,thatincorporatedSDoHandrelatedvariables.26Weincludedstudiesthatuseddatafromanysourceorstudytype.Studieswereexcludediftheydidnotincludethe27useofanymachinelearningalgorithm,weredevelopedfornon-humans,theoutcomeswerebio-markers,28mediators,surgeryormedicationofCVD,rehabilitationormentalhealthoutcomesafterCVDorcost-effective29analysisofCVD,themanuscriptwasnon-English,orwasareviewormeta-analysis.Wealsoexcludedarticles30presentedatconferencesasabstractsandthefulltextswerenotobtainable.Thestudywasregisteredwith31PROSPERO(CRD42020175466).32 33Findings 34Of2870articlesidentified,96wereeligibleforinclusion.MoststudiesthatcomparedMLandregressionshowed35increasedperformanceofML,andmoststudiesthatcomparedperformancewithorwithoutSDoH/related36variablesshowedincreasedperformancewiththem.ThemostfrequentlyincludedSDoHvariableswere37race/ethnicity,income,educationandmaritalstatus.StudieswerelargelyfromNorthAmerica,EuropeandChina,38limitingthediversityofincludedpopulationsandvarianceinsocialdeterminants.39 40Interpretation 41Findingsshowthatmachinelearningmodels,aswellasSDoHandrelatedvariables,improveCVDpredictionmodel42performance.Thelimitedvarietyofsourcesanddatainstudiesemphasizethatthereisopportunitytoincludemore43SDoHvariables,especiallyenvironmentalones,thatareknownCVDriskfactorsinmachinelearningCVDprediction44models.Giventheirflexibility,MLmayprovideopportunitytoincorporateandmodelthecomplexnatureofsocial45determinants.Suchdatashouldberecordedinelectronicdatabasestoenabletheiruse.46 47Funding 48WeacknowledgefundingfromBlueCrossBlueShieldofLouisiana.Thefunderhadnoroleinthedecisionto49publish.5051

Introduction52

Anestimated17.9millionpeopledieeachyearfromcardiovasculardiseases(CVD),whichrepresent31%ofall53

deathsworldwideandthenumberonecauseofdeath.1Low-incomeandmiddle-incomecountriescarry75%of54

theburdenofCVDdeathsworldwideandinhigh-incomecountries,lowersocioeconomicgroupshaveahigher55

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

NOTE: This preprint reports new research that has not been certified by peer review and should not be used to guide clinical practice.

Page 2: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

2

incidenceofCVDandhighermortalityduetoCVD.1,2Inhigh-incomecountriessuchastheUnitedStates,the56

prevalenceofCVDisexpectedtorise10%between2010and2030,3notonlyintheagingpopulationbutalso57

notablyviastarkdisparitiesamongsocioeconomicandracialgroups.4,5DirectcausesfortheseshiftsinCVD58

burdenhavebeenwell-studied,attributedtochangesindiet(increasedconsumptionofprocessedfoods)6and59

physicalactivity(moresedentarylifestyles),7resultinginadramaticriseinconditionssuchasobesity,60

hypertension,anddiabetesmellitus.Thesechangesareshapedbythe“conditionsinwhichpeopleareborn,61

grow,live,workandage”,referredtobytheWorldHealthOrganizationassocialdeterminantsofhealth(SDoH).862

Multinational,prospectivecohortstudiesaswellasecologicanalyseshaveshownthatSDoHcontributeto63

over35%ofthepopulationattributableriskofvariouscardiovasculardiseases,9,10amongwhicheducation,64

incomeandoccupationareparticularlyinfluential.11Researchhasalsoilluminatedmechanismsofaction;social65

factorsusuallyinteractwitheachotherthroughthemediationoforeffectmodificationbypsychologicaland66

biologicalpathways,exertingalong-termeffectoncardiovascularoutcomes.5,12Socialdeterminantsalsoresult67

inunequalsharingofthebenefitofadvancesinCVDpreventionandtreatment.13Giventhecriticalimportanceof68

socialdeterminantswithrespecttodiseaserisk,itisclearthatbettercapturingtheinteractionandrelative69

influenceofsuchfactorsinrelationtotraditionalCVDriskfactorsofhypertension,diabetesandhyperlipidemia70

providesthemostsignificantopportunitytoreduceCVDburden.5,11,12,1471

Meanwhile,artificialintelligence(AI)andmachinelearning(anapplicationofAIfordetectingpatternsfrom72

data)15toolshavestartedtobeadoptedinclinicalresearch,promptedbyrecentprogressinadvancedcomputing73

strategiesaswellastheproliferationofelectronicmedicalrecorddatabases.16Machinelearningmethodshave74

demonstratedimprovementacrossmultiplemetricsforpredictionofCVDrisk,incidenceandoutcomes17-19over75

traditionalriskscoressuchasthosefromtheAmericanCollegeofCardiologyorAmericanHeartAssociation.20As76

adata-drivenapproach,machinelearningprovidesmoreflexibilityinmodelingcomplexrelationshipsbetween77

predictors,whichcanbeparticularlyadvantageousinaddressingthemulti-levelinteractionsbetweendifferent78

socialdeterminantsandCVDoutcomes,aswellasuncoveringnovelriskfactors.Thoughtheincreasedflexibility79

ofmachinelearningmodelsisappealing,giventherapidriseofmachinelearningapproachesincludingstudies80

whichincorporatesocialdeterminants,weneedtobetterunderstandbestpracticesforsuchmodelling81

approachesforCVDriskpredictionparticularlyinthecontextofthoseincludingSDoH.82

Thus,weperformedasystematicreviewtounderstandthecurrentlandscapeofhowsocialdeterminantsare83

beingusedinmachinelearningmodelsforCVDprediction.Specifically,wesoughttoexaminewhichtypesof84

machinelearningalgorithmsandtypesofsocialdeterminantvariablesarebeingused,andforwhich85

populations.Indeed,understandingthemannerinwhichSDoHareincorporatedintosuchmodelsiscriticalin86

ordertoteaseapartthedistinctthebiologicalandsocialinfluences,alongwiththeirinteractions,thatmake87

populationsdifferentandinneedofadifferentstandardofcare.Findingsfromthisreviewservetoinformthe88

designoffuturemachinelearningapproachesandidentifyareasformethodologicalinnovationinorderto89

improveearlypredictionofCVDandreduceitssignificantdiseaseburden.21,2290

91

Method92

Searchstrategyandselectioncriteria93

First,YZwiththehelpofanexpertlibrarian,didacomprehensivesearchoffivedatabases:PubMed,Embase,94

WebofScience,IEEEXploreandACMDigitalLibraryonApril10th,2020,toidentifyallrelevantarticleson95

machinelearningintegratingsocialdeterminantsincardiovasculardiseasepredictionmodelspublishedin96

English.IEEEXploreandACMDigitalLibrarywereincludedspecificallytocomprehensivelycapturecomputer97

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 3: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

3

sciencearticlesrelatedtoourreview.Papersfrominceptionuntilthesearchdatewereincluded.Toensurethe98

qualityofincludedpapers,weonlyincludedpeer-reviewedarticlespublishedonjournalsoracceptedin99

conferencesandexcludednon-peerreviewedgreyliteratureorarXiv/medRxivpapers.100

Weidentifiedthetermsofsocialdeterminantsofhealth(SDoH)usingthebroaderdefinitionfromtheWorld101

HealthOrganizationandCenterforDiseaseControlandPreventionHealthyPeople2020initiativewhichdelineates102

SDoHinfivekeyareas:economicstability,education,socialandcommunitycontext(e.g.“race/ethnicity”,“income”103

and“education”),healthandhealthcare,neighborhoodandbuiltenvironment(e.g.“livingenvironment”,104

“pollution”and“residencecharacteristics”).23Figure1isasocio-ecologicalconceptualmodeladaptedfromHealthy105

People2020,theUnitedStatesfederalgovernment’snationalhealthagenda,24whichillustratesthe multifactorial106

natureofsocial-ecologicalinfluencesonhealth.Theframeworkemphasizestheexistenceofproximate,or107

“downstream,”healthinfluences(e.g.,smoking)thatareshapedbydistal,or“upstream,”factors(e.g.,socialnorms108

regardingsmoking,tobaccoregulations).Therefore,forarobustreview,wealsoincludedprominentfactors109

includinghealth-relatedbehaviorsalongthecausalpathway(e.g.“diet”,“smoking”and“physicalactivity”),as110

althoughtheseareenactedattheindividuallevel,theyareshapedatsocialandeconomiclevels.25Thisenablesusto111

understandcomprehensivelyhowsocialdeterminantsandfactorstheydirectlyshapeareassessedinrelationto112

CVD.Age,genderandracearealsointhecausalpathway.26BMIwasalsoincludedasitisinfluencedbysocial113

factorsandcausesdiabeteswhichdirectlyaffectsCVD.27114

Forsearchtermsrelatedtomachinelearning,weincludedallcommonlyusedsupervisedmachinelearning115

methods.Supervisedmachinelearningalgorithmsarethosethatperformreasoning(i.e.prediction)from116

observationsofthefeatures(e.g.clinicaldata,socialdeterminants)basedonexternallysuppliedexampleswhich117

includethefeatureslinkedtooutcome“labels”(e.g.CVDoutcomes).Thussupervisedmachinelearningwasa118

focusasthetypesoftasksconsideredintheliteratureusuallyutilizedlabeledoutcomesofCVD.28Commonly119

usedunsupervisedmachinelearningalgorithmscapturedbythesearchwerealsoincludedintheabstractand120

fulltextscreeningtoensurealltypesofpossiblestudieswereconsidered.Wealsoaddedsearchtermstocapture121

deeplearningandensemblemethodsastheyarewidelyusedincurrentclinicalresearch.29122

ThesearchtermsforCVDoutcomesincludedcardiovascularischemicoutcomes,coronaryheartdiseaseand123

cerebrovasculardiseasewhicharecausedbyatheroscleroticcardiovasculardisease(ASCVD).These124

cardiovasculardiseasescausethehighestmortality,andestimatedyearsofliveslostattributedtothese125

conditionshaveincreasedinrecentyears.1,2,30Foreachofthekeyareasofsocialdeterminantsandincluded126

variables,machinelearningandCVD,weidentifiedkeywordsbyreferencingpreviousreviewpapersonsocial127

determinantsandcardiovasculardiseases,11,31relatedstudiesofdifferentsocialdeterminants31-35orconsulting128

expertstoincluderelevantconcepts.Fullsearchstrategiesareprovidedintheappendix.129

Oncepaperswereidentifiedviathesearchterms,allstudydesignsandallpopulationswereincludedifthe130

articleutilizedanySDoHorhealthbehaviorsasfeaturesinthemachinelearningmodels(inadditiontoageand131

gender,aswefoundthatthesewerecommonlyincludedasstandardpracticeandnotspecificallytorepresenttheir132

contributionassocialdeterminants)weredeemedeligible.Eligibilitywasalsoconsiderediftheoutcomeswere133

CVD-related,includingincidence,survival,mortality,hospitaladmissionandreadmissionetc.Wedidnotrestrict134

timeofpublicationtoenablecapturingthetrendofthesetypesofpapersovertime.Studieswereexcludedifthey135

didnotincludetheuseofanymachinelearningalgorithm,weredevelopedfornon-humans,theoutcomeswere136

bio-markers,mediators,surgeryormedicationforCVD,rehabilitationormentalhealthoutcomesafterCVD137

diagnosisorcost-effectivenessanalysisofCVDtreatment,themanuscriptwasnon-English,orwasareviewor138

meta-analysis.Wealsoexcludedarticlespresentedatconferencesasabstractsandthefulltextswerenot139

obtainable.ThisreviewwasregisteredwithPROSPERO(CRD42020175466)andconductedinaccordancewiththe140

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 4: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

4

PreferredReportingItemsforSystematicReviewsandMeta-Analyses(PRISMA)method.Tosupplementthe141

bibliographicdatabasesearches,wealsousedGoogleScholartoscrutinizeallkeywordsregardingtheirrelevance142

inarticlesaswellasexaminepotentialarticlestoidentifyiftheywereeligible.Duplicateswereremovedinthe143

process.144

Threeinvestigators(YZ,EPW,andNM)screenedthetitleandabstract:eacharticleretrievedwas145independentlyassessedbytworeviewerstodetermineitseligibilityforfull-textreview.Conflictswereresolved146bydiscussionandvalidationfromathirdreviewer.Afterinitialappraisal,weretrievedfulltextsofeligible147articles.148

Dataanalysis149

Datawereextractedfromindividualarticlesindependentlybytworeviewers(ofYZ,EPW,andNM)andchecked150

bythethirdrevieweraccordingtocriteriainastandardizedextractionform.Alldataextractionwascross-151

checked,anddisagreementswereresolvedbydiscussionorreferraltothethirdreviewer.Informationextracted152

includedyearofpublication,country,population,socialdeterminantsincludedinthemachinelearning153

algorithm,machinelearningalgorithms,cardiovasculardiseaseoutcomes,datasourceandperformanceofthe154

algorithms.Foreacharticle,wedefinedseveralcriteriatoassessthequalityofthestudybasedonbestpractices155

inmachinelearning36including(1)whethermachinelearningmodelperformancewasevaluated;(2)whethera156

hyperparameter(aparameterwhosevalueisusedtocontrolthelearningprocess)tuningprocesswasdescribed;157

(3)whetherdata-drivenvariableselectionwasperformed;(4)whethermethodswereusedtospecifically158

interpretthecontributionofincludedvariablesintheprediction.Eachitemwasscoredasno(notpresent),159

unclear,oryes(present),andthensummarizedalongsideallitemstogetastudyqualityscore.160

161

RoleoftheFundingSource162

Thefunderofthestudyhadnoroleinstudydesign,datacollection,dataanalysis,datainterpretation,orwriting163

ofthereport.Thecorrespondingauthorhadfullaccesstoallthedatainthestudyandhadfinalresponsibilityfor164

thedecisiontosubmitforpublication.165

166

Results167

Ourdatabasesearchidentified2728distinctarticles;afterafull-textreviewof298papers,96wereincludedin168

thesystematicreview(Figure2).Amongtheincludedstudies,oneofthestudiesuseddatafromaclinicaltrial,169

whiletheothersutilizedobservationaldata.Oftheobservationalstudies,datafromcohortstudieswasthemost170

frequent(34studies),followedbydatafromelectronicmedicalrecords(32studies),surveys(14studies)and171

datafromopen-accessrepositoriesofregistryornationalsurveydata(7studies)(e.g.ScientificRegistryof172

TransplantRecipientsRegistry37).Mostoftheobservationaldatawerestructureddata(clearlydefineddata173

features),while9studiesincludedunstructureddata(e.g.electrocardiogram,imageandheartsound).The174

earliestyearofpublicationwas1992(artificialneuralnetworkalgorithm)38,andpublicationsfulfillingour175

inclusioncriteriahavebeenincreasingovertime(Figure3).Figure4summarizesvariables(4A),outcomes(4A),176

authorlocations(4B)andtypesofvenueswerestudieswerepublished(4C).Moredetailsonthedatasources177

andpopulationsincluded,alongwithallstudydetailsareinTableS1.178

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 5: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

5

179

Socialdeterminantandvariablesinthecausalpathway180

Includedstudiesreporteddiversevariablesacrosssocialdeterminantsandvariablesconsidered,including181

race/ethnicity,education,maritalstatus,occupation/employment,individualorhouseholdincome,medical182

insurance,areaofresidence(e.g.urbanversusruraloreasternvs.westernUSA)andothercommunity-levelfactors183

ofdeprivation,incomeandeducationandenvironmentalpollutantsaswellassmoking,alcoholconsumption,184

physicalactivities,substanceabuseanddiet.Inmoststudies,genderandagewereincludedasstandardvariables185

collectedinthesurveyorEHR.Afewstudiesassessedphysicalactivitiesanddietasmodifiableriskfactorsforearly186

preventionofCVD.14,39Halfofthestudiesreportedfeatureimportanceofvariables,inwhichage,gender,smoking187

(e.g.currentsmoking/pastsmoking/non-smoking)andBMIweremostfrequentlyreportedtocontribute188

significantlytotheCVDoutcomeprediction.Otherfrequentlyreporteddeterminantsincludingrace/ethnicity,189

alcoholconsumption(e.g.dailyintakeoralcoholism),andphysicalactivity/exercise(e.g.weeklyexercisetime).190

Besidesage,gender,BMIandsmokingwhichwerefrequentlyreportedinallCVDoutcomes,alcoholconsumption191

andphysicalactivitieswerefrequentlyassociatedwithstrokewhileBMIwasfrequentlyassociatedwithcoronary192

arterydisease.Thetoptenvariablesconsideredinextractedpapers,andtheirfrequency,areillustratedinFigure193

4A,whichincludemaritalstatus,education,incomeandrace/ethnicityasthemostcommonsocialdeterminants.194

Fourofthestudiescomparedmodelperformancewithsocialdeterminantsandwithoutsocialdeterminants;three195

showedsocialdeterminantssignificantlyimprovedprediction,othersshowedimprovedpredictionbyadditionof196

age,genderandrace.40,41Thestudythatshoweddecreasedperformanceaimedtoforecastthepatternofthe197

demandforhemorrhagicstrokehealthcareservicesbasedonairquality;itispossiblethattherelationshipbetween198

specificvariabletestedandoutcomehavelittledirectrelationship.42199

200

Algorithmsandmodeldevelopment201

Themostcommonmachinelearningmethodswereneuralnetwork(NN,36studies),randomforest(RF,28202

studies),anddecisiontrees(DT.21studies).Threestudiesusedunsupervisedmachinelearningalgorithms,such203

asclusteringtogroupCVDrisklevelsorprincipalcomponentanalysis(PCA)toextractfeaturespriorto204

supervisedmachinelearningclassification.14,43,44ThemostfrequentlyusedalgorithmsaredescribedinTable1.205

Ofthe35studiesusingneuralnetworks,12usedonehiddenlayer,23usedmultiplehiddenlayers,including206

mostcommonlythree-layerperceptron,convolutionalneuralnetworkandrecurrentneuralnetwork.Herewe207

refertothesestudiescollectivelyas“neuralnetworks”(NN)asdeeplearningtypicallyreferstoanneural208

networkwithmultiplelayers.45Ofthe42studiesincludingmultiplemachinelearningalgorithms,randomforest209

(9studies)andneuralnetwork(9studies)weremostfrequentlyreportedasthebestperformingmachine210

learningalgorithms.FormostcommonlystudiedCVDoutcomes,randomforestwasfrequentlyreportedtohave211

thebestpredictionforstrokewhilesupportvectormachine(SVM)performedbestforcoronaryarterydisease.212

Therewere24studiesthatcomparedmachinelearningalgorithmswithstandardlinearregression,logistic213

regressionorsurvivalanalysis;amongthose21showedimprovedperformancewithmachinelearning.One214

studyofriskpredictionforin-hospitalmortalityinwomenwithST-elevationmyocardialinfarctionusingdata215

fromtheNationalInpatientSampleintheUnitedStates,foundcomparableperformanceusingrandomforestand216

logisticregression.46Inanotherstudy,neuralnetworkmodelsforpredictionofacutecoronarysyndromesusing217

clinicaldataandNNshowedsimilarperformancetologisticregressioninpredictingacutecoronarysyndrome;218

however,only13variableswereconsidered.47Athirdstudyonpredictingadversecardiovasculareventsby219

modelsintegratingstress-relatedventricularfunctionalandangiographicdatashowedthatwhilealogistic220

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 6: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

6

modeldemonstratedbetterperformanceinthistaskandimplementation,aBayesiannetworkmodelshowed221

goodperformanceandalsowashighlightedasbeingbetteratdefiningcausalrelationships,andthususefulfor222

designingfuturemodelsinwhichnewvariablescanbeincorporatedinthepredictiontask.48223

224

Model,validationandperformanceandstudyquality225

Moststudiesevaluatedtheperformanceofmachinelearningalgorithm(s)developed.Areaunderthereceiver226

operatingcharacteristiccurve(AUC)wasthemostcommonevaluationmetricused(45)studies,followedby227

sensitivity(43studies),specificity(32studies)andaccuracy(32studies).Atleastthreeofthefourmetricswere228

usedin31studies.Otherevaluationmetricsusedincludedaccuracy,positivepredictivevalue,negative229

predictivevalueandF1-score,whichistheharmonicmeanoftheprecisionandrecall,commonlyusedto230

evaluatemachinelearningmethodsviatheirbalanceofthesemetrics.Externalevaluationwasperformedin11231

studies,whereintheauthorstestedthemachinelearningmodelsdevelopedinonehospitalonanotherhospital232

orpopulation.Forexample,onestudyspecificallytestedthegeneralizabilityofarecurrentneuralnetworkmodel233

forpredictingheartfailureriskinalargedatasetfrom10hospitals;evaluatingtheperformanceofamodel234

trainedoneachhospital’straining(andvalidation)setsoverthe10hospitals’testsets.Theyalsoevaluatedthe235

modelthattrainedonallhospitals’trainingsetsoverthe10hospitals’testsets.41andanotheruseddatafromone236

hospitaltotrainneuralnetworkmodelsfordiagnosisofacutecoronarysyndromeandtestedthemodelondata237

fromtwootherhospitals.47238

Amongthosereported,mostAUCwerehigherthan0.70(Figure5).Asmoststudieswerepublishedin239

biomedicalandclinicaljournals,moststudiesexplicitlyinterpretedthefindingsandtheirrelevancytoclinical240

applications.Almosthalf(40/96)ofthestudiescomparedmorethanonemachinelearningalgorithm,ofwhich241

RandomForestwasmostcommonlythebestperformingmodel.Themeanscoreofincludedstudiesinthe4-item242

qualityassessmentscale(basedonevaluationofML,data-drivenselectionoffeatures,hyperparametertuning243

description,interpretationofthemodel)was3.34.Halfofthestudies(49)hadfullscoresand30studiesmissed244

oneofthefouritems.Commonlymisseditemsweredata-drivenfeatureselectionanddetailsofhyperparameter245

tuning(cross-validationorgridsearchstrategieswereutilizedin68studiestotunehyperparameters;other246

studiesdidn’tgivedetailsabouthyper-parametertuningprocess).Halfofallthestudiesutilizedadata-driven247

selectionmethodtoidentifyfeaturesbeforefittingmachinelearningmodels,whichisdefinedasextractinga248

subsetofusefulvariablesamongtheoriginalvariablesandtransformingdatafromahigh-toalow-dimensional249

space.49Asdeeplearningmodelstoextractfeatureswhiletraining,thosestudiesdidnotalwaysincludeafeature250

selectionprocess.251

252

Discussion253

Toourknowledge,thisisthefirstsystematicreviewtoillustratehowmachinelearningisbeingusedto254

integrateSDoHincardiovasculardiseasepredictionmodels.Thisreviewdistillswhichtypesofalgorithmsand255

SDoHandrelatedvariableshavebeenconsideredandresultingperformance.Wefoundthattheflexibilityof256

machinelearningmodelshasprovedusefulinCVDpredictionmodels,withthemcommonlyperformingbetter257

thanregressionapproaches.WefindthatmodelsthatconsiderSDoHandrelatedvariablesalsobenefitfrom258

flexiblemodelingapproaches,withneuralnetworksconsistentlyoutperformingregressionacrossallCVD259

outcomes.260

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 7: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

7

Broadly,wefoundseverallimitationsinthecontentcoveredbyincludedpapers.First,thestudieswerehighly261

skewedtooriginatefromUSA,EuropeandChina,withlower-incomelocationsnotbeingwellrepresented.262

Moreover,wefoundthattheraceandethnicitydistributioninsomestudieswasalsonotveryrepresentativeof263

underlyingpopulations.ThisisparticularlystrikinggiventhehighandincreasingCVDburdenandchanging264

socio-environmentalcircumstancesinlower-incomecountriesandregionsanddisparitiesinCVDburden.The265

varianceofsocialdeterminantsincorporatedintomodels,andthustheperformanceandapplicabilityofthose266

modelsacrosscontexts,willbedecreasedwithlessdiversityinthestudysample.50SDoHvariablesthemselves267

werealsonotveryfrequentlyincluded,withonlymaritalstatus,education,incomeandrace/ethnicityinthetop268

ten.EnvironmentalattributesthathavebeenshownasimportantmodifiablecomponentsofCVDrisksuchas269

greenspacesandstress32,51,52wereveryfew,andeventhenwereverybroad(e.g.regionofcountry).42Ifsuch270

area-basedvariablesareincluded,machinelearningmayalsoproveusefultounweavethestrandsof271

environmentalinfluencesbutalsointegratetheeffectsofthevariouscomponentsoftheenvironmentintoa272

comprehensivemodel.32Wealsofindthatmodelsdidnottakeintoaccountsocialprocessesassociatedwith273

socioeconomicconditionsacrossthelifecourse.Socioeconomicposition,psychosocialfactorsandbehaviors274

duringadolescenceandyouthareimportantlikelytobeimportantinthedevelopmentofCVDandprecursors275

(dyslipidemia,hypertension,andsmoking).53Finally,studiesgenerallyincludedgenderinterchangeablywithsex,276

whichprecludesconsiderationofthesocially-determinedaspectofgender.54277

Despitetheselimitations,ourresultslargelyfoundsocialdeterminantsandvariablesconsideredtoimprove278

modelperformance.Intermsofalgorithms,severaltypesofmachinelearningalgorithmswereevaluated,with279

resultsshowingthatwhencomparedwithinstudies,themostflexiblemodelssuchasneuralnetworksand280

randomforestmodelswerebestperforming.Neuralnetworksalsomostcommonlyoutperformedregression281

models.Thisisunderstoodtobethecasebecauseneuralnetworksincludehiddenlayerswhichcantakeinto282

accountmorecomplexrelationsinthedata,andthereforethismaybeanotherpossibleexplanationforthe283

improvedperformance.55,56Moreover,recentstudiesuncoveringnetworkandspillovereffects(social284

environment)andshareddecision-making57involvedinphysicalactivity,58,59diet60andsmoking61indicatethat285

thepathwaysthatinformthesebehaviorsareintricate.However,thismayilluminateanopportunityformachine286

learning,whichbasedonflexibility,canhelpcapturesuchcomplexinteractions.287

TheconstraintsonincludeddataarelikelyduetodifficultiesincapturingcertainSDoHvariablesandlinking288

themwithindividualrecordsindatabasesusedinmanyoftheincludedstudies.Studieshavelargelyusedsocial289

variablesfromavailabledatasources;commonlythoseintheelectronichealthrecord.Theuseofflexible,290

machinelearningmodelsalsobringconcernsregardinginterpretabilityandpotentialover-fittingtodata,62291

thoughthiswasnotacommondiscussiontopicacrossallpapers.Thisislikelybecausemostmodelsselected292

variablesbasedonpriorclinicalsignificance,thuspredictionperformancewouldbebasedonsuchfactorswhich293

areknowntoberelevanttoCVDevenifthespecificimportanceofeachvariablewasnotmeasured.Furthermore,294

mostpapers(66)papersusedmethodssuchasautomaticrelevancedetermination55orfeatureselection63to295

examineand/orranktheimportanceofvariablesinmachinelearningmodels.Thiswasthecaseevenasarticles296

werepublishedinavarietyofvenues(Figure4D).297

Whilethisisthefirstreviewthatgivesfindingsrelatedtotheuseofmachinelearningandsocialdeterminants298

forCVDprediction,thereareindividualstudiesthatsupportcomponentsofthefindingsofthisstudy.First,299

machinelearningingeneralhasshownpromisewithrespecttocardiovasculardiseaseprediction.64-66Compared300

totheestablishedAmericanCollegeofCardiology/AmericanHeartAssociationriskcalculatortopredict301

incidenceandprognosisofASCVD,20previousworkhasshownthatmachine-learningalgorithms(especially302

randomforest,gradientboostingmachinesandneuralnetworks)werebetteratidentifyingindividualswhowill303

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 8: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

8

developCVDandthosewhowillnot.17-19ThesestudieshaveattributedthistothefactthatstandardCVDrisk304

assessmentmodelsmakeanimplicitassumptionthateachriskfactorisrelatedinalinearfashiontoCVD305

outcomesandsuchmodelsmaythusoversimplifycomplexrelationshipswhichincludelargenumbersofrisk306

factorswithnon-linearinteractions.Theroleofsocialdeterminantsincardiovasculardisease(notspecifically307

machinelearning-related)hasbeenstudiedthroughseveralpapersandsystematicreviews.Whilefull308

summariesofthisworkhavebeenperformedelsewhere,11wenotethattherehavebeenseveralstudiesof309

variousproximalanddistalsocialdeterminantsandcardiovasculardisease.Ingeneral,studiesindicatethatthe310

changingburdenofdiseaseduetosocietalandenvironmentalconditions,aswellasincreasingadvancesin311

treatmentandpreventionhavenotbeensharedequallyacrosseconomic,racialandethnicgroups,compelling312

theneedforbroadrangeconsiderationofsocialdeterminantsinCVDprediction.11,31Finally,themodelsthathave313

incorporatedsocialdeterminantsandmachinelearningforCVDpredictionalsoreflectlimitationsofmany314

machinelearningalgorithmsthathavebeenhighlightedrecently,whicharebasedonhomogenouspopulations,315

particularwithrespecttorace(capturedthroughthelimitedgeographicdiversityinFigure4C).17316

Ourreviewwaslimitedinseveralaspects.First,theincludedstudiesevaluateddifferenttypesof317

cardiovascularoutcomes,andheterogeneityofoutcomemetricsmakesitdifficulttocomparemachinelearning318

performanceacrossstudies.Thepopulationconsideredalsoincludessamplesfromdifferentdatasources,319

hospitalsandcountrieswhichtakentogethermakethecomparisonacrossstudiesnotstandardized.Third,most320

studiesdidnotevaluateexternalvalidity,leavingtheapplicabilityofthealgorithmstootherpopulationsor321

healthcaresettingsinconclusive.Fourth,thereviewwasalsolimitedtostudiespublishedinEnglish,whichmight322

havecreatedsomebiasinthearticlesthatwereultimatelyretainedfortheanalysis.323

Findingsemphasizetheneedtocomprehensivelycapturebothproximalanddistalsocialdeterminant324

variablesinmodels.Wheremechanismsarenotwellunderstood,machinelearningcanalsobeusedto325

understandrelationshipsbetweensocialandbiologicalvariablescomprehensively.Forexample,raceisoften326

conceptualizedasaproxyforvariablesforsocioeconomicpositionorculturalfactorsandbetterwaystocapture327

aswellasunderstandrelationshipsbetweenthesefactorsandtheirimpactonCVDriskshouldbeinvestigated.328

Indeed,identificationofpotentialmediatingandmoderatingfactorsinthesepathwaysofsocialdeterminants329

willinformpublichealthinterventions.Improvedconstructswillalsohelpinincorporationofenvironmentaland330

behavioralvariablessuchasdietandphysicalactivitywhichwerenotwellrepresentedincurrentstudies.Our331

findingssupportbodiesofworkthatpromoteinclusionofsuchinformationintheelectronichealthrecord67,68332

andweaddreasoningthatthiswouldalsoenablestudyofsocialdeterminantsinmachinelearninginlarge333

enoughsamplesizestoreduceoverfittingofmodels.Finally,resultsemphasizetheneedforstudiesthatinclude334

morediversepopulationswithvariedenvironmentalandsocialinfluences,whichwouldrepresentandensure335

validityofpredictionmodelsacrossthesediverseinteractions50toimprovecardiovasculardiseasepredictionin336

diversesettings,inparticularthosewherediseaseriskisincreasing.337

338

Researchincontext339

Evidencebeforethisstudy340

Whiletherearenoreviewsthatspecificallyaddresssocialdeterminants,machinelearningandcardiovascular341

disease(CVD),thelatestresearchoncardiovasculardiseaseindicatestheimperativerelevanceofsocial342

determinants.Societalandenvironmentalconditionsdistributedunequallyamonggroupsaredrivinga343

significantandincreasingglobalburdenofcardiovasculardiseasesparticularlytolowandmiddle-income344

countriesaswellaslower-socioeconomicgroupsinhigh-incomecountries.Atthesametime,researchshows345

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 9: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

9

thatmachinelearningoffersthepotentialforcapturingflexiblerelationshipscomparedtothelinear346

relationshipsassumedintypicalCVDriskscores,whichisofparticularrelevancefortheconsiderationofsocial347

variables.SelectmodelshaveexaminedtheperformanceofcertainsocialvariablesinCVDprediction,including348

thosethatusemachinelearning,butgiventheserapidlyadvancingresearchareas,asystematicexaminationinto349

whichsocialdeterminantshavebeenmodelledandhowdifferentmethodshaveperformed,isneeded.350

Addedvalueofthisstudy351

Througharigorousandcomprehensivesystematicreview,weassessedthestate-of-theartmethodsprediction352

ofthetwotypesofCVDwiththehighestrecentmortality(ischemicheartdiseaseandstroke),thatusemachine353

learningandincorporatesocialdeterminantsandrelatedvariablesintheircausalpathway.Weshowthat354

environmentalandarea-baseddeterminantsarelackingfrommostmodels.Machinelearning,especiallyflexible355

modelssuchasneuralnetworks,showgoodperformanceinrelationtoregressionmodels.Weaccountedfor356

modelandinformationgaindifferencesacrossbyexaminingwithinstudyperformanceofbestalgorithm357

comparedtoregression.Weassessedthequalityoftheirimplementationsviabest-practicesfromthemachine358

learningliterature,findingthatqualitywasgenerallyrigorous.Finally,weshowthattheoriginofstudiesis359

highlyskewedtoUSAandmiddle/high-incomecountriesinEuropeandAsia,whichindicatesthatknowledge360

regardingthediversityofsocialdeterminantsandtheirimpactislimited.361

Implicationsofalltheavailableevidence362

WiththesignificantburdenofCVDandlargeburdeninlow-andmiddle-incomecountries,thisworkdirectly363

informshowwecanaugmentpredictionmodels,usingstateoftheartmachinelearningmethods,whilealso364

takingintoaccountgrowingsocial,environmentalriskfactorsthatshapeCVDrisk.Accordingtothefindingsof365

thisreview,strategiestocapturesocialvariables,especiallyenvironmentaldeterminantsareneededinthe366

electronichealthrecorddatabasesfromwhichmachinelearningmethodsarecommonlydeveloped.Finally,367

studiesto-daterepresentanarrowsetoflocations;weneedtosupportstudiesinlow-andmiddle-income368

countriestoidentifyandtailorourunderstandingtothespecificsocialdeterminantsinthesepopulations.369

370

Acknowledgements371

WethankDoriceVieiraforvaluablehelpwiththesearchprocess.372

373

374

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 10: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

10

FiguresandTables375

376

377

Figure1:Socio-ecologicalframeworkofhealth;conceptualmodelusedinthestudy,adapted378fromHealthyPeople2020,theUnitedStatesfederalgovernment’snationalhealthagenda379

380

381

Figure2:PRISMAflowchartofstudyreviewprocessandexclusionofpapers382

Innate individual traits and biological

factors

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 11: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

11

383

384

Figure3:NumberofMLalgorithmsusedinpublicationsbyyearandtype(NN:neuralnetwork,RF:randomforest,385EN:ensemblemethods(e.g.Adaboost,gradientboosting,baggeddecisiontree),SVM:supportvectormachine,DT:386decisiontree,NB:naïveBayes,BN:Bayesiannetwork,Reg:regularizationmethods(ridge/lassoregression),Other:387multilayerperceptron,maximumentropy,adversarialnetwork,lineardiscriminantanalysis,k-nearestneighbors,388recursivepartitioning,clustering,quadraticdiscriminant,radialbasisfunctionkernel)389 390

391

Figure4:(A)Toptensocialdeterminantandrelatedvariablesincludedbasedonstudyinclusioncriteria(social392determinantsinblue,othervariablesinred),(B)mostfrequentlyreportedCVDoutcomes(AAA:abdominalaortic393aneurysm,LOS:lengthofstay,ACS:acutecoronarysyndrome,HF:heartfailure,MI:myocardialinfarction,CAD:394coronaryarterydisease)(C)countriesofcorrespondingauthorsand(D)journaltypesofpublicationreportedin395systematicreviewpapers(EVS:environmentalsciences)allwithrespecttothepercentageofincludedpapersthey396appearin 397

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 12: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

12

398

399

Figure5:DifferenceofAreaundertheROCcurvebetweenMLandLRbyMLalgorithmtype400

Table1:Summaryofmachinelearningalgorithms,bestperformingandsamplesizesusedinthestudies401

Algorithm Number(%)ofpapers**

Numasbestalgorithmwhenmultiplealgorithms

SampleSize

<100 100-1000 1000-10,000 >10,000

NeuralNet 35(36.5%) 9 5 8 13 9

RandomForest 32(33.3%) 9 0 3 14 15

DecisionTree 21(21.9%) 2 2 6 8 5

SupportVectorMachine 20(20.8%) 7 1 3 11 5

Ensemble 17(17.7%) 5 0 2 7 8

BayesianNetwork 13(13.5%) 1 1 4 2 6

NaïveBayes 12(12.5%) 1 2 2 5 3

Regularization* 11(11.5%) 0 0 2 6 3

Other 28(29.2%) 1 4 6 12 6

*regularization included Lasso, Ridge and Elastic net 402**Note:eachpapercouldincludemultipleversionsormultiplealgorithms403404

405

406

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 13: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

13

407

Appendix408

TableS1:Summaryofallincludedpapers(attachedatend).409

SearchtermsandsearchstrategiesPubMed:410(Socialdeterminantsofhealth[Mesh]ORdemography[Mesh]ORdemographic*[tw]ORrace[tw]ORracism[Mesh]411OR“ethnicity”[tw]ORgenderidentity[Mesh]ORgender[tw]ORsocial[tw]ORsocialsupport[Mesh]OR412income[Mesh]OReducation[Mesh]ORemployment[Mesh]ORmaritalstatus[Mesh]ORoccupation[tw]OR413“healthinsurance”[tw]ORhealthliteracy[Mesh]ORmarriage[tw]ORinsurance[tw]ORhousing[tw]OR414home[tw]ORreligion[tw]ORsocioeconomicfactors[Mesh]ORsocialclass[Mesh]OR“socialstatus”[tw]OR415“accesshealthcare”[tw]ORhealthcaredisparities[Mesh]OR“financialdifficulties”[tw]ORpoverty[Mesh]OR416“socialdisparity”[tw]ORunemployment[Mesh]ORsocialcondition[Mesh]OR“socialinequality”[tw]OR417vulnerablepopulation[Mesh]OR“socialenvironment”[tw]ORsociodemographic*[tw]ORsociological418factors[Mesh]ORbodymassindex[Mesh]ORphysicalactivity[Mesh]ORdiet[Mesh]ORsmoking[Mesh]OR419“alcoholconsumption”[tw]ORtobacco[Mesh]OR“substanceuse”[tw]OR“physicalinactivity”[tw]OR“substance420abuse”[tw]ORhealth*behavi*r*[tw]ORhealth*service[tw]ORenvironment[Mesh]OR“livingenvironment”OR421“birthplace”[tw]OR“pollution”[tw]ORresidencecharacteristics[Mesh]OR“geographiclocations”[tw]OR422“rural”[tw]OR“urbanhealth”[tw]ORneighborhood[tw]ORcultur*[tw])AND(machinelearning[Mesh]OR423supervisedmachinelearning[Mesh]ORdecisiontrees[Mesh]ORneuralnetworks[Mesh]OR“NaiveBayes”[tw]424OR“kNN”[tw]ORsupportvectormachine[Mesh]ORperceptron[tw]OR“radialbasisfunction”[tw]OR“Bayesian425Network”[tw]OR“randomforest”[tw]OR“classificationtree”[tw]OR“elasticnet”[tw]OR“multilayer426perceptron”[tw]ORlasso[tw]ORridge[tw]OR“nearestneighbor”[tw]ORdeeplearning[Mesh]ORboosting[tw]427ORbagging[tw]ORensemble[tw])AND(“atheroscleroticcardiovasculardisease”[tw]ORcardiovascular428abnormalities[Mesh]ORheartdisease*[tw]ORheartarrest[Mesh]ORmyocardialischemia[Mesh]ORarterial429occlusivediseases[Mesh]ORcerebrovasculardisorders[Mesh]ORperipheralvasculardiseases[Mesh])Embase:430(exp”socialdeterminantsofhealth”/orexp”demography”/ordemographic*or*race”/or”racism”or431*ethnicity”/orexp”genderidentity”/or”gender”or”social”orexp”socialsupport”/orexp”education”/orexp432”employment”/or”income”or”maritalstatus”orexp”occupation”/orexp”healthinsurance”/or”health433literacy”/orexp”marriage”/or”insurance”orexp”housing”/or”home”or”religion”or”socioeconomicfactors”434orexp”socioeconomics”/or”socialclass”orexp”healthcareaccess”/orexp”healthcaredisparities”/or435”financialdifficulties”orexp”poverty”/or”socialdisparity”orexp”unemployment”/orexp”socialstatus”/or436”socialinequality”orexp”vulnerablepopulation”/or*socialenvironment/orsociodemographic*or*body437mass/or*physicalactivity/or*diet/orexp”smoking”/orexp”alcoholconsumption”/or”tobaccouse”orexp438”substanceuse”/orexp”physicalinactivity”/orexp”substanceabuse”/or*environment/or*birthplace/orexp439”pollution”/or”residencecharacteristics”or*geography/or”neighborhood”orcultur*orexp”ruralhealth”/or440exp”urbanhealth”/)and(exp”machinelearning”/or”supervisedmachinelearning”orexp”decisiontrees”/or441”neuralnetworks”orexp”artificialneuralnetwork”/or”NaiveBayes”orexp”Bayesianlearning”/orexp”k442nearestneighbor”/or”knn”orexp”supportvectormachine”/or”SVM”orexp”perceptron”/orexp”radial443basedfunction”/or”BayesianNetwork”orexp”randomforest”/or”classificationtree”or”elasticnet”or444”multilayerperceptron”or”lasso”or”ridge”orexp”deeplearning”/or”boosting”or”ensemble”)and445(”atheroscleroticcardiovasculardisease”or*coronaryarteryatherosclerosis/or”cardiovascularabnormalities”446orexp”cardiovascularmalformation”/or*heartdisease/orexp”heartarrest”/orexp”myocardialischemia”/or447”arterialocclusivediseases”orexp”cerebrovasculardisorders”/orexp”peripheralvasculardiseases”/)448WebofScience:449TS=((“Socialdeterminantsofhealth”ORdemographyORdemographic*ORraceORethnicityOR“gender450identity”ORgenderORsocialOR“socialsupport”ORincomeOReducationORemploymentOR“maritalstatus”451ORoccupationOR“healthinsurance”ORmarriageORinsuranceORhousingORreligionOR“socioeconomic452factors”OR“socialclass”OR“accesshealthcare”OR“healthcaredisparities”OR“financialdifficult”ORpoverty453

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 14: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

14

OR“socialdisparity”ORunemploymentOR“socialcondition”OR“socialinequality”OR“vulnerablepopulation”454OR“socialenvironment”ORsociodemographic*OR“bodymassindex”OR“physicalactivity”ORdietORsmoking455OR“alcoholconsumption”ORtobaccoOR“substanceuse”OR“physicalinactivity”OR“substanceabuse”OR456environmentORbirthplaceORpollutionOR“residencecharacteristics”OR“geographiclocations”OR“rural”OR457“urbanhealth”)AND(“machinelearning”OR“supervisedmachinelearning”OR“decisiontrees”OR“neural458networks”OR“NaiveBayes”ORkNNOR“supportvectormachine”OR“perceptron”OR“radialbasisfunction”OR459“BayesianNetwork”OR“randomforest”OR“classificationtree”OR“elasticnet”OR“multilayerperceptron”OR460“lasso”OR“ridge”OR“nearestneighbor”OR“deeplearning”OR“boosting”OR“ensemble”)AND461(“atheroscleroticcardiovasculardisease”OR“cardiovascularabnormalities”ORheartdisease*OR“heartarrest”462OR“myocardialischemia”OR“arterialocclusivediseases”OR“cerebrovasculardisorders”OR“peripheral463vasculardiseases”))464IEEE:465(“Socialdeterminantsofhealth”ORdemographyORdemographic*ORraceORethnicityOR“genderidentity”OR466genderORsocialOR“socialsupport”ORincomeOReducationORemploymentOR“maritalstatus”OR467occupationOR“healthinsurance”ORmarriageORinsuranceORhousingORreligionOR“socioeconomicfactors”468OR“socialclass”OR“accesshealthcare”OR“healthcaredisparities”OR“financialdifficult”ORpovertyOR“social469disparity”ORunemploymentOR“socialcondition”OR“socialinequality”OR“vulnerablepopulation”OR“social470environment”ORsociodemographic*OR“bodymassindex”OR“physicalactivity”ORdietORsmokingOR471“alcoholconsumption”ORtobaccoOR“substanceuse”OR“physicalinactivity”OR“substanceabuse”OR472environmentORbirthplaceORpollutionOR“residencecharacteristics”OR“geographiclocations”OR“rural”OR473“urbanhealth”)AND(“machinelearning”OR“supervisedmachinelearning”OR“decisiontrees”OR“neural474networks”OR“NaiveBayes”ORkNNOR“supportvectormachine”OR“perceptron”OR“radialbasisfunction”OR475“BayesianNetwork”OR“randomforest”OR“classificationtree”OR“elasticnet”OR“multilayerperceptron”OR476“lasso”OR“ridge”OR“nearestneighbor”OR“deeplearning”OR“boosting”OR“ensemble”)AND477(“atheroscleroticcardiovasculardisease”OR“cardiovascularabnormalities”ORheartdisease*OR“heartarrest”478OR“myocardialischemia”OR“arterialocclusivediseases”OR“cerebrovasculardisorders”OR“peripheral479vasculardiseases”)480

References481

1 WorldHealthOrganization.Cardiovasculardiseases(CVDs)factsheet.WorldHealthOrganization482(2017).483

2 Deaton,C.etal.Theglobalburdenofcardiovasculardisease.EuropeanJournalofCardiovascularNursing48410,S5-S13(2011).485

3 Heidenreich,P.A.etal.ForecastingtheimpactofheartfailureintheUnitedStates:apolicystatement486fromtheAmericanHeartAssociation.Circulation:HeartFailure6,606-619(2013).487

4 Kuzawa,C.W.&Sweet,E.Epigeneticsandtheembodimentofrace:developmentaloriginsofUSracial488disparitiesincardiovascularhealth.AmericanJournalofHumanBiology:TheOfficialJournalofthe489HumanBiologyAssociation21,2-15(2009).490

5 Carnethon,M.R.etal.CardiovascularhealthinAfricanAmericans:ascientificstatementfromthe491AmericanHeartAssociation.Circulation136,e393-e423(2017).492

6 Stuckler,D.,McKee,M.,Ebrahim,S.&Basu,S.Manufacturingepidemics:theroleofglobalproducersin493increasedconsumptionofunhealthycommoditiesincludingprocessedfoods,alcohol,andtobacco.PLoS494Med9,e1001235(2012).495

7 Lakka,T.A.etal.Sedentarylifestyle,poorcardiorespiratoryfitness,andthemetabolicsyndrome.496Medicine&ScienceinSports&Exercise(2003).497

8 Health,W.C.o.S.D.o.&Organization,W.H.Closingthegapinageneration:healthequitythroughaction498onthesocialdeterminantsofhealth:CommissiononSocialDeterminantsofHealthfinalreport.(World499HealthOrganization,2008).500

9 Joseph,P.etal.Reducingtheglobalburdenofcardiovasculardisease,part1:theepidemiologyandrisk501factors.Circulationresearch121,677-694(2017).502

10 Tillmann,T.etal.PsychosocialandsocioeconomicdeterminantsofcardiovascularmortalityinEastern503Europe:Amulticentreprospectivecohortstudy.PLoSmedicine14,e1002459(2017).504

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 15: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

15

11 Havranek,E.P.etal.Socialdeterminantsofriskandoutcomesforcardiovasculardisease:ascientific505statementfromtheAmericanHeartAssociation.Circulation132,873-898(2015).506

12 Theodore,R.F.etal.Childhoodtoearly-midlifesystolicbloodpressuretrajectories:early-lifepredictors,507effectmodifiers,andadultcardiovascularoutcomes.Hypertension66,1108-1115(2015).508

13 Cooper,R.etal.Trendsanddisparitiesincoronaryheartdisease,stroke,andothercardiovascular509diseasesintheUnitedStates:findingsofthenationalconferenceoncardiovasculardiseaseprevention.510Circulation102,3137-3147(2000).511

14 He,X.,Matam,B.R.,Bellary,S.,Ghosh,G.&Chattopadhyay,A.K.cHDRiskMinimizationthroughLifestyle512control:MachineLearningGateway.Scientificreports10,1-10(2020).513

15 Watson,D.S.etal.Clinicalapplicationsofmachinelearningalgorithms:beyondtheblackbox.Bmj364514(2019).515

16 Rajkomar,A.,Dean,J.&Kohane,I.Machinelearninginmedicine.NewEnglandJournalofMedicine380,5161347-1358(2019).517

17 Alaa,A.M.,Bolton,T.,DiAngelantonio,E.,Rudd,J.H.&vanderSchaar,M.Cardiovasculardiseaserisk518predictionusingautomatedmachinelearning:Aprospectivestudyof423,604UKBiobankparticipants.519PloSone14,e0213653(2019).520

18 Dimopoulos,A.C.etal.Machinelearningmethodologiesversuscardiovascularriskscores,inpredicting521diseaserisk.BMCMedicalResearchMethodology18,179(2018).522

19 Kakadiaris,I.A.etal.MachinelearningoutperformsACC/AHACVDriskcalculatorinMESA.Journalof523theAmericanHeartAssociation7,e009476(2018).524

20 Cook,N.R.&Ridker,P.M.Furtherinsightintothecardiovascularriskcalculator:therolesofstatins,525revascularizations,andunderascertainmentintheWomen’sHealthStudy.JAMAinternalmedicine174,5261964-1971(2014).527

21 Caballero,F.F.etal.Advancedanalyticalmethodologiesformeasuringhealthyageingandits528determinants,usingfactoranalysisandmachinelearningtechniques:theATHLOSproject.Scientific529Reports7,43955(2017).530

22 Seligman,B.,Tuljapurkar,S.&Rehkopf,D.Machinelearningapproachestothesocialdeterminantsof531healthinthehealthandretirementstudy.SSM-populationhealth4,95-99(2018).532

23 People,C.o.L.H.I.f.H.,Health,B.o.P.,Practice,P.H.&Medicine,I.o.Leadinghealthindicatorsfor533healthypeople2020:letterreport.(NationalAcademiesPress,2011).534

24 Council,N.R.&Population,C.o.UShealthininternationalperspective:Shorterlives,poorerhealth.535(NationalAcademiesPress,2013).536

25 Short,S.E.&Mollborn,S.Socialdeterminantsandhealthbehaviors:conceptualframesandempirical537advances.Currentopinioninpsychology5,78-84(2015).538

26 Shiroma,E.J.&Lee,I.-M.Physicalactivityandcardiovascularhealth:lessonslearnedfrom539epidemiologicalstudiesacrossage,gender,andrace/ethnicity.Circulation122,743-752(2010).540

27 Nuttall,F.Q.Bodymassindex:obesity,BMI,andhealth:acriticalreview.Nutritiontoday50,117(2015).54128 Kotsiantis,S.B.,Zaharakis,I.&Pintelas,P.Supervisedmachinelearning:Areviewofclassification542

techniques.Emergingartificialintelligenceapplicationsincomputerengineering160,3-24(2007).54329 LeCun,Y.,Bengio,Y.&Hinton,G.Deeplearning.nature521,436-444(2015).54430 Roth,G.A.etal.Global,regional,andnationalage-sex-specificmortalityfor282causesofdeathin195545

countriesandterritories,1980–2017:asystematicanalysisfortheGlobalBurdenofDiseaseStudy2017.546TheLancet392,1736-1788(2018).547

31 Kreatsoulas,C.&Anand,S.S.Theimpactofsocialdeterminantsoncardiovasculardisease.Canadian548JournalofCardiology26,8C-13C(2010).549

32 Bhatnagar,A.Environmentaldeterminantsofcardiovasculardisease.Circulationresearch121,162-180550(2017).551

33 Cheng,I.,Ho,W.E.,Woo,B.K.&Tsiang,J.T.Correlationsbetweenhealthinsurancestatusandrisk552factorsforcardiovasculardiseaseintheelderlyAsianAmericanpopulation.Cureus10(2018).553

34 Fang,J.etal.AssociationofbirthplaceandcoronaryheartdiseaseandstrokeamongUSadults:National554HealthInterviewSurvey,2006to2014.JournaloftheAmericanHeartAssociation7,e008153(2018).555

35 Lapane,K.L.,Lasater,T.M.,Allan,C.&Carleton,R.A.Religionandcardiovasculardiseaserisk.Journalof556ReligionandHealth36,155-164(1997).557

36 Chen,P.-H.C.,Liu,Y.&Peng,L.Howtodevelopmachinelearningmodelsforhealthcare.Nature558materials18,410(2019).559

37 Hsich,E.M.etal.VariablesofimportanceintheScientificRegistryofTransplantRecipientsdatabase560predictiveofhearttransplantwaitlistmortality.AmericanJournalofTransplantation19,2067-2076561(2019).562

38 Akay,M.Noninvasivediagnosisofcoronaryarterydiseaseusinganeuralnetworkalgorithm.Biological563cybernetics67,361-367(1992).564

39 Shao,Z.,Chen,C.,Li,W.,Ren,H.&Chen,W.Assessmentoftheriskfactorsinthedailylifeofstroke565patientsbasedonanoptimizeddecisiontree.TechnologyandHealthCare27,317-329(2019).566

40 McGeachie,M.etal.Anintegrativepredictivemodelofcoronaryarterycalcificationinarteriosclerosis.567Circulation120,2448(2009).568

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 16: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

16

41 Rasmy,L.etal.Astudyofgeneralizabilityofrecurrentneuralnetwork-basedpredictivemodelsforheart569failureonsetriskusingalargeandheterogeneousEHRdataset.Journalofbiomedicalinformatics84,11-57016(2018).571

42 Chen,J.etal.MachineLearning-BasedForecastofHemorrhagicStrokeHealthcareServiceDemand572consideringAirPollution.Journalofhealthcareengineering2019(2019).573

43 Cheon,S.,Kim,J.&Lim,J.Theuseofdeeplearningtopredictstrokepatientmortality.International574journalofenvironmentalresearchandpublichealth16,1876(2019).575

44 Jabbar,M.,Deekshatulu,B.&Chndra,P.inInternationalConferenceonCircuits,Communication,Control576andComputing.322-328(IEEE).577

45 Illing,B.,Gerstner,W.&Brea,J.Biologicallyplausibledeeplearning—Buthowfarcanwegowith578shallownetworks?NeuralNetworks118,90-101(2019).579

46 Mansoor,H.,Elgendy,I.Y.,Segal,R.,Bavry,A.A.&Bian,J.Riskpredictionmodelforin-hospitalmortality580inwomenwithST-elevationmyocardialinfarction:Amachinelearningapproach.Heart&Lung46,405-581411(2017).582

47 Harrison,R.F.&Kennedy,R.L.Artificialneuralnetworkmodelsforpredictionofacutecoronary583syndromesusingclinicaldatafromthetimeofpresentation.Annalsofemergencymedicine46,431-439584(2005).585

48 Berchialla,P.,Foltran,F.,Bigi,R.&Gregori,D.Integratingstress-relatedventricularfunctionaland586angiographicdatainpreventivecardiology:aunifiedapproachimplementingaBayesiannetwork.587JournalofEvaluationinClinicalPractice18,637-643(2012).588

49 Chu,C.etal.Doesfeatureselectionimproveclassificationaccuracy?Impactofsamplesizeandfeature589selectiononclassificationusinganatomicalmagneticresonanceimages.Neuroimage60,59-70(2012).590

50 Harper,S.,Lynch,J.&Smith,G.D.Socialdeterminantsandthedeclineofcardiovasculardiseases:591understandingthelinks.Annualreviewofpublichealth32,39-69(2011).592

51 Richardson,E.A.&Mitchell,R.Genderdifferencesinrelationshipsbetweenurbangreenspaceand593healthintheUnitedKingdom.Socialscience&medicine71,568-575(2010).594

52 Steptoe,A.&Kivimäki,M.Stressandcardiovasculardisease.NatureReviewsCardiology9,360-370595(2012).596

53 Pollitt,R.A.,Rose,K.M.&Kaufman,J.S.Evaluatingtheevidenceformodelsoflifecoursesocioeconomic597factorsandcardiovascularoutcomes:asystematicreview.BMCpublichealth5,7(2005).598

54 Phillips,S.P.Definingandmeasuringgender:asocialdeterminantofhealthwhosetimehascome.599InternationalJournalforEquityinHealth4,1-4(2005).600

55 Bishop,C.M.Bayesianmethodsforneuralnetworks.(1995).60156 Dreiseitl,S.&Ohno-Machado,L.Logisticregressionandartificialneuralnetworkclassificationmodels:a602

methodologyreview.Journalofbiomedicalinformatics35,352-359(2002).60357 Smith,K.P.&Christakis,N.A.Socialnetworksandhealth.Annu.Rev.Sociol34,405-429(2008).60458 Suglia,S.F.etal.Whytheneighborhoodsocialenvironmentiscriticalinobesityprevention.Journalof605

UrbanHealth93,206-212(2016).60659 Bahr,D.B.,Browning,R.C.,Wyatt,H.R.&Hill,J.O.Exploitingsocialnetworkstomitigatetheobesity607

epidemic.Obesity17,723-728(2009).60860 Pachucki,M.A.,Jacques,P.F.&Christakis,N.A.Socialnetworkconcordanceinfoodchoiceamong609

spouses,friends,andsiblings.Americanjournalofpublichealth101,2170-2177(2011).61061 Årnes,A.P.&Krokstrand,T.T.TheincidenceandprevalenceofChronicFatigueSyndrome,BackPainof611

unknownorigin,Fibromyalgia,andMyalgiainNorwegianwomen,andtheirassociationtophysical612activity.AprospectivecohortstudyofmaterialfromtheNorwegianWomenandCancer(NOWAC)study,613UiTNorgesarktiskeuniversitet,(2014).614

62 Ahmad,M.A.,Eckert,C.&Teredesai,A.inProceedingsofthe2018ACMinternationalconferenceon615bioinformatics,computationalbiology,andhealthinformatics.559-560.616

63 Ni,Y.etal.Towardsphenotypingstroke:Leveragingdatafromalarge-scaleepidemiologicalstudyto617detectstrokediagnosis.PloSone13,e0192586(2018).618

64 Ambale-Venkatesh,B.etal.Cardiovasculareventpredictionbymachinelearning:themulti-ethnicstudy619ofatherosclerosis.Circulationresearch121,1092-1101(2017).620

65 Sitar-tăut,A.,Zdrenghea,D.,Pop,D.&Sitar-tăut,D.Usingmachinelearningalgorithmsincardiovascular621diseaseriskevaluation.Age1,4(2009).622

66 Weng,S.F.,Reps,J.,Kai,J.,Garibaldi,J.M.&Qureshi,N.Canmachine-learningimprovecardiovascular623riskpredictionusingroutineclinicaldata?PloSone12,e0174944(2017).624

67 Bazemore,A.W.etal.“Communityvitalsigns”:incorporatinggeocodedsocialdeterminantsinto625electronicrecordstopromotepatientandpopulationhealth.JournaloftheAmericanMedicalInformatics626Association23,407-412(2016).627

68 Cantor,M.N.&Thorpe,L.Integratingdataonsocialdeterminantsofhealthintoelectronichealth628records.HealthAffairs37,585-590(2018).629

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 17: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

17

630

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 18: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

18

Study Country Study Design SDoH and Related Variables Included

CVD Outcomes Algorithms

Alizadehsani,

Roohallah, et al.

(2018)

Australia Unclear Age, BMI, obesity, sex, smoking Coronary artery

disease

DT, SVM, NB

Tamal, Maruf

Ahmed, et al.

(2019)

Bangladesh Observational

(survey)

Age, physical exercise, smoking Heart disease DT, SVM, NB, QDA,

RF, LR

Al’Aref, Subhi J., et al. (2020)

Canada, Germany,

Italy, Korea,

Switzerland,

US

Prospective observational

Age, BMI, ethnicity, sex, smoking

Coronary artery disease

Gradient boosting

Chan, Ka Lung,

et al. (2018)

China Cohort Age, sex, smoking Stroke NN, SVM, NB

Chen, Jian, et al.

(2019)

China Observational

(EHR)

Environmental pollutants Stroke RF, DT, XGB, SVM,

LR, KNN

Gan, Xiu-min, et

al. (2011)

China Case control Age, alcohol intake, BMI,

education, other body metrics,

physical activities, diet, smoking

Stroke DT

Hsiao, Han CW,

et al. (2016)

China Observational

(EHR)

Age, gender, area level social

determinants

All types of CVD DL

Hu, Danqing, et

al. (2016)

China Observational

(EHR))

Age, smoking Other CVD RF, SVM, NB, lasso,

other*

Huang,

Zhengxing, et al. (2015)

China Observational

(EHR)

Age, gender, smoking Coronary artery

disease

SVM, other

Huang,

Zhengxing, et al.

(2019)

China Observational

(EHR)

Age, gender, smoking Other CVD NN, DL, other

Shao, Zeguo, et

al. (2019)

China Observational

(survey)

Age, alcohol intake, BMI, diet,

physical activities, smoking

Stroke RF, DT

Wan, Eric Yuk

Fai, et al. (2017)

China Retrospective

cohort

Age, BMI, gender, smoking All types of CVD DT

Xu, Yuan, et al.

(2019)

China Retrospective

cohort

Age, alcohol intake, gender,

smoking

Rehospitalization Gradient boosting

Karaolis, Minas

A., et al. (2010)

Cyprus Observational

(EHR)

Age, gender, smoking Myocardial

infarction

DT

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 19: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

19

Anselmino,

Matteo, et al.

(2009)

Europe Cross-sectional Age, BMI, gender, smoking,

other body metrics

Stroke, myocardial

infarction

NN

Deguen,

Séverine, et

al. (2010)

France Cross-sectional Age, education, gender, income,

occupation, residence, area-level

social determinants

Coronary artery

disease

Other

Baars, Theodor,

et al. (2020)

Germany Cohort Age, BMI, gender, smoking Mortality Other ensemble**

Exarchos, Konstantinos P.,

et al. (2015)

Greece Observational (EHR)

Age, gender, smoking Other NB

Tsipouras,

Markos G., et al.

(2008)

Greece Observational

(EHR)

Age, BMI, gender, smoking,

other body metrics

Coronary artery

disease

NN, DT

Jabbar, M. A, et

al. (2014)

India Observational

(EHR)

Age, gender, residence All types of CVD DT, PCA

Naushad, Shaik

Mohammad, et

al. (2018)

India Case control Age, alcohol intake, BMI, diet

gender, smoking

Coronary artery

disease

Other ensemble, other

Amin, Syed

Umar, et al.

(2013)

India Observational

(survey)

Age, alcohol intake, diet, gender,

physical activities, smoking

All types of CVD NN, other

Afarideh,

Mohsen, et al.

(2016)

Iran Open cohort Age, BMI, gender, smoking,

other body metrics,

All types of CVD NN

Amini, Leila, et al. (2013)

Iran Observational (survey)

Age, alcohol intake, BMI, gender, physical activities,

smoking

Stroke DT, other

Ayatollahi,

Haleh, et al.

(2019)

Iran Observational

(EHR)

Age, gender, marital status,

occupation, residence, smoking

Coronary artery

disease

NN, SVM

Parizadeh,

Donna, et al.

(2017)

Iran Cohort Age, gender, smoking, other

body metrics,

Stroke DT

Shakerkhatibi,

M., et al. (2015)

Iran Case–crossover

design

Age, gender, area-level social

determinants

Hospital admission NN

Berchialla, Paola,

et al. (2012)

Italy Cohort Age, smoking Myocardial

infarction,

mortality

RF, NN, SVM, BN

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 20: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

20

Bigi, Riccardo, et

al. (2005)

Italy Cohort Age, gender, smoking Myocardial

infarction,

mortality

NN, other

Foltran,

Francesca, et al.

(2011)

Italy Observational

(EHR)

Age, BMI, gender, smoking,

other body metrics

Coronary artery

disease

BN

Pasanisi,

Stefania, et al. (2018)

Italy Cohort Age, gender, smoking All types of CVD NN

Cho, In Jeong et

al. (2020)

Korea Cohort BMI, physical activities, smoking All types of CVD DL

Hae, Hyeonyong,

et al. (2018)

Korea Retrospective

Cohort

Age, BMI, gender, smoking Coronary artery

disease

RF, DT, GB, SVM, NB,

Ridge, other

Kwon, Joon-

myoung, et al.

(2019)

Korea Retrospective

Cohort

Age, BMI, gender, smoking Mortality RF, DL

Juarez-Orozco,

Luis Eduardo, et

al. (2020)

Netherlands Observational

(EHR)

Gender, BMI, smoking Coronary artery

disease, other

ensemble

Tay, Darwin, et

al. (2015)

Singapore Cohort Age, diet, gender, physical

activities, built environment

All types of CVD NN, SVM

Fuster-Parra,

Pilar, et al.

(2016)

Spain Observational

(survey)

BMI, gender, physical activities,

smoking, other body metrics

All types of CVD DT, BN, NB, other

Green, Michael,

et al. (2006)

Sweden Observational

(EHR)

Age, gender, smoking Acute coronary

syndrome

NN

Marshall, Adele H., et al. (2010)

UK Cohort BMI, Smoking Coronary artery disease, mortality

BN

Alaa, Ahmed M.,

et al. (2019)

UK Cohort BMI, diet, physical activities,

residence, smoking

All types of CVD RF, NN, ensemble, GB,

Adaboost

Harrison, Robert

F., et al. (2005)

UK Observational

(EHR)

Gender, smoking Acute coronary

syndrome

NN

He, Xi, et al.

(2020)

UK Observational

(survey)

Gender, smoking Coronary artery

disease

PCA, other

Ayala Solares,

Jose Roberto, et

al. (2019)

UK Observational

(EHR)

Gender, income, smoking, area-

level social determinants

All types of CVD BN

Yang, Hui, et al.

(2015)

UK Observational

(EHR)

BMI, smoking Coronary artery

disease

NB, other

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 21: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

21

Ahmad, Tariq, et

al. (2018)

USA Registry Age, alcohol intake, BMI,

education, gender, income,

marital status

Heart failure RF, other

Akay, Metin., et

al. (1992)

USA Cross-sectional Age, BMI, gender, smoking Coronary artery

disease

NN, other

Ambale-

Venkatesh,

Bharath, et al. (2017)

USA Cohort Age, alcohol intake, BMI,

education, gender, income, race,

smoking, other body metrics

Stroke, all types of

CVD, heart failure,

coronary artery disease, mortality

RF, lasso

Basu, Sanjay, et

al. (2017)

USA Cohort Age, gender, race, smoking Stroke, heart

failure, myocardial

infarction,

mortality

Lasso

Dinh, An, et al.

(2019)

USA Cross-sectional Age, alcohol intake, BMI,

gender, income, physical

activities, race

All types of CVD RF, GB, ensemble,

SVM

Dogan,

Meeshanthini V.,

et al. (2018)

USA Cohort Age, alcohol intake, BMI,

gender, physical activities,

smoking

Stroke RF, ensemble

Edwards,

Dorothy F., et al.

(1999)

USA Observational

(EHR)

Age, gender, race Mortality NN

Golas, Sara

Bersche, et al. (2018)

USA Observational

(EHR)

Education, gender, marital status,

occupation, race

Rehospitalization Deep unified networks,

GB

Gonzales, Tina

K., et al. (2017)

USA Cohort Alcohol intake, BMI, gender,

income, physical activities,

smoking

Myocardial

infarction

RF, other

Hsich, Eileen M.,

et al. (2019)

USA Observational

(survey)

Age, BMI, medical insurance,

race, smoking

Mortality RF

Hu, Danqing, et

al. (2016)

USA Clinical trial Age, BMI, gender, income, race,

residence,

Carotid

atherosclerosis

RF, NB, other

Imran, Tasnim F.,

et al. (2018)

USA Observational

(EHR)

Age, BMI, gender, race, smoking Stroke Lasso

Kerut, Edmund

Kenneth, et al.

(2019)

USA Observational

(survey)

Gender, race, smoking Abdominal aortic

aneurysm

NN

Kogan, Emily, et

al. (2020)

USA Observational

(EHR)

Gender, residence, area-level

social determinants

Stroke RF, NN, GB

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 22: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

22

Leach, Heather

J., et al. (2016)

USA Cohort BMI, diet, income, physical

activities, smoking, built

environment

All types of CVD DT

Akay, Metin, et

al. (1994)

USA Observational

(EHR)

Gender, smoking Coronary artery

disease

NN

Mansoor, Hend,

et al. (2017)

USA Cohort Alcohol intake, income, medical

insurance, race, smoking,

substance abuse, area-level social determinants

Mortality RF

McGeachie,

Michael, et al.

(2009)

USA Cohort Age, BMI, education, gender,

smoking

Coronary artery

disease

BN

Mobley, Bert A,

et al. (1995)

USA Observational

(EHR)

Gender, medical insurance, race Length of stay in

hospital

NN

Motwani,

Manish, et

al. (2017)

USA Cohort Age, BMI, race, smoking Mortality Other ensemble

Ni, Yizhao, et

al. (2018)

USA Cohort Alcohol intake, gender, marital

status, occupation, race, smoking,

substance abuse

Stroke RF, NN, SVM

Ottenbacher,

Kenneth J., et al.

(2001)

USA Retrospective

Cohort

Age, gender, marital status,

medical insurance, occupation,

residence

Rehospitalization NN

Rasmy, Laila, et

al. (2018)

USA Observational

(EHR)

Age, gender, race Heart failure DL, ridge, lasso

Baldassarre, Damiano, et al.

(2004)

Italy Cross-sectional Age, BMI, gender, smoking All types of CVD NN, other

Bandyopadhyay,

Sunayan, et al.

(2015)

USA Observational

(EHR)

Age, BMI, gender, smoking All types of CVD BN

Beunza, Juan-

Jose, et al. (2019)

Spain Cohort Age, BMI, education, gender,

smoking

Coronary artery

disease

RF, NN, DT, AdaBoost,

SVM

Biesbroek,

Sander, et al.

(2015)

Netherlands Cohort Age, alcohol intake, diet,

education, gender, physical

activities, , smoking

Stroke, Coronary

artery disease

RF, DT, PCA, other

Brisimi,

Theodora S., et

al. (2018)

USA Observational

(EHR)

Age, gender, race, smoking, area-

level social determinants

Hospitalization RF, SVM

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 23: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

23

Çelik, Güner, et

al. (2014)

Turkey Observational

(EHR)

Age, gender, smoking Stroke NN, other

Cheon, Songhee,

et al. (2019)

Korea Observational

(survey)

Age, gender, medical insurance,

area-level social determinants

Stroke RF, Adaboost, SVM,

DL, PCA, NB, other

Corsetti, James

P., et al. (2016)

USA Observational

(EHR)

BMI, race All types of CVD BN

Cox Jr, Louis

Anthony Tony.

(2017)

USA Observational

(survey)

Age, education, gender, income,

marital status, smoking, area-

level social determinants

Stroke RF, DT, BN

Cox Jr, Louis Anthony Tony.

(2018)

USA Observational (survey)

Age, education, income, gender, marital status, smoking, area-

level social determinants

All CVD BN

Daghistani,

Tahani A., et al.

(2019)

Saudi Arabia Observational

(EHR)

Age, gender, medical insurance,

smoking

Length of stay RF, NN, SVM, BN

Dai, Wuyang, et

al.

(2015)

USA Observational

(EHR)

Age, gender, race, smoking, area-

level social determinants

Hospitalization AdaBoost, SVM, NB,

other

Dogan,

Meeshanthini V.,

et al. 2018

USA Cohort Age, gender, smoking Coronary artery

disease

RF

Li, Yan, et al.

(2019)

USA Observational

(survey)

Age, alcohol intake, education,

gender, income, medical

insurance, physical activities,

race, smoking, area-level social determinants

Stroke, coronary

artery disease

RF

Li, Xuemeng, et

al. (2019)

China Observational

(survey)

Age, alcohol intake, BMI,

gender, physical activities,

smoking

Stroke RF, NN, DT, other

ensemble, BN, NB

Karaolis, M., et

al. (2008)

Cyprus Observational

(EHR)

Age, gender, smoking Coronary artery

disease

DT

Raihan, M., et al.

(2019)

Bangladesh Observational

(EHR)

Age, gender, physical activities,

smoking, substance abuse,

Coronary artery

disease

NN

Martínez-García,

M., et al. (2018)

Mexico Retrospective

Cohort

Age, alcohol intake, BMI,

education, gender, income,

marital status, medical insurance,

smoking

Myocardial

infarction,

rehospitalization

SVM

Miller, C. S., et

al. (2014)

USA Cross-sectional

case-controlled

Age, BMI, gender, race, smoking Myocardial

infarction

DT

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint

Page 24: Machine Learning for Integrating Social Determinants in ......2020/09/11  · 6 1New York University, School of Global Public Health, Department of Epidemiology 7 2New York University,

24

Eswaran,

Chikkannan, et

al. (2012)

Malaysia Cross-sectional Age, BMI, gender, smoking Myocardial

infarction

NN, BN, other

Ross, Elsie

Gyang, et al.

(2016)

USA Prospective

observational

Age, alcohol intake, BMI,

education, income, marital status,

occupation, physical activities,

race, residence, smoking

Mortality, other RF, ridge

Saito, Hiroshi, et al. (2019)

Japan Observational (survey)

Age, gender, occupation, residence, smoking

Rehospitalization Lasso

Van Loo, Hanna

M., et al. (2014)

Netherlands Observational

study

Age, BMI, gender, smoking Mortality Lasso

Vedomske,

Michael A. et al.

(2013)

USA Observational

(EHR)

Age, gender, medical insurance,

race

Rehospitalization RF

Ayyagari,

Rajeev, et al.

(2014)

USA Retrospective

Cohort

Age, gender, race, smoking Stroke Lasso

Vistisen, Dorte,

et al. (2016)

Denmark Cohort Age, alcohol intake, BMI,

gender, physical activities,

smoking

Stroke RF, DT

Wu, Yafei, et al.

(2020)

China Prospective

observational

Age, alcohol intake, gender,

smoking

Stroke RF, SVM, lasso

Yu, Shipeng, et

al. (2015)

USA Retrospective

Cohort

Age, gender, marital status, race Rehospitalization SVM, lasso

Zhuang,

Xiaodong, et al. (2018)

China Observational

(survey)

Age, BMI, gender, income, race,

smoking, area-level social determinants

All types of CVD RF

Abbreviations used: NN: neural network, RF: random forest, SVM: support vector machine, DT: decision tree, NB: naïve Bayes, BN: Bayesian network, Lasso: lasso regression , Ridge: ridge regression, QDA: quadratic discriminant analysis

*“Other” algorithms used include: multilayer perceptron, maximum entropy, adversarial network, k-nearest neighbors, recursive partitioning, clustering,

quadratic discriminant, RBF

**Other ensemble: methods other than: Adaboost, Gradient boosting

. CC-BY-NC 4.0 International licenseIt is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in(which was not certified by peer review)preprint The copyright holder for thisthis version posted September 13, 2020. ; https://doi.org/10.1101/2020.09.11.20192989doi: medRxiv preprint