Top Banner
HST 190: Introduction to Biostatistics Lecture 7: Logistic regression 1 HST 190: Intro to Biostatistics
31

HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

May 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntroductiontoBiostatistics

Lecture7:Logisticregression

1 HST190:IntrotoBiostatistics

Page 2: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Logisticregression

• We’vepreviouslydiscussedlinearregressionmethodsforpredictingcontinuousoutcomes§ Functionally,predictingthemeanatparticularcovariatelevels

• Whatifwewanttopredictvaluesforadichotomouscategoricalvariable,insteadofacontinuousone?§ Thiscorrespondstopredictingtheprobabilityoftheoutcomevariablebeinga“1”versus“0”

• Canwejustuselinearregressionfora0-1outcomevariable?

HST190:IntrotoBiostatistics2

Page 3: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics3

• Considermodelingtheprobabilitythatapersonreceivesaphysicalexaminagivenyearasafunctionofincome.§ Asampleofindividualsiscollected.Eachindividualreportsincomeandwhetherhe/shewenttothedoctorlastyear.

patient# 𝑌 = checkup 𝑋 = income1 1 32,0002 0 28,0003 0 41,0004 1 38,000etc.

Page 4: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics4

• Plottingthisdataandfittingalinearregressionline,weseethatthelinearmodelisnottailoredtothistypeofoutcome§ Forexample,anincomeof$500,000yieldsapredictedprobabilityofvisitingthedoctorgreaterthan1!

Page 5: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Logittransformation

HST190:IntrotoBiostatistics5

• Toovercometheproblem,wedefinethelogittransformation:if0 < 𝑝 < 1,thenlogit 𝑝 = ln 5

675

§ Noticethatas𝑝 ↑ 1, logit 𝑝 ↑ ∞, andas𝑝 ↓ 0, logit 𝑝 ↓ −∞

• Thus,logit 𝑝 cantakeany continuousvalue,sowewillfitlinearmodelonthistransformedoutcomeinstead

• Writethistypeofmodelgenerallyas𝑔 𝐸 𝑦 = 𝛼 + 𝛽6𝑥6 + ⋯+ 𝛽E𝑥E

§ Where𝐸 𝑦 = 1 ⋅ 𝑃 𝑦 = 1 + 0 ⋅ 𝑃 𝑦 = 0 = 𝑃 𝑦 = 1 = 𝑝 and𝑔 𝐸(𝑦) = logit 𝐸(𝑦)

§ Thismodeliscalledalogisticregressionmodel oralogitmodel

§ Bycomparison,thelinearregressionmodeltakes𝑔 𝐸(𝑦) = 𝐸 𝑦

Page 6: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics6

• Akeybenefitoffittinglogitmodelratherthancontingencytablemethodsisabilitytoadjustformultiplecovariates(includingcontinuouscovariates)simultaneously

• Tointerpretparameters,comparefitformanandwoman

§ logit 𝑝JKLMN = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP + 𝛽JKLMN

§ logit 𝑝LMN = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP

patient# 𝑦 = checkup income age gender1 1 32,000 60 F2 0 28,000 53 M3 0 41,000 45 M4 1 38,000 40 Fetc.

Page 7: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics7

⇒ logit 𝑝LMN = logit 𝑝JKLMN − 𝛽JKLMN

⟺ ln𝑝JKLMN

1 − 𝑝JKLMN− ln

𝑝LMN1 − 𝑝LMN

= 𝛽JKLMN

⟺ ln𝑝JKLMN 1 − 𝑝LMN𝑝LMN 1 − 𝑝JKLMN

= 𝛽JKLMN

⟺𝑝JKLMN 1 − 𝑝LMN𝑝LMN 1 − 𝑝JKLMN

=oddsJKLMNoddsLMN

= 𝑒XYZ[\]

• So,𝛽JKLMNisthelogoftheoddsratioforgettingacheckupbetweenmenandwomen,adjustingforageandincome

• Thisresultholdsforanydichotomousvariableinthemodel§ Thisallowsustoestimateoddsratioforagivenexposurewithdiseaseinaregression,accountingfortheeffectsofothervariables

Page 8: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics8

• Inalogisticregressionlogit 𝑝 = 𝛼 + 𝛽6𝑥6 +⋯+ 𝛽E𝑥E,thendenotethefittedparameterestimatesas𝛼_, �̀�6, … , �̀�E

• if𝑋b isadichotomousexposure,thentheestimatedoddsratiorelatingthisexposuretotheoutcomeis

ORe = 𝑒Xfg

• Ifinstead𝑋b isacontinuousexposure,thentheaboveoddsratioandCIdescribetheoutcome’sassociationwithaone-unitincreaseintheexposure,adjustingforothercovariates§ e.g.,“aoneunitincreaseinageisassociatedonaveragewithan𝑒Xf\hi-foldchangeintheoddsofgettingacheckup,holdinggenderconstant.”

Page 9: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Hypothesistestingandconfidenceintervals

HST190:IntrotoBiostatistics9

• Foranestimated�̀�b coefficientinalogisticmodel,thecorresponding100 1 − 𝛼 % CIisgivenby

𝑒Xfg7klmno

pPq Xfg , 𝑒Xfgrklmno

pPq Xfg

§ Matlab orothersoftwarewillprovideboth�̀�b andses �̀�b

§ takenoteofwhetheryouaregiven�̀�b orORe = 𝑒Xfg insoftwareoutput!Thisdiffersbetweenprograms

• Testingthehypothesis𝐻u: �̀�b = 0 versus𝐻6: �̀�b ≠ 0 isaz-test thatistypicallyprovidedaspartofsoftwareoutput

§ Ifthenullistrue,𝑍 = XfgpPq Xfg

isapproximately𝑁(0,1)

Page 10: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Interactionterms

HST190:IntrotoBiostatistics10

• Likeinlinearregression,wecanalsoincorporateinteractiontermsinalogisticregressionmodel

logit 𝑝 = 𝛼 + 𝛽MOP𝑋MOP + 𝛽QNRKLP𝑋QNRKLP+𝛽JKLMN𝑋JKLMN + 𝛽MOP:JKLMN𝑋MOP𝑋JKLMN

• 𝛽MOP:JKLMN capturesthepresenceofaninteractioneffectoreffectmodification ofgenderbyage§ e.g.,gendereffectonprobabilityofgettingannualcheckupisgreateramongyoungerpeople

Page 11: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Modelbuildingforinference

HST190:IntrotoBiostatistics11

• Thetechniquesforvariableselectioninlogisticregressionaresimilarasforlinearregression

§ Biggestchallengeislackofcomparablevisualfitdiagnosticslikeresidualplots

• Whenmodelbuildingforstudiesofassociationbetweenexposureandoutcome,focusisonincludingsourcesofconfounding(i.e.,externalvariablesassociatedwithbothexposureandoutcome)

• Onestrategyistofitandreportthefollowingthreemodels:1) anunadjustedorminimallyadjustedmodel

2) amodelthatincludes‘core’confounders(‘primary’model)o clearindicationfromscientificknowledgeand/ortheliterature

o consensusamonginvestigators

3) amodelthatincludes‘core’confoundersplusany‘potential’confounderso indicationislesscertain

Page 12: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Logisticregressioninretrospectivesetting

HST190:IntrotoBiostatistics12

• Howdoweinterpretinterceptlogit 𝑝 = 𝛼 + 𝛽6𝑥6 + ⋯+ 𝛽E𝑥E§ 𝛼 = log 5

675isthelogoddsofexperiencingtheoutcomeinthe

populationamongsubjectswith𝑥6 = ⋯ = 𝑥E = 0

§ Linksmodeltoabsoluteprevalenceoftheoutcomeinthepopulation

• Whathappenstologitmodelifoursamplingiscase-control(orretrospective)?§ Thatis,whatifwesamplebasedonoutcomestatus?

§ e.g.,sample100patientswithadiseaseand100patientswithout

• Typicallythissettingartificiallyselectsmorecasesthanwouldarisenaturallyundercross-sectionalorprospectivesampling§ sowecannotreadilyusethesampletodescribethetrueprobabilityofdiseaseinthepopulation

Page 13: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics13

• Thus,weseethattheintercept𝛼 isnolongermeaningfulinalogisticregressionusingcase-controlsampleddata§ Whatabouttheotherestimates?

• Recallweshowedthatusingcontingencytablestocomputeoddsratioswasvalidbothinprospective andretrospectivesamplingdesigns

• Itturnsoutthatthesameistruefortheestimatedcoefficientsinlogisticregression!Justasbefore,

ORe = 𝑒Xfg

§ allotherinference(tests,CIs)isalsothesame

• Theestimatedoddsratiooftheoutcomebetweenexposedandunexposedgroupsisthesameevenifthe‘absolute’proportionofcasessampledishigher

Page 14: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Matchedcase-controldesigns

HST190:IntrotoBiostatistics14

• Tofurtherincreasethestatisticalefficiencyofastudy,researchersmaycreateamatched case-controldesign§ Foreverycasesampled,oneormorecontrolsisselectedbasedonsimilaritytothecase

o Matchingeachcasewith𝑞 controlsiscalled𝟏: 𝒒matching

§ Goalistocorrectforpotentialconfoundinginthestudydesign§ e.g.,matcheachcasewithnoncase ofsameageandgender,resultingintwogroupshavingsamedistributionsofageandgender

• Aswithstandardcase-control,analysisthenmeasuresassociationbetweenanexposureofinterestandtheoutcome§ Exposureofinterestisnot afactorusedformatching

• Matcheddesignsbalanceincreasedcostofmatchingeachsubjectwithhigherpowerandpotentialforcausalinference

Page 15: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Analyzingmatchedcase-controldesigns

HST190:IntrotoBiostatistics15

• Supposethesampleincludes𝑚matchedsets,howshouldweapproachanalysis?

• Naïveapproach:chooseonematchedsettobe‘baseline,’andinclude𝑚− 1 indicatorvariablesforeachotherset§ Essentially,treateachmatchedsetaslevelofacategoricalvariable

• Suchamodelforcesustoestimateeffectofexposurewithingroupsthatmayonlyhaveafewpeopleinthem§ Unstableestimation

§ Cannotgeneralizeestimatedcomparisonsofspecificpairsofpeople

• Instead,wewantananalysisthatestimatesexposureeffectbyaggregatingacrossmatchedsets

Page 16: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Conditionallogisticregression

HST190:IntrotoBiostatistics16

• Instead,researchersuseconditionallogisticregression toestimatetheeffectofanexposureofinterest,conditioningoutthefactorsusedtocreatethematchedsets

• Toillustrate,assumeamatchedpairsdesign.Let§ 𝑦~6 = 1, 𝑦~� = 0 bethediseaseindicatorsofthe𝑖th case-controlpair§ 𝑥~66, … , 𝑥~6E , 𝑥~�6, … , 𝑥~�E bethecovariatesofthe𝑖th pair

o Doesnotinclude‘matchedon’factors,whichareaccountedforindesign

• Thenforeachpair,definetheconditionallikelihoodcontribution

𝐿~ 𝛽6, … , 𝛽E =𝑃 𝑦~6 = 1 ∩ 𝑦~� = 0

𝑃 𝑦~6 = 1 ∩ 𝑦~� = 0 + 𝑃 𝑦~6 = 0 ∩ 𝑦~� = 1

=𝑒∑ Xg��lg�

g�l

𝑒∑ Xg��lg�g�l + 𝑒∑ Xg��og�

g�l

Page 17: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics17

• Thus,wecancomputeestimatesthatmaximizetheconditionallikelihood 𝜷� = argmin𝜷 ∑ 𝐿~ 𝛽6, … , 𝛽E�

~�6

• If𝛽b iscoefficientofexposureofinterest,thenasbeforeOR� = 𝑒X�g

§ StandardmethodsfortestingandCIsareallthesameasbefore

• Notethatbecausewealreadyadjustedforfactorsusedformatching,wedonot getestimatedeffectsforthesefactors§ Itwouldbeinappropriatetoincludematchingfactorsascovariates

• Wealsodonotgetanestimatedintercept,whichmakessensebecauseinterceptnotinterpretableincase-controlsettinganyways

Page 18: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Logisticregressionmodelingforprediction

• Usingafittedlogisticregressionmodel,wesofarhavefocusedonestimationandinferenceofassociationsbetweenthecovariatesandtheoutcomeintheformofoddsratios

• Wecanalsopredictindividualprobabilitiesofexperiencingtheoutcomeusingthefittedmodel:

• Fromourmodellogit 𝑝 = 𝛼 + 𝛽6𝑥6 +⋯+ 𝛽E𝑥E,wecanrearrangetogetapredictedprobability𝑝_:

ln𝑝

1 − 𝑝 = 𝜆 ⇔𝑝

1 − 𝑝 = 𝑒� ⇔ 𝑝 = 𝑒� − 𝑝𝑒�

⇔ 𝑝 1 + 𝑒� = 𝑒� ⇔ 𝑝 =𝑒�

1 + 𝑒�

• Therefore,weseethatourregressionmodelleadstopredictedprobabilities

𝑝_ =𝑒�srXfl�lr⋯rXf���

1 + 𝑒�srXfl�lr⋯rXf���

HST190:IntrotoBiostatistics18

Page 19: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics19

• Wemayevenwanttopredictindividualsoutcomestatus,usingthemodeltopredictwhetherornottheywillexperiencetheoutcome§ e.g.,buildariskpredictionmodeltopredictwhomightdevelopadisease

• Justasinthelinearregressioncase,predictionintroducesimportantconsiderationsofmodelselection,andpredictionvalidation

Page 20: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Variableselectionforprediction

HST190:IntrotoBiostatistics20

• Variableselectionforpredictionisalsosimilartolinearregressionsetting,andcanusesimilartechniques:

1) Fixedsetbydesign(treatmentindicator+backgroundvariables)

2) Fitallpossiblesubsetsofmodelsandfindtheonethatfitsthebestaccordingtosomecriterion:§ AICorBIC

§ Predictiveperformanceby(cross-validated)AUC

3) Sequential:forward/backward/stepwiseselection

4) Regularized/penalizedregressionmethod

Page 21: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Modelselectioncriteria

HST190:IntrotoBiostatistics21

• Logisticregressionmodelsfitbymaximumlikelihood,soiftwomodelshavethesamenumberofparameters,choosetheonewithahigherfinallikelihoodvalue§ Similartolinearregression,aimtobalancefinallikelihood(𝐿�)andnumberofparameters(𝑘).

§ Generalformofanycriterion:𝑓 𝐿� + 𝑔(𝑘)

• Samecriteriausedforlinearregressionavailablehere:

§ Akaike’sInformationCriterion:AIC = −2ln 𝐿� + 2𝑝

§ BayesianInformationCriterion:BIC = −2ln 𝐿� + 𝑝log 𝑛

Page 22: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Binarypredictionvalidation

HST190:IntrotoBiostatistics22

• Tomeasurepredictiveperformanceforbinaryoutcomes,oneapproachreturnstoourdiscussionofdiagnostictesting

• Recall,indiagnostictestingitisimportanttobalance§ Correctlytestingpositivefortruediseasecases(‘sensitivity’)

§ Correctlytestingnegativefortruenon-cases(‘specificity’)

Testpositive Testnegative

Disease TruePositive(TP)𝐷 +∩ 𝑇 +

FalseNegative(FN)𝐷 +∩ 𝑇 − 𝐷 +

NoDisease FalsePositive(FP)𝐷 −∩ 𝑇 +

TrueNegative(TN)𝐷 −∩ 𝑇 − 𝐷 −

𝑇 + 𝑇 −

Page 23: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics23

• Binarypredictionisnearlyidentical,whereinsteadof‘testing’weare‘predicting’diseasestatus§ Wanttocorrectlypredictdiseaseintruecases,andcorrectlypredictnodiseaseintruenon-cases

• Howdoweconvertpredictedindividualprobabilities�̂� intodiscrete‘case’or‘non-case’predictions?§ wemustchooseanarbitrarycutoffvalue,e.g.,“if�̂�~ > 0.5 then𝑖thindividualispredictedtobeacase”

Page 24: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics24

• HowtochoosethecutoffvaluefortheVit.Elevelinserum?

• ForagroupwithKNOWNDiseasestatus,let’slistsomepossiblecutoffvalues.First,we’llseehowmanydiseasevs.nodiseasepatientsfalloneithersideofeachcutoff…

Predictedprobabilitycutoffforpredicted‘case’0.01 0.20 0.40 0.60 0.80 0.99

%patientswithvalue≥cut-off

Disease 0.95 0.87 0.73 0.54 0.34 0.17

NoDisease 0.91 0.68 0.38 0.12 0.02 0.002

Page 25: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics25

• Severalimportantrelationshipshere:§ 𝑃(�̂�~ ≥ cutoff ∩ patient𝑖iscase) = sensitivity

§ 𝑃 �̂�~ ≥ cutoff ∩ patient𝑖isnoncase = 1 − specificity

• Wecansummarizethetestingpossibilitiesbyplottingsensitivityvs.(1– specificity)…

Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99

%patientswithvalue≥cut-off

Disease 0.95 0.87 0.73 0.54 0.34 0.17

NoDisease 0.91 0.68 0.38 0.12 0.02 0.002

Page 26: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics26

Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99

Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002

Page 27: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics27

Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99

Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002

Page 28: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics28

Predictedprobabilitycutofffor‘case’0.01 0.20 0.40 0.60 0.80 0.99

Sensitivity 0.95 0.87 0.73 0.54 0.34 0.17

1-Specificity 0.91 0.68 0.38 0.12 0.02 0.002

Page 29: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics29

• Areceiveroperatingcharacteristic(ROC)curve foratestisaplotofsensitivityvs.(1-specificity)

§ Atest’sROCcurvehelpsuschooseanoptimalcutoffpoint.Italsoshowsushowusefulatestisoverall.

• TheAreaUndertheCurve(AUC)isasinglenumbersummarizingatest’sabilitytodiscriminate betweentruepositivesandtruenegatives.

§ AUCistheprobabilityforarandomlychosencaseandnoncase thatthecasewillhavethehigherpredictedprobability—0.5isa‘cointoss’

Page 30: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

Cross-validation

HST190:IntrotoBiostatistics30

• Cross-validationextendstothebinaryclassificationsettingusingpredictionmetricslikeAUC

• Theavailabledatasetisdividedintotwo(or3)randomparts.§ Trainingsetisusedtofitthemodel.

§ Testsetisusedtocheckthepredictivecapability(e.g.,AUC)andrefinethemodel.Gobacktotrainingifneeded.

§ Optional:Validationsetusedoncetoestimatemodel’strueAUC.

• Ifdatasetissmalleroryoudonotwanttosetasidedata,canstillestimateAUCusing𝒌-foldcrossvalidation

Page 31: HST 190: Introduction to Biostatistics...4 HST 190: Intro to Biostatistics •Plotting this data and fitting a linear regression line, we see that the linear model is not tailored

HST190:IntrotoBiostatistics31

• In𝒌-foldcrossvalidation,thedataissplitinto𝑘 groups,andthemodelisrepeatedlyfitonallbutonegroup,thenitsabilitytopredicttheleft-outgroupisrecorded§ averageAUCoverall𝑘 groupsestimatespredictiveperformanceon‘new’dataset