Top Banner
Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS ….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 1 of 23 BIOSTATS 640 Spring 2016 At Your Request! Stata Lab #2 Basics & Logistic Regression 1. Start a log …………………………………………..….………….… 2. Read in a dataset …....………………………………………………….. 3. Familiarize yourself with the data ………………….………………… 4. Create “1/2” Variables when you want to use command tab2 ………. 5. Create “0/1” Variables when you want to use commands cc, cs …….. 6. Fit a Logistic Regression Model ………………………………….….. 7. Compare Models Side-by-Side …..……………………………….….. 8. Perform a Likelihood Ratio Test ……………………………………… 9. Regression Diagnostics for Logistic Regression: Numerical …….……. a. Numerical Measures of Fit Using fitstat …………………..…….. b. Test of Model Adequacy Using linktest …………………………. c. Test of Overall Goodness-of-Fit Using lfit ……………………….. 10. Regression Diagnostics for Logistic Regression: Graphical ………. a. Plot of ROC Curve Using lroc …………………………………….. b. Plot of Standardized Residuals versus Observation Number ………. c. Plot of Influential Observations Using Cook’s Distances …………... 2 2 2 4 5 7 11 15 17 19 19 20 21 21 22 23
23

Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Jan 24, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 1 of 23

BIOSTATS 640 Spring 2016

At Your Request! Stata Lab #2

Basics & Logistic Regression

1. Start a log …………………………………………..….………….… 2. Read in a dataset …....………………………………………………….. 3. Familiarize yourself with the data ………………….………………… 4. Create “1/2” Variables when you want to use command tab2 ………. 5. Create “0/1” Variables when you want to use commands cc, cs …….. 6. Fit a Logistic Regression Model ………………………………….….. 7. Compare Models Side-by-Side …..……………………………….….. 8. Perform a Likelihood Ratio Test ……………………………………… 9. Regression Diagnostics for Logistic Regression: Numerical …….……. a. Numerical Measures of Fit Using fitstat …………………..…….. b. Test of Model Adequacy Using linktest …………………………. c. Test of Overall Goodness-of-Fit Using lfit ……………………….. 10. Regression Diagnostics for Logistic Regression: Graphical ………. a. Plot of ROC Curve Using lroc …………………………………….. b. Plot of Standardized Residuals versus Observation Number ………. c. Plot of Influential Observations Using Cook’s Distances …………...

2

2

2

4

5

7

11

15

17 19 19 20

21 21 22 23

Page 2: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 2 of 23

___1. Start a log Tip - Always keep a log of your stata session. Start a log of your session, taking care to save it as a “.log” file and not as a “.smcl” file Solution Launch Stata. From the main menu at upper left: FILE > LOG > BEGIN > From file format drop down: log > At where: choose a folder you will remember! > At Save as: name your log

___2. Read in a data set From the public course website page ASSIGNMENTS, download the data set illeetvilaine.dta. In Stata, open this data set. Solution From the main menu at upper left: FILE > OPEN Browse the folders on your computer: choose illeetvilaine.dta ___3. Familiarize yourself with the dataset Stata offers several commands for describing a dataset, including describe and codebook. From command window, use the help command to learn about describe and codebook and their various options. Next, play around with various choices to see which descriptions you like best! Solution . describe Contains data from /Users/cbigelow/Desktop/1. Teaching/web640/datasets/illeetvilaine.dta obs: 975 vars: 7 21 Mar 2014 15:25 size: 27,300 --------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label --------------------------------------------------------------------------------------------------------- case float %9.0g Case status (1=case, 0=control) age float %9.0g agegp float %9.0g agegp Age group tob float %9.0g Tobacco consumption gm/day tobgp float %9.0g tobgp Grouped tobacco consum. alc float %9.0g Alcohol consumption gm/day alcgp float %9.0g alcgp Grouped alcohol consum.

Page 3: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 3 of 23

. codebook, compact Variable Obs Unique Mean Min Max Label --------------------------------------------------------------------------------------------------------- case 975 2 .2051282 0 1 Case status (1=case, 0=control) age 975 61 52.22667 25 91 agegp 975 6 3.271795 1 6 Age group tob 975 9 11.74872 0 60 Tobacco consumption gm/day tobgp 975 4 1.763077 1 4 Grouped tobacco consum. alc 975 162 52.76923 0 268 Alcohol consumption gm/day alcgp 975 4 1.855385 1 4 Grouped alcohol consum. --------------------------------------------------------------------------------------------------------- . label list agegp: 1 25-34 2 35-44 3 45-54 4 55-64 5 65-74 6 75+ tobgp: 1 0-9 gm/day 2 10-19 3 20-29 4 30+ alcgp: 1 0-39 gm/day 2 40-79 3 80-119 4 120+ Ille-et-Vilaine Data: Illustration Suppose we are interested in the 2x2 table cross-classification of heavy smoking (30+ gm/day versus other) and case status (esophageal cancer case versus control): Disease (Esophageal Cancer) Exposure (Heavy Smoking) Yes No

Yes (30+ gm/day) 31 51 82 No 169 724 893

200 775 975 Stata has commands cc and cs for epidemiological analyses of 2x2 tables. The layout of the output is nice. Cases are in row 1 (controls in row 2) and exposed are in column 1 (non-exposed in column 2).

Page 4: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 4 of 23

___4. Create “1/2” Variables when you want to use command tab2 Why the fuss? Answer – Sometimes the arrangement of rows and columns in a 2x2 table are not what you expected nor want!

tab2 Stata will order the rows and columns according to the numeric values of the row and column variable. For a 0/1 variable, row 1 will be the value “0” row. Row 2 will be the value “1” row. For a 1/2 variable, row 1 will be the value “1” row. Row 2 will be the value “2” row. Columns are ordered similarly. cc, cs Stata assumes that you are using 0/1 variables here with 1= event and 0=non-event Stata will order the rows and columns according to event, with event being the first row (or column) Thus, row 1 will be the value “1=event” row. Row 2 will be the value “0=non-event” row. Columns are ordered similarly.

The variable tobgp has four values (1, 2, 3, and 4) representing increasing levels of smoking. Create a variable that you name exposure12 defined as follows: exposure12 = 1 if tobgp = 4 2 if tobgp = 1, 2, or 3 The variable case is a 0/1 variable denoting case status. Create a variable that you name case12 defined as follows: case12 = 1 if case=1 2 if case=0 Tip – Always check your variable creation work. Eg. – issue the command tab2 tobgp exposure12. Solution . * Create "1/2" variables when you want to use command tab2 . * “1/2” measure of heavy smoking (1=30+ gm/day versus 2=other) . * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day) . generate exposure12=tobgp . recode exposure12 (1=2) (2=2) (3=2) (4=1) (exposure12: 739 changes made) . label define exposure12f 2 "other" 1 "heavy" . label values exposure12 exposure12f . * "1/2" variable for case status (1=case versus 2=other) . generate case12=case . recode case12 (0=2) (case12: 775 changes made) . label define case12f 2 "control" 1 "case" . label values case12 case12f

Page 5: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 5 of 23

. * Check variable creations . tab2 tobgp exposure12 -> tabulation of tobgp by exposure12 Grouped | tobacco | exposure12 consum. | heavy other | Total -----------+----------------------+---------- 0-9 gm/day | 0 526 | 526 10-19 | 0 236 | 236 20-29 | 0 131 | 131 30+ | 82 0 | 82 -----------+----------------------+---------- Total | 82 893 | 975 . * tab2 with 1/2 variables - more to your liking? . tab2 exposure12 case12 -> tabulation of exposure12 by case12 | case12 exposure12 | case control | Total -----------+----------------------+---------- heavy | 31 51 | 82 other | 169 724 | 893 -----------+----------------------+---------- Total | 200 775 | 975 Nice. Heavy exposure is now row 1 and cases are now in column 1. ___5. Create “0/1” Variables when you want to use commands cc,cs What is this about? The command cc is for case-control designs and cs is for cohort designs! Solution . * Create "0/1" variables when you want to use commands cc, cs . * “0/1” measure of heavy smoking (1=30+ gm/day versus 0=other) . * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day) . generate exposure01=tobgp . recode exposure01 (1=0) (2=0) (3=0) (4=1) (exposure01: 975 changes made) . label define exposure01f 0 "other" 1 "heavy" . label values exposure01 exposure01f . * "0/1" variable for case status (1=case versus 0=other) . * This already exists as the variable case

Page 6: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 6 of 23

. * Check variable creations . tab2 tobgp exposure01 -> tabulation of tobgp by exposure01 Grouped | tobacco | exposure01 consum. | other heavy | Total -----------+----------------------+---------- 0-9 gm/day | 526 0 | 526 10-19 | 236 0 | 236 20-29 | 131 0 | 131 30+ | 0 82 | 82 -----------+----------------------+---------- Total | 893 82 | 975 . * Use the command cc for case-control. Use the command cs for cohort. . cc case exposure01 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 31 169 | 200 0.1550 Controls | 51 724 | 775 0.0658 -----------------+------------------------+------------------------ Total | 82 893 | 975 0.0841 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 2.604014 | 1.557944 4.2894 (exact) Attr. frac. ex. | .6159775 | .3581283 .7668672 (exact) Attr. frac. pop | .0954765 | +------------------------------------------------- chi2(1) = 16.42 Pr>chi2 = 0.0001 eststo, estout, esttab Stata has a set of commands for saving the results of fitting models (eststo) and then using the saved results to produce a side-by-side comparison of models (estout and esttab). To save (for later use) the results of fitting the current model, the command is: estout yourchoiceofname

Page 7: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 7 of 23

___6. Fit a Logistic Regression Model Ille-et-Vilaine Data: Illustration After creating another 3 new variables for illustration purposes, we will fit 4 logistic regression models. In #7, we’ll then compare them.

Model 1: Predictors = heavy drinking, age Model 2: Predictors = heavy smoking, age Model 3: Predictors = heavy drinking, heavy smoking, age Model 4: Predictors = heavy drinking, heavy smoking, drinking x smoking interaction, age

__6a) Create 3 new variables: i) alcohol_80plus = 0/1 measure of alcohol use >= 80 gm/day, ii) smoking_30plus = 0/1 measure of tobacco use >=30 gm/day, iii) drinker_smoker = interaction of heavy drinking and heavy smoking Solution * HEAVY DRINKER: Create alcohol_80plus = 0/1 measure of alcohol use >=80 gm/day. . generate alcohol_80plus=alcgp . recode alcohol_80plus (1=0) (2=0) (3=1) (4=1) (alcohol_80plus: 975 changes made) . label define alcoholf 0 "< 80 gm/day" 1 "80+ gm/day" . label values alcohol_80plus alcoholf . label variable alcohol_80plus "Heavy Drinker" . * Check variable creation . tab2 alcgp alcohol_80plus -> tabulation of alcgp by alcohol_80plus Grouped | alcohol | Heavy Drinker consum. | < 80 gm/d 80+ gm/da | Total ------------+----------------------+---------- 0-39 gm/day | 414 0 | 414 40-79 | 355 0 | 355 80-119 | 0 139 | 139 120+ | 0 67 | 67 ------------+----------------------+---------- Total | 769 206 | 975

Page 8: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 8 of 23

. * HEAVY SMOKER: Create smoking_30plus = 0/1 measure of tobacco use >=30 gm/day. . generate smoking_30plus=tobgp . recode smoking_30plus (1=0) (2=0) (3=0) (4=1) (smoking_30plus: 975 changes made) . label define smokingf 0 "< 30 gm/day" 1 "30+ gm/day" . label values smoking_30plus smokingf . * Check variable creation . numlabel, add . tab2 tobgp smoking_30plus -> tabulation of tobgp by smoking_30plus Grouped | tobacco | smoking_30plus consum. | 0. < 30 g 1. 30+ gm | Total --------------+----------------------+---------- 1. 0-9 gm/day | 526 0 | 526 2. 10-19 | 236 0 | 236 3. 20-29 | 131 0 | 131 4. 30+ | 0 82 | 82 --------------+----------------------+---------- Total | 893 82 | 975 . * INTERACTION: Create drinker_smoker = interaction of heavy drinking and heavy smoking . generate drinker_smoker=alcohol_80plus*smoking_30plus . label variable drinker_smoker "Interaction alcohol*smoking" __6b) Model 1: Predictors = heavy drinking, age. After fit, issue the command: eststo model1 Solution . * MODEL 1 – . * Logistic Regression Heavy Drinking Alone - adjusted for age . logistic case alcohol_80plus i.agegp Logistic regression Number of obs = 975 LR chi2(6) = 199.30 Prob > chi2 = 0.0000 Log likelihood = -395.09465 Pseudo R2 = 0.2014 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664 | agegp | 2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82 3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562 4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404 5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895 6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653 | _cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704 -------------------------------------------------------------------------------- . * ESTSTO to save results for later comparison . eststo model1

Page 9: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 9 of 23

__6c) Model 2: Predictors = heavy smoking, age. After fit, issue the command: eststo model2. Solution . * MODEL 2 – . * Logistic Regression Heavy Smoking Alone - adjusted for age . logistic case smoking_30plus i.agegp Logistic regression Number of obs = 975 LR chi2(6) = 145.72 Prob > chi2 = 0.0000 Log likelihood = -421.88661 Pseudo R2 = 0.1473 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- smoking_30plus | 4.211492 1.205916 5.02 0.000 2.402722 7.381906 | agegp | 2. 35-44 | 6.267996 6.675411 1.72 0.085 .7773197 50.54262 3. 45-54 | 38.39114 39.30348 3.56 0.000 5.161798 285.536 4. 55-64 | 65.17199 66.48418 4.09 0.000 8.82513 481.2834 5. 65-74 | 82.44814 84.59853 4.30 0.000 11.03516 616.0035 6. 75+ | 59.4483 63.32511 3.84 0.000 7.369337 479.5683 | _cons | .0060567 .0061361 -5.04 0.000 .0008315 .0441165 -------------------------------------------------------------------------------- . * ESTSTO to save results for later comparison . eststo model2 __6d) Model 3: Predictors = heavy drinking, heavy smoking, age. After fit, issue: eststo model3 Solution . * MODEL 3 – . * Logistic Regression Heavy Drinking and Heavy Smoking - adjusted for age . logistic case alcohol_80plus smoking_30plus i.agegp Logistic regression Number of obs = 975 LR chi2(7) = 219.23 Prob > chi2 = 0.0000 Log likelihood = -385.12755 Pseudo R2 = 0.2216 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432 smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035 | agegp | 2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074 3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861 4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294 5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486 6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835 | _cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757 -------------------------------------------------------------------------------- . * ESTSTO to save results for later comparison . eststo model3

Page 10: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 10 of 23

__6e) Model 4: Predictors = heavy drinking, heavy smoking, drinking x smoking interaction, age. Issue: eststo model4. Solution . * MODEL 4 – . * Logistic Regression Heavy Drinking and Heavy Smoking PLUS INTERACTION - adjusted . logistic case alcohol_80plus smoking_30plus i.agegp drinker_smoker Logistic regression Number of obs = 975 LR chi2(8) = 219.35 Prob > chi2 = 0.0000 Log likelihood = -385.07068 Pseudo R2 = 0.2217 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591 smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546 | agegp | 2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986 3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657 4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187 5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001 6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273 | drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342 _cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218 -------------------------------------------------------------------------------- . * ESTSTO to save results for later comparison . eststo model4

Page 11: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 11 of 23

___7. Compare Models Side-by-side estout, esttab Stata has 2 commands for produce a side-by-side comparison of models (estout and esttab). Use esttout for simple reporting. Use esttab for tests of significance. Because we have fit 4 models at this point, the basic commands are: estout model1 model2 model3 model4, option option option esttab model1 model2 model3 model4, option option option By default, Stata will give you a tabular summary of the betas. Of course, there are options you can give depending on what comparison of the 4 models you want to make. Here are some examples: estout Use this option If you want ,prehead(“titleyouprovide”) A title on your table ,eform Odds ratios esttab Use this option If you want ,stats(n chi2 bic, star(chi2)) Chi square tests of Null: beta = 0 ,eform stats(n chi2 bic, star(chi2)) Odds ratios & chi square tests of Null: OR=1 __7a) Simple: estout to display of regression coefficients (betas). Option prehead(“ “) to obtain a title Solution . estout model1 model2 model3 model4, prehead("Logistic Regression of Esophageal Cancer - BETA's") Logistic Regression of Esophageal Cancer - BETA's ---------------------------------------------------------------- model1 model2 model3 model4 b b b b ---------------------------------------------------------------- case alcohol_80~s 1.654102 1.633552 1.613106 1b.agegp 0 0 0 0 2.agegp 1.543953 1.835457 1.84063 1.881508 3.agegp 3.200302 3.647827 3.499904 3.53259 4.agegp 3.70612 4.17703 4.028732 4.062553 5.agegp 3.966039 4.412169 4.393586 4.425492 6.agegp 3.959277 4.085107 4.268525 4.306119 smoking_30~s 1.437817 1.383765 1.315578 drinker_sm~r .22443 _cons -5.049291 -5.106586 -5.506652 -5.533117 ---------------------------------------------------------------- Nice! We see the betas.

Page 12: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 12 of 23

__7b) estout to display estimated odds ratios [ exp(beta) ]: use option eform Solution . estout model1 model2 model3 model4, eform prehead("Logistic Regression of Esophageal Cancer - ODDS RATIO's") Logistic Regression of Esophageal Cancer - ODDS RATIO's ---------------------------------------------------------------- model1 model2 model3 model4 b b b b ---------------------------------------------------------------- case alcohol_80~s 5.228385 5.122038 5.018372 This is 0/1 heavy drinker 1b.agegp 1 1 1 1 This is the reference agegroup 2.agegp 4.683066 6.267996 6.300504 6.563396 3.agegp 24.53994 38.39114 33.11226 34.21248 4.agegp 40.6956 65.17199 56.18964 58.12253 5.agegp 52.77508 82.44814 80.93014 83.55392 6.agegp 52.41941 59.4483 71.41624 74.15214 smoking_30~s 4.211492 3.989895 3.726905 This is 0/1 heavy smoker drinker_sm~r 1.251609 Interaction: heavy drinking x smoking _cons .0064139 .0060567 .0040597 .0039536 Intercept ---------------------------------------------------------------- The option “eform” stands for “exponentiated coefficients.” Thus, these are the odds ratios.

Page 13: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 13 of 23

__7c) esttab to display chi square tests of Null: beta=0. Use option stats( n chi2 bic, star(chi2) ) Solution . * BETAs with chi square statistics . esttab model1 model2 model3 model4, stats(n chi2 bic, star(chi2)) prehead("Logistic Regression of Esophageal Cancer - BETA's") Logistic Regression of Esophageal Cancer - BETA's (1) (2) (3) (4) case case case case ---------------------------------------------------------------------------- case alcohol_80~s 1.654*** 1.634*** 1.613*** (8.74) (8.49) (7.99) 1b.agegp 0 0 0 0 (.) (.) (.) (.) 2.agegp 1.544 1.835 1.841 1.882 (1.45) (1.72) (1.71) (1.73) 3.agegp 3.200** 3.648*** 3.500*** 3.533*** (3.13) (3.56) (3.37) (3.38) 4.agegp 3.706*** 4.177*** 4.029*** 4.063*** (3.64) (4.09) (3.90) (3.90) 5.agegp 3.966*** 4.412*** 4.394*** 4.425*** (3.88) (4.30) (4.22) (4.21) 6.agegp 3.959*** 4.085*** 4.269*** 4.306*** (3.72) (3.84) (3.95) (3.95) smoking_30~s 1.438*** 1.384*** 1.316*** (5.02) (4.50) (3.55) drinker_sm~r 0.224 (0.33) _cons -5.049*** -5.107*** -5.507*** -5.533*** (-5.00) (-5.04) (-5.35) (-5.33) ---------------------------------------------------------------------------- n chi2 199.3*** 145.7*** 219.2*** 219.3*** bic 838.4 892.0 825.3 832.1 ---------------------------------------------------------------------------- t statistics in parentheses * p<0.05, ** p<0.01, *** p<0.001 This tabulation shows the betas. Underneath are the values of t-statistic = (beta/standard error)

Page 14: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 14 of 23

__7d) esttab to display OR, 95% CI, and chi square tests of Null: OR=1. Use option eform and stats( n chi2 bic, star(chi2) ) . Do this for models 1-3 ONLY! Solution . esttab model1 model2 model3, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic Regression of Esophageal Cancer - ODDS RATIO's") Logistic Regression of Esophageal Cancer - ODDS RATIO's (1) (2) (3) case case case ------------------------------------------------------------------------------------------ case alcohol_80~s 5.228*** 5.122*** [3.608,7.576] [3.512,7.469] 1b.agegp 1 1 1 [1,1] [1,1] [1,1] 2.agegp 4.683 6.268 6.301 [0.580,37.82] [0.777,50.54] [0.763,52.06] 3.agegp 24.54** 38.39*** 33.11*** [3.304,182.3] [5.162,285.5] [4.336,252.9] 4.agegp 40.70*** 65.17*** 56.19*** [5.529,299.5] [8.825,481.3] [7.409,426.1] 5.agegp 52.78*** 82.45*** 80.93*** [7.107,391.9] [11.04,616.0] [10.51,623.1] 6.agegp 52.42*** 59.45*** 71.42*** [6.503,422.6] [7.369,479.6] [8.588,593.9] smoking_30~s 4.211*** 3.990*** [2.403,7.382] [2.185,7.287] ------------------------------------------------------------------------------------------ n chi2 199.3*** 145.7*** 219.2*** bic 838.4 892.0 825.3 ------------------------------------------------------------------------------------------ Exponentiated coefficients; 95% confidence intervals in brackets * p<0.05, ** p<0.01, *** p<0.001

Page 15: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 15 of 23

___8. Perform a Likelihood Ratio Test Perform a likelihood ratio test is performed to assess the stastistical significance of the interaction of heavy drinking and heavy smoking in the model, controlling for age and the main effects of each of heavy drinking and heavy smoking. Thus,

Model “reduced”: Predictors = age, heavy drinking, heavy smoking Model “full”: Predictors = age, heavy drinking, heavy smoking + (drinking x smoking)

To do this requires 5 commands i) fit the reduced model ii) save the results of the reduced model fit (e.g. eststo reduced) iii) fit the full model iv) save the results of the full model fit (e.g. eststo full) v) to perform the likelihood ratio test, issue the command: lrtest reducedname fullname Solution . * Reduced model . logistic case i.agegp smoking_30plus alcohol_80plus Logistic regression Number of obs = 975 LR chi2(7) = 219.23 Prob > chi2 = 0.0000 Log likelihood = -385.12755 Pseudo R2 = 0.2216 (-2) Log likelihood Reduced Model = 770.2551 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- agegp | 2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074 3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861 4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294 5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486 6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835 | smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035 alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432 _cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757 -------------------------------------------------------------------------------- . estimates store reduced

Page 16: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 16 of 23

. * Full model . logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker Logistic regression Number of obs = 975 LR chi2(8) = 219.35 Prob > chi2 = 0.0000 Log likelihood = -385.07068 Pseudo R2 = 0.2217 (-2) Log likelihood Full Model = 770.14136 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- agegp | 2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986 3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657 4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187 5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001 6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273 | smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546 alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591 drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342 _cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218 -------------------------------------------------------------------------------- . estimates store full . lrtest reduced full Likelihood-ratio test LR chi2(1) = 0.11 (Assumption: reduced nested in full) Prob > chi2 = 0.7359 CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

Page 17: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 17 of 23

___9. Regression Diagnostics for Logistic Regression: Numerical Preliminary – Install the suite of commands in the package SPost . * Step 1: Install SPost Using net install . net install spost9_ado checking spost9_ado consistency and verifying not already installed... installing into /Users/cbigelow/Library/Application Support/Stata/ado/plus/... installation complete. . * Step 2: Now obtain all the ancillary files . net get spost9_do checking spost9_do consistency and verifying not already installed... copying into current directory... copying st9all.do copying st9ch2tutorial.do copying st9ch3estimate.do copying st9ch4binary.do copying st9ch5ordinal.do copying st9ch6nomcase.do copying st9ch7nomalt.do copying st9ch8count.do copying st9ch9other.do copying binlfp2.dta copying couart2.dta copying gsskidvalue2.dta copying nomocc2.dta copying ordwarm2.dta copying science2.dta copying sciwork.dta copying travel2.dta copying travel2case.dta copying wlsrnk.dta ancillary files successfully copied.

Page 18: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 18 of 23

Summary Now you have a model that is your “candidate” final model. There are lots of further explorations you can do to assess whether this really is a “good” final model. Ille-et-Vilaine Data: Illustration Having retained the null hypothesis in our likelihood ratio test of the interaction of heavy smoking and heavy drinking, our “candidate” final model contains: heavy drinking, heavy smoking, and age.

. * Before requesting any diagnostics of a model, you must have fit it. . logistic case i.agegp smoking_30plus alcohol_80plus Logistic regression Number of obs = 975 LR chi2(7) = 219.23 Prob > chi2 = 0.0000 Log likelihood = -385.12755 Pseudo R2 = 0.2216 -------------------------------------------------------------------------------- case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------------+---------------------------------------------------------------- agegp | 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835 | smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035 alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432 _cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757 --------------------------------------------------------------------------------

Page 19: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 19 of 23

. * . ***** 9a) Numerical measures of fit using command FITSTAT . fitstat Measures of Fit for logistic of case Log-Lik Intercept Only: -494.744 Log-Lik Full Model: -385.128 D(967): 770.255 LR(7): 219.233 Prob > LR: 0.000 McFadden's R2: 0.222 McFadden's Adj R2: 0.205 ML (Cox-Snell) R2: 0.201 Cragg-Uhler(Nagelkerke) R2: 0.316 McKelvey & Zavoina's R2: 0.466 Efron's R2: 0.224 Variance of y*: 6.157 Variance of error: 3.290 Count R2: 0.817 Adj Count R2: 0.110 AIC: 0.806 AIC*n: 786.255 BIC: -5885.062 BIC': -171.056 BIC used by Stata: 825.315 AIC used by Stata: 786.255 PARTIAL KEY: Log-Lik Intercept Only = -494.744: This is the log likelihood for the intercept only model Log-Lik Full Model = -385.128: This is the log likelihood for the current model LR(7) = 219.233 is the likelihood ratio chi square statistic which tests whether the current model predicts better than the intercept only model Prob > LR = .0001: This is the p-value for the LR(7) test Then there are a series of pseudo-R2 measures. Finally, there are a series of information criterion measures that are used to compare different models. . * . ***** 9b) Test of Model Adequacy Using command LINKTEST . linktest -- iteration output omitted -- Logistic regression Number of obs = 975 LR chi2(2) = 219.24 Prob > chi2 = 0.0000 Log likelihood = -385.12412 Pseudo R2 = 0.2216 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _hat | 1.009135 .1404655 7.18 0.000 .7338274 1.284442 _hatsq | .0039801 .0479037 0.08 0.934 -.0899094 .0978696 _cons | .0008299 .1243723 0.01 0.995 -.2429353 .2445952 ------------------------------------------------------------------------------ WHAT TO LOOK FOR: We expect the p-value for _HAT to be highly significant. Evidence of a GOOD FIT is reflected in a NON-SIGNIFICANT _HATSQ Here the p-value for _HATSQ is .934 This suggests good model adequacy

Page 20: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 20 of 23

. * . ***** 9c) Test of Overall Goodness of Fit Using command LFIT . lfit, group(10) table Logistic model for case, goodness-of-fit test (Table collapsed on quantiles of estimated probabilities) (There are only 9 distinct quantiles because of ties) +--------------------------------------------------------+ | Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total | |-------+--------+-------+-------+-------+-------+-------| | 1 | 0.0159 | 0 | 0.6 | 106 | 105.4 | 106 | | 2 | 0.0249 | 6 | 3.9 | 153 | 155.1 | 159 | | 3 | 0.1158 | 4 | 5.1 | 45 | 43.9 | 49 | | 4 | 0.1185 | 16 | 17.5 | 132 | 130.5 | 148 | | 6 | 0.1857 | 27 | 29.7 | 133 | 130.3 | 160 | |-------+--------+-------+-------+-------+-------+-------| | 7 | 0.2473 | 42 | 38.0 | 115 | 119.0 | 157 | | 8 | 0.3462 | 0 | 0.3 | 1 | 0.7 | 1 | | 9 | 0.5388 | 66 | 62.8 | 67 | 70.2 | 133 | | 10 | 0.8704 | 39 | 41.9 | 23 | 20.1 | 62 | +--------------------------------------------------------+ number of observations = 975 number of groups = 9 Hosmer-Lemeshow chi2(7) = 4.43 Prob > chi2 = 0.7291 WHAT TO LOOK FOR: Evidence of a OVERALL GOODNESS OF FIT is reflected in a NON-SIGNIFICANT p-value Here the Hosmer-Lemeshow test p-value is .7291 This suggests good overall fit

Page 21: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 21 of 23

___10. Regression Diagnostics for Logistic Regression: Graphical . ***** 10a) Plot of ROC Curve using LROC . predict xb, xb . lroc Logistic model for case number of observations = 975 area under ROC curve = 0.8119

WHAT TO LOOK FOR: Classification that is no better than a coin toss is reference in the 45 degree line Evidence of GOOD FIT is reflected in an ROC curve that lies above the 45 degree line reference Area under the ROC curve = .8119 says that 81% of the observations are correctly classified.

Page 22: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 22 of 23

. * . ***** 10b) Plot of Y=Standardized Residual versus X=Observation Number . predict std_residual, rs . label variable std_residual "Standardized Residual" . generate index=_n . label variable index "Observation Number" . graph twoway (scatter std_residual index,msymbol(d)), xlabel(0(100)1000) ylabel(-4(2)4) title("Plot of Standardized Residuals versus Observation Number") xtitle("Observation Number") ytitle("Standardized Residual") yline(0) caption("stdresidual.png", size(vsmall))

WHAT TO LOOK FOR: Think of standardized residuals as Z-scores, approximately. We’d like the majority to be within 1.96 of the expected value of 0 Values outside + 1.96 are potentially extreme.

Page 23: Stata Lab 2 - Basics and Logistic Regression 2016 SOLUTIONScourses.umass.edu/biep640w/pdf/Stata Lab 2 - Basics and...Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata

Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS

….1. Teaching\stata\stata version 14\stata version 14 – SPRING 2016\Stata Lab 2 – Basics and Logistic Regression 2016 SOLUTIONS.docx Page 23 of 23

. * . ***** 10c) Plot of Influential Observations: Y=Cook versus X=Observation Number . predict cook, dbeta . label variable cook "Cook Distance" . graph twoway (scatter cook index, msymbol(d)), xlabel(0(100)1000) title("Plot of Cook Distance versus Observation Number") xtitle("Observation Number") ytitle("Cook Distance") caption("cook.png", size(vsmall))

WHAT TO LOOK FOR: Look for a even ribbon of cook distance values with no spikes.