06/14/22 H.S. 1 Stata 3, Regression Hein Stigum Presentation, data and programs at: http://folk.uio.no/heins/
Dec 21, 2015
04/19/23 H.S. 1
Stata 3, Regression
Hein Stigum
Presentation, data and programs at:
http://folk.uio.no/heins/
April 19, 2023 H.S. 2
Agenda
• Linear regression
• GLM
• Logistic regression
• Binary regression
• (Conditional logistic)
April 19, 2023 H.S. 3
Linear regression
Birth weight
by
gestational age
April 19, 2023 H.S. 4
Regression idea
residual error,e
xofeffect ,tcoefficienb
covariate =x
outcome=y
:model
1
10
exbby
2500
3000
3500
4000
4500
5000
birt
h w
eigh
t (g
ram
)
240 260 280 300 320gestational age (days)
covariate = x,x
:cofactorsmany with model
21
22110 exbxbby
April 19, 2023 H.S. 5
Model and assumptions
• Model
• Assumptions– Independent errors
– Linear effects
– Constant error variance
),0(, 222110 Nxxy
April 19, 2023 H.S. 6
Association measure: RD
11
1
2210
2210
121
22110
1
211
βRD
β
xβββ
xβββ
yyRD
xβxββy
xx
Model:
Start with:
Hence:
April 19, 2023 H.S. 7
Purpose of regression
• Estimation– Estimate association between outcome and
exposure adjusted for other covariates
• Prediction– Use an estimated model to predict the
outcome given covariates in a new dataset
April 19, 2023 H.S. 8
Adjusting for confounders
• Not adjust– Cofactor is a collider– Cofactor is in causal path
• May or may not adjust– Cofactor has missing– Cofactor has error
True value
True value
Unadjusted estimate
Adjusted estimate
April 19, 2023 H.S. 9
Workflow
• Scatterplots
• Bivariate analysis
• Regression– Model fitting
• Cofactors in/out• Interactions
– Test of assumptions• Independent errors• Linear effects• Constant error variance
– Influence (robustness)
April 19, 2023 H.S. 10
Scatterplot20
0030
0040
0050
0060
00w
eigh
t
200 300 400 500 600 700gest
April 19, 2023 H.S. 11
Syntax
• Estimation– regress y x1 x2 linear regression
– xi: regress y x1 i.c1 categorical c1
• Post estimation– predict yf, xb predict
• Manage models– estimates store m1 save model
April 19, 2023 H.S. 12
Model 1: outcome+exposure
April 19, 2023 H.S. 13
Model 2: Add counfounders
Estimate association:m1=m2
Prediction: m2 is best
”Dummies”
April 19, 2023 H.S. 14
Assume educ is coded 1, 2, 3 for low, medium and high education
Choose low educ as reference
Make dummies for the two other categories:generate medium=(educ==2) if educ<.generate high =(educ==3) if educ<.
April 19, 2023 H.S. 15
Interaction
sexββRD
sexββ
sexβxβββ
sexβxβββ
yyRD
sexxβxβxββy
xx
311
31
32210
32210
121
1322110
11
2211
Model:
Start with:
Hence:
April 19, 2023 H.S. 16
Model 3: with interaction
April 19, 2023 H.S. 17
Test of assumptions
• Predict y and residuals– predict y, xb
– predict res, resid
• Plot resid vs y– independent?
– linear?
– const. var?
-100
00
1000
2000
3200 3400 3600 3800Linear prediction
twoway (scatter res y )(qfitci res y)
April 19, 2023 H.S. 18
Violations of assumptions
• Dependent residualsMixed models: xtmixed
• Non linear effectsgen gest2=gest^2
regress weigth gest gest2 sex
• Non-constant varianceregress weigth gest sex, robust
-1-.
50
.51
200 220 240 260 280 300gest
-2-1
01
2re
s3400 3500 3600 3700 3800
p
April 19, 2023 H.S. 19
Measures of influence
• Measure change in:– Outcome (y)
– Deviance
– Coefficients (beta)• Delta beta, Cook’s distance
Remove obs 1, see changeremove obs 2, see change
-.6
-.4
-.2
0.2
Influ
ence
1 2 10Id
April 19, 2023 H.S. 20
Points with high influence
lvr2plot, mlabel(id)
1 3456791011121314 161819 20 21232425 2627 282930313233 35 3637383940 414243 444548 49 50515456575859 6061 6264656668697072 74 7576 7778 79 8081 82838485 8687 8889909193 9495969899 100101102103104105 106107108109 110111112 113115116118 119120121122124125126 127128129130 131132133 134 135136137138139141142143144 145146148149150151152153154 155156157 158160161 163164165167168 169170171172 173174175 176177178 179180181182183184185186187 188189190191 192193 194195196197198199 200 201202203205206 208209210 211212 213214215216217218219220 221222223 224225226227228229230231 232233234235 236237 238239240241242244 245246247 248249250251252253254 255256 258259261 262263 264266267269270 271273274275276277278 279280281 283284 285286 287 288289290291
292 293294295296297 298299 300301303304305306307310311312 313314315316318319 320321322323324 325326327328 329 330331332334335 336337338339 340341343344345346347348349 350351 352353354 355357 358359360361363364365 366367368369 370371372374375 376 378379380381383384 385386 387388389390 391 392393394395396397399400401402403404 405406407408409410 412413414416418419420 421422423424 425426427 428 429430431432434 435436437 438439440441443444 445446 447448 449451452453454 455456457460461462463464465466 467468 469470471 472474475476477478479480481482483 484485486487 488489490491492493 495496497498 499500501502503 504505
506507508 509510511512513514515 516517 519520521 522523524 525526527 528 529530 532533 535536 537538
539
540541542543545546 547548551552553554555556557 559560562563 564565566 567568 569570571 572573574575576 5775795805815825830.2
.4.6
.8Le
vera
ge
0 .01 .02 .03Normalized residual squared
Added variable plot: gestational age
April 19, 2023 H.S. 21
505
291
134564
270
79500
206
180
110
36635
152
211
543
105
313
96
161181
136
565
36
176
125
744925219
7089
45564
191444
340
485
47695
456492
33
355
5406
368
58
359
493
345
8145
221
220
75
480
232
200
289
346
551
111
38
238
209
478
135
466
496297
103
474
416
258
271
580
218
201
353344
315
292497
521301
224
273
403
316
5
174
171
145
48338
370
46465267
121
354205486
305
523
229
132
482
53084
360
522
93210
158
249
146
347
324502
54
477
358
285
425
160
216
538
499
66
331
329
339
208
51
80283
116
427
385
461
259
296
43
107
318
40
11351
379
149335542
277303
168
248
194
535
227511
179
198407115
419
235560
447
14
18376
529
463
163
118
29
555
54832423
321
245
100
127
90278
470
31
275
577
9573
490
290
37108
109
520
536375
432
452426386
559
369
435
579
240566104
57568
460196
412
199
142
552
422
507
310
280
337
567
332
62
469
51499
32612
509
387
261
395
583
454
428
408
553
288
228
506
4
394
388170
312
7
479
120
177
122
189
352
242462223
3468
443
489
393
557
281
420
222
263
188
392
197
233
361
508
169
294
16155
72
94
49157219
13
304
449
371
133
287
430402
516
541
504
380
330
424
348562
528
307
441
483
178284
148
244
40983
20
429
364
503357546
574
414124
26
501
421
391
319
445
510
234
374
328
343
572
264
237
519
568
215300
195
193306
547
487
144
570182
401
410
157
323413349
367365363
262
69513102
431
255
517
438
41
24
241
27
10
143
18739
325
128
183
28
569
322
250
446
293
350
153
515
484
151
384
87
2344
13823198
434
246
42
383
129
77
101
230
52521
126
471
4531
5956
164
397
253
239
475
236
563
203
192
472
390
213
378
156
76202
30
184
247
175
295
495
334
172
451
299
406
276
131
448
320
399576
581
214
141
225
130
533
314
405
25
186465
396
527
60
88
457
571269
91389
173
298
481
436
150
341311
498
113
439
254154554190
86
372
212327
50119
279
167266
545
381437
537
488
440
185
165532
85
524
400
112
137
526
336
274
467
217
251
286
61
40478
418582
556256139
226
106
82
512
539
-10
000
1000
2000
e( w
eigh
t | X
)
-100 0 100 200 300 400e( gest | X )
coef = 5.6762662, se = 1.1303444, t = 5.02
avplot gest, mlabel(id)
April 19, 2023 H.S. 22
Removing outlier
April 19, 2023 H.S. 23
Influence
Outlier
Regression with outlier
Regressionwithout outlier
2000
3000
4000
5000
6000
Birt
h w
eigt
h
200 300 400 500 600 700Gestational age
April 19, 2023 H.S. 24
Final model
),0(, 222110 Nxxy
sum gest /* find smallest value */generate gest2=gest-204 /* smallest gest=204 */generate sex2=sex-1 /* boys=0, girls=1 */regress weight gest2 sex2 /* final model */estimates store m4
Give meaning to constant term:
April 19, 2023 H.S. 25
Logistic regression
Being bullied
April 19, 2023 H.S. 26
Model and assumptions
• Model
• Assumptions– Independent residuals– Linear effects
BinomialxyxyExx ~|),|(,)1
ln( 22110
April 19, 2023 H.S. 27
Association measure, Odds ratio
)(βOR
β
xβββ
xβββ
)(Odds)(Odds
Odds
Odds)(OR
xβxββ -
xx
x
x
11
1
2210
2210
12
1
21
22110
exp
1
2
lnln
lnln
1ln
11
1
1
Model:
Start with:
Hence:
April 19, 2023 H.S. 28
Syntax
• Estimation– logistic y x1 x2 logistic regression
– xi: logistic y x1 i.c1 categorical c1
• Post estimation– predict yf, pr predict probability
• Manage models– estimates store m1 save model
– est table m1, eform show OR
April 19, 2023 H.S. 29
Workflow
• Bivariate analysis
• Regression– Model fitting
• Cofactors in/out• Interactions
– Test of assumptions• Independent errors• Linear effects
– Influence (robustness)
April 19, 2023 H.S. 30
Bivariate
Generate dummiesgen Island= (country==2) if country<.gen Norway= (country==3)gen Finland= (country==4)gen Denmark= (country==5)
April 19, 2023 H.S. 31
Model 1: outcome and exposure
xi:logistic bullied i.country use xi: i.var for categorical variablesxi:logistic bullied i.country , coef coefs instead of OR'sxi:logistic bullied i.country if sex!=. & age!=. do if sex and age not missing
Alternative commands:
April 19, 2023 H.S. 32
Model 2: Add confounders
Estimate associations: m1=m2Predict: m2 best
April 19, 2023 H.S. 33
Interaction
sex)β(βOR
sexββ
sexβsexβββ
sexβsexβββ
)(Odds)(Odds
Odds
Odds)(OR
sexxβsexβxββ -
xx
x
x
311
31
3210
3210
12
1
21
132110
exp
11
22
lnln
lnln
1ln
11
1
1
Model:
Start with:
Hence:
April 19, 2023 H.S. 34
Model 3: interaction
April 19, 2023 H.S. 35
Test of assumptions
• Linear effects (of age)– findit lincheck search and
install
– lincheck xi:logistic bullied age I.country sex
0.2
.4.6
.8co
effic
ient
5 10 15age
April 19, 2023 H.S. 36
Points with high influence
0.2
.4.6
.8P
regi
bon'
s db
eta
.05 .1 .15 .2 .25 .3Pr(bullied)
estimates restore m2 restore best modelpredict p, p probability (mu in our notation)predict db, db delta-beta (one value, not one per estimate)scatter db p delta-beta plot
April 19, 2023 H.S. 37
Removing 2 observations
Conclusion:Robust results
April 19, 2023 H.S. 38
Generalized Linear Models
Being bullied
Designs and measures
April 19, 2023 H.S. 39
Design Frequency measuresNatural association measures
Cross-sectionalPrevalence Risk Ratio, Risk Difference
CohortIncidence risk Risk Ratio, Risk DifferenceIncidence rate Rate Ratio
Case ControlTraditional CC Odds RatioCase Cohort Rate RatioNested Case Control Rate Ratio
Models MeasuresGLM RR, RD, ORSurvival Rate Ratio
04/19/23 H.S. 40
Generalized Linear Models, GLM25
0030
0035
0040
0045
0050
00bi
rth
wei
ght
(gr
am)
250 270 290 310gestational age (days)
0.2
.4.6
.81
risk
0 20 40 60 80age
Linear regression
Logistic regression
Poisson regression
05
1015
0 20 40 60 80
04/19/23 H.S. 41
GLM: Distribution and link
• Distribution family– Given by data
– Influence p-value, CI
• Link function– May chose
– Shape (=link-1)
– Scale
– Association measure
Normal Binomial Poisson
Identity Logit Log
Additive Multi. Multi.
RD OR RR
04/19/23 H.S. 42
Data Distribution Link Measure NameStandard models
Continuous Normal Identity - LinearCount Poisson Log RR Poisson0/1 Binomial Logit OR Logistic
Other models0/1 Binomial Log RR Log binomial0/1 Binomial Identity RD Linear binomial
Distribution and link examples
Link: Identity linear model additive scale
OBS: not for traditional case control data
04/19/23 H.S. 43
Being bullied, 3 modelsglm bullied Island Norway Finland Denmark sex age, family(binomial) link(logit)
glm bullied Island Norway Finland Denmark sex age, family(binomial) link(log)
glm bullied Island Norway Finland Denmark sex age, family(binomial) link(identity)
April 19, 2023 H.S. 44
Convergence problems
• If glm does not converge, use:– poisson y x1 x2, irr robust RR
– regress y x1 x2, robust RD
Stop
April 19, 2023 H.S. 45
Association measure, RR
)(βRR
β
...xβββ
...xβββ
)()(
)(RR
...xβxββ
xx
x
x
11
1
2210
2210
12
1
21
22110
exp
1
2
lnln
lnln
ln
11
1
1
Model:
Start with:
Hence:
April 19, 2023 H.S. 46
Association measure: RD
11
1
2210
2210
121
22110
1
211
βRD
β
...xβββ
...xβββ
RD
...xβxββ
xx
Model:
Start with:
Hence:
April 19, 2023 H.S. 47
The importance of scale
Males
Females
010
2030
T1 T2
Additive scaleAbsolute increase
Females: 30-20=10Males: 20-10=10
Conclusion:Same increase for males and females
RD
Multiplicative scaleRelative increase
Females: 30/20=1.5Males: 20/10=2.0
Conclusion:More increase for
males
RR
April 19, 2023 H.S. 55
Stata regression commands
April 19, 2023 H.S. 56
• Regression with simple error structure– regress linear regression (also heteroschedastic
errors)– nl non linear least squares
• GLM– logistic logistic regression– poisson Poisson regression– binregbinary outcome, OR, RR, or RD effect measures
• Conditional logistc– clogit for matched case-control data
• Multiple outcome– mlogitmultinomial logit (not ordered)– ologit ordered logit
• Regression with complex error structure– xtmixed linear mixed models– xtlogitrandom effect logistic
April 19, 2023 H.S. 57
• Estimation– regress y x1 x2 linear regression– logistic y x1 x2 logistic regression– xi:regress y x1 i.x2 categorical x2
• Manage results– estimates store m1 store results– estimates table m1 m2 table of results– estimates stats m1 m2 statistics of results
• Post estimation– predict y, xb linear prediction– predict res, resid residuals– lincom b0+2*b3 linear combination
• Help– help logistic postestimation
Syntax