BIOST 536 Lecture 4 1 Lecture 4 – Logistic regression: estimation and confounding Linear model.

BIOST 536 Lecture 4 1

Lecture 4 – Logistic regression: estimation and confounding Linear model

20 1 1 2 2* * ... * where ~ (0, )p pY X X X N

210 ,...,,, p fixed parameters to be estimated

Then

pp XXXY *...**)(E 22110

pp XXXY *ˆ...*ˆ*ˆˆˆ22110

Find p ˆ...,,ˆ,ˆ10 such that

iii YY 2)ˆ( is minimized, "least squares criterion"

There is a closed form solution for the estimates of beta, i.e. only matrix operations are

required and there is no iteration

Also need an estimate of 2 but this is easily obtained

1

)ˆ(ˆ

2

2

pn

YYi

ii

The least squares estimates of beta are the same as the maximum likelihood estimate of

beta


Logistic regression estimation Modeling a binary outcome

Modeling the P(Y=1|X=x), not a continuous Y Linear model has problems since the left-hand side 0 ≤ Pr ≤ 1, but

the ’s can be anywhere between (-∞,∞) so does not do well

Instead assume that P(Y=1|X=x) depends on X only through the linear combination

Need to link Z to P(Y=1|Z=z) so we use a logistic function

0 1 1 2 2( | ) * * ... *p pP Y X X X X

0 1 1 2 2* * ... * and ( , )p pZ X X X Z

( 1| )1

z

z

eP Y Z z

e

0

.2.4

.6.8

1P

roba

bilit

y

-5 0 5z


Logistic function Symmetric – could model either Pr(Y=1) or Pr(Y=0) and the

effect of covariates would be the same Epidemiologists prefer logistic model since the OR is easily

derived Mathematically convenient form Maximum likelihood equations have a simple form

Need to solve these equations by iteration Iteration rarely goes awry if the data are not too sparse

0 1 1 2 2

0 1 1 2 2

ˆ ˆ ˆ ˆ* * ... *

ˆ ˆ ˆ ˆ* * ... *ˆ ( )

1

p p

p p

x x x

x x x

ep x

e

1

ˆ( ) 0 for each covariate n

ij i i ji

x y p x


Link functions Logistic function very close to the probit model (symmetric)

Probit model used more in classical diagnostic testing and ROC analysis

Call “logistic” and “probit” link functions that link the P(Y|X) though the linear combination

Other link functions include something called the “complimentary log-log” (asymmetrical)

Some general regression programs in Stata expect you to specify the link function

We will assume a logistic link function here but can test in Stata using the linktest command

( 1| ) ( ) when z ~ (0,1)P Y Z z z N

0 1 1 2 2* * ... *p pZ X X X


Simple estimation example

0 1 1

0 1 1

*X

*XE(Y|X) Pr( 1 X)1

eY

e

0

0

1)0X1(Pr 0

e

eY

01

11)0X0(Pr 0

eY

0 1

0 11Pr( 1 X 1)1

eY

e

0 11

1Pr( 0 X 1) 1

1Y

e

Exposed (X=1) Unexposed (X=0) Total

Case (Y=1) 34 18 52

Control (Y=0) 66 82 148

Total 100 100 200

Adopt a logistic model so

For the unexposed, probability of a case is

For the unexposed, the probability of a control is

For the exposed, the probability of a case is

For the exposed, the probability of being a control is


The likelihood for our data is

661

341

820

180 1

34

1001

18

100

The binomial coefficient

18

100 is the number of ways of choosing 18 cases out of the 100

unexposed, roughly 3.066 10 19

Fortunately, we can drop the binomial coefficients since they do not affect the beta estimates

Maximize 661

341

820

180 11 or equivalently

0 0 1

0 0 0 1 0 1

18 3482 661 1

L1 1 1 1

e e

e e e e

It turns out to more convenient to maximize the log likelihood, log L

In many cases we need to iterate to find the beta estimates that make this a maximum


The log likelihood is a well-behaved surface that is a function of the betas

853.0ˆ516.1ˆ10

0

0

ˆ 1.516

ˆ 1.516Pr( 1 X 0) 0.18

11

e eY

ee

34.011

)1X1(Pr853.0516.1

853.0516.1

ˆˆ

ˆˆ

10

10

e

e

e

eY

The maximum log likelihood is -111.2429 achieved at

For the unexposed, the estimate is

which is just the proportion of cases in the unexposed groupFor the exposed, the estimate is

which is just the proportion of cases in the exposed group


0: 10 H

0

0 0

0 0 0 0

18 3482 661 1

L1 1 1 1

e e

e e e e

0

Now do this under the null hypothesis that exposure does not make a difference, i.e.

Then the likelihood depends on only

Then the log likelihood is a function of

-2.2 -1.7 -1.2 -0.7 -0.2

-140

-135

-130

-125

-120

-115

The maximum log likelihood is -114.6114 achieved at

046.1ˆ0

Ignoring exposure, the estimatedprobability of being a case is

26.011

)1(Pr046.1

046.1

ˆ

ˆ

0

0

e

e

e

eY

which is the overall proportion of cases


Tests comparing nested models Want to decide if the more complex model is significantly better

than the simpler model Possible tests comparing nested models (complex model

includes all covariates of the simpler model)1. Likelihood ratio test – direct comparison of the difference in log-

likelihoods Preferred test Does not change with reparametrization

2. Score test – test computed at the null hypothesis values Very similar to LR test Sometimes can be computed when the LR test cannot Many of our common tests are score tests

3. Wald test – depends on the normality of the distribution of the estimates Can change with reparametrization P-value given in Stata output for individual variables


. cci 34 18 66 82 Proportion | Exposed Unexposed | Total Exposed -----------------+------------------------+------------------------ Cases | 34 18 | 52 0.6538 Controls | 66 82 | 148 0.4459 -----------------+------------------------+------------------------ Total | 100 100 | 200 0.5000 | | | Point estimate | [95% Conf. Interval] |------------------------+------------------------ Odds ratio | 2.346801 | 1.162749 4.819547 (exact) Attr. frac. ex. | .5738881 | .1399689 .7925116 (exact) Attr. frac. pop | .3752345 | +------------------------------------------------- chi2(1) = 6.65 Pr>chi2 = 0.0099 . list | case exp cnt | |------------------| 1. | 1 1 34 | 2. | 1 0 18 | 3. | 0 1 66 | 4. | 0 0 82 | . logistic case [fw=cnt], coef Logistic regression Number of obs = 200 LR chi2(0) = 0.00 Prob > chi2 = . Log likelihood = -114.61138 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons | -1.045969 .1612065 -6.49 0.000 -1.361927 -.7300097 ------------------------------------------------------------------------------ . predict probnull, pr . est store A


. logistic case exp [fw=cnt], coef Logistic regression Number of obs = 200 LR chi2(1) = 6.74 Prob > chi2 = 0.0094 Log likelihood = -111.2429 Pseudo R2 = 0.0294 ------------------------------------------------------------------------------ case | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- exp | .8530533 .3351327 2.55 0.011 .1962052 1.509901 _cons | -1.516347 .2602896 -5.83 0.000 -2.026506 -1.006189 ------------------------------------------------------------------------------ . predict probalt, pr . est store B . lrtest A B Likelihood-ratio test LR chi2(1) = 6.74 (Assumption: A nested in B) Prob > chi2 = 0.0094 . logistic Logistic regression Number of obs = 200 LR chi2(1) = 6.74 Prob > chi2 = 0.0094 Log likelihood = -111.2429 Pseudo R2 = 0.0294 ------------------------------------------------------------------------------ case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- exp | 2.346801 .7864899 2.55 0.011 1.216777 4.526284 ------------------------------------------------------------------------------ . list case exp cnt probnull probalt | case exp cnt probnull probalt | |---------------------------------------| 1. | 1 1 34 .26 .34 | 2. | 1 0 18 .26 .18 | 3. | 0 1 66 .26 .34 | 4. | 0 0 82 .26 .18 |


Summary about estimation Maximum likelihood estimation is preferred for binary outcome data The log likelihood depends on the betas that in turn depend on what

covariates are in the model In some cases the beta estimates are related to familiar values, but

usually have to iterate to get the estimates Difference in log likelihoods can sometimes test one model against

another Odds ratios turn out to be related to the logistic regression coefficients


Example Study of identification of domestic violence (DV) identification in a

medical setting ( PI: Robert Thompson, MD ; Co-PI: Fred Rivara, MD) Clinics were randomized to be either intervention (2 clinics) or control

clinics (3 clinics) Intervention clinics received training in DV detection and some support

services; clinics enrollees received materials Questions:

1. Did the intervention improve detection ?

2. Did it improve the rate of physicians asking about DV?

Sample of patients chosen based on sentinel conditions for DV

Intervention Control Totals

Asked about DV 278 125 403

Not asked about DV 1094 1895 2989

Totals 1372 2020 3392


Intervention group has a 20.3% rate versus 6.2% in the control clinics yields

a rate ratio of 3.27 and an observed odds ratio 85.31251094

1895278

Compare the two binomial proportions

Suppose the dataset looks like the following for variables Y, N, TRT

278 1372 1

125 2020 0

Consider a logistic model for TRT

110

110

X*

X*

1)X(Pr

e

eY

110

110

110

110

X*

X*

X*

X*

log

1

1

1log

1log)(logit

e

e

e

e

110 X*)(logit


. blogit y n trt Logit Estimates Number of obs = 3392 chi2(1) = 152.31 Prob > chi2 = 0.0000 Log Likelihood = -1160.3808 Pseudo R2 = 0.0616 ------------------------------------------------------------------------------ _outcome | Coef. Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- trt | 1.348686 .1141858 11.811 0.000 1.124886 1.572485 _cons | -2.71866 .092343 -29.441 0.000 -2.899649 -2.537671 ------------------------------------------------------------------------------ . blogit y n trt, or Logit Estimates Number of obs = 3392 chi2(1) = 152.31 Prob > chi2 = 0.0000 Log Likelihood = -1160.3808 Pseudo R2 = 0.0616 ------------------------------------------------------------------------------ _outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- trt | 3.852358 .4398845 11.811 0.000 3.079864 4.81861 ------------------------------------------------------------------------------

CI is a Wald confidence interval based on )ˆ(*96.1ˆ

11 see

Top right-hand 2 is a likelihood ratio test of 0...: 210 pH

Consider fitting in Stata . infile y n trt using d:\biost536\data\dv1.dat (2 observations read) . list y n trt 1. 278 1372 1 2. 125 2020 0

Blocked logit model blogit { # cases} { denom } { covariates } , [or]


STATA includes standard epidemiologic comparisons (EPITAB)

Incidence risk ratio (using n as the person-time variable) . ir y trt n trt | Exposed Unexposed | Total -----------------+------------------------+---------- y | 278 125 | 403 n | 1372 2020 | 3392 -----------------+------------------------+---------- Incidence Rate | .2026239 .0618812 | .118809 | Pt. Est. | [95% Conf. Interval] |------------------------+---------------------- Inc. rate diff. | .1407427 | .1145701 .1669153 Inc. rate ratio | 3.274402 | 2.64199 4.07705 (exact) Attr. frac. ex. | .6946008 | .6214974 .7547246 (exact) Attr. frac. pop | .4791539 | +----------------------------------------------- (midp) Pr(k>=278) = 0.0000 (exact) (midp) 2*Pr(k>=278) = 0.0000 (exact)


Cumulative incidence risk ratio (using the 2x2 cell entries) . csi 278 125 1094 1895 , exact or | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 278 125 | 403 Noncases | 1094 1895 | 2989 -----------------+------------------------+---------- Total | 1372 2020 | 3392 | | Risk | .2026239 .0618812 | .118809 | | | Pt. Est. | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .1407427 | .1170199 .1644655 Risk ratio | 3.274402 | 2.681872 3.997846 Attr. frac. ex. | .6946008 | .6271261 .7498653 Attr. frac. pop | .4791539 | Odds ratio | 3.852358 | 3.080882 4.816955 (Cornfield) +----------------------------------------------- 1-sided Fisher's exact P = 0.0000 2-sided Fisher's exact P = 0.0000

Note that we get the same risk differences and same rate ratios as with the

incidence risk ratio, but different confidence intervals

The odds ratio we obtain is the same as derived from the logistic regression


For the second question concerning case finding we have

Intervention Control Totals

DV identified 36 35 71

DV not identified 1336 1985 3321

Totals 1372 2020 3392

Intervention group has a 2.6% rate versus 1.7% in the control clinics

yields a rate ratio of 1.51 and an observed 53.1 . list y n trt 1. 36 1372 1 2. 35 2020 0 . blogit y n trt, or Logit Estimates Number of obs = 3392 chi2(1) = 3.11 Prob > chi2 = 0.0780 Log Likelihood = -343.2194 Pseudo R2 = 0.0045 ------------------------------------------------------------------------------ _outcome | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] ---------+-------------------------------------------------------------------- trt | 1.528229 .3667798 1.767 0.077 .9547672 2.44613 ------------------------------------------------------------------------------

We do not see a significant difference, but could be due to the small

sample - try the incidence ratio with an exact test


. csi 36 35 1336 1985 , exact or | Exposed Unexposed | Total -----------------+------------------------+---------- Cases | 36 35 | 71 Noncases | 1336 1985 | 3321 -----------------+------------------------+---------- Total | 1372 2020 | 3392 | | Risk | .0262391 .0173267 | .0209316 | | | Pt. Est. | [95% Conf. Interval] |------------------------+---------------------- Risk difference | .0089123 | -.0012817 .0191064 Risk ratio | 1.514369 | .9558286 2.399294 Attr. frac. ex. | .339659 | -.0462127 .5832107 Attr. frac. pop | .1722214 | Odds ratio | 1.528229 | .9585712 2.436415 (Cornfield) +----------------------------------------------- 1-sided Fisher's exact P = 0.0497 2-sided Fisher's exact P = 0.0868

No evidence of a significant improvement in case-finding

Problems in this analysis:

1. Three control clinics may not be similar

2. Two intervention clinics may not have responded in similar ways

3. Clinics may not have been comparable at baseline

4. Have not adjusted for potential confounding variables

5. Did not account for the oversampling of those with sentinel conditions


Confounding Confounding variable is related to both disease and exposure

occurrence that modifies the relationship between exposure and disease

We can adjust for the confounder explicitly by modeling or implicitly through stratification

Two necessary relationships:

1. Confounder must be related to exposure in the data

2. Confounder must be independently related to disease in the population

Example 1: Suppose (1) Age exposure to menopausal estrogens and (2) Age endometrial cancer Should we control for age?

Example 2: (1) menopausal estrogens endometrial hyperplasia and (2) endometrial hyperplasia endometrial cancer Do we want to control for endometrial hyperplasia in studying the

association between exposure to menopausal estrogens and endometrial cancer ?


Failure to account for confounding can increase or decrease the odds ratio

Hypothetical data from Breslow & Day, Volume 1


Observational cohort study

Total sample size ( N1 + N2 ) fixed by design

Individual confounder totals N1 , N2 may be fixed by design if the confounder is known

Individual exposure totals ( m 11 + m 12 ) and ( m 01 + m 02 ) may be fixed by design if exposure is

known at that time

M 11 , m 12 , m 01 , m 02 may all be fixed by design

Random variable is disease incidence given exposure and confounder status, i.e.

)(~ pBinomialX D where p is the probability of disease given exposure and confounder values

Disease (2 levels) Exposure (2 levels) Confounder (2 levels)

Confounder - Confounder + E E E E

D 1a 1b 11n D 2a 2b 12n

D 1c 1d 01n D 2c 2d 02n

11m 01m 1N

12m 02m 2N

disease no 0

disease 1DX

unexposed 0

exposed 1EX

confounder 0

confounder 1CX


Model in terms of probabilities

1. )0,1|1Pr( CED XXX estimated by 11

110ˆ

map


100ˆ

mbp


211ˆ

map


201ˆ

mbp

)0,0|0Pr()0,0|1Pr(

)0,1|0Pr()0,1|1Pr(1

CEDCED

CEDCED

XXXXXX

XXXXXX

11

11

01

1

01

1

11

1

11

1

1ˆcb

da

md

mb

mc

ma

Similarly for confounder 1CX then 22

222ˆ

cb

da


Knowing the two estimates 1 and 2 , then we need only two other probabilities

to describe the full model

Reparametrization 01001110010021 ˆˆˆˆˆˆˆˆ pppppp

Can be written in terms of odds ratios specific to each level of the confounder and also the

probability of disease given non-exposure at each level of the confounder

Now rewrite in terms of the logistic model

ECEC

ECEC

e

eX D X*XXX

X*XXX

CE *2

*1

*2

*1

1)X,X1(Pr

* *1 2logit( ) log X X X * X

1 C E C E

Intercept (1*)

Main effects for exposure () and the confounder (2*)

Interaction of exposure and confounder ()

Log odds ratio is the difference between two logits (exposed versus unexposed)


Confounder negative

1log logit Pr( 1| 1, 0) logit Pr( 1| 0, 0)D E C D E CX X X X X X

* *1 1 1 1log e

Confounder positive

2log logit Pr( 1| 1, 1) logit Pr( 1| 0, 1)D E C D E CX X X X X X

* * * *2 1 2 1 2 2log e

If this model is correct, would the Mantel-Haenszel approach give an alternative estimate of the odds ratio?


Estimates:

*1 estimates )0,0|1Pr(logit CED XXX

*2 estimates

)0,0|1Pr(logit)1,0|1Pr(logit CEDCED XXXXXX

estimates 1ˆlog

estimates 2ˆlog - 1ˆlog

Reparametrization 01001110*

2*

1 ˆˆˆˆˆˆˆˆ pppp

So this is another way to obtain estimates of the probabilities, e.g.

ˆˆˆˆ

ˆˆˆˆ

11 *2

*1

*2

*1

1ˆ

e

ep

Prefer the initial parametrization because we may want to test whether = 0 (no interaction) or = =0 (no association of exposure with outcome for either confounder level) The α’s give us the baseline levels for the two confounder levels


Case-control studyDisease (2 levels) Exposure (2 levels) Confounder (2 levels)

Confounder - Confounder + E E E E

D 1a 1b 11n D 2a 2b 12n

D 1c 1d 01n D 2c 2d 02n

11m 01m 1N

12m 02m 2N

disease no 0

disease 1DX

unexposed 0

exposed 1EX

confounder 0

confounder 1CX

Also need to include sampling fractions

samplednot 0

sampled 1Z

),,|1Pr( ZXXX CED depends also on the sampling scheme

),1|1Pr(1 CDX XXZC

sampling fraction for cases

probability of being sampled given case status and confounding variables


),0|1Pr(0 CDX XXZC

sampling fraction for controls

probability of being sampled given control status and confounding variables

Typically, CC XX 01 since cases are scarce, often 1 1CX

With case-control data we only know exposure status if they get sampled

)1,,|0Pr(

)1,,|1Pr(log

1log)(logit

ZXXX

ZXXX

CED

CED

),,0Pr(

),,1Pr(

),,0|1Pr(

),,1|1Pr(log

)1,,,0Pr(

)1,,,1Pr(log

CED

CED

CED

CED

CED

CED

XXX

XXX

XXXZ

XXXZ

ZXXX

ZXXX

ECECX

X

CEDX

X

C

C

C

C XXX

X*XXXlog

),|1Pr(logitlog

*2

*1

0

1

0

1

1st part depends on the sampling and 2nd on exposure and confounders


Case-control log odds ratio is the difference between two logits (exposed versus unexposed)

Confounder negative

1log logit Pr( 1| 1, 0, 1) logit Pr( 1| 0, 0, 1)D E C D E CX X X Z X X X Z

*

100

10*1

00

101 logloglog

Confounder positive

2log logit Pr( 1| 1, 1, 1) logit Pr( 1| 0, 1, 1)D E C D E CX X X Z X X X Z

* * * *11 112 1 2 1 2

01 01

log log log

Therefore, the same odds ratios are obtained from a case-control study as a cohort study

The probability model can be estimated from a cohort study where

ECEC X*XXX1

log)(logit *2

*1


The case-control probability model is

* *10 10111 2

00 01 00

1 2

logit( ) log1

log log log X X X * X

X X X * X

C E C E

C E C E

* *10 10111 1 2 2

00 01 00

log log log

In the case-control model we estimate 1 and 2 , but the sampling fractions

are usually not known so we cannot estimate 1* and 2

*

Cannot estimate the absolute probability of disease given exposure from a case-control study

1. If 00

10

01

11

then *

22 and the actual effect of the confounder can be estimated

2. If 0111 and 0010 then *

11 and *

22 and the actual absolute estimate

of disease probability can be made


Example of sampling proportions

Age is the confounder equally divided into young and old

Population looks like the following:

Young Old Total

Cases 100 200 300

Not cases 900 800 1700

If we take all cases, i.e. 11110 then we have twice as many old cases as young cases

Want to choose 300 controls from those without disease

Choice 1: Frequency matching - 200 old and 100 young controls

25.0

800

20011.0

900

1000100

Lose the ability to evaluate age effect in a case-control study

Choice 2: Matching without regard to age

176.0

1700

3000100

Expected number of elderly controls 800*.176 = 141

Expected number of young controls 900*.176 = 159

Can still evaluate the age effect in a case-control study, but may lose efficiency for assessing exposure since we still need to adjust for age


Summary about confounding We can control for confounding by stratification or modeling For a cohort sample with a binary exposure, binary confounder, and a

binary outcome the probability model is

For a case-control sample with a binary exposure, binary confounder, and a binary outcome the probability model is

but the sampling fractions may be unknown The odds ratios can be estimated from either cohort or case-control

studies, but absolute risk probabilities can be made only from a cohort study unless the sampling probabilities are known

ECEC X*XXX1

log)(logit *2

*1

1 2so log and log

ECEC X*XXX1

log)(logit 21

1 2so log and log

*

200

10

01

112

*1

00

101 logloglog

BIOST 536 Lecture 4 1 Lecture 4 – Logistic regression: estimation and confounding Linear model.

Documents