1 Experimental design and analyses of experimental data Lesson 6 Logistic regression Generalized Linear Models (GENMOD)
1
Experimental design and analyses of experimental data
Lesson 6
Logistic regression
Generalized Linear Models (GENMOD)
2
Logistic regression
• Used when data are dichotomous.
• Used when data are fractions between 0 and 1
3
Example:
• The distance from the nest to the nearest nest of Herring gull?
• On the vegetation surrounding the nest?
• On the number of eggs in the nest?
Does predation of eggs in nests of Oyster catcher depend on
4
OBS DIST EGGS VEG KILLED
1 0.5 3 B 3
2 1.0 7 C 5
3 5.7 5 B 1
4 3.8 9 A 6
5 3.0 7 C 5
6 6.1 8 A 3
........
57 3.3 3 A 3
Data:
5
Analysis of dichotomous data:
• Nests are categorized according to whether predation has occurred or not.
• No predation is scored as 0
• Predation is scored as 1
6
Plus/minus predator visit to Oyster catcher nest
0 1 2 3 4 5 6 7 8 9 10
Distance (m) from nearest Herring gull nest
0
1
Vis
it t
o n
est
7
The purpose is to fit a model to the data – a model that predicts the probability of a nest being
predated
8
The logistic regression model:
kk
kk
xxx
xxx
ie
e....
....
22110
22110
1
y
y
xxx
xxx
i e
e
e
epp
pp
11 ....
....
22110
22110
where pp xxxy ....22110
and ε BIN(0, π(1-π))
pp xxxy 221101ln
The logit-transformationThe odds(the ratio between the probability of a positive and a negative event)
9
y =02
1
11
1
11 0
0
e
e
e
ey
y
y 01
0
1
e
e
y 11
e
e
e
e
So that
y 10
10
How to do it in SAS
11
DATA logist;
OPTIONS LINESIZE = 90;
/* Example on logistic regression */
/* The example is inspirered by Dorthe Lahrmann's investigations of Oyster catchers (strandskader) on Langli in Ho Bugt */
INFILE 'h:\lin-mod\logist.prn' FIRSTOBS=2;
INPUT dist eggs veg $ killed;
/* dist = Distance to the nearest nest of Herring gull (sølvmåge)*/
/* eggs = Number of Oyster catcher eggs in a nest */
/* veg = vegetation type surrounding an Oyster catcher nest*/
IF killed > 0 THEN visit= 1;
IF killed = 0 THEN visit = 0;
/* If killed > 0 then the nest has been visited by a predator at least once */
12
/* Eksempel A: Analysis of a nest has been visited or not-visited by predators, i.e. visit = 1 or 0 */
PROC GENMOD; /* The procedure is Generalized Linear Models */
TITLE 'Eksempel A';
CLASS veg; /* veg is a class variable */
MODEL visit = dist veg /DIST=binomial LINK=logit TYPE3 DSCALE OBSTATS;
/* DIST = distribution function (here chosen as binomial) */
/* LINK = the model uses a logit-transformation of data */
/* TYPE3 = type 3 is used in order to evaluate the relative contribution of the different factors on the independent variable */
/* DSCALE = an option which tells SAS to scale the error in order to meet the demands of the model. If DSCALE is approximately 1, scaling is not needed. */
/* OBSTATS = gives the predicted values as well as their confidence limits */
RUN;
13
Eksempel A 10:19 Thursday, November 22, 2001 87
The GENMOD Procedure
Model Information
Description Value
Data Set WORK.LOGIST
Distribution BINOMIAL
Link Function LOGIT
Dependent Variable VISIT
Observations Used 57
Number Of Events 52
Number Of Trials 57
Class Level Information
Class Levels Values
VEG 3 A B C
14
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 53 20.2819 0.3827
Scaled Deviance 53 53.0000 1.0000
Pearson Chi-Square 53 22.2740 0.4203
Scaled Pearson X2 53 58.2057 1.0982
Log Likelihood . -26.5000 .
These values indicate the fit of the model.
Low values (for a given DF) indicate a good fit
These values should be close to unity if the model’s assumptions are met
Values less than unity indicate underdispersion (variance less than expected)Values greater than unity indicate overdispersion (variance greater than expected)
Values after scaling with DSCALE
15
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err ChiSquare Pr>Chi
INTERCEPT 1 8.5639 2.1271 16.2093 0.0001
DIST 1 -1.0032 0.2651 14.3173 0.0002
VEG A 1 0.2489 0.9555 0.0678 0.7945
VEG B 1 0.4370 0.9250 0.2232 0.6366
VEG C 0 0.0000 0.0000 . .
SCALE 0 0.6186 0.0000 . .
NOTE: The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F Pr>F ChiSquare Pr>Chi
DIST 1 53 34.8596 0.0001 34.8596 0.0001
VEG 2 53 0.1118 0.8944 0.2237 0.8942
16
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 55 20.3675 0.3703
Scaled Deviance 55 55.0000 1.0000
Pearson Chi-Square 55 21.6364 0.3934
Scaled Pearson X2 55 58.4265 1.0623
Log Likelihood . -27.5000 .
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err ChiSquare Pr>Chi
INTERCEPT 1 8.8288 2.0182 19.1363 0.0001
DIST 1 -1.0012 0.2587 14.9777 0.0001
SCALE 0 0.6085 0.0000 . .
NOTE: The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F Pr>F ChiSquare Pr>Chi
DIST 1 55 36.4999 0.0001 36.4999 0.0001
17
Observation Statistics
VISIT Pred Xbeta Std HessWgt Lower Upper Resraw
1 0.9998 8.3283 1.8909 0.000652 0.9903 1.0000 0.000242
1 0.9996 7.8277 1.7639 0.001075 0.9875 1.0000 0.000398
1 0.9578 3.1222 0.6185 0.1091 0.8710 0.9871 0.0422
1 0.9935 5.0244 1.0628 0.0175 0.9498 0.9992 0.006533
1 0.9971 5.8253 1.2605 0.007924 0.9663 0.9998 0.002943
1 0.9383 2.7217 0.5356 0.1563 0.8418 0.9775 0.0617
1 0.9971 5.8253 1.2605 0.007924 0.9663 0.9998 0.002943
1 0.9973 5.9255 1.2854 0.007173 0.9679 0.9998 0.002663
0 0.3358 -0.6822 0.5813 0.6023 0.1392 0.6123 -0.3358
1 0.9764 3.7229 0.7525 0.0622 0.9045 0.9945 0.0236
0 0.7150 ..........................................
18
Predicted values and 95% confidence limits
0 1 2 3 4 5 6 7 8 9 10
Distance (m) from nearest Herring gull nest
0.00
0.20
0.40
0.60
0.80
1.00V
isit
to
nes
t
19
/* Example B: Analysis of the fraction of eggs in a nest that are lost */
PROC GENMOD; /* procedure is Generalized Linear Models */
TITLE 'Eksempel B';
CLASS veg; /* veg is a class variable */
MODEL killed/eggs = dist veg eggs/DIST=binomial LINK=logit TYPE3 DSCALE OBSTATS;
/* DIST = distribution function (here chosen as binomial) */
/* LINK = the model uses a logit-transformation of data */
/* TYPE3 = SS3 is used to determine the contribution of the individual factors to the dependent variable */
/* DSCALE = option that can be used if Deviance/DF is different from 1.
It reduces the risk of Type 1 errors if the scale parameter is > 1
og the risk of a Type II errors, if the scale parameter is < 1 */
/* OBSTATS = gives the predicted values, and the confidence limits */
RUN;
Note that this procedure takes the absolutenumber of eggs killed out of the totalnumber of eggs into consideration, and notmerely the proportion of killed eggs
20
Eksempel B 12:26 Thursday, November 22, 2001 7
The GENMOD Procedure
Model Information
Description Value
Data Set WORK.LOGIST
Distribution BINOMIAL
Link Function LOGIT
Dependent Variable KILLED
Dependent Variable EGGS
Observations Used 57
Number Of Events 183
Number Of Trials 336
Class Level Information
Class Levels Values
VEG 3 A B C
21
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 52 53.9491 1.0375
Scaled Deviance 52 52.0000 1.0000
Pearson Chi-Square 52 44.1413 0.8489
Scaled Pearson X2 52 42.5465 0.8182
Log Likelihood . -171.3777 .
22
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err ChiSquare Pr>Chi
INTERCEPT 1 2.6437 0.5644 21.9369 0.0001
DIST 1 -0.5284 0.0623 71.9060 0.0001
VEG A 1 0.1425 0.3629 0.1541 0.6946
VEG B 1 0.1623 0.3602 0.2029 0.6524
VEG C 0 0.0000 0.0000 . .
EGGS 1 -0.0314 0.0637 0.2433 0.6219
SCALE 0 1.0186 0.0000 . .
NOTE: The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F Pr>F ChiSquare Pr>Chi
DIST 1 52 97.2164 0.0001 97.2164 0.0001
VEG 2 52 0.1135 0.8929 0.2271 0.8927
EGGS 1 52 0.2443 0.6232 0.2443 0.6211
23
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 55 54.5182 0.9912
Scaled Deviance 55 55.0000 1.0000
Pearson Chi-Square 55 45.0882 0.8198
Scaled Pearson X2 55 45.4867 0.8270
Log Likelihood . -179.6600 .
Analysis Of Parameter Estimates
Parameter DF Estimate Std Err ChiSquare Pr>Chi
INTERCEPT 1 2.5156 0.2950 72.7128 0.0001
DIST 1 -0.5212 0.0589 78.3656 0.0001
SCALE 0 0.9956 0.0000 . .
NOTE: The scale parameter was estimated by the square root of DEVIANCE/DOF.
LR Statistics For Type 3 Analysis
Source NDF DDF F Pr>F ChiSquare Pr>Chi
DIST 1 55 107.8859 0.0001 107.8859 0.0001
24
Predicted values and 95% confidence limits
0 1 2 3 4 5 6 7 8 9 10
Distance (m) from nearest Herring gull nest
0.0
0.2
0.4
0.6
0.8
1.0
Fra
ctio
n o
f eg
gs
rem
ove
d
25
Criteria For Assessing Goodness Of Fit
Criterion DF Value Value/DF
Deviance 52 53.9491 1.0375
Scaled Deviance 52 52.0000 1.0000
Pearson Chi-Square 52 44.1413 0.8489
Scaled Pearson X2 52 42.5465 0.8182
Log Likelihood . -171.3777 .
What is this?
26
The likelihood function
27
A nest contains n eggs of which r are eaten by predators.The probability that a given egg is eaten is denoted π.The probability that exactly r of the eggs are killed is
The binomial distribution
rnr
r
nrP
1)(
pp
pp
xxx
xxx
e
e
....
....
22110
22110
1
where
28
r1 = number of killed eggs out of n1 eggs in the first nest
r2 = number of killed eggs out of n2 eggs in the second nest
ri = number of killed eggs out of ni eggs in the ith nest
111
111
11 1)( rnr
r
nrP
The probability of observing exactly r1, r2, ...,ri events is
times 222
222
22 1)( rnr
r
nrP
333
333
33 1)( rnr
r
nrP
iii rn
ir
ii
ii r
nrP
1)(
L = P(r1) P(r2) P(r3)....... P(ri)...... P(rk) =
k
iirP
1
)(
ln L = ln P(r1) + ln P(r2) + ln P(r3) +...+ ln P(ri) + ...+ ln P(rk) =
)(ln1
k
iirP
Log-likelihood function
29
Maximum likelihoodThe parameters of
pp
pp
xxx
xxx
ie
e
....
....
22110
22110
1
are found as the values that maximize the likelihood of observing exactly r1, r2, ....,ri.... positive events out of n1, n2, ....,ni.... events
The maximum value of L can be found by differentiation of L with respect to β0 , β1, ...., βp, and setting the derivative equal to 0.
This is the same as differentiation with respect to ln L
0ln
0
L
0ln
1
L0
ln
2
L
...... 0ln
p
L