Modeling a Multinomial Response Bruce A Craig Department of Statistics Purdue University Reading: Faraway Ch. 7, Agresti Ch. 7, KNNL Ch. 14 STAT 526 Topic 11 1
Modeling a Multinomial Response
Bruce A Craig
Department of StatisticsPurdue University
Reading: Faraway Ch. 7, Agresti Ch. 7, KNNL Ch. 14
STAT 526 Topic 11 1
Multinomial Distribution
Model for discrete variable with two or more categories
Probability distribution:
Y = (Y1, . . . ,Yc) ∼ Multinomial(n, p1, . . . , pc−1)n is considered known (number of trials)
pc = 1−c−1∑
j=1
pj
p(y1, y2, . . . , yc) =(
n!y1!y2!...yc !
)
py11 p
y22 . . . pycc
E (Yj) = npj , Var(Yj) = npj(1− pj), Cov(Yj ,Yk) = −npjpk
Marginal dist for each Yj is B(n, pj)
Log-likelihood:
l(p) =c∑
j=1
yj log pj
Maximum Likelihood Estimator:p̂j = yj/n
STAT 526 Topic 11 2
Multinomial GLM Models
Now consider set of I Multinomial(ni ,pi) observations
Goal now to link predictors Xi to pi
As with binomial setting, can encounter data that are
Grouped: ni > 1Ungrouped: ni = 1
Predictors Xi may be continuous or discrete
Unlike binomial setting, need to distinguish between
Ordered categories for Y → cumulative logit modelNominal categories for Y → multinomial logit model
STAT 526 Topic 11 3
Example 1: Math Aptitude
Predicting a college freshman’s math aptitude given theirmathematics PSAT score in 10th grade.
Response: Aptitude Grade: 4 ordered levelsPredictor: PSAT score: continuous (10-pt increments)
STAT 526 Topic 11 4
Example 1: Math Aptitude
Aptitude grade (Y ) postively related to math score (X )
Overlap in math scores across grades means there is someuncertainty in predicting Y
In this example, ni = 1 for i = 1, 2, . . . , I = 500 students
Interested in conditional probs P(Y = j |xi )
With ordered response, often easier to work with thecumulative probabilities
P(Y ≤ j |xi ) =∑
k≤j
P(Y = j |xi )
STAT 526 Topic 11 5
Proportional Odds Model
Also called the cumulative logit model
log
(
P(Yi ≤ j |Xi )
1− P(Yi ≤ j |Xi )
)
= θj − Xiβ
Only parameters θj depend on level j
They are monotonically increasing with j
Parameters β describe cumulative log-odds
Odds(Xi )/Odds(Xi ′) does not depend on level jA β > 0 means a larger x increases probability of largerresponse j (positive association)
Like fitting logistic model for each j but β same
As with binary setting, can consider other link functions
STAT 526 Topic 11 6
Latent Variable Motivation
Similar to binary setting, can consider a latent variable tomotivate the GLM modelFor the math aptitude example, we could consider thereto be a latent continuous variable Z associated with theaptitude grade that is linearly related to their math score
Zi = β0 + β1xi + εi
Instead of observing Zi , we observe
Yi =
A Zi > c3B c2 < Zi < c3C c1 < Zi < c2D Zi < c1
Can compute the P(Yi = j |xi) using specified dist of ε
STAT 526 Topic 11 7
Motivation Continued
Using cJ = ∞,
P(Yi ≤ j |xi ) = P(Zi < cj)
= P(β0 + β1xi + εi ≤ cj)
= P(εi ≤ cj − β0 − β1xi )
= Fε(θ1 − β1xi )
If F is the CDF of the logistic distribution,
P(Yi ≤ j |xi ) =exp{θj − βixi}
1 + exp{θj − β1xi}
Can use Normal or Gumbel distributions to motivateprobit or complementary log-log link
STAT 526 Topic 11 8
Interpreting the Model Parameters
Using the logit link, the cumulative odds
log
(
P(Y ≤ j |X)
P(Y > j |X)
)
= θj − Xβ
Interpretation of a β (holding all other x constant)
log
(
P(Y ≤ j |x + δ)
P(Y > j |x + δ)÷
P(Y ≤ j |x)
P(Y > j |x)
)
= log
(
P(Y ≤ j |x + δ)
P(Y > j |x + δ)
)
− log
(
P(Y ≤ j |x)
P(Y > j |x)
)
= θ∗j − β(x + δ)− (θ∗j − βx) = −βδ
Change is proportional to the change in x for all j
STAT 526 Topic 11 9
Inference
Similar inference as in logistic regression when focusingon cumulative probs
Wald / LR / Score tests for model parametersPearson χ2, Deviance for goodness of fit
β invariant to the number of response categories
Predicting Y now involves a vector of probabilities
Easiest to first compute cumulative probabilities andthen use subtraction to get the probability vector
STAT 526 Topic 11 10
Maximum Likelihood Estimation
Let pj(X) = P(Y ≤ j |X)− P(Y ≤ j − 1|X)
Yi is vector of length j with one 1 and remainder 0’sLog-likelihood for the ith observation is:
li = log
J∏
j=1
pj (xi )yij
=J∏
j=1
[P(Y ≤ j |xi )− P(Y ≤ j − 1|xi )]yij
=J∏
j=1
[
exp{θj − xiβ}
1 + exp{θj − xiβ}−
exp{θj−1 − xiβ}
1 + exp{θj−1 − xiβ}
]yij
MaximizeI∑
i=1
li with respect to θj and β
STAT 526 Topic 11 11
Calculating the Residual Deviance
Residual deviance
G 2 =
I∑
i=1
J∑
j=1
yij log
(
1
p̂j(xi )
)
= −2
I∑
i=1
J∑
j=1
yij log p̂j(xi )
Degrees of freedom are the difference between model dfs
# params in saturated model (=# observations)
# params in reduced model (=# of intercepts + # predictors)
Residual deviance degrees of freedom for math aptitudestudy are 500− 4 = 496
STAT 526 Topic 11 12
Example: Math Aptitude
In R: Use polr in library MASS> library(MASS)
> fit = polr(grade1 ~ psat,mathapt)
> summary(fit)
Coefficients:
Value Std. Error t value
psat -0.01792 0.00124 -14.46 ***Function uses
int_j + XB
Intercepts: so be wary of sign
Value Std. Error t value
A|B -11.5391 0.6912 -16.6944
B|C -8.8652 0.5930 -14.9492
C|D -6.3311 0.5196 -12.1834
Residual Deviance: 981.8668
AIC: 989.8668
Grade A associated with j = 1 and Grade D with j = 4.That is why these are negative coefficients (see note above)
STAT 526 Topic 11 13
Results: Math Aptitude
Deviance is 981.9 on 496 df (from fit$df.residual)
Similar to Bernoulli distribution (ungrouped), thisdeviance should not be used to assess goodness of fit
Better to use extension of H-L test or Lipsitz testAssessment of β: For each 10-pt increase in score, theodds of being > j versus ≤ j decrease 16.4% (1− e−0.1792)
> exp(10*confint(fit))
Waiting for profiling to be done...
Re-fitting to get Hessian
2.5 % 97.5 %
0.8154889 0.8558026
STAT 526 Topic 11 14
Extension of Hosmer-Lemeshow
Score each observation and then group on these scores
si = p̂i1 + 2p̂i2 + · · ·+ Jp̂iJ
C∑
k=1
J∑
j=1
(Okj − Ekj)2/Ekj ∼ χ2
df
C represents the number of (equal-sized) groups
For this model, df = (C − 2)(J − 1) + (J − 2)
The additional J − 2 df are due to the reduced number ofparameters relative to a multinomial model
Note that our Bernoulli version considers J = 2
STAT 526 Topic 11 15
Using R
Can still use logitgof function> library(generalhoslem)
> logitgof(mathapt$grade1,ord=TRUE,fitted(fit))
Hosmer and Lemeshow test (ordinal model)
data: grade1, fitted(fit)
X-squared = 29.356, df = 26, p-value = 0.2952
sorder
Warning message:
In logitgof(grade1, ord = TRUE, fitted(fit)) :
At least one cell in the expected frequencies table is <!
Chi-square approximation may be incorrect.
This will be a problem in this example because littleoverlap in math scores between A and D aptitude students
STAT 526 Topic 11 16
Grouped Table - Using Deciles
Y = 1 Y = 2 Y = 3 Y = 4Group O E O E O E O E Total
1 25 21.34 21 23.81 3 4.42 1 0.43 502 11 9.75 28 32.38 18 13.28 0 1.59 573 3 3.85 23 21.93 16 16.69 3 2.53 454 2 2.72 21 19.78 20 22.30 6 4.20 495 0 2.46 26 21.26 33 34.73 8 8.56 676 0 0.75 8 7.55 21 18.38 4 6.33 337 1 0.88 6 9.66 31 32.08 20 15.38 588 0 0.39 5 4.68 17 21.79 21 16.14 439 0 0.25 4 3.19 23 21.63 30 31.93 5710 0 0.06 0 0.78 13 7.35 28 32.81 41
STAT 526 Topic 11 17
Lipsitz Test
As with H-L test, sort data into C groups
Define C − 1 group indicator variablesFit new ordinal logistic regression
log
(
P(Y ≤ j |X)
P(Y > j |X)
)
= θj − Xβ + γ1I1 + · · ·+ γC−1IC−1
Use the likelihood ratio test to test Ho : γ1 = · · · = γC−1 = 0
Recommend C be such that 6 ≤ C < N/5J
STAT 526 Topic 11 18
Using R
Can use lipsitz.test function in generalhoslem> library(generalhoslem)
> lipsitz.test(fit)
Lipsitz goodness of fit test for ordinal response models
data: formula: grade1 ~ psat
LR statistic = 11.226, df = 9, p-value = 0.2605
Tends to have rejection rates > α in small samples
Works best when covariates are continuous
STAT 526 Topic 11 19
Example 2: Dose Response
Effect of intravenous medication doses on patients withsubarachnoid hemorrhage trauma (p. 207, OrdCDA)
Glasgow Outcome Scale (Y )Treatment Veget. Major Minor GoodGroup (X ) Death State Disab. Disab. Recov.Placebo 59 25 46 48 32Low dose 48 21 44 47 30
Medium dose 44 14 54 64 31High dose 43 4 49 58 41
Response: Glascow Outcome scale - Ordered
Predictor: Dose level - Ordered
Similar to setting for linear-by-linear association model
Focus, however, is on predicting Y
So how do we treat the levels of X? Score them?
STAT 526 Topic 11 20
Notation for Ordinal Predictor
Back to contingency table summary (grouped data)Y
X 1 2 · · · J Total1 y11 y12 · · · y1J y1.2 y21 y22 · · · y2J y2....
..
....
..
....
..
.I yI1 yI2 · · · yIJ yI .
Total y.1 y
.2 · · · y.J n
Interested in cond probs P(Y = j |X = i) = pj |i
Proportional-odds model focuses on cumulative probs
P(Y ≤ j |X = i) =∑
k≤j
pk|i
STAT 526 Topic 11 21
Ordinal Odds Ratios
Local odds ratios
θLij =P(X = i ,Y = j) / P(X = i ,Y = j + 1)
P(X = i + 1,Y = j) / P(X = i + 1,Y = j + 1)
Global odds ratios
θGij =P(X ≤ i ,Y ≤ j) / P(X ≤ i ,Y > j)
P(X > i ,Y ≤ j) / P(X > i ,Y > j)
Cumulative odds ratios (conditional on X )
θCij =P(Y ≤ j |X = i) / P(Y > j |X = i)
P(Y ≤ j |X = i + 1) / P(Y > j |X = i + 1)
Analogues to correlations, but for categorical variables
STAT 526 Topic 11 22
Ordinal Odds Ratio Estimates
Local odds ratios
θ̂Lij =yij / yi,j+1
yi+1,j / yi+1,j+1
Global odds ratios
θ̂Gij =
∑
a≤i
∑
b≤j yab /∑
a≤i
∑
b>j yab∑
a>i
∑
b≤j yab /∑
a>i
∑
b>j yab
Cumulative odds ratios (conditional on X )
θ̂Cij =
∑
b≤j yib /∑
b>j yib∑
b≤j yi+1,b /∑
b>j yi+1,b
Alternative: testing for association with Pearson X 2
STAT 526 Topic 11 23
Example 2 Analysis : Dose Scored
> library(MASS)
> fit1 = polr(outcome~dose,weights=count,data=prob2)
> summary(fit1)
Coefficients:
Value Std. Error t value
dose 0.1755 0.05671 3.094
Intercepts:
Value Std. Error t value
1|2 -0.8946 0.1144 -7.8233
2|3 -0.4941 0.1107 -4.4638
3|4 0.5162 0.1118 4.6150
4|5 1.8815 0.1311 14.3565
Residual Deviance: 2461.349 degrees of freedom are 797
AIC: 2471.349 Cannot use to assess fit (ungrouped)
> exp(confint(fit1)) Increase dose 1 level increases odds of
2.5 % 97.5 % the next higher outcome between 6.6% and 33.2%
1.066619 1.332269
STAT 526 Topic 11 24
Plot of Predicted Probabilities> matplot(predProb, type="l", xlab="Dose+1", ylab="Predicted
Probability", cex=3.5)
> legend(x=3,y=0.15, lty=c(1:4), col=c(1:5), paste("Outcome =", c(1:5)))
STAT 526 Topic 11 25
Example 2 Analysis : Dose Categorical> fit2 = polr(outcome~as.factor(dose),weights=count,data=prob2)
> summary(fit2)
Coefficients:
Value Std. Error t value
as.factor(dose)1 0.1176 0.1791 0.6564
as.factor(dose)2 0.3174 0.1740 1.8240
as.factor(dose)3 0.5208 0.1794 2.9029
Intercepts:
Value Std. Error t value
1|2 -0.9188 0.1322 -6.9488 ***Two additional parameters
2|3 -0.5183 0.1291 -4.0154
3|4 0.4922 0.1298 3.7925 ***Test below does not suggest
4|5 1.8579 0.1462 12.7072 they add much to the fit
Residual Deviance: 2461.216
AIC: 2475.216
> anova(fit1,fit2)
Model R. df Resid. Dev Test Df LR stat. Pr(Chi)
dose 797 2461.349
as.factor(dose) 795 2461.216 1 vs 2 2 0.1328 0.9357261
STAT 526 Topic 11 26
Plot of Predicted Probabilities> matplot(predProb1, type="l", xlab="Dose+1", ylab="Predicted
Probability", cex=3.5)
> legend(x=3,y=0.15, lty=c(1:4), col=c(1:5), paste("Outcome =", c(1:5)))
STAT 526 Topic 11 27
Summary
Moving from scoring the ordinal variable to treating it asa nominal factor allow a test of the linearity assumption.
Result can depend on how one scores the different levelsof the dose variable
Equally spacedUnequally spaced
Visual comparison can be made via plots of the predictedprobabilities like the ones on Slides #25 and #27.
Need to look at grouped goodness of fit statistics
Multiple reasons for a poor fit
violation of proportional odds; wrong link; wrong func.form or missing predictors; overdispersion
STAT 526 Topic 11 28
Goodness of Fit: Grouped Data
Let the rows represent each of the groupsExpected cell frequency µ̂ij in row i and col j :
µ̂ij = yi.P̂(Y = j |X = i)
= yi.
[
P̂(Y ≤ j |X = i)− P̂(Y ≤ j − 1|X = i)]
Pearson χ2
X 2 =
I∑
i=1
J∑
j=1
(yij − µ̂ij)2
µ̂ij
H0∼ χ2df
Deviance
G 2 = 2
I∑
i=1
J∑
j=1
yij log
(
yij
µ̂ij
)
H0∼ χ2df
Dose scored: df = [I (J − 1)]− [(J − 1) + 1] = 11
Dose categorical: df = [I (J − 1)]− [(J − 1) + (I − 1)] = 9
STAT 526 Topic 11 29
Visual Assessment of Proportional
Odds : Grouped Data
Focus on each predictor (holding other predictors fixed)
According to the model, for all j and δ:
log
(
P(Y ≤ j |X + δ)
P(Y > j |X + δ)
)
− log
(
P(Y ≤ j |X )
P(Y > j |X )
)
= −βδ
Can plot these differences in cumulative odds usingestimates from the saturated model
When proportional odds are appropriate, the differencesshould be roughly the same for all values of X and levels j
STAT 526 Topic 11 30
Using R
> mat = xtabs(count~dose+outcome,prob2)
> cumProb <- apply( mat/apply(mat, 1, sum), 1, cumsum)
> cumProb
0 1 2 3
0 0.2809524 0.2526316 0.2125604 0.2205128
1 0.4000000 0.3631579 0.2801932 0.2410256
2 0.6190476 0.5947368 0.5410628 0.4923077
3 0.8476190 0.8421053 0.8502415 0.7897436
4 1.0000000 1.0000000 1.0000000 1.0000000
> logit <- function(x) {log(x/(1-x))}
> plot(0:3, logit(cumProb[-5,2])-logit(cumProb[-5,1]), type="l",
ylim=c(-1, 1), xlab="Dose", ylab="Empirical log(OR)", cex=3.5)
> for (i in 3:4) {lines(0:3, logit(cumProb[-5,i])-
logit(cumProb[-5,i-1]),col=i, lty=i)
}
abline(h=-coef(fit1), col="red", lwd=2)
legend("topleft", lty=c(1,3,4), col=c(1,3,4),
paste("Cum prob cutoff =", c(1:3)), cex=1)
legend("topright", lty=c(1), col=c("red"), "Model-based")
STAT 526 Topic 11 31
Are They Relatively Constant?
STAT 526 Topic 11 32
Formal Test for Proportional Odds
Testing
H0 : log
(
Pj
1− Pj
)
= θj − Xβ
Ha : log
(
Pj
1− Pj
)
= θj − Xβj
Model under Ha specifies cumulative logit, but notproportional odds, since log(OR) depends on j
The model under H0 is nested within the model under Ha
Thus can compare residual deviances
STAT 526 Topic 11 33
Formal Test in R
Must use vglm function in VGAM packageFirst fit proportional-odd model> library(VGAM)
> fit.vgam <- vglm(as.numeric(outcome) ~ dose,
+ cumulative(parallel=TRUE, reverse=FALSE),
+ weights=count,prob2)
> summary(fit.vgam)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -0.89466 0.11456 -7.809 5.74e-15 ***
(Intercept):2 -0.49410 0.11059 -4.468 7.91e-06 ***
(Intercept):3 0.51615 0.11067 4.664 3.10e-06 ***
(Intercept):4 1.88151 0.13020 14.451 < 2e-16 ***
dose -0.17548 0.05632 -3.116 0.00183 **
Residual deviance: 2461.349 on 75 degrees of freedom
**df based on ungrouped multinomial logit model**
STAT 526 Topic 11 34
Formal Test in R
Now fit relaxed model> fit.vgam3 <- vglm(as.numeric(outcome) ~ dose,
+ cumulative(parallel=FALSE, reverse=FALSE),
+ weights=count,prob2)
> summary(fit.vgam3)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -0.97749 0.13194 -7.408 1.28e-13 ***
(Intercept):2 -0.36265 0.12034 -3.014 0.00258 **
(Intercept):3 0.52391 0.12011 4.362 1.29e-05 ***
(Intercept):4 1.78941 0.16415 10.901 < 2e-16 ***
dose:1 -0.11292 0.07288 -1.549 0.12130
dose:2 -0.26889 0.06832 -3.936 8.29e-05 ***
dose:3 -0.18234 0.06385 -2.856 0.00430 **
dose:4 -0.11925 0.08470 -1.408 0.15916
Residual deviance: 2447.018 on 72 degrees of freedom
STAT 526 Topic 11 35
Results
Full residual deviance = 2447.018 on 72 df
Reduced residual deviance = 2461.349 on 75 df
Difference is 2461.349− 2447.018 = 14.331 on 3 df
> pchisq(14.331,3,lower=F)
[1] 0.002487536
Cannot accept that reduced model gives adequate fit
Proportional odds not reasonable
However, full cumulative odds model has issues too
Non-parallel lines means there is eventual crossing
STAT 526 Topic 11 36
Multinomial Logit Model
We now shift to case where categories are unordered
Therefore, cannot work with cumulative probabilitiesInstead declare one category as a reference and link thecovariates to probs through J − 1 relative prob ratios
ηij = log
(
pij
pi1
)
= xiβj j = 2, 3, . . . , J
This model implies
pij = exp{xiβj}pi1 j = 2, 3, . . . , J
and because∑J
1 pij = 1, this means
pi1 =1
1 +∑J
2 exp{xiβj}and pij =
exp{xiβj}
1 +∑J
2 exp{xiβj}
STAT 526 Topic 11 37
Multinomial Logit Model
The baseline, or reference, category is arbitrary
Common choices by software are j = 1 or j = J
Separate set of parameters βj for each ratio
Values of βj depend on the choice of baseline
Because all sets of βj relative to common category, jointlydefine probs
More flexible model than proportional odds but moredifficult to interpret (?)
Can be used as classification model using category withhighest predicted probability
STAT 526 Topic 11 38
Parameter Interpretation
In logistic regression and proportional-odds model, a βj
represents a log odds ratioIn this model, a βj describes the log change in relativeprob ratio
log
(
pj(x + 1)/p1(x + 1)
pj(x)/p1(x)
)
= logpj(x + 1)
p1(x + 1)− log
pj(x)
p1(x)
= β∗0j + β1j(x + 1)− (β∗
0j + β1jx)
= β1j
log
(
pj(x + 1)/pk(x + 1)
pj(x)/pk(x)
)
= logpj(x + 1)
pk(x + 1)− log
pj(x)
pk(x)
= logpj(x + 1)
p1(x + 1)− log
pk(x + 1)
p1(x + 1)−
logpj(x)
p1(x)+ log
pk(x)
p1(x)
= β1j − β1k
STAT 526 Topic 11 39
Maximum Likelihood Estimation
The log-likelihood for observation i is:
li = log
J∏
j=1
pj (xi )yij
=J
∑
j=2
yij log pj (xi ) +
1−J
∑
j=2
yij
log p1(xi )
=J
∑
j=2
yij logpj (xi )
1−J∑
k=2pk (xi )
+ log p1(xi )
=J
∑
j=2
yij (xiβj )− log
1 +J
∑
j=2
exp{xiβj}
MaximizeI∑
i=1
li with respect to βj
STAT 526 Topic 11 40
Example 2: Dose Response
Let’s revist our dose reponse study but use multinomiallogit model
Let’s consider doseAs a categorical predictor
There are 3 indicator variables per level + interceptTotal of 4(4) = 16 parameters
As a continuous predictor
Will assign scores to the categoriesTotal of 4(2) = 8 parameters
Previous proportional-odds models had 7 and 5parameters, respectively
STAT 526 Topic 11 41
Example 2: Dose Categorical
> library(nnet)
> fit1 <- multinom(outcome ~ as.factor(dose), weights=count, prob2)
> summary(fit1)
Coefficients:
(Intercept) as.factor(dose)1 as.factor(dose)2 as.factor(dose)3
2 -0.8586335 0.03194971 -0.2864809 -1.5161958
3 -0.2488754 0.16185828 0.4536705 0.3794879
4 -0.2063195 0.18526707 0.5810140 0.5055581
5 -0.6117850 0.14178037 0.2615807 0.5641507
Std. Errors:
(Intercept) as.factor(dose)1 as.factor(dose)2 as.factor(dose)3
2 0.2386396 0.3541205 0.3887204 0.5746170
3 0.1966936 0.2867909 0.2827264 0.2869711
4 0.1943777 0.2826526 0.2759257 0.2797853
5 0.2195434 0.3199468 0.3212239 0.3095891
Residual Deviance: 2443.166
AIC: 2475.166
STAT 526 Topic 11 42
Dose Categorical - Predicted Probs> predProb <- unique(fit1$fitted.values)
> matplot(predProb,las=1,type="l")
> legend("bottomleft", lty=c(1:4), col=c(1:5),
paste("Response =", c(0:4)),cex=0.75)
STAT 526 Topic 11 43
Calculation of Residual Deviance
Saturated model when the data are treated as grouped:Model-based predicted probs = sample proportions> m / apply(m, 1, sum)
0 1 2 3 4
0 0.2809524 0.11904762 0.2190476 0.2285714 0.1523810
1 0.2526316 0.11052632 0.2315789 0.2473684 0.1578947
2 0.2125604 0.06763285 0.2608696 0.3091787 0.1497585
3 0.2205128 0.02051282 0.2512821 0.2974359 0.2102564
Deviance for grouped data
G2 = 2I
∑
i=1
J∑
j=1
yij log
(
yij
µ̂ij
)
= 2I
∑
i=1
J∑
j=1
yij log
(
yij
yij
)
= 0
Deviance for ungrouped data
G2 = 2I
∑
i=1
J∑
j=1
yij log
(
1
p̂j (xi )
)
= −2I
∑
i=1
J∑
j=1
yij logp̂j (xi ) = 2443.166
with I × J × (J − 1)− I · (J − 1) = 4 · 5 · 4− 4 · 4 = 64 df
STAT 526 Topic 11 44
Example 2: Dose Scored
> library(nnet)
> fit2 <- multinom(outcome ~ dose, weights=count, prob2)
> summary(fit2)
Coefficients:
(Intercept) dose
2 -0.6999134 -0.3544346
3 -0.2194566 0.1470232
4 -0.1772963 0.1945578
5 -0.6544057 0.1914772
Std. Errors:
(Intercept) dose
2 0.2051749 0.13796048
3 0.1676773 0.09130087
4 0.1649761 0.08894654
5 0.1896008 0.10105460
Residual Deviance: 2449.145
AIC: 2465.145
STAT 526 Topic 11 45
Dose Scored - Predicted Probs> predProb <- unique(fit1$fitted.values)
> matplot(predProb,las=1,type="l")
> legend("bottomleft", lty=c(1:4), col=c(1:5),
paste("Response =", c(0:4)),cex=0.75)
STAT 526 Topic 11 46
Conclusions
Can compare the two models to test for linearity> anova(fit1,fit2)
Model Res. df Resid. Dev Df LR stat. Pr(Chi)
1 dose 72 2449.145
2 as.factor(dose) 64 2443.166 8 5.97846 0.6496448
Conclude that it is sufficient to consider linearity
Can do grouped goodness of fit test to assess fit
G 2 = 5.98 on 8 df (same because grouped Model 2 saturated)
This model does not fit as well as the relaxedcumulative-odds model
G 2 = 2447.018 on 72 df versus G 2 = 2449.145 on 72 df
STAT 526 Topic 11 47
Test for Equality of βj
Can test if different slope needed for each class j
H0 : log
(
pj (X )
p1(X )
)
= β0j + βX , j = 2, . . . , J
Ha : log
(
pj (X )
p1(X )
)
= β0j + βjX , j = 2, . . . , J
# -----separate beta_j for each response category-----
# ------the last category is the baseline in VGAM------
> fit3 <- vglm(outcome~dose, multinomial(parallel=FALSE),
+ weights=count,prob2)
> summary(fit3)
# -------same beta_j for each response category-------
> fit3.parallel <- vglm(outcome~dose,multinomial(parallel=TRUE),
+ weights=count,prob2)
> summary(fit3.parallel)
> 1 - pchisq(2*(logLik(fit3)-logLik(fit3.parallel)),
df=length(coef(fit3))-length(coef(fit3.parallel)))
[1] 0.0001767769
STAT 526 Topic 11 48
Example 4: Housing Satisfaction
1681 Copenhagen residents in study (housing in MASS)
Three categorical predictors (1 nominal, 2 ordered)
Contact Low HighSatisfaction Low Medium High Low Medium HighHousing InfluenceTower blocks Low 21 21 28 14 19 37
Medium 34 22 36 17 23 40High 10 11 36 3 5 23
Apartments Low 61 23 17 78 46 43Medium 43 35 40 48 45 86High 26 18 54 15 25 62
Atrium houses Low 13 9 10 20 23 20Medium 8 8 12 10 22 24High 6 7 9 7 10 21
Terraced houses Low 18 6 7 57 23 13Medium 15 13 13 31 21 13High 7 5 11 5 6 13
STAT 526 Topic 11 49
Mosaic Plot
STAT 526 Topic 11 50
Multinomial Logit Null Model Fit
Distribution of satisfaction same for all residents
> fit.mull <- multinom(Sat~1,weights=Freq,housing)
> summary(fit.null)
Call:
multinom(formula = Sat ~ 1, data = housing, weights = Freq)
Coefficients:
(Intercept)
Medium -0.2400404 #Low: 1/(1+exp(-0.2400404)+exp(.1639289))=0.3372992
High 0.1639289 #Medium: =0.2653183
#High: =0.3973825
Std. Errors:
(Intercept)
Medium 0.06329155
High 0.05710232 DF = 4*3*2*3*(3-1) - 2 = 142 (ungrouped)
DF = 4*3*2*(3-1) - 2 = 46 (grouped)
Residual Deviance: 3648.878
AIC: 3652.878
STAT 526 Topic 11 51
Multinomial Logit Fit
Consider influence as nominal variable> fit.multinom <- multinom(Sat~Infl+Type+Cont,weights=Freq,housing)
> summary(fit.multinom)
Coefficients:
(Intercept) InflMedium InflHigh TypeApartment
Medium -0.4192316 0.4464003 0.6649367 -0.4356851
High -0.1387453 0.7348626 1.6126294 -0.7356261
TypeAtrium TypeTerrace ContHigh
Medium 0.1313663 -0.6665728 0.3608513
High -0.4079808 -1.4123333 0.4818236
Std. Errors:
(Intercept) InflMedium InflHigh TypeApartment
Medium 0.1729344 0.1415572 0.1863374 0.1725327
High 0.1592295 0.1369380 0.1671316 0.1552714
TypeAtrium TypeTerrace ContHigh
Medium 0.2231065 0.2062532 0.1323975
High 0.2114965 0.2001496 0.1241371 Should we consider
interactions among
Residual Deviance: 3470.084 predictors?
AIC: 3498.084
STAT 526 Topic 11 52
Surrogate Log-Linear Models
Again focusing on satisfaction as multinomial responsewith other three variables as predictors
Will use associations between variables to developpredictive model
Model #1: Satisfaction is indep of the three predictors
If true, conditional distribution of satisfaction is thesame for all predictor combinationsIn other words, conditional probs do not vary withpredictorsThis is the same as the multinomial null modelCan express as log-linear model using
> fit <- glm(Freq~Infl*Type*Cont+Sat,family=poisson,housing)
STAT 526 Topic 11 53
Model #1 Results
> summary(fit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.162e+00 1.243e-01 25.433 < 2e-16 ***
InflMedium 2.733e-01 1.586e-01 1.723 0.084868 .
InflHigh -2.054e-01 1.784e-01 -1.152 0.249511
TypeApartment 3.666e-01 1.555e-01 2.357 0.018403 *
TypeAtrium -7.828e-01 2.134e-01 -3.668 0.000244 ***
TypeTerrace -8.145e-01 2.157e-01 -3.775 0.000160 ***
ContHigh -1.190e-15 1.690e-01 0.000 1.000000
Sat1Medium -2.400e-01 6.329e-02 -3.793 0.000149 ***
Sat1High 1.639e-01 5.710e-02 2.871 0.004094 **
InflMedium:TypeApartment -1.177e-01 2.086e-01 -0.564 0.572571
InflHigh:TypeApartment 1.753e-01 2.279e-01 0.769 0.441783
InflMedium:TypeAtrium -4.068e-01 3.035e-01 -1.340 0.180118
InflHigh:TypeAtrium -1.692e-01 3.294e-01 -0.514 0.607433
InflMedium:TypeTerrace 6.292e-03 2.860e-01 0.022 0.982450
InflHigh:TypeTerrace -9.305e-02 3.280e-01 -0.284 0.776633
InflMedium:ContHigh -1.398e-01 2.279e-01 -0.613 0.539715
InflHigh:ContHigh -6.091e-01 2.800e-01 -2.176 0.029585 *
STAT 526 Topic 11 54
Model #1 Results
TypeApartment:ContHigh 5.029e-01 2.109e-01 2.385 0.017083 *
TypeAtrium:ContHigh 6.774e-01 2.751e-01 2.462 0.013811 *
TypeTerrace:ContHigh 1.099e+00 2.675e-01 4.106 4.02e-05 ***
InflMedium:TypeApartment:ContHigh 5.359e-02 2.862e-01 0.187 0.851450
InflHigh:TypeApartment:ContHigh 1.462e-01 3.380e-01 0.432 0.665390
InflMedium:TypeAtrium:ContHigh 1.555e-01 3.907e-01 0.398 0.690597
InflHigh:TypeAtrium:ContHigh 4.782e-01 4.441e-01 1.077 0.281619
InflMedium:TypeTerrace:ContHigh -4.980e-01 3.671e-01 -1.357 0.174827
InflHigh:TypeTerrace:ContHigh -4.470e-01 4.545e-01 -0.984 0.325326
Null deviance: 833.66 on 71 degrees of freedom
Residual deviance: 217.46 on 46 degrees of freedom
AIC: 610.43
Large deviance suggests probs vary with predictors
Residual deviance based on Poisson dist here
Coefs for Sat1 are the same as null multinomial intercepts
STAT 526 Topic 11 55
Additive Contributions of Predictors
Can assess whether Sat1 depends on each of the 3predictors individually by adding interactions with it
> addterm(fit, ~. + Sat1:(Infl+Type+Cont), test="Chisq")
Single term additions
Model:
Freq ~ Infl * Type * Cont + Sat
Df Deviance AIC LRT Pr(Chi)
<none> 217.46 610.43
Infl:Sat1 4 111.08 512.05 106.371 < 2.2e-16 ***
Type:Sat1 6 156.79 561.76 60.669 3.292e-11 ***
Cont:Sat1 2 212.33 609.30 5.126 0.07708 .
Infl: max reduction in resid. deviance & AIC
Even though Cont:Sat1 not significant, let’s look atmodel with all three interactions
STAT 526 Topic 11 56
Model #2: Interactions with Sat> fit2 <- glm(Freq~Infl*Type*Cont+Sat1:Infl+Sat1*Type+Sat1*Cont,
+ family=poisson,housing)
> summary(fit2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.32106 0.14761 22.498 < 2e-16 ***
InflMedium -0.14543 0.17855 -0.814 0.415369
InflHigh -1.17183 0.21803 -5.375 7.68e-08 ***
TypeApartment 0.68296 0.17522 3.898 9.71e-05 ***
TypeAtrium -0.70064 0.24137 -2.903 0.003698 **
TypeTerrace -0.32511 0.23230 -1.400 0.161652
ContHigh -0.28230 0.18441 -1.531 0.125814
Sat1Medium -0.41923 0.17293 -2.424 0.015342 *
Sat1High -0.13874 0.15923 -0.871 0.383570
InflMedium:TypeApartment -0.01788 0.21050 -0.085 0.932302
InflHigh:TypeApartment 0.38687 0.23330 1.658 0.097263 .
InflMedium:TypeAtrium -0.36031 0.30498 -1.181 0.237432
InflHigh:TypeAtrium -0.03679 0.33479 -0.110 0.912503
InflMedium:TypeTerrace 0.18515 0.28889 0.641 0.521580
InflHigh:TypeTerrace 0.31075 0.33482 0.928 0.353345
InflMedium:ContHigh -0.20006 0.22875 -0.875 0.381799
InflHigh:ContHigh -0.72579 0.28235 -2.571 0.010155 *
TypeApartment:ContHigh 0.56969 0.21215 2.685 0.007247**
STAT 526 Topic 11 57
Model #2: Interactions with SatTypeAtrium:ContHigh 0.70211 0.27606 2.543 0.010979 *
TypeTerrace:ContHigh 1.21593 0.26997 4.504 6.67e-06 ***
InflMedium:Sat1Medium 0.44640 0.14156 3.153 0.001613 **
InflHigh:Sat1Medium 0.66494 0.18634 3.568 0.000359 ***
InflMedium:Sat1High 0.73486 0.13694 5.366 8.03e-08 ***
InflHigh:Sat1High 1.61263 0.16713 9.649 < 2e-16 ***
TypeApartment:Sat1Medium -0.43569 0.17253 -2.525 0.011562 *
TypeAtrium:Sat1Medium 0.13137 0.22311 0.589 0.555980
TypeTerrace:Sat1Medium -0.66657 0.20625 -3.232 0.001230 **
TypeApartment:Sat1High -0.73563 0.15527 -4.738 2.16e-06 ***
TypeAtrium:Sat1High -0.40798 0.21150 -1.929 0.053730 .
TypeTerrace:Sat1High -1.41233 0.20015 -7.056 1.71e-12 ***
ContHigh:Sat1Medium 0.36085 0.13240 2.726 0.006420 **
ContHigh:Sat1High 0.48183 0.12414 3.881 0.000104 ***
InflMedium:TypeApartment:ContHigh 0.04690 0.28621 0.164 0.869837
InflHigh:TypeApartment:ContHigh 0.12623 0.33821 0.373 0.708979
InflMedium:TypeAtrium:ContHigh 0.15724 0.39072 0.402 0.687364
InflHigh:TypeAtrium:ContHigh 0.47861 0.44424 1.077 0.281320
InflMedium:TypeTerrace:ContHigh -0.50016 0.36713 -1.362 0.173091
InflHigh:TypeTerrace:ContHigh -0.46310 0.45471 -1.018 0.308467
Null deviance: 833.657 on 71 degrees of freedom
Residual deviance: 38.662 on 34 degrees of freedom
AIC: 455.63STAT 526 Topic 11 58
Model #2: Interactions with Sat
Same model as our main-effects multinomial model
Different deviances due to different saturated models.
In multinom the saturated model is for subjectsIn surrogate log-linear model, it is for cells (grouped)
Comparison with null modelmultinom: 3648.9− 3470.1 = 178.8 and 142− 130 = 12 dflog-linear: 217.5− 38.7 = 178.8 and 46− 34 = 12 df
Could also consider higher-order interactionsRepresent non-additive effects of predictors on Sat
addterm(fit1, .~.+Sat:(Infl+Type+Cont)^2, test="Chisq")
None are found significant
STAT 526 Topic 11 59
Summary
Models using the Poisson distributionConsider E (count response) as a function of predictors
Poisson regressionQuasipoisson or negative binomial regressionSurrogate log-linear model
Multivariate associations of categorical variables
Nominal random variables: Log-linear modelsOrdinal random variables: Linear-by-linear model,column-effect models
Models using the multinomial distributionConsider E (count response) as a function of predictors
Ordinal response: cumulative logit modelNominal response: multinomial logit model
STAT 526 Topic 11 60