Modeling a Multinomial Response - Purdue Universitybacraig/notes526/topic11a.pdf · 2020. 10. 6. · Purdue University Reading: Faraway Ch. 7, Agresti Ch. 7, KNNL Ch. 14 STAT 526

Modeling a Multinomial Response

Bruce A Craig

Department of StatisticsPurdue University

Reading: Faraway Ch. 7, Agresti Ch. 7, KNNL Ch. 14

STAT 526 Topic 11 1

Multinomial Distribution

Model for discrete variable with two or more categories

Probability distribution:

Y = (Y1, . . . ,Yc) ∼ Multinomial(n, p1, . . . , pc−1)n is considered known (number of trials)

pc = 1−c−1∑

j=1

pj

p(y1, y2, . . . , yc) =(

n!y1!y2!...yc !

)

py11 p

y22 . . . pycc

E (Yj) = npj , Var(Yj) = npj(1− pj), Cov(Yj ,Yk) = −npjpk

Marginal dist for each Yj is B(n, pj)

Log-likelihood:

l(p) =c∑

j=1

yj log pj

Maximum Likelihood Estimator:p̂j = yj/n

STAT 526 Topic 11 2

Multinomial GLM Models

Now consider set of I Multinomial(ni ,pi) observations

Goal now to link predictors Xi to pi

As with binomial setting, can encounter data that are

Grouped: ni > 1Ungrouped: ni = 1

Predictors Xi may be continuous or discrete

Unlike binomial setting, need to distinguish between

Ordered categories for Y → cumulative logit modelNominal categories for Y → multinomial logit model

STAT 526 Topic 11 3

Example 1: Math Aptitude

Predicting a college freshman’s math aptitude given theirmathematics PSAT score in 10th grade.

Response: Aptitude Grade: 4 ordered levelsPredictor: PSAT score: continuous (10-pt increments)

STAT 526 Topic 11 4

Example 1: Math Aptitude

Aptitude grade (Y ) postively related to math score (X )

Overlap in math scores across grades means there is someuncertainty in predicting Y

In this example, ni = 1 for i = 1, 2, . . . , I = 500 students

Interested in conditional probs P(Y = j |xi )

With ordered response, often easier to work with thecumulative probabilities

P(Y ≤ j |xi ) =∑

k≤j

P(Y = j |xi )

STAT 526 Topic 11 5

Proportional Odds Model

Also called the cumulative logit model

log

(

P(Yi ≤ j |Xi )

1− P(Yi ≤ j |Xi )

)

= θj − Xiβ

Only parameters θj depend on level j

They are monotonically increasing with j

Parameters β describe cumulative log-odds

Odds(Xi )/Odds(Xi ′) does not depend on level jA β > 0 means a larger x increases probability of largerresponse j (positive association)

Like fitting logistic model for each j but β same

As with binary setting, can consider other link functions

STAT 526 Topic 11 6

Latent Variable Motivation

Similar to binary setting, can consider a latent variable tomotivate the GLM modelFor the math aptitude example, we could consider thereto be a latent continuous variable Z associated with theaptitude grade that is linearly related to their math score

Zi = β0 + β1xi + εi

Instead of observing Zi , we observe

Yi =

A Zi > c3B c2 < Zi < c3C c1 < Zi < c2D Zi < c1

Can compute the P(Yi = j |xi) using specified dist of ε

STAT 526 Topic 11 7

Motivation Continued

Using cJ = ∞,

P(Yi ≤ j |xi ) = P(Zi < cj)

= P(β0 + β1xi + εi ≤ cj)

= P(εi ≤ cj − β0 − β1xi )

= Fε(θ1 − β1xi )

If F is the CDF of the logistic distribution,

P(Yi ≤ j |xi ) =exp{θj − βixi}

1 + exp{θj − β1xi}

Can use Normal or Gumbel distributions to motivateprobit or complementary log-log link

STAT 526 Topic 11 8

Interpreting the Model Parameters

Using the logit link, the cumulative odds

log

(

P(Y ≤ j |X)

P(Y > j |X)

)

= θj − Xβ

Interpretation of a β (holding all other x constant)

log

(

P(Y ≤ j |x + δ)

P(Y > j |x + δ)÷

P(Y ≤ j |x)

P(Y > j |x)

)

= log

(

P(Y ≤ j |x + δ)

P(Y > j |x + δ)

)

− log

(

P(Y ≤ j |x)

P(Y > j |x)

)

= θ∗j − β(x + δ)− (θ∗j − βx) = −βδ

Change is proportional to the change in x for all j

STAT 526 Topic 11 9

Inference

Similar inference as in logistic regression when focusingon cumulative probs

Wald / LR / Score tests for model parametersPearson χ2, Deviance for goodness of fit

β invariant to the number of response categories

Predicting Y now involves a vector of probabilities

Easiest to first compute cumulative probabilities andthen use subtraction to get the probability vector

STAT 526 Topic 11 10

Maximum Likelihood Estimation

Let pj(X) = P(Y ≤ j |X)− P(Y ≤ j − 1|X)

Yi is vector of length j with one 1 and remainder 0’sLog-likelihood for the ith observation is:

li = log

J∏

j=1

pj (xi )yij

=J∏

j=1

[P(Y ≤ j |xi )− P(Y ≤ j − 1|xi )]yij

=J∏

j=1

[

exp{θj − xiβ}

1 + exp{θj − xiβ}−

exp{θj−1 − xiβ}

1 + exp{θj−1 − xiβ}

]yij

MaximizeI∑

i=1

li with respect to θj and β


Calculating the Residual Deviance

Residual deviance

G 2 =

I∑

i=1

J∑

j=1

yij log

(

1

p̂j(xi )

)

= −2

I∑

i=1

J∑

j=1

yij log p̂j(xi )

Degrees of freedom are the difference between model dfs

# params in saturated model (=# observations)

# params in reduced model (=# of intercepts + # predictors)

Residual deviance degrees of freedom for math aptitudestudy are 500− 4 = 496


Example: Math Aptitude

In R: Use polr in library MASS> library(MASS)

> fit = polr(grade1 ~ psat,mathapt)

> summary(fit)

Coefficients:

Value Std. Error t value

psat -0.01792 0.00124 -14.46 ***Function uses

int_j + XB

Intercepts: so be wary of sign


A|B -11.5391 0.6912 -16.6944

B|C -8.8652 0.5930 -14.9492

C|D -6.3311 0.5196 -12.1834

Residual Deviance: 981.8668

AIC: 989.8668

Grade A associated with j = 1 and Grade D with j = 4.That is why these are negative coefficients (see note above)


Results: Math Aptitude

Deviance is 981.9 on 496 df (from fit$df.residual)

Similar to Bernoulli distribution (ungrouped), thisdeviance should not be used to assess goodness of fit

Better to use extension of H-L test or Lipsitz testAssessment of β: For each 10-pt increase in score, theodds of being > j versus ≤ j decrease 16.4% (1− e−0.1792)

> exp(10*confint(fit))

Waiting for profiling to be done...

Re-fitting to get Hessian

2.5 % 97.5 %

0.8154889 0.8558026


Extension of Hosmer-Lemeshow

Score each observation and then group on these scores

si = p̂i1 + 2p̂i2 + · · ·+ Jp̂iJ

C∑

k=1

J∑

j=1

(Okj − Ekj)2/Ekj ∼ χ2

df

C represents the number of (equal-sized) groups

For this model, df = (C − 2)(J − 1) + (J − 2)

The additional J − 2 df are due to the reduced number ofparameters relative to a multinomial model

Note that our Bernoulli version considers J = 2


Using R

Can still use logitgof function> library(generalhoslem)

> logitgof(mathapt$grade1,ord=TRUE,fitted(fit))

Hosmer and Lemeshow test (ordinal model)

data: grade1, fitted(fit)

X-squared = 29.356, df = 26, p-value = 0.2952

sorder

Warning message:

In logitgof(grade1, ord = TRUE, fitted(fit)) :

At least one cell in the expected frequencies table is <!

Chi-square approximation may be incorrect.

This will be a problem in this example because littleoverlap in math scores between A and D aptitude students


Grouped Table - Using Deciles

Y = 1 Y = 2 Y = 3 Y = 4Group O E O E O E O E Total

1 25 21.34 21 23.81 3 4.42 1 0.43 502 11 9.75 28 32.38 18 13.28 0 1.59 573 3 3.85 23 21.93 16 16.69 3 2.53 454 2 2.72 21 19.78 20 22.30 6 4.20 495 0 2.46 26 21.26 33 34.73 8 8.56 676 0 0.75 8 7.55 21 18.38 4 6.33 337 1 0.88 6 9.66 31 32.08 20 15.38 588 0 0.39 5 4.68 17 21.79 21 16.14 439 0 0.25 4 3.19 23 21.63 30 31.93 5710 0 0.06 0 0.78 13 7.35 28 32.81 41


Lipsitz Test

As with H-L test, sort data into C groups

Define C − 1 group indicator variablesFit new ordinal logistic regression

log

(

P(Y ≤ j |X)

P(Y > j |X)

)

= θj − Xβ + γ1I1 + · · ·+ γC−1IC−1

Use the likelihood ratio test to test Ho : γ1 = · · · = γC−1 = 0

Recommend C be such that 6 ≤ C < N/5J


Using R

Can use lipsitz.test function in generalhoslem> library(generalhoslem)

> lipsitz.test(fit)

Lipsitz goodness of fit test for ordinal response models

data: formula: grade1 ~ psat

LR statistic = 11.226, df = 9, p-value = 0.2605

Tends to have rejection rates > α in small samples

Works best when covariates are continuous


Example 2: Dose Response

Effect of intravenous medication doses on patients withsubarachnoid hemorrhage trauma (p. 207, OrdCDA)

Glasgow Outcome Scale (Y )Treatment Veget. Major Minor GoodGroup (X ) Death State Disab. Disab. Recov.Placebo 59 25 46 48 32Low dose 48 21 44 47 30

Medium dose 44 14 54 64 31High dose 43 4 49 58 41

Response: Glascow Outcome scale - Ordered

Predictor: Dose level - Ordered

Similar to setting for linear-by-linear association model

Focus, however, is on predicting Y

So how do we treat the levels of X? Score them?


Notation for Ordinal Predictor

Back to contingency table summary (grouped data)Y

X 1 2 · · · J Total1 y11 y12 · · · y1J y1.2 y21 y22 · · · y2J y2....

..

....

..

....

..

.I yI1 yI2 · · · yIJ yI .

Total y.1 y

.2 · · · y.J n

Interested in cond probs P(Y = j |X = i) = pj |i

Proportional-odds model focuses on cumulative probs

P(Y ≤ j |X = i) =∑

k≤j

pk|i


Ordinal Odds Ratios

Local odds ratios

θLij =P(X = i ,Y = j) / P(X = i ,Y = j + 1)

P(X = i + 1,Y = j) / P(X = i + 1,Y = j + 1)

Global odds ratios

θGij =P(X ≤ i ,Y ≤ j) / P(X ≤ i ,Y > j)

P(X > i ,Y ≤ j) / P(X > i ,Y > j)

Cumulative odds ratios (conditional on X )

θCij =P(Y ≤ j |X = i) / P(Y > j |X = i)

P(Y ≤ j |X = i + 1) / P(Y > j |X = i + 1)

Analogues to correlations, but for categorical variables


Ordinal Odds Ratio Estimates

Local odds ratios

θ̂Lij =yij / yi,j+1

yi+1,j / yi+1,j+1

Global odds ratios

θ̂Gij =

∑

a≤i

∑

b≤j yab /∑

a≤i

∑

b>j yab∑

a>i

∑

b≤j yab /∑

a>i

∑

b>j yab

Cumulative odds ratios (conditional on X )

θ̂Cij =

∑

b≤j yib /∑

b>j yib∑

b≤j yi+1,b /∑

b>j yi+1,b

Alternative: testing for association with Pearson X 2


Example 2 Analysis : Dose Scored

> library(MASS)

> fit1 = polr(outcome~dose,weights=count,data=prob2)

> summary(fit1)

Coefficients:


dose 0.1755 0.05671 3.094

Intercepts:


1|2 -0.8946 0.1144 -7.8233

2|3 -0.4941 0.1107 -4.4638

3|4 0.5162 0.1118 4.6150

4|5 1.8815 0.1311 14.3565

Residual Deviance: 2461.349 degrees of freedom are 797

AIC: 2471.349 Cannot use to assess fit (ungrouped)

> exp(confint(fit1)) Increase dose 1 level increases odds of

2.5 % 97.5 % the next higher outcome between 6.6% and 33.2%

1.066619 1.332269


Plot of Predicted Probabilities> matplot(predProb, type="l", xlab="Dose+1", ylab="Predicted

Probability", cex=3.5)

> legend(x=3,y=0.15, lty=c(1:4), col=c(1:5), paste("Outcome =", c(1:5)))


Example 2 Analysis : Dose Categorical> fit2 = polr(outcome~as.factor(dose),weights=count,data=prob2)

> summary(fit2)

Coefficients:


as.factor(dose)1 0.1176 0.1791 0.6564

as.factor(dose)2 0.3174 0.1740 1.8240

as.factor(dose)3 0.5208 0.1794 2.9029

Intercepts:


1|2 -0.9188 0.1322 -6.9488 ***Two additional parameters

2|3 -0.5183 0.1291 -4.0154

3|4 0.4922 0.1298 3.7925 ***Test below does not suggest

4|5 1.8579 0.1462 12.7072 they add much to the fit


AIC: 2475.216

> anova(fit1,fit2)

Model R. df Resid. Dev Test Df LR stat. Pr(Chi)

dose 797 2461.349

as.factor(dose) 795 2461.216 1 vs 2 2 0.1328 0.9357261


Plot of Predicted Probabilities> matplot(predProb1, type="l", xlab="Dose+1", ylab="Predicted

Probability", cex=3.5)

> legend(x=3,y=0.15, lty=c(1:4), col=c(1:5), paste("Outcome =", c(1:5)))


Summary

Moving from scoring the ordinal variable to treating it asa nominal factor allow a test of the linearity assumption.

Result can depend on how one scores the different levelsof the dose variable

Equally spacedUnequally spaced

Visual comparison can be made via plots of the predictedprobabilities like the ones on Slides #25 and #27.

Need to look at grouped goodness of fit statistics

Multiple reasons for a poor fit

violation of proportional odds; wrong link; wrong func.form or missing predictors; overdispersion


Goodness of Fit: Grouped Data

Let the rows represent each of the groupsExpected cell frequency µ̂ij in row i and col j :

µ̂ij = yi.P̂(Y = j |X = i)

= yi.

[

P̂(Y ≤ j |X = i)− P̂(Y ≤ j − 1|X = i)]

Pearson χ2

X 2 =

I∑

i=1

J∑

j=1

(yij − µ̂ij)2

µ̂ij

H0∼ χ2df

Deviance

G 2 = 2

I∑

i=1

J∑

j=1

yij log

(

yij

µ̂ij

)

H0∼ χ2df

Dose scored: df = [I (J − 1)]− [(J − 1) + 1] = 11

Dose categorical: df = [I (J − 1)]− [(J − 1) + (I − 1)] = 9


Visual Assessment of Proportional

Odds : Grouped Data

Focus on each predictor (holding other predictors fixed)

According to the model, for all j and δ:

log

(

P(Y ≤ j |X + δ)

P(Y > j |X + δ)

)

− log

(

P(Y ≤ j |X )

P(Y > j |X )

)

= −βδ

Can plot these differences in cumulative odds usingestimates from the saturated model

When proportional odds are appropriate, the differencesshould be roughly the same for all values of X and levels j


Using R

> mat = xtabs(count~dose+outcome,prob2)

> cumProb <- apply( mat/apply(mat, 1, sum), 1, cumsum)

> cumProb

0 1 2 3

0 0.2809524 0.2526316 0.2125604 0.2205128

1 0.4000000 0.3631579 0.2801932 0.2410256

2 0.6190476 0.5947368 0.5410628 0.4923077

3 0.8476190 0.8421053 0.8502415 0.7897436

4 1.0000000 1.0000000 1.0000000 1.0000000

> logit <- function(x) {log(x/(1-x))}

> plot(0:3, logit(cumProb[-5,2])-logit(cumProb[-5,1]), type="l",

ylim=c(-1, 1), xlab="Dose", ylab="Empirical log(OR)", cex=3.5)

> for (i in 3:4) {lines(0:3, logit(cumProb[-5,i])-

logit(cumProb[-5,i-1]),col=i, lty=i)

}

abline(h=-coef(fit1), col="red", lwd=2)

legend("topleft", lty=c(1,3,4), col=c(1,3,4),

paste("Cum prob cutoff =", c(1:3)), cex=1)

legend("topright", lty=c(1), col=c("red"), "Model-based")


Are They Relatively Constant?


Formal Test for Proportional Odds

Testing

H0 : log

(

Pj

1− Pj

)

= θj − Xβ

Ha : log

(

Pj

1− Pj

)

= θj − Xβj

Model under Ha specifies cumulative logit, but notproportional odds, since log(OR) depends on j

The model under H0 is nested within the model under Ha

Thus can compare residual deviances


Formal Test in R

Must use vglm function in VGAM packageFirst fit proportional-odd model> library(VGAM)

> fit.vgam <- vglm(as.numeric(outcome) ~ dose,

+ cumulative(parallel=TRUE, reverse=FALSE),

+ weights=count,prob2)

> summary(fit.vgam)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept):1 -0.89466 0.11456 -7.809 5.74e-15 ***

(Intercept):2 -0.49410 0.11059 -4.468 7.91e-06 ***

(Intercept):3 0.51615 0.11067 4.664 3.10e-06 ***

(Intercept):4 1.88151 0.13020 14.451 < 2e-16 ***

dose -0.17548 0.05632 -3.116 0.00183 **

Residual deviance: 2461.349 on 75 degrees of freedom

**df based on ungrouped multinomial logit model**


Formal Test in R

Now fit relaxed model> fit.vgam3 <- vglm(as.numeric(outcome) ~ dose,

+ cumulative(parallel=FALSE, reverse=FALSE),


> summary(fit.vgam3)

Coefficients:


(Intercept):1 -0.97749 0.13194 -7.408 1.28e-13 ***

(Intercept):2 -0.36265 0.12034 -3.014 0.00258 **

(Intercept):3 0.52391 0.12011 4.362 1.29e-05 ***

(Intercept):4 1.78941 0.16415 10.901 < 2e-16 ***

dose:1 -0.11292 0.07288 -1.549 0.12130

dose:2 -0.26889 0.06832 -3.936 8.29e-05 ***

dose:3 -0.18234 0.06385 -2.856 0.00430 **

dose:4 -0.11925 0.08470 -1.408 0.15916



Results

Full residual deviance = 2447.018 on 72 df

Reduced residual deviance = 2461.349 on 75 df

Difference is 2461.349− 2447.018 = 14.331 on 3 df

> pchisq(14.331,3,lower=F)

[1] 0.002487536

Cannot accept that reduced model gives adequate fit

Proportional odds not reasonable

However, full cumulative odds model has issues too

Non-parallel lines means there is eventual crossing


Multinomial Logit Model

We now shift to case where categories are unordered

Therefore, cannot work with cumulative probabilitiesInstead declare one category as a reference and link thecovariates to probs through J − 1 relative prob ratios

ηij = log

(

pij

pi1

)

= xiβj j = 2, 3, . . . , J

This model implies

pij = exp{xiβj}pi1 j = 2, 3, . . . , J

and because∑J

1 pij = 1, this means

pi1 =1

1 +∑J

2 exp{xiβj}and pij =

exp{xiβj}

1 +∑J

2 exp{xiβj}


Multinomial Logit Model

The baseline, or reference, category is arbitrary

Common choices by software are j = 1 or j = J

Separate set of parameters βj for each ratio

Values of βj depend on the choice of baseline

Because all sets of βj relative to common category, jointlydefine probs

More flexible model than proportional odds but moredifficult to interpret (?)

Can be used as classification model using category withhighest predicted probability


Parameter Interpretation

In logistic regression and proportional-odds model, a βj

represents a log odds ratioIn this model, a βj describes the log change in relativeprob ratio

log

(

pj(x + 1)/p1(x + 1)

pj(x)/p1(x)

)

= logpj(x + 1)

p1(x + 1)− log

pj(x)

p1(x)

= β∗0j + β1j(x + 1)− (β∗

0j + β1jx)

= β1j

log

(

pj(x + 1)/pk(x + 1)

pj(x)/pk(x)

)

= logpj(x + 1)

pk(x + 1)− log

pj(x)

pk(x)

= logpj(x + 1)

p1(x + 1)− log

pk(x + 1)

p1(x + 1)−

logpj(x)

p1(x)+ log

pk(x)

p1(x)

= β1j − β1k


Maximum Likelihood Estimation

The log-likelihood for observation i is:

li = log

J∏

j=1

pj (xi )yij

=J

∑

j=2

yij log pj (xi ) +

1−J

∑

j=2

yij

log p1(xi )

=J

∑

j=2

yij logpj (xi )

1−J∑

k=2pk (xi )

+ log p1(xi )

=J

∑

j=2

yij (xiβj )− log

1 +J

∑

j=2

exp{xiβj}

MaximizeI∑

i=1

li with respect to βj


Example 2: Dose Response

Let’s revist our dose reponse study but use multinomiallogit model

Let’s consider doseAs a categorical predictor

There are 3 indicator variables per level + interceptTotal of 4(4) = 16 parameters

As a continuous predictor

Will assign scores to the categoriesTotal of 4(2) = 8 parameters

Previous proportional-odds models had 7 and 5parameters, respectively


Example 2: Dose Categorical

> library(nnet)

> fit1 <- multinom(outcome ~ as.factor(dose), weights=count, prob2)

> summary(fit1)

Coefficients:

(Intercept) as.factor(dose)1 as.factor(dose)2 as.factor(dose)3

2 -0.8586335 0.03194971 -0.2864809 -1.5161958

3 -0.2488754 0.16185828 0.4536705 0.3794879

4 -0.2063195 0.18526707 0.5810140 0.5055581

5 -0.6117850 0.14178037 0.2615807 0.5641507

Std. Errors:

(Intercept) as.factor(dose)1 as.factor(dose)2 as.factor(dose)3

2 0.2386396 0.3541205 0.3887204 0.5746170

3 0.1966936 0.2867909 0.2827264 0.2869711

4 0.1943777 0.2826526 0.2759257 0.2797853

5 0.2195434 0.3199468 0.3212239 0.3095891


AIC: 2475.166


Dose Categorical - Predicted Probs> predProb <- unique(fit1$fitted.values)

> matplot(predProb,las=1,type="l")

> legend("bottomleft", lty=c(1:4), col=c(1:5),

paste("Response =", c(0:4)),cex=0.75)


Calculation of Residual Deviance

Saturated model when the data are treated as grouped:Model-based predicted probs = sample proportions> m / apply(m, 1, sum)

0 1 2 3 4

0 0.2809524 0.11904762 0.2190476 0.2285714 0.1523810

1 0.2526316 0.11052632 0.2315789 0.2473684 0.1578947

2 0.2125604 0.06763285 0.2608696 0.3091787 0.1497585

3 0.2205128 0.02051282 0.2512821 0.2974359 0.2102564

Deviance for grouped data

G2 = 2I

∑

i=1

J∑

j=1

yij log

(

yij

µ̂ij

)

= 2I

∑

i=1

J∑

j=1

yij log

(

yij

yij

)

= 0

Deviance for ungrouped data

G2 = 2I

∑

i=1

J∑

j=1

yij log

(

1

p̂j (xi )

)

= −2I

∑

i=1

J∑

j=1

yij logp̂j (xi ) = 2443.166

with I × J × (J − 1)− I · (J − 1) = 4 · 5 · 4− 4 · 4 = 64 df


Example 2: Dose Scored

> library(nnet)

> fit2 <- multinom(outcome ~ dose, weights=count, prob2)

> summary(fit2)

Coefficients:

(Intercept) dose

2 -0.6999134 -0.3544346

3 -0.2194566 0.1470232

4 -0.1772963 0.1945578

5 -0.6544057 0.1914772

Std. Errors:

(Intercept) dose

2 0.2051749 0.13796048

3 0.1676773 0.09130087

4 0.1649761 0.08894654

5 0.1896008 0.10105460


AIC: 2465.145


Dose Scored - Predicted Probs> predProb <- unique(fit1$fitted.values)

> matplot(predProb,las=1,type="l")

> legend("bottomleft", lty=c(1:4), col=c(1:5),

paste("Response =", c(0:4)),cex=0.75)


Conclusions

Can compare the two models to test for linearity> anova(fit1,fit2)

Model Res. df Resid. Dev Df LR stat. Pr(Chi)

1 dose 72 2449.145

2 as.factor(dose) 64 2443.166 8 5.97846 0.6496448

Conclude that it is sufficient to consider linearity

Can do grouped goodness of fit test to assess fit

G 2 = 5.98 on 8 df (same because grouped Model 2 saturated)

This model does not fit as well as the relaxedcumulative-odds model

G 2 = 2447.018 on 72 df versus G 2 = 2449.145 on 72 df


Test for Equality of βj

Can test if different slope needed for each class j

H0 : log

(

pj (X )

p1(X )

)

= β0j + βX , j = 2, . . . , J

Ha : log

(

pj (X )

p1(X )

)

= β0j + βjX , j = 2, . . . , J

# -----separate beta_j for each response category-----

# ------the last category is the baseline in VGAM------

> fit3 <- vglm(outcome~dose, multinomial(parallel=FALSE),


> summary(fit3)

# -------same beta_j for each response category-------

> fit3.parallel <- vglm(outcome~dose,multinomial(parallel=TRUE),


> summary(fit3.parallel)

> 1 - pchisq(2*(logLik(fit3)-logLik(fit3.parallel)),

df=length(coef(fit3))-length(coef(fit3.parallel)))

[1] 0.0001767769


Example 4: Housing Satisfaction

1681 Copenhagen residents in study (housing in MASS)

Three categorical predictors (1 nominal, 2 ordered)

Contact Low HighSatisfaction Low Medium High Low Medium HighHousing InfluenceTower blocks Low 21 21 28 14 19 37

Medium 34 22 36 17 23 40High 10 11 36 3 5 23

Apartments Low 61 23 17 78 46 43Medium 43 35 40 48 45 86High 26 18 54 15 25 62

Atrium houses Low 13 9 10 20 23 20Medium 8 8 12 10 22 24High 6 7 9 7 10 21

Terraced houses Low 18 6 7 57 23 13Medium 15 13 13 31 21 13High 7 5 11 5 6 13


Mosaic Plot


Multinomial Logit Null Model Fit

Distribution of satisfaction same for all residents

> fit.mull <- multinom(Sat~1,weights=Freq,housing)

> summary(fit.null)

Call:

multinom(formula = Sat ~ 1, data = housing, weights = Freq)

Coefficients:

(Intercept)

Medium -0.2400404 #Low: 1/(1+exp(-0.2400404)+exp(.1639289))=0.3372992

High 0.1639289 #Medium: =0.2653183

#High: =0.3973825

Std. Errors:

(Intercept)

Medium 0.06329155

High 0.05710232 DF = 4*3*2*3*(3-1) - 2 = 142 (ungrouped)

DF = 4*3*2*(3-1) - 2 = 46 (grouped)


AIC: 3652.878


Multinomial Logit Fit

Consider influence as nominal variable> fit.multinom <- multinom(Sat~Infl+Type+Cont,weights=Freq,housing)

> summary(fit.multinom)

Coefficients:

(Intercept) InflMedium InflHigh TypeApartment

Medium -0.4192316 0.4464003 0.6649367 -0.4356851

High -0.1387453 0.7348626 1.6126294 -0.7356261

TypeAtrium TypeTerrace ContHigh

Medium 0.1313663 -0.6665728 0.3608513

High -0.4079808 -1.4123333 0.4818236

Std. Errors:

(Intercept) InflMedium InflHigh TypeApartment

Medium 0.1729344 0.1415572 0.1863374 0.1725327

High 0.1592295 0.1369380 0.1671316 0.1552714

TypeAtrium TypeTerrace ContHigh

Medium 0.2231065 0.2062532 0.1323975

High 0.2114965 0.2001496 0.1241371 Should we consider

interactions among

Residual Deviance: 3470.084 predictors?

AIC: 3498.084


Surrogate Log-Linear Models

Again focusing on satisfaction as multinomial responsewith other three variables as predictors

Will use associations between variables to developpredictive model

Model #1: Satisfaction is indep of the three predictors

If true, conditional distribution of satisfaction is thesame for all predictor combinationsIn other words, conditional probs do not vary withpredictorsThis is the same as the multinomial null modelCan express as log-linear model using

> fit <- glm(Freq~Infl*Type*Cont+Sat,family=poisson,housing)


Model #1 Results

> summary(fit)

Coefficients:


(Intercept) 3.162e+00 1.243e-01 25.433 < 2e-16 ***

InflMedium 2.733e-01 1.586e-01 1.723 0.084868 .

InflHigh -2.054e-01 1.784e-01 -1.152 0.249511

TypeApartment 3.666e-01 1.555e-01 2.357 0.018403 *

TypeAtrium -7.828e-01 2.134e-01 -3.668 0.000244 ***

TypeTerrace -8.145e-01 2.157e-01 -3.775 0.000160 ***

ContHigh -1.190e-15 1.690e-01 0.000 1.000000

Sat1Medium -2.400e-01 6.329e-02 -3.793 0.000149 ***

Sat1High 1.639e-01 5.710e-02 2.871 0.004094 **

InflMedium:TypeApartment -1.177e-01 2.086e-01 -0.564 0.572571

InflHigh:TypeApartment 1.753e-01 2.279e-01 0.769 0.441783

InflMedium:TypeAtrium -4.068e-01 3.035e-01 -1.340 0.180118

InflHigh:TypeAtrium -1.692e-01 3.294e-01 -0.514 0.607433

InflMedium:TypeTerrace 6.292e-03 2.860e-01 0.022 0.982450

InflHigh:TypeTerrace -9.305e-02 3.280e-01 -0.284 0.776633

InflMedium:ContHigh -1.398e-01 2.279e-01 -0.613 0.539715

InflHigh:ContHigh -6.091e-01 2.800e-01 -2.176 0.029585 *


Model #1 Results

TypeApartment:ContHigh 5.029e-01 2.109e-01 2.385 0.017083 *

TypeAtrium:ContHigh 6.774e-01 2.751e-01 2.462 0.013811 *

TypeTerrace:ContHigh 1.099e+00 2.675e-01 4.106 4.02e-05 ***

InflMedium:TypeApartment:ContHigh 5.359e-02 2.862e-01 0.187 0.851450

InflHigh:TypeApartment:ContHigh 1.462e-01 3.380e-01 0.432 0.665390

InflMedium:TypeAtrium:ContHigh 1.555e-01 3.907e-01 0.398 0.690597

InflHigh:TypeAtrium:ContHigh 4.782e-01 4.441e-01 1.077 0.281619

InflMedium:TypeTerrace:ContHigh -4.980e-01 3.671e-01 -1.357 0.174827

InflHigh:TypeTerrace:ContHigh -4.470e-01 4.545e-01 -0.984 0.325326

Null deviance: 833.66 on 71 degrees of freedom


AIC: 610.43

Large deviance suggests probs vary with predictors

Residual deviance based on Poisson dist here

Coefs for Sat1 are the same as null multinomial intercepts


Additive Contributions of Predictors

Can assess whether Sat1 depends on each of the 3predictors individually by adding interactions with it

> addterm(fit, ~. + Sat1:(Infl+Type+Cont), test="Chisq")

Single term additions

Model:

Freq ~ Infl * Type * Cont + Sat

Df Deviance AIC LRT Pr(Chi)

<none> 217.46 610.43

Infl:Sat1 4 111.08 512.05 106.371 < 2.2e-16 ***

Type:Sat1 6 156.79 561.76 60.669 3.292e-11 ***

Cont:Sat1 2 212.33 609.30 5.126 0.07708 .

Infl: max reduction in resid. deviance & AIC

Even though Cont:Sat1 not significant, let’s look atmodel with all three interactions


Model #2: Interactions with Sat> fit2 <- glm(Freq~Infl*Type*Cont+Sat1:Infl+Sat1*Type+Sat1*Cont,

+ family=poisson,housing)

> summary(fit2)

Coefficients:


(Intercept) 3.32106 0.14761 22.498 < 2e-16 ***

InflMedium -0.14543 0.17855 -0.814 0.415369

InflHigh -1.17183 0.21803 -5.375 7.68e-08 ***

TypeApartment 0.68296 0.17522 3.898 9.71e-05 ***

TypeAtrium -0.70064 0.24137 -2.903 0.003698 **

TypeTerrace -0.32511 0.23230 -1.400 0.161652

ContHigh -0.28230 0.18441 -1.531 0.125814

Sat1Medium -0.41923 0.17293 -2.424 0.015342 *

Sat1High -0.13874 0.15923 -0.871 0.383570

InflMedium:TypeApartment -0.01788 0.21050 -0.085 0.932302

InflHigh:TypeApartment 0.38687 0.23330 1.658 0.097263 .

InflMedium:TypeAtrium -0.36031 0.30498 -1.181 0.237432

InflHigh:TypeAtrium -0.03679 0.33479 -0.110 0.912503

InflMedium:TypeTerrace 0.18515 0.28889 0.641 0.521580

InflHigh:TypeTerrace 0.31075 0.33482 0.928 0.353345

InflMedium:ContHigh -0.20006 0.22875 -0.875 0.381799

InflHigh:ContHigh -0.72579 0.28235 -2.571 0.010155 *

TypeApartment:ContHigh 0.56969 0.21215 2.685 0.007247**


Model #2: Interactions with SatTypeAtrium:ContHigh 0.70211 0.27606 2.543 0.010979 *

TypeTerrace:ContHigh 1.21593 0.26997 4.504 6.67e-06 ***

InflMedium:Sat1Medium 0.44640 0.14156 3.153 0.001613 **

InflHigh:Sat1Medium 0.66494 0.18634 3.568 0.000359 ***

InflMedium:Sat1High 0.73486 0.13694 5.366 8.03e-08 ***

InflHigh:Sat1High 1.61263 0.16713 9.649 < 2e-16 ***

TypeApartment:Sat1Medium -0.43569 0.17253 -2.525 0.011562 *

TypeAtrium:Sat1Medium 0.13137 0.22311 0.589 0.555980

TypeTerrace:Sat1Medium -0.66657 0.20625 -3.232 0.001230 **

TypeApartment:Sat1High -0.73563 0.15527 -4.738 2.16e-06 ***

TypeAtrium:Sat1High -0.40798 0.21150 -1.929 0.053730 .

TypeTerrace:Sat1High -1.41233 0.20015 -7.056 1.71e-12 ***

ContHigh:Sat1Medium 0.36085 0.13240 2.726 0.006420 **

ContHigh:Sat1High 0.48183 0.12414 3.881 0.000104 ***

InflMedium:TypeApartment:ContHigh 0.04690 0.28621 0.164 0.869837

InflHigh:TypeApartment:ContHigh 0.12623 0.33821 0.373 0.708979

InflMedium:TypeAtrium:ContHigh 0.15724 0.39072 0.402 0.687364

InflHigh:TypeAtrium:ContHigh 0.47861 0.44424 1.077 0.281320

InflMedium:TypeTerrace:ContHigh -0.50016 0.36713 -1.362 0.173091

InflHigh:TypeTerrace:ContHigh -0.46310 0.45471 -1.018 0.308467

Null deviance: 833.657 on 71 degrees of freedom


AIC: 455.63STAT 526 Topic 11 58

Model #2: Interactions with Sat

Same model as our main-effects multinomial model

Different deviances due to different saturated models.

In multinom the saturated model is for subjectsIn surrogate log-linear model, it is for cells (grouped)

Comparison with null modelmultinom: 3648.9− 3470.1 = 178.8 and 142− 130 = 12 dflog-linear: 217.5− 38.7 = 178.8 and 46− 34 = 12 df

Could also consider higher-order interactionsRepresent non-additive effects of predictors on Sat

addterm(fit1, .~.+Sat:(Infl+Type+Cont)^2, test="Chisq")

None are found significant


Summary

Models using the Poisson distributionConsider E (count response) as a function of predictors

Poisson regressionQuasipoisson or negative binomial regressionSurrogate log-linear model

Multivariate associations of categorical variables

Nominal random variables: Log-linear modelsOrdinal random variables: Linear-by-linear model,column-effect models

Models using the multinomial distributionConsider E (count response) as a function of predictors

Ordinal response: cumulative logit modelNominal response: multinomial logit model


Modeling a Multinomial Response - Purdue Universitybacraig/notes526/topic11a.pdf · 2020. 10. 6. · Purdue University Reading: Faraway Ch. 7, Agresti Ch. 7, KNNL Ch. 14 STAT 526

Documents