Introduction to Generalized linear Models
STAC51: Categorical data Analysis
Mahinda Samarakoon
March 23, 2016
Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 67
Introduction to Generalized linear Models
Table of contents
1 Introduction to Generalized linear Models
Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 67
Introduction to Generalized linear Models
Introduction to Generalized linear Models
In ordinary regression models, we model means of Normalrandom variables as functions of some predictors (independentvariables).
Recall that the ordinary regression model is give by
Yi = β0 + β1xi1 + · · ·+ βpxip + εi
where εi are independent N(0, σ2).
This implies
E (Yi ) = β0 + β1xi1 + · · ·+ βpxip.
This model assumes that Yi has a Normal distribution.
What if this is not true? For example
Y can be a nominal categorical variable. Y can be a Poissonrandom variable. There are many other possibilities.
Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 67
Introduction to Generalized linear Models
Introduction to Generalized linear Models
Generalized linear models(GLM) extend ordinary regressionmodels to encompass non-normal response variables andmodeling functions of the mean.
We can use these models to investigate the relationships(associations) among categorical and continuous variables.
They have three components:
Random component identifies the response variable Y and itsprobability distributionSystematic component identifies the explanatory variables usedin a linear predictor function.Link function specifies the function of E (Y ) that the modelequates to the liner predictor.
Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 67
Introduction to Generalized linear Models
Introduction to Generalized linear Models:Randomcomponent
The random component of a GLM consists of a responsevariable Y with independent observations (y1, . . . , yN) from adistribution in the natural exponential family.
This family has probability density function or mass functionof form
f (yi ; θi ) = a(θi )b(yi ) exp(yiQ(θi )) (1)
Some important distributions, including the Poisson andbinomial are in this family.
The value of the parameter θi varies with i .
The parameter Q(θ) is called the natural parameter.
Note that there is a more general formula defining theexponential family, but this is sufficient for the discrete datawe discuss here.
Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 67
Introduction to Generalized linear Models
Introduction to Generalized linear Models:Systematiccomponent component
The systematic component of a GLM relates a vector(η1, . . . , ηN) to the explanatory variables through a linearmodel. Let xij denote the value of predictor j(j = 1, 2, . . . , p.)for subject i .
Thenηi =
∑j
βixij , i = 1, . . . ,N.
This linear combination of explanatory variables is called thelinear predictor.
Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 67
Introduction to Generalized linear Models
Introduction to Generalized linear Models:Link Function
Link function connects the random and systematiccomponents.
The model links µi = E (Yi ) to ηi by ηi = g(µi ), where thelink function g is a monotonic, differentiable function. Thus,g links E (Yi ) to explanatory variables through the formula
g(µi ) = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N.
Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 67
Introduction to Generalized linear Models
Introduction to Generalized linear Models:Link Function
The link function g(µ) = µ, called the identity link, hasηi = µi . ordinary regression with normally distributed Y .
The link function that transforms the mean to the naturalparameter is called the canonical link.
That is g(µi ) = Q(θi ) and
Q(θi ) = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N.
Use of the canonical link has advantages (but not mandatory).
Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 67
Introduction to Generalized linear Models
Example: Binomial Logit Models for Binary Data
For binary data P(Y = 1) = π and P(Y = 0) = 1− π.
Y has a Bernoulli distribution.
µ = E (Y ) = π.
We can express thee probability mass function as
f (y , π) = πy (1− π)1−y = (1− π)[π/(1− π)]y (2)
= (1− π) exp
[y
(log
π
1− π
)](3)
for y = 0 and 1.
This is a natural exponential family, identifying θ with π,a(π) = 1− π, b(y) = 1, and Q(π) = log π
1−π .
The natual parameter Q(π) = log π1−π is the log odds of the
response 1 (i.e. log odds of Y = 1), the logit of π.
with this canonical link function are called logistic regressionmodes, or sometimes simply as logit models
Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 67
Introduction to Generalized linear Models
Example: Binomial Logit Models for Binary Data
Question: Can we use the ordinary regression model for binarydata?
regression model is
E (Yi )− πi = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N.
The problem with this model is that πi is a probability (i.e.taking values between 0 and 1),
but linear functions take values over the entire real line.
This also doesn’t satisfy the usual assumptions of ordinaryregression model: Y does not have Normal distribution,
Var(Yi ) = πi (1− πi ) depends on i . That means Var(Y ) isnot constant.
Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 67
Introduction to Generalized linear Models
Example:Loglinear Models for Count Data
The simplest distribution for count data is the Poissondistribution.
The probability mass function of the Poisson distribution is
f (y , π) =e−µµy
y != e−µ
(1
y !
)exp[y log(µ)], y = 0, 1, . . . .
This is a natural exponential family with θ = µ, a(µ) = e−µ,b(y) = 1/y !, and Q(µ) = log(µ).
The natural parameter is log(µ) and so the canonical link isthe log link.
the model using link function is
log(µi ) = β0 + β1xi1 + · · ·+ βpxip, i = 1, . . . ,N. (4)
This is called a Poisson loglinear model
Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 67
Introduction to Generalized linear Models
Logistic regression model
To simplify the discussion, let’s use only one explanatoryvariable, x for predicting the probability of success, π(x)
The logistic regression model for this case is
logπ(x)
1− π(x)= α + βx (5)
or
π(x) =exp(α + βx)
1 + exp(α + βx)(6)
Note: F (x) = ex
1+ex is c.d.f of the standard logistic distribution
and so logistic regression model can be written asπ(x) = F (α + βx) where F is the c.d.f. of the standardlogistic distribution.
Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 67
Introduction to Generalized linear Models
Graph of π(x) vs x for α = 1 and β = 0.5
#R code for plotting the graph of pi vs x
alpha<-1
beta1<- 0.5
curve(expr = exp(alpha+beta1*x)/(1+exp(alpha+beta1*x)),
from = -15, to = 15, col = "red", main =
expression(pi(x) == frac(e^{alpha+beta*x},
1+e^{alpha+beta*x})), xlab = "x",
ylab = expression(pi(x)), panel.first = grid(nx =
NULL, ny = NULL, col = "gray", lty = "dotted"))
Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 67
Introduction to Generalized linear Models
Graph of π(x) vs x for α = 1 and β = −0.5
Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 67
Introduction to Generalized linear Models
Logistic regression model with more than one independentvariable
This model can generalized to more than one independentvariable.
logπ(x)
1− π(x)= α + β1x1 + · · ·+ βpxp (7)
or
π(x) =exp(α + β1x1 + · · ·+ βpxp)
1 + exp(α + β1x1 + · · ·+ βpxp(8)
Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 67
Introduction to Generalized linear Models
Interpretation of β’s
In the model with one independent variable, α represents thelog-odds Y = 1 when x = 0
and β represents the increase in the log-odds Y = 1 when xincreases by one unit.
In the model with more than one independent variable, αrepresents the log-odds Y = 1 when x1 = · · · = xp = 0
and βk represents the increase in the log-odds Y = 1 when xkincreases by one unit, holding other x variables fixed.
Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 67
Introduction to Generalized linear Models
Interpretation of β’s
For example, in example in aspirin study , we found that theodds of a heart attack in the placebo group is 1.83 times thatin the aspirin group.In this example we can consider x = 1 as the placebo group,x = 0 as aspirin group. y = 1 mean got a heart disease andy = 0 means did not get a heart disease.substituting x = 0 in 5, we get the log odds for the aspiringroup
log(odds(0)) = logπ(0)
1− π(0)= α
Substituting x = 1 we get the log odds for the placebo group
log(odds(1)) = logπ(1)
1− π(1)= α + β
and so
β = log
(Odds(1)
Odds(0)
)or eβ represents the odds ratio.
Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 67
Introduction to Generalized linear Models
Parameter estimation
We use the maximum likelihood methods to estimate theparameters (i.e α and β’s ). This requires numerical methods.
We can use the R function, glm(), to estimate the parameter.
Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
The R code below fits a logistic regression model for the data froman example from Kutner et al 2004. This example was based on astudy of the effect of computer programming experience on abilityto complete within a specified time a complex programming task,including debugging. Twenty-five persons were selected for thestudy. They had varying amounts of programming experience(measured in months of experience).
Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
> # Example p565 Kutner et al
> data=read.table("C:/Users/Mahinda/Desktop/CH14TA01.txt", header=F)
> experience <- data[,1]
> task <- data[,2]
> cbind(experience, task)
experience task
[1,] 14 0
[2,] 29 0
[3,] 6 0
[4,] 25 1
[5,] 18 1
[6,] 4 0
[7,] 18 0
[8,] 12 0
[9,] 22 1
[10,] 6 0
[11,] 30 1
[12,] 11 0
[13,] 30 1
[14,] 5 0
[15,] 20 1
[16,] 13 0
[17,] 9 0
[18,] 32 1
[19,] 24 0
[20,] 13 1
[21,] 19 0
[22,] 4 0
[23,] 28 1
[24,] 22 1
[25,] 8 1
Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
> model1 = glm(task ~ experience, family=binomial)
> summary(model1)
Call:
glm(formula = task ~ experience, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8992 -0.7509 -0.4140 0.7992 1.9624
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.05970 1.25935 -2.430 0.0151 *
experience 0.16149 0.06498 2.485 0.0129 *
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.296 on 24 degrees of freedom
Residual deviance: 25.425 on 23 degrees of freedom
AIC: 29.425
Number of Fisher Scoring iterations: 4
> # for every one-month increase in experience, estimated
odds of being able to perform the task are multiplied by exp(betahat1[1])
Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
The estimated probability that a person will be able toperform the task is
π(xi ) =exp(α + βxi )
1 + exp(α + βxi )
=exp(−3.05970 + 0.16149xi )
1 + exp(−3.05970 + 0.16149xi )
For example the probability that a person with 24 monthsexperience will be able to perform the task is
π(24) =exp(−3.05970 + 0.16149× 24)
1 + exp(−3.05970 + 0.16149× 24)= 0.6934
.
Mahinda Samarakoon STAC51: Categorical data Analysis 22 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
> pihat = model1$fitted.values # estmated probabilities
> cbind(experience, task , pihat)
experience task pihat
1 14 0 0.31026237
2 29 0 0.83526292
3 6 0 0.10999616
4 25 1 0.72660237
5 18 1 0.46183704
6 4 0 0.08213002
7 18 0 0.46183704
8 12 0 0.24566554
9 22 1 0.62081158
10 6 0 0.10999616
11 30 1 0.85629862
12 11 0 0.21698039
13 30 1 0.85629862
14 5 0 0.09515416
15 20 1 0.54240353
16 13 0 0.27680234
17 9 0 0.16709980
18 32 1 0.89166416
19 24 0 0.69337941
20 13 1 0.27680234
21 19 0 0.50213414
22 4 0 0.08213002
23 28 1 0.81182461
24 22 1 0.62081158
25 8 1 0.14581508
Mahinda Samarakoon STAC51: Categorical data Analysis 23 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
> # Plot of estimated probability vs experience
> Estimated_prob <- function(experience) { exp(model1$coefficients[1] +
model1$coefficients[2]*experience) /
(1+exp(model1$coefficients[1]+model1$coefficients[2]*experience)) }
> curve(Estimated_prob, from=0, to=40, , xlab="experience",
+ ylab="Estimated Probability")
> abline(h=(seq(0,1,by=0.02)), col="blue", lty="dotted")
> abline(v=(seq(0,40,1)), col="blue", lty="dotted")
Mahinda Samarakoon STAC51: Categorical data Analysis 24 / 67
Introduction to Generalized linear Models
Logistic regression model: Example 2
The R code below fits a logistic regression model for the data (incontingency table) in aspirin above:
> x = c(rep(1, 189+10845), rep(0, 104+10933))
> y = c(rep(1, 189), rep(0, 10845), rep(1, 104), rep(0, 10933))
> length(x)
[1] 22071
> length(y)
[1] 22071
> model1 = glm(y ~ x, family=binomial)
> summary(model1)
Mahinda Samarakoon STAC51: Categorical data Analysis 25 / 67
Introduction to Generalized linear Models
Logistic regression model: Example 2
The R code below fits a logistic regression model for the data (incontingency table) in aspirin above:
Call:
glm(formula = y ~ x, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.1859 -0.1859 -0.1376 -0.1376 3.0544
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.65515 0.09852 -47.250 < 2e-16 ***
x 0.60544 0.12284 4.929 8.28e-07 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3114.7 on 22070 degrees of freedom
Residual deviance: 3089.3 on 22069 degrees of freedom
AIC: 3093.3 Mahinda Samarakoon STAC51: Categorical data Analysis 26 / 67
Introduction to Generalized linear Models
Probit regression model
Another model for Bernoulli random component Y is theprobit regression model. This model uses the inverse standardnormal c.d.f Φ−1 as the link function. That is, the model
π(x) = Φ(α + βx) (9)
orΦ−1(π(x)) = α + βx . (10)
The curve has a similar appearance to logistic regression curve.
Mahinda Samarakoon STAC51: Categorical data Analysis 27 / 67
Introduction to Generalized linear Models
Probit regression model
The curves for β = 0.5 and β = −0.5, both with α = 1 are shownbelow:
Mahinda Samarakoon STAC51: Categorical data Analysis 28 / 67
Introduction to Generalized linear Models
Probit regression model
The curves for β = 0.5 and β = −0.5, both with α = 1 are shownbelow:
Mahinda Samarakoon STAC51: Categorical data Analysis 29 / 67
Introduction to Generalized linear Models
Probit regression model
Which model to use?
This is not an easy question.
One way to decide is to try many models and see which onefits the data best.
logit is easier to interpret, through the use of odds and oddsratios and so is used often.
Mahinda Samarakoon STAC51: Categorical data Analysis 30 / 67
Introduction to Generalized linear Models
Probit regression model: Example
> # Example p565 Kutner et al
> data=read.table("C:/Users/Mahinda/Desktop/CH14TA01.txt", header=F)
> experience <- data[,1]
> task <- data[,2]
> cbind(experience, task)
experience task
[1,] 14 0
[2,] 29 0
[3,] 6 0
[4,] 25 1
[5,] 18 1
[6,] 4 0
[7,] 18 0
[8,] 12 0
[9,] 22 1
[10,] 6 0
[11,] 30 1
[12,] 11 0
[13,] 30 1
[14,] 5 0
[15,] 20 1
[16,] 13 0
[17,] 9 0
[18,] 32 1
[19,] 24 0
[20,] 13 1
[21,] 19 0
[22,] 4 0
[23,] 28 1
[24,] 22 1
[25,] 8 1
Mahinda Samarakoon STAC51: Categorical data Analysis 31 / 67
Introduction to Generalized linear Models
Probit Regression model: Example
> model2 = glm(task ~ experience, family=binomial (link = probit))
> summary(model2)
Call:
glm(formula = task ~ experience, family = binomial(link = probit))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8959 -0.7579 -0.3907 0.8101 1.9691
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.83787 0.69012 -2.663 0.00774 **
experience 0.09686 0.03565 2.717 0.00659 **
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.296 on 24 degrees of freedom
Residual deviance: 25.380 on 23 degrees of freedom
Mahinda Samarakoon STAC51: Categorical data Analysis 32 / 67
Introduction to Generalized linear Models
Logistic regression model: Example
> pihatlogit = model1$fitted.values # estmated probabilities
> pihatprobit = model2$fitted.values # estmated probabilities
> cbind(experience, task , pihatlogit, pihatprobit)
experience task pihatlogit pihatprobit
1 14 0 0.31026237 0.31495382
2 29 0 0.83526292 0.83422869
3 6 0 0.10999616 0.10442754
4 25 1 0.72660237 0.72024848
5 18 1 0.46183704 0.46238565
6 4 0 0.08213002 0.07346854
7 18 0 0.46183704 0.46238565
8 12 0 0.24566554 0.24965602
9 22 1 0.62081158 0.61524129
10 6 0 0.10999616 0.10442754
11 30 1 0.85629862 0.85721025
12 11 0 0.21698039 0.21992975
13 30 1 0.85629862 0.85721025
14 5 0 0.09515416 0.08793556
15 20 1 0.54240353 0.53954616
16 13 0 0.27680234 0.28139084
17 9 0 0.16709980 0.16698550
18 32 1 0.89166416 0.89645092
19 24 0 0.69337941 0.68677231
20 13 1 0.27680234 0.28139084
21 19 0 0.50213414 0.50097045
22 4 0 0.08213002 0.07346854
23 28 1 0.81182461 0.80898266
24 22 1 0.62081158 0.61524129
25 8 1 0.14581508 0.14389004
Mahinda Samarakoon STAC51: Categorical data Analysis 33 / 67
Introduction to Generalized linear Models
Probit Regression model: Example
The two estimated regression curves (logistic and probit) areshown below.
> #Plotting estimated regression curves
> # Plot of estimated probability vs experience
>
> Estimated_prob <- function(experience) { exp(model1$coefficients[1] +
+ model1$coefficients[2]*experience) / (1+exp(model1$coefficients[1]+
model1$coefficients[2]*experience)) }
> # Or we can use
> #Estimated_prob <- function(experience) { plogis(model1$coefficients[1] +
> # model1$coefficients[2]*experience)}
> curve(Estimated_prob, from=0, to=40, col = "green", xlab="experience",
+ ylab="Estimated Probability")
> abline(h=(seq(0,1,by=0.02)), col="blue", lty="dotted")
> abline(v=(seq(0,40,1)), col="blue", lty="dotted")
>
> #---------------------------------------------------------------------------
> par(new = TRUE)
> #Ploting the estimated probablity from the probit model
> Estimated_prob2 <- function(experience) { pnorm(model2$coefficients[1] +
+ model2$coefficients[2]*experience)}
> curve(Estimated_prob2, from=0, to=40, col = "red", axes = FALSE,
xlab = "", ylab = "")
legend(locator(1), legend = c("Logit", "Probit"), lty = c(1,1),
col = c( "green", "red"))
#locator(1) places the legend at the place you click on the graph
Mahinda Samarakoon STAC51: Categorical data Analysis 34 / 67
Introduction to Generalized linear Models
Probit regression model: Example
Mahinda Samarakoon STAC51: Categorical data Analysis 35 / 67
Introduction to Generalized linear Models
Generalized linear models for count data
Counts of possible outcomes are non-negative integers.
These are often modeled as Poisson random variables.
A Poisson loglinear GLIM assumes a Poisson distribution forthe response and the log function for the link function. So,the linear predictor is related to the mean as
log(µ(x) = α + βx (11)
orµ(x) = exp(α + βx) = eα(eβ)x (12)
Interpretation of β: A unit increase in x has a multiplicativeimpact of eβ. i.e. the mean of Y at x + 1 is equal to themean at x times eβ.
Mahinda Samarakoon STAC51: Categorical data Analysis 36 / 67
Introduction to Generalized linear Models
Horseshoe Crab Mating Example p 123
For each i th female, assume the number of satellites, Yi , has aPoisson distribution with mean µi dependent on female shell width(xi ). We will model the expected number of satellites with thefollowing model:
log(µi ) = α + βxi .
The R code below fits the model for crab data:> # Log- linear model example
> # Example p 123
> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file
> model3 <- glm(formula = satellite ~ width, data = crab, family = poisson(link = log))
> summary(model3)
Call:
glm(formula = satellite ~ width, family = poisson(link = log),
data = crab)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8526 -1.9884 -0.4933 1.0970 4.9221
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
width 0.16405 0.01997 8.216 < 2e-16 ***
Mahinda Samarakoon STAC51: Categorical data Analysis 37 / 67
Introduction to Generalized linear Models
Horseshoe Crab Mating Example p 123
> # Predicting the mean response at a given value(s0 of x
> #Predict for 25 and 30 widths
> predict.data<-data.frame(width = c(25, 30))
> #Predcted vlues for mu
> pred1 <- predict(model3, newdata = predict.data, type = "response", se = TRUE)
> pred1
$fit
1 2
2.217477 5.035916
$se.fit
1 2
0.1345945 0.3703386
> pred2 <- predict(model3, newdata = predict.data, se = TRUE)
> #This gives predicted values for log(mu)
> pred2
$fit
1 2
0.7963699 1.6165954
$se.fit
1 2
0.06069713 0.07353947
Mahinda Samarakoon STAC51: Categorical data Analysis 38 / 67
Introduction to Generalized linear Models
Horseshoe Crab Mating Example p 123
> alpha<-0.05
> lower<-pred1$fit-qnorm(1-alpha/2)*pred1$se
> upper<-pred1$fit+qnorm(1-alpha/2)*pred1$se
> data.frame(predict.data, mu.hat = round(pred1$fit,3), lower = round(lower,3), upper = round(upper,3))
width mu.hat lower upper
1 25 2.217 1.954 2.481
2 30 5.036 4.310 5.762
Mahinda Samarakoon STAC51: Categorical data Analysis 39 / 67
Introduction to Generalized linear Models
Horseshoe Crab Mating Example p 123
> # Plot of estimated mean count vs width
> Estimated_count <- function(width) { exp(model1$coefficients[1] +
+ model1$coefficients[2]*width) }
> curve(Estimated_count, from=20, to=35, , xlab="Width",
+ ylab="Estimated Mean Count")
> abline(h=(seq(0,15,by=1)), col="blue", lty="dotted")
> abline(v=(seq(20,35,1)), col="blue", lty="dotted")
Mahinda Samarakoon STAC51: Categorical data Analysis 40 / 67
Introduction to Generalized linear Models
Overdispersion for Poisson GLMs
Count data often vary more than we would expect if theresponse distribution truly were Poisson.
The phenomenon of the data having greater variability thanexpected for a GLM is called overdispersion.
Mahinda Samarakoon STAC51: Categorical data Analysis 41 / 67
Introduction to Generalized linear Models
Overdispersion for Poisson GLMs
Mahinda Samarakoon STAC51: Categorical data Analysis 42 / 67
Introduction to Generalized linear Models
Overdispersion for Poisson GLMs
This might happen because the true distribution is a mixtureof different Poisson distributions.
One remedy for this is to find more explanatory variables andadd to the model.
negative binomial is a related distribution for count data thatpermits the variance to exceed the mean.
probability mass function of the negative binomial distributionis given by
f (y , k , π) =
(y + k − 1
y
)(1− π)yπk , y = 0, 1, 2, · · · (13)
where k > 0 and µ > 0 are parameters.
Negative binomial random variables can be interpreted as thenumber of failures before the kth success.
Mahinda Samarakoon STAC51: Categorical data Analysis 43 / 67
Introduction to Generalized linear Models
Overdispersion for Poisson GLMs
The mean and the variance of this distribution are given by
E (Y ) = µ =k(1− π)
πand (14)
Var(Y ) = k(1−π)π2 .
Note thatk
µ+ k= π
and
µ+ µ2/k =k(1− π)
π2= Var(Y ).
Note that E (Y ) < Var(Y ).
k is the (positive) dispersion parameter.
The smaller the dispersion parameter, the larger the varianceas compared to the mean. In R this is denoted by θ.
Note: Agresti uses γ = 1/k as dispersion parameter.
Mahinda Samarakoon STAC51: Categorical data Analysis 44 / 67
Introduction to Generalized linear Models
Overdispersion for Poisson GLMs
This probability mass function can also be written as
f (y , k , π) =
(y + k − 1
y
)(1− k
µ+ k
)y ( k
µ+ k
)k
, y = 0, 1, 2, · · ·
(15)
Mahinda Samarakoon STAC51: Categorical data Analysis 45 / 67
Introduction to Generalized linear Models
Negative Binomial GLMs: Example Horseshoe CrabMating Example
glm function in R cannot fit negative binomial regression models.We can use glm.nb function in the MASS package to estimate thismodel. The R code below uses the glm.nb to estimate a negativebinomial GLM for crab data.> #R code Negative binomial regression
> library(MASS)
> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file
> model3.nb<-glm.nb(formula = satellite ~ width, data = crab,
link = log)
> summary(model3.nb)
Call:
glm.nb(formula = satellite ~ width, data = crab, link = log,
init.theta = 0.90456808)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.05251 1.17143 -3.459 0.000541 ***
width 0.19207 0.04406 4.360 1.3e-05 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
Mahinda Samarakoon STAC51: Categorical data Analysis 46 / 67
Introduction to Generalized linear Models
Negative Binomial GLMs: Example Horseshoe CrabMating Exampl
(Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
Null deviance: 213.05 on 172 degrees of freedom
Residual deviance: 195.81 on 171 degrees of freedom
AIC: 757.29
Number of Fisher Scoring iterations: 1
Theta: 0.905
Std. Err.: 0.161
Mahinda Samarakoon STAC51: Categorical data Analysis 47 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs: Waldtest
One test we are usually interested in H0 : β = 0 againstHa : β 6= 0. For large n, MLE’s are approximately Normal. Inparticular, β ∼ N(β,AsVar(β)) and so
Z =β − βSE
approx→ N(0, 1)
and this result can be used to calculate approximate p-value.(Wald test) Consider the crab data. Test whether the number ofsatellites is independent of the width.Solution: z = 8.216, the p-value is 2× 10−16 < 0.05 and reject thenull hypothesis. An approximate 95 percent confidence interval is0.16405± 1.96× 0.01997
Mahinda Samarakoon STAC51: Categorical data Analysis 48 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs:Likelihood Ratio test
We have discussed the LRT before. This can also be usedhere.Recall
Λ =Maximum likelihood under the null hypohesis
Unrestricted maximum likelihoodFor testing H0 : β = 0, the numerator is calculated assumingβ = 0. Thus, the model fit to the data is only g(µ) = αwhere g(µ) is the link function.The denominator is calculated without assuming β = 0 and sothe model fit to the data is g(µ) = α + βx We know for largen,G 2 = −2 log(Λ) has an approximate chisquared distribution.The degrees of freedom is the number of parameters in inunrestricted model - the number of parameters in the modelunder the null hypothesis.
Mahinda Samarakoon STAC51: Categorical data Analysis 49 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs:Likelihood Ratio test
For example for testing the null hypothesis H0 : β = 0, thedegrees of freedom is 1.
The value of G 2 is not always given in software outputs.
They often give ”null deviance” and ”residual deviance”.
These values are G 2 values for testing some differenthypothesis, but we often can use them to calculate the valuevalue of G 2 for our other tests.
For examples The value of G 2 for testing H0 : β = 0 againstHa : β 6= 0 is simply null deviance - residual deviance.
Mahinda Samarakoon STAC51: Categorical data Analysis 50 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs:Likelihood Ratio test
Null deviance (G 21 ) tests H0: Model with only α against H1:
Saturated model.
Example Poisson GLM : In Poisson GLM, the saturated modelassumes Yi ∼ Poisson (µi for i = 1, . . . , n)
and the MLE of µi is yi .
G 21 = 2
n∑i=1
yi log
(yiµ0,i
)(16)
where µ0,i = eα0 and α0 is the MLE of α in the modellogµi = α, i = 1, . . . , n.
Residual deviance (G 22 )tests H0: Model with only α and β
against H1: Saturated model. .
Mahinda Samarakoon STAC51: Categorical data Analysis 51 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs:Likelihood Ratio test
Null deviance (G 21 ) tests H0: Model with only α against H1:
Saturated model.
Example Poisson GLM : In Poisson GLM, the saturated modelassumes Yi ∼ Poisson (µi for i = 1, . . . , n)
and the MLE of µi is yi .
G 21 = 2
n∑i=1
yi log
(yiµ0,i
)(17)
where µ0,i = eα0 and α0 is the MLE of α in the modellogµi = α, i = 1, . . . , n.
Residual deviance (G 22 )tests H0: Model with only α and β
against H1: Saturated model. .
Mahinda Samarakoon STAC51: Categorical data Analysis 52 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs:Likelihood Ratio test Example
> # Log- linear model example
> # Example p 123
> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file
> model3 <- glm(formula = satellite ~ width, data = crab, family = poisson(link = log))
> summary(model3)
Call:
glm(formula = satellite ~ width, family = poisson(link = log),
data = crab)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.8526 -1.9884 -0.4933 1.0970 4.9221
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
width 0.16405 0.01997 8.216 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 632.79 on 172 degrees of freedom
Residual deviance: 567.88 on 171 degrees of freedom
AIC: 927.18
—————————————————————————————— Mahinda Samarakoon STAC51: Categorical data Analysis 53 / 67
Introduction to Generalized linear Models
Statistical Inference and Model Checking For GLMs:Likelihood Ratio test Example
Crab data
G 2 = 632.79− 567.88 = 64.91
Degreed of freedom = 172 - 171 = 1
Chi-square table value at α = 0.05 = 3.84
We reject the null hypothesis H0 : β = 0
Data shows evidence to indicate that width has a significanteffect on the number of satellites.
Mahinda Samarakoon STAC51: Categorical data Analysis 54 / 67
Introduction to Generalized linear Models
Residuals for GLMs p140
Pearson residual for observation i is defined by
ei =yi − µi√
µi(18)
and standardized residuals are defined by
ri =yi − µi√µi (1− hi )
(19)
where hi is the ith diagonal element of the hat matrix.
H = W1/2X(XTWX)−1XTW1/2. (20)
hi ’s are known as leverages.
The standardized residual has a distribution that is closer to astandard normal distribution than the Pearson residual.
Mahinda Samarakoon STAC51: Categorical data Analysis 55 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
The R code below calculates the standardized residuals and createsresidual plots
> #RCode Residuals for Poissin Reg
> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file
> poissonReg <- glm(formula = satellite ~ width, data = crab,
family = poisson(link = log))
> e <-residuals(poissonReg, type="pearson")
> X<-model.matrix(poissonReg)
> muhat<-predict(poissonReg, type = "response")
> W <- diag(muhat)
> H<-(W^(1/2))%*%X%*%solve(t(X)%*%W%*%X)%*% t(X)%*%(W^(1/2))
> h <- diag(H)
> head(h)
[1] 0.009852370 0.006360719 0.006945761 0.019161622 0.014825698 0.008169498
> r <- e/sqrt(1-h)
> head(e)
1 2 3 4 5 6
2.1463312 0.8582102 -1.5642375 -1.0726099 -1.5836582 0.5254940
> head(r)
1 2 3 4 5 6
2.1569832 0.8609527 -1.5696984 -1.0830364 -1.5955298 0.5276537
Mahinda Samarakoon STAC51: Categorical data Analysis 56 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
The R code below calculates the standardized residuals and createsresidual plots
>
>
> h<-lm.influence(poissonReg)$h
> head(h)
1 2 3 4 5 6
0.009852370 0.006360719 0.006945761 0.019161622 0.014825698 0.008169498
> r <- e/sqrt(1-h)
> head(r)
1 2 3 4 5 6
2.1569832 0.8609527 -1.5696984 -1.0830364 -1.5955298 0.5276537
Mahinda Samarakoon STAC51: Categorical data Analysis 57 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
The R code below calculates the standardized residuals and createsresidual plots
> #----------------------------------------------------------------
> #Standardized residual vs observation number
> plot(x = 1:length(r), y = r, xlab="Observation number",
+ ylab="Standardized residuals", main = "Standardized
+ residuals vs. observation number")
> abline(h = c(-3, 3), lty=3, col="red")
> #-----------------------------------------------------------------
> par(mfrow = c(1,1))
> # Plot of Residual vs width
> plot(x = crab$width, y = r, xlab="Width",
+ ylab="Standardized Pearson residuals", main =
+ "Standardized Pearson residuals vs. width")
> abline(h = c(-3, 3), lty=3, col="red")
> #-------------------------------------------------------------
> plot(x = crab$width, y = r, xlab="Width",
+ ylab="Standardized Pearson residuals", main =
+ "Standardized Pearson residuals vs. width", type = "n")
> text(x = crab$width, y = r,
+ labels = crab$satellite, cex=0.75)
> abline(h = c(-3,3), lty=3, col="red")
Mahinda Samarakoon STAC51: Categorical data Analysis 58 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
Mahinda Samarakoon STAC51: Categorical data Analysis 59 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
Mahinda Samarakoon STAC51: Categorical data Analysis 60 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
Mahinda Samarakoon STAC51: Categorical data Analysis 61 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
The R code below calculates the standardized residuals and createsresidual plots
> #R code Negative binomial regression
> library(MASS)
> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt", header=T) #the data file
> model4.nb<-glm.nb(formula = satellite ~ width, data = crab, link = log)
> e.nb<-residuals(model4.nb, type="pearson")
> h.nb<-lm.influence(model4.nb)$h
> r.nb<-e.nb/sqrt(1-h.nb)
> par(mfrow = c(1,2))
> plot(x = 1:length(r.nb), y = r.nb, xlab="Obs. number",
+ ylab="Standardized residuals",
+ main = "Stand. residuals (Neg Bin model) vs. obs. number")
> abline(h = c(-3, 3), lty=3, col="red")
>
> plot(x = crab$width, y = r.nb,
+ xlab="Width", ylab="Standardized residuals",
+ main = "Stand. residuals (Neg Bin model) vs. width", type = "n")
> text(x = crab$width, y = r.nb, labels =
+ crab$satellite, cex=0.75)
> abline(h = c(-3, 3), lty=3, col="red")
>
Mahinda Samarakoon STAC51: Categorical data Analysis 62 / 67
Introduction to Generalized linear Models
Residuals for GLMs: Example
Mahinda Samarakoon STAC51: Categorical data Analysis 63 / 67
Introduction to Generalized linear Models
Goodness of Fit:Pearson Chisquare
For Poisson regression, the statistic
χ2 =n∑
i=1
(yi − µi )2
µi.
The statistic as an approximate χ2 distribution with n -number of model parameters = n − 2 degrees of freedom forlarge n.
In order for the χ2 approximation to work well, µi should notbe small.
Rule of thumb µi ≥ 5
Mahinda Samarakoon STAC51: Categorical data Analysis 64 / 67
Introduction to Generalized linear Models
Goodness of Fit: LRT
For Poisson regression, LRT comparing the model with thesaturated model is
G 2 = −2 log(Λ) = 2n∑
i=1
yi log
(yiµi
)
where µi = eα+βxi
The statistic has an approximate χ2 distribution with n − 2degrees of freedom for large n. In R this is called the residualdeviance.
Mahinda Samarakoon STAC51: Categorical data Analysis 65 / 67
Introduction to Generalized linear Models
Goodness of fit: Example
The R code and output for Crab data are given below. Use thePearson chisq test and the LRT to test goodness of fit of themodel.> #RCode Residuals for Poissin Reg
> crab=read.table("C:/Users/Mihinda/Desktop/crab.txt",
header=T) #the data file
> poissonReg <- glm(formula = satellite ~ width, data = crab,
family = poisson(link = log))
> summary(poissonReg)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
width 0.16405 0.01997 8.216 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 632.79 on 172 degrees of freedom
Residual deviance: 567.88 on 171 degrees of freedom
> pear.res<-resid(poissonReg, type="pearson")
> pearsonChisq <- sum(pear.res^2)
> pearsonChisq
[1] 544.157
> p_pearson <- 1-pchisq(pearsonChisq, df = poissonReg$df.residual)
> p_pearson
[1] 0.
Mahinda Samarakoon STAC51: Categorical data Analysis 66 / 67
Introduction to Generalized linear Models
Goodness of Fit:Example
Both tests indicate lack of fit of the model
Some yi ’s (not printed) are less than 5 and so the test is notvery reliable
Mahinda Samarakoon STAC51: Categorical data Analysis 67 / 67