Introduction to log-linear modelspersonal.psu.edu/abs12//stat504/Lecture/lec16.pdfcell counts with the log-linear model of independence and ask if this model ﬁts well. Stat 504,

Stat 504, Lecture 16 1'

&

$

%

Introduction to

log-linear models

Key Concepts:

• Benefits of models

• Two-way Log-linear models

• Parameters Constraints, Estimation and

Interpretation

• Inference for log-linear models

Objectives:

• Understand the structure of the log-linear models

in two-way tables

• Understand the concepts of independence and

associations described via log-linear models in

two-way tables


&

$

%

Useful Links:

• The CATMOD procedure in SAS:http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/catmod_index.htm

• The GENMOD procedure in SAS:http://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/genmod_index.htm

• The SAS source on log-linear model analysishttp://support.sas.com/onlinedoc/913/getDoc/en/statug.hlp/

catmod_sect30.htm#stat_catmod_catmodllma

• Fitting Log-linear models in Rhttp://stat.ethz.ch/R-manual/R-patched/library/stats/html/loglin.html

• Fitting Log-linear models in R via generalizedlinear models (glm())http://spider.stat.umn.edu/R/library/stats/html/glm.html

Readings:

• Agresti (2002) Ch. 8, 9

• Agresti (1996) Ch. 6, 7


&

$

%

Benefits of models over significance tests

Thus far our focus has been on describing interactions

or associations between two or three categorical

variables mostly via single summary statistics and

with significance testing.

Models can handle more complicated situation, and

analyze the simultaneous effects of multiple variables,

including mixtures of categorical and continuous

variables.

For example, the Breslow-Day statistics only works

for 2× 2 × K tables, while log-linear models will allow

us to test of homogenous associations in I × J × K

and higher-dimensional tables.

The structural form of the model describes the

patterns of interactions and associations. The model

parameters provide measures of strength of

associations.


&

$

%

In models, the focus is on estimating the model

parameters. The basic inference tools (e.g., point

estimation, hypothesis testing, and confidence

intervals) will be applied to the these parameters.

When discussing models, we will keep in mind

• Objective

• Model structure (e.g. variables, formula,

equation)

• Model assumptions

• Parameter estimates and interpretation

• Model fit (e.g. goodness-of-fit tests and statstics)

• Model selection


&

$

%

For example, recall a simple linear regression model

• Objective: model the expected value of a

continuous variable, Y , as a linear function of the

continuous predictor, X, E(Yi) = β0 + β1xi

• Model structure: Yi = β0 + β1xi + ei

• Model assumptions: Y is is normally distributed,

ei ∼ N(0, σ2), and independent, and X is fixed,

and constant variance σ2.

• Parameter estimates and interpretation: β̂0 is

estimate of β0 or the intercept, and β̂1 is estimate

of the slope, etc... What is the interpretation of

the slope?

• Model fit: R2, residual analysis, F-statistic

• Model selection

See handout labeled as Lec16LinRegExample.doc on

modeling average water usage given the amount of

bread production:

Water = 2273 + 0.0799 Production


&

$

%

Two-way ANOVA

Does the amount of sunlight and watering affect the

growth of geraniums?

Objective: model the continuous response as function

of two factors.

Model structure: Yijk = µ + αi + βj + γij + eijk with

eijk ∼ N(0, σ2), i = 1, ..., I, j = 1, ...., J , and

k = 1, ..., nij

Model assumptions: At each combination of levels the

outcome is normally distributed with the same

variance: yijk ∼ N(µij , σ2), where

µij = E(yijk) = µ + αi + βj + γij


&

$

%

This model is over-parametrized because term γij

already has I × J parameters corresponding to the

cell means µij . The constant, µ, and the main effects,

αi and βj give us additional 1 + I + J parameters.

We use constraints such asP

i αi =P

j βj =P

i

P

j γij = 0, to deal with this

overparametrization.

Does level of watering affect the growth of potted

geraniums? (Is there a significant main effect for

factor A?, e.g. H0 : αi = 0 for all i)

Does level of sunlight affect the growth of potted

geraniums? (Is there a significant main effect for

factor B?)

Does the effect of level of sunlight depend on level of

watering? (Is there a significant interaction between

factors A and B?)


&

$

%

Analysis of Variance for YIELD

Source DF SS MS F P

WATER 1 342.3 342.3 24.02 0.000

SUNLIGHT 1 20.3 20.3 1.42 0.256

Interaction 1 132.3 132.3 9.28 0.010

Error 12 171.0 14.3

Total 15 665.8

Individual 95% CI

WATER Mean ------+---------+---------+---------+-----

HIGH 22.0 (------*------)

LOW 12.8 (------*------)

------+---------+---------+---------+-----

12.0 16.0 20.0 24.0

Individual 95% CI

SUNLIGHT Mean ----+---------+---------+---------+-------

HIGH 18.5 (--------------*-------------)

LOW 16.3 (-------------*--------------)

----+---------+---------+---------+-------

14.0 16.0 18.0 20.0


&

$

%

Two-way Log-Linear Model

Now let µij be the expected counts, E(nij), in an

I × J table. An analogous model to two-way ANOVA

is

log(µij) = µ + αi + βj + γij

or in the notation used by Agresti

log (µij) = λ + λAi + λ

Bj + λ

ABij

with constraints:P

i λi =P

j λj =P

i

P

j λij = 0, to

deal with overparametrization.

Log-linear models specify how the cell counts depend

on the levels of categorical variables. They model the

association and interaction patterns among

categorical variables.

The log-linear modeling is natural for Poisson,

Multinomial and Product-Mutlinomial sampling.

They are appropriate when there is no clear

distinction between response and explanatory

variables, or there are more than two responses.


&

$

%

Example: General Social Survey

Cross-classification of respondents according to

choice for the president in 1992 presidental election

(Bush, Clinton, Perot) and political view on the 7

point scale (extremely liberal, liberal, slightly liberal,

moderate, slightly conservative, conservative,

extremely conservative)

http://sda.berkeley.edu:7502/D3/GSS96/Doc/gss90017.htmpres92

Let’s consider a 3 × 3 table:

Bush Clinton Perot Total

Liberal 70 324 56 450

Moderate 195 332 101 628

Conservative 382 199 117 698

Total 647 855 274 1774

Are political view and choice independent?

You already know how to answer this via chi-square

test of independence, but now we want to model the

cell counts with the log-linear model of independence

and ask if this model fits well.


&

$

%

Two-way Log-linear models

Given two categorical random variables, A and B,

there are two main models we will consider:

• Independence model, (A, B)

• Saturated model, (AB)

Objective: Model the cell counts: µij = nπij

Main assumption: The N = IJ counts in the cells are

assumed to be independent observations of a Poisson

random variable.


&

$

%

Log-linear model of independence for 2-way

tables

Recall the independence in terms of cell probabilities

as a product of marginal probabilities:

πij = πi+π+j i = 1, ..., I, j = 1, ..., J

in terms of cell frequencies:

µij = nπij = πi+π+j i = 1, ..., I, j = 1, ..., J

By taking logarithms of the expected number of

counts we obtain the loglinear model of

independence:

log µij = log n + log πi+ + log π+j


Bj

where A and B stand for two categorical variables.


&

$

%


Bj

This is an ANOVA type-representation where,

λ represents an ”overall” effect, or a grand mean of

the logarithms of the expected counts, and it ensures

thatP

i

P

j µij = n

λAi represents a ”main” effect of variable A, or a

deviation from a grand mean, and it ensures thatP

j µij = ni+. It represents the effect of classification

in row i.

λBj represents a ”main” effect of variables B, or a

deviation from a grand mean, and it ensures thatP

iµij = n+j. This is the effect of classification in

???

and, λAI = λB

J = 0


&

$

%

The ML fitted values are the same as expected values

under the test of independence:

Thus, the X2 and G2 for the test of independence are

goodness-of-fit statistics for the loglinear model of

independence testing that the independence model

holds vs. that it does not.

The model also implies that ALL odds ratios are

equal to 1

For our example, see vote.sas and compare the resultsof PROC FREQ and PROC GENMOD procedures.

Statistics for Table of pview by choice

Statistic DF Value Prob

------------------------------------------------------

Chi-Square 4 238.5354 <.0001

Likelihood Ratio Chi-Square 4 247.6951 <.0001

...

Criteria For Assessing Goodness Of Fit

Criterion DF Value Value/DF

Deviance 4 247.6951 61.9238


&

$

%

Parameter Constraints & Uniqueness

There are I − 1 unknown parameters in the set {λAi }

and J − 1 in the set {λBj }.

There can be many different parametrizations. We

need to set the constraints to account for redundant

parameters.

One way is to fix one value in the set to be equal to a

constant, typically 0. This corresponds to using

dummy coding for the categorical variables (e.g.

A = 1, 0). In SAS/GENMOD, the last level is set to 0.

µ11 =?

Another way is to fix the sum of the terms equal to a

constant, typically 0. That’s the ANOVA-type

constraint. This corresponds to using ”effect” coding

for categorical variables (e.g. A = 1, 0,−1). In

SAS/CATMOD, uses zero-sum.

µ11 =?


&

$

%

Link to Odds ratio

We can have different parameter estimates depending

on type of constraints we set. So, what is unique

about parameters?

The differences are unique:

λAi − λ

Ai′

λBj − λ

Bj′

The odds is also unique!

log(odds) = log(µi1

µi2

) = log(µi1) − log(µi2)

= (λ + λAi + λ1B) − (λ + λ

Ai + λ

B2 )

= λ1B − λB2

How about odds ratio?

log(oddsratio) = log(µ11µ22

µ12µ21

)

= log(µ11) + log(µ22) − log(µ12) + log(µ21)...

=


&

$

%

The odds ratio measures the strength of the

association and depends only on the interaction terms

{λABij }

How many numbers do we need to completely

characterize associations in I × J tables?


&

$

%

Saturated Loglinear Model for two-way tables:

log µij = λ + λAi + λ

Bj + λ

ABij

λABij represents an interaction/association between

two variables, and reflects the departure from

independence, and ensures that µij = nij

What constraint must hold?

The saturated model

1. the fitted values are exactly equal to ...

2. df=0,

3. he most complex model

4. has independence model as a special case

5. there is a direct functional relationship with the

odds ratio (and the unique number of those)

See vote.sas example.

We typically want a simpler model that smoothes the

data more, and it’s more parsimonious.


&

$

%

Hierarchical Models

These models include all lower order terms that

comprise higher-order terms in the model.

(A,B) is a simpler model than (AB)

Interpretation does not depend on how the variables

are coded.

Is this a hierarchical model?

log µij = λ + λAi + λ

ABij


&

$

%

Loglinear Models for three-way tables

Expending the log-linear model notation to 3-way

tables:

log µij = λ+λAi +λ

Bj +λ

Ck +λ

ABij +λ

ACik +λ

BCjk +λ

ABCijk

The main questions for the next lecture are:

What do the λ terms mean in this model? What

hypothesis about them correspond to the models of

independence we are already know?

What are some efficient ways to specify and interpret

these models and tables?

What are some efficient ways to fit and select among

many possible models in three and higher dimensions?


&

$

%

-


&

$

%

Example. Let’s go back to our familiar dataset on

graduate admissions at Berkeley:

Men Men Women Women

Dept. rejected accepted rejected accepted

A 313 512 19 89

B 207 353 8 17

C 205 120 391 202

D 278 139 244 131

E 138 53 299 94

F 351 22 317 24

Let D = department, S = sex, and A = admission

status (rejected or accepted). We analyzed this as a

three-way table on Assignment 5, more specifically we

looked at partial and marginal tables. Now we’ll look

at it from a loglinear point of view. Let yi be the

frequency or count in a particular cell of the

three-way table.


&

$

%

Saturated loglinear model:

Using PROC GENMOD, let’s fit the saturated

loglinear model.

options nocenter nodate nonumber linesize=72;

data berkeley;

input D $ S $ A $ y;

cards;

DeptA Male Reject 313

DeptA Male Accept 512

DeptA Female Reject 19

DeptA Female Accept 89

DeptB Male Reject 207

DeptB Male Accept 353

DeptB Female Reject 8

DeptB Female Accept 17

DeptC Male Reject 205

DeptC Male Accept 120

DeptC Female Reject 391

DeptC Female Accept 202

DeptD Male Reject 278

DeptD Male Accept 139

DeptD Female Reject 244

DeptD Female Accept 131

DeptE Male Reject 138

DeptE Male Accept 53

DeptE Female Reject 299

DeptE Female Accept 94

DeptF Male Reject 351

DeptF Male Accept 22

DeptF Female Reject 317

DeptF Female Accept 24

;

proc genmod data=berkeley order=data;

class D S A;

model y = D S A D*S D*A S*A D*S*A / dist=poisson link=log;

run;


&

$

%

When you use the order=data option, GENMOD

orders the levels of class variables in the same order

as they appear in the dataset. For each class variable,

GENMOD creates a set of dummy using the last

category as a reference group. Therefore, we can

interpret a two-way association as a log-odds ratio for

the two variables in question, with the other variable

held constant at its last category.

Here’s a portion of the SAS output. I edited the table

of ML estimates to remove the omitted zero terms.


&

$

%

Analysis Of Parameter Estimates

Standard

Parameter DF Estimate Error

Intercept 1 3.1781 0.2041

D DeptA 1 1.3106 0.2300

D DeptB 1 -0.3448 0.3170

D DeptC 1 2.1302 0.2159

D DeptD 1 1.6971 0.2220

D DeptE 1 1.3652 0.2287

S Male 1 -0.0870 0.2952

A Reject 1 2.5808 0.2117

D*S DeptA Male 1 1.8367 0.3167

D*S DeptB Male 1 3.1203 0.3857

D*S DeptC Male 1 -0.4338 0.3169

D*S DeptD Male 1 0.1463 0.3193

D*S DeptE Male 1 -0.4860 0.3415

D*A DeptA Reject 1 -4.1250 0.3297

D*A DeptB Reject 1 -3.3346 0.4782

D*A DeptC Reject 1 -1.9204 0.2288

D*A DeptD Reject 1 -1.9589 0.2378

D*A DeptE Reject 1 -1.4237 0.2425

S*A Male Reject 1 0.1889 0.3052

D*S*A DeptA Male Reject 1 0.8632 0.4027

D*S*A DeptB Male Reject 1 0.0311 0.5335

D*S*A DeptC Male Reject 1 -0.3138 0.3374

D*S*A DeptD Male Reject 1 -0.1177 0.3401

D*S*A DeptE Male Reject 1 -0.3891 0.3650

Scale 0 1.0000 0.0000

The intercept is a normalizing constant and should be

ignored. The main effects for D, A and A are all

difficult to interpret and not very meaningful. But the

two- and three-way associations are highly

meaningful. For example, the estimated coefficient for

the SA association is 0.1889.


&

$

%

Exponentiating this coefficient gives

exp(0.1889) = 1.208,

which is the estimated SA odds ratio for Department

F. The reference group for S is “women,” and the

reference group for A is “accept.” If we write the

2 × 2 table for S × A in Department F, with the

reference groups in the last row and column, we get

Dept F Reject Accept

Men 351 22

Women 317 24

for which the estimated odds ratio is

351 × 24

317 × 22= 1.208.

The Wald z-statistic for this coefficient,

z =0.1889

0.3052= 0.62,

indicates that the SA odds ratio for Department F is

not significantly different from 1.00.


&

$

%

To get the SA odds ratio for any other department,

we have to combine the SA coefficient with one of the

DSA coefficients. For example, the SA odds ratio for

Department A is

exp(0.1889 + 0.8632) = 2.864.

The Wald z-statistic for the first DSA coefficient,

z =0.8632

0.4027= 2.14,

indicates that the SA odds ratio for Department A is

significantly different from the SA odds ratio in

Department F. To see if the SA odds ratio in

Department A is significantly different from 1.00, we

would have to compute the standard error the sum of

the two coefficients using the estimated covariance

matrix.

Introduction to log-linear modelspersonal.psu.edu/abs12//stat504/Lecture/lec16.pdfcell counts with the log-linear model of independence and ask if this model ﬁts well. Stat 504,

Documents