BIBS BIBS SEOUL NATIONAL UNIVERSITY SEOUL NATIONAL UNIVERSITY Bioinformatics & Biostatistics Lab. Bioinformatics & Biostatistics Lab. Categorical Data Analysis & Logistic Regression 수수수수수 수수수수수수 수 수 수 수 수수수수 수수수수 수수수수수 수 수 수
Jan 02, 2016
BIBBIBSS
SEOUL NATIONAL UNIVERSITYSEOUL NATIONAL UNIVERSITYBioinformatics & Biostatistics Lab.Bioinformatics & Biostatistics Lab.
Categorical Data Analysis &
Logistic Regression
수원대학교 통계정보학과
김 진 흠
㈜ 마케팅랩 파트너스
선임연구원
이 은 경
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Outline
Two-way contingency tables: RR, Odds ratio, Chi-square tests
Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio
Logistic regression: Dichotomous response
Logistic regression: Polytomous response
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
First example: Aspirin & heart attacks
Clinical trials table of aspirin use and MI Test whether regular intake of aspirin reduces mortality from cardiovascular disease Data set
Prospective sampling design: Cohort studies, Clinical trials
Myocardial Infarction
Group Yes No Total
Placebo 189 10,845 11,034
Aspirin 104 10,933 11,037
2 2
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Second example: Smoking & heart attacks
Case-control study: table of smoking status and MI
Compare ever-smokers with nonsmokers in terms of the proportion who suffered MI Data set
Retrospective sampling design: Case-control study, Cross-
sectional design
Remark: Observational studies vs. experimental study
2 2
Ever-Smoker
MyocardialInfarction Controls
Yes 172 173
No 90 346
Total 262 519
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Comparing proportions in table
Difference:
Relative risk:
Useful when both proportions 0 or 1
: RR is more informative
: Response is
independent
of group
2 2
1 2
1
2
1
1 2 1 22
0.10, 0.01 0.09, 10p
p p p pp
11 2 1 2
2
0.410, 0.401 0.09, 1.02p
p p p pp
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example (revisited)
1st example =0.0171 - 0.0094=0.0077, 95% CI=(0.005, 0.011)
Taking aspirin diminishes heart attack
, 95% CI=(1.43, 2.3)
Risk of MI is at least 43% higher for the placebo group
2nd example , : Not estimable, meaningless even though possible
Estimate proportions in the reverse directionProportion of smoking given MI status:
(suffering MI), (Not suffered MI)
1 2p p
1
2
1.82p
p
1 2 1
2
1720.656
262 173
0.333519
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Association measure: odds ratio
Def’n:
Meaning When two variables are independent, i.e., When odds of success (in row 1) > (in row 2) When odds of success (in row 1) < (in row 2)
Remark: When both variables are response,
(called cross-product ratio) using joint probabilities
1 1
2 2
(1 )
(1 )
1 2 , 1 1,
0 1,
11 22
12 21
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Properties of odds ratio
Values of father from 1 in a given direction represent stronger association
When one value is the inverse of the other, two values of are the same strength of association, but in the opposite directions
Not changed when the table orientation reverses Unnecessary to identify one classification as a response variable
1 24, 0.25 1/ 4
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example (revisited)
1st example , 95% CI=(1.44, 2.33)
Estimated odds is 83% higher for the placebo group
2nd example Rough estimate of RR=3.8
Women who had ever smoked were about four times as likely to suffer as women who had never smoked
1 1 11 22
2 2 12 21
/(1 )ˆ 1.832/(1 )
p p n n
p p n n
ˆ RR
ˆ 3.8
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Independence tests
Hypothesis:
Two chi-square tests Under , estimated expected frequency
Pearson’s =
Likelihood ratio(LR) statistic
For a large sample, follow a chi-squared null distribution w
ith
Remark: When the chi-squared approximation is good. If not, apply Fisher’s exact test
0 : for all ,ij i jH i j
0H ˆ i jij
n n
n
2X2ˆ( )
ˆij ij
ij
n
2 2 log( )
ˆij
jij
nG n
2 2,X G
( 1)( 1)df I J
ˆ 5,ij
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example: AZT use & AIDS
Development of AIDS symptoms in AZT use and race Study on the effects of AZT in slowing the development of AIDS symptoms Data set
Symptoms
Race AZT Use Yes No Total
White Yes 14 93 107
No 32 81 113
Black Yes 11 52 63
No 12 43 55
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Three interests in table
Conditional independence? When controlling for race, AZT treatment and development of AIDS symptom are independent
Use Cochran-Mantel-Haenszel(CMH) test Summarize the information from partial tables
Homogeneous association? Odds ratios of AZT treatment and development of AIDS symptom are common for each race
Use Breslow-Day test
Common odds ratio? Use Mantel-Haenszel estimate
2 2 K
K
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example (AZT use & AIDS revisited)
CMH=6.8( =1) with -value=0.0091 Not independent!
Breslow-Day=1.39( =1) with -value=0.2384 Homogeneous association!
Common odds ratio=0.49 For each race, estimated odds of developing symptoms are half as high for those who took AZT
p
p
df
df
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Overview of types of generalized linear models(GLMs)
Three components: Random component (response variable), Linear predictor (linear combination of covariates), Link function
Types of GLMs
RandomComponent Link
SystematicComponent Model
Normal
Normal
Normal
Binomial
Poisson
Multinomial
Identity
Identity
Identity
Logit
Log
Generalized logit
Continuous
Categorical
Mixed
Mixed
Mixed
Mixed
Regression
Analysis of variance
Analysis of covariance
Logistic regression
Loglinear
Multinomial response
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Logistic regression with a quantitative covariate
Model:
Another representations Odds=
Odds at level equals the odds at multiplied by
Curve ascends ( ) or descends ( )The rate of change increases as increases
( )logit[ ( )] log
1 ( )
xx x
x
( )
1 ( )xx
ex
1x x e
( )1
x
x
ex
e
0 0 | |
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example: Horseshoe crabs Binary response
if a female crab has at least one satellite; otherwise
Covariate: female crab’s width
Data set
1Y 0Y
Width Number Cases Number Having Satellites < 23.25
23.25 - 24.25
24.25 - 25.25
25.25 - 26.25
26.25 - 27.25
27.25 - 28.25
28.25 - 29.25
> 29.25
14
14
28
39
22
24
18
14
5
4
17
21
15
20
15
14
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example: Horseshoe crabs
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Goodness-of-fit tests
Working model: number of settings: number of parameters in :
Hypothesis: fits the data
Pearson’s statistic:
Deviance statistic:
approximately follow a chi-square null
distribution with
,M ,sM p
0 :H M2 2( ) (observed-fitted) / fittedX M
2 ( ) 2 observed log(observed/fitted)G M 2 2,X G
df s p
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Inference for parameters
Interval estimation:
Two significance tests: Wald test: Use Likelihood ratio test: Use , log-likelihood function
Two tests have a large-sample chi-squared null distribution with
ˆ ˆ1.96 SE( )
0 : 0H ˆ ˆ/ SE( )z
0 12( )L L :L
1df
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example (Horseshoe crabs revisited)
Fitted model:
: larger at lager width ( )
There is a 64% increase in estimated odds of a satellite
for each centimeter increase in width ( )
with -value=0.506;
with -value=0.4012
95% CI for =(0.298, 0.697)
Significance test: Wald=23.9 ( =1) with -value < 0.0001; LRT=31.3 ( =1) with -value < 0.0001
ˆlogit[ ( )] 12.3508 0.4972x x
ˆ 0
ˆ1.64e
2 5.3 ( 6)X df p2 6.2 ( 6)G df p
df pdf p
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Logistic regression with qualitativepredictors: AIDS symptoms data
Use indicator variables for representing categories of predictors
Logits implied by indicator variables
1 2logit[ ( )]x x z
Logit
0 0
1 0
0 1
1 1
x
1
2
1 2
z
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
=difference between two logits (i.e., log of odds
ratio) at a fixed category of
Homogeneous association model
1
z
Logistic regression with qualitativepredictors: AIDS symptoms data
1odds of success at 1
odds of success at 0
xe
x
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Equivalence of contingency table & logistic regression
Conditional independence: CMH test vs.
Homogeneous association: Breslow-Day test vs. Goodness-of-fit test
Common odds ratio estimate: Mantel-Haenszel estimate vs.
2 2 K
0 1: 0H
1e
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Computer Output for Model with AIDS Symptoms Data
Log Likelihood - 167.5756Analysis of MaximumLikelihood Estmates
Parameter Estimate Std Error Wald Chi-Square Pr > ChiSq
Interceptaztrace
- 1.0736- 0.7195 0.0555
0.26290.27900.2886
16.67056.65070.0370
<.0.0010.00990.8476
LR Statistics
Source Df Chi-Square Pr>ChiSq azt race
11
6.870.04
0.00880.8473
Obs race azt y n pi_hat lower upper
1234
1100
1010
14321112
1071136355
0.149620.265400.142700.25472
0.098970.196680.087040.16953
0.219870.347740.225190.36396
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Logistic regression with mixed predictors: Horseshoe crabs data
For color=medium light,
For color=medium,
For color=medium dark,
For controlling
1 1 2 2 3 3 4logit[ ( )]x c c c x
1 2 3( , , ) (1,0,0)c c c
1 2 3( , , ) (0,1,0)c c c
1 2 3( , , ) (0,0,1)c c c
1odds of success of a medium-light crab
,odds of success of a dark crab
e
2odds of success of a medium crab
,odds of success of a dark crab
e
3odds of success of a medium-dark crab
odds of success of a dark crabe
,x
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Computer Output for Model for Horseshoe Crabs Data
Parameter Estimate
Std.
Error
Likelihood Ratio 95%
Confidence Limits
Chi
Square Pr > Chi Sq
interceptc1c2c3width
- 12.7151 1.3299 1.4023 1.1061 0.4680
2.76180.85250.54840.59210.1055
- 18.4564 - 0.2738 0.3527 - 0.0279 0.2713
- 7.5788 3.1354 2.5260 2.3138 0.6870
21.20 2.43 6.54 3.4919.66
< .00010.11880.01060.0617< .0001
LR Statistics
Source DF Chi-Square Pr > Chi Sq
widthcolor
13
26.40 7.00
< .00010.0720
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Estimated probabilities for primary food choice
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Logistic regression: ploytomous
Model categorical responses with more than two
categories
Two ways Use generalized logits function for nominal response Use cumulative logits function for ordinal response
Notation number of categories response probabilities with
:J1, , :J 1
1J
jj
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Generalized logit model: nominal response
Baseline-category logit: Pair each category with a baseline category
when is the baseline
Model with a predictor The effects vary according to the category paired with the baseline These pairs of categories determine equations for all other pairs of categories
Eg, for a pair of categories
Remark: Parameter estimates are same no matter which category is the baseline
logit log , 1, , 1jjJ
J
j J
J
logit , 1, , 1jJ j jx j J
, ,a b/
log log ( ) ( )/
a a Ja b a b
b b J
x
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example: Alligator food choice
59 alligators sample in Lake Gorge, Florida Response: Primary food type found in alligator’s stomach
Fish(1), Invertebrate(2), Other(3, baseline category)
Predictor: alligator length, which varies 1.24~3.89(m) ML prediction equations
Larger alligator seem to select fish than invertebrates
Independence test: Food choice & length LRT=16.8006( ) with -value=0.0002
1 3 2 3ˆ ˆ ˆ ˆlog( / ) 1.618 0.11 ; log( / ) 5.697 2.465x x 1 2ˆ ˆlog( / ) 4.08 2.355x
0 1 2: 0H 2df p
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Cumulative logit model: ordinal response
Logit of a cumulative probability
Categories 1 to : combined, Categories to : combined
Cumulative proportional odds model with a predictor The effect of are identical for all cumulative logits Any one curve for is identical to any of others shifted to the right or shifted to the left For =log of odds ratio is
Proportional to the difference between valuesSame for each cumulative probability
( )logit[ ( )] log , 1, , 1
1 ( )
P Y jP Y j j J
P Y j
j 1j J
logit[ ( )] , 1, , 1jP Y j x j J x 1J
( )P Y j
1 2, ,x x 2 1logit[ ( ) | ] logit[ ( ) | ]P Y j x x P Y j x x
2 1( )x x x
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Example: Political ideology & party affiliation
Response: Political ideology with five-point ordinal scale
Predictors: Political party(Democratic, Republican)
PoliticalParty
Political Ideology
VeryLiberal
SlightlyLiberal Moderate
SlightlyConservative
VeryConservative
Democratic 80 81 171 41 55
Republican 30 46 148 84 99
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Parameter inference ,
Democrats tend to be more liberal than Republicans Wald=57.1( ) with -value < 0.0001
Strong evidence of an association 95% CI for =(0.72, 1.23) or =(2.1, 3.4)
At least twice as high for Democrats as for Republicans
Goodness-of-fit with -value=0.2957 Good adequacy!
Example: Political ideology & party affiliation
ˆ 0.975 0.975 2.65e
0 : 0,H 1df p
e
p2 3.7( 3)G df
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Another logit forms for ordinal response categories
Adjacent-categories logit
Adjacent-categories logits determine the logits for all pairs of response categories
Continuation-ratio logit Form1:
Contrast each category with a grouping of categories from lower levels of response scale
Form2:
Contrast each category with a grouping of categories from higher levels of response scale
1log , 1, , 1j
j
j J
1 11 1 2
2 3
log , log , , log J
J
11 2
2 3
log , log , , log J
n n J
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
Summary
Two-way contingency tables: RR, Odds ratio, Chi-square tests
Three-way contingency tables: Conditional independence, Homogeneous association, Common odds ratio
Logistic regression: Dichotomous response
Logistic regression: Polytomous response
SNUSNUBIBS
BIOSTATISTICS FOR BIOINFORMATICSBIOSTATISTICS FOR BIOINFORMATICS
References
Agresti, A. (1996). An Introduction to Categorical Data Analysis, Wiley: New York (Also the 2nd edition is available)
Stokes, M.E., Davis, C.S., and Koch, G.G. (2000). Categorical Data Analysis Using The SAS System, Second Ed., SAS Inc.: Cary