04/19/2006Econ 6161 Econ 616 Spring 2006 Qualitative Response Regression Models Presented by Yan Hu.

04/19/2006 Econ 616 1

Econ 616 – Spring 2006

Qualitative Response Regression Models

Presented by Yan Hu

04/19/2006 Econ 616 2

Outline Qualitative Response Regression Model

Binary Response Regression Models1. The Linear Probability Model (LPM)2. The Logit Model3. The Probit Model

04/19/2006 Econ 616 3

What is Qualitative Response Regression Model? The dependent variable is qualitative

(or dummy) in nature.--- The dependent variable is a binary, or dichotomous variable: Y=1 if the person is in the labor force and Y=0 if he or she is not.--- Trichotomous response variable.--- Poly-chotomous (or multiple-category) response variable.

04/19/2006 Econ 616 4

Binary Response Regression Models

E(Y) is related to the X’s through a link function g( E(Y) ) = X.

In binary regression, a link function specifies a relationship between E(Y) (the probability of Y=1, which is also the expected value of Y) and a linear composite score of X's.

04/19/2006 Econ 616 5

Three Binary Response Regression Models

The Linear Probability Model (LPM) The Logit Model The Probit Model

04/19/2006 Econ 616 6

What’s Linear Probability Model? Y follows the

Bernoulli probability distribution.

Link function: E(Y)=0(1-P)+1(P)

=P Expression for LPM:

P= X

Yi Probability

0 1-P

1 P

Total 1

04/19/2006 Econ 616 7

Problems of LPM (1)1. Non-normality of the disturbances:

Ui follows the Bernoulli distribution :

Problem may not be so critical. If the objective is point estimation, the normality assumption of disturbance is not necessary and the OLS still remain unbiased. As the sample size increases indefinitely, the OLS estimators tend to be normally distributed

iii XY 21

ui ProbabilityYi=1 Pi

Yi=0 (1-Pi)iX211

iX21

04/19/2006 Econ 616 8

Problems of LPM (2)2. Heteroscedastic variances of the disturbances: Var(ui)=Pi(1-Pi), the variance is a function of the

mean (Pi). One way to solve the heteroscedasticity is to

transform the model by dividing it by the weights . Then, estimate the transformed equation by OLS.

)p -(1pw iii

i

i

i

i

ii

i

wu

wX

wwY

21

04/19/2006 Econ 616 9

Problems of LPM (3)3. Nofulfillment of

Two ways of finding out whether the estimated lie between 0 and 1:

1. Estimate the LPM by the usual OLS method. If some are less than zero, is assumed to be zero for those cases; if they are greater than 1, they are assumed to be 1.

2. Devise an estimating technique that will guarantee that the estimated conditional probabilities will lie between 0 and 1, such as logit and probit models.

1)/(0 ii XYE

iY

iYiY

iY

04/19/2006 Econ 616 10

Problems of LPM (4)4. Questionable value of R2 as a measure of

goodness of fit. For a given X, the Y values will be either 0 or 1.

Therefore, all the Y values will either lie along the X-axis or along the line corresponding to 1. Therefore, generally no LPM is expected to fit such a scatter so well. As a result, the conventionally computed R2 is likely to be much lower than 1 for such models.

Aldrich and Nelson contend that “use of the coefficient of determination as a summary statistic shoud be avoided in models with qualitative dependent variable.”

04/19/2006 Econ 616 11

What is the Logit Model?

The cumulative logistic distrubution:P = E(Y=1|X) = 1/(1+e-βX)

P

X

1

0

04/19/2006 Econ 616 12

What is the Logit Model? From the logistic distribution,

1-P = e-βX / (1+e-βX)P/(1-P) = eβX, odds ratiolog[p/(1-P)] = βX

Link function: g=log[ p/(1-p) ], where p is the probability of either Y=1 or Y=0, depending on the software.

Generally, log[ p/(1-p) ]=X.

04/19/2006 Econ 616 13

Two Types of Data

To estimate the value of logit log[ p/(1-p) ]=X, we have to distinguish two types of data:

--- Data at the individual, or micro, level --- Grouped or replicated data

04/19/2006 Econ 616 14

Data at the Individual Level X: family income,

Y=1 if the family owns a house and 0 if it does not own a house. The following table gives data on individual families.

FAMILY Y X

1 0 8

2 1 16

3 1 18

4 0 11

5 0 12

6 1 19

7 1 20

8 0 13

9 0 9

04/19/2006 Econ 616 15

Grouped or Replicated Data The following table

shows data on several families grouped according to income level and the number of families owning a house at each income level. Corresponding to each income level Xi, there are Ni families, ni among whom are home owners.

Income N n

6 40 8

8 50 12

10 60 18

13 80 28

15 100 45

20 70 36

25 65 39

30 50 33

35 40 30

40 25 20

04/19/2006 Econ 616 16

Steps in Estimating the Logit Regression (Grouped Data) For each income level X, compute the probability of

owning a house as Pi^=ni/Ni. For each Xi, obtain the logit as Li^=log[Pi^/(1-Pi^)] To resolve the problem of heteroscedasticity,

Wi=NiPi^(1-Pi^)(Wi)0.5Li

= β1(Wi)0.5+ β2(Wi)0.5Xi+(Wi)0.5ui

or Li* = β1(Wi)0.5+ β2Xi*+vi Estimate above function by OLS on the transformed

data. Establish confidence intervals and/or test hypotheses

in the usual OLS framework.

04/19/2006 Econ 616 17

SAS ProgramProc Import Out= Work.incomes

Datafile= "c:\yan\econ616\DG-15.4.xls"; Run;

data incomes1;set incomes;phat=n1/n;lhat=log(phat/(1-phat));w=n*phat*(1-phat);wsquar=sqrt(w);lstar=round(lhat*wsquar, 0.0001);xstar=round(income*wsquar, 0.0001);run;

proc reg data=incomes1;model lstar = wsquar xstar / NOINT;run;

04/19/2006 Econ 616 18

SAS Output

The estimated slope coefficient suggests that for a unit ($1000) increase in weighted income, the weighted log of odds in favor of owning a house goes up by 0.08 units.

Variable DF Parameter Estimator

Standard Error

t Value Pr > |t|

wsquar 1 -1.59324 0.11150 -14.29 <.0001

xstar 1 0.07867 0.00545 14.44 <.0001

04/19/2006 Econ 616 19

Odds Interpretation The odds ratio:

For a unit increase in weighted income, the (weighted) odds in favor of owing a house increase by 1.082 (e0.07867) or about 8.17%.

iii

i

iXWZ

Z

Z

i

i eeee

PP *07867.0*59324.1

11

1

04/19/2006 Econ 616 20

An Example of Individual Data In the following table,

Y=1 if a student’s final grade in an intermediate microeconomics course was A and Y=0 if the final grade was B or C. GPA, TUCE, and Personalized System of Instruction (PSI) are grade predictors.

OBS GPA TUCE PSI GRADE LETTER

1 2.66 20 0 0 C

2 2.89 22 0 0 B

3 3.28 24 0 0 B

4 2.92 12 0 0 B

5 4 21 0 1 A

6 2.86 17 0 0 B

7 2.76 17 0 0 B

8 2.87 21 0 0 B

04/19/2006 Econ 616 21

SAS ProgramProc Import Out= Work.gpagrade Datafile= "c:\yan\econ616\DG-15.7.xls";

Run;

proc print data=gpagrade;run;

Proc Logistic data=gpagrade ; Model grade (event='1') = gpa tuce psi;

run; /* or */proc probit data=gpagrade;

class grade;model grade = gpa tuce psi / d=logistic itprint;

run;

04/19/2006 Econ 616 22

Output Standard Wald

Parameter DF Estimate Error Chi-Square Pr> ChiSq

Intercept 1 -13.0204 4.9310 6.9723 0.0083GPA 1 2.8259 1.2629 5.0072 0.0252TUCE 1 0.0951 0.1415 0.4518 0.5015PSI 1 2.3785 1.0645 4.9925 0.0255

Testing Global Null Hypothesis: BETA=0Test Chi-Square DF Pr > ChiSqLikelihood Ratio 15.4042 3 0.0015Score 13.3088 3 0.0040Wald 8.3762 3 0.0388

04/19/2006 Econ 616 23

Interpretation Each slope coefficient is a partial slope and

measures the change in the estimated logit for a unit change in the value of the given regressor (holding other regressors constant).

Odds interpretation. For example, students who are exposed to the new method of teaching are more than 10.7887 (e2.3785) times to get an A than students who are not exposed to it, other things remaining the same.

04/19/2006 Econ 616 24

What’s the Probit Model Probit link: p= (h), where p is the

cumulative distribution function of a standard normal variate.

Pi=P(Y=1|X)=P(Ii*≤Ii)=P(Zi≤β1+β2Xi)= (β1+β2Xi), where P(Y=1|X) means the probability that an event occurs given the values of the X, and where Zi~N(0,σ2).

β1+β2Xi= -1(Pi), where -1 is the inverse of the normal CDF.

04/19/2006 Econ 616 25

Use of Probit Model Probit model is used when Y is considered as

the “manifestation” of some unobservable Gaussian-distributed latent variable in the data.

For example, the decision of the family to own a house or not depends on an unobservable index I (latent variable), that is determined by one or more explanatory variables, say income X, in such a way that the larger the value of the index I, the greater the probability of a family owning a house.

04/19/2006 Econ 616 26

Probit Estimation with Grouped Data

Method 1:1. Calculate Pi

^=N1/N.2. Estimate Ii= -1(Pi

^), where is the standard normal CDF.

3. Estimate β1 and β2 from Ii, i.e., β1+β2Xi= Ii.

Method 2:Use SAS or R program directly.

04/19/2006 Econ 616 27

ProgramSAS:

Proc Import Out= Work.incomesDatafile= "c:\yan\econ616\DG-

15.4.xls"; Run;

proc genmod data=incomes;class ;model n1/n = income / dist = bin

Link = probit lrci;

run;

R:incomes <- as.data.frame(matrix(scan(),ncol=3, byrow=T))6 40 88 50 1210 60 1813 80 2815 100 4520 70 3625 65 3930 50 3335 40 3040 25 20

names(incomes) <- c(“income”,”N”, “N1”)N0 <- incomes$N- incomes$N1glmA <- glm(cbind(N1, N0)~income, incomes, family=binomial(link=”probit”))

04/19/2006 Econ 616 28

OutputCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.988138 0.122144 -8.090 5.97e-16 ***income 0.048587 0.005995 8.105 5.28e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 72.7581 on 9 degrees of freedomResidual deviance: 2.3456 on 8 degrees of freedomAIC: 49.002

Number of Fisher Scoring iterations: 3

04/19/2006 Econ 616 29

Interpretation We want to find out the effect of a unit change in X

(income) on the probability that Y=1, that is, a family purchases a house.

1. The rate of change of the probability with respect to income:

2. If X=6 (thousand dollars), the normal density function of f[-0.988138 + 0.048587(6)]=f(-0.6966)=0.313.

3. 0.313*0.048587=0.0152. Starting with an income level of $6000, if the income goes up by $1000, the probability of a family purchasing a house goes up by about 1.52%.

221 )( ii

i XfdXdP

04/19/2006 Econ 616 30

Probit Model for Individual Data SAS program:

Proc Import Out= Work.gpagrade Datafile= "c:\yan\econ616\DG-15.7.xls"; Run;

proc probit data=gpagrade; class grade;model grade = gpa tuce psi;

run;

04/19/2006 Econ 616 31

Output

Analysis of Parameter Estimates Standard 95% Confidence Chi-

Parameter DF Estimate Error Limits Square Pr > ChiSq

Intercept 1 7.4523 2.5425 2.4692 12.4355 8.59 0.0034GPA 1 -1.6258 0.6939 -2.9858 -0.2658 5.49 0.0191TUCE 1 -0.0517 0.0839 -0.2162 0.1127 0.38 0.5375PSI 1 -1.4263 0.5950 -2.5926 -0.2601 5.75 0.0165

04/19/2006 Econ 616 32

Marginal Effect of Change in Regressor Holding the effect of all other variables constant.1. LPM: slope coefficient measures directly the

change in the probability of an event occurring as a result of a unit change in the value of a regressor.

2. Logit model: the slope coefficient of a variable gives the change in the log of the odds associated with a unit change in that variable. The rate of change in the probability of an event happening is given by βjPi(1-Pi).

3. Probit model: the rate of change in the probability is given by βj f(Xβ), where f is the density function of the standard normal variable.

04/19/2006 Econ 616 33

Logit or Probit? In most applications, the models are quite similar, the

main difference being that the logistic distribution has slightly fat tails.

There is no compelling reason to choose one over the other.

In practice, many researchers choose the logit model because of its comparative mathematical simplicity.

0

logit

P1 probit

04/19/2006 Econ 616 34

Reading

Damodar N. Gujarati, Basic Econometrics, P580-615

04/19/2006Econ 6161 Econ 616 Spring 2006 Qualitative Response Regression Models Presented by Yan Hu.

Documents