04/19/2006 Econ 616 1 Econ 616 – Spring 2006 Qualitative Response Regression Models Presented by Yan Hu
Jan 19, 2018
04/19/2006 Econ 616 1
Econ 616 – Spring 2006
Qualitative Response Regression Models
Presented by Yan Hu
04/19/2006 Econ 616 2
Outline Qualitative Response Regression Model
Binary Response Regression Models1. The Linear Probability Model (LPM)2. The Logit Model3. The Probit Model
04/19/2006 Econ 616 3
What is Qualitative Response Regression Model? The dependent variable is qualitative
(or dummy) in nature.--- The dependent variable is a binary, or dichotomous variable: Y=1 if the person is in the labor force and Y=0 if he or she is not.--- Trichotomous response variable.--- Poly-chotomous (or multiple-category) response variable.
04/19/2006 Econ 616 4
Binary Response Regression Models
E(Y) is related to the X’s through a link function g( E(Y) ) = X.
In binary regression, a link function specifies a relationship between E(Y) (the probability of Y=1, which is also the expected value of Y) and a linear composite score of X's.
04/19/2006 Econ 616 5
Three Binary Response Regression Models
The Linear Probability Model (LPM) The Logit Model The Probit Model
04/19/2006 Econ 616 6
What’s Linear Probability Model? Y follows the
Bernoulli probability distribution.
Link function: E(Y)=0(1-P)+1(P)
=P Expression for LPM:
P= X
Yi Probability
0 1-P
1 P
Total 1
04/19/2006 Econ 616 7
Problems of LPM (1)1. Non-normality of the disturbances:
Ui follows the Bernoulli distribution :
Problem may not be so critical. If the objective is point estimation, the normality assumption of disturbance is not necessary and the OLS still remain unbiased. As the sample size increases indefinitely, the OLS estimators tend to be normally distributed
iii XY 21
ui ProbabilityYi=1 Pi
Yi=0 (1-Pi)iX211
iX21
04/19/2006 Econ 616 8
Problems of LPM (2)2. Heteroscedastic variances of the disturbances: Var(ui)=Pi(1-Pi), the variance is a function of the
mean (Pi). One way to solve the heteroscedasticity is to
transform the model by dividing it by the weights . Then, estimate the transformed equation by OLS.
)p -(1pw iii
i
i
i
i
ii
i
wu
wX
wwY
21
04/19/2006 Econ 616 9
Problems of LPM (3)3. Nofulfillment of
Two ways of finding out whether the estimated lie between 0 and 1:
1. Estimate the LPM by the usual OLS method. If some are less than zero, is assumed to be zero for those cases; if they are greater than 1, they are assumed to be 1.
2. Devise an estimating technique that will guarantee that the estimated conditional probabilities will lie between 0 and 1, such as logit and probit models.
1)/(0 ii XYE
iY
iYiY
iY
04/19/2006 Econ 616 10
Problems of LPM (4)4. Questionable value of R2 as a measure of
goodness of fit. For a given X, the Y values will be either 0 or 1.
Therefore, all the Y values will either lie along the X-axis or along the line corresponding to 1. Therefore, generally no LPM is expected to fit such a scatter so well. As a result, the conventionally computed R2 is likely to be much lower than 1 for such models.
Aldrich and Nelson contend that “use of the coefficient of determination as a summary statistic shoud be avoided in models with qualitative dependent variable.”
04/19/2006 Econ 616 11
What is the Logit Model?
The cumulative logistic distrubution:P = E(Y=1|X) = 1/(1+e-βX)
P
X
1
0
04/19/2006 Econ 616 12
What is the Logit Model? From the logistic distribution,
1-P = e-βX / (1+e-βX)P/(1-P) = eβX, odds ratiolog[p/(1-P)] = βX
Link function: g=log[ p/(1-p) ], where p is the probability of either Y=1 or Y=0, depending on the software.
Generally, log[ p/(1-p) ]=X.
04/19/2006 Econ 616 13
Two Types of Data
To estimate the value of logit log[ p/(1-p) ]=X, we have to distinguish two types of data:
--- Data at the individual, or micro, level --- Grouped or replicated data
04/19/2006 Econ 616 14
Data at the Individual Level X: family income,
Y=1 if the family owns a house and 0 if it does not own a house. The following table gives data on individual families.
FAMILY Y X
1 0 8
2 1 16
3 1 18
4 0 11
5 0 12
6 1 19
7 1 20
8 0 13
9 0 9
04/19/2006 Econ 616 15
Grouped or Replicated Data The following table
shows data on several families grouped according to income level and the number of families owning a house at each income level. Corresponding to each income level Xi, there are Ni families, ni among whom are home owners.
Income N n
6 40 8
8 50 12
10 60 18
13 80 28
15 100 45
20 70 36
25 65 39
30 50 33
35 40 30
40 25 20
04/19/2006 Econ 616 16
Steps in Estimating the Logit Regression (Grouped Data) For each income level X, compute the probability of
owning a house as Pi^=ni/Ni. For each Xi, obtain the logit as Li^=log[Pi^/(1-Pi^)] To resolve the problem of heteroscedasticity,
Wi=NiPi^(1-Pi^)(Wi)0.5Li
= β1(Wi)0.5+ β2(Wi)0.5Xi+(Wi)0.5ui
or Li* = β1(Wi)0.5+ β2Xi*+vi Estimate above function by OLS on the transformed
data. Establish confidence intervals and/or test hypotheses
in the usual OLS framework.
04/19/2006 Econ 616 17
SAS ProgramProc Import Out= Work.incomes
Datafile= "c:\yan\econ616\DG-15.4.xls"; Run;
data incomes1;set incomes;phat=n1/n;lhat=log(phat/(1-phat));w=n*phat*(1-phat);wsquar=sqrt(w);lstar=round(lhat*wsquar, 0.0001);xstar=round(income*wsquar, 0.0001);run;
proc reg data=incomes1;model lstar = wsquar xstar / NOINT;run;
04/19/2006 Econ 616 18
SAS Output
The estimated slope coefficient suggests that for a unit ($1000) increase in weighted income, the weighted log of odds in favor of owning a house goes up by 0.08 units.
Variable DF Parameter Estimator
Standard Error
t Value Pr > |t|
wsquar 1 -1.59324 0.11150 -14.29 <.0001
xstar 1 0.07867 0.00545 14.44 <.0001
04/19/2006 Econ 616 19
Odds Interpretation The odds ratio:
For a unit increase in weighted income, the (weighted) odds in favor of owing a house increase by 1.082 (e0.07867) or about 8.17%.
iii
i
iXWZ
Z
Z
i
i eeee
PP *07867.0*59324.1
11
1
04/19/2006 Econ 616 20
An Example of Individual Data In the following table,
Y=1 if a student’s final grade in an intermediate microeconomics course was A and Y=0 if the final grade was B or C. GPA, TUCE, and Personalized System of Instruction (PSI) are grade predictors.
OBS GPA TUCE PSI GRADE LETTER
1 2.66 20 0 0 C
2 2.89 22 0 0 B
3 3.28 24 0 0 B
4 2.92 12 0 0 B
5 4 21 0 1 A
6 2.86 17 0 0 B
7 2.76 17 0 0 B
8 2.87 21 0 0 B
04/19/2006 Econ 616 21
SAS ProgramProc Import Out= Work.gpagrade Datafile= "c:\yan\econ616\DG-15.7.xls";
Run;
proc print data=gpagrade;run;
Proc Logistic data=gpagrade ; Model grade (event='1') = gpa tuce psi;
run; /* or */proc probit data=gpagrade;
class grade;model grade = gpa tuce psi / d=logistic itprint;
run;
04/19/2006 Econ 616 22
Output Standard Wald
Parameter DF Estimate Error Chi-Square Pr> ChiSq
Intercept 1 -13.0204 4.9310 6.9723 0.0083GPA 1 2.8259 1.2629 5.0072 0.0252TUCE 1 0.0951 0.1415 0.4518 0.5015PSI 1 2.3785 1.0645 4.9925 0.0255
Testing Global Null Hypothesis: BETA=0Test Chi-Square DF Pr > ChiSqLikelihood Ratio 15.4042 3 0.0015Score 13.3088 3 0.0040Wald 8.3762 3 0.0388
04/19/2006 Econ 616 23
Interpretation Each slope coefficient is a partial slope and
measures the change in the estimated logit for a unit change in the value of the given regressor (holding other regressors constant).
Odds interpretation. For example, students who are exposed to the new method of teaching are more than 10.7887 (e2.3785) times to get an A than students who are not exposed to it, other things remaining the same.
04/19/2006 Econ 616 24
What’s the Probit Model Probit link: p= (h), where p is the
cumulative distribution function of a standard normal variate.
Pi=P(Y=1|X)=P(Ii*≤Ii)=P(Zi≤β1+β2Xi)= (β1+β2Xi), where P(Y=1|X) means the probability that an event occurs given the values of the X, and where Zi~N(0,σ2).
β1+β2Xi= -1(Pi), where -1 is the inverse of the normal CDF.
04/19/2006 Econ 616 25
Use of Probit Model Probit model is used when Y is considered as
the “manifestation” of some unobservable Gaussian-distributed latent variable in the data.
For example, the decision of the family to own a house or not depends on an unobservable index I (latent variable), that is determined by one or more explanatory variables, say income X, in such a way that the larger the value of the index I, the greater the probability of a family owning a house.
04/19/2006 Econ 616 26
Probit Estimation with Grouped Data
Method 1:1. Calculate Pi
^=N1/N.2. Estimate Ii= -1(Pi
^), where is the standard normal CDF.
3. Estimate β1 and β2 from Ii, i.e., β1+β2Xi= Ii.
Method 2:Use SAS or R program directly.
04/19/2006 Econ 616 27
ProgramSAS:
Proc Import Out= Work.incomesDatafile= "c:\yan\econ616\DG-
15.4.xls"; Run;
proc genmod data=incomes;class ;model n1/n = income / dist = bin
Link = probit lrci;
run;
R:incomes <- as.data.frame(matrix(scan(),ncol=3, byrow=T))6 40 88 50 1210 60 1813 80 2815 100 4520 70 3625 65 3930 50 3335 40 3040 25 20
names(incomes) <- c(“income”,”N”, “N1”)N0 <- incomes$N- incomes$N1glmA <- glm(cbind(N1, N0)~income, incomes, family=binomial(link=”probit”))
04/19/2006 Econ 616 28
OutputCoefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -0.988138 0.122144 -8.090 5.97e-16 ***income 0.048587 0.005995 8.105 5.28e-16 ***---Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 72.7581 on 9 degrees of freedomResidual deviance: 2.3456 on 8 degrees of freedomAIC: 49.002
Number of Fisher Scoring iterations: 3
04/19/2006 Econ 616 29
Interpretation We want to find out the effect of a unit change in X
(income) on the probability that Y=1, that is, a family purchases a house.
1. The rate of change of the probability with respect to income:
2. If X=6 (thousand dollars), the normal density function of f[-0.988138 + 0.048587(6)]=f(-0.6966)=0.313.
3. 0.313*0.048587=0.0152. Starting with an income level of $6000, if the income goes up by $1000, the probability of a family purchasing a house goes up by about 1.52%.
221 )( ii
i XfdXdP
04/19/2006 Econ 616 30
Probit Model for Individual Data SAS program:
Proc Import Out= Work.gpagrade Datafile= "c:\yan\econ616\DG-15.7.xls"; Run;
proc probit data=gpagrade; class grade;model grade = gpa tuce psi;
run;
04/19/2006 Econ 616 31
Output
Analysis of Parameter Estimates Standard 95% Confidence Chi-
Parameter DF Estimate Error Limits Square Pr > ChiSq
Intercept 1 7.4523 2.5425 2.4692 12.4355 8.59 0.0034GPA 1 -1.6258 0.6939 -2.9858 -0.2658 5.49 0.0191TUCE 1 -0.0517 0.0839 -0.2162 0.1127 0.38 0.5375PSI 1 -1.4263 0.5950 -2.5926 -0.2601 5.75 0.0165
04/19/2006 Econ 616 32
Marginal Effect of Change in Regressor Holding the effect of all other variables constant.1. LPM: slope coefficient measures directly the
change in the probability of an event occurring as a result of a unit change in the value of a regressor.
2. Logit model: the slope coefficient of a variable gives the change in the log of the odds associated with a unit change in that variable. The rate of change in the probability of an event happening is given by βjPi(1-Pi).
3. Probit model: the rate of change in the probability is given by βj f(Xβ), where f is the density function of the standard normal variable.
04/19/2006 Econ 616 33
Logit or Probit? In most applications, the models are quite similar, the
main difference being that the logistic distribution has slightly fat tails.
There is no compelling reason to choose one over the other.
In practice, many researchers choose the logit model because of its comparative mathematical simplicity.
0
logit
P1 probit
04/19/2006 Econ 616 34
Reading
Damodar N. Gujarati, Basic Econometrics, P580-615