Lecture 18, Page 1 of 9
Simple Regression Model (Assumptions)
Reading: Sections 18.1, 18.2, “Logarithms in Regression Analysis with Asiaphoria,” 19.6 – 19.8 (Optional: “Normal probability plot” pp. 607-8)
1
Lecture 18
Remember Regression?
2
60
65
70
75
80
Hei
ght s
on, i
nche
s
60 65 70 75Height father, inches
son_hat = 33.887 + 0.514*fathern = 1078, R2 = 0.251, s_e = 2.437
OLS intercept 33.887: No interpretation b/c father cannot be 0 inches tall
OLS slope 0.514: For every extra 1 inch of father’s height, son is on average about ½ inch taller
(y-hat): Predicted y, given x; E.g. son of a 72 inch tall father predicted to be 70.895 inches (= 33.887 + 0.514*72)
(residual): ; E.g. if is 70.895 but is 68.531,
then residual is -2.364 inches
(s.d. of residuals) 2.437 inches: measures scatter about OLS line
0.251: 25.1% of variation in sons’ heights explained by variation in their fathers’ heights
Descriptive & Inferential Statistics• Chap. 6: Scatterplots,
Association, and Correlation
– ∑–
∑• Chap. 7: Introduction to
Linear Regression
–
– ̅–
– ∑– /
• Simple Reg.: Chaps. 18 & 19 (Inference for Regression & Understanding Regression Residuals)
• Multiple Reg.: Chaps. 20 & 21 (Multiple Regression & Building Multiple Regression Models)
3
BUT, multiple regression is also a new way to describe data: descriptive statistics
Lecture 18, Page 2 of 9
Questions and Data: Still Important
• Which kind of question?– Research question: What
is causal effect of a change in X (e.g. match) on Y (e.g. amount given)
– Descriptive question: what are patterns in data (e.g. how does household spending on food vary with income?)
• Which kind of data?– Observational or
experimental • Correlation ≠ causation is
a cliché• Instead, apply
understanding of data and specific context to fully interpret quantitative results
– Cross-sectional, time series, or panel
4
5
Which kind of data are these?
Source: “The Economic Impact of Universities: Evidence from Across the Globe” a 2016 NBER Working Paper http://www.nber.org/papers/w22501
On this slide, what is review of Fall-term material and what is new material?
“Unsurprisingly, we find that higher university density is associated with higher GDP per capita levels.” p. 9 Why associatedinstead of correlated? Also, does quote imply causality?
6
“It is interesting that countries with more universities in 1960 generally had higher growth rates over 1960-2000 .” p. 9But is it a strong association?
Lecture 18, Page 3 of 9
X-variable is defined as Log(1 + universities per million people) • Logs can straighten curved scatter plot
– Plus one addresses countries with 0 universities– Example 1: x-value of 1 is a country with 1.72
universities per million: ln(1 + 1.72) 1• E.g. 10 universities w/ pop. 5.82 million: 1.7210/5.82
– Example 2: x-value of 3 is a country with 19.09universities per million: ln(1 + 19.09) 3
• E.g. 25 universities w/ pop. 1.31 million: 19.0925/1.31– University density is over 11 times bigger in Example
2, but x-value only 3 times as big (diminishing returns)
7
8
How to interpret b=2.64?
On average, countries with a university density (universities per million people) that is 10% higher have average years of schooling that is 0.264 years higher.
Frozen Pizza (p. 627)
• How does the volume of sales depend on the price of frozen pizza?– What is the economic name of this relationship?
• Weekly data on price and quantity for each of four cities (1994 – 1996); 156 weeks– Raw data: ch18_MCSP_Frozen_Pizza.csv– Are these data observational? – Cross-sectional, time series, or panel?
9
Lecture 18, Page 4 of 9
10
Demand Shifters
Quantity DemandedPrice
Observational Data
What are some demand shifters for frozen pizza?
Why do these demand shifters affect the price the firm chooses to set?
2468
10
Qua
ntity
, 10,
000s
2 2.2 2.4 2.6 2.8 3Price, $1s
Denver, 1994-96n = 156 weeks
Frozen Pizza: OLS
• 0.7697• 2 0.5924• 18.12 5.28• Interpret the line?
– Is the OLS line an estimate of the demand equation?
11
Simple Linear Regression:One x-variable
• Model: – : dependent var., regressand, y-var., LHS-var.– : independent var., regressor, explanatory var., x-
var., RHS-var. (i.e. right-hand side variable)– : observation index (often or cross-sectional
data; time series data; or panel data)– : intercept (constant) parameter – : slope parameter– : error term, residual, disturbance
12
Lecture 18, Page 5 of 9
– Line is expected value: E– Error explains deviations
from expectations
Error term in
• includes all other factors that affect aside from – Impossible to collect
data on everything: some variables unobserved to the researcher
– It reflects reality: model cannot control for everything
13
|
In the above graph is positive or negative?
Assumptions Tame Elusive Epsilon
• We cannot observe but we can observe – Notice how many of the six assumptions are
about the unobservable • Some assumptions can be checked by analyzing (the
statistic tied to the parameter , but some cannot • In general, models make assumptions about unknowns
– For example, a model could assume the outcome of the role of a die follows a discrete Uniform distribution: i.e. it’s fair with a 1/6 probability of each outcome {1, 2, 3, 4, 5, 6}
14
15
• Six assumptions of the linear regression model– Your book gives only four assumptions:
• I suspect it leaves one out because it is fairly obvious• I am sure the other one is left out because it is only
required if you wish to make a causal interpretation• To minimize confusion, number the extra two as 5 & 6
– Econometrics addresses violations in assumptions• ECO374H Applied Econometrics (Commerce)• ECO375H & ECO475H Applied Econometrics I & II
Six Assumptions
Lecture 18, Page 6 of 9
16
Assumption #1
• Regression equation is linear in the error and parameters; the variables (in boxes) are linearly related to each other
– Not assuming that what is in boxes is linear (so long as no nonlinear functions of parameters or nonlinear functions of the error)
• Example of a linear regression: • Example of a linear regression: lnFrozen Pizza (Chapter 18)
2
4
6
8
10
Wee
kly
Q, 1
0,00
0s
2 2.2 2.4 2.6 2.8 3Weekly price, $1s
Frozen Pizza, Denver, 1994-96Q-hat = 18.122 + -5.280*P
n = 156, R2 = 0.592, s.e.(b) = 0.353
-2
-1
0
1
2
3
e (re
sidu
als)
2 3 4 5 6 7Q_hat
Which violations can we see?
17
Natural Log Transformations
18
-.4
-.2
0
.2
.4
e (re
sidu
als)
2 3 4 5 6 7ln(Q)_hat
1
1.5
2
2.5
ln(W
eekl
y Q
, 10,
000s
)
.7 .8 .9 1 1.1ln(Weekly price, $1s)
Frozen Pizza, Denver, 1994-96ln(Q)-hat = 4.095 + -2.773*ln(P)
n = 156, R2 = 0.631, s.e.(b) = 0.171
Lecture 18, Page 7 of 9
Assumption #2
• No autocorrelation / no serial correlation: , 0 if – Common problem in
time-series data• E.g. higher than expected
inflation today, likely high tomorrow
– Errors assumed not systematically related across observations
19
-10-
50
510
resi
dual
(e)
0 10 20 30t
Assumption #2 HoldsNo Autocorrelation
-3-2
-10
12
resi
dual
(e)
0 10 20 30t
Assumption #2 ViolatedPositive Autocorrelation
Assumption #3
• Homoscedasticity: , 1, … ,– “Equal variance
assumption”– Error is just as “noisy”
for all values of x– Violation is called
heteroscedasticity– Common problem in
cross-sectional data
20
01
23
Y2
0 2 4 6 8 10X2
Heteroscedasticity
-10
12
3Y1
0 2 4 6 8 10X1
Homoscedasticity
Fix Assumption #1 issues before checking Assumption #3
21
-.4
-.2
0
.2
.4
e (re
sidu
als)
2 3 4 5 6 7ln(Q)_hat
1
1.5
2
2.5
ln(W
eekl
y Q
, 10,
000s
)
.7 .8 .9 1 1.1ln(Weekly price, $1s)
Frozen Pizza, Denver, 1994-96ln(Q)-hat = 4.095 + -2.773*ln(P)
n = 156, R2 = 0.631, s.e.(b) = 0.171
Heteroscedasticity – unequal variance of the residuals – is often a byproduct of a violation of the linearity assumption
Remember that Chapter 18 advises you to check the assumptions in order: start with the linearity assumption
Is Denver pizza regression an example?
Lecture 18, Page 8 of 9
22
Assumptions #4 & #5
• Galton’s data (Lec. 5)– Assumptions 1-3 hold?
• Normality: is Normal– is unobserved so
check • Error has mean zero: 0, 1, … ,
– Constant term (i.e. or ) picks up any constant
effects, not the error
6065707580
Son,
inch
es
60 65 70 75Father, inches
Galton's Heights, n = 1078
0.05.1
.15.2
Den
sity
-10 -5 0 5 10residuals (e)
n = 1078
Graphical Summary
23
||| ,,,Assumptions #3, #4, and #5 combined: ~ 0,
Would the elements in the population (not shown) lie on the line?
Is a reflection of sampling error?
24
2017 ON Public Sector Disclosure of 2016 salaries for University of Waterloo employees
Sex n Mean S.d.F 416 $139.74K $33.74KM 941 $155.36K $36.96K
OLS Results:Salary-hat = 139.74 + 15.62*MaleR2 = 0.0385, n = 1,357, = 36.006
Do Assumptions #1 - #5 hold?100
200
300
400
Sala
ry a
nd S
alar
y-ha
t
0 .2 .4 .6 .8 1Male
Waterloo, 2016 Salaries
100
200
300
400
Sala
ry (1
,000
0's
CAN
$)
Male=0 Male=1
Waterloo, 2016 Salaries
Lecture 18, Page 9 of 9
Assumption #6
• x uncorrelated w/ error: , 0– Exogeneity: x variable(s) unrelated with error
• Dosage is exogenous: • Experimental data can est. causal effect:
– Endogeneity: x variable(s) related with error• With observational data, lurking/unobserved/omitted/
confounding variables mean x and error are related• Price of pizza is endogenous: • Endogeneity bias means:
25
26
“Short-Hand” Assumptions
1) Linear relationship between variables (possibly non-linearly transformed)
2) No correlation amongst errors (no autocorrelation for time-series data)
3) Homoscedasticity (single variance) of errors4) Normally distributed errors5) Constant included (error has mean 0)6) No relationship between x and error