Part 16: Linear Regression 6-1/46 Statistics and Data Analysis Professor William Greene Stern School of Business IOMS Department Department of Economics
Apr 02, 2015
Part 16: Linear Regression16-1/46
Statistics and Data Analysis
Professor William Greene
Stern School of Business
IOMS Department
Department of Economics
Part 16: Linear Regression16-2/46
Statistics and Data Analysis
Part 16 – Regression
Part 16: Linear Regression16-3/46
Sales Population - semilogIncome Demographics
- Box Jenkins
Part 16: Linear Regression16-4/46
A Regression Analysis that People Really Cared
About
The Year 2000 World Health Report by WHO
http://www.who.int/whr/2000/en
Part 16: Linear Regression16-5/465
Health Care System Performance
Part 16: Linear Regression16-6/46
New York Times, Page 1, June 21, 2000
Part 16: Linear Regression16-7/46
Part 16: Linear Regression16-8/46
That Number 37 Ranking
What is the source? What is it? Ranking of what? And why are we looking at it in our
class on Statistics and Data Analysis? Interesting It’s an application of regression
analysis.
Part 16: Linear Regression16-9/46
The Source Behind the News
http://www.who.int/entity/healthinfo/paper30.pdf
Part 16: Linear Regression16-10/46
What Did They Study?
Part 16: Linear Regression16-11/46
The standard measure of health care success is Disability
Adjusted Life Expectancy,
DALE
Part 16: Linear Regression16-12/46
The WHO Researchers
Were Interested in
a Broader Measure
These are the items listed in the NYT editorial.
Part 16: Linear Regression16-13/46
They Created a Measure COMP = Composite Index
“In order to assess overall efficiency, the first step was to combine the individualattainments on all five goals of the health system into a single number, which we call the composite index. The composite index is a weighted average of the five component goals specified above. First, country attainment on all five indicators (i.e., health, health inequality, responsiveness-level, responsiveness-distribution, and fair-financing) were rescaled restricting them to the [0,1] interval. Then the following weights were used to construct the overall composite measure: 25% for health (DALE), 25% for health inequality, 12.5% for the level of responsiveness, 12.5% for the distribution of responsiveness, and 25% for fairness in financing. These weights are based on a survey carried out by WHO to elicit stated preferences of individuals in their relative valuations of the goals of the health system.”
(From the WHO Technical Report)
Part 16: Linear Regression16-14/46
Did They Rank Countries by
COMP? Yes, but that was not what
produced the number 37
ranking!
Part 16: Linear Regression16-15/46
So, What is Going On?
A Model: Health Care Output = a function of Health Care Inputs
OUTPUT = COMP
INPUTS = Health Care Spending and Education of the Population
Part 16: Linear Regression16-16/46
The WHO COMP Equation
1
22 3
log =
= α+β log
+β log +β (log )
i =1,...,191 countries
i i i
i
i i i
COMP Maximum Attainable - Inefficiency
HealthExp
Educ Educ + e
Part 16: Linear Regression16-17/46
Estimated Model
β1
β2
β3
α
Part 16: Linear Regression16-18/46
The Best a Country Could Do vs. What They Actually Do
Part 16: Linear Regression16-19/4619
Part 16: Linear Regression16-20/46
The US Ranked 37th!
Countries were ranked by overall efficiency
Part 16: Linear Regression16-21/46
Linear Regression Correlation (and vs. causality) Examining correlation
Descriptive: Relationship between variables Predictive: Use values of one variable to predict
another. Control: Should a firm increase R&D? Understanding: What is the elasticity of demand
for our product? (Should we raise our price?) The regression relationship
Part 16: Linear Regression16-22/46
Positive Correlation and Regression
0 1 2 Financial Cases
2.4 -
2.3 -
2.2 -
2.1 -
2.0 -
1.9 -
Expected Number of Real Estate Cases Given Number of Financial Cases
The “regression of R on F”
Part 16: Linear Regression16-23/46
Correlation of Home Prices with Other Factors
What explains the pattern? Is the distribution of average listing prices random?
Part 16: Linear Regression16-24/46
Part 16: Linear Regression16-25/46
IncomePC
List
ing
3250030000275002500022500200001750015000
900000
800000
700000
600000
500000
400000
300000
200000
100000
Scatterplot of Listing vs IncomePC
Part 16: Linear Regression16-26/46
Regression
Modeling and understanding correlation “Change in y” is associated with “change
in x” How do we know this? What can we infer from the observation? Causality and correlation
http://en.wikipedia.org/wiki/Causality and see, esp. “Probabilistic Causation” about halfway down the article.
Part 16: Linear Regression16-27/46
Correlation – Education and Life Expectancy
EDUC
DA
LE
121086420
80
70
60
50
40
30
20
01
OECD
Scatterplot of DALE vs EDUC
Causality? Correlation? Does more education make people live longer? A hidden driver of both? (GDPC)
Graph Scatterplots With Groups/ Categorical variable is OECD.
Part 16: Linear Regression16-28/46
Useful Description(?)
Scatter plot of box office revenues vs. number of “Can’t Wait To See It” votes on Fandango for 62 movies. What do we learn from the figure? Is the “relationship” convincing? Valid? (Real?)
Part 16: Linear Regression16-29/46
More Movie Madness
Domestic
Overs
eas
6005004003002001000
1400
1200
1000
800
600
400
200
0
Scatterplot of Overseas vs Domestic
Domestic
Overs
eas
5004003002001000
700
600
500
400
300
200
100
0
Scatterplot of Overseas vs Domestic
Did domestic box office success help to predict foreign box office success?
499 biggest movies up to 2003500 biggest movies up to 2003
Note the influence of an outlier.
Movies.mtp
Part 16: Linear Regression16-30/46
Average Box Office by Internet Buzz Index
= Average Box Office for Buzz in Interval
Part 16: Linear Regression16-31/46
Correlation
Is there a conditional expectation?
The data suggest that the average of Box Office increases as Buzz increases.
Average Box Office = f(Buzz) is the “Regression of Box Office on Buzz”
Part 16: Linear Regression16-32/46
Is There Really a Relationship?
BoxOffice is obviously not equal to f(Buzz) for some function. But, they do appear to be “related,” perhaps statistically – that is, stochastically. There is a correlation. The linear regression summarizes it.
A predictor would be Box Office = a + b Buzz. Is b really > 0? What would be implied by b > 0?
Part 16: Linear Regression16-33/46
Using Regression to Predict
Domestic
Overs
eas
6005004003002001000
1250
1000
750
500
250
0
S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%
Regression95% PI
Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic
Predictor: Overseas = a + b Domestic. The prediction will not be perfect. We construct a range of “uncertainty.”
Stat Regression Fitted Line Plot
Options: Display Prediction Interval
The equation would not predict Titanic.
Part 16: Linear Regression16-34/46
Effect of an Outlier is to Twist the Regression Line
DomesticBox
Fore
ignBox
5004003002001000
700
600
500
400
300
200
100
0
S 66.9303R-Sq 47.4%R-Sq(adj) 47.3%
Regression of Foreign Box Office on DomesticForeignBox = 20.78 + 0.9202 DomesticBox
Domestic
Overs
eas
6005004003002001000
1400
1200
1000
800
600
400
200
0
S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%
Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic
Without Titanic, slope = 0.9202
With Titanic, slope = 1.051
Part 16: Linear Regression16-35/46
Least Squares Regression
Part 16: Linear Regression16-36/46
a
b
How to compute the y intercept, a, and the slope, b, in y = a + bx.
Part 16: Linear Regression16-37/46
Fitting a Line to a Set of Points
Income
PerC
apitaG
27000260002500024000230002200021000
6.4
6.3
6.2
6.1
6.0
5.9
5.8
5.7
5.6
Scatterplot of PerCapitaG vs Income
Choose a and b tominimize the sum of squared residuals
Gauss’s methodof least squares.
N N N2 2 2
i i i i ii 1 i 1 i 1SS [y - a - bx ] [y - (a + bx )] e
Residuals i i i
i i
e y (a bx )
ˆ y y
Yi
Xi
Predictionsa + bxi
Part 16: Linear Regression16-38/46
Computing the Least Squares Parameters a and b
N N
i ii 1 i 1
N2 2x ii 1
N
xy i ii 1
1 1y = y = 20.721 x = x = 0.48242
N N1
Var(x) = s = (x x) = 0.02453N-1
1Cov(x,y) = s = (x x)(y y) = 1.784
N-1
4 numbers are needed:
xy
2x
s 1.784b 72.7181
s 0.02453
a y - bx = 20.721- (72.7181)(0.48242) = -14.36
Part 16: Linear Regression16-39/46
Least Squares Uses Calculus
N 21i iN-1 i=1
2N i i1
N-1 i=1
N1i iN-1 i=1
2N i i1
N-1 i=1
N1i i iN-1 i=1
SS = (y - a -bx )
(y - a -bx )SS=
a a
= 2(y - a -bx )(-1) = 0
(y - a -bx )SS =
b b
= 2(y - a -bx )(-x ) = 0
N1i=1 i iN-1
N 21i=1 iN-1
The solution is
a = y - bx where
Σ (x - x)(y - y)b =
Σ (x - x)
Part 16: Linear Regression16-40/46
b Measures Covariationb is related to the correlation of x and y.
Predictor Box Office = a + b Buzz.
xyxy
x y
y
x
Cov(x,y)b =
Var(x)
Note the numerator of b is
the covariance of x and y.
If Cov(x,y) = 0, then b = 0.
Also, since the correlation
sCov(x,y)is r ,
s sVar(x)Var(y)
sb Correlation of x and y.
s
Part 16: Linear Regression16-41/46
Is There Really a Statistically Valid Relationship?
We reframe the question.
If b = 0, then there is no (linear) relationship. How can we find out if the regression relationship is just a fluke due to a particular observed set of points? To be studied later in the course.
BoxOffice = a + b Cntwait3. Is b really > 0?
Part 16: Linear Regression16-42/46
Interpreting the Function
EDUC
DA
LE
121086420
80
70
60
50
40
30
20
S 7.87034R-Sq 59.2%R-Sq(adj) 59.0%
Fitted Line PlotDALE = 35.16 + 3.611 EDUC
a
b
a = the life expectancy associated with 0 years of education. No country has 0 average years of education. The regression only applies in the range of experience.
b = the increase in life expectancy associated with each additional year of average education.
The range of experience (education)
Part 16: Linear Regression16-43/46
Correlation and Causality
EDUC
DA
LE
121086420
80
70
60
50
40
30
20
S 7.87034R-Sq 59.2%R-Sq(adj) 59.0%
Fitted Line PlotDALE = 35.16 + 3.611 EDUC
Does more education make you live longer (on average)?
Part 16: Linear Regression16-44/46
Causality?
Height (inches) and Income ($/mo.) in first post-MBA Job (men). WSJ, 12/30/86.Ht. Inc. Ht. Inc. Ht. Inc.70 2990 68 2910 75 3150 67 2870 66 2840 68 2860 69 2950 71 3180 69 2930 70 3140 68 3020 76 3210 65 2790 73 3220 71 3180 73 3230 73 3370 66 2670 64 2880 70 3180 69 3050 70 3140 71 3340 65 2750 69 3000 69 2970 67 2960 73 3170 73 3240 70 3050
Estimated Income = -451 + 50.2 Height
Correlation = 0.84 (!)
Part 16: Linear Regression16-45/46
Using Regression to Predict
Domestic
Overs
eas
6005004003002001000
1250
1000
750
500
250
0
S 73.0041R-Sq 52.2%R-Sq(adj) 52.1%
Regression95% PI
Regression of Foreign Box Office on DomesticOverseas = 6.693 + 1.051 Domestic
Part 16: Linear Regression16-46/46
Summary Using scatter plots to examine data The linear regression
Description Predict Control Understand
Linear regression computation Computation of slope and constant term Prediction Covariation vs. Causality
Interpretation of the regression line as a conditional expectation