Alternatively, dependent variable and independent variable.
Alternatively, endogenous variable and exogenous variable.
Association versus causation
Scatterplots
Weeks since beginning of semester
Per
cent
age
of c
ompu
ters
use
d in
com
pute
r la
bsfr
ee
Stata Exercise 1
Stata Exercise 2
Suppose we were considering the effect of hiring more people into the firm. On average, what total billings can we expect from a staff of 50? 150?
Stata Exercise 3
Stata Exercise 4
Stata Exercise 5
Adding Categorical Values to a Scatterplot
Often it is useful to have a way of distinguishing groups of data in a scatterplot
Stata Exercise 6
Transforming Data
Data analysts often look for a transformation of the data that simplifies the overall pattern.
The transformation typically involves turning a non-Normally distributed variable into a more-or-less Normally distributed variable.
Stata Exercise 7
Categorical Explanatory Variable
What if the explanation for the numbers is not another number but the category?
For example, investing in a particular sector of the economy might be great in some years or terrible in others.
Stata Exercise 8
More scatterplots
Relations between competitors
Stata Exercise 9
Correlation
Which one has the stronger correlation?
r = covariance(x,y) / [stdev(x)*stdev(y)]
r = (1/(n-1)) * sum of [(standardized values of x) (standardized values of] y)
week w - mean of w z-score of wprop of comps
p - mean of p z-score of pz-score * z-score
1 73.12 89.73 71.34 65.35 54.66 57.97 51.68 41.29 59.1
10 48.511 2412 4313 29.114 19.715 12.116 10.1
sum 0.00
8.5 4.8 46.9 23.1 count 16mean of w stdev of w mean of p stdev of p corr
Correlation
The r coefficient between measures of height and weight is positive because people who are of above-average height tend to be of above-average weight … so if the z-score for height is large, the z-score for weight tends to be large.
r = (1/(n-1)) * sum of [(standardized values of x) (standardized values of] y)
Correlation applet at www.whfreeman.com/pbs
Stata Exercise 11
Correlation
Correlation coefficients, as well as scatterplots can be used for comparisons.
For example, how well did Vanguard International Growth Fund (an investment vehicle) do compared to an average of the stocks in Europe, Australasia and the Far East?
Stata Exercise 12
Correlation
Doesn’t tell you anything about causality Variables must be numerical It is indifferent to units of measurement r>0 means positive association; r<0, negative -1 < r < 1. r = -1 means a perfectly straight
downward-sloping line. r=0 means no relation. r only measures linear relations r is not resistant to outliers
Stata Exercise 13
Regression
The Linear Regression Model
Errors have a mean 0 and a constant sd of and are independent of x.
iii errorbxay
05
0000
01
0000
00
1000 2000 3000 4000Square Footage of Homes
Linear prediction Price of Homes
05
0000
01
0000
00
1000 2000 3000 4000Square Footage of Homes
Price of Homes Linear prediction
05
01
001
50F
req
uenc
y
0 500000 1000000Price of Homes
1000<sqft<=1500
05
01
001
50F
req
uenc
y
0 500000 1000000Price of Homes
1500<sqft<=20000
50
100
150
Fre
que
ncy
0 500000 1000000Price of Homes
2000<sqft<=2500
05
01
001
502
00F
req
uenc
y
0 500000 1000000Price of Homes
2500<sqft<=3000
05
01
001
50F
req
uenc
y
0 500000 1000000Price of Homes
3000<sqft<=35000
50
100
150
Fre
que
ncy
0 500000 1000000Price of Homes
3500<sqft<=4000
05
0000
01
0000
00
1000 2000 3000 4000Square Footage of Homes
Price of Homes Linear prediction
05
0000
100
000
150
000
200
000
An
nual
ear
nin
gs (
dolla
rs)
55 60 65 70 75 80Height (inches)
earn Fitted values
(66.5’’, $20,000)
(76.5’’, $35,600)
(61.5’’, $12,200)
y – 20,000 = 1560 (x - 66.5)
y = – 84,000 + 1560 x
Sketch a scatterplot of the data consistent with this line
$37,694
95% of values
05
0000
100
000
150
000
200
000
An
nual
ear
nin
gs (
dolla
rs)
55 60 65 70 75 80Height (inches)
earn Fitted values
01
23
y
0 1 2 3x
Draw the best-fitting line through the circles
Draw the best-fitting line through the circles
01
23
4y
0 1 2 3 4 5 6x
01
23
y
0 1 2 3x
Mark with an “X” the average “y” value for each “x” value. Then draw the best-fitting line through the Xs
01
23
4y
0 1 2 3 4 5 6x
Mark with an “X” the average “y” value for each “x” value. Then draw the best-fitting line through the Xs
Regression (unlike correlation) is sensitive to your determination of which variable is explanatory and which response.
Sales = a + b(item)Item = a + b(sales)
Fac
t 1
Stata Exercise 14
Facts 2 and 3
If x changes by one standard deviation ofx, y changes by r standard deviations of y.– E.g., sx = 1, sy = 2, and r = 0.61.
If x changes by 1, y will change by 2*0.61 = 1.22
The regression line goes through the point– The point-slope form of the line requires only the
information on this slide to draw a line.
),( yx
Fact 4
Correlation r is related to the slope of the regression line and therefore to the relation between x and y.
Actually, the square of r, that is, R2 is the fraction of the variation in y that is explained by the variation in x.
),( yx
y
xyR
of valuesobservedin variationtotal
line thealongit pulls as ˆin variation 2
Because most of the variation in gas consumption is explained by temperature, the R2 of this regression is very high.
tbill98 tbill98_hat residuals
11.5 10.84649
12.6 12.19961
13.8 14.81564
6.4 5.975251
5.3 6.336083
Excel Exercise 1
Stata Exercises 15 and 16
With influential observations
Without influential observation 21
Stata Exercise 17
Cautions about Correlation and Regression
Don’t extrapolate too far Correlations are stronger for averages than
for individuals Beware of lurking (latent, hidden, excluded,
neglected) variables Association is not causation
– Establishing causation takes a lot of work (see p. 139).