Top Banner
DMBA: Statistics Lecture 2: Simple Linear Regression Least Squares, SLR properties, Inference, and Forecasting Carlos Carvalho The University of Texas McCombs School of Business mccombs.utexas.edu/faculty/carlos.carvalho/teaching 1
86

Lecture 2: Simple Linear Regression

Dec 31, 2016

Download

Documents

trannhan
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 2: Simple Linear Regression

DMBA: Statistics

Lecture 2: Simple Linear Regression

Least Squares, SLR properties, Inference,and Forecasting

Carlos CarvalhoThe University of Texas McCombs School of Business

mccombs.utexas.edu/faculty/carlos.carvalho/teaching

1

Page 2: Lecture 2: Simple Linear Regression

Today’s Plan

1. The Least Squares Criteria

2. The Simple Linear Regression Model

3. Estimation for the SLR Model

I sampling distributions

I confidence intervals

I hypothesis testing

2

Page 3: Lecture 2: Simple Linear Regression

Linear Prediction

Yi = b0 + b1Xi

I b0 is the intercept and b1 is the slope

I We find b0 and b1 using Least Squares

3

Page 4: Lecture 2: Simple Linear Regression

The Least Squares Criterion

The formulas for b0 and b1 that minimize the least squares

criterion are:

b1 = corr(X ,Y )× sY

sXb0 = Y − b1X

where,

sY =

√√√√ n∑i=1

(Yi − Y

)2and sX =

√√√√ n∑i=1

(Xi − X

)2

4

Page 5: Lecture 2: Simple Linear Regression

Correlation and Covariance

Measure the direction and strength of the linear relationship

between variables Y and X

!""#$%%&$'()*"$+Applied Regression AnalysisCarlos M. Carvalho

Review: Covariance and Correlation

, -"./01"$23"$direction .4*$strength 56$23"$()4".1$1"(.2)54/3)7$8"29""4$23"$:.1).8("/$;$.4*$<

The direction is given by the sign of the covariance

Cov(Y ,X ) =

∑ni=1 (Yi − Y )(Xi − X )

n − 15

Page 6: Lecture 2: Simple Linear Regression

Correlation and Covariance

Correlation is the standardized covariance:

corr(X ,Y ) =cov(X ,Y )√var(X )var(Y )

=cov(X ,Y )

sd(X )sd(Y )

The correlation is scale invariant and the units of measurement

don’t matter: It is always true that −1 ≤ corr(X ,Y ) ≤ 1.

This gives the direction (- or +) and strength (0→ 1)

of the linear relationship between X and Y .

6

Page 7: Lecture 2: Simple Linear Regression

Correlation

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = 1

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .5

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = .8

-3 -2 -1 0 1 2 3

-3-2

-10

12

3

corr = -.8

7

Page 8: Lecture 2: Simple Linear Regression

Correlation

Only measures linear relationships:

corr(X ,Y ) = 0 does not mean the variables are not related!

-3 -2 -1 0 1 2

-8-6

-4-2

0

corr = 0.01

0 5 10 15 20

05

1015

20

corr = 0.72

Also be careful with influential observations.

8

Page 9: Lecture 2: Simple Linear Regression

Back to Least Squares

1. Intercept:

b0 = Y − b1X ⇒ Y = b0 + b1X

I The point (X , Y ) is on the regression line!

I Least squares finds the point of means and rotate the line

through that point until getting the “right” slope

2. Slope:

b1 = corr(X ,Y )× sY

sX

I So, the right slope is the correlation coefficient times a scaling

factor that ensures the proper units for b1

9

Page 10: Lecture 2: Simple Linear Regression

More on Least Squares

From now on, terms “fitted values” (Yi ) and “residuals” (ei ) refer

to those obtained from the least squares line.

The fitted values and residuals have some special properties. Lets

look at the housing data analysis to figure out what these

properties are...

10

Page 11: Lecture 2: Simple Linear Regression

The Fitted Values and X

1.0 1.5 2.0 2.5 3.0 3.5

80100

120

140

160

X

Fitte

d V

alue

s

corr(y.hat, x) = 1

11

Page 12: Lecture 2: Simple Linear Regression

The Residuals and X

1.0 1.5 2.0 2.5 3.0 3.5

-20

-10

010

2030

X

Residuals

corr(e, x) = 0

mean(e) = 0

12

Page 13: Lecture 2: Simple Linear Regression

Why?

What is the intuition for the relationship between Y and e and X ?

Lets consider some “crazy”alternative line:

1.0 1.5 2.0 2.5 3.0 3.5

6080

100

120

140

160

X

Y

LS line: 38.9 + 35.4 X

Crazy line: 10 + 50 X

13

Page 14: Lecture 2: Simple Linear Regression

Fitted Values and Residuals

This is a bad fit! We are underestimating the value of small houses

and overestimating the value of big houses.

1.0 1.5 2.0 2.5 3.0 3.5

-20

-10

010

2030

X

Cra

zy R

esid

uals

corr(e, x) = -0.7mean(e) = 1.8

Clearly, we have left some predictive ability on the table!

14

Page 15: Lecture 2: Simple Linear Regression

Fitted Values and Residuals

As long as the correlation between e and X is non-zero, we could

always adjust our prediction rule to do better.

We need to exploit all of the predictive power in the X values and

put this into Y , leaving no “Xness” in the residuals.

In Summary: Y = Y + e where:

I Y is “made from X ”; corr(X , Y ) = 1.

I e is unrelated to X ; corr(X , e) = 0.

15

Page 16: Lecture 2: Simple Linear Regression

Another way to derive things

The intercept:

1

n

n∑i=1

ei = 0 ⇒ 1

n

n∑i=1

(Yi − b0 − b1Xi)

⇒ Y − b0 − b1X = 0

⇒ b0 = Y − b1X

16

Page 17: Lecture 2: Simple Linear Regression

Another way to derive things

The slope:

corr(e,X ) =n∑

i=1

ei(Xi − X ) = 0

=n∑

i=1

(Yi − b0 − b1Xi)(Xi − X )

=n∑

i=1

(Yi − Y − b1(Xi − X ))(Xi − X )

⇒ b1 =

∑ni=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2= rxy

sysx

17

Page 18: Lecture 2: Simple Linear Regression

Decomposing the Variance

How well does the least squares line explain variation in Y ?

Since Y and e are independent (i.e. cov(Y , e) = 0),

var(Y ) = var(Y + e) = var(Y ) + var(e)

This leads to

n∑i=1

(Yi − Y )2 =n∑

i=1

(Yi − Y )2 +n∑

i=1

e2i

18

Page 19: Lecture 2: Simple Linear Regression

Decomposing the Variance – ANOVA Tables

SSR: Variation in Y explained by the regression line.

SSE: Variation in Y that is left unexplained.

SSR = SST ⇒ perfect fit.

Be careful of similar acronyms; e.g. SSR for “residual” SS.

19

Page 20: Lecture 2: Simple Linear Regression

Decomposing the Variance – ANOVA Tables

Week II. Slide 23Applied Regression Analysis – Fall 2008 Matt Taddy

Decomposing the Variance – The ANOVA Table

20

Page 21: Lecture 2: Simple Linear Regression

A Goodness of Fit Measure: R2

The coefficient of determination, denoted by R2,

measures goodness of fit:

R2 =SSR

SST= 1− SSE

SST

I 0 < R2 < 1.

I The closer R2 is to 1, the better the fit.

21

Page 22: Lecture 2: Simple Linear Regression

A Goodness of Fit Measure: R2

An interesting fact: R2 = r2xy ( i.e., R2 is squared correlation).

R2 =

∑ni=1(Yi − Y )2∑ni=1(Yi − Y )2

=

∑ni=1(b0 + b1Xi − b0 − b1X )2∑n

i=1(Yi − Y )2

=b21

∑ni=1(Xi − X )2∑n

i=1(Yi − Y )2=

b21s2

x

s2y

= r2xy

No surprise: the higher the sample correlation between

X and Y , the better you are doing in your regression.

22

Page 23: Lecture 2: Simple Linear Regression

Back to the House Data

!""#$%%&$'()*"$+,Applied Regression AnalysisCarlos M. Carvalho

Back to the House Data

''-''.

''/

23

Page 24: Lecture 2: Simple Linear Regression

Prediction and the Modelling Goal

A prediction rule is any function where you input X and it

outputs Y as a predicted response at X .

The least squares line is a prediction rule:

Y = f (X ) = b0 + b1X

24

Page 25: Lecture 2: Simple Linear Regression

Prediction and the Modelling Goal

Y is not going to be a perfect prediction.

We need to devise a notion of forecast accuracy.

25

Page 26: Lecture 2: Simple Linear Regression

Prediction and the Modelling Goal

There are two things that we want to know:

I What value of Y can we expect for a given X?

I How sure are we about this forecast? Or how different could

Y be from what we expect?

Our goal is to measure the accuracy of our forecasts or how much

uncertainty there is in the forecast. One method is to specify a

range of Y values that are likely, given an X value.

Prediction Interval: probable range for Y-values given X

26

Page 27: Lecture 2: Simple Linear Regression

Prediction and the Modelling Goal

Key Insight: To construct a prediction interval, we will have to

assess the likely range of residual values corresponding to a Y value

that has not yet been observed!

We will build a probability model (e.g., normal distribution).

Then we can say something like “with 95% probability the

residuals will be no less than -$28,000 or larger than $28,000”.

We must also acknowledge that the “fitted” line may be fooled by

particular realizations of the residuals.

27

Page 28: Lecture 2: Simple Linear Regression

The Simple Linear Regression Model

The power of statistical inference comes from the ability to make

precise statements about the accuracy of the forecasts.

In order to do this we must invest in a probability model.

Simple Linear Regression Model: Y = β0 + β1X + ε

ε ∼ N(0, σ2)

The error term ε is independent “idosyncratic noise”.

28

Page 29: Lecture 2: Simple Linear Regression

Independent Normal Additive Error

Why do we have ε ∼ N(0, σ2)?

I E [ε] = 0⇔ E [Y | X ] = β0 + β1X

(E [Y | X ] is “conditional expectation of Y given X ”).

I Many things are close to Normal (central limit theorem).

I MLE estimates for β’s are the same as the LS b’s.

I It works! This is a very robust model for the world.

We can think of β0 + β1X as the “true” regression line.

29

Page 30: Lecture 2: Simple Linear Regression

The Regression Model and our House Data

Think of E [Y |X ] as the average price of houses with size X:

Some houses could have a higher than expected value, some lower,

and the true line tells us what to expect on average.

The error term represents influence of factors other X .

30

Page 31: Lecture 2: Simple Linear Regression

Conditional Distributions

The conditional distribution for Y given X is Normal:

Y |X ∼ N(β0 + β1X , σ2).

σ controls dispersion:

31

Page 32: Lecture 2: Simple Linear Regression

Conditional vs Marginal Distributions

More on the conditional distribution:

Y |X ∼ N(E [Y |X ], var(Y |X )).

I Mean is E [Y |X ] = E [β0 + β1X + ε] = β0 + β1X .

I Variance is var(β0 + β1X + ε) = var(ε) = σ2.

Remember our sliced boxplots:

I σ2 < var(Y ) if X and Y are related.

32

Page 33: Lecture 2: Simple Linear Regression

Prediction Intervals with the True Model

You are told (without looking at the data) that

β0 = 40; β1 = 45; σ = 10

and you are asked to predict price of a 1500 square foot house.

What do you know about Y from the model?

Y = 40 + 45(1.5) + ε

= 107.5 + ε

Thus our prediction for price is

Y ∼ N(107.5, 102)

33

Page 34: Lecture 2: Simple Linear Regression

Prediction Intervals with the True Model

The model says that the mean value of a 1500 sq. ft. house is

$107,500 and that deviation from mean is within ≈ $20,000.

We are 95% sure that

I −20 < ε < 20

I $87, 500 < Y < $127, 500

In general, the 95 % Prediction Interval is PI = β0 + β1X ± 2σ.

34

Page 35: Lecture 2: Simple Linear Regression

Summary of Simple Linear Regression

Assume that all observations are drawn from our regression model

and that errors on those observations are independent.

The model is

Yi = β0 + β1Xi + εi

where ε is independent and identically distributed N(0, σ2).

The SLR has 3 basic parameters:

I β0, β1 (linear pattern)

I σ (variation around the line).

35

Page 36: Lecture 2: Simple Linear Regression

Key Characteristics of Linear Regression Model

I Mean of Y is linear in X .

I Error terms (deviations from line) are normally distributed

(very few deviations are more than 2 sd away from the

regression mean).

I Error terms have constant variance.

36

Page 37: Lecture 2: Simple Linear Regression

Break

Back in 15 minutes...

37

Page 38: Lecture 2: Simple Linear Regression

Recall: Estimation for the SLR Model

SLR assumes every observation in the dataset was generated by

the model:

Yi = β0 + β1Xi + εi

This is a model for the conditional distribution of Y given X.

We use Least Squares to estimate β0 and β1:

β1 = b1 =

∑ni=1(Xi − X )(Yi − Y )∑n

i=1(Xi − X )2

β0 = b0 = Y − b1X

38

Page 39: Lecture 2: Simple Linear Regression

Estimation for the SLR Model

39

Page 40: Lecture 2: Simple Linear Regression

Estimation of Error Variance

Recall that εiiid∼ N(0, σ2), and that σ drives the width of the

prediction intervals:

σ2 = var(εi ) = E [(εi − E [εi ])2] = E [ε2i ]

A sensible strategy would be to estimate the average for squared

errors with the sample average squared residuals:

σ2 =1

n

n∑i=1

e2i

40

Page 41: Lecture 2: Simple Linear Regression

Estimation of Error Variance

However, this is not an unbiased estimator of σ2. We have to alter

the denominator slightly:

s2 =1

n − 2

n∑i=1

e2i =

SSE

n − 2

(2 is the number of regression coefficients; i.e. 2 for β0 + β1).

We have n − 2 degrees of freedom because 2 have been “used up”

in the estimatiation of b0 and b1.

We usually use s =√

SSE/(n − 2), in the same units as Y .

41

Page 42: Lecture 2: Simple Linear Regression

Degrees of Freedom

Degrees of Freedom is the number of times you get to observe

useful information about the variance you’re trying to estimate.

For example, consider SST =∑n

i=1(Yi − Y )2:

I If n = 1, Y = Y1 and SST = 0: since Y1 is “used up”

estimating the mean, we haven’t observed any variability!

I For n > 1, we’ve only had n − 1 chances for deviation from

the mean, and we estimate s2y = SST/(n − 1).

In regression with p coefficients (e.g., p = 2 in SLR), you only get

n − p real observations of variability ⇒ DoF = n − p.

42

Page 43: Lecture 2: Simple Linear Regression

Estimation of Error Variance

Where is s in the Excel output?

!""#$%%%&$'()*"$+Applied Regression Analysis Carlos M. Carvalho

Estimation of 2

!,"-"$).$s )/$0,"$123"($4506507

8"9"9:"-$;,"/"<"-$=45$.""$>.0?/*?-*$"--4-@$-"?*$)0$?.$estimated.0?/*?-*$*"<)?0)4/&$ ).$0,"$.0?/*?-*$*"<)?0)4/

.

Remember that whenever you see “standard error” read it as

estimated standard deviation: σ is the standard deviation.

43

Page 44: Lecture 2: Simple Linear Regression

Sampling Distribution of Least Squares Estimates

How much do our estimates depend on the particular random

sample that we happen to observe? Imagine:

I Randomly draw different samples of the same size.

I For each sample, compute the estimates b0, b1, and s.

If the estimates don’t vary much from sample to sample, then it

doesn’t matter which sample you happen to observe.

If the estimates do vary a lot, then it matters which sample you

happen to observe.

44

Page 45: Lecture 2: Simple Linear Regression

Sampling Distribution of Least Squares Estimates

45

Page 46: Lecture 2: Simple Linear Regression

Sampling Distribution of Least Squares Estimates

46

Page 47: Lecture 2: Simple Linear Regression

Sampling Distribution of Least Squares Estimates

LS lines are much closer to the true line when n = 50.

For n = 5, some lines are close, others aren’t:

we need to get “lucky”

47

Page 48: Lecture 2: Simple Linear Regression

Review: Sampling Distribution of Sample Mean

Step back for a moment and consider the mean for an iid sample

of n observations of a random variable {X1, . . . ,Xn}

Suppose that E (Xi ) = µ and var(Xi ) = σ2

I E (X ) = 1n

∑E (Xi ) = µ

I var(X ) = var(

1n

∑Xi

)= 1

n2

∑var (Xi ) =

σ2

n

If X is normal, then X ∼ N(µ, σ

2

n

).

If X is not normal, we have the central limit theorem (more in a

minute)!

48

Page 49: Lecture 2: Simple Linear Regression

Oracle vs SAP Example (understanding variation)

49

Page 50: Lecture 2: Simple Linear Regression

Oracle vs SAP

50

Page 51: Lecture 2: Simple Linear Regression

Oracle vs SAP

Do you really believe that SAP affects ROE?

How else could we look at this question?

51

Page 52: Lecture 2: Simple Linear Regression

Central Limit Theorem

Simple CLT states that for iid random variables, X , with mean µ

and variance σ2, the distribution of the sample mean becomes

normal as the number of observations, n, gets large.

That is, X →n N(µ,σ2

n), and sample averages tend to be normally

distributed in large samples.

52

Page 53: Lecture 2: Simple Linear Regression

Central Limit Theorem

Exponential random variables don’t look very normal:

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

X

f(X)

E [X ] = 1 and var(X ) = 1.

53

Page 54: Lecture 2: Simple Linear Regression

Central Limit Theorem

1000 means from n=2 samples

apply(matrix(rexp(2 * 1000), ncol = 1000), 2, mean)

Frequency

0 1 2 3 4 5

050

150

250

54

Page 55: Lecture 2: Simple Linear Regression

Central Limit Theorem

1000 means from n=5 samples

apply(matrix(rexp(5 * 1000), ncol = 1000), 2, mean)

Frequency

0.0 0.5 1.0 1.5 2.0 2.5 3.0

050

100

150

55

Page 56: Lecture 2: Simple Linear Regression

Central Limit Theorem

1000 means from n=10 samples

apply(matrix(rexp(10 * 1000), ncol = 1000), 2, mean)

Frequency

0.0 0.5 1.0 1.5 2.0

050

100

150

200

250

56

Page 57: Lecture 2: Simple Linear Regression

Central Limit Theorem

1000 means from n=100 samples

apply(matrix(rexp(100 * 1000), ncol = 1000), 2, mean)

Frequency

0.7 0.8 0.9 1.0 1.1 1.2 1.3

050

100

150

200

57

Page 58: Lecture 2: Simple Linear Regression

Central Limit Theorem

1000 means from n=1k samples

apply(matrix(rexp(1000 * 1000), ncol = 1000), 2, mean)

Frequency

0.90 0.95 1.00 1.05 1.10

050

100

150

200

58

Page 59: Lecture 2: Simple Linear Regression

Sampling Distribution of b1

The sampling distribution of b1 describes how estimator b1 = β1

varies over different samples with the X values fixed.

It turns out that b1 is normally distributed: b1 ∼ N(β1, σ2b1

).

I b1 is unbiased: E [b1] = β1.

I Sampling sd σb1 determines precision of b1.

59

Page 60: Lecture 2: Simple Linear Regression

Sampling Distribution of b1

Can we intuit what should be in the formula for σb1?

I How should σ figure in the formula?

I What about n?

I Anything else?

var(b1) =σ2∑

(Xi − X )2=

σ2

(n − 1)s2x

Three Factors:

sample size (n), error variance (σ2 = σ2ε), and X -spread (sx).

60

Page 61: Lecture 2: Simple Linear Regression

Sampling Distribution of b0

The intercept is also normal and unbiased: b0 ∼ N(β0, σ2b0

).

σ2b0

= var(b0) = σ2

(1

n+

X 2

(n − 1)s2x

)

What is the intuition here?

61

Page 62: Lecture 2: Simple Linear Regression

The Importance of Understanding Variation

When estimating a quantity, it is vital to develop a notion of the

precision of the estimation; for example:

I estimate the slope of the regression line

I estimate the value of a house given its size

I estimate the expected return on a portfolio

I estimate the value of a brand name

I estimate the damages from patent infringement

Why is this important?

We are making decisions based on estimates, and these may be

very sensitive to the accuracy of the estimates!

62

Page 63: Lecture 2: Simple Linear Regression

The Importance of Understanding Variation

Example from “everyday” life:

I When building a house, we can estimate a required piece of

wood to 1/4”?

I When building a fine cabinet, the estimates may have to be

accurate to 1/16” or even 1/32”.

The standard deviations of the least squares estimators of the

slope and intercept give a precise measurement of the accuracy of

the estimator.

However, these formulas aren’t especially practical

since they involve the unknown quantity: σ

63

Page 64: Lecture 2: Simple Linear Regression

Estimated Variance

We estimate variation with “sample standard deviations”:

sb1 =

√s2

(n − 1)s2x

sb0 =

√s2

(1

n+

X 2

(n − 1)s2x

)

Recall that s =√∑

e2i /(n − 2) is the estimator for σ = σε.

Hence, sb1 = σb1 and sb0 = σb0 are estimated coefficient sd’s.

A high level of info/precision/accuracy means small sb values.

64

Page 65: Lecture 2: Simple Linear Regression

Normal and Student’s t

Recall what Student discovered:

If θ ∼ N(µ, σ2), but you estimate σ2 ≈ s2 based on n − p

degrees of freedom, then θ ∼ tn−p(µ, s2).

For example:

I Y ∼ tn−1(µ, s2y /n).

I b0 ∼ tn−2

(β0, s

2b0

)and b1 ∼ tn−2

(β1, s

2b1

)The t distribution is just a fat-tailed version of the normal. As

n − p −→∞, our tails get skinny and the t becomes normal.

65

Page 66: Lecture 2: Simple Linear Regression

Standardized Normal and Student’s t

We’ll also usually standardize things:

bj − βj

σbj

∼ N(0, 1) =⇒bj − βj

sbj

∼ tn−2(0, 1)

We use Z ∼ N(0, 1) and Zn−p ∼ tn−p(0, 1) to

represent standard random variables.

Notice that the t and normal distributions depend upon assumed

values for βj : this forms the basis for confidence intervals,

hypothesis testing, and p-values.

66

Page 67: Lecture 2: Simple Linear Regression

Testing and Confidence Intervals (in 3 slides)

Suppose Zn−p is distributed tn−p(0, 1). A centered interval is

P(−tn−p,α/2 < Zn−p < tn−p,α/2) = 1− α

67

Page 68: Lecture 2: Simple Linear Regression

Confidence Intervals

Since bj ∼ tn−p(βj , sbj),

1− α = P(−tn−p,α/2 <bj − βj

sbj

< tn−p,α/2)

= P(bj − tn−p,α/2sbj< βj < bj + tn−p,α/2sbj

)

Thus (1− α)*100% of the time, βj is within the Confidence

Interval: bj ± tn−p,α/2sbj

68

Page 69: Lecture 2: Simple Linear Regression

Testing

Similarly, suppose that assuming bj ∼ tn−p(βj , sbj) for our sample

bj leads to (recall Zn−p ∼ tn−p(0, 1))

P(

Zn−p < −∣∣∣∣bj − βj

sbj

∣∣∣∣)+ P(

Zn−p >

∣∣∣∣bj − βj

sbj

∣∣∣∣) = ϕ.

Then the “p-value” is ϕ = 2P(Zn−p > |bj − βj |/sbj).

You do this calculation for βj = β0j , an assumed null/safe value,

and only reject β0j if ϕ is too small (e.g., ϕ < 1/20).

In regression, β0j = 0 almost always.

69

Page 70: Lecture 2: Simple Linear Regression

More Detail... Confidence Intervals

Why should we care about Confidence Intervals?

I The confidence interval captures the amount of information in

the data about the parameter.

I The center of the interval tells you what your estimate is.

I The length of the interval tells you how sure you are about

your estimate.

70

Page 71: Lecture 2: Simple Linear Regression

More Detail... Testing

Suppose that we are interested in the slope parameter, β1.

For example, is there any evidence in the data to support the

existence of a relationship between X and Y?

We can rephrase this in terms of competing hypotheses.

H0 : β1 = 0. Null/safe; implies “no effect” and we ignore X .

H1 : β1 6= 0. Alternative; leads us to our best guess β1 = b1.

71

Page 72: Lecture 2: Simple Linear Regression

Hypothesis Testing

If we want statistical support for a certain claim about the data,

we want that claim to be the alternative hypothesis.

Our hypothesis test will either reject or not reject the null

hypothesis (the default if our claim is not true).

If the hypothesis test rejects the null hypothesis, we have

statistical support for our claim!

72

Page 73: Lecture 2: Simple Linear Regression

Hypothesis Testing

We use bj for our test about βj .

I Reject H0 when bj is far from β0j (ususally 0).

I Assume H0 when bj is close to β0j .

An obvious tactic is to look at the difference bj − β0j .

But this measure doesn’t take into account the uncertainty in

estimating bj : What we really care about is how many standard

deviations bj is away from β0j .

73

Page 74: Lecture 2: Simple Linear Regression

Hypothesis Testing

The t-statistic for this test is

zbj=

bj − β0j

sbj

=bj

sbj

for β0j = 0.

If H0 is true, this should be distributed zbj∼ tn−p(0, 1).

I Small |zbj| leaves us happy with the null β0

j .

I Large |zbj| (i.e., > about 2) should get us worried!

74

Page 75: Lecture 2: Simple Linear Regression

Hypothesis Testing

We assess the size of zbjwith the p-value :

ϕ = P(|Zn−p| > |zbj|) = 2P(Zn−p > |zbj

|)

(once again, Zn−p ∼ tn−p(0, 1)).

-4 -2 0 2 4

0.0

0.2

0.4

p-value = 0.05 (with 8 df)

Z

p(Z)

z-z

75

Page 76: Lecture 2: Simple Linear Regression

Hypothesis Testing

The p-value is the probablity, assuming that the null hypothesis is

true, of seeing something more extreme

(further from the null) than what we have observed.

You can think of 1− ϕ (inverse p-value) as a measure of distance

between the data and the null hypothesis. In other words, 1− ϕ is

the strength of evidence against the null.

76

Page 77: Lecture 2: Simple Linear Regression

Hypothesis Testing

The formal 2-step approach to hypothesis testing

I Pick the significance level α (often 1/20 = 0.05),

our acceptable risk (probability) of rejecting a true

null hypothesis (we call this a type 1 error).

This α plays the same role as α in CI’s.

I Calculate the p-value, and reject H0 if ϕ < α

(in favor of our best alternative guess; e.g. βj = bj).

If ϕ > α, continue working under null assumptions.

This is equivilent to having the rejection region |zbj| > tn−p,α/2.

77

Page 78: Lecture 2: Simple Linear Regression

Example: Hypothesis Testing

Consider again a CAPM regression for the Windsor fund.

Does Windsor have a non-zero intercept?

(i.e., does it make/lose money independent of the market?).

H0 : β0 = 0 and there is no-free money.

H1 : β0 6= 0 and Windsor is cashing regardless of market.

78

Page 79: Lecture 2: Simple Linear Regression

Example: Hypothesis Testing

!""#$%%%&$'()*"$+,Applied Regression Analysis Carlos M. Carvalho

-"./(($01"$!)2*345$5"65"33)42$72$8$,9:;<

b! sb! bsb!

!

Hypothesis Testing – Windsor Fund Example

It turns out that we reject the null at α = .05 (ϕ = .0105). Thus

Windsor does have an “alpha” over the market.

79

Page 80: Lecture 2: Simple Linear Regression

Example: Hypothesis Testing

Looking at the slope, this is a very rare case where the null

hypothesis is not zero:

H0 : β1 = 1 Windsor is just the market (+ alpha).

H1 : β1 6= 1 and Windsor softens or exaggerates market moves.

We are asking whether or not Windsor moves in a different way

than the market (e.g., is it more conservative?).

Now,

t =b1 − 1

sb1

=−0.0643

0.0291= −2.205

tn−2,α/2 = t178,0.025 = 1.96

Reject H0 at the 5% level80

Page 81: Lecture 2: Simple Linear Regression

Forecasting

The conditional forecasting problem: Given covariate Xf and

sample data {Xi ,Yi}ni=1, predict the “future” observation yf .

The solution is to use our LS fitted value: Yf = b0 + b1Xf .

This is the easy bit. The hard (and very important!) part of

forecasting is assessing uncertainty about our predictions.

81

Page 82: Lecture 2: Simple Linear Regression

Forecasting

If we use Yf , our prediction error is

ef = Yf − Yf = Yf − b0 − b1Xf

82

Page 83: Lecture 2: Simple Linear Regression

Forecasting

This can get quite complicated! A simple strategy is to build the

following (1− α)100% prediction interval:

b0 + b1Xf ± tn−2,α/2s

A large predictive error variance (high uncertainty) comes from

I Large s (i.e., large ε’s).

I Small n (not enough data).

I Small sx (not enough observed spread in covariates).

I Large difference between Xf and X .

Just remember that you are uncertain about b0 and b1!

Reasonably inflating the uncertainty in the interval above is always

a good idea... as always, this is problem dependent. 83

Page 84: Lecture 2: Simple Linear Regression

Forecasting

For Xf far from our X , the space between lines is magnified...

84

Page 85: Lecture 2: Simple Linear Regression

Glossary and Equations

I Yi = b0 + b1Xi is the ith fitted value.

I ei = Yi − Yi is the ith residual.

I s: standard error of regression residuals (≈ σ = σε).

s2 =1

n − 2

∑e2i

I sbj: standard error of regression coefficients.

sb1 =

√s2

(n − 1)s2x

sb0 = s

√1

n+

X 2

(n − 1)s2x

85

Page 86: Lecture 2: Simple Linear Regression

Glossary and Equations

I α is the significance level (prob of type 1 error).

I tn−p,α/2 is the value such that for Zn−p ∼ tn−p(0, 1),

P(Zn−p > tn−p,α/2) = P(Zn−p < −tn−p,α/2) = α/2.

I zbj∼ tn−p(0, 1) is the standardized coefficient t-value:

zbj=

bj − β0j

sbj

(= bj/sbjmost often)

I The (1− α) ∗ 100% for βj is bj ± tn−p,α/2sbj.

86