REGRESSION MODEL ASSUMPTIONS. The Regression Model We have hypothesized that: y = 0 + 1 x + | | + | | So far we focused on the regression part –
Post on 21-Dec-2015
259 Views
Preview:
Transcript
REGRESSION MODELREGRESSION MODEL
ASSUMPTIONSASSUMPTIONS
The Regression Model
• We have hypothesized that:
y = 0 + 1x +
|<Regression>| + |<Error>|
• So far we focused on the regression part – getting the best estimates for the ’s
• Here we focus on the error term,
THE RANDOM VARIABLE,
• The error term, , is a random variable that describes how the observed values, yi, vary around the regression line.
• For any value of x, has a distribution with a mean and a standard deviation
• At any x value xi, the observed value of the error term is called its residual, given by:
iii y - y e ˆ
STEP 3: 4 ASSUMPTIONS ABOUT
The remainder of our discussion about linear regression assumes the following about
• (1) DISTRIBUTION: is distributed normally
• (2) MEAN:– The errors average out to 0, i.e. E(), or = 0
• (3) STANDARD DEVIATION: , is the samesame at all values of x
• (4) INDEPENDENCE:– The errors are independentindependent of each other
What Do These Assumptions Imply About y?
• y = 0 + 1x + .0 + 1x is a constant for a given value of x is normally distributed with mean 0 and standard
deviation .
• Thus y is normally distributed with standard deviation and mean E(y),
E(y) = E(0 + 1x + ) = E(0 + 1x) + E() = 0 + 1x + 0 = 0 + 1x
BEST ESTIMATE FOR
• The true value of is unkown.
• It can estimated by s as follows:
s s and 2-n
y -(y
2-n
SSE s
and, 2-n freedom of degrees Thus
β and β :quantities two estimating are we Here
.estimated) being quantities(# - n freedom of Degrees
Freedom of Degrees
y -(y
Freedom of Degrees
SSE s
2ii
10
ii
;)ˆ
.
)ˆ
22
22
Hand Calculation of SSE
1 1200 101000 109567.57 73403214.02
2 800 92000 88540.54 11967859.75
3 1000 110000 99054.05 119813732.7
4 1300 120000 114824.32 26787618.7
5 700 90000 83283.78 45107560.26
6 800 82000 88540.54 42778670.56
7 1000 93000 99054.05 36651570.49
8 600 75000 78027.03 9162892.622
9 900 91000 93797.30 7824872.169
10 1100 105000 104310.81 474981.7385
SUM 373972972.97
ii 52.5657x 46486.49 ythat Recall ˆ
SSESSE
22iiiii )( )y(y )y y y x i
6837.15246746621.6s
246746621.68
97377972972.
2n
SSEs2
s
Residual Error
SSE/(n-2) = s2
SSE
Checking the Assumptions
• Many times it is just assumed that the assumptions hold.
• We now show how to check the assumptions.
Residuals
• The assumptions for can be checked using RESIDUAL ANALYSISRESIDUAL ANALYSIS.
• A residual, ei, is the observation of at an observed value of x, xi.
• For example in the Dollar Only example:y1 = 101,000 when x1 = 1200
8567.67109,567.57101,000e
109,567.57200)52.56757(146486.49y
1
1
ˆ
Standardized Residuals• Is a residual of -8,567.67 large?
– It depends on the size of a standard error, s.• Standardized residual = ei/(standard error of ei for xi).• Standardized residuals are easier to use to test the
assumptions.• Two typical ways for calculating the standard error of
ei for a particular xi value are:
• Both approaches yield substantially the same results.
2i
2i
i
i
i
i
)x(x
)x(x
n
1h where
h1s
e
s
e
Standardized Residuals in Excel
• Excel uses the following formula:
1-n
2-ns
ei
This still gives approximately the same values as the other methods. We will use the ones generated by Excel to check the assumptions.
Checking to See if Errors (Residuals) Appear to Come From a Normal Distribution
TWO WAYS TO CHECK• Construct a plot of standardized residuals and
see if they look normal– Could use Histogram from Data Analysis– A “quick check” – Standardized residuals are like
z-values. Check to see if about 68% are between ± 1, 95% between ± 2, and virtually all between ± 3.
• Look at a normal probability plot. These are statistical plots to check for “normality”. A “perfect” normal distribution would be a straight line on such a plot.
Checking to see if Is Constant
• Look at the residual plot to see if the points seem more spread out at some x’s than at others – in the Dollar Only example, it did not appear so on the Excel residual plot.
• Constant is called homoscedasticityhomoscedasticity!• If the points had looked like the next page, then
we see for lower values of x there is less variation than at higher values and the constant variation assumption would have been violated. This is called heteroscedasticityheteroscedasticity!
x
e
Heteroscedasticity– Nonconstant Variance
Checking Independence
• This is mainly for time series data (i.e. the x-axis is time) used in forecasting
• But basically if the data looks like the next slide – errors are not independent – In this case whether you have a positive or
negative error (residual) depends on the x-value.
– This is called autocorrelation.
X=timeX=time
YY
Example of Autocorrelation(Errors are Dependent on x)
Residual Analysis in Excel
CHECK:
Residuals
Standardized Residuals
Residual Plots
Normal Probability Plots
Standardized ResidualsStandardized Residuals70% are between ± 1
100% are between ±2
“Close” to expected
normalnormal values
Residual values appear to
average out to 0 everywhere.
There is no discernable
pattern for the errors.
Normal Probability Plot
• The following is the normal probability plot generated by Excel. Again Excel does it “slightly wrong”, but it should give us a good idea.
• Looks close to a straight line – normality assumption appears valid.
Normal Probability Plot
050000100000150000
0 20 40 60 80 100
Sample Percentile
Sal
es
Review• 4 assumptions about
1. is normal.
2. = E() = 0.3. is the same for all values of x.4. Errors are independent.
• Checking The Assumptions– Check residual plot to see if variation changes for
different values of x.– Check normality assumption by a normal probability
plot or by creating a histogram of standardized residuals.
• Does it appear normal and centered around 0?• Are about 68% between ±1, 95% between ±2, almost all
between ±3?
top related