BUS B272F Unit 1 ANOVA and Linear Regression ANOVA and Linear Regression
BUS B272F Unit 1 ANOVA and Linear Regression
ANOVA andLinear Regression
BUS B272F Unit 1 ANOVA and Linear Regression
Analysis of Variance(ANOVA)
ANOVA and Linear Regression
3
BUS B272 Unit 1
Analysis of Variance
The Analysis of Variance (ANOVA) is a procedure that tests to determine whether differences exist between two or more populations.
The techniques analyzes the variance of the data to determine whether we can infer that the populations differ.
ANOVA and Linear Regression
4
BUS B272 Unit 1
One way (Single-factor) analysis of variance ANOVA assumptions F test for difference among k means
Topics
ANOVA and Linear Regression
5
BUS B272 Unit 1
General Experimental Setting Investigator controls one or more
independent variables Called treatments or factors Each treatment contains two or more
levels (or categories/classifications) Observe effects on dependent
variable Response to different levels of
independent variable Experimental design: the plan used to
test hypothesis
ANOVA and Linear Regression
6
BUS B272 Unit 1
Completely Randomized Design
Experimental units (subjects) are assigned randomly to treatments Subjects are assumed homogeneous
Only one factor or independent variable With two or more treatment levels
Analyzed by One-way analysis of variance (one-way
ANOVA)
ANOVA and Linear Regression
7
BUS B272 Unit 1
Factor (Training Method)
Factor Levels
(Treatments)
Randomly Assigned
Units
Dependent Variable
(Response)
21 hrs 17 hrs 31 hrs
27 hrs 25 hrs 28 hrs
29 hrs 20 hrs 22 hrs
Randomized Design Example
ANOVA and Linear Regression
8
BUS B272 Unit 1
One-way Analysis of Variance F Test
Evaluate the difference among the mean responses of 2 or more (k) populationse.g. : Several types of tires, oven temperature settings, different types of marketing strategies
ANOVA and Linear Regression
9
BUS B272 Unit 1
Samples are randomly and independently drawn
This condition must be met Populations are normally distributed
F test is robust to moderate departure from normality
Populations have equal variances
Assumptions of ANOVA
ANOVA and Linear Regression
10
BUS B272 Unit 1
Hypotheses of One-Way ANOVA
All population means are equal No treatment effect (no variation in means
among groups)
At least one population mean is different (others may be the same!)
There is treatment effect Does not mean that all population means are
different
1 : Not all are the sameiH
kH 210 :
ANOVA and Linear Regression
11
BUS B272 Unit 1
One-way ANOVA (No Treatment Effect)
The Null Hypothesis is True
1 : Not all are the sameiH
1 2 3
kH 210 :
ANOVA and Linear Regression
12
BUS B272 Unit 1
One-way ANOVA (Treatment Effect Present)
The Null Hypothesis is
NOT True
1 : Not all are the sameiH
1 2 3 1 2 3
kH 210 :
ANOVA and Linear Regression
13
BUS B272 Unit 1
One-way ANOVA(Partition of Total Variation)
Variation Due to Treatment SST
Variation Due to Random Sampling SSE
Total Variation SS(Total)
= +
ANOVA and Linear Regression
14
BUS B272 Unit 1
ANOVA set-up
ANOVA and Linear Regression
15
BUS B272 Unit 1
Total Variation
: the i-th observation in group j
: the number of observations in group j
n : the total number of observations in all groups
k : the number of groups
ijX
jn
k
j
n
iij
j
XXTotalSS1 1
2)(
n
X
X
k
j
n
iij
j
1 1 the overall or grand mean
ANOVA and Linear Regression
16
BUS B272 Unit 1
Total Variation(continued)
Group 1 Group 2 Group 3
Response, X
X
2221
211)( XXXXXXTotalSS knk
ANOVA and Linear Regression
17
BUS B272 Unit 1
Among-Treatments Variation
Variation Due to Differences Among Groupsi j
: The sample mean of group
: The overall or grand mean
jX j
X
21
XXnSST j
k
jj
1
k
SSTMST
ANOVA and Linear Regression
18
BUS B272 Unit 1
Among-Treatments Variation(continued)
Group 1 Group 2 Group 3
Response, X
X1X 2X
3X
2222
211 XXnXXnXXnSST kk
ANOVA and Linear Regression
19
BUS B272 Unit 1
Summing the variation within each treatment and then adding over all treatments.
: The sample mean of group
: The -th observation in group
j
ij
X j
X i j
Within-Treatment Variation
k
ij
n
ijij
j
XXSSE1
2
kn
SSEMSE
ANOVA and Linear Regression
20
BUS B272 Unit 1
Within-Treatment Variation(continued)
Group 1 Group 2 Group 3
Response, X
1X 2X3X
22
22
1
222
2222
2212
211
2121
2111
2
1
kknkkkk
n
n
XXXXXX
XXXXXX
XXXXXXSSE
k
ANOVA and Linear Regression
21
BUS B272 Unit 1
Within-Treatment Variation(continued)
For k = 2, this is the pooled-variance in the t-test.
• If more than 2 groups, use F test.
• For 2 groups, use t-test. F test is more limited.
111
111
21
2222
211
k
kk
nnn
SnSnSn
kn
SSEMSE
ANOVA and Linear Regression
22
BUS B272 Unit 1
One-way ANOVAF Test Statistic
Test statistic:
MST is mean squares among or between variances
MSE is mean squares within or error variances
Degrees of freedom: kndf 2
MSE
MSTF
11 kdf
ANOVA and Linear Regression
23
BUS B272 Unit 1
One-way ANOVA Summary Table
Source ofVariation
Degrees of
Freedom
Sum ofSquares
Mean Squares
(Variance)
FStatistic
Among(treatmen
t)k – 1 SST MST =
SST/(k – 1 ) MST/MSE
Within(Error) n – k SSE
MSE =SSE/(n – k )
Total n – 1 SS(Total)
SSESSTTotalSS )(
ANOVA and Linear Regression
24
BUS B272 Unit 1
Features of One-way ANOVA F Statistic
The F statistic is the ratio of the among estimate of variance and the within estimate of variance. The ratio must always be positive df1 = k -1 will typically be small df2 = n - k will typically be large
The ratio should be closed to 1 if the null is true.
ANOVA and Linear Regression
25
BUS B272 Unit 1
One-way ANOVA F Test Example
As production manager, you want to see if three filling machines have different mean filling times. You assign 15 similarly trained and experienced workers, five per machine, to the machines. At the 0.05 significance level, is there a difference in mean filling times?
Machine1 Machine2 Machine3
25.40 23.40 20.0026.31 21.80 22.2024.10 23.50 19.7523.74 22.75 20.6025.10 21.60 20.40
ANOVA and Linear Regression
26
BUS B272 Unit 1
One-way ANOVA Example: Scatter Diagram
27
26
25
24
23
22
21
20
19
••
•••
•••••
••••
•
Time in SecondsMachine1 Machine2 Machine3
25.40 23.40 20.0026.31 21.80 22.2024.10 23.50 19.7523.74 22.75 20.6025.10 21.60 20.40
1 2
3
24.93 22.61
20.59 22.71
X X
X X
1X
2X
3X
X
ANOVA and Linear Regression
27
BUS B272 Unit 1
One-way ANOVA Example Computations
Machine 1 Machine 2 Machine 3
25.40 23.40 20.0026.31 21.80 22.2024.10 23.50 19.7523.74 22.75 20.6025.10 21.60 20.40
1
2
3
24.93
22.61
20.59
22.71
X
X
X
X
164.47
71.2259.2071.2261.2271.2293.245 222
SST
15
3
5
n
k
n j
ANOVA and Linear Regression
28
BUS B272 Unit 1
9211.012
0532.11
kn
SSEMSE
5820.232
16.47
1
k
SSTMST
0532.11682.3112.32592.4 SSE
ANOVA and Linear Regression
29
BUS B272 Unit 1
Summary Table
Source ofVariation
Degrees of
Freedom
Sum ofSquares
Mean Squares
(Variance)
F
Among(Treatmen
t)
Within(Error)
Total
3-1=2
15-3=12
15-1=14
47.1640
11.0532
58.2172
23.5820
0.9211
MST/MSE=25.602
ANOVA and Linear Regression
30
BUS B272 Unit 1
3.89 F0
= 0.05
There is evidence to believe that at least one i differs from the rest.
Reject H0 at = 0.05
Critical Value(s):
602.259211.0
5820.23
MSE
MSTF
Test Statistic:
H0: 1 = 2 = 3
H1: Not all the means are equal
df1= 2 df2 = 12
One-way ANOVA Example Solution
ANOVA and Linear Regression
31
BUS B272 Unit 1
Computer Application
To obtain the Microsoft Excel computer output in the previous page, first enter the data into c columns in an Excel file, then follow the commands:Tools/ Data Analysis/ Anova: Single Factor
ANOVA and Linear Regression
32
BUS B272 Unit 1
Computer Output using Data Analysis of Excel
ANOVA
SUMMARY
Groups Count Sum Average Variance
Machine 1 5 124.65 24.93 1.0648Machine 2 5 113.05 22.61 0.778Machine 3 5 102.95 20.59 0.9205
ANOVASource of Variation SS df MS F P-value F-crit
Between Groups 47.164 2 23.582 25.602 4.684E-05 3.88529Within Groups 11.0532 12 0.9211Total 58.2172 14
ANOVA and Linear Regression
33
Statistics Visa Company Card Cash Cheque
n 10 12 18 14
312 547 276 450
s 64 112 41 73
Exercise 1The manager of a large department store wants to test if the average size of customer transactions differs with four types of payment: Visa card, company card, cash or cheque. If there are differences in the average customer transaction size among the four types of payment, the manager will further investigate which types of payment will give rise to higher transaction volumes and hence he will design an appropriate promotional programme. A random sample of 54 customer transactions using various types of payment was drawn during the past two months. With reference to sampled data, the sample statistics are obtained as follows:
BUS B272 Unit 1
Test if differences of average customer transaction size exist among the four types of payment at a 0.05 level of significance.
x
ANOVA and Linear Regression
34
Exercise 1One factor is involved, i.e. the type of payment. Under this factor, there are k = 4 treatments (or factor levels) which represent the four types of payment: Visa card, company card, cash and cheque. The experimental units are customer transactions.
BUS B272 Unit 1
388450142761854712312105411
1
k
jjj xn
nx
740,640
38845014388276183885471238831210 2222
1
2
k
jj xxnSST
702,272
7311441118112112641101 2222
1
2
k
jjj snSSE
ANOVA and Linear Regression
35
Exercise 1Source Sum of
squaresDegrees of freedom
Mean squares F
Among treatments 640,740 k – 1 = 3 213,580 39.16
Within treatments 272,702 n – k = 50 5,454.04
Total 913,442 n – 1 = 53
differ means theof least twoAt :
:
1
43210
H
H
;16.39:StatisticTest MSE
MSTF
80.2:region Rejection 50,3,05.0,1, FFF knk
Since the test statistic of 39.16 is greater than the critical value of 2.80, reject H0. At 0.05 level of significance, there is evidence to reveal that the average customer transaction sizes are significantly different among the four types of payment.
Excel printoutANOVASource of variation SS df MS F P-valueBetween Groups 640740 3 213580 39.16 0.0000Within Groups 272702 50 5454.04Total 913442 53
BUS B272 Unit 1
ANOVA and Linear Regression
36
Can ANOVA be replaced by t-Test?
t-Test : any difference between two population means μ1 and μ2
Multiple t-tests are required for more than two population means
Conducting multiple tests increases the probability of making Type I errors. E.g. compare 6 population means, if use ANOVA with significant level 5%, there will be a 5% chance we reject the null hypothesis when it is true. If we use t-test, we need to perform 15 tests and if same 5% significant level is set, the chance of a Type I error will be
1 – (1 - 0.05)15 = 0.54
BUS B272 Unit 1
BUS B272F Unit 1 ANOVA and Linear Regression
Linear Regression
ANOVA and Linear Regression
38
BUS B272 Unit 1
Linear Regression
Origin of regression Determining the simple linear regression
equation Assessing the fitness of the model Correlation analysis Estimation and prediction Assumptions of regression and
correlation
ANOVA and Linear Regression
39
BUS B272 Unit 1
Origin of Regression “Regression," from a Latin root meaning
"going back," is a series of statistical methods used in studying the relationship between two variables and were first employed by Francis Galton in 1877.
Galton was interested in studying the relationship between a father’s height and the son’ s height. Making use of the “regression” method, he found that son’s height regress to the overall mean and the method is then called “regression”.
ANOVA and Linear Regression
40
BUS B272 Unit 1
Linear Regression Analysis
Linear Regression analysis is used primarily to model and describe linear relationship and provide prediction among variables Predicts the value of a dependent
(response) variable based on the value of at least one independent (explanatory) variable
Express statistically the effect of the independent variables on the dependent variable
ANOVA and Linear Regression
41
BUS B272 Unit 1
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
ANOVA and Linear Regression
42
BUS B272 Unit 1
Simple Linear Regression Model
The relationship between two variables, say X and Y, is described by a linear function.
The change of the variable Y, (called dependent or response variable) is associated with the change in the other variable X (called independent or explanatory variable).
Explore the dependency of Y on X.
ANOVA and Linear Regression
43
BUS B272 Unit 1
Why Regression?
X
1
2
3
4
Y
2
2.5
2.5
5
0
1
2
3
4
5
6
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
X
Y
3Y
(1, 2)
(2, 2.5)
(4, 5)
(3, 2.5)
5.53535.235.232square of Sum 2222 Y
45.135.4545.35.255.25.265.12ˆsquare of Sum 2222 Y
The larger the sum of squares, the poor the estimate.
ANOVA and Linear Regression
44
BUS B272 Unit 1
Linear Relationship We wish to study whether there is
any association between two quantitative variables, say X and Y
If ‘Y tends to increase as X increases’
If ‘Y tends to decrease as X increases’
If the corresponding magnitude of increase or decrease follows a specific proportion, the relationship identified is said to be a linear one.
– a positive relationship
– a negative relationship
ANOVA and Linear Regression
45
BUS B272 Unit 1
Scatter Diagram
A scatter diagram is a graph plotted for all X-Y pairs of the sample data.
By viewing a scatter diagram, one can determine whether a relationship exists between the two variables. It can also suggest the likely mathematical form of that relationship that allow one to judge initially and intuitively whether or not there exists a linear relationship between the two variables involved.
ANOVA and Linear Regression
46
BUS B272 Unit 1
Example
0
10
20
30
40
0 50 100 150 200
Level of pollution
Num
ber o
f con
sulta
tions
The level of air pollution at Kwun Tong and the total number of consultations relating to respiratory diseases in a public clinic in the area were recorded during a specific time period on 14 randomly selected days.
ANOVA and Linear Regression
47
BUS B272 Unit 1
PopulationRegressionLine (conditional mean)
Population Linear Regression
Population regression line is a straight line that describes the dependence of the average value (conditional mean) of one variable on the other
Population Y intercept
Population SlopeCoefficient
Random Error
Dependent (Response) Variable
Independent (Explanatory) Variable
ii iY X
YX
ANOVA and Linear Regression
48
BUS B272 Unit 1
Population Linear Regression(continued)
ii iY X Random Error (vertical discrepancies or residual for point i )
Y
X
(Observed Value of Y) =
Observed Value of Y
YX iX
i
(Conditional Mean)
ANOVA and Linear Regression
49
BUS B272 Unit 1
Least Squares Method
The line fitted by least squares is the one that makes the sum of squares of all those vertical discrepancies (residuals) as small as possible, i.e. minimum of
which is the sum of squared residuals.
2i
ANOVA and Linear Regression
50
BUS B272 Unit 1
Sample Y intercept
Residual0 1i iib bY X e
0 1Y b b X
Sample regression line is formed by the point estimates of and , i.e., and . It provides an estimate of the population regression line as well as a predicted value of Y
Sample Linear Regression
Samplecoefficient of slope
Sample regression line (Fitted regression line or predicted value)
0 1 0b 1b
ANOVA and Linear Regression
51
BUS B272 Unit 1
Sample Linear Regression
and are obtained by finding the specific values of and that minimizes the sum of the squared residuals
0b0b 1b
(continued)
22
1 1
ˆn n
i i ii i
Y Y e
1b
ANOVA and Linear Regression
52
BUS B272 Unit 1
Coefficients of Sample Linear Regression
For ii XbbY 10ˆ
n
XX
n
YXYX
bi
i
iiii
22
1
XbYb 10
ANOVA and Linear Regression
53
BUS B272 Unit 1
Interpretation of the Slope and the Intercept
is the average value of Y when
the value of X is zero.
measures the change in the
average value of Y as a result of a
one-unit change in X.
0
1
ANOVA and Linear Regression
54
BUS B272 Unit 1
is the estimated average value
of Y when the value of X is zero.
is the estimated change in the
average value of Y as a result of
one-unit change in X.
(continued)
Interpretation of the Slope and the Intercept
0b
1b
ANOVA and Linear Regression
55
BUS B272 Unit 1
Example 1 : Simple Linear Regression
Suppose that you want to examine the linear dependency of the annual sales among seven stores on their size in square footage. Sample data for seven stores were obtained. Find the equation of the straight line that fits the data best.
Annual Store Square Sales
Feet ($1000)
1 1,726 3,681 2 1,542 3,395 3 2,816 6,653 4 5,555 9,543 5 1,292 3,318 6 2,208 5,563 7 1,313 3,760
ANOVA and Linear Regression
56
BUS B272 Unit 1
Example 1 : Scatter Diagram
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
Square Feet
An
nu
al
Sa
les
($00
0)
Excel Output
ANOVA and Linear Regression
57
BUS B272 Unit 1
Computation of Regression Coefficient
Annual Square SalesStore Feet ($1000)
X Y
1 1,726 3,681 2 1,542 3,395 3 2,816 6,653 4 5,555 9,543 5 1,292 3,318 6 2,208 5,563 7 1,313 3,760
2XYX
16,452 35,913 104,841,549 52,413,218
2Y
216,500,737
6,353,406 5,235,09018,734,84853,011,365 4,286,85612,283,104 4,936,880
2,979,076 2,377,764 7,929,85630,858,025 1,669,264 4,875,264 1,723,969
13,549,76111,526,02544,262,40991,068,84911,009,12430,946,96914,137,600
n
XX
n
YXYX
bi
i
iiii
22
1
ANOVA and Linear Regression
58
BUS B272 Unit 1
Computation of Regression Coefficient
452,16 X
549,841,104 XY218,413,522 X
913,35Y
486633657.1
7452,16
218,413,52
7913,35452,16
549,841,104
22
2
1
n
XX
n
YXYX
bi
i
iiii
41472608.636,17
452,16486633657.1
7
913,3510
XbYb
ANOVA and Linear Regression
59
BUS B272 Unit 1
Example 1 : Equation for the Sample
Regression Line
iii XXbbY 487.1415.636,1ˆ10
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
Square Feet
An
nu
al
Sa
les
($000)
Y i = 1636.415 +1.487X i
ANOVA and Linear Regression
60
BUS B272 Unit 1
Example 1 : Interpretation of Results
The slope of 1.487 means that for each increase of one unit in X, we predict the average of Y to increase by an estimated 1.487 units.
The model estimates that for each increase of one square foot in the size of the store, the expected annual sales are predicted to increase by $1487.
ˆ 1636.415 1.487i iY X
ANOVA and Linear Regression
61
BUS B272 Unit 1
Predicting Annual Sales Based on Square Footage
Suppose that we would like to use the fitted model to predict the average annual sales for a store with 4,000 square feet.
35.949,582,7$
94935225.582,7
000,4486633657.1414726.1636
486633657.1414726.1636ˆ
ii XY
ANOVA and Linear Regression
62
BUS B272 Unit 1
Interpolation versus Extrapolation
For using regression line for prediction purpose, it is not appropriate to make predictions beyond the relevant range (in the previous example: (1,292, 5,555)) of the independent variable.
That is, we may interpolate within the relevant range of X values, but we SHOULD NOT extrapolate beyond the range of X values. For example, it is not appropriate to predict the average annual sales for a store with 7,000 square feet since it is beyond the range of X values, i.e., (1,292, 5,555).
ANOVA and Linear Regression
63
BUS B272 Unit 1
Causal Relationship?
In general, when there is a relationship identified between X and Y using regression analysis, we usually would say that ‘X is associated with Y’ instead of saying ‘X causes Y’.
We cannot claim that two variables are related by cause and effect just because there is a statistical relationship between the two. In fact, you cannot infer a causal relationship from statistics alone.
ANOVA and Linear Regression
64
BUS B272 Unit 1
For example, the price of dog food and houses, may well be positively correlated over time.
When you collect data concerning the price of dog food and the price of houses over time, you might end up with an inference that they have a positive relationship, but can you conclude that an increase in the price of dog food would directly cause the price of houses to increase too?
It might be that an inflationary force is influencing both and hence they can be seen to move in the same general direction over time.
ANOVA and Linear Regression
65
BUS B272 Unit 1
Computer Application
Import the data into two adjacent columns in an Excel file and then click Tools/Data Analysis/ Regression (See page 624-5 for detail description).
ANOVA and Linear Regression
66
BUS B272 Unit 1
Example 1: Computer OutputSUMMARY OUTPUT
Regression Statistics
Multiple R 0.970557204R Square 0.941981286Adjusted R Square 0.930377543Standard Error 611.7515173Observations 7
ANOVAdf SS MS F Significance F
Regression 1 30380456.12 30380456 81.17909 0.000281201Residual 5 1871199.595 374239.9Total 6 32251655.71
Coefficients Standard Error t Stat P-valueIntercept 1636.414726 451.4953308 3.624433 0.015149Square feet 1.486633657 0.164999212 9.009944 0.000281
ANOVA and Linear Regression
67
BUS B272 Unit 1
Exercise 2
Day 8 9 10 11 12 13 14
Level of pollution
135 147 107 118 126 143 104
Consultations 32 35 23 28 26 32 22
Day 1 2 3 4 5 6 7
Level of pollution
115 134 126 158 99 86 129
Consultations 20 86 28 38 18 12 29
Consider the example about the level of air pollution at Kwun Tong and the total number of consultations that relate to respiratory diseases in a public clinic in the area. The corresponding data were given as follows:
ANOVA and Linear Regression
68
BUS B272 Unit 1
Exercise 1 (a) Determine the sample regression line to
predict the number of consultations by the level of pollution.
(b) Interpret the coefficients.
Solution:
727,11
n
iix 429
1
n
iiy
207,2181
2
n
iix281,55
1
n
iii yx
079,171
2
n
iiy
ANOVA and Linear Regression
69
BUS B272 Unit 1
Exercise 1
456701074.0
14727,1
207,218
14429727,1
281,55
22
2
1
n
xx
n
yxyx
bi
i
iiii
694482444.25
357142857.123456701074.06427143.3010
xbyb
For , each additional increase in pollution level, the number of consultations increases, on average by 0.456701074.
1b
No meaningful interpretation for can be made, as the range of x does not include zero.
0b
ANOVA and Linear Regression
70
BUS B272 Unit 1
From time to time, after we have set up a linear regression model, we wish to assess the fitness of the model. That is, we wish to find out how well the model fit to the given data. For a good fit, the data as a whole should be quite close to the regression line and the independent variable can thus be used to predict the value of the dependent variable with high accuracy.
To examine how well the independent variable predicts the dependent variable, we need to develop several measures of variation.
Assessing the simple linear regression model
ANOVA and Linear Regression
71
BUS B272 Unit 1
Measure of Variation: The Sum of Squares
SS(Total) = SSR + SSE
Total Sample
Variability
= Explained Variability
+ Unexplained Variability
ANOVA and Linear Regression
72
BUS B272 Unit 1
Measure of Variation: The Sum of Squares
SS(Total) = total sum of squares Measures the variation of the Yi values
around their mean Y SSR = regression sum of squares
Explained variation attributable to the relationship between X and Y
SSE = error sum of squares Variation attributable to factors other than
the relationship between X and Y (Unexplained variation)
(continued)
ANOVA and Linear Regression
73
BUS B272 Unit 1
Measure of Variation: The Sum of Squares
(continued)
Xi
Y
X
Y
SSE =(Yi - Yi )2
SSR = (Yi - Y)2
_
__
Yi
SS(Total) = (Yi – Y )2
_
ANOVA and Linear Regression
74
BUS B272 Unit 1
714.655,251,32)(
2
22 n
YYYYTotalSS i
ii
452,16 X 549,841,104 XY
218,413,522 X
913,35Y
737,500,2162 Y
595.199,871,1
ˆ10
22
iiiiii YXbYbYYYSSE
1190.456,380,30
ˆ2
102
n
YYXbYbYYSSR i
iiii
ANOVA and Linear Regression
75
BUS B272 Unit 1
Standard Error of Estimate
The standard deviation of the variation of observations around the regression line.
751517366.6115
595.199,871,1
2
ˆ
21
2
n
YY
n
SSES
n
ii
ANOVA and Linear Regression
76
BUS B272 Unit 1
The smallest value that can assume is 0, which occurs when SSE = 0, that is, when all the points fall on the regression line. Thus, when is small, the fit is excellent, and the linear regression model is likely to be an effective analytical and forecasting tool.
When is large, the regression model is a poor one, it is of little value to be used.
S
S
S
Standard Error of Estimate
ANOVA and Linear Regression
77
BUS B272 Unit 1
The Coefficient of Determination (r 2 or R 2 )
Measures the proportion of variation in Y that is explained by the independent variable X in the regression model
By themselves, SSR, SSE and SS(Total) provide little that can be directly interpreted. A simple ratio of SSR and SS(Total) provides a measure of the usefulness of the regression equation.
941981286.0714.655,251,32
119.456,380,30
Squares of Sum Total
Squares of Sum Regression
)(2
TotalSS
SSRr
ANOVA and Linear Regression
78
BUS B272 Unit 1
Coefficients of Determination (r 2)
r2 = 1 r2 = 1
r2 = 0.8 r2 = 0Y
Yi = b0 + b1Xi
X
^
YYi = b0 + b1Xi
X
^Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
ANOVA and Linear Regression
79
BUS B272 Unit 1
Coefficient of Correlation
Coefficient of correlation is used to measure strength of association (linear relationship) between two numerical variables) Only concerned with strength of the
relationship No causal effect is implied
ANOVA and Linear Regression
80
BUS B272 Unit 1
Population correlation coefficient is denoted by (Rho).
Sample correlation coefficient is denoted by r . It is an estimate of and is used to measure the strength of the linear relationship in the sample observations.
(continued)
2rr
Coefficient of Correlation
ANOVA and Linear Regression
81
BUS B272 Unit 1
99705572037.0
7913,35
7737,500,2167452,16
7218,413,52
7913,35
7452,16
7549,841,104
22
1
22
1
22
1
n
ii
n
ii
n
iii
ynyxnx
yxnyxr
Coefficient of Correlation
ANOVA and Linear Regression
82
BUS B272 Unit 1r = 0.6 r = 1
Sample of Observations from Various r Values
Y
X
Y
X
Y
X
Y
X
Y
X
r = –1 r = –0.6 r = 0
ANOVA and Linear Regression
83
BUS B272 Unit 1
Features of r and r
Unit free Range between –1 and 1 The closer to –1, the stronger the
negative linear relationship The closer to 1, the stronger the positive
linear relationship The closer to 0, the weaker the linear
relationship
ANOVA and Linear Regression
84
BUS B272 Unit 1
There is also a more systematic way to assess model fitness, i.e., to perform a hypothesis testing on the slope of the regression line.
Inference about the Slope
If the two variables involved are not at all linearly related, one could observe from the scatter diagram shown on the right that the slope of the regression line will be zero.
ANOVA and Linear Regression
85
BUS B272 Unit 1
Hence, we can determine whether a significant relationship between the variables X and Y exists by testing whether (the true slope) is equal to zero.
Inference about the Slope
1
0:
0:
11
10
H
H (There is no linear relationship)
(There is a linear relationship)If is rejected, there is evidence to
believe that a linear relationship exists between X and Y.
0H
ANOVA and Linear Regression
86
BUS B272 Unit 1
The standard error of the slope
The estimated standard error of . 1b
n
i
n
ii
b
XnX
S
XX
SS
1
22
1
21
ANOVA and Linear Regression
87
BUS B272 Unit 1
Inference about the Slope: t Test
t test for a population slope Is there a linear dependency of Y on
X ? Null and alternative hypotheses
H0: 1 = 0 (no linear dependency) H1: 1 0 (linear dependency)
Test statistic: . . 2d f n
1
11
bS
bt
ANOVA and Linear Regression
88
BUS B272 Unit 1
Example: Store SalesData for Seven Stores: Estimated
Regression Equation:
The slope of this model is 1.487.
Is square footage of the store affecting its annual sales?
Annual Store Square Sales
Feet ($000)
1 1,726 3,681 2 1,542 3,395 3 2,816 6,653 4 5,555 9,543 5 1,292 3,318 6 2,208 5,563 7 1,313 3,760
Yi = 1636.415 +1.487Xi
ANOVA and Linear Regression
89
BUS B272 Unit 1
H0: 1 = 0 0.05
H1: 1 0 df 7 - 2 = 5
Test Statistic:
009943959.9164999212.0
486633657.1
28571429.350,27218,413,52
751517366.611486633657.1
2
11
1
bS
bt
ANOVA and Linear Regression
90
BUS B272 Unit 1
Inferences about the Slope: t Test Example
Critical Value(s): Decision:Conclusion:
At 5% level of significance, there is evidence to reveal that square footage is associated with annual sales.
0 2.5706-2.5706
0.025
Reject Reject
0.025
Reject H0
5t
ANOVA and Linear Regression
91
BUS B272 Unit 1
0:
0:
11
10
H
H (No linear relationship)
(A linear relationship)
0:
0:
11
10
H
H (No positive linear relationship)
(A positive linear relationship)
0:
0:
11
10
H
H (No negative linear relationship)
(A negative linear relationship)
Inferences about the Slope
ANOVA and Linear Regression
92
BUS B272 Unit 1
Exercise 3
Consider the data of Exercise 2 about the level of air pollution at Kwun Tong and the total number of consultations that relate to respiratory diseases in a public clinic in the area.
Test at the 5% level of significance to determine whether level of air pollution and the total number of consultations are positively linearly related.
ANOVA and Linear Regression
93
BUS B272 Unit 1
Solution:
0:
0:
11
10
H
H
727,11
n
iix 429
1
n
iiy
207,2181
2
n
iix
281,551
n
iii yx
079,171
2
n
iiy
0.05; df 14 - 2 = 12
04091530.855,2
ˆ10
22
iiiiii YXbYbYYYSSE
424658060.15214
04091530.855,2
2
n
SSES
ANOVA and Linear Regression
94
BUS B272 Unit 1
Exercise 3
214537530.0
357142857.12314207,218
424658060.15
2
221
XnX
SS
i
b
128770074.2214537530.0
456701074.0
1
11
bS
bt
ANOVA and Linear Regression
95
BUS B272 Unit 1
Computer Output迴歸統計
R 的倍數 0.52356487R 平方 0.27412017
R 調整的 平方 0.21363018標準誤 15.4246581觀察值個數 14
ANOVA自由度 SS MS F 顯著值
迴歸 1 1078.17337 1078.173 4.531662 0.054675殘差 12 2855.040915 237.9201總和 13 3933.214286
係數 標準誤 t 統計 P-值截距 -25.6944824 26.78388667 -0.95933 0.356325Level of pollution0.45670107 0.214537531 2.12877 0.054675
For two-tailed test
ANOVA and Linear Regression
96
BUS B272 Unit 1
Exercise 3
Critical Value(s):
0 1.7823
0.05
Reject H0
12t
Decision:Conclusion:
At 5% level of significance, there is evidence to believe that level of air pollution and total number of consultations are positively linearly related.
Reject H0
ANOVA and Linear Regression
97
BUS B272 Unit 1
You have seen how can we assess the model fitness. If the model fits satisfactorily, we can use it to forecast and estimate values of the dependent variable.
We can obtain a point prediction of Y with a given value of X using the linear regression line.
Confidence interval about the particular value of Y or the average of Y for a given value of X can also be computed if desired.
Estimation of Mean Values
ANOVA and Linear Regression
98
BUS B272 Unit 1
Estimation of Mean Values
Confidence interval estimate for :
The mean of Y given a particular
t value from table with df = n - 2
Standard error of the estimate
Size of interval varies according to distance away from mean, X
gXXY
gX
n
ii
gni
XX
XX
nStY
1
2
2
2 ,2/1ˆ
ANOVA and Linear Regression
99
BUS B272 Unit 1
Prediction of Individual Values
Prediction interval for individual response Yi at a particular
Addition of one increases width of interval from that for the mean of Y
n
ii
gni
XX
XX
nStY
1
2
2
2 ,2/1
1ˆ
gX
ANOVA and Linear Regression
100
BUS B272 Unit 1
Interval Estimates for Different Values of X
Y
X
Prediction Interval for a individual Yi
Y given X
Confidence Interval for the mean of Y
Y i = b0 + b1X i
X
ANOVA and Linear Regression
101
BUS B272 Unit 1
Example: Stores Sales
Yi = 1636.415 +1.487Xi
Data for seven stores:
Regression Model Obtained:
Predict the annual sales for a store with 2000 square feet.
Annual Store Square Sales
Feet ($000)
1 1,726 3,681 2 1,542 3,395 3 2,816 6,653 4 5,555 9,543 5 1,292 3,318 6 2,208 5,563 7 1,313 3,760
ANOVA and Linear Regression
102
BUS B272 Unit 1
Estimation of Mean Values: Example
751517366.611S
Find the 95% confidence interval for the average annual sales for a 2,000 square-foot store.
Predicted Sales Yi = 1636.415 +1.487Xi = 4609.68 ($000)
X = 2350.29 tn-2 = t5 = 2.571
Confidence Interval Estimate forgXXY
44.5222 ,92.39967579.61268.46091ˆ
1
2
2
2 ,2/
n
ii
gni
XX
XX
nStY
4284.1374631728571429.2350752413218 222 XnX
ANOVA and Linear Regression
103
BUS B272 Unit 1
Prediction Interval for Y : Example
6297.64 ,72.29219613.168768.46091
1ˆ
2
1
2
2
2 ,2/
xnx
xx
nStY
n
i
gni
751517366.611S
4284.1374631728571429.2350752413218 222 XnX
Find the 95% prediction interval for the annual sales of a 2,000
square-foot storePredicted Sales Yi = 1636.415 +1.487Xi = 4609.68 ($000)
X = 2350.29 tn-2 = t5 = 2.571
Prediction Interval for Individual Y
751517366.611S
ANOVA and Linear Regression
104
BUS B272 Unit 1
Computer Application
Commands: Tools/ Data Analysis Plus/ Prediction Interval.
ANOVA and Linear Regression
105
BUS B272 Unit 1
Computer OutputPrediction Interval
Annual Sales ($1000)
Predicted value 4609.682
Prediction IntervalLower limit 2921.998Upper limit 6297.366
Interval Estimate of Expected ValueLower limit 3997.025Upper limit 5222.339
ANOVA and Linear Regression
106
BUS B272 Unit 1
Linear Regression Assumptions
1. Normality Y values are normally distributed for
each X Probability distribution of error is normal
2. Homoscedasticity (Constant Variance)
3. Independence of Errors
ANOVA and Linear Regression
107
BUS B272 Unit 1
• Y values are normally distributed around the regression line.
• For each X value, the “spread” or variance around the regression line is the same.
Variation of Errors around the Regression Line
X1
X2
X
Y
f(e)
Sample Regression Line
.
BUS B272F Unit 1 ANOVA and Linear Regression
Multiple Regression
ANOVA and Linear Regression
109
BUS B272 Unit 1
Introduction
Extension of the simple linear regression model to allow for any fixed number of independent variables. That is, the number of independent variables could be more than one.
ANOVA and Linear Regression
110
BUS B272 Unit 1
To make use of computer printout to Assess the model
How well it fits the data Is it useful Are any required conditions violated?
Employ the model Interpreting the coefficients Predictions using the prediction equation
Estimating the expected value of the dependent variable
Multiple Linear Regression
ANOVA and Linear Regression
111
BUS B272 Unit 1
Allow for k independent variables to potentially be related to the dependent variable
y = b0 + b1x1+ b2x2 + …+ bkxk + e
Dependent variableIndependent variables
Random error variable
Model and Required Conditions
RegressionCoefficients
ANOVA and Linear Regression
112
BUS B272 Unit 1
Multiple Regression for k = 2, Graphical Demonstration
y
X2
X1
The simple linear regression modelallows for one independent variable, “x”for y = b0 + b1x + e
The multiple linear regression modelallows for more than one independent variable.Y = b0 + b1x1 + b2x2 + e
y = b0 + b1x1 + b2x2
y = b0 + b1x1 + b2x2
y = b0 + b1x1 + b2x2
y = b0 + b1x1 + b2x2y = b0 + b1x1 + b2x2
y = b0 + b1x1 + b2x2
y = b0 + b1x1 + b2x2
XY 10ˆ
ANOVA and Linear Regression
113
BUS B272 Unit 1
The error e is normally distributed. The mean is equal to zero and the
standard deviation is constant (se) for all values of y.
The errors are independent.
Required conditions for the error variable
ANOVA and Linear Regression
114
BUS B272 Unit 1
Estimating the Coefficients and
Assessing the Model
The procedure used to perform multiple regression analysis:
If the model assessment indicates good fit to the data, use it to interpret the coefficients and generate predictions.
Assess the model fitness using statistics obtained from the sample.
Obtain the model coefficients and statistics using a statistical software.
ANOVA and Linear Regression
115
BUS B272 Unit 1
Example 18.1 Keller: Where to locate a new motor inn?
Estimating the Coefficients and Assessing the Model, Example
La Quinta defines profitable inns as those with an operating margin in excess of 50% and unprofitable ones with margins of less than 30%.
La Quinta Motor Inns is planning to build new inns.Management wishes to predict which sites are likely to be profitable.Several areas where predictors of profitability (operating margin) can be identified are:
CompetitionMarket awarenessDemand generatorsDemographicsPhysical quality
ANOVA and Linear Regression
116
BUS B272 Unit 1
Margin (%)
Competition Market awareness Customers Community Physical
Profitability
Number Nearest Officespace Enrollment Income Distance
Distance to the downtowncore (in miles)
Medianhouseholdincome of nearbyarea (in $thousands)
Number of miles to closest competition
Number of hotels/motelsrooms within 3 miles from the site
Estimating the Coefficients and Assessing the Model, Example
Office space in nearby community
Enrollemnt in nearby
university or college (in thousands)
ANOVA and Linear Regression
117
BUS B272 Unit 1
Data were collected from randomly selected 100 inns that belong to La Quinta, and ran for the following suggested model:
Margin = b + bRooms + b2Nearest + b3Office + b4College + b5Income + b6Disttwn
Estimating the Coefficients and Assessing the Model, Example
Margin Number Nearest Office Space Enrollment Income Distance55.5 3203 4.2 549 8 37 2.733.8 2810 2.8 496 17.5 35 14.449 2890 2.4 254 20 35 2.6
Xm18-01
ANOVA and Linear Regression
118
BUS B272 Unit 1
This is the sample regression equation (sometimes called the prediction equation)This is the sample regression equation (sometimes called the prediction equation)
Regression Analysis, Excel Output
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.7246R Square 0.5251Adjusted R Square 0.4944Standard Error 5.51Observations 100
ANOVAdf SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000Residual 93 2825.6 30.4Total 99 5949.5
Coefficients Standard Error t Stat P-valueIntercept 38.14 6.99 5.45 0.0000Number -0.0076 0.0013 -6.07 0.0000Nearest 1.65 0.63 2.60 0.0108Office Space 0.020 0.0034 5.80 0.0000Enrollment 0.21 0.13 1.59 0.1159Income 0.41 0.14 2.96 0.0039Distance -0.23 0.18 -1.26 0.2107
Margin = 38.14 - 0.0076Number +1.65Nearest + 0.020Office Space +0.21Enrollment + 0.41Income - 0.23Distance
ANOVA and Linear Regression
119
BUS B272 Unit 1
Model Assessment
The model is assessed using two tools: The coefficient of determination The F -test of the analysis of variance
The standard error of estimates participates in building the above tools.
ANOVA and Linear Regression
120
BUS B272 Unit 1
Standard Error of Estimate
The standard deviation of the error is estimated by the Standard Error of Estimate:
The magnitude of s is judged by comparing it to
.y
1
kn
SSES
ANOVA and Linear Regression
121
BUS B272 Unit 1
From the printout, se = 5.51 Calculating the mean value of y, we
have
It seems se is not particularly small. Question:
Can we conclude the model does not fit the data well?
739.45y
Standard Error of Estimate
ANOVA and Linear Regression
122
BUS B272 Unit 1
Coefficient of Determination
The definition is:
From the printout, r 2 = 0.5251 52.51% of the variation in operating margin is
explained by the six independent variables. 47.49% remains unexplained.
)()(11 2
2
TotalSS
SSR
TotalSS
SSE
yy
SSEr
i
ANOVA and Linear Regression
123
BUS B272 Unit 1
For testing the validity of the model, the following question is asked:Is there at least one independent variable linearly related to the dependent variable?
To answer the question we test the hypothesis
H0: b1 = b2 = … = bk = 0H1: At least one bi is not equal to
zero.
If at least one bi is not equal to zero, the model has some validity or usefulness.
Testing the Validity of the Model
ANOVA and Linear Regression
124
BUS B272 Unit 1
The hypotheses are tested by an ANOVA procedure ( the Excel output)
Testing the Validity of the La Quinta Inns Regression Model
MSE=SSE / (n-k-1)
MSR=SSR / k
MSR / MSE
SSE
SSR
k =n–k–1 = n-1 =
ANOVAdf SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000Residual 93 2825.6 30.4Total 99 5949.5
ANOVA and Linear Regression
125
BUS B272 Unit 1
[Total variation in y] SS(Total) = SSR + SSE. Large F results from a large SSR. That implies much of the variation in y can be explained by the regression model; the model is useful, and thus, the null hypothesis should be rejected. Therefore, the rejection region is:
Testing the Validity of the La Quinta Inns Regression Model
F > Fa, k, n – k – 1
1
knSSE
kSSR
F
while the test statistic is:
ANOVA and Linear Regression
126
BUS B272 Unit 1
Also, the p-value (Significance F) = 0.0000; Reject the null hypothesis.
Testing the Validity of the La Quinta Inns Regression Model
ANOVAdf SS MS F Significance F
Regression 6 3123.8 520.6 17.14 0.0000Residual 93 2825.6 30.4Total 99 5949.5
Fa, k, n-k-1 = F0.05,6,100-6 -1 = 2.17F = 17.14 > 2.17
Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. At least one of the bi is not equal to zero. Thus, at least one independent variable is linearly related to y. This linear regression model is valid.
ANOVA and Linear Regression
127
BUS B272 Unit 1
b0 = 38.14. This is the intercept, the value of
y when all the variables take the value zero. Since the data range of all the independent variables do not cover the value zero, do not interpret the intercept.
b1 = – 0.0076. In this model, for each
additional room within 3 mile of the La
Quinta inn, the operating margin decreases
on average by 0.0076% (assuming the other
variables are held constant).
Interpreting the Coefficients
ANOVA and Linear Regression
128
BUS B272 Unit 1
b2 = 1.65. In this model, for each additional
mile that the nearest competitor is to a La Quinta inn, the operating margin increases on average by 1.65% when the other variables are held constant.
b3 = 0.020. For each additional 1000 sq-ft of office space, the operating margin will increase on average by 0.02% when the other variables are held constant.
b4 = 0.21. For each additional thousand students the operating margin increases on average by 0.21% when the other variables are held constant.
Interpreting the Coefficients
ANOVA and Linear Regression
129
BUS B272 Unit 1
b5 = 0.41. For additional $1000 increase in median household income, the operating margin increases on average by 0.41%, when the other variables remain constant.
b6 = -0.23. For each additional mile to the
downtown center, the operating margin
decreases on average by 0.23% when the
other variables are held constant.
Interpreting the Coefficients
ANOVA and Linear Regression
130
BUS B272 Unit 1
Coefficients Standard Error t Stat P-valueIntercept 38.14 6.99 5.45 0.0000Number -0.0076 0.0013 -6.07 0.0000Nearest 1.65 0.63 2.60 0.0108Office Space 0.020 0.0034 5.80 0.0000Enrollment 0.21 0.13 1.59 0.1159Income 0.41 0.14 2.96 0.0039Distance -0.23 0.18 -1.26 0.2107
The hypothesis for each bi is
Excel printout
Test statistic:
Testing the Coefficients
H0: bi = 0H1: bi ¹ 0
d.f. = n - k -1
ib
ii
s
bt
ANOVA and Linear Regression
131
BUS B272 Unit 1
The model can be used for making predictions by Producing prediction interval estimate for the
particular value of y, for a given set of values of xi.
Producing a confidence interval estimate for the expected value of y, for a given set of values of xi.
The model can be used to learn about relationships between the independent variables xi, and the dependent variable y, by interpreting the coefficients bi
Using the Linear Regression Equation
ANOVA and Linear Regression
132
BUS B272 Unit 1
Predict the average operating margin of an inn at a site with the following characteristics: 3815 rooms within 3 miles, Closet competitor 0.9 miles away, 476,000 sq-ft of office space, 24,500 college students, $35,000 median household income, 11.2 miles away from downtown center.
MARGIN = 38.14 - 0.0076(3815) +1.65(0.9) + 0.020(476) +0.21(24.5) + 0.41(35) - 0.23(11.2) = 37.1%
Xm18-01 La Quinta Inns, Predictions
ANOVA and Linear Regression
133
BUS B272 Unit 1
Interval estimates by Excel (Data Analysis Plus)
It is predicted, with 95% confidence that the operating margin will lie between 25.4% and 48.8%.It is estimated the average operating margin of all sites that fit this category falls within 33% and 41.2%.Both of them suggested that the given site would not be profitable (less than 50%).
La Quinta Inns, Predictions
Prediction Interval
Margin
Predicted value 37.1
Prediction IntervalLower limit 25.4Upper limit 48.8
Interval Estimate of Expected ValueLower limit 33.0Upper limit 41.2