1 統統統 Spring 2004 授授授授 授授授授授授 : 授授: 2004 授 5 授 11 授 授授授授 授授授授授授 :
1 1 Slide Slide
統計學 Spring 2004
授課教師:統計系余清祥 日期: 2004 年 5 月 11 日 第十二週:建立迴歸模型
2 2 Slide Slide
Chapter 16Chapter 16Regression Analysis: Model BuildingRegression Analysis: Model Building
General Linear ModelGeneral Linear Model Determining When to Add or Delete VariablesDetermining When to Add or Delete Variables Analysis of a Larger ProblemAnalysis of a Larger Problem Variable-Selection ProceduresVariable-Selection Procedures Residual AnalysisResidual Analysis Multiple Regression ApproachMultiple Regression Approach to Analysis of Variance andto Analysis of Variance and
Experimental DesignExperimental Design
3 3 Slide Slide
General Linear ModelGeneral Linear Model
Models in which the parameters (Models in which the parameters (00, , 11, . . . , , . . . , pp ) ) all all
have exponents of one are called have exponents of one are called linear modelslinear models..
First-Order Model with One Predictor VariableFirst-Order Model with One Predictor Variable
Second-Order Model with One Predictor VariableSecond-Order Model with One Predictor Variable
Second-Order Model with Two Predictor Second-Order Model with Two Predictor VariablesVariables
with Interactionwith Interactiony x x x x x x 0 1 1 2 2 3 12
4 22
5 1 2y x x x x x x 0 1 1 2 2 3 12
4 22
5 1 2
y x x 0 1 1 2 12y x x 0 1 1 2 12
y x 0 1 1y x 0 1 1
4 4 Slide Slide
General Linear ModelGeneral Linear Model
Often the problem of nonconstant variance can be Often the problem of nonconstant variance can be
corrected by transforming the dependent variable to acorrected by transforming the dependent variable to a
different scale.different scale.
Logarithmic TransformationsLogarithmic Transformations
Most statistical packages provide the ability to apply Most statistical packages provide the ability to apply
logarithmic transformations using either the base-10logarithmic transformations using either the base-10
(common log) or the base (common log) or the base ee = 2.71828... (natural = 2.71828... (natural log).log).
Reciprocal TransformationReciprocal Transformation
Use 1/Use 1/yy as the dependent variable instead of as the dependent variable instead of yy..
5 5 Slide Slide
Models in which the parameters (Models in which the parameters (00, , 11, . . . , , . . . , pp ) have ) haveexponents other than one are called exponents other than one are called nonlinear modelsnonlinear models..
In some cases we can perform a transformation ofIn some cases we can perform a transformation ofvariables that will enable us to use regression analysisvariables that will enable us to use regression analysiswith the general linear model.with the general linear model.
Exponential ModelExponential ModelThe exponential model involves the regression The exponential model involves the regression equation:equation:
We can transform this nonlinear model to a linear We can transform this nonlinear model to a linear model by taking the logarithm of both sides.model by taking the logarithm of both sides.
E y x( ) 0 1E y x( ) 0 1
General Linear ModelGeneral Linear Model
6 6 Slide Slide
Determining When to Add or Delete Determining When to Add or Delete VariablesVariables
F F Test Test
To test whether the addition of To test whether the addition of xx22 to a to a model involving model involving xx11 (or the deletion of (or the deletion of xx22 from a from a model involving model involving xx11and and xx22) is statistically ) is statistically significantsignificant
1 1 2
1 2
(SSE( )-SSE( , ))/ 1(SSE( , ))/ ( 1)
x x xF
x x n p
1 1 2
1 2
(SSE( )-SSE( , ))/ 1(SSE( , ))/ ( 1)
x x xF
x x n p
(SSE(reduced)-SSE(full))/ number of extra termsMSE(full)
F (SSE(reduced)-SSE(full))/ number of extra terms
MSE(full)F
7 7 Slide Slide
Variable-Selection ProceduresVariable-Selection Procedures
Stepwise RegressionStepwise Regression• At each iteration, the first consideration is to At each iteration, the first consideration is to
see whether the least significant variable see whether the least significant variable currently in the model can be removed currently in the model can be removed because its because its F F value, FMIN, is less than the value, FMIN, is less than the user-specified or default user-specified or default FF value, FREMOVE. value, FREMOVE.
• If no variable can be removed, the If no variable can be removed, the procedure checks to see whether the most procedure checks to see whether the most significant variable not in the model can be significant variable not in the model can be added because its added because its FF value, FMAX, is greater value, FMAX, is greater than the user-specified or default than the user-specified or default F F value, value, FENTER. FENTER.
• If no variable can be removed and no If no variable can be removed and no variable can be added, the procedure stops.variable can be added, the procedure stops.
8 8 Slide Slide
Forward SelectionForward Selection
• This procedure is similar to stepwise-This procedure is similar to stepwise-regression, but does not permit a variable to regression, but does not permit a variable to be deleted.be deleted.
• This forward-selection procedure starts with This forward-selection procedure starts with no independent variables.no independent variables.
• It adds variables one at a time as long as a It adds variables one at a time as long as a significant reduction in the error sum of significant reduction in the error sum of squares (SSE) can be achieved.squares (SSE) can be achieved.
Variable-Selection ProceduresVariable-Selection Procedures
9 9 Slide Slide
Backward EliminationBackward Elimination
• This procedure begins with a model that This procedure begins with a model that includes all the independent variables the includes all the independent variables the modeler wants considered.modeler wants considered.
• It then attempts to delete one variable at a It then attempts to delete one variable at a time by determining whether the least time by determining whether the least significant variable currently in the model can significant variable currently in the model can be removed because its be removed because its F F value, FMIN, is less value, FMIN, is less than the user-specified or default than the user-specified or default F F value, value, FREMOVE.FREMOVE.
• Once a variable has been removed from the Once a variable has been removed from the model it cannot reenter at a subsequent step.model it cannot reenter at a subsequent step.
Variable-Selection ProceduresVariable-Selection Procedures
10 10 Slide Slide
Best-Subsets RegressionBest-Subsets Regression• The three preceding procedures are one-The three preceding procedures are one-
variable-at-a-time methods offering no variable-at-a-time methods offering no guarantee that the best model for a given guarantee that the best model for a given number of variables will be found.number of variables will be found.
• Some software packages include Some software packages include best-best-subsets regressionsubsets regression that enables the use to that enables the use to find, given a specified number of find, given a specified number of independent variables, the best regression independent variables, the best regression model.model.
• Minitab output identifies the two best one-Minitab output identifies the two best one-variable estimated regression equations, the variable estimated regression equations, the two best two-variable equation, and so on.two best two-variable equation, and so on.
Variable-Selection ProceduresVariable-Selection Procedures
11 11 Slide Slide
Example: PGA Tour DataExample: PGA Tour Data
The Professional Golfers Association keeps a varietyThe Professional Golfers Association keeps a variety
of statistics regarding performance measures. Dataof statistics regarding performance measures. Data
include the average driving distance, percentage ofinclude the average driving distance, percentage of
drives that land in the fairway, percentage of greens hitdrives that land in the fairway, percentage of greens hit
in regulation, average number of putts, percentage ofin regulation, average number of putts, percentage of
sand saves, and average score.sand saves, and average score.
The variable names and definitions are shown on theThe variable names and definitions are shown on the
next slide.next slide.
12 12 Slide Slide
Variable Names and DefinitionsVariable Names and Definitions
DriveDrive: average length of a drive in yards: average length of a drive in yards
FairFair: : percentage of drives that land in the fairwaypercentage of drives that land in the fairway
GreenGreen: percentage of greens hit in regulation (a par-: percentage of greens hit in regulation (a par-3 3 green is “hit in regulation” if the player’s green is “hit in regulation” if the player’s first first shot lands on the green)shot lands on the green)
PuttPutt: : average number of putts for greens that average number of putts for greens that have have
been hit in regulationbeen hit in regulation
SandSand: : percentage of sand saves (landing in a sandpercentage of sand saves (landing in a sand
trap and still scoring par or better)trap and still scoring par or better)
ScoreScore: average score for an 18-hole round : average score for an 18-hole round
Example: PGA Tour DataExample: PGA Tour Data
13 13 Slide Slide
Sample DataSample Data
DriveDrive FairFair GreenGreen PuttPutt SandSand ScoreScore
277.6277.6 .681.681 .667.667 1.7681.768 .550.550 69.1069.10259.6259.6 .691.691 .665.665 1.8101.810 .536.536 71.0971.09269.1269.1 .657.657 .649.649 1.7471.747 .472.472 70.1270.12267.0267.0 .689.689 .673.673 1.7631.763 .672.672 69.8869.88267.3267.3 .581.581 .637.637 1.7811.781 .521.521 70.7170.71255.6255.6 .778.778 .674.674 1.7911.791 .455.455 69.7669.76272.9272.9 .615.615 .667.667 1.7801.780 .476.476 70.1970.19265.4265.4 .718.718 .699.699 1.7901.790 .551.551 69.7369.73
Example: PGA Tour DataExample: PGA Tour Data
14 14 Slide Slide
Sample Data (continued)Sample Data (continued)
DriveDrive FairFair GreenGreen PuttPutt SandSand ScoreScore
272.6272.6 .660.660 .672.672 1.8031.803 .431.431 69.9769.97263.9263.9 .668.668 .669.669 1.7741.774 .493.493 70.3370.33267.0267.0 .686.686 .687.687 1.8091.809 .492.492 70.3270.32266.0266.0 .681.681 .670.670 1.7651.765 .599.599 70.0970.09258.1258.1 .695.695 .641.641 1.7841.784 .500.500 70.4670.46255.6255.6 .792.792 .672.672 1.7521.752 .603.603 69.4969.49261.3261.3 .740.740 .702.702 1.8131.813 .529.529 69.8869.88262.2262.2 .721.721 .662.662 1.7541.754 .576.576 70.2770.27
Example: PGA Tour DataExample: PGA Tour Data
15 15 Slide Slide
Sample Data (continued)Sample Data (continued)
DriveDrive FairFair GreenGreen PuttPutt SandSand ScoreScore
260.5260.5 .703.703 .623.623 1.7821.782 .567.567 70.7270.72271.3271.3 .671.671 .666.666 1.7831.783 .492.492 70.3070.30263.3263.3 .714.714 .687.687 1.7961.796 .468.468 69.9169.91276.6276.6 .634.634 .643.643 1.7761.776 .541.541 70.6970.69252.1252.1 .726.726 .639.639 1.7881.788 .493.493 70.5970.59263.0263.0 .687.687 .675.675 1.7861.786 .486.486 70.2070.20263.0263.0 .639.639 .647.647 1.7601.760 .374.374 70.8170.81253.5253.5 .732.732 .693.693 1.7971.797 .518.518 70.2670.26266.2266.2 .681.681 .657.657 1.8121.812 .472.472 70.9670.96
Example: PGA Tour DataExample: PGA Tour Data
16 16 Slide Slide
Sample Correlation CoefficientsSample Correlation Coefficients
ScoreScore DriveDrive FairFair GreenGreen PuttPutt
DriveDrive -.154-.154
FairFair -.427-.427 -.679-.679
GreenGreen -.556-.556 -.045-.045 .421.421
PuttPutt .258.258 -.139-.139 .101.101 .354.354
SandSand -.278-.278 -.024-.024 .265.265 .083.083 -.296 -.296
Example: PGA Tour DataExample: PGA Tour Data
17 17 Slide Slide
Best Subsets Regression of SCOREBest Subsets Regression of SCORE
VarsVars R-sqR-sq R-sq(a) C-p R-sq(a) C-p ss D F D F G P SG P S
11 30.930.9 27.927.9 26.926.9 .39685.39685 XX
11 18.218.2 14.614.6 35.735.7 .43183.43183 XX
22 54.754.7 50.550.5 12.412.4 .32872.32872 XX XX
22 54.654.6 50.550.5 12.512.5 .32891.32891 XX XX
33 60.760.7 55.155.1 10.210.2 .31318.31318 XX XX XX
33 59.159.1 53.353.3 11.411.4 .31957.31957 XX XX XX
44 72.272.2 66.866.8 4.24.2 .26913.26913 XX XX XX XX
44 60.960.9 53.153.1 12.112.1 .32011.32011 XX XX XX XX
55 72.672.6 65.465.4 6.06.0 .27499.27499 XX XX XX XX XX
Example: PGA Tour DataExample: PGA Tour Data
18 18 Slide Slide
Minitab OutputMinitab Output
The regression equation The regression equation
Score = 74.678 - .0398(Drive) - 6.686(Fair) Score = 74.678 - .0398(Drive) - 6.686(Fair)
- 10.342(Green) + 9.858(Putt)- 10.342(Green) + 9.858(Putt)
Predictor Coef Stdev t-ratio Predictor Coef Stdev t-ratio p pConstantConstant 74.67874.678 6.9526.952 10.7410.74 .000.000DriveDrive -.0398-.0398 .01235.01235 -3.22-3.22 .004.004FairFair -6.686-6.686 1.9391.939 -3.45-3.45 .003.003GreenGreen -10.342-10.342 3.5613.561 -2.90-2.90 .009.009PuttPutt 9.8589.858 3.1803.180 3.103.10 .006.006
ss = .2691 = .2691 R-sq R-sq = 72.4%= 72.4% R-sq(adj) R-sq(adj) = 66.8%= 66.8%
Example: PGA Tour DataExample: PGA Tour Data
19 19 Slide Slide
Minitab OutputMinitab Output
Analysis of VarianceAnalysis of Variance
SOURCE DF SS MS SOURCE DF SS MS F PF P
Regression 4Regression 4 3.79469 3.79469 .94867 .94867 13.1013.10 .000 .000
ErrorError 20 20 1.44865 1.44865 .07243 .07243
TotalTotal 24 24 5.24334 5.24334
Example: PGA Tour DataExample: PGA Tour Data
20 20 Slide Slide
Residual Analysis: AutocorrelationResidual Analysis: Autocorrelation
Durbin-Watson Test for AutocorrelationDurbin-Watson Test for Autocorrelation• StatisticStatistic
• The statistic ranges in value from zero to The statistic ranges in value from zero to four.four.
• If successive values of the residuals are If successive values of the residuals are close together (positive autocorrelation), close together (positive autocorrelation), the statistic will be small.the statistic will be small.
• If successive values are far apart (negative If successive values are far apart (negative auto-auto-correlation), the statistic will be large.correlation), the statistic will be large.
• A value of two indicates no autocorrelation.A value of two indicates no autocorrelation.
de e
e
t tt
n
tt
n
( )12
2
2
1
de e
e
t tt
n
tt
n
( )12
2
2
1
21 21 Slide Slide
End of Chapter 16End of Chapter 16