1 1 Slide 統計學 Spring 2004 授課教師：統計系余清祥日期： 2004 年 5 月 11 日第十二週：建立迴歸模型.

1 1 Slide Slide

統計學 Spring 2004

授課教師：統計系余清祥日期： 2004 年 5 月 11 日第十二週：建立迴歸模型

2 2 Slide Slide

Chapter 16Chapter 16Regression Analysis: Model BuildingRegression Analysis: Model Building

General Linear ModelGeneral Linear Model Determining When to Add or Delete VariablesDetermining When to Add or Delete Variables Analysis of a Larger ProblemAnalysis of a Larger Problem Variable-Selection ProceduresVariable-Selection Procedures Residual AnalysisResidual Analysis Multiple Regression ApproachMultiple Regression Approach to Analysis of Variance andto Analysis of Variance and

Experimental DesignExperimental Design

3 3 Slide Slide

General Linear ModelGeneral Linear Model

Models in which the parameters (Models in which the parameters (00, , 11, . . . , , . . . , pp ) ) all all

have exponents of one are called have exponents of one are called linear modelslinear models..

First-Order Model with One Predictor VariableFirst-Order Model with One Predictor Variable

Second-Order Model with One Predictor VariableSecond-Order Model with One Predictor Variable

Second-Order Model with Two Predictor Second-Order Model with Two Predictor VariablesVariables

with Interactionwith Interactiony x x x x x x 0 1 1 2 2 3 12

4 22

5 1 2y x x x x x x 0 1 1 2 2 3 12

4 22

5 1 2

y x x 0 1 1 2 12y x x 0 1 1 2 12

y x 0 1 1y x 0 1 1

4 4 Slide Slide


Often the problem of nonconstant variance can be Often the problem of nonconstant variance can be

corrected by transforming the dependent variable to acorrected by transforming the dependent variable to a

different scale.different scale.

Logarithmic TransformationsLogarithmic Transformations

Most statistical packages provide the ability to apply Most statistical packages provide the ability to apply

logarithmic transformations using either the base-10logarithmic transformations using either the base-10

(common log) or the base (common log) or the base ee = 2.71828... (natural = 2.71828... (natural log).log).

Reciprocal TransformationReciprocal Transformation

Use 1/Use 1/yy as the dependent variable instead of as the dependent variable instead of yy..

5 5 Slide Slide

Models in which the parameters (Models in which the parameters (00, , 11, . . . , , . . . , pp ) have ) haveexponents other than one are called exponents other than one are called nonlinear modelsnonlinear models..

In some cases we can perform a transformation ofIn some cases we can perform a transformation ofvariables that will enable us to use regression analysisvariables that will enable us to use regression analysiswith the general linear model.with the general linear model.

Exponential ModelExponential ModelThe exponential model involves the regression The exponential model involves the regression equation:equation:

We can transform this nonlinear model to a linear We can transform this nonlinear model to a linear model by taking the logarithm of both sides.model by taking the logarithm of both sides.

E y x( ) 0 1E y x( ) 0 1


6 6 Slide Slide

Determining When to Add or Delete Determining When to Add or Delete VariablesVariables

F F Test Test

To test whether the addition of To test whether the addition of xx22 to a to a model involving model involving xx11 (or the deletion of (or the deletion of xx22 from a from a model involving model involving xx11and and xx22) is statistically ) is statistically significantsignificant

1 1 2

1 2

(SSE( )-SSE( , ))/ 1(SSE( , ))/ ( 1)

x x xF

x x n p

1 1 2

1 2

(SSE( )-SSE( , ))/ 1(SSE( , ))/ ( 1)

x x xF

x x n p

(SSE(reduced)-SSE(full))/ number of extra termsMSE(full)

F (SSE(reduced)-SSE(full))/ number of extra terms

MSE(full)F

7 7 Slide Slide

Variable-Selection ProceduresVariable-Selection Procedures

Stepwise RegressionStepwise Regression• At each iteration, the first consideration is to At each iteration, the first consideration is to

see whether the least significant variable see whether the least significant variable currently in the model can be removed currently in the model can be removed because its because its F F value, FMIN, is less than the value, FMIN, is less than the user-specified or default user-specified or default FF value, FREMOVE. value, FREMOVE.

• If no variable can be removed, the If no variable can be removed, the procedure checks to see whether the most procedure checks to see whether the most significant variable not in the model can be significant variable not in the model can be added because its added because its FF value, FMAX, is greater value, FMAX, is greater than the user-specified or default than the user-specified or default F F value, value, FENTER. FENTER.

• If no variable can be removed and no If no variable can be removed and no variable can be added, the procedure stops.variable can be added, the procedure stops.

8 8 Slide Slide

Forward SelectionForward Selection

• This procedure is similar to stepwise-This procedure is similar to stepwise-regression, but does not permit a variable to regression, but does not permit a variable to be deleted.be deleted.

• This forward-selection procedure starts with This forward-selection procedure starts with no independent variables.no independent variables.

• It adds variables one at a time as long as a It adds variables one at a time as long as a significant reduction in the error sum of significant reduction in the error sum of squares (SSE) can be achieved.squares (SSE) can be achieved.


9 9 Slide Slide

Backward EliminationBackward Elimination

• This procedure begins with a model that This procedure begins with a model that includes all the independent variables the includes all the independent variables the modeler wants considered.modeler wants considered.

• It then attempts to delete one variable at a It then attempts to delete one variable at a time by determining whether the least time by determining whether the least significant variable currently in the model can significant variable currently in the model can be removed because its be removed because its F F value, FMIN, is less value, FMIN, is less than the user-specified or default than the user-specified or default F F value, value, FREMOVE.FREMOVE.

• Once a variable has been removed from the Once a variable has been removed from the model it cannot reenter at a subsequent step.model it cannot reenter at a subsequent step.


10 10 Slide Slide

Best-Subsets RegressionBest-Subsets Regression• The three preceding procedures are one-The three preceding procedures are one-

variable-at-a-time methods offering no variable-at-a-time methods offering no guarantee that the best model for a given guarantee that the best model for a given number of variables will be found.number of variables will be found.

• Some software packages include Some software packages include best-best-subsets regressionsubsets regression that enables the use to that enables the use to find, given a specified number of find, given a specified number of independent variables, the best regression independent variables, the best regression model.model.

• Minitab output identifies the two best one-Minitab output identifies the two best one-variable estimated regression equations, the variable estimated regression equations, the two best two-variable equation, and so on.two best two-variable equation, and so on.


11 11 Slide Slide

Example: PGA Tour DataExample: PGA Tour Data

The Professional Golfers Association keeps a varietyThe Professional Golfers Association keeps a variety

of statistics regarding performance measures. Dataof statistics regarding performance measures. Data

include the average driving distance, percentage ofinclude the average driving distance, percentage of

drives that land in the fairway, percentage of greens hitdrives that land in the fairway, percentage of greens hit

in regulation, average number of putts, percentage ofin regulation, average number of putts, percentage of

sand saves, and average score.sand saves, and average score.

The variable names and definitions are shown on theThe variable names and definitions are shown on the

next slide.next slide.

12 12 Slide Slide

Variable Names and DefinitionsVariable Names and Definitions

DriveDrive: average length of a drive in yards: average length of a drive in yards

FairFair: : percentage of drives that land in the fairwaypercentage of drives that land in the fairway

GreenGreen: percentage of greens hit in regulation (a par-: percentage of greens hit in regulation (a par-3 3 green is “hit in regulation” if the player’s green is “hit in regulation” if the player’s first first shot lands on the green)shot lands on the green)

PuttPutt: : average number of putts for greens that average number of putts for greens that have have

been hit in regulationbeen hit in regulation

SandSand: : percentage of sand saves (landing in a sandpercentage of sand saves (landing in a sand

trap and still scoring par or better)trap and still scoring par or better)

ScoreScore: average score for an 18-hole round : average score for an 18-hole round


13 13 Slide Slide

Sample DataSample Data

DriveDrive FairFair GreenGreen PuttPutt SandSand ScoreScore

277.6277.6 .681.681 .667.667 1.7681.768 .550.550 69.1069.10259.6259.6 .691.691 .665.665 1.8101.810 .536.536 71.0971.09269.1269.1 .657.657 .649.649 1.7471.747 .472.472 70.1270.12267.0267.0 .689.689 .673.673 1.7631.763 .672.672 69.8869.88267.3267.3 .581.581 .637.637 1.7811.781 .521.521 70.7170.71255.6255.6 .778.778 .674.674 1.7911.791 .455.455 69.7669.76272.9272.9 .615.615 .667.667 1.7801.780 .476.476 70.1970.19265.4265.4 .718.718 .699.699 1.7901.790 .551.551 69.7369.73


14 14 Slide Slide

Sample Data (continued)Sample Data (continued)


272.6272.6 .660.660 .672.672 1.8031.803 .431.431 69.9769.97263.9263.9 .668.668 .669.669 1.7741.774 .493.493 70.3370.33267.0267.0 .686.686 .687.687 1.8091.809 .492.492 70.3270.32266.0266.0 .681.681 .670.670 1.7651.765 .599.599 70.0970.09258.1258.1 .695.695 .641.641 1.7841.784 .500.500 70.4670.46255.6255.6 .792.792 .672.672 1.7521.752 .603.603 69.4969.49261.3261.3 .740.740 .702.702 1.8131.813 .529.529 69.8869.88262.2262.2 .721.721 .662.662 1.7541.754 .576.576 70.2770.27


15 15 Slide Slide

Sample Data (continued)Sample Data (continued)


260.5260.5 .703.703 .623.623 1.7821.782 .567.567 70.7270.72271.3271.3 .671.671 .666.666 1.7831.783 .492.492 70.3070.30263.3263.3 .714.714 .687.687 1.7961.796 .468.468 69.9169.91276.6276.6 .634.634 .643.643 1.7761.776 .541.541 70.6970.69252.1252.1 .726.726 .639.639 1.7881.788 .493.493 70.5970.59263.0263.0 .687.687 .675.675 1.7861.786 .486.486 70.2070.20263.0263.0 .639.639 .647.647 1.7601.760 .374.374 70.8170.81253.5253.5 .732.732 .693.693 1.7971.797 .518.518 70.2670.26266.2266.2 .681.681 .657.657 1.8121.812 .472.472 70.9670.96


16 16 Slide Slide

Sample Correlation CoefficientsSample Correlation Coefficients

ScoreScore DriveDrive FairFair GreenGreen PuttPutt

DriveDrive -.154-.154

FairFair -.427-.427 -.679-.679

GreenGreen -.556-.556 -.045-.045 .421.421

PuttPutt .258.258 -.139-.139 .101.101 .354.354

SandSand -.278-.278 -.024-.024 .265.265 .083.083 -.296 -.296


17 17 Slide Slide

Best Subsets Regression of SCOREBest Subsets Regression of SCORE

VarsVars R-sqR-sq R-sq(a) C-p R-sq(a) C-p ss D F D F G P SG P S

11 30.930.9 27.927.9 26.926.9 .39685.39685 XX

11 18.218.2 14.614.6 35.735.7 .43183.43183 XX

22 54.754.7 50.550.5 12.412.4 .32872.32872 XX XX

22 54.654.6 50.550.5 12.512.5 .32891.32891 XX XX

33 60.760.7 55.155.1 10.210.2 .31318.31318 XX XX XX

33 59.159.1 53.353.3 11.411.4 .31957.31957 XX XX XX

44 72.272.2 66.866.8 4.24.2 .26913.26913 XX XX XX XX

44 60.960.9 53.153.1 12.112.1 .32011.32011 XX XX XX XX

55 72.672.6 65.465.4 6.06.0 .27499.27499 XX XX XX XX XX


18 18 Slide Slide

Minitab OutputMinitab Output

The regression equation The regression equation

Score = 74.678 - .0398(Drive) - 6.686(Fair) Score = 74.678 - .0398(Drive) - 6.686(Fair)

- 10.342(Green) + 9.858(Putt)- 10.342(Green) + 9.858(Putt)

Predictor Coef Stdev t-ratio Predictor Coef Stdev t-ratio p pConstantConstant 74.67874.678 6.9526.952 10.7410.74 .000.000DriveDrive -.0398-.0398 .01235.01235 -3.22-3.22 .004.004FairFair -6.686-6.686 1.9391.939 -3.45-3.45 .003.003GreenGreen -10.342-10.342 3.5613.561 -2.90-2.90 .009.009PuttPutt 9.8589.858 3.1803.180 3.103.10 .006.006

ss = .2691 = .2691 R-sq R-sq = 72.4%= 72.4% R-sq(adj) R-sq(adj) = 66.8%= 66.8%


19 19 Slide Slide

Minitab OutputMinitab Output

Analysis of VarianceAnalysis of Variance

SOURCE DF SS MS SOURCE DF SS MS F PF P

Regression 4Regression 4 3.79469 3.79469 .94867 .94867 13.1013.10 .000 .000

ErrorError 20 20 1.44865 1.44865 .07243 .07243

TotalTotal 24 24 5.24334 5.24334


20 20 Slide Slide

Residual Analysis: AutocorrelationResidual Analysis: Autocorrelation

Durbin-Watson Test for AutocorrelationDurbin-Watson Test for Autocorrelation• StatisticStatistic

• The statistic ranges in value from zero to The statistic ranges in value from zero to four.four.

• If successive values of the residuals are If successive values of the residuals are close together (positive autocorrelation), close together (positive autocorrelation), the statistic will be small.the statistic will be small.

• If successive values are far apart (negative If successive values are far apart (negative auto-auto-correlation), the statistic will be large.correlation), the statistic will be large.

• A value of two indicates no autocorrelation.A value of two indicates no autocorrelation.

de e

e

t tt

n

tt

n

( )12

2

2

1

de e

e

t tt

n

tt

n

( )12

2

2

1

21 21 Slide Slide

End of Chapter 16End of Chapter 16

1 1 Slide 統計學 Spring 2004 授課教師：統計系余清祥 日期： 2004 年 5 月 11 日 第十二週：建立迴歸模型.

Documents

1 1 Slide 統計學 Spring 2004 授課教師：統計系余清祥日期： 2004 年 5 月 11 日第十二週：建立迴歸模型.