Top Banner
SimpleLinear R.gression USINGSTATISTICS @ Sunflowers Apparel 13.1 TYPES OF REGRESSION MODELS 13.2 DETERMINING THESIMPLE LINEAR REGRESSION EOUATION The Least-Squares Method Visual Explorations: Exploring Simple Linear Regression Coefficients Predictions in Regression Analysis: Interpolation Versus Extrapolation Computing theI Intercept, bo, and theSlope, b, I3.3 MEASURES OF VARIATION Computing theSum of Squares The Coefficient of Determination Standard Error of theEstimate I3.4 ASSUMPTIONS 13.5 RESIDUAL ANALYSIS Evaluating the Assumptions 13.6 MEASURING AUTOCORRELATION: TH E DU RBIN.WATSON STATISTIC Residual Plots to Detect Autocorrelation The Durbin-Watson Statistic 13.7 INFERENCES ABOUT THESLOPE AND CORRELATION COEFFICIENT t Test for theSlope F Test for theSlope Confidence IntervalEstimate of the Slope (8,) r Test for the Correlation Coefficient 13.8 ESTIMATION OF MEAN VALUES AND PREDICTION OF INDIVIDUAL VALUES The Confidence Interval Estimate The Prediction Interval 13.9 PITFALLS IN REGRESSION AND ETHICAL ISSUES EXCEL COMPANION TO CHAPTER 13 El3.l Performing Simple Linear Regression Analyses E13.2 Creating Scatter Plots andAddine a Prediction Line El 3.3 Performing Residual Analyses E13.4 Computing theDurbin-Watson Statistic E13.5 Estimatins theMean of yand Predictins )'Values E13.6 Example: Sunflowers Apparel Data In this chapter,you learn: r To use regressionanalysisto predict the value ofa dependent variable based on an independent variable r The meaningof the regression coefficients 6n and b, I To evaluate the assumptions of regression analysis and know what to do if the assumptions are violated r To make inferencesabout the slope and correlation coefficient I To estimatemean values and predict individual values
59
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: chap 13

thetherifi-theeral

Simple Line ar R.gression

USING STATISTICS @ Sunflowers Apparel

13.1 TYPES OF REGRESSION MODELS

13.2 DETERMINING THE SIMPLE LINEARREGRESSION EOUATIONThe Least-Squares MethodVisual Explorations: Exploring Simple Linear

Regression CoefficientsPredictions in Regression Analysis: Interpolation

Versus ExtrapolationComputing the I Intercept, bo, and the Slope, b,

I3.3 MEASURES OF VARIATIONComputing the Sum of SquaresThe Coefficient of DeterminationStandard Error of the Estimate

I3.4 ASSUMPTIONS

13.5 RESIDUAL ANALYSISEvaluating the Assumptions

13.6 MEASURING AUTOCORRELATION:TH E DU RBIN.WATSON STATISTICResidual Plots to Detect AutocorrelationThe Durbin-Watson Statistic

13.7 INFERENCES ABOUT THE SLOPEAND CORRELATION COEFFICIENTt Test for the SlopeF Test for the SlopeConfidence IntervalEstimate of the Slope (8,)r Test for the Correlation Coefficient

13.8 ESTIMATION OF MEAN VALUES ANDPREDICTION OF INDIVIDUAL VALUESThe Confidence Interval Estimate

The Prediction Interval

13.9 PITFALLS IN REGRESSIONAND ETHICAL ISSUES

EXCEL COMPANION TO CHAPTER 13El3.l Performing Simple Linear Regression

AnalysesE13.2 Creating Scatter Plots andAddine

a Prediction LineEl 3.3 Performing Residual AnalysesE13.4 Computing the Durbin-Watson StatisticE13.5 Estimatins the Mean of yand Predictins

)'ValuesE13.6 Example: Sunflowers Apparel Data

In this chapter, you learn:r To use regression analysis to predict the value ofa dependent variable based

on an independent variabler The meaning of the regression coefficients 6n and b,I To evaluate the assumptions of regression analysis and know what to do if

the assumptions are violatedr To make inferences about the slope and correlation coefficientI To estimate mean values and predict individual values

Page 2: chap 13

512 CHAPTER THIRTEEN Simple Linear Regression

Using Statistics @ Sunflowers Apparel

The sales for Sunflowers Apparel, a chain of upscale clothing storeswomen, have increased during the past 12 years as the chainexpanded the number of stores open. Until now, Sunflowersselected sites based on subjective factors, such as the availability ofgood lease or the perception that a location seemed ideal for anstore. As the new director of planning, you need to develop a sapproach that will lead to making better decisions during the sitetion process. As a starting point, you believe that the size of the storenificantly contributes to store sales, and you want to use this relatiin the decision-making process. How can you use statistics so thatcan forecast the annual sales ofa proposed store based on the sizeofstore?

f n th is chapter and the nextIdevelop a model to predictvariables.

two chapters, you learn how regression analysis enables youthe values of a numerical variable, based on the value of

In regression analysis, the variable you wish to predict is called the dependentThe variables used to make the prediction are called independent variables. Inpredicting values of the dependent variable, regression analysis also allows you to identifftvoe of mathematical relationshio that exists between a deoendent and an indeoendentable, to quantify the effect that changes in the independent variable have on thevariable, and to identify unusual observations. For example, as the director of planning,

may wish to predict sales for a Sunflowers store, based on the size of the store. Otherples include predicting the monthly rent of an apartment, based on its size, and predictrmonthly sales of a product in a supermarket, based on the amount of shelf space devotedproduct.

This chapter discusses simple linear regression, in which a single numerical ivariable, X, is used to predict the numerical dependent variable )', such as using the sizestore to predict the annual sales of the store. Chapters 14 and l5 discuss multiplemodels, which use several independent variables to predict a numerical dependent variFor example, you could use the amount of advertising expenditures, price, and theshelf space devoted to a product to predict its monthly sales.

13.1 TYPES OF REGRESSION MODELSln Section 2.5,you used a scatter plot (also known as a scatter diagram) to examinethetionship between an X variable on the horizontal axis and a I variable on the vertical axis.nature of the relationship between two variables can take many forms, ranging from siextremely complicated mathematical functions. The simplest relationship consists of aline, or linear relationship. An example of this relationship is shown in Figure 13.1.

Page 3: chap 13

FIGURE 13.1A positive stra ig ht-l inerelationship

I 3. I : Types of Rcgression Models 5 I 3

Equat ion (13. l ) represents the st ra ight- l ine ( l inear) model .

SIMPLE LINEAR REGRESSION MODEL)i: Fo + B,{ + e, (13 .1)

wnere

Fu: Yintercept for the population

Fr : slope for the population

t,: random error in Ifor observation i

{ = dependent variable (sometimes referred to as

the response variable) for observation i

X,: independent variable (sometimes referred to asthe explanatory variable) for observation i

The portion y,- 0n + F{,of the simple l inear regression model expressed in Equation(13.1) is a s t ra ight l ine. The s lope of the l ine, 8, , represents the expected change in ) 'per uni tchange in X. It represents the mean amount that I changes (either positively or negatively) fora one-unit change in X. The Yintercept, B,,, represents the mean value of ) 'when Xequals 0.The last component of the model, €,, represents the random error in X for each observation, l. Inother words, e, is the vertical distance of the actual value of X, above or below the predictedvalue of { on the l ine.

The selection of the proper mathematical model depends on the distribution of the X and Yvalues on the scatterplot. [n PanelA of Figure 13.2 on page 514, the values of /are generallyincreasing l inearly as X increases. This panel is similar to Figure I 3.3 on page 5 15, which i l lus-trates the positive relationship between the square footage of the store and the annual sales atbranches of the Sunflowers Apparel women's clothing store chain.

Panel B is an example of a negative l inear relationship. As X increases, the values of f aregenerally decreasing. An example of this type of relationship might be the price of a particularproduct and the amount of sales.

The data in Panel C show a positive curvil inear relationship between X and Y. The valuesof ) ' increase as X increases, but this increase tapers off beyond certain values of X. An exam-ple of a positive curvil inear relationship might be the age and maintenance cost of a machine.As a machine gets o1der, the maintenance cost rnay rise rapidly at f irst, but then level offbeyond a certain number ofyears.

Panel D shows a U-shaped relationship between X and Y. As X increases, at f irst Igener-ally decreases; but as Xcontinues to increase, )/not only stops decreasing but actually increasesabove its minimum value. An example of this type of relationship might be the number oferrors per hour at a task and the number of hours worked. The number of errors per hour

forhasIersr f aarelraticrlec-sig-shipyou

'that

/ou to'other

'iable.

ion toify thet vari-:ndentg, Youexam-ng theIto the

endentze of a'ession

Lble, }.runt of

re rela-ris. Thenple totraight-

LY = "change in Y"A X = " c h a n g e i n X "

Page 4: chap 13

5I4 CHAPTERTHIRTEEN

FIGURE 13.2Examples of typesof relationships foundin scatter olots

Simple Linear Regression

Panel DU-shaped curvi l inear relat ionship

No relationship between X and Y

decreases as the individual becomes more proficient at the task, but then it increasescertain point because offactors such as fatigue and boredom.

Panel E indicates an exponential relationship between X and IZ. In this case, fvery rapidly as X first increases, but then it decreases much less rapidly as X increasesAn example of an exponential relationship could be the resale value of an automobileage. In the first year, the resale value drops drastically from its original price;resale value then decreases much less rapidly in subsequent years.

Finally, Panel F shows a set of data in which there is very little or no relationshipX and Y. High and low values of Iappear at each value ofX.

In this section, a variety of different models that represent the relationshipvariables were briefly examined. Although scatter plots are useful in visuallymathematical form of a relationship, more sophisticated statistical procedures aredetermine the most appropriate model for a set of variables. The rest of this chapterthe model used when there is a linear relationship between variables.

13.2 DETERMINING THE SIMPLE LINEAR REGRESSIONIn the Using Statistics scenario on page 512, the stated goal is to forecast annualnew stores, based on store size. To examine the relationship between the store size inand its annual sales, a sample of 14 stores was selected. Table l3.l summarizes thethese 14 stores, which are stored in the file @[!.

Panel A Panel B

a

Panel F

Page 5: chap 13

I 3.2: Determining the Simple Linear Regression Equation 5 1 5

LE 13 .1Footage

Thousands of Square Storeand Annual Sales I

zaJ

456,7

Millions of Dollars)

13.3Excel scatter

forthe Sunflowersdata

E2.12 to create

SquareFeet

(Thousands)

Annual Sales(in Millionsof Dollars)

SquareFeet

(Thousands)

Annual Sales(in Millionsof Dollars)

a Sample ofBranches of

t . 71 .62 .85.61 .32.21 .3

3 .73 .96.79.5J . +

5.63 . 7

Store

89

1011121314

l . l

1 . 55 .24.6s .83 .0

2.75 .52 .9

10.'77 .6

I 1 . 84 . 1

Apparel

beyond a

decreases;s further.le and itsvever, the

l between

ween twoaying the,ailablediscusses

ryoN:s for alluare feet:sults for

Figure I 3.3 displays the scatter plot for the data in Table I 3. I . Observe the increasing rela-tionship between square feet (,{) and annual sales (Y). As the size of the store increases, annualsales increase approximately as a straight line. Thus, you can assume that a straight line pro-vides a useful mathematical model of this relationshio. Now vou need to determine the soecificstraight line that is the best fit to these data.

3 4 5 6 7

Squde Fest (000)

The Least-Squares Method

In the preceding section, a statistical model is hypothesizedto represent the relationshipbetween two variables, square footage and sales, in the entire population of Sunflowers Apparelstores. Howeveq as shown in Table 13.1, the data are from only a random sample of stores. Ifcertain assumptions are valid (see Section 13.4), you can use the sample Xintercept, bo, and thesample slope, b,, as estimates of the respective population parameters, Bo and B,. Equation( 13.2) uses these estimates to form the simple linear regression equation. This straight line isoften referred to as the prediction line.

SIMPLE LINEAR REGRESSION EOUATION: THE PREDICTION LINE

The predicted value of I equals the Y intercept plus the slope times the value of X.

Scatter Plot for Site Selection

Y i = b o + 4 X i (13.2)

Page 6: chap 13

5 16 CHAPTER THIRTEEN Simple Linear Regressron

where

FIGURE 13.4Microsoft Excel resultsfor the SunflowersApparel data

See Section E13.1 to createthis.

I; : predicted value of I for observation i

X,: value ofXfor observation i

bo: sample lintercept

b, : sample slope

Equation (13.2) requires the determination of two regression coefficients-bo (the)zintercept) and b, (the sample slope). The most common approach to finding bo and b, ismethod of least squares. This method minimizes the sum of the squared differencesthe actual values ({) and the predicted values (Ii) using the simple linear regression[that is, the prediction line; see Equation ( I 3.2)] . This sum of squared differences is equal to

\{r, - f)'j = l

Because Yi = bo + \Xi,

2cr, - f,)' =t rt, - (bo + brx,)12i = l i = l

Because this equation has two unknowns, boand b,, the sum of squared differences dependsthe sample )zintercept, bo, and the sample slope, b,. The least-squares method determinesvalues of bo and brthat minimize the sum of squared differences. Any values forboandother than those determined by the least-squares method result in a greater sum of squaredferences between the actual values ({) and the predicted values )2,. In this book, MiExcel is used to perform the computations involved in the least-squares method. For the dataTable 13.1, Figure 13.4 presents results from Microsoft Excel.

t2 ssE-1145;r 0J339

Coeffclenes S'a,nde,lldEnor tsrrr P'*a,lllp Lovero.18201J280

Page 7: chap 13

13.2: Determining the Simple Linear Regression Equation 517

To understand how the results are computed" many of the computations involved are illus-trated in Examples 13.3 and 13.4 on pages 520-521 and,526-527 .In Figure 13.4, observe thatb0: 0.9645 and br: 1.6699. Thus, the prediction line [see Equation (13.2) on page 515] forthese data is

= 0.9645 + 1.6699Xi

The slope, b,, is +1.6699. This means that for each increase of I unit in X, the mean value of Iis estimated to increase by | .6699 units. In other words, for each increase of I .0 thousand squarefeet in the size of the store, the mean annual sales are estimated to increase by | .6699 millionsof dollars. Thus, the slope represents the portion of the annual sales that are estimated to varyaccording to the size of the store.

The )z intercept, bo, is +0.9645. The f intercept represents the mean value of Y when Xequals 0. Because the square footage ofthe store cannot be 0, this Iintercept has no practicalinterpretation. Also, the Iintercept for this example is outside the range of the observed valuesof the X variable, and therefore interpretations of the value of bo should be made cautiously.Figure 13.5 displays the actual observations and the prediction line. To illustrate a situation inwhich there is a direct interpretation for the I/ intercept, bo, see Example I 3. I .

TNTERPRETTNG THE y |NTERCEpT, bo, AND THE SLOPE, b1

A statistics professor wants to use the number of hours a student studies for a statistics finalexam (X) to predict the final exam score (y).A regression model was fit based on data col-lected for a class during the previous semester, with the following results:

' i i = 3 5 . 0 + 3 X i

What is the interpretation of the Iintercept, bo, and the slope, b,?

SOLUTION The I interc ept bo : 35.0 indicates that when the student does not study for thefinal exam, the mean final exam score is 35.0. The slope b, : 3 indicates that for each increaseof one hour in studying time, the mean change in the final exam score is predicted to be +3.0.In other words, the final exam score is predicted to increase by 3 points for each one-hourincrease in studying time.

t,

13.5Excel scatter

tand prediction lineSunflowers Apparel

Seaion El3.2 to create

LE 13 .1

Scatter Diagram for Site Selection

y = r.0599t o.96,fs

Page 8: chap 13

5 I U cHAP' r l rR T I I IRTITEN S inrp lc L inear l l cs ress ion

VISUAL EXPLORATIONS Exploring Simple Linear Regression Coefficients

Use the Visr- ra l E,xplorat ions Simple L inear Regressionprocedure to producc a prediction l ine that is as close aspossible to the prediction l ine defined by the least-sqLraressolution. Open the fiff i ff i add-in work-book and select VisualExplorat ions 9 Simple L inearRegression (E,xcel 91-2003) or Add- ins ) VisualErplorations ) Simple Linear Regression (Exccl 2001).(Scc Sect iorr El .6 to learn about us ing add- ins. )

When a scatter plot of the Sunflowers Apparcl data ofTab le 13 .1 on page 515 w i th an i n i t i a l p red i c t i on l i neappears (shown belo lv) , c l ick the spinner br- r t tons toc l range the values for b, , the s lope of the predict ion l ine.and b,,. the f interccpt of thc prediction l irre.

Try to produce a prcdiction l inc that is as close as possibleto the prcdiction l ine dcfined by the least-squares estimates.using the chart display and thc Differencc fi'om Targct SSEvalue as f-eedback (sce page 525 fbr an cxplanation of SSE).C'l ick Finish when you are done with this exploration.

At any time. click Reset to reset thc b, and ir,, values.Help for rrore inforn.ration, or Solution to reveal the pre-diction l inc defined by the lcast-squarcs rnethod.

Using Your Own Regression DataTo use VisLra l Explorat ions to f ind a prec l ic t ion l ine foryour own data, open th. iilffiffiffi}ffj add-in

workbook and se lec t V i sua lE rp lo ra t i ons ) S imp leL inea r Reg ress ion w i th you r wo rkshee t da ta(91 -2003 ) o r Add - i ns ) V i sua l Exp lo ra t i ons )

Simple L inear Regression lv i th 1 'our uorksheet data(2001). In the p locedurc 's d ia log box (shown belo lv) ,enter your I var iable cel l range as the Y Var iable Cel lRange and your X var iablc cc l l rangc as thc X \hr iableCel l Range. Cl l ick F i rs t ce l ls in both ranges conta in alabel . cnter a t i t lc as thc Ti t le . arrd c l ick OK. When thescat ter p lot u ' i th an in i t ia l predict ion l ine appears. usethe inst ruct ions in the f i rs t par t of th is sect ion to t ry t0producc the prcdic t ion l inc dcf incc l by the lcast-squarcsnrethoci.

Data

!Variable CellRarq", i-""-'- -*-;

X Variable cell Range: i-. -- -*-l]

v flrst cells in both ranges contain a label

Or-tput options

Page 9: chap 13

.S

letattL,) ,rll,lea

heSC

toCS

EXAMPLE 13 .2

13.2: Determining the Simple Linear Regression Equation 5 l9

Return to the Using Statistics scenario concerning the Sunflowers Apparel stores.Example 13.2 illustrates how you use the prediction equation to predict the mean annual sales.

PREDICTING MEAN,ANNUAL SALEs, BASED ON SOUARE FOOTAGE

Use the prediction line to predict the mean annual sales for a store with 4,000 square feet.

SOLUTION You can determine the predicted value by substituting X: 4 (thousands of squarefeet) into the simple linear regression equation:

Y i =0 .9645+1.6699Xi

ti = 0.9645 + l.6699(4) = 7 .644 or $7, 644, 000

Thus, the predicted mean annual sales of a store with 4,000 square feet is $7,644,000.

Predictions in Regression Analysis: Interpolation Versus Extrapolation

When using a regression model for prediction purposes, you need to consider only the relevant

range of the independent variable in making predictions. This relevant range includes all values

from the smallest to the largest Xused in developing the regression model. Hence, when predict-

ing )'for a given value ofX, you can interpolate within this relevant range of the Xvalues, but you

should not extrapolate beyond the range of X values. When you use the square footage to predict

annual sales, the square footage (in thousands ofsquare feet) varies from 1.1 to 5.8 (see Table

I 3. I on page 5 I 5). Therefore, you should predict annual sales only for stores whose size is

between l.l and 5.8 thousands of square feet. Any prediction of annual sales for stores outside

this range assumes that the observed relationship between sales and store size for store sizes from

1.1 to 5.8 thousand square feet is the same as for stores outside this range. For example, you can-

not extrapolate the l inear relationship beyond 5,800 square feet in Example 13.2.It would beimproper to use the prediction line to forecast the sales for a new store containing 8,000 square

feet. It is quite possible that store size has a point of diminishing returns. If that is true, as square

footage increases beyond 5,800 square feet, the effect on sales might become smaller and smaller.

Computing the Y Intercept, bo, and the Slope, b,

For small data sets, you can use a hand calculator to compute the least-squares regression

coefficients. Equations (13.3) and (13.4) give the values of b,, and b', which minimize

I t t l - t , ) '= I t t ' - (bo+b,x , )12

i = l i = l

COMPUTATIONAL FORMULA FOR THE SLOPE, b1

(13.3)

where

n

ssx: I(x, - x)',J _ I

, ,ssxrA = -' ssx

Page 10: chap 13

520. CHAPTERTHIRTEEN Simple Linear Regression

EXAMPLE 13 .3

COMPUTATIONAL FORMULA FOR THE Y INTERCEPT, bO

bo=Y -b tX

wheren

Sv.L J ' t

v - i = lt - -

n

n

Sr .^Lr" I

v - i = l

(13.4)

CoMPUTING THE y INTERCEPT, bo, AND THE SLOPE, b1

Compute the I/ intercept, bo, and the slope, b1, for the Sunflowers Apparel data.

SOLUTION Examining Equations (13.3) and (13.4), you see that five quantities must be cal'

culated to determine b, and bo. These are n, the sample sir"; ! X , , the sum of the X values;

n

) 4. , f r . sum of rhe X valuesl Z f ?, the sum of the squared X values; and ) X,4. the sum; - t l = l t - - l

of the product of X and )2. For the Sunflowers Apparel data, the number of square feet is used to

predict the annual sales in a store. Table 13.2 presents the computations of the various sums

needed for the site selection problem, pf u. ) Y,2 , thesum of the squared I/ values that will be

used to compute SS?"in Section 13.3. i=-

SquareFeet (X)

AnnualSales (Y) y2TABLE 13 .2

Computations for theSunflowers ApparelData

I23456789

l 0l lt 2l 3t 4Totals

1 .71 .62.85.61 .32.21 .3l . l

3.21 .55.24.65.83.0

40.9

3.73.96.79.53.45.63.72.75.52.9

10.7'7.6

I 1 .84 .1

81 .8

2.892.s67.84

31 .36r .694.841.69l . 2 r

10.242.25

27.0421.1633.649.00

r57.41

13.6915.2144.8990.25I 1 . 5 63 1 . 3 613.697.29

30.258.41

114.4957.76

139.241 6 . 8 1

s94.90

6.296.24

18.76s3.204.42

12.324.812.97

r7.604.35ss.6434.9668.4412.30

302.30

Page 11: chap 13

rI

13.2: Determining the Simple Linear Regression Equation 521

using Equations (r3.3) and (13.4), you can compute the values of boand,br:

, .SsrryD 1 = -' ,ssr

^ss,Kr = f,f*, - X)V, - l) = L *,r, -i= l j= l

,YSyr = 302.3 - (40'9X81'8)t4

= 302.3 -- 23997285= 63.32715

.ssf, = 2r*, - x), =f *? -; - t. - t i = l

= 157.41- @o'D2t4

= 157.41 - 119.48642= 37.92358

n

so that

and

, 63.3271s, r = -' 37.923s8

= 1.6699

bo=F-brX

! r 't =d-= t t f =5.842857

n 1 4n

)x,N =E

' =09?=2.e2t43

n14bo = 5.842857 - (r.6699)(2.92143)

= 0.9645

lIt2x

i = l2r,i= l

Page 12: chap 13

522 CHAPTERTHIRTEEN Simple Linear Regression

Learning the Basics

13.1 Fitting a straight line to a set of data yieldsthe following prediction line:

Y i = 2 + 5 X i

a. Interpret the meaning of the Iintercept, bo.b. Interpret the meaning of the slope, br.c. Predict the mean value of Y for X : 3,

13.2 If the values ofXin Problem 13.1 range from2to25,should you use this model to predict the mean value of YwhenXequalsa . 3 ?b. -3?

c. 0?d,.24?

1 3.3 Fitting a straight line to a set of data yieldsthe following prediction line:

Yi = 16 -O.5Xi

a. Interpret the meaning of the Iintercept, bo.b. Interpret the meaning of the slope, bt.c. Predict the mean value of Y for X: 6.

Applying the Concepts

13.4 The marketing manager of a large super-market chain would like to use shelf space to pre-dict the sales of pet food. A random sample of 12equal-sized stores is selected, with the following

results (stored in the file E!E!!E@:

Shelf Space (X) Weekly Sales (Y)Store (Feet) ($)

160220r40190240260230270280260290310

a. Construct a scatter plot.

Forthese data,bo:145 and br:7.4.b. Interpret the meaning of the slope, 6r, in this problem.c. Predict the mean weekly sales (in hundreds of dollars)

pet food for stores with 8 feet of shelf space for pet

13.5 Circulation is the lifeblood of the publishingness. The larger the sales of a magazine, the more itcharge advertisers. Recently, a circulation gap hasbetween the publishers' reports of magazines'sales and subsequent audits by the Audit BureauCirculations. The data in the file@@representreported and audited newsstand sales (in thousands)2001 for the following l0 magazines:

Magazine Audited (

299.6207.7325.0336.348.6

400.391.239.1

268.62t4.3

Source: Extracted from M. Rose, "In Fight for Ads, PublishersOverstate Their Sales," The Wall Street Journal, August 6, 2003,pp. A1, AI0.

a. Construct a scatter plot.

For these data bo : 26.724 and b t : 0.57 19.

b. Interpret the meaning of the slope, b1, in this problem.c. Predict the mean audited newsstand sales for a

zine that reports newsstand sales of 400,000.

13.6 The owner of a moving company typically hasmost experienced manager predict the total numberlabor hours that will be required to complete anmove. This approach has proved useful in the past, butwould like to be able to develop a more accurate methodpredicting labor hours by using the number of cubicmoved. In a preliminary effort to provide a moremethod" he has collected data for 36 moves in whichorigin and destination were within the boroughManhattan in New York Citv and in which the travel tiwas an insignificant portion of the hours worked. Theare stored in the file @!@f[.

YMCosmoGirlRosiePlayboyEsquireTbenPeopleMoreSpinVogueElle

62r .0359.7530.0492.170.5

567.0125.550.6

353.3263.6

ffirc

555

10l0l015l515202020

I2J

456

89

10l lt2

Page 13: chap 13

f1.

t -

ndd)fren

hisof

mghel o fbetatetheof

meIata

12J

A

678o

l 01 1t 2I J

t 4l 5l 6I I

1 81 9202 122Z J

z+

25

2162832372032s9374342301365384404426482

a. Construct a scatter plot.b. Assuming a linear relationship, use the least-squares

method to find the regression coefficients bo and b,.c. Interpret the meaning of the slope, b,, in this problem.d. Predict the mean labor hours for movins 500 cubic feet.

13.7 A large mail-order house believes thatthere is a linear relationship between the weightof the mail it receives and the number of orders to

be filled. It would like to investigate the relationship inorder to predict the number of orders, based on the weightof the mail. From an operational perspective, knowledge ofthe number of orders will help in the planning of the order-fulfil lment process. A sample of 25 mail shipments isselected that range from 200 to 700 pounds. The results(stored in the file @[@) are as follows:

13.2: Determining the Simple Linear Regression Equation 523

13.9 An agent for a residential real estate company in alarge city would like to be able to predict the monthly rentalcost for apartments, based on the size of the apartment, asdefined by square footage. A sample of 25 apartments(stored in the file [[l$) in a particular residential neigh-borhood was selected. and the information satheredrevealed the followins:

Monthly SizeRent (Square

Apartment ($) Feet)

Monthly SizeRent (Square

Apartment ($) Feet)

Weightof Mail(Pounds)

Weightof Mail

(Pounds)

432409553572506528501628677602630652

9501,6001,2001,500

950l ,700I ,650

93s875

1 , 1 5 01,4001,6502,300

850I 45n

1,085I t 1 ' )

718I ,4851,136

726700956

1 ,100t,2851,985

Orders(Thousands)

1,800 t,3691 ,400 t , t 1 51,450 t,2251,100 1,245l,700 1,259t ,200 I ,1501,150 8961,600 1,3611,650 1,040t,200 7 55

800 1,000l ,750 1.200

Orders(Thousands)

6 . 19 . 17.27 .56.9

I 1 . 510.39 .59.2

10.612.512.914.5

1 3 . 6t2 .81 6 . 5t 7 . l1 5 . 016.2l 5 . 8r9.0t9.41 9 . 11 8 . 020.2

L .

aa-

a. Construct a scatter plot.b. Assuming a linear relationship, use the least-squares

method to find the regression coefficients bo and b,.c. Interpret the meaning of the slope, b,, in this problem.d. Predict the mean number of orders when the weisht of

the mail is 500 pounds.

13.8 The value of a sports franchise is directly related tothe amount of revenue that a franchise can generate. Thedata in the file EEEE@represent the value in 2005 (inmillions of dollars) and the annual revenue (in millions ofdollars) for 30 baseball franchises. Suppose you want todevelop a simple linear regression model to predict fran-chise value based on annual revenue generated.a. Construct a scatter plot.b. Use the least-squares method to find the regression

coefficients boand br.c. Interpret the meaning of bo and b, in this problem.d. Predict the mean value of a baseball franchise that sen-

erates $150 million of annual revenue.

a. Construct a scatter plot.b. Use the least-squares method to find the regression

coefficients boandbr.c. Interpret the meaning of 6o and b, in this problem.d. Predict the mean monthly rent for an apartment that has

1,000 square feet.e. Why would it not be appropriate to use the model to pre-

dict the monthly rent for apartments that have 500square feet?

f. Your friends Jim and Jennifer are considering signing alease for an apartment in this residential neighborhood.They are trying to decide between two apartments, onewith 1,000 square feet for a monthly rent of $1,2'/5 andthe other with 1,200 square feet for a monthly rent of$1,425. What would you recommend to them based on(a) through (d)?

13.10 The data in the file ftII$EEprovide measure-ments on the hardness and tensile strength for 35 specimensof die-cast aluminum. It is believed that hardness (measuredin Rockwell E units) can be used to predict tensile strength(measured in thousands of pounds per square inch).a. Construct a scatter plot.b. Assuming a linear relationship, use the least-squares

method to find the regression coefficients bo and b,.c. Interpret the meaning of the slope, b,, in this problem.d. Predict the mean tensile strength for die-cast aluminum

that has a hardness of 30 Rockwell E units

Page 14: chap 13

524 CHAPTER THIRTEEN Simple Linear Regression

13.3 MEASURES OF VARIATIONWhen using the least-squares method to determine the regression coefficients for a set of dalyou need to compute three important measures of variation. The first measure, the total sum rsquares (,S,SZ), is a measure of variation of the { values around their mean, l.In a regressiranalysis, the total variation or total sum ofsquares is subdivided into explained variation arunexplained variation. The explained variation or regression sum of squares (SSR) is duethe relationship between X and Y, and the unexplained variation, or error sum of squan(^SSf) is due to factors other than the relationship between X and Y. Figure 13.6 shows therdifferent measures of variation.

FIGURE 13.6Measures of var iat ion Error sum

of squares

Computing the Sum of Squares

The regression sum of squares (SSR) is based on the difference between )2, (the predicted valuof )'from the prediction line ) and F (the mean value of If . The error sum of squares (SSXrepresents the part ofthe variation in Ithat is not explained by the regression. It is based onthdifference between Y,and, i,. Equations (13.5), (13.6), (13.7),and (13.8) define these measursof variation.

MEASURES OF VARIATION IN REGRESSION

The total sum ofsquares is equal to the regression sum ofsquares plus the error sum ofsquares.

,s,sz: ssR +.lsE (13.s)

TOTAL SUM OF SOUARES (557)

The total sum of squares (SSf is equal to theobserved )'value and | , the mean value of /.

SSI = Total sum of squares

n

=\{r , - f ) ,

ij

,t',"^- ?t'= ssrY i= bo+ b tX i

,2,(r,- D2= ssr Regression sumof squaresn ^

,Zr(V,- v',)',--SSR

(13.6)

Page 15: chap 13

t?,ofonndto€srse

1:lilr lI l

I l

13.3: Measures ofvariation 525

REGRESSION 5UM OF SOUARES (55R)

The regression sum of squares (S,SR) is equal to the sum of the squared differences betweenthe predicted value of Y and Y , the mean value of )'.

SSR = Explained variation or regression of squares (13.7)

n

=\ {v , - r )2i= l

ERROR SUM OF SOUARES (55O

The error sum of squares (SSU) is equal to the sum of the squared differences between theobserved value of Iand the predicted value of ).

^S,SE = Unexplained variation or error sum of squares (13.8)

n

= \{r, _ y,),i= l

Figure 13.7 shows the sum of squares area of the worksheet containing the Microsoft Excelresults forthe SunflowersApparel data. The total variation, SSZ, is equal to 116.9543. Thisamount is subdivided into the sum of squares explained by the regression (,S.SR), equal to105.7476, and the sum of squares unexplained by the regression (SSg), equal to I 1.2067. FromEquation (13.5) on page 524:

S,SZ: SSR + SSE

1 16.9543 : 105.7 47 6 + 1 1.2067

11 i r|f SS frlS F Sign'ricanceF12_ jRegresion | 105.7{76 105.7176 113.2335 0.fin0rs Apparel datal3lResldual 12 111067 0.934|i l ' l total t3 116.95{3

',3.7

Excel sumfor the

16

Section E13.1 to createworksheet that containsarea.

Coe/ficJsnts Sandard Erol t Stal P-value Lower 95o/o 95o/oI 0.0917 o.1820 2.1110

18 iSquare Feet 1.66$ 0.1569 10.6411 0.fino 1.t200 2.0118

In a data set that has a large number of significant digits, the results of a regression analy-sis are sometimes displayed using a numerical format known as scientific notation. This type offormat is used to display very small or very large values. The number after the letter E repre-sents the number of digits that the decimal point needs to be moved to the left (for a negativenumber) or to the right (for a positive number). For example, the number 3.7431E+02 meansthat the decimal point should be moved two places to the right, producing the number 374.31.The number 3.'7431E-02 means that the decimal point should be moved two places to the left,producing the number 0.037431. When scientific notation is used, fewer significant digits areusually displayed and the numbers may appear to be rounded.

Page 16: chap 13

526 CHAPTER THIRTEEN Simple Linear Regressron

The Coefficient of Determination

By themselves, S,SR, SSE, and S,ST"provide little information. However, the ratio of the regres-sion sum of squares (SSR) to the total sum of squares (SSf) measures the proportion of varia-tion in I/ that is explained by the independent variable X in the regression model. This ratio iscalled the coefficient of determination, 12, and is defined in Equation ( 13.9).

COEFFICIENT OF DETERMI NATION

The coefficient of determination is equal to the regression sum of squares (that is,explained variation) divided by the total sum ofsquares (that is, total variation).

, 2 =Regression sum of squares ,ssR (13.e)

Total sum ofsquares ,s,sz

The coefficient of determination measures the proportion of variation in Ithat is explainedby the independent variable X in the regression model. For the Sunflowers Apparel data, with,S,SR : 105.7476. SSE: 11.2067. and,SSI: 116.9543.

) t 0 5 . 7 4 7 6 ^ . ^ . -t'- = = 0.9042

116.9543

Therefore, 90.42% of the variation in annual sales is explained by the variability in the size of thestore, as measured by the square footage. This large r'2 indicates a strong positive linear relation-ship between fwo variables because the use of a regression model has reduced the variability inpredicting annual sales by 90.42%. Only 9.58% of the sample variability in annual sales is due tofactors other than what is accounted for by the linear regression model that uses square footage.

Figure 13.8 presents the coefficient of determination portion of the Microsoft Excel resultsfor the Sunflowers Apparel data.

F IGURE 13.8Partial Microsoft Excelregression results for theSunflowers Apparel data

See Section E13.1 to createthe worksheet that containsthis area.

EXAMPLE 13 .4 COMPUTING THE COEFFICIENT OF DETERMINAT]ON

Compute the coefficient of determination, 12, for the Sunflowers Apparel data.

SOLUTION You can compute S,Sl.SSR, and SSE, that are defined in Equations (13.6), (13.7),and (13 .8) on pages 524-525, by us ing Equat ions (13 .10) , (13 .1 l ) , and (13 .12) .

COMPUTATIONAL FORMULA FOR S5T

n

ss?" = )tr, - y), =

/ \ Ll n I

l Iv , ll . L t

' I

\ i = l )

n

4. iklult iple R5 tR Square6 ";Adjuered R Square 0.8527 :Standard Error svx-0.96$4

(13.10)

Page 17: chap 13

COMPUTATIONAL FORMULA FOR SsR

COMPUTATIONAL FORMUT.A FOR 558

,ss.rt = Etl - Y,' = 4I"r + 1Zx,Y' -j*l i=l i=l

13.3: Measures ofVariation 527

(13,r1)

.e,sr = Itt, - 'fj' =|4' - h}r, - h}x,n (13.12),=l i=l i=l i=l

Using the summary results from Table 13.2 onpage 520,

( n ) 2l ln I

ssz = fd,-v)'=fr,' +i = r 7 - r ' n

= 594.9- (81'S)2

t4= 594,9 - 477.94571

= 116.95429

3 ^ - .^SSR= LV i -Y \ "

i = l

I 'r2

t+ ll )v , l

n n l L t I

= uoZY, +b,\XiYi -*+i = l i = l

= (0.s64478X81.8) + (1.66e86)( 302.3)- (sl '8)2

t4= 105.74726

3 ^ aS S E = / ( Y i - Y i ) "

=fr? -b,ir,-u,fx,Y,i= t i= l i= l

= 594.9 - (0.e64478)(81.8) - (1.66986X302.3)

= 11.2067

,z -105.74726 =0.9042116.95429

nY v/ d ' ii=l

Therefore.

Page 18: chap 13

528 CHAPTERTHIRTEENSimple Linear Regression

Standard Error of the EstimateAlthough the least-squares method results in the line that fits the data with the minimumamount of error, unless all the observed data points fall on a straight line, the prediction line isnot a perfect predictor. Just as all data values cannot be expected to be exactly equal to theirmean, neither can they be expected to fall exactly on the prediction line. An important statistic,called the standard error of the estimate, measures the variability of the actual )zvalues fromthe predicted )z values in the same way that the standard deviation in Chapter 3 measures thevariability of each value around the sample mean. In other words, the standard error of the esti-mate is the standard deviation around the prediction line, whereas the standard deviation inChapter 3 is the standard deviation sround the sample mean.

Figure 13.5 on page 517 illustrates the variability around the prediction line for theSunflowers Apparel data. Observe that although many of the actual values of )'fall near theprediction line, none of the values are exactly on the line.

The standard error of the estimate, represented by the symbol Sr", is defined in Equation( 1 3 . 1 3 ) .

STANDARD ERROR OF THE ESTIMATE

SYX =2rt, -+fl= l

n -2(13.13)

where

Y,: actual value of Y for a givenX,

i : predicted value of I for a given X,

^SSZ': error sum of squares

From Equation (I3.8) and Figure I3.4 on page 5l6,,S,SE : I1.2067. Thus,

c -OYX _ = 0.9664

This standard error of the estimate, equal to 0.9664 millions of dollars (that is, $966,400), islabeled Standard Error in the Microsoft Excel results shown in Figure 13.8 on page 526. The stan-dard error of the estimate represents a measure of the variation around the prediction line. It ismeasured in the same units as the dependent variable )2. The interpretation of the standard error ofthe estimate is similar to that of the standard deviation. Just as the standard deviation measuresvariability around the mean, the standard error of the estimate measures variability around theprediction line. For Sunflowers Apparel, the typical difference between actual annual sales at astore and the predicted annual sales using the regression equation is approximately $966,400.

,sstn -2

Learning the Basics

13,11 How do you interpret a coefficient ofdetermination, 12, equal to 0.80?

13.13 If ^SSR : 66 and,S,Sf : 88, compute thecoefficient of determination, 12, and interpret itsmeaning.

@q 13.12 lf .tSR : 36 and SSE : 4. determine SSr flft@ 13.14 If .SSE : l0 and S.SR : 30. compure thelAsslsil and then compute the coefficient of determina- lAsitiil coefficient of determination. 12. and interpretits

tion, rz, and interpret its meaning. meanlng.

Page 19: chap 13

If ,S,SR : 120, why is it impossible for SSZ toI l0?

the Concepts

13.16 In Problem 13.4 on page 522, the mar-keting manager used shelf space for pet foodto predict weekly sales (stored in the file

@!s[) For that data, ,S^SR : 20,535 and30,025.

the coefficient of determination. 12. andits meaning.

ine the standard error of the estimate.useful do you think this regression model is for

sales?

ln Problem 13.5 on page 522, you used reportedine newsstand sales to predict audited sales

in the file @s@). For that data,130.301.41 and S,SZ: 144.538.64.

ine the coefficient of determination, r2, andlts mearung.

ine the standard error of the estimate.useful do you think this regression model is forins audited sales?

In Problem 13.6 on page 522, an owner of a mov-ny wanted to predict labor hours, based on the

feet moved (stored in the file @@@. Using theof that problem,rmine the coefficient of determination. 12" and inter-lts meanmg.

ine the standard error of the estimate.useful do you think this regression model is for:tine labor hours?

13.19 In Problem 13.7 on page 523, you usedthe weight of mail to predict the number of orders

13.4: Assumptions 529

received (stored in the file@. Using the results ofthatproblem,a. determine the coefficient of determination, 12, and inter-

pret its meaning.b. find the standard error of the estimate.c. How useful do you think this regression model is for

predicting the number of orders?

13.20 In Problem 13.8 on page 523, you used annualrevenues to predict the value of a baseball franchise(stored in the file !![s@lQ. Using the results of thatproblem,a. determine the coefficient of determination. r2. and inter-

pret its meaning.b. determine the standard error of the estimate.c. How useful do you think this regression model is for

predicting the value of a baseball franchise?

13.21 In Problem 13.9 on page 523, an agent for a realestate company wanted to predict the monthly rent forapartments, based on the size of the apartment (stored inthe file ft@@. Using the results of that problem,a. determine the coefficient of determination, r2, andinter-

pret its meaning.b. determine the standard error of the estimate.c. How useful do you think this regression model is for

predicting the monthly rent?

13.22 In Problem 13.10 on page 523, you used hardnessto predict the tensile strength of die-cast aluminum(stored in the file ft@!@). Using the results of thatproblem,a. determine the coefficient of determination. 12. and

interpret its meaning.b. find the standard error of the estimate.c. How useful do you think this regression model is for

predicting the tensile sfiength of die-cast aluminum?isI-

isf)s|ea

s

13.4 ASSUMPTIONSThe discussion of hypothesis testing and the analysis of variance emphasized the importance ofthe assumptions to the validity of any conclusions reached. The assumptions necessary forregression are similar to those of the analysis of variance because both topics fall in the generalcategory of linear models (reference 4).

The four assumptions of regression (known by the acronym LINE) are as follows:

. Linearityr Independenceoferrorsr Normality of error. Equal variance

The first assumption, linearity, states that the relationship between variables is linear.Relationships between variables that are not linear are discussed in Chapter 15.

The second assumption, independence of errors, requires that the errors (er) are indepen-dent of one another. This assumption is particularly important when data are collected over aperiod of time. In such situations, the errors for a specific time period are sometimes correlatedwith those of the previous time period.

Page 20: chap 13

530 CHAPTERTHIRTEEN Simple Linear Regression

The third assumption, normality, requires that the errors (e,) are normally

each value of X. Like the I test and the ANOVA F' test, regression analysis is fairlyagainst departures from the normality assumption. As long as the distribution of the enoneach level ofXis not extremely different from a normal distribution, inferences about po

are not seriouslv affected.The fourth assumption, equal variance or homoscedasticity, requires that the variance

the errors (e,) are constant for all values of X. In other words, the variability of )'values issame when X is a low value as when X is a high value. The equal variance assumpticimportant when making inferences about po and B,. If there are serious departures fromassumption, you can use either data transformations or weighted least-squares methodsreference 4).

13.5 RESIDUAL ANALYSIS

In Section 13.1, regression analysis was introduced. In Sections 13.2 and 13.3, amodel was developed using the least-squares approach for the Sunflowers Apparel data. Isthe correct model for these data? Are the assumptions introduced in Section 13.4 valid? Insection, a graphical approach called residual analysis is used to evaluate the assumptionsdetermine whether the regression model selected is an appropriate model.

The residual or estimated error value, e,, is the difference between the observed (I)predicted (I,) values of the dependent variable for a given value ofX,. Graphically, a resiappears on a scatter plot as the vertical distance between an observed value of )z and thediction line. Equation (13.14) defines the residual.

RESIDUALThe residual is equal to the difference between the observed value of /and the predicted 1:value ot'I.

e i = Y i - Y i (13.14)

Evaluating the AssumptionsRecall from Section 13.4 that the four assumptions of regression (known by theLINE) are linearity, independence, normality, and equal variance.

Linearity To evaluate linearity, you plot the residuals on the vertical axis against the cone-sponding X, values of the independent variable on the horizontal axis. If the linear model isappropriate for the data, there is no apparent pattern in this plot. However, if the linear model isnot appropriate, there is a relationship between the X, values and the residuals, e,. You can seesuch a pattern in Figure 13.9. Panel A shows a situation in which, although there is an increas-ing trend in I as X increases, the relationship seems curvilinear because the upward trenddecreases for increasing values of X. This quadratic effect is highlighted in Panel B, wherethere is a clear relationship between X,and e,. By plotting the residuals, the linear trend of.f,with I has been removed, thereby exposing the lack of fit in the simple linear model. Thus, aquadratic model is a better fit and should be used in place of the simple linear model. (SeeSection l5.l for further discussion of fitting quadratic models.)

To determine whether the simple linear regression model is appropriate, return to the eval-uation ofthe Sunflowers Apparel data. Figure 13.10 provides the predicted and residual valuesof the response variable (annual sales) computed by Microsoft Excel.

Page 21: chap 13

13.5: Residual Analys is 531

FIGURE 13.9Studying theappropnatenessof the simple l inearregression model

F IGURE 13.10Microsoft Excelresidual statistics for theSunflowers Apparel data

See Section E13.3 to createthe worksheet that containsthis area.

FIGURE 13.11Micosoft Excel plot ofresiduals against thesquare footage of astore for the SunflowersApparel data

See Section E2.12 to createthis.

a la

a a

ao a o o

' 1 o ' l o

l o a

aa

Obseruation Predicted Anmral Sates Fesidaals1231567II

101 1121311

3.8032395983.6362533675.64008814710.315702633.135294672d.6381707573.1352916722.8013222086.3{n0330743.4692671359.64n577088.645840318

10.6{967515.97106061'l

{.1032395980.2637466331.05991 1853.0.8157026350.2647053?80.9618292430.564705328s.101322208.o.8r,8033071.0.5692671351.052242n2-1.0458403181.150324S2-1.874060611

To assess Iinearity, the residuals are plotted against the independent variable (store size, inthousands of square feet) in Figure 13.11. Although there is widespread scatter in the residualplot, there is no apparent pattern or relationship between the residuals and Xi. The residualsappear to be evenly spread above and below 0 for the differing values ofX. You can concludethat the linear model is appropriate for the Sunflowers Apparel data.

Square Feet Residual Plot

Square F6et

Page 22: chap 13

532 CHAPTER THIRTEEN Simple Linear Regression

T A B L E 1 3 . 3Frequency Distributionof '14 Residual Valuesfor the SunflowersApparel Data

FIGURE 13.12Microsoft Excel normarprobability plot ofthe residuals for theSunflowers Apparel data

See Section E6.2 to createthis.

Independence You can evaluate the assumption of independence of the errors bythe residuals in the order or sequence in which the data were collected. Data collectedperiods of time sometimes exhibit an autocorrelation effect among successive observations,these instances, there is a relationship between consecutive residuals. Ifthis relationship exi(which violates the assumption of independence), it is apparent in the plot of the residualssus the time in which the data were collected. You can also test for autocorrelation by usingDurbin-Watson statistic. which is the subiect of Section 13.6. Because the Sunflowersdata were collected during the same time period, you do not need to evaluate the iassumption.

Normality You can evaluate the assumption of normality in the errors by tallying theuals into a frequency distribution and displaying the results in a histogram (see SectionFor the Sunflowers Apparel data, the residuals have been tallied into a frequency distributionTable 13.3. (There are an insufficient number of values. however. to construct a hiYou can also evaluate the normality assumption by comparing the actual versus theoreticalues of the residuals or by constructing a normal probability plot of the residuals (see Secti6.3). Figure 13.12 is a normal probability plot of the residuals for the Sunflower Apparel

Residuals Frequency

-2.25 but less than -1.75-l.75 but less than -1.25-1.25 but less than -0.75-0.75 but less than -0.25-0.25 but less than +0.25+0.25 but less than +0.75+0.75 but less than +1.25

0

ll! -o.soE

-1

.1.5

-2

-2.50

ZValw

It is difficult to evaluate the normality assumption for a sample of only 14 values, regard-less of whether you use a histogram, stem-and-leaf display, box-and-whisker plot, orprobability plot. You can see from Figure 13.12 that the data do not appear to depart substan-tially from a normal distribution. The robustness of regression analysis with modest departuresfrom normality enables you to conclude that you should not be overly concerned about depar-tures from this normality assumption in the Sunflowers Apparel data.

I03I234

t4

Normal Probability Plot of the Residuals

Page 23: chap 13

3.13equal

13.5: ResidualAnalysis 533

Equal Variance You can evaluate the assumption of equal variance from a plot of theresiduals with X,. For the Sunflowers Apparel data of Figure I 3. I I on page 53 I , there do notappear to be major differences in the variability of the residuals for different X, values. Thus,you can conclude that there is no apparent violation in the assumption ofequal variance at eachlevel ofX.

To examine a case in which the equal variance assumption is violated, observe Figure13.13, which is a plot ofthe residuals withX, for a hypothetical set of data. In this plot, the vari-ability of the residuals increases dramatically as Xincreases, demonstrating the lack of homo-geneity in the variances of Y,at each level ofX. For these data, the equal variance assumptionis invalid.

..;j:'iii. . . ! ! l ] .

3. f ' I

a

a

a

a

. t ll 1

a a

aa aa

a

aaaa

a

, :'; i:l:t

t t a : : :

aa

. l ' . ! ;33:o o o ! o r r

the Basics

results below provide the Xvalues, residuals, 13.24 The results below show theXvalues, residuals, anda residual plot from a regression analysis:plot from a regression analysis:

2.u1.5

t.0

! o.t!

I o.og2

-0.5

.1.0

-1.5

t0: {.0a*ii; -"- l:r-iti**-:ird

-,iit.---.,3.2r!rt"!,1._,*:

evidence of a pattern in the residuals? Explain. Is there any evidence of a pattern in the residuals? Explain.

Page 24: chap 13

534 CHAPTERTHIRTEEN Simple Linear Regression

Applying the Concepts

13.25 In Problem 13.5 on page 522, you used reportedmagazine newsstand sales to predict audited sales. The dataare stored in the file@l$fi!. Perform a residual analy-sis for these data.a. Determine the adequacy of the fit of the model.b. Evaluate whether the assumptions of regression have

been seriously violated.

13.26 In Problem 13.4 on page 522,the market-ing manager used shelf space for pet food to pre-dict weekly sales. The data arc stored in the file

[!$!!frE Perform a residual analysis for these data.a. Determine the adequacy of the fit of the model.b. Evaluate whether the assumptions of regression have

been seriously violated.

13.27 In Problem 13.7 on page 523, you used the weightof mail to predict the number of orders received. Perform aresidual analysis for these data. The data are stored in thefile ftfiEE. Based on these results,a. determine the adequacy of the fit of the model.b. evaluate whether the assumptions of regression have

been seriously violated.

13.28 In Problem 13.6 on page 522, the owner of a mov-ing company wanted to predict labor hours based on thecubic feet moved. Perform a residual analysis for thesedata. The data are stored in the file E@E. Based onthese results,

a. determine the adequacy of the fit of the model.b. evaluate whether the assumptions of regression have

been seriously violated.

13.29 In Problem 13.9 on page 523, an agent for a realestate company wanted to predict the monthly rent forapartments, based on the size of the apartments. Perform aresidual analysis for these data. The data are stored in thefile [@. Based on these results,a. determine the adequacy of the fit of the model.b. evaluate whether the assumptions of regression have

been seriously violated.

13.30 In Problem 13.8 on page 523, you used annual rev-enues to predict the value ofa baseball franchise. The dataare stored in the file EE@. Perform a residualanalysis for these data. Based on these results,a. determine the adequacy of the fit of the model.b. evaluate whether the assumptions of regression have

been seriously violated.

13.31 In Problem 13.10 on page 523, you used hardnessto predict the tensile strength of die-cast aluminum. Thedata are stored in the file ftftl!$Q Perform a residualanalysis for these data. Based on these results,a. determine the adequacy of the fit of the model.b. evaluate whether the assumptions of regression have

been seriously violated.

13.5 MEASURING AUTOCORRELATION:TH E DU RBIN.WATSON STATISTICOne of the basic assumptions of the regression model is the independence of the errors. Thisassumption is sometimes violated when data are collected over sequential time periods becausca residual at any one time period may tend to be similar to residuals at adjacent time periThis pattern in the residuals is called autocorrelation. When a set of data has substantial acorrelation, the validity of a regression model can be in serious doubt.

Residual Plots to Detect AutocorrelationAs mentioned in Section 13.5, one way to detect autocorrelation is to plot the residuals inorder. If a positive autocorrelation effect is present, there will be clusters of residuals withsame sign, and you will readily detect an apparent pattern. If negative autocorrelation exiresiduals will tend to jump back and forth from positive to negative to positive, and so on.type of pattern is very rarely seen in regression analysis. Thus, the focus of this section ispositive autocorrelation. To illustrate positive autocorrelation, consider the following

The manager of a package delivery store wants to predict weekly sales, based onnumber of customers making purchases for a period of 15 weeks. In this situation,data are collected over a period of l5 consecutive weeks at the same store, you needdetermine whether autocorrelation is present. Table I 3.4 presents the data (stored in the fi@EED. Figure 13.14 illustrates Microsoft Excel results for these data.

Page 25: chap 13

13.6: Measuring Autocorrelat ion: The Durbin-Watson Statist ic 535

I

rI

T A B L E 1 3 . 4Customers andSales for a Period of15 Consecutive Weeks

FIGURE'13.14Microsoft Excel resultsfor the package del iverystore data of Table 13.4

-t \ v tl - l a

-See Section E13.1 to createthis.

FIGURE 13.15Microsoft Excel residuarplot for the packager i e l i v o r v c f n r a . ] : i a

of Table 1 3.4

See Sectron E13.3 to createthis.

Customers

Sales(Thousandsof Dollars)

Sales(Thousandsof Dollars)Customers

8809058868439049508 4 1

794199831855845844863875

9.338.267.489.089.83

10.091 1 . 0 111.49

t2 .07t2 .5511.9210.27I 1 . 8 0t 2 . 1 59.64

o

10l tt2l aI J

t 4l 5

From Figure 13.14, observe that 12 is 0.6514, indicating that 65.l4oh of the variation insales is explained by variation in the number of customers. In addition, the )' intercept, bo, is-16.0322, and the slope, b,, is 0.0308. However, before using this model for predictron! youmust undertake proper analyses ofthe residuals. Because the data have been collected over aconsecutive period of l5 weeks, in addition to checking the l inearity, normality, and equal-variance assumptions, you must investigate the independence-of-errors assumption. You canplot the residuals versus time to help you see whether a pattern exists. In Figure 13.15, youcan see that the residuals tend to fluctuate up and down in a cycfical pattern. This cyclicalpattern provides strong cause for concern about the autocorrelation of the residuals and,hence, a violation of the independence-of-errors assumption.

!3 11"3901 0.8762

Package Delivery Store Sales Analysis Residual Plol

Page 26: chap 13

536 CHAPTER THIRTEEN Simnle Linear Resressron

The Durbin-Watson StatisticThe Durbin-Watson statistic is used to measure autocorrelation. This statistic measures thecorrelation between each residual and the residual for the time period immediately precedingthe one of interest. Equation (13.15) defines the Durbin-Watson statistic.

D U RBI N-WATSON STATISTIC

f { e ' - e , - , ) 2. L ' - I

- - ,

>"?i - |

e,: residual at the time period I

(r3.ls)

where

To better understand the Durbin-Watson statistic, D, you can examine Equation (13.15).n

s r )I n e n u m e r a t o r . ) , l e i

- e i _ t ) -H

, represents the squared difference between two successive

FIGURE 13.16M icrosoft Excel resultsof the Durbin-Watsonstatistic for the packagedelivery store data

See Section E13.4 to createthts.

residuals. summed from the second value to the nth valuen

s a )I he denomlnator . Lel : . represents

l = 1

the sum of the squared residuals. When successive residuals are positively autocorrelated thevalue of D approaches 0. If the residuals are not correlated, the value of D will be close to 2. (lfthere is negative autocorrelation, D will be greater than 2 and could even approach its maxr-mum value of 4.) For the package delivery store data, as shown in the Microsoft Excel resultsof Figure 13.16, the Durbin-Watson statistic, D, is 0.8830.

*83/84

You need to determine when the autocorrelation is large enough to make the Durbin-Watson statistic, D, fall sufficiently below 2 to conclude that there is significant positive auto-correlation. After computing D, you compare it to the critical values of the Durbin-Watson sta-tistic found in Table E.10, a portion of which is presented in Table 13.5. The crit ical valuesdepend on o(, the significance level chosen, n,the sample size, and k, the number of indepen-dent variables in the model (in simple l inear resression. /r : 1).

cr : .05TABLE 13 .5F ind ing Cr i t i ca l Va luesof the Durbin-WatsonStatistic dL

.62 2 .15

.67 2 .10

.7 | 2.06

t . 5 4 . 8 21 . 5 4 . 8 61.54 .901 . 5 3 . 9 3

l 61 '7

1 8

.69 t .97

.- /4 1.93

.78 1 .90

. 8 2 1 . 8 7

.95

.981.021.05

Page 27: chap 13

; t heding

13.6: MeasurinsAutocorrelation: The Durbin-Watson Statistic 537

In Table 13.5, two values are shown for each combination of cr (level of significance), r(sample size), and fr (number of independent variables in the model). The first value, d., repre-sents the lower critical value. If D is below dr, you conclude that there is evidence of positiveautocorrelation among the residuals. If this occurs, the least-squares method used in this chap-ter is inappropriate, and you should use alternative methods (see reference 4). The secondvalue, ds, represents the upper critical value of D, above which you would conclude that thereis no evidence of positive autocorrelation among the residuals. If D is between d , and d s, lovare unable to arrive at a definite conclusion.

Forthe package delivery store data, with one independent variable (f : 1) and l5 values(n: 15), dL: 1.08 and du: 1.36. Because D : 0.8830 < 1.08, you conclude that there is pos-itive autocorrelation among the residuals. The least-squares regression analysis of the data isinappropriate because of the presence of significant positive autocorrelation among the resid-uals. In other words, the independence-of-errors assumption is invalid. You need to use alter-native approaches discussed in reference 4.

i . l5 ) .

ssive

rsents

d, the2. (rfnaxi-ssults

rrbin-auto-n sta-aluesepen-

,2.2r2 .15;2.10'2.06

+6-3+ l+30

-4-7

i:

f,

w u

Learning the Basics

13.32 The residuals for l0 consecutive timeperiods are as follows:

Time Period Residual Time Period Residual

I2345

b. Compute the Durbin-Watson statistic. At the 0.05 levelof significance, is there evidence of positive autocorre-lation among the residuals?

c. Based on (a) and (b), what conclusion can you reachabout the autocorrelation ofthe residuals?

Applying the Concepts

13.34 In Problem 13.4 on page 522 concerningpet food sales, the marketing manager used shelfspace for pet food to predict weekly sales.

a. Is it necessary to compute the Durbin-Watson statistic inthis case? Explain.

b. Under what circumstances is it necessary to compute theDurbin-Watson statistic before proceeding with theleast-squares method of regression analysis?

13.35 The owner of a single-family home in a suburbancounty in the northeastern United States would like todevelop a model to predict electricity consumption in his all-electric house (lights, fans, heat, appliances, and so on), basedon average atmospheric temperature (in degrees Fahrenheit).Monthly kilowatt usage and temperature data are available fora period of 24 consecutive months in the file![@[email protected]. Assuming a linear relationship, use the least-squares

method to find the regression coefficients bo and b,.b. Predict the mean kilowatt usage when the average

atmospheric temperature is 50o Fahrenheit.c. Plot the residuals versus the time period.d. Compute the Durbin-Watson statistic. At the 0.05 level

of significance, is there evidence of positive autocorre-lation among the residuals?

e. Based on the results of (c) and (d), is there reason toquestion the validity of the model?

r. Plot the residuals over time. What conclusion can youreach about the pattern of the residuals over time?

b. Based on (a), what conclusion can you reach about theautocorrelation of the residuals?

13.33 The residuals for l5 consecutive timeperiods are as follows:

fime Period Residual Time Period Residual

I2345678

+4-6- l-5+2+5-2+7

6789

l 0

9l0l lt 2l 3t4l 5

r 1T I

+2+3+4+5

I Plot the residuals over time. What conclusion canreach about the pattern of the residuals over time?

you

Page 28: chap 13

538 CHAPTERTHIRTEEN SimpleLinearRegression

13.35 A mail-order catalog business that sells personalcomputer supplies, software, and hardware maintains acentralized warehouse for the distribution of productsordered. Management is currently examining the processof distribution from the warehouse and is interested instudying the factors that affect warehouse distributioncosts. Currently, a small handling fee is added to the order,regardless of the amount of the order. Data have been col-lected over the past 24 months, indicating the warehousedistribution costs and the number of orders received. Theyare stored in the file@@. The results are as follows:

To use the espresso shot in making alatte, cappuccino,other drinks, the shot must be poured into the beverageing the separation of the heart, body, and crema. If the shot iused after the separation occurs, the drink becomessively bitter and acidic, ruining the final drink. Thus,longer separation time allows the drink-maker more timepour the shot and ensure that the beverage will meettions. An employee at a coffee shop hypothesized thatharder the espresso grounds were tamped down intoportafilter before brewing, the longer the separation tiwould be. An experiment using 24 observations wasducted to test this relationship. The independent variTamp measures the distance, in inches, between thegrounds and the top ofthe portafilter (that is, the hardertamp, the larger the distance). The dependent variableis the number of seconds the heart, body, and crema arearated (that is. the amount of time after the shot isbefore it must be used for the customer's beverage). Theare stored in the filel$!$$:

Shot Tamp Time Shot Tamp

MonthsDistribution Cost

(Thousands of Dollars)Number

of Orders

b.

I2J

456789

l0l l1213t415l6t7l 8l9202 l222324

52.957r .6685.5863.6972.8r68.4452.4670,7782.0374.3970.84s4.0862.9872.3058.9979.3894.4459.7490.5093.2469.3353.7r89.1 866.80

4,0153,8065,3094,2624,2964,0973,2134,8095,2374,7324,4132,9213,9774,4283,9644,5925,5823,4505,0795,7354,2693,7085,3874,161

| 0.20 t42 0.50 t43 0.50 184 0.20 t6s 0.20 166 0.50 137 0.20 128 0.35 159 0.50 9

10 0.35 1511 0.50 l lt2 0.50 t6

13 0.5014 0.5015 0.3s16 0.35r7 0.2018 0.2019 0.2020 0.202t 0.3522 0.3523 0.3524 0.35

t319l9l7l8t5l6l816t4l6

Assuming a linear relationship, use the least-squaresmethod to find the regression coefficients bo and b,.Predict the monthly warehouse distribution costs whenthe number of orders is 4.500.

c. Plot the residuals versus the time period.d. Compute the Durbin-Watson statistic. At the 0.05 level

ofsignificance, is there evidence ofpositive autocorre-lation among the residuals?

e. Based on the results of (c) and (d), is there reason toquestion the validity of the model?

13.37 A freshly brewed shot of espresso has three distinctcomponents: the heart, body, and crema. The separation ofthese three components typically lasts only l0 to 20 seconds.

Determine the prediction line, using Time as thedent variable and Tamp as the independent variable.Predict the mean separation time for a Tamp distance0.50 inch.

c. Plot the residuals versus the time order of exoerition. Are there any noticeable patterns?

d. Compute the Durbin-Watson statistic. At the 0.05of significance, is there evidence of positivelation among the residuals?

e. Based on the results of (c) and (d), is there reasonquestion the validity of the model?

13.38 The owner of a chain of ice cream storeslike to study the effect of atmospheric temperaturesales during the summer season. A sample of 2ltive days is selected, with the results stored in the data@.(Hint: Determine which are the independent anddent variables.)

Page 29: chap 13

)r

isi-

aol-

IE

IC

IC

l-

te

io

reIC

)-rda

Assuming a linear relationship, use the least-squaresmethod to find the regression coefficients bo and b,.Predict the sales per store for a day in which the temper-ature is 83"F.Plot the residuals versus the time oeriod.

13.7: lnferences About the Slope and Correlation Coefficient 539

d. Compute the Durbin-Watson statistic. At the 0.05 levelof significance, is there evidence of positive autocorre-lation among the residuals?

e. Based on the results of (c) and (d), is there reason toquestion the validity of the model?

13.7 INFERENCES ABOUT THE SLOPEAND CORRELATION COEFFICIENTIn Sections l3.l through 13.3, regression was used solely for descriptive purposes. You learnedhow the least-squares method determines the regression coefficients and how to predict Y for agiven value of X. In addition, you learned how to compute and interpret the standard error ofthe estimate and the coefficient of determination.

When residual analysis, as discussed in Section 13.5, indicates that the assumptions of aleast-squares regression model are not seriously violated and that the straight-line model isappropriate, you can make inferences about the linear relationship between the variables in thepopulation.

t Test for the SlopeTo determine the existence of a significant linear relationship between the X and )z variables,you test whether Fr (tlte population slope) is equal to 0. The null and alternative hypotheses areas follows:

Hot Fr: 0 (There is no linear relationship.)Hl Fr + 0 (There is a linear relationship.)

If you reject the null hypothesis, you conclude that there is evidence of a linear relationship.Equation (13.16) defines the test statistic.

TESTTNG A HypOTHEStS FOR A pOpULATtON SLOPE, 01, USTNG THE t TESTThe r statistic equals the difference between the sample slope and hypothesized value of thepopulation slope divided by the standard error ofthe slope.

r - 4 -F r (13.16)

where

Srr _-

3ssx: > 6i- x)2j= l

The test statistic I follows a I distribution with n - 2 desrees of freedom.

Return to the Using Statistics scenario concerning Sunflowers Apparel. To test whether thereis a significant linear relationship between the size of the store and the annual sales at the 0.05 levelof significance, refer to the Microsoft Excel worksheet for the / test presented in Fizure I 3. 1 7.

sr,

Svx

ffi

Page 30: chap 13

540 CHAPTERTHIRTEEN

FIGURE 13.17Microsoft Excel ttestfor the slope for theSunflowers Apparel data

Simple Linear Regression

D :

16 i Coefficients Sandard Errcr t Sat P-rralae Lawer95% Upper9S/otZj lntercept18 Square Feet

0.96451.6699

0.5262 1.83290.1569 10.6411

0.09170.qpo

{.18201.3280

2.11102.01 18

See Section E13.1 to createthe worksheet that containsthis area.

FIGURE 13.18Testing a hypothesisabout the populat ionslope at the 0.05 levelof s igni f icance, wi th12 deorees of f reedom

and

From Figure 13.17,

4 = + 1 . 6 6 9 9 n = 1 4 S a = 0 . 1 5 6 9

, _ h r -Fsn,

_ r . 6 6 9 9 - 0 : 1 0 . 6 4 1 1

0. I 569

Microsoft Excel labels this r statistic l Stat (see Figure 13.17). Using the 0.05 level of signifi-cance, the cr i t ical value of / with n - 2:12 degrees of f reedom is 2.1788. Because I - 10.6411>2.1188, you reject Ho (see Figure 13.18). Using thep-value, you reject Ho because thep-valueis approximately 0 which is less than cr : 0.05. Hence, you can conclude that there is a signifi-cant linear relationship between mean annual sales and the size of the store.

I -2.1t788

Region o fRejection

0

Reg ion o fNonrejection

+2.1788 !, tp

Reg ion o fRejection

Cr i t i ca lVa lue

Crit icalVa lue

F Test for the Slope

As an alternative to the I test, you can use an F test to determine whether the slope in simplelinear regression is statistically significant. In Section 10.4, you used the tr distribution to testthe ratio of two variances. Equation ( I 3. | 7) defines the ,E test for the slope as the ratio of thevariance that is due to the regression (MSR) divided by the error variance (MSE - Sii.

TEST|NG A HYPOTHESIS FOR A POPULAT]ON SLOPE, 91' USTNG THE FTEST

The F statistic is equal to the regression mean square (MSR) divided by the error meansquare (MSD.

MSRt - -

MSE(13.17)

Page 31: chap 13

13.7: Inferences About the Slope and Correlation Coefficient 541

where

MsR: !q4LL

MSE:s,sE

n - k - 1

t: number of independent variables in the regression model

The test statistic F follows an F distribution with k and n - k -l degrees of freedom.

Using a level of significance a, the decision rule is

Reject Hoif F> Fu.

otherwise, do not reject l{n.

Table I 3.6 organizes the complete set of results into an ANOVA table.rtifi-f l l >ralueFrifi-

13 .6Table Source Fdf

Sum ofSquares

Mean Square(Variance)

inq theo f a

Coefficient

13.19Excel Ftest

Sunflowersdata

EI3.1 to create

RegreeslonResidual

14 lTotal

1132335

. MSR

MSE

0.{xno

Regression k

Error n - k - l

Total n- | ,S,SZ

The completed ANOVA table is also part of the Microsoft Excel results shown inF igure l3 . l9 .F igure l3 . lgshowsthat thecomputedFsta t is t i c is l l3 .2335andthe p-va lueis approximately 0.

ANOVAss MS F F

,SSR M,SR = SSR

S.siE MSE = 'S^St

n - k - l

11213

105.747611.2M7

I16.9543

105.74760333!'

Using a level of significance of 0.05, from Table E.5, the critical value of the F distribu-tion, with 1 and 12 degrees of freedom, is 4.75 (see Figure 13.20). Because F: 113.2335 > 4.j5or because the p-value : 0.0000 < 0.05, you reject Hn and conclude that the size of the store issignificantly related to annual sales. Because the F teit in Equation 13.17 on page 540 is equiv-alent to the I test on page 539, you reach the same conclusion.

that contains

Page 32: chap 13

542 CHAPTERTHIRTEEN

FTGURE 13.20Regions of rejectionand nonreiection whentesting foisignificanceof slooe at the 0.05 levelof significance, with1 and 12 degreesof freedom

Simple Linear Regression

| 4.75

i tRegion of Crit ical

Nonrejection ValueRegion ofRelection

As an alternative to testing for the existence of a linear relationship between the variables,can construct a confidence interval estimate of B, and determine whether thevalue (8, :0) is included in the interval. Equation (13.18) defines the confidence iestimate of B,.

CoNFTDENCE TNTERVAL EST|MATE OF THE SLOPE, B1The confidence interval estimate for the slope can be constructed by taking the sampleslope, b1, and adding and subtracting the critical / value multiplied by the standard errorof the slope.

Confidence Interval Estimate of the Slope (0r)

b r ! t n _ 2 5 6 ,

From the Microsoft Excel results of Figure 13.17 on page 540,

4 = 1 . 6 6 9 9 n = 1 4 S h = 0 . 1 5 6 9

(13.18)

To construct a95oh confidence interval estimate, al2:0.025, and from Table E.3, /,,Thus,

b1+ tn -256, = 1 .6699 t (2 .1788X0.1569)

= 1.6699 + 0.3419

1 . 3 2 8 0 < F r < 2 . 0 1 1 8

Therefore, you estimate with 95o/o confidence that the population slope is between 1.32802.0118. Because these values are above 0. vou conclude that there is a sisnificant lineartionship between annual sales and the size of the store. Had the interval included 0, youhave concluded that no significant relationship exists between the variables. The coninterval indicates that for each increase of 1,000 square feet, mean annual sales are estimatedincrease by at least $1,328,000 but no more than $2,011,800.

t Test for the Correlation CoefficientIn Section 3.5 on page 130, the strength of the relationship between two numericalwas measured, using the correlation coefficient, r. You can use the correlation coefficientdetermine whether there is a statistically significant linear relationship between Xand L To

Page 33: chap 13

13.7: Inferences About the Slope and Correlation Coefficient 543

so, you hypothesize that the population correlation coefficient, p, is 0. Thus, the null and alter-native hypotheses are

Ho: p :0 (no correlation)Hr :p+0(cor re la t ion)

Equation ( 13. 19) defines the test statistic for determining the existence of a significant correlation.

TESTING FOR THE EXISTENCE OF CORRELATION

l = (r3.1e)

where

, : +F,: _,[7

i f b l > 0

i f b l < 0

The test statistic I follows a / distribution with n - 2 degrees of freedom.

In the Sunflowers Apparel problem, 12 : 0 .9042 and b , : +1 .6699 (see Figure I 3.4 onpage 516). Because btr 0, the correlatiopeqe.ficient for annual sales and store size is thepositive square root of P, that is, P : +40.9042 : +0.9509. Testing the null hypothesis thatthere is no correlation between these two variables results in the following observed / statistic:

r - 0

= 1 0 . 6 4 1 I

Using the 0.05 level of significance, because t : l0.64ll > 2.1'788, you reject the null hypoth-esis. You conclude that there is evidence ofan association between annual sales and store size.This / statistic is equivalent to the / statistic found when testing whether the population slope,

F1, is equal to zero (see Figure 13.17 on page 540).When inferences concerning the population slope were discussed" confidence intervals and

tests of hypothesis were used interchangeably. However, developing a confidence interval for thecorrelation coefficient is more complicated because the shape of the sampling distribution of thestatistic r varies for different values of the population correlation coefficient. Methods for devel-oping a confidence interval estimate for the correlation coefficient are presented in reference 4.

You are testing the null hypothesis that there is no

a. What is the value of the I test statistic?b. At the o : 0.05 level of significance, what are the criti-

cal values?c. Based on your answers to (a) and (b), what statistical

decision should you make?ionship between two variables, X and )'. From

the Basics

1 - (o.9so9)2t4 -2

of n = 10. vou determinethatr:0.80.

Page 34: chap 13

544 CHAPTER THIRTEEN Simple Linear Regression

'13.40 You are testing the null hypothesis thatthere is no relationship between two variables, Xand Y. From your sample of n : 18, you deter-

mine that b1:+4.5 and 56, : 1.5.a. What is the value of the r test statistic?b. At the cr : 0.05 level of significance, what are the criti-

cal values?c. Based on your answers to (a) and (b), what statistical

decision should you make?d. Construct a 95oh confidence interval estimate of the

populat ion slope, B,.

13.41 You are testing the null hypothesis thatthere is no relationship between two variables, Xand L From your sample of n:20, you determine

that SSR : 60 and,SSt: 40.a. What is the value of the F test statistic?b. At the cr : 0.05 level of significance, what is the critical

value?c. Based on your answers to (a) and (b), what statistical

decision should you make?d. Compute the correlation coefficient by first computing

P andassuming that b, is negative.e. At the 0.05 level of significance, is there a significant

correlation between Xand l?

Applying the Concepts

13.42 In Problem 13.4 on page 522, the market-ing manager used shelf space for pet food to pre-dict weekly sales. The data are stored in the file

fE@ From the results of that problem, bt: 7.4 and56, : 1.59.a. At the 0.05 level of significance, is there evidence of a

linear relationship between shelf space and sales?b. Construct a 95o/o confidence interval estimate of the

population slope, 8,.

13.43 In Problem 13.5 on page 522, you used reportedmagazine newsstand sales to predict audited sales. The dataarestoredinthef i [email protected], br:0.5719 and 56, :0.0668.

a. At the 0.05 level of significance, is there evidence of alinear relationship between reported sales and auditedsales?

b. Construct a 95o/o confidence interval estimate of thepopulation slope, B,.

13.44 In Problem 13.6 on pages 522-523,the owner of amoving company wanted to predict labor hours, based onthe number of cubic feet moved. The data are stored in thefile@@$. Using the results of that problem,a. at the 0.05 level of significance, is there evidence of a

linear relationship between the number of cubic feetmoved and labor hours?

b. construct a95"/o confidence interval estimate of the pop-ulation slope, 8,.

13.45 In Problem 13.7 on page 523. youthe weight of mail to predict the number ofreceived. The data are stored in the file

Using the results of that problem,a. at the 0.05 level of significance, is there evidence of

linear relationship between the weight of mail andnumber of orders received?

b. construct a95oh confidence interval estimate of the

ulation slope, B,.

13.45 In Problem 13.8 on page 523, you used annualenues to oredict the value ofa baseball franchise. Theare stored in the file[[[!!@fs. Using the results ofproblem,a. at the 0.05 level of sienificance. is there evidence of

linear relationship between annual revenue andchise value?

b. construct a95o/o confidence interval estimate of the

ulation slope, B,.

13.47 In Problem 13.9 on page 523, an agent for aestate company wanted to predict the monthly rent forments, based on the size of the apartment. The data arein the file[S[!. Using the results of that problem,a. at the 0.05 level of significance, is there evidence of

linear relationship between the size of the apartmentthe monthly rent?

b. construct a95Yo confidence interval estimate of the

ulation slope, B,.

13.48 In Problem 13.10 on page 523, you used hato predict the tensile strength of die-cast aluminum.

data are stored in the file [[[ft$! Using the resultsthat problem,a. at the 0.05 level of significance, is there evidence of

linear relationship between hardness and

strensth?b. construct a95"/o confidence interval estimate of the

ulation slope, 8,.

13.49 The volatility of a stock is often measured bybeta value. You can estimate the beta value of a stockdeveloping a simple linear regression model, using thecentage weekly change in the stock as the dependentable and the percentage weekly change in a market index

the independent variable. The S&P 500 Index ts aindex to use. For example, if you wanted to estimate

beta for IBM, you could use the following model, which

sometimes referred to as a market model:

(% weekly change in IBM) : 9o * 9, (% weekly changeS & P 5 0 0 i n d e x ) + e

The least-squares regression estimate of the slope br isestimate of the beta value for IBM. A stock with a

value of 1.0 tends to move the same as the overallA stock with a beta value of 1.5 tends to move 50%

than the overall market. and a stock with a beta value

Page 35: chap 13

usedrders@.

to move only 60% as much as the overall market.with negative beta values tend to move in a direc-

opposite that of the overall market. The following tablesome beta values for some widely held stocks:

Ticker Symbol Beta

I 3.7: Inferences About the Slooe and Correlation Coefficient 545

mately 12.5%. On the downside, if the same index loses20%, POSCX loses approximately 25o/o.a. Consider the leveraged mutual fund ProFund UltraOTC

"Inv" (UOPIX), whose description is 200% of the per-formance of the S&P 500 Index. What is its approxi-mate market model?

b. If the NASDAQ gains 30% in a yeaq what return do youexpect UOPX to have?

c. If the NASDAQ loses 35% in a year, what return do youexpect UOPX to have?

d. What type of investors should be attracted to leveragedfunds? What type of investors should stay away fromthese funds?

13.51 The data in the file EEE@ represent thecalories and fat (in grams) of 16-ounce iced coffee drinksat Dunkin'Donuts and Starbucks:

Product Calories Fat

o f ad the

pop-

I rev-) dataf that

: of afran-

) pop-

a realapart-stored

e of ant and

e pop-

ldnessr. Thenlts of

re of a;ensile

e pop-

by itsrck byle per-rt vari-rdex asInlmOn

ate thehich is

, is thea beta

narket.{o mOrec of 0.6

Company

TIBMDISAALSI

0.801.201.402.263 .61Logrc

: Extracted from finance.yahoo.com, May 3 I, 2006.

each of the five companies, interpret the beta value.How can investors use the beta value as a euide forinvestins?

lndex funds are mutual funds that try to mimic theof leading indexes, such as the S&P 500 Index,

NASDAQ 100 Index, or the Russell 2000 Index. Thevalues for these funds (as described in Problem 13.49)therefore approximately 1.0. The estimated market

for these funds are approximately

(% weekly change in index tu"d) : 0.0 + 1.0 (% weeklychange in the index)

index funds are designed to magnify theof maior indexes. An article in Mutual Funds

0'Shaughnessy, "Reach for Higher Returns," MutualJuly 1999, pp. 4449) described some of the risks

rewards associated with these funds and save detailssome of the most popular leveraged funds, including

in the followins table:

(Ticker Symbol) Fund Description

Small Cap 125% ofRussell 2000 Index(POSCX)

"Inv"Nova 150% ofthe S&P 500Index

UltraOTC Double (200%) the NASDAQ 100Index(uoPx)

estimatedmarket models for these funds are

(% weekly change in POSCX) : 0.0 + 1.25 (% weeklychange in the Russell 2000 Index)

(% weekly change in RYNVX) : 0.0 + | .50 (% weeklychange in the S&P 500 Index)

weekly change in UOPIX tund) : 0.0 + 2.0 (% weeklychange in the NASDAQ 100 Index)

if the Russell 2000 Index gains 10% over a period ofthe leveraged mutual fund POSCX gains approxi-

Dunkin'Donuts Iced Mocha Swirl latte(whole milk)

Starbucks Coffee Frappuccino blendedcoffee

Dunkin' Donuts Coffee Coolatta (cream)Starbucks Iced Coffee Mocha Espresso(whole milk and whipped cream)

Starbucks Mocha Frappuccino blendedcoffee (whipped cream)

8.0240

Starbucks Chocolate Brownie Frappuccinoblended coffee (whipped cream) 510

Starbucks Chocolate Frappuccino BlendedCrdme (whipped cream) 530

260 3.5350 22.0

350 20.0

420 16.0

22.0

r9.0

Source: Extractedfrom "Coffee as Candy at Dunkin'Donuts andStarbucks," Consumer Reports, June 2004, p. 9.

a. Compute and interpret the coefficient of correlation, r.b. At the 0.05 level of significance, is there a significant

linear relationship between the calories and fat?

13.52 There are several methods for calculating fueleconomy. The following table (contained in the file

@l!!ls) indicates the mileage as calculated by ownersand by current government standards:

VehicleGovernment

Owner Standards

2005 Ford F-1502005 Chevrolet Silverado2002 HondaAccord LX2002 Honda Civic2004 Honda Civic Hybrid2002 Ford Explorer2005 Toyota Camry2003 Toyota Corolla2005 Toyota Prius

14.31 5 . 027.827.948.81 6 . 823.732.8J I . J

16.817.826.234.247.61 8 . 328.53 3 . Is6.0

Page 36: chap 13

546 CHAPTER THIRTEEN Simple Linear Regressron

a. Compute and interpret the coefficient of correlation, r.b. At the 0.05 level of significance, is there a significant

linear relationship between the mileage as calculated byowners and by current government standards?

13.53 College basketball is big business, with coaches'salaries, revenues, and expenses in millions of dollars. Thedata in the file !![!l!$ls$l[f@ represent the coaches'salaries and revenues for college basketball at selectedschools in a recent year (extracted from R. Adams, "Pay forPlayoffs," The Wall Street Journal, March ll-12,2006, pp.P l , P8) .a. Compute and interpret the coefficient of correlation, r.b. At the 0.05 level of s igni f icance, is there a signi f i -

cant linear relationship between a coach's salary andrevenue?

13.54 College football players trying out for the NFLgiven the Wonderlic standardized intelligence test. The dataithe file[@!@Srepresent the average Wonderlic scoresfootball players trying out for the NFL and therates for football players at selected schools (extractedS. Walkeq "The NFUs Smartest Team," The WallJournal,September 30, 2005, pp. Wl, Wl0).a. Compute and interpret the coefficient of correlation, r.b. At the 0.05 level of sienificance. is there a sisnifi

linear relationship between the average Wonderlicof football players trying out for the NFL and theation rates for football players at selected schools?

c. What conclusions can you reach about the relatbetween the average Wonderlic score of footballtrying out for the NFL and the graduation rates forball players at selected schools?

13.8 ESTIMATION OF MEAN VALUES AND PREDICTIONOF INDIVIDUAL VALUESThis section presents methods of making inferences about the mean of )'and predicting indi-vidual values of )2.

The Confidence Interval Estimate

In Example 13.2 on page 519, you used the prediction line to predict the value of )'for a givenX. The mean annual sales for stores with 4,000 square feet was predicted tobe1.644 millionsof dollars ($7,644,000). This estimate, howeveq is a point estimate of the population mean. InChapter 8, you studied the concept of the confidence interval as an estimate of the populationmean. In a similar fashion, Equation ( I 3.20) defines the confidence interval estimate for themean response for a given X.

CONFIDENCE INTERVAL ESTIMATE FOR THE MEAN OF Y

where

Yi

Y, t tr-rsrr^fi

Y, - tn-rsrr rE, V4x=x, 3 t, + t,-rsvx fi, (13.20)

h i = ,ssx

: predicted value of { = bs + b1X,

,Sr": standard error of the estimate

n : sample size

X,: given value ofX

Vvlx=x, : mean value of I when X - X,

n

ssx: I (x,- x)',j - !

Page 37: chap 13

atea ins o flionfomreet

, r .)ant)orerdu-

;hipyersoot-

ndi-

lvenionsr. Intionthe

13.8: Estimation of MeanValues and Predict ion of Individual Values 547

The width of the confidence interval in Equation (13.20) depends on several factors. For agiven level of confidence, increased variation around the prediction line, as measured by thestandard error of the estimate, results in a wider interval. However, as you would expect,increased sample size reduces the width of the interval. In addition, the width of the intervalalso varies at different values of X. When you predict )'for values of X close to X, the intervalis narrower than for predictions for X values more distant from X.

In the Sunflowers Apparel example, suppose you want to construct a 95o/o confidenceinterval estimate of the mean annual sales for the entire population of stores that contain 4,000square feet (X:4). Using the simple linear regression equation,

t i = 0 . 9 6 4 5 + 1 . 6 6 9 9 X ,

= 0.9645 + 1.6699(4) = 7.6439 (millions of dollars)

Also, given the following:

X = 2.9214 S),x = 0.9664

il

SSX = Zr*,- Xl ' = 37.9236i = I

From Tab le E .3 , t r r :2 .1788. Thus ,

where

so that

, tt t i - -

n

* t, zsvx

= 7 .6439 t (2.1788X0 .9664)

= 7.6439 + 0.6728

SO

6 . 9 7 1 1 ! F y r - q < 8 . 3 1 6 7

Therefore, the 95o/o confidence interval estimate is that the mean annual sales are between$6,971,100 and $8,316,700 for the population of stores with 4,000 square feet.

The Prediction Interval

In addition to the need for a confidence interval estimate for the mean value, you often wantto predict the response for an individual value. Although the form of the prediction intervalissimilar to that of the confidence interval estimate of Equation (13.20), the prediction intervalis predicting an individual value, not estimating a parameter. Equation (13.21) defines theprediction interval for an individual response, Y, at aparticular value, X,, denoted by Yx=x, .

Y, X tn-rSrr rfr

, (X, - x)'- l - -

ssx

, 6 , - v )2T - ,ssx

(4 - 2 .g2rq237.9236

Page 38: chap 13

548 CHAPTER THIRTEEN Simple Linear Regression

where

so that

PREDICTIONINTERVAL FOR AN INDIVIDUAL RESPONSE, Y^ t--Yi+t,-zSrxl l+I+

J

1 - t,-rsu^t;a 3 Yy=y, s I + to*2sn.[il-

lvhere&r,-yr,SWn,a+dX,aredefinedasinEquation,(13.20)onpege546and Yy*y,isfuture value of YwhenX=4.

To construct a95%io prediction interval of the annual sales for an individual store thattains 4,000 square feet (X:4), you first compute t1. Urittg the prediction line:

f i =0.9645+1.6699X,

= 0.9645 + 1.6699(4)

= 7.6439 (millions of dollars)

Also, given the following:

2.9214 SYX = 0.9664

n

\ rx , -x) '=37.e236

From Table E.3, tn: 2.1788. Thus,

X_

SSX =

n

'ti

so

5.4335 3 Yr_+< 9.8543

Therefore, with 95o/o confidence, you predict that the annual sales for an individual store4,000 square feet is between $5,433,500 and $9,854,300.

f, :'t,-rsr*[1

n

2<', - x)'; - l

* tn-zsvx

7 .6439 I (2. 1788X0 .9664)

7.6439 !2.2104

t+ !+t4

(4 - 2.s q237.9236

(13.21)

, r , (x i -x ) 'n SSX

Page 39: chap 13

13,21Excel

intervaland prediction

for theApparel

E13.5 to create

the Basics

13.55 Based on a sample of n:20, the least-squares method was used to develop the follow-ing prediction line: ,t

lt

* 3X,.In addition,

Syx = 1.0 X = 2 Z<*, - X)2 =20

i = l

a 95o/o confidence interval estimate of themean response for X:2.a 95o/o prediction interval of an individual

for X:2.

13.56 Based on a sample of n:20, the least-squares method was used to develop the follow-ing prediction line: Yi : 5 + 3X,.ln addition,

=l.o X=2 fr" , -x)2=zo

a 95o/o confidence interval estimate of themean response forX:4.a 95o/o prediction interval of an individual

for X: 4.the results of (a) and (b) with those of Problem

13.8: Estimation of Mean Values and Prediction of Individual Values 549

Figure 13.21 is a Microsoft Excel worksheet that illustrates the confidence interval esti-mate and the prediction interval for the Sunflowers Apparel problem. If you compare the resultsof the confidence interval estimate and the prediction interval, you see that the width of theprediction interval for an individual store is much wider than the confidence interval estimatefor the mean. Remember that there is much more variation in predicting an individual valuethan in estimating a mean value.

-DarrCopylF2-Bi -2-nwF -85, Bl-D6c.ItlF3-DmcopylF{trrn rrgrcdon rerh.t c.ll Bl-t/80 + {Bf -Btll^2nn'DrirCofylF

-810'813'sARTFtal-815 - 818-815 r 8il

-Bl0"813'SQRTfi + Bl{-Bl5 - 8?3-815 r fiB

Applying the Concepts

13.57 In Problem 13.5 on page 522, you used reportedsales to predict audited sales of magazines. The data arestored in the file@!s@. For these data Sr*:42.186and. h,: 0. 108 when X: 400.a. Construct a 95oh confidence interval estimate of the

mean audited sales for magazines that report newsstandsales of 400.000.

b. Construct a95Yo prediction interval of the audited salesfor an individualmagazine that reports newsstand salesof400.000.

c. Explain the difference in the results in (a) and (b).

ffiffi

13.58 In Problem 13.4 on page 522, the mar-keting manager used shelf space for pet food topredict weekly sales. The data are stored in thefile [@!![!. For these data Sr*: 30.81 and

h i : 0 . 1 3 7 3 w h e n x : 8 .a. Construct a 95o/o confidence interval estimate of the

mean weekly sales for all stores that have 8 feet of shelfspace for pet food.

b. Construct a 95o/o prediction interval of the weekly salesof an individual store that has 8 feet of shelf space forpet food.

c. Explain the difference in the results in (a) and (b).(a) and (b). Which interval is wider? Why?

Page 40: chap 13

550 CHAPTERTHIRTEEN Simple Linear Regression

13.59 In Problem 13.7 on page 523, you used the weightof mail to predict the number of orders received. The dataare stored in the file@[!.a. Construct a 95o/o confidence interval estimate of the

mean number of orders received for all packages with aweight of500 pounds.

b. Construct a 95o/o prediction interval of the number oforders received for an individual package with a weightof500 pounds.

c. Explain the difference in the results in (a) and (b).

13.50 In Problem 13.6 on page 522, the owner of a mov-ing company wanted to predict labor hours based on thenumber of cubic feet moved. The data are stored in the file

@.a. Construct a 95oh confidence interval estimate of the

mean labor hours for all moves of 500 cubic feet.b. Construct a95%o prediction interval of the labor hours of

an individual move that has 500 cubic feet.c. Explain the difference in the results in (a) and (b).',3.6', In Problem 13.9 on page 523, an agent for a realestate company wanted to predict the monthly rent forapartments, based on the size of the apartment. The dataare stored in the file [!@a. Construct a 95o/o confidence interval estimate of the

mean monthly rental for all apartments that are 1,000square feet in size.

b. Construct a 95o/o prediction interval of therental of an individual apartment that is 1,000feet in size.

c. Explain the difference in the results in (a) and (b),

13.62 In Problem 13.8 on page 523, you predictedvalue of a baseball franchise. based on currentThe data are stored in the file!![$@@.a. Construct a 95o/o confidence interval estimate of

mean value of all baseball franchises that generate $million of annual revenue.

b. Construct a 95o/o prediction interval of the valueindividual baseball franchise that senerates $150lion ofannual revenue.

c. Explain the difference in the results in (a) and (b).

13.63 In Problem 13.10 on page 523, you usedto predict the tensile strength of die-cast aluminum.data are stored in the file@[@.a. Construct a 95o/o confidence interval estimate of

mean tensile strength for all specimens with aof 30 Rockwell E units.

b. Construct a 95Yo prediction interval of thestrength for an individual specimen that has aof 30 Rockwell E units.

c. Explain the difference in the results in (a) and (b).

13.9 PITFALLS IN REGRESSION AND ETHICAL IsSUEsSome of the pitfalls involved in using regression analysis are as follows:

. Lacking an awareness of the assumptions of least-squares regressionI Not knowing how to evaluate the assumptions of least-squares regressionr Not knowing what the alternatives to least-squares regression are if a particular assu

is violated, Using a regression model without knowledge of the subject matterI Extrapolating outside the relevant ranger Concluding that a significant relationship identified in an observational study is due

cause-and-effect relationship

The widespread availability of spreadsheet and statistical software has madeanalysis much more feasible. However, for many users, this enhanced availability ofhas not been accompanied by an understanding ofhow to use regression analysisSomeone who is not familiar with either the assumptions of regression or how to evaluateassumptions cannot be expected to know what the alternatives to least-squares regressiona particular assumption is violated.

Thedata inTab le l3 .7 (s to red in the f i le@i l lus t ra te the impor tanceofscatter plots and residual analysis to go beyond the basic number crunching of computingIintercept, the slope. and12.

Page 41: chap 13

13.9: Pit fal ls in Regression and Ethical Issues )) I

Data Set C Data Set D13.7Sets of Artificia I

Data Set A Data Set B

X

re

le

;0

ln.l-

ssne

l 0t 4589

t 247

l ll 36

l 0t 4589

1 247

l ll 36

9. r48 . 1 04.748 . 1 48.',779 . 1 33 . 1 07.269.268.746 . 1 3

1 01 4589

1 247

l l1 36

8888888

l 9888

8.049.965.686.958 . 8 1

10.844.264.828.337.587.24

7.468.845.736.777 . 1 18. r55.396.427.8 r

12.746.08

6.585.16I . 1 18.848.477.045.25

t 2 . 5 05 . 5 67 . 9 16.89

nerSS

i le)ss

Source: Extracted.fiom E J. Anscombe, "Graphs in StatisticalAnalysrs," American Statistician, Vol. 27 (1973),pp. l7-21.

Anscombe (reference 1) showed that all four data sets given in Table 13.7 have the follow-

ing identical results:

= 3.0 + 0.5X;

Svx = 1'23'7

Sa, = 0 .1 18

12 = 0.667

SSR = Explainedvariation = It1 -

j = l

n

SSE = Unxplained variation = \{v,

. _ I

SSZ = Total variation = t

(, - y 12 = 41.27l = l

Thus, with respect to these statistics associated with a simple linear regression analysis, thefour data sets are identical. Were you to stop the analysis at this point, you would fail to observethe important differences among the four data sets. By examining the scatter plots for the fourdata sets in Figure 13 .22 on page 552, and their residual plots in Figure I 3.23 on page 552, youcan clearly see that each ofthe four data sets has a different relationship between X and Y.

From the scatter plots of Figure 13.22 and the residual plots of Figure 13.23, you see howdifferent the data sets are. The only data set that seems to follow an approximate straight line isdata set A. The residual plot for data set A does not show any obvious patterns or outlyingresiduals. This is certainly not true for data sets B, C, and D. The scatter plot for data set Bshows that a quadratic regression model (see Section l5.l) is more appropriate. This conclu-sion is reinforced by the residual plot for data set B. The scatter plot and the residual plot fordata set C clearly show an outlying observation. If this is the case, you may want to remove theoutlier and reestimate the regression model (see reference 4). Similarly, the scatter plot for dataset D represents the situation in which the model is heavily dependent on the outcome of a sin-gle response (XB: 19 and )', : 12.50). You would have to cautiously evaluate any regressionmodel because its regression coefficients are heavily dependent on a single observation.

Yi

f )2 = 27 .5 t

- f )2 = 13 .76

o a

ionarerly.theE i f

ingthe

Page 42: chap 13

FIGURE 13.22Scatter plots for fourdata sets

FIGURE 13.23Residual plots for fourdata sets

Res idua l+4

a

a

aa

aa

a

aa

a

1 0Pane l D

$ 0 | $ a

Page 43: chap 13

13.9: Pit fal ls in Regression and Ethical Issues 553

In summary, scatter plots and residual plots are of vital importance to a complete regres-sion analysis. The information they provide is so basic to a credible analysis that you shouldalways include these graphical methods as part of a regression analysis. Thus, a strategy thatyou can use to help avoid the pitfalls of regression is as follows:

1. Start with a scatter plot to observe the possible relationship between X and Y.2. Check the assumptions of regression before moving on to using the results of the model.3. Plot the residuals versus the independent variable to determine whether the linear model is

appropriate and to check the equal-variance assumption.4. Use a histogram, stem-and-leaf display, box-and-whisker plot, or normal probability plot

of the residuals to check the normality assumption.5. If you collected the data over time, plot the residuals versus time and use the Durbin-

Watson test to check the independence assumption.6. If there are violations of the assumptions, use alternative methods to least-squares regres-

sion or alternative least-squares models.7. If there are no violations of the assumptions, carry out tests for the significance of the

regression coefficients and develop confidence and prediction intervals.8. Avoid making predictions and forecasts outside the relevant range of the independent

variable.9. Keep in mind that the relationships identified in observational studies may or may not be

due to cause-and-effect relationships. Remember that while causation implies correlation,correlation does not imolv causation.

$r.V

fa\

$ahl*

"S

\q)ss,x\

erhaps you are familiar with the ITV competit ion organized by imodel Tyra Banks to find r"America's top model." You may

be less familiar with another set of toD mod-els that are emerging from the businessworld.

ln a Eusiness Week article from itsJanuary 23, 2006, edit ion (S. Baker, "Why

MathWil l RockYourWorld: More Math GeeksAre Call ing the Shots in Business. ls Yourlndustry Next?" Business Week, pp.54-62),Stephen Baker talks about how "quants"

turned f inance upside down and is moving onto other business f ields. The name quantsderives from the fact that "math geeks"develop models and forecasts by using"ouanti tat ive methods." These methods arebuil t on the principles of regression analysisdiscussed in this chapter, although the actualmodels are much more complicated than thesimple l inear models discussed in this chapter.

Regression-based models have becomethe top models for many types of businessanalyses. Some examples include

n Advert ising and marketing Managersuse econometric models (in other words,

regression models) to determine theeffect of an advertisement on sales, basedon a set of factors. Also, managers usedata mining to predict patterns of behav-ior of what customers wil l buy in thefuture, based on historic informationabout the consumer.Finance Any time you read about a finan-cial "model," you should understand thatsome type of regression model is beingused. For example, a New York Timesart icle on June 18, 2006, t i t led 'An OldFormula That Points to New Worry" byMark Hulbert (p. BU8) discusses a markett iming model that predicts the return ofstocks in the next three to five years,based on the dividend yield of the stockmarket and the interest rate of 90-davTreasury bi l ls.Food and beverage Believe i t or not,Enologix, a Cali fornia consult ing com-pany, has developed a "formula" (a

regression model) that predicts a wine'squality index, based on a set of chemicalcompounds found in the wine (see D,Darl ington, "The Chemistry of a 90+Wine," Ihe New York Times Magazine,August 7, 2005, pp. 36-39).

t Publishing A study of the effect of pricechanges at Amazon.com and BN.com onsales (again, regression analysis) foundthat a 1 % price change at BN.com pushedsales down 4%, but it pushed sales downonly 0.5% at Amazon.com. (You candownload the paper at http:/ /gsbadg.uchicago.edu/vitae.htm.)

s Transportation Farecast.com uses datamining and predictive technologies to objec-tively predict airfare pricing (see D. Darlin,?irfares Made Easy (Or Easier)," The NewYork Times, July 1, 2006, pp. C1, C6).

# Real estate Zillow.com uses informationabout the features contained in a homeand its location to develoo estimates aboutthe market value of the home, using a "for-

mula" built with a proprietary algorithm.

In the article, Baker stated that statisticsand probabil i ty wi l l become core ski l ls forbusinesspeople and consumers. Those whoare successful will know how to use statistics,whether they are bui lding f inancial models ormaking marketing plans. He also stronglyendorsed the need for everyone in business tohave knowledge of Microsoft Excel to be ableto produce statistical analysis and reports.

Page 44: chap 13

554 CHAPTER THIRTEEN Simole Linear Resression

As you can see frorn the chapter roadmap in Figure 13.24,this chapter develops the simple l inear regression modeland discusses the assumotions and how to evaluate them.

Plot ResidualsI Over I lme1-. , .

ComputeDurbin-Watson

Statistic

Once you are assured that the model is appropriate, you canpredict values by using the prediction l ine and test for thesignificance of the slope.

Simple Linear Regressionand Correlat ion

Regression Pr imary Correlat ionF O C U S

Least-SquaresCoefficient

-:tto""l*:.5

rest |ng Ho:P = 0

, Regression Analysis

Scatter Plot

Predict ion Line

DataCol lected

in Sequent ia lOrder

?

No

Residual Analysis

Use Alternative to Yes: Least-squares RegressionL.".,.

lsAutocorrelat ion

Present7

I

No Yes ModelAppropr iate

?

N o

Testing Hs:

I 0 r = 0l {See Assumpt ionsl

N o Model YesSigni f icant '

?

Est imate9r

Use Model forPredict ion and Est imat ion

Est imatevy,*:1a..*

Predict

; "=\^*

F|GURE '13.24 Roadmap for simple l inear regression

Page 45: chap 13

learned how the director ofplanning for a chainstores can use regression analysis to investigate

io between the size of a store and its annualhave used this analysis to make better decisions

Regression Model

Y , : Fo+ P l x i+ E i (13.1)

Regression Equation: The Prediction Line

(r3.2)

(13.3)

Y i = b o + h X i

Formula for the Slope, D,

, ,ssxrA = -' .ssx

Formula for the Y Intercept, 6o

pf' bo=Y -4X (13 .4 )

ofVariation in Regression

,s,sz: s,sR + ssr (13.5)

of Squares (SST)n

i= Total sum of squares = > ff, - 112 (13.6)j - l

$um of Squares (^SSR)

ined variation or regression ofsquares

(f, - Y)' (13'7)

of Squares (S^lE)

ined variation or enor sum of squares

(Y, - f,)'

of Determination

r_ Regression sum of squares _ ^S^SR

(13.8)

i: Total sum of souares ,S,SZ(13.e)

Formula for SSZ

n

/ \ r i -

,

( , 1 2

. lrnl| ) ' = ). r, t - \ t=r / ( l3. lo)

- n

Key Equations 555

when selecting new sites for stores as well as to forecast salesfor existing stores. In Chapter 14, regression analysis isextended to situations in which more than one independentvariable is used to predict the value ofa dependent variable.

Computational Formula for SSR

n

ssR=\ f r ,_y l ,l =1

( n \2 (13.11)l l r , l

n n l H ' l

=b^ I r ,+b t X ,Y , - \ r= l /"v.L;-, 7i tt

Computational Formula for ^S.SE

n n n n

ssE = ),ti - v )' = 2t,' - aol vi - bt>x iYii = t i = l i = l j = l

(13.12)

Standard Error of the Estimate

\ {v , -v, ) '; - l

" 1

Mt - -

\ n - 2 -Srx =

Residual.?

€ i = I i - I i

Durbin-Watson Statisticn

s r . )/ ( e ; - e i _ 1 ) -

D - i=2 (13 .15)n

! o 2

3'',Testing a Hypothesis for a Population Slope, p'

Using the t Test

4-9r

(13.13)

(13.14)

(13.16)

(13.17)

(13.18)

sut

Testing a Hypothesis for a Population Slope, B'Using the ,FTest

- MSR' - M S E

Confidence Interval Estimate of the Slope, B,

b l ! tn_2Sb l

b 1 - t n _ 2 s b t < F r < b r + t r _ r S ^

Page 46: chap 13

I556' cUapTERTHIRTEEN Simple LinearRegression

Testing for the Existence of Correlation

r - p

I r-"\ ,-z

Confidence Interval Estimate for the Mean of Y'i, t t,-rsr*rfi't, - tn-rsr*rE a pr,"=x, 3 f, + tn-rsrrrfi

Prediction Interval for an Individual Responseo Y

assumptions of regression 529autocorrelation 534coefficient of determination 526confidence interval estimate for the

mean response 546correlationcoefficient 542dependent variable 512Durbin-Watson statistic 536error sum of squares (.LlE") 524equal variance 530explainedvariation 524explanatoryvariable 513homoscedasticity 530

(13.1e)

(13.20)

independenceoferrors 529independentvariable 512least-squares method 516linearrelationship 512normality 530prediction interval for an individual

response, I 547prediction line 515regressionanalysis 512regression coefficient 516regression sum ofsquares (SSR) 524relevant range 519residual 530

(13.21)

residual analysis 530response variable 513scatter diagram 512scatter plot 512simple linear regression 512simple linear regression equation 5lslope 513standard error of the estimate 528total sum ofsquares (SSQ 524total variation 524unexplainedvariation 524)z intercept 513

'i, + tn-rsr"^F. r,'i, - tn-rsrr..fl 3 Yx=xi s f, + t,-2syr.,F+ k

Checking Your Understanding13.54 What is the interpretation of the )zintercept and theslope in the simple linear regression equation?

13.65 What is the interpretation of the coefficient ofdetermination?

13.66 When is the unexplained variation (that is, errorsum of squares) equal to 0?

13.67 When is the explained variation (that is, regressionsum of squares) equal to 0?

13.68 Why should you always carry out a residual analy-sis as part of a regression model?

13.69 What are the assumptions of regression analysis?

13.70 How do you evaluate the assumptions of regressionanalysis?

13.71 When and how do you use the Durbin-Watsonstatistic?

13.72 What is the difference between a confidence ival estimate of the mean response, Vy x=x , and ation interval of Yr=y ?

Applying the Concepts13.73 Researchers from the Lubin School of BusinessPace University in New York Citv conducted a studylnternet-supported courses. In one part of the study,numerical variables were collected on 108 students inintroductory management course that met once a weekan entire semester. One variable collected was hittency.To measure hit consistency, the researchers didfollowine: If a student did not visit the Internetbetween classes, the student was given a 0 for that tiperiod. If a student visited the Internet site one ortimes between classes, the student was given a I fortime period. Because there were 13 time periods, adent's score on hit consistency could range from 0 to 13.

The other three variables included the student'saverage, the student's cumulative grade point

Page 47: chap 13

the total number of hits the student had on the'eite supporting the course. The following tableconelation coefficient for all pairs of variables.correlations marked with an * are statisticallvusing o : 0.001:

Correlation

Cumulative GPATotal HitsHit Consistency

GPA.Total HitsGPA, Hit Consistency

Hit Consistency

F.xtmctedfrom D. Baugheti A. Varanelli, and E. Weisbord,Hits in an Internet-Supported Course: How Can

Use Them and What Do They Mean? " Decision SciencesInnovative Educatioq Fall 2003, I(2), pp. 159-179.

conclusions can you reach from this correlation

surprised by the results, or are they consistentown observations and experiences?

Management of a soft-drink bottling companydevelop a method for allocating delivery costs to

Although one cost clearly relates to travel timeparticular route, another variable cost reflects theired to unload the cases of soft drink at the deliv-A sample of 20 deliveries within a territory wasThe delivery times and the numbers of caseswere recorded in the@@$@file:

Chapter Review Problems 557

e. Determine the coefficient of determination, 12, andexplain its meaning in this problem.

f. Perform a residual analysis. Is there any evidence of apattern in the residuals? Explain.

g. At the 0.05 level of significance, is there evidence of alinear relationship between delivery time and the num-ber ofcases delivered?

h. Construct a 95oh confidence interval estimate of themean delivery time for 150 cases of soft drink.

i. Construct a95o/o prediction interval of the delivery timefor a single delivery of 150 cases ofsoft drink.

j. Construct a 95o/o confidence interval estimate of thepopulation slope.

k. Explain how the results in (a) through (j) can help allo-cate delivery costs to customers.

13.75 A brokerage house wants to predict the number oftrade executions per day, using the number of incomingphone calls as a predictor variable. Data were collected overa period of 35 days and are stored in the file@@.a. Use the least-squares method to compute the regression

coefficients boand br.b. Interpret the meaning of bo and b, in this problem.c. Predict the number of trades executed for a day in which

the number of incoming calls is 2,000.d. Should you use the model to predict the number of

trades executed for a day in which the number of incom-ing calls is 5,000? Why or why not?

e. Determine the coefficient of determination, r2, andexplain its meaning in this problem.

f. Plot the residuals against the number of incoming callsand also against the days. Is there any evidence ofa patternin the residuals with either of these variables? Explain.

g. Determine the Durbin-Watson statistic for these data.h. Based on the results of (f) and (g), is there reason to

question the validity of the model? Explain.i. At the 0.05 level of significance, is there evidence of a

linear relationship between the volume of trade execu-tions and the number of incoming calls?

j. Construct a 95o/o confidence interval estimate of themean number of trades executed for days in which thenumber of incoming calls is 2,000.

k Construct a 95o/o prediction interval of the number oftrades executed for a particular day in which the numberof incoming calls is 2,000.

l. Construct a 95oh confidence interval estimate of thepopulation slope.

m.Based on the results of (a) through (l), do you think thebrokerage house should focus on a strategy of increas-ing the total number of incoming calls or on a strategythat relies on trading by a small number of heavytraders? Explain.

13.76 You want to develop a model to predict the sellingprice of homes based on assessed value. A sample of 30

0.72*0.080.37*0 .120.32*0.64*

DeliveryNumber Time

Customer ofCases (Minutes)

DeliveryNumber TimeofCases (Minutes)

52 32.1g 34.873 36.285 37.895 37.8

103 39.7n6 38.5l2l 4r.9t43 44.2t57 47.r

l lt 2l 3t41 5l 6t 7l 8l920

161 43.0184 49.4202 57.22r8 56.8243 60.6254 61.2267 58.227s 63.1287 65.6298 67.3

a regression model to predict delivery time, basedofcases delivered.

least-squares method to compute the regressionrts Do and b,.the meaning of bo and 6, in this problem.

the delivery time for 150 cases of soft drink.you use the model to predict the delivery time

who is receiving 500 cases of soft drink?why not?

Page 48: chap 13

558 CHAPTER THIRTEEN Simple Linear Regression

recently sold single-family houses in a small city is selectedto study the relationship between selling price (in thousandsofdollars) and assessed value (in thousands ofdollars). Thehouses in the city had been reassessed at full value one yearprior to the study. The results are in the file@@.

(Hint: First, determine which are the independent anddependent variables.)a. Construct a scatter plot and" assuming a linear relation-

ship, use the least-squares method to compute theregression coefficients bo and br.

b. Interpret the meaning of the I intercept, bo, and theslope, b,, in this problem.

c. Use the prediction line developed in (a) to predict the sell-ing price for a house whose assessed value is S I 70,000.

d. Determine the coefficient of determination, r2, andinterpret its meaning in this problem.

e. Perform a residual analysis on your results and deter-mine the adequacy of the fit of the model.

f. At the 0.05 level of significance, is there evidence of a lin-ear relationship between selling price and assessed value?

g. Construct a95o/o confidence interval estimate of the meanselling price for houses with an assessed value of $170,000.

h. Construct a95o/o prediction interval of the selling price ofan individual house with an assessed value of $ 170,000.

i. Construct a 95o/o confidence interval estimate of thepopulation slope.

13,77 You want to develop a model to predict theassessed value ofhouses, based on heating area. A sampleof 15 single-family houses is selected in a city. The assessedvalue (in thousands ofdollars) and the heating area ofthehouses (in thousands of square feet) are recorded, with thefollowing results, stored in the file@@!fS:

Assessed Heating Area of DwellingHouse Value ($000) (Thousands of Square Feet)

a. Construct a scatter plot and" assuming a linear relation-ship, use the least-squares method to compute the regres-sion coefficients bo and b,.

b. Interpret the meaning of the I intercept, bo, and theslope, b1, in this problem.

c. Use the prediction line developed in (a) to predict theassessed value for a house whose heating area is 1,750square feet.

d. Determine the coefficient of determination. r2. and,interpret its meaning in this problem.

e. Perform a residual analysis on your results and deter-mine the adequacy of the fit of the model.

f. At the 0.05 level of significance, is there evidence of a lin-ear relationship between assessed value and heating area?

g. Construct a 95o/o confidence interval estimate of themean assessed value for houses with a heating area of1,750 square feet.

h. Construcl a 95oh prediction interval of the assessedvalue of an individual house with a heating area of 1,750square feet.

i. Construct a 95o/o confidence interval estimate of thepopulation slope.

13.78 The director of graduate studies at alarge college ofbusiness would like to predict the grade point average (GPA)of students in an MBA program based on the GraduateManagement Admission Test (GMAI) score. A sample of20 students who had completed 2 years in the program isselected. The results are stored in the filefiS@@:

GMATObservation Score GPA

GMATObservation Score GPA

688647652608680617557599616594

12J

i

56789

l 0

3.723.443 .213.293 .913.283.023 . 1 33.45J . J J

l l1 2I J

T4l 5l 6l 71 81 920

5675425515735366396196947 1 8759

3.072.862.912.793.003.55J . + I

3.603.883.76

I2a

456

89

1 0l lt 2l 3t 4l 5

184.4177.4r7 5 .7185.9179.1170.4175.8185.9r78.5179.2186.7t'19.3174.5r 83.8176.8

2.001 . 7 |1 .45t .761.931.201 . 5 51.931 . 5 91 . 5 01.901 . 3 91 . 5 41 . 8 91 . 5 9

are the independent and

(Hint: First, determine which are the independent anddependent variables.)a. Construct a scatter plot and, assuming a linear relation-

ship, use the least-squares method to compute theregression coefficients bo and b,.

b. Interpret the meaning of the I intercept, bo, and theslope, b1, in this problem.

c. Use the prediction line developed in (a) to predict theGPA for a student with a GMAT score of 600.

d. Determine the coefficient of determination, 12, andinterpret its meaning in this problem.

e. Perform a residual analysis on your results and deter-mine the adequacy of the fit of the model.

(Hint: First, determine whichdependent variables.)

Page 49: chap 13

ionircs{

Chapter Review Problems 559

Temperature O-Ring(oF) Damage Index

the 0.05 level ofsignificance, is there evidence ofarelationship between GMAT score and GPA? Flight Number

a 95Yo confidence interval estimate of theI2356

894t-B4t-c4t-D4t-G5 l -A5 l - Bsl-c5 l - Dsl-Fsl-G5 l - I5 l - J6r-A61-B6 l -c

GPA of students with a GMAI score of 600.

slope.

The manager of the purchasing department of abanking organization would like to develop a model

the amount of time it takes to process invoices.are collected from a sample of 30 days, and the num-

the residuals asainst the number of invoices

667069

0400000042400

a 95%o prediction interval of the GPA for astudent with a GMAT score of 600.a 95o/o confidence interval estimate of the

invoices processed and completion time, in hours, isinthe file@@.

First, determine which are the independent andvariables.)

ing a linear relationship, use the least-squaresto compute the regression coefficients bo and b,.

the meaning of the )z intercept, bo, and theb1, in this problem.

the prediction line developed in (a) to predict theof time it would take to process 150 invoices.ine the coefficient of determination, r2, and

ItS meamng.

and also aeainst time.on the plots in (e), does the model seem

the Durbin-Watson statistic and. at the 0.05of significance, determine whether there is any

ion in the residuals.on the results of (e) through (g), what conclusions

you reach concerning the validity of the model?the 0.05 level ofsignificance, is there evidence ofa

relationship between the amount of time and theof invoices processed?

a95o/o confidence interval estimate of the meanoftime it would take to process 150 invoices.

a95o/oprediction interval ofthe amount of timetake to process 150 invoices on a particular day.

On January 28, 1986, the space shuffle Challengerand seven astronauts were killed. Prior to the

the predicted atmospheric temperature was forweather at the launch site. Engineers for Morton(the manufacturer of the rocket motor) prepared

to make the case that the launch should not take placethe cold weather. These arguments were rejected, and

tragically took place. Upon investigation after, experts agreed that the disaster occurred

of leaky rubber O-rings that did not seal properlythe cold temperature. Data indicating the atmo-temperature at the time of 23 previous launches and

Note: Data from flight 4 is omitted due to unknown O-ring condition.

Source: Extractedfrom Report of the Presidential Commission onthe Space Shuttle Challenger Accident Washington, DC, 1986, Vol.II (Hl-H3) and Vol. IV (664), andPost Challenger Evaluation ofSpace Shuttle Risk Assessment and Management, Washington, DC,1988, pp. 135-136.

a. Construct a scatter plot for the seven flights in whichthere was O-ring damage (O-ring damage index * 0).What conclusions, if any, can you draw about the rela-tionship between atmospheric temperature and O-ringdamase?

b. Construct a scatter plot for all 23 flights.c. Explain any differences in the interpretation of the re

tionship between atmospheric temperature and O-ridamage in (a) and (b).

d. Based on the scatter plot in (b), provide reasons why aprediction should not be made for an atmospheric tem-perature of 3 I'F, the temperature on the morning of thelaunch of the Challenger.Although the assumption of a linear relationship maynot be valid" fit a simple linear regression model to pre-dict O-ring damage, based on atmospheric temperature.Include the prediction line found in (e) on the scatterplot developed in (b).Based on the results of (f), do you think a linear modelis appropriate for these data? Explain.Perform a residual analvsis. What conclusions do voureach?

6867727370J I

6370'18

677553678r706779757658

0l l00000404

e.

g.

h.damage index are stored in the file!@@:

Page 50: chap 13

5q0 CHAPTERTHIRTEEN Simple Linear Regression

13.81 Crazy Dave, a well-known baseball analyst, wouldlike to study various team statistics for the 2005 baseballseason to determine which variables might be useful in pre-dicting the number of wins achieved by teams during theseason. He has decided to begin by using a team's earnedrun average (ERA), a measure of pitching performance, topredict the number of wins. The data for the 30 MajorLeague Baseball teams are in the file [!!!!lf[

(Hint: First, determine which are the independent anddependent variables.)a. Assuming a linear relationship, use the least-squares

method to compute the regression coefficients bo and b,.b. Interpret the meaning of the I intercept, bo, and the

slope, b1, in this problem.c. Use the prediction line developed in (a) to predict the

number of wins for a team with an ERA of 4.50.d. Compute the coefficient of determination, 12, andinter-

pret its meaning.e. Perform a residual analysis on your results and deter-

mine the adequacy of the fit of the model.f. At the 0.05 level of significance, is there evidence of a

linear relationship between the number of wins andthe ERA?

g. Construct a 95o/o confidence interval estimate of themean number of wins expected for teams with an ERAof 4.50.

h. Construct a 95Yo prediction interval of the number ofwins for an individual team that has an ERA of 4.50.

i. Construct a 95%o confidence interval estimate of theslope.

j. The 30 teams constitute a population. In order to use sta-tistical inference, as in (f) through (i), the data must beassumed to represent a random sample. What "popula-tion" would this sample be drawing conclusions about?

k. What other independent variables might you considerfor inclusion in the model?

13.82 College football players trying out for the NFL aregiven the Wonderlic standardized intelligence test. The datain the file E![!@! contains the average Wonderlicscores of football players trying out for the NFL and thegraduation rates for football players at selected schools(extracted from S. Walker, "The NFI-ls Smartest Teaml' TheWall Street Journal, September 30, 2005, pp. Wl, Wl0).You plan to develop a regression model to predict theWonderlic scores for football players trying out for theNFL, based on the graduation rate of the school theyattended.a. Assuming a linear relationship, use the least-squares

method to compute the regression coefficients boandbr.b. Interpret the meaning of the I intercept, bo, and the

slope, b1, in this problem.c. Use the prediction line developed in (a) to predict the

Wonderlic score for football players trying out for theNFL from a school that has a eraduation rate of 50o/o.

d. Compute the coefficient of determination, 12, andirfier-pret its meaning.

e. Perform a residual analysis on your results and deter-mine the adequacy of the fit of the model.

f. At the 0.05 level of significance, is there evidence of alinear relationship between the Wonderlic score for afootball player trying out for the NFL from a school andthe school's graduation rate?

g. Construct a 95%o confidence interval estimate of themean Wonderlic score for football players trying out forthe NFL from a school that has a graduation rate of 50%.'

h. Construct a 95o/o prediction interval of the Wonderlicscore for a football player trying out for the NFL from aschool that has a sraduation rate of50o/o.

i. Construct a 95%o confidence interval estimate of theslope.

13.83 College basketball is big business, with coaches'salaries, revenues, and expenses in millions of dollars.The data in the fil" EEEEEE!$EI@ contains thecoaches' salaries and revenues for college basketballat selected schools in a recent year (extracted fromR. Adams, "Pay for Playoffs," The Wall Street Journcl,March ll-12,2006, pp. Pl, P8). You plan to develop aregression model to predict a coach's salary basedrevenue.a. Assuming a linear relationship, use the I

method to compute the regression coefficients bo andb,b. Interpret the meaning of the )z intercept, bo, and the

slope, b1, in this problem. !c. Use the prediction line developed in (a) to predici

the coach's salary for a school that has revenue$7 million.

d. Compute the coefficient of determination, r2, antpret its meaning.

e. Perform a residual analysis on your results andmine the adequacy of the fit of the model.

f. At the 0.05 level of significance, is there evidence oflinear relationship between the coach's salary forschool and revenue?

g. Construct a 95o/o confidence interval estimate ofmean salary ofcoaches at schools that have revenue$7 million.

h. Construct a9lYoprediction interval of the coach'sfor a school that has revenue of $7 million.

i. Construct a 95o/o confidence interval estimate ofslope.

13.84 Durins the fall harvest season in the United Spumpkins are sold in large quantities at farm stands.instead of weighing the pumpkins prior to sale, thestand operator will just place the pumpkin in the aate circular cutout on the counter. When asked whvwas done, one farmer replied, "l can tell the weight ofpumpkin from its circumference." To determinethis was really true, a sample of 23 pumpkins were

Page 51: chap 13

(cm)

50

545237

Weight(Grams)

for circumference and weighed" with the followingstored in the file E@@fr:

Chapter Review Problems 561

Sales-Latest one-month sales total (dollars)Age-Median age of customer base (years)HS-Percentage of customer base with a high schooldiplomaCollege-Percentage of customer base with a collegediplomaGrowth-Annual population growth rate of customerbase over the past 10 yearsIncome-Median family income of customer base(dollars)

a. Construct a scatter plot, using sales as the dependentvariable and median family income as the independentvariable. Discuss the scatter diagram.

b. Assuming a linear relationship, use the least-squaresmethod to compute the regression coefficients bo and b,.

c. Interpret the meaning of the I intercept, bo, and theslope, b1, in this problem.

d. Compute the coefficient of determination, 12, andinler-pret its meaning.

e. Perform a residual analysis on your results and deter-mine the adequacy of the fit of the model.

f. At the 0.05 level of significance, is there evidence of alinear relationship between the independent variable andthe dependent vaiable?

g. Construct a 95o/o confidence interval estimate of theslope and interpret its meaning.

13.86 For the data of Problem 13.85, repeat (a) through(g), using median age as the independent variable.

'13.87 For the data of Problem 13.85, repeat (a) through (g),using high school graduation rate as the independent variable.

1 3.88 For the data ofProblem I 3.85, repeat (a) through (g),using college graduation rate as the independent variable.

13.89 For the data of Problem 13.85, repeat (a) through(g), using population growth as the independent variable.

13.90 Zagat's publishes restaurant ratings for various loca-tions in the United States. The data file @containsthe Zagatrating for food, decor, service, and the price per per-son for a sample of 50 restaurants located in an urban area(New York City) and 50 restaurants located in a suburb ofNew York City. Develop a regression model to predict theprice per person, based on a variable that represents the sumofthe ratings for food, decor, and service.Source: Extractedfrom Zagat Survey 2002 NewYork CityRestaurants and Zagat Survey 200 l-2002, Long Island Restaurants.

a. Assuming a linear relationship, use the least-squaresmethod to compute the regression coefficients bo and b,.

b. Interpret the meaning of the I intercept, bo, and theslope, b1, in this problem.

c. Use the prediction line developed in (a) to predict the priceper pe$on for a restaurant with a summated rating of 50.

d. Compute the coefficient of determination, 12, and inter-pret its meaning.

Circumference Weight(cm) (Grams)

:5253475163

1,2002,0001,5001,700

5001,0001,5001,4001,5002,500

s001,000

2,0002,5004,6004,6003,100

6001,5001,5001,6002,3002,r00

5766828370345 t50496059i33

43

ing a linear relationship, use the least-squaresto compute the regression coefficients bo and b,.the meaning of the slope, b,, in this problem.

the mean weight for a pumpkin that is 60 cen-in circumference.

you think it is a good idea for the farmer to sellpkins by circumference instead of weight? Explain.

ine the coefficient of determination. 12, andIts meamns.a residual analysis for these data and determine

adequacy of the fit of the model.the 0.05 level of sisnificance. is there evidence of a

relationship between the circumference and theight of a pumpkin?

t a 95oh confidence interval estimate of thetion slope, Br.

a 95% confidence interval estimate of themean weight for pumpkins that have a cir-of 60 centimeters.

a 95o/o prediction interval of the weight forindividual pumpkin that has a circumference of 60

Can demographic information be helpful in pre-sales of sporting goods stores? The data stored in

EE@[Eure the monthly sales totals from a ran-of 38 stores in a large chain of nationwide

goods stores. All stores in the franchise, and thusthe sample, are approximately the same size and

,the same merchandise. The county or, in some cases,in which the store draws the majority of its cus-

is referred to here as the customer base. For each ofstores, demographic information about the customer

is provided. The data are real, but the name of theise is not used at the request of the company. The

in the data set are

Page 52: chap 13

562 CHAPTER THIRTEEN Simple Linear Regression

e. Perform a residual analysis on your results and deter-mine the adequacy of the fit of the model.

f. At the 0.05 level of significance, is there evidence of alinear relationship between the price per person and thesummated rating?

g. Construct a 95oh confidence interval estimate of themean price per person for all restaurants with a sum-mated rating of 50.

h. Construct a95o/o prediction interval of the price per per-son for a restaurant with a summated rating of 50.

i. Construct a 95% confidence interval estimate of the slope.j. How useful do you think the summated rating is as a

predictor of price? Explain.'13.91 Refer to the discussion of beta values and marketmodels in Problem 13.49 onpages 544-545. One hundredweeks of data, ending the week of May 22,2006, for the S&P500 and three individual stocks are included in the data file

@ Note that the weekly percentqge change for boththe S&P 500 and the individual stocks is measured as thepercentage change from the previous week's closing value tothe current week's closing value. The variables included are

Week-Current weekSP500-Weekly percentage change in the S&P 500 IndexWALMART-Weekly percentage change in stock priceof Wal-Mart Stores, Inc.TARGET-Weekly percentage change in stock price ofthe Target CorporationSARALEE-Weekly percentage change in stock priceof the Sara Lee Corporation

Source : Extracted from finance.yahoo.com, May 3 I, 2 006.

a. Estimate the market model forWal-Mart Stores Inc. (Hint:Use the percentage change in the S&P 500 Index as theindependent variable and the percentage change in Wal-Mart Stores, Inc.'s stock price as the dependent variable.)

b. Interpret the beta value for Wal-Mart Stores, Inc.c. Repeat (a) and (b) forTarget Corporation.d. Repeat (a) and (b) for Sara Lee Corporation.e. Write a brief summary of your findings.

13.92 The data file [@!contains the stock prices offour companies, collected weekly for 53 consecutiveweeks, ending May 22,2006. The variables are

Week-Closing date for stock pricesMSFT-Stock price of Microsoft, Inc.Ford-Stock price of Ford Motor CompanyGM-Stock price of General Motors, Inc.IAL-Stock price of International Aluminum, Inc.

Source; Extracted Jrom finance.yahoo.com, May 3 1, 2006.

a. Calculate the correlation coefficient, r, for each pair ofstocks. (There are six of them.)

b. Interpret the meaning of r for each pair.c. Is it a good idea to have all the stocks in an individual's

portfolio be strongly positively correlated among eachother? Explain.

13.93 Is the daily performance of stocks and bonds corre-lated? The data file E!@s![[tE contains informationconcerning the closing value of the Dow Jones IndustrialAverage and the Vanguard Long-Term Bond Index Fundfor 60 consecutive business days, ending May 30, 2006.The variables included are

Date Current dayBonds Closing price of Vanguard Long-Term BondIndex FundStocks-Closing price of the Dow Jones IndustrialAverage

Scturce : Extracted.from finance.yahoo.com, May 3 1, 2 006.

a. Compute and interpret the correlation coefficient, r, forthe variables Stocks and Bonds.

b. At the 0.05 level of significance, is there a relationshipbetween these two variables? Explain.

Report Writing Exercises13.94 In Problems 13.85-13.89 on page 561, you devel-oped regression models to predict monthly sales at a sport-ing goods store. Noq write a report based on the modelsyou developed. Append to your report all appropriatecharts and statistical information.

examining new subscription data for the prior threemonths, a group of three managers would develop a subjec-tive forecast of the number of new subscriptions. LaurenHall, who was recently hired by the company to providespecial skil ls in quantitative forecasting methods, sug-gested that the department look for factors that might helpin predicting new subscriptions.

Members of the team found that the forecasts in thepast year had been particularly inaccurate because in somemonths, much more time was spent on telemarketing than

Managing the Springville Herald

To ensure that as many trial subscriptions as possible areconverted to regular subscriptions, the Herald marketingdepartment works closely with the distribution departmentto accomplish a smooth initial delivery process for the trialsubscription customers. To assist in this effort, the market-ing department needs to accurately forecast the number ofnew regular subscriptions for the coming months.

A team consisting of managers from the marketing anddistribution departments was convened to develop a bettermethod of forecasting new subscriptions. Previously, after

Page 53: chap 13

in other months. In particular, in the past month, only 1,055hours were completed because callers were busy during thefrst week of the month attending training sessions on thepersonal but formal greeting style and a new standard pre-sentation guide (see "Managing the Springville Herald" inChapter ll). Lauren collected data (stored in the file@@) for the number of new subscriptions and hoursspent on telemarketing for each month for the past twoyears.

EXERCISES

SH13.1 What criticism can you make concerning themethod of forecasting that involved taking the newsubscriptions data for the prior three months as thebasis for future projections?

Apply your knowledge of simple linear regression in thisWeb Case, which extends the SunJlowers Apparel UsingStatistics scenario from this chapter

Leasing agents from the Triangle Mall ManagementCorporation have suggested that Sunflowers consider sev-eral locations in some of Triangle's newly renovatedlifestyle malls that cater to shoppers with higher-than-meandisposable income. Although the locations are smaller thanthe typical Sunflowers location, the leasing agents arguethat higherthan-mean disposable income in the surround-ing community is a better predictor of higher sales thanstore size. The leasing agents maintain that sample datafrom 14 Sunflowers stores prove that this is true.

Review the leasing agents'proposal and supportingdocuments that describe the data at the company's Web site,

References 563

SHl3.2 What factors other than number of telemarketinghours spent might be useful in predicting the num-ber of new subscriptions? Explain.

SHl3.3 a. Analyze the data and develop a regression modelto predict the mean number of new subscriptionsfor a month, based on the number of hours spenton telemarketing for new subscriptions.

b. If you expect to spend 1,200 hours on telemarket-ing per month, estimate the mean number of newsubscriptions for the month. Indicate the assump-tions on which this prediction is based. Do youthink these assumptions are valid? Explain.

c. What would be the danger of predicting thenumber of new subscriptions for a month inwhich 2,000 hours were spent on telemarketine?

www.prenhall.com/Springville/Triangle_Sunfl ower.htm,(or open this Web case file from the Student CD-ROM'sWeb Case folder), and then answer the following:1. Should mean disposable income be used to predict sales

based on the sample of 14 Sunflowers stores?2. Should the management of Sunflowers accept the claims

of Triangle's leasing agents? Why or why not?

3. Is it possible that the mean disposable income of the sur-rounding area is not an important factor in leasing newlocations? Explain.

4. Are there any other factors not mentioned by the leas-ing agents that might be relevant to the store leasingdecision?

l. Anscombe, F. J., "Graphs in Statistical Analysisl' TheAmerican Statistician 27 (1973): 17-21.

2. Hoaglin, D. C., and R. Welsch, "The Hat Matrix inRegression and ANOVA I' The American Statistician 32(1978): 17-22.

3. Hocking, R. R., "Developments in Linear RegressionMethodology: 1959-1982," kchnometrics 25 (l 983):219-250.

4. Kutner, M. H., C. J. Nachtsheim, J. Neter, and W. Li,Applied Linear Statistical Models,5th ed. (NewYork:McGraw-Hill/Irwin, 2005).

5. Microsoft Excel 2007 (Redmond, WA: Microsoft Corp.,2007\.

Page 54: chap 13

564 EXCEL coMPANIoN to chaoter l3

E13.1 PERFORMING SIMPLE LINEARREGRESSION ANALYSES

You perform a simple linear regression analysis by eitherusing the PHStat2 Simple Linear Regression procedure orby using the ToolPak Regression procedure.

Using PHStat2 Simple Linear Regression

Open to the worksheet that contains the data for the regres-sion analysis. Select PHStat ) Regression ) SimpleLinear Regression. In the procedure's dialog box (shownbelow), enter the cell range of the )z variable as the YVariable Cell Range and the cell range of theXvariable asthe X Variable Cell Range. Click First cells in bothranges contain label and enter a value for the Confidencelevel for regression coefficients. Click the RegressionStatistics Table and the ANOVA and Coefficients TableRegression Tool Output Options, enter a title as the Title,and click OK.

0*i

Y tCarda Cd Rffqcl

X Valrilc Cd RrUc:

17 ** cek nr bo$r rurEes cmtdr l$d

corfidarc brclfa rogre*sm ccfficlonB, lG-*

Regrseion Tod A*g-t Optftns

l? neresaon**ucsrge

17 $.totArrdco6fMT.bh

T Rcddi*TaHa

T Rcc&dPlot

O.tpr.t Opdons

Tf{c: I

f- sc*ter uagran

f qrbbFvcat$nstltirtt

I- Cor$dorra md Radction htorval far X - T- irI

lr.b I| ----.---*rI I

li oK ll cilcd I

PHStat2 performs the regression analysis, using tToolPak Regression procedure. Therefore, the worksheelproduced does rol dynamically change ifyou changedata. (Rerun the procedure to create revised results.)three Output Options available in the PHStat2 dialog boxenhance the ToolPak procedure and are explained inSect ions E13.2.813.4. and E13.5.

Using ToolPak Regression

Open to the worksheet that contains the data for the regres-sion analysis. Select Tools t Data Analysis. selectRegression from the DataAnalysis list, and click OK.lnprocedure's dialog box (shown below), enter the cell rangethe X variable data as the Input Y Range and enter the cellrange of the X variable data as the Input X Range. ClickLabels, click Confidence Level and enter a value in itsand then click OK. Results appear on a new worksheet.

I'p'rt ,=, fruTrpr-tYRdrgc: E =

rreir1Rd4c: E fct-dl

EIL$.* flcsstsrtca"no t- Hph-l

M codidarc r-cvd: 9s %

Output opiisrs

OQrrptncrgc:

O tlcr,r WutatcC EV:Ot$+rltdt6oofResidudk

,,.8

trne*ar* nR.dCsdPtotsn*md*d2rdRrcdu.b ilWrcmnot3

tlormd Probabiity

DUonnlPr**yPtes

E13.2 CREATING SCATTER PLOTSADDING A PREDICTION LINE

You use Excel charting features to create a scatter plotadd a prediction line to that plot. Ifyou select theDiagram output option of the PHStat2 Simple LiRegression procedure (see Section 813.1), you can skipthe 'Adding a Prediction Line" section that applies toExcel version you use.

Page 55: chap 13

Iype OPtbns

Ifendlne rrimer+) A*om*A: LirEsr (scri€rt)

{) Eurtmr

F0re{nst

f,orword: 0 I Lhits

[*kwvdr 0 ] t-kr*s

Dl* r*aapt * o

g Bsplay gquatbn m chsrt

E u+tev B-squarea v.k€ on ch.ft

theeetourfhelox

ln

Creating a Sea***r Flot

Use either the Section E2.12 instructions to create a scatterplot (see page 93) or use the Sect ion E13.1 instruct ions in"Using PHStat2 Simple Linear Regression", but c l ickingScatter Diagram before you click OK.

Adding a Prediction Line (97-2003)

Open to the chart sheet that contains your scatter plotand select Chart t Add Trendline. In the Add Trendlinedia log box (see Figure El3. l ) , c l ick the Type tab and thencl ick L inear . Cl ick the Opt ions tab and select theAutomat ic opt ion. Cl ick Display equat ion on char t andDisplay R-squared value on chart and then click OK. Ifyou have included a label as part of your data range, you willsee that label displayed in place of Seriesl in this dialog box.

FIGURE E13.1 Add Trendl ine dialog box (97-2003)

Adding a Prediction Line QA07)Open to the chart sheet that contains your scatter plot andselect Layout ) Trendline and in the Trendline gallery,select More Trendline Options. In the Trendline Opitionspanel of the Format Trendline dialog box (see Figure F.13.2),select the Linear option, click Display equation on chartand Display R-squared value on chart, and click Close.

Relocating an X Axis

Ifthere are )'values on a residual plot or scatter plot thatare less than zero, Microsoflt Excel places the X axis at thepoint f : 0, possibly obscuring some of the data points. To

E 13.2: C'reating Scatter Plots and Adding a Predict ion I- ine 565

Then*r@fi3

ilne Color

Lfie Sti ie

Shadoi

Tre,rdlire OptionsTrendF.egressrm Tfpc

t , - _

___l " E\Jsnrnnal

. t . -

| (:i Lnear

. t , , .I i , Loo8r rmmr i

_ t

I I p o l r n o * a tI

| , ' r P o , r .I

| :" l,to. nc Areraoe' t - -

Trendftna l lame

i:l Aulomatc : Linrar (Annu6l Siies)

(j Eustom;

ForeGst

foftlardl 0,0 gelods

Bicklrtrd: 0.0 pcrods

ff 5et Intacept -

E Drsplay Equdbon on ch.rt

t{ rqt:[email protected] "1"" 0" .njii

f*c"* -l

FIGURE E13.2 Format Trendline dialog box (2007)

relocate the X axis to the bottom of the chart. open to thechart, right-click the I axis and select Format Axis fiomthe shortcut rnenu.

lf you use Excel 97-2003, select the Scale tab in theFormat Axis dialog box (see Figure E 13.3 ), and enter the valuefbund in the Minimum box (-6 in Figure E13.3)as the Value(X) axis Crosses at value and click OK. (As you enter thisvalue, the check box fbr this entry is cleared automatically. )

Patterns 5cde Fort Nwnber

valua (Y) axis scde

Arjto

El Ptqnilrfl: -6

B t*tagnrum: 4

E] mg;orur*: I

P Fl.rnr mit: o.2

flva&ie (X) axisqr6rc5 6t: -6

Dsplay Ur*s: ttone v

flEogartfrrr scde

I vduas h geverse rda

flvaltr 1x; axis crosiei at &6ximun vah.€

6---oK---l T(-"*a I

FIGURE E13.3 Format Axis dialoq box (97-2003)

res-lectr the,e ofcelll ickbox,

D

nder'ar

tohe

Page 56: chap 13

566 EXCEL coMPANIoN to chaoter 13

If you use Excel2007, in the Axis Options panel of theFormat Axis dialog box (see Figure E I 3.4), select the Axisvalue option, change its default value of 0.0 (shown inFigure E13.4) to a value less than the minimum )'value,and click Close.

{diiiiqisiit rlr i

{;ffistdt Axis options

I t{.rrber ' t'?unun: O Apo O Exct

i n r Maxirum: Oegb ORxedur€colar Mslo(unit: O autr O r,feA

Lnc sb/. t/klor t'nit: O ruto O Fx# ,shadow ; I Yabcrrnrcvcrscordcr

313 Fo{rnat LI Logarifndc s(6h

i AtgrynefttI I Display Urib: it{sc v '

i Lxilabcb: tlcxttoAxts \

I I DisplayUrib: i!{sc v_,

Faapr bck nrak tvpe: O.rtstdc v

i 1I i ttor tck mark typc: t{onr v

: Ar! Irb.h: tlrxt t6 axk v

i i :fbri:ontalaxbgosc.s:

i , O uu,ar.o,(, Axb vabg: 0.0

FIGURE E13.4 Format Axis dialog box (2007)

E13.3 PERFORMING RESIDUALANALYSES

You modify the procedures of Section E I 3. I to perform aresidual analysis. If you use the PHStat2 Simple LinearRegression procedure, click all the Regression Tool outputoptions (Regression Statistics Table, ANOVA andCoefficients Table, Residuals Table, and Residual Plot).If you use the ToolPak Regression procedure, clickResiduals and Residual Plots before clicking OK. If youneed to relocate anXaxis to the bottom ofa residual plot,review the "Relocating an XAxis" part of Section E13.2.

E13.4 COMPUTING THE DURBIN.WATSON STATISTIC

You compute the Durbin-Watson Statistic by either using thePHStat2 Simple Linear Regression procedure or by using aseveral-step process that uses th. EE@EEEEworkbook.

Using PHStat2 Simple Linear RegressionUse the Sect ion E13.1 instruct ions in "Using PHStat2Simple Linear Regression," but clicking Durbin-WatsonStatistic before you click OK. Choosing the Durbin-

Watson Statistic causes PHStat2 to create a residuals table,even if you did not check the Residuals Table RegressionTool output option.

The Durbin-Watson Statistic output option creates anew Durbin-Watson worksheet similar to the one shown inFigure 13.16 on page 536. This worksheet references cellsin the regression results worksheet that is also created by theprocedure. If you delete the regression results worksheet,the DurbinWatson worksheet disolavs an error messase.

Using Durbin-Watson.xlsOpen to the DurbinWatson worksheet of the

E@@workbook. This worksheet (see Figure 13.16on page 536) uses the SUMXMY2 (cell range I, cell range 2)function in cell 83 to compute the sum of squared differenceof the residuals, and the SUMSQ (residuals cell range) func-tion in cell E}4 to compute the sum of squared residuals for theSection 13.6 package delivery store example.

By setting cell range 1 to the cell range of the firstresidual through the second-to-last residual and cell range2 to the cell range of the second residual through the lastresidual, you can get SUMXMY2 to compute the squareddifference between two successive residuals, which is thenumerator term of Equat ion (13.15). Because residualsappear in a regression results worksheet, cell referencesused in the SUMXMY2 function must refer to the regres-sion results worksheet by name.

In the Durbin-Watson workbook, the SLR worksheetcontains the simple linear regression analysis for theSect ion 13.6 package del ivery example. The residualsappear in the cell range C25:C39. Therefore, cell range Iis set to SLR!C25:C38, and cel l range 2 is set toSLR!C26:C39. This makes the cel l B3 formula:SUMXMY2(SLR!C26:C39, SLR!C25:C38). The cel l84 formula, which also must refer to the SLR worksheet,is :SUMSQ(SLR!C25:C39).

To adapt the Durbin-Watson workbook to other prob-lems, first create a simple linear regression results work-sheet that contains residual output and copy that worksheetto the Durbin-Watson workbook. Then open to theDurbin-Watson worksheet and edit the formulas in cells83 and 84 so that they refer to the correct cell ranges onyour regression worksheet. Finally, delete the no-longer-needed SLR worksheet.

E13.5 ESTIMATING THE MEAN OF YAND PREDICTING YVALUES

You compute a confidence interval estimate for the meanresponse and the prediction interval for an individualresponse either by selecting the PHStat2 Simple LinearRegression procedure or by making entries in theg@@workbook.

Page 57: chap 13

))

FIGURE E13.5DataCopy worksheet(first six rows)

Using PHStat2 Simple Linear RegressionUse the Sect ion E13.1 instruct ions in "Using PHStat2Simple Linear Regression", but before you click OK, clickConfidence and Prediction Interval for X: and enter anXvalue in its box (see below). Then enter a value for theConfidence level for interval estimates and click OK.

D*.

vvriatrhcdRangc, f------*:xv$ntrhcctrRscc' [-----*ftr fi* ce*s in bo*r r*lges cont*r hbd

Cmkrco bval fa roryessbn co#fiinr**, lG-x

Reg/es*rn Tod Attrlt Optimg

17 Rcges*on Statistis foile

[- AITOVA and Cocffi*nts rabb

T Raidr*Tabb

I- ncn*f.reR*

OIg.tOdtdrt

Tf{cr

[* sc*arnagrun

I* U.rbn-Wctron*&ktk

tr CorSdcrmardFrodctionlr*arvdfor X = l***Crfi*rre b/d for htervd cstinatce , [**

C.,"d I

PHStat2 places the confidence interval estimate andprediction interval on a new worksheet similar to the oneshown in Figure 13.21 on page 549. (PHStat2 also cre-ates a DataCopy worksheet that is discussed in the nextpart of this section.)

Using Cl EandPlforSLR.xls

0pen to the CIEandPI worksheet of the@[@[![!EE workbook. This worksheet (shown inFigure 13.21 on page 549) uses the function TINV(I-conJidence level, degrees of freedom) to determine thelvalue and compute the confidence interval estimateandprediction interval for the Section 13.8 Sunflower'sApparelexample.

E 13.6: Example: Sunf lowers Apparel Data 561

-COUt{T(B:B)-AVTRAGE{A:A}*SUT(c:Q*TREI{D(82:815, Ai2zA15, ClEandPllBfl

Cells B8, B I I, B 12, and B l5 contain formulas that ref-erence individual cells on a DataCopy worksheet. Thisworksheet, the first six rows of which are shown in FigureE13.5, contains a copy of the regression data in columns Aand B and a formula in column C that squares the differ-ence between each X and X .tne worksheet also computesthe sample size, the sample mean, the sum of the squareddifferences [SSXin Equation (13.20) on page 546], and thepredicted Ivalue in cells F2, F3, F4, and F5.

The cel l F5 formula uses the funct ion TREND(Y variable cell ronge,X vsriable cell range, X value) tocalculate the predicted I value. Because the formula usesthe X value that has been entered on the CIEandPt work-sheet, the X value in the cell F5 formula is set toCIEandPM4. Because the DataCopy and CIEandPIworksheets reference each other, you should consider theseworksheets a matched pair that should not be broken up.

To adapt these worksheets to other problems, first cre-ate a simple l inear regression results worksheet. Then,transfer the standard error value, always found in theregression resul ts worksheet cel l 87, to cel l B l3 of theCIEandPI worksheet. Change, as is necessary, the XValueand the confidence level in cells 84 and 85 of theCIEandPI worksheet. Next, open to the DataCopy work-sheet, and if your sample size is not 14, follow the instruc-tions found in the worksheet. Enter the problem's X valuesin column A and l 'values in column B. Finally, return tothe CIEandPI worksheet to examine its updated results.

E13.6 EXAMPLE: SUNFLOWERSAPPAREL DATA

This section shows you how to use PHStat2 or Basic Excelto perform a regression analysis for Sunflowers Apparelusing the square footage and annual sales data stored in thel[!f[!workbook.

Using PHStat2Open to the Data worksheet of the [fff[!workbook. SelectPHStat ) Regression ) Simple Linear Regression. In theprocedure's dialog box (see Figure E13.6), enter C1:Cl5as the Y Variable Cell Range and Bl:Bl5 as theX Variable Cell Range. Click First cells in both ranges

it

eitdets)S

t-

3ttelsI

lo1arll

-

)1,

:anral)athe

b-k-Iet

nellsJN

ir-

A B c U E F

15ql|a]e

FselAnouatSaleg {X-XBad^2

23

1 . 7 3,7 1.4919 amole Sizs 1tt - t 3.S | 7t& rmple Mean 2.921t

4 2.8 6.7 8Dl47 3um of SquarEd Difference 37.SZA5 5 E 9.t rredicted Y ffHal)5 3.4 2.6Ha a c

Page 58: chap 13

568 EXCEL CoMPANIoN to Chapter l3

Data

YvariableCdl Rarpr lcl'cls * **;

x veriaHe tdl Range: ru;s15 -^

;17 Fir* cells in bsth ranges cn*e*n labd

Csrfidenc* bvel for rogression codficbr*s: k*'1{

Regressbn Tool ortut Optitrts

F Regessian 5tdislics Tabb

V *xwn ard ceffkients TBbk

17 Resid:ds T&le

tr7 Resid-rd Plot

Ortput O$ions

rlrle: i5ir- d A";[

f7 scetter Diegram

l* fr.rrbift.watson 5t*irtic

fil {orfidgme 6nd tuedctbn intervattor x = [_

Csrfidence tevd for brtervd estimates: igS qt

FIGURE E13 .6 Comp le ted S imo le L inea rRegression d ia log box

contain label and enter a value for the Confidence level forregression coefficients. Click the Regression StatisticsTable,ANOVA and Coefficients Table, Residuals Table,and Residual Plot Regression Tool Output Options. EnterSi te Select ion Analys is as the Ti t le and c l ick Scat terDiagram. Cl ick Conf idence and Predict ion Intervalfor X: and enter 4 in its box. Enter 95 in the Confidencelevel for interval estimates box. Click OK to execute theprocedure.

To evaluate the assumption of linearity, you review theResidual Plot for Xl char t sheet . Note that there is noapparent pattern or relationship between the residuals andX variable.

To evaluate the normality assumption, create a nor-mal probabi l i ty p lot . Wi th your workbook open to theSLR worksheet, select PHStat ) Probabil ity & Prob.Distributions ) Normal Probabil ity Plot. In the proce-dure 's d ia log box (see Figure E13.7) , enter C24:C38 asthe Var iable Cel l Range and c l ick F i rs t ce l l conta inslabel . Enter Normal Probabi l i ty Plot as the Ti t le andcl ick OK. In the NormalPlot char t sheet , observe that thedata do not appear to depart substantially from a normaldistribution.

To evaluate the assurnption of equal variances, reviewthe Residual PIot for Xl chart sheet. Note that there do notappear to be major differences in the variabil ity of theresiduals.

Varieble Cdl Rvrger lcz+:c:a J17 frst qefi csntalnr ldbel

Outg.rt O$ions

T*le: W

Heb I lt oK il cmcd ILgr:i::::1! ^ --

FIGURE E13.7 Completed NormalProbabi l i ty Plot d ia log box

You conclude that all assumptions are valid and thatyou can use this simple l inear regression model for theSunflowers Apparel data. You can now open to the SLRworksheet to view the details of the analysis or open to theEstimate worksheet to make inferences about the mean of)'and the prediction of individual values of ) '.

Using Basis Excel

Open to the Data worksheet of the ff iE workbook.Select Tools ) Data Analysis (972003) or Data ) DataAnalysis (2007). Select Regression from the DataAnalysis l ist, and click OK. In the procedure's dialog box(see Figure E13.8) , enter Cl :Cl5 as the Input Y Rangeand enter Bl :Bl5 as the Input X Range. Cl ick Labels,click Confidence Level and enter 95 in its box, and clickResiduals. Click OK to execute the orocedure.

inFUl

Inprl Y Ran{e:

InFtrt XRang6:

B laoetsE ccrtiderse t-avelr

Oul:frut 0Fli'rn5

f) Qr*nrt nmCe:

Q ttew Wwtstreet gly:

C fiF/{ WtrkbookReEidr.lal:

E aeEou*[ &andardieed Ra:#uals

Nermal Frob,ebiiity

I Sormd Prqbabiky ff*s

r l , r t q

81 :615

f corstat k Z*ero

S olo

ruffi

t3-*:lf c*.-l I

Tn-b 1

a .

I nesi6uU rktsI rfre nit tuts

FIGURE E'|3.8 Completed Regression dialog box

Page 59: chap 13

To evaluate the assumption of l inearity, you plot theresiduals against the square feet (independent) variable. To

simplifu creating this plot, open to the Data worksheet andcopy the square feet ce l l range Bl :Bl5 to cel l E1. Thencopy the cell range of the residuals, C24:C38 on the SLRworksheet, to cell Fl of the Data worksheet. With yourworkbook open to the Data worksheet, use the Section813.2 instructions on pages 564-566 to create a scatterplot. (Use El:Fl5 as the Data range (Excel 97-2003) oras the cell range of the X and I variables (Excel 2007)when creating the scatter plot.) Review the scatter plot.0bserve that there is no apparent pattern or relationshipbetween the residuals and X variable. You conclude that thelinearity assumption holds.

You now evaluate the normality assumption by creatinganormal probability plot. Create a Plot worksheet, using themodelworksheet in the $fifr workbook as your guide. In anew worksheet, enter Rank in cell A I and then enter theseries 1 through 14 in cells A2:A15. Enter Proportion inallBl and enter the formula :A2l15 in cell 82. Next. enterZValue in cell Cl and the formula :NORMSINV(82) in

E I 3.6: E,xample: Sunflowcrs Apparel Data 569

cell C2. Copy the residuals (including their column heading)to the cell range Dl:Dl5. Select the formulas in cell rangeB2:C2 and copy them down through row 15. Open to theprobability plot and observe that the data do not appear todepart substantially from a normal distribution.

To evaluate the assumption of equal variance, return tothe scatter plot of the residuals and the X variable that youalready developed. Observe that there do not appear to bemajor differences in the variabil ity of the residuals.

You conclude that all assumptions are valid and thatyou can use th is s imple l inear regression model for theSunflowers Apparel data. You can now evaluate the detailsof the regression results worksheet. If you are interested inmaking inferences about the mean of ) 'and the prediction

of individual values of )', open the (l!!@@$f[! work-book. (Usually, you would have to first make adjustmentsto the DataCopy worksheet, as discussed in Section E13.5,but this workbook already contains the entries for theSunf lowers Apparel analys is . ) Open to the CIEandPIworksheet to make inferences about the mean of ) 'and theprediction of individual values of ) '.

hattheLRther o f

rok.)ata)ataboxngerels,;lick