L4&5 Multiple Regression 2010B

8/8/2019 L4&5 Multiple Regression 2010B

1/77

Chi-square goodness-of-fit test

The chi-square test can be used to determine whether the sample data conform toany kind of expected distribution and the data is categorical (nominal orordinal). The test determines whether the data fits a given distribution, such asuniform, normal,

Where:

f0 = frequency of observed (or actual) values

fe = frequency of expected (or theoretical) values

k = number of categoriesm = number of parameters being estimated from the sample data

1

e

eo

f

ff 22 )( = df = k 1 m


2/77

Chi-square test for independence

The Chi-square test for independence is based on the count in acontingency (or cross tabs) table. It tests whetherthe counts for therow categories areprobabilistically independent of the counts for thecolumn categories.

Where:

Oi= Observed number of observations in category I

Ei = Expected number of observations in category I

2

ij

ijij

ijE

EO 22

)( = df = (row 1)(col 1)


3/77

Chi-square test Local survey

x In a national survey, consumers were asked the question, In general, how wouldyou rate the level of service that business in this country provide?

x The distribution of responses was in the National column:x Suppose a manager wants to find out whether this result apply to his customers

of her store in the city.x She did a similar survey to 207 randomly selected customers in her

store and observed the results as in the

Local column.x She can use Chi-square test to see

if her observed frequencies of responsesare the same frequencies that would be

expected on the national survey.3

Nationalresponse

Local response (of207 asked)

Excellent 8% 21

Pretty good 47% 109Only fair 34% 62

Poor 11% 15


4/77

Clive Morley 4

Hypothesis Testing Local Survey Example

x Example using Excel

Microsoft Excel

Worksheet


5/77

Clive Morley 5

Steps in Hypothesis Testing

1: State the null and alternative hypotheses.

2: Make a judgment about the population distribution,the level of measurement, and then select theappropriate statistical test.

3: Decide upon the desired level of significance.

4: Collect data from a sample and compute thestatistical test to see if the level of significance ismet.

5: Accept or reject the null hypothesis.


6/77

Clive Morley 6

Contingency Tables

Two way table

Test whether rows and columns are associated (orindependent)

Can calculate expected numbers in each cell if rows andcolumns independentCompare with actual (observed)

= (Oi - Ei)/Ei


7/77

Clive Morley 7

Contingency Tables - Example

Two way tableeg. responses to question 6 (a, b, or c) by two groups:

Q6 Group 1 Group 2

a 10 18b 12 22

c 15 26

The numbers in the table are counts (frequencies) of thenumber falling into each category


8/77

Clive Morley 8


Q6 Group 1 Group 2

a 27% 27%

b 32 33

c 41 39


9/77

Clive Morley 9


= 0.014

prob = 0.993

Not significant

Q6 Group 1 Group 2a 10 18

b 12 22c 15 26

Chi-square test statistic = 0.0142

p-value = 0.9929

Microsoft Excel

Worksheet


10/77

Clive Morley 10

Statistical Decision

For t-test (for mean or proportion):Null Hypothesis: no-change situation

For Chi square test:

Null Hypothesis: two variable sets are independent

test value t > t critical-value (usually 2): Reject Null Hypothesis

p-value < alpha (usually 0.5): Reject Null Hypothesis

test value Chi-square > Chi-square critical-valua: Reject Null Hypothesis


11/77

Clive Morley 11

Type I and Type II errors

Two ways a hypothesis test result can be wrong:

I - find hypothesis is wrong, when it is correct

II - find hypothesis is correct, when it is wrong


12/77

Clive Morley 12


TEST FINDS

REALITYHypothesis correct Hypothesis wrong

Hypothesis correct type II error

Hypothesis wrong type I error

(test significance level)


13/77

Clive Morley 13


Prob value = observed probability of type I error

In control charts, control limits often set at 3 standard deviationsequivalent to setting probability of type I error at 0.003

minimizes reacting when dont need to

Using t = 2equivalent to setting probability of type I error at 0.05


14/77

BUSM 4074Management Decision

Making

Prof. Clive MorleyGraduate School of Business


15/77

BUSM 4074 Management DecisionMaking

4. Multiple regression

5. Multiple regression (cont)


16/77

Unit 4&5 - Learning Objectives

x To understand the use of the multiple regression technique,including linear, log-log, logit, autoregressive and time seriesmodels

x To be able to carry out straightforward multiple regression

model estimationx To be able to interpret standard computer output from a multiple

regression exercise, including to assess variables forsignificance, estimate the size of an explanatory variablesimpact on the dependent variable, assess model fit and to use

the model to estimate values of the dependent variable

Clive Morley 16


17/77

as my salary increases, computers are getting

cheaper,

therefore to get cheaper computers, pay me more

What is wrong with this (very attractive) argument?


18/77

Multiple regression

very powerful widely used statistical techniquemany applications in all sorts of areas

used to estimate the relationship between variables

For example, Y might be the sales of a certain item and Xthe price of it. The linear relationship is estimated:

Y = a + bX


19/77

Multiple regression

parameters a and b are estimated from data on thevariables X and Y

correlation establishes whether a linear relationship existsand how strong it is

regression estimates what the relationship is


20/77

Multiple regression

model is readily extended to include other explanatoryvariables: for example, sales (Y) might depend on price(X1), buyers incomes (X2) and advertising expenditure(X3), giving the equation to be estimated

Y = a + b1X1 + b2X2 + b3X3

Data on a number of cases (eg. various sales areas ordifferent times) for all the variables is needed


21/77

Multiple regression

the explanatory variables do not exactly predict the valueof Y

due to random effects

the impacts of other (hopefully minor) variables, etc.

so the equation does not exactly fit

residuals


22/77

Purposes of multiple regression

x to estimate the equation, so we canpredict Y for givenvalues of the explanatory variables, or

x to estimate the effects of variables on Y (through the b

parameters of the variables of interest, and also throughthe variables correlation with Y), or

x to determine which potential explanatory variableshave a significant impact on Y (through testing thesignificance of the relevant b values).


23/77

Theory Least squares

Computer finds the values for parameters that give theline of best fit

Best fit is defined as minimising the sum of squarederrors (SSE)


24/77

Theory model specification

Y is some function of a lot of explanatory variables

Narrow the lot ofexplanatory variables down to those

expected to be important (ignore others)

Then specify functional form of relationship linear isusual starting point for regression

(but see discussion of log-log models below)


25/77

Theory model specification

Model specification which variables, linear (or other)form, etc based on relevant theory

Estimated relationship thenbased on data


26/77

Multiple regression

overall fit of the equation estimated is measured byR-squared (R)the proportion of the variation in Y explained by theequation

also is the square of the correlationbetween the fitted andactual Y values

Each parameter estimated (and hence each variable) can betested for individual significance


27/77

Linear Regression Example

Data: House Price(y)

Sq Feet(x)

245 1400312 1600

279 1700308 1875199 1100219 1550

405 2350324 2450319 1425255 1700


28/77

Plot of data


y = 75.814 + 0.123xSq.Fteet


29/77

29

Simple Linear Regression Model

iii xy ++= 10

Random error for thisXi value

ii

Slope = 1

Intercept = 0

Observed

value of Y forXi

Y

Xi

cmXY ^

+=


30/77

X(SqFt)

Y($000)

Predicted Y

Y

Y - Y

1400 245 251.92 -6.92316

1600 312 273.88 38.12329

1700 279 284.85 -5.85348

1875 308 304.06 3.93716

1100 199 218.99 -19.99284

1550 219 268.39 -49.38832

2350 405 356.20 48.79749

2450 324 367.18 -43.17929

1425 319 254.67 64.33264

1700 255 284.85 -29.85348

30

Excel Residual Output for House Price model

It shows how well the regression line fits the datapoints. The best and worst predictions were 3.94 and

64.3, respectively.


31/77

31

_

_

Measures of variation

SSE: Sum of Squares of Error, SSR: Sum of Squares of RegressionXi

Y

X

Yi

SSyy = (Yi - Y)2SSE = (Yi - i )2

_

YY

Y_

SSR = (i - Y)2


32/77

Total Sum ofSquares

Regression Sum ofSquares

Error Sum ofSquares

32

x Total variation is made up of two parts: SS yy = SSR + SSE

=

2)( YYSSR i =2)( ii YYSSE

=

2)( YYSSiyy

Where: Y = Average value of the dependent variable

Yi = Observed values of the dependent variable

i = Predicted value of Y for the given X i value

_

SSyy = Total Sum of SquaresMeasures the variation of the Yi values around their mean YSSR = Regression Sum of SquaresExplained variation attributable to the relationship between X and YSSE = Error Sum of Squares

Variation attributable to factors other than the relationship between X and Y

Measures of variation


33/77

33

Standard Error of the Estimate

( )

=

=

xybyby

SSE

yy10

2

2

The standard error of the estimate is the standard deviation oftheerror of a regression model

Sum of SquaresError

Standard Error

of the

Estimate 2

=

n

SSEse

Standard Error of the Estimate tells us how spread-out the errors is.


34/77

Computer output:Correlation R = 0.837 , R-squared = 0.700

Coefficient t sig

Constant 75.813 2.508 0.0204

Sq. feet 0.123 7.009 0.0000



35/77

Linear regression - Example

the model estimated is:House Price = 75.813 + 0.123 Sq.Feet

x correlation between House Price and Sq. Feet is high, at0.837, and the fit of the regression model is quite strongR = 0.700, i.e. 70%

x The Sq. Feet variable is highly significantt = 7.009,p = 0.0000

x Implicit hypothesis is thatcoefficient is zero, i.e. variablehas no impact


36/77


Add another variable to data - LocationPrice Sq. Feet Location

245 1400 2

312 1600 3

279 1700 4308 1875 3

199 1100 5

219 1550 1

405 2350 1

324 2450 5319 1425 4

etc


37/77


Computer output:

Correlation R = 0.839 , R-squared = 0.705Without Location: Correlation R = 0.837 , R-squared = 0.700

Coefficient t sig

Constant 73.510 2.366 0.0282

Sq. Feet

Location

0.120

2.283

6.475

0.525

0.0000

0.6050


38/77

Slight improvement in R

Location not significant (sig orp-value high)

- considerdropping from model


Coefficient t sig

Constant 73.510 2.366 0.0282

Sq. FeetLocation

0.1202.283

6.4750.525

0.00000.6050


39/77

Linear regression - example

Model Market to Book Value (MBV) as function ofRevenueData

Company MBV Revenue

12

3

2.0111.814

1.522

39.5054.165

10.40645

1.8261.824

7.6022.942

6

7etc

1.337

1.650

5.228

1.697


40/77


OutputDep Var: MBV N: 71

Multiple R: 0.318

Squared multiple R: 0.101

variable coefficient t value sig

Constant 2.010 11.465 0.000

Revenue 0.046 2.789 0.007


41/77


Model is:MBV = 2.010 + 0.046 Revenue

Fit not great (R = 0.10, 10%)But significant (F = 7.778,p = 0.007)

Revenue variable is significant

(t = 2.789, p = 0.007)


42/77


More factors (variables) impact on MBV and need to beconsidered

*** WARNING ***

Case 1 has large leverage

(Leverage = 0.243)Case 8 has large leverage

(Leverage = 0.163)

Case 56 is an outlier

(Standardized Residual = 5.167)Durbin-Watson D Statistic 1.682

First Order Autocorrelation 0.140


43/77

Multiple regression

Avoid step-wise regression

Look fornon-linear patterns in scattered plot

Diagnostic checksx Multicollinearity (different xs move together in systematic way)x

Autocorrelation (successive error terms are correlated with eachother)x Outliers (data points that are not together with the rest)x Heteroscedasticity (non-constant variance)x Leverage (observation with large effects on outcomes)


44/77

Multiple regression - Example

Hospital and Nursing Salary example

(9.10 of textbook)


45/77


Dialogue box

Dependent: Annual Nursing Salary

Independents: Number of beds in home

Annual medical in-patient daysAnnual total patient days

Rural (1) and non-rural (0) homes


46/77


Model Summary

R R Square Adjusted R Square

0.8803 0.775 0.7557

Std. Error of the Estimate $82,024.63

ANOVA F SigRegression 40.4375 0.000


47/77


R=0.88 Coeff. of correlation, the relationship between 2variables. R=-1: strong, negative relationship. R=1: strong,positive relationship. R=0: no relationship between 2 variables.

R Square =0.775 Coeff. of Determination. 77.5% of a change in Ycan be explained by a change in X. The other 22.5% is by some

other factors. This fit is quite strong.Adjusted R Square=0.7557 Adjusted for multiple variables. A

decrease Adjusted R Square means the newly added variable is notsignificant.


48/77


Std. Error of the Estimate $82,024.63Sig=0.000 Significant fit

p=0.1799 (beds) Too high (compared to ), shouldconsider dropping beds as a variable of fit


49/77


Coeff-icients

StandardError

t value p value(sig)

Constant (Intercept) 113.5003 495.4654 0.2291 0.8198

Number of beds in home 9.6399 7.0804 1.3615 0.1799

Annual medical in-patientdays (100s)

-7.4072 2.4012 -3.0848 0.0034

Annual total patient days(100s)

15.7674 2.7550 5.7232 0.0000

Rural (1) and non-rural (0)homes

-79.5796 288.1857 -0.2761 0.7837


50/77


The interpretation of the coefficients is that if the in-patient days,total patient days and rural factor are held constant, then the annualnursing salary is expected to increase by $9.64 for each extra bed inhome. Similarly, annual nursing salary is expected to increase (decrease)by

$-740.72, $1576.74 and $-79.58 for each extra in-patient day, patientday and rural factor, respectively, other variables held constant. The $11300 can be interpreted as the annual base salary.

Coeff-icients Standard Error t value p value (sig)

Constant (Intercept) 113.5003 495.4654 0.2291 0.8198

Number of beds in home 9.6399 7.0804 1.3615 0.1799Annual medical in-patient days (100s) -7.4072 2.4012 -3.0848 0.0034

Annual total patient days (100s) 15.7674 2.7550 5.7232 0.0000

Rural (1) and non-rural (0) homes -79.5796 288.1857 -0.2761 0.7837


51/77


x Compare the intercepts and theslopes of multipleregression with those of linear regression?

Changes have occurred. (Difficult to analyse in detail)

x Sestill is the error of the estimate. Note that multipleregression yields abetter s

ethan those in linear

x R2 similarly (but would increase with extra xs)x

Adjusted R2

a decrease indicates an added x notbelong to the equation.


52/77


x Tolerance stats OK (> 0.1), so no multicollinearityissue. If individual R2 is too high (almost equal R2 ofmultiple regression): suspect multicollinearity!

x Durbin-Watson stat d = 2.4789, somewhat negativelyautocorrelation issue. 1 < d 2 would indicate noautocorrelation concern!

x

Outlier see graphs of residuals. Normal shape onhistogram, random (no patterned) on scatter plots.


53/77


RegressionStandardizedResidual

6.00

5.00

4.00

3.00

2.00

1.00

0.00

-1.00

-2.00

-3.00

-4.00

Histogram

Dependent Variable: Current Salary

Frequen

cy

160

140

120

100

80

60

40

20

0

Std. Dev=1.00

Mean=0.00

N=474.00

S t a n d a r d iz e d R e

0

5

1 0

1 5

2 0

- 3 .5 - 3 - 2 .5 - 2 - 1 .5 - 1 - 0 .5 0 0 .5 1 1 .5 2 2 .5 3 3 .5

V a r ia

Freque

ncy

Standardized Residual distribution is relatively normal. Relatively good fit


54/77


Scatter plot Randomly distributed. Both sides of 0.00

Scatterplot

Dependent Variable: Current Salary

Current Salary

140000120000100000800006000040000200000

RegressionStandardizedResidual

8

6

4

2

0

-2

-4

-6

-2500.0000

-2000.0000

-1500.0000

-1000.0000

-500.0000

0.0000

500.0000

1000.0000

1500.0000

2000.0000

0 10 20 30 40 50 60

Estimate

The red plot is an example of aheteroscedasticity (unequal variance

distribution)


55/77


Dummy variables categorical data, related to dependentvariable

Other names: indicators, 0-1 variables

If dummy variable = 1: correct categorydummy variable = 0: not in that category

The coefficient of this variable indicates the dependentvariable difference due to this (dummy) variable


56/77

Salary = 113.50 + 9.64Bed 7.41In-ptDay + 15.77Tot-ptDay 79.58Rural

Rural = 0 and Rural = 1: Salary difference = - $7958(rural is lower)

Two or more categorical variables can be involved. Itindicates the y difference when the rest is the same

Coeff-icients Standard Error t value p value (sig)

Constant (Intercept) 113.5003 495.4654 0.2291 0.8198Number of beds in home 9.6399 7.0804 1.3615 0.1799

Annual medical in-patient days (100s) -7.4072 2.4012 -3.0848 0.0034

Annual total patient days (100s) 15.7674 2.7550 5.7232 0.0000

Rural (1) and non-rural (0) homes -79.5796 288.1857 -0.2761 0.7837


A l i R i


57/77

Analysing a Regression

x

p-value of the Regressionx p-value of each x, to consider dropping it or not

x Adjusted R-square value

x Standard Error of the Regression estimate

x Scatter plot of residuals Randomness. Outliers. Heteroscedacisticity (non-

equal variance)x Histogram of the residuals

x Durbin-Watson statistic (d)

Linear, Quadratic and Log regressionl


58/77

example

The Public Service Electric Company produces differentquantities of electricity each month, depending on thedemand. File Poly and Log examples - Power.xls listthe number of units of electricity produced (Units) and

the total cost of producing these (Cost) for a 36-monthperiod. How can regression be used to analyse the

relationship between Cost and Units?

M l i l i E l


59/77


R Square 0.7359Standard Error 2733.7424

R Square 0.8216Standard Error 2280.7998

L d l


60/77

Log model

Very often we use multiple regression to fit a multiplicative model:Y = aX1b1 X2b2 X3b3

Any explanatory variable change by 1%, the dependent variablechange by a constant percentage

This can be estimated by making a logarithmic transformation ofthe equation, which gives:

ln(Y) = ln(a) + b1ln(X1)+ b2ln(X2)+ b3ln(X3)

L d l


61/77

Log model

Thus we can calculate ln(Y), ln(X1), ln(X2), ln(X3)

And regress these variables in the usual way, to estimatethe parameters of the original equation.

L d l l


62/77

Log model example

File CarSales.xls contains annual data (1970 1999) ondomestic auto sales in the United States. The variables

are defined as:Sales: annual domestic auto sales (in number of units)PriceIndex: consumer price index of transportationIncome: real disposable income

Interest: prime rate interest

M lti l i E l


63/77

Regression and Correlation

Observations 30R Square 0.5414

Standard Error 758049.7773Adjusted R Square 0.4680Multiple R 0.7358


LogRegres Coefficients t value p value

Intercept -110360558.48 -45.9500 0.0000

Log(Sales) 7522741.47 54.4195 0.0000

Log(PriceIndex) 35983.70 0.2297 0.8202Log(Income) -162258.29 -0.6222 0.5395

Log(Interest) -13588.13 -0.2133 0.8328

Regression and Correlation

Observations 30

R Square 0.9956

Standard Error 74199.1103Adjusted R Square 0.9949

Multiple R 0.9978

MultiRegres Coefficients t value p valueIntercept 513941538.55 0.7356 0.4688

Year -258651.57 -0.7234 0.4761

PriceIndex -18121.97 -0.4786 0.6364

Income 2175.75 1.1204 0.2732

Interest -8895378.05 -1.4810 0.1511

Multiple regression Example


64/77


Log modelProbably slightlybetter model

R-square = 0.99, Good

Less outliers, slightly better

Residual plots not necessarily better

Multiple Regression Goal


65/77

Multiple Regression Goal

Remove any unimportant(multicorrelation orautocorrelation, etc) variables

out of the equation and decidewhich variable(s) are importantfor the regression model.

Use that model for yourprediction.

Multiple regression time series example


66/77


Plot CarSales.xls data, Year vs. Sales



67/77


Period Sales (000)

2003 Quarter I 25.4

2003 Quarter II 23.8

2003 Quarter III 22.0

2003 Quarter IV 28.6

2004Quarter I 28.5

2004 Quarter II 27.0etc



68/77


0 5 10 15 20TIME

20

25

30

35

40

45

SA

LES



69/77


Create dummy variables for the Quarters and time period

Period Sales Time QII QIII QIV

2003 Q I 25.4 1 0 0 0

2003 Q II 23.8 2 1 0 02003 Q III 22.0 3 0 1 0

2003 Q IV 28.6 4 0 0 1

2004 Q I 28.5 5 0 0 0

2004 Q II 27.0 6 1 0 0etc



70/77


Squared multiple R: 0.987

Effect Coefficient t P

CONSTANT 23.679 50.5 0.000

TIME 1.005 28.5 0.000

QII -2.525 -5.2 0.000QIII -5.070 -9.8 0.000

QIV 0.450 0.9 0.401

Could drop QIV and re-estimate



71/77


Model as estimated is:

Sales

= 23.679 + 1.005 Time - 2.525QII 5.070QIII + 0.450QIV

Say data ended at Time = 24 i.e. 2008 QIVUse model to forecast

e.g. forecast sales in 2009 in quarters I and II



72/77


2009 quarters I is Time = 25, QI = 1, QII=0, QIII=0, QIV=0

Sales = 23.679 + 1.005 25 - 0 0 + 0

= 48.808 i.e. $48,800

2009 quarters II is Time = 26, QI = 0, QII=1, QIII=0, QIV=0Sales = 23.679 + 1.005 26 - 0 2.525 + 0

= 47.284 i.e. $47,300

Autoregression


73/77

Autoregression

Another way of dealing with time series is Autoregression

Often used when Durbin-Watson indicates autocorrelation (acommon issue with time series data)

Or because it makes theoretical sensethat one periods value depends (partly) on the previousvalue of the series

Use previous (lagged) values as explanatory variable

Autoregression


74/77

u o eg ess o

In example, add another variable, which is the lagged sales

Period Sales Time QII QIII QIV lagSales2003 Q I 25.4 1 0 0 0 -

2003 Q II 23.8 2 1 0 0 25.42003 Q III 22.0 3 0 1 0 23.82003 Q IV 28.6 4 0 0 1 22.02004 Q I 28.5 5 0 0 0 28.6

2004 Q II 27.0 6 1 0 0 28.5etc

Autoregression


75/77

g

The lagged variable would replace the Time (trend) variable

First data point is lost, as we dont have a lagged value for it

Can handle seasonality by having another variable, Sales

lagged by the seasonality period (e.g. 4 terms)

Logit regression


76/77

g g

If the dependent variable is categorical, not metrice.g. accounting graduates, membership of CPA Aust ornot is dependent variableX variables might be gender, age, importance of

joining cost, importance of brand status, etc

Regression possible, special technical issues

Reference


77/77

Ragsdale (2008) chapter 9, + pp522-28

L4&5 Multiple Regression 2010B

Documents