Introduction to Regression Lecture 5.1

Diploma in StatisticsIntroduction to Regression

Lecture 5.1 1

Introduction to RegressionLecture 5.1

1. Review

2. Transforming data, the log transform

i. liver fluke egg hatching rate

ii. explaining CEO remuneration

iii. brain weights and body weights

3. SLR with transformed data

4. Transforming X, quadratic fit

5. Other options


Lecture 5.1 2

Using t values

Convention: n >30 is big,

n < 30 is small.

Z0.05 = 1.96

≈ 2

t30, 0.05 = 2.04

≈ 2


Lecture 5.1 3

Selected critical values for the t-distribution .25 .10 .05 .02 .01 .002 .001

= 1 2.41 6.31 12.71 31.82 63.66 318.32 636.61 2 1.60 2.92 4.30 6.96 9.92 22.33 31.60 3 1.42 2.35 3.18 4.54 5.84 10.22 12.92 4 1.34 2.13 2.78 3.75 4.60 7.17 8.61 5 1.30 2.02 2.57 3.36 4.03 5.89 6.87 6 1.27 1.94 2.45 3.14 3.71 5.21 5.96 7 1.25 1.89 2.36 3.00 3.50 4.79 5.41 8 1.24 1.86 2.31 2.90 3.36 4.50 5.04 9 1.23 1.83 2.26 2.82 3.25 4.30 4.78 10 1.22 1.81 2.23 2.76 3.17 4.14 4.59 12 1.21 1.78 2.18 2.68 3.05 3.93 4.32 15 1.20 1.75 2.13 2.60 2.95 3.73 4.07 20 1.18 1.72 2.09 2.53 2.85 3.55 3.85 24 1.18 1.71 2.06 2.49 2.80 3.47 3.75 30 1.17 1.70 2.04 2.46 2.75 3.39 3.65 40 1.17 1.68 2.02 2.42 2.70 3.31 3.55 60 1.16 1.67 2.00 2.39 2.66 3.23 3.46 120 1.16 1.66 1.98 2.36 2.62 3.16 3.37 ∞ 1.15 1.64 1.96 2.33 2.58 3.09 3.29


Lecture 5.1 4

Quantify the extent of the recovery in Year 6, Q3.

= 1030 Q1 + 1292 Q2 + 1210 Q3 + 1279 Q4 + 33.7 Time

Year 6 Q2: P = 1657

= 1292 + 33.7 × 22 = 2033

P – = 1657 – 2033 = – 376

Year 6 Q3: P = 2185

= 1210 + 33.7 × 23 = 1985

P – = 2185 – 1985 = 200

Homework 4.2.1

P̂

P̂

P̂

P̂

P̂


Lecture 5.1 5

Homework 4.2.2

List correspondences between the output from the original regression and the output from the alternative regression.

Confirm that the coefficients of Q1, Q2 and Q3 in the original are the corresponding coefficients in the alternative with the Q4 coefficient added.


Lecture 5.1 6

Predictor Coef SE Coef T PNoconstantQ1 1029.87 23.41 43.99 0.000Q2 1292.35 24.45 52.85 0.000Q3 1210.42 25.55 47.37 0.000Q4 1278.70 26.71 47.88 0.000Time 33.725 1.619 20.83 0.000S = 40.9654

Predictor Coef SE Coef T PConstant 1278.70 26.71 47.88 0.000Q1 -248.82 26.36 -9.44 0.000Q2 13.65 26.11 0.52 0.609Q3 -68.27 25.96 -2.63 0.019Time 33.725 1.619 20.83 0.000S = 40.9654


Lecture 5.1 7

Homework 4.2.3

1. Calculate the simple linear regressions of Jobtime on each of T_Ops and Units. Confirm the corresponding t-values.

2. Calculate the simple linear regression of Jobtime on Ops per Unit. Comment on the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value.

3. Confirm the calculation of the R2 values.


Lecture 5.1 8

Solution 4.2.3

2. Calculate the simple linear regression of Jobtime on Ops per Unit. Comment on the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value.

Comment: The t-value is insignificant; the negative correlation is just chance variation, with no substantive meaning.


Lecture 5.1 9

Variance Inflation Factors

2kk

kR1

1ns

)ˆ(SE

ns)ˆ(SE0R

kk

2k

factorlationinferrordardtansR1

12k

factorlationinfiancevarR1

12k

Convention: problem if > 90% or VIFk > 102kR


Lecture 5.1 10

What to do?

• Get new X values, to break correlation pattern

– impractical in observational studies

• Choose a subset of the X variables

– manually

– automatically

• stepwise regression

• other methods


Lecture 5.1 11

Residential load survey data.

Data collected by a US electricity supplier during an investigation of the factors that influence peak demand for electricity by residential customers.

Load is demand at system peak demand hour, (kW)

Size is house size, in SqFt/1000,

Income (X2) is annual family income, in $/1000,

AirCon (X3) is air conditioning capacity, in tons,

Index (X4) is the house appliance index, in kW,

Residents (X5) is number in house on a typical day


Lecture 5.1 12

Matrix plot


Lecture 5.1 13

Results

All variables in:Predictor Coef SE Coef T PConstant 0.1263 0.2289 0.55 0.585Size -2.6689 0.9059 -2.95 0.006Income 0.00027912 0.00007892 3.54 0.001AirCon 0.42462 0.03472 12.23 0.000Index 0.00038137 0.00007884 4.84 0.000Residents 0.00197 0.02218 0.09 0.930

Income deletedPredictor Coef SE Coef T PConstant -397.0 492.7 -0.81 0.426Size 10943.3 594.2 18.42 0.000AirCon -1.86 75.45 -0.02 0.980Index 0.0721 0.1709 0.42 0.676Residents 38.65 47.75 0.81 0.424


Lecture 5.1 14

Exercise

Calculate the VIF for Size. Comment.

Homework

Calculate variance inflation factors for all explanatory variables. Discuss


Lecture 5.1 15

Multicollinearity

when when there is perfect correlation within the X variables.

Example: Indicators

Illustration: Minitab


Lecture 5.1 16


1. Review




iii. brain weights and body weightsA



5. Other options


Lecture 5.1 17

(i) Hatching of liver fluke eggs

The life cycle of the liver fluke

1. Adults in liver lay eggs

2. Animals excrete eggs

3. Eggs hatch on ground

4. Larvae seek snail

5. Development within snail

6. Emergence from snail

7. Consumption by animal

8. Penetration to liver


Lecture 5.1 18

Hatching of liver fluke eggs:Duration and Success rate

Duration and success rate of hatching of 600 liver fluke eggs at a series of fixed temperatures

Temperature (C)

Number hatched

Duration (mean days)

SD Hatch%

10 546 115.75 2.14 91.0 13 543 56.50 2.33 90.5 16 534 32.39 1.98 89.0 18 501 24.49 1.41 83.5 20 499 18.92 1.39 83.1 22 497 15.58 1.23 82.8 24 465 13.39 1.03 77.5 26 448 11.98 1.28 74.0 28 438 10.16 0.94 73.0 30 432 9.45 0.96 72.0 32 256 10.37 0.94 42.5 34 42 11.52 0.85 7.0 35 0


Lecture 5.1 19

Temperature

Dura

tion

353025201510

120

100

80

60

40

20

0

Scatterplot of Duration vs Temperature


Lecture 5.1 20

Temperature

Log(D

ura

tion)

353025201510

2.2

2.0

1.8

1.6

1.4

1.2

1.0

Scatterplot of Log(Duration) vs Temperature


Lecture 5.1 21

Sales

Tota

l com

p

140000120000100000800006000040000200000

200000000

150000000

100000000

50000000

0

Scatterplot of Total comp vs Sales

(ii) Explaining CEO Compensationand Company Sales,

(Forbes magazine, May 1994)


Lecture 5.1 22

Explaining CEO Remuneration,bivariate log transformation

LogSales

LogCom

p

5.55.04.54.03.53.02.52.0

8

7

6

5

4

Scatterplot of LogComp vs LogSales


Lecture 5.1 23

(iii) Mammals' Brainweight vs Bodyweight

Species Bodyweight Brainweight

African elephant 6654 5712 African giant pouched rat 1 6.6 Artic fox 3.385 44.5 Artic ground squirrel 0.92 5.7 Asian elephant 2547 4603 Brachiosaurus 87000 154.5 Baboon 10.55 179.5 Big brown bat 0.023 0.3 Brazilian tapir 160 169 Cat 3.3 25.6 Chimpanzee 52.16 440

● ● ●

● ● ●

● ● ●


Lecture 5.1 24

Bodyweight

Bra

inw

eig

ht

9000080000700006000050000400003000020000100000

6000

5000

4000

3000

2000

1000

0

Scatterplot of Brainweight vs Bodyweight

Scatterplot view


Lecture 5.1 25

LBodyW

LBra

inW

543210-1-2-3

4

3

2

1

0

-1

Scatterplot of LBrainW vs LBodyW

Scatterplot view,log transform


Lecture 5.1 26

LBodyW

LBra

inW

43210-1-2-3

4

3

2

1

0

-1


Scatterplot view,Dinosaurs deleted


Lecture 5.1 27

Histogram view

600048003600240012000

48

36

24

12

0

Brainweight

Fre

qu

en

cy

6000500040003000200010000

60

45

30

15

0

Bodyweight

Fre

qu

en

cy

Histogram of Brainweight

Histogram of Bodyweight


Lecture 5.1 28

Histogram view,log transform

43210-1

16

12

8

4

0

LBrainW

Fre

qu

en

cy

43210-1-2

12

9

6

3

0

LBodyW

Fre

qu

en

cy

Histogram of LBrainW

Histogram of LBodyW


Lecture 5.1 29

Changing spread with log


Lecture 5.1 30



Lecture 5.1 31



Lecture 5.1 32



Lecture 5.1 33



Lecture 5.1 34



Lecture 5.1 35



Lecture 5.1 36



Lecture 5.1 37



Lecture 5.1 38

Why the log transform works

High spread at high X

transformed to

low spread at high Y

Low spread at low X

transformed to

high spread at low Y


Lecture 5.1 39

Why the log transform works

10 to 100

transformed to

log10(10) to log10(102)

i.e. 1 to 2

1/10 = 0.1 to 1/100 = 0.01

transformed to

log10(10–1) to log10(10–2)

i.e., – 1 to – 2


Lecture 5.1 40


1. Review







5. Other options


Lecture 5.1 41

SLR with transformed dataLBrainW versus LBodyW

The regression equation is

LBrainW = 0.932 + 0.753 LBodyW

Predictor Coef SE Coef T P

Constant 0.93237 0.04170 22.36 0.000

LBodyW 0.75309 0.02858 26.35 0.000

S = 0.302949


Lecture 5.1 42

LBodyW

LBra

inW

43210-1-2-3

4

3

2

1

0

-1


Application:Do humans conform?

Human


Lecture 5.1 43


• Delete the Human data,

• calculate regression,

• predict human LBrainW and

• compare to actual, relative to s


Lecture 5.1 44


Regression Analysis: LBrainW versus LBodyW

The regression equation isLBrainW = 0.924 + 0.744 LBodyW

Predictor Coef SE Coef t pConstant 0.92410 0.03933 23.50 0.000LBodyW 0.74383 0.02706 27.48 0.000

S = 0.285036


Lecture 5.1 45


LBodyW(Human) = 1.79239

LBrainW(Human) = 3.12057

Predicted LBrainW = 0.924 + 0.744 × 1.79239

= 2.25754

Residual = 3.12057 – 2.25754= 0.86303

Residual / s = 0.86303 / 0.285036 = 3.03


Lecture 5.1 46

Deleted residuals

For each potentially exceptional case:

– delete the case

– calculate the regression from the rest

– use the fitted equation to calculate a

deleted fitted value

– calculate deleted residual

= obseved value – deleted fitted value

Minitab does this automatically for all cases!


Lecture 5.1 47


With 63 cases, we do not expect to see any cases with residuals exceeding 3 standard deviations.

On the other hand, recalling the scatter plot, the humans do not appear particulary exceptional. The dotplot view of deleted residuals emphasises this:

Water opossums appear more exceptional.

HumanWater Opossum


Lecture 5.1 48


4

3

2

1

0

-1

-2

-3

-43210-1-2-3

De

lete

d R

esi

du

als

Score

AD 0.385

P-Value 0.383

Probability Plot of Deleted Residuals


Lecture 5.1 49


1. Review







5. Other options


Lecture 5.1 50

Optimising a nicotine extraction process

In determining the quantity of nicotine in different samples of tobacco, temperature is a key variable in optimising the extraction process. A study of this phenomenon involving analysis of 18 samples produced these data.


Lecture 5.1 51


Regression Analysis: Nicotine versus Temperature

The regression equation isNicotine = 2.61 + 0.0247 Temperature

Predictor Coef SE Coef T PConstant 2.6086 0.2121 12.30 0.000Temperature 0.024656 0.003579 6.89 0.000

S = 0.217412 R-Sq = 74.8%


Lecture 5.1 52



Lecture 5.1 53

Optimising a nicotine extraction process,quadratic fit

90807060504030

4.6

4.4

4.2

4.0

3.8

3.6

3.4

3.2

3.0

Temperature

Nic

oti

ne

Scatterplot of Nicotine vs Temperature


Lecture 5.1 54


The regression equation isNicotine = 1.20 + 0.0767 Temperature - 0.000453 Temp-sqr

Predictor Coef SE Coef T PConstant 1.2041 0.6312 1.91 0.076Temperature 0.07674 0.02257 3.40 0.004Temp-sqr -0.0004529 0.0001943 -2.33 0.034

S = 0.192398 R-Sq = 81.5%


Lecture 5.1 55



Lecture 5.1 56

Optimising a nicotine extraction process,quadratic fit, case 5 excluded

The regression equation isNicotine = 1.21 + 0.0750 Temperature - 0.000419 Temp-sqr

Predictor Coef SE Coef T PConstant 1.2096 0.5129 2.36 0.033Temperature 0.07504 0.01835 4.09 0.001Temp-sqr -0.0004189 0.0001583 -2.65 0.019

S = 0.156321 R-Sq = 88.6%


Lecture 5.1 57

Optimising a nicotine extraction process,quadratic fit, case 5 excluded


Lecture 5.1 58

5 Other options

• Other functions,

– e.g., 1/Y, Y, Y2, etc., same for X

• Generalised linear models,

– choose a function of Y, a model for

• etc.


Lecture 5.1 59

Reading

EM Section 6.7.1

Hamilton, Ch. 5

Extra Notes: More on log

Introduction to Regression Lecture 5.1

Documents

corresponding tvalues

alternative regression

original regression

corresponding coefficients

new x values

correlation patternimpractical

r2 values

house size