Diploma in Statistics Introduction to Regression Lecture 5.1 1 Introduction to Regression Lecture 5.1 1. Review 2. Transforming data, the log transform i. liver fluke egg hatching rate ii.explaining CEO remuneration iii.brain weights and body weights 3. SLR with transformed data 4. Transforming X, quadratic fit 5. Other options
Introduction to Regression Lecture 5.1. Review Transforming data, the log transform liver fluke egg hatching rate explaining CEO remuneration brain weights and body weights SLR with transformed data Transforming X, quadratic fit Other options. Using t values. Convention: n >30 is big, - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 1
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weights
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 2
Using t values
Convention: n >30 is big,
n < 30 is small.
Z0.05 = 1.96
≈ 2
t30, 0.05 = 2.04
≈ 2
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 3
Selected critical values for the t-distribution .25 .10 .05 .02 .01 .002 .001
1. Calculate the simple linear regressions of Jobtime on each of T_Ops and Units. Confirm the corresponding t-values.
2. Calculate the simple linear regression of Jobtime on Ops per Unit. Comment on the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value.
3. Confirm the calculation of the R2 values.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 8
Solution 4.2.3
2. Calculate the simple linear regression of Jobtime on Ops per Unit. Comment on the negative correlation of Jobtime with Ops per Unit in the light of the corresponding t-value.
Comment: The t-value is insignificant; the negative correlation is just chance variation, with no substantive meaning.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 9
Variance Inflation Factors
2kk
kR1
1ns
)ˆ(SE
ns)ˆ(SE0R
kk
2k
factorlationinferrordardtansR1
12k
factorlationinfiancevarR1
12k
Convention: problem if > 90% or VIFk > 102kR
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 10
What to do?
• Get new X values, to break correlation pattern
– impractical in observational studies
• Choose a subset of the X variables
– manually
– automatically
• stepwise regression
• other methods
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 11
Residential load survey data.
Data collected by a US electricity supplier during an investigation of the factors that influence peak demand for electricity by residential customers.
Load is demand at system peak demand hour, (kW)
Size is house size, in SqFt/1000,
Income (X2) is annual family income, in $/1000,
AirCon (X3) is air conditioning capacity, in tons,
Index (X4) is the house appliance index, in kW,
Residents (X5) is number in house on a typical day
(ii) Explaining CEO Compensationand Company Sales,
(Forbes magazine, May 1994)
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 22
Explaining CEO Remuneration,bivariate log transformation
LogSales
LogCom
p
5.55.04.54.03.53.02.52.0
8
7
6
5
4
Scatterplot of LogComp vs LogSales
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 23
(iii) Mammals' Brainweight vs Bodyweight
Species Bodyweight Brainweight
African elephant 6654 5712 African giant pouched rat 1 6.6 Artic fox 3.385 44.5 Artic ground squirrel 0.92 5.7 Asian elephant 2547 4603 Brachiosaurus 87000 154.5 Baboon 10.55 179.5 Big brown bat 0.023 0.3 Brazilian tapir 160 169 Cat 3.3 25.6 Chimpanzee 52.16 440
● ● ●
● ● ●
● ● ●
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 24
Bodyweight
Bra
inw
eig
ht
9000080000700006000050000400003000020000100000
6000
5000
4000
3000
2000
1000
0
Scatterplot of Brainweight vs Bodyweight
Scatterplot view
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 25
LBodyW
LBra
inW
543210-1-2-3
4
3
2
1
0
-1
Scatterplot of LBrainW vs LBodyW
Scatterplot view,log transform
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 26
LBodyW
LBra
inW
43210-1-2-3
4
3
2
1
0
-1
Scatterplot of LBrainW vs LBodyW
Scatterplot view,Dinosaurs deleted
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 27
Histogram view
600048003600240012000
48
36
24
12
0
Brainweight
Fre
qu
en
cy
6000500040003000200010000
60
45
30
15
0
Bodyweight
Fre
qu
en
cy
Histogram of Brainweight
Histogram of Bodyweight
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 28
Histogram view,log transform
43210-1
16
12
8
4
0
LBrainW
Fre
qu
en
cy
43210-1-2
12
9
6
3
0
LBodyW
Fre
qu
en
cy
Histogram of LBrainW
Histogram of LBodyW
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 29
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 30
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 31
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 32
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 33
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 34
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 35
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 36
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 37
Changing spread with log
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 38
Why the log transform works
High spread at high X
transformed to
low spread at high Y
Low spread at low X
transformed to
high spread at low Y
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 39
Why the log transform works
10 to 100
transformed to
log10(10) to log10(102)
i.e. 1 to 2
1/10 = 0.1 to 1/100 = 0.01
transformed to
log10(10–1) to log10(10–2)
i.e., – 1 to – 2
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 40
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weights
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 41
SLR with transformed dataLBrainW versus LBodyW
The regression equation is
LBrainW = 0.932 + 0.753 LBodyW
Predictor Coef SE Coef T P
Constant 0.93237 0.04170 22.36 0.000
LBodyW 0.75309 0.02858 26.35 0.000
S = 0.302949
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 42
LBodyW
LBra
inW
43210-1-2-3
4
3
2
1
0
-1
Scatterplot of LBrainW vs LBodyW
Application:Do humans conform?
Human
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 43
Application:Do humans conform?
• Delete the Human data,
• calculate regression,
• predict human LBrainW and
• compare to actual, relative to s
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 44
Application:Do humans conform?
Regression Analysis: LBrainW versus LBodyW
The regression equation isLBrainW = 0.924 + 0.744 LBodyW
Predictor Coef SE Coef t pConstant 0.92410 0.03933 23.50 0.000LBodyW 0.74383 0.02706 27.48 0.000
S = 0.285036
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 45
Application:Do humans conform?
LBodyW(Human) = 1.79239
LBrainW(Human) = 3.12057
Predicted LBrainW = 0.924 + 0.744 × 1.79239
= 2.25754
Residual = 3.12057 – 2.25754= 0.86303
Residual / s = 0.86303 / 0.285036 = 3.03
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 46
Deleted residuals
For each potentially exceptional case:
– delete the case
– calculate the regression from the rest
– use the fitted equation to calculate a
deleted fitted value
– calculate deleted residual
= obseved value – deleted fitted value
Minitab does this automatically for all cases!
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 47
Application:Do humans conform?
With 63 cases, we do not expect to see any cases with residuals exceeding 3 standard deviations.
On the other hand, recalling the scatter plot, the humans do not appear particulary exceptional. The dotplot view of deleted residuals emphasises this:
Water opossums appear more exceptional.
HumanWater Opossum
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 48
Application:Do humans conform?
4
3
2
1
0
-1
-2
-3
-43210-1-2-3
De
lete
d R
esi
du
als
Score
AD 0.385
P-Value 0.383
Probability Plot of Deleted Residuals
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 49
Introduction to RegressionLecture 5.1
1. Review
2. Transforming data, the log transform
i. liver fluke egg hatching rate
ii. explaining CEO remuneration
iii. brain weights and body weights
3. SLR with transformed data
4. Transforming X, quadratic fit
5. Other options
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 50
Optimising a nicotine extraction process
In determining the quantity of nicotine in different samples of tobacco, temperature is a key variable in optimising the extraction process. A study of this phenomenon involving analysis of 18 samples produced these data.
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 51
Optimising a nicotine extraction process
Regression Analysis: Nicotine versus Temperature
The regression equation isNicotine = 2.61 + 0.0247 Temperature
Predictor Coef SE Coef T PConstant 2.6086 0.2121 12.30 0.000Temperature 0.024656 0.003579 6.89 0.000
S = 0.217412 R-Sq = 74.8%
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 52
Optimising a nicotine extraction process
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 53
Optimising a nicotine extraction process,quadratic fit
90807060504030
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
3.0
Temperature
Nic
oti
ne
Scatterplot of Nicotine vs Temperature
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 54
Optimising a nicotine extraction process,quadratic fit
The regression equation isNicotine = 1.20 + 0.0767 Temperature - 0.000453 Temp-sqr
Predictor Coef SE Coef T PConstant 1.2041 0.6312 1.91 0.076Temperature 0.07674 0.02257 3.40 0.004Temp-sqr -0.0004529 0.0001943 -2.33 0.034
S = 0.192398 R-Sq = 81.5%
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 55
Optimising a nicotine extraction process,quadratic fit
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 56
Optimising a nicotine extraction process,quadratic fit, case 5 excluded
The regression equation isNicotine = 1.21 + 0.0750 Temperature - 0.000419 Temp-sqr
Predictor Coef SE Coef T PConstant 1.2096 0.5129 2.36 0.033Temperature 0.07504 0.01835 4.09 0.001Temp-sqr -0.0004189 0.0001583 -2.65 0.019
S = 0.156321 R-Sq = 88.6%
Diploma in StatisticsIntroduction to Regression
Lecture 5.1 57
Optimising a nicotine extraction process,quadratic fit, case 5 excluded