2001 Bio 4118 Applied Biostatistics L13.1 Université d’Ottawa / University of Ottawa Lecture 13: Multiple linear Lecture 13: Multiple linear regression regression When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression
Lecture 13: Multiple linear regression. When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression. Some GLM procedures. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
2001
Bio 4118 Applied BiostatisticsL13.1
Université d’Ottawa / University of Ottawa
Lecture 13: Multiple linear Lecture 13: Multiple linear regressionregression
Lecture 13: Multiple linear Lecture 13: Multiple linear regressionregression
When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression
When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression
2001
Bio 4118 Applied BiostatisticsL13.2
Université d’Ottawa / University of Ottawa
Some GLM proceduresSome GLM proceduresSome GLM proceduresSome GLM procedures
ProcedureDependentvariable
Independent variable(s)
Simpleregression
1 continuous 1 continuous
SingleclassificationANOVA
1 continuous 1 categorical*
Multiple-classificationANOVA
1 continuous 2 or more categorical*
ANCOVA 1 continuousAt least 1 categorical*, atleast 1 continuous
Multipleregression
1 continuous 2 or more continuous
*either categorical or treated as a categorical variable
2001
Bio 4118 Applied BiostatisticsL13.3
Université d’Ottawa / University of Ottawa
When do we use When do we use multiple regression?multiple regression?
When do we use When do we use multiple regression?multiple regression?
to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2, …) variables
e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance
to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2, …) variables
e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance
Log [P]
Lo
g P
rod
uct
ion
Log [P]
Lo
g P
rod
uct
ion
Log [Zoo]
2001
Bio 4118 Applied BiostatisticsL13.4
Université d’Ottawa / University of Ottawa
The general model is:
which defines a k-dimensional plane, where = intercept, j = partial regression coefficient of Y on Xj, Xij is value of ith observation of dependent variable Xj, and i is the residual of the ith observation.
The general model is:
which defines a k-dimensional plane, where = intercept, j = partial regression coefficient of Y on Xj, Xij is value of ith observation of dependent variable Xj, and i is the residual of the ith observation.
The multiple regression model: general The multiple regression model: general formform
The multiple regression model: general The multiple regression model: general formform
Y Xi jj
k
ij i
1
X2
X1
Y
X2
X1
Y, X1, X2^
Y, X1, X2
Y X , X 1 2.
2001
Bio 4118 Applied BiostatisticsL13.5
Université d’Ottawa / University of Ottawa
What is the partial regression coefficient What is the partial regression coefficient anyway?anyway?
What is the partial regression coefficient What is the partial regression coefficient anyway?anyway?
j is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables!
j is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables!
Partial regression
Simple (pooled)regression -4 -2 0 2 4
-8
-4
0
4
8
X1
Y
X2 = 3
X2 = 1
X2 = -1
X2 = -3
2001
Bio 4118 Applied BiostatisticsL13.6
Université d’Ottawa / University of Ottawa
The effect of scaleThe effect of scaleThe effect of scaleThe effect of scale
Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same.
So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales.
Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same.
So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales.
Y j = 2
4
2
01 2
Xj
Y j = .02
4
2
0100 200
2001
Bio 4118 Applied BiostatisticsL13.7
Université d’Ottawa / University of Ottawa
Since j depends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables.
The standardized coefficients j* estimate the relative strength of the influence of variable Xj on Y.
Since j depends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables.
The standardized coefficients j* estimate the relative strength of the influence of variable Xj on Y.
The multiple regression model: The multiple regression model: standardized formstandardized form
The multiple regression model: The multiple regression model: standardized formstandardized form
Partial regression coefficient: equals the slope of the regression of Y on Xj when all other independent variables are held constant.
Standardized partial regression coefficient: the rate of change of Y in standard deviation units per one standard deviation of Xj with all other independent variables held constant.
2001
Bio 4118 Applied BiostatisticsL13.9
Université d’Ottawa / University of Ottawa
AssumptionsAssumptions
independence of residuals homoscedasticity of residuals linearity (Y on all X) no error on independent variables normality of residuals
2001
Bio 4118 Applied BiostatisticsL13.10
Université d’Ottawa / University of Ottawa
Hypothesis testing in simple linear Hypothesis testing in simple linear regression: partitioning the total sums of regression: partitioning the total sums of
squaressquares
Hypothesis testing in simple linear Hypothesis testing in simple linear regression: partitioning the total sums of regression: partitioning the total sums of
squaressquares
Total SS Model (Explained) SS Unexplained (Error) SS
( )Y Yii
N
1
2 ( )Y Yii
N
1
2 ( )Y Yii
N
i
1
2
Y
Y = +
2001
Bio 4118 Applied BiostatisticsL13.11
Université d’Ottawa / University of Ottawa
Partition total sums of squares into model and residual SS:
Partition total sums of squares into model and residual SS:
Hypothesis testing in multiple regression Hypothesis testing in multiple regression I: partitioning the total sums of squaresI: partitioning the total sums of squares
Hypothesis testing in multiple regression Hypothesis testing in multiple regression I: partitioning the total sums of squaresI: partitioning the total sums of squares
SS Y Yii
N
Total ( )
1
2
SS Y Yii
N
model ( )
1
2
SS Y Yii
N
ierror ( )
1
2X2
X1
Y
Model SS
Total SS
Residual SS
2001
Bio 4118 Applied BiostatisticsL13.12
Université d’Ottawa / University of Ottawa
Hypothesis testing I: partitioning the Hypothesis testing I: partitioning the total sums of squarestotal sums of squares
Hypothesis testing I: partitioning the Hypothesis testing I: partitioning the total sums of squarestotal sums of squares
So, MSmodel = s2Y and
MSerror = 0 if observed = expected for all i.
Calculate F = MSmodel/MSerror and compare with F distribution with 1 and N - 2 df.
H0: F = 1
So, MSmodel = s2Y and
MSerror = 0 if observed = expected for all i.
Calculate F = MSmodel/MSerror and compare with F distribution with 1 and N - 2 df.
If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodel for each considered separately.
But if they are correlated, the former will be less than the latter.
So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)?
If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodel for each considered separately.
But if they are correlated, the former will be less than the latter.
So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)?
high R2 but few or no significant t-tests for individual independent variables
high pairwise correlations between X’s high partial correlations among regressors
(independent variables are a linear combination of others)
Eigenvalues, condition index, tolerance and variance inflation factors
2001
Bio 4118 Applied BiostatisticsL13.18
Université d’Ottawa / University of Ottawa
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity
Eigenvectors: a set of “lines” 1, 2,…, k in a k-dimensional space which are orthogonal to each other
Eigenvalue: the magnitude (length) of the corresponding eigenvector
Eigenvectors: a set of “lines” 1, 2,…, k in a k-dimensional space which are orthogonal to each other
Eigenvalue: the magnitude (length) of the corresponding eigenvector
X2
X1
1
2
X2X
1
1
2
1
2
2001
Bio 4118 Applied BiostatisticsL13.19
Université d’Ottawa / University of Ottawa
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity
Quantifying the Quantifying the effect of effect of
multicollinearitymulticollinearity Eigenvalues: if all k
eigenvalues are approximately equal, multicollinearity is low.
Condition index: sqrt(l /s); near 1 indicates low multicollinearity.
Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity.
Eigenvalues: if all k eigenvalues are approximately equal, multicollinearity is low.
Condition index: sqrt(l /s); near 1 indicates low multicollinearity.
Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity.
X2
X1
Low correlation 1 = 2
X2X
1
High correlation 1 >> 2
2001
Bio 4118 Applied BiostatisticsL13.20
Université d’Ottawa / University of Ottawa
Remedial measuresRemedial measures
Get more data to reduce correlations. Drop some variables. Use principal component or ridge regression,
which yield biased estimates but with smaller standard errors.
2001
Bio 4118 Applied BiostatisticsL13.21
Université d’Ottawa / University of Ottawa
Multiple regression: the general ideaMultiple regression: the general ideaMultiple regression: the general ideaMultiple regression: the general idea Evaluate significance of a
variable by fitting two models: one with the term in, the other with it removed.
Test for change in model fit ( MF) associated with removal of the term in question.
Unfortunately, M F may depend on what other variables are in model if there is multicollinearity!
Evaluate significance of a variable by fitting two models: one with the term in, the other with it removed.
Test for change in model fit ( MF) associated with removal of the term in question.
Unfortunately, M F may depend on what other variables are in model if there is multicollinearity!
Strategy II: Strategy II: forward selectionforward selection Start with variable that has
highest (significant) R2, i.e. highest partial correlation coefficient r.
Add others one at a time until no further significant increase in R2 with js recomputed at each step.
problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included.
Start with variable that has highest (significant) R2, i.e. highest partial correlation coefficient r.
Add others one at a time until no further significant increase in R2 with js recomputed at each step.
problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included.
{X1, X2, X3}
{X2}
r2 > r1 > r3
{X1, X2, X3}
{X1, X2}
RR2
RR21
R21R2
R21R2
{X2}
{X1, X2, X3}
Finalmodel
R123R21
{X1, X2}
R123R21
2001
Bio 4118 Applied BiostatisticsL13.26
Université d’Ottawa / University of Ottawa
Forward selection: Forward selection: order of entryorder of entry
Forward selection: Forward selection: order of entryorder of entry
Begin with the variable with the highest partial correlation coefficient.
Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value.
Begin with the variable with the highest partial correlation coefficient.
Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value.
removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient.
But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded.
Start with all variables. Drop variables whose
removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient.
But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded.
{X1, X2, X3}
{X3}
r2 < r1 < r3
{X1, X3} RR13
R3R13
R13R123
{X3}
{X1, X2, X3}
Finalmodel
RR123
R13R123
R3R13
{X1, X3}
2001
Bio 4118 Applied BiostatisticsL13.28
Université d’Ottawa / University of Ottawa
Backward selection: Backward selection: order of entryorder of entry
Backward selection: Backward selection: order of entryorder of entry
Begin with the variable with the smallest partial correlation coefficient.
Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove) value.
Begin with the variable with the smallest partial correlation coefficient.
Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove) value.
Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages.
To avoid infinite loops, we usually set p to enter > p to remove.
Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages.
To avoid infinite loops, we usually set p to enter > p to remove.
log of herptile species richness (logherp) as a function of log wetland area (logarea), percentage of land within 1 km covered in forest (cpfor2) and density of hard-surface roads within 1 km (thtdens)
2001
Bio 4118 Applied BiostatisticsL13.31
Université d’Ottawa / University of Ottawa
Example (all variables)Example (all variables)
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162
Example: forward stepwiseExample: forward stepwiseDEPENDENT VARIABLE LOGHERP MINIMUM TOLERANCE FOR ENTRY INTO MODEL = .010000 FORWARD STEPWISE WITH ALPHA-TO-ENTER= .10 AND ALPHA-TO-REMOVE= .05
The biological significance of the higher order terms in a polynomial regression (if any) is generally not known.
By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term.
The biological significance of the higher order terms in a polynomial regression (if any) is generally not known.
By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term.
Extrapolation of polynomial models is always nonsense.
Extrapolation of polynomial models is always nonsense.
X1
Y
Y = X1- X12
2001
Bio 4118 Applied BiostatisticsL13.43
Université d’Ottawa / University of Ottawa
Power analysis Power analysis in GLM in GLM
(including MR)(including MR)
Power analysis Power analysis in GLM in GLM
(including MR)(including MR)
In any GLM, hypotheses are tested by means of an F-test.
Remember: the appropriate SSerror and dferror depends on the type of analysis and the hypothesis under investigation.
Knowing F, we can compute R2, the proportion of the total variance in Y explained by the factor (source) under consideration.
In any GLM, hypotheses are tested by means of an F-test.
Remember: the appropriate SSerror and dferror depends on the type of analysis and the hypothesis under investigation.
Knowing F, we can compute R2, the proportion of the total variance in Y explained by the factor (source) under consideration.
F
FR
df
df
SS
SS
dfSS
dfSS
MS
MSF
factor
error
error
factor
errorerror
factorfactor
error
factor
1
/
/
2
2001
Bio 4118 Applied BiostatisticsL13.44
Université d’Ottawa / University of Ottawa
Partial and total Partial and total RR22Partial and total Partial and total RR22
The total R2 (R2Y•B) is the
proportion of variance in Y accounted for (explained by) a set of independent variables B.
The partial R2 (R2Y•A,B- R2
Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed.
The total R2 (R2Y•B) is the
proportion of variance in Y accounted for (explained by) a set of independent variables B.
The partial R2 (R2Y•A,B- R2
Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed.
Proportion of varianceaccounted for by both A
and B (R2Y•A,B)
Proportion of variance
accounted for by A only
(R2Y•A)(total R2)
Proportion of variance accounted
for by Bindependent of A
(R2Y•A,B- R2
Y•A )(partial R2)
2001
Bio 4118 Applied BiostatisticsL13.45
Université d’Ottawa / University of Ottawa
Partial and total Partial and total RR22
Partial and total Partial and total RR22
The total R2 (R2Y•B) for
set B equals the partial R2 (R2
Y•A,B- R2Y•A ) with
respect to set B if either (1) the total R2 for A (R2
Y•A) is zero, or (2) if A and B are independent (in which case R2
Y•A,B= R2
Y•A + R2Y•B).
The total R2 (R2Y•B) for
set B equals the partial R2 (R2
Y•A,B- R2Y•A ) with
respect to set B if either (1) the total R2 for A (R2
Y•A) is zero, or (2) if A and B are independent (in which case R2
Y•A,B= R2
Y•A + R2Y•B).
Proportion of variance
accounted for by B
(R2Y•B)(total R2)
Proportion of variance
independent of A(R2
Y•A,B- R2Y•A )
(partial R2)
A
Y
B
A
Equal iff
2001
Bio 4118 Applied BiostatisticsL13.46
Université d’Ottawa / University of Ottawa
Partial and total Partial and total RR2 2 in multiple regressionin multiple regressionPartial and total Partial and total RR2 2 in multiple regressionin multiple regression
Suppose we have three independent variables X1 ,X2
and X3 .
Suppose we have three independent variables X1 ,X2
and X3 .
32321
32
1
321
,2
,,22
,2
,22
22
,,2
,2
321 ,,
XXYXXXYAYBAY
XXYBY
XYAY
XXXYBAY
RRRR
RR
RR
RR
XXBXA
Log [P]
Lo
g P
rod
uct
ion
Log [Zoo]
2001
Bio 4118 Applied BiostatisticsL13.47
Université d’Ottawa / University of Ottawa
Defining effect size in multiple Defining effect size in multiple regressionregression
Defining effect size in multiple Defining effect size in multiple regressionregression
The effect size, denoted f2 is given by the ratio of the factor (source) R2
factor and the appropriate error R2
error.
Note: both R2factor and
R2error depend on the
null hypothesis under investigation.
The effect size, denoted f2 is given by the ratio of the factor (source) R2
factor and the appropriate error R2
error.
Note: both R2factor and
R2error depend on the
null hypothesis under investigation.
22
2
factor
error
Rf
R
2001
Bio 4118 Applied BiostatisticsL13.48
Université d’Ottawa / University of Ottawa
Case 1: a set B of variables {X1, X2, …} is related to Y, and the total R2 (R2
Y•B) is determined. The error variance proportion
is then 1- R2Y•B .
H0: R2Y•B = 0
Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands
B ={LOGAREA, CPFOR2,THTDEN }
Case 1: a set B of variables {X1, X2, …} is related to Y, and the total R2 (R2
Y•B) is determined. The error variance proportion
is then 1- R2Y•B .
H0: R2Y•B = 0
Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands
B ={LOGAREA, CPFOR2,THTDEN }
Defining effect Defining effect size in multiple size in multiple
regression: regression: case 1case 1
Defining effect Defining effect size in multiple size in multiple
regression: regression: case 1case 1
22
2
factor
error
Rf
R
2001
Bio 4118 Applied BiostatisticsL13.49
Université d’Ottawa / University of Ottawa
DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162
Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2
Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2
The proportion of variance of LOGHERP due to THTDEN (B) over and above that due to LOGAREA and CPFOR2 (A) is R2
Y•A,B- R2Y•A =.098 .
The error variance proportion is then 1- R2
Y•A,B= 1 - .547 . So effect size for variable
THTDEN is 0.216 .
The proportion of variance of LOGHERP due to THTDEN (B) over and above that due to LOGAREA and CPFOR2 (A) is R2
Y•A,B- R2Y•A =.098 .
The error variance proportion is then 1- R2
Y•A,B= 1 - .547 . So effect size for variable
THTDEN is 0.216 .
216.547.1
.449.547.
1 2},2,{
2}2,{
2},2,{
2
THTDENCPFORLOGAREA
CPFORLOGAREA
THTDENCPFORLOGAREA
R
R
R
f
2001
Bio 4118 Applied BiostatisticsL13.53
Université d’Ottawa / University of Ottawa
Determining powerDetermining powerDetermining powerDetermining power Once f2 has been
determined, either a priori (as an alternate hypothesis) or a posteriori (the observed effect size), calculate non-central F parameter .
knowing and factor (source) (1) and error (2) degrees of freedom, we can determine power from appropriate tables for given .
Once f2 has been determined, either a priori (as an alternate hypothesis) or a posteriori (the observed effect size), calculate non-central F parameter .
knowing and factor (source) (1) and error (2) degrees of freedom, we can determine power from appropriate tables for given .
= .05)
= .01)
Decreasing 2
1-
1 = 2
= .05 = .01
2 3 4 51 1.5 2 2.5
)1( 212 f
2001
Bio 4118 Applied BiostatisticsL13.54
Université d’Ottawa / University of Ottawa
Example: herptile richness in Example: herptile richness in southeastern Ontario wetlandssoutheastern Ontario wetlandsExample: herptile richness in Example: herptile richness in
What is probability of detecting a true effect size for CPFOR2 equal to the estimated effect size once effects of LOGAREA and THTDEN have been controlled for, given = 0.05?
sample of 28 wetlands 3 variables (LOGAREA,
CPFOR2, THTDEN) Dependent variable is log10 of
the number of herptile species.
What is probability of detecting a true effect size for CPFOR2 equal to the estimated effect size once effects of LOGAREA and THTDEN have been controlled for, given = 0.05?
Variable t p
LOGAREA(1)
3.96 0.001
THTDEN (2) -2.28 .032
CPFOR2 (3) .774 .447
R2{1,2,3} 0.547
R2{1,2 } 0.536
2001
Bio 4118 Applied BiostatisticsL13.55
Université d’Ottawa / University of Ottawa
Example: herptile richness in Example: herptile richness in southeastern Ontario wetlandssoutheastern Ontario wetlandsExample: herptile richness in Example: herptile richness in