Top Banner
2001 Bio 4118 Applied Biostatistics L13.1 Université d’Ottawa / University of Ottawa Lecture 13: Multiple linear Lecture 13: Multiple linear regression regression When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression
55

Lecture 13: Multiple linear regression

Jan 16, 2016

Download

Documents

Saad

Lecture 13: Multiple linear regression. When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression. Some GLM procedures. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.1

Université d’Ottawa / University of Ottawa

Lecture 13: Multiple linear Lecture 13: Multiple linear regressionregression

Lecture 13: Multiple linear Lecture 13: Multiple linear regressionregression

When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression

When and why we use it The general multiple regression model Hypothesis testing in multiple regression The problem of multicollinearity Multiple regression procedures Polynomial regression Power analysis in multiple regression

Page 2: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.2

Université d’Ottawa / University of Ottawa

Some GLM proceduresSome GLM proceduresSome GLM proceduresSome GLM procedures

ProcedureDependentvariable

Independent variable(s)

Simpleregression

1 continuous 1 continuous

SingleclassificationANOVA

1 continuous 1 categorical*

Multiple-classificationANOVA

1 continuous 2 or more categorical*

ANCOVA 1 continuousAt least 1 categorical*, atleast 1 continuous

Multipleregression

1 continuous 2 or more continuous

*either categorical or treated as a categorical variable

Page 3: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.3

Université d’Ottawa / University of Ottawa

When do we use When do we use multiple regression?multiple regression?

When do we use When do we use multiple regression?multiple regression?

to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2, …) variables

e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance

to compare the relationship between a continuous dependent (Y) variable and several continuous independent (X1, X2, …) variables

e.g. relationship between lake primary production, phosphorous concentration and zooplankton abundance

Log [P]

Lo

g P

rod

uct

ion

Log [P]

Lo

g P

rod

uct

ion

Log [Zoo]

Page 4: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.4

Université d’Ottawa / University of Ottawa

The general model is:

which defines a k-dimensional plane, where = intercept, j = partial regression coefficient of Y on Xj, Xij is value of ith observation of dependent variable Xj, and i is the residual of the ith observation.

The general model is:

which defines a k-dimensional plane, where = intercept, j = partial regression coefficient of Y on Xj, Xij is value of ith observation of dependent variable Xj, and i is the residual of the ith observation.

The multiple regression model: general The multiple regression model: general formform

The multiple regression model: general The multiple regression model: general formform

Y Xi jj

k

ij i

1

X2

X1

Y

X2

X1

Y, X1, X2^

Y, X1, X2

Y X , X 1 2.

Page 5: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.5

Université d’Ottawa / University of Ottawa

What is the partial regression coefficient What is the partial regression coefficient anyway?anyway?

What is the partial regression coefficient What is the partial regression coefficient anyway?anyway?

j is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables!

j is the rate of change in Y per change in Xj with all other variables held constant; this is not the slope of the regression of Y on Xj, pooled over all other variables!

Partial regression

Simple (pooled)regression -4 -2 0 2 4

-8

-4

0

4

8

X1

Y

X2 = 3

X2 = 1

X2 = -1

X2 = -3

Page 6: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.6

Université d’Ottawa / University of Ottawa

The effect of scaleThe effect of scaleThe effect of scaleThe effect of scale

Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same.

So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales.

Two independent variables on different scales will have different slopes, even if the proportional change in Y is the same.

So, if we want to measure the relative strength of the influence of each variable on Y, we must eliminate the effect of different scales.

Y j = 2

4

2

01 2

Xj

Y j = .02

4

2

0100 200

Page 7: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.7

Université d’Ottawa / University of Ottawa

Since j depends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables.

The standardized coefficients j* estimate the relative strength of the influence of variable Xj on Y.

Since j depends on the size of Xj, to examine the relative effect of each independent variable we must standardize the regression coefficients by first transforming all variables and fitting the regression model based on the transformed variables.

The standardized coefficients j* estimate the relative strength of the influence of variable Xj on Y.

The multiple regression model: The multiple regression model: standardized formstandardized form

The multiple regression model: The multiple regression model: standardized formstandardized form

YY Y

sX

X X

s

Y X

s

s

ii

Yij

ij j

X

i jj

k

ij i

j jX

Y

j

j

* *

* * *

*

,

1

Page 8: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.8

Université d’Ottawa / University of Ottawa

Regression coefficients: summaryRegression coefficients: summary

Partial regression coefficient: equals the slope of the regression of Y on Xj when all other independent variables are held constant.

Standardized partial regression coefficient: the rate of change of Y in standard deviation units per one standard deviation of Xj with all other independent variables held constant.

Page 9: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.9

Université d’Ottawa / University of Ottawa

AssumptionsAssumptions

independence of residuals homoscedasticity of residuals linearity (Y on all X) no error on independent variables normality of residuals

Page 10: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.10

Université d’Ottawa / University of Ottawa

Hypothesis testing in simple linear Hypothesis testing in simple linear regression: partitioning the total sums of regression: partitioning the total sums of

squaressquares

Hypothesis testing in simple linear Hypothesis testing in simple linear regression: partitioning the total sums of regression: partitioning the total sums of

squaressquares

Total SS Model (Explained) SS Unexplained (Error) SS

( )Y Yii

N

1

2 ( )Y Yii

N

1

2 ( )Y Yii

N

i

1

2

Y

Y = +

Page 11: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.11

Université d’Ottawa / University of Ottawa

Partition total sums of squares into model and residual SS:

Partition total sums of squares into model and residual SS:

Hypothesis testing in multiple regression Hypothesis testing in multiple regression I: partitioning the total sums of squaresI: partitioning the total sums of squares

Hypothesis testing in multiple regression Hypothesis testing in multiple regression I: partitioning the total sums of squaresI: partitioning the total sums of squares

SS Y Yii

N

Total ( )

1

2

SS Y Yii

N

model ( )

1

2

SS Y Yii

N

ierror ( )

1

2X2

X1

Y

Model SS

Total SS

Residual SS

Page 12: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.12

Université d’Ottawa / University of Ottawa

Hypothesis testing I: partitioning the Hypothesis testing I: partitioning the total sums of squarestotal sums of squares

Hypothesis testing I: partitioning the Hypothesis testing I: partitioning the total sums of squarestotal sums of squares

So, MSmodel = s2Y and

MSerror = 0 if observed = expected for all i.

Calculate F = MSmodel/MSerror and compare with F distribution with 1 and N - 2 df.

H0: F = 1

So, MSmodel = s2Y and

MSerror = 0 if observed = expected for all i.

Calculate F = MSmodel/MSerror and compare with F distribution with 1 and N - 2 df.

H0: F = 1

MSY Yi

i

N

model

( )

1

2

1

MSY Y

N

ii

N

i

error

( )

1

2

2

Page 13: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.13

Université d’Ottawa / University of Ottawa

Hypothesis testing II: Hypothesis testing II: testing individual testing individual partial regression partial regression

coefficientscoefficients

Hypothesis testing II: Hypothesis testing II: testing individual testing individual partial regression partial regression

coefficientscoefficients Test each hypothesis by a t-

test:

Note: these are 2-tailed hypotheses!

Test each hypothesis by a t-test:

Note: these are 2-tailed hypotheses!

ts

ts

j

j

j

,

YY

H02: 2 = 0,accepted

X2, X1 fixed

X1 = 2

X1 = 3

YY

X1, X2 fixed

H01: = 0,rejected

X2 = 1

X2 = 2

Page 14: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.14

Université d’Ottawa / University of Ottawa

MulticollinearityMulticollinearityMulticollinearityMulticollinearity Independent variables are

correlated, and therefore, not independent: evaluate by looking at covariance or correlation matrix.

Independent variables are correlated, and therefore, not independent: evaluate by looking at covariance or correlation matrix.

Variable X1 X2 X3

X1

2 12

13

X2

21

2 23

X3

31

32

2

Variance

Covariance

X1

colinear

X2

independent

X3

X2

Page 15: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.15

Université d’Ottawa / University of Ottawa

Multicollinearity: problemsMulticollinearity: problemsMulticollinearity: problemsMulticollinearity: problems

If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodel for each considered separately.

But if they are correlated, the former will be less than the latter.

So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)?

If two independent variables X1 and X2 are uncorrelated, then the model sums of squares for a linear model with both included equals the sum of the SSmodel for each considered separately.

But if they are correlated, the former will be less than the latter.

So, the real question is: given a model with X1 included, how much does SSmodel increase when X2 is also included (or vice versa)?

SS SS SS

if

X X X X

X X

model model model1 2 1 2

1 2

2 0

,

,

SS SS SS

if

X X X X

X X

model model model1 2 1 2

1 2

2 0

,

,

Page 16: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.16

Université d’Ottawa / University of Ottawa

Multicollinearity: consequencesMulticollinearity: consequences

inflated standard errors for regression coefficients sensitivity of parameter estimates to small

changes in data But, estimates of partial regression coefficients

remain unbiased. One or more independent variables may not

appear in the final regression model not because they do not covary with Y, but because they covary with another X.

Page 17: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.17

Université d’Ottawa / University of Ottawa

Detecting multicollinearityDetecting multicollinearity

high R2 but few or no significant t-tests for individual independent variables

high pairwise correlations between X’s high partial correlations among regressors

(independent variables are a linear combination of others)

Eigenvalues, condition index, tolerance and variance inflation factors

Page 18: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.18

Université d’Ottawa / University of Ottawa

Quantifying the Quantifying the effect of effect of

multicollinearitymulticollinearity

Quantifying the Quantifying the effect of effect of

multicollinearitymulticollinearity

Eigenvectors: a set of “lines” 1, 2,…, k in a k-dimensional space which are orthogonal to each other

Eigenvalue: the magnitude (length) of the corresponding eigenvector

Eigenvectors: a set of “lines” 1, 2,…, k in a k-dimensional space which are orthogonal to each other

Eigenvalue: the magnitude (length) of the corresponding eigenvector

X2

X1

1

2

X2X

1

1

2

1

2

Page 19: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.19

Université d’Ottawa / University of Ottawa

Quantifying the Quantifying the effect of effect of

multicollinearitymulticollinearity

Quantifying the Quantifying the effect of effect of

multicollinearitymulticollinearity Eigenvalues: if all k

eigenvalues are approximately equal, multicollinearity is low.

Condition index: sqrt(l /s); near 1 indicates low multicollinearity.

Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity.

Eigenvalues: if all k eigenvalues are approximately equal, multicollinearity is low.

Condition index: sqrt(l /s); near 1 indicates low multicollinearity.

Tolerance: 1 - proportion of variance in each independent variable accounted for by all other independent variables: near 1 indicates low multicollinearity.

X2

X1

Low correlation 1 = 2

X2X

1

High correlation 1 >> 2

Page 20: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.20

Université d’Ottawa / University of Ottawa

Remedial measuresRemedial measures

Get more data to reduce correlations. Drop some variables. Use principal component or ridge regression,

which yield biased estimates but with smaller standard errors.

Page 21: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.21

Université d’Ottawa / University of Ottawa

Multiple regression: the general ideaMultiple regression: the general ideaMultiple regression: the general ideaMultiple regression: the general idea Evaluate significance of a

variable by fitting two models: one with the term in, the other with it removed.

Test for change in model fit ( MF) associated with removal of the term in question.

Unfortunately, M F may depend on what other variables are in model if there is multicollinearity!

Evaluate significance of a variable by fitting two models: one with the term in, the other with it removed.

Test for change in model fit ( MF) associated with removal of the term in question.

Unfortunately, M F may depend on what other variables are in model if there is multicollinearity!

Model A(X1 in)

Model B(X2 out)

M F(e.g. R2)

Delete X1

( small)

Retain X1

( large)

Page 22: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.22

Université d’Ottawa / University of Ottawa

Fitting multiple regression modelsFitting multiple regression models

Goal: find the “best” model, given the available data.

Problem 1: what is “best”? highest R2? lowest RMS? highest R2 but contains only individually

significant independent variables? maximizes R2 with minimum number of

independent variables?

Page 23: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.23

Université d’Ottawa / University of Ottawa

Selection of independent variables Selection of independent variables (cont’d)(cont’d)

Problem 2: even if “best” is defined, by what method do we find it?

Possibilities: compute all possible models (2k -1) and

choose the best one. use some procedure for winnowing down the

set of possible models.

Page 24: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.24

Université d’Ottawa / University of Ottawa

Strategy I: computing all possible Strategy I: computing all possible modelsmodels

Strategy I: computing all possible Strategy I: computing all possible modelsmodels

Compute all possible models and choose the “best” one.

cons: time-consuming leaves definition of

“best” to researcher

pros: if the “best” model is

defined, you will find it!

Compute all possible models and choose the “best” one.

cons: time-consuming leaves definition of

“best” to researcher

pros: if the “best” model is

defined, you will find it!

{X1, X2, X3}

{X2}

{X1}

{X3}

{X1, X2}

{X2, X3}

{X1, X3}

{X1, X2, X3}

Page 25: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.25

Université d’Ottawa / University of Ottawa

Strategy II: Strategy II: forward selectionforward selection

Strategy II: Strategy II: forward selectionforward selection Start with variable that has

highest (significant) R2, i.e. highest partial correlation coefficient r.

Add others one at a time until no further significant increase in R2 with js recomputed at each step.

problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included.

Start with variable that has highest (significant) R2, i.e. highest partial correlation coefficient r.

Add others one at a time until no further significant increase in R2 with js recomputed at each step.

problem: if Xj is included, it stays in even if it contributes little to the SSmodel once other variables are included.

{X1, X2, X3}

{X2}

r2 > r1 > r3

{X1, X2, X3}

{X1, X2}

RR2

RR21

R21R2

R21R2

{X2}

{X1, X2, X3}

Finalmodel

R123R21

{X1, X2}

R123R21

Page 26: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.26

Université d’Ottawa / University of Ottawa

Forward selection: Forward selection: order of entryorder of entry

Forward selection: Forward selection: order of entryorder of entry

Begin with the variable with the highest partial correlation coefficient.

Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value.

Begin with the variable with the highest partial correlation coefficient.

Next entry is that variable which gives largest increase in overall R2 by F-test of significance of increase, above some specified F-to-enter (below specified p to enter) value.

{X1, X2, X3, X4}

{X2}

r2 > r1 > r3 > r4

{X2, X1}

{X2, X4}

p[F(X2, X4)] = .55

X4 eliminated

p to enter = .05

{X2, X3} {X2, X1}

p[F(X2)] = .001

p[F(X2, X1)] = .002p[F(X2, X3)] = .04

...

{X2, X3}

Page 27: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.27

Université d’Ottawa / University of Ottawa

Strategy III: Strategy III: backward selectionbackward selection

Strategy III: Strategy III: backward selectionbackward selection

Start with all variables. Drop variables whose

removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient.

But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded.

Start with all variables. Drop variables whose

removal does not significantly reduce R2, one at a time, starting with the one with the lowest partial correlation coefficient.

But, once Xj is dropped, it stays out even if it explains a significant amount of the remaining variability once other variables are excluded.

{X1, X2, X3}

{X3}

r2 < r1 < r3

{X1, X3} RR13

R3R13

R13R123

{X3}

{X1, X2, X3}

Finalmodel

RR123

R13R123

R3R13

{X1, X3}

Page 28: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.28

Université d’Ottawa / University of Ottawa

Backward selection: Backward selection: order of entryorder of entry

Backward selection: Backward selection: order of entryorder of entry

Begin with the variable with the smallest partial correlation coefficient.

Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove) value.

Begin with the variable with the smallest partial correlation coefficient.

Next removal is that variable which gives the smallest increase in overall R2 by F-test of significance of increase, below some specified F-to-remove (above specified p to remove) value.

{X1, X2, X3, X4}

{X2, X1, X3}

r2 > r1 > r3 > r4

{X2, X1}

p[F(X2, X1)] = .25

p to remove = .10

p[F(X2, X3)] = .001

...

p[F(X2, X1, X3)] = .44

X4 removed

X3 removed X1 , X2 still in

X2, X3, X1 still in

{X1, X3}{X2, X3}

p[F(X1, X3)] = .009

Page 29: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.29

Université d’Ottawa / University of Ottawa

Strategy IV: stepwise Strategy IV: stepwise selectionselection

Strategy IV: stepwise Strategy IV: stepwise selectionselection

Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages.

To avoid infinite loops, we usually set p to enter > p to remove.

Once a variable is included (removed), set of remaining variables is scanned for other variables that should now be deleted (included), including those added (removed) at earlier stages.

To avoid infinite loops, we usually set p to enter > p to remove.

{X1, X2, X3, X4}

{X2}

r2 > r1 > r4 > r3

{X1, X2, X3}

{X2, X4}

p[F(X2, X4)] = .03

p to enter = .10p to remove = .05

{X2, X3} {X2, X1}

p[F(X2)] = .001

p[F(X2, X1)] = .002p[F(X2, X3)] = .09

{X1, X2, X4}

p[F(X1, X2, X4)] = .02 p[F(X1, X2, X3)] = .19{X1, X4}

Page 30: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.30

Université d’Ottawa / University of Ottawa

ExampleExample

log of herptile species richness (logherp) as a function of log wetland area (logarea), percentage of land within 1 km covered in forest (cpfor2) and density of hard-surface roads within 1 km (thtdens)

Page 31: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.31

Université d’Ottawa / University of Ottawa

Example (all variables)Example (all variables)

DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032

Page 32: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.32

Université d’Ottawa / University of Ottawa

Example (cont’d)Example (cont’d)

ANALYSIS OF VARIANCE

SOURCE SS DF MS F-RATIO P

REGRESSION 0.760 3 0.253 9.662 0.000 RESIDUAL 0.629 24 0.026

Page 33: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.33

Université d’Ottawa / University of Ottawa

Example: forward stepwiseExample: forward stepwiseDEPENDENT VARIABLE LOGHERP MINIMUM TOLERANCE FOR ENTRY INTO MODEL = .010000 FORWARD STEPWISE WITH ALPHA-TO-ENTER= .10 AND ALPHA-TO-REMOVE= .05

STEP # 0 R= .000 RSQUARE= .000

VARIABLE COEFF. SE. STD COEF. TOL. F 'P' IN --- 1 CONSTANT OUT PART. CORR --- 2 LOGAREA 0.596 . . .1E+01 14.321 0.001 3 CPFOR2 0.305 . . .1E+01 2.662 0.115 4 THTDEN -0.496 . . .1E+01 8.502 0.007

Page 34: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.34

Université d’Ottawa / University of Ottawa

Forward stepwise (cont’d)Forward stepwise (cont’d)STEP # 1 R= .596 RSQUARE= .355TERM ENTERED: LOGAREA

VARIABLE COEFF. SE. STD COEF. TOL. F 'P'

IN --- 1 CONSTANT 2 LOGAREA 0.247 0.065 0.596 .1E+01 14.321 0.001

OUT PART. CORR --- 3 CPFOR2 0.382 . . 0.99 4.273 0.049 4 THTDEN -0.529 . . 0.98 9.725 0.005

Page 35: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.35

Université d’Ottawa / University of Ottawa

Forward stepwise (cont’d)Forward stepwise (cont’d)

STEP # 2 R= .732 RSQUARE= .536 TERM ENTERED: THTDEN

VARIABLE COEFF. SE. STD COEF .TOL. F 'P'

IN --- 1 CONSTANT 2 LOGAREA 0.225 0.057 0.542 0.98 15.581 0.001 4 THTDEN -0.042 0.013 -0.428 0.98 9.725 0.005

OUT PART. CORR --- 3 CPFOR2 0.156 . . 0.74380 0.599 0.447

Page 36: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.36

Université d’Ottawa / University of Ottawa

Forward stepwise: final modelForward stepwise: final model

FORWARD STEPWISE: P TO INCLUDE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732SQUARED MULTIPLE R: 0.536ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.161

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005

Page 37: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.37

Université d’Ottawa / University of Ottawa

Example: backward stepwise (final Example: backward stepwise (final model)model)

BACKWARD STEPWISE: P TO REMOVE = .15 DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.732SQUARED MULTIPLE R: 0.536ADJUSTED SQUARED MULTIPLE R: .499STANDARD ERROR OF ESTIMATE: 0.161

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.376 0.149 0.000 . 2.521 0.018 LOGAREA 0.225 0.057 0.542 0.984 3.947 0.001 THTDEN -0.042 0.013 -0.428 0.984 -3.118 0.005

Page 38: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.38

Université d’Ottawa / University of Ottawa

Example: subset modelExample: subset model

DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.670SQUARED MULTIPLE R: 0.449ADJUSTED SQUARED MULTIPLE R: .405STANDARD ERROR OF ESTIMATE: 0.175

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.027 0.167 0.000 . 0.162 0.872 LOGAREA 0.248 0.062 0.597 1.000 4.022 0.000 CPFOR2 0.003 0.001 0.307 1.000 2.067 0.049

Page 39: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.39

Université d’Ottawa / University of Ottawa

What if relationship between Y What if relationship between Y and one or more X’s is nonlinear?and one or more X’s is nonlinear?

Option 1: transform data. Option 2: use non-linear regression. Option 3: use polynomial regression.

Page 40: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.40

Université d’Ottawa / University of Ottawa

In polynomial regression, the regression model includes terms of increasingly higher powers of the dependent variable.

In polynomial regression, the regression model includes terms of increasingly higher powers of the dependent variable.

The polynomial regression modelThe polynomial regression modelThe polynomial regression modelThe polynomial regression model

Y Xi jj

k

ij

i

1

10

100

1000

10 30 50 70 90 110

Current velocity (cm/s)

Bla

ck f

ly b

iom

ass

(mg

DM

/m²)

Linear model2nd orderpolynomial model

Page 41: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.41

Université d’Ottawa / University of Ottawa

Fit simple linear model. Fit model with quadratic,

test for increase in SSmodel .

Continue with higher order (cubic, quartic, etc.) until there is no further significant increase in SSmodel .

Include terms of order up to the power of (number of points of inflexion plus 1).

Fit simple linear model. Fit model with quadratic,

test for increase in SSmodel .

Continue with higher order (cubic, quartic, etc.) until there is no further significant increase in SSmodel .

Include terms of order up to the power of (number of points of inflexion plus 1).

The polynomial regression model: The polynomial regression model: procedureprocedure

The polynomial regression model: The polynomial regression model: procedureprocedure

10

100

1000

10 30 50 70 90 110

Current velocity (cm/s)

Bla

ck f

ly b

iom

ass

(mg

DM

/m²)

Linear model2nd orderpolynomial model

Page 42: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.42

Université d’Ottawa / University of Ottawa

Polynomial regression: caveatsPolynomial regression: caveatsPolynomial regression: caveatsPolynomial regression: caveats

The biological significance of the higher order terms in a polynomial regression (if any) is generally not known.

By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term.

The biological significance of the higher order terms in a polynomial regression (if any) is generally not known.

By definition, polynomial terms are strongly correlated; hence, standard errors will be large (precision is low), and increase with the order of the term.

Extrapolation of polynomial models is always nonsense.

Extrapolation of polynomial models is always nonsense.

X1

Y

Y = X1- X12

Page 43: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.43

Université d’Ottawa / University of Ottawa

Power analysis Power analysis in GLM in GLM

(including MR)(including MR)

Power analysis Power analysis in GLM in GLM

(including MR)(including MR)

In any GLM, hypotheses are tested by means of an F-test.

Remember: the appropriate SSerror and dferror depends on the type of analysis and the hypothesis under investigation.

Knowing F, we can compute R2, the proportion of the total variance in Y explained by the factor (source) under consideration.

In any GLM, hypotheses are tested by means of an F-test.

Remember: the appropriate SSerror and dferror depends on the type of analysis and the hypothesis under investigation.

Knowing F, we can compute R2, the proportion of the total variance in Y explained by the factor (source) under consideration.

F

FR

df

df

SS

SS

dfSS

dfSS

MS

MSF

factor

error

error

factor

errorerror

factorfactor

error

factor

1

/

/

2

Page 44: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.44

Université d’Ottawa / University of Ottawa

Partial and total Partial and total RR22Partial and total Partial and total RR22

The total R2 (R2Y•B) is the

proportion of variance in Y accounted for (explained by) a set of independent variables B.

The partial R2 (R2Y•A,B- R2

Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed.

The total R2 (R2Y•B) is the

proportion of variance in Y accounted for (explained by) a set of independent variables B.

The partial R2 (R2Y•A,B- R2

Y•A ) is the proportion of variance in Y accounted for by B when the variance accounted for by another set A is removed.

Proportion of varianceaccounted for by both A

and B (R2Y•A,B)

Proportion of variance

accounted for by A only

(R2Y•A)(total R2)

Proportion of variance accounted

for by Bindependent of A

(R2Y•A,B- R2

Y•A )(partial R2)

Page 45: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.45

Université d’Ottawa / University of Ottawa

Partial and total Partial and total RR22

Partial and total Partial and total RR22

The total R2 (R2Y•B) for

set B equals the partial R2 (R2

Y•A,B- R2Y•A ) with

respect to set B if either (1) the total R2 for A (R2

Y•A) is zero, or (2) if A and B are independent (in which case R2

Y•A,B= R2

Y•A + R2Y•B).

The total R2 (R2Y•B) for

set B equals the partial R2 (R2

Y•A,B- R2Y•A ) with

respect to set B if either (1) the total R2 for A (R2

Y•A) is zero, or (2) if A and B are independent (in which case R2

Y•A,B= R2

Y•A + R2Y•B).

Proportion of variance

accounted for by B

(R2Y•B)(total R2)

Proportion of variance

independent of A(R2

Y•A,B- R2Y•A )

(partial R2)

A

Y

B

A

Equal iff

Page 46: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.46

Université d’Ottawa / University of Ottawa

Partial and total Partial and total RR2 2 in multiple regressionin multiple regressionPartial and total Partial and total RR2 2 in multiple regressionin multiple regression

Suppose we have three independent variables X1 ,X2

and X3 .

Suppose we have three independent variables X1 ,X2

and X3 .

32321

32

1

321

,2

,,22

,2

,22

22

,,2

,2

321 ,,

XXYXXXYAYBAY

XXYBY

XYAY

XXXYBAY

RRRR

RR

RR

RR

XXBXA

Log [P]

Lo

g P

rod

uct

ion

Log [Zoo]

Page 47: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.47

Université d’Ottawa / University of Ottawa

Defining effect size in multiple Defining effect size in multiple regressionregression

Defining effect size in multiple Defining effect size in multiple regressionregression

The effect size, denoted f2 is given by the ratio of the factor (source) R2

factor and the appropriate error R2

error.

Note: both R2factor and

R2error depend on the

null hypothesis under investigation.

The effect size, denoted f2 is given by the ratio of the factor (source) R2

factor and the appropriate error R2

error.

Note: both R2factor and

R2error depend on the

null hypothesis under investigation.

22

2

factor

error

Rf

R

Page 48: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.48

Université d’Ottawa / University of Ottawa

Case 1: a set B of variables {X1, X2, …} is related to Y, and the total R2 (R2

Y•B) is determined. The error variance proportion

is then 1- R2Y•B .

H0: R2Y•B = 0

Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands

B ={LOGAREA, CPFOR2,THTDEN }

Case 1: a set B of variables {X1, X2, …} is related to Y, and the total R2 (R2

Y•B) is determined. The error variance proportion

is then 1- R2Y•B .

H0: R2Y•B = 0

Example: effect of wetland area, surrounding forest cover, and surrounding road densities on herptile species richness in southeastern Ontario wetlands

B ={LOGAREA, CPFOR2,THTDEN }

Defining effect Defining effect size in multiple size in multiple

regression: regression: case 1case 1

Defining effect Defining effect size in multiple size in multiple

regression: regression: case 1case 1

22

2

factor

error

Rf

R

Page 49: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.49

Université d’Ottawa / University of Ottawa

DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032

22

2

.5471.21

1 .547

factor

error

Rf

R

Page 50: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.50

Université d’Ottawa / University of Ottawa

Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2

Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2

Case 2: the proportion of variance of Y due to B over and above that due to A is determined (R2

Y•A,B- R2Y•A

). The error variance proportion is

then 1- R2Y•A,B .

H0: R2Y•A,B- R2

Y•A = 0 Example: herptile richness in

southeastern Ontario wetlands B ={THTDEN}, A = {LOGAREA,

CPFOR2},AB = {LOGAREA, CPFOR2, THTDEN}

Case 2: the proportion of variance of Y due to B over and above that due to A is determined (R2

Y•A,B- R2Y•A

). The error variance proportion is

then 1- R2Y•A,B .

H0: R2Y•A,B- R2

Y•A = 0 Example: herptile richness in

southeastern Ontario wetlands B ={THTDEN}, A = {LOGAREA,

CPFOR2},AB = {LOGAREA, CPFOR2, THTDEN}

22

2

factor

error

Rf

R

Page 51: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.51

Université d’Ottawa / University of Ottawa

DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.670SQUARED MULTIPLE R: 0.449ADJUSTED SQUARED MULTIPLE R: .405STANDARD ERROR OF ESTIMATE: 0.175

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.027 0.167 0.000 . 0.162 0.872 LOGAREA 0.248 0.062 0.597 1.000 4.022 0.000 CPFOR2 0.003 0.001 0.307 1.000 2.067 0.049

DEP VAR: LOGHERP N: 28 MULTIPLE R: 0.740SQUARED MULTIPLE R: 0.547ADJUSTED SQUARED MULTIPLE R: .490STANDARD ERROR OF ESTIMATE: 0.162

VARIABLE COEFF. SE STD COEF. TOL. T P

CONSTANT 0.285 0.191 0.000 . 1.488 0.150 LOGAREA 0.228 0.058 0.551 0.978 3.964 0.001 CPFOR2 0.001 0.001 0.123 0.744 0.774 0.447 THTDEN -0.036 0.016 -0.365 0.732 -2.276 0.032

Page 52: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.52

Université d’Ottawa / University of Ottawa

Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2

Defining effect size in multiple Defining effect size in multiple regression: case 2regression: case 2

The proportion of variance of LOGHERP due to THTDEN (B) over and above that due to LOGAREA and CPFOR2 (A) is R2

Y•A,B- R2Y•A =.098 .

The error variance proportion is then 1- R2

Y•A,B= 1 - .547 . So effect size for variable

THTDEN is 0.216 .

The proportion of variance of LOGHERP due to THTDEN (B) over and above that due to LOGAREA and CPFOR2 (A) is R2

Y•A,B- R2Y•A =.098 .

The error variance proportion is then 1- R2

Y•A,B= 1 - .547 . So effect size for variable

THTDEN is 0.216 .

216.547.1

.449.547.

1 2},2,{

2}2,{

2},2,{

2

THTDENCPFORLOGAREA

CPFORLOGAREA

THTDENCPFORLOGAREA

R

R

R

f

Page 53: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.53

Université d’Ottawa / University of Ottawa

Determining powerDetermining powerDetermining powerDetermining power Once f2 has been

determined, either a priori (as an alternate hypothesis) or a posteriori (the observed effect size), calculate non-central F parameter .

knowing and factor (source) (1) and error (2) degrees of freedom, we can determine power from appropriate tables for given .

Once f2 has been determined, either a priori (as an alternate hypothesis) or a posteriori (the observed effect size), calculate non-central F parameter .

knowing and factor (source) (1) and error (2) degrees of freedom, we can determine power from appropriate tables for given .

= .05)

= .01)

Decreasing 2

1-

1 = 2

= .05 = .01

2 3 4 51 1.5 2 2.5

)1( 212 f

Page 54: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.54

Université d’Ottawa / University of Ottawa

Example: herptile richness in Example: herptile richness in southeastern Ontario wetlandssoutheastern Ontario wetlandsExample: herptile richness in Example: herptile richness in

southeastern Ontario wetlandssoutheastern Ontario wetlands sample of 28 wetlands 3 variables (LOGAREA,

CPFOR2, THTDEN) Dependent variable is log10 of

the number of herptile species.

What is probability of detecting a true effect size for CPFOR2 equal to the estimated effect size once effects of LOGAREA and THTDEN have been controlled for, given = 0.05?

sample of 28 wetlands 3 variables (LOGAREA,

CPFOR2, THTDEN) Dependent variable is log10 of

the number of herptile species.

What is probability of detecting a true effect size for CPFOR2 equal to the estimated effect size once effects of LOGAREA and THTDEN have been controlled for, given = 0.05?

Variable t p

LOGAREA(1)

3.96 0.001

THTDEN (2) -2.28 .032

CPFOR2 (3) .774 .447

R2{1,2,3} 0.547

R2{1,2 } 0.536

Page 55: Lecture 13: Multiple linear regression

2001

Bio 4118 Applied BiostatisticsL13.55

Université d’Ottawa / University of Ottawa

Example: herptile richness in Example: herptile richness in southeastern Ontario wetlandssoutheastern Ontario wetlandsExample: herptile richness in Example: herptile richness in

southeastern Ontario wetlandssoutheastern Ontario wetlands Sample effect size

f2 for CPFOR2 once effects of LOGAREA and THTDEN have been controlled for = .024 .

Source (CPFOR2) df = 1 = 1

Error df = 2 = 28 - 1 - 1 - 1 = 25

Sample effect size f2 for CPFOR2 once effects of LOGAREA and THTDEN have been controlled for = .024 .

Source (CPFOR2) df = 1 = 1

Error df = 2 = 28 - 1 - 1 - 1 = 25

),(..)(.

)(f

..

...R

RRf

},,{

},{},,{

21

212

2321

221

23212

, ,given tables,from2716481251024

1

0245471

5365471