Top Banner
Linear Regression Didier Concordet [email protected] ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at http://www.biostat.envt.fr/
49

Linear Regression Didier Concordet [email protected] ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

Mar 27, 2015

Download

Documents

Landon Fraser
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

Linear Regression

Didier [email protected]

ECVPT Workshop April 2011

Ecole NationaleVétérinairede Toulouse

Can be downloaded at http://www.biostat.envt.fr/

Page 2: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

2

An example

0

50

100

150

200

250

300

350

0 20 40 60 80 100

HP

LC (Y

)

Known concentrations (x)

x Y10 38.810 60.010 49.020 85.520 82.220 72.930 96.730 102.930 114.050 156.950 176.750 171.970 212.070 223.470 228.090 283.490 274.4

Page 3: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

3

About the straight line

Y= a + b x

Y

x

a

b>0

b<0

Y

x

a b=0

Y

x

a=0

b>0

a = intercept b = slope

Page 4: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

4

Questions

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

Page 5: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

5

How to obtain the best straight line ?

• write a (statistical) model

• estimate the parameters

• graphical inspection of data

Proceed in three main steps

Page 6: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

6

Write a model

A statistical model

Mean model :functionnal relationship

Variance model :Assumptions on the residuals

Page 7: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

7

Write a model

= residual (error term)

iY

ix

ibxa i

iii bxaY

iMean model

Page 8: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

8

Assumptions on the residuals

• the xi 's are not random variables

they are known with a high precision

• the i 's have a constant variance

homoscedasticity

• the i 's are independent

• the i 's are normally distributed

normality

Page 9: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

9

Homoscedasticity

0

50

100

150

200

250

300

350

0 50 100

Y

x

0

50

100

150

200

250

300

350

400

0 50 100

Y

x

homoscedasticity heteroscedasticity

Page 10: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

10

Normality

x

Y

Page 11: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

11

Estimate the parameters

A criterion is needed to estimate parameters

A statistical model A criterion

Page 12: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

12

How to estimate the "best" a et b ?

Intuitive criterion : i

i minimum

compensation

Reasonnable criterion : minimum

Least squares criterion (L.S.)

i

i2

Linear model

Homoscedasticity

Normality

Page 13: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

13

The least squares criterion

minimum

minimum

2

2

iii

ii

bxaY

iii bxaY

Page 14: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

14

Result of optimisation

ii

iii

xx

xxYYb 2ˆ xbYa ˆˆ

a band change with samples

a band are random variables

ii xx

x

nase 2

22 1

ˆ

ii xx

bse 2

Page 15: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

15

Balance sheet

True mean straight line bxaY

Estimated straight line xbaY ˆˆˆ or xxbYY ˆˆ

Mean predicted value for the ith observationii xbaY ˆˆˆ

ith residual iiiii xbaYYY ˆˆˆˆ

iii bxaY

Page 16: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

16

Example

Dep Var: HPLC N: 18

Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000CONCENT 2.916 0.069 42.030 0.000

Intercept

Slope

a

bxY 916.2046.20ˆ Estimated straight line

%37.181837.0046.20

682.3ˆ aCV ase ˆ

Page 17: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

17

Example

0

50

100

150

200

250

300

350

0 50 100

HP

LC

(Y

)

Known concentrations (x)

Page 18: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

18

Example

10 38.8 49.2 -10.410 60.0 49.2 10.810 49.0 49.2 -0.220 85.5 78.4 7.220 82.2 78.4 3.820 72.9 78.4 -5.530 96.7 107.5 -10.930 102.9 107.5 -4.630 114.0 107.5 6.550 156.9 165.8 -9.050 176.7 165.8 10.850 171.9 165.8 6.170 212.0 224.2 -12.270 223.4 224.2 -0.870 228.0 224.2 3.890 283.4 282.5 0.990 274.4 282.5 -8.190 294.0 282.5 11.5

iY iiYix

Page 19: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

19

Residual variance

by construction 0ˆˆ i

iii

i YY

0ˆˆ22

iii

ii YY

2

ˆ

2

ˆˆ

22

2

n

YY

ni

iii

i

but

The residual variance is defined by

2standard error of estimate

Page 20: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

20

Example

Dep Var: HPLC N: 18

Multiple R: 0.996 Squared multiple R: 0.991Adjusted squared multiple R: 0.991

Standard error of estimate : 8.282 Effect Coefficient Std Error t P(2 Tail) CONSTANT 20.046 3.682 5.444 0.000CONCENT 2.916 0.069 42.030 0.000

Page 21: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

21

Questions

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

Page 22: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

22

Is this model the best one to use ?

Tools to check the mean model : • scatterplot residuals vs fitted values• test(s)

Tools to check the variance model :• scatterplot residuals vs fitted values• Probability plot (Pplot)

Page 23: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

23

Checking the mean model

scatterplot residuals vs fitted values

iY

i

0

No structure in the residualsOK

iY

i

0

structure in the residualschange the mean model

Page 24: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

24

Checking the mean model : tests

Two cases

ReplicationsTest of lack of fit

No replicationTry a polynomial model (quadratic first)

Page 25: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

25

Without replication

Example :

try another mean model and test the improvement

iiii cxbxaY 2

If the test on c is significant (c 0) then keep this model

Dep Var: HPLC N: 18 Multiple R: 0.996 Squared multiple R: 0.991Adjusted squared multiple R: 0.991Standard error of estimate: 8.539 Effect Coefficient Std Error t P(2 Tail) CONSTANT 21.284 6.649 3.201 0.006CONCENT 2.842 0.335 8.486 0.000CONCENT*CONCENT 0.001 0.003 0.227 0.824

iiii cxbxaY 2

Page 26: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

26

With replications

Perform a test of lack of fit

Principle : compare to

if > then change the model-

Yxba ˆˆ

x

Departure from linearity

Pure error

Page 27: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

27

Test of lack of fit : how to do it ?

Three steps

1) Linear regression

2 2

ˆ2 2

ndf

nSS

RES

RES

2) One way ANOVA

errorSS2ˆanova

3) errorRESLOF

errorRESLOF

dfdfdf

SSSSSS

errorLOFerror

error

LOF

LOF fSS

df

df

SSF ,if

then change the model

Page 28: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

28

Test of lack of fit : example

Three steps

1) Linear regression

22 282.8ˆ

16218

1097.5282.8218 2

RES

RES

df

SS

2) One way ANOVA

3) 1216

427.10055.1097

LOF

LOF

df

SS26.32747.0 05.0

12,4 fFif

We keep the straight line

Dep Var: HPLC N: 18 Analysis of VarianceSource Sum-of-Squares df Mean-Square F-ratio PCONCENT 121251.776 5 24250.355 289.434 0.000Error 1005.427 12 83.786

Page 29: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

29

Checking the variance model : homoscedasticity

scatterplot residuals vs fitted values

iY

i

0

homoscedasticityOK

No structure in the residualsbut heteroscedasticitychange the model (criterion)

iY

i

0

Page 30: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

30

What to do with heteroscedasticity ?

scatterplot residuals vs fitted values : modelize the dispersion.

iY

i

0

The standard deviation of the residuals increaseswith : it increases with xY

Page 31: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

31

What to do with heteroscedasticity ?

Estimate again the slope and the intercept but withweights proportionnal to the variance.

and check that the weight residuals (as defined above) are homoscedastic

minimum 22

iiii

ii bxaYw

2

1

i

ix

w with

Page 32: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

32

Checking the variance model : normality

i0

No curvature :Normality

Curvature : non normalityis it so important ?

i0

Exp

ecte

d va

lue

for

norm

al d

istr

ibut

ion

Exp

ecte

d va

lue

for

norm

al d

istr

ibut

ion

Page 33: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

33

What to do with non normality ?

Try to modelize the distribution of residuals

In general, it is difficult with few observations

If enough observations are available,the non normality does not affect too much

the result.

Page 34: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

34

An interesting indice R²

R² = square correlation coefficient

= % of dispersion of the Yi's explained by the straight line (the model)

0 R² 1

If R² = 1, all the i = 0, the straight line explain all the variation of the Yi's

If R² = 0, the slope is = 0, the straight line does not explain any variation of the Yi's

Page 35: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

35

0

20

40

60

80

100

120

140

160

180

200

0 5 10 15 20 25 30

Y

x

An interesting indice R²

R² and R (correlation coefficient) are not designed to measure linearity !

Example :Multiple R: 0.990Squared multiple R: 0.980Adjusted squared multiple R: 0.980

Page 36: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

36

Questions

• How to obtain the best straight line ?

• Is this straight line the best curve to use ?

• How to use this straight line ?

Page 37: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

37

How to use this straight line ?

• Direct use : for a given x– predict the mean Y– construct a confidence interval of the mean Y– construct a prediction interval of Y

• Reverse use calibration (approximate results): for a given Y

– predict the mean x– construct a confidence interval of the mean x– construct a prediction interval of X

Page 38: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

38

For a given x predict the mean Y

x

xba ˆˆ

53.107ˆˆ

30

916.2ˆ046.20ˆ

xba

x

b

a

Example :

Page 39: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

39

Confidence interval of the mean Y

ii

n

ii

n

xx

xx

ntxba

bxa

xx

xx

ntxba

2

222/1

2

2

222/1

2

1ˆˆˆ

1ˆˆˆ

There is a probability 1- that a+bx belongs to this interval

Page 40: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

40

Confidence interval of the mean Y

0

50

100

150

200

250

300

350

0 20 40 60 80 100

L

U

30

Page 41: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

41

Example

2.112

8.102

14250

45

282.8ˆ

12.205.0

18

53.107ˆˆ

30

916.2ˆ046.20ˆ

2

22

2/12

U

L

xx

x

tn

xba

x

b

a

ii

n

Page 42: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

42

Prediction interval of Y

ii

n

ii

n

xx

xx

ntxba

Y

xx

xx

ntxba

2

222/1

2

2

222/1

2

11ˆˆˆ

11ˆˆˆ

100(1-of the measurements carried-out for this x belongs to this interval

Page 43: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

43

Prediction interval of Y

0

50

100

150

200

250

300

350

0 20 40 60 80 100

L

U

30

Page 44: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

44

Example

7.125

4.89

14250

45

282.8ˆ

12.205.0

18

53.107ˆˆ

30

916.2ˆ046.20ˆ

2

22

2/12

U

L

xx

x

tn

xba

x

b

a

ii

n

Page 45: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

45

Reverse use : for a given Y=y0 predict the mean X

X

0y

30

53.107

916.2ˆ046.20ˆ

0

X

y

b

a

Example :

Page 46: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

46

For a given Y=y0 a confidence interval of the mean X

0

50

100

150

200

250

300

350

0 20 40 60 80 100

Y0

X

L U

Page 47: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

47

Confidence interval of the mean X

ii

n

ii

n

xx

xL

ntLbay

xx

xU

ntUbay

2

222/1

20

2

222/1

20

1ˆˆˆ

1ˆˆˆ

There is a probability 1- that the mean X belongs to [ L , U ]

L and U are so that

Page 48: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

48

Example

33.31

59.28

14250

45

282.8ˆ

12.205.0

1853.107

916.2ˆ046.20ˆ

2

22

2/12

0

U

L

xx

x

tn

y

b

a

ii

n

Page 49: Linear Regression Didier Concordet d.concordet@envt.fr ECVPT Workshop April 2011 Ecole Nationale Vétérinaire de Toulouse Can be downloaded at

49

What you should no longer believe

One can fit the straight line by inverting x and Y

If the correlation coefficient is high, the straight line is the best model

Normality of the i's is essential to perform a good regression

Normality of the xi's is required to perform a regression