Variance and covariance

Variance and covariance

n

ii

T

nT

n

a

aaa

a

aa

1

2

212

1

......

UU

UU

n

ii

n

ii

T

an

an

Variance1

22

1

2

11

11

......

MM

n

ii

T

nT

n

Variancena

aaa

a

aa

1

2

21

2

1

)1(

...

...

VV

V

V

TT

nnVariance ))((

11

11 2 MUMUVV

M contains the mean

Covariance)1(

...;

...

1

2

1

2

1

nba

b

bb

a

aa

n

iBiAi

T

Bn

B

B

An

A

A

AB

BA

iancevarCon B

TA

)()(

11 XBXA

Sums of squares

General additive models

iancevarCon B

TA

)()(

11 XBXA

The coefficient of correlation

yx

xy

yx

xyr

)cov(

)()'(1

1)var(

)()'(1

1)var(

)()'(1

1)cov(

YY

X

YX

ΜXΜX

ΜXΜX

ΜYΜX

ny

nx

nxy

Y

XX

)()')(()'()()'(

YY

YX

ΜYΜYΜXΜXΜYΜXR

XX

For a matrix X that contains several variables

holds

The matrix R is a symmetric distance matrix that contains all correlations between the variables

11

11 )()'(1

1

XX

XX

DΣΣR

ΣΜXΜXΣRn

The diagonal matrix SX contains the standard deviations as entries.

X-M is called the central matrix.

We deal with samples

)()'(1

1 ΜXΜXD

n

MatrixCov

Xn

X

X

X

0000...00000000

2

1

Σ

11

11 )()'(1

1

XX

XX

DΣΣR

ΣΜXΜXΣRn

XΣX XTr

Xn

X

/1......

/1 1

X XnXT /1....../1 1X

Pre-and postmultiplication

nnnn

n

n

X

................

...

...

21

22221

11211

Σ

nnnnn

n

n

nX

/1....../1

................

...

...

/1....../1

1

21

22221

11211

1ΣR

Premultiplication Postmultiplication

n

i

n

jij

n

i

n

j ji

ijnnnn scalarr

1 11 11;;;111

XΣXR

n

/1...00............0.../100...0/1

2

1

X

nnnn

n

n

X

................

...

...

21

22221

11211

ΣΣXXXXΣXXR S

For diagonal matrices X holds

Linear regression

European bat species and environmental correlates

ln(Area)ln(Number

of species)

10.26632 3.2580976.148468 011.33704 3.2188767.696213 0.6931478.519989 2.7080512.24361 2.89037210.3264 2.995732

10.84344 3.17805412.40519 2.89037211.61702 3.4965088.891512 2.1972255.703782 1.6094389.068777 3.0445229.019059 2.83321310.94366 3.5263617.824046 1.0986129.132379 2.89037211.27551 3.17805410.67112 2.6390577.887209 2.63905710.71945 2.3978957.243513 012.73123 2.39789513.20664 3.46573612.78555 3.2188761.871802 1.60943811.7905 3.496508

11.44094 3.33220511.54248 011.16014 2.39789512.6162 3.433987

9.615805 2.56494911.07637 2.7725895.075174 1.79175911.08702 2.6390577.858641 2.89037210.1401 3.178054

6.670766 1.6094385.755742 2.07944210.42552 3.0445220.667829 08.265136 2.1972259.557046 2.07944212.68838 2.39789512.65321 3.17805411.42796 3.17805412.37772 3.40119715.25979 3.4011974.110874 010.07799 2.99573211.53468 3.36729610.14353 3.09104210.80058 3.3322059.917045 3.29583713.13427 3.46573611.03568 013.01692 2.89037210.62825 3.36729610.63432 2.83321310.07593 3.25809713.31114 3.258097-0.82098 0

N=62

)ln()ln( 10 AaaS

1

02

1

2

1

102

1

1......

11

...1...11

... aa

x

xx

x

xx

aa

y

yy

nnn

Y

XAY Matrix approach to linear regression

YXXXA

AIAXAXXXYXXX

XAXYX

''

''''

''

1

11

X is not a square matrix, hence X-1 doesn’t exist.

The species – area relationship of European bats

ln(Number of

species)Constant ln(Area) X'

3.258097 1 10.26632 1 1 1 1 1 1 1 1 1 1 1 1 10 1 6.148468 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344 12.40519 11.61702 8.891512 5.703782 9.068777

3.218876 1 11.337040.693147 1 7.696213 X'X2.70805 1 8.519989 62 607.1316

2.890372 1 12.24361 607.1316 6518.1612.995732 1 10.3264 (X'X)-1

3.178054 1 10.84344 0.183521 -0.017092.890372 1 12.40519 -0.01709 0.0017463.496508 1 11.617022.197225 1 8.8915121.609438 1 5.703782 X'Y3.044522 1 9.068777 154.29372.833213 1 9.019059 1647.9083.526361 1 10.943661.098612 1 7.8240462.890372 1 9.132379 (X'X)-1(X'Y)3.178054 1 11.27551 a0 0.1468082.639057 1 10.67112 a1 0.2391442.639057 1 7.8872092.397895 1 10.71945

0 1 7.2435132.397895 1 12.731233.465736 1 13.206643.218876 1 12.785551.609438 1 1.8718023.496508 1 11.7905

y = 0.2391x + 0.1468R² = 0.4614

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

-5 0 5 10 15 20

ln(#

spec

ies)

ln (Area)

What about the part of variance explained by our model?

11 )()'(1

1

XX ΣΜXΜXΣRn

24.024.015.0 16.1

15.0ln24.0ln

AAeS

AS

1.16: Average number of species per unit area (species density)

0.24: spatial species turnover

11 )()'(1

1

XX ΣΜXΜXΣRn

X-M (X-M)'

0.769488 0.473878 0.769488 -2.48861 0.730267 -1.79546 0.219442 0.401763-2.48861 -3.64398 0.473878 -3.64398 1.54459 -2.09623 -1.27246 2.4511640.730267 1.54459-1.79546 -2.096230.219442 -1.27246 (X-M)'(X-M) (X-M)'(X-M) / (n-1)0.401763 2.451164 71.0087 136.9954 1.164077 2.2458260.507124 0.533954 136.9954 572.8582 2.245826 9.3911190.689445 1.0509910.401763 2.612741 Sx1.007899 1.824579 1.078924 0-0.29138 -0.90093 0 3.064493-0.87917 -4.088660.555914 -0.72367 Sx-1

0.344605 -0.77339 0.926849 01.037752 1.151213 0 0.326318

-1.39 -1.96840.401763 -0.66007 Sx-1 (X-M)'(X-M) / (n-1)0.689445 1.48306 1.078924 2.0815420.150449 0.878671 0.732854 3.0644930.150449 -1.90524-0.09071 0.927004-2.48861 -2.54893 Sx-1 (X-M)'(X-M) / (n-1) Sx-1-0.09071 2.938785 1 0.6792450.977127 3.414195 0.679245 10.730267 2.993105-0.87917 -7.920641.007899 1.998051 Sx-1 (X-M)'(X-M) / (n-1) Sx-1)2

0.843596 1.64849 1 0.461374-2.48861 1.750039 0.461374 1-0.09071 1.3676980.945379 2.8237520.076341 -0.17664

y = 0.2391x + 0.1468R² = 0.4614

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

-5 0 5 10 15 20

ln(#

spec

ies)

ln (Area)

n

iiMY YY

n 1

22; )(

11y = 0.2391x + 0.1468

R² = 0.4614

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

-5 0 5 10 15 20

ln(#

spec

ies)

ln (Area)

How to interpret the coefficient of determination

n

ii

n

ii

n

ii

n

iii

YY

YXY

YYn

XYYn

variance Totalvariance ResidualR

1

2

1

2

1

2

1

2

2

)(

))((

)(1

1

))((1

1

11

dfR

RF 2

2

1Statistical testing

is done by an F or a t-test.

2);(

2)(;

2; MXYXYYMY

n

iiiXYY XYY

n 1

22)(; ))((

11

n

iiMXY YXY

n 1

22);( ))((

11

dfR

Rt

Ft

21

Total variance

Rest (unexplained) variance

Residual (explained) variance

LaNaTaAaaS T 403210 )ln()ln(

The general linear model

n

iiinn XaaXaXaXaXaaY

103322110 ...

A model that assumes that a dependent variable Y can be expressed by a linear combination of predictor variables X is called a linear model.

XAY

nnmm

n

n

m y

ya

xx

xxxx

y

yy

......1

.........1...1...1

...1

0

,1,

,21,2

,11,1

2

1

YXXXA

AIAXAXXXYXXX

XAXYX

''

''''

''

1

11

ΕXAY

nnnmm

n

n

m y

ya

xx

xxxx

y

yy

.........1

.........1...1...1

...1

0

1

0

,1,

,21,2

,11,1

2

1

The vector E contains the error terms of each regression. Aim is to minimize E.

The general linear model

n

iiinn XaaXaXaXaXaaY

103322110 ...

n

iiinn XaaXaXaXaXaaY

103322110 ...

If the errors of the preictor variables are Gaussian the error term e should also be Gaussian and means and variances are additive

)()()(1

0

n

iii XaaY )()( 2

10

22

n

iii XaaY

Total variance

Explained variance

Unexplained (rest)

variance

)()()(

)( 2

22

21

02

2

YY

Y

XaaR

n

iii

LaNaAaaS T 40310 )ln()ln(

1. Model formulation2. Estimation of model parameters

3. Estimation of statistical significance

YXXXA

XAY

'' 1

Y

Country/Islandln(Number

of species)

Constant ln(Area)Days below zero

Latitude of capitals (decimal degrees)

Albania 3.258097 1 10.26632 34 41.33Andorra 0 1 6.148468 60 42.5Austria 3.218876 1 11.33704 92 48.12Azores 0.693147 1 7.696213 1 37.73Baleary Islands 2.70805 1 8.519989 18 39.55Belarus 2.890372 1 12.24361 144 53.87Belgium 2.995732 1 10.3264 50 50.9Bosnia and Herzegovina 3.178054 1 10.84344 114 43.82British islands 2.890372 1 12.40519 64 51.15Bulgaria 3.496508 1 11.61702 102 42.65Canary Islands 2.197225 1 8.891512 1 27.93Channel Is. 1.609438 1 5.703782 12 49.22Corsica 3.044522 1 9.068777 11 41.92Crete 2.833213 1 9.019059 1 35.33Croatia 3.526361 1 10.94366 114 45.82Cyclades Is. 1.098612 1 7.824046 1 37.1Cyprus 2.890372 1 9.132379 2 35.15Czech Republic 3.178054 1 11.27551 119 50.1Denmark 2.639057 1 10.67112 85 55.63Dodecanese Is. 2.639057 1 7.887209 2 36.4Estonia 2.397895 1 10.71945 143 59.35Faroe Is. 0 1 7.243513 35 62Finland 2.397895 1 12.73123 169 60.32France 3.465736 1 13.20664 50 48.73Germany 3.218876 1 12.78555 97 52.38Gibraltar 1.609438 1 1.871802 0 36.1Greece 3.496508 1 11.7905 2 37.9Hungary 3.332205 1 11.44094 100 47.43Iceland 0 1 11.54248 133 64.13

X

Multiple regression

X'

1 1 1 1 1 1 1 110.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344

34 60 92 1 18 144 50 11441.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82

X'X62 607.1316 4328 2906.4

607.1316 6518.161 48545.59 29086.574328 48545.59 534136 228951.7

2906.4 29086.57 228951.7 141148.1

(X'X)-1

1.019166 -0.02275 0.00261 -0.02053-0.02275 0.002458 -7.5E-05 8.3E-050.00261 -7.5E-05 1.3E-05 -5.9E-05

-0.02053 8.3E-05 -5.9E-05 0.000509

(X'X)-1X'0.025783 0.163309 0.013407 0.07203 0.060295 0.010457 -0.13031 0.1703470.003376 -0.00859 0.002243 -0.00078 0.00013 0.001069 0.003124 -0.00097-0.00017 0.000405 9.87E-05 -0.00019 -0.00014 0.000364 -0.00054 0.000676-0.00066 -0.00195 -0.00056 -0.00074 -0.00076 -0.00064 0.003269 -0.00409

(X'X)-1X'Y X'Y (X'X)-1(X'Y)a0 2.679757 154.2937 2.679757a1 0.290121 1647.908 0.290121a2 0.002155 11289.32 0.002155a3 -0.06789 7137.716 -0.06789

Multiple R and R2

n

ii

n

ii

n

ii

n

iii

YY

YXY

YYn

XYYn


1

2

1

2

1

2

1

2

2

)(

))((

)(1

1

))((1

1

11

The coefficient of determination

1..............................

...1

...1

12

212

1211

21

ny

y

my

myyy

rr

rrrrrrrr

R

y x1 x2 xm

XXXY

YX

RRR

R1

The correlation matrix can be devided into four compartments.

11 )()'(1

1

XX ΣΜXΜXΣRn

TTR XYXXXYYXXXXY RRRRRR 112

)det()det(1

)det()det()det(2

XXXX

XXRR

RR

RR

ln(Number of species)

ln(Area)Days below zero


X-M X-M X-M X-M (X-M)'

3.2580965 10.26632 34 41.33 0.769488 0.473878 -35.8065 -5.54742 0.769488 -2.48861 0.730267 -1.79546 0.219442 0.401763 0.507124 0.689445 0.401763 1.007899 -0.291380 6.148468 60 42.5 -2.48861 -3.64398 -9.80645 -4.37742 0.473878 -3.64398 1.54459 -2.09623 -1.27246 2.451164 0.533954 1.050991 2.612741 1.824579 -0.90093

3.2188758 11.33704 92 48.12 0.730267 1.54459 22.19355 1.242581 -35.8065 -9.80645 22.19355 -68.8065 -51.8065 74.19355 -19.8065 44.19355 -5.80645 32.19355 -68.80650.6931472 7.696213 1 37.73 -1.79546 -2.09623 -68.8065 -9.14742 -5.54742 -4.37742 1.242581 -9.14742 -7.32742 6.992581 4.022581 -3.05742 4.272581 -4.22742 -18.94742.7080502 8.519989 18 39.55 0.219442 -1.27246 -51.8065 -7.327422.8903718 12.24361 144 53.87 0.401763 2.451164 74.19355 6.992581 (X-M)'(X-M) (X-M)'(X-M)/(n-1)2.9957323 10.3264 50 50.9 0.507124 0.533954 -19.8065 4.022581 71.0087 136.9954 518.6241 -95.1758 1.164077 2.245826 8.502034 -1.560263.1780538 10.84344 114 43.82 0.689445 1.050991 44.19355 -3.05742 136.9954 572.8582 6163.884 625.8081 2.245826 9.391119 101.0473 10.259152.8903718 12.40519 64 51.15 0.401763 2.612741 -5.80645 4.272581 518.6241 6163.884 232013.7 26066.26 8.502034 101.0473 3803.503 427.31573.4965076 11.61702 102 42.65 1.007899 1.824579 32.19355 -4.22742 -95.1758 625.8081 26066.26 4903.6 -1.56026 10.25915 427.3157 80.386892.1972246 8.891512 1 27.93 -0.29138 -0.90093 -68.8065 -18.94741.6094379 5.703782 12 49.22 -0.87917 -4.08866 -57.8065 2.342581 S3.0445224 9.068777 11 41.92 0.555914 -0.72367 -58.8065 -4.95742 1.078924 0 0 02.8332133 9.019059 1 35.33 0.344605 -0.77339 -68.8065 -11.5474 0 3.064493 0 03.5263605 10.94366 114 45.82 1.037752 1.151213 44.19355 -1.05742 0 0 61.67255 01.0986123 7.824046 1 37.1 -1.39 -1.9684 -68.8065 -9.77742 0 0 0 8.9658742.8903718 9.132379 2 35.15 0.401763 -0.66007 -67.8065 -11.72743.1780538 11.27551 119 50.1 0.689445 1.48306 49.19355 3.222581 S 1

2.6390573 10.67112 85 55.63 0.150449 0.878671 15.19355 8.7525812.6390573 7.887209 2 36.4 0.150449 -1.90524 -67.8065 -10.4774 0.926849 0 0 02.3978953 10.71945 143 59.35 -0.09071 0.927004 73.19355 12.47258 0 0.326318 0 0

0 7.243513 35 62 -2.48861 -2.54893 -34.8065 15.12258 0 0 0.016215 02.3978953 12.73123 169 60.32 -0.09071 2.938785 99.19355 13.44258 0 0 0 0.1115343.4657359 13.20664 50 48.73 0.977127 3.414195 -19.8065 1.8525813.2188758 12.78555 97 52.38 0.730267 2.993105 27.19355 5.502581 S 1D1.6094379 1.871802 0 36.1 -0.87917 -7.92064 -69.8065 -10.7774 1.078924 2.081542 7.880104 -1.446133.4965076 11.7905 2 37.9 1.007899 1.998051 -67.8065 -8.97742 0.732854 3.064493 32.97357 3.3477473.3322045 11.44094 100 47.43 0.843596 1.64849 30.19355 0.552581 0.137858 1.638448 61.67255 6.928784

0 11.54248 133 64.13 -2.48861 1.750039 63.19355 17.25258 -0.17402 1.144244 47.66024 8.9658742.3978953 11.16014 23 53.43 -0.09071 1.367698 -46.8065 6.5525813.4339872 12.6162 18 41.8 0.945379 2.823752 -51.8065 -5.07742 S 1DS-1

2.5649494 9.615805 110 52.7 0.076341 -0.17664 40.19355 5.822581 1 0.679245 0.127773 -0.16129 Det RXX R2

2.7725887 11.07637 124 56.96 0.28398 1.283927 54.19355 10.08258 0.679245 1 0.534656 0.373388 0.286065 0.6664621.7917595 5.075174 90 47.67 -0.69685 -4.71727 20.19355 0.792581 0.127773 0.534656 1 0.772795 Det R2.6390573 11.08702 130 54.62 0.150449 1.294578 60.19355 7.742581 -0.16129 0.373388 0.772795 1 0.0954132.8903718 7.858641 93 49.62 0.401763 -1.9338 23.19355 2.742581

S 1DS-1

1 0.679245 0.127773 -0.16129 Det RXX R2

0.679245 1 0.534656 0.373388 0.286065 0.6664620.127773 0.534656 1 0.772795 Det R-0.16129 0.373388 0.772795 1 0.095413

S 1DS-1)-1

1.408029 -0.86031 0.139099-0.86031 3.008345 -2.003610.139099 -2.00361 2.496439

S 1DS-1)-1RXY S 1DS-1)-1RXYRYX

0.824037 0.6664620.123194-0.56418

TTR XYXXXYYXXXXY RRRRRR 112

)det()det(1

)det()det()det(2

XXXX

XXRRR

RRR

Adjusted R2

6307.383

136233354.066646.01

121

2

2

22

21

kkn

RR

dfdfF

R: correlation matrixn: number of cases

k: number of independent variables in the model

)( parameterSEparametert

11)1(1 22

kn

nRRadj

D<0 is statistically not significant and should

be eliminated from the model.

1)1)(( 21

knRRtraceSE

Y

Country/Islandln(Number

of species)

Constant ln(Area)Days below zero


Latitude2 X'

Albania 3.258097 1 10.26632 34 41.33 1708.169 1 1 1 1 1 1 1 1Andorra 0 1 6.148468 60 42.5 1806.25 10.26632 6.148468 11.33704 7.696213 8.519989 12.24361 10.3264 10.84344Austria 3.218876 1 11.33704 92 48.12 2315.534 34 60 92 1 18 144 50 114Azores 0.693147 1 7.696213 1 37.73 1423.553 41.33 42.5 48.12 37.73 39.55 53.87 50.9 43.82Baleary Islands 2.70805 1 8.519989 18 39.55 1564.203 1708.169 1806.25 2315.534 1423.553 1564.203 2901.977 2590.81 1920.192Belarus 2.890372 1 12.24361 144 53.87 2901.977Belgium 2.995732 1 10.3264 50 50.9 2590.81 X'XBosnia and Herzegovina 3.178054 1 10.84344 114 43.82 1920.192 62 607.1316 4328 2906.4 141148.1British islands 2.890372 1 12.40519 64 51.15 2616.323 607.1316 6518.161 48545.59 29086.57 1441737Bulgaria 3.496508 1 11.61702 102 42.65 1819.023 4328 48545.59 534136 228951.7 12488619Canary Islands 2.197225 1 8.891512 1 27.93 780.0849 2906.4 29086.57 228951.7 141148.1 7106497Channel Is. 1.609438 1 5.703782 12 49.22 2422.608 141148.1 1441737 12488619 7106497 3.71E+08Corsica 3.044522 1 9.068777 11 41.92 1757.286Crete 2.833213 1 9.019059 1 35.33 1248.209 (X'X)-1

Croatia 3.526361 1 10.94366 114 45.82 2099.472 6.45421 0.000497 0.001087 -0.25606 0.002409Cyclades Is. 1.098612 1 7.824046 1 37.1 1376.41 0.000497 0.002557 -8.1E-05 -0.00092 1.03E-05Cyprus 2.890372 1 9.132379 2 35.15 1235.523 0.001087 -8.1E-05 1.34E-05 6.63E-06 -6.8E-07Czech Republic 3.178054 1 11.27551 119 50.1 2510.01 -0.25606 -0.00092 6.63E-06 0.010716 -0.0001Denmark 2.639057 1 10.67112 85 55.63 3094.697 0.002409 1.03E-05 -6.8E-07 -0.0001 1.07E-06Dodecanese Is. 2.639057 1 7.887209 2 36.4 1324.96Estonia 2.397895 1 10.71945 143 59.35 3522.423 (X'X)-1X'Faroe Is. 0 1 7.243513 35 62 3844 0.028519 -0.00857 -0.18332 0.227512 0.119213 -0.18587 -0.27812 -0.01106Finland 2.397895 1 12.73123 169 60.32 3638.502 0.003388 -0.00932 0.001402 -0.00011 0.000382 0.000229 0.002492 -0.00174France 3.465736 1 13.20664 50 48.73 2374.613 -0.00017 0.000453 0.000154 -0.00024 -0.00016 0.000419 -0.00049 0.000727Germany 3.218876 1 12.78555 97 52.38 2743.664 -0.00078 0.0055 0.007968 -0.00748 -0.00331 0.007864 0.009674 0.003767Gibraltar 1.609438 1 1.871802 0 36.1 1303.21 1.21E-06 -7.6E-05 -8.7E-05 6.89E-05 2.61E-05 -8.7E-05 -6.6E-05 -8E-05Greece 3.496508 1 11.7905 2 37.9 1436.41Hungary 3.332205 1 11.44094 100 47.43 2249.605 (X'X)-1X'YIceland 0 1 11.54248 133 64.13 4112.657 a0 -3.40816Ireland 2.397895 1 11.16014 23 53.43 2854.765 a1 0.264082Italy 3.433987 1 12.6162 18 41.8 1747.24 a2 0.003862Kaliningrad Region 2.564949 1 9.615805 110 52.7 2777.29 a3 0.195932Latvia 2.772589 1 11.07637 124 56.96 3244.442 a4 -0.0027

X

A mixed model2

430210 lnln LaLaDaAaaS T

20 0027.0196.0004.0ln26.041.3ln LLDAS T

The final model

Is this model realistic?

Negative species density

Realistic increase of species richness with

area

Increase of species richness with winter

length

Increase of species richness at higher

latitudes

A peak of species richness at

intermediate latitudes

The model makes a series of unrealistic predictions.Our initial assumptions are wrong despite of the high degree of variance explanation

Our problem arises in part from the intercorrelation between the

predictor variables (multicollinearity).

We solve the problem by a step-wise approach eliminating the variables that are either not

significant or give unreasonable parameter values

The variance explanation of this final model is higher than that of the previous one.

y = 0.6966x + 0.7481R² = 0.6973

-1-0.5

00.5

11.5

22.5

33.5

44.5

0 1 2 3 4

ln(#

spec

ies

pred

icte

d)

ln (# species observed)

......... 33221

3223

2222221

3113

211211110 XaXaXaXaXbXaXaXaXaaY nnn

Multiple regression solves systems of intrinsically linear algebraic equations

YXXXA '' 1

• The matrix X’X must not be singular. It est, the variables have to be independent. Otherwise we speak of multicollinearity. Collinearity of r<0.7 are in most cases tolerable.

• Multiple regression to be safely applied needs at least 10 times the number of cases than variables in the model.

• Statistical inference assumes that errors have a normal distribution around the mean.• The model assumes linear (or algebraic) dependencies. Check first for non-linearities. • Check the distribution of residuals Yexp-Yobs. This distribution should be random.• Check the parameters whether they have realistic values.

y = 0.6966x + 0.7481R² = 0.6973

-1-0.5

00.5

11.5

22.5

33.5

44.5

0 1 2 3 4

ln(#

spec

ies

pred

icte

d)

ln (# species observed)

Multiple regression is a hypothesis testing and not a hypothesis generating

technique!!

Polynomial regression General additive model

Standardized coefficients of correlation

xZZ-tranformed distributions have a mean

of 0 an a standard deviation of 1.

YXXX ZZZZB '' 1n

i i n ni 1 i i

X Yi 1 i 1X Y X Y

(X X)(Y Y)(X X) (Y Y)1 1 1r Z Z

n 1 s s n 1 s s n 1

nnn

n

iiiiini

nniii

rr

rr

nR

ZxZxZxZx

ZxZxZxZx

..............................

.......

'1

1

..............................

......

'

1

111

1

111

ZZZZ

XYxx RRB 1

In the case of bivariate regression Y = aX+b, Rxx = 1.Hence B=RXY.

Hence the use of Z-transformed values results in standardized correlations coefficients, termed b-values

ZZRΣΜXΜXΣR XX '1

1)()'(1

1 11

nnBRR XXXY

How to interpret beta-values

If then Beta values are generalisations of simple coefficients of correlation. However, there is an important difference. The higher the correlation between two or more predicator variables (multicollinearity) is, the less will r depend on the correlation between X and Y. Hence other variables might have more and more influence on r and b. For high levels of multicollinearity it might therefore become more and more difficult to interpret beta-values in terms of correlations. Because beta-values are standardized b-values they should allow comparisons to be make about the relative influence of predicator variables. High levels of multicollinearity might let to misinterpretations. Beta values above one are always a sign of too high multicollinearity

Hence high levels of multicollinearity might· reduce the exactness of beta-weight estimates· change the probabilities of making type I and type II errors· make it more difficult to interpret beta-values.

We might apply an additional parameter, the so-called coefficient of structure. The coefficient of structure ci is defined as

where riY denotes the simple correlation between predicator variable i and the dependent variable Y and R2 the coefficient of determination of the multiple regression. Coefficients of structure measure therefore the fraction of total variability a given predictor variable explains. Again, the interpretation of ci is not always unequivocal at high levels of multicollinearity.

BXY Y

XiXiXi B

b

2Rrc iY

i

Partial correlations

X

Y

Z

rxy rzy

rzx

X X(Y) X(Z)Y Y(X) Y(Z)

/ 2 21 1

XY XZ YZ

XY Z

XZ YZ

r r rrr r

Semipartial correlation

XY XZ YZ(X|Y)Z 2

YZ

r r rr1 r

A semipartial correlation correlates a variable with one residual only.

y = 1.02Z + 0.41

0

0.5

1

1.5

2

0 0.5 1Z

X X

y = 1.70Z + 0.600

0.5

1

1.5

2

2.5

0 0.5 1Z

Y

Y

The partial correlation rxy/z is the correlation of the residuals X and Y

Path analysis and linear structure models

Y

X3X2 X4X1

e

Multiple regression

YX3

X2

X4X1

ee

e

e

e

Path analysis tries to do something that is logically impossible, to derive causal relationships from sets of observations.

Path analysis defines a whole model and tries to separate correlations into direct and indirect effects

eXaXaXaXaaY 443322110

The error term e contain the part of the variance in Y that is not explained by the model. These errors are called residuals

Regression analysis does not study the relationships between the predictor

variables

X Z

Y

WpXW pZX

pZY

pXY

e e

e

e

Path analysis is largely based on the computation of partial coefficients of correlation.

Path coefficients

Path analysis is a model confirmatory tool. It should not be used to generate models or even to seek for models that fit the data set.

xw

xy

zx zy

W p X eX p Y e

Z p X p Y e

xw

xy

zx zy

p X W e 0X p Y e 0

p X p Y Z e 0

We start from regression functions

From Z-transformed values we get

X

W

Z

Y

pXW

pYX

pXZ pYZ

W xw X

X xy Y

Z zx X zy Y

W Y xw X Y Y

X W xy Y W W

Z W zx X W zy Y W W

X Z xy Y X X

X Y xy Y Y Y

Z Y zx X Y zy Y Y Y

WY xw XY

XW xy YW

ZW zx X W zy YW

XZ xy YX

XY

Z p Z eZ p Z e

Z p Z p Z e

Z Z p Z Z eZZ Z p Z Z eZ

Z Z p Z Z p Z Z eZ

Z Z p Z Z eZ

Z Z p Z Z eZ

Z Z p Z Z p Z Z eZ

r p rr p r

r p r p r

r p r

r

xy

ZY zx XY zy

p

r p r p

eZY = 0

ZYZY = 1

ZXZY = rXY

xw

xy

zx zy

p X W e 0X p Y e 0

p X p Y Z e 0

Path analysis is a nice tool to generate hypotheses.It fails at low coefficients of correlation and circular

model structures.

Target symptom

X A B C D E Expected values X'1 0 1 1 0 1 0.848615 A 0 0 1 0 11 0 1 1 0 1 0.848615 B 1 1 0 1 00 1 0 0 0 0 -0.2092 C 1 1 0 1 01 0 1 1 1 1 1.108631 D 0 0 0 1 00 1 0 0 0 1 0.106749 E 1 1 0 1 11 0 1 1 1 1 1.1086310 1 0 0 0 0 -0.2092 X'X1 0 1 1 1 1 1.108631 A B C D E1 1 1 1 1 1 0.899435 A 8 5 1 2 40 1 1 0 0 1 0.19602 B 5 11 6 6 91 0 0 1 1 1 1.01936 C 1 6 10 8 101 0 0 1 1 1 1.01936 D 2 6 8 11 111 0 0 0 1 1 0.575961 E 4 9 10 11 151 0 1 0 1 1 0.6652330 1 1 0 0 0 -0.11992 (X'X)-1

0 1 1 0 0 0 -0.11992 0.205969 -0.09304 0.098145 0.0242 -0.082281 0 0 1 1 1 1.01936 -0.09304 0.224792 -0.05216 0.028233 -0.095991 0 0 1 1 1 1.01936 0.098145 -0.05216 0.361387 -0.06158 -0.190640 1 1 0 1 1 0.456037 0.0242 0.028233 -0.06158 0.368379 -0.25249

-0.08228 -0.09599 -0.19064 -0.25249 0.458457Sum 8 11 10 11 15

+L$25*B23+L$26*C23+L$27*D23+L$28*E23+L$29*F23

X'Y (X'X)-1X'Y1 -0.20927 0.089271

10 0.44339910 0.26001612 0.315945

Symptoms

-0.5

0

0.5

1

1.5

0 1

Pred

icte

d va

lue

Observed occurrences

Non-metric multiple regression

R2 (X'X)-1

0.828365 0.205969 -0.09304 0.098145 0.0242 -0.082281-R2 -0.09304 0.224792 -0.05216 0.028233 -0.09599

0.171635 0.098145 -0.05216 0.361387 -0.06158 -0.19064N df 0.0242 0.028233 -0.06158 0.368379 -0.25249

19 13 -0.08228 -0.09599 -0.19064 -0.25249 0.458457

B SE(B) t PA -0.2092 0.035351 0.052147 -4.01163 0.001479B 0.089271 0.038582 0.054478 1.638668 0.125246C 0.443399 0.062027 0.069074 6.419143 2.27E-05D 0.260016 0.063227 0.069739 3.728397 0.002529E 0.315945 0.078687 0.0778 4.060988 0.001348

Statistical inference

1)1)(( 21

knRRtraceSE

n

ii

n

iii

YYn

XYYn


1

2

1

2

2

)(1

1

))((1

1

1

-0.5

0

0.5

1

1.5

0 1

Pred

icte

d va

lue

Observed occurrences

Target symptom

Predicted values

Predicted values

Total variance

Explained variance

Unexplained

varianceX A B C D E1 0 1 1 0 1 0.848615 1 0.135734 0.047105 0.0229171 0 1 1 0 1 0.848615 1 0.135734 0.047105 0.0229170 1 0 0 0 0 -0.2092 0 0.398892 0.706903 0.0437631 0 1 1 1 1 1.108631 1 0.135734 0.227579 0.0118010 1 0 0 0 1 0.106749 0 0.398892 0.275446 0.0113951 0 1 1 1 1 1.108631 1 0.135734 0.227579 0.0118010 1 0 0 0 0 -0.2092 0 0.398892 0.706903 0.0437631 0 1 1 1 1 1.108631 1 0.135734 0.227579 0.0118011 1 1 1 1 1 0.899435 1 0.135734 0.071747 0.0101130 1 1 0 0 1 0.19602 0 0.398892 0.189711 0.0384241 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003751 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003751 0 0 0 1 1 0.575961 1 0.135734 0.003093 0.1798091 0 1 0 1 1 0.665233 1 0.135734 0.001133 0.1120690 1 1 0 0 0 -0.11992 0 0.398892 0.564758 0.0143820 1 1 0 0 0 -0.11992 0 0.398892 0.564758 0.0143821 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003751 0 0 1 1 1 1.01936 1 0.135734 0.150374 0.0003750 1 1 0 1 1 0.456037 0 0.398892 0.030815 0.207969

Mean0.631579 0.421053 0.578947 0.526316 0.578947 0.789474 0.245614 0.249651 0.042156

True R2 1

Approximated R2 0.828365

Symptoms

Rounding errors due to different precisions cause the residual variance to be larger

than the total variance.

Logistic and other regression techniques

A B CMale 5.998 0.838 2.253Male 3.916 0.992 1.964Male 4.511 0.904 1.930Male 5.940 0.795 1.171Male 6.532 0.574 1.390Male 6.513 1.036 0.571Male 3.052 0.584 2.179Male 3.512 1.126 1.843Male 6.676 0.992 2.288Male 6.976 0.502 1.062

Female 5.649 0.913 2.231Female 5.712 0.474 2.237Female 5.112 0.277 1.009Female 3.681 0.329 2.420Female 5.239 0.922 1.592Female 5.180 0.546 2.418Female 2.133 0.300 3.087Female 5.361 0.472 2.175Female 6.460 0.321 1.007Female 6.839 0.426 3.179

11 1

y

y y

eZe e

01

n

i ii

Y a a x

nn 0 i i

i 10 i ii 1

n

0 i ii 1

a a xn a a x

0 i ia a xi 1

p p eln a a x e p1 p 1 p

1 e

01

011

n

i ii

n

i ii

a a x

a a x

eZ

e

0

0.2

0.4

0.6

0.8

1

1.2

0 5 10 15 20

Y

Z

Threshold

Indecisive region

Surely males Surely females

We use odds

The logistic regression model

0

0.2

0.4

0.6

0.8

1

Mal

e

Mal

e

Mal

e

Mal

e

Mal

e

Fem

ale

Fem

ale

Fem

ale

Fem

ale

Fem

ale

Sex

Z

0.19 0.2 6.36 1.77

0.19 0.2 6.36 1.771

A B C

A B C

eZe

0 i i

0a a x

1

bY

1 b e

Generalized non-linear regression models

0

1

0 5 10x

Y b1= 3b2= 4

0

1

0 5 10x

Y

b1= 1b2=0.5

A special regression model that is used in pharmacology

2

00 b

1

bY bX1b

b0 is the maximum response at dose saturation. b1 is the concentration that produces a half maximum response.b2 determines the slope of the function, that means it is a measure how fast the response increases with increasing drug dose.

Variance and covariance

Documents

matrix x

matrix r

general linear model

linear regression x

matrix approach

square matrix

central matrix

model d