Variance and covariance n i i T n T n a a a a a a a 1 2 2 1 2 1 ... ... UU U U n i i n i i T a n a n Variance 1 2 2 1 2 1 1 1 1 ... ... M M n i i T n T n Variance n a a a a a a a 1 2 2 1 2 1 ) 1 ( ... ... VV V V T T n n Variance ) )( ( 1 1 1 1 2 M U M U VV M contains the mean Covarianc ) 1 ( ... ; ... 1 2 1 2 1 n b a b b b a a a n i B i A i T B n B B A n A A AB B A iance var Co n B T A ) ( ) ( 1 1 X B X A Sums of squares General additive models
General additive models. Variance and covariance. Sums of squares. M contains the mean. The coefficient of correlation. We deal with samples. For a matrix X that contains several variables holds. The diagonal matrix S X contains the standard deviations as entries . - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Variance and covariance
n
ii
T
nT
n
a
aaa
a
aa
1
2
212
1
......
UU
UU
n
ii
n
ii
T
an
an
Variance1
22
1
2
11
11
......
MM
n
ii
T
nT
n
Variancena
aaa
a
aa
1
2
21
2
1
)1(
...
...
VV
V
V
TT
nnVariance ))((
11
11 2 MUMUVV
M contains the mean
Covariance)1(
...;
...
1
2
1
2
1
nba
b
bb
a
aa
n
iBiAi
T
Bn
B
B
An
A
A
AB
BA
iancevarCon B
TA
)()(
11 XBXA
Sums of squares
General additive models
iancevarCon B
TA
)()(
11 XBXA
The coefficient of correlation
yx
xy
yx
xyr
)cov(
)()'(1
1)var(
)()'(1
1)var(
)()'(1
1)cov(
YY
X
YX
ΜXΜX
ΜXΜX
ΜYΜX
ny
nx
nxy
Y
XX
)()')(()'()()'(
YY
YX
ΜYΜYΜXΜXΜYΜXR
XX
For a matrix X that contains several variables
holds
The matrix R is a symmetric distance matrix that contains all correlations between the variables
11
11 )()'(1
1
XX
XX
DΣΣR
ΣΜXΜXΣRn
The diagonal matrix SX contains the standard deviations as entries.
The model makes a series of unrealistic predictions.Our initial assumptions are wrong despite of the high degree of variance explanation
Our problem arises in part from the intercorrelation between the
predictor variables (multicollinearity).
We solve the problem by a step-wise approach eliminating the variables that are either not
significant or give unreasonable parameter values
The variance explanation of this final model is higher than that of the previous one.
y = 0.6966x + 0.7481R² = 0.6973
-1-0.5
00.5
11.5
22.5
33.5
44.5
0 1 2 3 4
ln(#
spec
ies
pred
icte
d)
ln (# species observed)
......... 33221
3223
2222221
3113
211211110 XaXaXaXaXbXaXaXaXaaY nnn
Multiple regression solves systems of intrinsically linear algebraic equations
YXXXA '' 1
• The matrix X’X must not be singular. It est, the variables have to be independent. Otherwise we speak of multicollinearity. Collinearity of r<0.7 are in most cases tolerable.
• Multiple regression to be safely applied needs at least 10 times the number of cases than variables in the model.
• Statistical inference assumes that errors have a normal distribution around the mean.• The model assumes linear (or algebraic) dependencies. Check first for non-linearities. • Check the distribution of residuals Yexp-Yobs. This distribution should be random.• Check the parameters whether they have realistic values.
y = 0.6966x + 0.7481R² = 0.6973
-1-0.5
00.5
11.5
22.5
33.5
44.5
0 1 2 3 4
ln(#
spec
ies
pred
icte
d)
ln (# species observed)
Multiple regression is a hypothesis testing and not a hypothesis generating
technique!!
Polynomial regression General additive model
Standardized coefficients of correlation
xZZ-tranformed distributions have a mean
of 0 an a standard deviation of 1.
YXXX ZZZZB '' 1n
i i n ni 1 i i
X Yi 1 i 1X Y X Y
(X X)(Y Y)(X X) (Y Y)1 1 1r Z Z
n 1 s s n 1 s s n 1
nnn
n
iiiiini
nniii
rr
rr
nR
ZxZxZxZx
ZxZxZxZx
..............................
.......
'1
1
..............................
......
'
1
111
1
111
ZZZZ
XYxx RRB 1
In the case of bivariate regression Y = aX+b, Rxx = 1.Hence B=RXY.
Hence the use of Z-transformed values results in standardized correlations coefficients, termed b-values
ZZRΣΜXΜXΣR XX '1
1)()'(1
1 11
nnBRR XXXY
How to interpret beta-values
If then Beta values are generalisations of simple coefficients of correlation. However, there is an important difference. The higher the correlation between two or more predicator variables (multicollinearity) is, the less will r depend on the correlation between X and Y. Hence other variables might have more and more influence on r and b. For high levels of multicollinearity it might therefore become more and more difficult to interpret beta-values in terms of correlations. Because beta-values are standardized b-values they should allow comparisons to be make about the relative influence of predicator variables. High levels of multicollinearity might let to misinterpretations. Beta values above one are always a sign of too high multicollinearity
Hence high levels of multicollinearity might· reduce the exactness of beta-weight estimates· change the probabilities of making type I and type II errors· make it more difficult to interpret beta-values.
We might apply an additional parameter, the so-called coefficient of structure. The coefficient of structure ci is defined as
where riY denotes the simple correlation between predicator variable i and the dependent variable Y and R2 the coefficient of determination of the multiple regression. Coefficients of structure measure therefore the fraction of total variability a given predictor variable explains. Again, the interpretation of ci is not always unequivocal at high levels of multicollinearity.
BXY Y
XiXiXi B
b
2Rrc iY
i
Partial correlations
X
Y
Z
rxy rzy
rzx
X X(Y) X(Z)Y Y(X) Y(Z)
/ 2 21 1
XY XZ YZ
XY Z
XZ YZ
r r rrr r
Semipartial correlation
XY XZ YZ(X|Y)Z 2
YZ
r r rr1 r
A semipartial correlation correlates a variable with one residual only.
y = 1.02Z + 0.41
0
0.5
1
1.5
2
0 0.5 1Z
X X
y = 1.70Z + 0.600
0.5
1
1.5
2
2.5
0 0.5 1Z
Y
Y
The partial correlation rxy/z is the correlation of the residuals X and Y
Path analysis and linear structure models
Y
X3X2 X4X1
e
Multiple regression
YX3
X2
X4X1
ee
e
e
e
Path analysis tries to do something that is logically impossible, to derive causal relationships from sets of observations.
Path analysis defines a whole model and tries to separate correlations into direct and indirect effects
eXaXaXaXaaY 443322110
The error term e contain the part of the variance in Y that is not explained by the model. These errors are called residuals
Regression analysis does not study the relationships between the predictor
variables
X Z
Y
WpXW pZX
pZY
pXY
e e
e
e
Path analysis is largely based on the computation of partial coefficients of correlation.
Path coefficients
Path analysis is a model confirmatory tool. It should not be used to generate models or even to seek for models that fit the data set.
xw
xy
zx zy
W p X eX p Y e
Z p X p Y e
xw
xy
zx zy
p X W e 0X p Y e 0
p X p Y Z e 0
We start from regression functions
From Z-transformed values we get
X
W
Z
Y
pXW
pYX
pXZ pYZ
W xw X
X xy Y
Z zx X zy Y
W Y xw X Y Y
X W xy Y W W
Z W zx X W zy Y W W
X Z xy Y X X
X Y xy Y Y Y
Z Y zx X Y zy Y Y Y
WY xw XY
XW xy YW
ZW zx X W zy YW
XZ xy YX
XY
Z p Z eZ p Z e
Z p Z p Z e
Z Z p Z Z eZZ Z p Z Z eZ
Z Z p Z Z p Z Z eZ
Z Z p Z Z eZ
Z Z p Z Z eZ
Z Z p Z Z p Z Z eZ
r p rr p r
r p r p r
r p r
r
xy
ZY zx XY zy
p
r p r p
eZY = 0
ZYZY = 1
ZXZY = rXY
xw
xy
zx zy
p X W e 0X p Y e 0
p X p Y Z e 0
Path analysis is a nice tool to generate hypotheses.It fails at low coefficients of correlation and circular
model structures.
Target symptom
X A B C D E Expected values X'1 0 1 1 0 1 0.848615 A 0 0 1 0 11 0 1 1 0 1 0.848615 B 1 1 0 1 00 1 0 0 0 0 -0.2092 C 1 1 0 1 01 0 1 1 1 1 1.108631 D 0 0 0 1 00 1 0 0 0 1 0.106749 E 1 1 0 1 11 0 1 1 1 1 1.1086310 1 0 0 0 0 -0.2092 X'X1 0 1 1 1 1 1.108631 A B C D E1 1 1 1 1 1 0.899435 A 8 5 1 2 40 1 1 0 0 1 0.19602 B 5 11 6 6 91 0 0 1 1 1 1.01936 C 1 6 10 8 101 0 0 1 1 1 1.01936 D 2 6 8 11 111 0 0 0 1 1 0.575961 E 4 9 10 11 151 0 1 0 1 1 0.6652330 1 1 0 0 0 -0.11992 (X'X)-1
A special regression model that is used in pharmacology
2
00 b
1
bY bX1b
b0 is the maximum response at dose saturation. b1 is the concentration that produces a half maximum response.b2 determines the slope of the function, that means it is a measure how fast the response increases with increasing drug dose.