Lecture 12 Correlation and linear regression y = ax + b 2 2 1 1 ( ) [ ( )] n n i i i i D y y ax b 1 1 2 ( ) 0 2 ( ) 0 n i i i i n i i i D x y ax b a D y ax b b 1 2 2 1 n i i i n i i xy nxy a x nx b y ax ( ) ( ) y ax y ax y y ax x The least squares method of Carl Friedrich Gauß. 0 5 10 15 20 0 5 10 15 20 Y X y 2 OLRy y
19
Embed
Lecture 12 Correlation and linear regression y = ax + b The least squares method of Carl Friedrich Gauß. y2y2 OLRy yy.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Lecture 12Correlation and linear regression
y = ax + b
2 2
1 1
( ) [ ( )]n n
i ii i
D y y ax b
1
1
2 ( ) 0
2 ( ) 0
n
i i ii
n
i ii
Dx y ax b
a
Dy ax b
b
1
22
1
n
i ii
n
ii
x y nx ya
x nx
b y ax
( ) ( )y ax y ax y y a x x
The least squares method of Carl Friedrich Gauß.
0
5
10
15
20
0 5 10 15 20
Y
X
Dy2
OLRy
Dy
2
1
2
1
1
22
1
1
22
1
)(1
))((1
1
1
x
xyn
ii
n
iii
n
ii
n
iii
n
ii
n
iii
s
s
xxn
yyxxn
xxn
yxyxn
xnx
yxnyxa
Covariance
Variance
Correlation coefficient
xy
x y
xy
x y
sr
s s
22
2 2
xy
x y
r
Coefficient of determination
2 Explained variance
Total varianceR
y
x
yxxyx
s
sar
srssas
2
Slope a and coefficient of correlation r are zero if the covariance is zero.
11 r
10 2 r
y = 0.192x + 0.4671R² = 0.1723
01234567
0 10 20 30
Brac
hypt
erou
s spe
cies
Macropterous species
y = 0.3875x + 3.7188R² = 0.4455
02468
101214
0 10 20 30
Dim
orph
ic sp
ecie
s
Macropterous species
Relationships between macropterous, dimorphic and brachypterous ground beetles
on 17 Mazurian lake islandsPositive correlation; r =r2= 0.41The regression is weak. Macropterous species richness explains only 17% of the variance in brachypterous species richness.We have some islands without brachypterous species.We really don’t know what is the independent variable.There is no clear cut logical connection.
Positive correlation; r =r2= 0.67The regression is moderate. Macropterous species richness explains only 45% of the variance in dimorphic species richness.The relationship appears to be non-linear. Log-transformation is indicated (no zero counts).We really don’t know what is the independent variable.There is no clear cut logical connection.
y = -36.203x + 5.5585R² = 0.2311
01234567
0 0.05 0.1 0.15
Brac
hypt
erou
s spe
cies
Isolation
y = 0.4894x + 22.094R² = 0.0037
05
1015202530354045
-3 -2 -1 0 1 2
Brac
hypt
erou
s spe
cies
ln Area
Negative correlation; r =r2= -0.48The regression is weak. Island isolation explains only 23% of the variance in brachypterous species richness.We have two apparent outliers. Without them the whole relationship would vanish, it est R2 0.Outliers have to be eliminated fom regression analysis.We have a clear hypothesis about the logical relationships. Isolation should be the predictor of species richness.
No correlation; r =r2= 0.06The regression slope is nearly zero. Area explains less than 1% of the variance in brachypterous species richness.We have a clear hypothesis about the logical relationships. Area should be the predictor of species richness.
IndividualsThe species – individuals relationship are obviously non-linear.
Ground beetles on Mazurian lake islands
y = 6.0987ln(x) - 8.3513R² = 0.6003
1
10
100
1 100 10000
Spec
ies
Individuals
y = 6.7337x0.2306
R² = 0.67
0
10
20
30
40
50
60
0 2000 4000
Spec
ies
Individuals
Linear function Logarithmic function Power function
y = 6.7337x0.2306
R² = 0.67
1
10
100
1 100 10000
Spec
ies
Individuals
IIS
IS
ln2308.0907.1ln2308.0)733.6ln(ln
733.6 2308.0
Intercept Slope
The power function has the highest R2 and explains therefore most of the variance in species richness.The coefficient of determination is a measure of goodness of fit.
Having more than one predictor
Individuals
Isolation
Area
Species
Describe species richness in dependence of numbers of individuals, area, and isolation of islands.
We need a clear hypothesis about dependent and independent predictors.Use a block diagram.
Predictors are not independent.Numbers of individuals depends on area and degree of isolation.
We need linear relationships
We use ln transformed variables of species, area, and individuals. Check for multicollinearityusing a correlation matrix.We check for non-linearities using plots.
Of the predictors area and individuals are highly correlated.
The correlation between area and individuals is highly significant.The probability of H0 = 0.004.
In linear regression analysis correlations of predictors below 0.7 are acceptable.
Collinearity
The final data for our analysis
The model
Isolation a Area a Ind a a S3 2 1 0ln ln ln
YXXXa
XaYTT 1)(
Multiple linear regression
The vector Y contains the
response variable
The matrix X contains the effect (predictor) variables
Observed r 0.41508801 Mean r 0.061Lower CL -0.538Upper CL 0.768
Permutation test for statistical significance
Randomize 1000 times x or y.Calculate each time r. Plot the statistical distribution and calculate the lower and upper confidence limits.
0102030405060708090
-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Nr
Lower CL Upper CL
g > 0
Calculating confidence limits
Rank all 1000 coefficients of correlation and take the values at rank positions 25 and 975.
S N2.5 = 25 S N2.5 = 25
m > 0
Observed r
The RMA regression has a much steeper slope.This slope is often intuitively better.
The coefficient of correlation is independent of the regression method
The 95% confidence limit of the regression slopemark the 95% probability that the regression slope is within these
limits.The lower CL is negative, hence the zero slope is with the 95% CL.
Upper CL
Lower CL
In OLRy regression insignificance of slope means also insignificance of r and R2.
0
5
10
15
20
0 5 10 15 20
Y
X
Dy2
OLRy
Dy
Outliers have an overproportional
influence on correlation and
regression.
Outliers should be eliminated from regression analysis.
Instead of the Pearson coefficient of correlations use Spearman’s rank order correlation.
01234567
0 1 2 3 4 5 6 7
Y
X
Normal correlation on ranked data
rPearson = 0.79
rSpearman = 0.77
Home work and literature
Refresh:
• Coefficient of correlation• Pearson correlation• Spearman correlation• Linear regression• Non-linear regression• Model I and model II regression• RMA regression
Prepare to the next lecture:
• F-test• F-distribution• Variance
Literature:
Łomnicki: Statystyka dla biologówhttp://statsoft.com/textbook/