Regression and Correlation Analysis Violeta Bartolome Senior Associate Scientiest PBGB-CRIL [email protected]
Correlation Analysis
• A measure of association between two numerical variables.
• Example (positive correlation)o As soil fertility increases, rice grain yield
also increases
IRRI-PBGB-CRIL 2
also increases
Example
For seven randomly selected plots,
Nitrogen Content (%)
Grain Yield (kg/ha)
0.12 16520.14 2056
IRRI-PBGB-CRIL 3
selected plots, nitrogen content in the soil and the grain yield were recorded.
0.15 25980.16 27340.19 32380.22 48240.23 4858
How would you describe the graph?
Grain Yield of Rice at differnt levels of Soil Nitrogen Content
4000
5000
6000
Grain Yield (kg/ha)
IRRI-PBGB-CRIL 4How “strong” is the linear relationship?
1000
2000
3000
4000
0.1 0.15 0.2 0.25
Nitrogen Content (%)
Grain Yield (kg/ha)
Measuring the Relationship
Pearson’s Sample Correlation Coefficient, r
measures the direction and the
IRRI-PBGB-CRIL 5
measures the direction and the strength of the linear association between two numerical paired variables.
Direction of Association
Positive Correlation Negative Correlation
IRRI-PBGB-CRIL 6
Strength of Linear Association
r value Interpretation
IRRI-PBGB-CRIL 7
1 perfect positive linear relationship
0 no linear relationship
-1 perfect negative linear relationship
Strength of Linear Association
No Linear CorrelationNo Linear CorrelationNo Linear CorrelationNo Linear Correlation
Perfect Linear Positive Perfect Linear Positive Perfect Linear Positive Perfect Linear Positive CorrelationCorrelationCorrelationCorrelation
IRRI-PBGB-CRIL 8
No Linear CorrelationNo Linear CorrelationNo Linear CorrelationNo Linear Correlation
Other Strengths of Association
r value Interpretation
0.9 strong association
IRRI-PBGB-CRIL 9
0.9 strong association
0.5 moderate association
0.25 weak association
Other Strengths of Association
Strong Positive Linear Strong Positive Linear Strong Positive Linear Strong Positive Linear CorrelationCorrelationCorrelationCorrelation
Moderate Negative Moderate Negative Moderate Negative Moderate Negative Linear CorrelationLinear CorrelationLinear CorrelationLinear Correlation
IRRI-PBGB-CRIL 10
Linear CorrelationLinear CorrelationLinear CorrelationLinear Correlation
Formula
= the sum
IRRI-PBGB-CRIL 11
x
= the sumn = number of paired
itemsxi = input variable yi = output variable
= x-bar = mean ofx’s
= y-bar = mean ofy’s
sx= standard deviation of x’s
sy= standard deviation of y’s
y
Correlation Coefficient (r)
r=0 does not necessarily mean no relationship. Relationship may be
IRRI-PBGB-CRIL 12
relationship. Relationship may be nonlinear.
Correlation Coefficient
IRRI-PBGB-CRIL 13
Correlation Coefficient (r)
A significant r does not necessarily mean a strong linear relationship
IRRI-PBGB-CRIL 14
Correlation Coefficient
350
400
450
500
r = .25**n = 234
When no. of observations is
IRRI-PBGB-CRIL 15
100
150
200
250
300
0 5 10 15 20
Tiller/plant
Yield/plot observations is
large, a low r-value may still be significant.
Correlation Coefficient (r)
To be able to conclude that 2 variables have a strong linear relationship, r should be both high and significant
IRRI-PBGB-CRIL 16
and significant
Correlation Coefficient
4
5
6Yield (t/ha)
r = .90**n = 60
IRRI-PBGB-CRIL 17
0
1
2
3
20 30 40 50 60 70 80 90 100 110
No. of spikelet/panicle
Yield (t/ha)
Test of significance for rDegrees of Freedom Probability, p
0.05 0.01 0.001
1 0.997 1.000 1.000
2 0.950 0.990 0.999
3 0.878 0.959 0.991
4 0.811 0.917 0.974
5 0.755 0.875 0.951
6 0.707 0.834 0.925
7 0.666 0.798 0.898
r is significant if the absolute value is greater that the tabular
IRRI-PBGB-CRIL 18
7 0.666 0.798 0.898
8 0.632 0.765 0.872
9 0.602 0.735 0.847
10 0.576 0.708 0.823
11 0.553 0.684 0.801
12 0.532 0.661 0.780
13 0.514 0.641 0.760
14 0.497 0.623 0.742
15 0.482 0.606 0.725
16 0.468 0.590 0.708
17 0.456 0.575 0.693
18 0.444 0.561 0.679
19 0.433 0.549 0.665
20 0.423 0.457 0.652
value is greater that the tabular value.
CORRELATION ANALYSIS
PEARSON CORRELATION ANALYSIS Nitrogen.Content Grain.Yield
Nitrogen.Content Coef 1 0.99 P-value 1 1e-04
Grain.Yield Coef 0.99 1
IRRI-PBGB-CRIL 19
Grain.Yield Coef 0.99 1 P-value 1e-04 1
Regression Analysis
IRRI-PBGB-CRIL 20
Regression Analysis
What is the growth rate of a rice plant?
Growth rate can be defined as the change in heightper unit of time.
Scientific Question
IRRI-PBGB-CRIL 21
per unit of time.
Data Collection
DAS Height (cm)
0 0
10 12
30 55
IRRI-PBGB-CRIL 22
60 80
90 110
Statistical Questions• What is the relationship
between age and height?Linear
• How do I describe or quantify the relationship?
60
80
100
120
Plant Height (cm)
IRRI-PBGB-CRIL 23
quantify the relationship?Regression
• Is the association significant?Statistical Test
0
20
40
60
0 20 40 60 80 100
Days after Seeding
Plant Height (cm)
Linear Regression
• A general method for estimating or describing association between a continuous outcome variable
IRRI-PBGB-CRIL 24
continuous outcome variable (dependent) and one or multiple predictors in one equation.
o One predictor: Simple linear regressiono Multiple predictors: Multiple linear regression
Statistical Model
52
54
56
Y
Data = Model Fit + Residual
YY ε+= ˆ
IRRI-PBGB-CRIL 25
46
48
50
52
X
Y iii YY ε+= ˆ
ii XY 10ˆ ββ +=
Intercept Slope
Yi = µ + α i + εi
Least Squares Estimates
iii YY ε+= ˆ ii XY 10ˆ ββ +=
To estimate the intercept and slope, minimize residual sum of squares (RSS)
IRRI-PBGB-CRIL 26
RSS = εi2 =∑ (Yi − ˆ Y i)
2 =∑ (Yi − β0 − β1X i)2∑
∂RSS∂β0
=(Yi − β0 − β1X i)
2∑∂β0
= −2 (Yi − β0 − β1X i)∑ = 0
==> ˆ β 0 = Y − ˆ β 1X
∂RSS∂β1
=(Yi −Y + β1X − β1X i)
2∑∂β1
= −2 (X i − X )(Yi −Y + β1X − β1X i)∑ = 0
==> ˆ β 1 =(X i − X )(Yi −Y )∑(X i − X )
2∑
We don’t have to do the estimation by hand. R/CropStat or other statistical packages can do the work for us.
LINEAR REGRESSION ANALYSISDependent Variable: Height
Analysis of Variance SV Df Sum Square Mean Square F value Pr (>F) DAS 1 8201.389781 8201.389781 95.435198 0.002279Residuals 3 257.810219 85.93674
Model Summary R Squared 0.969523
IRRI-PBGB-CRIL 27
R Squared 0.969523 Adj. R Squared 0.959364
Parameter Estimates Parameter Estimate Std. Error t value Pr (> |t|)
(Intercept) 4.912409 6.311259 0.778356 0.493109DAS 1.223358 0.125227 9.769094 0.002279
Example: Growth Rate Data
Parameter Estimates Parameter Estimate Std. Error t value Pr (> |t|) (Intercept) 4.912409 6.311259 0.778356 0.493109DAS 1.223358 0.125227 9.769094 0.002279
IRRI-PBGB-CRIL 28
Intercept: The height at age 0 is 4.9 cm.Slope: The height increase per day after seeding is 1.223 cm.
Height =4.9+ 1.223DAS r = 0.98
0
20
40
60
80
100
120
140
0 20 40 60 80 100
Days after Seeding
Plant H
eight (cm)
Prediction
Given the regression line, it can be predicted that the height at 40 days after
Height =4.9+ 1.223DAS r = 0.98
80
100
120
140
Plant Height (cm)
IRRI-PBGB-CRIL 29
height at 40 days after seeding will be 53.8 cm.
0
20
40
60
80
0 20 40 60 80 100
Days after Seeding
Plant Height (cm)
Example: Growth Rate Data Analysis of Variance SV Df Sum Square Mean Square F value Pr (>F) DAS 1 8201.389781 8201.389781 95.435198 0.002279Residuals 3 257.810219 85.93674
Model Summary R Squared 0.969523
IRRI-PBGB-CRIL 30
R Squared 0.969523 Adj. R Squared 0.959364
∑ ∑∑∑ −+−=−+−=− 2222 )ˆ()ˆ()ˆˆ()( iiiiiii YYYYYYYYYY
SST SSM SSE
Sums of Squares
Degrees of freedomn-1 1 n-2
∑∑
−
−== 2
22
)(
)ˆ(
YY
YY
SSTSSM
Ri
i R2 is the fraction of variation in Y explained by X.
Linear Regression vs. ANOVA
ANOVADependent: ContinuousIndependent: Categorical
Linear regressionDependent: ContinuousIndependent: Continuous
IRRI-PBGB-CRIL 31
Linear models
ANOVA and regression are the same thing!!!
Misuse of Regression and Correlation Analysis
• Performing regression and correlation on spurious data could give significant results. But this is not a valid indication of a linear relationship.
IRRI-PBGB-CRIL 32
Misuse of Regression and Correlation Analysis
• Extrapolation of resultso scope of data is extended. Example
§ If the relationship of yield IR8 and stemborer incidence is extended to cover all rice varieties
IRRI-PBGB-CRIL 33
incidence is extended to cover all rice varieties§ If the relationship between grain yield and protein
content from varietal trials is assumed to be applicable to other types of experiments such as fertilizer trials
o functional relationship is assumed to hold beyond the range of X values tested
Misuse of Regression and Correlation Analysis
y = 23.751x + 4307.2r = 0.987**9000
10000
11000
There is no evidence if a linear relationship still holds
IRRI-PBGB-CRIL 34
4000
5000
6000
7000
8000
0 30 60 90 120 150 180 210 240
N-rate (kg/ha)
Grain Yield (kg
/ha) linear relationship still holds
above N = 180 kg/ha
Coefficient of Determination (R2)
• Percentage of the total variation that is explained by the linear function.
IRRI-PBGB-CRIL 35
For example, with an R2 value of 0.64, the implication is 64% [(0.64)(100) = 64] of the variation in the variable Y can be explained by the linear function of the variable X.
Problems with R2
• R2 tends to increase as additional variables are included to a regression equation, regardless of their true importance in determining the values of the dependent variable
The adjusted R2 (Ra2) compensates for this effect
IRRI-PBGB-CRIL 36
• Gives no information on the appropriateness of the model
iablestindependenofnop
nsobservatioofnonwhere
Rpn
nRa
var.
.
)1()1(
11 22
=
=
−+−
−−=
The adjusted R2 (Ra2) compensates for this effect
Problems with R2
IRRI-PBGB-CRIL 37
Curvilinear data fitted by a straight line with high R2
Segregated data fitted by a straight line with high R2
For detecting these kinds of departures from the regression model there is no substitute to plotting the data
Thank you!
IRRI-PBGB-CRIL 38