Top Banner
Regression and Correlation Analysis Violeta Bartolome Senior Associate Scientiest PBGB-CRIL [email protected]
38

Powerpoint - Regression and Correlation Analysis

Nov 22, 2014

Download

Documents

Vivay Salazar
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Powerpoint - Regression and Correlation Analysis

Regression and Correlation Analysis

Violeta BartolomeSenior Associate Scientiest

[email protected]

Page 2: Powerpoint - Regression and Correlation Analysis

Correlation Analysis

• A measure of association between two numerical variables.

• Example (positive correlation)o As soil fertility increases, rice grain yield

also increases

IRRI-PBGB-CRIL 2

also increases

Typically, in the summer as the temperature increases people are thirstier. Consider the two numerical variables, temperature and water consumption. We would expect the higher the temperature, the more water a given person would consume. Thus we would say that in the summer time, temperature and water consumption are positively correlated.
Page 3: Powerpoint - Regression and Correlation Analysis

Example

For seven randomly selected plots,

Nitrogen Content (%)

Grain Yield (kg/ha)

0.12 16520.14 2056

IRRI-PBGB-CRIL 3

selected plots, nitrogen content in the soil and the grain yield were recorded.

0.15 25980.16 27340.19 32380.22 48240.23 4858

(The data is shown in the table with the temperature placed in increasing order.)
Page 4: Powerpoint - Regression and Correlation Analysis

How would you describe the graph?

Grain Yield of Rice at differnt levels of Soil Nitrogen Content

4000

5000

6000

Grain Yield (kg/ha)

IRRI-PBGB-CRIL 4How “strong” is the linear relationship?

1000

2000

3000

4000

0.1 0.15 0.2 0.25

Nitrogen Content (%)

Grain Yield (kg/ha)

This graph helps us visualize what appears to be a somewhat linear relationship between temperature and the amount of water one drinks.
Page 5: Powerpoint - Regression and Correlation Analysis

Measuring the Relationship

Pearson’s Sample Correlation Coefficient, r

measures the direction and the

IRRI-PBGB-CRIL 5

measures the direction and the strength of the linear association between two numerical paired variables.

Page 6: Powerpoint - Regression and Correlation Analysis

Direction of Association

Positive Correlation Negative Correlation

IRRI-PBGB-CRIL 6

Direction of the Association: The association can be either positive or negative. Positive Correlation: as the x variable increases so does the y variable. Example: In the summer, as the temperature increases, so does thirst. Negative Correlation: as the x variable increases, the y variable decreases. Example: As the price of an item increases, the number of items sold decreases.
Page 7: Powerpoint - Regression and Correlation Analysis

Strength of Linear Association

r value Interpretation

IRRI-PBGB-CRIL 7

1 perfect positive linear relationship

0 no linear relationship

-1 perfect negative linear relationship

Strength of the Association:  The strength of the linear association is measured by the sample Correlation Coefficient, r.  r can be any value from –1 to +1.    The closer r is to one (in magnitude) the stronger the linear association.   If r equals zero, then there is no linear association between the two variables. 
Page 8: Powerpoint - Regression and Correlation Analysis

Strength of Linear Association

No Linear CorrelationNo Linear CorrelationNo Linear CorrelationNo Linear Correlation

Perfect Linear Positive Perfect Linear Positive Perfect Linear Positive Perfect Linear Positive CorrelationCorrelationCorrelationCorrelation

IRRI-PBGB-CRIL 8

No Linear CorrelationNo Linear CorrelationNo Linear CorrelationNo Linear Correlation

Page 9: Powerpoint - Regression and Correlation Analysis

Other Strengths of Association

r value Interpretation

0.9 strong association

IRRI-PBGB-CRIL 9

0.9 strong association

0.5 moderate association

0.25 weak association

*  No other values of r have precise definitions of strength. See the chart below. Note:  All of the values in the second table are positive. Thus the associations are positive. The same strength interpretations hold for negative values of r, only the direction interpretations of the association would change.
Page 10: Powerpoint - Regression and Correlation Analysis

Other Strengths of Association

Strong Positive Linear Strong Positive Linear Strong Positive Linear Strong Positive Linear CorrelationCorrelationCorrelationCorrelation

Moderate Negative Moderate Negative Moderate Negative Moderate Negative Linear CorrelationLinear CorrelationLinear CorrelationLinear Correlation

IRRI-PBGB-CRIL 10

Linear CorrelationLinear CorrelationLinear CorrelationLinear Correlation

Page 11: Powerpoint - Regression and Correlation Analysis

Formula

= the sum

IRRI-PBGB-CRIL 11

x

= the sumn = number of paired

itemsxi = input variable yi = output variable

= x-bar = mean ofx’s

= y-bar = mean ofy’s

sx= standard deviation of x’s

sy= standard deviation of y’s

y

Page 12: Powerpoint - Regression and Correlation Analysis

Correlation Coefficient (r)

r=0 does not necessarily mean no relationship. Relationship may be

IRRI-PBGB-CRIL 12

relationship. Relationship may be nonlinear.

Page 13: Powerpoint - Regression and Correlation Analysis

Correlation Coefficient

IRRI-PBGB-CRIL 13

Page 14: Powerpoint - Regression and Correlation Analysis

Correlation Coefficient (r)

A significant r does not necessarily mean a strong linear relationship

IRRI-PBGB-CRIL 14

Page 15: Powerpoint - Regression and Correlation Analysis

Correlation Coefficient

350

400

450

500

r = .25**n = 234

When no. of observations is

IRRI-PBGB-CRIL 15

100

150

200

250

300

0 5 10 15 20

Tiller/plant

Yield/plot observations is

large, a low r-value may still be significant.

Page 16: Powerpoint - Regression and Correlation Analysis

Correlation Coefficient (r)

To be able to conclude that 2 variables have a strong linear relationship, r should be both high and significant

IRRI-PBGB-CRIL 16

and significant

Page 17: Powerpoint - Regression and Correlation Analysis

Correlation Coefficient

4

5

6Yield (t/ha)

r = .90**n = 60

IRRI-PBGB-CRIL 17

0

1

2

3

20 30 40 50 60 70 80 90 100 110

No. of spikelet/panicle

Yield (t/ha)

Page 18: Powerpoint - Regression and Correlation Analysis

Test of significance for rDegrees of Freedom Probability, p

0.05 0.01 0.001

1 0.997 1.000 1.000

2 0.950 0.990 0.999

3 0.878 0.959 0.991

4 0.811 0.917 0.974

5 0.755 0.875 0.951

6 0.707 0.834 0.925

7 0.666 0.798 0.898

r is significant if the absolute value is greater that the tabular

IRRI-PBGB-CRIL 18

7 0.666 0.798 0.898

8 0.632 0.765 0.872

9 0.602 0.735 0.847

10 0.576 0.708 0.823

11 0.553 0.684 0.801

12 0.532 0.661 0.780

13 0.514 0.641 0.760

14 0.497 0.623 0.742

15 0.482 0.606 0.725

16 0.468 0.590 0.708

17 0.456 0.575 0.693

18 0.444 0.561 0.679

19 0.433 0.549 0.665

20 0.423 0.457 0.652

value is greater that the tabular value.

Page 19: Powerpoint - Regression and Correlation Analysis

CORRELATION ANALYSIS

PEARSON CORRELATION ANALYSIS Nitrogen.Content Grain.Yield

Nitrogen.Content Coef 1 0.99 P-value 1 1e-04

Grain.Yield Coef 0.99 1

IRRI-PBGB-CRIL 19

Grain.Yield Coef 0.99 1 P-value 1e-04 1

Page 20: Powerpoint - Regression and Correlation Analysis

Regression Analysis

IRRI-PBGB-CRIL 20

Regression Analysis

Page 21: Powerpoint - Regression and Correlation Analysis

What is the growth rate of a rice plant?

Growth rate can be defined as the change in heightper unit of time.

Scientific Question

IRRI-PBGB-CRIL 21

per unit of time.

Page 22: Powerpoint - Regression and Correlation Analysis

Data Collection

DAS Height (cm)

0 0

10 12

30 55

IRRI-PBGB-CRIL 22

60 80

90 110

Page 23: Powerpoint - Regression and Correlation Analysis

Statistical Questions• What is the relationship

between age and height?Linear

• How do I describe or quantify the relationship?

60

80

100

120

Plant Height (cm)

IRRI-PBGB-CRIL 23

quantify the relationship?Regression

• Is the association significant?Statistical Test

0

20

40

60

0 20 40 60 80 100

Days after Seeding

Plant Height (cm)

Page 24: Powerpoint - Regression and Correlation Analysis

Linear Regression

• A general method for estimating or describing association between a continuous outcome variable

IRRI-PBGB-CRIL 24

continuous outcome variable (dependent) and one or multiple predictors in one equation.

o One predictor: Simple linear regressiono Multiple predictors: Multiple linear regression

Page 25: Powerpoint - Regression and Correlation Analysis

Statistical Model

52

54

56

Y

Data = Model Fit + Residual

YY ε+= ˆ

IRRI-PBGB-CRIL 25

46

48

50

52

X

Y iii YY ε+= ˆ

ii XY 10ˆ ββ +=

Intercept Slope

Yi = µ + α i + εi

Page 26: Powerpoint - Regression and Correlation Analysis

Least Squares Estimates

iii YY ε+= ˆ ii XY 10ˆ ββ +=

To estimate the intercept and slope, minimize residual sum of squares (RSS)

IRRI-PBGB-CRIL 26

RSS = εi2 =∑ (Yi − ˆ Y i)

2 =∑ (Yi − β0 − β1X i)2∑

∂RSS∂β0

=(Yi − β0 − β1X i)

2∑∂β0

= −2 (Yi − β0 − β1X i)∑ = 0

==> ˆ β 0 = Y − ˆ β 1X

∂RSS∂β1

=(Yi −Y + β1X − β1X i)

2∑∂β1

= −2 (X i − X )(Yi −Y + β1X − β1X i)∑ = 0

==> ˆ β 1 =(X i − X )(Yi −Y )∑(X i − X )

2∑

We don’t have to do the estimation by hand. R/CropStat or other statistical packages can do the work for us.

Page 27: Powerpoint - Regression and Correlation Analysis

LINEAR REGRESSION ANALYSISDependent Variable: Height

Analysis of Variance SV Df Sum Square Mean Square F value Pr (>F) DAS 1 8201.389781 8201.389781 95.435198 0.002279Residuals 3 257.810219 85.93674

Model Summary R Squared 0.969523

IRRI-PBGB-CRIL 27

R Squared 0.969523 Adj. R Squared 0.959364

Parameter Estimates Parameter Estimate Std. Error t value Pr (> |t|)

(Intercept) 4.912409 6.311259 0.778356 0.493109DAS 1.223358 0.125227 9.769094 0.002279

Page 28: Powerpoint - Regression and Correlation Analysis

Example: Growth Rate Data

Parameter Estimates Parameter Estimate Std. Error t value Pr (> |t|) (Intercept) 4.912409 6.311259 0.778356 0.493109DAS 1.223358 0.125227 9.769094 0.002279

IRRI-PBGB-CRIL 28

Intercept: The height at age 0 is 4.9 cm.Slope: The height increase per day after seeding is 1.223 cm.

Height =4.9+ 1.223DAS r = 0.98

0

20

40

60

80

100

120

140

0 20 40 60 80 100

Days after Seeding

Plant H

eight (cm)

Page 29: Powerpoint - Regression and Correlation Analysis

Prediction

Given the regression line, it can be predicted that the height at 40 days after

Height =4.9+ 1.223DAS r = 0.98

80

100

120

140

Plant Height (cm)

IRRI-PBGB-CRIL 29

height at 40 days after seeding will be 53.8 cm.

0

20

40

60

80

0 20 40 60 80 100

Days after Seeding

Plant Height (cm)

Page 30: Powerpoint - Regression and Correlation Analysis

Example: Growth Rate Data Analysis of Variance SV Df Sum Square Mean Square F value Pr (>F) DAS 1 8201.389781 8201.389781 95.435198 0.002279Residuals 3 257.810219 85.93674

Model Summary R Squared 0.969523

IRRI-PBGB-CRIL 30

R Squared 0.969523 Adj. R Squared 0.959364

∑ ∑∑∑ −+−=−+−=− 2222 )ˆ()ˆ()ˆˆ()( iiiiiii YYYYYYYYYY

SST SSM SSE

Sums of Squares

Degrees of freedomn-1 1 n-2

∑∑

−== 2

22

)(

)ˆ(

YY

YY

SSTSSM

Ri

i R2 is the fraction of variation in Y explained by X.

Page 31: Powerpoint - Regression and Correlation Analysis

Linear Regression vs. ANOVA

ANOVADependent: ContinuousIndependent: Categorical

Linear regressionDependent: ContinuousIndependent: Continuous

IRRI-PBGB-CRIL 31

Linear models

ANOVA and regression are the same thing!!!

Page 32: Powerpoint - Regression and Correlation Analysis

Misuse of Regression and Correlation Analysis

• Performing regression and correlation on spurious data could give significant results. But this is not a valid indication of a linear relationship.

IRRI-PBGB-CRIL 32

Page 33: Powerpoint - Regression and Correlation Analysis

Misuse of Regression and Correlation Analysis

• Extrapolation of resultso scope of data is extended. Example

§ If the relationship of yield IR8 and stemborer incidence is extended to cover all rice varieties

IRRI-PBGB-CRIL 33

incidence is extended to cover all rice varieties§ If the relationship between grain yield and protein

content from varietal trials is assumed to be applicable to other types of experiments such as fertilizer trials

o functional relationship is assumed to hold beyond the range of X values tested

Page 34: Powerpoint - Regression and Correlation Analysis

Misuse of Regression and Correlation Analysis

y = 23.751x + 4307.2r = 0.987**9000

10000

11000

There is no evidence if a linear relationship still holds

IRRI-PBGB-CRIL 34

4000

5000

6000

7000

8000

0 30 60 90 120 150 180 210 240

N-rate (kg/ha)

Grain Yield (kg

/ha) linear relationship still holds

above N = 180 kg/ha

Page 35: Powerpoint - Regression and Correlation Analysis

Coefficient of Determination (R2)

• Percentage of the total variation that is explained by the linear function.

IRRI-PBGB-CRIL 35

For example, with an R2 value of 0.64, the implication is 64% [(0.64)(100) = 64] of the variation in the variable Y can be explained by the linear function of the variable X.

Page 36: Powerpoint - Regression and Correlation Analysis

Problems with R2

• R2 tends to increase as additional variables are included to a regression equation, regardless of their true importance in determining the values of the dependent variable

The adjusted R2 (Ra2) compensates for this effect

IRRI-PBGB-CRIL 36

• Gives no information on the appropriateness of the model

iablestindependenofnop

nsobservatioofnonwhere

Rpn

nRa

var.

.

)1()1(

11 22

=

=

−+−

−−=

The adjusted R2 (Ra2) compensates for this effect

Page 37: Powerpoint - Regression and Correlation Analysis

Problems with R2

IRRI-PBGB-CRIL 37

Curvilinear data fitted by a straight line with high R2

Segregated data fitted by a straight line with high R2

For detecting these kinds of departures from the regression model there is no substitute to plotting the data

Page 38: Powerpoint - Regression and Correlation Analysis

Thank you!

IRRI-PBGB-CRIL 38