Simple Linear Regression and Correlation

Simple Linear Simple Linear Regression and Regression and

CorrelationCorrelation

Introduction• RegressionRegression refers to the statistical technique of

modeling the relationship between variables.• In simple linearsimple linear regressionregression, we model the

relationship between two variablestwo variables. • One of the variables, denoted by Y, is called the

dependent variable dependent variable and the other, denoted by X, is called the independent variableindependent variable.

• The model we will use to depict the relationship between X and Y will be a straight-line relationshipstraight-line relationship.

• A graphical sketch of the pairs (X, Y) is called a scatter plotscatter plot.

This scatterplot locates pairs of observations of advertising expenditures on the x-axis and sales on the y-axis. We notice that:

Larger (smaller) values of sales tend to be associated with larger (smaller) values of advertising.

Scatterplot of Advertising Expenditures (X) and Sales (Y)

50403020100

140

120

100

80

60

40

20

0

Advertising

Sale

s

The scatter of points tends to be distributed around a positively sloped straight line.

The pairs of values of advertising expenditures and sales are not located exactly on a straight line.

The scatter plot reveals a more or less strong tendency rather than a precise linear relationship.

The line represents the nature of the relationship on average.

Using Statistics

X

Y

X

Y

X 0

0

0

0

0

Y

X

Y

X

Y

XY

Examples of Other Scatterplots

Simple Linear Regression Model

yy = = aa+ b+ bxx + +where:where:

aa and b are called and b are called parameters of the modelparameters of the model,,a is the a is the interceptintercept and b is the and b is the slopeslope..

is a random variable called theis a random variable called the error term error term..

The The simple linear regression modelsimple linear regression model is: is:

The equation that describes how The equation that describes how yy is related to is related to xx and and an error term is called the an error term is called the regression modelregression model..

• The relationship between X and Y is a straight-line relationship.

• The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations.

• That is: ~ N(0,2)

• The relationship between X and Y is a straight-line relationship.

• The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations.

• That is: ~ N(0,2)

X

Y

E[Y]=0 + 1 X

Assumptions of the Simple Linear Regression

Model

Identical normal distributions of errors, all centered on the regression line.

Assumptions of the Simple Linear Regression Model

.{Error ei Yi Yi

Yi the predicted value of Y for Xi

Yi the predicted value of Y for Xi

YY

XX

line regression fitted theˆ bXaY line regression fitted theˆ bXaY

Yi

Yi

Errors in Regression

XiXi

point data observed the point data observed the

SIMPLE REGRESSION AND CORRELATION

Estimating Using the Regression Line

First, lets look at the equation of a straight line is:

bXaY Independent variable

Slope of the line

Dependent variable

Y-intercept


The Method of Least Squares

To estimate the straight line we have to use the least squares method.

This method minimizes the sum of squaresof error between the estimated points on theline and the actual observed points.


The estimating line bXaY Slope of the best-fitting Regression Line

22 XXn

YXXYnb

Y-intercept of the Best-fitting Regression Line

XbYa

SIMPLE REGRESSION - EXAMPLE

Suppose an appliance store conducts a five-month experiment to determinethe effect of advertising on sales revenue.The results are shown below. (File PPT_Regr_example.sav)Month Advertising Exp.($100s) Sales Rev.($1000S) 1 1 1 2 2 1 3 3 2 4 4 2 5 5 4


X Y X2 XY1 1 1 12 1 4 23 2 9 64 2 16 85 4 25 20

15X 10Y 552X 37XY

35

15 X 25

10 Y


103702 ..a

XbYa

X..Y 7010

22 XXn

YXXYnb b = 0.7

Standard Error of Estimate

The standard error of estimate is used to measure the reliability of the estimatingequation.

It measures the variability or scatter of the observed values around the regressionline.



2

2

nYY

se

Short-cut

2

2

nXYbYaY

s e


Y2

114416

262Y

2

2

nXYbYaY

s e

25

3770101026

..se

60550.

Correlation Analysis

Correlation analysis is used to describethe degree to which one variable islinearly related to another.

There are two measures for describing correlation:

1.The Coefficient of Correlation

2.The Coefficient of Determination

The correlationcorrelation between two random variables, X and Y, is a measure of the degree of linear associationdegree of linear association between the two variables.

The population correlation, denoted by, can take on any value from -1 to 1.

The correlationcorrelation between two random variables, X and Y, is a measure of the degree of linear associationdegree of linear association between the two variables.

The population correlation, denoted by, can take on any value from -1 to 1.

indicates a perfect negative linear relationship-1 < < 0 indicates a negative linear relationship indicates no linear relationship0 < < 1 indicates a positive linear relationshipindicates a perfect positive linear relationship

The absolute value of indicates the strength or exactness of the relationship.

indicates a perfect negative linear relationship-1 < < 0 indicates a negative linear relationship indicates no linear relationship0 < < 1 indicates a positive linear relationshipindicates a perfect positive linear relationship

The absolute value of indicates the strength or exactness of the relationship.

Correlation

Y

X

= 0= 0

Y

X

= -.8= -.8 Y

X

= .8= .8

Y

X

= 0= 0

Y

X

= -1= -1Y

X

= 1= 1

Illustrations of Correlation

The coefficient of correlation:

2222

yynxxn

yxxynr

Sample Coefficient of Determination2r

Alternate Formula

22

2

2

YnY

YnXYbYar

Sample Coefficient of Determination

22

22

YnYnYXYbYa

r

2

22

2526

25377.0101.0

r 8167.0

Interpretation:We can conclude that 81.67 % of the variation in the sales revenues is explain by the variation in advertising expenditure.

Percentage of total variation explained by the regression.

Percentage of total variation explained by the regression.

The Coefficient of Correlation or Karl Pearson’s Coefficient of

Correlation

The coefficient of correlation is the squareroot of the coefficient of determination.

The sign of r indicates the direction of the relationship between the two variables X and Y.

The sign of r will be the same as the sign of the coefficient “b” in the regressionequation Y = a + b X


If the slope of the estimatingline is positive

If the slope of the estimatingline is negative

:- r is the positive square root

:- r is the negative square root

2rr

9037.08167.0 rThe relationship between the two variables is direct

H0: = 0 (No linear relationship)H1: 0 (Some linear relationship)

Test Statistic: t

r

rn

n( )

2 212

Hypothesis Tests for the Correlation Coefficient

Analysis-of-Variance Table and an F Test of the Regression Model

Source ofVariation

Sum ofSquares

Degrees ofFreedom Mean Square F Ratio

Regression SSR (1) MSR MSRMSE

Error SSE (n-2) MSE

Total SST (n-1) MST

Source ofVariation

Sum ofSquares

Degrees ofFreedom Mean Square F Ratio

Regression SSR (1) MSR MSRMSE

Error SSE (n-2) MSE

Total SST (n-1) MST

H0 : The regression model is not significantH1 : The regression model is significant

We pose the question:

Is the independent variable linearly related to the dependent variable?

To answer the question we test the hypothesis

H0: b = 0

H1: b is not equal to zero.

If b is not equal to zero, the model has some validity.

Testing for the existence of linear relationship

Test statistic, with n-2 degrees of freedom:bs

bt

Correlations

Advertising

expenses ($00)

Sales revenue ($000)

Advertising expenses ($00)

Pearson Correlation 1 .904*

Sig. (2-tailed) .035N 5 5

Sales revenue ($000)

Pearson Correlation .904* 1Sig. (2-tailed) .035N 5 5

*. Correlation is significant at the 0.05 level (2-tailed).

Model Summary

Model R R SquareAdjusted R

SquareStd. Error of the Estimate

1 .904a .817 .756 .606a. Predictors: (Constant), Advertising expenses ($00)

ANOVAb

ModelSum of Squares df

Mean Square F Sig.

1 Regression 4.900 1 4.900 13.364 .035a

Residual 1.100 3 .367Total 6.000 4

a. Predictors: (Constant), Advertising expenses ($00)b. Dependent Variable: Sales revenue ($000)

Alternately, R2 = 1-[SS(Residual) / SS(Total)] = 1-(1.1/6.0)=0.817 When adjusted for degrees of freedom, Adjusted R2 = 1-[SSResidual/(n-k-1)] / [SS(Total)/(n-1)] = 1-[1.1//3]/[6/4] = 0.756

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.BStd. Error Beta

1 (Constant) -.100 .635 -.157 .885Advertising expenses ($00) .700 .191 .904 3.656 .035

a. Dependent Variable: Sales revenue ($000)

X..Y 7010

Test StatisticMSE

MSRF

Value of the test statistic: 364.13F

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. is not equal to zero. Thus, the independent variable is linearly related to y. This linear regression model is valid

Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. is not equal to zero. Thus, the independent variable is linearly related to y. This linear regression model is valid

The p-value is 0.035

Test statistic, with n-2 degrees of freedom:

Rejection Region 182.33/05.0 ttValue of the test statistic: 66.3

191.0

7.0t

Conclusion:

The calculated test statistic is 3.66 which is outside the acceptance region. Alternately, the actual significance is 0.035. Therefore we will reject the null hypothesis. The advertising expenses is a significant explanatory variable.

bs

bt

Simple Linear Regression and Correlation

Documents

regressionxi simple

regression linefirst

simple regression example

linear relationship0

straightline relationship

error term

degree of linear association

coefficient of correlation