Simple Linear Simple Linear Regression and Regression and Correlation Correlation
Jan 03, 2016
Simple Linear Simple Linear Regression and Regression and
CorrelationCorrelation
Introduction• RegressionRegression refers to the statistical technique of
modeling the relationship between variables.• In simple linearsimple linear regressionregression, we model the
relationship between two variablestwo variables. • One of the variables, denoted by Y, is called the
dependent variable dependent variable and the other, denoted by X, is called the independent variableindependent variable.
• The model we will use to depict the relationship between X and Y will be a straight-line relationshipstraight-line relationship.
• A graphical sketch of the pairs (X, Y) is called a scatter plotscatter plot.
This scatterplot locates pairs of observations of advertising expenditures on the x-axis and sales on the y-axis. We notice that:
Larger (smaller) values of sales tend to be associated with larger (smaller) values of advertising.
Scatterplot of Advertising Expenditures (X) and Sales (Y)
50403020100
140
120
100
80
60
40
20
0
Advertising
Sale
s
The scatter of points tends to be distributed around a positively sloped straight line.
The pairs of values of advertising expenditures and sales are not located exactly on a straight line.
The scatter plot reveals a more or less strong tendency rather than a precise linear relationship.
The line represents the nature of the relationship on average.
Using Statistics
X
Y
X
Y
X 0
0
0
0
0
Y
X
Y
X
Y
XY
Examples of Other Scatterplots
Simple Linear Regression Model
yy = = aa+ b+ bxx + +where:where:
aa and b are called and b are called parameters of the modelparameters of the model,,a is the a is the interceptintercept and b is the and b is the slopeslope..
is a random variable called theis a random variable called the error term error term..
The The simple linear regression modelsimple linear regression model is: is:
The equation that describes how The equation that describes how yy is related to is related to xx and and an error term is called the an error term is called the regression modelregression model..
• The relationship between X and Y is a straight-line relationship.
• The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations.
• That is: ~ N(0,2)
• The relationship between X and Y is a straight-line relationship.
• The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations.
• That is: ~ N(0,2)
X
Y
E[Y]=0 + 1 X
Assumptions of the Simple Linear Regression
Model
Identical normal distributions of errors, all centered on the regression line.
Assumptions of the Simple Linear Regression Model
.{Error ei Yi Yi
Yi the predicted value of Y for Xi
Yi the predicted value of Y for Xi
YY
XX
line regression fitted theˆ bXaY line regression fitted theˆ bXaY
Yi
Yi
Errors in Regression
XiXi
point data observed the point data observed the
SIMPLE REGRESSION AND CORRELATION
Estimating Using the Regression Line
First, lets look at the equation of a straight line is:
bXaY Independent variable
Slope of the line
Dependent variable
Y-intercept
SIMPLE REGRESSION AND CORRELATION
The Method of Least Squares
To estimate the straight line we have to use the least squares method.
This method minimizes the sum of squaresof error between the estimated points on theline and the actual observed points.
SIMPLE REGRESSION AND CORRELATION
The estimating line bXaY Slope of the best-fitting Regression Line
22 XXn
YXXYnb
Y-intercept of the Best-fitting Regression Line
XbYa
SIMPLE REGRESSION - EXAMPLE
Suppose an appliance store conducts a five-month experiment to determinethe effect of advertising on sales revenue.The results are shown below. (File PPT_Regr_example.sav)Month Advertising Exp.($100s) Sales Rev.($1000S) 1 1 1 2 2 1 3 3 2 4 4 2 5 5 4
SIMPLE REGRESSION - EXAMPLE
X Y X2 XY1 1 1 12 1 4 23 2 9 64 2 16 85 4 25 20
15X 10Y 552X 37XY
35
15 X 25
10 Y
SIMPLE REGRESSION - EXAMPLE
103702 ..a
XbYa
X..Y 7010
22 XXn
YXXYnb b = 0.7
Standard Error of Estimate
The standard error of estimate is used to measure the reliability of the estimatingequation.
It measures the variability or scatter of the observed values around the regressionline.
Standard Error of Estimate
Standard Error of Estimate
2
2
nYY
se
Short-cut
2
2
nXYbYaY
s e
Standard Error of Estimate
Y2
114416
262Y
2
2
nXYbYaY
s e
25
3770101026
..se
60550.
Correlation Analysis
Correlation analysis is used to describethe degree to which one variable islinearly related to another.
There are two measures for describing correlation:
1.The Coefficient of Correlation
2.The Coefficient of Determination
The correlationcorrelation between two random variables, X and Y, is a measure of the degree of linear associationdegree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1 to 1.
The correlationcorrelation between two random variables, X and Y, is a measure of the degree of linear associationdegree of linear association between the two variables.
The population correlation, denoted by, can take on any value from -1 to 1.
indicates a perfect negative linear relationship-1 < < 0 indicates a negative linear relationship indicates no linear relationship0 < < 1 indicates a positive linear relationshipindicates a perfect positive linear relationship
The absolute value of indicates the strength or exactness of the relationship.
indicates a perfect negative linear relationship-1 < < 0 indicates a negative linear relationship indicates no linear relationship0 < < 1 indicates a positive linear relationshipindicates a perfect positive linear relationship
The absolute value of indicates the strength or exactness of the relationship.
Correlation
Y
X
= 0= 0
Y
X
= -.8= -.8 Y
X
= .8= .8
Y
X
= 0= 0
Y
X
= -1= -1Y
X
= 1= 1
Illustrations of Correlation
The coefficient of correlation:
2222
yynxxn
yxxynr
Sample Coefficient of Determination2r
Alternate Formula
22
2
2
YnY
YnXYbYar
Sample Coefficient of Determination
22
22
YnYnYXYbYa
r
2
22
2526
25377.0101.0
r 8167.0
Interpretation:We can conclude that 81.67 % of the variation in the sales revenues is explain by the variation in advertising expenditure.
Percentage of total variation explained by the regression.
Percentage of total variation explained by the regression.
The Coefficient of Correlation or Karl Pearson’s Coefficient of
Correlation
The coefficient of correlation is the squareroot of the coefficient of determination.
The sign of r indicates the direction of the relationship between the two variables X and Y.
The sign of r will be the same as the sign of the coefficient “b” in the regressionequation Y = a + b X
SIMPLE REGRESSION AND CORRELATION
If the slope of the estimatingline is positive
If the slope of the estimatingline is negative
:- r is the positive square root
:- r is the negative square root
2rr
9037.08167.0 rThe relationship between the two variables is direct
H0: = 0 (No linear relationship)H1: 0 (Some linear relationship)
Test Statistic: t
r
rn
n( )
2 212
Hypothesis Tests for the Correlation Coefficient
Analysis-of-Variance Table and an F Test of the Regression Model
Source ofVariation
Sum ofSquares
Degrees ofFreedom Mean Square F Ratio
Regression SSR (1) MSR MSRMSE
Error SSE (n-2) MSE
Total SST (n-1) MST
Source ofVariation
Sum ofSquares
Degrees ofFreedom Mean Square F Ratio
Regression SSR (1) MSR MSRMSE
Error SSE (n-2) MSE
Total SST (n-1) MST
H0 : The regression model is not significantH1 : The regression model is significant
We pose the question:
Is the independent variable linearly related to the dependent variable?
To answer the question we test the hypothesis
H0: b = 0
H1: b is not equal to zero.
If b is not equal to zero, the model has some validity.
Testing for the existence of linear relationship
Test statistic, with n-2 degrees of freedom:bs
bt
Correlations
Advertising
expenses ($00)
Sales revenue ($000)
Advertising expenses ($00)
Pearson Correlation 1 .904*
Sig. (2-tailed) .035N 5 5
Sales revenue ($000)
Pearson Correlation .904* 1Sig. (2-tailed) .035N 5 5
*. Correlation is significant at the 0.05 level (2-tailed).
Model Summary
Model R R SquareAdjusted R
SquareStd. Error of the Estimate
1 .904a .817 .756 .606a. Predictors: (Constant), Advertising expenses ($00)
ANOVAb
ModelSum of Squares df
Mean Square F Sig.
1 Regression 4.900 1 4.900 13.364 .035a
Residual 1.100 3 .367Total 6.000 4
a. Predictors: (Constant), Advertising expenses ($00)b. Dependent Variable: Sales revenue ($000)
Alternately, R2 = 1-[SS(Residual) / SS(Total)] = 1-(1.1/6.0)=0.817 When adjusted for degrees of freedom, Adjusted R2 = 1-[SSResidual/(n-k-1)] / [SS(Total)/(n-1)] = 1-[1.1//3]/[6/4] = 0.756
Coefficientsa
Model
Unstandardized Coefficients
Standardized
Coefficients
t Sig.BStd. Error Beta
1 (Constant) -.100 .635 -.157 .885Advertising expenses ($00) .700 .191 .904 3.656 .035
a. Dependent Variable: Sales revenue ($000)
X..Y 7010
Test StatisticMSE
MSRF
Value of the test statistic: 364.13F
Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. is not equal to zero. Thus, the independent variable is linearly related to y. This linear regression model is valid
Conclusion: There is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis. is not equal to zero. Thus, the independent variable is linearly related to y. This linear regression model is valid
The p-value is 0.035
Test statistic, with n-2 degrees of freedom:
Rejection Region 182.33/05.0 ttValue of the test statistic: 66.3
191.0
7.0t
Conclusion:
The calculated test statistic is 3.66 which is outside the acceptance region. Alternately, the actual significance is 0.035. Therefore we will reject the null hypothesis. The advertising expenses is a significant explanatory variable.
bs
bt