Top Banner
Statistics For Management Unit 12 Sikkim Manipal University 180 Unit 12 Simple Correlation & Regression Structure 12.1 Introduction Objectives 12.2 Correlation 12.2.1 Causation and Correlation 12.2.2 Types of Correlation 12.3 Measures of Correlation 12.3.1 Scatter Diagram 12.3.2 Karl Pearson’s Correlation Coefficient 12.3.3 Properties of Karl Pearson’s Correlation Coefficient 12.3.4 Factors Influencing the Size of Correlation Coefficient 12.4 Problems 12.5 Probable Error 12.6 Spearman’s Rank Correlation Coefficient 12.7 Partial Correlation 12.8 Multiple Correlation 12.9 Regression 12.9.1 Regression Analysis 12.9.2 Regression Lines 12.9.3 About Regression Coefficient 12.9.4 Differences Between Correlation Coefficient and Regression Coefficient 12.9.5 Examples 12.10 Standard Error Of Estimate 12.11 Multiple Regression Analysis 12.12 Reliability of Estimates 12.13 Application of Multiple Regression Self Assessment Questions 12.14 Summary Terminal Questions Answer to SAQ’s and TQ’s
25
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 180

Unit 12 Simple Correlation & Regression Structure

12.1 Introduction

Objectives

12.2 Correlation

12.2.1 Causation and Correlation

12.2.2 Types of Correlation

12.3 Measures of Correlation

12.3.1 Scatter Diagram

12.3.2 Karl Pearson’s Correlation Coefficient

12.3.3 Properties of Karl Pearson’s Correlation Coefficient

12.3.4 Factors Influencing the Size of Correlation Coefficient

12.4 Problems

12.5 Probable Error

12.6 Spearman’s Rank Correlation Coefficient

12.7 Partial Correlation

12.8 Multiple Correlation

12.9 Regression

12.9.1 Regression Analysis

12.9.2 Regression Lines

12.9.3 About Regression Coefficient

12.9.4 Differences Between Correlation Coefficient and Regression Coefficient

12.9.5 Examples

12.10 Standard Error Of Estimate

12.11 Multiple Regression Analysis

12.12 Reliability of Estimates

12.13 Application of Multiple Regression

Self Assessment Questions

12.14 Summary

Terminal Questions

Answer to SAQ’s and TQ’s

Page 2: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 181

12.1 Introduction Both correlation and regression are used to measure the strength of relationships between

variables.

The following statistical tools measure the relationship between the variable analyzed in social

science research. 1. Correlation

a. Simple correlation – Here the relationship between two variables are studied.

b. Partial correlation – Here the relationship of any two variables are studied, keeping all

others constant.

c. Multiple correlation – Here the relationship between variables are studied simultaneously. 2. Regression

a. Simple regression

b. Multiple regression 3. Association of Attributes Correlation measures the relationship (positive or negative, perfect) between the two variables.

Regression analysis considers relationship between variables and estimates the value of another

variable, having the value of one variable. Association of Attributes attempts to ascertain the

extent of association between two variables. Learning Objectives

In this unit students will learn about

1. Simple, partial & multiple correlation 2. Parametric and non parametric measures of correlation

The method of estimating unknown values from known values through regression equations

12.2 Correlation

When two or more variables move in sympathy with other, then they are said to be correlated. If

both variables move in the same direction then they are said to be positively correlated. If the

variables move in opposite direction then they are said to be negatively correlated. If they move

haphazardly then there is no correlation between them.

Correlation analysis deals with

1) Measuring the relationship between variables.

2) Testing the relationship for its significance.

3) Giving confidence interval for population correlation measure.

Page 3: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 182

12.2.1 Causation and Correlation The correlation between two variables may be due to the following causes,

i) Due to small sample sizes. Correlation may be present in sample and not in population.

ii) Due to a third factor. Correlation between yield of rice and tea may be due to a third factor

“rain”

12.2.2 Types of Correlation Types of correlation are given below

a. Positive or Negative

b. Simple, Partial and Multiple

c. Linear and Non­linear

Positive correlation: Both the variables (X and Y) will vary in the same direction. If variable X

increases, variable Y also will increase; if variable X decreases, variable Y also will decrease.

Negative Correlation: The given variables will vary in opposite direction. If one variable

increases, other variable will decrease.

Simple, Partial and Multiple correlations: In simple correlation, relationship between two variables

are studied. In partial and multiple correlations three or more variables are studied. Three or

more variables are simultaneously studied in multiple correlations. In partial correlation more

than two variables are studied, but the effect on one variable is kept constant and relationship

between other two variables is studied.

Linear and Non­Linear correlation: It depends upon the constancy of the ratio of change between

the variables. In linear correlation the percentage change in one variable will be equal to the

percentage change in another variable. It is not so in non linear correlation.

12.3 Measures of correlation i) Scatter Diagram.

ii) Karl Pearson’s correlation coefficient.

iii) Spearman’s Rank correlation coefficient.

12.3.1 Scatter Diagram The ordered pair of observed values are plotted on x y plane as dots. Therefore it is also known

as Dot Diagram. It is diagrammatic representation of relationship.

Page 4: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 183

If the dots lie exactly on a straight line that runs form left bottom to right top, then the variables are

said to be perfectly positively correlated (fig.i).

If the dots lie close to a straight line that runs from left bottom to right top, then the variables are

said to be positively correlated (fig.ii).

If the dots lie exactly on a straight line that runs from left top to right bottom then the variables are

said to be perfectly negatively correlated (fig iii).

If the dots lie very close to a straight line that runs from left top to right bottom then the variables

are said to be negatively correlated (fig iv).

If the dots lie all over the graph paper then the variables have zero correlation (fig v).

Scatter diagram tells us the direction in which they are related and does not give any quantitative

measures for comparison between sets of data.

12.3.2 Karl Pearson’s Correlation Coefficient

It is defined as

i) ∑xy

Nσxσy

Where x =X – X y = Y­ Y

r = (A)

Y Y

Y

X X X i ii iii

iv v

Y Y

X X 0 0

0 0 0

Page 5: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 184

∑ (x – x) 2

n

∑ (y – y) 2

n

n – number of paired observations ∑xy / N is called covariance of x and y. the other forms of

this formula are

ii. ∑ xy

√(∑x 2 ) (∑y 2 ) (B)

. N∑ XY ­ ∑X∑Y

N∑X 2 – (∑X 2 ) 1/2 N∑Y 2 ­ (∑Y 2 ) ½ (C)

. N∑ dx dy ­ ∑dx dy

N∑dx 2 ­ (∑dx 2 ) 1/2 N∑dy 2 ­ (∑dy 2 ) ½ (D)

For all practical purpose we can conveniently use form D. Whenever summary information is

given choose proper form from A to C.

12.3.3 Properties Of Karl Pearson’s Correlation Coefficient.

§ Its value always lies between – 1 and 1.

§ It is not affected by change of origin or change of scale.

§ It is a relative measure (does not have any unit attached to it)

12.3.4 Factors influencing The size of Correlation Coefficient The size of r is very much dependent upon the variability of measured values in the correlation

sample. The greater the variability, the higher will be the correlation, everything else being equal.

The size of r is altered when researchers select extreme groups of subjects in order to compare

these groups with respect to certain behaviors. Selecting extreme groups on one variable

increases the size of r over what would be obtained with more random sampling.

Combining two groups which differ in their mean values on one of the variables is not likely to

faithfully represent the true situation as far as the correlation is concerned.

Addition of an extreme case (and conversely dropping of an extreme case) can lead to changes

in the amount of correlation. Dropping of such a case leads to reduction in the correlation while

the converse is also true. (Source: Aggarwal.Y.P, Statistical Methods, Sterling Publishers Pvt

Ltd., New Delhi, 1998, p.131).

σx 2 =

σy 2 =

r =

r =

r =

Page 6: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 185

12.4 Problems

Example 1: Find Karl Pearson’s Correlation Coefficient, given

X 20 16 12 8 4

Y 22 14 4 12 8

X Y X 2 Y 2 XY

20 22 400 484 440

16 14 256 196 224

12 4 144 16 48

8 12 64 144 96

4 8 16 64 32

∑X = 60 ∑Y = 60 ∑X 2 = 880 ∑Y 2 = 904 ∑XY = 840

Applying the formula for r and substituting the respective values from the above table we get r

as:

. N∑ XY ­ ∑X∑Y

N∑X 2 – (∑X 2 ) 1/2 N∑Y 2 ­ (∑Y 2 ) ½

. 5(840 – (60)(60)

√5(880) – (60) 2 √5(904) – (30) 2

Example 2: Calculate Karl Pearson Coefficient of Correlation from the following data:

Year 1985 1986 1987 1988 1989 1990 1991 1992

Index of Production 100 102 104 107 105 112 103 99

Number of

unemployed

15 12 13 11 12 12 19 26

r =

r = = 0.70

r = 0.70

Page 7: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 186

Solution:

Year Index of

Production X

X – X

x x 2

No. of

unemployed

Y – Y

y y 2 xy

1985 100 ­ 4 16 15 0 0 0

1986 102 ­ 2 4 12 ­ 3 9 + 6

1987 104 0 0 13 ­ 2 4 0

1988 107 + 3 9 11 ­ 4 16 ­ 12

1989 105 + 1 1 12 ­ 3 9 ­ 3

1990 112 + 8 64 12 ­ 3 9 ­ 24

1991 103 ­ 1 J 19 + 4 16 ­ 4

1992 99 ­ 5 25 26 + 11 121 ­ 55

∑X = 832 ∑x = 0 ∑x 2 =

120

∑Y = 120 ∑y = 0 ∑y 2 = 194 ∑xy = ­92

X = 104 Y = 15

∑ xy ­92

√(∑x 2 ) (∑y 2 ) √120 x 184

Therefore a correlation between production and unemployed is negative.

Example 3: Calculate Correlation Coefficient from the following data:

X 50 60 58 47 49 33 65 43 46 68

Y 48 65 50 48 55 58 63 48 50 70

Solution:

X­50 = dx dx 2 Y Y­55 = dy dy 2 dx dy

50 0 0 48 ­ 7 49 0

60 + 10 100 65 + 10 100 + 100

58 + 8 64 50 ­ 5 25 ­ 40

47 ­ 3 9 48 ­ 7 49 + 21

49 ­ 1 1 55 0 0 0

33 ­17 289 58 3 9 ­ 51

r = r = = = ­0.619

Page 8: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 187

65 + 15 225 63 8 64 + 120

43 ­7 49 48 ­ 7 49 + 49

46 ­ 4 16 50 ­ 5 25 + 20

68 +18 324 70 15 225 + 270

∑X =

519

∑dx = + 19 ∑dx 2 =

1077

∑Y = 535 ∑dy = 5 ∑dy 2 =

595

∑dxdy =

489

Using the formula for calculating r as

. N∑ dx dy ­ ∑dx dy

N∑dx 2 ­ (∑dx 2 ) ½ N∑dy 2 ­ (∑dy 2 ) ½

And substituting values we get r = 0.611

Example 4: In a Bivariate data on x and y variance of x = 49, variance of y = 9 and covariance

(∑x,y) = ­17.5. Find coefficient of correlation between x and y.

Solution: we know

∑xy

Nσxσy

Given ∑xy

N

σx = √49 = 7 σy = √9 = 3

17.5

7 x 3

There is a high negative correlation.

Example 5: Ten observation in Weight (x) and Height (y) of a particular age group gave the

following data.

∑x = 56 ∑y = 138 ∑x 2 = 1357 ∑y 2 = 2136 ∑xy = 836

Find “r”

Solution: we know

. N∑ xy ­ ∑x∑y

N∑x 2 – (∑x 2 ) 1/2 N∑y 2 ­ (∑y 2 ) ½

r =

= ­17.5

r = ­ = ­0.833

r =

r =

Page 9: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 188

Given N = 10, ∑x = 56 ∑y = 138 ∑x 2 = 1357 ∑y 2 = 2136 ∑xy = 836

. 10 x 836 – (56) (138)

10 x 1357 – (56) 2 1/2 10 x 2136 ­ (138) 2 ½

= 0.1286

Correlation is practically nil.

12.5 Probable Error It measures the extent to which correlation coefficient is dependable. It is an old measure of

testing the reliability of “r”. It is given by

P.E = (0.6475) [1 – r 2 ] / √n

Where “r” is measured from sample of size n.

It is used to

i) interpret the value of r

a) If r < P.E, then it not at all significant.

b) If r > 6 P.E, then “r” is highly significance.

c) If P.E < r < 6 P.E, we can not say anything about the significance of “r”

ii) Construct confidence limits within which population “P” is expected to lie.

Conditions under which P.E can be used. 1. Samples should be drawn from a normal population.

2. The value of “r” must be determined from sample values.

3. Samples must have been selected at random

Example 6

If r = 0.6 and N = 64, a) Interpret ‘r’ b) find the limits within which ‘ρ’ is suppose to lie.

Solution:

1 – (0.6) 2

64

= 0.054

a) 6 P.E = 6 x 0.054 = 0.324

Since r (0.6) > 6 P.E

It is highly significant

b) Limits for population “ρ”

= 0.6 ± 0.054

r =

P.E = 0.6745

Page 10: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 189

= 0.546 – 0.654

12.6 Spearman’s Rank Correlation Coefficient Karl Pearson’s correlation coefficient assumes that

i) Samples are drawn from a normal population.

ii) The variables under study are affected by a large number of independent causes so

as to form a normal distribution. When we do not know the shape of population

distribution and when the data is qualitative type Spearman’s Ranks correlation

coefficient is used to measure relationship.

It is defined as

6∑D 2 Where

N 3 – N

D is the difference between ranks assigned to the variables. Value of ρ lies between

– 1 and +1 and its interpretation is same as that of Karl Pearson’s correlation

coefficient. There are 3 types of problems

i. Ranks are assigned.

ii. Ranks are to be assigned and there is no tie between ranks.

iii. When there is tie between ranks.

i. When ranks are assigned already

Example 7: In a singing competition, two judges assigned the following ranks for 7 candidates. Find Spearman’s rank correlation coefficient.

Competitor 1 2 3 4 5 6 7 Judge I 5 6 4 3 2 7 1 Judge II 6 4 5 1 2 7 3

Solution:

Competitor R1 (Judge 1) R2 (Judge 2) D = R1 – R2 D 2 1 5 6 ­1 1 2 6 4 ­2 4 3 4 5 ­1 1 4 3 1 2 4 5 2 2 0 0

ρ = 1 ­

Page 11: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 190

6 7 7 0 0 7 1 3 2 4

13 6 x 13 6 x 13

7(7 2 – 1) 7 x 48

Example 8: Rank Difference Coefficient of Correlation (Case of No Ties)

Student Score on Test I

Score on Test II

Rank Of Test I

Rank on Test II

Difference between Ranks

Difference squared

X Y R1 R2 D D 2

A 16 8 2 5 ­3 9 B 14 14 3 3 0 0 C 18 12 1 4 ­3 9 D 10 16 4 2 2 4 E 2 20 5 1 4 16

N = 5 ∑D 2 = 38

Applying the formula of Regulations we get

6∑D 2 6(38)

N 3 – N 5 3 – 5

Relation between x and y is very high and inverse.

Relationship between score on Test I & II is very high and inverse.

iii) Where ranks are repeated

Example 9: The sales statistics of 6 sales representatives in two different localities. Find whether there is a relationship between buying habits of the people in the localities.

Representative 1 2 3 4 5 6

Locality I 70 40 65 110 60 20

Locality II 70 30 80 100 90 20

Solution:

Representative Sales in Locality I R1

Sales in locality II R2 D = R1­R2 D 2

1 2 4 ­2 4 2 5 5 0 0 3 3 3 0 0 4 1 1 0 0

ρ= 1 ­ = 1­ 1 – 9 = ­0.9

ρ = 1 ­ = 1­ = 0.768

Page 12: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 191

5 4 2 2 4 6 6 6 0 0

0 8

6 x 8 8

6 x (6 2 – 1) 35

There is high positive correlation between buying habits of the locality people.

iii When Ranks are repeated Example 10

Find rank correlation coefficient for the following data.

Student A B C D E F G H I J Score on Test I 20 30 22 28 32 40 20 16 14 18

Score on Test II 32 32 48 36 44 48 28 20 24 28

Student Score on Test I

Score on Test II

Rank Of Test I

Rank on Test II

Difference between Ranks

Difference squared

X Y R1 R2 D D2 A 20 32 6.5 5.5 0 1.00 B 30 32 3 5.5 ­ 2.5 6.25 C 22 48 5 1.5 3.5 12.25 D 28 36 4 4 0 0 E 32 44 2 3 ­ 1.0 1.00 F 40 48 1 1.5 ­ 0.5 0.25 G 20 28 6.5 7.5 ­ 1.0 1.00 H 16 20 9 10 ­ 1.0 1.00 I 14 24 10 9 1.0 1.00 J 18 28 8 7.5 0.5 0.25

N = 10 ∑D 2 = 24

[6∑D 2 + 1/12 (m1 3 – m1) + 1/12 (m2

3 – m2) + 1/12 (m3 3 – m3) + 1/12 (m4

3 – m4)]

N 3 – N

mi represents the number of times a rank is repeated

[6 x 24 + 1/12 (2 3 – 2) + 1/12 (2 3 – 2) + 1/12 (2 3 – 2) + 1/12 (2 3 – 2]

10 (10 2 – 1)

ρ = 1 ­ = 1­ = 0.7714

ρ = 1 ­

ρ = 1 ­

Page 13: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 192

[144 + 0.5 + 0.5 + 0.5 + 0.5]

10 x 99

146

10 x 99

Testing of Correlation “t” test is used to test correlation coefficient. Height and weight of a random sample of six adults

Height (cm) 170 175 176 178 183 185

Weight (Kg) 57 64 70 76 71 82

It is reasonable to assume that these variables are normally distributed, so the Karl Pearson

Correlation coefficient is the appropriate measure of the degree of association between height

and weight. R = 0.875

Hypothesis test for Pearson’s population correlation coefficient

Ho:ρ = 0 This implies no correlation between the variables in the population

H1: ρ > 0 This implies that there is positive correlation in the population (increasing height is

associated with increasing weight) 5% significance level is taken

. 2 − n 2 6 −

1 – r 2 1 – 0.875 2

Table value of 5% significance level and 4 degree of freedom (6­2) = 2.132.

Since the calculated value is more than the table value. Null hypothesis is rejected. There is

significant positive correlation between height and weight.

12.7. Partial Correlation Partial Correlation is used in a situation where three and four variables involved. Three variables

such as age, height and weight. Correlation between height and weight can be computed by

keeping age constant. Age may be the important factor influencing the strength of relationship

between height and weight. Partial Correlation is used to keep constant the effect of age. The

effect of one variable is partialled out from the correlation between other two variables. This

statistical technique is known as partial correlation.

Correlation between variables x and y is denoted as rxy Partial Correlation is denoted by the symbol r12.3. Here correlation between variable 1 and 2

keeping 3 rd variable constant.

ρ= 1 ­

= 1 ­ = 0.8525

Statistic “t” test = = 0.875 = 3.61

Page 14: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 193

. r12 – r13 . r23

√1 – r13 2 x √1 – r23 2

r12.3 = Partial correlation between variables 1 and 2 keeping 3 rd constant

r12 = correlation between variables 1 and 2

r13 = correlation between variables 1 and 3

r23 = correlation between variables 2 and 3

Similarly,

. r13 – r12 . r23 and

√1 – r12 2 x √1 – r23 2

. r23 – r12 . r13

√1 – r12 2 x √1 – r13 2

Self Assessment Questions 1. From the following data, calculate the correlation between variables 1 and 2 keeping the 3 rd

constant.

r12 = 0.7; r13 = 0.6 r23 = 0.4

2. Calculate r23.1 and r13.2 from the following:

r12 = 0.60; r13 = 0.51 r23 = 0.40

3. Given the zero order correlation coefficients:

r12 = 0.8; r13 = 0.6 r23 = 0.5

Calculate the partial correlation between variable 1 and 3 keeping the 2 nd constant. Interpret

your result.

12.8. Multiple Correlation Three or more variables are involved in multiple correlations. The dependent variable is denoted

by X1 and other variables are denoted by X2, X3 etc. Gupta S.P, has expressed that “the

coefficient of multiple linear correlation is represented by R1 and it is common to add subscripts

designating the variables involved. Thus R1.234 would represent the coefficient of multiple linear

correlations between X1 on the one hand X2, X3 and X4 on the other. The subscript of the

dependent variable is always to the left of the point:

The coefficient of multiple correlations for r12, r13 and r23 can be expressed

R1.23 = √r12 2 + r13 2 ­ 2 r12 r13 r23 / 1 – r23 2

R2.13 = √r12 2 + r23 2 ­ 2 r12 r13 r23 / 1 – r13 2

R3.12 = √r12 2 + r23 2 ­ 2 r12 r13 r23 / 1 – r12 2

r12.3 =

r13.2 =

r23.1 =

Page 15: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 194

Coefficient of multiple correlations for R1.23 is the same as R1.32

A coefficient of multiple correlation lies between 0 and 1. If the coefficient of multiple correlation

is 1, it shows that the correlation is prefect. If it is 0, it shows that there is no linear relationship

between the variables. The coefficient of multiple correlation are always positive in sign and

range from 0 to + 1.

Coefficient of multiple determination can be obtained by squaring R1.23. Alternative formula for

computing R1.23 is:

R1.23 = √r12 2 + r13.2 2 (1 – r12 2 ) or

R 2 1.23 = r12 2 + r13.2 2 (1 – r12 2 )

Similarly alternative formulas for R1.24 and R1.34 can be computed

The following formula can be used to determine a multiple correlation coefficient with three

independent variables.

R1.24 = √1 – (1 ­ r14 2 ) (r13.4 2 ) (1 ­ r12.34 2 )

Multiple correlation analysis measures the relationship between the given variables. In this

analysis the degree of association between one variable considered as the dependent variable

and a group of other variables considered as the independent variables.

Example 11: The following zero order correlation coefficients are given r12 = 0.98; r13 = 0.44 r23 = 0.54

Calculate multiple correlation coefficient treating first variable as dependent and second and third

variables as independent. (source: Gupta S.P, Statistical Method)

Solution: First variable is dependent.

Second and third variables are independent.

Using the formula for multiple correlation coefficient for R1.23 we get:

R1.23 = √r12 2 + r13 2 ­ 2 r12 r13 r23 / 1 – r23 2 = 0.986

State whether the following are True/False

1. Scatter diagram does not give us a quantitative measure of correlation coefficient.

2. Correlation studies estimate the values of one variable from the knowledge of the other.

3. Correlation coefficient is an absolute measure.

4. Correlation coefficient is a geometric mean between regression coefficients.

Page 16: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 195

5. The regression lines pass through (x,y).

6. byx = r.

7. The higher the angle between regression coefficients, the lower is the correlation

coefficient.

8. The correlation studied between height and weight, keeping age as constant.

12.9. Regression Regression is defined as, “the measure of the average relationship between two or more

variables in terms of the original units of the data.”

Correlation analysis attempts to study the relationship between the two variables x and y.

Regression analysis attempts to predict the average x for a given y. In Regression it is attempted

to quantify the dependence of one variable on the other. Example: There are two variables x and

y. y depends on x. The dependence is expressed in the form of the equations.

12.9.1 Regression Analysis Regression Analysis used to estimate the values of the dependent variables from the values of

the independent variables.

Regression analysis is used to get a measure of the error involved while using the regression line

as a basis for estimation.

Regression coefficient is used to calculate correlation coefficient. The square of correlation that

prevails between the given two variables.

12.9.2 Regression Lines For a set of paired observations there exist two straight lines. The line drawn such that sum of

vertical deviation is zero and sum of their squares is minimum, is called Regression line of y on x.

It is used to estimate y – values for given x – values. The line drawn such that sum of horizontal

deviation is zero and sum of their squares is minimum, is called Regression line of x on y. it is

used to estimate x – values for given y – values. The smaller angle between these lines, higher is

the correlation between the variables. The regression lines always intersect at (X,Y)

The regression lines have equation,

(i) The regression equation of y on x is given by

Y – Y = byx (X – X)

(ii) The regression equation of x on y is given by

X – X = bxy (Y – Y)

Page 17: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 196

Where

. N∑ dxdy – (∑dx) (∑dy)

N∑dx 2 ­ (∑dx 2 )

And N∑ dxdy – (∑dx) (∑dy)

N∑dy 2 ­ (∑dy 2 )

The regression equations found by the above conditions is said to fitted by method of least

squares. byx and bxy are called regression coefficients.

12.9.3 About Regression coefficient byx . bxy = r 2 ⇒ ± √byx . bxy = 1

• byx . bxy ≤ 1

• If byx is –ve, then bxy is also –ve and r is –ve.

• They can also be expressed as

σy

σx

σx

σy

• It is an absolute measure.

12.9.4 Differences Between Correlation Coefficient And Regression Coefficient

Correlation Coefficient Regression Coefficient

rxy = ryx byx = bxy

­1< r <1 if byx can be greater than one, but bxy must

be less than one such that byx.byx<1

It has no units attached to it It has unit attached to it

There exist nonsense correlation There is no such nonsense regression

It is not based on cause and effect

relationship

It is based on cause and effect relationship

It indirectly helps in estimation It is meant for estimation

byx =

bxy =

byx = r.

bxy = r.

Page 18: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 197

12.9.5 Examples :

Example 11: Find regression equation from the following data

Age of Husband 18 19 20 21 22 23 24 25 26 27 Age of Wife 17 17 18 18 19 19 19 20 21 22

And hence calculate correlation coefficient. Solution:

Age of husband

(x)

dx = x­22 dx 2 Age of wife (y)

dy = y­19 dy 2 dx dy

18 ­4 16 17 ­2 4 8 19 ­3 9 17 ­2 4 6 20 ­2 4 18 ­1 1 2 21 ­1 1 18 ­1 1 1 22 0 0 19 0 0 0 23 1 1 19 0 0 0 24 2 4 19 0 0 0 25 3 9 20 1 1 3 26 4 16 21 2 4 8 27 5 25 22 3 9 15

Total 225 5 85 190 0 24 43

225 190

10 10

Regression equation of Y on X is

Y – Y = byx (X – X)

10 x 43 – (5) (0) 430

10 x 85 – (5) 2 825

⇒ Y – 19 = 0.521 (X – 22.5)

⇒ Y = 0.521X + 7.2775

Regression Equation of X and Y is

10 x 43 – (5) (0) 43

10 x 24 – (5) 2 24

⇒ X – 22.5 = 1.792 (Y – 19)

⇒ X= 1.792Y – 11.548

r = 792 . 1 521 . 0 x = 0.966

X = = 22.5 Y = = 19

byx = = = 0.521

bxy = = = 1.392

Page 19: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 198

Example 12: In a correlation study we have the following data.

Series X Series Y

Mean S.D 65 67

S.D 2.5 3.5

Correlation coefficient 0.8

Find the two regression equations.

Solution:

σy

σx

3.5

2.5

⇒ Y – 67 = 1.12 (X – 65)

⇒ Y = 1.12 x – 5.8

Regression equation of x and y

σx

σy

2.5

3.5

⇒ X – 65 = 0.57 (Y – 67)

⇒ X = 0.57 Y + 26.72

12.10. Standard Error of Estimate The standard error of estimates helps to measure the accuracy of the estimated figures in

regression analysis. If the value of the standard error of estimate is small, it shows that the

estimate provided by the regression equation is better and closer. If standard error of estimate is

zero, it shows that there is no variation about the line and the correlation will be perfect. “The

standard error of estimate uses to ascertain how good and representative the regression line is as

a description of the average relationship between two series:.

The standard error of regression of X values from Xc is:

Y ­ Y = r. (X – X)

Y – 67 = (0.8) (X – 65)

X ­ X = r. (Y – Y)

X – 65 = (0.8) Y ­ 67

Page 20: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 199

√∑(X ­ Xc) 2 also

N

Sx.y = 6 x√1 – r 2 also

√∑X 2 ­ a∑X ­ b∑XY

√ N

√∑(Y ­ Yc) 2

√N

Example 13

1. The following results were worked out from scores in Statistics and Mathematics in

a certain examination.

Scores in Statistics (X) Scores in Mathematics (Y)

Mean 40 48

Standard Deviation 10 15

Karl Pearson’s correlation coefficient between x and y is = + 0.42

Find the regression lines x on y and y on x. Use the regression lines to find the value of y when x

= 50 and value of x when y = 30. Solution:

Given the following data:

X = 40; Y = 40; σx = 10; σy = 15; r = 0.42

The regression line x on y:

Is (X – X ) = r σx / σy (Y – Y)………….(1)

The regression line y on x: is

Is (Y – Y) = r σy / σx (X – X)………….(2)

Therefore substituting the values we get the respective equation as:

X = 0.279y + 26.608……….(3) And

Y = 0.63 x + 22.80…………(4)

Therefore when y = 30; x =35.518 using equation (3) and

When x =50 y = 54.3 by using equation (4)

2. From the following data obtain the two regression equations

X 12 4 20 8 16

Sx . y =

Sx . y =

Sx . y =

Page 21: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 200

Y 18 22 10 16 14

Estimate Y for X = 15 and estimate X for Y = 20

Solution

X = (12 + 4 + 20 + 8 + 16)/ 5 =12 = mean of X

Y = (18 + 22 + 10 + 16 + 14) / 5 = 16 = mean of Y

X Y X – X X ­ 12

Y – Y Y ­ 16

(X – X) 2 (Y – Y) 2 (X – X) (Y – Y)

12 8 0 2 0 4 0 4 22 ­ 8 6 64 36 ­ 48 20 10 8 ­ 6 64 36 ­ 48 8 16 ­ 4 0 16 0 0 16 14 4 ­ 2 16 4 ­ 8

160 80 ­ 104

Σ(X – X) (Y – Y) 104

Σ(X – X) 2 160

and

Σ(X – X) (Y – Y) 104

Σ(Y – Y) 2 80

Regression equation X on Y is given by

(X – X) = b 1 (Y – Y)

X – 12 = ­ 1.3 (Y – 16)

Therefore X = 32.8 – 1.3Y

When Y = 20 X = 32.8 – 1.3 x 20 = 6.8

Regression equation Y on X is given by

(Y – Y) = b (X – X)

Y – 16 = ­ 0.65 (X – 12)

Therefore Y = 23.8 – 0.65X

When X = 15 Y = 23.8 – 0.65 x 15 = 14.05

b yx = = = ­ 0.65

bxy = = = ­ 1.3

Page 22: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 201

12.11 Multiple Regression Analysis Multiple Regression Analysis is an extension of two variable regression analysis. In this analysis,

two or more independent variables are used to estimate the values of a dependent variable,

instead of one independent variable.

The objective of multiple regression analysis are:

• To derive an equation which provides estimates of the dependent variable from

values of the two or more variables independent variables.

• To obtain the measure of the error involved in using the regression equation as a

basis of estimation.

• To obtain a measure of the proportion of variance in the dependent variable

accounted for or explained by the independent variables.

Multiple regression equation explains the average relationship between the given variables and

the relationship is used to estimate the dependent variable. Regression equation refers the

equation for estimating a dependent variable. Example 14: Estimating dependent variable X1 from the independent variables X2, X3…………..

It is known as regression equation of X1 on X2, X3…………..

Regression equation, when three variables are involved, is given below:

X1.23 = a1.23 + b1.23 X2 + b13.2 X3

Where X1.23 = estimated value of the dependent variable

X2 and X3 = independent variables.

a1.23 = (Constant) the intercept made by the regression plan. It gives the value

of the dependent variable, when all the independent variables assume a

value equal to zero.

b1.23 and b13.2 = Partial regression coefficients or net regression coefficients.

b1.23 = measures the amount by which a unit change in X2 is expected to affect

X1 when X3 is held constant.

Deviations Taken From Actual Means

X1.23 = b1.23 X2 + b13.2 X3

X1 = (X1 – X1)

X2 = (X2 – X2)

X3 = (X3 – X3)

b1.23 and b13.2 can be obtained by solving the following equations.

ΣX1X2 = b1.23 X2 2 + b13.2 X2 X3

ΣX1X2 = b1.23 ΣX2 X3 + b13.2 ΣX3

Page 23: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 202

σ1.23

σ3.12

r12 – r13 r23 S1 r12 – r13 r23 S1

1 – r23 2 S2 1 – r23 2 S3

Regression equation of X3 and X2 and X1 is:

r23 – r13 r12 S3 r13 – r23 r12 S3

1 – r23 2 S2 1 – r23 2 S1

12.12 Reliability of Estimates Reliability of estimates test the estimated value obtained by applying regression equation,

whether the estimated value is very close to actual observed value. Standard error uses to

measure the closeness of estimate derived from the regression equation to actual observed

values. The measure of reliability is an average of the deviations of the actual value of non­

dependent variable from the estimate from the regression equation. Determining the accuracy of

estimates from the multiple regression is reliability of estimates. It is also known as standard

error of estimate.

Standard error of estimate of X1 on X2 and X3 is given below:

√Σ(X1 ­ Xlast) 2

N – 3

Where

S1.23 = Standard error of estimate X1 on X2 and X3

Xlast = Estimate value of X1 as calculated from the regression equations.

12.13 Application of Multiple Regression Multiple regression can be applied to test the factors such as export elasticity, import elasticity

and structural change (contribution of manufacturing sector towards GDP) influencing over

employment.

Employment is dependent variable. Similarly researchers can attempt to use multiple regression

in their research work appropriately. Self Assessment Questions

a. Correlation coefficient is a geometric mean between regression coefficients

b. The regression lines pass through (X,Y )

c.byx = r . S.D of X / S.D of Y

b12.3 = r

S1.23 =

(X1 – X1) = (X2 – X2)+ (X3 – X3)

(X3 – X3) = (X2 – X2)+ (X1 – X1)

Page 24: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 203

d. The higher the angle between regression coefficients, the lower is the correlation

coefficient.

12.14 Summary In this unit we studied the concept of correlation and regression and the different types of

correlation and regression. We saw how regression helps us to study unknown variables with the

help of known variables. It also establishes reliability measure for estimated values.

Terminal Questions 1. Test the significance correlation for the values based on the number of observations i) 10

ii)100 and r is 0.4 an d0.9.

2. The following table gives marks obtained by 10 students in commerce and statistics.

Calculate the rank correlation

Marks in Statistics 35 90 70 40 95 45 60 85 80 50

Marks in Commerce 45 70 65 30 90 40 50 75 85 60

3. Calculate the Spearman’s rank correlation coefficient between the series A and B given

below:

Series A 57 59 62 63 64 65 55 58 57

Series B 113 117 126 126 130 129 111 116 112

4. Obtain the two lines of regression and its estimate the blood pressure when age is 50 yrs.

Age (X) in yrs 56 42 72 39 63 47 52 49 40 42 68 60

B P (Y) 127 112 140 118 129 116 130 125 115 120 135 133

5. The following results were worked out from scores in statistics and mathematics in a

certain examination.

Scores in Statistics (X) Scores in Mathematics (Y)

Mean 39.5 47.5

Standard Deviation 10.8 17.8

Page 25: 12-Coorelation

Statistics For Management Unit 12

Sikkim Manipal University 204

Karl Pearson’s correlation coefficient between X and Y = 0.42. Find both the regression lines.

Use these lines to estimate the value of Y when X = 50 and the value of X when Y = 30.

Answers To Self Assessment Questions Reference 13.10

1) True 2) False 3) False 4)True

Reference 13.18

1) True 2) True 3) False 4)True

Answers To Terminal Questions 1. i) Non significant ii, iii, iv) Highly significant

2. 0.903

3. 0.967

4. X = ­ 95 + 1.184

Y = 87.2 + 0.724

5. X = 27.62 + 0.25Y

Y = 20.24 + 0.69X