Top Banner
1 Chapter 10 Correlation and Regression We deal with two variables, x and y. Main goal: Investigate how x and y are related , or correlated ; how much they depend on each other .
48

Chapter 10 Correlation and Regression

Feb 05, 2016

Download

Documents

ChristMas

Chapter 10 Correlation and Regression. We deal with two variables, x and y . Main goal: Investigate how x and y are related , or correlated ; how much they depend on each other. Example. x is the height of mother y is the height of daughter - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 10 Correlation and Regression

1

Chapter 10Correlation and Regression

We deal with two variables, x and y.

Main goal:

Investigate how x and y are related, or correlated; how much they depend on each other.

Page 2: Chapter 10 Correlation and Regression

2

Example

• x is the height of mother

• y is the height of daughter

Question: are the heights of daughters independent of the height of their mothers? Or is there a correlation between the heights of mothers and those of daughters? If yes, how strong is it?

Page 3: Chapter 10 Correlation and Regression

3

Example:

This table includes a random sample of heights of mothers, fathers, and their daughters.

Heights of mothers and their daughters in this sample seem to be strongly correlated…

But heights of fathers and their daughters in this sample seem to be weakly correlated (if at all).

Page 4: Chapter 10 Correlation and Regression

4

Section 10-2

Correlation between two variables (x and y)

Page 5: Chapter 10 Correlation and Regression

5

Definition

A correlation exists between two variables when the values of one somehow affect the values of the other in some way.

Page 6: Chapter 10 Correlation and Regression

6

Key Concept

Linear correlation coefficient, r, is a numerical measure of the strength of the linear relationship between two variables, x and y, representing quantitative data.

Then we use that value to conclude that there is (or is not) a linear correlation between the two variables.

Note: r always belongs in the interval (-1,1), i.e., –1 r 1

Page 7: Chapter 10 Correlation and Regression

7

Exploring the Data

We can often see a relationship between two variables by constructing a scatterplot.

Page 8: Chapter 10 Correlation and Regression

8

Scatterplots of Paired Data

Page 9: Chapter 10 Correlation and Regression

9

Scatterplots of Paired Data

Page 10: Chapter 10 Correlation and Regression

10

Scatterplots of Paired Data

Page 11: Chapter 10 Correlation and Regression

11

Requirements

1. The sample of paired (x, y) data is a random sample of quantitative data.

2. Visual examination of the scatterplot must confirm that the points approximate a straight-line pattern.

3. The outliers must be removed if they are known to be errors. (We will not do that in this course…)

Page 12: Chapter 10 Correlation and Regression

12

Notation for the Linear Correlation Coefficient

n = number of pairs of sample data

denotes the addition of the items indicated.

x denotes the sum of all x-values.

x2 indicates that each x-value should be squared and then those squares added.

(x)2 indicates that the x-values should be added and then the total squared.

Page 13: Chapter 10 Correlation and Regression

13

Notation for the Linear Correlation Coefficient

xy indicates that each x-value should be first multiplied by its corresponding y-value. After obtaining all such products, find their sum.

r = linear correlation coefficient for sample data.

= linear correlation coefficient for population data, i.e. linear correlation between two populations.

Page 14: Chapter 10 Correlation and Regression

14

n(xy) – (x)(y)

n(x2) – (x)2 n(y2) – (y)2r =

The linear correlation coefficient r measures the strength of a linear relationship between the paired values in a sample.

We should use computer software or calculator to compute r

Formula

Page 15: Chapter 10 Correlation and Regression

15

• Enter x-values into list L1 and y-values into list L2

• Press STAT and select TESTS• Scroll down to LinRegTTest press ENTER

• Make sure that XList: L1 and YList: L2

• choose: & ≠ 0

• Press on Calculate• Read r2 =… and r =… • Also read the P-value p = …

Linear correlation by TI-83/84

Page 16: Chapter 10 Correlation and Regression

16

Interpreting r

Using Table A-6:

If the absolute value of the computed value of r, denoted |r|, exceeds the value in Table A-6, conclude that there is a linear correlation.

Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation.

Note: In most cases we use the significance level = 0.05 (the middle column of Table A-6).

Page 17: Chapter 10 Correlation and Regression

17

Interpreting r

Using P-value computed by calculator:

If the P-value is ≤ , conclude that there is a linear correlation.

Otherwise, there is not sufficient evidence to support the conclusion of a linear correlation.

Note: In most cases we use the significance level = 0.05.

Page 18: Chapter 10 Correlation and Regression

18

Caution

Know that the methods of this section apply only to a linear correlation.

If you conclude that there is no linear correlation, it is possible that there is some other association that is not linear.

Page 19: Chapter 10 Correlation and Regression

19

Round to three decimal places so that it can be compared to critical values in Table A-6.

Use calculator or computer if possible.

Rounding the Linear Correlation Coefficient r

Page 20: Chapter 10 Correlation and Regression

20

Properties of the Linear Correlation Coefficient r1. –1 r 1

2. if all values of either variable are converted to a different scale, the value of r does not change.

3. The value of r is not affected by the choice of x and y. Interchange all x- and y-values and the value of r will not change.

4. r measures strength of a linear relationship.

5. r is very sensitive to outliers, they can dramatically affect its value.

Page 21: Chapter 10 Correlation and Regression

21

Example:

The paired pizza/subway fare costs from Table 10-1 are shown here in a scatterplot. Find the value of the linear correlation coefficient r for the paired sample data.

Page 22: Chapter 10 Correlation and Regression

22

Example - 1:

Using software or a calculator, r is automatically calculated:

Page 23: Chapter 10 Correlation and Regression

23

Interpreting the Linear Correlation Coefficient r

Critical Values from Table A-6 and the Computed Value of r

Page 24: Chapter 10 Correlation and Regression

24

Using a 0.05 significance level, interpret the value of r = 0.117 found using the 62 pairs of weights of discarded paper and glass listed in Data Set 22 in Appendix B.

Is there sufficient evidence to support a claim of a linear correlation between the weights of discarded paper and glass?

Example - 2:

Page 25: Chapter 10 Correlation and Regression

25

Example:

Using Table A-6 to Interpret r:

If we refer to Table A-6 with n = 62 pairs of sample data, we obtain the critical value of 0.254 (approximately) for = 0.05. Because |0.117| does not exceed the value of 0.254 from Table A-6, we conclude that there is not sufficient evidence to support a claim of a linear correlation between weights of discarded paper and glass.

Page 26: Chapter 10 Correlation and Regression

26

Interpreting r:Explained Variation

The value of r2 is the proportion of the variation in y that is explained by the linear relationship between x and y.

Page 27: Chapter 10 Correlation and Regression

27

Using the pizza subway fare costs, we have found that the linear correlation coefficient is r = 0.988. What proportion of the variation in the subway fare can be explained by the variation in the costs of a slice of pizza?

With r = 0.988, we get r2 = 0.976.

We conclude that 0.976 (or about 98%) of the variation in the cost of a subway fares can be explained by the linear relationship between the costs of pizza and subway fares. This implies that about 2% of the variation in costs of subway fares cannot be explained by the costs of pizza.

Example:

Page 28: Chapter 10 Correlation and Regression

28

Common Errors Involving Correlation

1. Causation: It is wrong to conclude that correlation implies causality.

2. Linearity: There may be some relationship between x and y even when there is no

linear correlation.

Page 29: Chapter 10 Correlation and Regression

29

Caution

Know that correlation does not imply causality.

There may be correlation without causality.

Page 30: Chapter 10 Correlation and Regression

30

Section 10-3

Regression

Page 31: Chapter 10 Correlation and Regression

31

Regression

The typical equation of a straight liney = mx + b is expressed in the formy = b0 + b1x, where b0 is the y-intercept and b1 is the slope.

^

The regression equation expresses a relationship between x (called the explanatory variable, predictor variable or independent variable), and y (called the response variable or dependent variable).

^

Page 32: Chapter 10 Correlation and Regression

32

Definitions

Regression EquationGiven a collection of paired data, the regression equation

Regression Line

The graph of the regression equation is called the regression line (or line of best fit, or least squares line).

y = b0 + b1x^

algebraically describes the relationship between the two variables.

Page 33: Chapter 10 Correlation and Regression

33

Example:

Page 34: Chapter 10 Correlation and Regression

34

Notation for Regression Equation

y-intercept of regression equation

Slope of regression equation

Equation of the regression line

PopulationParameter

SampleStatistic

0 b0

1 b1

y = 0 + 1 x y = b0 + b1x^

Page 35: Chapter 10 Correlation and Regression

35

Requirements

1. The sample of paired (x, y) data is a random sample of quantitative data.

2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern.

3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors.

Page 36: Chapter 10 Correlation and Regression

36

Rounding the y-intercept b0 and the Slope b1

Round to three significant digits.

If you use the formulas from the book, do not round intermediate values.

Page 37: Chapter 10 Correlation and Regression

37

Example:

Refer to the sample data given in Table 10-1 in the Chapter Problem. Use technology to find the equation of the regression line in which the explanatory variable (or x variable) is the cost of a slice of pizza and the response variable (or y variable) is the corresponding cost of a subway fare. (CPI=Consumer Price Index, not used)

Page 38: Chapter 10 Correlation and Regression

38

Example:

Requirements are satisfied: simple random sample; scatterplot approximates a straight line; no outliers Here are results from four different technologies technologies

Page 39: Chapter 10 Correlation and Regression

39

Example:

Note: in TI-83/84, a means b0 and b means b1

All of these technologies show that the regression equation is

y = 0.0346 +0.945x,

where y is the predicted cost of a subway fare and x is the cost of a slice of pizza.

We should know that the regression equation is an estimate of the true regression equation.

^

^

Page 40: Chapter 10 Correlation and Regression

40

Example:

Graph the regression equation

(from the preceding Example) on the scatterplot of the pizza/subway fare data and examine the graph to subjectively determine how well the regression line fits the data.

ˆ 0.0346 0.945 y x

Page 41: Chapter 10 Correlation and Regression

41

Example:

Page 42: Chapter 10 Correlation and Regression

42

1. Predicted value of y is y = b0 + b1x

2. Use the regression equation for predictions only if the graph of the regression line on the scatterplot confirms that the regression line fits the points reasonably well.

Using the Regression Equation for Predictions

3. Use the regression equation for predictions only if the linear correlation coefficient r indicates that there is a linear correlation between the two variables.

^

Page 43: Chapter 10 Correlation and Regression

43

4. Use the regression line for predictions only if the value of x does not go much beyond the scope of the available sample data. (Predicting too far beyond the scope of the available sample data is called extrapolation, and it could result in bad predictions.)

Using the Regression Equation for Predictions

5. If the regression equation does not appear to be useful for making predictions, the best predicted value of a variable is its point estimate, which is its sample mean, y.

_

Page 44: Chapter 10 Correlation and Regression

44

Strategy for Predicting Values of Y

Page 45: Chapter 10 Correlation and Regression

45

If the regression equation is not a good model, the best predicted value of y is simply y, the mean of the y values.

Remember, this strategy applies to linear patterns of points in a scatterplot.

Using the Regression Equation for Predictions

_

Page 46: Chapter 10 Correlation and Regression

46

For a pair of sample x and y values, the residual is the difference between the observed sample value of y and the y-value that is predicted by using the regression equation. That is,

Definition

residual = observed y – predicted y = y – y ^

Page 47: Chapter 10 Correlation and Regression

47

Residuals

Page 48: Chapter 10 Correlation and Regression

48

A straight line satisfies the least-squares property if the sum of the squares of the residuals is the smallest sum possible.

Definition