Top Banner
Chapter 3 – Examining Relationships
76

Chapter 3 – Examining Relationships

Jan 16, 2016

Download

Documents

Rhoda

Chapter 3 – Examining Relationships. Scatterplots and Correlation - 3.1. Shows a relationship between two variables. Scatterplots:. Response Variables:. Variable on the y- axis. Response to a variable. Explanatory Variables:. Variable on the x- axis. Influences the response. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 3 – Examining Relationships

Chapter 3 – Examining Relationships

Page 2: Chapter 3 – Examining Relationships

Scatterplots and Correlation - 3.1

Page 3: Chapter 3 – Examining Relationships

Scatterplots: Shows a relationship between two variables.

Explanatory Variables: Variable on the x-axis.Influences the response

Response Variables: Variable on the y-axis.

Response to a variable

Page 4: Chapter 3 – Examining Relationships

Looking at Scatterplots:

• Direction: Positive as x increases, y increasesNegative as x increases, y decreases

• Form: Is there a linear relationship between the two variables?

• Strength: Do the points follow a single stream that is tight to the line or is there considerable spread (or variability) around the line?

Page 5: Chapter 3 – Examining Relationships

Calculator Tip: Scatterplots

L1: Explanatory Variable

L2: Response Variable

Use statplot to graph

Page 6: Chapter 3 – Examining Relationships

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

1. T-shirts at a store: Price of each, Number Sold

x

yD:

S:

negative

strong

$5 $50

1

100

Price of shirt

# sold

explanatory response

Page 7: Chapter 3 – Examining Relationships

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

2. Drivers: Reaction Time, Blood Alcohol Level

x

yD:

S:

positive

strong

.01 .5

1

10

BAC

Time

explanatoryresponse

Page 8: Chapter 3 – Examining Relationships

Example #1:Suppose you were to collect data for each pair of variables below. Which variable is the explanatory and which is the response? Determine the likely direction and strength of the relationship.

3. Cars: Age of Owner, Weight of the Car

Makes no sense!!!

Page 9: Chapter 3 – Examining Relationships

Example #2:In a study of whether a relationship exists between a child's aptitude and the age at which he/she first speaks, researchers recorded the age (in months) of a child's first speech and the child's score on an aptitude test. These data for these 21 children follow:

Make a scatterplot and describe the relationship in the context of the problem.

Page 10: Chapter 3 – Examining Relationships

D:

F:

S:positive

curved

moderate

Page 11: Chapter 3 – Examining Relationships

Correlation:

Measures the direction and strength of the linear relationship

“r”

Must be quantitative

Page 12: Chapter 3 – Examining Relationships

Attributes of the Correlation

1.The correlation coefficient is a unit-less measurement, denoted with the letter r, and has values between -1 and 1.

2. When r = 1 all the data points form a perfect straight line relationship with a positive slope.

3. When r = -1 all the data points form a perfect straight line relationship with a negative slope.

Page 13: Chapter 3 – Examining Relationships

Attributes of the Correlation

4. Values of r close to 0 means that the linear relationship is weak. There is a general linear trend, but there is a lot of variability around that trend.

5. When r =0 there is no relationship between the two variables. In other words, the best fitting line has a slope of zero.

Page 14: Chapter 3 – Examining Relationships

6. Outliers have a large influence on the correlation coefficient. The correlation is NOT resistant to outliers.

Attributes of the Correlation

7. Correlation does not describe curved relationships! (ONLY LINEAR)

Page 15: Chapter 3 – Examining Relationships

Types of Correlation:

r = 0 r = -0.3

r = 0.5 r = -0.7

r = 0.9 r = -0.99

Page 16: Chapter 3 – Examining Relationships

Example #3:What is wrong with the following statements?

There is a strong correlation between the gender of American workers and their income.

Gender is categorical

Page 17: Chapter 3 – Examining Relationships

Example #3:What is wrong with the following statements?

2. We found a high correlation (r = 1.09) between students’ rating of faculty teaching and ratings made by other faculty members.

r can’t be bigger than 1

Page 18: Chapter 3 – Examining Relationships

Example #3:What is wrong with the following statements?

3. We found a very weak correlation (r = -0.95) which suggests little relationship between income and hours spent at casinos.

r = -0.95 is a strong negative relationship

Page 19: Chapter 3 – Examining Relationships

Example #3:What is wrong with the following statements?

4. We found a very weak correlation (r = 0.01) which suggests little relationship between age and death rate.

Should be a very strong relationship!

Page 20: Chapter 3 – Examining Relationships

Guidelines: How strong is the linear relationship?

0 < r < 0.3 = weak positive -0.3 < r < 0 = weak negative0.4 < r < 0.7 = moderate positive -0.4 < r < -0.7 = moderate negative0.8 < r < 1 = strong positive -0.8 < r < -1 = strong negative

Page 21: Chapter 3 – Examining Relationships

HOW TO CALCULATE THE CORRELATION COEFFICIENT

Remember how to calculate the z-score? We used this calculation to determine how many standard deviations our observations was from the mean.

RECALL:

z - score = z = x

Page 22: Chapter 3 – Examining Relationships

In this case, we were only concerned with one variable.

Now, we are considering two variables and each must be standardized.

Page 23: Chapter 3 – Examining Relationships

Notation:

s' theofdeviation standard sampleS

s' theofn observatioth ' the

s' ofmean sample

n correlatio

x x

xix

xx

r

i

s' theofdeviation standard sampleS

s' theofn observatioth ' the

s' ofmean sample

nsobservatio ofnumber totaln

y y

yiy

yy

i

Page 24: Chapter 3 – Examining Relationships

FORMULA:

y

i

x

i

S

yy

S

xx

n 1

1r

Page 25: Chapter 3 – Examining Relationships

Calculator Tip: Correlation

L1: Explanatory Variable

L2: Response Variable

Stat-calc-LinReg(a+bx), L1, L2

(make sure your diagnostic is on!!!)

Page 26: Chapter 3 – Examining Relationships

Example #4:

Speed (x) 20 30 40

MPG (y) 25 35 45

Step #1: Find the following summary statistics:

n = ________

SPEED: x = ______ sx = _______

MPG: y = ______ sy = _______

330 10

35 10

Page 27: Chapter 3 – Examining Relationships

Step #2: Calculate z-scores

SPEED Z(x1) = Z(x2) = Z(x3) =

MPG Z(y1) = Z(y2) = Z(y3) =

PRODUCT Z(x1)Z(y1) = Z(x2)Z(y2) = Z(x3)Z(y3) =

10

3020Z

1Z

10

3030Z

0Z

10

3040Z

1Z

10

3525Z

1Z

10

3535Z

0Z

10

3545Z

1Z

1 0 1

Page 28: Chapter 3 – Examining Relationships

Step #3: Calculate the Correlation

10113

1r

)2(2

1r

1r

Page 29: Chapter 3 – Examining Relationships

3.2 – Least-Squares Regression

Page 30: Chapter 3 – Examining Relationships

Regression line: straight line that describes the linear relationship between an explanatory variable and a response variable.

Page 31: Chapter 3 – Examining Relationships

LEAST SQUARES REGRESSION LINE:

• This is the best-fitting line to the data.

• The goal is to minimize the (vertical) distances of your observations (data) from your line.

• Again, we must square the distances (like the calculation of the variance) because some data points will be larger than the mean (positive) and some are smaller than the mean (negative) and they will cancel each other out. So to compensate, they are squared.

Page 32: Chapter 3 – Examining Relationships

We can use this line to predict a response, y, from a given explanatory variable, x.

Page 33: Chapter 3 – Examining Relationships

Remember graphing??

Slope-Intercept formula for a line:

y = mx + b where m = ____________

and b = ____________

slope

y-intercept

Do you remember the SLOPE?

rise

run

y

x

In statistics, we write it

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

Page 34: Chapter 3 – Examining Relationships

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

1. What is the slope of the line? What does it mean?

m = 5.9

For every inch in length, it adds 5.9 pounds in weight

Page 35: Chapter 3 – Examining Relationships

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

2. What is the y-intercept of the line? What does it mean?

b = -393

If an alligator is 0 inches, then it weights -393lbs. This makes no sense!!!

Page 36: Chapter 3 – Examining Relationships

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

3. Describe the relationship between weight and length of alligators.

As the length increases, their weight increases.

Page 37: Chapter 3 – Examining Relationships

Example #1Wildlife researchers monitor many wildlife populations by taking aerial photographs in order to estimate the weights of alligators. Here is the regression line of the weights of adult alligators (in pounds) and their lengths (in inches) based on the data collected from captured alligators.

Predicted Weight = – 393 + 5.9(length)

4. What is the predicted weight for an alligator 90 inches long?

= -393 + 5.9(90)

= -393 + 531

= 138 lbs

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

Page 38: Chapter 3 – Examining Relationships

CALCULATION:

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

Page 39: Chapter 3 – Examining Relationships

Facts about Least Squares Regression:

1. The distinction between explanatory and response variables is essential (which variable is used to predict which?).

2. It always passes through the point (x, y).

3. Correlation ‘r’ describes the direction and strength of the straight line, but doesn’t tell us anymore about the slope than if it is positive or negative, or zero.

Page 40: Chapter 3 – Examining Relationships

Extrapolation: Predicting outside the range of the x values

Page 41: Chapter 3 – Examining Relationships

Calculator Tip: LSRL

L1: Explanatory Variable

L2: Response Variable

Stat-calc-LinReg(a+bx), L1, L2, vars/y-vars/Function/ Y1

Page 42: Chapter 3 – Examining Relationships

Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

a. Interpret the value of the correlation coefficient in the context of the problem.

As wine consumption increases, mean deaths from heart disease decreases.

Page 43: Chapter 3 – Examining Relationships

Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

b. Calculate the least-squares regression line predicting death rate from wine consumption.

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= -0.0843(68,396/2,510) = -2.2971

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx = 191,053–(-2.2971*3,026)= 198004.0991

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 198,004.0991 – 2.2971x

Page 44: Chapter 3 – Examining Relationships

Example #2: Is there a relationship between wine consumption (in liters) and yearly deaths from heart disease (deaths per 100,000)? Here are the summary statistics:

Mean wine consumption: 3,026 SD of wine consumption: 2,510Mean deaths from heart disease: 191,053 SD of heart disease deaths: 68,396

Correlation coefficient between wine consumption and yearly deaths from heart disease = -.0843

c. Use your line to predict death rate for an average adult who consumes 4 liters of wine.

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 198,004.0991 – 2.2971x

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 198,004.0991 – 2.2971(4)

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 197,994.9107

Page 45: Chapter 3 – Examining Relationships

Example #3: The following data describes the relationship between a tree trunks diameter vs. it height. Make a scatterplot of the data and find the LSRL. Define any variables used in this equation. How strong of an association is there?

Trunk Diameter

8 9 7 6 13 7 11 12

Tree Height

35 49 27 33 60 21 45 51

Page 46: Chapter 3 – Examining Relationships

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= -1.31467 + 4.54133x

Where x = trunk diameter and

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= tree height

Strong correlation, r = 0.88

Page 47: Chapter 3 – Examining Relationships

Residual: How close is the data to the line?

Observed y – predicted

yy ˆ

y

Page 48: Chapter 3 – Examining Relationships

residual

Page 49: Chapter 3 – Examining Relationships

Residual Plot: A plot that shows the residuals for all the data. A good line has no pattern.

Calculator Tip: Residual PlotCalculate the LSRLL3: vars/ y-vars/ function/ Y1(L1)L4: L2 – L3

Scatterplot: L1, L4

Page 50: Chapter 3 – Examining Relationships

Example of random residual plots

Page 51: Chapter 3 – Examining Relationships

Example of curved residual plots

Not a linear model.

Page 52: Chapter 3 – Examining Relationships

Example of fanning residual plots

Less accurate for larger x values.

Page 53: Chapter 3 – Examining Relationships

Standard Deviation of the residuals:

Used to measure the prediction error of the line

2

residuals2

ns

Calculator Tip: SD of residuals

Find residuals/ in L5: L42/2nd List/ math/ sum(L5)

Page 54: Chapter 3 – Examining Relationships

Example #4The ages (in years) of seven men and their systolic blood pressures are given below:

Age (x) 16 25 39 45 49 64 70Systolic BP 100 120 140 160 165 185 200

Predicted Pressure (ˆ y )

102.2 118.5 143.8 154.7 161.9 189 199.8y

Regression Equation: xy 8068.13589.73ˆ

Residuals: -2.27 1.47 -3.82 5.34 3.11 -3.99 .17

Page 55: Chapter 3 – Examining Relationships

Residual Plot:

No apparent pattern.

Page 56: Chapter 3 – Examining Relationships

Standard deviation of the residuals::

-2.27 1.47 -3.82 5.34 3.11 -3.99 .17

2

residuals2

ns

27

)17(.)99.3()11.3()34.5()82.3()47.1()27.2( 2222222

s

5

03275905.76s

899557899.3s

Page 57: Chapter 3 – Examining Relationships

Assessing the Predictive Power of the Equation:

1. Correlation of Determination: r2 = the correlation coefficient, squared

2. It is the fraction (or percent) of the variation in the values of y that is explained by the least-squares regression of y on x.

3. The closer r2 is to 1, the better the regression line describes the connection between x and y – in particular, predictions made with the equation will be more accurate.

Page 58: Chapter 3 – Examining Relationships

3.2 & 3.3 – Correlation of Determination, Lurking Variables

Page 59: Chapter 3 – Examining Relationships

Correlation of Determination: (r2)

How much of the y value is explained by the x value

Page 60: Chapter 3 – Examining Relationships

Reading Computer Output:

Predictor Coef StDev T PConstantx-variable

S = R-Sq= R-Sq(adj) =

y-intSlope

r2

Page 61: Chapter 3 – Examining Relationships

Example #1The correlation between alcohol and yearly deaths from heart disease was -0.843. What percent of the variation in the yearly deaths from heart disease can be explained by the regression of yearly deaths in alcohol consumption?

r = -0.843

r2 = 0.710649

71% of deaths from heart disease can be explained by alcohol consumption.

Page 62: Chapter 3 – Examining Relationships

Example #2Is there a linear relationship between marijuana consumption and other drug usage? For this regression, the percent of variability in other drug usage explained by the regression of other drugs on marijuana use as 66.5%. What is the correlation coefficient?

r = 0.815475

r2 = .665

Page 63: Chapter 3 – Examining Relationships

Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

a. Calculate the LSRL.

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= 0.849(143/2.008) = 60.46165339

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx = 446.9 – (60.46*7.557) = -10.00871464

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

= -10.0087 + 60.4617x

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

is the predicted number of calories and x is the serving size.

Page 64: Chapter 3 – Examining Relationships

b. What percent of the variability in calories is explained by the least squares line with serving size?

Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

r2 = 0.8492 = 0.720801

72% of the variability in calories is explained by serving size

Page 65: Chapter 3 – Examining Relationships

c. Use this regression line to predict the average number of calories in a 35-ounce serving. Explain if the least squares would be appropriate to use in this situation.

Example #3Fast Food Sandwiches: The mean serving size for fast food sandwiches is 7.557 ounces with a standard deviation of 2.008 ounces. The mean number of calories per sandwich is 446.9 with a standard deviation of 143. The correlation between serving size and calories is 0.849.

xy 4617.600087.10ˆ )35(4617.600087.10ˆ y

1508.2106ˆ y

No, extrapolation, too far away from normal values.

Page 66: Chapter 3 – Examining Relationships

Example #3:Commercial airlines need to know the operating cost per hour of flight for each plane in their fleet. In a study of the relationship between operating cost per hour and number of passenger seats, investigators computed the regression of operating cost per hour on the number of passenger seats. The 12 sample aircraft used in the study included planes with as few as 126 passenger seats and planes with as many as 410 passenger seats. Operating cost per hour ranged between $3,600 and $7,800. Some computer output from a regression analysis of these data are shown below.

Page 67: Chapter 3 – Examining Relationships
Page 68: Chapter 3 – Examining Relationships

a. What is the equation of the least squares regression line that describes the relationship between operating cost per hour and number of passenger seats in the plane? Define any variables used in this equation.

xy 673.141136ˆ

ˆ y a bx

1.Slope: b rSy

Sx

Calculate this first!

2. Y - intercept: a = y - bx

is the predicted operation cost and x is the # of passenger seats

Page 69: Chapter 3 – Examining Relationships

b. What is the value of the correlation coefficient for operating cost per hour and number of passenger seats in the plane? Interpret this correlation.

57.0r = 0.75498

There is a positive strong correlation between the number of passenger seats and cost for operation.

Page 70: Chapter 3 – Examining Relationships

c. Suppose that you want to describe the relationship between operating cost per hour and number of passenger seats in the planes only in the range of 250 to 350 seats. Does the line shown in the scatterplot still provide the best description of the relationship for data in this range? Why or why not?

No, Between 250 and 350 seats, the direction looks negative.

Page 71: Chapter 3 – Examining Relationships

Cautions in Making Predictions with Regression Lines:

1. If the correlation is not strong, predictions will not be accurate.

2. Extrapolation: Do not make predictions outside of the range for which you have data.

3. Correlation simply does not imply causation

• The correlation may be a coincidence• Both correlation variables might be directly influenced by some common underlying cause

Page 72: Chapter 3 – Examining Relationships

It is a variable that is not among the explanatory or response variables, but influences the interpretation of the relationship.

Lurking Variables:

Causation Common Response (z = lurking variable)

X YX Y

Z

Page 73: Chapter 3 – Examining Relationships

Example #4There is a positive correlation between the number of deaths by drowning and the number of ice cream cones sold. Is this evidence that people are not heeding the old advice to wait 2 hours after eating before swimming and are paying the price for it?

No! Summer is the lurking variable

Page 74: Chapter 3 – Examining Relationships

Example #5 Smoke Causes Coughs: A strong relationship is

found between weekly sales of firewood and weekly sales of cough drops from September to March. Can we conclude that smoke from the fires causes coughs?

No! Winter is the lurking variable

Page 75: Chapter 3 – Examining Relationships

Outlier: Observation away from the other data points

Influential Point:

Observation that drastically changes the LSRL

Page 76: Chapter 3 – Examining Relationships

http://bcs.whfreeman.com/tps3e/pages/bcs-main.asp?v=category&s=00020&n=99000&i=99020.01&o=|00510|00520|00530|00010|00020|00030|00040|00050|00060|00070|00080|00110|00120|00300|0P000|01000|02000|03000|04000|05000|06000|07000|08000|09000|10000|11000|12000|13000|14000|15000|99000|

Applet: