Copyright © 2010 Pearson Education, Inc. Slide 8 - 1 The correlation between two scores X and Y equals 0.8. If both the X scores and the Y scores are converted.

Slide 8 - 1Copyright © 2010 Pearson Education, Inc.

The correlation between two scores X and Y equals 0.8. If both the X scores and the Y scores are converted to z-scores, then the correlation between the z-scores for X and the z-scores for Y would be:

a. -0.8

b. -0.2

c. 0.0

d. 0.2

e. 0.8


The correlation between two scores X and Y equals 0.8. If both the X scores and the Y scores are converted to z-scores, then the correlation between the z-scores for X and the z-scores for Y would be:

a. -0.8

b. -0.2

c. 0.0

d. 0.2

e. 0.8

Copyright © 2010 Pearson Education, Inc.

Chapter 8Linear Regression


Fat Versus Protein: An Example

The following is a scatterplot of total fat versus protein for 30 items on the Burger King menu with a correlation of .83:


The Linear Model

The linear model (line of best fit, least squares line, regression line) is just an equation of a straight line through the data to show us how the values are associated.

Using this line we will be able to predict values. Predicted values are denoted as: (also called y-hat).

The hat tells you they are predicted values. The difference between the observed-value and the

predicted-value is called the residual.

residual = observed – predicted = y – y(hat)


A negative residual means the predicted value’s too big (an overestimate).

A positive residual means the predicted value’s too small (an underestimate).

In the figure, the estimated fat of the BK Broiler chicken sandwich is 36 g, while the true value of fat is 25 g, the residual=?


“Best Fit” Means Least Squares

Some residuals are positive, others are negative, and, on average, they cancel each other out. To calculate how well the line fits the data we square the residuals (to eliminate the negatives) then find the sum of the squares.

The smaller the sum, the better the fit. That is why another name is least squares line.


If the variables are standardized (zscores or standard deviations):

The equation of the line of best fit is:

Correlation (also called r) is the same for x and y because it is standardized. Therefore:

ˆ *

ˆ *

y xz correlation z

y correlation x

ˆ *

ˆ *

x yz correlation z

x correlation y


Example: A scatterplot of house prices vs. house size for houses shows a relationship that is straight, with only moderate scatter and no outliers. The correlation between house price and house size is 0.77.

a. You go to an open house and find the house is 1 standard deviation above the mean in size. What would you guess about its price?

b. You read an add for a house priced 2 standard deviations below the mean. What would you guess about it’s size?

c. A friend tells you about a house whose size in square meters (he’s European) is 1.5 standard deviations above the mean. What would you guess about its size in square feet?


Sometimes we are given the regression line in REAL UNITS!!!

The regression line for the Burger King data fits the data well: The equation is

Example: What is predicted fat content for a BK Broiler chicken sandwich (with 30 g of protein)?


To find the regression line (in real units):

You may be given the standard deviations, correlation and means.

OR …You may be given raw data.


First make sure a regression is appropriate: Since regression and correlation are closely

related, we need to check the same conditions for regressions as we did for correlations: Quantitative Variables Condition Straight Enough Condition (look at scatterplot) Outlier Condition (look at scatterplot)


To create the Regression Line in Real Units given the standard deviations, correlation (r), and means:

You know the equation of a line.

In Statistics we use a slightly different notation:

We write b1 and b0 for the slope and intercept of

the line. (slope is always in units of y per unit of x)

y mx b

y b0 b1x

1y

x

sb r

s

0 1b y b x


To find a regression line (linear model) with raw data:

Use your calculator! First, be sure to check:

Quantitative Linear (scatterplot) No outliers (scatterplot).

If it is not quantitative, not linear or it has outliers, you will NOT be able to model the data with a linear model.


TI Tips: Equation of the Regression Line

STAT, CALC, Choose LinReg(a + bx) (Not the first one … the second one … scroll down!)

Specify that x and y are YR and TUIT (we put these in our calculator before.)


TI Tips: Equation of the Regression Line Graphed on the Scatterplot

STAT, CALC, Choose LinReg(a + bx) (Not the first one … the second one … scroll down!)

Specify that x and y are YR and TUIT (we put these in our calculator before.)

We want the screen to say: LinReg(a+bx) YR, TUIT, Y1 (this will send the equation to Y1 and then we will see it on our graph)

To add Y1 to the end: VARS, Y-VARS, 1:Function and choose Y1

ENTER See the equation. It has also been placed in Y1. Hit

GRAPH.


Example: Using the relationship between house price (in thousands of dollars) and house size (in thousands of square feet) the regression model is: a. What is the slope and what does it mean?

b. What are the units of the slope?

c. Your house is 2000 square feet bigger than your neighbor’s house. How much more do you expect it to be worth?

d. Is the y-intercept of -3.117 meaningful, explain?

ˆ 3.117 94.454*P Size


Example: The linear model relating hurricanes’ wind speeds to their central pressures was:

Predicted MaxWindSpeed = 955.27-(.897)CentralPressure

Hurricane Katrina had a central pressure measured at 920 millibars. What does our regression model predict for her maximum wind speed? How good is that prediction, given that Katrina’s actual wind speed was measured at 110 knots?


More about Residuals

A scatterplot of all the residuals the graph should be completely random! It should show no bends and should have no outliers.


Draw examples of a residual graph that is not random.


TI Tips – Residual Plots

You look at the scatterplot to make sure it is linear. Sometimes it is hard to tell. After you do a regression do a residual plot. If the residual plot is completely random, you know your scatterplot was linear.

The calculator automatically stores the residuals in a list named RESID after you run a regression. To look at them … STAT EDIT cursor over to RESID.

To create the residual plot … STAT PLOT, Plot2, Xlist:YR and Ylist: RESID

Y= may still have your regression line in it. You can either turn it off or remove it.

ZoomStat Do you see a curve?


Example: Our linear model for homes uses the model: predicted price = -3.117 + (94.454)(size)

a. Would you prefer to find a home with a negative or a positive residual? Explain.

b. You plan to look for a home of about 3000 square feet. How much should you expect to have to pay?

c. You find a nice home that size selling for $300,000. What’s the residual?


The Residual Standard Deviation

The standard deviation of the residuals, se, measures how much the points spread around the regression line.

Se = “Errors in predictions based on this model have a standard deviation of s (standard deviation in y units).”

We estimate the SD of the residuals using:

se e2

n 2


R2—The Variation Accounted For (cont.)

All regression analyses include this statistic, although by tradition, it is written R2 (pronounced “R-squared”). An R2 of 0 means that none of the variance in the data is in the model; all of it is still in the residuals.

When interpreting a regression model you need to Tell what R2 means. “The % of variability in y that is explained by x is” R2

R2 is always between 0% and 100%. What makes a “good” R2 value depends on the kind of data you are analyzing and on what you want to do with it.

Always report slope and intercept for a regression and R2 so that readers can judge for themselves how successful the regression is at fitting the data.


Assumptions and Conditions

Quantitative Variables Condition: Regression can only be done on two

quantitative variables (and not two categorical variables).

Straight Enough Condition: The linear model assumes that the relationship

between the variables is linear. (check by scatterplot)


Assumptions and Conditions (cont.)

If the scatterplot is not straight enough, stop here. You can only use a linear model on two

variables that are related linearly! Some nonlinear relationships can be saved by re-

expressing the data to make the scatterplot more linear.



It’s a good idea to check linearity again after computing the regression when we can examine the residuals.

Does the Plot Thicken? Condition: Residual plots should be scattered. Don’t confuse this with Normal Probability Plots

from unit one (to see if it is a normal curve) should be a straight line.



Outlier Condition: Watch out for outliers. Outlying points can dramatically change a

regression model. Outliers can even change the sign of the slope,

misleading us about the underlying relationship between the variables.


What Can Go Wrong?

Don’t fit a straight line to a nonlinear relationship. Beware extraordinary points (y-values that stand

off from the linear pattern or extreme x-values). Don’t extrapolate beyond the data—the linear

model may no longer hold outside of the range of the data.

Don’t infer that x causes y just because there is a good linear model for their relationship—association is not causation.

Don’t choose a model based on R2 alone.


A few IMPORTANT things to remember:

“The percentage of variability in y that is explained by x is: r2” (an example of this will be homework problem #7)

Correlation = r = +/- squareroot of r2 (you need to decide if it is + or – for a positive or negative correlation)

residual = observed – predicted = y – y(hat) R2 tells you how well the actual data fits the model (1 is

perfect, zero is no correlation) 1 – r2 is the fraction of the original variance left in the

residuals Be careful not to use a regression to extrapolate (predict

values beyond the scope/time frame of the model)


Homework (Day 1) Pg. 192 1-33 odd (skip 9)

Copyright © 2010 Pearson Education, Inc. Slide 8 - 1 The correlation between two scores X and Y equals 0.8. If both the X scores and the Y scores are converted.

Documents

scatterplot slide

linear regression slide

pearson education

y yhat slide

unit of x slide

y scores

scores x

x scores