Chapter 8 Linear Regression
Chapter 8Linear Regression
Objectives & Learning Goals
Understand Linear Regression (linear modeling): Create and interpret a linear regression model
using technology, equations (when statistics are provided) and when given computer output:
Discuss slope and y-intercept in context; Check conditions by graphing and interpreting residuals; Make predictions within the range of data using the created
model.
Fast Food - Fat Versus Protein r = 0.83; describe the association…
“There is a strong, positive linear association between fat and protein. In general, as protein increases, so does fat. We may be concerned about these points as possible outliers…”
We can say even more with a model…
The Linear Model The linear model is the equation of the “line of best fit.”.
Models aren’t perfect. No matter how “best” our line of best fit is; points will fall above and below the line.
Residuals are the distances from the best fit line to each data point – they’re the parts of the relationship between the variables that our model can’t explain – or model errors.
Residual
More on Residuals
To find the residual of a data point, subtract the predicted value from the observed value:
Residual
predictedactualyyresidual ˆ
Residuals (cont.) A negative residual means
the model’s prediction is too large (the model overestimates).
A positive residual means the predicted value is too small (it underestimates).
In this example, the linear model overestimates the fat content of a sandwich (33g); the actual value is 25g, so the residual is negative 8g.
Residuals are key because our technology uses them to
determine which line is “best.”
“Best Fit” Means Least Squares Some residuals are positive, others are negative, on
average these errors cancel each other out.
We can’t assess how well the line fits by adding up all the residuals (we would just get zero!).
Solution?? Anyone?? Anyone?? Bueller?? Similar to what we did with standard deviation, we square the
residuals, then add up the squares. The smaller the sum, the better the line.
The line of best fit is the line that minimizes the sum of the squared residuals. We call this the least squares regression line (LSRL).
Conditions for Regression
We check the same conditions for regression as we did for correlations. These are???
Quantitative Variables Condition
Straight Enough Condition
Outlier Condition
The Regression Line in Real Units In Algebra we learned that the equation of a
line can be written as? ___________
In Statistics we use slightly different notation:
b0 is the y-intercept
b1 is the slope
We use to emphasize that the points that satisfy the equation are predictions made using a model, not data.
ymx b
yb0 b1xy
Back to Burgers! An Example The LSRL is shown for the
for the Burger King sandwich data. The regression equation is:
To predict the fat content (!!“fat hat”!!) for a BK sandwich with 30g of protein; PLUG AND CHUG the formula (do it now) !!
Conclude with a statement like: “According to our model, we expect a sandwich with 30 grams of protein to have about 36g of fat.”
35.9 g
Class Example…Country Marijuana (%) Other Drugs (%)
Czech Rep 22 4Denmark 17 3England 40 21Finland 5 1Ireland 37 16
Italy 19 8Ireland 23 14Norway 6 3Portugal 7 3Scotland 53 31
USA 34 24
The table above shows the % of city teens who use Marijuana recreationally and the % of teens who use other, more dangerous drugs. Enter the data in your calculator lists and let’s build and analyze a model!!
Correlation and the Line of Best Fit This figure shows the
scatterplot of the z-scores for fat and protein.
If a burger has an average protein content, it should also have an average fat content. So in z-score world, the LSRL passes through the origin: (0, 0).
This means that in the “real” world the LSRL passes through:
_________
In z-score world, r is the slope of the regression line!
),( yx
In z-score world, r is the slope
of the LSRL!
Moving one standard deviation away from the mean of x moves us r standard deviations away from the mean in y.
How Big Can Predicted Values Get?
Because we move r units in the y direction for every unit in the x direction and r = -1 ≤ r ≤ +1, each y-hat must be closer to its mean than its corresponding x value.
This property of the linear model is called regression to the mean (predictions using the LSRL will tend toward the mean of y).
R2— Accounting for the Model’s Variation The variation in the
residuals is the key to assessing how well the model fits.
In our BK sandwich example, the standard deviation of the variable fat is 16.4 grams.
The standard deviation of the residuals (the errors after we’ve applied the model) is only 9.2 grams.
*Original After Data* Model
R2—The Variation Accounted For (cont.) If the correlation were 1.0 and the
model predicted fat values perfectly, all the residuals would be zero (no variation).
As it is, the correlation is 0.83 - not perfect; but pretty good!
The model’s residuals have less variation than total fat alone. So the model “explains” some of the variability in fat by accounting for protein content.
To build a model that accounts for the remaining errors we would use multiple regression – beyond the scope of this course).
R2—The Variation Accounted For (cont.)
The correlation coefficient squared (r2), gives the percentage of the data’s variance accounted for by the model.
For the BK model, r2 = 0.832 = 0.69, this tells us that 69% of the variability fat changes can be explained by changes in protein levels.
The remaining 31% of the variability in fat is left in the residuals (“other” reasons).
How Big Should R2 Be? Well……(you guessed it!)……it depends!
R2 is always between 0% and 100% (and it is always less than r (unless r = 1 or -1). What makes a “good” R2 value depends on the kind of data you’re analyzing and what you want to do with it.
Examples?
Along with the slope and intercept for a regression, you should always report R2 so that readers can judge for themselves how successful the regression is at fitting the data.
AP Test Requirements
For the AP test, you are expected to be able to find / analyze regression equations THREE ways:1) Using summary statistics & formulas;
2) Using technology (calculators) when given data;
3) From computer-generated regression output.
Worksheet!!!
The Regression Line in Real Units Since the linear model integrates “z-score
world” (r), the line of best fit we use is a little more complicated than the ones you’re used to from your algebra courses…
We want our model to be useful in real units so we have to back out of z-score world...
To find a linear model’s slope, we use:
To find the model’s y-intercept:
b1 rsysx
b0 y b1x
yb0 b1x
Income vs. Housing Cost Example
Income v. Housing Costs ExampleSome governmental organization is interested in building a model to predict a person’s housing costs based on the person’s income (using tax return data). They capture a sample of data and find that the mean income is $46,209 with a standard deviation of $7,004. The mean housing cost for this same sample is $324 with a standard deviation of $119; r=0.62. Is a linear model appropriate? Find the regression equation. Explain what the slope and the y-intercept mean in context. Compute and interpret r-squared. The organization then decides to use this same data to predict a person’s income based on their housing costs. Find the new regression equation.
Reading Computer Output – HP vs. MPG
)(0918175.04542.38ˆ hpgpm
Write the regression equation now…
More on Residuals
The linear model assumes the relationship between the two variables is a straight line. The residuals are the part of the data that hasn’t been accounted for by the model.
Actual Data = Model + Residual
or
Residual = Actual – Prediction (AP)
In symbols:
ey y
Residuals (cont.) Residuals help us to see whether the
model is appropriate. When it is, there should be nothing interesting left behind in the residuals (just random error).
After we fit a regression model, we ALWAYS make a scatterplot of the residuals vs. X or y-hat (both will look identical) hoping to find……….. well, nothing interesting.
Residuals (cont.) The residual plot for the BK sandwich
regression looks random (no pattern) – and that’s good!
Residuals (cont.) This residual plot shows a pattern, indicating
that our assumption of linearity may be wrong.
Residuals (cont.) Another “bad” residual plot…
Residual Standard Deviation The standard deviation of the residuals, se,
measures how spread out the points are around the regression line. The equation is:
se e2
n 2
Once again, we don’t actually calculate this manually, our technology does it for us!!!
The Residual Standard Deviation Examine a Normal Probability Plot or a Histogram
of the residuals to make sure the residuals have about the same amount of scatter throughout (this is called the Equal Variance Assumption).
Reality Check: Does the Regression Make Sense? When statistics are based on real data, the results
of a statistical analysis should reinforce your common sense.
If the results are surprising, then you’ve either learned something new about the world or your analysis is incorrect.
Which do you think is more likely?
When you perform a regression, think about what the coefficients mean and ask yourself whether they make sense.
What Can Go Wrong? Don’t fit a straight line to a nonlinear
relationship. Beware extraordinary points (y-values that
stand off from the linear pattern or extreme x-values).
Don’t infer that x causes y just because there is a good linear model for their relationship - association is not causation.
Don’t choose a model based on R2 alone.
Get it??
http://xkcd.com/552/
What Have We Learned?
When the relationship between two quantitative variables is “straight enough”, a linear model can help summarize that relationship. The regression line doesn’t pass through all points, but
it is the “best” line because it minimizes the sum of squared residuals.
What Have We Learned? Correlation tells us lots of things:
The slope of the line is based on the correlation, adjusted for the units of x and y.
For each SD in x that we are away from x-bar, we expect to be r SDs in y away from y-bar.
Since r is always between –1 and +1, each y-hat is fewer SDs away from its mean than the corresponding x was (regression to the mean).
R2 tells us the percent of the response accounted for by the regression model; the rest is error.