Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve learned: • Why we want a ‘random’ sample and how to achieve it (Sampling Scheme) • How to use randomization, replication, and control/blocking to create a valid experi- ment. • Methods for summarizing and graphing data in meaningful ways. • Analyzing CRD type designs to investigate which factors are important for a particular response. • Next we’ll look at how to deal with quantitative explanatory variables and a quantita- tive response. Multi-Way ANOVA vs Linear Regression • Multi-Way ANOVA used with variable(s) and a – Determines which factors are significantly associated with the response • Correlation and Linear Regression used with variable(s) and a – Determines if there is a significant linear relationship between the variables – Not necessarily cause and effect! 63
24
Embed
ST 370 - Correlation and Regressionlu/ST370/Note_4.pdf · Chapter 5 ST 370 - Correlation and Regression Readings: Chapter 11.1-11.4, 11.7.2-11.8, Chapter 12.1-12.2 Recap: So far we’ve
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
What would a correlation of 1 or -1 look like on a scatterplot?
• Not resistant to outliers, why?
67
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
R = 1: a straight increasing line; R = -1: a straight decreasing line.
wlu4
Typewritten Text
Its calculation depends on sample mean and sample variance, which are sensitive to outliers.
Example: For the cups of coffee and quiz score example, we can find the sample correlation,r (which is estimating the true underlying population correlation (ρ)) and interpret it.
Linear RegressionWhat if I want to use an explanatory variable (x) to predict a response (y)?
• Simple Linear Regression (SLR) finds the line of best fit between one quantitativeexplanatory variable x and the response y measured on the same subjects.
• Data in the form of pairs of observations (x1, y1), (x2, y2), , (xn, yn) - same set up ascorrelation data
Often done using a model of the form:
Note: β0 and β1 are true unknown parameters. When we have data we ’fit’ this model andget estimates denoted by
69
wlu4
Typewritten Text
E(Y|x) = beta0 + beta1 * x; Y = beta0 + beta1 * x + epsilon, where epsilon is the random error term with the mean 0 and variance sigma^2.
wlu4
Typewritten Text
wlu4
Typewritten Text
beta0_hat and beta1_hat.
How to get the estimates?
Names of line with estimates plugged in: ’fitted line’ or ’regression line’ or ’prediction line’or ’prediction equation’ or ’least squares line’
70
wlu4
Typewritten Text
the method of least squares: L = sum_{i=1,...,n} (y_i - beta0 - beta1 * x_i)^2. We want to find beta0 and beta1 to minimize L. The resulting minimizers are named the least squares estimators. beta0_hat = ybar - beta1_hat * xbar, beta1_hat = S_xy / S_xx, where S_xy = sum_{i=1,...,n} (y_i-ybar)(x_i-xbar), S_xx = sum_{i=1,...,n} (x_i-xbar)^2, and xbar and ybar and sample averages of x's and y's respectively.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
fitted line: y_hat = beta0_hat + beta1_hat * x
Interpretation of line:
Major use of Regression line!!
Useful for predicting the response (called y) for a given value of x
• Just plug the new x into the equation!
• Warning! Extrapolation is bad!
71
wlu4
Typewritten Text
y_hat = beta0_hat + beta1_hat * x_new
wlu4
Typewritten Text
wlu4
Typewritten Text
Extrapolation means you predict the response for an x value that is out of the range of the x values in your data used for fitting the least squares line. In the above figure, for example, you want to predict the response for x = 0.
wlu4
Typewritten Text
Interpretation of parameter estimates:
• β0 = b0= sample intercept
• β1 = b1 = sample slope
72
wlu4
Typewritten Text
where the fitted line intercepts with Y-axis, i.e. the value of y_hat when x = 0.
wlu4
Typewritten Text
the change of the fitted response variable (y_hat) when the explanatory variable (x) increases by 1 unit. It can be positive or negative, depending on the sign of beta1_hat.
, measures the difference between the actual response value and the fitted/predicted response value based on the obtained least squares line.
wlu4
Typewritten Text
wlu4
Typewritten Text
here y_i_hat = beta0_hat + beta1_hat * x_i is called fitted or predicted value when x = x_i.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
The estimated error variance: sigma2_hat = SS(E) / (n - 2) = MSE, where n is the number of observations (x_i, y_i)'s, and SS(E) = sum_{i=1,...,n} e_i^2 = sum_{i=1,...,n} (y_i - y_i_hat)^2.
wlu4
Typewritten Text
wlu4
Typewritten Text
Example: Coffee and Quiz Score
1. Find the fitted least squares regression line.
2. Interpret the slope.
3. Interpret the intercept. Is it meaningful in this example?
4. Jenny happens to stop in the coffee shop. She notices your plot and says, ’I had 5 cupsof coffee the night before the quiz. Guess what my score was!’
5. Jenny actually scored a 14. Find her residual.
6. If Enrique drank 12 cups of coffee, can we safely predict his quiz score?
When increasing one cup of coffee, the quiz score will increase by 1.03 point.
wlu4
Typewritten Text
The predicted quiz score when drinking 0 cup of coffee. Here, 0 is a meaningful value for x, so the intercept here is meaningful.
wlu4
Typewritten Text
wlu4
Typewritten Text
y_hat = 3.96 + 1.03 * 5 =
wlu4
Typewritten Text
9.11
wlu4
Typewritten Text
e = 14 - 9.11 = 4.89
wlu4
Typewritten Text
We should be cautious since 12 is out of the range (1 to 10 in our data). If the same linear relationship still holds, the predicted quiz score is given by 3.96 + 1.03 * 12 = 16.32
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
Note: Always Graph your Data
• To check linearity assumption.
• Four data sets below give the same fitted line, but only the first one makes sense!
Note: Causation vs Correlation
• Association between variables does not mean that one variable causes the changes inanother.
• Only well designed experiments can establish that an explanatory variable causes achange in a response variable.
74
wlu4
Typewritten Text
The relationship is not linear here
wlu4
Typewritten Text
There is one outlier in data
wlu4
Typewritten Text
The explanatory variable is not quantitative.
wlu4
Typewritten Text
How do we know we have a Significant Linear Relationship
IDEA:
• Given data, we find
• An estimate of the (true) mean response
• If we did another experiment would we get the same data??
• Need to do a statistical test to see if there is a significant linear relationship betweenthe explanatory and response variable.
What ’hypothesis’ do we want to test?
75
wlu4
Typewritten Text
the best fitted line using the least squares method.
wlu4
Typewritten Text
y_hat = beta0_hat + beta1_hat * x
wlu4
Typewritten Text
In our example, we want to know whether there is a statistically significant relationship between cups of coffee and quiz score.
wlu4
Typewritten Text
H_0: beta1 = 0 vs. H_a: beta1 is not equal to 0. Under H_0, the linear model becomes Y = beta0 + epsilon. In other words, the responses are randomly distributed around a constant beta0 and have no relationship with the explanatory variable X. Under H_a, there is a linear relationship between the response variable and explanatory variable, and the direction of the association (positive vs. negative) depends on the sign of beta1.
How to investigate this hypothesis? ........ ANOVA!
ANOVA = ANalysis Of VAriance. What variation are we measuring here?
Which line above gives more evidence of a significant linear relationship? Why?
A measure of the total amount of variability in the response is the sample variance of Y .Consider the numerator, which we call the total sum of squares:
SS(Tot) =n∑i=1
(yi − y)2 (variability of observations about the mean)
dfTot = n− 1
76
wlu4
Typewritten Text
The left panel, because the data are less spread around the fitted line, or in other words, the fitted line explains more variations of the data compared with the right panel.
This variability is partitioned into independent components:Sum of squares due to regression, SS(R) (or sum of squares model, SS(M))
SS(R) =n∑i=1
(yi − y)2 (variability of the fitted line about the mean)
dfR = 1
Sum of squares due to error, SS(E)
SS(E) =∑
e2i
=∑
(yi − yi)2 (variability of observations about the fitted line)
dfE = n− 2
Note: SS(Tot) = SS(R) + SS(E) and dfTot = dfR + dfE(Recall: DF = Number of independent pieces of data for the sum of squares).
The ANOVA table from simple linear regressionSource df Sum of squares Mean Square F-Ratio
Meaning of pieces very similar to Multi-Way ANOVA!
• mean squares - represent standardized measures of variation due to the differentsources and are given by SS(source)/dfsource.
• F-Ratio - Ratios of mean squares often follow an F -distribution and are appropriatefor testing different hypotheses of interest.
In this case, to test(H0) β1 = 0 vs (H1) β1 6= 0
Use the test statisticF = MS(R)/MS(E)
This is used to find a p-value which can then be used to determine if we have a statisticallysignificant result.
77
wlu4
Typewritten Text
MS(E) is an estimator of sigma^2.
wlu4
Typewritten Text
The larger the F statistic, the smaller the p-vale, and thus the stronger evidence to reject the null hypothesis and claim that the response and explanatory variable have a significant linear relationship.
Conducting the analysis in Statcrunch
78
79
wlu4
Typewritten Text
Questions: 1. Is there a statistically significant linear relationship between cups of coffee and quiz score? Why? 2. What is the error sum of squares? 3. What is the estimated error variance? 4. What are the estimated standard errors of the least squares estimates, beta0_hat and beta1_hat?
wlu4
Typewritten Text
r2 the ’coefficient of determination’
r2 = SS(Model)/SS(Tot) - Interpretation:
1. High r2 then the line is good for prediction.
2. In the coffee example, we obtained r2 = 0.8298Interpretation:
80
wlu4
Typewritten Text
It is a measure of the proportion of variation in Y that can be explained by the fitted least squares line based on X. Often expressed as a percentage.
wlu4
Typewritten Text
It means about 83% of the variation in the quiz score can be explained by the fitted least squares line based on the cups of coffee.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
Some remarks: A steep slope (positive or negative) does not necessarily indicate a strong linear relationship. We use correlation: a unit free measure to express the strength and direction of the linear relationship between X and Y. A small (in absolute value) slope does not indicate a weak linear relationship. We can have a perfect linear relationship that is a fairly flat line.
wlu4
Typewritten Text
.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
r^2 ranges 0 to 100%, and is equal to the square of the sample correlation.
wlu4
Typewritten Text
wlu4
Typewritten Text
sample correlation r measures the strength and direction of the linear relationship between X and Y. The sign of r is the same as the sign of the slope estimate.
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
beta1_hat = sqrt(SS_YY/S_XX) * r.
wlu4
Typewritten Text
Example: One type of fuel is biodiesel, which comes from plants. An experiment was doneto determine how much biodiesel could be generated from a certain type of plant grown indifferent media. The final biomass was also recorded on 44 the plants from the experiment.Let‘s consider these two variables, the log of biodiesel and biomass.
proc reg data=bioexp ;
model logbiodiesel=biomass/clb;
run;
1. Describe the scatterplot (3 things!).
2. Give the value of the sample correlation.
81
wlu4
Typewritten Text
form: linear strength: moderate direction: positive association
wlu4
Typewritten Text
r =
wlu4
Typewritten Text
Note: the 95% prediction confidence interval is usually wider than the 95% (estimation) confidence interval. Recall our linear model is Y = beta0 + beta1*x + epsilon; The (estimation) confidence interval is for yhat = beta0_hat + beta1_hat * x; However, the prediction confidence interval is for y.pred = beta0_hat + beta1_hat*x + epsilon.
wlu4
Typewritten Text
wlu4
Typewritten Text
+ sqrt(0.3348) = 0.58
wlu4
Typewritten Text
wlu4
Typewritten Text
Output From Proc Reg for Biomass and Log(Biodiesel) Example 1
The REG ProcedureModel: MODEL1
Dependent Variable: logBiodiesel
Output From Proc Reg for Biomass and Log(Biodiesel) Example 1
Consider the regression of Y = brain size (MRI counts) on x=Height(in). Use the outputbelow to answer the following:
1. Interpret the scatterplot and state the correlation.
2. What is the % of variability in the MRI counts explained by the data.
3. Determine the fitted regression line.
4. Is there a significant liner relationship between the variables. Why or why not?
5. Does the intercept have a meaningful interpretation?
6. Predict brain size for an individual that is 60inches tall.
83
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
------
wlu4
Typewritten Text
wlu4
Typewritten Text
wlu4
Typewritten Text
fitted least squares line?
wlu4
Typewritten Text
form: strength: direction: correlation = 0.6
wlu4
Typewritten Text
wlu4
Typewritten Text
36.2%
wlu4
Typewritten Text
MRI_Count = 153834.09 + 11022.732*Height
wlu4
Typewritten Text
Yes, there is, because the p-value in the ANOVA table <0.0001.
wlu4
Typewritten Text
No, since no one will have height 0.
wlu4
Typewritten Text
MRI_count = 153834.09 + 11022.732*60
wlu4
Typewritten Text
wlu4
Typewritten Text
84
wlu4
Typewritten Text
wlu4
Typewritten Text
Multiple Linear Regression (Optional Reading)These ideas can easily be extended to the case of more than 1 quantitative explanatoryvariable. This is called Multiple Linear Regression:
Model Yi = β0 + β1xi1 + β2xi2 + ...+ βpxip + Ei
• βj are called ’partial slopes’
• xp can be quadratic values, cubics, or interaction terms
– x21, x3
1, x1 ∗ x2
• fitted model, y = b0 + b1x1 + b2x2 + ...+ bpxp
• Model often still fit using ’least squares’ i.e. minimize SS(E) =∑n
i=1(yi − yi)2
Example(Taken from Probabilitiy and Statistics, Devore) Soil and sediment adsorption, the extentto which chemicals collect in a condensed form on the surface, is an important characteristicinfluencing the effectiveness of pesticides and various agricultural chemicals. A study wasdone on 13 soil samples that measured Y = phosphate adsorption index, X1 = amount ofextractable aluminum, and X2 = amount of extractable iron. The data are given below: