lecture 21b practice problems - Laulima : Gateway · 2017. 11. 7. · Lecture 21b: Practice Problems for Lecture 21 “Regression and Correlation” Everything that appears in these

1 of 21

Statistics 21b_practice.pdf

Michael Hallstone, Ph.D. [email protected]

Lecture 21b: Practice Problems for Lecture 21 “Regression and Correlation” Everything that appears in these lecture notes is fair game for the test. They are the best “study guide” I can provide. It is impossible to provide a “list” that is more comprehensive than the lecture notes above. However, here are a few additional practice exercises or practice concepts.

Do the following steps for a Regression/Correlation Analysis for the problems below: Step 1: discuss the probability of a logical relationships between your two variables: you want to discuss

• whether or not it should be cause and effect (or common cause) or no logical relationship • whether you expect the relationship to be positive or negative • whether or not it is linear in nature

Step 2: draw a scatter plot and look for evidence of a linear relationship. Discuss what you see in the scatter plot (i.e. does it look linear or not? Is it positive or negative? Step 3: do the 7 steps for a regression test. I won’t make you do the computations for step 6 by hand in this class. But if given the SPSS output you will have to

• find the “TR” value and make the correct statistical decision (regarding whether or not to reject the null hypothesis)

• find the p-value and use it to make the correct statistical decision (regarding whether or not to reject the null hypothesis)

• write out the equation for the regression line using the slope and y-intercept from the output

• correctly draw the regression line on the scatter plot Step 4: What does r2mean? I won’t make you compute r2 by hand in this class. But if given SPSS output you will have to

• be able to find r2, • know if the regression is significant, • and if the regression is significant say what r2 means in plain English

Step 5: based upon the SPSS output

• write out equation for the regression line (“y-hat” line) • draw the line on your scatter plot

Step 6: What does r mean? ? I won’t make you compute r by hand in this class. But if given SPSS output you will have to

• be able to find r, • know if the regression is significant, • and if the regression is significant say what r means in plain English

2 of 21

 1. Size of the population in= x and amount of tax revenue generated =y with α =.05 A public administrator wants to forecast her budget in the coming years. She knows that the amount of money available to her agency will be based upon tax revenue and she knows that tax revenue fluctuates with the ebb and flow of the population. The following data are forecasts of population and tax revenue for the county in which she works and the SPSS output data points x y 1 146 12.4 2 85 6.1 3 21 2.4 4 47 4.3 5 115 9.5 6 90 7.6 7 100 8.5 sum= 604 50.8 mean= 86.3 7.3

SPSS output for population and tax revenue

ANSWER BELOW Step 1: discuss the probability of a logical relationships between your two variables: In general we would expect a cause and effect relationship that is linear and positive in nature. One might argue it is common cause, but in general tax revenue is based upon how many people are paying taxes –of course it depends upon the type of taxes as property tax, sales tax, income tax etc will all differ slightly with respect to population. But in general, the more people in the county, the more “tax

3 of 21

base” exists, so the relationship is positive and we would expect a general linear relationship as population continues to grow then taxes should as well. We would not expect taxes to decline significantly when a population continues to grow, so it is probably not a curvilinear relationship. Step 2: draw a scatter plot and look for evidence of a linear relationship. Discuss what you see in the scatter plot (i.e. does it look linear or not? Is it positive or negative?

Based upon this scatter plot it looks like a LINEAR POSITIVE relationship Step 3: do the 7 steps for a regression test 1. State null and alternative hypothesis. H0: B = 0 H1: B ≠ 0 2. State level of significance or α “alpha.” For this problem alpha =.05 3. Determine the test distribution to use – use Z if # of data points (x,y) >30

otherwise use t. [If t is used: df=n-2 where n = # of data points (x,y)] In this case we have 7 pairs of data so n=7. Use a t distribution with df=n-2 or df=7-2=5 4. Define the rejection regions. And draw a picture! ALL regression tests are

TWO TAILED tests so α/2 goes into each tail. In this case critical t value = 2.571 5. State the decision rule. Reject null if TR> 2.571 or TR< - 2.571 otherwise FTR. 6. Perform necessary calculations on data and compute TR value.

SPSS output below

4 of 21

TR= 15.096 Step 7: Compare TR value with the decision rule and make a statistical decision. (Write out decision in plain English) 15.096 is greater than 2.571 so we reject the null and conclude the alternative. We conclude that there is a slope to the population regression line and a meaningful regression relationship does exist between the size of the population and the amount of tax revenue generated. We are at least 95% confident of this statement. Step 4:

• r2=.979 or approx 98% • the regression is significant based decision in step 7 or p=.000 • since the regression is significant r2 of 98% means that 98% of the variation in tax revenue in

the county can be explained by the population (of citizens) in the county

Step 5: based upon the SPSS output • write out equation for the regression line y-hat= .079x+.415 • draw the line on your scatter plot: I can’t do this on the computer. One point is (0, 0.415).

Plug another x into the equation to get a corresponding y. Use x=10 or something easy: y = .079(10)+.415


• r= .989 • the regression is significantbased upon step 7 or p=.000 • since the regression is significant r indicates a STRONG POSTIVE CORRELATION

5 of 21

2. number of prior convictions= x and length of current sentence =y with α =.05 (This is sort of a dumb example as it’s obvious that there is a relationship, but please just use it for practice). Pretend the sentencing law for drug offenses in California dictates that judges must take into account the number of prior convictions when they sentence a person is convicted of a drug crime. Pretend that as the number of prior convictions rise the length of the current sentence gets longer. That makes sense: the judge will sentence you to “more time” if you have more prior convictions and “less time” if you have fewer prior convictions. A prison administrator is trying to figure out if the judges are following the law and collects the following data where x= the number of priors and y= the length of the current sentence. data points x y 1 0 1.5 2 1 3 3 2 5 4 3 7 5 4 6 6 5 10 7 6 10

SPSS output

ANSWER BELOW Step 1: discuss the probability of a logical relationships between your two variables In general we would expect a cause and effect relationship that is linear and positive in nature. If the law dictates that the current sentence gets longer (bigger) as the number of prior convictions grows then it is a positive relationship. Depending upon how the law is written, it may be a linear relationship that levels off as we would expect a limit to the length of sentence regardless of how may prior convictions: so after a certain number of priors the judge may stop adding time. We would not expect the judge to sentence someone to 100 years or 150 years or even 50 years for a drug offense. Step 2: draw a scatter plot and look for evidence of a linear relationship. Discuss what you see in the scatter plot (i.e. does it look linear or not? Is it positive or negative?

6 of 21

Based upon this scatter plot it looks like there is a LINEAR POSITIVE relationship between the two variables Step 3: do the 7 steps for a regression test

1. State null and alternative hypothesis. H0: B = 0 H1: B ≠ 0

2. State level of significance or α “alpha.” For this problem alpha =.05

3. Determine the test distribution to use – use Z if # of data points (x,y) >30 otherwise use t. [If t is used: df=n-2 where n = # of data points (x,y)]

In this case we have 7 pairs of data so n=7. Use a t distribution with df=n-2 or df=7-2=5

4. Define the rejection regions. And draw a picture! ALL regression tests are TWO TAILED tests so α/2 goes into each tail.

In this case critical t value = 2.571

5. State the decision rule. Reject null if TR> 2.571 or TR< - 2.571 otherwise FTR.

6. Perform necessary calculations on data and compute TR value. SPSS Output

7 of 21

TR=7.950 Step 7: Compare TR value with the decision rule and make a statistical decision. (Write out decision in plain English) 7.950 is greater than 2.571 so we reject the null and conclude the alternative. We conclude that there is a slope to the population regression line and a meaningful regression relationship does exist between the number of prior convictions and the length of the current sentence. We are at least 95% confident of this statement. Step 4:

• r2=.927 or approx 93% • the regression is significant based decision in step 7 or p=.001 • since the regression is significant r2 of 93% means that 93% of the variation in length of current

sentence can be explained by number of prior convictions Step 5: based upon the SPSS output

• write out equation for the regression line y-hat= 1.446x+1.732 • draw the line on your scatter plot: I can’t do this on the computer. One point is (0, 1.732).

Plug another x into the equation to get a corresponding y. Use x=10 or something easy.


• r= .963 • know if the regression is significant based upon step 7 or p=.001 • since the regression is significant r indicates a STRONG POSTIVE CORRELATION

8 of 21

3. number of injuries= x and number of overtime shifts =y with α =.05 A prison administrator is trying to figure out whether the number of injuries amongst prison guards can predict the number of overtime shifts (worked by prison guards). Since overtime shifts “kill” her budget she wants to see if a meaningful relationship exists and if it does, then she will try to make the workplace safer and thus reduce overtime shifts (and costs.) She and collects the following data where x= the number of injuries amongst prison guards and y= the number of overtime shifts worked by prison guards. data points x y 1 0 10 2 2 20 3 3 30 4 4 30 5 5 40 6 5 40 7 5 45

SPSS output for # of injuries and overtime shifts

ANSWER BELOW Step 1: discuss the probability of a logical relationships between your two variables In general we would expect a cause and effect relationship that is linear and positive in nature. It makes sense that injured workers would mean a “gap” or “hole” in the schedule and would require another worker to fill in for that shift. That would mean the worker who fills in is working and extra shift and thus gets overtime. Of course this is dependent upon the fact that the prison administrator does NOT have “part time” prison guards who could fill in for the injured workers.

9 of 21

Step 2: draw a scatter plot and look for evidence of a linear relationship. Discuss what you see in the scatter plot (i.e. does it look linear or not? Is it positive or negative?

This scatter plot seems to show a POSTIVE LINEAR relationship more or less. Step 3: do the 7 steps for a regression test








10 of 21

6. Perform necessary calculations on data and compute TR value.

SPSS Output

TR=9.731 Step 7: Compare TR value with the decision rule and make a statistical decision. (Write out decision in plain English) 9.731 is greater than 2.571 so we reject the null and conclude the alternative. We conclude that there is a slope to the population regression line and a meaningful regression relationship does exist between the number of injuries and the number of overtime shifts. We are at least 95% confident of this statement. Step 4:

• r2=.950 or approx 95% • the regression is significant based decision in step 7 or p=.000 • since the regression is significant r2 of 95% means that 95% of the variation number of

overtime shifts worked by prison guards can be explained by the number of injuries suffered by prison guards. (If the administrator can reduce the number of injuries she could probably reduce her overtime costs.)

Step 5: based upon the SPSS output • write out equation for the regression line y-hat= 6.349x+8.947 • draw the line on your scatter plot: I can’t do this on the computer. One point is (0, 8.947).




11 of 21

4. number of prior DUI convictions= x and BAC (Blood Alcohol Content) =y with α =.05 (As you will see in the discussion of the logical relationship below, there is a problem with the relationship between these two variables, but we will use this for practice purposes.) This is an actual “problem” from my DUI study in Honolulu but with fake data. It seems reasonable that people with more prior DUI convictions have a significant problem with alcohol and get “drunker” than those with fewer DUI convictions. So people with more prior DUI convictions will probably have a higher BAC. Below is the fake data: data points x y 1 0 0.1 2 1 0.12 3 2 0.15 4 3 0.16 5 4 0.2 6 5 0.21 7 6 0.22

SPSS Output

ANSWER BELOW Step 1: discuss the probability of a logical relationships between your two variables: In general we would expect a cause and effect relationship that is linear and positive in nature. While we would not expect it to be direct cause and effect, we would expect it to be a common cause relationship: number of prior DUI convictions does NOT directly cause an increase in BAC. But there is something associated with prior DUI convictions (people with more prior DUI convictions are severe “alcoholics”) that probably also causes an increase in BAC. So this is not really appropriate for simple linear regression and is more appropriate for multiple regression (where we examine multiple x’s or multiple independent variables).

12 of 21


This scatter plot seems to indicate a POSTIVE and LINEAR relationship between the two variables. Step 3: do the 7 steps for a regression test








13 of 21


SPSS Output

TR=13.536 Step 7: Compare TR value with the decision rule and make a statistical decision. (Write out decision in plain English) 13.536 is greater than 2.571 so we reject the null and conclude the alternative. We conclude that there is a slope to the population regression line and a meaningful regression relationship does exist between the number of prior convictions and BAC. We are at least 95% confident of this statement. Step 4:

• r2=.973 or approx 97% • the regression is significant based decision in step 7 or p=.000 • since the regression is significant r2 of 97% means that 97% of the variation in BAC is

explained by the number of prior DUI convictions (Again, this is probably a common cause relationship and NOT direct cause and effect but we’d need to do a multiple regression to find out more about this relationship.)

Step 5: based upon the SPSS output • write out equation for the regression line y-hat= .021x+0.103 • draw the line on your scatter plot: I can’t do this on the computer. One point is (0, .103). Plug

another x into the equation to get a corresponding y. Use x=10 or something easy.



14 of 21

5. number of police patrols= x and crime rate =y with α =.05 A police department administrator want to know if increasing the number of police patrols will reduce the crime rate in her city. Pretend this is data where x= number of police patrols and y=the crime rate. data points x y 1 2 10 2 3 9 3 4 9 4 5 8 5 6 7 6 7 7 7 8 7

SPSS output

ANSWER BELOW Step 1: discuss the probability of a logical relationships between your two variables: In general it is logical to hope for a direct cause and effect relationship that is NEGATIVE in nature. Whether or not it is a linear relationship is a bit harder to discern, but if increased police patrols actually deter potential criminals then we would expect the crime rate to decline, although it is hard to say if it will continue to drop in a purely linear fashion. If increasing police patrols actually declines crime, we might expect that there needs to be a significant increase in police patrols until criminals notice it and then perhaps even with a whole lot of police patrols, the truly hard core criminals will no longer be deterred. So perhaps it is a negative curvilinear relationship.

15 of 21


The relationship appears negative and somewhat linear. Step 3: do the 7 steps for a regression test








16 of 21


SPSS Output

TR= -6.994 Step 7: Compare TR value with the decision rule and make a statistical decision. (Write out decision in plain English) -6.994 is less than - 2.571 so we reject the null and conclude the alternative. We conclude that there is a slope to the population regression line and a meaningful regression relationship does exist between the number of police patrols and the crime rate. We are at least 95% confident of this statement. Step 4:

• r2=.907 or approx 91% • the regression is significant based decision in step 7 or p=.001 • since the regression is significant r2 of 91% means that 91% of the variation in the crime rate is

explained by the number of police patrols.

Step 5: based upon the SPSS output • write out equation for the regression line y-hat= -.536x + 10.821 • draw the line on your scatter plot: I can’t do this on the computer. One point is (0, 10.821).


Step 6: What does r mean? ? I won’t make you compute r by hand in this class. But if given SPSS output you will have to Note that SPSS has an annoying quirk. The regression output for r does not give the negative sign although the slope of the regression line is negative. This is just SPSS’s company being “lazy” and having somewhat inferior software. Below is the correlation output from SPSS which shows that r is negative (note that r is the same number as in the regression output).

17 of 21

• r= -.953 • know if the regression is significant based upon step 7 or p=.001 • since the regression is significant r indicates a STRONG NEGATIVE CORRELATION

6. number of police patrols= x and average speed on highway =y with α =.05 A police department administrator want to know if increasing the number of police patrols will reduce the average speed of drivers on the highway. The administrator will be measuring average speed of ALL drivers as collected by one of those road-side radar guns connected to a data collector. So the drivers, as a group will NOT be seeing a cop with a radar gun pointed at them and might not be deterred. Pretend this is data where x= number of police patrols and y=the average speed. data points x y 1 1 68 2 2 68 3 3 69 4 4 67 5 5 66 6 6 67 7 7 67

18 of 21

SPSS output

ANSWER BELOW Step 1: discuss the probability of a logical relationships between your two variables: In general it is logical to hope for a direct cause and effect relationship that is NEGATIVE in nature. Hopefully most drivers do not want to receive speeding tickets and will lower their speed on the basis of seeing more police on the roads and are thus deterred. However, since we are measuring average speed of ALL drivers as collected by one of those road-side radar guns connected to a data collector, there would need to be enough patrols to deter the WHOLE group of drivers over the study period.

19 of 21


This scatter plot does not seem to indicate a linear relationship between the two variables at all. If you were to draw a circle around all of the dots it would look like a round blob rather than a “skinny egg.” Step 3: do the 7 steps for a regression test








20 of 21


SPSS Output

TR= -1.826 Step 7: Compare TR value with the decision rule and make a statistical decision. (Write out decision in plain English) -1.826 is GREATER - 2.571 so we FAILD TO REJECT the null. We conclude that there is NOT sufficient evidence to suggest there is a slope to the population regression line and a meaningful regression relationship does NOT exist between the number of police patrols and the average speed of drivers on the highway. We have NO IDEA of how confident we are of this statement. Step 4:

• r2=.400 or approx 40% • the regression is NOT significant based decision in step 7 or p=.127 • since the regression is NOT significant r2 is meaningless! However had the regression

been significant it WOULD HAVE meant that 40% of the variation in the average speed on highways is explained by the number of police patrols.

Step 5: based upon the SPSS output • write out equation for the regression line y-hat= -.286x + 68.571 • draw the line on your scatter plot: I can’t do this on the computer. One point is (0, 68.571).


Step 6: What does r mean? ? I won’t make you compute r by hand in this class. But if given SPSS output you will have to Note that SPSS has an annoying quirk. The regression output for r does not give the negative sign although the slope of the regression line is negative. This is just SPSS’s company being “lazy” and

21 of 21

having somewhat inferior software. Below is the correlation output from SPSS which shows that r is negative (note that r is the same number as in the regression output).

• r= -.632 • the regression is NOT significant based decision in step 7 or p=.127 • since the regression is NOT significant r2 is meaningless! However had the regression

been significant it WOULD HAVE indicated a moderately NEGATIVE correlation.

lecture 21b practice problems - Laulima : Gateway · 2017. 11. 7. · Lecture 21b: Practice Problems for Lecture 21 “Regression and Correlation” Everything that appears in these

Documents