-
1 of 21
Statistics 21b_practice.pdf
Michael Hallstone, Ph.D. [email protected]
Lecture 21b: Practice Problems for Lecture 21 “Regression and
Correlation” Everything that appears in these lecture notes is fair
game for the test. They are the best “study guide” I can provide.
It is impossible to provide a “list” that is more comprehensive
than the lecture notes above. However, here are a few additional
practice exercises or practice concepts.
Do
the
following
steps
for
a
Regression/Correlation
Analysis
for
the
problems
below:
Step
1: discuss the probability of a logical relationships between your
two variables: you want to discuss
• whether or not it should be cause and effect (or common cause)
or no logical relationship • whether you expect the relationship to
be positive or negative • whether or not it is linear in nature
Step 2: draw a scatter plot and look for evidence of a linear
relationship. Discuss what you see in the scatter plot (i.e. does
it look linear or not? Is it positive or negative? Step 3: do the 7
steps for a regression test. I won’t make you do the computations
for step 6 by hand in this class. But if given the SPSS output you
will have to
• find the “TR” value and make the correct statistical decision
(regarding whether or not to reject the null hypothesis)
• find the p-value and use it to make the correct statistical
decision (regarding whether or not to reject the null
hypothesis)
• write out the equation for the regression line using the slope
and y-intercept from the output
• correctly draw the regression line on the scatter plot Step 4:
What does r2mean? I won’t make you compute r2 by hand in this
class. But if given SPSS output you will have to
• be able to find r2, • know if the regression is significant, •
and if the regression is significant say what r2 means in plain
English
Step 5: based upon the SPSS output
• write out equation for the regression line (“y-hat” line) •
draw the line on your scatter plot
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to
• be able to find r, • know if the regression is significant, •
and if the regression is significant say what r means in plain
English
-
2 of 21
1. Size of the population in= x and amount of tax revenue
generated =y with α =.05 A public administrator wants to forecast
her budget in the coming years. She knows that the amount of money
available to her agency will be based upon tax revenue and she
knows that tax revenue fluctuates with the ebb and flow of the
population. The following data are forecasts of population and tax
revenue for the county in which she works and the SPSS output data
points x y 1 146 12.4 2 85 6.1 3 21 2.4 4 47 4.3 5 115 9.5 6 90 7.6
7 100 8.5 sum= 604 50.8 mean= 86.3 7.3
SPSS output for population and tax revenue
ANSWER BELOW Step 1: discuss the probability of a logical
relationships between your two variables: In general we would
expect a cause and effect relationship that is linear and positive
in nature. One might argue it is common cause, but in general tax
revenue is based upon how many people are paying taxes –of course
it depends upon the type of taxes as property tax, sales tax,
income tax etc will all differ slightly with respect to population.
But in general, the more people in the county, the more “tax
-
3 of 21
base” exists, so the relationship is positive and we would
expect a general linear relationship as population continues to
grow then taxes should as well. We would not expect taxes to
decline significantly when a population continues to grow, so it is
probably not a curvilinear relationship. Step 2: draw a scatter
plot and look for evidence of a linear relationship. Discuss what
you see in the scatter plot (i.e. does it look linear or not? Is it
positive or negative?
Based upon this scatter plot it looks like a LINEAR POSITIVE
relationship Step 3: do the 7 steps for a regression test 1. State
null and alternative hypothesis. H0: B = 0 H1: B ≠ 0 2. State level
of significance or α “alpha.” For this problem alpha =.05 3.
Determine the test distribution to use – use Z if # of data points
(x,y) >30
otherwise use t. [If t is used: df=n-2 where n = # of data
points (x,y)] In this case we have 7 pairs of data so n=7. Use a t
distribution with df=n-2 or df=7-2=5 4. Define the rejection
regions. And draw a picture! ALL regression tests are
TWO TAILED tests so α/2 goes into each tail. In this case
critical t value = 2.571 5. State the decision rule. Reject null if
TR> 2.571 or TR< - 2.571 otherwise FTR. 6. Perform necessary
calculations on data and compute TR value.
SPSS output below
-
4 of 21
TR= 15.096 Step 7: Compare TR value with the decision rule and
make a statistical decision. (Write out decision in plain English)
15.096 is greater than 2.571 so we reject the null and conclude the
alternative. We conclude that there is a slope to the population
regression line and a meaningful regression relationship does exist
between the size of the population and the amount of tax revenue
generated. We are at least 95% confident of this statement. Step
4:
• r2=.979 or approx 98% • the regression is significant based
decision in step 7 or p=.000 • since the regression is significant
r2 of 98% means that 98% of the variation in tax revenue in
the county can be explained by the population (of citizens) in
the county
Step 5: based upon the SPSS output • write out equation for the
regression line y-hat= .079x+.415 • draw the line on your scatter
plot: I can’t do this on the computer. One point is (0, 0.415).
Plug another x into the equation to get a corresponding y. Use
x=10 or something easy: y = .079(10)+.415
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to
• r= .989 • the regression is significantbased upon step 7 or
p=.000 • since the regression is significant r indicates a STRONG
POSTIVE CORRELATION
-
5 of 21
2. number of prior convictions= x and length of current sentence
=y with α =.05 (This is sort of a dumb example as it’s obvious that
there is a relationship, but please just use it for practice).
Pretend the sentencing law for drug offenses in California dictates
that judges must take into account the number of prior convictions
when they sentence a person is convicted of a drug crime. Pretend
that as the number of prior convictions rise the length of the
current sentence gets longer. That makes sense: the judge will
sentence you to “more time” if you have more prior convictions and
“less time” if you have fewer prior convictions. A prison
administrator is trying to figure out if the judges are following
the law and collects the following data where x= the number of
priors and y= the length of the current sentence. data points x y 1
0 1.5 2 1 3 3 2 5 4 3 7 5 4 6 6 5 10 7 6 10
SPSS output
ANSWER BELOW Step 1: discuss the probability of a logical
relationships between your two variables In general we would expect
a cause and effect relationship that is linear and positive in
nature. If the law dictates that the current sentence gets longer
(bigger) as the number of prior convictions grows then it is a
positive relationship. Depending upon how the law is written, it
may be a linear relationship that levels off as we would expect a
limit to the length of sentence regardless of how may prior
convictions: so after a certain number of priors the judge may stop
adding time. We would not expect the judge to sentence someone to
100 years or 150 years or even 50 years for a drug offense. Step 2:
draw a scatter plot and look for evidence of a linear relationship.
Discuss what you see in the scatter plot (i.e. does it look linear
or not? Is it positive or negative?
-
6 of 21
Based upon this scatter plot it looks like there is a LINEAR
POSITIVE relationship between the two variables Step 3: do the 7
steps for a regression test
1. State null and alternative hypothesis. H0: B = 0 H1: B ≠
0
2. State level of significance or α “alpha.” For this problem
alpha =.05
3. Determine the test distribution to use – use Z if # of data
points (x,y) >30 otherwise use t. [If t is used: df=n-2 where n
= # of data points (x,y)]
In this case we have 7 pairs of data so n=7. Use a t
distribution with df=n-2 or df=7-2=5
4. Define the rejection regions. And draw a picture! ALL
regression tests are TWO TAILED tests so α/2 goes into each
tail.
In this case critical t value = 2.571
5. State the decision rule. Reject null if TR> 2.571 or
TR< - 2.571 otherwise FTR.
6. Perform necessary calculations on data and compute TR value.
SPSS Output
-
7 of 21
TR=7.950 Step 7: Compare TR value with the decision rule and
make a statistical decision. (Write out decision in plain English)
7.950 is greater than 2.571 so we reject the null and conclude the
alternative. We conclude that there is a slope to the population
regression line and a meaningful regression relationship does exist
between the number of prior convictions and the length of the
current sentence. We are at least 95% confident of this statement.
Step 4:
• r2=.927 or approx 93% • the regression is significant based
decision in step 7 or p=.001 • since the regression is significant
r2 of 93% means that 93% of the variation in length of current
sentence can be explained by number of prior convictions Step 5:
based upon the SPSS output
• write out equation for the regression line y-hat= 1.446x+1.732
• draw the line on your scatter plot: I can’t do this on the
computer. One point is (0, 1.732).
Plug another x into the equation to get a corresponding y. Use
x=10 or something easy.
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to
• r= .963 • know if the regression is significant based upon
step 7 or p=.001 • since the regression is significant r indicates
a STRONG POSTIVE CORRELATION
-
8 of 21
3. number of injuries= x and number of overtime shifts =y with α
=.05 A prison administrator is trying to figure out whether the
number of injuries amongst prison guards can predict the number of
overtime shifts (worked by prison guards). Since overtime shifts
“kill” her budget she wants to see if a meaningful relationship
exists and if it does, then she will try to make the workplace
safer and thus reduce overtime shifts (and costs.) She and collects
the following data where x= the number of injuries amongst prison
guards and y= the number of overtime shifts worked by prison
guards. data points x y 1 0 10 2 2 20 3 3 30 4 4 30 5 5 40 6 5 40 7
5 45
SPSS output for # of injuries and overtime shifts
ANSWER BELOW Step 1: discuss the probability of a logical
relationships between your two variables In general we would expect
a cause and effect relationship that is linear and positive in
nature. It makes sense that injured workers would mean a “gap” or
“hole” in the schedule and would require another worker to fill in
for that shift. That would mean the worker who fills in is working
and extra shift and thus gets overtime. Of course this is dependent
upon the fact that the prison administrator does NOT have “part
time” prison guards who could fill in for the injured workers.
-
9 of 21
Step 2: draw a scatter plot and look for evidence of a linear
relationship. Discuss what you see in the scatter plot (i.e. does
it look linear or not? Is it positive or negative?
This scatter plot seems to show a POSTIVE LINEAR relationship
more or less. Step 3: do the 7 steps for a regression test
1. State null and alternative hypothesis. H0: B = 0 H1: B ≠
0
2. State level of significance or α “alpha.” For this problem
alpha =.05
3. Determine the test distribution to use – use Z if # of data
points (x,y) >30 otherwise use t. [If t is used: df=n-2 where n
= # of data points (x,y)]
In this case we have 7 pairs of data so n=7. Use a t
distribution with df=n-2 or df=7-2=5
4. Define the rejection regions. And draw a picture! ALL
regression tests are TWO TAILED tests so α/2 goes into each
tail.
In this case critical t value = 2.571
5. State the decision rule. Reject null if TR> 2.571 or
TR< - 2.571 otherwise FTR.
-
10 of 21
6. Perform necessary calculations on data and compute TR
value.
SPSS Output
TR=9.731 Step 7: Compare TR value with the decision rule and
make a statistical decision. (Write out decision in plain English)
9.731 is greater than 2.571 so we reject the null and conclude the
alternative. We conclude that there is a slope to the population
regression line and a meaningful regression relationship does exist
between the number of injuries and the number of overtime shifts.
We are at least 95% confident of this statement. Step 4:
• r2=.950 or approx 95% • the regression is significant based
decision in step 7 or p=.000 • since the regression is significant
r2 of 95% means that 95% of the variation number of
overtime shifts worked by prison guards can be explained by the
number of injuries suffered by prison guards. (If the administrator
can reduce the number of injuries she could probably reduce her
overtime costs.)
Step 5: based upon the SPSS output • write out equation for the
regression line y-hat= 6.349x+8.947 • draw the line on your scatter
plot: I can’t do this on the computer. One point is (0, 8.947).
Plug another x into the equation to get a corresponding y. Use
x=10 or something easy.
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to
• r= .975 • the regression is significantbased upon step 7 or
p=.000 • since the regression is significant r indicates a STRONG
POSTIVE CORRELATION
-
11 of 21
4. number of prior DUI convictions= x and BAC (Blood Alcohol
Content) =y with α =.05 (As you will see in the discussion of the
logical relationship below, there is a problem with the
relationship between these two variables, but we will use this for
practice purposes.) This is an actual “problem” from my DUI study
in Honolulu but with fake data. It seems reasonable that people
with more prior DUI convictions have a significant problem with
alcohol and get “drunker” than those with fewer DUI convictions. So
people with more prior DUI convictions will probably have a higher
BAC. Below is the fake data: data points x y 1 0 0.1 2 1 0.12 3 2
0.15 4 3 0.16 5 4 0.2 6 5 0.21 7 6 0.22
SPSS Output
ANSWER BELOW Step 1: discuss the probability of a logical
relationships between your two variables: In general we would
expect a cause and effect relationship that is linear and positive
in nature. While we would not expect it to be direct cause and
effect, we would expect it to be a common cause relationship:
number of prior DUI convictions does NOT directly cause an increase
in BAC. But there is something associated with prior DUI
convictions (people with more prior DUI convictions are severe
“alcoholics”) that probably also causes an increase in BAC. So this
is not really appropriate for simple linear regression and is more
appropriate for multiple regression (where we examine multiple x’s
or multiple independent variables).
-
12 of 21
Step 2: draw a scatter plot and look for evidence of a linear
relationship. Discuss what you see in the scatter plot (i.e. does
it look linear or not? Is it positive or negative?
This scatter plot seems to indicate a POSTIVE and LINEAR
relationship between the two variables. Step 3: do the 7 steps for
a regression test
1. State null and alternative hypothesis. H0: B = 0 H1: B ≠
0
2. State level of significance or α “alpha.” For this problem
alpha =.05
3. Determine the test distribution to use – use Z if # of data
points (x,y) >30 otherwise use t. [If t is used: df=n-2 where n
= # of data points (x,y)]
In this case we have 7 pairs of data so n=7. Use a t
distribution with df=n-2 or df=7-2=5
4. Define the rejection regions. And draw a picture! ALL
regression tests are TWO TAILED tests so α/2 goes into each
tail.
In this case critical t value = 2.571
5. State the decision rule. Reject null if TR> 2.571 or
TR< - 2.571 otherwise FTR.
-
13 of 21
6. Perform necessary calculations on data and compute TR
value.
SPSS Output
TR=13.536 Step 7: Compare TR value with the decision rule and
make a statistical decision. (Write out decision in plain English)
13.536 is greater than 2.571 so we reject the null and conclude the
alternative. We conclude that there is a slope to the population
regression line and a meaningful regression relationship does exist
between the number of prior convictions and BAC. We are at least
95% confident of this statement. Step 4:
• r2=.973 or approx 97% • the regression is significant based
decision in step 7 or p=.000 • since the regression is significant
r2 of 97% means that 97% of the variation in BAC is
explained by the number of prior DUI convictions (Again, this is
probably a common cause relationship and NOT direct cause and
effect but we’d need to do a multiple regression to find out more
about this relationship.)
Step 5: based upon the SPSS output • write out equation for the
regression line y-hat= .021x+0.103 • draw the line on your scatter
plot: I can’t do this on the computer. One point is (0, .103).
Plug
another x into the equation to get a corresponding y. Use x=10
or something easy.
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to
• r= .987 • the regression is significantbased upon step 7 or
p=.000 • since the regression is significant r indicates a STRONG
POSTIVE CORRELATION
-
14 of 21
5. number of police patrols= x and crime rate =y with α =.05 A
police department administrator want to know if increasing the
number of police patrols will reduce the crime rate in her city.
Pretend this is data where x= number of police patrols and y=the
crime rate. data points x y 1 2 10 2 3 9 3 4 9 4 5 8 5 6 7 6 7 7 7
8 7
SPSS output
ANSWER BELOW Step 1: discuss the probability of a logical
relationships between your two variables: In general it is logical
to hope for a direct cause and effect relationship that is NEGATIVE
in nature. Whether or not it is a linear relationship is a bit
harder to discern, but if increased police patrols actually deter
potential criminals then we would expect the crime rate to decline,
although it is hard to say if it will continue to drop in a purely
linear fashion. If increasing police patrols actually declines
crime, we might expect that there needs to be a significant
increase in police patrols until criminals notice it and then
perhaps even with a whole lot of police patrols, the truly hard
core criminals will no longer be deterred. So perhaps it is a
negative curvilinear relationship.
-
15 of 21
Step 2: draw a scatter plot and look for evidence of a linear
relationship. Discuss what you see in the scatter plot (i.e. does
it look linear or not? Is it positive or negative?
The relationship appears negative and somewhat linear. Step 3:
do the 7 steps for a regression test
1. State null and alternative hypothesis. H0: B = 0 H1: B ≠
0
2. State level of significance or α “alpha.” For this problem
alpha =.05
3. Determine the test distribution to use – use Z if # of data
points (x,y) >30 otherwise use t. [If t is used: df=n-2 where n
= # of data points (x,y)]
In this case we have 7 pairs of data so n=7. Use a t
distribution with df=n-2 or df=7-2=5
4. Define the rejection regions. And draw a picture! ALL
regression tests are TWO TAILED tests so α/2 goes into each
tail.
In this case critical t value = 2.571
5. State the decision rule. Reject null if TR> 2.571 or
TR< - 2.571 otherwise FTR.
-
16 of 21
6. Perform necessary calculations on data and compute TR
value.
SPSS Output
TR= -6.994 Step 7: Compare TR value with the decision rule and
make a statistical decision. (Write out decision in plain English)
-6.994 is less than - 2.571 so we reject the null and conclude the
alternative. We conclude that there is a slope to the population
regression line and a meaningful regression relationship does exist
between the number of police patrols and the crime rate. We are at
least 95% confident of this statement. Step 4:
• r2=.907 or approx 91% • the regression is significant based
decision in step 7 or p=.001 • since the regression is significant
r2 of 91% means that 91% of the variation in the crime rate is
explained by the number of police patrols.
Step 5: based upon the SPSS output • write out equation for the
regression line y-hat= -.536x + 10.821 • draw the line on your
scatter plot: I can’t do this on the computer. One point is (0,
10.821).
Plug another x into the equation to get a corresponding y. Use
x=10 or something easy.
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to Note that
SPSS has an annoying quirk. The regression output for r does not
give the negative sign although the slope of the regression line is
negative. This is just SPSS’s company being “lazy” and having
somewhat inferior software. Below is the correlation output from
SPSS which shows that r is negative (note that r is the same number
as in the regression output).
-
17 of 21
• r= -.953 • know if the regression is significant based upon
step 7 or p=.001 • since the regression is significant r indicates
a STRONG NEGATIVE CORRELATION
6. number of police patrols= x and average speed on highway =y
with α =.05 A police department administrator want to know if
increasing the number of police patrols will reduce the average
speed of drivers on the highway. The administrator will be
measuring average speed of ALL drivers as collected by one of those
road-side radar guns connected to a data collector. So the drivers,
as a group will NOT be seeing a cop with a radar gun pointed at
them and might not be deterred. Pretend this is data where x=
number of police patrols and y=the average speed. data points x y 1
1 68 2 2 68 3 3 69 4 4 67 5 5 66 6 6 67 7 7 67
-
18 of 21
SPSS output
ANSWER BELOW Step 1: discuss the probability of a logical
relationships between your two variables: In general it is logical
to hope for a direct cause and effect relationship that is NEGATIVE
in nature. Hopefully most drivers do not want to receive speeding
tickets and will lower their speed on the basis of seeing more
police on the roads and are thus deterred. However, since we are
measuring average speed of ALL drivers as collected by one of those
road-side radar guns connected to a data collector, there would
need to be enough patrols to deter the WHOLE group of drivers over
the study period.
-
19 of 21
Step 2: draw a scatter plot and look for evidence of a linear
relationship. Discuss what you see in the scatter plot (i.e. does
it look linear or not? Is it positive or negative?
This scatter plot does not seem to indicate a linear
relationship between the two variables at all. If you were to draw
a circle around all of the dots it would look like a round blob
rather than a “skinny egg.” Step 3: do the 7 steps for a regression
test
1. State null and alternative hypothesis. H0: B = 0 H1: B ≠
0
2. State level of significance or α “alpha.” For this problem
alpha =.05
3. Determine the test distribution to use – use Z if # of data
points (x,y) >30 otherwise use t. [If t is used: df=n-2 where n
= # of data points (x,y)]
In this case we have 7 pairs of data so n=7. Use a t
distribution with df=n-2 or df=7-2=5
4. Define the rejection regions. And draw a picture! ALL
regression tests are TWO TAILED tests so α/2 goes into each
tail.
In this case critical t value = 2.571
5. State the decision rule. Reject null if TR> 2.571 or
TR< - 2.571 otherwise FTR.
-
20 of 21
6. Perform necessary calculations on data and compute TR
value.
SPSS Output
TR= -1.826 Step 7: Compare TR value with the decision rule and
make a statistical decision. (Write out decision in plain English)
-1.826 is GREATER - 2.571 so we FAILD TO REJECT the null. We
conclude that there is NOT sufficient evidence to suggest there is
a slope to the population regression line and a meaningful
regression relationship does NOT exist between the number of police
patrols and the average speed of drivers on the highway. We have NO
IDEA of how confident we are of this statement. Step 4:
• r2=.400 or approx 40% • the regression is NOT significant
based decision in step 7 or p=.127 • since the regression is NOT
significant r2 is meaningless! However had the regression
been significant it WOULD HAVE meant that 40% of the variation
in the average speed on highways is explained by the number of
police patrols.
Step 5: based upon the SPSS output • write out equation for the
regression line y-hat= -.286x + 68.571 • draw the line on your
scatter plot: I can’t do this on the computer. One point is (0,
68.571).
Plug another x into the equation to get a corresponding y. Use
x=10 or something easy.
Step 6: What does r mean? ? I won’t make you compute r by hand
in this class. But if given SPSS output you will have to Note that
SPSS has an annoying quirk. The regression output for r does not
give the negative sign although the slope of the regression line is
negative. This is just SPSS’s company being “lazy” and
-
21 of 21
having somewhat inferior software. Below is the correlation
output from SPSS which shows that r is negative (note that r is the
same number as in the regression output).
• r= -.632 • the regression is NOT significant based decision in
step 7 or p=.127 • since the regression is NOT significant r2 is
meaningless! However had the regression
been significant it WOULD HAVE indicated a moderately NEGATIVE
correlation.