This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
In this example, a city’s temperature is likely to be dependent upon its latitude – not the other way around. Temperature cannot affect a city’s latitude.
The latitude is called the independent (or explanatory) variable. The temperature is called the dependent(or response) variable.
Scatter graphs
When plotting scatter graphs, the convention is to always plot the independent variable on the horizontal axis and the dependent variable on the vertical axis.
Scatter graph showing how January temperatures change with latitude
The type of correlation existing between two variables can be described in terms of the gradient of the slope formed by the points, and how close the points lie to a straight line.
Strong positive correlation – the points lie close to a straight line with positive gradient.
Weak positive correlation – the points are more scattered but follow a general upward trend.
The diagram shows a positive correlation between cigarette consumption and life expectancy. However, it would be wrong to conclude that consuming more cigarettes causes people to live longer.
Correlation vs. causation
This type of correlation is sometimes referred to asnonsense correlation.
The relationship can be explained because both life expectancy and cigarette consumption for a country are correlated with a third variable – the wealth of the country.
Examination-style question: A researcher believes there is a relationship between a country’s annual income per head (x, in $1000) and the per capita carbon dioxide emissions (c, tonnes). He collects data from a random sample of 10 countries and records the following results:
Product–moment correlation coefficient
.
. .
.
2 2
96 5 54
2156 9 383 54
619 6
x c
x c
xc
Calculate the value of the product–moment correlation coefficient and comment on the implications of your answer.
Therefore, the product–moment correlation coefficient is:
Product–moment correlation coefficient
xc
xx cc
Sr
S S
Income shows weak positive correlation with CO2 emissions – emissions are generally higher in wealthier countries. However, as the correlation is low, the result is somewhat inconclusive.
Sometimes a variable is controlled by the experimenter – they decide in advance what values that variable should take. If a variable is controlled, then it is non-random.
Random variables take values that cannot be predicted with certainty before collecting the data.
Example: An experiment is carried out into how fast a mug of coffee cools. The temperature of the coffee is measured every 2 minutes until 10 minutes have passed.
Types of variables
Time (minutes) 0 2 4 6 8 10
Temperature (°C) 95 83 73 64 55 48
The values for the time were chosen by the experimenter. If the experiment is repeated, the values for the time will be the same. Therefore, time is a non-random variable.
Temperature is a random variable. The values for this variable may be different if the experiment is repeated.
The best fitting line is the one that minimizes the sum of the squared deviations, , where di is the vertical distance between the ith point and the line.
Consider again the temperature data presented earlier.
Example: The table shows the latitude, x, and mean January temperature(°C), y, for a sample of 10 cities in the northern hemisphere.
Calculate the equation of the regression line of y on x and use it to predict the mean January temperature for the city of Los Angeles, which has a latitude of 34°N.
This prediction for the mean January temperature in Los Angeles is based purely on the city’s latitude.
There are likely to be additional factors that can affect the climate of a city, for example:
Regression – random on random
The concept of regression we have considered here can be extended to incorporate other relevant factors, producing a new formula. This allows for more accurate prediction.
A regression equation can only confidently be used to predict values of y that correspond to x values that lie within the range of the data values available.
The dangers of extrapolation
It can be dangerous to extrapolate (i.e. to predict) from the graph, a value for y that corresponds to a value of x that lies beyond the range of the values in the data set.
It is reasonably safe to make predictions
within the range of the data.
It is unwise to extrapolate beyond the given data.
This is because we cannot be sure that the relationship between the two variables will continue to be true.
The graph indicates that there is fairly strong positive correlation between weight and wingspan – this means that wingspan tends to be longer in heavier birds.
c) When the weight is 160 g, we can predict the wingspan to be:
y = 20.0 + 0.176x =
d) The average weight of a duck is outside the range of weights provided in the data. It would therefore be inappropriate to use the regression line to predict the wingspan of a duck, as we cannot be certain that the same relationship will continue to be true at higher weights.
Note: The regression coefficient (0.176) can be interpreted here as follows: as the weight increases by 1 g, the wingspan increases by 0.176 cm, on average.
We now turn our attention to the situation where we wish to estimate a value of x when we are given a value of y. We will continue to assume that both variables are random.
To predict x given y (when both variables are random), we use the regression line of x on y. This line has the equation:
Predicting x from y – random on random
x = a′ + b′y
Note that both the regression line of x on y and the regression line of y on x pass through the mean point.
The two lines won’t in general be equal, unless the points lie in a perfect straight line.
Examination-style question: 15 AS-level mathematics students sit papers in C1 and S1. Their results are summarized below, with c representing the percentage mark in C1, and s the percentage mark in S1.
Predicting x from y – random on random
2 2 888 58 362 943 66 445 61 878c c s s cs
a) Calculate the regression line of s on c and the regression
line of c on s.b) Caroline was absent for her C1 examination, but scored
51% in S1. Use the appropriate regression line to estimate her percentage score in the C1 paper.c) Calculate the product–moment correlation coefficient between the marks in the two papers. Comment on theimplications of this for the accuracy of the estimate found in b).
The PMCC indicates that there is very strong positive correlation between the marks in C1 and S1 – the points on the scatter graph would lie very close to a straight line.
This suggests that the mark estimated in b) is likely to be fairly accurate.
We will now consider a situation where one of the variables (here assumed to be x) is a controlled variable. This means that the values of x are fixed – they were decided upon when the experiment was planned.
If x is a controlled variable, the regression line of x on y does not have any statistical meaning, since the values of x are not random.
We consequently use only the regression line of y on x, whether we are estimating a y or an x value.
Examination-style question: An agricultural researcher wishes to explore how the yield of a crop is affected by the amount of fertilizer used. She designs an experiment in which she fertilizes a small plot of land with a pre-determined amount of fertilizer. She obtains the following results:
Amount of fertilizer (kg), x 2 4 6 8 10 12
Crop yield (kg), y 8.55 9.34 9.52 10.39 11.42 11.57
a) Calculate the regression line of y on x.
b) The regression line of x on y is: x = –23.8 + 3.04yUse the appropriate regression line to estimate how much fertilizer would be needed to achieve a crop yield of 10 kg. Explain how you decided which regression line to use.
b) Since x is a controlled variable, only the regression line of y on x has meaning. Therefore, this equation should be used to estimate x when y = 10:
Note: The intercept (7.91) represents the crop yield that might be expected if no fertilizer were to be applied. The equation of the line also shows that increasing the amount of fertilizer by
1 kg increases the expected crop yield by 0.317 kg.