Linear regression involves finding the equation of the line of best fit on a scatter graph. The equation obtained can then be used to make an estimate.

Linear regression involves finding the equation of the line of best fit on a scatter graph.

The equation obtained can then be used to make an estimate of one variable given the value of the other variable.

There are two cases to consider, depending upon whether:

Regression

S1 deals with the with the first situation.

1. We wish to find a value of y given a value for x, or

2. We want to estimate x given y.

Linear regression involves finding the equation of the line of best fit on a scatter graph.

The equation obtained can then be used to make an estimate of one variable given the value of the other variable.

There are two cases to consider, depending upon whether:

Regression

S1 deals with the with the first situation.

1. We wish to find a value of y given a value for x,

2. We want to estimate x given y.

Regression

The best fitting line is the one that minimizes the sum of the squared deviations, , where di is the vertical distance between the ith point and the line.

2id

0

5

10

15

20

0 2 4 6 8

d1d2

d3

d4d5

d6

The distances di are sometimes referred

to as residuals.

Regression

As stated previously, the best fitting line should pass through the mean point, .( , )x y

The line that minimizes the sum of squared deviations is formally known as the least squares regression line of y on x.

The equation of the least squares regression line of y on x is:

Regression

2

2xx

xS x

n

and: a y bx

xy

x yS xy

n Recall: and

y = a + bx

b is sometimes referred to as the regression

coefficient.

xy

xx

Sb

Swhere:

Example: The table shows the latitude, x, and mean January temperature(°C), y, for a sample of 10 cities in the northern hemisphere.

Calculate the equation of the regression line of y on x and use it to predict the mean January temperature for the city of Los Angeles, which has a latitude of 34°N.

Regression

City Latitude Mean Jan. temp. (°C)

Belgrade 45 1

Bangkok 14 32

Cairo 30 14

Dublin 50 3

Havana 23 22

Kuala Lumpur 3 27

Madrid 40 5

New York 41 0

Reykjavik 30 –1

Tokyo 36 5

2 11 636x

Regression

We begin by finding summary statistics for the table:

x 312

We then use these to calculate the gradient (b) and y-intercept (a) for the regression line.

City Latitude (x)

Mean Jan. temp. (°C) (y)

Belgrade 45 1

Bangkok 14 32

Cairo 30 14

Dublin 50 3

Havana 23 22

Kuala Lumpur 3 27

Madrid 40 5

New York 41 0

Reykjavik 30 –1

Tokyo 36 5

y 108

y 2 2494

xy 2000

Regression

xy

x yS xy

n

2

2

312

108

11 636

2494

2000

x

y

x

y

xy

xx

xS x

n

2

2

To find the gradient, we need Sxy and Sxx:

Therefore:

xy

xx

Sb

S

312 1082000

10. 1369 6

2

312

11 63610

.1901 6

.

.

1369 6

1901 6–0.720 (to 3 s.f.)

Therefore, the equation of the regression line is:

y = 33.3 – 0.720x

This is our estimate of the mean January temperature in Los Angeles.

Regression

x 312

10

y 108

10

To find the y-intercept we also need and :x y

So: a y bx

.31 2

.10 8

. ( . . )10 8 0 720 31 2

= 33.3 (to 3 s.f.)

So, when x = 34, y = 33.3 – 0.720 × 34 = 8.82°C.

2

2

312

108

11 636

2494

2000

x

y

x

y

xy

This prediction for the mean January temperature in Los Angeles is based purely on the city’s latitude.

There are likely to be additional factors that can affect the climate of a city, for example:

Regression

The concept of regression we have considered here can be extended to incorporate other relevant factors, producing a new formula. This allows for more accurate prediction.

altitude;

proximity to the coast;

ocean currents;

prevailing winds.

A regression equation can only confidently be used to predict values of y that correspond to x values that lie within the range of the data values available.

The dangers of extrapolation

0 5 10 15 20 2505

10152025303540

It can be dangerous to extrapolate (i.e. to predict) from the graph, a value for y that corresponds to a value of x that lies beyond the range of the values in the data set.

It is reasonably safe to make predictions

within the range of the data.

It is unwise to extrapolate beyond the given data.

This is because we cannot be sure that the relationship between the two variables will continue to be true.

Examination-style question: The average weight and wingspan of 9 species of British birds are given in the table.

Examination-style question: regression

Bird Weight (g)

Wingspan (cm)

Wren 10 15

Robin 18 21

Chaffinch 18 24

Cuckoo 57 33

Blackbird 100 37

Pigeon 300 67

Lapwing 220 70

Crow 500 99

Common gull 400 100

a) Plot the data on a scatter graph. Comment on the relationship between the variables.

b) Calculate the regression line of wingspan on weight.

c) Use your regression line to estimate the wingspan of a jay, if its average weight is 160 g.

d) Explain why it would be inappropriate to use your lineto estimate the wingspan of a duck, if the averageweight of a duck is 1 kg.


0 100 200 300 400 500 6000

20

40

60

80

100

120

Scatter graph showing the weight and wingspan of birds

Weight (g)

Win

gsp

an (

cm)

a)

The graph indicates that there is fairly strong positive correlation between weight and wingspan – this means that wingspan tends to be longer in heavier birds.

b) Summary values for the paired data are:


x 1623

xy

x yS xy

n

xx

xS x

n

2

2

xy

xx

Sb

S

These can be used to find the gradient of the regression line:

Therefore:

x = weighty = wingspany 466

2 562 397x 2

32 890y 131 541xy

1623 466

131 5419

47 505.672

1623

562 3979

269 716

47 505.67

269 7160.176 (to 3 s.f.)


.1623

180 339

x

.y 466

51 789

To find the y-intercept we also need and :x y

So: a y bx

Therefore, the equation of the regression line is:

y = 20.0 + 0.176x

where y = wingspan and x = weight.

. ( . . ) 51 78 0 176 180 33

.20 04

c) When the weight is 160 g, we can predict the wingspan to be:

y = 20.0 + 0.176x =

d) The average weight of a duck is outside the range of weights provided in the data. It would therefore be inappropriate to use the regression line to predict the wingspan of a duck, as we cannot be certain that the same relationship will continue to be true at higher weights.

Note: The regression coefficient (0.176) can be interpreted here as follows: as the weight increases by 1 g, the wingspan increases by 0.176 cm, on average.


20.0 + (0.176 × 160)= 48.2 cm (to 3 s.f.)

Linear regression involves finding the equation of the line of best fit on a scatter graph. The equation obtained can then be used to make an estimate.

Documents

regression equation

regression line of wingspan

squares regression line

linear regression

regression citylatitudemean

concept of regression

regression coefficient

value of y