Chapter 4 Bivariate Data Notes PWE 2015 - PAUL WEISERdrweiser.weebly.com/uploads/5/2/6/4/52647653/... · Chapter 4 Bivariate Data Notes PWE 2015 ...

Page 1 of 30

Year 11

General Maths

Year 10

General Mathematics Unit 2

Bivariate Data – Chapter 4

Chapter Four

1st Edition 2nd Edition 2013 4A 1, 2, 3, 4, 6, 7, 8, 9, 10, 11 1, 2, 3, 4, 6, 7, 8, 9, 10, 11

2F (FM) 1, 2(ii), 3, 4 1, 2(ii), 3, 4 4C 1 (a, b, c, d, g, h), 2, 3, 4, 5, 6, 7, 8 1 (a, b, c, d, g, h), 2, 3, 4, 5, 6, 7, 8 4D 1, 3, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16 1, 3, 5, 6, 7, 8, 9, 10, 13, 14, 15, 16

3A (FM) 1 1 3B (FM) 1, 2, 3, 8, 9 1, 2, 3, 8, 9 3C (FM) 1, 2, 3, 4, 5 1, 2, 3, 4, 5 3E (FM) 1, 2, 3, 5, 8 1, 2, 3, 5, 8 Please note that exercises that are indicated as (FM) refer to the exercises from Year 12 Further Maths textbook. These exercises ARE included as part of these handout notes.

More resources

http://drweiser.weebly.com

Page 2 of 30

Table of Contents

Bivariate Data – Chapter 4 ................................................................................................................ 1

4A – SCATTERPLOTS ......................................................................................................................... 3

Drawing conclusions/causation ................................................................................................................... 4

2F (Further Mathematics) ................................................................................................................. 6

Pearson’s Product -‐ Moment Correlation Coefficient (r) .............................................................................. 6

Exercise 2F .................................................................................................................................................. 7

4C – LINEAR MODELLING .................................................................................................................. 8

4D – MAKING PREDICTIONS ........................................................................................................... 12

Interpolation and Extrapolation ................................................................................................................ 14

Reliability of Results .................................................................................................................................. 14

3A INTRODUCTION TO REGRESSION (Further Maths) ..................................................................... 15

3A Method of Fitting Lines by Eye ............................................................................................................. 15

Exercise 3A ................................................................................................................................................ 16

3B Fitting a straight line — the 3-‐median method .......................................................................... 17

Graphical approach ................................................................................................................................... 17

Arithmetic approach ................................................................................................................................. 17

CAS CALCULATOR: Fitting a Straight Line Using the 3 Median Method ........................................................ 20

Exercise 3B ................................................................................................................................................ 21

3C Fitting a straight line — least-‐squares regression ....................................................................... 22

Choosing Between 3-‐Median and Least –Squares Regression .................................................................... 22

Calculating the least-‐squares regression line by hand .................................................................................. 23

Exercise 3C ................................................................................................................................................ 25

3E Residual analysis ....................................................................................................................... 26

Residual Plot ............................................................................................................................................. 27

Exercise 3E ................................................................................................................................................ 30

Page 3 of 30

4A – SCATTERPLOTS Bivariate data result from measurements being made on each of the two variables for a given set of items.

Bivariate data can be graphed on a scatterplot (or scattergraph) as shown at left.

Each of the data points is represented by a single visible point on the graph.

When drawing a scatterplot, we need to choose the correct variable to assign to each of the axes.

The convention is to place the independent variable on the x-‐axis and the dependent variable on the y-‐axis.

The independent variable in an experiment or investigation is the variable that is deliberately controlled or adjusted by the investigator.

The dependent variable is the variable that responds to changes in the independent variable.

Example 1

The operators of a casino keep records of the number of people playing a ‘Jackpot’ type game. The

table below shows the number of players for different prize amounts.

a) Draw a scatter plot of the data (no calculator)

Page 4 of 30

Drawing conclusions/causation When data are graphed, we can often estimate by eye (rather than measure) the type of correlation involved. Our ability to make these qualitative judgements can be seen from the following examples, which summarise the different types of correlation that might appear in a scatterplot.

Page 5 of 30

Example 2

Using the same data in the first example:

a) Draw a scatter plot of the data using your CAS calculator.

b) State the type of correlation that the scatterplot shows.

c) Suggest why the plot is not perfectly linear.

Page 6 of 30

2F (Further Mathematics)

Pearson’s Product -‐ Moment Correlation Coefficient (r) A more precise tool to measure the correlation between the two variables is Pearson’s product-‐moment

correlation coefficient (denoted by the symbol r). It is used to measure strength of linear relationships between

two variables. The value of r ranges from −1 to 1. That is −1 ≤ r ≤ 1.

Following is a gallery of scatterplots with the corresponding value of r for each.

Page 7 of 30

Exercise 2F

Page 8 of 30

4C – LINEAR MODELLING If a linear relationship exists between a pair of variables then it is useful to be able to summarise the relationship in terms of an equation. This equation can then be used to make predictions about the levels of one variable given the value of the other.

The process of finding the equation is known as linear modelling.

An equation can be found to represent the line which passes through any two points by using two coordinate geometry formulas.

The gradient of the line, passing through (x1, y1) and (x2, y2) is given by:

𝑚 =(𝑦! − 𝑦!)(𝑥! − 𝑥!)

The equation of a straight line with the gradient m and passing through (x1, y1) is given by:

𝑦−𝑦! = 𝑚(𝑥 − 𝑥!) if you use 𝑥!, 𝑦!

Or you can substitue it into 𝑦 = 𝑚𝑥 + 𝑐, to solve for c

Example 1

Find the equation of the line passing through the points (2, 6) and (5, 12).

Page 9 of 30

To find the equation for a scatterplot that consists of many points we need to fit a straight line through the whole set of points.

The process of fitting a line to a set of points is often referred to as regression. The regression line or trend line (also known as line of best fit) may be placed on a scatterplot by eye or by using the three-‐mean method (to be covered in exercise 3B).

The line of best fit is the straight line which most closely fits the data.

Ski Resort Data

Its equation can then be found by using the method in the previous example by choosing any two points that are on the line.

The y-‐intercept is the value of y when the level of x is zero, that is, where the line touches the y-‐axis.

The gradient (slope) of the equation represents the rate of change of variable y with changing x.

Sometimes after drawing a scatterplot it is clear that the points represent a relationship that is not linear. The relationship might be one of the non-‐linear types shown below.

In such cases it is not appropriate to try to model the data by attempting to fit a straight line through the points and find its equation. It is similarly inappropriate to attempt to fit a linear model (straight line) through a scatterplot if it shows that there is no correlation between the variables.

Page 10 of 30

Example 2

The following table shows the fare charged by a bus company for journeys of differing length.

a) Represent the data using a scatterplot and place in the trend line by eye.

b) Find an equation that relates the fare, F, to distance travelled, d.

c) Explain in words the meaning of the y-‐intercept and gradient of the line.

Page 11 of 30

Example 3

The table below gives the times (in hours) spent by 8 students studying for a measurement test and the marks (in %) obtained on the test.

a) Draw the scatterplot to represent the data. Use your Calculator.

b) Using your calculator find the equation of the line of best fit. Write your equation in terms of the variables:

time spent studying and test mark.

Page 12 of 30

4D – MAKING PREDICTIONS The equation of the trend line may be used to make predictions about the variables by substituting a value into the equation.

Example 1

It is found that the relationship between the number of people playing a casino Jackpot game and the prize money offered is given by the equation N = 0.07p + 220, where N is the number of people playing and p is the prize money.

a) Find the number of people playing when the prize money is $2500.

b) Find the likely prize on offer if there were 500 people playing.

Using technology:

Alternatively, a prediction could be made from the graph’s trend line.

Page 13 of 30

Example 2

The scatterplots below show the depth of snow and the corresponding number of skiers.

From the graph’s trend line find:

a) the number of skiers when snow depth was 3 m. b) the depth of snow that would attract about 400 people.

Page 14 of 30

Interpolation and Extrapolation We use the term interpolation when we make predictions from a graph’s trend line from within the bounds of the original experimental data.

We use the term extrapolation when we make predictions from a graph’s trend line from outside the bounds of the original experimental data.

Data can be interpolated or extrapolated either algebraically or graphically.

Reliability of Results Results predicted (whether algebraically or graphically) from the trend line of a scatterplot can be considered reliable only if:

1. a reasonably large number of points were used to draw the scatterplot, 2. a reasonably strong correlation was shown to exist between the variables (the stronger the correlation, the

greater the confidence in predictions), 3. the predictions were made using interpolation and not extrapolation. Extrapolated results can never be

considered to be reliable because when extrapolation is used we are assuming that the relationship holds true for untested values.

Page 15 of 30

3A INTRODUCTION TO REGRESSION (Further Maths)

The process of ‘fitting’ straight lines to bivariate data enables us to analyse relationships between the data and possibly make predictions based on the given data set.

Regression analysis is concerned with finding these straight lines using various methods so that the number of points above and below the line is ‘balanced’.

3A Method of Fitting Lines by Eye There should be an equal number of points above and below the line.

Example 1:

Fit a straight line to the data in the figure using

the equal-‐number-‐of-‐points method.

Page 16 of 30

Exercise 3A 1. Fit a straight line to the data in the scatterplots using the equal-‐number-‐of-‐points method.

Page 17 of 30

3B Fitting a straight line — the 3-‐median method Fitting lines by eye is useful but it is not the most accurate of methods.

We can find the line of best fit in the form of ___________________________________

One method to find the line of best fit is called the 3-‐median method.

This method is as follows:

Step 1. Plot the points on a scatterplot.

Step 2. Divide the points into 3 groups (lower, middle and upper) using vertical divisions

(a) If the number of points is divisible by 3, divide them into 3 equal groups

(b) If there is 1 extra point, put the extra point in the middle group

(c) If there are 2 extra points, put 1 extra point in each of the outer groups

Step 3. Find the median point of each of the 3 groups and mark each median on the scatterplot (the median of the x-‐values and the median of the y-‐values in the group).

(a) The median of the lower group is denoted by ),( LL yx

(b) The median of the middle group is denoted by ),( MM yx

(c) The median of the upper group is denoted by ),( UU yx

Note: Although the x-‐values are already in ascending order on the scatterplot, the y-‐values within each group may need re-‐ordering before you can find the median.

Steps 4 and 5 can be completed using 2 different approaches; graphical or arithmetic

Graphical approach Step 4. Place your ruler so that it passes through the lower and upper medians. Move the ruler a third of the

way toward the middle group median while maintaining the slope. Hold the ruler there and draw the line.

Step 5. Find the equation of the line (general form y = mx + c).

There are two general methods.

(a) Method A: Choose two points which lie on the line and use these to find the gradient of the line and then the equation of the line.

12

12

xxyym

−

−= Substitute the coordinates of one point and m into the equation to find c

(b) Method B: If the scale on the axes begins at zero, you can read off the y-‐intercept of the line and calculate the gradient of the line.

Arithmetic approach

Step 4. Calculate the gradient (m) of the line. Use the rule: LU

LU

xxyy

m−

−=

Step 5. Calculate the y-‐intercept (c) of the line. Use the rule: ( ) ( )[ ]UMLUML xxxmyyyc ++−++=31

Thus, the equation of the regression line is y = mx + c.

Page 18 of 30

Example 1: Find the equation of the regression line for the data in the table using the 3-‐median method. Give coefficients correct to 2 decimal places.

1. Sketch the scatterplot then divide it into 3 groups.

2. Using graphical approach to find the equation for the line of best fit.

Page 19 of 30

3. Using arithmetic approach to find the equation for the line of best fit.

i. Find the gradient of the line

ii. Find y-‐intercept

iii. Find the equation of the line

Page 20 of 30

CAS CALCULATOR: Fitting a Straight Line Using the 3 Median Method Example 2

Find the equation of the regression line for the data in the table below using the 3-‐median method. Give coefficients correct to 2 decimal places.

On a Lists & Spreadsheet page, enter x-‐values into column A and y-‐values into column B. Label the columns accordingly.

To draw a scatterplot of the data, add a Data & Statistics page.

Tab e to each axis to select ‘Click to add variable’. Place x on the horizontal axis and y on the vertical axis.

The graph should appear as shown. If you move the pointer lover any point and press Click x twice, the coordinates for that point will be displayed.

To fit a regression line, complete the following steps. Press:

• MENU b

• 4: Analyse 4

• 6: Regression 6

• 3: Show Median–Median 3

Page 21 of 30

Exercise 3B

Page 22 of 30

3C Fitting a straight line — least-‐squares regression Another method for finding the equation of a straight line which is fitted to data is known as the method of least-‐squares regression. It is used when data show a linear relationship and have no obvious outliers.

To understand the underlying theory behind least-‐squares, consider the regression line shown below.

We wish to minimise the total of the vertical lines, or ‘errors’ in some way. For example, balancing the errors above and below the line. This is reasonable, but for sophisticated mathematical reasons it is preferable to minimise the sum of the squares of each of these errors. This is the essential mathematics of least-‐squares regression.

Choosing Between 3-‐Median and Least –Squares Regression The 3-‐median method should be used in preference to least-‐squares regression method if there are clear outliers in the data

The calculation of the equation of a least-‐squares regression line is simple using a CAS calculator.

Example 3

A study shows the more calls a teenager makes on their mobile phone, the less time they spend on each call. Find the equation of the linear regression line for the number of calls made plotted against call time in minutes using the least-‐squares method on a CAS calculator. Express coefficients correct to 2 decimal places.

Number of minutes 1 3 4 7 10 12 14 15

Number of calls 11 9 10 6 8 4 3 1

On a Lists & Spreadsheet page, enter the minutes values into column A and the number of calls values into column B. Label the columns accordingly.

Page 23 of 30

To draw a scatterplot of the data in a Data & Statistics page, tab e to each axis to select ‘Click to add variable’. Place minutes on the horizontal axis and calls on the vertical axis. The graph will appear as shown.

To fit a least-‐squares regression line, complete the following steps. Press:

• MENU b

• 4: Analyse 4

• 6: Regression 6

• 1: Show Linear (mx+b) 1

To find r and r2, return to the Lists & Spreadsheet page by pressing Ctrl/and then the left arrow. Summary variables are found by pressing:

• MENU b

• 4: Statistics 4

• 1: Stat Calculations 1

• 3: Linear Regression (mx+b) 3

Complete the table as shown below and press OK to display the statistical parameters. Notice that the equation is stored and labelled as function f1.

The regression information is stored in the first available column on the spreadsheet.

Calculating the least-‐squares regression line by hand Summary data needed:

_x The mean of the independent variable (x-‐variable)

y The mean of the dependent variable (y-‐variable)

sx the standard deviation of the independent variable

sy the standard deviation of the dependent variable

r Pearson’s product–moment correlation coefficient.

Page 24 of 30

Formula to use:

The general form of the least-‐squares regression line is

Where the slope of the regression line is

the y-‐intercept of the regression line is

Example 4:

A study to find a relationship between the height of husbands and the height of their wives revealed the following details.

Mean height of the husbands: 180 cm

Mean height of the wives: 169 cm

Standard deviation of the height of the husbands: 5.3 cm

Standard deviation of the height of the wives: 4.8 cm

Correlation coefficient, r = 0.85

The form of the least-‐squares regression line is to be: Height of wife = m × height of husband + c

(a) Which variable is the dependent variable? ______________________________

(b) Calculate the value of m for the regression line (to 2 decimal places).

(c) Calculate the value of c for the regression line (to 2 decimal places).

(d) Use the equation of the regression line to predict the height of a wife whose husband is 195 cm tall (to the nearest cm).

Page 25 of 30

Exercise 3C

Page 26 of 30

3E Residual analysis There are situations where the mere fitting of a regression line to some data is not enough to convince us that the data set is truly linear. Even if the correlation is close to +1 or – 1 it still may not be convincing enough.

The next stage is to analyse the residuals, or deviations, of each data point from the straight line.

A residual is the vertical difference between each data point and the regression line.

When we plot the residual values against the original x-‐values and the points are randomly scattered above and below zero (x-‐axis), then the original data is most likely to have a linear relationship.

If the residual plot shows some sort of pattern then the original data probably is not linear

Page 27 of 30

Residual Plot To produce a residual plot, carry out the following steps:

Step 1. Draw up a table as follows

x 1 2 3 4 5 6 7 8 9 10

y 5 6 8 15 24 47 77 112 187 309

ypred

Residuals (y−ypred)

Step 2. Find the equation of the least-‐squares regression line y = mx + b using the graphics calculator.

Step 3. Calculate the predicted y-‐values (ypred) using the least squares regression equation.

The predicted y-‐values are the y-‐values on the regression line.

Put these values into the table.

Step 4. Calculate the residuals.

Residual value = y -‐ ypred

actual data value y-‐value from the regression line

Enter these values into the table.

Note: the sum of all the residuals will always add to zero (or very close).

Step 5. Plot the residual values against the original x-‐values.

If the data points in the residual plot are randomly scattered above and below zero (the x-‐axis), then the original data will probably be linear.

If the residual plot shows a pattern then the original data is not linear.

Example 8

Use the data below to produce a residual plot and comment on the likely linearity of the data.

Step 1.

x 1 2 3 4 5 6 7 8 9 10

y 5 6 8 15 24 47 77 112 187 309

ypred

Residual (y – y pred)

Step 2. Equation of the least-‐squares regression line.

y = ax + b

Page 28 of 30

Step 3. Calculate the predicted y-‐values using the equation _________________________________

When x = 1 ypred =

=

=

When x = 2 ypred =

=

=

Or use the CAS calculator to get the ypred values from the regression line by opening a Graphs & Geometry page and enter the equation of the least-‐squares regression and press enter.

Once you have the graph press

• Menu b, • 5: Trace 5 • 1: Graph Trace 1. • Type in the x value and the corresponding y value will appear.

Step 4. Calculate the residuals.

Residual = y − ypred

Residual = Residual =

= =

= =

Calculate the rest of the residuals and enter them into the table.

Add all residuals to check it equals zero.

Step 5. Plot residual values against original x-‐values.

x

y

0 1 2 3 4 5 6 7 8 9 10

-50

-40

-30

-20

-10

0

10

20

30

40

50

60

70

80

90

100 Residual

Page 29 of 30

The residual plot shows _____________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

Using a CAS calculator

Find the equation of a least-‐squares regression line.

Enter the data on a Lists & Spreadsheet page.

To find the values of m and b for the equation

y = mx + b press

• MENU b

• 4: Statistics 4

• 1: Stat Calculations 1

• 3: Linear Regression ( mx + b) 3

To generate the residual values in their own column, move to the shaded cell in column E and press:

• Ctrl / • MENU b • 4: Variables … 4 • 3: Link To: ¢ 3 • Select the list stat6.resid

Write down all of the residuals displayed in the column. Scroll down for the complete list of values.

Note: The stat number will vary depending on the calculator and previously stored data.

Example 9

Using the same data as in Worked example 8, plot the residuals and discuss the features of the residual plot.

Generate the list of residuals as demonstrated in Example 8.

On the Data & Statistics page select x for the x-‐axis and stat.resid for the y-‐axis.

To identify if a pattern exists, it is useful to join the residual points.

To do this, press: • MENU b • 2: Plot Properties 2 • 1: Connect Data Points 1

Page 30 of 30

Exercise 3E

Chapter 4 Bivariate Data Notes PWE 2015 - PAUL WEISERdrweiser.weebly.com/uploads/5/2/6/4/52647653/... · Chapter 4 Bivariate Data Notes PWE 2015 ...

Documents