Top Banner
SCHOLAR Study Guide Higher Applications of Mathematics Statistics and probability Authored by: David Young (The University of Strathclyde) John Reilly (Education Scotland) Heriot-Watt University Edinburgh EH14 4AS, United Kingdom.
174

Higher Applications of Mathematics Statistics and probability

Feb 23, 2023

Download

Documents

Khang Minh
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Higher Applications of Mathematics Statistics and probability

SCHOLAR Study Guide

Higher Applications of MathematicsStatistics and probability

Authored by:

David Young (The University of Strathclyde)

John Reilly (Education Scotland)

Heriot-Watt University

Edinburgh EH14 4AS, United Kingdom.

Page 2: Higher Applications of Mathematics Statistics and probability

First published 2021 by Heriot-Watt University.

This edition published in 2022 by Heriot-Watt University SCHOLAR.

Copyright © 2022 SCHOLAR Forum.

Members of the SCHOLAR Forum may reproduce this publication in whole or in part for educationalpurposes within their establishment providing that no profit accrues at any stage, Any other use of thematerials is governed by the general copyright statement that follows.

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmittedin any form or by any means, without written permission from the publisher.

Heriot-Watt University accepts no responsibility or liability whatsoever with regard to the informationcontained in this study guide.

Distributed by the SCHOLAR Forum.

SCHOLAR Study Guide Higher Applications of Mathematics: Statistics and probability

Higher Applications of Mathematics Course Code: C844 76

Print Production and Fulfilment in UK by Print Trail www.printtrail.com

Page 3: Higher Applications of Mathematics Statistics and probability

AcknowledgementsThanks are due to the members of Heriot-Watt University's SCHOLAR team who planned and created thesematerials, and to the many colleagues who reviewed the content.

We would like to acknowledge the assistance of the education authorities, colleges, teachers and studentswho contributed to the SCHOLAR programme and who evaluated these materials.

Grateful acknowledgement is made for permission to use the following material in the SCHOLARprogramme:

The Scottish Qualifications Authority for permission to use Past Papers assessments.

The Scottish Government for financial support.

The content of this Study Guide is aligned to the Scottish Qualifications Authority (SQA) curriculum.

All brand names, product names, logos and related devices are used for identification purposes only and aretrademarks, registered trademarks or service marks of their respective holders.

Page 4: Higher Applications of Mathematics Statistics and probability
Page 5: Higher Applications of Mathematics Statistics and probability

v

Contents

2 Statistics and probability 1

1 Basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Interpreting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Correlation and linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Hypothesis testing and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . 81

5 Statistics and probability: End of section test . . . . . . . . . . . . . . . . . . . . . . . . 123

Glossary 131

Answers to questions and activities 134

Page 6: Higher Applications of Mathematics Statistics and probability
Page 7: Higher Applications of Mathematics Statistics and probability

Unit 2: Statistics and probability

1 Basic probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Tree diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Carrying out basic calculations involving combination of events . . . . . . . . . . . 12

1.6 Manipulating probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2 Interpreting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Basic terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.5 Numerical summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.6 Choice of descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7 Computing descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.8 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.11 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.12 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Correlation and linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.7 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Page 8: Higher Applications of Mathematics Statistics and probability

2 UNIT 2. STATISTICS AND PROBABILITY

4 Hypothesis testing and confidence intervals . . . . . . . . . . . . . . . . . . . . . . 81

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.3 Tests for comparing two groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.4 Tests for comparing more than two groups . . . . . . . . . . . . . . . . . . . . . . . 99

4.5 Tests for categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.6 Hypothesis tests for normality and correlation . . . . . . . . . . . . . . . . . . . . . 109

4.7 Notes on statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.8 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.11 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.12 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5 Statistics and probability: End of section test . . . . . . . . . . . . . . . . . . . . . . 123

© HERIOT-WATT UNIVERSITY

Page 9: Higher Applications of Mathematics Statistics and probability

3

Unit 2 Topic 1

Basic probability

Contents1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Tree diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Venn diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

1.5 Carrying out basic calculations involving combination of events . . . . . . . . . . . . . . . . . 12

1.6 Manipulating probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6.1 Intersection and union of events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6.2 Complementary events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.6.3 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.8 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Learning objective

By the end of this topic, you should be able to:

• understand some basic definitions relating to probability;

• construct and interpret tree diagrams;

• construct and interpret Venn diagrams;

• carry out basic calculations involving combination of events, where information may bedisplayed in tables or graphs.

Page 10: Higher Applications of Mathematics Statistics and probability

4 UNIT 2. STATISTICS AND PROBABILITY

1.1 Introduction

In a survey of a group of 85 high school students, 27 receive private tutoring. If a student is selectedat random, will that student have a tutor? It is possible to estimate the chance that the randomlyselected student has a tutor based on the information given; 27 out of 85 have private tutors. Theinformation from the survey can therefore be used to predict whether a randomly selected studenthas a tutor or not.

The proportion of students with tutors among all the students is27

85= 0.318. For those without tutors,

the proportion is85− 27

85=

58

85= 0.682. These proportions (or relative frequencies) can be used

as the probability of choosing a student who does or does not have a tutor. If a student is chosenfrom the sample randomly, the probability of choosing one with a tutor is 0.318, which can be writtenP(student has a tutor) = 0.318, and the probability of choosing one without a tutor is 0.682, whichcan be written P(student does not have a tutor) = 0.682.

If students were selected at random over and over, it would be possible to calculate the proportion ofsamples in which the selected student had a tutor. It would be expected that this proportion shouldbe close to 0.32. The more times the repeat sampling is done, the closer the proportion shouldcome to 0.32.

Statistically this kind of estimation problem is often approached by considering the number of waysthat one student can be selected from a group of 85. If each student is equally likely to be selected,then there are 85 ways of selecting a student. If we are interested in inferring something aboutstudents who have a tutor, the number of ways of selecting a student with a tutor from the 85 is 27.Selecting from a sample (of students in this case), an individual with the characteristic of interest (atutor in this case) is usually referred to as a 'success'.

Since the probability is a ratio, it is a positive number. If all 85 students have a tutor then the numberof successes is the same as the number of students so the probability of a success is equal to 1,i.e. picking a student with a tutor would be a certainty. If none of the students have a tutor thenthe number of successes, and hence the probability of a success, is equal to 0, i.e. it would beimpossible to pick a success. Therefore, the value of a probability must lie between 0 and 1.

Example

A pack of 52 cards is shuffled and one card is drawn. A 'success' is defined as selecting anace.

There are four aces in a pack of cards which means there are four possible 'successes',

therefore the probability of success is4

52=

1

13= 0.077.

This can be written as P(drawing an ace) = 0.077.

In this example, it is possible to estimate the exact value of the probability of a success sincethe number of possible successes and the total number of ways of selecting a card are known.In most real-life situations it is impossible to look at every member of a population and thuswe must take a sample. The sample result can then be used to estimate the probability fromthe number of successes in the sample.

© HERIOT-WATT UNIVERSITY

Page 11: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 5

Top tip

The following mathematical notation is not required for this course but is useful shorthand.

The formula for calculating the probability of success (p) is

p =r

n

where r is the number of possible successes and n is the number of possible outcomes.

• 0 ≤ p ≤ 1

• p = 0 is an impossible event.

• p = 1 is a certain event.

1.2 Definitions

A trial is a process or experiment that yields information. If the trial is repeated several times then itgives a set of observations, e.g.

• randomly selecting a student from a class and recording whether they have a private tutor ornot;

• throwing a coin and recording whether it lands on heads or tails;

• giving a group of 100 patients with a heart condition a new drug and observing the number forwhom the drug improves their condition after treatment;

• randomly selecting a card from a pack.

An outcome is the result of carrying out a trial, e.g.

• the student has a private tutor;

• the coin lands on heads;

• the number of patients whose condition improves is 67;

• the card is the seven of clubs.

An event is a set that consists of one or more possible outcomes of a trial, e.g.

• the student has no tutor;

• the coin lands on heads;

• the number of patients whose condition improves is between 25 and 75;

• an ace is drawn from the pack.

© HERIOT-WATT UNIVERSITY

Page 12: Higher Applications of Mathematics Statistics and probability

6 UNIT 2. STATISTICS AND PROBABILITY

The sample space is the set of all possible outcomes of the trial, e.g.

• the student has a private tutor or the student has no tutor;

• the coin lands on heads or the coin lands on tails;

• the number of patients whose condition improves lies between 0 and 100;

• the number of cards in the pack is 52.

Examples

1. In a trial, a coin is thrown three times and the sequence of results is recorded as theoutcome. The sample space (i.e. all possible outcomes from the trial) is:

HHH, THH, HTH, HHT, TTH, HTT, THT, TTT

where H = heads and T = tails.

Define three events as follows:

• at least one throw of the coin lands on heads (HHH, THH, HTH, HHT, TTH, HTT, THT);

• all throws of the coin land on tails (TTT);

• exactly two throws of the coin land on heads (THH, HTH, HHT).

What is the probability of each event?

There are eight outcomes in the sample space.

• There are seven outcomes in the event so the probability that at least one throw of the

coin lands on heads is7

8or P(at least one throw of the coin lands on heads) = 0.875.

• There is one outcome in the event so the probability that all throws of the coin land on

tails is1

8or P(all throws of the coin land on tails) = 0.125.

• There are three outcomes in the event so the probability that exactly two throws of the

coin land on heads is3

8or P(exactly two throws of the coin land on heads) = 0.375.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. A coin is thrown three times and the number of heads recorded. The sample space is:

0, 1, 2, 3.

Determine the probabilities associated with each outcome in this sample space.

Looking at the outcomes from the sample space for the previous example (HHH, THH, HTH,HHT, TTH, HTT, THT, TTT) makes it possible to determine the probabilities associated witheach outcome in this sample space.

© HERIOT-WATT UNIVERSITY

Page 13: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 7

Outcome 0 1 2 3

Matchingoutcomes from

previousexample

TTT HTT, THT, TTH HHT, HTH, THH HHH

P(outcome)1

8

3

8

3

8

1

8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3. First year students at university either live at home, in halls of residence or in a rented flat.

Registration information shows that:

• three quarters of all first year students live at home;

• one fifth of the remaining students live in rented flats.

Determine the probabilities associated with each outcome in this sample space.

The probability that a first year student lives in at home is given to us as three quarters. Weare also told that one fifth of the remaining quarter live in rented flats so the probability of that

is1

5× 1

4=

1

20. The remainder of the students must live in halls of residence so the probability

of that is1

4− 1

20=

5

20− 1

20=

1

5.

Type ofaccommodation

Home Halls of residence Rented flat

P(type ofaccomodation)

3

4

1

5

1

20

Note: In this and the previous example, the sum of all the probabilities is equal to 1.

Top tip

The following mathematical notation is not required for this course but is useful shorthand.

We can use formal notation to represent a sample space as follows:

S = {s1, s2, . . . }

where each si is a possible outcome.

The sum of all the probabilities of all of the outcomes in a trial is equal to 1. This is animportant observation in probability theory.

In many real-life situations, the sample space of a trial will consist of events that are notequally likely. For example, consider the example of treating patients with a heart conditionwith a new drug. The number out of 100 for whom the drug works will depend on how goodthe drug is, so some values out of 100 (the outcomes or elementary events in the samplespace) will be more likely than others, e.g. if the drug is very good, then it would be expectedthat outcomes such as 70, 80 or 90 would be more likely than outcomes such as 0, 5, or 20.

© HERIOT-WATT UNIVERSITY

Page 14: Higher Applications of Mathematics Statistics and probability

8 UNIT 2. STATISTICS AND PROBABILITY

1.3 Tree diagrams

A tree diagram is a good visual aid in computing probabilities. Consider the example where a coinis thrown three times and the results recorded, giving a sequence of heads (H) and tails (T). Thesample space is HHH, THH, HTH, HHT, TTH, HTT, THT, TTT.

Results are equally likely on each throw, so the probability of either a heads or a tails each throw isa half. The tree diagram in Figure 1.1 represents the sequence of throws.

Figure 1.1: Tree diagram for throwing a coin three times

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A particular outcome in the sample space is found by following the branches of the tree. To derivethe probability of a particular outcome, e.g. all heads or outcome HHH, the probabilities associated

with each branch of that outcome are multiplied together, e.g. P(all heads) =1

2× 1

2× 1

2=

1

8.

To derive the probability of a particular event in the sample space, e.g. at least one heads, whichincludes outcomes HHH, THH, HTH, HHT, TTH, HTT, THT, the probabilities associated with each

outcome in that event are added together, e.g. P(at least one heads) =1

8+1

8+1

8+1

8+1

8+1

8+1

8=

7

8.

© HERIOT-WATT UNIVERSITY

Page 15: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 9

Go onlineTree diagrams: Quiz

Q1: Three people are randomly selected from voter registration and driving records to reportfor jury duty. The gender of each person is noted by the clerk.

a) List the possible outcomes in the sample space and construct a tree diagram to illustratethe selection process.

b) If each person is just as likely to be female or male, what is the probability of eachoutcome in the sample space?

c) What is the probability that one male is selected?

d) What is the probability that three females are selected?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q2: A supermarket plans to conduct an experiment to compare its own brand of cola withtwo well-known brands. One shopper is selected at random and asked to taste and rank thethree brands without knowing which is which: the supermarket own brand, well known brand1 and well known brand 2.

a) List the outcomes in the sample space and construct a tree diagram to illustrate theselection process.

b) What is the probability associated with each outcome (each possible rank order) in thesample space?

c) If the shopper has no ability to distinguish a difference in tastes, what is the probabilitythat the supermarket own brand will be ranked first?

© HERIOT-WATT UNIVERSITY

Page 16: Higher Applications of Mathematics Statistics and probability

10 UNIT 2. STATISTICS AND PROBABILITY

1.4 Venn diagrams

A Venn diagram is another way in which we can sort groups of data and visually representprobabilities. A Venn diagram consists of a rectangle and one or more circles. The rectanglerepresents all of the values that we need to consider and is known as the universal set. Each circlerepresents a set. If the circles overlap then we have the intersection of two or more sets.

Example

The whole numbers from 0 to 10 are distributed into the following sets:

• set A, the odd numbers between 0 and 10 (1, 3, 5, 7, 9);

• set B, the numbers between 5 and 10 (5, 6, 7, 8, 9, 10).

This information can be displayed in a Venn diagram as follows.

From the diagram, it can be seen that the numbers 0, 2 and 4 are not in either of the circles,or sets, but are inside the rectangle, the universal set. Also, it can be seen that the numbers5, 7 and 9 are in both. We can represent this information in the following Venn diagram.

© HERIOT-WATT UNIVERSITY

Page 17: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 11

Go onlineVenn diagrams: Quiz

Q3: The numbers 1 to 10 are distributed into two sets:

• set A is the set of even numbers;• set B is the set of multiples of 3.

Draw a Venn diagram to illustrate this data.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q4: A music school survey of 50 pupils provides the following information:

• 18 pupils play both piano and guitar;• 30 play piano;• 28 play guitar.

a) Draw a Venn diagram to illustrate this data.

b) How many pupils play neither instrument?

c) What is the probability that a pupil chosen at random plays neither instrument?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q5: In a fitness survey of students, the following data on the types of activity were obtained:

• 53 students cycle;• 69 students run;• 56 students swim;• 32 students swim and cycle;• 42 students run and swim;• 38 students run and cycle;• 24 students swim, cycle and run.

a) Draw a Venn diagram to illustrate this data.

b) How many students took part in the survey?

c) What is the probability that a student chosen at random will participate in all threeactivities?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 18: Higher Applications of Mathematics Statistics and probability

12 UNIT 2. STATISTICS AND PROBABILITY

Q6: There are 70 pupils in a year group studying languages.

All 70 pupils study at least one language:

• 8 students study French, Spanish and Italian;

• 18 of the students speak French and Italian;

• 28 of the students speak French and Spanish;

• 17 of the students speak Spanish and Italian;

• 43 students speak French;

• 48 students speak Spanish.

a) Draw a Venn diagram to show this information.

b) How many students speak Italian only?

c) What is the probability that a student chosen at random speaks French but not Italian?

1.5 Carrying out basic calculations involving combination of events

Conditional probability involves events that are inter-related so that the outcome of one event isaffected by the result of another event, e.g. picking two cards from a pack without replacing them,where the second choice is from a total of one fewer cards excluding the card that was chosen first.

Examples

1. A group of 120 guests attended a wedding reception. Guests had a choice of main meal:chicken or a vegetarian. 90 chose chicken. Two days later, 75 of the guests are suffering froma form of food poisoning called salmonella. Of these, 65 ate chicken.

Salmonella infection is often associated with chicken and this is the type of data that publichealth officials may collect and analyse if an outbreak of disease occurs.

a) If a public health investigator randomly chooses a guest to interview, what is theprobability that the guest ate chicken?

b) If a guest is randomly chosen from only those that are currently ill, what is the probabilitythat the guest ate chicken?

If the following events are defined:

• guest ate chicken;

• guest is ill;

then the following summary table can be produced.

© HERIOT-WATT UNIVERSITY

Page 19: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 13

Guest ate chickenGuest did not eat

chickenTotal

Guest is ill 65 10 75

Guest is not ill 25 20 45

Total 90 30 120

a) P(guest ate chicken) =90

120=

3

4= 0.75

b) If the public health investigator is only interested in guests that are ill (which is more likelyto be the case), and thus only select a guest to interview from those that are ill, then theprobability that the guest ate chicken will be different.

As can be seen from the summary table, P(guest ate chicken given guest is ill) =65

75=

13

15= 0.867.

Another way of looking at this is that the P(guest ate chicken and guest is ill) =65

120,

P(guest is ill) =75

120, so the probability that the guest ate chicken given that the guest is

ill is65/120

75/120=

65

75=

13

15= 0.867.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. A common real-life application of conditional probability is in diagnostic testing. Considerthe example of random drug testing of athletes before a big event. The issue with this typeof testing is that the test may not be 100% accurate and thus athletes may be incorrectlyaccused of drug taking while athletes who do take drugs may be missed.

Assume it is known that 36% of all athletes take drugs (this would be known from estimatesin research or monitoring studies), the probability that the test can correctly identify a druguser is 98% (known as the sensitivity of the test) and the probability that the test can correctlyidentify a non-user is 92% (the specificity of the test).

What is the probability that a randomly selected athlete tests positive for drugs?

Summarising the information from the question:

• P(athlete is drug user) = 0.36;

• P(athlete is not drug user) = 0.64;

• P(athlete has positive test given athlete is drug user) = 0.98;

• P(athlete has control (negative) test given athlete is drug user) = 0.02;

• P(athlete has positive test given athlete is not drug user) = 0.08;

• P(athlete has control (negative) test given athlete is not drug user) = 0.92.

These probabilities are ideally illustrated using a tree diagram where D = drug user, N = notdrug user, P = positive test result and C = control (negative) test result.

© HERIOT-WATT UNIVERSITY

Page 20: Higher Applications of Mathematics Statistics and probability

14 UNIT 2. STATISTICS AND PROBABILITY

Figure 1.2: Tree diagram for testing of athletes for drug use

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

From Figure 1.2 and the associated probabilities, it is possible to calculate the probability thata randomly selected athlete tests positive. This event is made up of true positives (i.e. drugusers, DP in the diagram) and false positives NP (i.e. non-drug users, NP in the diagram).

P(athlete has positive test) = 0.353 + 0.051 = 0.404

Thus, any athlete being tested for drugs has around a 40% chance of testing positive.

© HERIOT-WATT UNIVERSITY

Page 21: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 15

Go onlineCarrying out basic calculations involving combination of events: Quiz

A student is examined on two tests; one in January and one in June. From previous resultsit can be seen that students who fail the first test are more likely to fail the second test thanthose who do not fail the first test. In particular, it is known that the probability of passing thefirst test is 0.6, the probability of passing the second test given that they passed the first testis 0.9 and that the probability of passing the second test given that they failed the first test is0.1.

Q7: Construct a tree diagram for this scenario.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q8: If a student must pass both tests to pass the class, calculate the overall pass-rate.

In a particular city, airport A handles 50% of all flights and airports B and C handle 30% and20% respectively. The detection probabilities for weapons are 0.9, 0.5 and 0.4 for the threeairports.

Q9: Construct a tree diagram for this scenario.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q10: If a weapon is detected, what is the probability this is at:

a) airport A?

b) airport B?

c) airport C?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q11: In a standard deck, there are 52 cards. Twelve cards are face cards (F) and 40 cardsare not face cards (N).

If two cards are drawn one at a time without replacement, draw and label a tree diagramshowing the probabilities of all possible outcomes.

Top tip

The following mathematical notation is not required for this course but is useful shorthand.

Formally, the conditional probability of A given B is calculated as

P (A|B) =P (A ∩B)

P (B).

Rearranging this allows P (A ∩ B) to be calculated as P (A ∩ B) = P (A|B)P (B).

© HERIOT-WATT UNIVERSITY

Page 22: Higher Applications of Mathematics Statistics and probability

16 UNIT 2. STATISTICS AND PROBABILITY

1.6 Manipulating probabilities

When probabilities are associated with particular events, it is often useful to know the probabilities ofother events from the sample space. Some basic rules of probabilities allow them to be manipulated.

1.6.1 Intersection and union of events

If there are two events from a sample space then the intersection of events is the set of outcomesthat appear in both events.

This can be represented on a Venn diagram where the intersection is shaded as follows.

If two events have no outcomes in common then they are said to be mutually exclusive. This meansthey cannot occur at the same time.

For mutually exclusive events, there is no intersection on a Venn diagram.

If there are two events from a sample space then the union of is the set of outcomes that appear inone or other or both events.

This can be represented on a Venn diagram where the union is shaded as follows.

© HERIOT-WATT UNIVERSITY

Page 23: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 17

If this union represents the full sample space then it is said to be exhaustive.

Example

When three coins are tossed and the sequence of events recorded, the sample space is HHH,THH, HTH, HHT, TTH, HTT, THT, TTT.

Define events:

• at least one tails is thrown (THH, HTH, HHT, TTH, HTT, THT, TTT);

• all tails are thrown (TTT);

• two heads are thrown (THH, HTH, HHT).

What is the probability of each of the following?

a) The intersection of at least one tails is thrown and all tails are thrown.

b) The intersection of at least one tails is thrown and two heads are thrown.

c) The intersection of all tails are thrown and two heads are thrown.

d) The union of at least one tails is thrown and all tails are thrown.

e) The union of at least one tails is thrown and two heads are thrown.

f) The union of all tails are thrown and two heads are thrown.

Given that the sample space contains eight outcomes and using the method for equally likelyevents as before:

a) the intersection of at least one tails is thrown and all tails are thrown is TTT so the

probability is1

8;

b) the intersection of at least one tails is thrown and two heads are thrown is THH, HTH,

HHT so the probability is3

8;

© HERIOT-WATT UNIVERSITY

Page 24: Higher Applications of Mathematics Statistics and probability

18 UNIT 2. STATISTICS AND PROBABILITY

c) the intersection of all tails are thrown and two heads are thrown is empty so these eventsare mutually exclusive so the probability is 0;

d) the union of at least one tails is thrown and all tails are thrown is THH, HTH, HHT, TTH,

HTT, THT, TTT so the probability is7

8;

e) the union of at least one tails is thrown and two heads are thrown is THH, HTH, HHT,

TTH, HTT, THT, TTT so the probability is7

8;

f) the union of all tails are thrown and two heads are thrown is HTH, HHT, THH, TTT so

the probability is1

2.

Top tip

The following mathematical notation is not required for this course but is useful shorthand.

The number of outcomes in A and B (the intersection) is written as N(A ∩ B).

For mutually exclusive events, N(A ∩ B) = 0.

The number of outcomes in A or B (the union) is written as N(A ∪ B).

Rule 1: The addition rule for mutually exclusive events

If A and B are mutually exclusive events from a discrete sample space S then

P (A ∪ B) = P (A) + P (B).

Extending this to k mutually exclusive events, E1 . . . Ek

P (E1 ∪ E2 ∪ . . . ∪ Ek) = P (E1) + P (E2) + . . . + P (Ek).

Rule 2: The general addition rule

More generally, if A and B are not mutually exclusive

P (A ∪ B) = P (A) + P (B) − P (A ∩ B).

Rule 3: Total probability

If we have a set of mutually exclusive and exhaustive events E1 . . . Ek then

E1 + E2 + . . . + Ek = 1

which is the law of total probability.

© HERIOT-WATT UNIVERSITY

Page 25: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 19

1.6.2 Complementary events

For an event from a sample space, the complement is the set of all outcomes that are not in theevent. This means that the event and its complement are mutually exclusive.

The complement of A is shown by the shaded area in the following Venn diagram.

Example

A dice is rolled and the number it lands on is recorded. Define events:

• the number rolled is 6;

• the number rolled is higher than 3.

What is the complement of each of these events?

• The only outcome in the first event is 6, so the complement is 1-5.

• The outcomes in the second event are 4-6, so the complement is 1-3.

Top tip

The following mathematical notation is not required for this course but is useful shorthand.

Given that A and its complement, A′, are mutually exclusive,P (A ∪ A′) = P (A) + P (A′) = 1.

Rule 4: Complementary rule

P (A′) = 1 − P (A)

© HERIOT-WATT UNIVERSITY

Page 26: Higher Applications of Mathematics Statistics and probability

20 UNIT 2. STATISTICS AND PROBABILITY

1.6.3 Independence

In the 'carrying out basic calculations involving combination of events' examples, the events wereinter-related. In the first example, the chance of a guest being ill depended on whether or not theyate chicken; in the second, the chance that an athlete had a positive drugs test result depended onwhether or not they were a drug user.

In real-life situations it is also possible to have trials with associated events that are not related. Thisis known as independence. Examples of independence include the following.

• A telesales employee for a mobile phone company calls three homes to try and sell thecompany's latest offer. Whether or not they make a sale on the second call is likely to beindependent of whether or not they make a sale with the first call.

• On a busy motorway during rush hour, the chance of a crash occurring on one particularmorning is unlikely to depend on whether or not a crash occurred the previous morning.

Example

Assume that the drug test in the earlier example is not working properly. Previously, it wasknown that 36% of all athletes took drugs, the probability that the test could correctly identifya drug user was 98% and the probability that the test could correctly identify a non-user was92%. Now the test gives a positive result 98% of the time irrespective of whether or not theathlete is a drug user.

What is the probability that an athlete has a positive test result?

Given that the test is no longer working properly and gives a positive result 98% of the time,the probability that an athlete has a positive test result is now 0.98.

To check this, we can summarise the updated information from the question:

• P(athlete is drug user) = 0.36;

• tP(athlete is not drug user) = 0.64;

• P(athlete has positive test given athlete is drug user) = 0.98;

• P(athlete has control (negative) test given athlete is drug user) = 0.02;

• P(athlete has positive test given athlete is not drug user) = 0.98;

• P(athlete has control (negative) test given athlete is not drug user) = 0.02.

From the associated probabilities, it is possible to calculate the probability that a randomlyselected athlete tests positive. This event is made up of true positives (i.e. drug users) andfalse positives (i.e. non-drug users).

P(athlete is drug user and athlete has positive test) = 0.36 × 0.98 = 0.353

P(athlete is not drug user and athlete has positive test) 0.64 × 0.98 = 0.627

P(athlete has positive test) = 0.353 + 0.627 = 0.98

Thus, any athlete being tested for drugs has a 98% chance of testing positive.

Note: once again, it is worth stating that the probabilities associated with each trial total to 1.

© HERIOT-WATT UNIVERSITY

Page 27: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 21

Top tip

The following mathematical notation is not required for this course but is useful shorthand.

Rule 5: Multiplication rule for independence

If events A and B are independent then P (A|B) = P (A) so

P (A ∩ B) = P (A)P (B).

1.7 Summary

Summary

• A trial is a process or experiment that yields information. If the trial is repeated severaltimes then it gives a set of observations.

• An outcome is the result of carrying out a trial.

• An event is a set that consists of one or more possible outcomes of a trial and istherefore a subset of the sample space.

• The sample space is the list of all possible outcomes of the trial.

• A tree diagram is a good visual aid in computing probabilities.

• To derive the probabilities associated with each outcome in a tree diagram, theprobabilities associated with each branch of the outcome are multiplied together.

• To derive the probability of any event within the sample space, the probabilitiesassociated with each of the outcomes in that event are added together.

• A Venn diagram is a good way to sort groups of data and visually represent probabilities.

• The rectangle in a Venn diagram represents all of the values that need to be considered.

• Each circle in a Venn diagram represents a set.

• If the circles in a Venn diagram overlap, this shows the intersection of two or more sets.

• Conditional probability involves events that are inter-related so the outcome of one eventis affected by the result of another event.

© HERIOT-WATT UNIVERSITY

Page 28: Higher Applications of Mathematics Statistics and probability

22 UNIT 2. STATISTICS AND PROBABILITY

1.8 End of topic test

Go onlineBasic probability topic test

Q12: An experiment with a jar containing three blue marbles and ten green marbles is set up.Two marbles are randomly taken from the jar one at a time and are not replaced.

Calculate the probability of randomly choosing:

a) two greens marbles;

b) one green and one blue marble.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q13: A football team plays two matches. A win is worth 3 points, a draw 1 point and a loss0 points. Results from last season showed that the probability the team wins is 0.1 and theprobability they draw is 0.6.

Calculate the probability that the team will gain at least 3 points over the two matches.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q14: Use the Venn diagram to answer the following questions.

a) List the members of set S.

b) List the members of the intersection of set S and set T.

c) List the members not in set S nor set T.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 29: Higher Applications of Mathematics Statistics and probability

TOPIC 1. BASIC PROBABILITY 23

Q15: In a survey of university students:

• 92 had taken maths;

• 54 had taken chemistry;

• 48 had taken physics;

• 26 had taken maths and physics;

• 28 had taken maths and chemistry;

• 18 had taken chemistry and physics;

• 12 had taken all the three courses.

Draw a Venn diagram and answer the following questions.

a) How many students are there altogether?

b) How many students took only one course?

c) What is the probability that a student chosen at random is studying all three subjects?

© HERIOT-WATT UNIVERSITY

Page 30: Higher Applications of Mathematics Statistics and probability
Page 31: Higher Applications of Mathematics Statistics and probability

25

Unit 2 Topic 2

Interpreting data

Contents2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Basic terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.1 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.3.2 Numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.1 Pie chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.2 Bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.4.3 Scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4.4 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.4.5 Graphics commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 Numerical summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.1 Measures of location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.2 Measures of spread . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6 Choice of descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

2.7 Computing descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.7.1 Notes on quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.7.2 Box plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.8 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.11 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

2.12 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Page 32: Higher Applications of Mathematics Statistics and probability

26 UNIT 2. STATISTICS AND PROBABILITY

Learning objective

By the end of this topic, you should be able to:

• describe different types of data;

• understand the difference between populations and samples;

• explain and understand the importance of data outliers;

• construct and interpret statistical diagrams, for example:

◦ pie charts;

◦ bar charts;

◦ scatter plots;

◦ histograms;

◦ box plots.

• interpret the distribution of data, with particular reference to symmetry, normality, andskewness;

• derive, understand, and interpret sample measures of location and spread, includingmean and standard deviation, and median and interquartile range.

© HERIOT-WATT UNIVERSITY

Page 33: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 27

2.1 Introduction

In the course of a research study, vast amounts of data may be gathered. It is necessary tosummarise this data in a useful way for reporting purposes. Data can be summarised eithergraphically (e.g. in the form of a pie chart), numerically (e.g. by computing a mean value) orboth.

2.2 Basic terminology

Suppose a researcher in the UK is developing a new drug for the treatment of HIV. In statisticalterms, the population relevant for consideration in such a study would be everybody in the UK whois HIV+. In order to determine the effectiveness of this new treatment, a study would have to beconducted where members of the population are given the new drug and the effects recorded.

For each subject on this study, various parameters or interest would be recorded e.g. their age,gender, T-cell count etc. These values of these parameters are known as data. The data are usedin the analysis to determine the effectiveness of the new treatment and also to see if there is anunexpected effect (e.g. it may be that certain drugs work better in younger people or in males ratherthan females).

To prove that this new drug compound is 100% effective, it would be necessary to give it to all HIV+patients in the UK. Obviously this would be impractical in terms of time, money, staff, resources etc.To overcome this problem a sample is used. A sample is chosen from the population in such a waythat it is representative of the population as a whole (e.g. the age range, gender ratio etc. shouldbe the same in the sample as in the population). This is known as random sampling. The idea ofsampling is illustrated in Figure 2.1. It is very important that a sample is chosen randomly otherwisethe observed effect of the new compound in the sample may not be seen when it is marketed andmade available to the population as a whole.

Figure 2.1: Relationship between a population and sample

© HERIOT-WATT UNIVERSITY

Page 34: Higher Applications of Mathematics Statistics and probability

28 UNIT 2. STATISTICS AND PROBABILITY

2.3 Types of data

There are two main types of data – categorical data and numerical data. From a statisticalperspective it is important to be able to identify the data type in order to choose the most appropriatemethod of reporting and the correct statistical test to apply.

2.3.1 Categorical data

Categorical data can be classified into a number of specific categories. Sometimes categorical datais referred to as qualitative data. There are two subgroups of categorical data – nominal andordinal.

Nominal categorical data is data that can be classified into a number of specific categories with noparticular ordering. For example, if gender is recorded for each participant on a study then this wouldbe nominal categorical data. The data could be coded as 1 for male and 2 for female. Alternatively,1 could be used for female and 2 for male. The numbers simply identify the categories and the orderis not important.

Ordinal categorical data can be classified into a number of specific categories but the ordering isimportant. For example, suppose the amount of pain that a patient suffers after an operation isrecorded as none, mild, moderate, severe or unbearable. These could be coded 1, 2, 3, 4 and 5respectively. However, there is an ordering here – it is obviously of more concern if a patient hassevere pain rather than mild pain. The higher numbers therefore represent something different fromthe lower numbers in medical terms.

2.3.2 Numerical data

Numerical data is data that is recorded as numerical values. Numerical data is sometimes referredto as quantitative data. Again there are two subgroups for numerical data – discrete andcontinuous.

Discrete numerical data is recorded as a whole number. For example, if the number of times aperson cycles to work each week is recorded, this is numerical data (since it is an actual numericalvalue). It is also discrete since it can only take whole number values – a person would not bereported as cycling to work 3.576 times, each week. The range of possible values is also limited insome sense – there would be an upper bound or limit to the number of times the person would cycleto work in a week.

Continuous numerical data can take decimal values and may be measured precisely by a machine,e.g. a weight of 65.5 kg.

Occasionally there may be some overlap within the categories and it can be difficult to discernwhether data is discrete or continuous numerical data. However, it will always be possible to classifydata as categorical or numerical and this distinction is sufficient for most simple statistical analysis.

© HERIOT-WATT UNIVERSITY

Page 35: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 29

Go onlineTypes of data: Quiz

Identify each of the following variables as quantitative or qualitative.

Q1: Bacterial counts

a) Qualitativeb) Quantitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q2: Average monthly rainfall

a) Qualitativeb) Quantitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q3: Temperature

a) Qualitativeb) Quantitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q4: Diagnosis

a) Qualitativeb) Quantitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q5: Voting preference

a) Qualitativeb) Quantitative

Identify each of the following categorical variables as nominal or ordinal.

Q6: Make of car

a) Nominalb) Ordinal

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q7: Deprivation scores

a) Nominalb) Ordinal

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 36: Higher Applications of Mathematics Statistics and probability

30 UNIT 2. STATISTICS AND PROBABILITY

Q8: Blood type

a) Nominalb) Ordinal

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q9: Species of bird

a) Nominalb) Ordinal

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q10: Level of happiness rated from 1-10

a) Nominalb) Ordinal

Identify each of the following quantitative variables as discrete or continuous.

Q11: Time to run 100 m

a) Continuousb) Discrete

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q12: Yield in kg of wheat in a field

a) Continuousb) Discrete

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q13: Shoe size

a) Continuousb) Discrete

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q14: Age in years

a) Continuousb) Discrete

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q15: Height of a volcano

a) Continuousb) Discrete

© HERIOT-WATT UNIVERSITY

Page 37: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 31

2.4 Graphical displays

Graphs are useful for detecting patterns or outliers (i.e. a value which is unusual compared tothe others). They can also help to identify relationships between variables and can be used todetermine the distribution of quantitative data. However, it should be noted that graphical methodslead to subjective interpretation of the data – not all researchers will interpret a graph in the sameway. Interpretation from graphs alone should therefore be done with caution.

Depending on the type of data being summarised (i.e. categorical or numerical), different graphsare appropriate for display purposes.

2.4.1 Pie chart

A pie chart is useful for displaying categorical data. Figure 2.2 shows pie chart of data gatheredfrom a study to investigate the sources of global warming.

Figure 2.2: Pie chart of global warming emissions by economic sector

From this it is easy to see that over 25% of the global warming emissions come from energy supplies.

© HERIOT-WATT UNIVERSITY

Page 38: Higher Applications of Mathematics Statistics and probability

32 UNIT 2. STATISTICS AND PROBABILITY

2.4.2 Bar chart

Another graphical display for categorical data is called a bar chart. Bar charts are useful fordisplaying ordinal data. The bar chart in Figure 2.3 shows population data for several countries.This is a clustered bar chart in that each bar (representing the countries) has been sub-divided intoa year (1996 or 2007) to allow a comparison of population over time within each country.

Figure 2.3: Populations of major European countries in the years 1996 and 2007

© HERIOT-WATT UNIVERSITY

Page 39: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 33

2.4.3 Scatter plot

The relationship between two numerical variables can be assessed using a scatter plot. Theplot in Figure 2.4shows the healing time in days (discrete numerical data) for wounds of varyingdimensions (continuous numerical data). The graph clearly shows that as the wound dimensionincreases, the time to heal increases. The relationship between these two variables is approximatelylinear.

Figure 2.4: Scatter plot illustrating the relationship between wound dimension and healing time

It should be noted here that one of the data points lies outwith the range of the other data. Thispatient had a wound dimension of about 62 mm which took around 105 days to heal. A data valuethat is unusual when compared with the rest of the data is known as an outlier. It is important toidentify outliers and attempt to explain the cause. For example, it could be that there was a typingerror in transcribing the data into the spreadsheet. In this situation the reason was determined tobe that this patient had a surgical site infection that resulted in the wound taking longer to heal thanexpected.

Linear relationships between variables of this type can be further investigated using methods knownas correlation and regression. These techniques will be applied to this data later.

© HERIOT-WATT UNIVERSITY

Page 40: Higher Applications of Mathematics Statistics and probability

34 UNIT 2. STATISTICS AND PROBABILITY

2.4.4 Histogram

One of the most important graphical displays for numerical data is called a histogram. The shape ofa histogram illustrates the distribution (i.e. the shape) of data. The histogram in Figure 2.5 showsthe age distribution of the subjects from the wound dimension and healing time study. From this itcan be seen that subjects ranged from about 15-50 years and the majority of patients were around30 years old.

Figure 2.5: Histogram illustrating the age distribution of patients

This histogram is bell shaped and data with this characteristic shape is referred to as normal dataor data which is normally distributed. This is one of the most important distributions in statistics andwill be considered in more detail later.

Other distributional shapes are illustrated in the histograms in Figure 2.6 and Figure 2.7. Theseare called skewed data. Figure 2.6 is skewed with a tail to the right. If this was an age distribution,then most of the people in the sample would be younger. Conversely, the data in Figure 2.7 isskewed with a tail to the left and if this was an age distribution, most of the sample would be older,with just a few younger people included.

© HERIOT-WATT UNIVERSITY

Page 41: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 35

Figure 2.6: Positively skewed

Figure 2.7: Negatively skewed

It is important to be able to determine whether the distribution of numerical data is normal or skewedin order to:

• report the correct summary statistics for the data, and

• analyse the data using the correct statistical test .

Visual interpretation of a histogram is usually sufficient to determine if a set of data is approximatelynormally distributed or not. An objective way of determining the normality of a set of data is explainedlater on.

© HERIOT-WATT UNIVERSITY

Page 42: Higher Applications of Mathematics Statistics and probability

36 UNIT 2. STATISTICS AND PROBABILITY

2.4.5 Graphics commands

Graphics commands in Minitab

The Graph menu in Minitab is used to produce graphical displays of data. Each graph style offersoptions to edit the layout, formatting, labels and scales. Alternatively, right clicking on a graph givesadditional options to edit colours and labels. Graphs can be saved in various formats using the menuoption File > Save Graph As. . . . Note that the default is to save the graph as a Minitab graph format(*.mgf ) but these graphs will not display on a computer which does not have Minitab installed.

Graphics commands in R Studio

Pie charts can be constructed using the ��� command. Use ��������� for a description of thevarious options which includes section labels and a graph title. Bar charts, scatter plots andhistograms can be constructed in a similar way using the �����, ���� and �� � commandsrespectively. Using the ���� function gives information on the various options for each.

Figure 2.8: Scatter plot produced in R Studio

For example, to produce the scatter plot of healing times for different wound dimensions as shownin Figure 2.8, the data was read into R and plotted using the following commands.

1 � ����� �� ��� ���������������� ������ �������

2 � � ����������

3 � ��� ���� � �� � ����������� �������� ������ ��������� ��������

������ ����! �� "� �������� ����#$� �!��!���

© HERIOT-WATT UNIVERSITY

Page 43: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 37

Graphics commands in Excel

The graphics options in Excel are contained in the Insert tab at the top of the spreadsheet. There areoptions there to do pie charts, bar charts and scatter plots. Excel is less user friendly at producinghistograms and some manipulation is needed. It is necessary to install the Data Analysis optionto Excel and histograms can be constructed using the Data tab then selecting Data Analysis andchoosing Histogram from the Analysis Tools menu.

Figure 2.9 shows the scatter plot produced using the Chart option in Excel.

Figure 2.9: Scatter plot produced in Excel

2.5 Numerical summaries

It is impossible to interpret vast amounts of data simply by looking at it. In addition to graphicaldisplays, numerical summaries are useful for condensing the information gathered into a format thatis meaningful to report.

2.5.1 Measures of location

The obvious way to summarise a list of numbers is to calculate and quote some average, or mean,value. The mean value is computed by adding all the data values together and dividing by the totalnumber of data points in the sample. For example, consider the following list of numbers:

6, 8, 3, 6, 4, 7, 5, 9, 2, 4

The mean value is calculated in the following way.

x =6 + 8 + 3 + 6 + 4 + 7 + 5 + 9 + 2 + 4

10=

54

10= 5.4

© HERIOT-WATT UNIVERSITY

Page 44: Higher Applications of Mathematics Statistics and probability

38 UNIT 2. STATISTICS AND PROBABILITY

The mean of a sample is usually denoted as x. For example, a scientific paper may state that"x = 9 days", which simply states that the mean, or average, was 9 days.

Quoting the mean as a numerical summary for a set of data is a measure of location for the data.The idea of location can be thought of as a indication of where most of the data points would beexpected to lie, or what a 'typical' or 'normal' value would be. There are two other measures oflocation that are often seen in scientific literature – the median and mode.

The median of a set of numbers is simply the 'middle' number when the data are arranged inascending order of magnitude. If the number of data points is odd then this is the middle number.If the number of data points is even, the median is calculated as the average of the 'middle two'numbers.

Examples

1. What is the median of the numbers 15, 17, 17, 19, 24, 36 and 42?

There are an odd number of numbers and the middle, hence median, value is 19.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. What is the median of the numbers 1, 2, 5, 7, 7, and 8?

There are an even number of numbers with 5 and 7 in the middle. The median is therefore5 + 7

2= 6.

The median value is often denoted Q2 in computer output.

The mode is the value that occurs most often in a set of data. In the first example the mode is 17and in the second the mode is 7.

2.5.2 Measures of spread

Information about the range of data is lost when a set of data is summarised using a measure oflocation. For example, if the average age of a group of people is 30 years, without the actual databeing available it is not possible to know the ages of the oldest or youngest in the sample. Theyoungest and oldest could be 28 and 34 or 2 and 99 respectively. In order to account for this whensummarising data, a measure of how spread the data is should be included along with the measureof location.

One way of expressing how spread out the data values are is to quote a measure of the averagedistance of all the points from the mean. The set of numbers

6, 8, 3, 6, 4, 7, 5, 9, 2, 4

have a mean value of 5.4. Subtracting this mean value from each of the numbers in the list will givean idea of the 'distance' from each point to the mean. The set of values obtained by subtracting themean from each data point are

0.6, 2.6, –2.4, 0.6, –1.4, 1.6, –0.4, 3.6, –3.4, –1.4

This indicates, for example, that the first data point is 0.6 above the mean, the second data point is2.6 above the mean, the third data point is 2.4 below the mean, etc.

© HERIOT-WATT UNIVERSITY

Page 45: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 39

A measure of the 'average' of these distances from the mean will give an indication of how spreadout the data points are. However, calculating the average in the conventional way leads to thenumbers cancelling each out giving a total of zero (try it!). A mathematical trick of removing thenegative signs to avoid these numbers cancelling each other out is to square them. Squaring anumber means multiplying it by itself. For example, the square of 3 is 3 × 3 = 9. Similarly, thesquare of –3 is −3 × − 3 = 9. Squaring each of the numbers in the list gives

0.36, 6.76, 5.76, 0.36, 1.96, 2.56, 0.16, 12.96, 11.56, 1.96

Adding together these squared differences from the mean gives a total of 44.4. A quantity calledthe variance of the data is computed by dividing this total of squared differences by one less thanthe number of data points. Since there are 10 data points in this sample, the variance of the data is44.4

9= 4.93.

Suppose the data represented by the original set of numbers were the number of kilograms that 10participants lost on a weight loss program. The mean weight loss would then be 5.4 kg. However, insquaring the numbers for the calculation of the variance, the units of measurement would be squarekilograms (i.e. variance = 4.93 kg2). This is obviously a unit of measurement which makes nosense. Taking the square root of the variance will convert the units back into kilograms. The squareroot of the variance is called the standard deviation.

standard deviation =√

variance

The standard deviation is commonly quoted as a measure of the spread of a set of data. Theinterpretation of a standard deviation is considered further later on. In this example, the standarddeviation is

√4.93 = 2.22 kg.

Other measures of spread that are commonly computed are the range and interquartile range. Therange is probably the simplest and most intuitive way of expressing the spread of a data set. It iscomputed by subtracting the minimum value in the data set from the maximum value. Here therange is 9 − 2 = 7 kg.

The problem with using the range to express the spread of data is that it is clearly going to be heavilyinfluenced by the presence of outliers. Recall that an outlier is a value which is unusual comparedto the other data points. Typically, an outlier in a data set will be a point which is much greater ormuch smaller than the other data values.

A way of compensating for the presence of outliers in a set of data is to compute the inter-quartilerange as a measure of spread. The inter-quartile range (IQR) is calculated by placing the data inascending order and splitting it into quarters. Recall that Q2 is the median of the data set. Q1 andcan be thought of as the 'median' of the lower half of the data set and Q3 is the 'median' of the upperhalf. The inter-quartile range is computed as

IQR = Q3 − Q1

The IQR spans the middle 50% of the data.

© HERIOT-WATT UNIVERSITY

Page 46: Higher Applications of Mathematics Statistics and probability

40 UNIT 2. STATISTICS AND PROBABILITY

2.6 Choice of descriptive statistics

When a set of data is summarised, a measure of location and a measure of spread should be quotedin order to ensure sufficient information is reported. The choice of which measure of location andspread to report for a set of data depends on the distribution of the data.

The following hypothetical data refer to the length of time spent in hospital (in days) for 7 patientsafter a particular operation

2, 2, 3, 2, 15, 1, 3

This type of information may have been gathered in order to:

• inform patients who are undergoing this particular operation how long they can expect to spendin hospital, or;

• inform the NHS finance board how long patients stay in hospital after this operation to facilitatebudget plans.

To address either of those questions, the information has to be summarised and reported with ameasure of location and a measure of spread.

By simply looking at the data it would seem reasonable that a patient who undergoes this operationshould be out of hospital in 2 or 3 days at the most. There is one patient who is in for 15 days whichis unusual compared to the other patients (an outlier). Using a statistical software package we cancompute the possible measures of location and spread for this data. The following output is fromMinitab.

Descriptive Statistics: Days in Hospital

Statistics����� � ��� ����� ������� �� ����� �� ������

�� � �� ��! �� "� "� �� �#�

The decision has to be made as to which are more appropriate measures of location (mean ormedian) and spread (standard deviation or interquartile range). This can be decided based on thedistribution of the data.

If the value of the outlier was, say, 245 rather than 15, the median value would still be 2 days.Therefore, the median value is unaffected by extreme points. However, the mean value would beaffected – it would increase to 36.9 days (check this!). The mean value is therefore affected byoutliers.

Extreme points (outliers) are also omitted from the calculation of the interquartile range. Increasingthe value of the outlier in the data set from 15 to 245 would have no effect on the IQR. However, thestandard deviation is a measure of the variability of the data around the mean. An extreme valueof 245 would be included in the calculation of the standard deviation and would in fact increase it to91.8 days (verify this).

A histogram of this data would reveal that the distribution of length of stay is skewed with a tail to theright (similar to Figure 2.6). For skewed data, the mean and standard deviation are affected by theoutliers. When data is normally distributed however, there are no extreme observations to influencethe mean value or inflate the standard deviation. For a normally distributed data set, the mean andmedian will be fairly similar.

© HERIOT-WATT UNIVERSITY

Page 47: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 41

Considering this data in the context of reporting the measures of location and spread, there are twooptions.

1. The mean is 4 days and the standard deviation is 4.9 days.

2. The median is 2 days and the IQR is between 2-3 days.

Option 1 indicates that the average length of time spent in the hospital was 4 days from this sample.However, inspection of the raw data shows that none of the patients was in for as long as 4 days,apart from the outlier at 15 days. For skewed data, the more accurate way to report summarystatistics is to give a median and inter-quartile range, i.e. option 2. This in fact agrees with thesubjective impression of length of stay from the visual inspection of the raw data.

It is important to accurately and honestly report summary data, and this should be done bydetermining the distribution of the data and quoting the appropriate summary statistics. Quotingthe mean and standard deviation for this set of data in the context of the hypothetical study used forillustration would result in:

• patients being prepared to be in hospital for a longer stay than in reality they should beexpecting, or;

• the hospital being able to claim more money from the NHS board for keeping patients inhospital longer than they actually do!

It is important in real-world data analysis to know the distribution of data. Many textbooks will quotedata like weight being normally distributed. However, in medical studies, weight data is often frompatients who are struggling with their weight and this data is likely to be skewed like Figure 2.7.The choice of summary statistics is also important for formal statistical analysis and this is coveredin detail in a later chapter.

Summary

The mean and median give an indication of the 'typical' value of a set of data. This is known as ameasure of location. The standard deviation, range and interquartile range all given an indication ofhow spread out the data are about the average. The median and IQR are unaffected by outliers. Ifthe distribution of a set of data is normal then the mean and standard deviation should be reportedas summary statistics. For skewed data, the appropriate measures of location and spread are themedian and IQR.

© HERIOT-WATT UNIVERSITY

Page 48: Higher Applications of Mathematics Statistics and probability

42 UNIT 2. STATISTICS AND PROBABILITY

2.7 Computing descriptive statistics

To ensure a statistical report is of high quality and accuracy, computation of descriptive statisticsshould only ever be done using a statistical software package.

Computing descriptive statistics in Minitab

For numerical data, descriptive statistics can be computed using the menu option Stat > BasicStatistics > Display Descriptive Statistics. . . . To specify only selected summary data, use theStatistics option to specify, e.g. the mean and standard deviation. The following output showsthe descriptive statistics for the hypothetical data seen previously (Days in Hospital: 2, 2, 3, 2, 15,1, 3).

Descriptive Statistics: Days in Hospital

Statistics����� � ��� ����� ������� �� ����� �� ������

�� � �� ��! �� "� "� �� �#�

Computing descriptive statistics in R Studio

In R, the function ������, where x is a numerical variable can be used to compute the samplemean. Similarly �������� computes the median. A useful command is ������� which givesthe median, minimum, maximum and upper and lower quartiles of x. The standard deviation iseasily computed using ����. The R code and output for the computation of descriptive statistics isshown.

1 � %��� ����#�#�&�#�$'�$�&�

2 � �������%����

3 (��� $� )�� (���� (�� &� )�� (���

4 $ # # * & $'

Computing descriptive statistics in Excel

The Analysis Toolpak add-in needs to be installed in Excel to compute the descriptive statistics. Onthe Data tab, click Data Analysis and then in the Analysis Tools list choose Descriptive Statistics.The relevant data should then be highlighted in the Input Range box and the Summary Statisticsoption produces a table with various statistics, including the mean, median and standard deviation.These can also be computed individually using the function commands $�%&$'%, �%�($� and �)�%�

respectively. The functions �*$&)(+%��,�� and �*$&)(+%��,�� can be used to get the upper andlower quartiles of x. The output in Table 2.1 was generated using the Descriptive Statistics optionin the Data Analysis tab for the data.

© HERIOT-WATT UNIVERSITY

Page 49: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 43

Table 2.1: Descriptive statistics in Excel

Length of stay

Mean 4

Standard error 1.85164

Median 2

Mode 2

Standard deviation 4.898979

Sample variance 24

Kurtosis 6.568056

Skewness 2.536243

Range 14

Minimum 1

Maximum 15

Sum 28

Count 7

The quartiles can be computed using the function commands�*$&)(+%��� ,�� $�� �*$&)(+%��� ,��, where Days is the list of seven data points recordedas the length of stay. These are computed as 2 and 3 days respectively.

2.7.1 Notes on quartiles

Note that estimates of quartiles is done using interpolation and computer generated estimates willnot always agree since different software packages may use different rules to compute the quartiles.More information on the differing methods can be found athttp://dsearls.org/other/CalculatingQuartiles/CalculatingQuartiles.htm.

The estimated values will generally be fairly similar from package to package and any differenceswill not be of concern for this introductory course in statistics.

2.7.2 Box plots

An alternative graphical representation to a histogram to determine the distribution of a set of datais a box plot. This shows a horizontal line at the median of the data and draws a box from the firstquartile Q1 to the third quartile Q3. This box therefore represents the middle 50% of the data. Aline then extends from Q1 to a lower limit which is Q1 − 1.5(Q3 − Q1) and another from Q3 (i.e.the top of the box) to an upper limit of Q3 + 1.5(Q3 − Q1). Values above or below those limitsare generally highlighted at outliers. Figure 2.10 is a box plot of the ages of the participants in thewound dimension study. This is the data which was illustrated as a histogram in Figure 2.5. Thevertical symmetry of the box plot indicates that the data is approximately normally distributed.

© HERIOT-WATT UNIVERSITY

Page 50: Higher Applications of Mathematics Statistics and probability

44 UNIT 2. STATISTICS AND PROBABILITY

Figure 2.10: Box plot of ages from wound dimension study

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

These are particularly useful for comparing distributions across different groups. The nature ofa histogram would make this impossible as groups would overlap and the distribution would beunclear. Figure 2.11 shows the comparison of air pollution levels between two cities. This clearlyshows that pollution levels are higher in City B than in City A. Note the outliers for the data from CityA. The distribution of the data from City B looks close to being normally distributed.

© HERIOT-WATT UNIVERSITY

Page 51: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 45

Figure 2.11: Box plot of air pollution levels in two cities

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Note: The ozone.csv (https://scholar.hw.ac.uk/download/2021/H-APP/ozone.csv) file contains thedata for this box plot and can be downloaded from the web-site. The following R code will producethe box plot as shown.

1 � �+�� ����� ������+�������������,-. ��������

2 � � �����+���

3 � ������ ��+�� � ���� � �/+�� ������� ���� � ���0� � 1�� �0� � 2���

���� � �2�� ��� � �" �� 3���� � �� ���

© HERIOT-WATT UNIVERSITY

Page 52: Higher Applications of Mathematics Statistics and probability

46 UNIT 2. STATISTICS AND PROBABILITY

2.8 The normal distribution

The characteristic bell-shape of the normal distribution describes many distributions that occur inpractice. For example, physical measurements in which there is natural variation (e.g. height andweight) are closely approximated by the normal distribution. Data that can be approximated by anormal distribution can be analysed using parametric methods of statistical testing.

Two important mathematical properties of the normal distribution are:

• it is symmetric about the mean, and;

• approximately 95% of the data points lie within 2 standard deviations of the mean.

The histogram in Figure 2.12 shows the distribution of the wound dimension age data with thenormal approximation to the data superimposed as a smooth curve.

Figure 2.12: Histogram illustrating normally distributed age data with theoretical curvesuperimposed

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

This clearly illustrates the characteristic bell shape. Since normally distributed data is symmetric

© HERIOT-WATT UNIVERSITY

Page 53: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 47

about the mean, the histogram shows that the average age of subjects in this data set was around30 years.

Example

In a scientific paper it is reported that the mean age of 500 injecting drug users in a city is28 years with a standard deviation of 5 years. Assuming that the authors have reported thecorrect summary statistics for the data they have gathered, what can be assumed about thedistribution of the ages? What do the reported summary statistics reveal about the locationand spread of the ages of injecting drug users in the city?

Since the mean and standard deviation have been reported as the measures of location andspread to describe this data set, it must be assumed that the data is normally distributed. Theaverage age of injecting drug users is 28 years with a standard deviation of 5 years. For anormal distribution it is known that 95% of the data points lie within 2 standard deviations ofthe mean. Two standard deviations in this case is 10 years. Therefore, approximately 95% ofthe injecting drug users in this sample would be aged between 18 and 38 years.

Without having the actual data, it is possible to use the properties of the normal distributionto make inference about the original data. The histogram in Figure 2.13shows the actual agedistribution of this data with the normal probability curve superimposed.

Figure 2.13: Histogram illustrating actual age data with normal probability curvesuperimposed

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 54: Higher Applications of Mathematics Statistics and probability

48 UNIT 2. STATISTICS AND PROBABILITY

2.9 Exercises

Go onlineInterpreting data exercises

The file activity.csv (https://scholar.hw.ac.uk/download/2021/H-APP/activity.csv) containsdata relating to health care commissioned by the Scottish government to address healthinequalities.

To import the file to R in preparation for the following questions, use the following commands.

1 � �� ������� ������� ��� ������������,-. ��������

2 � � ������ ���

Q16: Construct a table to examine the distribution of patient activity.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q17: Produce a pie chart of the activity levels. Use the various options to change the titlesand format of the chart

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q18: Produce a histogram of the pulse rates. Comment on the distribution of the data andconstruct a table of the most appropriate statistics to summarise the location and spread ofthe pulse rates.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q19: Compute descriptive statistics for the pulse rates for each level of activity and commenton any differences between the three groups.

© HERIOT-WATT UNIVERSITY

Page 55: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 49

The following data shows rainfall levels for each month (mm) for a location in the USA.

SamSamWater Climate Tool

Name of location (approximately): Palm Springs, CA 92264, USA Latitude: 33.82790(decimal degrees)

Longitude: -116.57257 (decimal degrees)

Altitude: (m above mean sea level)

Average precipitation (in mm or liter per m) for this location is listed in the table.

Month Rainfall (mm)

January 99

February 82

March 93

April 58

May 24

June 8

July 19

August 27

September 22

October 24

November 66

December 82

This data is available at http://www.samsamwater.com/climate/index.php.

Q20: Produce a bar chart of this data to illustrate the trend over time – the data is availablein the file rainfall.csv (https://scholar.hw.ac.uk/download/2021/H-APP/rainfall.csv).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Download rainfall data at another location from this website and produce a similar plot.

© HERIOT-WATT UNIVERSITY

Page 56: Higher Applications of Mathematics Statistics and probability

50 UNIT 2. STATISTICS AND PROBABILITY

Crime data for Tayside police force areas on SIMD and crime counts (which relates to selectedrecorded offences, not all crimes committed in the area) is available in the file crimes.csv (https://scholar.hw.ac.uk/download/2021/H-APP/crimes.csv).

Q21: Produce a plot to illustrate the relationship between SIMD and crime rates andcomment on the relationship.

People who are concerned about their health may prefer hot dogs that are low in salt andcalories. The hotdogs.csv (https://scholar.hw.ac.uk/download/2021/H-APP/hotdogs.csv) filecontains data on the sodium and calories contained in each of 54 major hot dog brands.

The hot dogs are classified by type: beef, poultry, and meat (mostly pork and beef, but up to15% poultry meat).

Q22: Construct a table to show the number and percentage of each type of hot dog in thedata set.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q23: Produce a plot to compare the calories between the different types of hot dog and givea subjective impression of any differences. (Hint: This could be done using a box plot ofcalories for each type of hot dog on the same graph, which allows a good visual comparison).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q24: Produce a plot to investigate any association between the amount of sodium in thehot dogs and their calorie content and interpret the relationship. (Hint: A scatter plot isthe appropriate graphical display to use to subjectively assess the association between twoquantitative variables, i.e. Sodium and Calories).

© HERIOT-WATT UNIVERSITY

Page 57: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 51

2.10 Summary

Summary

• Categorical data can be classified into a number of specific categories. Sometimescategorical data is referred to as qualitative data.

• Numerical data is data that is recorded as numerical values. Numerical data can alsobe referred to as quantitative data.

• Depending on the type of data being summarised (i.e. categorical or numerical),different graphs are appropriate for display purposes. For example, pie-charts and barcharts are useful for displaying categorical data.

• The relationship (usually linear) between two numerical variables can be assessedusing a scatterplot.

• A data value that is unusual when compared with the rest of the data is known as anoutlier.

• An important graphical displays for numerical data is called a histogram. The shape ofa histogram illustrates the distribution (i.e. the shape) of data. If the histogram is bellshaped, the data is referred to as normal data or data which is normally distributed.

• Other distributional histogram shapes can represent skewed data either to the right(positively skewed) or left (negatively skewed).

• The mean and median give an indication of the ‘typical' value of a set of data. This isknown as a measure of location.

• The standard deviation, range and inter-quartile range all give an indication of howspread out the data are about the average.

• The median and inter-quartile range are unaffected by outliers. If the distribution of a setof data is normal then the mean and standard deviation should be reported as summarystatistics.

• For skewed data, the appropriate measures of location and spread are the median andinter-quartile range.

• An alternative graphical representation to a histogram to determine the distribution of aset of data is a boxplot.

• Two important mathematical properties of the normal distribution are: it is symmetricabout the mean, and approximately 95% of the data points lie within two standarddeviations of the mean.

© HERIOT-WATT UNIVERSITY

Page 58: Higher Applications of Mathematics Statistics and probability

52 UNIT 2. STATISTICS AND PROBABILITY

2.11 Resources

Downloads

• activity.csv - https://scholar.hw.ac.uk/download/2021/H-APP/activity.csv

• crimes.csv (https://scholar.hw.ac.uk/download/2021/H-APP/crimes.csv)

• hotdogs.csv - https://scholar.hw.ac.uk/download/2021/H-APP/hotdogs.csv

• ozone.csv - https://scholar.hw.ac.uk/download/2021/H-APP/ozone.csv

• wound.csv - https://scholar.hw.ac.uk/download/2021/H-APP/wound.csv

Links

• https://www.minitab.com/en-us/products/minitab/ - powerful statistical software everyone canuse (free trial).

• https://www.r-project.org/ - a free software environment for statistical computing and graphics.

• http://dsearls.org/other/CalculatingQuartiles/CalculatingQuartiles.htm - notes on quartiles

© HERIOT-WATT UNIVERSITY

Page 59: Higher Applications of Mathematics Statistics and probability

TOPIC 2. INTERPRETING DATA 53

2.12 End of topic test

Go onlineInterpreting data topic test

For each variable in the data set, decide which are qualitative and which are quantitative.

Gender Activity Smokes Height Weight Pulse1 Male Moderate No 66 140 642 Male Slight Yes 73 190 663 Male A lot No 72 150 844 Female Moderate No 61 140 965 Female Moderate No 70 120 626 Female Slight No 65 118 84

Q25: Gender

a) Quantitativeb) Qualitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q26: Activity

a) Quantitativeb) Qualitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q27: Smokes

a) Quantitativeb) Qualitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q28: Height

a) Quantitativeb) Qualitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q29: Weight

a) Quantitativeb) Qualitative

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q30: Pulse

a) Quantitativeb) Qualitative

© HERIOT-WATT UNIVERSITY

Page 60: Higher Applications of Mathematics Statistics and probability

54 UNIT 2. STATISTICS AND PROBABILITY

Q31: Which graphs would be suitable for displaying the distribution of votes in a generalelection?

a) Box plot

b) Histogram

c) Pie chart

d) Scatter plot

e) Bar chart

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q32: A set of data which is skewed with a tail to the right should be summarised using whichmeasures of location and spread?

a) Mean

b) Median

c) Range

d) Interquartile range

e) Standard deviation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q33: For a set of data which is normally distributed with mean 15 and standard deviation 3.5,which of the following is true?

a) A histogram of the data would be symmetric about 15.

b) 95% of the data would lie between 8 and 22.

c) 95% of the data would lie between 11.5 and 18.5.

d) The median value of the data would be approximately 15.

e) The interquartile range is from 8 to 22.

© HERIOT-WATT UNIVERSITY

Page 61: Higher Applications of Mathematics Statistics and probability

55

Unit 2 Topic 3

Correlation and linear regression

Contents3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.2 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.2.1 Interpretation of the correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.2.2 Computing the correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3.1 Fitting a regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.3.2 Model predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.3.3 Use of regression line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3.4 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.3.5 Computing regression lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

3.6 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

3.7 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Learning objective

By the end of this topic, you should be able to:

• describe and provide examples of dependent and independent variables;

• construct and interpret scatter plots;

• explore trends in data, for example seasonality;

• compute, understand and interpret correlations;

• understand the applicability of Pearson's product-moment correlation coefficient;

• use software to apply simple linear regression;

• interpret the slope and intercept parameters in relation to data;

• use linear models for prediction and assess the accuracy of predictions.

Page 62: Higher Applications of Mathematics Statistics and probability

56 UNIT 2. STATISTICS AND PROBABILITY

3.1 Introduction

Often in research it is of interest to see how one variable relates to another variable. For example,what happens to the systolic blood pressure in a group of patients as the dose of a drug is increased?It is also possible to construct a model which would make it possible to estimate the value of onevariable from a number of known factors e.g. what would be the expected reduction in systolic bloodpressure in a 50 year old female given a specified dose of a new drug?

3.1.1 Terminology

A variable that is under the control of an investigator is known as the independent variable. In theexample in the introduction, the dose of the drug administered would be the independent variable.The dependent variable is the variable that the investigator is trying to predict (in this case systolicblood pressure). It is usually of interest to try to predict what will happen to the dependent variableas the independent variable changes.

A good way to initially inspect the relationship between two variables and look for patterns is toproduce a scatter plot with the independent variable on the x-axis and the dependent variable onthe y-axis, as illustrated in Figure 3.1. This shows that as the independent variable increases, thedependent variable decreases in an approximately linear way.

Figure 3.1: Scatter plot illustrating the relationship between two variables

© HERIOT-WATT UNIVERSITY

Page 63: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 57

The scatter plot shown in Figure 3.2 shows the relationship between serving size (in grams)and calories for a number of sandwiches from nutritional data published by a fast-food restaurant.This illustrates a positive relationship between the two variables in that a higher calorie content isassociated with larger serving sizes.

Figure 3.2: Scatter plot illustrating a positive linear relationship between serving size and caloriesin a fast food restaurant

3.2 Correlation

The correlation between two numerical variables is a measure of the strength of linearity betweenthem. It is measured on a scale ranging from -1 to +1. If large values of one variable occur withlarge values of the other then the correlation is said to be positive (as in Figure 3.2). If large valuesof one variable coincide with small values of the other then the correlation is negative (as in Figure3.1). If the observations in the scatter plot lie close to a straight line then the correlation value willbe high. A high positive correlation will be close to +1 and a high negative correlation will be closeto –1. If there is no obvious linear relationship between the two variables then the correlation will beclose to zero. Figure 3.3, Figure 3.4, Figure 3.5 and Figure 3.6 show examples of correlations forfour pairs of variables.

© HERIOT-WATT UNIVERSITY

Page 64: Higher Applications of Mathematics Statistics and probability

58 UNIT 2. STATISTICS AND PROBABILITY

Figure 3.3: Highly correlated variables

Figure 3.4: Less highly correlated variables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 65: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 59

Figure 3.5: Negatively correlated variables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.6: No correlation between variables

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 66: Higher Applications of Mathematics Statistics and probability

60 UNIT 2. STATISTICS AND PROBABILITY

3.2.1 Interpretation of the correlation coefficient

The correlation coefficient should always be interpreted with care since there may be no directconnection between two highly correlated variables. For example, between 1900 and 2000 theincrease in world population was three times greater than the entire previous history of humanity- an increase from 1.5 to 6.1 billion in just 100 years. Global temperatures have also risen about1.30◦C over the past century. There is a positive correlation between these two variables (sinceboth have increased over this period of time). However, there is clearly no causal relationshipbetween these two variables - the global temperature rise is not causing the population explosion.This may appear to be a rather exaggerated example, but it is a common mistake for researchersto make claims of causal relationships between variables based on high correlation values. Often,in research, the correlation will be due to a causal relationship between the two variables, e.g. anincrease in systolic blood pressure could be caused by a higher dose of a drug, but other statisticalmethods are necessary to show this.

Note too that a low correlation value does not necessarily mean a low degree of association. Thescatter plot in Figure 3.7 shows two variables that are strongly related, the association beingquadratic. However, since the correlation coefficient measures the degree of linear associationbetween the variables, the correlation is low (since the relationship is not linear).

Figure 3.7: Scatter plot illustrating a strong association between two variables (quadraticrelationship) but low correlation

© HERIOT-WATT UNIVERSITY

Page 67: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 61

Figure 3.8: Perfect agreement between two thermometers

Figure 3.9: Poor agreement between both thermometers

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 68: Higher Applications of Mathematics Statistics and probability

62 UNIT 2. STATISTICS AND PROBABILITY

Another way in which correlations are often misused is in method comparison studies. For example,a scientist may want to ensure that two thermometers are calibrated with each other before usingthem in different labs. In this case, data could be gathered by placing both thermometers in beakersof water at varying temperatures and recording the temperature reading on each. Figure 3.8 showssome hypothetical data for what would be expected if there was perfect agreement between thereadings measured by each thermometer. Here the correlation is 1, i.e. a perfect linear relationship.Suppose that the temperature reading from thermometer B was not the same as thermometer A.Figure 3.9 shows hypothetical data where thermometer B gives a reading which is numerically halfthe reading from thermometer A. Clearly, there is no good agreement between the measurements,but there is still a perfect, positive linear relationship between the two temperature readings (i.e.a correlation of 1). Correlation is often used inappropriately in this type of data analysis. Recallthat correlation is simply a measure of how linear a relationship is between two variables. Methodcomparison studies require a perfect linear relationship where the points would have to pass throughthe origin at a 45 degree angle. There are statistical methods for method comparison studies, butthese are not covered here.

Example

Figure 3.10 illustrates the relationship between the dimension of a wound and the healingtime in days.

Figure 3.10: Healing time for wounds

It is clear from this plot that as the wound dimension increases, the time to heal increases. Thecorrelation coefficient between these two (numerical) variables is 0.864 (after removing theoutlier). This indicates a strong, positive correlation (since high values of wound dimensioncoincide with high healing times). Remember that this cannot be interpreted as a causalrelationship based on this high correlation, although this is probably the case in this instance.

© HERIOT-WATT UNIVERSITY

Page 69: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 63

3.2.2 Computing the correlation

If x denotes the independent variable and y the dependent variable, the formula for computing thecorrelation coefficient between n pairs of observations whose values are (xi, yi) is

r =

∑ni=1 (xi − x) (yi − y)√∑n

i=1 (xi − x)2∑n

i=1 (yi − y)2.

The numerator in this calculation simply produces a high positive number if the relationship betweenx and y is a positive linear relationship. If the relationship is a negative linear relationship then thenumerator will be a large negative number. This computation is illustrated in Figure 3.11 whichshows a positive linear relationship.

Figure 3.11: Computation of the correlation

The point (x, y) splits the graph region into four quadrants. For each point (x i, yi) in the top rightquadrant, the values of (xi − x) and (yi − y) will be positive. Within this quadrant, the numerator ofthe formula computing the correlation coefficient, i.e. the sum of (xi − x) . (yi − y) will be positive.Similarly for the data points lying in the top left quadrant, the numerator will be negative. For datapoints lying in the bottom right quadrant, the numerator will also be negative, and positive for thosein the bottom left quadrant.

© HERIOT-WATT UNIVERSITY

Page 70: Higher Applications of Mathematics Statistics and probability

64 UNIT 2. STATISTICS AND PROBABILITY

The data in Figure 3.11 is approximately linear, with most of the data points therefore lying in thetop right and bottom left quadrants. The numerator for the correlation coefficient will therefore bepositive. Had the relationship in the data been that y decreased as x increased, most of the datapoints would have been in the top left and bottom right quadrants, so the numerator would have beennegative. If there was no linear relationship, the points would be randomly scattered throughout thefour quadrants and the numerator would be close to zero since the positive and negative productwould cancel each other out. Therefore the numerator of the correlation coefficient r will give thefollowing:

• a positive number if x and y both increase in an approximate linear pattern (i.e. the data slopeupwards);

• a negative number if y decreases as x increases in an approximate linear pattern (i.e. the dataslope downwards), and;

• a value close to zero if there is no linear relationship between x and y, i.e. the points arerandomly scattered.

The numerator value obtained can lie anywhere between –1 and +1. The effect of the denominatoris to scale the value in the numerator so that –1 ≤ r ≤ +1. This scaling allows the direct comparisonof correlation coefficients from different data sets. The computation is simply done using a statisticalsoftware package.

The data for the following examples is available in the file wound.csv(https://scholar.hw.ac.uk/download/2021/H-APP/wound.csv).

Computing the correlation in Minitab

Pearson's correlation coefficient can be computed in Minitab using the menu option Stat > BasicStatistics > Correlation. For the wound dimension data with the outlier removed, the output is asshown.

Correlation: dim, time

Correlations-� �� .������� �/0�

-1���� �

Computing the correlation in R Studio

The following code can be used to input the data for the wound dimension study and compute thecorrelation.

1 � ����� ����� ���������������������,-. ��������

2 � � ����������

3 � ������ � ������������������� ������

4 4$5 6�78*#'69

Note the additional option � �23��4� ��.���������� 3 is needed in order to compute thecorrelation with the outlier removed. For a complete set of data with no missing values, this partof the code is not needed.

© HERIOT-WATT UNIVERSITY

Page 71: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 65

Computing the correlation in Excel

In Excel, the correlation option is available in the Data Analysis package. Selecting the option Labelsin first row produces the output shown.

wound time

wound 1

time 0.864250698 1

Go onlineCorrelation: Quiz

Q1: If the correlation between body weight and annual income was high and positive, thiswould indicate that:

a) high incomes cause people to eat more food.b) low incomes cause people to eat less food.c) high income people tend to spend a greater proportion of their income on food than low

income people, on average.d) high income people tend to be heavier than low income people, on average.e) high incomes cause people to gain weight.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q2: A research study has reported a correlation of –0.693 between the eye colour (brown,green, blue) of experimental animals and the amount of nicotine that is fatal to the animalwhen consumed. This would indicate that:

a) nicotine is less harmful to one eye colour than others.b) lethal dose decreases as eye colour changes.c) eye colour of animals must always be considered in assessing the effect of nicotine

consumption.d) further study required to explain this correlation.e) correlation is not an appropriate measure of association.

© HERIOT-WATT UNIVERSITY

Page 72: Higher Applications of Mathematics Statistics and probability

66 UNIT 2. STATISTICS AND PROBABILITY

3.3 Linear regression

When a linear relationship exists between two variables, it is possible to develop an equation topredict values of the dependent variable (y-axis) from knowledge of the independent variable (x-axis). Linear regression is a statistical modelling technique that fits a straight line to data. Thegeneral equation of a straight line is given by

y = a + bx

where b is the slope (or gradient) of the line and a is the intercept of the line with the y-axis. This isillustrated in Figure 3.12.

Figure 3.12: Mathematical representation of a straight line

© HERIOT-WATT UNIVERSITY

Page 73: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 67

Example

A study was conducted to investigate the development of a foetus throughout the growthperiod. Data was recorded on 84 foetuses about the date of conception and the length(measures using ultrasound). The date of conception allows the age of the foetus to becalculated. The two variables age and length are clearly related. The aim of the study was tomodel the length and age data and use it to determine if a foetus of known age is developingat an appropriate rate. The summary statistics in Minitab for the two variables of interestare shown. Note that the mean and median values for both length and age are very closeindicating that these variables are possibly normally distributed.

Descriptive Statistics: age, length

Statistics����� � ��� ����� ������� �� ����� �� ������

5� /� � �!� ����/ �#� #!�"# � �# /���# � �

���5�� /� #�/#� ���"! "���! ���/� #�0#� ��"�� !��/�

The scatter plot in Figure 3.13 shows the relationship between the two variables. It wouldappear that age and length are strongly related, in a linear way. It is therefore possible to fit astraight line to this data and use the line to model the relationship between the variables.

Figure 3.13: Scatter plot showing the relationship between the age and length data

© HERIOT-WATT UNIVERSITY

Page 74: Higher Applications of Mathematics Statistics and probability

68 UNIT 2. STATISTICS AND PROBABILITY

3.3.1 Fitting a regression line

If all the data in Figure 3.13 lay exactly on a straight line, and there was no random variation aboutthat line, it would be simple to draw an approximate straight line on the scatter plot.

However, this is not the case for real data. For a given value of the independent variable, there willbe a range of observed values of the dependent variable. For example, in the graph, it is clear thatat age about 75 days the corresponding lengths range from about 5.5 to 7.5 cm. Clearly from thisplot, different analysts would estimate different regression lines. In order to have objective results, itis necessary to define some criteria in order to produce the best fitting straight line for a given set ofdata.

The best fitting straight line for such data would be one which, on average, has all the points inthe scatter plot as close as possible to it. A method called least squares is used to find this line.The method of least squares minimises the sum of squared vertical deviations from the line and theresulting line is known as the least squares linear regression line.

The idea of least squares is illustrated in Figure 3.14 which shows a scatter plot and the line of bestfit identified by the method of least squares. The method can be thought of as fitting an approximatestraight line on the graph and then calculating the distance from each of the data points to this line(where the distances are denoted as ei). The best fitting straight line is the one for which the sum ofthe all distances from points to the line is the smallest.

Figure 3.14: Fitting a regression line using least squares

© HERIOT-WATT UNIVERSITY

Page 75: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 69

Points lying above the line will give positive distances and those below, negative distances. In orderto avoid these cancelling out, the distances are squared (similar to the calculation of the standarddeviation) then summed. If the line is then moved slightly, a new sum of squared differences can beobtained at the new position. This is done until the minimum sum of squared differences is achieved.The corresponding line is known as the least squares linear regression line. Note: this 'movement'of the line is used to illustrate the concept of least squares fitting. In practice the least squaresregression line is computed using differentiation and computing a and b to minimise the sum of thesquared residuals.

The vertical distances between each point and the fitted line (denoted e i) are known as 'residuals'.The smaller the residuals, the better the model is at describing the data (i.e. the better the fit ofthe regression line). Examining the residuals is a good way of assessing how well the linear modelfits the data. All calculations for estimating the regression line using least squares are done bycomputer and the results of applying these methods to the foetal development data are shown.

Example

The estimated least squares linear regression line (y = a + bx) for the foetal developmentdata is

length = − 2.69 + 0.12 × age.

This line cuts the y-axis at –2.69 and has a slope (or gradient) of 0.12. Therefore, as the ageof a foetus increases by one day, the length increases by 0.12 cm.

3.3.2 Model predictions

The least squares linear regression line can be used to predict the length of a foetus of known ageand thereby assess if it is growing at an appropriate rate. For example, the length of a foetus aged85 days would be estimated as

length =− 2.69 + 0.12 × age

=− 2.69 + (0.12 × 85)

= 7.51.

Therefore, the length of a foetus aged 85 days should be approximately 7.51 cm if it is developingat the correct rate.

Clearly there will be other factors which determine the actual length of a foetus at 85 days, e.g.height of parents. In order to allow for such variability, a prediction interval can be computed usingstatistical software. This is similar to a confidence interval which is described in detail later. Theprediction interval for a foetus of age 85 days is from 7.01 to 8.08 cm. This would mean that if afoetus of age 85 days is measured using an ultrasound scan, and has a length of between 7.01 and8.08 cm, then it would be assumed that it is the correct length for its age (i.e. developing at thecorrect rate). If the length is less than 7.01 cm, there is evidence to suggest that it is not developingas it should. A length greater than 8.08 cm suggests that it is larger than it should be. In this waylinear regression can be used to highlight any potential problems in the development of a foetusfrom measurements taken using ultrasound.

© HERIOT-WATT UNIVERSITY

Page 76: Higher Applications of Mathematics Statistics and probability

70 UNIT 2. STATISTICS AND PROBABILITY

3.3.3 Use of regression line

When a linear relationship exists between two variables, the least squares linear regression line maybe used to estimate a value of the dependent variable given a value of the independent variable.The value of the independent variable used for prediction purposes should be within the range of thegiven data. In the previous example, the regression model would be valid for predicting the length fora foetus aged between 45 and 100 days. Outside of this range it is not obvious that the relationshipis still valid. For example, predicting the length of a foetus aged 10 days using the model would givethe estimated length as

length = − 2.69 + (0.12 × 10) = − 1.49.

Obviously, a length of -1.49 cm makes no sense. In the context of the study though, it would not bepossible to measure the length of a foetus at 10 days. The equation could also be used to estimatethe length of a foetus at 300 days. However, a foetus of age 300 days cannot exist! The regressionline is therefore only valid to use for predicting within the range of data gathered from the study.

The predicted value of the dependent variable (i.e. the length) is only an estimate and other factorsapart from the age will contribute to the length of a foetus (e.g. genetics, health of the mother etc.).Prediction intervals are useful for providing a range of predicted values of the dependent variable.

3.3.4 Coefficient of determination

When regression analysis is performed the computer will also provide a value called the coefficientof determination. This is a measure of the amount of variability in the data that is explained by theregression line and is usually expressed as a percentage. It therefore gives an indication of how wellthe regression line fits the data. Since a regression line which has a high coefficient of determination(usually denoted R2) fits the data well, the predicted values of the dependent variable will be moreaccurate for lines which have high R2 values. In the example of foetal development, the coefficientof determination is 97.6%, indicating that the linear regression model is a good fit. Using the modelfor predictive purposes should therefore provide reliable estimates of length.

All the values of the dependent variable (length) were not the same in this data since it was foundto depend on the independent variable (i.e. the age of the foetus). The coefficient of determinationindicates that 97.6% of the variability in the lengths was explained by the variability in the ages.Therefore, 2.4% of the variability was not explained and has to be attributed to other factors (e.g.genetics, diet, smoking habits of the mother, etc.).

Figure 3.15 shows the actual least squares linear regression line for this study superimposed onthe original data. The high coefficient of determination indicated that the linear model was valid andvisually it can be seen that the line does indeed fit the data well.

© HERIOT-WATT UNIVERSITY

Page 77: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 71

Figure 3.15: Original data with least squares linear regression line superimposed

3.3.5 Computing regression lines

Mathematically the least squares linear regression slope and intercept parameters are estimated as

b =

∑ni=1 (xi − x) (yi − y)∑n

i−1 (xi − x)2and a = y − bx.

Statistical analysis packages will estimate these parameters to compute the line of best fit.

Computing regression lines in Minitab

There are two options to fit regression lines in Minitab. Stat > Regression > Fitted Line Plot. . .produces a scatter plot with the least squares regression line superimposed along with theregression equation. This produces a plot similar to the one in Figure 3.15. Stat > Regression> Regression > Fit Regression Model. . . is an alternative and allows additional options like fittingmultiple regression models and using the analysis for prediction purposes. Part of the output fromthis approach shows the regression model.

© HERIOT-WATT UNIVERSITY

Page 78: Higher Applications of Mathematics Statistics and probability

72 UNIT 2. STATISTICS AND PROBABILITY

Regression Analysis: length versus age

Coefficients)�� 6��7 �% 6��7 )1���� -1���� �(8

6�� ��� 1"�0!� ���! 1�/��� �

5� ��" �0 � " # #/�0" � ��

Regression Equation

���5�� 2 1"�0!� 9 ��" �0 5�

Once Stat > Regression > Regression > Fit Regression Model. . . has been used, the option Stat> Regression > Regression > Predict. . . can be used to predict the length of a foetus at a givenage. The output obtained for a prediction at 75 days produces the prediction interval as shown.

Prediction for length

Regression Equation

���5�� 2 1"�0!� 9 ��" �0 5�

Settings

����� ������5

5� �#

Prediction8�� �% 8�� !#: 6( !#: -(

0����"� � � �# � �0�"/�"!, 0�� �"#) �#�/��0", 0�/��!"�

Computing regression lines in R Studio

The following code can be used to estimate the slope and intercept parameters for the foetusexample.

1 � "� ������� �����"� �������������,-.��������

2 � � ����"� ���

3 � �����! �:�!�

45 0���; ���"����� � ��! � : �!�

67 0�""���� �;

8 �<� �� � �!

9 �#�8=6= 6�$#6'

To get the predicted length of a foetus at age 75 days, along with a 95% prediction interval, thefollowing code can be used.

1 � ���� ������! � : �!�� ���� ���� ��"����!�9'�� �� ��� �

�����

2 "� �� ��

3 $ 8�&*&#9# '�7$$8#$ 8�79*=#&

© HERIOT-WATT UNIVERSITY

Page 79: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 73

The following code will produce a plot of the data with the least squares linear regression linesuperimposed. The plot produced using this R code in shown in Figure 3.16.

1 � ��� �"� ��� ������,!����� ��� ��������� �� ��!���� �� ���

2 � �����������! � : �!��

Figure 3.16: Fitted line plot produced using R

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 80: Higher Applications of Mathematics Statistics and probability

74 UNIT 2. STATISTICS AND PROBABILITY

Computing regression lines in Excel

The regression line can be computed using the Data Analysis tool and selecting Regression fromthe options. The dependent and independent variables have to be identified and the output is shown.

SUMMARY OUTPUT

Coefficients Standard Error t Stat P-value

Intercept -2.690880514 0.148621305 -18.10561753 2.19402E-30

Age 0.120455364 0.002054818 58.62093636 1.03478E-68

The linear regression line can be superimposed on a scatter plot in Excel. This can be done bydrawing a simple scatter plot in Excel and then selecting the chart and using the drop down menuat the top of the Excel workbook marked Chart Tools and under that Layout. There is a Trendlineoption which can be used to add the least squares linear regression line to the scatter plot, andthere are further options to put the equation of the line, and the coefficient of determination, ontothe scatter plot. This produces the plot shown in Figure 3.17.

Figure 3.17: Fitted line plot produced using Excel

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 81: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 75

Go onlineLinear regression: Quiz

Decide whether each of the following statements is true or false.

Q3: Regression analysis predictions are inaccurate if made outwith the range of x data.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q4: Correlation measures association between two numerical variables.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q5: Low correlation means no association between the variables

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q6: The intercept value in a regression analysis is meaningless.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q7: A negative value of the slope parameter in a regression analysis indicates a negativecorrelation.

a) Trueb) False

© HERIOT-WATT UNIVERSITY

Page 82: Higher Applications of Mathematics Statistics and probability

76 UNIT 2. STATISTICS AND PROBABILITY

3.4 Exercises

Go onlineCorrelation and linear regression exercises

The file foetus.csv (https://scholar.hw.ac.uk/download/2021/H-APP/foetus.csv) contains thedata on the age and length of 84 foetuses measured from ultrasound scans.

Q8: Produce descriptive statistics to summarise the variables age and length and commenton the distribution of each.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q9: Produce a scatter plot with age on the x-axis and length on the y-axis.

Edit the graph to have a suitable title and axis labels and comment on the relationship betweenthe age and length of typically developing foetuses.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q10: Compute the correlation coefficient between age and length.

What can be concluded about the relationship between age and length?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q11: Compute the least squares linear regression line which would model the length of afoetus in terms of the age.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q12: Compute the coefficient of determination for the regression model.

How can this be interpreted in terms of the fitted model?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q13: Use the fitted model to predict the length of a foetus at 85 days and the length of afoetus at 120 days.

For each prediction, comment on two factors which would indicate the predictions areaccurate.

The cucumbers.csv (https://scholar.hw.ac.uk/download/2021/H-APP/cucumbers.csv) dataset contains randomly collected data on growing season precipitation (mm) and cucumberyield (kg/m2). This data is available at http://www.physicalgeography.net/fundamentals/3h.html. It is reasonable to suggest that the amount of water received on a field during the growingseason will influence the yield of cucumbers growing on it.

Q14: Produce a scatter plot with precipitation on the x-axis and yield on the y-axis.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q15: Compute the correlation between these two variables and comment on the result.

Research has shown that high school students perform better academically with more sleepeach night. A small study was done on a class of high school students who were each askedto record the average number of hours sleep per night they get over a one week period. Thedata is stored in the file performance.csv(https://scholar.hw.ac.uk/download/2021/H-APP/performance.csv).

© HERIOT-WATT UNIVERSITY

Page 83: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 77

Q16: Produce a scatter plot of amount of sleep and exam performance on the appropriate xand y-axes.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q17: From the scatter plot, comment on the average numbers of hours sleep that thestudents have recorded, explaining why some seem unusual.

Suggest how this might have been measured.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q18: Compute the correlation between amount of sleep and exam performance and write asentence to interpret the result.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q19: Compute the least squares linear regression line to predict exam performance fromamount of sleep.

Interpret the model coefficients in the context of the problem.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q20: Predict the performance of an individual who has 10 hours of sleep and comment onthe accuracy of this prediction.

3.5 Summary

Summary

• A variable that is under the control of an investigator is known as the independentvariable.

• The dependent variable is the variable that the investigator is trying to predict.

• Inspect the relationship between two variables (the independent variable on the x-axisand dependent variable on the y-axis) by constructing a scatter plot.

• The correlation between two numerical variables is a measure of the strength of linearitybetween them.

• A high positive correlation will be close to +1, a high negative correlation will be closeto –1, and If there is no obvious linear relationship between the two variables then thecorrelation will be close to zero.

• The correlation coefficient should always be interpreted with care since there may beno direct connection between two highly correlated variables.

• A low correlation value does not necessarily mean a low degree of association.

• Understand how the correlation coefficient, r, is calculated, but you will not be asked tocompute this by hand.

© HERIOT-WATT UNIVERSITY

Page 84: Higher Applications of Mathematics Statistics and probability

78 UNIT 2. STATISTICS AND PROBABILITY

Summary continued

r =

∑ni=1 (xi − x) (yi − y)√∑n

i−1 (xi − x)2∑n

i=1 (yi − y)2

• When a linear relationship exists between two variables, it is possible to develop anequation to predict values of the dependent variable, y = a + bx, where b is the slope orgradient and a is the y-intercept of the line.

• When a linear relationship exists between two variables, the least squares linearregression line may be used to estimate a value of the dependent variable given avalue of the independent variable.

• The predicted value of the dependent variable is only an estimate and other factorsshould be taken into account.

• The coefficient of determination is a measure of the amount of variability in the datathat is explained by the regression line and is usually expressed as a percentage. Ittherefore gives an indication of how well the regression line fits the data.

3.6 Resources

Downloads

• cucumbers.csv - https://scholar.hw.ac.uk/download/2021/H-APP/cucumbers.csv

• durability.csv - https://scholar.hw.ac.uk/download/2021/H-APP/durability.csv

• foetus.csv - https://scholar.hw.ac.uk/download/2021/H-APP/foetus.csv

• performance.csv - https://scholar.hw.ac.uk/download/2021/H-APP/performance.csv

• wound.csv - https://scholar.hw.ac.uk/download/2021/H-APP/wound.csv

Links

• https://www.minitab.com/en-us/products/minitab/ - powerful statistical software everyone canuse (free trial).

• https://www.r-project.org/ - a free software environment for statistical computing and graphics.

© HERIOT-WATT UNIVERSITY

Page 85: Higher Applications of Mathematics Statistics and probability

TOPIC 3. CORRELATION AND LINEAR REGRESSION 79

3.7 End of topic test

Go onlineCorrelation and linear regression topic test

A leather production company is interested in characteristics of cowhides that are associatedwith the quality of their final product, particularly the softness of the leather. A random sampleof 10 hides was used and the following properties of each measured: collagen content, fatcontent and softness.

The following table shows the correlations between these variables and the correspondingp-values.

Fat Collagen

Collagen 0.713

0.011

Softness –0.810 –0.581

0.001 0.054

Q21: Interpret the results of this analysis by completing the follow paragraphs using thenumbers and words given.

Correlation analysis was used to quantify the degree of ;;;;;;; between the variables.

The correlation between collagen content and fat content is 0.713 and represents a ;;;;;;;

relationship between the variables. As collagen ;;;;;;;, fat content increases. The p-valueis ;;;;;;;, which is less than the significance level of 0.05. The p-value indicates that thecorrelation is ;;;;;;;.

The correlation coefficient between fat content and softness is ;;;;;;; and the p-value is0.001. The p-value is less than the significance level of ;;;;;;;, which indicates that thecorrelation is significant. As fat content increases, softness ;;;;;;;.

The correlation between ;;;;;;; and softness is -0.581 and the p-value is ;;;;;;;.

Numbers and words: –0.810, 0.011, 0.05, 0.054, collagen, decreases, increases, linearity,positive, significant.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q22: Because the sample has only 10 observations, the statistician requests the companyincrease the sample size. This may be problematic due to:

i. cost;ii. sourcing enough hides;

iii. time required to collect the data;iv. ethical issues related to animal research.

© HERIOT-WATT UNIVERSITY

Page 86: Higher Applications of Mathematics Statistics and probability

80 UNIT 2. STATISTICS AND PROBABILITY

A designer of mobile devices at a multinational technology company specialising in consumerelectronics, computer software and online services wants to assess the durability of touchscreens in their products. Data is gathered on the durability of screens in relation to theoptical lamination from a random sample of products.

The designer uses simple regression to determine whether the optical lamination of thescreens is associated with durability. A fitted line plot is produced in R Studio.

Q23: Interpret the results of this analysis by completing the follow paragraphs using thenumbers and words given.

The regression analysis appears to indicate that there is a ;;;;;;; relationship between thetwo variables. As ;;;;;;; increases, the durability of the screen ;;;;;;;. The least squareslinear regression line estimates that as the optical lamination increases by ;;;;;;; unit, thedurability increases by ;;;;;;;. The ;;;;;;; value of –18.9 is meaningless in the context ofthis analysis.

There appears to be an ;;;;;;; in the top right corner of the fitted line plot, which could havean effect on the results. The designer should investigate this to determine its cause.

Numbers and words: 3.5, decreases, increases, intercept, lamination, linear, one, outlier.

© HERIOT-WATT UNIVERSITY

Page 87: Higher Applications of Mathematics Statistics and probability

81

Unit 2 Topic 4

Hypothesis testing and confidenceintervals

Contents4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.2.1 The p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2.2 Two sample t-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2.3 Choice of test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.3 Tests for comparing two groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.3.1 Paired data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.4 Tests for comparing more than two groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.5 Tests for categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.5.1 Chi-squared test of independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.5.2 Chi-squared test for goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.5.3 Paired analyses and small samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.5.4 Z-test for two proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.6 Hypothesis tests for normality and correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.6.1 Normality tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.6.2 Test for correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.7 Notes on statistical testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.7.1 Errors in Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.7.2 Power and sample size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4.8 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.8.1 Confidence interval for the difference in means . . . . . . . . . . . . . . . . . . . . . . 116

4.8.2 Relationship between CIs and p-values . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.8.3 Clinical significance of a statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . 117

4.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

4.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.11 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

4.12 End of topic test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

Page 88: Higher Applications of Mathematics Statistics and probability

82 UNIT 2. STATISTICS AND PROBABILITY

Learning objective

By the end of this topic, you should be able to:

• formulate research questions;

• interpret and relate the results of a hypothesis test to the original research question;

• generate, understand, and interpret confidence intervals;

• perform simple analyses using t-tests and paired t-tests;

• use z-tests for two proportions;

• understand how errors can arise in statistical testing.

© HERIOT-WATT UNIVERSITY

Page 89: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 83

4.1 Introduction

Hypothesis testing is used extensively in research, for example, to determine if one treatment ismore effective than another. The ideas behind hypothesis testing and the interpretation of the results(normally in the form of p-values) are described here. The choice of appropriate test depends onthe type and distribution of data being analysed. Once the appropriate test has been performed, theinterpretation of the result has to be explained in the context of the research study and confidenceintervals are useful for this. These give some idea of the magnitude of an effect on the populationin general (e.g. to indicate by how much a new cholesterol lowering drug will be expected to reducecholesterol in a patient).

Statistical analysis considers the probability of an event being due to chance. For example, whatis the chance that the third child in a family will be female if the first two were male? It is usuallynot possible to be 100% certain of an event or outcome but mathematically it is possible to say howlikely it is to occur. This forms the basis for statistical testing or hypothesis testing.

4.2 Hypothesis testing

A statistical test is designed by a researcher in an attempt to prove some hypothesis of intereste.g. that a new treatment is more effective than the existing treatment. Initially it assumes thecontrary view (i.e. the new treatment is not more effective than the existing one) and only comesdown in support of the hypothesis of interest if the data gathered is sufficiently unlikely to havebeen generated by the contrary view. The contrary view is known as the null hypothesis and theresearch hypothesis of interest is known as the alternative hypothesis. These ideas are illustratedin the following example.

Example

A study was conducted to compare the effect of two different analgesics (pain killers) on bloodglucose levels. Fifteen subjects were given analgesic A and 12 were given analgesic B andthe blood glucose levels recorded in mg/kg as shown in Table 4.1. The objective of the studyis to determine if the blood glucose levels are higher with one or other analgesic.

Table 4.1: Blood glucose levels after administration of two analgesics

Analgesic A 44, 69, 51, 71, 52, 71, 55, 76, 60, 82, 62, 91, 66, 108, 68

Analgesic B 52, 95, 64, 97, 68, 107, 77, 116, 79, 83, 84, 88

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Null hypothesis – there is no difference in the blood glucose levels between the two analgesicgroups.

Alternative hypothesis – there is some difference in the blood glucose levels between the twoanalgesic groups.

© HERIOT-WATT UNIVERSITY

Page 90: Higher Applications of Mathematics Statistics and probability

84 UNIT 2. STATISTICS AND PROBABILITY

The summary statistics for the blood glucose levels in these two groups calculated usingMinitab are shown.

Descriptive Statistics: Analgesic;A, Analgesic;B

Statistics����� � ��� ����� ������� �� ����� �� ������

$��5� �.;$ �# 0/�� �0��� ��� ##� 0/� �0� � /�

$��5� �.;< �" /���� �/��" #"� � �"# /��# !0�# ��0�

Since the mean and median values are approximately equal in each of the groups, itseems reasonable to assume that the data are normally distributed. This can be verified byexamining histograms for each which are shown in Figure 4.1 and Figure 4.2. Determiningwhether or not the data are normally distributed is an important step in identifying whichhypothesis test to apply.

Figure 4.1: Distribution of blood glucose levels in patients receiving analgesic A

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 91: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 85

Figure 4.2: Distribution of blood glucose levels in patients receiving analgesic B

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

It would appear from the summary statistics that the average blood glucose levels in patientsgiven analgesic A (68.4 mg/kg) is lower than in those given analgesic B (83.67 mg/kg). Thisis clear from the box plot shown in Figure 4.3.

Figure 4.3: Box plot comparing blood glucose levels in two analgesics

© HERIOT-WATT UNIVERSITY

Page 92: Higher Applications of Mathematics Statistics and probability

86 UNIT 2. STATISTICS AND PROBABILITY

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

On average therefore, patients given analgesic B have blood glucose levels that are 15.27mg/kg (i.e. 83.67 − 68.4) higher than those given analgesic A. The hypothesis test willestimate the probability of this magnitude of difference occurring. More specifically, underthe null hypothesis (i.e. that the glucose level in both groups is not different), the hypothesistest estimates the probability that a difference of 15.27 mg/kg would occur in a sample of thissize purely by chance (i.e. simply due to random variability). Obviously if the probability ofsuch a difference happening by chance is very small, it would be unlikely to happen if thenull hypothesis was true. The conclusion would then be that there is a significant differencebetween the two analgesics.

4.2.1 The p-value

The p-value is the probability of getting data as extreme as those actually observed in theexperiment if the null hypothesis was true. The lower the p-value, the more evidence there is againstthe null hypothesis (i.e. in favour of the study hypothesis). The conventional cut-off for significanceis 5%. Therefore, if the p-value is 5% or less, there would be evidence to suggest that the nullhypothesis is false and that the study hypothesis of interest (i.e. the alternative hypothesis) is true.

4.2.2 Two sample t-test

The appropriate test to perform in order to examine differences in mean values between twoindependent groups of normally distributed data is known as a two sample t-test. The followingexample output shows some of the results generated from performing this test in Minitab on theblood glucose levels.

Example

Estimation for Difference��77���.� !#: 6( 7� ��77���.�

1�#��� �1"!��#,1���/�

Test���� ������� � =0> μ1 1 μ2 2

$�������� ������� � =1> μ1 1 μ2 �=

)1���� �8 -1����

1"��� "" � "!

The computed p-value is printed on the last line and is 0.029. This means that the probabilityof getting a difference of 15.77 mg/kg in mean blood glucose levels between the two groupsby chance is 0.029 or 2.9%. Since this probability is less than 5%, it is unlikely that it wouldhappen by chance. Since it would be so unlikely to be due to chance, the difference musttherefore be due to the different effects of the two analgesics. Conclude that the blood glucoselevel for analgesic A is significantly lower than it is for analgesic B.

Note that it is possible to make this claim only if there are no other differences between the twogroups. It is important therefore that at the start of a study, subjects are allocated randomly

© HERIOT-WATT UNIVERSITY

Page 93: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 87

into the two groups. If they are not, then the observed difference between the groups may bedue to another factor. For example, if the average age in one group is higher than the other,age may be the factor that influences the blood glucose levels rather than the analgesic.

Two sample t-test in Minitab

If the data from the two independent samples are entered into two columns of Minitab, the test isperformed using the menu option Stat > Basic Statistics > 2-sample t. . . .

The default setting is Each sample is in its own column and the two columns should be identified asSample 1 and Sample 2 in the appropriate boxes in the settings menu. This gives the output shownin the example.

If the data is in a 'stacked' format, the test is done by selecting Both samples are in one column fromthe settings drop-down menu. The data column can then be indicated in the Samples box and thevariable which indicates sample 1 or 2 should be entered into the Sample IDs box.

Two sample t-test in R Studio

The following code in R will enter the data and perform a two sample t-test.

1 � 1���!���>1 ����**�'$�'#�''�86�8#�88�87�8=�9$�9$�98�7#�=$�$67�

2 � 1���!���>2 ����'#�8*�87�99�9=�7&�7*�77�='�=9�$69�$$8�

3 � � � �1���!���>1 �1���!���>2�

45 ���� ��� ?���� � �

67 �� �; 1���!���>1 ��� 1���!���>2

8 � �#�&&7#� �" � ##�'=$ � ������ � 6�6#78$

9 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

10 =' ��� ���"���� �� ���;

11 �#=�9#=7& �$�76&'6

12 ����� � ��� �;

13 ��� �" � ��� �" �

14 87�*6666 7*�$8889

This gives the same p-value as Minitab. Rather than entering the data manually, the following codecan be used to read the data from the file bloodglucose.csv(https://scholar.hw.ac.uk/download/2021/H-APP/bloodglucose.csv).

> 5��.� � <1 ���. ��3�����5��.� ��. �3�

© HERIOT-WATT UNIVERSITY

Page 94: Higher Applications of Mathematics Statistics and probability

88 UNIT 2. STATISTICS AND PROBABILITY

Two sample t-test in Excel

In the Data Analysis package there is an option to perform t-Test: Two-Sample Assuming UnequalVariances and this option will produce the same analysis as those described for Minitab and R. Thedata for Analgesic A and Analgesic B should be highlighted to indicate the range in the appropriateVariable Range boxes and the output is shown in Table 4.2.

Table 4.2: T-test output from Excel

t-test: Two-Sample Assuming Unequal Variances

Variable 1 Variable 2

Mean 68.4 84.1666667

Variance 271.4 328.515152

Observations 15 12

Hypothesised mean difference 0

df 23

t Stat -2.33818752

P(T<=t) one-tail 0.01421893

t Critical one-tail 1.71387153

P(T<=t) two-tail 0.02843786

t Critical two-tail 2.06865761

4.2.3 Choice of test

The choice of test to apply depends on whether the data is categorical or numerical and on thedistribution of the data i.e. whether or not it is normally distributed. If the distribution is normal thenthe type of tests available are known as parametric tests. For non-normal data, non-parametrictests are applicable.

Note that in probability theory there is a result known as the central limit theorem. This roughlystates that the sampling distribution of the mean of any set of data will be approximately normal if thesample size is large enough. Using this result, it is possible to apply parametric tests to non-normaldata so long as the sample size is large. However, for smaller samples (say < 50) it is advisable tocheck the distribution of the data and choose a parametric or nonparametric analysis as appropriate.Non-parametric tests are not covered in this course.

The actual test used is then determined by the comparison of interest. The set up and interpretationof the test is the same each time, it is just the name of the actual test to use that needs to be decidedwhich is based on the research hypothesis and the type and distribution of the data. Examples ofcommonly used tests in are given in the following section.

© HERIOT-WATT UNIVERSITY

Page 95: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 89

4.3 Tests for comparing two groups

To compare mean values between two independent groups, where the data in both groups isnumerical and normally distributed, a t-test is used (usually referred to as a two sample t-test).This is is a parametric test, i.e. it applies to normally distributed data. If the data are not normallydistributed, the non-parametric equivalent of a t-test has to be used. This test is known as a Mann-Whitney test.

Example

A clinical trial was conducted to determine the effectiveness of a drug in preventing prematurebirths. In the trial, 200 pregnant women were studied, 100 in a treatment group who receivedthe drug and 100 in a control group who received a placebo. The patients were administereda fixed dose of the drug on a one time only basis between the 24th and 28th week ofpregnancy. Patients were randomly assigned to the groups. The response variable of interestwas the birthweight (in kg) of the baby. The data are available in the Excel spreadsheetbirthweights.csv (https://scholar.hw.ac.uk/download/2021/H-APP/birthweights.csv).

Perform a hypothesis test to determine if there is any evidence that the drug is effective.

The descriptive statistics and boxplot shows that birthweights appear to be slightly higher inthe treatment group.

Descriptive Statistics: Treatment, Control

Statistics����� � ��� ����� ������� �� ����� �� ������

)������ � �� �� �/#� #� �� 0����� 0�!�## ���0 � !���"�

6����� � 0��#�! �/� � ��#��" #�/"!� 0��#"! 0�!/�0 /�����

© HERIOT-WATT UNIVERSITY

Page 96: Higher Applications of Mathematics Statistics and probability

90 UNIT 2. STATISTICS AND PROBABILITY

The weights are roughly symmetrical and two outliers are highlighted in the treatment group.Since the sample size is large, and the comparison is between independent groups (thewomen in the treatment group are not the same women as those in the control group), theresearch question can be addressed using a two-sample t-test.

Estimation for Difference��77���.� !#: 6( 7� ��77���.�

�#�� � �� �, ��/��

Test���� ������� � =0> μ1 1 μ2 2

$�������� ������� � =1> μ1 1 μ2 �=

)1���� �8 -1����

��# �!� �

Here p = 0.000, rounded to three decimal places. Since a p-value can never actually bezero, this is usually reported as p < 0.001. The null hypothesis for the test is that there is nodifference in the average birthweight of babies born in the two groups. Since p < 0.05, thenull hypothesis is rejected and is possible to claim that the average birthweight in the treatedgroup is significantly greater than in the control group.

The estimated difference is 0.547 so, on average, those born in the treated group are about0.55 kg heavier. The 95% confidence interval suggests that if this study was repeated overand over, it's likely that the difference in birthweights is between 0.3 and 0.8 kg more in thetreated group.

Since the women in this study were randomised into the two groups, the only differencesbetween the groups should be in the treatment administered, i.e. the drug or placebo. Sincethe statistical analysis shows that there is a significant difference in the birthweights, this canbe attributed to the drug being effective at preventing pre-mature births.

T-test in Minitab

With the weights in separate columns of the spreadsheet, use Stat > Basic Statistics > 2-samplet. . . to see the output shown in the example.

© HERIOT-WATT UNIVERSITY

Page 97: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 91

T-test in R Studio

The following code can be used to perform the two-sample t-test on this data in R Studio.

1 � ��!� � �� ���������� ���!� �������

2 � � ������!� ��

3 � � � ��� �� �0�� ���

The mean difference, p-value and confidence interval are produced as shown.

1 ���� ��� ?���� � �

2 �� �; �� �� ��� 0�� ��

3 � *�*==8 � �" � $=9�=& � ������ � $�$8�6'

4 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

5 =' ��� ���"���� �� ���;

6 6�&67&=## 6�97=8697

7 ����� � ��� �;

8 ��� �" � ��� �" �

9 9�66& 8�*'*

T-test in Excel

The following output was obtained using the Data Analysis package with the option to 't-Test: Two-Sample Assuming Unequal Variances'.

t-test: Two-Sample Assuming Unequal Variances

Variable 1 Variable 2

Mean 7.001093969 6.45385583

Variance 0.724238748 0.75755486

Observations 100 100

Hypothesised mean difference 0

df 198

t Stat 4.49554651

P(T<=t) one-tail 5.89926E-06

t Critical one-tail 1.652585784

P(T<=t) two-tail 1.17985E-05

t Critical two-tail 1.972017478

The appropriate p-value is the one labelled P(T<=t) two-tail.

© HERIOT-WATT UNIVERSITY

Page 98: Higher Applications of Mathematics Statistics and probability

92 UNIT 2. STATISTICS AND PROBABILITY

4.3.1 Paired data

A special case of comparing two mean values occurs when data are recorded on the sameindividuals twice. For example, a patient's blood pressure may be recorded before and after sometreatment, or their weight measured before and after a diet intervention. This type of data is knownas paired data. There tends to be less variability in paired data since observations are made twiceon the same individual rather than two separate individuals, thereby reducing the natural variabilitybetween individuals. The appropriate parametric test is called a paired t-test.

Example

A sports company wants to compare two materials, A and B, for use on the soles of runningshoes. In this example, each of ten participants in a study wore running shoes with the soleof one shoe made from Material A and the sole on the other shoe made from Material B. Thesole types were randomly assigned to account for systematic differences in wear betweenthe left and right foot. After three months, the shoes are measured for wear and the amountof wear is reflected in the data shown in Table 4.3 where a higher value indicates a betterperformance of the material.

Table 4.3: Amount of wear over three months for different material

Material A 13.2, 8.2, 10.9, 14.3, 10.7, 6.6, 9.5, 10.8, 8.8, 13.3

Material B 14.0, 8.8, 11.2, 14.2, 11.8, 6.4, 9.8, 11.3, 9.3, 13.6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

These data come from a paired study design. In this analysis, the variability due to thedifferences between the pairs is removed. For example, one participant may do a lot morerunning on a rough surface while another may run a lot less and on a more forgiving surface.

Descriptive statistics show that the average wear value is higher in material B.

Descriptive Statistics: Material;A, Material;B

Statistics����� � ��� ����� �% ���

�����;$ � � �0� "��#� ���#

�����;< � ��� � "�#�/ ��!0

Estimation for Paired Difference��� ����� �% ��� !#: 6( 7� μ;��77���.�

1 ��� ��/� ��"" �1 �0/�, 1 �����

���������� ��� �� ���������� � �����������

Test���� ������� � =0> μ;��77���.� 2

$�������� ������� � =1> μ;��77���.� �=

© HERIOT-WATT UNIVERSITY

Page 99: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 93

)1���� -1����

1���# � !

The mean difference (material A – material B) is –0.410 and the p-value from the test is 0.009.

Since p < 0.05, the null hypothesis of no difference in wear between the two materials canbe rejected and it is possible to conclude that the performance of material B is significantlygreater than for material A in terms of wear over the three month period.

The large variability in the measurements for each material is clear from looking at the data(e.g. the range of data for material A is from 6.6 to 14.3. This relatively large variabilityobscures the somewhat small difference in wear between the left and right shoes (thelargest difference between shoes was 1.10). This is why a paired experimental design andsubsequent analysis with a paired t-test, where appropriate, is often much more powerful thanan unpaired approach.

The confidence interval estimates the difference in wear as being between 0.133 and 0.687less for material A. This interval does not span zero which would indicate no difference andtherefore agrees with the results of the hypothesis test which rejected the null hypothesis ofzero difference in wear between the two materials.

Assumptions for paired t-test

Note that the paired t-test is a parametric test. The normality assumption for this test is required forthe differences between the paired observations. In the previous example, the distribution of wearon the shoes for material A and material B does not impact the choice of test - it is determinedby examining the distributions of the differences in wear. If these differences are approximatelynormally distributed the paired t-test is performed. Otherwise, the non-parametric hypothesis test,called a Wilcoxon test, should be performed.

Paired t-test in Minitab

The Minitab output for the paired study shown in the example was produced using the commandsStat > Basic Statistics > Paired-t. . . .

© HERIOT-WATT UNIVERSITY

Page 100: Higher Applications of Mathematics Statistics and probability

94 UNIT 2. STATISTICS AND PROBABILITY

Paired t-test in R Studio

The R code to enter the data and perform the paired analysis is:

1 � (� ���>1 ����$&�#�7�#�$6�=�$*�&�$6�9�8�8�=�'�$6�7�7�7 �$&�&�

2 � (� ���>2 ����$*�6�7�7�$$�#�$*�#�$$�7�8�*�=�7�$$�&�=�& �$&�8�

3 � � � �(� ���>1 �(� ���>2 �������,-.�

4 @��� � �

5 �� �; (� ���>1 ��� (� ���>2

6 � �&�&*7=� �" � =� ������ � 6�667'&=

7 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 �6�878='&= �6�$&&6*8$

10 ����� � ��� �;

11 ��� �" � ��""���

12 �6�*$

Rather than manually enter the data it can be read in directly from the file runningshoes.csv(https://scholar.hw.ac.uk/download/2021/H-APP/runningshoes.csv).

Paired t-test in Excel

In the Data Analysis package there is an option to perform t-Test: Paired Two Sample for Meansand this option performs a paired t-test. The data for Materials A and B should be highlighted toindicate the range in the appropriate Variable Range boxes and the output is shown in Table 4.4

Table 4.4: Paired t-test output from Excel

t-Test: Paired Two Sample for Means

Variable 1 Variable 2

Mean 10.63 11.04

Variance 6.009 6.3426667

Observations 10 10

Pearson Correlation 0.98822553

Hypothesised mean difference 0

df 9

t Stat -3.3488765

P(T<=t) one-tail 0.00426939

t Critical one-tail 1.83311293

P(T<=t) two-tail 0.00853878

t Critical two-tail 2.26215716

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 101: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 95

Example

A group of biologists were interested in measuring the metal content in the wood of poplartrees growing in a polluted area. Measurements were made of 13 poplar clones, once inAugust and once in November. The concentrations of aluminium (in micrograms of Al pergram of wood) are shown in Table 4.5

Table 4.5: Concentration of aluminium in poplar clones

Clone August November August-November

Columbia River 18.3 12.7 -5.6

Fritzi Pauley 13.3 11.1 -2.2

Hazendans 16.5 15.3 -1.2

Primo 12.6 12.7 0.1

Raspalje 9.5 10.5 1.0

Hoogvorst 13.6 15.6 2.0

Balsam Spire 8.1 11.2 3.1

Gibecq 8.9 14.2 5.3

Beaupre 10.0 16.3 6.3

Unal 8.3 15.5 7.2

Trichobel 7.9 19.9 12.0

Gaver 8.1 20.4 12.3

Wolterson 13.4 36.8 23.4

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

This is a paired study design since both estimates of aluminium content are from the sametrees in August and then in November. Assuming that the differences in aluminium contentare normally distributed, the appropriate analysis would be a paired t-test. The Minitab outputis shown.

Descriptive Statistics: August, November

Statistics����� � ��� ����� �% ���

$�5� � �� ����" ���# �!0

������� �� �0��" 0�/! ��!�

The estimate of the difference between August and November is as follows.

Estimation for Paired Difference��� ����� �% ��� !#: 6( 7� μ;��77���.�

1��! ��0# "��" �1!�#", 1 �"/�

© HERIOT-WATT UNIVERSITY

Page 102: Higher Applications of Mathematics Statistics and probability

96 UNIT 2. STATISTICS AND PROBABILITY

���������� ��� �� ������ � ��������

Test���� ������� � =0> μ;��77���.� 2

$�������� ������� � =1> μ;��77���.� �=

)1���� -1����

1"��� � �

The results of the paired t-test give p = 0.040.

Since p < 0.05, reject the null hypothesis and conclude that there is evidence that the meandifference is not equal to zero. Therefore, it would appear that the average aluminium levelsare higher in November than they were in August. This can be seen from Figure 4.4. Notethat this plot can be reproduced using the following R code.

1 � ��������� ���������� �������

2 � � ��������������

3 � ������ �1�!�� �A���� ���������1�!�� ���A�������

Figure 4.4: Concentration of aluminium in poplar clones

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 103: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 97

Paired t-test in Minitab

The paired t-test can be performed using the menu option Stat > Basics Statistics > Paired t. . . .

Paired t-test in R Studio

The following command performs the paired t-test in R.

> ���� ��$�5� �,�������,����2)&*%�

It produces the following output.

1 @��� � �

2 �� �; 1�!�� ��� A����

3 � �#�&67=� �" � $#� ������ � 6�6&='8

4 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

5 =' ��� ���"���� �� ���;

6 �=�'#&=&*7 �6�#9868'#

7 ����� � ��� �;

8 ��� �" � ��""���

9 �*�=

© HERIOT-WATT UNIVERSITY

Page 104: Higher Applications of Mathematics Statistics and probability

98 UNIT 2. STATISTICS AND PROBABILITY

Paired t-test in Excel

This can be done using the paired t-test option in the Data Analysis package.

The p-value produced is 0.039 as shown.

t-Test: Paired Two Sample for Means

Variable 1 Variable 2

Mean 11.4230769 16.323077

Variance 11.9135897 47.430256

Observations 13 13

Pearson Correlation 0.01669772

Hypothesised mean difference 0

df 12

t Stat -2.3088957

P(T<=t) one-tail 0.01977763

t Critical one-tail 1.78228756

P(T<=t) two-tail 0.03955525

t Critical two-tail 2.17881283

© HERIOT-WATT UNIVERSITY

Page 105: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 99

4.4 Tests for comparing more than two groups

The tests described are only appropriate for comparing two groups. Comparison between three ormore mean values is done using a test called analysis of variance (ANOVA). ANOVA is a parametrictest and is appropriate for use with numerical data that is normally distributed in each of the groupsbeing compared. The equivalent non-parametric test is called a Kruskal-Wallis test.

When data from a study requires comparison of three of more groups of data from the same source,the analysis has to take into account the repeated measures on the same subjects. For example, agroup of people may have their fitness levels measured prior to starting an exercise program thenafter 10 weeks on the program and at 6 months. There are measurements at three time points on thesame participants. This is equivalent to the paired study designs which were analysed using a pairedt-test or the non-parametric Wilcoxon test. The appropriate parametric and non-parametric testsfor this analysis are repeated measures ANOVA and the Friedman test respectively. These testscan be performed quite simply in Minitab and R. These tests, and the subsequent more complexexperimental designs with quantitative outcome measures, are beyond the scope of this course.

4.5 Tests for categorical data

The hypothesis tests described so far are applicable for quantitative data. Commonly in research,categorical data is of interest. With this type of data there are several tests available. The mostcommon is the chi-squared test.

4.5.1 Chi-squared test of independence

The chi-squared test of independence is used to determine if there is a relationship between twocategorical variables. The frequency of one categorical variable is compared with different valuesof the second categorical variable. For example, a researcher wants to examine the relationshipbetween gender (male vs. female) and smoking (yes vs. no). The chi-squared test of independencecan be used to examine this relationship. If the null hypothesis is accepted there would be norelationship between gender and smoking. If the null hypotheses is rejected the implication wouldbe that there is a relationship between gender and smoking (e.g. females are more or less likely tosmoke than males).

Consider the data in Table 4.6.

Table 4.6: Relationship between gender and smoking

Smoking status

Gender Smoker Non-smoker Total

Female 30 70 100

Male 80 20 100

Total 110 90 200

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

If the null hypothesis were true, there would be no relationship between gender and smoking. In thatcase, it is possible to work out what would be expected in the table. There are 200 subjects in the

© HERIOT-WATT UNIVERSITY

Page 106: Higher Applications of Mathematics Statistics and probability

100 UNIT 2. STATISTICS AND PROBABILITY

study and half are female and half male. If the null hypothesis were true then half of the smokerswould be male and half would be female and the data should then look like that in Table 4.7.

Table 4.7: Expected data under the null hypothesis

Smoking status

Gender Smoker Non-smoker Total

Female 55 45 100

Male 55 45 100

Total 110 90 200

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The chi-squared test compares the observed data (from the study) to the expected data (under thenull hypothesis) and computes the probability of the obtaining the difference between the observedand expected data if the null hypothesis is true. As with the other hypothesis tests consideredpreviously, if p < 0.05 then the data is unlikely to be consistent with the null hypothesis being true.This gives evidence to then suggest that smoking and gender were not independent. In this exampleit could be inferred that males were more likely than females to be smokers from comparison of theobserved and expected frequencies.

Example

Table 4.8 shows the number of buildings in four North American cities which are between80 and 100 stories (tall buildings) and over 100 stories – the architectural definition of a'skyscraper'. It is of interest to determine if the two categorical variables (city and height ofbuilding) are independent.

Table 4.8: Heights of buildings in North American cities

City

Height ofbuilding Toronto Vancouver New York LA

Buildings > 100stories

2 3 26 12

Buildings > 80and < 100stories

10 9 31 16

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 107: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 101

The following output was produced in Minitab.

Rows: Building Columns: City

)����� ��.���� ��4 ?�@ +$ $��

> / �� < � ���� "

�����

�����

"0

""��/0

�"

��� �0��

> � ���� �

��"00

!

��"00

��

���#��

�0

�0�!#�00

$�� �" �" #� "/ � !

���� �������������

�������� �����

Chi-Square Test

6��1�A�� �8 -1����

-� �� ��0!! � ��!#

+�@������� #� /� � ��00

� ���� �� ��! �������� ������ ���� �!�� "#

The p-value for the chi-squared test is 0.195 so there is no evidence to reject the nullhypothesis that city and height of buildings are independent. (This example will be re-analysed in more detail later).

Chi-squared test for independence in Minitab

This data can be analysed in Minitab using the menu option Stat > Tables > Chi-Square Test forAssociation. . . .

Since this data is in the form of a cross tabulation, select Summarized data in a two-way table fromthe drop down menu in the settings. The four columns of data which store the numbers of buildingsof each size can then be added to the box Columns containing the table.

Chi-squared test for independence in R Studio

The simplest way to perform the analysis on the contingency table is to save the data as a csv fileand read it directly in to R. This produces a data file as shown.

1 � �������!� �� �������"�����������!�������

2 �������!�

3 �!� ��"�2������! ���� � B������ A��C�D E���1�!��

4 $ �76 ��� �$66 � ��� # & #8 $#

5 # � $66 � ��� $6 = &$ $8

© HERIOT-WATT UNIVERSITY

Page 108: Higher Applications of Mathematics Statistics and probability

102 UNIT 2. STATISTICS AND PROBABILITY

The chi-squared test can then be implemented using the following command.

> .�� A��� ���������5 B,">#C�

This selects columns 2-5 of the data frame for the test, giving the following output.

1 @���� F� 0�� ��3��� �

23 �� �; �������!�4� #;'5

4 ���3��� � *�8==* � �" � &� ������ � 6�$='#

56 �����! ����!;

7 <� ����3� � ��������!�4� #;'5� ;

8 ��� ��3��� �������� ��� ��� � �����

The warning about the approximation being incorrect will be considered later.

Chi-squared test for independence in Excel

The data has to be set up in Excel as two tables - one of the observed data from the study and theother with the expected values if the null hypothesis is true. These can easily be computed for ar × c contingency table as

(row total × column total)n

where n is the sample size (109 in this example). The format is shown in Table 4.9.

Table 4.9: Test for independence Excel

Observed data

Height of building Toronto Vancouver New York LA

> 100 stories 2 3 26 12

> 80 and < 100stories

10 9 31 16

Expected data

Height of building Toronto Vancouver New York LA

> 100 stories 4.734 4.734 22.486 11.046

>80 and < 100stories

7.266 7.266 34.514 16.954

0.19516

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The p-value is computed using the function

26=()%�)�<0>%�D<�">%���

where B6:E7 are the cells containing the observed data and B12:E13 are the cells of expected data.The p-value is then inserted at the chosen cell on the spreadsheet which is shown at the bottom ofTable 4.9.

© HERIOT-WATT UNIVERSITY

Page 109: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 103

4.5.2 Chi-squared test for goodness-of-fit

Another use of the chi-squared test is for goodness-of-fit. This test is used to find out how theobserved value of a given categorical outcome differs from the expected value. In a chi-squaredgoodness of fit test, the term 'goodness of fit' is used to compare the observed sample distributionwith the expected probability distribution. In a chi-squared goodness of fit test, sample data isdivided into intervals and the numbers of points that fall into the interval are compared with theexpected numbers of points in each interval. It is a simple modification of the chi-squared test ofassociation and an example is included here, however, performing this test is out-with the scope ofthe higher.

The following table gives accidental deaths from falls, by month, for the year 2009. It is of interest toknow if falls are randomly (i.e. evenly) distributed across the months.

Month Deaths (from falls)

January 1150

February 1034

March 1080

April 1126

May 1142

June 1100

July 1112

August 1099

September 1114

October 1079

November 999

December 1181

Total 13,216

The chi-squared goodness-of-fit tests can be used here to determine if the observed data isconsistent with falls being equally distributed across the months. Over the course of the year therewere 13,216 reported deaths from falls. If the null hypothesis is true then the expected number of

deaths each month would be13 216

12= 1101.33. The distributional fit that is being tested is that the

deaths follow a Uniform distribution. The observed data is compared with the data that would beexpected if the null hypothesis was true (i.e. 1101.33 deaths each month) using the chi-squaredgoodness of fit test. Some of the Minitab output is shown.

© HERIOT-WATT UNIVERSITY

Page 110: Higher Applications of Mathematics Statistics and probability

104 UNIT 2. STATISTICS AND PROBABILITY

Observed and expected counts

6��5�� E� ���� )� � �������� %���.��� 6���������� �� 6��1 A��

F��� ��# � /����� �� ���� "��# #"

8���� � �� � /����� �� ���� ����00�

�.� � / � /����� �� ���� ����"�

$��� ��"0 � /����� �� ���� �##"�0

�� ���" � /����� �� ���� ��# �0�

F��� �� � /����� �� ���� � �0�

F��� ���" � /����� �� ���� �� ���

$�5� � � !! � /����� �� ���� � �!�

�������� ���� � /����� �� ���� ���#0/

E.���� � �! � /����� �� ���� ��#"/!

������� !!! � /����� �� ���� !�# /#/

��.���� ��/� � /����� �� ���� #��0"/�

Chi-square test

� �8 6��1�A -1����

��"�0 �� "������ � �

Since p < 0.05, the null hypothesis is rejected and it is possible to conclude that there is evidencethat the deaths from falls are not evenly distributed across the months. For example, it appears thatthere are fewer deaths than expected in November and more than expected in December.

Chi-squared goodness of fit in Minitab

The test can be performed in Minitab using Stat > Tables > Chi-Square Goodness-of-Fit Test (OneVariable). . . .

The Observed counts is the column of data with the number of deaths per month and there isan option to input the names of the months to the Category names (optional) box which helps ininterpreting the output. Since the hypothesis being tested is to determine if deaths occur equallyacross the months, the Equal proportions test should be selected.

Chi-squared goodness of fit in R Studio

The R function .�� A��� � can be used. The default for this is the test of equal frequencies(equivalent to the Equal proportions option in Minitab) so the distribution does not need to bespecified for this example. The R code and output to do the goodness-of-fit test on the deathsdata is shown.

1 � �� ������$$'6 �$6&*�$676 �$$#8�$$*# �$$66�$$$# �$6==�$$$* �$69=�=== �$$7$�

2 � ����3� � ��� ���

34 0�� ��3��� � "� !��� �������� ��

56 �� �; �� ��

7 G��3��� � #*�9$* � �" � $$� ������ � 6�6$66*

© HERIOT-WATT UNIVERSITY

Page 111: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 105

Chi-squared goodness of fit in Excel

The format of the data for Excel should be as shown in Table 4.10. The expected values have to becomputed as required and set up as shown.

Table 4.10: Goodness of fit in Excel

Month No. of falls Expected falls

January 1150 1101.33

February 1034 1101.33

March 1080 1101.33

April 1126 1101.33

May 1142 1101.33

June 1100 1101.33

July 1112 1101.33

August 1099 1101.33

September 1114 1101.33

October 1079 1101.33

November 999 1101.33

December 1181 1101.33

0.010036

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Selecting a cell for the p-value from the test, the function

26=()%�)�<"><��D6">6���

where B2:B13 are the cells containing the observed data and C2:C13 are the cells of expecteddata. The p-value is then inserted at the chosen cell on the spreadsheet (shown here at the bottomof Table 4.10).

4.5.3 Paired analyses and small samples

The equivalent test for paired categorical data is known as McNemar's test. This test would beappropriate for example, when categorial observations were made on the same individuals overtime. A group of young people could be classified as suffering from hay fever or not at age 6. Thesame group could then be followed up as teenagers and again classified as suffering from hay fever.Some who did not have the allergy at age 6 may now have developed it while others may havegrown out of it. The cross-tabulation would look something like Table 4.11 and would be analysedusing McNemar's test.

© HERIOT-WATT UNIVERSITY

Page 112: Higher Applications of Mathematics Statistics and probability

106 UNIT 2. STATISTICS AND PROBABILITY

Table 4.11: Data illustrating McNemar's test

Hay fever at 13

Hay fever at 6 Yes No

Yes A B

No C D

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The chi-squared test is a large sample test and when the sample is not large enough, warningmessages may be printed with the output as mentioned previously. For Minitab this was

G �E)% G " .��� 4��� ����.��� .���� �� ��� #

and in R:

6��1 A��� ���������� �� �� ��.��.�.

Specifically, the sample size requirements relate to the expected values for the test. For the test tobe valid, all expected values should be greater than 1 and no more than 20% should be less than 5.Looking at the expected values for the buildings example, it is clear that for this data, two of the cells(i.e. 25%) had expected values less than 5. This led to the error warnings in the statistical packages.When this happens it is sometimes possible to regroup the data as shown in the following example.If this is not possible then for a 2×2 table for which a chi-squared test is not valid, Fisher's exact testcan be performed, however, performing this test is beyond the scope of this course.

Example

Since the expected values for the buildings example did not allow the chi-squared analysisto be valid, it is possible in this case to regroup the data in a logical way and still performthe test. For example, rather than test the hypothesis that city and height of buildings isindependent, the hypothesis could be changed to either 'country (Canada vs. USA) andheight of buildings is independent' or 'North American coast (East vs. West) and height ofbuildings is independent'. Regrouping the data to address the hypothesis about country andheight of building would give the data shown in Table 4.12.

Table 4.12: Relationship between country and height of buildings

City

Height of building Canada USA

Buildings > 100stories

5 38

Buildings > 80 and< 100 stories

19 47

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

© HERIOT-WATT UNIVERSITY

Page 113: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 107

The Minitab output from the chi-squared analysis on this condensed set of data is shown.

Rows: Building Columns: Country

6�� *�$ $��

> / �� < � ���� #

!���

�/

���#���

> � ���� �!

���#�

��

#����00

$�� "� /# � !

���� �������������

�������� �����

Chi-Square Test

6��1�A�� �8 -1����

-� �� ���0# � � �#

+�@������� ����" � � "!

All of the expected values are greater than 5 so the test is valid. In addition, p < 0.05 sothe null hypothesis can be rejected and there is evidence to suggest an association betweencountry and height of buildings. Comparison of the observed and expected values in theMintiab output indicates that there are fewer skyscrapers in Canada than expected and morein the USA than expected if the null hypothesis was true.

4.5.4 Z-test for two proportions

Notice that in the previous example, when the table has been reduced to a 2×2 table, the associationof interest could actually be addressed by considering whether or not the proportion of skyscrapersdiffers between Canada and the USA. This hypothesis can be tested using a z-test for twoproportions. The hypotheses become:

• Null hypothesis: no difference in proportion of skyscrapers between Canada and USA;

• Alternative hypothesis: some difference between the two proportions.

The data for the test is that of 24 tall buildings in Canada, 5 are skyscrapers, i.e.5

24= 20.8%,

compared to38

85= 44.7% in the USA. Some of the output from the test in Minitab is shown.

Descriptive Statistics

����� � %���� ����� �

����� � "� # �" /���

����� " /# �/ ���� #!

© HERIOT-WATT UNIVERSITY

Page 114: Higher Applications of Mathematics Statistics and probability

108 UNIT 2. STATISTICS AND PROBABILITY

Estimation for difference��77���.� !#: 6( 7� ��77���.�

1 �"�/�"# �1 ���"##�, 1 � ��/!��

CI based on normal approximation

Test���� ������� � =0> �1 1 �2 2

$�������� ������� � =1> �1 1 �2 �=

������ H1���� -1����

���� ����������1"��� � �#

8� ��� ��.� � #�

$!� ������ ������� �� �!� ���������� %#&'(('"� �� ���� ��� �!� �����#

Since p < 0.05, reject the null hypothesis and conclude that there is a significantly higher proportionof skyscrapers in the USA than in Canada. Notice that the p-value is the same for as that for thechi-squared test.

This approach actually provides more information which aids in interpretation of the data. Thedifference is –0.239 which estimates that the USA has around 24% more skyscrapers, with a 95%confidence interval giving the true range of the difference between 4.5% and 43.3%.

Z-test for two proportions in Minitab

The test can be performed in Minitab using the menu option Stat > Basic Statistics > 2Proportions. . . .

Since the data is in summary form in the contingency table, select the option Summarized data inthe settings drop-down menu. The data for Number of events and Number of trials can then beentered into the Sample 1 and Sample 2 boxes for Canada and the USA.

Z-test for two proportions in R Studio

The following code can be used to perform the test on the summarised data.

1 � ���� � �� � ��'� &7�� � � ��#*� 7'�� ��� �H1E?.�

23 #������ � "� 3���� � �" ���� ���� �� ��� ��� ���� �

��� ���

45 �� �; ��'� &7� �� �" ��#*� 7'�

6 ���3��� � *�*8'# � �" � $� ������ � 6�6&*'=

7 �� �� �� ���� ����; �������

8 =' ��� ���"���� �� ���;

9 �6�*&#''966 �6�6**7=&=7

10 ����� � ��� �;

11 ��� $ ��� #

12 6�#67&&&& 6�**96'77

© HERIOT-WATT UNIVERSITY

Page 115: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 109

Z-test for two proportions in Excel

There is no standard test for the comparison of two proportions in Excel. It is possible to code thetest into a spreadsheet and some online programs are available(e.g. https://www.xlstat.com/en/solutions/features/comparison-of-two-proportions).

4.6 Hypothesis tests for normality and correlation

There are additional hypothesis tests which relate to some of the methods covered previously.

4.6.1 Normality tests

In 'Interpreting data', the properties of the normal distribution were described. A histogram showsthe distribution of data and an approximate bell shape indicates that the data is approximatelynormal. This assumption of normality is necessary when applying parametric tests to test ahypothesis related to a mean value. The assumption of normality can be formally tested using ahypothesis test.

There are several tests for normality. The null hypothesis is that the data act as a random samplefrom a normal distribution. The test then looks at how likely the data is to come from such adistribution. A p-value of < 0.05 indicates that the null hypothesis is unlikely to be true given thedata and suggests therefore that the assumption of normality is not valid.

There are additional hypothesis tests which relate to some of the concepts covered previously,namely to test for normality of data and to determine if a correlation coefficient is significant. Theseare illustrated here to aid in understanding fully how both methods work in practice, however,performing these tests is not required at higher level.

Examples

1. The age data in age.csv (https://scholar.hw.ac.uk/download/2021/H-APP/age.csv) looksfrom the histogram to be normally distributed.

© HERIOT-WATT UNIVERSITY

Page 116: Higher Applications of Mathematics Statistics and probability

110 UNIT 2. STATISTICS AND PROBABILITY

The output from the Anderson-Darling normality test in Minitab is shown.

The graph gives a graphic illustration of how close the age data lies to a perfect normaldistribution which is the line on this plot. The p-value is given in the text box as 0.098. Sincep > 0.05, do not reject the null hypothesis and conclude that there is no evidence to suggestthat this data is not normally distributed.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2. The distribution of biceps skinfold measures for patients with Crohn's disease and others,some with coeliac disease, is available in biceps.csv(https://scholar.hw.ac.uk/download/2021/H-APP/biceps.csv). The distribution of thesemeasurements for those with coeliac appears to be skewed as shown.

© HERIOT-WATT UNIVERSITY

Page 117: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 111

The output from the Anderson-Darling normality test in Minitab is shown.

The p-value for the normality test in Minitab is 0.026. The null hypothesis is therefore rejected,so there is evidence that this data is not normally distributed.

© HERIOT-WATT UNIVERSITY

Page 118: Higher Applications of Mathematics Statistics and probability

112 UNIT 2. STATISTICS AND PROBABILITY

Normality tests in Minitab

The normality test can be performed using the menu option Stat > Basic Statistics > NormalityTest. . .

There are three different tests listed: Anderson-Darling, Ryan-Joiner and Kolmogorov-Smirnov. Foreach, the null hypothesis is that the data is normally distributed.

Normality tests in R Studio

The simplest normality test to perform in R is the Shapiro-Wilk test. This is similar to the Ryan-Joiner test in Minitab. To perform this test on the biceps skinfold measures from patients withCoeliac disease, use following R code can be used to produce the output shown.

1 � ������ ���$�7�#�6�#�6�#�6�&�6�&�7�*�#�'�*�=�8�

2 � ������� � ��������

34 ?����� ����D ������ � �

56 �� �; ������

7 � � 6�97&$&� ������ � 6�6$&6=

Since p < 0.05, the null hypothesis test is rejected.

Normality tests in Excel

Excel does not have a normality test in the basic Data Analysis package. It is possible to find onlineapplications written in Excel which will perform the calculations and estimate the p-value for thehypothesis test.

4.6.2 Test for correlation

In 'Correlation and linear regression', the correlation coefficient was defined as a measure of thelinear association between two numerical variables. An r value close to +1 or –1 indicates a stronglinear relationship. It is possible to use a hypothesis test to determine if there is evidence for oragainst the null hypothesis of zero correlation between the two variables.

Example

Returning to the data in wound.csv (https://scholar.hw.ac.uk/download/2021/H-APP/wound.csv), there was an approximate linear relationship between the dimension of a wound andthe time it took to heal. This linear relationship was quantified by computing the correlationcoefficient and the Minitab output for the computation is shown.

Correlation: wound, time

Correlations-� �� .������� �/0�

-1���� �

The p-value is automatically produced with the correlation coefficient and is reported asp < 0.001. Therefore the null hypothesis of no correlation is rejected and there is evidenceof a significant, positive linear relationship between wound dimension and healing time.

© HERIOT-WATT UNIVERSITY

Page 119: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 113

Test for correlation in Minitab

Pearson's correlation coefficient can be computed in Minitab using the menu option Stat > BasicStatistics > Correlation. . . .

The output, as shown in the example, gives the correlation coefficient and the p-value for thehypothesis test with null hypothesis of no correlation.

Test for correlation in R Studio

Here the .���� � function can be used. The computation of the correlation using R was performedin 'Correlation and linear regression'. This just produced the correlation coefficient. The followingcode gives the correlation along with the results of the hypothesis test.

1 � ����� ���������������������

2 � � ����������

3 � ��� � ���� � ���

The output p < 0.001, indicating a significant correlation between the two variables.

1 � ��� � ���� � ���

23 @���� F� ����� ����� ���� ���

45 �� �; ��� ��� ��

6 � $6�6$9 � �" � &*� ������ � $�$$8�$$

7 �� �� �� ���� ����; � ���� ��� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 6�9*7$&=9 6�=#=6687

10 ����� � ��� �;

11 ��

12 6�78*#'69

Test for correlation in Excel

The Correlation function in the Data Analysis package will easily compute the correlation coefficientin Excel. The output is shown in Table 4.13.

Table 4.13: Correlation in Excel

wound time

wound 1

time 0.864251 1

© HERIOT-WATT UNIVERSITY

Page 120: Higher Applications of Mathematics Statistics and probability

114 UNIT 2. STATISTICS AND PROBABILITY

4.7 Notes on statistical testing

Hypothesis tests are simple to implement using a statistical software package such as Minitab or RStudio. Most of the simpler tests are also available in Excel. The format of the test is always thesame:

• determine the null and alternative hypotheses;

• choose the appropriate test based on the type and distribution of the data;

• if the p-value is less than 0.05, reject the null hypothesis and conclude that there is evidenceto support the alternative hypothesis;

• if the p-value is not significant (i.e. > 0.05), conclude there is no evidence to reject the nullhypothesis.

4.7.1 Errors in Statistical Tests

Since it is not possible to be 100% certain of an event, there is a margin for error each time astatistical test is performed. The two types of error that can occur are known as Type I and Type IIerrors.

Type I error

In this case the study finds a significant difference (i.e.p < 0.05) but that difference does not reallyexist. Here the null hypothesis would be rejected when in fact it is true. A Type I error is also knownas a false positive result.

Type II error

Here the study finds no significant difference between groups that are in fact different. The nullhypothesis would be accepted when it is actually false. This is a false negative result.

Since the conventional cut-off for significance is p < 0.05, a hypothesis test is significant if theprobability of observing the data by chance was less than 0.05. A small probability like this wouldsuggest that a chance occurrence is unlikely. However, it still could happen. There is therefore a 1 in20 chance that a Type I error will occur. This means that there is a 5% chance of finding a significantresult that does not really exist every time a significance test is carried out. In certain medicalsituations this may not be acceptable. For example, in testing a very toxic therapy a researcherwould want to be sure that it is effective before allowing patients to be treated. It may be appropriatein such a situation to set a more stringent p-value (e.g. significant only if p < 0.01). Reducing theprobability of a Type I error in this way, however, will increase the chance of a Type II error. This iswhen, for example, one treatment is better than the other but the results of the hypothesis test donot produce a p-value which is less than 0.05. To avoid Type II errors, it is important to ensure thesample size of the study is adequate.

4.7.2 Power and sample size

The success or failure of a study can depend on whether the sample size used is appropriate.In studies involving humans or animals, study participants may be put at risk (or at leastinconvenienced). If the study is undersized, scientific knowledge is not being advanced andresources (i.e. time, money, staff) are being diverted from more worthwhile research. If a samplesize too large, more subjects are put at risk (or inconvenienced) than necessary and resources areoverstretched.

In order to be 100% certain of a difference between treatments A and B it would be necessary

© HERIOT-WATT UNIVERSITY

Page 121: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 115

to study every subject to whom either A or B could potentially be given (target population). It isnot feasible (or possible) to study the target population so a representative sample from the targetpopulation is used to make inference of effects of treatment on the population. It is then expectedthat the difference between sample mean responses to two treatments would be similar to the trueunderlying mean difference.

A hypothesis test provides the logical framework to enable the objective assessment of the weightof evidence the data supply either for or against the study hypothesis. Without testing the wholepopulation it is impossible to prove a hypothesis beyond all doubt. The test proceeds by initiallystating a hypothesis which is the opposite of the study hypothesis (the null hypothesis NH) and thentries to disprove it. The study hypothesis (i.e. hypothesis of real interest) is called the alternativehypothesis (AH).

When the gathered data are used to weigh up the evidence against the null hypothesis (and hencein favour of the study hypothesis), there are two outcomes:

• a significant outcome where there is evidence that the null hypothesis is false, or;

• a statistically non-significant outcome where there is no strong evidence against the nullhypothesis.

Therefore, a statistical test can make two types of decision error. Sample size determination isbased on controlling these error probabilities. The test outcomes are illustrated in Table 4.14.

Table 4.14: Outcomes of a hypothesis test

Conclusion of hypothesis test

True status Non-significant Significant

NH True(groups do not differ)

Do not reject NH when NH istrue

(correct conclusion)

Reject NH when NH is true(incorrect conclusion) Type I

error

NH False(groups differ)

Do not reject NH when NH isfalse

(incorrect conclusion) Type IIerror

Reject NH when NH is false(correct conclusion)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

The two error probabilities are conventionally named as follows:

• α is the probability of a Type I error (false positive result);

• β is the probability of a Type II error (false negative result)

α is called the significance level of the test and is usually fixed at some low value (generally 0.05).The power of the test gives information about the other error probability:

POWER = 1 − β = Prob(reject NH when NH false)

Since α and β are error probabilities, ideally the significance level (α) would be as small as possible(i.e. close to 0) and the power (1 − β) would be as large as possible (i.e. close to 1). A smaller

© HERIOT-WATT UNIVERSITY

Page 122: Higher Applications of Mathematics Statistics and probability

116 UNIT 2. STATISTICS AND PROBABILITY

α is usually associated with a larger β (i.e. smaller 1 − β) for a given experimental design andsample size. The type of design and sample size required to simultaneously achieve a small α(approximately 0.05) and a large 1 ˘ β (over 0.8 is generally considered to be acceptable) should bedetermined in advance of gathering the data.

Sample size calculations require estimates of certain study outcomes which are not reliably knownuntil after the study has been conducted. In a two group study, for example, the sample sizecomputation would require the following points to be addressed.

• What is the main purpose of the study and hence the major endpoint?

• What would be expected to happen in the control group?

• How small a difference between groups would it be important to fail to detect (minimumclinically significant difference)?

• How certain of detecting this difference do you want to be and at what level of significance (i.e.the required power)?

These questions form an important part of the process of determining the required study size. Thisinformation can either be deduced by previously published studies or from clinical experience. Insome studies it may be necessary to conduct a pilot study prior to designing a full study. Themathematical computations are beyond the scope of this course.

4.8 Confidence intervals

The sample mean, computed from a set of data, is only an estimate of the population mean. Anyparameters estimated in this way will depend on the sample from which they are calculated. Differentsamples will give different estimated mean values and no single one will be more precise thananother.

It is possible to produce a range of plausible values for the mean and thereby produce an intervalin which it is relatively certain the true population mean value lies. Such an interval is knownas a confidence interval (CI). As with hypothesis testing, confidence intervals cannot be 100%guaranteed to contain the true population mean value. Usually confidence intervals are quotedat a level of 95% certainty.

Confidence intervals are generally based on the mathematical properties of the distribution that thedata follows. For data that is normally distributed, approximately 95% of the data points will liewithin about 2 standard deviations of the mean (see 'Intepreting data'). This approach is the kindof idea behind the mathematical computation of a confidence interval for a parameter, although thedistributional assumptions are a little more complicated. The mathematics can be ignored at thisstage but it is important to have a understanding of the interpretation and inference that can bedrawn from confidence intervals.

4.8.1 Confidence interval for the difference in means

Previously, blood glucose levels were compared between the two groups, using a two sample t-test. The Minitab output from the test gave a p-value of 0.029, and the conclusion was that theblood glucose levels for those given analgesic B was significantly greater than the levels in those onanalgesic A. However, this gives no information on the magnitude of the difference in blood glucose.

© HERIOT-WATT UNIVERSITY

Page 123: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 117

The observed difference in mean values between analgesic A and analgesic B from this study was68.4 − 84.2 = − 15.77 mg/kg. Therefore, the best estimate of the true population differencebetween analgesics A and B is that blood glucose levels for analgesic A are, on average, 15.77mg/kg less than the levels for B. Since this specific value is based on the sample chosen (i.e. the27 people on this study), choosing a different sample would change this estimated mean difference.However, it is possible to compute a confidence interval for this mean difference and this is shownin the Minitab output as

!#: 6( 7� ��77���.�> �1"!��#, 1���/�.

The confidence interval gives a range of values in which it is 95% certain that the true difference lies.The confidence interval is (–29.75,–1.78) which implies that the true population mean differencecould be as much as 29.75 mg/kg less on A than B or as little as 1.78mg/kg less. Therefore, itis possible to be reasonably confident that any individual taking analgesic A should have a bloodglucose level between 1.78 and 29.75 mg/kg less than if they took analgesic B. The information inthis confidence interval is therefore important for the clinical interpretation of the analysis in termsof being able to report the likely effect size that will be seen in practice.

4.8.2 Relationship between CIs and p-values

The two sample t-test indicated that the blood glucose levels on analgesic A were significantly lowerthan on analgesic B (since the p-value was less than 0.05). The confidence interval gives a rangeof values between which it is possible to be relatively sure the true population difference lies. Therange of the confidence interval in this case was from –29.75 to –1.78mg/kg. Since this intervaldoes not include zero, zero is not a plausible value for the difference between the mean values.If the confidence interval did contain zero, this would imply that the true difference between bloodglucose levels on the two analgesics could be zero, suggesting that there would be no significantdifference between the two groups. Therefore a confidence interval for the difference between twomeans that does not contain the value of zero will correspond with a p-value of less than 0.05 (i.e.a significant hypothesis test) from the two sample t-test.

4.8.3 Clinical significance of a statistical analysis

As illustrated previously, a confidence interval gives additional information that may be practicallyuseful. For example, a study was conducted to compare two methods of measuring peak flow (l/min)in 50 patients who had severe asthma. The aim of the study was to see if the two methods weresimilar. If they were then obviously there would be no difference between the measurements.

The mean difference between the measurements was estimated from the data as 5.3 l/min and thepaired t-test produced a p-value of 0.32. Since this is not less than 0.05, the null hypothesis isnot rejected so there is no evidence to suggest that there is a significant difference between themethods of measuring peak flow. However, the 95% confidence interval for the mean differenceis (–104.7,115.3). Therefore, it is possible to be 95% sure that the true mean difference betweenmeasurements lies within this interval. Therefore, measurements by one method could potentiallybe up to 104.7 l/min less than or 115.3 l/min more than the other.

The results of this analysis indicate that there is no evidence of a difference between the twomethods of measuring peak flow. However, given the large variability between measurements seenin the confidence interval, it would seem from a clinical perspective that the two measures do notagree well enough for use in practice. Often in medical studies like this, final recommendations froma study are based on clinical judgment by the researcher. Statistical significance does always meanclinical significance.

© HERIOT-WATT UNIVERSITY

Page 124: Higher Applications of Mathematics Statistics and probability

118 UNIT 2. STATISTICS AND PROBABILITY

Note that equivalence studies like this should not be analysed using a hypothesis test. In anequivalence study it is of interest to know if two methods of measurement give the same results.Such studies should be analysed using the ideas of confidence intervals to get an idea of thevariability in the paired measurements. This is done using what is commonly known as 'limits ofagreement'.

4.9 Exercises

Go onlineData analysis, interpretation and communication exercises

Q1: The tennis.csv (https://scholar.hw.ac.uk/download/2021/H-APP/tennis.csv) data setcontains information on the number of Facebook fans and Twitter followers of 40 randomlyselected tennis players.

a) Produce a box plot to compare the distributions of Facebook fans and Twitter followersfor these tennis players.

b) (Perform an appropriate hypothesis test to determine if there is evidence of a differencein the numbers of fans engaging with the players through Facebook and Twitter.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q2: The actors.csv (https://scholar.hw.ac.uk/download/2021/H-APP/actors.csv) data setshows the same information on 40 randomly selected actors, combined with the tennis data togive 80 individual observations with a variable Profession indicating actors or tennis players.

a) Explain the missing values in this data.

b) Produce a box plot to visually compare the total numbers of social media followers foractors and tennis players.

c) Perform a hypothesis test to determine if there is a difference in the total number of socialmedia followers between these two professions.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q3: The data shown in the table is taken from the work of Katkici et al. (Ü. Katkici, M.S.Özkök, M. Örsal: An autopsy evaluation of defense wounds in 195 homicidal deaths due tostabbing, J. Forensic Sci. Soc., 34 (1994), pp. 237-240) and shows results of their survey ofdefence wounds observed during the post-mortem examination of 195 Turkish victims of allforms of stabbing. Of the 195 victims, 162 were male and 33 female.

Gender Present Absent Total

Male 57 105 162

Female 18 15 33

Total 75 120 195

It is of interest to determine whether there is any evidence of a difference between thebehaviour of males and females during the course of a fatal stabbing.

© HERIOT-WATT UNIVERSITY

Page 125: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 119

a) Compute the proportion of males and females with defence wounds observed at post-mortem.

b) Perform an appropriate hypothesis test to determine if there is any evidence of adifference between the behaviour of males and female victims.

c) Write a sentence to explain the confidence interval reported with the hypothesis testoutput in (ii), in the context of this problem.

d) How does this confidence interval agree with the results of the hypothesis test in (ii)?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q4: The workforce.csv (https://scholar.hw.ac.uk/download/2021/H-APP/workforce.csv)data set shows the labour force participation rates for males and females over time forseveral countries in Europe and North America. The data for this example were taken fromhttp://wdi.worldbank.org/table/2.2 and records the labour force participation rates for malesand females over time by country.

a) Perform a hypothesis test to determine if there has been a change in rates of femalesparticipating in the labour force across both continents from 2000 to 2013.

b) Perform a hypothesis test to determine if there is evidence of a difference in femalelabour force participation rates in 2013 between Europe and North America.

c) Comment on the significance of the difference seen in (ii).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q5: A study was conducted to investigate a claim that playing music to dairy cattle increasestheir milk production. A herd of dairy cattle was divided into two groups. Music was played toone group and the control group did not have music played. The result of the study was thatthe average increase in production was 2.5 l/cow over the time period in question. A 95%confidence interval for the difference (music – control) in the mean production was computedto be (1.5,3.5) l/cow.

What does this mean?

© HERIOT-WATT UNIVERSITY

Page 126: Higher Applications of Mathematics Statistics and probability

120 UNIT 2. STATISTICS AND PROBABILITY

4.10 Summary

Summary

• Statistical analysis considers the probability of an event being due to chance. This formsthe basis for hypothesis testing.

• The null hypothesis assumes the opposite of what a researcher wants to investigate,e.g. that two population mean values do not differ. The alternative hypothesis isgenerally the main aim of the study, e.g. that the mean value in one group is different(greater than or less than) the mean of the other group.

• Construct and interpret histograms to investigate whether the data is normallydistributed or not.

• Construct box plots to provide a visual representation of median, inter-quartile range,outliers etc.

• Compute and interpret descriptive statistics, commenting on the means or medians andthe spread of data from the standard deviation or IQR.

• The p-value is the probability of getting data as extreme as those actually observed inthe experiment if the null hypothesis was true.

• The lower the p-value, the more evidence there is against the null hypothesis.

• If the p-value is less than 0.05, reject the null hypothesis and conclude that there isevidence to support the alternative hypothesis.

• If the p-value is greater than 0.05, conclude there is no evidence to against the nullhypothesis.

• A confidence Interval is a range of values in which it is likely the true population meanvalue lies.

• For data that is normally distributed, approximately 95% of the data points will lie withinabout two standard deviations of the mean.

• A two sample t-test is an appropriate test to perform in order to compare mean valuesbetween two independent groups of normally distributed data.

• A paired t-test is a special case of comparing two mean values when the data arerecorded on the same individuals or objects twice.

• A z-test for two proportions allows you to compare two proportions. The null hypothesiswould be that there is no evidence of a difference between the two proportions and thealternative hypothesis would be that the proportions differ, i.e. one if significantly greateror less than the other.

© HERIOT-WATT UNIVERSITY

Page 127: Higher Applications of Mathematics Statistics and probability

TOPIC 4. HYPOTHESIS TESTING AND CONFIDENCE INTERVALS 121

4.11 Resources

Downloads

• actors.csv - https://scholar.hw.ac.uk/download/2021/H-APP/actors.csv

• ages.csv - https://scholar.hw.ac.uk/download/2021/H-APP/ages.csv

• biceps.csv - https://scholar.hw.ac.uk/download/2021/H-APP/biceps.csv

• birthweights.csv - https://scholar.hw.ac.uk/download/2021/H-APP/birthweights.csv

• bloodglucose.csv - https://scholar.hw.ac.uk/download/2021/H-APP/bloodglucose.csv

• buildings.csv - https://scholar.hw.ac.uk/download/2021/H-APP/buildings.csv

• runningshoes.csv - https://scholar.hw.ac.uk/download/2021/H-APP/runningshoes.csv

• tennis.csv - https://scholar.hw.ac.uk/download/2021/H-APP/tennis.csv

• trees.csv - https://scholar.hw.ac.uk/download/2021/H-APP/trees.csv

• workforce.csv - https://scholar.hw.ac.uk/download/2021/H-APP/workforce.csv

• wound.csv - https://scholar.hw.ac.uk/download/2021/H-APP/wound.csv

Links

• https://www.minitab.com/en-us/products/minitab/ - powerful statistical software everyone canuse (free trial).

• https://www.r-project.org/ - a free software environment for statistical computing and graphics.

© HERIOT-WATT UNIVERSITY

Page 128: Higher Applications of Mathematics Statistics and probability

122 UNIT 2. STATISTICS AND PROBABILITY

4.12 End of topic test

Go onlineHypothesis testing and confidence intervals topic test

Q6: Treatment A has been used for years in order to control pain after a particular operation,but a study is to be conducted in order to see whether a new treatment, B, is more effectivethan A. A will continue to be used unless there is sufficient evidence that B is more effective.

What is the alternative hypothesis for this study?

a) Treatment A is more effective than Treatment Bb) Treatment B is more effective than Treatment Ac) Treatment A is not more effective than Treatment Bd) Treatment B is not more effective than Treatment Ae) Treatments A and B differ in effectiveness

Explain whether each of the following statements is true or false.

Q7: Rejecting the null hypothesis when it is true is a Type I error.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q8: Prob(Type I error) = significance level of the test.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q9: Power of a test should be 5%.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q10: β is the probability of accepting the NH when it is false.

a) Trueb) False

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q11: Larger sample size increases the chance of a Type II error.

a) Trueb) False

© HERIOT-WATT UNIVERSITY

Page 129: Higher Applications of Mathematics Statistics and probability

123

Unit 2 Topic 5

Statistics and probability: End ofsection test

Page 130: Higher Applications of Mathematics Statistics and probability

124 UNIT 2. STATISTICS AND PROBABILITY

Go onlineStatistics and probability section test

A group of friends is listed by surname: Evans, Kanias, Lopez, Mason, Nalty, Ochoa, Patel,Quinn, Smith, Trott, Usman, Valdo, White, Xiang.

Sub-groups of the friends enjoy the following sporting activities:

• football: Kanias, Lopez, Ochoa, Quinn, Smith, Trott;

• tennis: Evans, Kanias, Patel, Valdo;

• volleyball: Kanias, Lopez, Patel, Trott, White.

Q1: Illustrate where their preferences for each of these sports lie within a Venn diagram.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q2: If names of the friends are selected at random, what is the probability that:

a) the selected friend plays football?

b) the selected friend plays all three sports?

c) the selected friend plays tennis and volleyball?

d) the selected friend plays none of the sports?

You are off to football practice and love playing in goal, but that depends on who is coachingtoday. From your past training sessions you know that:

• with coach A the probability of being selected to play in goal is 0.5;

• with coach B the probability of being selected to play in goal is 0.3;

• coach A is in charge of training in about 6 out of every 10 games.

Q3: Illustrate the situation using a tree diagram.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q4: What is the probability you will end up getting to play in goal today?

© HERIOT-WATT UNIVERSITY

Page 131: Higher Applications of Mathematics Statistics and probability

TOPIC 5. STATISTICS AND PROBABILITY: END OF SECTION TEST 125

State whether the following variables are quantitative or qualitative.

Q5: Gender

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q6: Height

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q7: Pulse rate

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q8: Smoker

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q9: Weight

State whether the following variables are nominal or ordinal.

Q10: Customer satisfaction level

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q11: Exam grade

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q12: Gender

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q13: Hair colour

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q14: Marital status

State whether the following variables are continuous or discrete.

Q15: Height of a building

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q16: Number of pupils in a class

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q17: Shoe size

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q18: Speed of a train

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q19: Weight of a car

© HERIOT-WATT UNIVERSITY

Page 132: Higher Applications of Mathematics Statistics and probability

126 UNIT 2. STATISTICS AND PROBABILITY

A quality engineer for an airline supply company wants to decrease the number of fuselagepanels that are rejected because of paint flaws. Data from 22 panels is shown.

Panel Flaw type Shift Panel Flaw type Shift

1 Other Day 12 Peel Day

2 Scratch Day 13 Smudge Day

3 Scratch Day 14 Peel Day

4 Peel Day 15 Peel Night

5 Smudge Night 16 Scratch Night

6 Other Night 17 Peel Night

7 Scratch Night 18 Peel Night

8 Peel Night 19 Scratch Weekend9 Other Weekend 20 Peel Weekend

10 Scratch Weekend 21 Smudge Weekend11 Scratch Weekend 22 Peel Weekend

Q20: Which is the most common type of paint flaw?

a) Peel

b) Scratch

c) Smudge

d) Other

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q21: Which is the shift on which most flaws occurred?

a) Day

b) Night

c) Weekend

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q22: Draw a pie chart to illustrate this data.

© HERIOT-WATT UNIVERSITY

Page 133: Higher Applications of Mathematics Statistics and probability

TOPIC 5. STATISTICS AND PROBABILITY: END OF SECTION TEST 127

A statistical method called capture/recapture was used across several large cities to estimatethe number of pigeons, the droppings of which can cause a disease called histoplasmosiswhich affects the lungs, although most people who contract it suffer no symptoms and itcannot be transmitted between people.

This data was then matched with the incidence of histoplasmosis taken from medicalsurveillance data with the following results.

Correlation: pigeons, histoplasmosis

Correlations-� �� .������� ��!�

-1���� �

Q23: What kind of correlation is shown between the number of pigeons and incidence ofhistoplasmosis?

a) Positive

b) Negative

c) No correlation

© HERIOT-WATT UNIVERSITY

Page 134: Higher Applications of Mathematics Statistics and probability

128 UNIT 2. STATISTICS AND PROBABILITY

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q24: How strong is the correlation?

a) Significant

b) Insignificant

c) No correlation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q25: From the information provided, suggest a possible source of error which should beconsidered when interpreting these results.

A clothing manufacturer wants to mass produce non-iron (anti-winkle) cotton t-shirts. As partof the production process, they try to understand how formaldehyde (added to help hold thegarment dye) impacts wrinkle resistance. 32 pieces of cotton cellulose produced with differingformaldehyde concentrations were measured for their durable press (DP) rating. Higher DPis associated with better quality and less wrinkling.

A regression analysis was performed to model the durable press rating from formaldehydeconcentration with the following results.

© HERIOT-WATT UNIVERSITY

Page 135: Higher Applications of Mathematics Statistics and probability

TOPIC 5. STATISTICS AND PROBABILITY: END OF SECTION TEST 129

1 0���; ���"����� � ,� ��! : 0����

23 0�""���� �;

4 �<� �� � 0���

5 $�*#'' 6�&#&*

Q26: What is the slope parameter?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q27: Use the model to estimate the durable press rating at a formaldehyde concentration of5.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q28: Suggest any possible confounding factors that may have an influence on the durablepress ratings.

A data set contains tests scores in Maths and English for 93 students applying for furthereducation placements. It is of interest to determine if this particular group of students tendedto score higher on the English or Maths tests.

Q29: State the name of the statistical test which should be used to analyse this data andexplain why this test is appropriate.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

An appropriate test was performed to address the question of interest and some of the outputis shown.

1 �� �; (� �� ��� .�!����

2 � '�8$6$ � �" � =&� ������ � #�69�69

3 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

4 =' ��� ���"���� �� ���;

5 &�$==#&# 8�96'6#*

6 ����� � ��� �;

7 ��� �" � ��""���

8 *�='#$#7

Q30: Complete the following paragraph that summarises the results of the study using thenumbers and words provided.

Since p < 0.001, the ;;;;;;; hypothesis is ;;;;;;; and there is evidence to suggest thatthe mean scores for Maths and English are significantly ;;;;;;;. The confidence intervalsuggests that on average the Maths scores are between ;;;;;;; % and ;;;;;;; % ;;;;;;;

than the English scores.

Numbers and words: 3.2, 5.0, 5.6, 6.7, accepted, alternative, different, higher, lower, null,rejected.

© HERIOT-WATT UNIVERSITY

Page 136: Higher Applications of Mathematics Statistics and probability

130 UNIT 2. STATISTICS AND PROBABILITY

Results of a Scotland-wide study have shown that the Pfizer/BioNTech COVID-19 vaccineprovides 79% protection against the Delta coronavirus variant two weeks after the secondjab, while the second dose of the Oxford/AstraZeneca vaccine offers 60% protection. Thedata from the study are shown in the following table.

Vaccine No. on study No. of Covid-19cases

% protection

Pfizer/BioNTech 142 30 79

Oxford/AstraZeneca 178 71 60

A hypothesis test is performed to determine if there is a statistically significant difference inprotection between the two vaccines with the following results.

�� .������ ���� ��.

����� � %���� ����� �

����� � ��" ��" ��//��"

����� " ��/ � � �0 ��"�

% ������� 7� ��77���.�

��77���.� !#: 6( 7� ��77���.�

��/�0 ! � � /!" /, �"/0 /�

)� �

���� ������� � =0> -1 1 -2 2

$�������� ������� � =1> -1 1 -2 �=

������ H1���� -1����

���� ���������� ���� �

8� ��� ��.� �

Q31: Is there evidence of a difference in the proportions of those who are protected by thetwo vaccines?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q32: The confidence interval indicates that protection by Pfizer/BioNTech is achieved inbetween approximately ;;;;% and ;;;% more people than AstraZeneca.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q33: The authors did point out that, given the observational nature of these data, theestimates of vaccine effectiveness needed to be interpreted with caution.

Suggest a reason why the authors added a disclaimer to their report.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q34: What could be done to avoid the problem which led to the disclaimer.

© HERIOT-WATT UNIVERSITY

Page 137: Higher Applications of Mathematics Statistics and probability

GLOSSARY 131

Glossary

Alternative hypothesis

a hypothesis that says there is statistical significance between the two variables being tested- it is the hypothesis that the researcher is trying to prove

Categorical data

information that may be divided into groups, e.g. race or sex

Central limit theorem

states that the distribution of the sample approximates a normal distribution as the sample sizebecomes larger (say >50) regardless of the population distribution shape

Continuous

data that can take any numerical value within a range and is usually measured, e.g. height orweight

Correlation

quantifies the degree to which two numerical variables are linearly related

Discrete

data that can only take certain values or whole numbers, e.g. the number of students in aclass

Event

an outcome or defined collection of outcomes of an experiment which is therefore a subset ofthe sample space.

Intercept

the point at which the regression line crosses the y-axis (i.e. the estimated value of y whenx = 0)

Likert scale

a point scale which produces ordinal categorical data used in surveys to record opinions,usually on a 5 or 7-point scale with options that range from one extreme to another, e.g. howsatisfied were you with your meal tonight?

Nominal

categorical data that is used for naming or labelling variables without any quantitative value

Non-parametric test

methods of statistical analysis that do not require a distribution to meet the requiredassumptions to be analysed (particularly if the data is not normally distributed)

Normal data

numerical data that come from a normal distribution and have a characteristic bell shape whendisplayed on a histogram

Null hypothesis

a hypothesis that says there is no statistical significance between the two variables beingtested - it is the hypothesis that the researcher is trying to disprove

© HERIOT-WATT UNIVERSITY

Page 138: Higher Applications of Mathematics Statistics and probability

132 GLOSSARY

Numerical data

information which has meaning as a measurement, e.g. age, weight, height

Ordinal

categorical data where the variables have natural ordered categories and the distancesbetween categories is now known is ordinal categorical data, e.g. data recorded on a Likertscale

Outcome

the result of carrying out a trial.

Paired data

this data arises from the same individual at different points in time e.g. a group of childrenhave their heights measured at age 5 and again one year later

Paired t-test

a hypothesis test used when it is of interest to examine the mean difference between twomeasurements of the same variable on the same subject

Parametric test

these hypothesis tests assume a normal distribution of data values (i.e. a bell-shaped curve)

Power

the probability of making a correct decision in rejecting the null hypothesis when it is false(equal to 1 − β)

P-value

a number describing how likely it is that your data would have occurred by random chance (i.e.that the null hypothesis is true)

Qualitative data

another term for numerical data

Quantitative data

another term for numerical data

Regression

a statistical method that helps analyse and understand the relationship between two numericalvariables by estimating the equation of the best fitting straight line

Sample space

the list of all possible outcomes of a trial.

Significance level

the probability of rejecting the null hypothesis when it is true (i.e. a Type I error, usually denotedα)

Skewed data

asymmetric data which deviates from the symmetrical bell curve typically with a tail in thedistribution more toward high or low values

© HERIOT-WATT UNIVERSITY

Page 139: Higher Applications of Mathematics Statistics and probability

GLOSSARY 133

Slope

computed in the regression analysis and estimated how much y increases (or decreases) asx increases by 1 unit

Tree diagram

a visual aid in computing probabilities.

Trial

a process or experiment that yields information - if it is repeated several times then it gives aset of observations.

Two sample t-test

a hypothesis test used to determine if there is a significant difference between the means oftwo groups that are independent

Type I error

occurs if the null hypothesis is rejected when it is true (also known as a false positive anddenoted α)

Type II error

occurs if the null hypothesis is accepted when it is false (also known as a false negative anddenoted β)

Universal set

all of the values that need to be considered in an experiment.

Venn diagram

a way of sorting groups of data and visually representing probabilities.

Z-test

a statistical test that can be used to determine whether there is a difference between twoproportions

© HERIOT-WATT UNIVERSITY

Page 140: Higher Applications of Mathematics Statistics and probability

134 ANSWERS: UNIT 2 TOPIC 1

Answers to questions and activities for Unit 2

Topic 1: Basic probability

Tree diagrams: Quiz (page 9)

Q1:

a) The sample space is FFF, FFM, FMF, FMM, MFF, MFM, MMF, MMM where F is female and Mis male.The following is a tree diagram illustrating the jury duty selection process.

b) The probability of each outcome in the sample space is found by multiplying the probabilitiesassociated with each branch of that outcome together. In this case, each probability is equally

likely so all outcomes are equally likely, i.e. P(outcome) =1

2× 1

2× 1

2=

1

8= 0.125.

c) The outcomes where one male is selected are FFM, FMF or MFF.

P(one male selected) =1

8+

1

8+

1

8=

3

8= 0.375.

d) The outcome where three females are selected is FFF.

© HERIOT-WATT UNIVERSITY

Page 141: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 1 135

P(three females selected) =1

8= 0.125.

Q2:

a) The sample space is S12, S21, 1S2, 12S, 2S1, 21S where S is the supermarket own brand, 1is well known brand 1 and 2 is well known brand 2.The following is a tree diagram illustrating the selection process.

b) The probability of each outcome in the sample space is found by multiplying the probabilitiesassociated with each branch of that outcome together. In this case, each outcome is equally

likely so P(outcome) =1

3× 1

2=

1

6= 0.167.

c) Two outcomes in the sample space have the supermarket own brand ranked first: S12 andS21.

P(supermarket own brand ranked first) =1

6+

1

6=

1

3= 0.333.

© HERIOT-WATT UNIVERSITY

Page 142: Higher Applications of Mathematics Statistics and probability

136 ANSWERS: UNIT 2 TOPIC 1

Venn diagrams: Quiz (page 11)

Q3:

Q4:

a)

b) 10 pupils play neither instrument.

c) P(pupil plays neither instrument) =10

50=

1

5= 0.2

© HERIOT-WATT UNIVERSITY

Page 143: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 1 137

Q5:

a) • Start with the number of people participating in all three activities, 24 and enter this whereall three circles overlap.

• Now consider the number who swim and cycle, 32. Now subtract 24 from 32 and enteryour answer in the appropriate part of the diagram.

• Similarly, the number who run and swim is 42. Now subtract 24 from 42 to obtain and enteryour answer in the diagram.

• Now consider the number who run and cycle, 38. Now subtract 24 from 38 and enter youranswer in the diagram.

• Calculate the number who swim: 56 − (8 + 24 + 18) = 6• Calculate the number who run: 69 − (18 + 24 + 14) = 13• Calculate the number who cycle 53 − (14 + 24 + 8) = 7

b) The total number of students responding to the survey is 6 + 8 + 7 + 24 + 18 + 14 + 13 = 90.

c) P(student participates in all three activities) =24

90=

4

15= 0.267 or approximately 27%.

© HERIOT-WATT UNIVERSITY

Page 144: Higher Applications of Mathematics Statistics and probability

138 ANSWERS: UNIT 2 TOPIC 1

Q6:

A)

B) 7 students speak Italian only.

C) P(student speaks French but not Italian) =25

70=

5

14= 0.357.

Carrying out basic calculations involving combination of events: Quiz (page 15)

Q7: The following is a tree diagram illustrating test results where P is a pass and F is a fail.

© HERIOT-WATT UNIVERSITY

Page 145: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 1 139

Q8: The outcome where a student passes both tests is PP so P(student passes both tests) =0.6 × 0.9 = 0.54.

Since the probability that a student passes the class is 0.54, it can be expected that the pass ratefor the class will be about 54%.

Q9: The following is a tree diagram illustrating the detection of weapons on passengers at airportswhere A is airport A, B is airport B, C is airport C, D is a weapon is detected and N is no weapon isdetected.

Q10: The outcomes where a weapon is detected are AD, BD and CD so P(weapon is detected)= 0.45 + 0.15 + 0.08 = 0.68.

a) P(weapon is detected at airport A) = 0.45

P(weapon is detected at airport A given weapon is detected) =0.45

0.68= 0.662

b) P(weapon is detected at airport B) = 0.15

P(weapon is detected at airport B given weapon is detected) =0.15

0.68= 0.221

c) P(weapon is detected at airport C) = 0.08

P(weapon is detected at airport C given weapon is detected) =0.08

0.68= 0.118

© HERIOT-WATT UNIVERSITY

Page 146: Higher Applications of Mathematics Statistics and probability

140 ANSWERS: UNIT 2 TOPIC 1

Q11: The following is a tree diagram illustrating card selection where F is a face card and N is anon-face card.

© HERIOT-WATT UNIVERSITY

Page 147: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 1 141

Basic probability topic test (page 22)

Q12: This question can be answered using a tree diagram to represent the experiment where G isa green marble and B is a blue marble.

a) P(two green marbles) = 0.577

b) The probability of randomly choosing one green and one blue marble is the sum of theprobabilities of choosing a green and then a blue marble (outcome GB), and blue and then agreen marble (outcome BG) so P(one green marble and one blue marble) = 0.192 + 0.192 =0.384.

© HERIOT-WATT UNIVERSITY

Page 148: Higher Applications of Mathematics Statistics and probability

142 ANSWERS: UNIT 2 TOPIC 1

Q13: This question can be answered using a tree diagram to show all of the possible scenarioswhere W is a win, D is a draw and L is a loss.

The sample space for outcomes resulting in at least 3 points is WW, WD, WL, DW, LW.

P(team gains at least 3 points over two matches)

= (0.1 × 0.1) + (0.1 × 0.6) + (0.1 × 0.3) + (0.6 × 0.1) + (0.3 × 0.1)= 0.01 + 0.06 + 0.03 + 0.06 + 0.03= 0.19

© HERIOT-WATT UNIVERSITY

Page 149: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 1 143

Q14:

a) 1, 4, 9

b) 4

c) 3, 5, 7, 11

Q15: Let M, P and C represent maths, physics and chemistry courses respectively.

• Step 1: Work backwards from the number of students taking all three courses (12).

• Step 2: Calculate the number of students taking maths and physics (26 − 12 = 14).

• Step 3: Calculate the number of students taking maths and chemistry (28 − 12 = 16).

• Step 4: Calculate the number of students taking physics and chemistry (18 − 12 = 6).

Now use these values to calculate the number of students taking maths, physics and chemistry onlyas follows.

• Step 5: Maths only 92 − (14 + 12 + 16) = 50.

• Step 6: Physics only 48 − (14 + 12 + 6) = 16.

• Step 7: Chemistry only 54 − (16 + 12 + 6) = 20.

We can use a Venn diagram to represent this information.

a) Altogether there are 50 + 14 + 16 + 16 + 12 + 6 + 20 = 134 students.

b) Students only studying maths + students only studying physics + students only studyingchemistry = 50 + 16 + 20 = 86 students.

c) P(student studies maths, physics and chemistry) =12

134=

6

67= 0.0896 (approx 10%).

© HERIOT-WATT UNIVERSITY

Page 150: Higher Applications of Mathematics Statistics and probability

144 ANSWERS: UNIT 2 TOPIC 2

Topic 2: Interpreting data

Types of data: Quiz (page 29)

Q1: b) Quantitative

Q2: b) Quantitative

Q3: b) Quantitative

Q4: a) Qualitative

Q5: a) Qualitative

Q6: a) Nominal

Q7: b) Ordinal

Q8: a) Nominal

Q9: a) Nominal

Q10: b) Ordinal

Q11: a) Continuous

Q12: a) Continuous

Q13: b) Discrete

Q14: b) Discrete

Q15: a) Continuous

Interpreting data exercises (page 48)

Q16: The number of and proportion of subjects reporting each activity level can be computed asshown.

1 � ������ ��� ��

23 1 �� (��� ?��!�

4 #$ 8$ =

5 � ������ ����1� ��� �� I � �� ����� "�� � ��� �� � F����F

6 � ���� ���������J$66 I ������ ��� �� � ��� �!� �" ����

78 1 �� (��� ?��!�

9 #&�698=# 89�6&#=9 =�7=6$$

© HERIOT-WATT UNIVERSITY

Page 151: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 2 145

Q17: The pie chart is produced using the following code.

> ���������$.��������

In order to produce a more detailed pie-chart, the following code adds labels to the chart indicatingthe percentage of subjects in each category.

1 � ������������� ���������J$66 �6� I ����� ��� �!� � 6 ��

2 � ������� ��� ��K�������� I ���� ��� �! ��!��

3 � ���� ����� ��� ��4�#5����

4 � ��� ������ �� 4�#5������������ � ������?�" ��� � �� ��� �

������

Q18: The histogram of pulse rates shown is obtained using the following code.

> �� ��-�� �, .��2���� !, ���23=� ��5� �7 ��� � �� 3�

It is possible to experiment with other colours using the .��2 option.

© HERIOT-WATT UNIVERSITY

Page 152: Higher Applications of Mathematics Statistics and probability

146 ANSWERS: UNIT 2 TOPIC 2

Q19: Descriptive statistics can be computed for levels of a categorical variable using the aggregatefunction in R as shown.

1 � �!!!� �@��� � ��� �1� ��� ��� �������

2 L����$ ��(��� ��$� )�� ��(���� ��(�� ��&� )�� ��(���

3 $ 1 �� '7�66 8*�66 96�66 9$�'9 98�66 =#�66

4 # (��� '*�66 88�66 96�66 9#�9* 76�66 $66�66

5 & ?��!� 8#�66 9*�66 7#�66 9=�'8 =6�66 =6�66

Means and standard deviations can be calculated by group in the same way by replacing summaryby ��� or �, for example.

1 � �!!!� �@��� � ��� �1� ��� ��� ���

2 L����$ �

3 $ 1 �� =�8#'7'7

4 # (��� $6�=7*87=

5 & ?��!� $6�*99*7=

© HERIOT-WATT UNIVERSITY

Page 153: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 2 147

Q20: Using The data can be imported to R Studio using the ���. � command as usual.Alternatively, data can be manually read using using the following commands.

1 � ��� � �����M��������H�������(������1������(�����M�����M�����

�1�!�� ���?� �����/� �����A�������%�����

2 � ���"��� ����==�7#�=&�'7�#*�7�$=�#9�##�#*�88�7#�

There two variables can be combined in a data frame using the command

> 4����<1���7��������,��7���

The bar chart shown can be constructed using the following code. Note that the option � 2" rotatesthe names of the months by 45 degrees on the x-axis, otherwise the full month names are too longfor the graph.

> ��������7��, ��� �52�����, ���23&��7�� ����3, � 2", .��23�@ �5���3�

Note that you can see a complete list of the 657 colors in R Studio by typing .��� �� in the consolewindow.

© HERIOT-WATT UNIVERSITY

Page 154: Higher Applications of Mathematics Statistics and probability

148 ANSWERS: UNIT 2 TOPIC 2

Q21: The scatter plot shown can be constructed using the following code.

1 � ��� �?<(%�0���#6$#� ��D�?<(%�0�� �#6$#����� � ������,�� �������

� �� ?<(% ��� ��� � ���

Note: to change the default circles to dots, use �.�2�! in the code. The code

1 � ��� �?<(%�0���#6$#� ��D�?<(%�0�� �#6$#����� � ������,�� �������

� �� ?<(% ��� ��� � ��� ����$=� ����� ��� ���

will produce the same plot with red dots rather than black circles.

© HERIOT-WATT UNIVERSITY

Page 155: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 2 149

Q22: The table of the number and percentage of each hot dog type is obtained using the followingcode.

1 � ��������

23 2" (� @��� �

4 #6 $9 $9

5 � ������ ��������

6 � ��� �! ����������� ���������J$66 �$� I ��� �� � K� � $ ��

7 � ��� ���������������� �!� I ������ ����� �� � ��� �!�

8 � ���

9 ���� ��� �!

10 2" #6 &9�6

11 (� $9 &$�'

12 @��� � $9 &$�'

Q23: The box plot shown was constructed using the following code.

1 � ������ �0����� : ��� � �� � � �� ��!� � ���� � ������ ���� �

�0������� ���� � �2����� �" ������ �� �� �" �� ��!��

�������� &��

© HERIOT-WATT UNIVERSITY

Page 156: Higher Applications of Mathematics Statistics and probability

150 ANSWERS: UNIT 2 TOPIC 2

Q24: The scatter plot shown can be constructed using the following code.

1 � ��� �?������0����� �������,�� ������� � �� ������ ��� ������

��� � �� ����$=� �������!� �$��

© HERIOT-WATT UNIVERSITY

Page 157: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 2 151

Interpreting data topic test (page 53)

Q25: b) Qualitative

Q26: b) Qualitative

Q27: b) Qualitative

Q28: a) Quantitative

Q29: a) Quantitative

Q30: a) Quantitative

Q31: c) Pie chart or e) Bar chart.

This is a categorical variable since each voter would have either voted for a political party or not.Categorical data can be displayed graphically using a pie chart or bar chart.

Histograms and box plots are for displaying a single numerical variable and a scatter plot shows therelationship between two numerical variables.

Q32: b) Median and d) Interquartile range.

Any skewed distribution (i.e. skewed to the left or to the right) should be summarised using themedian as a measure of location and interquartile range as a measure of spread.

Q33: a) A histogram of the data would be symmetric about 15, b) 95% of the data would lie between8 and 22 and d) The median value of the data would be approximately 15.

Approximately 95% of the area under the curve of a normal distribution lies within two standarddeviations of the mean. The interquartile range is the middle 50% of the data.

© HERIOT-WATT UNIVERSITY

Page 158: Higher Applications of Mathematics Statistics and probability

152 ANSWERS: UNIT 2 TOPIC 3

Topic 3: Correlation and linear regression

Correlation: Quiz (page 65)

Q1: d) high income people tend to be heavier than low income people, on average.

Q2: e) correlation is not an appropriate measure of association.

Linear regression: Quiz (page 75)

Q3: a) True

Q4: b) False

Q5: b) False

Q6: b) False

Q7: a) True

Correlation and linear regression exercises (page 76)

Q8: Histograms of both variables to visually inspect normality can be produced using the followingcode.

1 � ��� ��!�

2 � ��� ���! ��

It would also be appropriate to perform normality tests on these variables.

1 � ������� � ��!�

23 ?���������D ������ � �

45 �� �; �!

6 � � 6�=98= � ������ � 6�$&''

78 � ������� � ���! ��

910 ?���������D ������ � �

1112 �� �; ��! �

13 � � 6�=9&#*� ������ � 6�699#9

Since both p-values are > 0.05, the null hypothesis of normality is not rejected, therefore, theappropriate measures of location and spread are and mean and standard deviation as shown.

1 � �����!�

2 4$5 96�=*6*7

© HERIOT-WATT UNIVERSITY

Page 159: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 3 153

3 � ����!�

4 4$5 $*�$7*=9

5 � ������! ��

6 4$5 '�7'*#7

7 � �����! ��

8 4$5 $�9#7=#$

Q9: The following scatter plot was produced using:

> �����5�, ���5��, ���23'�4�� �7 7���� 3, ���23$5� ��� �3, ���23+��5�� �.��3�

Growth rate of a healthy foetus

Q10: The correlation coefficient and test for NH: no correlation, can be computed using:

> .���� ��5�,���5���

and the output is shown.

1 @����F� ����� ����� ���� ���

23 �� �; �! ��� ��! �

4 � '7�8#$ � �" � 7#� ������ � #�#�$8

5 �� �� �� ���� ����; � ���� ��� �� �� 3��� � 6

6 =' ��� ���"���� �� ���;

7 6�=7$=&7& 6�==#*6$*

8 ����� � ��� �;

9 ��

10 6�=77#97&

© HERIOT-WATT UNIVERSITY

Page 160: Higher Applications of Mathematics Statistics and probability

154 ANSWERS: UNIT 2 TOPIC 3

Q11: The least squares linear regression line is shown as:

1 � �����! �:�!�

23 0���; ���"����� � ��! � : �!�

45 0�""���� �;

6 �<� �� � �!

7 �#�8=6= 6�$#6'

and can be superimposed on the original data to produce the following plot using:

1 � ��� �"� ��� ������H� � ��� ��� N���! � � �#�8=$ O 6�$#6' �!��

2 � �����������! � : �!��

Fitted line plot

Q12: The following commands will extract the coefficient of determination from the linear regressionanalysis output:

1 � ������������! �:�!��P��3���

2 4$5 6�=988=*

© HERIOT-WATT UNIVERSITY

Page 161: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 3 155

Q13: To predict the length at age 85 days, along with a prediction interval, use:

1 � ������������! � : �!�� ���� ���� ��"����!�7'�� �� ��� �

�����

2 "� �� ��

3 $ 9�'*97#' 9�6$&&&' 7�67#&$8

Q14: The following scatter plot was produced using:

> �����-�.��������, ?����, ���23-�.�������� ����3, ���236�.���� ����� �@5I�"�3�

Scatterplot of cucumber yield vs precipitation

Q15: Use the following code to compute the correlation coefficient:

> .���� ��-�.��������,?�����

This gives the p-value for the correlation test.

© HERIOT-WATT UNIVERSITY

Page 162: Higher Applications of Mathematics Statistics and probability

156 ANSWERS: UNIT 2 TOPIC 3

Q16: The following scatter plot was produced using:

> �����$��; ����, -�7���.�, ���23$��5� ���� ���� �3, ���23%�� ��7���.� �:�3�

Relationship between sleep and exam performance

Q17: Some of the average values are exact numbers (e.g. 8 hours) which suggests this is anestimate rather than measured in a scientific way. Possibly these students have just guessed anamount. Data should be collected in a way that is accurate and consistent, e.g. participants couldhave been issued with a fitness tracker that measures sleep, with detailed instructions on how touse it and record the information.

Q18: The correlation and hypothesis test is computed using:

1 � ��� � �1�>��� �@"������

23 @���� F� ����� ����� ���� ���

45 �� �; 1�>��� ��� @"�����

6 � '�788� �" � #7� ������ � #�8#8�68

7 �� �� �� ���� ����; � ���� ��� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 6�'$7877 6�7966'#*

10 ����� � ��� �; ��

11 6�9*#'&##

© HERIOT-WATT UNIVERSITY

Page 163: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 3 157

Q19: The least squares linear regression line is:

1 � ���@"����� : 1�>����

23 0���; ���"����� � @"����� : 1�>���

45 0�""���� �;

6 �<� �� � 1�>���

7 6�6'$9& 9�&#9#=

Q20: The exam performance and prediction for a student with an average of 10 hours sleep is:

1 � ���� ����@"����� : 1�>����� ���� � � �� ��"���1�>��� �

$6�� �� ��� � �����

2 "� �� ��

3 $ 9&�&#*8* '7�*$6= 77�#&7&7

Note that the coefficient of determination to help with the interpretation of the accuracy here can becomputed as shown:

1 � ����������@"����� : 1�>�����P��Q���3���

2 4$5 6�'&'&&6=

© HERIOT-WATT UNIVERSITY

Page 164: Higher Applications of Mathematics Statistics and probability

158 ANSWERS: UNIT 2 TOPIC 3

Correlation and linear regression topic test (page 79)

Q21: Correlation analysis was used to quantify the degree of linearity between the variables.

The correlation between collagen content and fat content is 0.713 and represents a positiverelationship between the variables. As collagen increases, fat content increases. The p-valueis 0.011, which is less than the significance level of 0.05. The p-value indicates that the correlationis significant.

The correlation coefficient between fat content and softness is –0.810 and the p-value is 0.001. Thep-value is less than the significance level of 0.05, which indicates that the correlation is significant.As fat content increases, softness decreases.

The correlation between collagen and softness is –0.581 and the p-value is 0.054.

Q22: i. cost, iii. time required to collect the data, and iv. ethical issues related to animal research.

Cost is a likely factor as losing hides to a study means less hides turned to leather to sell whichimpacts profit. Sourcing enough hides shouldn't be a problem as that is their business. Timerequired to collect the data may be a problem – it is unknown who would have to do this and thecost involved. Ethical issues can play a part in animal studies and could be an issue here if thesehides are scrapped after the study.

Q23: The regression analysis appears to indicate that there is a linear relationship between thetwo variables. As lamination increases, the durability of the screen increases. The least squareslinear regression line estimates that as the optical lamination increases by one unit, the durabilityincreases by 3.5. The intercept value of –18.9 is meaningless in the context of this analysis.

There appears to be an outlier in the top right corner of the fitted line plot, which could have aneffect on the results. The designer should investigate this to determine its cause.

© HERIOT-WATT UNIVERSITY

Page 165: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 4 159

Topic 4: Hypothesis testing and confidence intervals

Data analysis, interpretation and communication exercises (page 118)

Q1: To import the file to R, use the following commands.

1 � ��������� ����� ���������������,-. ��������

2 � � ���� �����

a) The box plot to compare Facebook fans and Twitter followers can be constructed using thefollowing code.

1 � ������ �H�����D�"������� �"������ ���� �����

������0�������� �" ������ ���� "��������

2 � ���� �����H�����D "��������� "��������

3 � �����$�� �$;#������������

b) Since this is paired data the appropriate test and output is as follows.

© HERIOT-WATT UNIVERSITY

Page 166: Higher Applications of Mathematics Statistics and probability

160 ANSWERS: UNIT 2 TOPIC 4

1 � � � �H�����D�"��� ���� �"������ �������,-.�

23 @��� � �

45 �� �; H�����D�"��� ��� ��� �"������

6 � $�7679 � �" � &'� ������ � 6�69=6=

7 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 �$$$'#7�' $=&&=8=�=

10 ����� � ��� �;

11 ��� �" � ��""���

12 =$$##6�9

Additional learning point (beyond the scope of this course)The assumption for the test is that the differences between the number of Facebook fans andTwitter followers is normally distributed. This should really be checked prior to performing thepaired t-test. The non-parametric option is the Wilcoxon test. In R the analysis is as follows.

1 � ������� � �H�����D�"������� �"������ �������,-.�

23 �������� ��!�� ��D �

45 �� �; H�����D�"��� ��� ��� �"������

6 B � *$=� ������ � 6�$7$*

7 �� �� �� ���� ����; � ���� ��� ���" �� �� 3��� � 6

Since p > 0.05, there is no evidence from this analysis to suggest a difference.

Q2: To import the file to R, use the following commands.

1 � �� ������� ������� �������������,-.��������

2 � � ������ ���

a) Several of the Twitter followers information is missing. It would seem reasonable to assumethis is because these individuals do not have Twitter accounts.

b) The R code for the box plot is as follows.

1 � ������ ��� �� : @�"����� ��������� �� ������ ���� "������ ��

��"�������

© HERIOT-WATT UNIVERSITY

Page 167: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 4 161

This plot indicates actors have more total social media followers than tennis players.c) Since these are different professions, the appropriate test is a two sample t-test.

1 � � � ��� �� : @�"������

23 ���� ��� ?���� � �

45 �� �; �� �� �� @�"�����

6 � 9�*#69 � �" � *#�8$� ������ � &�&'7�6=

7 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 #$#*&*$* &9$6*8'9

10 ����� � ��� �;

11 ��� �� !��� 1� � ��� �� !��� 1 �� � �����

12 &#6$7*7# #7****8

Since p < 0.001, reject NH and conclude that there is evidence to suggest that actors havesignificantly more social media followers than tennis players.

© HERIOT-WATT UNIVERSITY

Page 168: Higher Applications of Mathematics Statistics and probability

162 ANSWERS: UNIT 2 TOPIC 4

Non-parametric analysis (beyond the scope of this course)Note: The box plot shows that the distribution of social media followers for the tennis playersis skewed. The Mann-Whitney test is perhaps more appropriate, however, the sample size islarge enough to use the central limit theorem and perform the two sample t-test. The non-parametric analysis leads to the same conclusion.

1 � ������� � ��� �� : @�"������

23 �������� ��D ��� �

45 �� �; �� �� �� @�"�����

6 � � $'&7� ������ � #�#�$8

7 �� �� �� ���� ����; � ���� ��� ���" �� �� 3��� � 6

Q3: Since this is a comparison of two proportions (i.e. the proportion of males and females withevidence of defense wounds), the data can be analysed using a z-test for two proportions.

a) For males the proportion is57

162= 0.352, or 35.3% and for females,

18

33= 0.545, i.e. 54.5%.

b) The test for two proportions can be done on the summarised data from the data in the tableusing the following command.

1 � ���� � �� � ��'9� $7�� � � ��$8#� &&����� �H1E?.�

23 #������ � "� 3���� � �" ���� ���� �� ���

��� ���� � ��� ���

45 �� �; ��'9� $7� �� �" ��$8#� &&�

6 G��3��� � *�&*$' � �" � $� ������ � 6�6&9$=

7 �� �� �� ���� ����; �������

8 =' ��� ���"���� �� ���;

9 �6�&979##69 �6�667*7&&#

10 ����� � ��� �;

11 ��� $ ��� #

12 6�&'$7'$= 6�'*'*'*'

Since p = 0.040, reject the NH and conclude that there is evidence that significantly morefemales than males have defence wounds as a result of fatal stabbings.

c) It is possible to be 95% sure that in the population, between 0.8% and 37.8% more femalesthan males will exhibit defence wounds as a result of a fatal stabbing.

d) Since the confidence interval does not contain zero, this is not a plausible value for thedifference. This agrees with the hypothesis test which indicated that significantly more femalesthan males would have defence wounds, hence the difference is not zero.

© HERIOT-WATT UNIVERSITY

Page 169: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 4 163

Q4: To import the file to R, use the following commands.

1 � ��D"�� ����� �������D"�������������,-.��������

2 � � ������D"���

a) The paired t-test can be performed using the following command.

1 � � � �"���#666 �"���#6$& �������,-.�

23 @��� � �

45 �� �; "���#666 ��� "���#6$&

6 � �*�&8'8� �" � *'� ������ � 9�&*$�6'

7 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 �&�'7=7*8 �$�&#&$=7

10 ����� � ��� �;

11 ��� �" � ��""���

12 �#�*'8'##

Since p < 0.001, reject NH and conclude that there is evidence of an increase in the rates offemales participating in the labour force over time across both Europe and North America.

b) Since the two continents constitute independent samples, a two sample t-test is appropriate.

1 � � � �"���#6$& : 0�� ��� �

23 ���� ��� ?���� � �

45 �� �; "���#6$& �� 0�� ���

6 � �$�77*$� �" � $&�7=& � ������ � 6�6768'

7 �� �� �� ���� ����; � ��""�� �� ���� �� �� 3��� � 6

8 =' ��� ���"���� �� ���;

9 �$6�==#='== 6�9$'$7#$

10 ����� � ��� �;

11 ��� �� !��� .��� ��� �� !��� A� � 1����

12 '#�&8$$$ '9�'6666

Although the mean is higher in North America than in Europe, p = 0.08065 indicates that it isreasonable to conclude that this difference occurred by chance. There is no evidence at the 5%significance level of the rates of females participating in the labour force in 2013 being differentbetween the two continents.

c) The difference in between North America and Europe may become significant if data from morecountries within each continent was included.

Q5: The interpretation of the confidence interval means that it is possible to be 95% sure that thetrue population difference in mean milk production lies within the interval (1.5,3.5) l/cow. Therefore,by playing music to a heard of dairy cattle, it is possible to be 95% sure that the milk yield willincrease by between 1.5 and 3.5 l/cow. This result would have to be interpreted in the 'clinical'context. Even though the milk production increased, the size of the effect is unclear since there isno information on the 'time period in question'. The actual increase in yield may not therefore beprofitable in practice.

© HERIOT-WATT UNIVERSITY

Page 170: Higher Applications of Mathematics Statistics and probability

164 ANSWERS: UNIT 2 TOPIC 4

Hypothesis testing and confidence intervals topic test (page 122)

Q6: b) Treatment B is more effective than Treatment A

Q7: a) True

Q8: a) True

Q9: b) False

Q10: a) True

Q11: b) False

© HERIOT-WATT UNIVERSITY

Page 171: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 5 165

Topic 5: Statistics and probability: End of section test

Statistics and probability section test (page 124)

Q1:

Q2:

a) P(football) =6

14= 0.429

b) P(all three sports) =1

14= 0.071

c) P(tennis and volleyball) =2

14= 0.143

d) P(no sports) =4

14= 0.286

© HERIOT-WATT UNIVERSITY

Page 172: Higher Applications of Mathematics Statistics and probability

166 ANSWERS: UNIT 2 TOPIC 5

Q3:

Q4: The probability of being in goal if coach A is there today = 0.6 × 0.5 = 0.3

The probability of being in goal if coach A is there today= (1 − 0.6) × 0.3 = 0.12

Therefore the probability of being chosen to play in goal today is 0.3 + 0.12 = 0.42, i.e. a 42%chance.

Q5: Qualitative

Q6: Quantitative

Q7: Quantitative

Q8: Qualitative

Q9: Quantitative

Q10: Ordinal

Q11: Ordinal

Q12: Nominal

Q13: Nominal

Q14: Nominal

© HERIOT-WATT UNIVERSITY

Page 173: Higher Applications of Mathematics Statistics and probability

ANSWERS: UNIT 2 TOPIC 5 167

Q15: Continuous

Q16: Discrete

Q17: Discrete

Q18: Continuous

Q19: Continuous

Q20: a) Peel

Q21: b) Night

Q22:

© HERIOT-WATT UNIVERSITY

Page 174: Higher Applications of Mathematics Statistics and probability

168 ANSWERS: UNIT 2 TOPIC 5

Q23: a) Positive, given that the incidence of histoplasmosis tends to increase as the number ofpigeons increases.

Q24: a) Significant, since p < 0.001.

Q25: Estimating the number of pigeons (capture/recapture method won't give total accuracy) andthe number of cases of the disease (it is stated that most people have no symptoms).

Q26: 0.3234

Q27: 3.04

Q28: Colour of the dye, size of the t-shirt, other materials/chemicals in the manufacturing process.

Q29: A paired t-test is appropriate since the data for each of the two test results are from the samestudent.

Q30: Since p < 0.001, thenull hypothesis is rejected and there is evidence to suggest that the meanscores for Maths and English are significantly different. The confidence interval suggests that onaverage the Maths scores are between 3.5% and 6.7 % higher than the English scores.

Q31: Yes, since p < 0.001, there is evidence of a difference in the proportions of those who areprotected by the two vaccines.

Q32: The confidence interval indicates that protection by Pfizer/BioNTech is achieved in between8.9% and 28.6% more people than AstraZeneca.

Q33: Given that the study is observational then there could be other differences not accounted forin the analysis, e.g. those who got AstraZeneca could be older, or more susceptible, or more at riskof contracting it due to lifestyle, etc.

Q34: Trials could be randomised.

© HERIOT-WATT UNIVERSITY