Part 2: Summarising Data Numerically and Graphicallyparkj1/math105/part2-12-12.pdf · Part 2: Summarising Data Numerically and Graphically Matthew Sperrin and Juhyun Park December

Part 2: Summarising Data Numerically and

Graphically

Matthew Sperrin and Juhyun Park

December 12, 2008

1 Introduction

How long do University students spend on social networking websites per day? A

random sample of 50 students were asked to record their social networking website

usage for one day. The results, in minutes spent, are given in Table 1. Looking at

the table only, what can you learn from the data?

Figure 1 shows exactly the same data in a histogram, with minutes spent plotted

along the horizontal axis, and the height of the bars representing the number of

students in each region.

Exercise 1. Can you learn more about social networking site usage from looking at

the graph than you can from the table? Is the graph or the table easier to interpret?

0 4 2 0 6 17 0 51 13 10

17 9 5 11 18 10 4 12 15 3

22 21 21 4 4 24 2 6 3 6

2 7 29 1 9 6 34 32 27 19

26 185 5 7 9 68 42 4 2 3

Table 1: Minutes spent on social networking websites per day

1. Introduction 2

Minutes spent on social networking websites per day

0 50 100 150

Figure 1: Minutes spent on social networking websites per day

Presenting data graphically can often help us to learn things from the data.

Table 2 gives the birthweights (in grams) of 44 babies born in the Mater Mothers’

Hospital in Brisbane, Queensland, Australia, on December 18, 1997.

Exercise 2. What is the variable being measured here? Is it quantitative? If so, is

it continuous or discrete?1

Suppose you are asked ‘Tell me about the birthweights of these babies’. What

would you say?

1See Part 1 for an introduction to these concepts.

2. Visualising the Data 3

3837 3334 3554 3838 3625 2208 1745 2846 3166 3520 3380

3294 2576 3208 3521 3746 3523 2902 2635 3920 3690 3430

3480 3116 3428 3783 3345 3034 2184 3300 2383 3428 4162

3630 3406 3402 3500 3736 3370 2121 3150 3866 3542 3278

Table 2: Birthweights of babies born in the Mater Mothers’ Hospital in Brisbane,

Queensland, Australia, on December 18, 1997

It would be time-consuming and uninformative to list the weights of every single

baby — it would be better to specify a few numbers that summarise in some way the

weights of the babies.

Graphical and numerical summaries of data come under the joint heading of ex-

ploratory data analysis.

This part of the course has the following objectives:

1. To introduce some numerical and graphical summary methods.

2. To explore which graphical methods/summary statistics are useful in certain

situations, and how to use them together sensibly.

3. To extend the ideas to situations where we are interested in the relationship

between two variables.

2 Visualising the Data

Suppose that we are interested in analysing the birthweight data given in Table 2.

Where should we start?

Whatever the question is that we are going to try and answer, whatever the

purpose of the analysis, a useful first step is to look at the data. Looking at the table

itself is probably not very helpful — it is very difficult to get any sort of intuition on

what the data is like. Getting some visual impression of what the data we have is

like will help us in deciding what we can do with the data, as we will see later.

3. Measuring Location and Spread 4

A possible first step in visualising the data is to produce a histogram, so we briefly

introduce histograms in this section. We work through the process of constructing a

histogram using the birthweight data.

1. We divide the range of the data into (equally) sized bins. Here, the lightest baby

has weight 1745 grams and the heaviest has weight 4162 grams. We will use 6

bins: 1501-2000, 2001-2500, 2501-3000, 3001-3500, 3501-4000 and 4001-4500.

2. Record the number of observations that fall into each bin. Here, we need to

draw a frequency table, and tally the number of birthweights in each category.

For example, the first weight listed in Table 2 is 3837 grams, so falls into the

category ‘3501-4000’.

Bin Number of Observations

1501-2000 1

2001-2500 4

2501-3000 4

3001-3500 19

3501-4000 15

4001-4500 1

3. Plot on a graph. The x axis is the range of the data and the y axis is the number

of observations (count) in each bin. Figure 2 gives the final histogram.

Exercise 3. What do you learn about the babies’ birthweights from this histogram?

For example, what sort of weight are the heaviest babies? What do the babies typically

weigh?

This gives a very brief introduction to histograms, and more technical aspects will

be given in Section 4.

3 Measuring Location and Spread

Recall that we have, in Table 2, the birthweights (in grams) of 44 babies. What sort

of questions might be of interest regarding these birthweights?

Birthweight (grams)

1500 2000 2500 3000 3500 4000 4500

Figure 2: Histogram of Birthweights

We might be interested in:

1. What does a typical baby weigh?

2. How spread out are the weights of the babies? Or, how light are ‘light’ babies,

and how heavy are ‘heavy’ babies?

The first question can be answered by calculating a statistic that summarises loca-

tion. We will introduce two ways to measure location — the median and mean. The

second question can be answered by calculating a statistic that summarises spread.

We will introduce three ways to measure spread — the range, the interquartile range,

and the standard deviation. Later, we will talk about the relative advantages and

disadvantages of each measure, and discuss the circumstances when each might be

Before we get started, a brief digression on notation is necessary.

Some Notation

Firstly, we choose a letter to denote the variable we are measuring — X is a common

choice. It is good practice to make it clear what your variables mean, by writing at the

beginning ‘let X denote [the variable we are measuring]’. Instead of saying ‘the first

observation from the data’ we will simply write x1 (If we had chosen Y to denote the

variable, this would be y1). The tenth observation is written as x10, and so on. We call

the total number of observations in our dataset n, so the final observation is xn. For

example, in the birthweight data in Table 2, let X be the weights of the babies. Then,

reading across the first row we get x1 = 3837, x2 = 3334, x3 = 3554, . . . , x44 = 3278,

and n = 44, as there are 44 births recorded.

Notice the difference here between upper case and lower case letters. Upper case

letters are not numbers — they are variables, telling us what it is we are measuring.

They are random because each time we take a measurement we will get a different

answer. Lower case letters are numbers, because they tell us the value the variable

takes for a specific measurement.

We use brackets if we have put the data in ascending numerical order — the

first observation (which is now the smallest) is called x(1), the second observation (so

second smallest) is called x(2), and so on, up to the largest observation x(n). Table 3

gives the birthweight data in ascending numerical order. Looking at this we can see

that x(1) = 1745, x(2) = 2121,. . ., and x(44) = 4162.

1745 2121 2184 2208 2383 2576 2635 2846 2902 3034 3116

3150 3166 3208 3278 3294 3300 3334 3345 3370 3380 3402

3406 3428 3428 3430 3480 3500 3520 3521 3523 3542 3554

3625 3630 3690 3736 3746 3783 3837 3838 3866 3920 4162

Table 3: Birthweights of babies born in the Mater Mothers’ Hospital in Brisbane,

Queensland, Australia, on December 18, 1997; in ascending numerical order.

3.1 The Median 7

3.1 The Median

The median is one way of answering the question ‘What does a typical baby weigh?’.

Exercise 4. Suppose we have placed the 44 birthweights in ascending numerical order

(as in Table 3). Which of the following values would best reflect what a typical baby

weighs?

• x(2) — i.e. the 2nd smallest value?

• x(22) — a value somewhere around the middle?

• x(42) — one of the largest values?

The median value of a collection of data is the ‘middle’ value when the data is in

numerical order. We use the symbol xm for the median.

Exercise 5. Looking at Table 3, find the ‘middle’ value for the birthweight.

We now give a general procedure for calculating the median, using the birthweight

data as an example.

1. Place the data in numerical order. This is done in Table 3.

2. Take the total number of observations and add 1. So here, there are 44 obser-

vations, so adding 1 gives 45.

3. Divide by 2 — call the result t. So t = 45/2 = 22.5.

4. If the result is a whole number the median is x(t). Otherwise, the result is the

average of the two numbers either side of t. We have t = 22.5, not a whole

number, so our answer is the average of x(22) and x(23). So we get

xm =x(22) + x(23)

3402 + 3406

2= 3404

So the median of the birthweight data is 3404 grams.

Exercise 6. Why do we add 1 to the total number of observations before we divide

by 2? HINT: you can get intuition by considering a data set with 3 observations (i.e.

n = 3).

3.2 The Range and Interquartile Range 8

Exercise 7. In calculating the median, does it matter whether the data is arranged

in ascending or descending numerical order?

So, the median gives us an answer to the question ‘what does a typical baby

weigh?’ — the median birthweight is 3404 grams.

3.2 The Range and Interquartile Range

We now consider the question, ‘How spread out are the weights of the babies? Or, how

light are ‘light’ babies, and how heavy are ‘heavy’ babies?’ Given that the ‘typical

baby’ weighs 3404 grams, would be surprised to see a baby weighing 3600 grams, for

example? These are the sorts of questions that we can answer by having an indication

of the spread of the data.

One way we may consider looking at the spread of the data would be the difference

between the heaviest baby and the lightest baby — this is called the range, and in

the notation:

Range = x(n) − x(1)

So, for the birthweight data, the lightest baby is x(1) = 1745, and the heaviest

baby is x(n) = 4162. So the range is

Range = x(n) − x(1) = 4162− 1745 = 2417 grams

Exercise 8. Table 4 gives the time 50 randomly selected University students spend

on social networking websites in a day (in minutes), in ascending numerical order.

Calculate the range of this data. Why might the range not be a good description of

the spread of the data in this case?

Since the range can be so sensitive to outliers — values that are unusually large

or small — we consider a slightly different measure, called the Interquartile Range

(IQR). The IQR takes the range of the middle 50% of the data, meaning it is not

affected by the outliers.

0 0 0 1 2 2 2 2 3 3

3 4 4 4 4 4 5 5 6 6

6 6 7 7 9 9 9 10 10 11

12 13 15 17 17 18 19 21 21 22

24 26 27 29 32 34 42 51 68 185

Table 4: Minutes spent on social networking websites per day, in ascending numerical

order.

• The lower quartile (LQ) is the value that has one quarter of the data smaller,

and three quarters of the data larger than it. It is also known as the 25th

quantile, as 25% of the data is smaller than it.

• Similarly, the upper quartile (UQ) is the value that has one quarter of the data

larger, and three quarters of the data smaller than it. It is also known as the

75th quantile, as 75% of the data is smaller than it.

• The Interquartile Range (IQR) is the range of the middle 50% of the data, and

is calculated by taking the difference between the UQ and the LQ — IQR =

UQ− LQ.

Exercise 9. Looking at Table 3, find the lower quartile (one quarter of the way

along the data), the upper quartile (three quarters of the way along the data) and the

interquartile range (difference between the two).

As with the median, there are technicalities involved when the LQ and UQ lie

‘in-between’ two data points. However, there is no agreement on what to do in these

cases. We suggest calculating the LQ as ‘the median of the lower half of the data’,

and the UQ as ‘the median of the upper half of the data’. If there is an odd number

of data points, usual convention is to exclude the median from both calculations.

We will now give a general algorithm for calculating the LQ, UQ and IQR, using

the birthweight data as an example.

The LQ is the median of the first half of the data, i.e. the top two rows of Table

3, or the first 22 observations. The total number of observations plus 1 is therefore

23. Divide this by two and calling the answer t, gives t = 11.5. So the result is the

average of x(11) and x(12),

LQ =x(11) + x(12)

3116 + 3150

2= 3133

The UQ is the median of the second half of the data, i.e. the bottom two rows

of Table 3, or observations 23 − 44, the last 22 observations The total number of

observations plus 1 is therefore 23. Divide this by two to get t = 11.5. So the result

is the average of the 11th and 12th observations in the second half of the data. This

is NOT the same as x(11) and x(12) because they are in the first half of the data! We

add 22 to tell us where to find the UQ because we didn’t use the first 22 observations.

So 11.5 + 22 = 33.5, so the result is the average of x(33) and x(34),

UQ =x(33) + x(34)

3554 + 3625

2= 3589.5

Now the IQR is simply the difference between the UQ and the LQ:

IQR = UQ− LQ = 3589.5− 3133 = 456.5

At this point, we can return to our original question that we were trying to answer

by calculating the spread of the data — how heavy are ‘heavy’ babies and how light

are ‘light’ babies? It remains somewhat subjective what we mean by light and heavy,

but we could say that the upper 25% are ‘heavy’ and the lower 25% are ‘light’. We

know these quantities — they are just the UQ and the LQ. So, ‘heavy’ babies have

weight greater than 3589.5 grams, and ‘light’ babies have weight less than 3133 grams.

Exercise 10. Calculate the Median, LQ, UQ and IQR of the social networking web-

site usage data. Table 4 gives the data in ascending numerical order.

Box Plots

The median, quartiles and range can be summarised in a graphical format, which can

be useful when comparing one sample against another. Firstly, the quantities can be

summarised in a so-called five number summary. This consists of five numbers —

the smallest observation, the lower quartile, the median, the upper quartile and the

largest observation (in that order). For example, for the birthweight data we would

write the five number summary as (1745, 3133, 3404, 3589.5, 4162).

This summary can also be drawn in a box plot. To create a box plot, the horizontal

axis is the range of the data (in this case, the weights of the babies). Draw a small

vertical line at each of the five numbers from the five number summary, then connect

the ends of the line at the LQ to the lines at the UQ. Figure 3 gives the box plot for

the birthweight data.

2000 2500 3000 3500 4000

Figure 3: Boxplot of Birthweights

Exercise 11. Suppose that we separate the weights of the 44 babies into boys and girls.

We calculate the five number summaries for each group, and they are, for the boys

(2121,3166,3404,3630,4162) and for the girls (1745,2576,3381,3523,3866). Draw the

two boxplots, one below the other, on a single graph, using the same horizontal axis.

Use this to compare the two sets of birthweights.

3.3 The Mean 12

3.3 The Mean

We have discussed that it is useful to find location statistics, in order to answer a

question such as, in our birthweight example, what a typical baby weighs. The mean

is an alternative way to calculate this. We will explore the similarities and differences

between the mean and median in Section 3.5.

The mean is calculated by adding up the values of all the observations, then

dividing by the amount of observations. This corresponds to sharing the values equally

amongst all the observations. The sample mean is denoted by x (say ‘x-bar’).2 The

formula for the mean is

x =x1 + x2 + . . . + xn

In this formula, we have used ‘. . .’ to show that we have missed out all the middle

terms, but you would of course fill these in when using the formula. There is a nicer

way to indicate a sum of observations than this, by using the symbol∑

, which means

‘the sum of’.

So we can rewrite the formula for the mean as

n∑i=1

We use the area below the∑

sign to indicate where the sum starts from, and the

area above to indicate where the sum finishes. The i in this case is called an index.

It changes its value from the smallest value in the sum to the largest value, going

through every whole number inbetween (i = 1, then i = 2, etc.).

As an example, let’s calculate the mean of the birthweight data. Putting the data

into the formula we get

n∑i=1

xi =3837 + 3334 + . . . + 3278

44= 3275.955

so the mean is approximately x = 3276. So, the weight of an average baby is 3276

grams.

2It is x because X is the letter used to denote the variable. If we had used Y , it would be y.

3.4 The Variance and the Standard Deviation 13

3.4 The Variance and the Standard Deviation

Just as the mean is an alternative to the median for measuring locations, there is also

an alternative to the IQR for measuring spread.

Exercise 12. Using the time spent on social networking websites data in Table 4,

demonstrate why the IQR can not be said to use all of the data.

HINT: Think about changing the value of the largest time (currently 185) to 285.

Would the IQR change?

The sample variance is defined as the average squared distance from an observation

to the sample mean. The ‘distance’ from an observation xi to the sample mean x is

(xi − x). The ‘squaring’ removes any negative values here (e.g. if xi = 3 but x = 5,

then xi − x = −2, but squaring this gives 4).

We call the sample variance s2, and the formula is

n∑i=1

(xi − x)2

Since we have squared the (xi − x) part, the variance is not in the same units as

the observations. For example, if our observations were in metres, the variance would

be in square metres! This makes the variance difficult to interpret.

The standard deviation remedies this problem by simply taking the square root

of the variance. It is denoted by s, and its formula is given by

√√√√ 1

n∑i=1

(xi − x)2

If necessary, we can subscript by the letter of the variable, e.g. s2X and sX for the

variance and standard deviation of the variable X. It is useful to do this if we are

dealing with more than one variable.

As an indication of how the standard deviation describes spread in a dataset, Fig-

ure 4 gives four examples of histograms, on the same scale, and the sample standard

deviation in each case. Each of the datasets has a mean of 3. The relationship is that

the further from the mean the data tends to be, the larger the standard deviation

is. For example, in the bottom right histogram of Figure 4, all the values are far

from 3, resulting in a large standard deviation. On the other hand, in the bottom

left histogram of Figure 4, the values are all fairly close to 3, resulting in a smaller

standard deviation.

Standard Deviation = 1.06

0 1 2 3 4 5 6

cy0 1 2 3 4 5 6

0 1 2 3 4 5 6

Figure 4: Histograms of four datasets (each with 100 observations) and their associ-

ated standard deviations.

As an example we will now calculate variance and standard deviation of the birth-

weight data, giving a step-by-step approach to doing the calculations.

1. Produce a table with three columns — the original xi values, the xi − x values,

and finally the (xi − x)2 values.

xi xi − x (xi − x)2

3837 3837− 3276 = 561 5162 = 314721

3334 3334− 3276 = 58 582 = 3364

3554 3554− 3276 = 278 2782 = 77284...

......

3278 2 4∑ni=1(xi − x)2: 11989186

Variance (above line divided by n): 272481.5

Standard Deviation (square root of variance): 522

Table 5: Calculation of variance and standard deviation for the birthweight data

2. Total up the final column of the table and divide by the number of observations.

This is the variance.

3. Take the square root of the answer. This is the standard deviation.

These steps are summarised for the birthweight data in the Table 3.4. We cal-

culated earlier that x = 3276 grams. Construcing a table like this is helpful when

calculating the variance and standard deviation by hand.

So we conclude that the sample standard deviation of the birthweights is 522

grams. We have omitted many lines in the table — it would be a good exercise to

check that you can reproduce the table in full. These calculations are somewhat time-

consuming to do by hand. Most scientific calculators, and many computer packages,

will do the calculation for you.

The standard deviation can loosely be interpreted as ‘how far a typical observation

is from the mean’.

Exercise 13. Suppose you wish to hire a typing assistant. The number of pages

typed per day by Assistant A has mean 65 pages and standard deviation 3 pages.

3.5 Mean or Median? 16

The number of pages typed per day by Assistant B has mean 75 pages and standard

deviation 20 pages.

• Which assistant is the most consistent?

• Which assistant would you expect to type the most pages over a week?

• If these were the two applicants for the job of typing assistant, which would you

hire (assuming you know nothing else about them)?

A small standard deviation means a high consistency or precision.

Exercise 14. Calculate the variance and standard deviation of the following times

spent by University students on social networking websites:3

4 6 51 17 11 4 3 21 24 3

Exercise 15. Is it possible for either the variance or the standard deviation to be

negative?

3.5 Mean or Median?

We conclude this section by looking at the differences between the two location statis-

tics we have introduced — the mean and the median — and discussing the situations

in which each might be used.

Exercise 16. Suppose a student receives the following marks in nine courses (placed

in ascending numerical order). The University awards a first class degree to any

student who earns 70% or more overall.

35 40 42 56 70 70 71 73 73

The exam board proposes that the overall degree classification is calculated based

on the median of the nine marks — here the median is 70, so the student is awarded

a first class degree. Is this a fair result?

3This is a subset of Table 4, to make the calculation more manageable.

3.5 Mean or Median? 17

Exercise 17. Figure 5 gives the annual wage of 50 UK full time workers, chosen in

a random sample. The median of the wages is 21.5 thousand pounds, and the mean

is 28.7 thousand pounds. Explain why the mean is larger than the median. Which of

the mean and the median is more useful here?

Annual Wage (in £1000s)

0 50 100 150 200 250

Figure 5: Histogram of the wages of 50 randomly selected full time workers in the

The above questions illustrate situations where it is possible to select one of the

location statistics over the other — the mean is sometimes more useful than the

median, but in other situations the median can be more useful than the mean.

In general, the problem with the mean is that it is affected by extreme values. In

4. The Shape of the Data 18

the wages example, there is one person in the sample earning 250 000, which is far

more than everybody else. This causes the mean to be larger. Therefore, in cases

where there are outliers in the data, the median is often a better choice.

The weakness of the median is that it cares only about the order that the data

is in, and the value of the middle observation. In the degree classification example

above, the median of 70% does not represent well what the data looks like. When

there are no outliers in the data, the mean is often more useful. So, it is important to

look at the data (for example by viewing histograms) before deciding which summary

statistics to report.

4 The Shape of the Data

We introduced the histogram in Section 2, in this section we look in more detail at

the histogram, and introduce other graphical methods to visualise the shape of the

4.1 Histograms Revisited

When we first introduced the histogram in Section 2, we selected 6 equally sized bins

to divide the data into, without explaining why we made this choice. In fact, choosing

the number of bins to use is something of an art in producing histograms, and is often

done by trial and error. Figure 6 gives two histograms of the birthweight data, the

first has two bins (each bin being 2000 grams wide), the second has 48 bins (each

bin being 50 grams wide). Both were created using identical data — the birthweight

data from Table 2.

Exercise 18. Using Figures 2 and 6, comment on the consequences of making a poor

choice for the number of bins.

You will find when you use statistical packages to produce histograms that an

appropriate number of bins is selected automatically.

Exercise 19. Table 4 gives the time spent by 50 randomly chosen students on social

networking websites (in minutes). Produce a histogram, with appropriate bin width,

4.1 Histograms Revisited 19

Birthweight (grams)

1000 2000 3000 4000 5000

Birthweight (grams)

2000 2500 3000 3500 4000

Figure 6: Histogram of Birthweights — with 2 bins (top) and 48 bins (bottom).

for this data. (Note: you may find it useful to exclude the largest time (185 minutes)

to produce a more interesting histogram). Interpret the results.

Density Histograms

Exercise 20. Suppose you take the names of all the babies born in the Mater Mothers’

Hospital in Brisbane, Queensland, Australia, on December 18, 1997, put them in

a hat, and select one at random. What is the chance of the chosen baby having

birthweight between 3001 and 3500 grams?

In Figure 7 we have the same histogram as in Figure 2, but this time we rescale

the height of the bars so that the area of the bar in each bin represents the chance of

a randomly chosen birthweight coming from that bin.

4.1 Histograms Revisited 20

Birthweight (grams)

1500 2000 2500 3000 3500 4000 4500

Figure 7: Density Histogram of Birthweights

How have we calculated this? The chance of a randomly chosen birthweight (or,

in general, a randomly chosen observation) coming from a bin should beNumber of Observations in BinTotal Number of Observations

. The area of the bin is Height of bar×Width of bar.

Since we want the chance and the area to match, and this means that

Height of Bar =Number of Observations in Bin

Width of bar× Total Number of Observations

Exercise 21. Explain why the total area of all the bars in the new histogram is 1.

4.2 Bar Charts 21

4.2 Bar Charts

You are probably familiar with a method of displaying categorical data that is very

similar to a histogram — the bar chart.

A bar chart could be used, for example, to visualise and compare the number of

people in a sample with different hair colours (this example was also used in Part 1).

Suppose we collect a sample of 100 people and record their hair colour. We record

the results in a frequency table:

Hair Colour Number of Observations

Black 10

Brown 51

Fair 28

Ginger 3

Other 8

A bar chart is then constructed from the frequency table in exactly the same way

as we did for the histogram. Figure 8 gives two possible bar charts we could construct

from this frequency table.4 The difference between the two is that we have changed

the order of the bars along the horizontal axis.

Exercise 22. Why is it not sensible to change the order of the bars along the hori-

zontal axis in a histogram, like that in Figure 7?

Bar charts are useful ways to display categorical data. They must not be confused

with histograms, which are used to summarise quantitative data.

4.3 Empirical Distribution Functions

The histogram is very useful for estimating the chance of a randomly chosen ob-

servation being within a given range of values, or equivalently, the proportion of

observations within that same region (see Section 2). However, we have already seen

4Note though, it is usual to arrange the bars in descending height order (as in Figure 8, left

panel).

4.3 Empirical Distribution Functions 22

Figure 8: Two barcharts showing the sample proportions of 100 people’s hair colour.

that the histogram is limited in that its shape depends on the number of bins we

choose to classify our data into. In particular, if we are interested in estimating,

say, the chance of a random observation being less than a certain value, we will get

different answers depending on the bins we have chosen for the histogram.

The empirical distribution function (e.d.f.) is a means of displaying all the quan-

tiles of a set of data. Two special quantiles, the LQ (25th quantile) and the UQ (75th

quantile) have already been introduced in Section 3.2. Producing the e.d.f. does not

require any subjective decisions (like choosing the number of bins in the histogram

setting). Therefore, there is one unique e.d.f. for any collection of data. There is an

added bonus with the e.d.f. — we can read the median and quartiles from the graph

quickly and easily, as we will see.

We label the e.d.f. as a capital letter with a tilde (‘˜’) above it.

The formula for the e.d.f. is

F (x) =Number of observations smaller than x

Total number of observations

For example, the 25th quantile, or lower quartile, is the value which has 25% of

the data smaller than it. In Section 3.2 we calculated the LQ for the birthweight

data is 456.5. So, F (456.5) = 0.25 approximately, because the e.d.f. measures the

percentage of the data smaller than a fixed value.

We will now work through the construction of the e.d.f. using the birthweight

1. We construct the graph using the ordered data (as in Table 3 for the birthweight

data). The x axis covers the range of the data (so you could use the same range

as the histogram, for example), and the y axis gives the value of F (x). The

graph is built like a staircase, starting from the y value of 0, on the left of the

smallest observation on the x axis, and finishing at the y value of 1, on the right

of the largest observation on the x axis.

2. Now, every time we encounter an observation, the y value goes up by one ‘step’.

The amount we go up is 1n, with n = number of observations, as usual. If there

are ‘ties’ (i.e. more than one observation with the same value), we take a bigger

step, whose size is 1n× number of ties.

3. The graph finishes when the y value reaches 1. If you run out of data before

reaching 1, or go past 1, you know you have made a mistake somewhere! Figure

9 shows what the final e.d.f. for the birthweight data should look like.

Ensure you understand why constructing the graph in this way gives the e.d.f.

that is defined in Equation 4.3. Also ensure that you understand that, unlike with

histograms, there is no aspect of the e.d.f. that we can change to give a ‘different’

graph for the same set of data. Notice that the e.d.f. looks like a staircase — it moves

up in ‘jumps’.

Exercise 23. Two consecutive birthweights from Table 3 are 2208 and 2383. Explain

why F (2210) and F (2300) must be the same.

Exercise 24. Produce an e.d.f. for the social networking website data (Table 4).

Reading the Median and Quantiles from the e.d.f.

Recall that the median of a collection of data is the ‘middle value’. The e.d.f. is

defined in Equation 4.3 as

1500 2000 2500 3000 3500 4000

birthweight

●●

●●●

●●●●●●●●●●

●●

●●●●●●●

●●

●●●

●●●●

●●

Figure 9: e.d.f. of Birthweights

Exercise 25. Explain why F (xm) = 0.5, i.e. the value of the e.d.f. at the median is

Since the value of the e.d.f. at the median is 0.5, we can easily read off an estimate

of the median from our e.d.f. We do this by working backwards. As we know that

F (xm) = 0.5, if we draw a line across from the y-axis at 0.5 to the e.d.f., reading

off the corresponding x-value gives us the median. This process is illustrated in the

birthweight data in Figure 10, and we get an estimate for the median of around 3400.

Exercise 26. Show that, in the same way, the value of the e.d.f. at the LQ and UQ

is 0.25 and 0.75 respectively. Use this to estimate from the e.d.f. in Figure 9 the

LQ, UQ and IQR of the birthweight data. Compare with the estimates you already

calculated from the data, and explain why they may not be exactly the same.

5. The Relationship Between Two Variables 25

1500 2000 2500 3000 3500 4000

birthweight

●●

●●●

●●●●●●●●●●

●●

●●●●●●●

●●

●●●

●●●●

●●

Figure 10: e.d.f. of birthweights, with median line added

Exercise 27. Summarise the advantages and disadvantages of Empirical Distribution

Functions versus Histograms.

5 The Relationship Between Two Variables

So far in this part of the course, we have looked at graphical methods and summary

statistics for a single variable only. Can you think of situations where it may be of

interest to look at the relationship between two variables?

There has been much said about the supposed link between smoking and lung

cancer. We have data available, from 44 US states, on two variables.

• Smoke: Number of cigarettes smoked (hundreds per person) in 1960.

5.1 Scatter Graphs 26

Smoke Lung Smoke Lung Smoke Lung Smoke Lung

18.20 17.05 25.82 19.80 18.24 15.98 28.60 22.07

31.10 22.83 33.60 24.55 40.46 27.27 28.27 23.57

20.10 13.58 27.91 22.80 26.18 20.30 22.12 16.59

21.84 16.84 23.44 17.71 21.58 25.45 28.92 20.94

25.91 26.48 26.92 22.04 24.96 22.72 22.06 14.20

16.08 15.60 27.56 20.98 23.75 19.50 23.32 16.70

42.40 23.03 28.64 25.95 21.16 14.59 29.14 25.02

19.96 12.12 26.38 21.89 23.44 19.45 23.78 12.11

29.18 23.68 18.06 17.45 20.94 14.11 20.08 17.60

22.57 20.74 14.00 12.01 25.89 21.22 21.17 20.34

21.25 20.55 22.86 15.53 28.04 15.92 30.34 25.88

Table 6: Numbers of cigarettes smoked (hundreds per person) in 1960 and Deaths

per 100K population from lung cancer, in 44 US states.

• Lung: Deaths per 100 000 population from lung cancer.

The data is given in Table 6.

Looking at the table, can you see any relationship between the number of cigarettes

smoked, and the rate of deaths from lung cancer?

Clearly, this is difficult if not impossible to do.

5.1 Scatter Graphs

A scatter graph is a visualisation of the relationship of two quantitative (numeric)

variables. So far, we have been using X to denote our variable. Now that there are

two variables, we will use X to denote the number of cigarettes smoked, and Y to

denote the number of deaths from lung cancer.

To construct a scatter graph, we plot the specific X and Y values for each state

onto a graph as co-ordinates. The horizontal axis has the range of the X variable,

and the vertical axis takes the range of the Y variable. Then, for example, for the

5.1 Scatter Graphs 27

first state in the table, we draw a cross on the graph at (18.20, 17.05). After plotting

all of the points from Table 6, we end up with a scatter graph — as in Figure 11.

15 20 25 30 35 40

Cigarettes Smoked (hundreds per capita) in 1960

Figure 11: Scatter plot of the number of cigarettes smoked against the number of

deaths from lung cancer, for 44 US states.

Exercise 28. What does the Figure 11 suggest about the relationship between cigarettes

smoked and the number of deaths through lung cancer?

The diagonal line in Figure 11 is a line of best fit. This is drawn in such a way

as to represent the underlying relationship between the X and Y variables. Since

this line slopes upwards we would say there is a positive relationship between X and

Y , i.e. as the number of cigarettes smoked increases, so does the number of deaths

through lung cancer.

5.2 Correlation 28

Exercise 29. How might the line of best fit be used to estimate the death rate from

lung cancer in a state for which we only know that the number of (hundreds of)

cigarettes smoked per person was 30?

Exercise 30. Can you think of two variables that may have a negative relationship?

i.e. as one increases in value, the other one decreases?

5.2 Correlation

Suppose you are asked to describe the relationship between smoking and lung cancer

using Figure 11 over the telephone. What would you say? Is this easy to do?

As well as having a picture of the relationship between two variables, it is also

useful to have some sort of numerical summary (this would be a lot easier to descibe

over the telephone!).

The correlation between two variables is a number that describes the strength of

the linear (i.e. straight line) relationship between them. We use the symbol r for

the sample correlation, and subscript it with the two variables we are calculating the

correlation between. For example rXY is the sample correlation between smoking

and lung cancer, from the previous section. The correlation is always between −1

and 1. A positive number corresponds to a positive linear relationship between the

two variables (i.e. as one increases, the other increases), and a negative number

corresponds to a negative relationship (i.e. as one increases, the other decreases).

Figure 12 shows some scatter graphs of the relationship between two variables, and

the associated correlations, to give a feel for what the numbers mean.

The fact that correlation measures linear relationships between variables is im-

portant to remember. It is always useful to plot the data in a scatter graph first, to

see whether there is an indication of a relationship between the two variables that is

not linear. Figure 13 gives some examples of calculated correlations where the rela-

tionship between two variables is not linear. The correlation can be very misleading

in these instances.

The formula for calculating the sample correlation between two variables X and

5.2 Correlation 29

●●

−2 −1 0 1 2

Correlation = 1

● ●

●●

●●●

●●

● ●

●●

●● ●

●●

−2 −1 0 1 2−

Correlation = 0.95

●●

● ●

●●

● ●●

● ●

●●

●●●

● ●●

● ●●●

●●

● ●

●●

● ●●

●●

● ●

−2 −1 0 1 2

Correlation = 0.78

●●●

●●

●●●

●●

●● ●

●●

●●●

−2 −1 0 1 2 3

Correlation = 0.45

●●

● ●

●●

●●●

● ●

●●●●

●●

−3 −2 −1 0 1 2

Correlation = 0.21

●●●

●●

●●●

●●

●● ●

● ●

●●

● ●

●●

● ●●

●●

● ●

−1 0 1 2 3−

Correlation = 0.08

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

●●●

●●

−3 −2 −1 0 1

Correlation = −1

●●

●● ●

●●

● ●

●●

● ●

−2 −1 0 1 2

Correlation = −0.79

●●

● ●

● ●●

●●

● ●

●●

●●●

●●

●●●

●●

−3 −2 −1 0 1

Figure 12: Scatter plots of nine datasets (each with 100 observations and 2 variables

per observation) and their associated correlations.

5.2 Correlation 30

●●

●●●●

●●

●●●

●●

●●●●

●●

●●●●

●●●●●●

●●●

●●

●●●●

●●

●●●

●●●●●

●●

0 1 2 3 4 5 6

●●●●

●●

●●●

●●

●●●

●●

●●●●●●●●

●●●●●●

●●●●

●●

●●●

0 1 2 3 4 5 6−

Correlation = 0.02

●●●

●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●

●●●

●●

●●●

●●

0 1 2 3 4 5 6

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 1 2 3 4 5 6

Correlation = 0

Figure 13: Scatter plots of two datasets (each with 100 observations and 2 variables

per observation) and their associated correlations, where the relationship between the

two variables is not linear.

6. Summary 31

∑ni=1(xi − x)(yi − y)

We will calculate the correlation for the lung cancer and smoking data. As we did

for the variance, it is helpful to draw a table here. We have calculated the means and

standard deviations in advance. For the ‘smoke’ variable, x = 24.91 and sX = 5.57.

For the ‘lung’ variable, y = 19.65 and sY = 4.23 (feel free to verify these yourself!).

xi (Smoke) yi (Lung) (xi − x) (yi − y) (xi − x)× (yi − y)

18.20 17.05 −6.71 −2.60 17.48

31.10 22.83 6.18 3.18 19.65

20.10 13.58 −4.81 −6.07 29.24...

......

30.34 25.88 5.43 6.23 33.79∑ni=1(xi − x)(yi − bary): 706.66

Correlation (above line divided by (n× sX × sY )): 0.68

A correlation of 0.68 is a fairly strong correlation.

6 Summary

There were three objectives to this part of the course, let’s summarise how we have

tackled each one.

1. To introduce some basic graphical methods and summary statistics.

We have introduced a range of techniques. This is not exhaustive, there are

many other graphical methods and summary statistics we could have considered.

2. To motivate which graphical methods/summary statistics are useful in certain

situations, and how to use them together sensibly.

It is vital to identify which tools are useful to us for different purposes. Usually,

we will use a combination of graphical representations, and summary statistics,

to learn about the variables in the data we have collected.

A. Hints and Answers to Exercises 32

3. To extend the ideas to situations where we are interested in measuring two

variables.

The final section introduces scatter plots and correlation, as methods specific to

dealing with the relationship between two variables. Do remember though, it is

still important to look at each of the variables individually, using the methods

in the earlier sections.

A Hints and Answers to Exercises

Exercise 1: Hopefully, you believe the graph is easier to interpret. We can see that

the majority of people spend less than ten minutes on social networking websites

in a day, with a very small number spending longer than 50 minutes. There is one

outlying observation — one person spends 185 minutes in the day, which is far longer

than everybody else.

Exercise 2: The variable is ‘Weight of baby’. It is certainly, quantitative, and

since a weight is measured on a scale, it is continuous (although, in the data we have

in Table 2, it is rounded off to the nearest gram).

Exercise 3: All babies are between 1500 and 4500 grams, with most weights lying

between 3000 and 4000 grams. You may have other comments.

Exercise 4: x(22) would be the most sensible.

Exercise 5 This is worked through in the text.

Exercise 6: This is easily seen if we think about the case with three observations

only. For example, suppose we have observations 4, 7 and 12. If we did not add one,

we’d be looking for the median inbetween the 1st and 2nd value, which clearly doesn’t

make sense. This peculiarity is because of the way we count — if we started counting

from 0 rather than 1, we would not have this problem!

Exercise 7: No. Whichever way around the numbers are written, the same num-

bers will always be in the middle. However, it is conventional to write the numbers

in ascending numerical order.

Exercise 8:

Range = x(n) − x(1) = 185− 0 = 185 minutes

This is not particularly useful because the largest time, x(n) = 185 does not fit in with

the rest of the data — if we excluded this value the range would be only 68 minutes.

It is undesirable that the range is so sensitive to this one, unusual value. This is why

the interquartile range is more commonly used for measuring spread.

Exercise 9 Answered in the text.

Exercise 10: To calculate the median, total number of observations +1 = 51.

Then t = 51/2 = 25.5. So the median is the average of the 25th and 26th value in

Table 4:

xm =x(25) + x(26)

2= 9 minutes

The LQ is the median of the first 25 observations, so LQ = x(13) = 4 minutes.

The UQ is the median of the last 25 observations, so UQ = x(38) = 21 minutes. Then

IQR = UQ− LQ = 21− 4 = 17 minutes.

Exercise 11: Your box plot should look like Figure 14, although you may have the

boxplots the other way around, which is fine.

Comparing the two, the boys’ are heavier at birth on average (the median is

slightly larger), and the IQR for the boys is much smaller, meaning that in general,

boys’ weights cluster more closely around the median than the girls’.

Exercise 12: Imagine partitioning the data into three groups:

1. Smaller than the LQ

2000 2500 3000 3500 4000

Figure 14: Boxplot of the birthweights of the boys (top) and the girls (bottom).

2. Between the LQ and the UQ

3. Larger than the UQ

Then, changing the value of any observation would not affect the IQR, provided

the observation stays in the same group. For example, changing the value of the

largest time from 185 to 285 would not change the IQR, since the value clearly re-

mains in the group ‘larger than the UQ’.

Exercise 13: Assistant A is the most consistent as the standard deviation is

smaller, but Assistant B would be expected to type the most pages over the week as

the mean of Assistant B is larger. Either answer is correct for the final part — it

depends on your preferences. If quantity only is important, then Assistant B is the

best choice. However, if consistent output on a daily basis is required then Assistant

A should be chosen.

Exercise 14: We reproduce the table in full. First, calculate the mean and check

you get 14.4. Also, the number of observations n = 10.

xi xi − x (xi − x)2

4 −10.4 108.16

6 −8.4 70.56

51 36.6 1339.56

17 2.6 6.76

11 −3.4 11.56

4 −10.4 108.16

3 −11.4 129.96

21 6.6 43.56

24 9.6 92.16

3 −11.4 129.96∑ni=1(xi − x)2: 2040.4

Variance (above line divided by n): 204.04

Standard Deviation (square root of variance): 14.28

So the sample variance is 204.04 and the sample standard deviation is 14.28.5

Exercise 15: No. We are adding together lots of numbers which are zero or larger.

Therefore, all variances and standard deviations are always positive, and indeed it

does not make sense to have a negative value — a variance or standard deviation of

zero means there is ‘no spread’ i.e. all the observations have exactly the same value.

Exercise 16: This would seem a little generous to the student — as she only just

5If you check this result using a calculator or a computer, you may get a variance of 226.71, and

a standard deviation of 15.06. Do not worry. The computer is calculating the unbiased estimate of

the population variance/standard deviation, we will come on to this in Part 4.

got a first in five of the modules, and was some way off achieving a first in the other

Exercise 17: The mean has been distorted by the single large time (of 185 min-

utes). The median is not affected by this value. Therefore, for most purposes, such

as giving the time a typical University student spends on social networking websites

per day, the median would be a better measure.

Exercise 18: The top panel in Figure 6 is not very informative — Figure 2 gives

a far more informative picture of the data. The bottom panel in Figure 6 is very

difficult to interpret — it is not very smooth, and does not give a good overview of

the shape of the data.

Exercise 19: Including the largest value, we took 14 bins, with each bin having

width 10 minutes. The resulting histogram is that seen in the Introduction, repro-

duced in the top panel of Figure 15 for convenience. Excluding the largest value, we

take bins of width 5 minutes to obtain the histogram in the bottom panel of Figure

We can say that the largest time is abnormally large compared to the rest of the

data. Excluding this, the remaining times are between 0 and 70 minutes. Most of

them are between 0 and 30 minutes.

Exercise 20: The chance is the same as the proportion of babies between 3001 and

3500 grams, which we can get straight from the histogram (or the frequency table).

Probability[Randomly chosen baby between 3001 and 3500 grams]

=Number of babies between 3001 and 3500

Total number of babies

=19/44

cy0 50 100 150

0 10 20 30 40 50 60 70

Figure 15: Histogram of daily time spent on social networking websites (top — all

values included, bottom — largest value excluded) with bins of width 10 and 5 minutes

respectively.

Exercise 21: The total area is the same as the probability of a randomly chosen

baby having any weight. The probability of any certain event is 1 (this is how prob-

ability is defined).

Exercise 22: There is a meaning to the position of the bars in a histogram — they

are in numerical order. This is not the case in general for a bar chart.

Exercise 23 The value of the e.d.f. only changes when we encounter an observation

— at this point it jumps up by 1n

(step 2 of the construction of the e.d.f.). So the

e.d.f. went up by 1n

at 2208. It then stays the same until we get to 2383, when it

jumps up by 1n

again. So the value of the e.d.f. at 2210 and 2300 must both be the

Exercise 24: The final e.d.f. is given in Figure 16.

0 200 400 600

State Area

●●

●●●●●●●●●●

●●●●●●●●●

●●●●●

●●

Figure 16: e.d.f. of State areas.

Exercise 25: The median is the middle value, and since the value of the e.d.f. is

then we expect half of the observations to be smaller than the middle value.

Exercise 26: The argument for the F (LQ) = 0.25, and F (LQ) = 0.75 are similar

to the previous exercise. Figure 17 is the birthweight data e.d.f. with the lower

quartile and upper quartile lines added. We can estimate from this that the LQ is

around 3100 and the UQ around 3600, to give an estimate of the IQR of 500.

These do not match exactly with the values calculated from the data because we

are only able to estimate from the graph.

1500 2000 2500 3000 3500 4000

birthweight

●●

●●●

●●●●

●●●●●●

●●

●●●●●●●

●●

●●●

●●●●

●●

Figure 17: e.d.f. of State areas, with lower quartile and upper quartile lines added.

Exercise 27:

• The histogram requires a subjective decision to be made about the bin sizes,

whereas there are no subjectivities involved in producing an e.d.f.

• It is easier to get an impression of the shape of the distribution from a (well

constructed) histogram.

• It is easier to calculate from the e.d.f. probabilities of a random observation

landing inbetween any given interval, whereas with histograms we are restricted

to the bins that we have defined.

Exercise 28: There is some positive relationship. i.e. in states where the amount

of cigarettes smoked (hundreds per capita) is larger than average, we would also ex-

pect the number of deaths from lung cancer per 100 000 to be larger than average.

Exercise 29: We could read, from the scatter graph, the number of deaths per

100 000 that corresponds to 30 cigarettes smoked (hundreds per capita). We get an

estimate of around 22 deaths per 100 000. See Figure 18 for a visualisation of how

this works.

15 20 25 30 35 40

Cigarettes Smoked (hundreds per capita) in 1960

Figure 18: Scatter plot of the number of cigarettes smoked against the number of

deaths from lung cancer, for 44 US states. Looking up expected number of deaths

for a state where 30 cigarettes (hundreds per capita) are smoked.

Exercise 30: There are many possible examples. For example, one variable being

‘temperature’ and the other ‘scarf sales’.

Part 2: Summarising Data Numerically and Graphicallyparkj1/math105/part2-12-12.pdf · Part 2: Summarising Data Numerically and Graphically Matthew Sperrin and Juhyun Park December

Documents

Presentation summarising financial results for 2013

Summarising Sets of Phylogenies

Linear Encoders for Numerically Controlled Machine Tools ·...

Report Summarising Consultations Undertaken · The Mersey.....

Yr8 t2 literacy summarising less 6

Juhyun`s Book

Plagiarism Proper academic practice: summarising &...

Linear Encoders for Numerically Controlled … encoders for....

SCIMP Summarising GP Records Web viewSCIMP Summarising GP....

Numerically controlled machine remote maintenance

JuHyun`s Artbook

Summarising and presenting data

Paraphrasing and Summarising

NUMERICALLY ESTIMATING INTERNAL ... - people.brandeis.edu

Summarising the CIPD Annual Survey Report 2015

Pharmacy 732 Winter 2003. Instructors William...