Part 2: Summarising Data Numerically and Graphicallyparkj1/math105/part2-12-12.pdf · Part 2: Summarising Data Numerically and Graphically Matthew Sperrin and Juhyun Park December
Post on 17-Apr-2020
7 Views
Preview:
Transcript
Part 2: Summarising Data Numerically and
Graphically
Matthew Sperrin and Juhyun Park
December 12, 2008
1 Introduction
How long do University students spend on social networking websites per day? A
random sample of 50 students were asked to record their social networking website
usage for one day. The results, in minutes spent, are given in Table 1. Looking at
the table only, what can you learn from the data?
Figure 1 shows exactly the same data in a histogram, with minutes spent plotted
along the horizontal axis, and the height of the bars representing the number of
students in each region.
Exercise 1. Can you learn more about social networking site usage from looking at
the graph than you can from the table? Is the graph or the table easier to interpret?
0 4 2 0 6 17 0 51 13 10
17 9 5 11 18 10 4 12 15 3
22 21 21 4 4 24 2 6 3 6
2 7 29 1 9 6 34 32 27 19
26 185 5 7 9 68 42 4 2 3
Table 1: Minutes spent on social networking websites per day
1
1. Introduction 2
Minutes spent on social networking websites per day
Fre
quen
cy
0 50 100 150
05
1015
2025
30
Figure 1: Minutes spent on social networking websites per day
Presenting data graphically can often help us to learn things from the data.
Table 2 gives the birthweights (in grams) of 44 babies born in the Mater Mothers’
Hospital in Brisbane, Queensland, Australia, on December 18, 1997.
Exercise 2. What is the variable being measured here? Is it quantitative? If so, is
it continuous or discrete?1
Suppose you are asked ‘Tell me about the birthweights of these babies’. What
would you say?
1See Part 1 for an introduction to these concepts.
2. Visualising the Data 3
3837 3334 3554 3838 3625 2208 1745 2846 3166 3520 3380
3294 2576 3208 3521 3746 3523 2902 2635 3920 3690 3430
3480 3116 3428 3783 3345 3034 2184 3300 2383 3428 4162
3630 3406 3402 3500 3736 3370 2121 3150 3866 3542 3278
Table 2: Birthweights of babies born in the Mater Mothers’ Hospital in Brisbane,
Queensland, Australia, on December 18, 1997
It would be time-consuming and uninformative to list the weights of every single
baby — it would be better to specify a few numbers that summarise in some way the
weights of the babies.
Graphical and numerical summaries of data come under the joint heading of ex-
ploratory data analysis.
This part of the course has the following objectives:
1. To introduce some numerical and graphical summary methods.
2. To explore which graphical methods/summary statistics are useful in certain
situations, and how to use them together sensibly.
3. To extend the ideas to situations where we are interested in the relationship
between two variables.
2 Visualising the Data
Suppose that we are interested in analysing the birthweight data given in Table 2.
Where should we start?
Whatever the question is that we are going to try and answer, whatever the
purpose of the analysis, a useful first step is to look at the data. Looking at the table
itself is probably not very helpful — it is very difficult to get any sort of intuition on
what the data is like. Getting some visual impression of what the data we have is
like will help us in deciding what we can do with the data, as we will see later.
3. Measuring Location and Spread 4
A possible first step in visualising the data is to produce a histogram, so we briefly
introduce histograms in this section. We work through the process of constructing a
histogram using the birthweight data.
1. We divide the range of the data into (equally) sized bins. Here, the lightest baby
has weight 1745 grams and the heaviest has weight 4162 grams. We will use 6
bins: 1501-2000, 2001-2500, 2501-3000, 3001-3500, 3501-4000 and 4001-4500.
2. Record the number of observations that fall into each bin. Here, we need to
draw a frequency table, and tally the number of birthweights in each category.
For example, the first weight listed in Table 2 is 3837 grams, so falls into the
category ‘3501-4000’.
Bin Number of Observations
1501-2000 1
2001-2500 4
2501-3000 4
3001-3500 19
3501-4000 15
4001-4500 1
3. Plot on a graph. The x axis is the range of the data and the y axis is the number
of observations (count) in each bin. Figure 2 gives the final histogram.
Exercise 3. What do you learn about the babies’ birthweights from this histogram?
For example, what sort of weight are the heaviest babies? What do the babies typically
weigh?
This gives a very brief introduction to histograms, and more technical aspects will
be given in Section 4.
3 Measuring Location and Spread
Recall that we have, in Table 2, the birthweights (in grams) of 44 babies. What sort
of questions might be of interest regarding these birthweights?
3. Measuring Location and Spread 5
Birthweight (grams)
Fre
quen
cy
1500 2000 2500 3000 3500 4000 4500
05
1015
Figure 2: Histogram of Birthweights
We might be interested in:
1. What does a typical baby weigh?
2. How spread out are the weights of the babies? Or, how light are ‘light’ babies,
and how heavy are ‘heavy’ babies?
The first question can be answered by calculating a statistic that summarises loca-
tion. We will introduce two ways to measure location — the median and mean. The
second question can be answered by calculating a statistic that summarises spread.
We will introduce three ways to measure spread — the range, the interquartile range,
and the standard deviation. Later, we will talk about the relative advantages and
disadvantages of each measure, and discuss the circumstances when each might be
used.
3. Measuring Location and Spread 6
Before we get started, a brief digression on notation is necessary.
Some Notation
Firstly, we choose a letter to denote the variable we are measuring — X is a common
choice. It is good practice to make it clear what your variables mean, by writing at the
beginning ‘let X denote [the variable we are measuring]’. Instead of saying ‘the first
observation from the data’ we will simply write x1 (If we had chosen Y to denote the
variable, this would be y1). The tenth observation is written as x10, and so on. We call
the total number of observations in our dataset n, so the final observation is xn. For
example, in the birthweight data in Table 2, let X be the weights of the babies. Then,
reading across the first row we get x1 = 3837, x2 = 3334, x3 = 3554, . . . , x44 = 3278,
and n = 44, as there are 44 births recorded.
Notice the difference here between upper case and lower case letters. Upper case
letters are not numbers — they are variables, telling us what it is we are measuring.
They are random because each time we take a measurement we will get a different
answer. Lower case letters are numbers, because they tell us the value the variable
takes for a specific measurement.
We use brackets if we have put the data in ascending numerical order — the
first observation (which is now the smallest) is called x(1), the second observation (so
second smallest) is called x(2), and so on, up to the largest observation x(n). Table 3
gives the birthweight data in ascending numerical order. Looking at this we can see
that x(1) = 1745, x(2) = 2121,. . ., and x(44) = 4162.
1745 2121 2184 2208 2383 2576 2635 2846 2902 3034 3116
3150 3166 3208 3278 3294 3300 3334 3345 3370 3380 3402
3406 3428 3428 3430 3480 3500 3520 3521 3523 3542 3554
3625 3630 3690 3736 3746 3783 3837 3838 3866 3920 4162
Table 3: Birthweights of babies born in the Mater Mothers’ Hospital in Brisbane,
Queensland, Australia, on December 18, 1997; in ascending numerical order.
3.1 The Median 7
3.1 The Median
The median is one way of answering the question ‘What does a typical baby weigh?’.
Exercise 4. Suppose we have placed the 44 birthweights in ascending numerical order
(as in Table 3). Which of the following values would best reflect what a typical baby
weighs?
• x(2) — i.e. the 2nd smallest value?
• x(22) — a value somewhere around the middle?
• x(42) — one of the largest values?
The median value of a collection of data is the ‘middle’ value when the data is in
numerical order. We use the symbol xm for the median.
Exercise 5. Looking at Table 3, find the ‘middle’ value for the birthweight.
We now give a general procedure for calculating the median, using the birthweight
data as an example.
1. Place the data in numerical order. This is done in Table 3.
2. Take the total number of observations and add 1. So here, there are 44 obser-
vations, so adding 1 gives 45.
3. Divide by 2 — call the result t. So t = 45/2 = 22.5.
4. If the result is a whole number the median is x(t). Otherwise, the result is the
average of the two numbers either side of t. We have t = 22.5, not a whole
number, so our answer is the average of x(22) and x(23). So we get
xm =x(22) + x(23)
2=
3402 + 3406
2= 3404
So the median of the birthweight data is 3404 grams.
Exercise 6. Why do we add 1 to the total number of observations before we divide
by 2? HINT: you can get intuition by considering a data set with 3 observations (i.e.
n = 3).
3.2 The Range and Interquartile Range 8
Exercise 7. In calculating the median, does it matter whether the data is arranged
in ascending or descending numerical order?
So, the median gives us an answer to the question ‘what does a typical baby
weigh?’ — the median birthweight is 3404 grams.
3.2 The Range and Interquartile Range
We now consider the question, ‘How spread out are the weights of the babies? Or, how
light are ‘light’ babies, and how heavy are ‘heavy’ babies?’ Given that the ‘typical
baby’ weighs 3404 grams, would be surprised to see a baby weighing 3600 grams, for
example? These are the sorts of questions that we can answer by having an indication
of the spread of the data.
One way we may consider looking at the spread of the data would be the difference
between the heaviest baby and the lightest baby — this is called the range, and in
the notation:
Range = x(n) − x(1)
So, for the birthweight data, the lightest baby is x(1) = 1745, and the heaviest
baby is x(n) = 4162. So the range is
Range = x(n) − x(1) = 4162− 1745 = 2417 grams
Exercise 8. Table 4 gives the time 50 randomly selected University students spend
on social networking websites in a day (in minutes), in ascending numerical order.
Calculate the range of this data. Why might the range not be a good description of
the spread of the data in this case?
Since the range can be so sensitive to outliers — values that are unusually large
or small — we consider a slightly different measure, called the Interquartile Range
(IQR). The IQR takes the range of the middle 50% of the data, meaning it is not
affected by the outliers.
3.2 The Range and Interquartile Range 9
0 0 0 1 2 2 2 2 3 3
3 4 4 4 4 4 5 5 6 6
6 6 7 7 9 9 9 10 10 11
12 13 15 17 17 18 19 21 21 22
24 26 27 29 32 34 42 51 68 185
Table 4: Minutes spent on social networking websites per day, in ascending numerical
order.
• The lower quartile (LQ) is the value that has one quarter of the data smaller,
and three quarters of the data larger than it. It is also known as the 25th
quantile, as 25% of the data is smaller than it.
• Similarly, the upper quartile (UQ) is the value that has one quarter of the data
larger, and three quarters of the data smaller than it. It is also known as the
75th quantile, as 75% of the data is smaller than it.
• The Interquartile Range (IQR) is the range of the middle 50% of the data, and
is calculated by taking the difference between the UQ and the LQ — IQR =
UQ− LQ.
Exercise 9. Looking at Table 3, find the lower quartile (one quarter of the way
along the data), the upper quartile (three quarters of the way along the data) and the
interquartile range (difference between the two).
As with the median, there are technicalities involved when the LQ and UQ lie
‘in-between’ two data points. However, there is no agreement on what to do in these
cases. We suggest calculating the LQ as ‘the median of the lower half of the data’,
and the UQ as ‘the median of the upper half of the data’. If there is an odd number
of data points, usual convention is to exclude the median from both calculations.
We will now give a general algorithm for calculating the LQ, UQ and IQR, using
the birthweight data as an example.
The LQ is the median of the first half of the data, i.e. the top two rows of Table
3, or the first 22 observations. The total number of observations plus 1 is therefore
3.2 The Range and Interquartile Range 10
23. Divide this by two and calling the answer t, gives t = 11.5. So the result is the
average of x(11) and x(12),
LQ =x(11) + x(12)
2=
3116 + 3150
2= 3133
The UQ is the median of the second half of the data, i.e. the bottom two rows
of Table 3, or observations 23 − 44, the last 22 observations The total number of
observations plus 1 is therefore 23. Divide this by two to get t = 11.5. So the result
is the average of the 11th and 12th observations in the second half of the data. This
is NOT the same as x(11) and x(12) because they are in the first half of the data! We
add 22 to tell us where to find the UQ because we didn’t use the first 22 observations.
So 11.5 + 22 = 33.5, so the result is the average of x(33) and x(34),
UQ =x(33) + x(34)
2=
3554 + 3625
2= 3589.5
Now the IQR is simply the difference between the UQ and the LQ:
IQR = UQ− LQ = 3589.5− 3133 = 456.5
At this point, we can return to our original question that we were trying to answer
by calculating the spread of the data — how heavy are ‘heavy’ babies and how light
are ‘light’ babies? It remains somewhat subjective what we mean by light and heavy,
but we could say that the upper 25% are ‘heavy’ and the lower 25% are ‘light’. We
know these quantities — they are just the UQ and the LQ. So, ‘heavy’ babies have
weight greater than 3589.5 grams, and ‘light’ babies have weight less than 3133 grams.
Exercise 10. Calculate the Median, LQ, UQ and IQR of the social networking web-
site usage data. Table 4 gives the data in ascending numerical order.
Box Plots
The median, quartiles and range can be summarised in a graphical format, which can
be useful when comparing one sample against another. Firstly, the quantities can be
summarised in a so-called five number summary. This consists of five numbers —
the smallest observation, the lower quartile, the median, the upper quartile and the
3.2 The Range and Interquartile Range 11
largest observation (in that order). For example, for the birthweight data we would
write the five number summary as (1745, 3133, 3404, 3589.5, 4162).
This summary can also be drawn in a box plot. To create a box plot, the horizontal
axis is the range of the data (in this case, the weights of the babies). Draw a small
vertical line at each of the five numbers from the five number summary, then connect
the ends of the line at the LQ to the lines at the UQ. Figure 3 gives the box plot for
the birthweight data.
2000 2500 3000 3500 4000
Figure 3: Boxplot of Birthweights
Exercise 11. Suppose that we separate the weights of the 44 babies into boys and girls.
We calculate the five number summaries for each group, and they are, for the boys
(2121,3166,3404,3630,4162) and for the girls (1745,2576,3381,3523,3866). Draw the
two boxplots, one below the other, on a single graph, using the same horizontal axis.
Use this to compare the two sets of birthweights.
3.3 The Mean 12
3.3 The Mean
We have discussed that it is useful to find location statistics, in order to answer a
question such as, in our birthweight example, what a typical baby weighs. The mean
is an alternative way to calculate this. We will explore the similarities and differences
between the mean and median in Section 3.5.
The mean is calculated by adding up the values of all the observations, then
dividing by the amount of observations. This corresponds to sharing the values equally
amongst all the observations. The sample mean is denoted by x (say ‘x-bar’).2 The
formula for the mean is
x =x1 + x2 + . . . + xn
n
In this formula, we have used ‘. . .’ to show that we have missed out all the middle
terms, but you would of course fill these in when using the formula. There is a nicer
way to indicate a sum of observations than this, by using the symbol∑
, which means
‘the sum of’.
So we can rewrite the formula for the mean as
x =1
n
n∑i=1
xi
We use the area below the∑
sign to indicate where the sum starts from, and the
area above to indicate where the sum finishes. The i in this case is called an index.
It changes its value from the smallest value in the sum to the largest value, going
through every whole number inbetween (i = 1, then i = 2, etc.).
As an example, let’s calculate the mean of the birthweight data. Putting the data
into the formula we get
x =1
n
n∑i=1
xi =3837 + 3334 + . . . + 3278
44= 3275.955
so the mean is approximately x = 3276. So, the weight of an average baby is 3276
grams.
2It is x because X is the letter used to denote the variable. If we had used Y , it would be y.
3.4 The Variance and the Standard Deviation 13
3.4 The Variance and the Standard Deviation
Just as the mean is an alternative to the median for measuring locations, there is also
an alternative to the IQR for measuring spread.
Exercise 12. Using the time spent on social networking websites data in Table 4,
demonstrate why the IQR can not be said to use all of the data.
HINT: Think about changing the value of the largest time (currently 185) to 285.
Would the IQR change?
The sample variance is defined as the average squared distance from an observation
to the sample mean. The ‘distance’ from an observation xi to the sample mean x is
(xi − x). The ‘squaring’ removes any negative values here (e.g. if xi = 3 but x = 5,
then xi − x = −2, but squaring this gives 4).
We call the sample variance s2, and the formula is
s2 =1
n
n∑i=1
(xi − x)2
Since we have squared the (xi − x) part, the variance is not in the same units as
the observations. For example, if our observations were in metres, the variance would
be in square metres! This makes the variance difficult to interpret.
The standard deviation remedies this problem by simply taking the square root
of the variance. It is denoted by s, and its formula is given by
s =
√√√√ 1
n
n∑i=1
(xi − x)2
If necessary, we can subscript by the letter of the variable, e.g. s2X and sX for the
variance and standard deviation of the variable X. It is useful to do this if we are
dealing with more than one variable.
As an indication of how the standard deviation describes spread in a dataset, Fig-
ure 4 gives four examples of histograms, on the same scale, and the sample standard
deviation in each case. Each of the datasets has a mean of 3. The relationship is that
the further from the mean the data tends to be, the larger the standard deviation
3.4 The Variance and the Standard Deviation 14
is. For example, in the bottom right histogram of Figure 4, all the values are far
from 3, resulting in a large standard deviation. On the other hand, in the bottom
left histogram of Figure 4, the values are all fairly close to 3, resulting in a smaller
standard deviation.
Standard Deviation = 1.06
x
Fre
quen
cy
0 1 2 3 4 5 6
05
1020
30
Standard Deviation = 1.63
x
Fre
quen
cy0 1 2 3 4 5 6
05
1015
Standard Deviation = 0.29
x
Fre
quen
cy
0 1 2 3 4 5 6
02
46
810
14
Standard Deviation = 2.54
x
Fre
quen
cy
0 1 2 3 4 5 6
010
2030
4050
Figure 4: Histograms of four datasets (each with 100 observations) and their associ-
ated standard deviations.
As an example we will now calculate variance and standard deviation of the birth-
weight data, giving a step-by-step approach to doing the calculations.
1. Produce a table with three columns — the original xi values, the xi − x values,
and finally the (xi − x)2 values.
3.4 The Variance and the Standard Deviation 15
xi xi − x (xi − x)2
3837 3837− 3276 = 561 5162 = 314721
3334 3334− 3276 = 58 582 = 3364
3554 3554− 3276 = 278 2782 = 77284...
......
3278 2 4∑ni=1(xi − x)2: 11989186
Variance (above line divided by n): 272481.5
Standard Deviation (square root of variance): 522
Table 5: Calculation of variance and standard deviation for the birthweight data
2. Total up the final column of the table and divide by the number of observations.
This is the variance.
3. Take the square root of the answer. This is the standard deviation.
These steps are summarised for the birthweight data in the Table 3.4. We cal-
culated earlier that x = 3276 grams. Construcing a table like this is helpful when
calculating the variance and standard deviation by hand.
So we conclude that the sample standard deviation of the birthweights is 522
grams. We have omitted many lines in the table — it would be a good exercise to
check that you can reproduce the table in full. These calculations are somewhat time-
consuming to do by hand. Most scientific calculators, and many computer packages,
will do the calculation for you.
The standard deviation can loosely be interpreted as ‘how far a typical observation
is from the mean’.
Exercise 13. Suppose you wish to hire a typing assistant. The number of pages
typed per day by Assistant A has mean 65 pages and standard deviation 3 pages.
3.5 Mean or Median? 16
The number of pages typed per day by Assistant B has mean 75 pages and standard
deviation 20 pages.
• Which assistant is the most consistent?
• Which assistant would you expect to type the most pages over a week?
• If these were the two applicants for the job of typing assistant, which would you
hire (assuming you know nothing else about them)?
A small standard deviation means a high consistency or precision.
Exercise 14. Calculate the variance and standard deviation of the following times
spent by University students on social networking websites:3
4 6 51 17 11 4 3 21 24 3
Exercise 15. Is it possible for either the variance or the standard deviation to be
negative?
3.5 Mean or Median?
We conclude this section by looking at the differences between the two location statis-
tics we have introduced — the mean and the median — and discussing the situations
in which each might be used.
Exercise 16. Suppose a student receives the following marks in nine courses (placed
in ascending numerical order). The University awards a first class degree to any
student who earns 70% or more overall.
35 40 42 56 70 70 71 73 73
The exam board proposes that the overall degree classification is calculated based
on the median of the nine marks — here the median is 70, so the student is awarded
a first class degree. Is this a fair result?
3This is a subset of Table 4, to make the calculation more manageable.
3.5 Mean or Median? 17
Exercise 17. Figure 5 gives the annual wage of 50 UK full time workers, chosen in
a random sample. The median of the wages is 21.5 thousand pounds, and the mean
is 28.7 thousand pounds. Explain why the mean is larger than the median. Which of
the mean and the median is more useful here?
Annual Wage (in £1000s)
Fre
quen
cy
0 50 100 150 200 250
05
1015
Figure 5: Histogram of the wages of 50 randomly selected full time workers in the
UK.
The above questions illustrate situations where it is possible to select one of the
location statistics over the other — the mean is sometimes more useful than the
median, but in other situations the median can be more useful than the mean.
In general, the problem with the mean is that it is affected by extreme values. In
4. The Shape of the Data 18
the wages example, there is one person in the sample earning 250 000, which is far
more than everybody else. This causes the mean to be larger. Therefore, in cases
where there are outliers in the data, the median is often a better choice.
The weakness of the median is that it cares only about the order that the data
is in, and the value of the middle observation. In the degree classification example
above, the median of 70% does not represent well what the data looks like. When
there are no outliers in the data, the mean is often more useful. So, it is important to
look at the data (for example by viewing histograms) before deciding which summary
statistics to report.
4 The Shape of the Data
We introduced the histogram in Section 2, in this section we look in more detail at
the histogram, and introduce other graphical methods to visualise the shape of the
data.
4.1 Histograms Revisited
When we first introduced the histogram in Section 2, we selected 6 equally sized bins
to divide the data into, without explaining why we made this choice. In fact, choosing
the number of bins to use is something of an art in producing histograms, and is often
done by trial and error. Figure 6 gives two histograms of the birthweight data, the
first has two bins (each bin being 2000 grams wide), the second has 48 bins (each
bin being 50 grams wide). Both were created using identical data — the birthweight
data from Table 2.
Exercise 18. Using Figures 2 and 6, comment on the consequences of making a poor
choice for the number of bins.
You will find when you use statistical packages to produce histograms that an
appropriate number of bins is selected automatically.
Exercise 19. Table 4 gives the time spent by 50 randomly chosen students on social
networking websites (in minutes). Produce a histogram, with appropriate bin width,
4.1 Histograms Revisited 19
Birthweight (grams)
Fre
quen
cy
1000 2000 3000 4000 5000
010
25
Birthweight (grams)
Fre
quen
cy
2000 2500 3000 3500 4000
02
4
Figure 6: Histogram of Birthweights — with 2 bins (top) and 48 bins (bottom).
for this data. (Note: you may find it useful to exclude the largest time (185 minutes)
to produce a more interesting histogram). Interpret the results.
Density Histograms
Exercise 20. Suppose you take the names of all the babies born in the Mater Mothers’
Hospital in Brisbane, Queensland, Australia, on December 18, 1997, put them in
a hat, and select one at random. What is the chance of the chosen baby having
birthweight between 3001 and 3500 grams?
In Figure 7 we have the same histogram as in Figure 2, but this time we rescale
the height of the bars so that the area of the bar in each bin represents the chance of
a randomly chosen birthweight coming from that bin.
4.1 Histograms Revisited 20
Birthweight (grams)
Den
sity
1500 2000 2500 3000 3500 4000 4500
00.
0002
0.00
040.
0006
0.00
08
Figure 7: Density Histogram of Birthweights
How have we calculated this? The chance of a randomly chosen birthweight (or,
in general, a randomly chosen observation) coming from a bin should beNumber of Observations in BinTotal Number of Observations
. The area of the bin is Height of bar×Width of bar.
Since we want the chance and the area to match, and this means that
Height of Bar =Number of Observations in Bin
Width of bar× Total Number of Observations
Exercise 21. Explain why the total area of all the bars in the new histogram is 1.
4.2 Bar Charts 21
4.2 Bar Charts
You are probably familiar with a method of displaying categorical data that is very
similar to a histogram — the bar chart.
A bar chart could be used, for example, to visualise and compare the number of
people in a sample with different hair colours (this example was also used in Part 1).
Suppose we collect a sample of 100 people and record their hair colour. We record
the results in a frequency table:
Hair Colour Number of Observations
Black 10
Brown 51
Fair 28
Ginger 3
Other 8
A bar chart is then constructed from the frequency table in exactly the same way
as we did for the histogram. Figure 8 gives two possible bar charts we could construct
from this frequency table.4 The difference between the two is that we have changed
the order of the bars along the horizontal axis.
Exercise 22. Why is it not sensible to change the order of the bars along the hori-
zontal axis in a histogram, like that in Figure 7?
Bar charts are useful ways to display categorical data. They must not be confused
with histograms, which are used to summarise quantitative data.
4.3 Empirical Distribution Functions
The histogram is very useful for estimating the chance of a randomly chosen ob-
servation being within a given range of values, or equivalently, the proportion of
observations within that same region (see Section 2). However, we have already seen
4Note though, it is usual to arrange the bars in descending height order (as in Figure 8, left
panel).
4.3 Empirical Distribution Functions 22
Figure 8: Two barcharts showing the sample proportions of 100 people’s hair colour.
that the histogram is limited in that its shape depends on the number of bins we
choose to classify our data into. In particular, if we are interested in estimating,
say, the chance of a random observation being less than a certain value, we will get
different answers depending on the bins we have chosen for the histogram.
The empirical distribution function (e.d.f.) is a means of displaying all the quan-
tiles of a set of data. Two special quantiles, the LQ (25th quantile) and the UQ (75th
quantile) have already been introduced in Section 3.2. Producing the e.d.f. does not
require any subjective decisions (like choosing the number of bins in the histogram
setting). Therefore, there is one unique e.d.f. for any collection of data. There is an
added bonus with the e.d.f. — we can read the median and quartiles from the graph
quickly and easily, as we will see.
We label the e.d.f. as a capital letter with a tilde (‘˜’) above it.
The formula for the e.d.f. is
F (x) =Number of observations smaller than x
Total number of observations
For example, the 25th quantile, or lower quartile, is the value which has 25% of
the data smaller than it. In Section 3.2 we calculated the LQ for the birthweight
data is 456.5. So, F (456.5) = 0.25 approximately, because the e.d.f. measures the
percentage of the data smaller than a fixed value.
We will now work through the construction of the e.d.f. using the birthweight
4.3 Empirical Distribution Functions 23
data.
1. We construct the graph using the ordered data (as in Table 3 for the birthweight
data). The x axis covers the range of the data (so you could use the same range
as the histogram, for example), and the y axis gives the value of F (x). The
graph is built like a staircase, starting from the y value of 0, on the left of the
smallest observation on the x axis, and finishing at the y value of 1, on the right
of the largest observation on the x axis.
2. Now, every time we encounter an observation, the y value goes up by one ‘step’.
The amount we go up is 1n, with n = number of observations, as usual. If there
are ‘ties’ (i.e. more than one observation with the same value), we take a bigger
step, whose size is 1n× number of ties.
3. The graph finishes when the y value reaches 1. If you run out of data before
reaching 1, or go past 1, you know you have made a mistake somewhere! Figure
9 shows what the final e.d.f. for the birthweight data should look like.
Ensure you understand why constructing the graph in this way gives the e.d.f.
that is defined in Equation 4.3. Also ensure that you understand that, unlike with
histograms, there is no aspect of the e.d.f. that we can change to give a ‘different’
graph for the same set of data. Notice that the e.d.f. looks like a staircase — it moves
up in ‘jumps’.
Exercise 23. Two consecutive birthweights from Table 3 are 2208 and 2383. Explain
why F (2210) and F (2300) must be the same.
Exercise 24. Produce an e.d.f. for the social networking website data (Table 4).
Reading the Median and Quantiles from the e.d.f.
Recall that the median of a collection of data is the ‘middle value’. The e.d.f. is
defined in Equation 4.3 as
F (x) =Number of observations smaller than x
Total number of observations
4.3 Empirical Distribution Functions 24
1500 2000 2500 3000 3500 4000
0.0
0.2
0.4
0.6
0.8
1.0
birthweight
Fn(
x)
●●
●●
●●
●●
●●
●●●
●●●●●●●●●●
●●
●●●●●●●
●●
●●●
●●●●
●●
Figure 9: e.d.f. of Birthweights
Exercise 25. Explain why F (xm) = 0.5, i.e. the value of the e.d.f. at the median is
0.5.
Since the value of the e.d.f. at the median is 0.5, we can easily read off an estimate
of the median from our e.d.f. We do this by working backwards. As we know that
F (xm) = 0.5, if we draw a line across from the y-axis at 0.5 to the e.d.f., reading
off the corresponding x-value gives us the median. This process is illustrated in the
birthweight data in Figure 10, and we get an estimate for the median of around 3400.
Exercise 26. Show that, in the same way, the value of the e.d.f. at the LQ and UQ
is 0.25 and 0.75 respectively. Use this to estimate from the e.d.f. in Figure 9 the
LQ, UQ and IQR of the birthweight data. Compare with the estimates you already
calculated from the data, and explain why they may not be exactly the same.
5. The Relationship Between Two Variables 25
1500 2000 2500 3000 3500 4000
0.0
0.2
0.4
0.6
0.8
1.0
birthweight
Fn(
x)
●●
●●
●●
●●
●●
●●●
●●●●●●●●●●
●●
●●●●●●●
●●
●●●
●●●●
●●
Figure 10: e.d.f. of birthweights, with median line added
Exercise 27. Summarise the advantages and disadvantages of Empirical Distribution
Functions versus Histograms.
5 The Relationship Between Two Variables
So far in this part of the course, we have looked at graphical methods and summary
statistics for a single variable only. Can you think of situations where it may be of
interest to look at the relationship between two variables?
There has been much said about the supposed link between smoking and lung
cancer. We have data available, from 44 US states, on two variables.
• Smoke: Number of cigarettes smoked (hundreds per person) in 1960.
5.1 Scatter Graphs 26
Smoke Lung Smoke Lung Smoke Lung Smoke Lung
18.20 17.05 25.82 19.80 18.24 15.98 28.60 22.07
31.10 22.83 33.60 24.55 40.46 27.27 28.27 23.57
20.10 13.58 27.91 22.80 26.18 20.30 22.12 16.59
21.84 16.84 23.44 17.71 21.58 25.45 28.92 20.94
25.91 26.48 26.92 22.04 24.96 22.72 22.06 14.20
16.08 15.60 27.56 20.98 23.75 19.50 23.32 16.70
42.40 23.03 28.64 25.95 21.16 14.59 29.14 25.02
19.96 12.12 26.38 21.89 23.44 19.45 23.78 12.11
29.18 23.68 18.06 17.45 20.94 14.11 20.08 17.60
22.57 20.74 14.00 12.01 25.89 21.22 21.17 20.34
21.25 20.55 22.86 15.53 28.04 15.92 30.34 25.88
Table 6: Numbers of cigarettes smoked (hundreds per person) in 1960 and Deaths
per 100K population from lung cancer, in 44 US states.
• Lung: Deaths per 100 000 population from lung cancer.
The data is given in Table 6.
Looking at the table, can you see any relationship between the number of cigarettes
smoked, and the rate of deaths from lung cancer?
Clearly, this is difficult if not impossible to do.
5.1 Scatter Graphs
A scatter graph is a visualisation of the relationship of two quantitative (numeric)
variables. So far, we have been using X to denote our variable. Now that there are
two variables, we will use X to denote the number of cigarettes smoked, and Y to
denote the number of deaths from lung cancer.
To construct a scatter graph, we plot the specific X and Y values for each state
onto a graph as co-ordinates. The horizontal axis has the range of the X variable,
and the vertical axis takes the range of the Y variable. Then, for example, for the
5.1 Scatter Graphs 27
first state in the table, we draw a cross on the graph at (18.20, 17.05). After plotting
all of the points from Table 6, we end up with a scatter graph — as in Figure 11.
15 20 25 30 35 40
1520
25
Cigarettes Smoked (hundreds per capita) in 1960
Dea
ths
per
100
000
of L
ung
Can
cer
Figure 11: Scatter plot of the number of cigarettes smoked against the number of
deaths from lung cancer, for 44 US states.
Exercise 28. What does the Figure 11 suggest about the relationship between cigarettes
smoked and the number of deaths through lung cancer?
The diagonal line in Figure 11 is a line of best fit. This is drawn in such a way
as to represent the underlying relationship between the X and Y variables. Since
this line slopes upwards we would say there is a positive relationship between X and
Y , i.e. as the number of cigarettes smoked increases, so does the number of deaths
through lung cancer.
5.2 Correlation 28
Exercise 29. How might the line of best fit be used to estimate the death rate from
lung cancer in a state for which we only know that the number of (hundreds of)
cigarettes smoked per person was 30?
Exercise 30. Can you think of two variables that may have a negative relationship?
i.e. as one increases in value, the other one decreases?
5.2 Correlation
Suppose you are asked to describe the relationship between smoking and lung cancer
using Figure 11 over the telephone. What would you say? Is this easy to do?
As well as having a picture of the relationship between two variables, it is also
useful to have some sort of numerical summary (this would be a lot easier to descibe
over the telephone!).
The correlation between two variables is a number that describes the strength of
the linear (i.e. straight line) relationship between them. We use the symbol r for
the sample correlation, and subscript it with the two variables we are calculating the
correlation between. For example rXY is the sample correlation between smoking
and lung cancer, from the previous section. The correlation is always between −1
and 1. A positive number corresponds to a positive linear relationship between the
two variables (i.e. as one increases, the other increases), and a negative number
corresponds to a negative relationship (i.e. as one increases, the other decreases).
Figure 12 shows some scatter graphs of the relationship between two variables, and
the associated correlations, to give a feel for what the numbers mean.
The fact that correlation measures linear relationships between variables is im-
portant to remember. It is always useful to plot the data in a scatter graph first, to
see whether there is an indication of a relationship between the two variables that is
not linear. Figure 13 gives some examples of calculated correlations where the rela-
tionship between two variables is not linear. The correlation can be very misleading
in these instances.
The formula for calculating the sample correlation between two variables X and
5.2 Correlation 29
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
−2 −1 0 1 2
−2
01
2
Correlation = 1
● ●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
−2 −1 0 1 2−
2−
10
12
Correlation = 0.95
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
● ●
●●
●
●
●
●
●
●●●
● ●●
●
●
●
● ●●●
●
●●
●●
●
● ●
●
●●
●
●
●●
●
●
● ●●
●●
●
●
●
●
●
●
●
●
● ●
●
−2 −1 0 1 2
−2
01
2
Correlation = 0.78
●
●●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
−2 −1 0 1 2 3
−2
01
2
Correlation = 0.45
●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●●
● ●
●●
●●
●●●
●●●
●
●
●
● ●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●●
●●
●
●●
●
●
●
−3 −2 −1 0 1 2
−3
−1
12
3
Correlation = 0.21
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●●
●
● ●●
●
●●
●
●●
● ●
●
−1 0 1 2 3−
2−
10
12
Correlation = 0.08
●
●●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●●
●
●
●
●
●●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●●
●●●
●
●●●
●●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
−3 −2 −1 0 1
−1
01
23
Correlation = −1
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
● ●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
−2 −1 0 1 2
−2
01
23
Correlation = −0.79
●
●
●
●
●
●
●
●●
●
●●
● ●
●
●
●
●
● ●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●●
●
● ●
●
●
●●
●●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●●
●
●●
●
−3 −2 −1 0 1
−2
01
23
Correlation = −0.54
Figure 12: Scatter plots of nine datasets (each with 100 observations and 2 variables
per observation) and their associated correlations.
5.2 Correlation 30
●
●●
●
●●
●
●●
●●●●
●●
●
●●●
●●●
●●
●
●
●●●●
●●
●
●●
●
●
●
●●
●●●●
●
●
●●●●●●
●
●●●
●●
●
●●●●
●
●●●●
●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●
●●●●●
●●
●
●●
●
●●
0 1 2 3 4 5 6
0.0
0.4
0.8
Correlation = −0.03
●
●
●●●●
●●
●
●
●●
●●●
●
●
●
●
●●
●
●●
●●
●●
●●
●●●
●
●
●
●
●●
●●
●●
●
●●
●
●
●●
●●
●
●
●●●●●●●●
●
●
●●●●●●
●
●
●
●
●●●●
●
●
●
●
●●●●
●●
●
●●
●●
●●
●●●
●
●
0 1 2 3 4 5 6−
3−
2−
10
12
3
Correlation = 0.02
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●●●●●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●
●●●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●●●
●●
●
●
●
●
●
●
0 1 2 3 4 5 6
−10
05
1015
Correlation = −0.13
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 1 2 3 4 5 6
0.0
1.0
2.0
3.0
Correlation = 0
Figure 13: Scatter plots of two datasets (each with 100 observations and 2 variables
per observation) and their associated correlations, where the relationship between the
two variables is not linear.
6. Summary 31
Y is:
rXY =
∑ni=1(xi − x)(yi − y)
nsXsY
We will calculate the correlation for the lung cancer and smoking data. As we did
for the variance, it is helpful to draw a table here. We have calculated the means and
standard deviations in advance. For the ‘smoke’ variable, x = 24.91 and sX = 5.57.
For the ‘lung’ variable, y = 19.65 and sY = 4.23 (feel free to verify these yourself!).
xi (Smoke) yi (Lung) (xi − x) (yi − y) (xi − x)× (yi − y)
18.20 17.05 −6.71 −2.60 17.48
31.10 22.83 6.18 3.18 19.65
20.10 13.58 −4.81 −6.07 29.24...
......
...
30.34 25.88 5.43 6.23 33.79∑ni=1(xi − x)(yi − bary): 706.66
Correlation (above line divided by (n× sX × sY )): 0.68
A correlation of 0.68 is a fairly strong correlation.
6 Summary
There were three objectives to this part of the course, let’s summarise how we have
tackled each one.
1. To introduce some basic graphical methods and summary statistics.
We have introduced a range of techniques. This is not exhaustive, there are
many other graphical methods and summary statistics we could have considered.
2. To motivate which graphical methods/summary statistics are useful in certain
situations, and how to use them together sensibly.
It is vital to identify which tools are useful to us for different purposes. Usually,
we will use a combination of graphical representations, and summary statistics,
to learn about the variables in the data we have collected.
A. Hints and Answers to Exercises 32
3. To extend the ideas to situations where we are interested in measuring two
variables.
The final section introduces scatter plots and correlation, as methods specific to
dealing with the relationship between two variables. Do remember though, it is
still important to look at each of the variables individually, using the methods
in the earlier sections.
A Hints and Answers to Exercises
Exercise 1: Hopefully, you believe the graph is easier to interpret. We can see that
the majority of people spend less than ten minutes on social networking websites
in a day, with a very small number spending longer than 50 minutes. There is one
outlying observation — one person spends 185 minutes in the day, which is far longer
than everybody else.
Exercise 2: The variable is ‘Weight of baby’. It is certainly, quantitative, and
since a weight is measured on a scale, it is continuous (although, in the data we have
in Table 2, it is rounded off to the nearest gram).
Exercise 3: All babies are between 1500 and 4500 grams, with most weights lying
between 3000 and 4000 grams. You may have other comments.
Exercise 4: x(22) would be the most sensible.
Exercise 5 This is worked through in the text.
Exercise 6: This is easily seen if we think about the case with three observations
only. For example, suppose we have observations 4, 7 and 12. If we did not add one,
we’d be looking for the median inbetween the 1st and 2nd value, which clearly doesn’t
make sense. This peculiarity is because of the way we count — if we started counting
from 0 rather than 1, we would not have this problem!
A. Hints and Answers to Exercises 33
Exercise 7: No. Whichever way around the numbers are written, the same num-
bers will always be in the middle. However, it is conventional to write the numbers
in ascending numerical order.
Exercise 8:
Range = x(n) − x(1) = 185− 0 = 185 minutes
This is not particularly useful because the largest time, x(n) = 185 does not fit in with
the rest of the data — if we excluded this value the range would be only 68 minutes.
It is undesirable that the range is so sensitive to this one, unusual value. This is why
the interquartile range is more commonly used for measuring spread.
Exercise 9 Answered in the text.
Exercise 10: To calculate the median, total number of observations +1 = 51.
Then t = 51/2 = 25.5. So the median is the average of the 25th and 26th value in
Table 4:
xm =x(25) + x(26)
2=
9 + 9
2= 9 minutes
The LQ is the median of the first 25 observations, so LQ = x(13) = 4 minutes.
The UQ is the median of the last 25 observations, so UQ = x(38) = 21 minutes. Then
IQR = UQ− LQ = 21− 4 = 17 minutes.
Exercise 11: Your box plot should look like Figure 14, although you may have the
boxplots the other way around, which is fine.
Comparing the two, the boys’ are heavier at birth on average (the median is
slightly larger), and the IQR for the boys is much smaller, meaning that in general,
boys’ weights cluster more closely around the median than the girls’.
Exercise 12: Imagine partitioning the data into three groups:
1. Smaller than the LQ
A. Hints and Answers to Exercises 34
girls
boys
2000 2500 3000 3500 4000
Figure 14: Boxplot of the birthweights of the boys (top) and the girls (bottom).
2. Between the LQ and the UQ
3. Larger than the UQ
Then, changing the value of any observation would not affect the IQR, provided
the observation stays in the same group. For example, changing the value of the
largest time from 185 to 285 would not change the IQR, since the value clearly re-
mains in the group ‘larger than the UQ’.
Exercise 13: Assistant A is the most consistent as the standard deviation is
smaller, but Assistant B would be expected to type the most pages over the week as
the mean of Assistant B is larger. Either answer is correct for the final part — it
depends on your preferences. If quantity only is important, then Assistant B is the
A. Hints and Answers to Exercises 35
best choice. However, if consistent output on a daily basis is required then Assistant
A should be chosen.
Exercise 14: We reproduce the table in full. First, calculate the mean and check
you get 14.4. Also, the number of observations n = 10.
xi xi − x (xi − x)2
4 −10.4 108.16
6 −8.4 70.56
51 36.6 1339.56
17 2.6 6.76
11 −3.4 11.56
4 −10.4 108.16
3 −11.4 129.96
21 6.6 43.56
24 9.6 92.16
3 −11.4 129.96∑ni=1(xi − x)2: 2040.4
Variance (above line divided by n): 204.04
Standard Deviation (square root of variance): 14.28
So the sample variance is 204.04 and the sample standard deviation is 14.28.5
Exercise 15: No. We are adding together lots of numbers which are zero or larger.
Therefore, all variances and standard deviations are always positive, and indeed it
does not make sense to have a negative value — a variance or standard deviation of
zero means there is ‘no spread’ i.e. all the observations have exactly the same value.
Exercise 16: This would seem a little generous to the student — as she only just
5If you check this result using a calculator or a computer, you may get a variance of 226.71, and
a standard deviation of 15.06. Do not worry. The computer is calculating the unbiased estimate of
the population variance/standard deviation, we will come on to this in Part 4.
A. Hints and Answers to Exercises 36
got a first in five of the modules, and was some way off achieving a first in the other
four.
Exercise 17: The mean has been distorted by the single large time (of 185 min-
utes). The median is not affected by this value. Therefore, for most purposes, such
as giving the time a typical University student spends on social networking websites
per day, the median would be a better measure.
Exercise 18: The top panel in Figure 6 is not very informative — Figure 2 gives
a far more informative picture of the data. The bottom panel in Figure 6 is very
difficult to interpret — it is not very smooth, and does not give a good overview of
the shape of the data.
Exercise 19: Including the largest value, we took 14 bins, with each bin having
width 10 minutes. The resulting histogram is that seen in the Introduction, repro-
duced in the top panel of Figure 15 for convenience. Excluding the largest value, we
take bins of width 5 minutes to obtain the histogram in the bottom panel of Figure
15.
We can say that the largest time is abnormally large compared to the rest of the
data. Excluding this, the remaining times are between 0 and 70 minutes. Most of
them are between 0 and 30 minutes.
Exercise 20: The chance is the same as the proportion of babies between 3001 and
3500 grams, which we can get straight from the histogram (or the frequency table).
Probability[Randomly chosen baby between 3001 and 3500 grams]
=Number of babies between 3001 and 3500
Total number of babies
=19/44
=0.43
A. Hints and Answers to Exercises 37
Minutes spent on social networking websites per day
Fre
quen
cy0 50 100 150
05
1525
Minutes spent on social networking websites per day
Fre
quen
cy
0 10 20 30 40 50 60 70
05
1015
Figure 15: Histogram of daily time spent on social networking websites (top — all
values included, bottom — largest value excluded) with bins of width 10 and 5 minutes
respectively.
Exercise 21: The total area is the same as the probability of a randomly chosen
baby having any weight. The probability of any certain event is 1 (this is how prob-
ability is defined).
Exercise 22: There is a meaning to the position of the bars in a histogram — they
are in numerical order. This is not the case in general for a bar chart.
Exercise 23 The value of the e.d.f. only changes when we encounter an observation
— at this point it jumps up by 1n
(step 2 of the construction of the e.d.f.). So the
e.d.f. went up by 1n
at 2208. It then stays the same until we get to 2383, when it
jumps up by 1n
again. So the value of the e.d.f. at 2210 and 2300 must both be the
A. Hints and Answers to Exercises 38
same.
Exercise 24: The final e.d.f. is given in Figure 16.
0 200 400 600
0.0
0.2
0.4
0.6
0.8
1.0
State Area
Fn(
x)
●●
●●
●●
●●●●●●●●●●
●●●●●●●●●
●
●
●●●●●
●
●●●●●
●●
●●
Figure 16: e.d.f. of State areas.
Exercise 25: The median is the middle value, and since the value of the e.d.f. is
F (x) =Number of observations smaller than x
Total number of observations
then we expect half of the observations to be smaller than the middle value.
Exercise 26: The argument for the F (LQ) = 0.25, and F (LQ) = 0.75 are similar
to the previous exercise. Figure 17 is the birthweight data e.d.f. with the lower
quartile and upper quartile lines added. We can estimate from this that the LQ is
around 3100 and the UQ around 3600, to give an estimate of the IQR of 500.
A. Hints and Answers to Exercises 39
These do not match exactly with the values calculated from the data because we
are only able to estimate from the graph.
1500 2000 2500 3000 3500 4000
0.0
0.2
0.4
0.6
0.8
1.0
birthweight
Fn(
x)
●●
●●
●●
●●
●●
●●●
●●●●
●●●●●●
●●
●●●●●●●
●●
●●●
●●●●
●●
Figure 17: e.d.f. of State areas, with lower quartile and upper quartile lines added.
Exercise 27:
• The histogram requires a subjective decision to be made about the bin sizes,
whereas there are no subjectivities involved in producing an e.d.f.
• It is easier to get an impression of the shape of the distribution from a (well
constructed) histogram.
• It is easier to calculate from the e.d.f. probabilities of a random observation
landing inbetween any given interval, whereas with histograms we are restricted
to the bins that we have defined.
A. Hints and Answers to Exercises 40
Exercise 28: There is some positive relationship. i.e. in states where the amount
of cigarettes smoked (hundreds per capita) is larger than average, we would also ex-
pect the number of deaths from lung cancer per 100 000 to be larger than average.
Exercise 29: We could read, from the scatter graph, the number of deaths per
100 000 that corresponds to 30 cigarettes smoked (hundreds per capita). We get an
estimate of around 22 deaths per 100 000. See Figure 18 for a visualisation of how
this works.
15 20 25 30 35 40
1520
25
Cigarettes Smoked (hundreds per capita) in 1960
Dea
ths
per
100
000
of L
ung
Can
cer
Figure 18: Scatter plot of the number of cigarettes smoked against the number of
deaths from lung cancer, for 44 US states. Looking up expected number of deaths
for a state where 30 cigarettes (hundreds per capita) are smoked.
Exercise 30: There are many possible examples. For example, one variable being
A. Hints and Answers to Exercises 41
‘temperature’ and the other ‘scarf sales’.
top related