1
The Institute of Chartered Accountants of Sri Lanka
Postgraduate Diploma in Business Finance and Strategy
Quantitative Methods for Business Studies
Handout 02: Presentation and Analysis of data
Tables and Charts for Categorical Data. One way tabulation It is a table with two columns. One column lists the categories, and the other for the frequencies
or percentages with which the items in the categories occur.
Example:
Two way tabulation When the data are tabulated according to two characteristics at a time, it is said to be double tabulation
or two-way tabulation.
Example:
Complex Tabulation
When the data are tabulated according to many characteristics, it is said to be complex tabulation.
Example
2
Pie Charts
A pie chart is a circle with wedges cut of varying sizes marked out like slices of a pie or pizza. The
relative sizes of the wedges correspond to the relative frequencies of the categories.
Simple bar Chart
In bar charts, each category of data is depicted by a bar, the height of which represents the
frequency or percentage of observations falling into a category.
Multiple bar chart
In a multiple bar chart two or more sets of inter-related data are represented facilitating comparison
between more than one phenomena.
Stacked Bar chart
In a stacked bar chart, data series are stacked one on top of the other in vertical columns.
How Do You Spend the Holiday's
45%
38%
5%
5%7%
At home with family
Travel to visit family
Vacation
Catching up on work
Other
3
Pictogram
A pictogram is a statistical graphic in which the size of the picture is intended to represent the
frequencies or size of the values being represented.
Organizing Numerical data. As the number of observations gets large, it becomes more and more difficult to focus on the major
features in a set of data. We need ways to organize the observations so that we can better
understand what information the data is conveying. Large sets of data are presented under groups
to facilitate better presentation.
Presenting Numerical data in tables and graphs
Frequency Distribution
A frequency distribution is a table in which the data is arranged into conveniently established,
numerically ordered class groupings or categories.
Frequency Distribution for Ungrouped Data
When the number of values in a variable is small, an ungrouped frequency distribution would be appropriate.
Exercise
20 families were surveyed to find how many children they had. The data obtained were as follows.
0,2,3,1,1,3,4,2,0,3,4,2,2,1,0,4,1,2,2,3 Construct a frequency distribution. Frequency Distribution for Grouped Data
Sometimes it is impractical to prepare a frequency distribution table using ungrouped data. This is particularly true when there is a large number of observations. In cases such cases it is better to collect the observations into groups or classes with clearly defined upper and lower limits. The following steps have to be followed to construct a frequency distribution for a grouped set of data.
Find the range of the distribution. (ie difference between highest and lowest ) Select class intervals of a convenient size.
For most distributions about 6 to 12 classes will be sufficient. Usually interval widths of 5, 10, 20 and so on are suitable.
Mark the number of values falling within each class interval using tally marks and construct the frequency distribution table.
4
Important definitions relating to frequency distribution.
Lower class limit
These are the smallest numbers that can actually belong to different classes.
Upper class limit
These are the largest numbers that can actually belong to different classes
Class width
This is the difference between two consecutive lower class limits or upper class limits.
Cumulative Frequency distributions
The cumulative frequency of a class is the sum of frequencies for that class and all previous classes.
Relative Frequency Distribution
The relative frequency of a class can be found by dividing the class frequency by the total of all
frequencies. This can also be given as a percentage.
Relative frequency = class frequency / total frequency
Frequency Histogram A histogram is a column or bar graph of a frequency table. However, unlike bar graphs, there are no gaps between adjacent bars.
Frequency Polygon A frequency polygon is a line graph drawn by joining the mid-points of the tops of each rectangle in a histogram with straight lines. To construct a frequency polygon frequencies should be plotted against the mid points of each interval and joined with a straight line graph. Ogive or Cumulative Frequency Graph
An ogive is the graphical presentation of a cumulative frequency distribution. Ogives are used when it is considered more useful to determine the number (or proportion) of data items that fall above or below a particular value rather than within a given interval. There are 2 types of ogives - “less than” and “greater than” ogives. The following steps have to be followed to construct a less than ogive.
1 Compute the cumulative frequencies of the distribution in ascending order. 2 Prepare a graph with the cumulative frequency on the vertical axis and the class intervals
on the horizontal axis. 3 After plotting the first point, the respective cumulative frequencies must be plotted
against the upper limits of each class. 4 Join all the points with straight lines.
Exercise
1. The manager of Jerry’s salon recently asked his last 50 customers to punch a time card when
they first arrived at the shop and to punch out right after they paid for their hair cut. He then
used the data on the cards to measure how long it took Jerry and his hair dressers to cut hair in
order to schedule their appointment intervals. The following data were tabulated.
5
50 21 36 35 35 27 38 51 28 35
32 32 27 25 24 38 43 46 29 45
39 27 36 38 35 31 28 38 33 46
35 31 38 48 23 25 43 31 32 38
43 32 18 43 52 52 49 53 46 19
i. Form the frequency distribution and percentage distribution.
ii. Plot the histogram
iii. Plot the frequency polygon.
iv. Form the cumulative percentage distribution.
v. Plot the cumulative percentage polygon.
vi. On the basis of the results (a)-(e), comment on the time gap to be kept between two
consecutive appointments for haircuts.
2. Moore Travel, a nationwide travel agency, offers special rates on certain Caribbean cruises to
senior citizens. The president of Moore Travel wants additional information on the ages of
those people taking cruises. A random sample of 40 customers taking a cruise last year revealed
these ages:
77 18 63 84 38 54 50 59 54 56
36 26 50 34 44 41 58 58 53 51
62 43 52 53 63 62 62 65 61 52
60 60 45 66 83 71 63 58 61 71
i. Organize the data into a frequency distribution, using 7 classes and 15 as the
lower limit of the first class.
ii. Where do the data tend to cluster?
iii. Determine the relative frequency distribution.
iv. Describe the distribution.
The Stem and Leaf Display
A stem-and-leaf diagram, also called a stem-and-leaf plot, is a diagram that quickly summarizes
data while maintaining the individual data points. In such a diagram, the "stem" is a column of the
unique elements of data after removing the last digit. The final digits ("leaves") of each column
are then placed in a row next to the appropriate column and sorted in numerical order.
Exercise
1. A bank wants to find the number of times a particular automated teller machine (ATM) is
used each day. The following is the number of times it was used during each of the last 30
days. Develop a stem-and-leaf display. Summarize the data on the number of times the
machine was used: 83 64 84 76 84 54 75 59 70 61
63 80 84 73 68 52 65 90 52 77
6
95 36 78 61 59 84 47 87 60 66
How many times was the ATM used on a typical day?
What were the largest and the smallest number of times the ATM was used?
Around what values did the number of times the ATM was used, tend to cluster?
2. The back to back stem and leaf plot below shows the LDL cholesterol levels (in milligram per
deciliter mg/dL) of two groups of people, smokers and non-smokers. The digits in the stem
represents the hundreds and tens and the digit in the leaf represents the ones. For or example
11|8 = 118 and so on.
i. People with a cholesterol level of 129 or less are said to have a near ideal level of
cholesterol. How many people, in each group, have a near ideal level of cholesterol?
ii. People with a cholesterol level between 130 and 159 inclusive are said to be in the border
high. How many people, in each group, are in the border high?
iii. People with a cholesterol level between 160 and 189 inclusive are said to have a high
level of cholesterol. How many people, in each group, have a high level of cholesterol?
iv. People with a cholesterol level of 190 or above are said to have a very high level of
cholesterol. How many people, in each group, have a very high level of cholesterol?
v. Comparing the two groups, which group has more people with a higher level of
cholesterol?
7
Measures of Central Tendency and Dispersion
Measures of Central Tendency A measure of central tendency is a value at the center or middle of a data set. It condenses the
mass of data into one single value and enables us to get an idea of the entire set of data. It also
enables comparison of two or more sets of data.
Mean
The mean is computed by dividing the sum of the values of each and every observation by the
total number of observations.
Mean for population:
Mean for sample =
Where,
= symbol representing summation,
x = the set of values in the sample set,
n = number of values in the sample set.
When mean is given for grouped data, the midpoint of every class interval is considered to be
representing the values within the class. The mean is given as,
f
fx_x
Where, f = the frequency of each class
x = mid point of each class
Median
Median is the middle value in a range of values arranged in sequence by size.
When the data is arranged in the ascending or descending order,
Median = the [(n+1)/2]th value.
The median for a grouped set of data is the (n/2)th term. To calculate the median for grouped
data, the cumulative frequency distribution has to be constructed first and then the following
formula can be applied.
Median = L +
n2
C
fi
Where,
L = real lower limit of the median class,
n = number of items of data,
n
X
X
n
i
i 1
N
XN
i
i 1
8
C = cumulative frequency of the class prior to the median class,
f = frequency of the median class,
i = class interval width.
Mode
Mode is the most frequently occurring figure in an ungrouped set of data.
For grouped data, mode can be estimated using the following formula.
Mode = L + d1 i
d1 + d2
Where,
L = real lower limit of the modal class,
d1 = difference in frequencies between the modal class and the proceeding class.
d2 = difference in frequencies between the modal class and the following class.
i = class interval width.
Exercises
1. Refer the hair cutting data with respect to Jerry’s salon given on page 4. Compute the
mean, median and mode.
The weighted mean
Weighted mean is used in situations where scores vary in their degree of importance. In such
situations, different weights are attached to different scores. A weight is a value corresponding
to how much the score is counted. Given a list of scores x1, x2,…xn and corresponding list of
weights w1, w2, ….wn the weighted mean is obtained by the formulae
Weighted mean = Σ(w . x)
Σw
Exercise
The final score of a course is computed as a weighted mean with the weights 10% for mid term
test, 30% for assignment and 60% for the written exam. A student scored 80 marks for the mid
term test, 70 for the assignment and 60 for the written exam. Find the final mark obtained for
the course.
Measures of Non Central Tendency Quartiles.
Quartiles are employed particularly when summerising or describing the properties of large
sets of numerical data. The quartiles are descriptive measures that splits the ordered data into
four quarters.
In an ungrouped set of scores, Q1 = [(n+1)/4]th value and Q3 = [3(n+1)/4]th value.
IQR for grouped data:
9
Q1 (lower quartile) = L + n/4 -C i
FQ1
Q2 (median) = L + n/2 -C i
fQ2
Q3 (upper quartile) = L + 3n/4 -C i
FQ3
where
L = real lower limit of the quartile class ,
n = total number of observations in the entire data set,
C = cumulative frequency in the class immediately before the quartile class,
fQ = frequency of the relevant quartile class,
i = the length of the real class interval of the relevant quartile class.
Exercises
1. A manufacturer of flashlight batteries took a sample of 13 batteries from a day’s production
and used them continuously until they were drained. The number of hours they were used
until failure were
342, 426, 317, 545, 264, 451, 1049, 631, 512, 266, 492, 562, 298
i. Compute the mean, median, mode and mid range(average of highest and lowest).
ii. Looking at the distribution of times to failure, which measures do you think is best?
iii. In what ways would this information be useful to the manufacturer? Discuss.
iv. Using the information above, what would you advice if the manufacturer wanted to be able
to say in advertisements that “these batteries should last 400 hours”?
2. The following frequency distribution shows the annual profit levels in last financial year
with respect to 100 small enterprises in a city. Previous studies show that average monthly
profit of small enterprises for the previous financial year was Rs 40,000.
Profit (Rs ‘000) Number of enterprises
0-10 3
10-20 6
20-30 11
30-40 18
40-50 23
50-60 16
60-70 15
70-80 8
total 100
i. Find the mean and median and mode and comment on the profit values compared to
the previous year. .
ii. Find the profit below which, the lowest 25% of the enterprises fall.
iii. Draw the cumulative percentage distribution and find the answer for part ii graphically.
10
Measures of Variation Measures of variation describe the spread of individual values around the central value.
Range
The range is the difference between the highest value and the lowest value in a set of data.
Interquartile Range
The inter-quartile range (IQR) = Q3 - Q1
where Q1 = lower quartile, Q3 = upper quartile.
IQR, unlike the range, is not affected by the extreme values. It shows the spread of the
middle 50% of data.
The five number summary and Box plot
The five numbers that help describe the center, spread and shape of data are:
Xsmallest , First Quartile (Q1), Median (Q2), Third Quartile (Q3) and Xlargest
Box plot is a graphical display of the data based on the five-number summary:
Xsmallest Q1 Median Q3 Xlargest
Mean Absolute Deviation
The mean deviation is the average of the absolute deviations taken from the mean.
Considering sample data,
Mean deviation = ∑|x- x|
n
Variance
Average of squared deviations of values from the mean
Variance of population:
Variance of the sample :
Variance for grouped data:
25% of data 25% 25% 25% of data
of data of data
1-n
)Xf(X
S
n
1i
2i
2
N
)(XN
1i
2i
2
1-n
)X(X
S
n
1i
2
i2
11
Standard deviation
Standard deviation shows the variation of the figures about the mean. It is the square root of
the variance.
Standard deviation for population =
Standard deviation of sample
Exercises 1. The following data contains the data from 10 days and shows the number of rejected cathode
ray tubes out of 120 inspected per day.
Number rejected: 8 9 11 7 11 6 7 9 12 10 i. What is the average number of rejects per day?
ii. What is the standard deviation of the rejects?
2. A set of final examination grades in an introductory statistics course was found to be distributed
with a mean of 73 and a standard deviation of 8. Are you better off with a grade of 81 on this
exam or a grade of 68 on a different exam where the mean is 62 and standard deviation is 3?
Show statistically and explain.
3. Refer the hair cutting data with respect to Jerry’s salon given on page 4. Compute the standard
deviation.
Coefficient of variation (Cv)
The coefficient of variation measures the scatter in the data relative to the mean. A lower
coefficient of variation indicates a lower relative dispersion.
Cv = s
x Where,
s - standard deviation
x - mean.
Exercise
To set an appropriate price for a product, it is necessary to be able to estimate its cost of
production. One element of the cost is based on the length of time spent by workers to produce
the product. The most widely used technique for making such a measurement is the time study.
In a time study, the task to be studied is divided into measurable parts and each is measured
with a stopwatch or filmed for later analysis. For each worker, this process is repeated many
times for each sub task. Then the average and standard deviation of the time required to
complete each subtask are computed for each worker.
1-n
)X(X
S
n
1i
2
i
N
)(XN
1i
2i
12
The data(in minutes) given in the table are the result of a time study of a production operation
involving two tasks.
Worker A
Worker B
Repetition
Subtask 1 Subtask 2 Subtask 1 Subtask 2
1 30 2 31 7
2 28 4 30 2
3 31 3 32 6
4 38 3 30 5
5 25 2 29 4
6 29 4 30 1
7 30 3 31 4
i. For each worker, find the mean and the standard deviation for each subtask.
ii. If you could choose workers similar to A or B to perform subtasks 1 and 2, which type
would you assign to each subtask?
iii. Explain your decision on the basis of your answer to part b above.
Shape of a Distribution Skewness
A symmetrical distribution is represented by a curve that can be divided by a vertical line into
two parts which are mirror images of each other.
A skewed distribution is represented by a curve that lacks symmetry.
Differences among mean, median and mode can be seen from the graphs of symmetrical and
skewed distributions.
Mean = Median =Mode Mean < Median< Mode Mode <Median < Mean
Right-Skewed Left-Skewed Symmetric
13
Measure of skewness
Pearson’s coefficient of skewness = 3(Mean-Median)
Exercises:
1. In a factory, the time during working hours in which a machine is not operating as a result of
breakage or failure is called the ‘downtime”. The following distribution shows a sample of 100
downtimes of a certain machine (rounded to the nearest minute):
i. Find the mean, median and the mode
ii. Find the standard deviation
iii. Comment about the skewness of data
2. Data on the frequency of absenteeism in plant owned by Novel Electronics is shown in the table
below. Management expects your statistical expertise to analyse their employee absenteeism
levels as compared with national norms. Studies show that average number of days per year
that employees are absent across the nation for similar plants is about 27.
Days absent per
year
Frequency
0-10 2
10-20 5
20-30 6
30-40 7
40-50 5
50-60 3
60-70 2
Total 30
i. Calculate the mean and the median levels of absenteeism and comment about the
absenteeism level of Novel compared to the national norms.
ii. Calculate the standard deviation of absenteeism.
14
iii. If the management of Novel decides to give a reward to employees who are in the lower
25% of absenteeism levels, find the level of absenteeism below which the reward is
applicable.
iv. Draw the cumulative frequency polygon for the absenteeism levels and find the answer for
part c graphically.
3. A cinema is showing 3 films, A,B and C. The ages of people watching the films are
illustrated in the following box and whisker plots.
Describe the ages of people watching the three films. In your view which film is suitable
for : children, young adults and adults.?
4. A bank called applications for the post of management trainee. The applicants were asked to
sit for three papers: Language, Numerical skills and general knowledge. The results of the
performance for the three papers are given below.
Analyse the performance of the candidates in the three exam papers.
Paper Mean Median Std. Deviation Skewness
Language 77.2071 80.1000 7.91857 -.801
Numerical 68.1429 71.1000 13.11099 -1.168
General knowledge 70.0071 72.9000 12.60088 -.424
Total 71.7857 73.3500 11.84818 -.984