The Institute of Chartered Accountants of Sri Lanka 2...Frequency Polygon A frequency polygon is a line graph drawn by joining the mid-points of the tops of each rectangle in a histogram

1

The Institute of Chartered Accountants of Sri Lanka

Postgraduate Diploma in Business Finance and Strategy

Quantitative Methods for Business Studies

Handout 02: Presentation and Analysis of data

Tables and Charts for Categorical Data. One way tabulation It is a table with two columns. One column lists the categories, and the other for the frequencies

or percentages with which the items in the categories occur.

Example:

Two way tabulation When the data are tabulated according to two characteristics at a time, it is said to be double tabulation

or two-way tabulation.

Example:

Complex Tabulation

When the data are tabulated according to many characteristics, it is said to be complex tabulation.

Example

2

Pie Charts

A pie chart is a circle with wedges cut of varying sizes marked out like slices of a pie or pizza. The

relative sizes of the wedges correspond to the relative frequencies of the categories.

Simple bar Chart

In bar charts, each category of data is depicted by a bar, the height of which represents the

frequency or percentage of observations falling into a category.

Multiple bar chart

In a multiple bar chart two or more sets of inter-related data are represented facilitating comparison

between more than one phenomena.

Stacked Bar chart

In a stacked bar chart, data series are stacked one on top of the other in vertical columns.

How Do You Spend the Holiday's

45%

38%

5%

5%7%

At home with family

Travel to visit family

Vacation

Catching up on work

Other

3

Pictogram

A pictogram is a statistical graphic in which the size of the picture is intended to represent the

frequencies or size of the values being represented.

Organizing Numerical data. As the number of observations gets large, it becomes more and more difficult to focus on the major

features in a set of data. We need ways to organize the observations so that we can better

understand what information the data is conveying. Large sets of data are presented under groups

to facilitate better presentation.

Presenting Numerical data in tables and graphs

Frequency Distribution

A frequency distribution is a table in which the data is arranged into conveniently established,

numerically ordered class groupings or categories.

Frequency Distribution for Ungrouped Data

When the number of values in a variable is small, an ungrouped frequency distribution would be appropriate.

Exercise

20 families were surveyed to find how many children they had. The data obtained were as follows.

0,2,3,1,1,3,4,2,0,3,4,2,2,1,0,4,1,2,2,3 Construct a frequency distribution. Frequency Distribution for Grouped Data

Sometimes it is impractical to prepare a frequency distribution table using ungrouped data. This is particularly true when there is a large number of observations. In cases such cases it is better to collect the observations into groups or classes with clearly defined upper and lower limits. The following steps have to be followed to construct a frequency distribution for a grouped set of data.

Find the range of the distribution. (ie difference between highest and lowest ) Select class intervals of a convenient size.

For most distributions about 6 to 12 classes will be sufficient. Usually interval widths of 5, 10, 20 and so on are suitable.

Mark the number of values falling within each class interval using tally marks and construct the frequency distribution table.

4

Important definitions relating to frequency distribution.

Lower class limit

These are the smallest numbers that can actually belong to different classes.

Upper class limit

These are the largest numbers that can actually belong to different classes

Class width

This is the difference between two consecutive lower class limits or upper class limits.

Cumulative Frequency distributions

The cumulative frequency of a class is the sum of frequencies for that class and all previous classes.

Relative Frequency Distribution

The relative frequency of a class can be found by dividing the class frequency by the total of all

frequencies. This can also be given as a percentage.

Relative frequency = class frequency / total frequency

Frequency Histogram A histogram is a column or bar graph of a frequency table. However, unlike bar graphs, there are no gaps between adjacent bars.

Frequency Polygon A frequency polygon is a line graph drawn by joining the mid-points of the tops of each rectangle in a histogram with straight lines. To construct a frequency polygon frequencies should be plotted against the mid points of each interval and joined with a straight line graph. Ogive or Cumulative Frequency Graph

An ogive is the graphical presentation of a cumulative frequency distribution. Ogives are used when it is considered more useful to determine the number (or proportion) of data items that fall above or below a particular value rather than within a given interval. There are 2 types of ogives - “less than” and “greater than” ogives. The following steps have to be followed to construct a less than ogive.

1 Compute the cumulative frequencies of the distribution in ascending order. 2 Prepare a graph with the cumulative frequency on the vertical axis and the class intervals

on the horizontal axis. 3 After plotting the first point, the respective cumulative frequencies must be plotted

against the upper limits of each class. 4 Join all the points with straight lines.

Exercise

1. The manager of Jerry’s salon recently asked his last 50 customers to punch a time card when

they first arrived at the shop and to punch out right after they paid for their hair cut. He then

used the data on the cards to measure how long it took Jerry and his hair dressers to cut hair in

order to schedule their appointment intervals. The following data were tabulated.

5

50 21 36 35 35 27 38 51 28 35

32 32 27 25 24 38 43 46 29 45

39 27 36 38 35 31 28 38 33 46

35 31 38 48 23 25 43 31 32 38

43 32 18 43 52 52 49 53 46 19

i. Form the frequency distribution and percentage distribution.

ii. Plot the histogram

iii. Plot the frequency polygon.

iv. Form the cumulative percentage distribution.

v. Plot the cumulative percentage polygon.

vi. On the basis of the results (a)-(e), comment on the time gap to be kept between two

consecutive appointments for haircuts.

2. Moore Travel, a nationwide travel agency, offers special rates on certain Caribbean cruises to

senior citizens. The president of Moore Travel wants additional information on the ages of

those people taking cruises. A random sample of 40 customers taking a cruise last year revealed

these ages:

77 18 63 84 38 54 50 59 54 56

36 26 50 34 44 41 58 58 53 51

62 43 52 53 63 62 62 65 61 52

60 60 45 66 83 71 63 58 61 71

i. Organize the data into a frequency distribution, using 7 classes and 15 as the

lower limit of the first class.

ii. Where do the data tend to cluster?

iii. Determine the relative frequency distribution.

iv. Describe the distribution.

The Stem and Leaf Display

A stem-and-leaf diagram, also called a stem-and-leaf plot, is a diagram that quickly summarizes

data while maintaining the individual data points. In such a diagram, the "stem" is a column of the

unique elements of data after removing the last digit. The final digits ("leaves") of each column

are then placed in a row next to the appropriate column and sorted in numerical order.

Exercise

1. A bank wants to find the number of times a particular automated teller machine (ATM) is

used each day. The following is the number of times it was used during each of the last 30

days. Develop a stem-and-leaf display. Summarize the data on the number of times the

machine was used: 83 64 84 76 84 54 75 59 70 61

63 80 84 73 68 52 65 90 52 77

6

95 36 78 61 59 84 47 87 60 66

How many times was the ATM used on a typical day?

What were the largest and the smallest number of times the ATM was used?

Around what values did the number of times the ATM was used, tend to cluster?

2. The back to back stem and leaf plot below shows the LDL cholesterol levels (in milligram per

deciliter mg/dL) of two groups of people, smokers and non-smokers. The digits in the stem

represents the hundreds and tens and the digit in the leaf represents the ones. For or example

11|8 = 118 and so on.

i. People with a cholesterol level of 129 or less are said to have a near ideal level of

cholesterol. How many people, in each group, have a near ideal level of cholesterol?

ii. People with a cholesterol level between 130 and 159 inclusive are said to be in the border

high. How many people, in each group, are in the border high?

iii. People with a cholesterol level between 160 and 189 inclusive are said to have a high

level of cholesterol. How many people, in each group, have a high level of cholesterol?

iv. People with a cholesterol level of 190 or above are said to have a very high level of

cholesterol. How many people, in each group, have a very high level of cholesterol?

v. Comparing the two groups, which group has more people with a higher level of

cholesterol?

7

Measures of Central Tendency and Dispersion

Measures of Central Tendency A measure of central tendency is a value at the center or middle of a data set. It condenses the

mass of data into one single value and enables us to get an idea of the entire set of data. It also

enables comparison of two or more sets of data.

Mean

The mean is computed by dividing the sum of the values of each and every observation by the

total number of observations.

Mean for population:

Mean for sample =

Where,

= symbol representing summation,

x = the set of values in the sample set,

n = number of values in the sample set.

When mean is given for grouped data, the midpoint of every class interval is considered to be

representing the values within the class. The mean is given as,

f

fx_x

Where, f = the frequency of each class

x = mid point of each class

Median

Median is the middle value in a range of values arranged in sequence by size.

When the data is arranged in the ascending or descending order,

Median = the [(n+1)/2]th value.

The median for a grouped set of data is the (n/2)th term. To calculate the median for grouped

data, the cumulative frequency distribution has to be constructed first and then the following

formula can be applied.

Median = L +

n2

C

fi

Where,

L = real lower limit of the median class,

n = number of items of data,

n

X

X

n

i

i 1

N

XN

i

i 1

8

C = cumulative frequency of the class prior to the median class,

f = frequency of the median class,

i = class interval width.

Mode

Mode is the most frequently occurring figure in an ungrouped set of data.

For grouped data, mode can be estimated using the following formula.

Mode = L + d1 i

d1 + d2

Where,

L = real lower limit of the modal class,

d1 = difference in frequencies between the modal class and the proceeding class.

d2 = difference in frequencies between the modal class and the following class.

i = class interval width.

Exercises

1. Refer the hair cutting data with respect to Jerry’s salon given on page 4. Compute the

mean, median and mode.

The weighted mean

Weighted mean is used in situations where scores vary in their degree of importance. In such

situations, different weights are attached to different scores. A weight is a value corresponding

to how much the score is counted. Given a list of scores x1, x2,…xn and corresponding list of

weights w1, w2, ….wn the weighted mean is obtained by the formulae

Weighted mean = Σ(w . x)

Σw

Exercise

The final score of a course is computed as a weighted mean with the weights 10% for mid term

test, 30% for assignment and 60% for the written exam. A student scored 80 marks for the mid

term test, 70 for the assignment and 60 for the written exam. Find the final mark obtained for

the course.

Measures of Non Central Tendency Quartiles.

Quartiles are employed particularly when summerising or describing the properties of large

sets of numerical data. The quartiles are descriptive measures that splits the ordered data into

four quarters.

In an ungrouped set of scores, Q1 = [(n+1)/4]th value and Q3 = [3(n+1)/4]th value.

IQR for grouped data:

9

Q1 (lower quartile) = L + n/4 -C i

FQ1

Q2 (median) = L + n/2 -C i

fQ2

Q3 (upper quartile) = L + 3n/4 -C i

FQ3

where

L = real lower limit of the quartile class ,

n = total number of observations in the entire data set,

C = cumulative frequency in the class immediately before the quartile class,

fQ = frequency of the relevant quartile class,

i = the length of the real class interval of the relevant quartile class.

Exercises

1. A manufacturer of flashlight batteries took a sample of 13 batteries from a day’s production

and used them continuously until they were drained. The number of hours they were used

until failure were

342, 426, 317, 545, 264, 451, 1049, 631, 512, 266, 492, 562, 298

i. Compute the mean, median, mode and mid range(average of highest and lowest).

ii. Looking at the distribution of times to failure, which measures do you think is best?

iii. In what ways would this information be useful to the manufacturer? Discuss.

iv. Using the information above, what would you advice if the manufacturer wanted to be able

to say in advertisements that “these batteries should last 400 hours”?

2. The following frequency distribution shows the annual profit levels in last financial year

with respect to 100 small enterprises in a city. Previous studies show that average monthly

profit of small enterprises for the previous financial year was Rs 40,000.

Profit (Rs ‘000) Number of enterprises

0-10 3

10-20 6

20-30 11

30-40 18

40-50 23

50-60 16

60-70 15

70-80 8

total 100

i. Find the mean and median and mode and comment on the profit values compared to

the previous year. .

ii. Find the profit below which, the lowest 25% of the enterprises fall.

iii. Draw the cumulative percentage distribution and find the answer for part ii graphically.

10

Measures of Variation Measures of variation describe the spread of individual values around the central value.

Range

The range is the difference between the highest value and the lowest value in a set of data.

Interquartile Range

The inter-quartile range (IQR) = Q3 - Q1

where Q1 = lower quartile, Q3 = upper quartile.

IQR, unlike the range, is not affected by the extreme values. It shows the spread of the

middle 50% of data.

The five number summary and Box plot

The five numbers that help describe the center, spread and shape of data are:

Xsmallest , First Quartile (Q1), Median (Q2), Third Quartile (Q3) and Xlargest

Box plot is a graphical display of the data based on the five-number summary:

Xsmallest Q1 Median Q3 Xlargest

Mean Absolute Deviation

The mean deviation is the average of the absolute deviations taken from the mean.

Considering sample data,

Mean deviation = ∑|x- x|

n

Variance

Average of squared deviations of values from the mean

Variance of population:

Variance of the sample :

Variance for grouped data:

25% of data 25% 25% 25% of data

of data of data

1-n

)Xf(X

S

n

1i

2i

2

N

)(XN

1i

2i

2

1-n

)X(X

S

n

1i

2

i2

11

Standard deviation

Standard deviation shows the variation of the figures about the mean. It is the square root of

the variance.

Standard deviation for population =

Standard deviation of sample

Exercises 1. The following data contains the data from 10 days and shows the number of rejected cathode

ray tubes out of 120 inspected per day.

Number rejected: 8 9 11 7 11 6 7 9 12 10 i. What is the average number of rejects per day?

ii. What is the standard deviation of the rejects?

2. A set of final examination grades in an introductory statistics course was found to be distributed

with a mean of 73 and a standard deviation of 8. Are you better off with a grade of 81 on this

exam or a grade of 68 on a different exam where the mean is 62 and standard deviation is 3?

Show statistically and explain.

3. Refer the hair cutting data with respect to Jerry’s salon given on page 4. Compute the standard

deviation.

Coefficient of variation (Cv)

The coefficient of variation measures the scatter in the data relative to the mean. A lower

coefficient of variation indicates a lower relative dispersion.

Cv = s

x Where,

s - standard deviation

x - mean.

Exercise

To set an appropriate price for a product, it is necessary to be able to estimate its cost of

production. One element of the cost is based on the length of time spent by workers to produce

the product. The most widely used technique for making such a measurement is the time study.

In a time study, the task to be studied is divided into measurable parts and each is measured

with a stopwatch or filmed for later analysis. For each worker, this process is repeated many

times for each sub task. Then the average and standard deviation of the time required to

complete each subtask are computed for each worker.

1-n

)X(X

S

n

1i

2

i

N

)(XN

1i

2i

12

The data(in minutes) given in the table are the result of a time study of a production operation

involving two tasks.

Worker A

Worker B

Repetition

Subtask 1 Subtask 2 Subtask 1 Subtask 2

1 30 2 31 7

2 28 4 30 2

3 31 3 32 6

4 38 3 30 5

5 25 2 29 4

6 29 4 30 1

7 30 3 31 4

i. For each worker, find the mean and the standard deviation for each subtask.

ii. If you could choose workers similar to A or B to perform subtasks 1 and 2, which type

would you assign to each subtask?

iii. Explain your decision on the basis of your answer to part b above.

Shape of a Distribution Skewness

A symmetrical distribution is represented by a curve that can be divided by a vertical line into

two parts which are mirror images of each other.

A skewed distribution is represented by a curve that lacks symmetry.

Differences among mean, median and mode can be seen from the graphs of symmetrical and

skewed distributions.

Mean = Median =Mode Mean < Median< Mode Mode <Median < Mean

Right-Skewed Left-Skewed Symmetric

13

Measure of skewness

Pearson’s coefficient of skewness = 3(Mean-Median)

Exercises:

1. In a factory, the time during working hours in which a machine is not operating as a result of

breakage or failure is called the ‘downtime”. The following distribution shows a sample of 100

downtimes of a certain machine (rounded to the nearest minute):

i. Find the mean, median and the mode

ii. Find the standard deviation

iii. Comment about the skewness of data

2. Data on the frequency of absenteeism in plant owned by Novel Electronics is shown in the table

below. Management expects your statistical expertise to analyse their employee absenteeism

levels as compared with national norms. Studies show that average number of days per year

that employees are absent across the nation for similar plants is about 27.

Days absent per

year

Frequency

0-10 2

10-20 5

20-30 6

30-40 7

40-50 5

50-60 3

60-70 2

Total 30

i. Calculate the mean and the median levels of absenteeism and comment about the

absenteeism level of Novel compared to the national norms.

ii. Calculate the standard deviation of absenteeism.

14

iii. If the management of Novel decides to give a reward to employees who are in the lower

25% of absenteeism levels, find the level of absenteeism below which the reward is

applicable.

iv. Draw the cumulative frequency polygon for the absenteeism levels and find the answer for

part c graphically.

3. A cinema is showing 3 films, A,B and C. The ages of people watching the films are

illustrated in the following box and whisker plots.

Describe the ages of people watching the three films. In your view which film is suitable

for : children, young adults and adults.?

4. A bank called applications for the post of management trainee. The applicants were asked to

sit for three papers: Language, Numerical skills and general knowledge. The results of the

performance for the three papers are given below.

Analyse the performance of the candidates in the three exam papers.

Paper Mean Median Std. Deviation Skewness

Language 77.2071 80.1000 7.91857 -.801

Numerical 68.1429 71.1000 13.11099 -1.168

General knowledge 70.0071 72.9000 12.60088 -.424

Total 71.7857 73.3500 11.84818 -.984

The Institute of Chartered Accountants of Sri Lanka 2...Frequency Polygon A frequency polygon is a line graph drawn by joining the mid-points of the tops of each rectangle in a histogram

Documents

The Institute of Chartered Accountants of Sri Lanka 2...Frequency Polygon A frequency polygon is a line graph drawn by joining the mid-points of the tops of each rectangle in a histogram