1 Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8 Based on Introduction to Business Statistics Kvanli / Pavur / Keeling.

1

Data Summary Using Descriptive Measures

Sections 3.1 – 3.6, 3.8

Based on Introduction to Business Statistics

Kvanli / Pavur / Keeling

2

|»| Summary of Descriptive Measures

DESCRIPTIVE MEASURESA single number computed from the sample data that provides information about the data.

An example of such measures is the mean, which the average of all the observations in a sample or a population.

Measures of Central Tendency

Determine the center of the data values or possibly the most

typical value.

Measures of VariationDetermine the spread of

the data.

Measures of PositionIndicate how a particular

data point fits in with all the other data points.

Measures of ShapeIndicate how the data points are distributed.

MeanThe average of the data

values.

MedianThe value in the center of the

ordered data values

ModeThe value that occurs more

than once and the most often

MidrangeThe average of the highest

and the lowest values

RangeRange = H - L

VarianceThe average of the sum

squared differences of the mean from individual values.

Standard Deviation

The positive squared root of the variance

Coefficient of VariationThe standard deviation in terms of the mean.

PercentileP% below P-th

Percentile & (1-P)% above it

QuartilesThe 25th, 50th and 75th

percentiles

Z-ScoreExpresses the number Expresses the number of standard deviations of standard deviations the value x is from the the value x is from the

meanmean.

SkewnessThe tendency of a The tendency of a

distribution to stretch distribution to stretch out in a particular out in a particular

directiondirection

KurtosisA measure of the A measure of the peakedness of a peakedness of a

distributiondistribution

3

|»| The Mean

The mean represents the average of the data and is computed by dividing the sum of the data points by the number of the data points.

It is the most popular measure of central tendency.

We can easily compute and explain the mean.

We have two types of mean depending on whether the data set includes all items of a population or a subset of items of a population – Sample Mean and Population Mean.

4

|»| Sample Mean

It is the sum of the data values in a sample divided by the number of data values in that sample.

We use (X-bar) to denote the sample mean, and n to denote the number of data values in a sample.

Therefore, for ungrouped data, we obtain,

X

n

x

n

x

n

xxxxX

n

ii

n

1321 .....

Example 3.1 (Accident Data): The following sample represents the number of accidents (monthly) over 11 months: 18, 10, 15, 13, 17, 15, 12, 15, 18, 16, 11. Compute the mean number of monthly accidents, i.e., compute the sample mean.

55.1411

160

11

1116181512151713151018

n

xX

5

|»| Sample Mean (cont.)

Example 3.2: The mean of a sample with 5 observations is 20. If the sum of four of the observations is 75, what is the value of the fifth observation?

754

1

i

ix 1005205

1

i

ix 25751004

1

5

15

ii

ii xxx

|»| Population Mean

It is the sum of the data values in a population divided by the number of data values in that population.

We use μ to denote the population mean, and N to denote the number of data values in a population.

Therefore, we obtain,

N

x

N

xxxx N

.....321

6

|»| The Median, Md

The Median (Md) of a set of data is the value in the center of the data values when they are arranged from lowest to highest.

It has the equal number of items to the right and the left.

Median is preferred to the mean as a measure of central tendency for data set with outliers.

Calculating the median from a sample involves the following steps:

Arrange data values in ascending order.

Find the position of the median. The median position is the ordered value.

Find the median value.

st2

1)(n

7

|»| The Median (cont.)

Example 3.3: Compute the median for the accident data given in Example 3.1.

Ascending order: 10, 11, 12, 13, 15, 15 , 15, 16, 17, 18, 18.

n = 11, Median Position = (11+1)/2 = 6th ordered value.

Md = 15.

8

|»| The Mode, Mo

The Mode (Mo) of a data set is the value that occurs more than once and the most often.

Mode is not always a measure of central tendency; this value need not occur in the center of the data.

There may be more than one mode if several numbers occur the same (and the largest) number of times.

Mode is extensively used in areas such as manufacturing of clothing, shoes, etc.

Example 3.4: Find the mode for the accident data given in Example 3.1.

The data point 15 appears the most number of times, so Mo = 15.

9

|»| The Midrange, Mr

It is the average of the highest and the lowest values of a data set.

Midrange provides an easy-to-grasp measure of central tendency.

If we use H to denote the highest value and L to denote the lowest value of a data set, we obtain,

Example 3.5: Find the midrange for the accident data given in Example 3.1.

L = 10, H = 18, so Mr = (L + H)/2 = (10 + 18)/2 = 19.

2

HLMr

10

|»| The Range, R

The numerical difference between the largest value (H) and the smallest value (L). That is, Range = H – L.

Example 3.6: The range for the accident data given in Example 3.1 is H – L = 18 – 10 = 8.

The range is a crude measure of variation but easy to calculate and contains valuable information in some situations.

For instance, stock reports cite the high and low prices of the day.

Similarly, weather forecasts use daily high and low temperatures. Range is strongly influenced by the outliers.

11

|»| The Variance

Variance describes the spread of the data values from the mean.

It is the average of the sum of the squared differences of the mean from individual values.

Two types of variance are (1) Sample variance, and (2) Population variance.

|»| Sample Variance, S2

S2 describes the variation of the sample values about the sample mean.

It is the average of the sum of the squared differences of the sample mean from individual values.

That is,

11

2

22

2

n

n

xx

n

xxS

12

|»| Sample Variance - Example

Example 3.7: Calculate the sample variance for the accident data.

)( xx 2)( xx

7275.74)( 2xx

x

18 18 - 14.55 = 3.45 11.9025

10 10 - 14.55 = -4.55 20.7025

15 15 - 14.55 = 0.45 0.2025

13 13 - 14.55 = -1.55 2.4025

17 17 - 14.55 = 2.45 6.0025

15 15 - 14.55 = 0.45 0.2025

12 12 - 14.55 = -2.55 6.5025

15 15 - 14.55 = 0.45 0.2025

18 18 - 14.55 = 3.45 11.9025

16 16 - 14.55 = 1.45 2.1025

11 11 - 14.55 = -3.55 12.6025

47.7

10

7275.74

111

7275.74

1

2

2

n

xxs

13

|»| Sample Variance - Examples

Example 3.8: From 50 collected data, the statistics ∑x and ∑x2 are calculated to be 20 and 33, respectively. Compute the sample variance,

51.0

49

833

15050

)20(33

1

22

2

2

n

n

xx

S

Example 3.9: The values of the difference between data values and the sample mean are -5, 1, -3, 2, 3, and 2, What is the variance of the data?

-5 (-5)2 = 25

1 (1)2 = 1

-3 (-3)2 = 9

2 (2)2 = 4

3 (3)2 = 9

2 (2)2 = 4

)( xx 2)( xx

52)( 2xx

4.10

5

52

16

52

1

2

2

n

xxs

14

|»| Population Variance, σ2

σ 2 describes the variation of the population values about the population mean.

It is the average of the sum of the squared differences of the population mean from individual values.

That is, N

x2

2

|»| The Standard Deviation

Standard deviation is the positive square root of the variance.

The positive square root of the sample variance is the sample standard deviation, denoted by S.

The positive square root of the population variance is the population standard deviation, denoted by σ.

1

12

2

2

2

nn

xx

n

xxSS

N

x2

2

15

|»| Standard Deviation

Example 3.10: Find the sample standard deviations for Examples 3.7, 3.8, and 3.9.

73.247.72 ss

71.051.02 ss

22.34.102 ss

From Example 3.7:

From Example 3.8:

From Example 3.9:

16

|»| Coefficient of Variation, CV

Measures the standard deviation in terms of mean.

For example, what percentage of x-bar is s?

The Coefficient of Variation (CV) is used to compare the variation of two The Coefficient of Variation (CV) is used to compare the variation of two or more data sets where the values of the data differ greatly.or more data sets where the values of the data differ greatly.

100x

sCV

Example 3.11: The scores for team 1 were 70, 60, 65, and 69. The scores for team 2 were 72, 58, 61, and 73. Compare the coefficients of variation for these two teams.

For team 1: 66X 55.4S

89.6100 X

sCV

For team 2: 66X 62.7S

55.11100 X

sCV

17

|»| Percentile

The P-th percentile is a number such that P% of the measurements fall below the P-th percentile and (100-P)% fall above it.

Most common measure of position.Most common measure of position.

How to calculate percentileHow to calculate percentile

Arrange the dataArrange the data

Find the location of the Find the location of the PPth percentile.th percentile.

Find percentile using the following rules:Find percentile using the following rules:

Location Rule 1: If n Location Rule 1: If n P/100 is P/100 is notnot a counting number, round it a counting number, round it up, and the Pth percentile will be the value in this position of the up, and the Pth percentile will be the value in this position of the ordered data.ordered data.

Location Rule 2: If n Location Rule 2: If n P/100 P/100 is is a counting number, the Pth a counting number, the Pth percentile is the average of the number in this location (of the percentile is the average of the number in this location (of the ordered data) and the number in the next largest locationordered data) and the number in the next largest location

100

Pn

18

|»| Percentile - Example

Example 3.12: Find the 35th percentile from the following aptitude data (Aptitude Data).

Number of data values, Number of data values, nn = 50 = 50

35th Percentile = 35th Percentile = PP35. So,

22 25 28 31 34 35 39 39 40 42 44 44 46 48 49 51 53 53 55 55

56 57 59 60 61 63 63 63 65 66 68 68 69 71 72 72 74 75 75 76

78 78 80 82 83 85 88 90 92 96

5.17100

3550

100Location 35

pnP

17.5 is NOT a counting number. So, using Location Rule 1, P17.5 is NOT a counting number. So, using Location Rule 1, P35 = 18th value = 53 = 18th value = 53.

19

|»| Quartiles and Interquartile Range

Quartiles are merely particular percentiles that divide the data into Quartiles are merely particular percentiles that divide the data into quarters, namelyquarters, namely.

QQ1 = 1st quartile = 25th percentile (= 1st quartile = 25th percentile (PP25))

QQ2 = 2nd quartile = 50th percentile ( = 2nd quartile = 50th percentile (PP50) = Median.) = Median.

QQ3 = 3rd quartile = 75th percentile ( = 3rd quartile = 75th percentile (PP75))

Example 3.13:Example 3.13: Determine the quartiles for the aptitude data Determine the quartiles for the aptitude data

5.37100

7550

100

25100

5050

100

5.12100

2550

100

pn

pn

pn QQ11 = 13 = 13thth ordered value = 46 ordered value = 46

QQ22 = Median = (61+63)/2 = 62 = Median = (61+63)/2 = 62

QQ33 = 38 = 38thth ordered value = 75 ordered value = 75

Interquartile Range (IQR)Interquartile Range (IQR)

The range for the middle 50% of the dataThe range for the middle 50% of the data

IQR = QIQR = Q3 – Q – Q1. For aptitude data: IQR = 75 – 46 = 29.. For aptitude data: IQR = 75 – 46 = 29.

20

|»| Z-Scores

Z-score determines the relative position of any particular data value Z-score determines the relative position of any particular data value X X and is and is based on the mean and standard deviation of the data setbased on the mean and standard deviation of the data set.

The Z-score is expresses the number of standard deviations the value x is from The Z-score is expresses the number of standard deviations the value x is from the mean.the mean.

A negative Z-score implies that x is to the left of the mean and a positive Z-score A negative Z-score implies that x is to the left of the mean and a positive Z-score implies that x is to the right of the meanimplies that x is to the right of the mean..

s

xxz

Example 3.14: Find the z-score Example 3.14: Find the z-score for an aptitude test score of 83for an aptitude test score of 83.

22.161.18

36.6083

s

xxz

|»| Standardizing Sample Data

The process of subtracting the mean and dividing by the standard deviation is The process of subtracting the mean and dividing by the standard deviation is referred to as standardizing the sample datareferred to as standardizing the sample data.

The corresponding z-score is the standardized score.The corresponding z-score is the standardized score.

21

|»| Skewness, Sk

s

MdxSk

)(3

Example 3.15: Find the skewness for aptitude data.Example 3.15: Find the skewness for aptitude data. SSk = 3(60.36 – 62)/18.61 = 3(-1.64)/18.61 = -4.92/18.61 = -0.26 = 3(60.36 – 62)/18.61 = 3(-1.64)/18.61 = -4.92/18.61 = -0.26

The values of SThe values of Skk will always fall between -3 and 3 will always fall between -3 and 3

A positive SA positive Skk number implies a shape which is skewed right and the number implies a shape which is skewed right and the

mode < median < meanmode < median < mean In a data set with a negative SIn a data set with a negative Sk value the value the

mean < median < modemean < median < mode

Skewness measures the tendency of a distribution to stretch out in a Skewness measures the tendency of a distribution to stretch out in a particular directionparticular direction.

The Pearson’s coefficient of skewness Pearson’s coefficient of skewness is used to calculate skewness.is used to calculate skewness.

22

|»| Skewness, Sk – In Graphs

Histogram of Symmetric Data

xx = = MdMd = = MoMo

Fre

que

ncy

Fre

que

ncy

23


Histogram with Right (Positive) Skew

ModeMode((MoMo))

MedianMedian((MdMd))

SkSk > 0 > 0

MeanMean((xx ))

Re

lativ

e F

req

uen

cyR

ela

tive

Fre

qu

ency

24


Histogram with Left (Negative) Skew

ModeMode((MoMo))

MedianMedian((MdMd))

SkSk < 0 < 0

MeanMean((xx ))

Re

lativ

e F

req

uen

cyR

ela

tive

Fre

qu

ency

25

|»| Kurtosis

Kurtosis is a measure of the peakedness of a distributionKurtosis is a measure of the peakedness of a distribution.

Large values occur when there is a high frequency of data near the mean and in Large values occur when there is a high frequency of data near the mean and in the tails.the tails.

The calculation is cumbersome and the measure is used infrequently.The calculation is cumbersome and the measure is used infrequently.

|»| Interpreting X-bar and S

How many or what percentage of the data values are/is within two standard deviation of the mean?

Usually three ways to know that:

Actual percentage based on the sample

Chebyshev’s Inequality

Empirical Rule

26

|»| Kurtosis

According to Chebyshev, in general, at least of the data values lie between and (have z-scores between –k and k) for any k > 1.

Chebyshev’s Inequality is usually conservative but makes no assumption about the distribution of the population.

Empirical rule assumes bell-shaped distribution of the population, i.e., normal population

ksx ksx %100

k1-1 2

Actual Chebyshev’sPercentage Inequality Empirical Rule

Between (Aptitude Data) Percentage Percentage

x - s and x + s 66% — ≈ 68%(33 out of 50)

x - 2s and x + 2s 98% ≥ 75% ≈ 95%(49 out of 50)

x - 3s and x + 3s 100% ≥ 89% ≈ 100%(50 out of 50)

27

|»| A Bell-Shaped (Normal) Population

28

|»| Bivariate Data

Data collected on two variables for each item. Example 3.16: Data for 10 families on income (thousands of dollars)

and square footage of home (hundreds of square feet) (Income-Footage Data).

Income (000s), X Sq Footage of Home (00s), Y

32 16

36 17

55 26

47 24

38 22

60 21

66 32

44 18

70 30

50 20

29

|»| Scatter Diagram

Graphical illustration of bivariate data Each observation is represented by a point, where the X-axis is always

horizontal and the Y-axis is vertical.

|2020

|3030

|4040

|5050

|6060

|7070

|8080

35 35 –

30 30 –

25 25 –

20 20 –

15 15 –

10 10 –

5 5 –Sq

ua

re f

oo

tag

e (h

un

dre

ds)

Sq

ua

re f

oo

tag

e (h

un

dre

ds)

YY

XX

Income (thousands)Income (thousands)(a)(a)

35 35 –

30 30 –

25 25 –

20 20 –

15 15 –

10 10 –

5 5 –S

qu

are

fo

ota

ge

(hu

nd

red

s)S

qu

are

fo

ota

ge

(hu

nd

red

s)

|2020

|3030

|4040

|5050

|6060

|7070

|8080

YY

XX

Income (thousands)(b)(b)

30

|»| Coefficient of Correlation, r

Measures the strength of the linear relationship between X variable and Y variable.

22

)(

yyxx

yyxxr

n

yy

n

xx

n

yxxy

2

2

2

2

yx

xy

SSSS

SCP

r ranges from -1 to 1.

The larger the |r| is, the stronger the linear relationship is between X and Y.

If r = 1 or r = -1, X and Y are perfectly correlated.

If r > 0, X and Y have positive relationship (i.e., large values of X are associated with large values of Y).

If r < 0, X and Y have negative relationship (i.e., large values of X are associated with small values of Y).

31

|»| Coefficient of Correlation – Example

Example 3.17: Calculate r for Income-Footage Data.

Income, X Footage, Y XY X2 Y2

32 16 32x16=512 (32)2=1024 (16)2=256

36 17 36x17=612 (36)2=1296 (17)2=289

55 26 55x26=1430 3025 676

47 24 47x24=1128 2209 576

38 22 38x22=836 1444 484

60 21 60x21=1260 3600 441

66 32 66x32=2112 4356 1024

44 18 44x18=792 1936 324

70 30 70x30=2100 4900 900

50 20 50x20=1000 2500 400

498 226 11782 26290 5370

843.0

10)226(

537010

)498(26290

10)226)(498(

11782

222

2

2

2

n

yy

n

xx

n

yxxy

r

32

|»| Coefficient of Correlation, r – In Graphs

yy

xxrr = 0 = 0

(a)(a)

yy

xxrr = 1 = 1

(b)(b)yy

xxrr = -1 = -1

(c)(c)

yy

xxrr = .9 = .9

(d)(d)

33

|»| Coefficient of Correlation, r – In Graphs

yy

xxrr = -.8 = -.8

(e)(e)

yy

xxrr = .5 = .5

(f)(f)

1 Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8 Based on Introduction to Business Statistics Kvanli / Pavur / Keeling.

Documents

sample data

number of data values

data points

data set

set of data

data summary

accident data

ungrouped data