1 Data Summary Using Descriptive Measures Sections 3.1 – 3.6, 3.8 Based on Introduction to Business Statistics Kvanli / Pavur / Keeling
Jan 18, 2016
1
Data Summary Using Descriptive Measures
Sections 3.1 – 3.6, 3.8
Based on Introduction to Business Statistics
Kvanli / Pavur / Keeling
2
|»| Summary of Descriptive Measures
DESCRIPTIVE MEASURESA single number computed from the sample data that provides information about the data.
An example of such measures is the mean, which the average of all the observations in a sample or a population.
Measures of Central Tendency
Determine the center of the data values or possibly the most
typical value.
Measures of VariationDetermine the spread of
the data.
Measures of PositionIndicate how a particular
data point fits in with all the other data points.
Measures of ShapeIndicate how the data points are distributed.
MeanThe average of the data
values.
MedianThe value in the center of the
ordered data values
ModeThe value that occurs more
than once and the most often
MidrangeThe average of the highest
and the lowest values
RangeRange = H - L
VarianceThe average of the sum
squared differences of the mean from individual values.
Standard Deviation
The positive squared root of the variance
Coefficient of VariationThe standard deviation in terms of the mean.
PercentileP% below P-th
Percentile & (1-P)% above it
QuartilesThe 25th, 50th and 75th
percentiles
Z-ScoreExpresses the number Expresses the number of standard deviations of standard deviations the value x is from the the value x is from the
meanmean.
SkewnessThe tendency of a The tendency of a
distribution to stretch distribution to stretch out in a particular out in a particular
directiondirection
KurtosisA measure of the A measure of the peakedness of a peakedness of a
distributiondistribution
3
|»| The Mean
The mean represents the average of the data and is computed by dividing the sum of the data points by the number of the data points.
It is the most popular measure of central tendency.
We can easily compute and explain the mean.
We have two types of mean depending on whether the data set includes all items of a population or a subset of items of a population – Sample Mean and Population Mean.
4
|»| Sample Mean
It is the sum of the data values in a sample divided by the number of data values in that sample.
We use (X-bar) to denote the sample mean, and n to denote the number of data values in a sample.
Therefore, for ungrouped data, we obtain,
X
n
x
n
x
n
xxxxX
n
ii
n
1321 .....
Example 3.1 (Accident Data): The following sample represents the number of accidents (monthly) over 11 months: 18, 10, 15, 13, 17, 15, 12, 15, 18, 16, 11. Compute the mean number of monthly accidents, i.e., compute the sample mean.
55.1411
160
11
1116181512151713151018
n
xX
5
|»| Sample Mean (cont.)
Example 3.2: The mean of a sample with 5 observations is 20. If the sum of four of the observations is 75, what is the value of the fifth observation?
754
1
i
ix 1005205
1
i
ix 25751004
1
5
15
ii
ii xxx
|»| Population Mean
It is the sum of the data values in a population divided by the number of data values in that population.
We use μ to denote the population mean, and N to denote the number of data values in a population.
Therefore, we obtain,
N
x
N
xxxx N
.....321
6
|»| The Median, Md
The Median (Md) of a set of data is the value in the center of the data values when they are arranged from lowest to highest.
It has the equal number of items to the right and the left.
Median is preferred to the mean as a measure of central tendency for data set with outliers.
Calculating the median from a sample involves the following steps:
Arrange data values in ascending order.
Find the position of the median. The median position is the ordered value.
Find the median value.
st2
1)(n
7
|»| The Median (cont.)
Example 3.3: Compute the median for the accident data given in Example 3.1.
Ascending order: 10, 11, 12, 13, 15, 15 , 15, 16, 17, 18, 18.
n = 11, Median Position = (11+1)/2 = 6th ordered value.
Md = 15.
8
|»| The Mode, Mo
The Mode (Mo) of a data set is the value that occurs more than once and the most often.
Mode is not always a measure of central tendency; this value need not occur in the center of the data.
There may be more than one mode if several numbers occur the same (and the largest) number of times.
Mode is extensively used in areas such as manufacturing of clothing, shoes, etc.
Example 3.4: Find the mode for the accident data given in Example 3.1.
The data point 15 appears the most number of times, so Mo = 15.
9
|»| The Midrange, Mr
It is the average of the highest and the lowest values of a data set.
Midrange provides an easy-to-grasp measure of central tendency.
If we use H to denote the highest value and L to denote the lowest value of a data set, we obtain,
Example 3.5: Find the midrange for the accident data given in Example 3.1.
L = 10, H = 18, so Mr = (L + H)/2 = (10 + 18)/2 = 19.
2
HLMr
10
|»| The Range, R
The numerical difference between the largest value (H) and the smallest value (L). That is, Range = H – L.
Example 3.6: The range for the accident data given in Example 3.1 is H – L = 18 – 10 = 8.
The range is a crude measure of variation but easy to calculate and contains valuable information in some situations.
For instance, stock reports cite the high and low prices of the day.
Similarly, weather forecasts use daily high and low temperatures. Range is strongly influenced by the outliers.
11
|»| The Variance
Variance describes the spread of the data values from the mean.
It is the average of the sum of the squared differences of the mean from individual values.
Two types of variance are (1) Sample variance, and (2) Population variance.
|»| Sample Variance, S2
S2 describes the variation of the sample values about the sample mean.
It is the average of the sum of the squared differences of the sample mean from individual values.
That is,
11
2
22
2
n
n
xx
n
xxS
12
|»| Sample Variance - Example
Example 3.7: Calculate the sample variance for the accident data.
)( xx 2)( xx
7275.74)( 2xx
x
18 18 - 14.55 = 3.45 11.9025
10 10 - 14.55 = -4.55 20.7025
15 15 - 14.55 = 0.45 0.2025
13 13 - 14.55 = -1.55 2.4025
17 17 - 14.55 = 2.45 6.0025
15 15 - 14.55 = 0.45 0.2025
12 12 - 14.55 = -2.55 6.5025
15 15 - 14.55 = 0.45 0.2025
18 18 - 14.55 = 3.45 11.9025
16 16 - 14.55 = 1.45 2.1025
11 11 - 14.55 = -3.55 12.6025
47.7
10
7275.74
111
7275.74
1
2
2
n
xxs
13
|»| Sample Variance - Examples
Example 3.8: From 50 collected data, the statistics ∑x and ∑x2 are calculated to be 20 and 33, respectively. Compute the sample variance,
51.0
49
833
15050
)20(33
1
22
2
2
n
n
xx
S
Example 3.9: The values of the difference between data values and the sample mean are -5, 1, -3, 2, 3, and 2, What is the variance of the data?
-5 (-5)2 = 25
1 (1)2 = 1
-3 (-3)2 = 9
2 (2)2 = 4
3 (3)2 = 9
2 (2)2 = 4
)( xx 2)( xx
52)( 2xx
4.10
5
52
16
52
1
2
2
n
xxs
14
|»| Population Variance, σ2
σ 2 describes the variation of the population values about the population mean.
It is the average of the sum of the squared differences of the population mean from individual values.
That is, N
x2
2
|»| The Standard Deviation
Standard deviation is the positive square root of the variance.
The positive square root of the sample variance is the sample standard deviation, denoted by S.
The positive square root of the population variance is the population standard deviation, denoted by σ.
1
12
2
2
2
nn
xx
n
xxSS
N
x2
2
15
|»| Standard Deviation
Example 3.10: Find the sample standard deviations for Examples 3.7, 3.8, and 3.9.
73.247.72 ss
71.051.02 ss
22.34.102 ss
From Example 3.7:
From Example 3.8:
From Example 3.9:
16
|»| Coefficient of Variation, CV
Measures the standard deviation in terms of mean.
For example, what percentage of x-bar is s?
The Coefficient of Variation (CV) is used to compare the variation of two The Coefficient of Variation (CV) is used to compare the variation of two or more data sets where the values of the data differ greatly.or more data sets where the values of the data differ greatly.
100x
sCV
Example 3.11: The scores for team 1 were 70, 60, 65, and 69. The scores for team 2 were 72, 58, 61, and 73. Compare the coefficients of variation for these two teams.
For team 1: 66X 55.4S
89.6100 X
sCV
For team 2: 66X 62.7S
55.11100 X
sCV
17
|»| Percentile
The P-th percentile is a number such that P% of the measurements fall below the P-th percentile and (100-P)% fall above it.
Most common measure of position.Most common measure of position.
How to calculate percentileHow to calculate percentile
Arrange the dataArrange the data
Find the location of the Find the location of the PPth percentile.th percentile.
Find percentile using the following rules:Find percentile using the following rules:
Location Rule 1: If n Location Rule 1: If n P/100 is P/100 is notnot a counting number, round it a counting number, round it up, and the Pth percentile will be the value in this position of the up, and the Pth percentile will be the value in this position of the ordered data.ordered data.
Location Rule 2: If n Location Rule 2: If n P/100 P/100 is is a counting number, the Pth a counting number, the Pth percentile is the average of the number in this location (of the percentile is the average of the number in this location (of the ordered data) and the number in the next largest locationordered data) and the number in the next largest location
100
Pn
18
|»| Percentile - Example
Example 3.12: Find the 35th percentile from the following aptitude data (Aptitude Data).
Number of data values, Number of data values, nn = 50 = 50
35th Percentile = 35th Percentile = PP35. So,
22 25 28 31 34 35 39 39 40 42 44 44 46 48 49 51 53 53 55 55
56 57 59 60 61 63 63 63 65 66 68 68 69 71 72 72 74 75 75 76
78 78 80 82 83 85 88 90 92 96
5.17100
3550
100Location 35
pnP
17.5 is NOT a counting number. So, using Location Rule 1, P17.5 is NOT a counting number. So, using Location Rule 1, P35 = 18th value = 53 = 18th value = 53.
19
|»| Quartiles and Interquartile Range
Quartiles are merely particular percentiles that divide the data into Quartiles are merely particular percentiles that divide the data into quarters, namelyquarters, namely.
QQ1 = 1st quartile = 25th percentile (= 1st quartile = 25th percentile (PP25))
QQ2 = 2nd quartile = 50th percentile ( = 2nd quartile = 50th percentile (PP50) = Median.) = Median.
QQ3 = 3rd quartile = 75th percentile ( = 3rd quartile = 75th percentile (PP75))
Example 3.13:Example 3.13: Determine the quartiles for the aptitude data Determine the quartiles for the aptitude data
5.37100
7550
100
25100
5050
100
5.12100
2550
100
pn
pn
pn QQ11 = 13 = 13thth ordered value = 46 ordered value = 46
QQ22 = Median = (61+63)/2 = 62 = Median = (61+63)/2 = 62
QQ33 = 38 = 38thth ordered value = 75 ordered value = 75
Interquartile Range (IQR)Interquartile Range (IQR)
The range for the middle 50% of the dataThe range for the middle 50% of the data
IQR = QIQR = Q3 – Q – Q1. For aptitude data: IQR = 75 – 46 = 29.. For aptitude data: IQR = 75 – 46 = 29.
20
|»| Z-Scores
Z-score determines the relative position of any particular data value Z-score determines the relative position of any particular data value X X and is and is based on the mean and standard deviation of the data setbased on the mean and standard deviation of the data set.
The Z-score is expresses the number of standard deviations the value x is from The Z-score is expresses the number of standard deviations the value x is from the mean.the mean.
A negative Z-score implies that x is to the left of the mean and a positive Z-score A negative Z-score implies that x is to the left of the mean and a positive Z-score implies that x is to the right of the meanimplies that x is to the right of the mean..
s
xxz
Example 3.14: Find the z-score Example 3.14: Find the z-score for an aptitude test score of 83for an aptitude test score of 83.
22.161.18
36.6083
s
xxz
|»| Standardizing Sample Data
The process of subtracting the mean and dividing by the standard deviation is The process of subtracting the mean and dividing by the standard deviation is referred to as standardizing the sample datareferred to as standardizing the sample data.
The corresponding z-score is the standardized score.The corresponding z-score is the standardized score.
21
|»| Skewness, Sk
s
MdxSk
)(3
Example 3.15: Find the skewness for aptitude data.Example 3.15: Find the skewness for aptitude data. SSk = 3(60.36 – 62)/18.61 = 3(-1.64)/18.61 = -4.92/18.61 = -0.26 = 3(60.36 – 62)/18.61 = 3(-1.64)/18.61 = -4.92/18.61 = -0.26
The values of SThe values of Skk will always fall between -3 and 3 will always fall between -3 and 3
A positive SA positive Skk number implies a shape which is skewed right and the number implies a shape which is skewed right and the
mode < median < meanmode < median < mean In a data set with a negative SIn a data set with a negative Sk value the value the
mean < median < modemean < median < mode
Skewness measures the tendency of a distribution to stretch out in a Skewness measures the tendency of a distribution to stretch out in a particular directionparticular direction.
The Pearson’s coefficient of skewness Pearson’s coefficient of skewness is used to calculate skewness.is used to calculate skewness.
22
|»| Skewness, Sk – In Graphs
Histogram of Symmetric Data
xx = = MdMd = = MoMo
Fre
que
ncy
Fre
que
ncy
23
|»| Skewness, Sk – In Graphs
Histogram with Right (Positive) Skew
ModeMode((MoMo))
MedianMedian((MdMd))
SkSk > 0 > 0
MeanMean((xx ))
Re
lativ
e F
req
uen
cyR
ela
tive
Fre
qu
ency
24
|»| Skewness, Sk – In Graphs
Histogram with Left (Negative) Skew
ModeMode((MoMo))
MedianMedian((MdMd))
SkSk < 0 < 0
MeanMean((xx ))
Re
lativ
e F
req
uen
cyR
ela
tive
Fre
qu
ency
25
|»| Kurtosis
Kurtosis is a measure of the peakedness of a distributionKurtosis is a measure of the peakedness of a distribution.
Large values occur when there is a high frequency of data near the mean and in Large values occur when there is a high frequency of data near the mean and in the tails.the tails.
The calculation is cumbersome and the measure is used infrequently.The calculation is cumbersome and the measure is used infrequently.
|»| Interpreting X-bar and S
How many or what percentage of the data values are/is within two standard deviation of the mean?
Usually three ways to know that:
Actual percentage based on the sample
Chebyshev’s Inequality
Empirical Rule
26
|»| Kurtosis
According to Chebyshev, in general, at least of the data values lie between and (have z-scores between –k and k) for any k > 1.
Chebyshev’s Inequality is usually conservative but makes no assumption about the distribution of the population.
Empirical rule assumes bell-shaped distribution of the population, i.e., normal population
ksx ksx %100
k1-1 2
Actual Chebyshev’sPercentage Inequality Empirical Rule
Between (Aptitude Data) Percentage Percentage
x - s and x + s 66% — ≈ 68%(33 out of 50)
x - 2s and x + 2s 98% ≥ 75% ≈ 95%(49 out of 50)
x - 3s and x + 3s 100% ≥ 89% ≈ 100%(50 out of 50)
27
|»| A Bell-Shaped (Normal) Population
28
|»| Bivariate Data
Data collected on two variables for each item. Example 3.16: Data for 10 families on income (thousands of dollars)
and square footage of home (hundreds of square feet) (Income-Footage Data).
Income (000s), X Sq Footage of Home (00s), Y
32 16
36 17
55 26
47 24
38 22
60 21
66 32
44 18
70 30
50 20
29
|»| Scatter Diagram
Graphical illustration of bivariate data Each observation is represented by a point, where the X-axis is always
horizontal and the Y-axis is vertical.
|2020
|3030
|4040
|5050
|6060
|7070
|8080
35 35 –
30 30 –
25 25 –
20 20 –
15 15 –
10 10 –
5 5 –Sq
ua
re f
oo
tag
e (h
un
dre
ds)
Sq
ua
re f
oo
tag
e (h
un
dre
ds)
YY
XX
Income (thousands)Income (thousands)(a)(a)
35 35 –
30 30 –
25 25 –
20 20 –
15 15 –
10 10 –
5 5 –S
qu
are
fo
ota
ge
(hu
nd
red
s)S
qu
are
fo
ota
ge
(hu
nd
red
s)
|2020
|3030
|4040
|5050
|6060
|7070
|8080
YY
XX
Income (thousands)(b)(b)
30
|»| Coefficient of Correlation, r
Measures the strength of the linear relationship between X variable and Y variable.
22
)(
yyxx
yyxxr
n
yy
n
xx
n
yxxy
2
2
2
2
yx
xy
SSSS
SCP
r ranges from -1 to 1.
The larger the |r| is, the stronger the linear relationship is between X and Y.
If r = 1 or r = -1, X and Y are perfectly correlated.
If r > 0, X and Y have positive relationship (i.e., large values of X are associated with large values of Y).
If r < 0, X and Y have negative relationship (i.e., large values of X are associated with small values of Y).
31
|»| Coefficient of Correlation – Example
Example 3.17: Calculate r for Income-Footage Data.
Income, X Footage, Y XY X2 Y2
32 16 32x16=512 (32)2=1024 (16)2=256
36 17 36x17=612 (36)2=1296 (17)2=289
55 26 55x26=1430 3025 676
47 24 47x24=1128 2209 576
38 22 38x22=836 1444 484
60 21 60x21=1260 3600 441
66 32 66x32=2112 4356 1024
44 18 44x18=792 1936 324
70 30 70x30=2100 4900 900
50 20 50x20=1000 2500 400
498 226 11782 26290 5370
843.0
10)226(
537010
)498(26290
10)226)(498(
11782
222
2
2
2
n
yy
n
xx
n
yxxy
r
32
|»| Coefficient of Correlation, r – In Graphs
yy
xxrr = 0 = 0
(a)(a)
yy
xxrr = 1 = 1
(b)(b)yy
xxrr = -1 = -1
(c)(c)
yy
xxrr = .9 = .9
(d)(d)
33
|»| Coefficient of Correlation, r – In Graphs
yy
xxrr = -.8 = -.8
(e)(e)
yy
xxrr = .5 = .5
(f)(f)