Data Description Chapter 3 1
1
Data Description
Chapter 3
2
Through this chapter you will learn Measure of Central tendency Measure of Dispersion Measure of Position
3
A statistic is a characteristic or measure obtained by using the data values from a sample.
A parameter is a characteristic or measure obtained by using all the data values for a specific population.
4
Population Arithmetic Mean
X
N
X : Each value, N: Total number of values in the population
5
Sample Arithmetic Mean
XX
nX: Each value in the samplen: Total number of observations in the
sample (sample size)
6
Example 1Find the mean of the following sample data
7 4 8 8 10 12 12 .
X= 7+4+8+8+10+12+12 = 61
618.71
7
Xx
n
7
Estimate the Mean of a Grouped Data into a Frequency Distribution
f frequency of each class
Xm class midpoint of each class
n Total number of frequencies
mf X
Xn
8
Example 2Given a frequency distribution
Estimate the mean.
Class boundaries Frequency5.5-10.5 1
10.5-15.5 215.5-20.5 320.5-25.5 525.5-30.5 430.5-35.5 335.5-40.5 2
9
Example 2 (Cont.)
Class Frequency Midpoint
5.5 - 0.5 1 8 8
10.5 - 15.5 2 13 2615.5 - 20.5 3 18 5420.5 - 25.5 5 23 11525.5 - 30.5 4 28 11230.5 - 35.5 3 33 9935.5 - 40.5 2 38 76
f mfX
total n =f= 20 f Xm= 490
10
Example 2 (Cont.)
490
24.520
mf XX
n
11
Median
A median is the midpoint of the data array.
Steps in finding the median of a data array: Step1: Arrange the data in order Step2: Select the midpoint of the
array as the median.
12
Example 3
Find the median of the scores 7 2 3 7 6 9 10 8 9 9 10.
Arrange the data in order to obtain
2 3 6 7 7 8 9 9 9 10 10
We have 11 values. 8 is the exact middle value and hence it is the median.
13
Example 4
Find the median of the scores 7 2 3 7 6 9 10 8 9 9
Arrange the data in order
2 3 6 7 7 8 9 9 9 10
With these ten scores, no single score is at the exact middle. Instead, the two scores of 7 and 8 share the middle. We therefore find the mean of these two scores.
14
Example 4 (Cont.)
the median is 7.5.
7 87.5
2
15
The Estimate of Data Grouped into a Frequency Distribution
2Median
nCF
LB Wf
16
LB Lower boundary of the median class
n Total # of frequencies
f frequency of the median class
CF Cumulative frequency of the class preceding the median class.
w class width
17
Example 5Given the frequency distribution as below. Estimate the median.
Class Frequency30-39 440-49 650-59 860-69 1270-79 980-89 790-99 4
18
Example 5First find the cumulative frequency
Class Frequency CF30-39 4 440-49 6 1050-59 8 1860-69 12 3070-79 9 3980-89 7 4690-99 4 50
19
Example 5w = 10, n = 50, and hence, n/2=25. The median falls in the class 60-69 ( 59.5-69.5)
2Median
25 1859.5 10 65.33
12
nCF
LB Wf
20
Example 6Estimate the median for the frequency distribution below
Class Frequency80-89 590-99 9
100-109 20110-119 8120-129 6130-139 2
21
Modeo For grouped data into a frequency
distribution, the estimate of mode can be the class midpoint of the modal class ( the class with the highest frequency)
o It can also be found by the formula
1
1 2
dMode LB w
d d
22
whereo LB Lower boundary of the modal classo W class widtho d1 difference between class frequency of
the modal class and that of the class preceding it.
o d2 difference between class frequency of the modal class and that of the class right after it.
23
Example 7
AClass
BFrequency ( f )
5.5-10.5 110.5-15.5 215.5-20.5 3 20.5-25.5 525.5-30.5 430.5-35.5 335.5-40.5 2
Estimate the mode of the below distribution
Modal class
24
Example 7 (cont.) LB = 20.5
W = 5
d1= 5 - 3 =2
d2 = 5 – 4=1
2
Mode 20.5 5 23.832 1
25
The Midrange
lowest value highest valueMR
2
26
Example 8
The midrange of this data set: 2, 3, 6, 8, 4, 1 is
MR=(8+1)/2=4.5
27
The Weighted Mean
Xi : the values
Wi : the weights
n
i i1 1 2 2 n n i 1
n1 2 n
ii 1
w Xw X w X w X
Xw w w
w
28
Example 8 A student obtained 40, 50, 60, 80, and 45 marks in the subjects of Math, Statistics, Physics, Chemistry and Biology respectively. Assuming weights 5, 2, 4, 3, and 1 respectively for the above mentioned subjects. Find Weighted Arithmetic Mean per subject.
29
Example 8 (cont.)
Subjects Marks
Obtained Weight wxMath 40 5 200Statistics 50 2 100Physics 60 4 240Chemistry 80 3 240Biology 55 1 55Total 15 835
30
Example 8 (cont.)
835x 55.667marks / subject
15
31
Distribution Shapes
Mode Median Mean
a Positively skewed or right-skewed
y
x
32
Distribution Shapes (cont.)
b Negatively skewed or left-skewedModeMedianMean
x
y
33
Distribution Shapes (cont.)
Mean = Median = Mode
x
y
34
Range
The range is the highest value minus the lowest value. The symbol R is used for the range.
highest value lowest valueR
35
Mean Deviation
Mean DeviationX X
n
36
Example 9
The number of patients seen in the emergency room in a hospital for a sample of 5 days last year was: 103, 97, 101, 106, and 103. Determine the mean deviation and interpret.
37
Example 9
First find the arithmetic mean
103 97 101 106 103X 102
5
38
Example 9 (Cont.)
Number of cases Deviation Absolute
Deviation103 103 - 102= 1 197 97 - 102= -5 5
101 101 - 102= -1 1106 106 - 102= 4 4103 103 - 102= 1 1
Total 12
39
Example 9 (Cont.)
X X 12
MD 2.4n 5
Hence the mean deviation is 2.4 patients per day. The number of patients deviates, on average, by 2.4 patients from the mean of 102 patients per day.
40
Example 10
The weight of a group of crates being shipped to Ireland is (in pounds)
95, 103, 105, 110, 104, 105, 112, and 90.
a) What is the range of the weights?
b) Compute the arithmetic mean weight. c) Compute the mean deviation of the weights. (answer: a) 22, b) 103, c) 5.25 pounds)
41
Population Variance and Standard Deviation
2
2 X
N
2
X
N
Remember: Standard deviation is the positive square root of variance.
42
Example 11Find the variance and standard deviation for the population data: 35, 45, 30, 35, 40, 25
Solution
First find the arithmetic mean
X= 35+ 45+ 30+ 35+40+25=210
= 210/6 = 35
then construct the table
43
Example 11(cont.)
X35 0 045 10 10030 -5 2535 0 040 5 2525 -10 100
X 2
X
44
Example 11(cont.)
2
2 X 25041.7
N 6
The population variance is
The population standard deviation is
2
X41.7 6.5
N
45
Sample Variance and Standard DeviationSample Variance (Conceptual formula)
Sample Variance (Computational formula)
2
2
1
X Xs
n
22
2
1
X X ns
n
46
Sample Variance and Standard Deviation (Cont.)Sample Standard Deviation (Conceptual
formula)
Sample Standard Deviation (Computational formula)
2
1
X Xs
n
22
1
X X ns
n
47
Example 12
Find the sample variance and standard deviation for the amount of European auto sales for a sample of 6 years shown. The data are in millions of dollars.
11.2, 11.9, 12.0, 12.8, 13.4, 14.3
48
Example 12 (Cont.)Method 1Find the mean : 12.6
X11.20 -1.40 1.9611.90 -0.70 0.4912.00 -0.60 0.3612.80 0.20 0.0413.40 0.80 0.6414.30 1.70 2.89
x 2x
Total= 6.38
49
Example 12 (Cont.)Method 1The variance is defined by
and hence, the standard deviation is
2 6.381.28
6 1s
1.28 1.13s
50
Example 12 (Cont.)Method 2We compute X= 11.2+11.9+12.0+12.8+13.4+14.3 =75.6X2= 11.22 +11.92 +12.02 +12.82
+13.42 +14.32 =958.94The variance is computed by
Standard deviation is 1.13
2
2958.94 75.6 6
1.285
s
51
Example 13
Suppose the number of minutes you spent for traveling to school on last 7 days are9, 12, 9, 15, 10, 11, 15. Find the variance of the number of minutes by the two formula.
52
Variance and Standard Deviation for Grouped Data
22
2
1m mf X f X n
sn
f : class frequencyXm : class midpoint (class mark)n : Total number of frequencies
53
Example
Find the variance and the standard deviation for the frequency distribution of the data representing the number of miles that 20 runners ran during one week.
54
Example 14 (cont.)Class Frequency
5.5-10.5 110.5-15.5 215.5-20.5 320.5-25.5 525.5-30.5 430.5-35.5 335.5-40.5 2
55
Example 14 (cont.)Class
BoundaryFreq.
fMidpoint
Xm
5.5-10.5 1 8 8 6410.5-15.5 2 13 26 33815.5-20.5 3 18 54 97220.5-25.5 5 23 115 264525.5-30.5 4 28 112 313630.5-35.5 3 33 99 326735.5-40.5 2 38 76 2888
mf X 2mf X
490 13310
56
Example 14 (cont.)
22 13310 490 20
20 168.7
s
Hence, the variance is
and the standard deviation is 8.3
57
Coefficient of VariationThe coefficient of variation is the standard deviation divided by the mean. The result is expressed as a percentage.
CVar 100%s
X
CVar 100%
58
Example 15
The mean of the number of sales of cars over a 3-month period is 87, and the standard deviation is 5. The mean of the commissions is $5225, and the standard deviation is $773. Compare the variations of the two.
59
Example 15 (Cont.)
Sales
Commissions
Since the coefficient of variation is larger for commission, the commissions are more variable than the sales.
5CVar 100% 5.7%
87
s
X
773CVar 100% 14.8%
5225
60
Example 16The mean for the number of pages of women’s fitness magazines is 132, with a variance of 23; the mean for the number of advertisements of a sample of women’s fitness magazines is 182, with a variance of 62. Compare the variances.(answer: 3.6% pages, 4.3% advertisements)
61
Chebyshev’s theoremThe proportion of values from a data set that will fall within k standard deviations of the mean will be at least 1-1/k2, where k is a number greater than 1 (k is not necessarily an integer).
62
Chebyshev’s theorem
3X s 3X s2X s 2X sX
At least
88.89%
At least
75%
63
Example 17
The mean price of houses in a certain neighborhood is $50,000, and the standard deviation is $10, 000. Find the price range for which at least 75% of the houses will sell.
64
Example 17 (Cont.)Chebyshev’s theorem states that three-fourths, or 75%, of the data values will fall within 2 standard deviations of the mean. Thus,
and
Hence, at least 75% of all homes sold in the area will have a price range from $30,000 to $70,000.
$50,000 2 $10,000 $70,000
$50,000 2 $10,000 $30,000
65
Example 18A survey of local companies found that the mean amount of travel allowance for executives was $0.25 per mile. The standard deviation was $ 0.02. Using Chebyshev’s theorems find the minimum percentage of the data values that will fall between $0.20 and $0.30.
66
The Empirical (Normal) Rule
Chebyshev’s theorem applies to any distribution regardless of its shape. However, when a distribution is bell-shaped (or what is called normal), the following statements, which make up the empirical rule, are true.
67
The Empirical (Normal) Rule
Approximately 68% of the data values will fall within 1 standard deviation of the mean.
Approximately 95% of the data values will fall within 2 standard deviations of the mean.
Approximately 99.7% (almost all) of the data values will fall within 3 standard deviations of the mean.
68
The Empirical (Normal) Rule
3X s 2X s 1X s X 1X s 2X s 3X s
68%
95%
99.7%
69
Measures of PositionStandard ScoresA z score or standard score for a value is obtained by subtracting the mean from the value and dividing the result by the standard deviation. The symbol for a standard score is z.
value mean
standard deviationz
70
Measures of PositionStandard ScoresThe z score represents the number of standard deviations that a data value falls above or below the mean.
71
Example 19 A student scored 65 on a calculus test that had a mean of 50 and a standard deviation of 10; she scored 30 on a history test with a mean of 25 and a standard deviation of 5. Compare her relative position on the two tests.
72
Example 19 (Cont.)For calculus, the z score is
For history the z score is
Since the z score for calculus is larger, her relative position in the calculus class is higher than her relative position in the history class.
65 501.5
10
X Xz
s
30 251.0
5
X Xz
s
73
PercentilesPercentiles divide the data set into 100 equal groups.
There are several mathematical methods for computing percentiles for data. These methods can be used to find approximate percentile rank of a data value or to find a data value corresponding to a given percentile.
74
Find a Percentile Rank Corresponding to a ValueThe percentile corresponding to a given value X is computed by using the following formula
#of values 0.5
below Percentile 100%
total#of value
X
75
Example 20
A teacher gives a 20-point test to 10 students. The scores are shown here. Find the percentile rank of a score of 12.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
76
Example 20 (Cont.)Arrange the data in order from lowest to highest
2, 3, 5, 6, 8, 10, 12, 15, 18, 20
Thus, a student whose score was 12 did better than 65% of the class.
6 0.5Percentile 100%
1065th percentile
77
Finding a Data Value Corresponding to a Given PercentileoArrange the data in order from lowest to highest. oCompute c=(np)/100, where n is the total number of
observations and p the percentile.oIf c is not a whole number, round up to the next
whole number. Starting at the lowest value, count over to the number that corresponds to the rounded-up value.
oIf c is a whole number, use the value halfway between the cth and (c+1)th values when counting up from the lowest value.
78
Example 21A teacher gives a 20-point test to 10 students. The scores are shown here. find the value corresponding to the 25th percentile.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
79
Example 21 (Cont.)oArrange the data in order from lowest to
highest2, 3, 5, 6, 8, 10, 12, 15, 18, 20
o n= 10, p = 25 c= 10×25 / 100=2.5
o We round it up to get c =3. Start at the lowest values and count over to the third value, which is 5. Hence, the value 5 corresponds to the 25th percentile.
80
Example 22
A teacher gives a 20-point test to 10 students. The scores are shown here. find the value corresponding to the 60th percentile.
18, 15, 12, 6, 8, 2, 3, 5, 20, 10
81
Example (22 Cont.)oArrange the data in order from smallest to
largest2, 3, 5, 6, 8, 10, 12, 15, 18, 20
on= 10, p = 60 c= 10×60 / 100=6
oSince is a whole number, we use the value halfway between the 6th and 7th values when counting up from the lowest valueoThe 60th percentile is (10+12)/2=11.
82
QuartilesQuartiles divide the distribution into 4 equal groups, separated by Q1, Q2, and Q3.
Q1 Q2 Q3
25% 25% 25% 25%
L H
83
QuartilesQuartiles can be computed using the formula for computing percentiles. o1st quartile corresponds to 25th percentile .o2nd quartile corresponds to 50th percentile.o3rd quartile corresponds to 75th percentile.
2nd quartile = 25th percentile = median
84
Example 23
Find first quartile, second quartile and third quartile for the data set 15, 13, 6, 5, 12, 50, 22, 18.
Arrange the data in order from smallest to the largest. 5 6 12 13 15 18 22 50
85
Example 23 (Cont.)oFirst quartile = 25th percentile.
c = (825)/100=2Hence, the first quartile is equal to the second value plus the third value divided by 2. That is, Q1 = (6+12)/2=9
oSecond quartile = 50th percentilec=(8 50)/100=4Hence, Q2 =(4th value+5th value)/2
=(13+15)/2=14
86
Example 23 (Cont.)oThird quartile = 75th percentile
c=(8 75)/100=6Hence, Q3 =(6th value+7th value)/2
=(18+22)/2=20
87
o Interquartile Range: IQR Q3 Q1
oQuartile deviation: QD (Q3 Q1)/2oSemi-interquartile range is referred to
quartile deviation. oMidquartile Range : (Q3 Q1)/2
Interquartile Range, Quartile Deviation and Midquartile Range
88
oFirst quartile
oSecond quartile (Median)
oThird quartile
Quartiles of Data Grouped into a Freq. Dist.
1
/ 4n CFQ LB w
f
2
/ 2n CFQ LB w
f
3
3 / 4n CFQ LB w
f
The office manager of the Mallard Glass Co. is investigating the ages in months of the company’s PCs currently in use. The ages of 30 units selected at random were organized into a frequency distribution. Compute the quartile deviation.
Example 24
89
90
Example 24 (Cont.)Age
(in months) # of PCs
20-24 3
25- 29 5
30-34 10
35-39 7
40-44 4
45-49 1
91
Example 24 (Cont.)Age
(in months) # of PCsCumu. Freq.
20-24 3 3
25- 29 5 8
30-34 10 18
35-39 7 25
40-44 4 29
45-49 1 30
92
Example 24 (Cont.)
1
30 / 4 324.5 5 29
5Q
2
3 30 / 4 1835.5 5
738.71
Q
Hence, QD 38.7129 4.855 months
93
Example 25
The weekly income of a sample of 60 part time employees of a fast-food restaurant chain was organized into the following frequency distribution. Compute the standard deviation and quartile deviation.
94
Example 25 (Cont.)
Weekly Incomes
Number of Employees
100-149 5
150-199 9
200-249 20
250-299 18
300-349 5
350-399 3
95
Outliers An outlier is an extremely high or an
extremely low data value when compared with the rest of the data values.
An outlier can strongly affect the mean and standard deviation of a variable.
There are several ways to check a data set for outliers. One of which is shown as follows:
96
Outliers (Cont.)Step1 Arrange the data in order and find Q1
and Q3.Step2 Find the inter-quartile range:
IQR=Q3 Q1 Step3 Multiply the IQR by 1.5.Step5 Check the data set for any data value
which is smaller than Q11.5IQR or larger than Q3 1.5IQR .
97
Outliers: Example 26Check the following data set for outliers.
5, 6, 12, 13, 15, 18, 22, 50We found Q19, Q320Inter-quartile Range: IQR 20-9=11Compute the dividing points:
Q11.5IQR 91.5117.5Q3 1.5IQR 201.51136.5
The data value of 50 is greater than the upper dividing point of 36.5. So, the data value of 50 is considered an outlier.
98
Exploratory Data Analysiso In exploratory data analysis (EDA) the
data are presented graphically using a box-plot (sometimes called a box-and-whisker plot).
oThe purpose of exploratory data analysis is to examine data to find out what information can be discovered about the data such as the center and the spread.
oEDA was developed by John Tukey.
99
Exploratory Data Analysis
A box plot can be used to graphically represent the data set. These plots involve five specific values:
o The lowest value (i.e., minimum)o Q1
o Median (Q2)o Q3
o The highest value (i.e., maximum)
100
Example 27 (Box-plot)
A stockbroker recorded the number of clients she saw each day over an 11-day period the data are shown below. Construct a box plot for the data.
33, 38, 43, 30, 29, 40, 51, 27, 42, 23, 31
101
Example 27 (Box-plot)oArrange the data in order from lowest to the
highest: 23, 27, 29, 30, 31, 33, 38, 40, 42, 43, 51
oWe obtain: the lowest value23, Q129, Median Q2 33, Q3 43, and the highest value 15.
20 25 30 35 40 45 50
23 5129 4233
102
THE END!