Copyright Reserved 1 1 Chapter 3 - Descriptive stats: Numerical measures 3.1 Measures of Location Mean Perhaps the most important measure of location is the mean (average). Sample mean: ∑ where n = sample size Example: The number of students per class is as follows: 46 54 42 46 32 The mean is: ∑ Median The median is another measure of location for a variable. The median is the value in the middle when the data are arranged in ascending order (smallest to largest value). Computation: o Arrange the data in ascending order (smallest to largest value) o For an odd number of observations, the median is the middle value o For an even number of observations, the median is the average of the middle 2 values Example: The number of students per class is as follows: 46 54 42 46 32 The median is: Arrange the values from smallest to largest: 32 42 46 46 54 Middle value = Median = 46
30
Embed
Chapter 3 - Descriptive stats: Numerical measures 3.1 ... · • Arrange the data in ascending order (smallest to largest value) • Compute an index i ( ) where p = percentile of
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Copyright Reserved 1
1
Chapter 3 - Descriptive stats: Numerical measures
3.1 Measures of Location
Mean
Perhaps the most important measure of location is the mean (average).
Sample mean:
∑
where n = sample size
Example:
The number of students per class is as follows:
46 54 42 46 32
The mean is:
∑
Median
The median is another measure of location for a variable.
The median is the value in the middle when the data are arranged in ascending order (smallest
to largest value).
Computation:
o Arrange the data in ascending order (smallest to largest value)
o For an odd number of observations, the median is the middle value
o For an even number of observations, the median is the average of the middle 2 values
Example:
The number of students per class is as follows:
46 54 42 46 32
The median is:
Arrange the values from smallest to largest:
32 42 46 46 54
Middle value = Median = 46
Copyright Reserved 2
2
Example
The yearly income (R1000’s) of 8 workers is as follows:
95 102 105 120
125 150 220 450
1. Calculate the mean and the median.
Answers:
Mean/average:
∑
Median:
For the median, we arrange the values from smallest to largest:
95 102 105 120 125 150 220 450
Median =
Although the mean is the more commonly used measure of central location, in some
situations the median is preferred.
The mean is influenced by extremely small and large data values, while the median is not
influenced by extreme values.
Mode
Definition:
The mode is the value that occurs with greatest frequency.
Example:
The number of students per class is as follows:
46 54 42 46 32
The mode is: 46
Note:
Bi-modal:
If the data have exactly 2 modes.
Example of a bi-modal data set: 46 54 42 46 32 54
Multimodal:
If data have more than 2 modes.
Copyright Reserved 3
3
Example:
Give the appropriate measure of location for the following data:
Soft drink Frequency
Coke Classic 19
Diet Coke 8
Dr. Pepper 5
Pepsi-Cola 13
Sprite 5
The mode is: Coke Classic
For this type of data it obviously makes no sense to speak of the mean or median.
Using Microsoft Excel 2007 to compute the mean, median and mode
Formula worksheet
Value worksheet
Copyright Reserved 4
4
Percentiles
Definition: The pth
percentile is a value such that at least p percent of the observations are less than
or equal to this value and at least (100 – p) percent of the observations are greater than or equal to
this value.
Calculating the pth
percentile:
• Arrange the data in ascending order (smallest to largest value)
• Compute an index i
(
) where p = percentile of interest
n = sample size
(a) If i is not an integer, round up
(b) If i is an integer, the pth
percentile is the average of the values in positions i and (i +1)
Example:
Determine the 85th
percentile ( ) for the starting salary data:
Step 1: Arrange the data in ascending order
Step 2: (
) (
) ( )
Step 3: In the 11th
position (after being arranged in ascending order): .
Interpretation: 85% of the graduates have a starting salary of R3 730 or less.
Copyright Reserved 5
5
Determine the 33rd
percentile ( ) for the
starting salary:
Step 1: Arrange the data in ascending order
Step 2: (
) (
) ( )
Step 3: In the 4th
position (after being arranged
in ascending order): .
Interpretation: 33% of the graduates have a
starting salary of R3 480 or less.
Determine the median ( ) for the starting
salary:
Step 1: Arrange the data in ascending order
Step 2: (
) (
) ( )
i + 1 = 7
Step 3: The median is the average of the values
in the 6th
and 7th
positions:
Interpretation: 50% of the graduates have a
starting salary of R3 505 or less.
Copyright Reserved 6
6
Determine the 25th
percentile ( ) for the
starting salary:
Step 1: Arrange the data in ascending order
Step 2: (
) (
) ( )
i + 1 = 4
Step 3: is the average of the values in the
3rd
and 4th
positions:
Interpretation: 25% of the graduates have a
starting salary of R3 465 or less.
Determine the 75th
percentile ( ) for the
starting salary:
Step 1: Arrange the data in ascending order
Step 2: (
) (
) ( )
i + 1 = 10
Step 3: is the average of the values in the
9th
and 10th
positions:
Interpretation: 75% of the graduates have a
starting salary of R3 600 or less.
Copyright Reserved 7
7
Quartiles
First quartile, 25th
percentile
Second quartile, 50th
percentile, median
Third quartile, 75th
percentile
3.2 Measures of variability
Range
Range = Largest Value – Smallest Value
Range
Example of the salary data.
The range is: = 3 925 – 3 310 = 615
Advantages:
o Easy to calculate
Disadvantages:
o It’s sensitive to just 2 data values: the Largest Value and the Smallest Value.
o Unstable, it is influenced by extreme values.
Suppose one of the graduates received a starting salary of 10 000 per month. Then the range is equal
to:
The range is: = 10 000 – 3 310 = 6 690.
Copyright Reserved 8
8
Interquartile Range - IQR
It’s the range for the middle 50% of the data
Example of the salary data.
The interquartile range for the salary data is:
Advantages:
o Easy to interpret
o Is not influenced by extreme values
Disadvantages:
o It’s only based on the middle 50% of the data.
Variance
The variance is a measure of variability that utilizes all the data
Example
Given:
The Sample Variance
∑( )
Standard Deviation
Sample Standard Deviation
√ and therefore √∑( )
46 54 42 46 32
Copyright Reserved 9
9
Example Calculate the standard deviation of the class sizes.
Number of
students in class
( )
Mean
class size
( )
Deviation about
the mean
( )
Squared deviation
about the mean
( )
46 44 2 4
54 44 10 100
42 44 -2 4
46 44 2 4
32 44 -12 144
∑( ) ∑( )
∑( )
and √
OR
∑( )
( )
( ) ( )
( ) ( )
( ) ( ) ( ) ( ) ( )
( ) ( )
and √
Interpretation:
The average deviation of the class sizes from the average class size (44) is 8 students.
Coefficient of Variation
It’s a relative measure of variability
It measures the standard deviation relative to the mean
Coefficient of Variation:
The coefficient of variation tells us that the sample standard deviation is a % of the value of
the sample mean.
Copyright Reserved 10
10
Example:
The class test mark (out of 10) and the semester test mark (out of 50) of 5 students are investigated.
Class test (out of 10) Semester test (out of 50)
4 13
5 20
7 25
6 32
8 40
Average of class test marks = 6 Average of semester test marks = 26
Variance of class test marks = 2.5 Variance of semester test marks = 109.5
Which test has the biggest relative variation? Calculate the relevant numerical measures.
Coefficient of variation for the class test marks:
√
Coefficient of variation for the semester test marks:
√
Therefore, the semester test has the biggest relative variation.
Using Microsoft Excel’s 2007 Descriptive Statistics Tool
Self-study (see page 115)
3.3 Measures of Distribution Shape, Relative Location and Detecting Outliers
Distribution Shapes
Read through by yourself.
z- Scores
z - Scores:
The z -score is called the standardized value.
It can be interpreted as the number of standard deviations x is from the mean .
Copyright Reserved 11
11
Example:
z -scores of the class sizes dataset.
(We calculated the mean and standard deviation previously: and s = 8).
Number of students
in class
( )
Deviation about the
mean
( )
z-score
(
)
Interpretation:
54 is 1.25 standard deviations above the mean.
32 is 1.5 standard deviation below the mean.
Example:
The Mathematics marks of 2 students are compared.
Student 1 75% (in School A)
Student 2 80% (in School B)
Which one has done the best, relatively to his school?
School s
A 55 64 8
B 80 144 12
Student 1:
Student 1’s mark is 2.5 standard deviations above the mean.
Student 2:
Student 2’s mark is exactly the same value as the mean.
Conclusion:
Student 1 has done relatively better in his school than Student 2.
Copyright Reserved 12
12
Chebyshev’s Theorem – Not for examination
Empirical Rule
Empirical Rule:
68% of the data values will be within 1 std dev of .
95% of the data values will be within 2 std dev of .
100% of the data values will be within 3 std dev of .
Copyright Reserved 13
13
Example of the application of the empirical rule:
Suppose IQ scores have a bell-shaped distribution with a mean of 100 and a standard deviation of 15.
a) What percentage of people should have an IQ score between 85 and 115? Answer = 68%
b) What percentage of people should have an IQ score between 70 and 130? Answer = 95%
c) What percentage of people should have an IQ score of more than 130? Answer = 2.5%
100% - 95% = 5% and
= 2.5%
Copyright Reserved 14
14
d) The 16th
percentile ( ) is equal to:
100% - 68% = 32% and
= 16%. Therefore, P16 = 85.
e) The 84th
( ) percentile is equal to:
16% + 68% = 84%. P84 = 115
f) Is a person with an IQ score of 160 seen as an outlier?
Yes, since approximately 100% of the values are between 55 and 145, an IQ score of 160 is seen as
an outlier.
OR
> 3 (see the next Section on outliers).
Copyright Reserved 15
15
Detecting Outliers
Sometimes a data set will have one or more observations with unusually large or unusually
small values.
Extreme values are called outliers.
Standardized values (z-scores) can be used to identify outliers.
In the case of a bell-shaped distribution, the following rule can be applied:
Since 100% of the data will be within 3 std dev of the mean, we recommend treating any data
value with a (z-score <-3) OR a (z –score >3) as an outlier.
3.4 Exploratory Data Analysis
Five-Number Summary
The following 5 numbers are used to summarize the data:
1. Smallest Value
2. First Quartile ( ) 3. Second Quartile ( )
4. Third Quartile ( )
5. Largest Value
The five-number summary of the salary data is:
Smallest value = 3310
(Median)
Largest value = 3925
(These values have been calculated previously).
Copyright Reserved 16
16
Box Plot
A box plot is a graphical summary of data that is based on a five-number summary.
A box plot provides another way to identify outliers.