McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods
Jan 03, 2016
McGraw-Hill/Irwin
Copyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved.
Chapter 3
Descriptive Statistics: Numerical Methods
3-2
Descriptive Statistics
3.1 Describing Central Tendency
3.2 Measures of Variation
3.3 Percentiles, Quartiles and Box-and-Whiskers Displays
3.4 Covariance, Correlation, and the Least Square Line
3.5 Weighted Means and Grouped Data (Optional)
3.6 The Geometric Mean (Optional)
3-3
Describing Central Tendency
• In addition to describing the shape of a distribution, want to describe the data set’s central tendency
– A measure of central tendency represents the center or middle of the data
3-4
Parameters and Statistics
• A population parameter is a number calculated from all the population measurements that describes some aspect of the population
• A sample statistic is a number calculated using the sample measurements that describes some aspect of the sample
3-5
Measures of Central Tendency
Mean, The average or expected value
Median, Md The value of the middle point of the ordered measurements
Mode, Mo The most frequent value
3-6
The Mean
Population X1, X2, …, XN
Population Mean
N
X
N
=1ii
Sample x1, x2, …, xn
Sample Mean
x
n
x x
n
=1ii
3-7
The Sample Mean
and is a point estimate of the population mean • It is the value to expect, on average and in the long run
n
xxx
n
xx n
n
ii
...211
For a sample of size n, the sample mean is defined as
3-8
Example: Car Mileage Case
• Example 3.1: Sample mean for first five car mileages from Table 3.1
30.8, 31.7, 30.1, 31.6, 32.1
26.315
3.156
5
1.326.311.307.318.3055
54321
5
1
x
xxxxxx
x ii
3-9
The Median
The median Md is a value such that 50% of all measurements, after having been arranged in numerical order, lie above (or below) it
1. If the number of measurements is odd, the median is the middlemost measurement in the ordering
2. If the number of measurements is even, the median is the average of the two middlemost measurements in the ordering
3-10
Example: Car Mileage Case
• Example 3.1: First five observations from Table 3.1:
30.8, 31.7, 30.1, 31.6, 32.1
• In order: 30.1, 30.8, 31.6, 31.7, 32.1
• There is an odd so median is one in middle, or 31.6
3-11
The Mode
The mode Mo of a population or sample of measurements is the measurement that occurs most frequently– Modes are the values that are observed “most
typically”
– Sometimes higher frequencies at two or more values
• If there are two modes, the data is bimodal
• If more than two modes, the data is multimodal
– When data are in classes, the class with the highest frequency is the modal class
• The tallest box in the histogram
3-12
Relationships Among Mean, Medianand Mode
3-13
Measures of Variation
• Knowing the measures of central tendency is not enough
• Both of the distributions below have identical measures of central tendency
3-14
Measures of Variation
Range Largest minus the smallest measurement
Variance The average of the squared deviations of all the population measurements from the population mean
Standard The square root of the
Deviation variance
3-15
The Range
• Largest minus smallest
• Measures the interval spanned by all the data
• For Figure 3.13, largest is 5 and smallest is 3
• Range is 5 – 3 = 2 days
3-16
Population Variance and Standard Deviation
• The population variance (σ2) is the average of the squared deviations of the individual population measurements from the population mean (µ)
• The population standard deviation (σ) is the positive square root of the population variance
3-17
Variance
• For a population of size N, the population variance σ2 is:
• For a sample of size n, the sample variance s2 is:
N
xxx
N
xN
N
ii 22
22
11
2
2
11
222
211
2
2
n
xxxxxx
n
xxs n
n
ii
3-18
Standard Deviation
• Population standard deviation (σ):
• Sample standard deviation (s):
2
2ss
3-19
Example: Chris’s Class Sizes This Semester
• Data points are: 60, 41, 15, 30, 34
• Mean is 36
• Variance is:
Standard deviation is:
4.2165
1082
5
436441255765
36343630361536413660 222222
71.144.216
3-20
Example: Sample Variance and Standard Deviation
• Example 3.7: data for first five car mileages from Table 3.1 are 30.8, 31.7, 30.1, 31.6, 32.1
• The sample mean is 31.26
8019.0643.0
643.04
572.24
26.311.3226.316.3126.311.3026.317.3126.318.30
15
2
22222
5
1
2
2
ss
xxs i
i
3-21
The Empirical Rule for Normal Populations• If a population has mean µ and standard
deviation σ and is described by a normal curve, then– 68.26% of the population measurements lie within
one standard deviation of the mean: [µ-σ, µ+σ]
– 68.26% of the population measurements lie within two standard deviations of the mean: [µ-2σ, µ+2σ]
– 68.26% of the population measurements lie within three standard deviations of the mean: [µ-3σ, µ+3σ]
3-22
Chebyshev’s Theorem
• Let µ and σ be a population’s mean and standard deviation, then for any value k> 1
• At least 100(1 - 1/k2 )% of the population measurements lie in the interval [µ-kσ, µ+kσ]
• Only practical for non-mound-shaped distribution population that is not very skewed
3-23
z Scores
• For any x in a population or sample, the associated z score is
• The z score is the number of standard deviations that x is from the mean
– A positive z score is for x above (greater than) the mean
– A negative z score is for x below (less than) the mean
deviation standard
mean
xz
3-24
Coefficient of Variation
• Measures the size of the standard deviation relative to the size of the mean
• Coefficient of variation =standard deviation/mean × 100%
• Used to:– Compare the relative variabilities of values about
the mean
– Compare the relative variability of populations or samples with different means and different standard deviations
– Measure risk
3-25
Percentiles, Quartiles, and Box-and-Whiskers Displays
For a set of measurements arranged in increasing order, the pth percentile is a value such that p percent of the measurements fall at or below the value and (100-p) percent of the measurements fall at or above the value
• The first quartile Q1 is the 25th percentile
• The second quartile (or median) is the 50th percentile
• The third quartile Q3 is the 75th percentile
• The interquartile range IQR is Q3 - Q1
3-26
Calculating Percentiles
1. Arrange the measurements in increasing order
2. Calculate the index i=(p/100)n where p is the percentile to find
3. (a) If i is not an integer, round up and the next integer greater than i denotes the pth percentile(b) If i is an integer, the pth percentile is the average of the measurements in the i and i+1 positions
3-27
Percentile Example (p=10th Percentile)
• i=(10/100)12=1.2
• Not an integer so round up to 2
• 10th percentile is in the second position so 11,070
7,524 11,070 18,211 26,817 36,551 41,286
49,312 57,283 72,814 90,416 135,540 190,250
3-28
Percentile Example (p=25th Percentile)
• i=(25/100)12=3
• Integer so average values in positions 3 and 4
• 25th percentile (18,211+26,817)/2 or 22,514
7,524 11,070 18,211 26,817 36,551 41,286
49,312 57,283 72,814 90,416 135,540 190,250
3-29
Five Number Summary
1. The smallest measurement
2. The first quartile, Q1
3. The median, Md
4. The third quartile, Q3
5. The largest measurement
• Displayed visually using a box-and-whiskers plot
3-30
Box-and-Whiskers Plots
• The box plots the:
– first quartile, Q1
– median, Md
– third quartile, Q3
– inner fences
– outer fences
3-31
Box-and-Whiskers Plots Continued
• Inner fences
– Located 1.5IQR away from the quartiles:
• Q1 – (1.5 IQR)
• Q3 + (1.5 IQR)
• Outer fences
– Located 3IQR away from the quartiles:
• Q1 – (3 IQR)
• Q3 + (3 IQR)
3-32
Box-and-Whiskers Plots Continued
• The “whiskers” are dashed lines that plot the range of the data
– A dashed line drawn from the box below Q1 down to the smallest measurement
– Another dashed line drawn from the box above Q3 up to the largest measurement
3-33
Box-and-Whiskers Plots Continued
3-34
Outliers
• Outliers are measurements that are very different from other measurements
– They are either much larger or much smaller than most of the other measurements
• Outliers lie beyond the fences of the box-and-whiskers plot
– Measurements between the inner and outer fences are mild outliers
– Measurements beyond the outer fences are severe outliers
3-35
Covariance, Correlation, and the Least Squares Line
• When points on a scatter plot seem to fluctuate around a straight line, there is a linear relationship between x and y
• A measure of the strength of a linear relationship is the covariance sxy
1
1
n
yyxxs
n
iii
xy
3-36
Covariance
• A positive covariance indicates a positive linear relationship between x and y
– As x increases, y increases
• A negative covariance indicates a negative linear relationship between x and y
– As x increases, y decreases
3-37
Correlation Coefficient
• Magnitude of covariance does not indicate the strength of the relationship
– Magnitude depends on the unit of measurement used for the data
• Correlation coefficient (r) is a measure of the strength of the relationship that does not depend on the magnitude of the data
yx
xy
ss
sr
3-38
Correlation Coefficient Continued
• Sample correlation coefficient r is always between -1 and +1
– Values near -1 show strong negative correlation
– Values near 0 show no correlation
– Values near +1 show strong positive correlation
• Sample correlation coefficient is the point estimate for the population correlation coefficient ρ
3-39
Least Squares Line
• If there is a linear relationship between x and y, might wish to predict y on the basis of x
• This requires the equation of a line describing the linear relationship
• Line is calculated based on least squares line
– Discussed in detail in Chapter 13
3-40
Least Squares Line Continued
• Need to calculate slope (b1) and y-intercept (b0)
•
•
21x
xy
s
sb
xbyb 10
3-41
Weighted Means
• Sometimes, some measurements are more important than others
– Assign numerical “weights” to the data
• Weights measure relative importance of the value
• Calculate weighted mean as
where wi is the weight assigned to the ith measurement xi
i
ii
w
xw
3-42
Descriptive Statistics for Grouped Data
• Data already categorized into a frequency distribution or a histogram is called grouped data
• Can calculate the mean and variance even when the raw data is not available
• Calculations are slightly different for data from a sample and data from a population
3-43
Descriptive Statistics for Grouped Data (Sample)
• Sample mean for grouped data:
• Sample variance for grouped data:
fi is the frequency for class i
Mi is the midpoint of class i
n = Σfi = sample size
n
Mf
f
Mfx ii
i
ii
1
22
n
xMfs ii
3-44
Descriptive Statistics for Grouped Data (Population)
• Population mean for grouped data:
• Population variance for grouped data:
fi is the frequency for class i
Mi is the midpoint of class i
N = Σfi = population size
N
Mf
f
Mf ii
i
ii
N
xMf ii
22
3-45
The Geometric Mean (Optional)
• For rates of return of an investment, use the geometric mean to give the correct wealth at the end of the investment
• Suppose the rates of return (expressed as decimal fractions) are R1, R2, …, Rn for periods 1, 2, …, n
• The mean of all these returns is the calculated as the geometric mean:
1111 21 n
ng RRRR