Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution- NonCommercial-ShareAlike 3.0 Unported Creative Commons License, the full details of which may be found online here: http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use, edit, or redistribute the content provided that the original source is cited, it is for non-commercial purposes, and provided it is distributed under a similar license. CC BY-NC-SA Nordyke 2010
38
Embed
Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
This material is distributed under an Attribution-NonCommercial-ShareAlike 3.0 Unported Creative Commons License, the full details of which may be found online here: http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use, edit, or redistribute the content provided that the original source is cited, it is for non-commercial purposes, and provided it is distributed under a similar license.
• We use the label univariate descriptive statistics to refer to a variety of measures of center and variation that are useful for understanding the nature and distribution of a single variable.
• They can allow us to quickly understand a large amount of information about a single variable.
• They make data meaningful!
CC BY-NC-SA Nordyke 2010
Making Data Meaningful
Age of Volunteer15192217391726
A relatively small sample of the ages of volunteers at a local non-profit agency in the community.
What does this list tell us about the age of volunteers in the agency?
• Measures of central tendency– Mean, Median, Mode, Midrange
• Measures of distribution– Range, Min, Max, Percentiles
• Measures of Variation– Standard Deviation, Variance, Coefficient of
Variation
CC BY-NC-SA Nordyke 2010
Some initial notation
indicates the addition of a set of values
y is the variable used to represent the individual data values
n represents the number of values in a sample
N represents the number of values in a population
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mean
The sample mean is the mathematical average of the data and is the measure of central tendency we use most often.
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mean
Sample Mean: Observation
# Age of
Volunteer1 152 173 174 195 226 267 39
155The sum of all of the observations
n = the number of observations
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
The sample median is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If there isn’t one value in the middle we take the average of the two middle values.
The median is not affected by extreme values.
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
ỹMedian:
Median is often denoted by ỹ which is pronounced “y-tilde”
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
15 17 17 19 22 26 39
Sample ages are arranged in ascending order
The middle value is the median.ỹ = 19
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
15 17 17 19 22 26 34 39
If there are two values in the middle, we take the average of the two.
ỹ
ỹ ỹ 20.5
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Median
15 17 17 19 22 26 34 99
Note that the presence of an extreme value, doesn’t change the median.
ỹ
ỹ ỹ 20.5
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mode
The mode is the value that occurs most frequently. – Not every sample has a distinct mode. Sometimes
it is bimodal (two modes) or multimodal (three or more modes) or sometimes there is no mode at all.
– The mode is the only measure of central tendency we can use for nominal data.
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Mode
15 17 17 19 22 26 39
17 is the only value that occurs more than once, so it is the value that occurs most
frequently and the mode.
Mode is often denoted with the symbol M
M = 17
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - ModeBlueGreenGreenPurple Purple RedRedRedRedYellowYellowYellow
M = Red
2029333334414142434545
Multi modal
1.12.34.15.34.36.78.28.38.78.9
10.3
No Mode
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Midrange
The midrange, or middle of the range is the average of the highest and lowest values.
There is no distinct symbol for the Midrange.
Midrange
CC BY-NC-SA Nordyke 2010
Measures of Central Tendency - Midrange
15 17 17 19 22 26 39
Midrange
Midrange
Midrange
CC BY-NC-SA Nordyke 2010
Comparing Measures of Central Tendency
15 17 17 19 22 26 39
Mean Median = 19Mode = 17Midrange = 27
CC BY-NC-SA Nordyke 2010
Comparing Measures of CenterMeasure of Center(Listed from most
used to least used)
Does it always exist?
Does it take into account every
value?
Is it affected by extreme values?
Mean Always Yes Yes
Median Always No No
Mode Might not exist, may have more than one
No No
Midrange Always No Yes
CC BY-NC-SA Nordyke 2010
The Range
• The range of a sample is the difference between the highest value and the lowest value.
15 17 17 19 22 26 39
In our example the Range = 39 – 15 or 24; there are 24 years between our youngest and oldest volunteers in the sample.
CC BY-NC-SA Nordyke 2010
Measures of Variance
• Where measures of central tendency try to give us an idea of where the middle of the data lies, measures of variance (or variation) tell us about how the data is distributed around that center.
• Our three primary measures of variance are:– Standard Deviation,– Variance and– Coefficient of Variation
CC BY-NC-SA Nordyke 2010
Measures of Variance – Standard Deviation
Sample Standard Deviation: Population Standard Deviation:
The Standard Deviation is a measure of the variation of values around the mean.
CC BY-NC-SA Nordyke 2010
Some Key Points for Understanding Standard Deviation
• The standard deviation is always positive.• The standard deviation of a sample will always
be in the same units as the observations in the sample.
• Extreme values or outliers can change the value of the standard deviation substantially.
• The size of the sample will affect the size of the standard deviation; as the sample size increases, the size of the standard deviation decreases.
CC BY-NC-SA Nordyke 2010
Measures of Variance - Variance
• The variance of a sample is just the standard deviation of the sample squared.
Sample Variance: Population Variance:
CC BY-NC-SA Nordyke 2010
Standard Deviation and Variance Notation
Sample Populations = standard deviation = standard deviations2 = variance 2 = variance
CC BY-NC-SA Nordyke 2010
Seeing Standard Deviations
• Once I figure out how to draw the curves, this well be a slide that shows the difference between a distribution with a small standard deviation (tall and narrow) and a large one (broad and flat).
CC BY-NC-SA Nordyke 2010
Back to our example
• In our sample of volunteer ages, the mean was 22.14 years.
• We can calculate the standard deviation to better understand how the values or distributed around that mean.
When data sets have distributions that are approximately bell shaped, the following is true:
• About 68% of all values fall within 1 standard deviation of the mean
• About 95% of all values fall within 2 standard deviations of the mean
• About 99.7% of all values fall within 3 standard deviations of the mean
CC BY-NC-SA Nordyke 2010
The Empirical Rule
34% 34%
68% of values fall within 1 standard deviation of the
mean
CC BY-NC-SA Nordyke 2010
The Empirical Rule
34% 34%
68% of values fall within 1 standard deviation of the
mean
95% of values fall within 2 standard deviations of the mean
13.5%13.5%
CC BY-NC-SA Nordyke 2010
The Empirical Rule
34% 34%
68% of values fall within 1 standard deviation of the
mean
95% of values fall within 2 standard deviations of the mean
99.7% of values fall within 3 standard deviations of the mean
13.5%13.5%2.4%2.4%
CC BY-NC-SA Nordyke 2010
Measures of Center – Coefficient of Variation
• The Coefficient of Variation (CV) is a measure of the standard deviation of a sample relative to its mean.
• CV’s can be useful when you are comparing the standard deviations of variables that are in two different units.
CC BY-NC-SA Nordyke 2010
Measures of Center – Coefficient of Variation
An example: You are comparing the heights and weights of fourth graders.
Height = 52”S = 4”
Weight = 80 lbs.S = 10 lbs.
Which variable has greater variance? How can we compare 4” to 10 lbs?
CC BY-NC-SA Nordyke 2010
Measures of Center – Coefficient of Variation
Height = 52”S = 4” * 100%
CV = 8%
* 100%
Weight = 80 lbs.S = 10 lbs. * 100%
CV = 12.5%
The standard deviation of height is 8% of the mean of height, where as the standard deviation of weight is 12.5% of the mean of weight, so there is greater variation in the weight of the fourth graders than in the height.