-
Univariate Descriptive Statistics
Dr. Shane Nordyke
University of South Dakota
This material is distributed under an
Attribution-NonCommercial-ShareAlike 3.0 Unported Creative Commons
License, the full details of which may be found online here:
http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use,
edit, or redistribute the content provided that the original source
is cited, it is for non-commercial purposes, and provided it is
distributed under a similar license.
CC BY-NC-SA Nordyke 2010
-
Why do we need descriptive statistics
We use the label univariate descriptive statistics to refer to a
variety of measures of center and variation that are useful for
understanding the nature and distribution of a single variable.
They can allow us to quickly understand a large amount of
information about a single variable.
They make data meaningful!
CC BY-NC-SA Nordyke 2010
-
Making Data Meaningful
Age of Volunteer 15 19
22 17 39
17 26
CC BY-NC-SA Nordyke 2010
A relatively small sample of the ages of volunteers at a local
non-profit agency in the community.
What does this list tell us about the age of volunteers in the
agency?
-
Making Data Meaningful
Age of Volunteer
15 17 17 19 22 26 39
CC BY-NC-SA Nordyke 2010
Sorting the list can provide a starting place.
What do we know now?
-
Making Data Meaningful
CC BY-NC-SA Nordyke 2010
What if the sample is larger?
39 25 22 40 37 15 30 16 25 28
16 31 50 46 30 15 25 20 17 22 43 27 42 43 17 16 33 26 31 30
38 43 40 22 19 15 24 19 26 40 39 27 35 28 26 28 41 43 47 22
36 41 25 38 25 36 38 38 18 45 16 30 40 21 16 48 48 46 30 31
31 16 26 49 24 44 39 15 21 24 24 41 42 49 44 24 18 28 22 38
22 47 44 20 31 24 24 27 34 33 17 49 33 44 27 43 49 16 23 25 35
34 20 26 29 44 17 42 43 29
32 33 18 24 45 50 21 39 40 21 28 31 19 16 26 26 16 45 22 21
47 15 39 49 33 29 40 20 18 37
49 16 19 23 34 37 18 15 19 41
-
The Menu of Basic Descriptive Statistics
Measures of central tendency
Mean, Median, Mode, Midrange
Measures of distribution
Range, Min, Max, Percentiles
Measures of Variation
Standard Deviation, Variance, Coefficient of Variation
CC BY-NC-SA Nordyke 2010
-
Some initial notation
CC BY-NC-SA Nordyke 2010
indicates the addition of a set of values
y is the variable used to represent the individual data
values
n represents the number of values in a sample
N represents the number of values in a population
-
Measures of Central Tendency - Mean
The sample mean is the mathematical average of the data and is
the measure of central tendency we use most often.
CC BY-NC-SA Nordyke 2010
-
Measures of Central Tendency - Mean
Sample Mean:
= =1
=155
7
= 22.14
CC BY-NC-SA Nordyke 2010
Observation #
Age of Volunteer
1 15 2 17 3 17 4 19 5 22 6 26 7 39
155 The sum of all of the observations
n = the number of observations
-
Measures of Central Tendency - Median
The sample median is the middle value when the original data
values are arranged in order of increasing (or decreasing)
magnitude. If there isnt one value in the middle we take the
average of the two middle values.
The median is not affected by extreme values.
CC BY-NC-SA Nordyke 2010
-
Measures of Central Tendency - Median
CC BY-NC-SA Nordyke 2010
=( )
2 Median:
Median is often denoted by which is pronounced y-tilde
-
Measures of Central Tendency - Median
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 39
Sample ages are arranged in ascending order
The middle value is the median. = 19
-
Measures of Central Tendency - Median
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 34 39
If there are two values in the middle, we take the average of
the two.
=( )
2
=(19:22)
2 = 20.5
-
Measures of Central Tendency - Median
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 34 99
Note that the presence of an extreme value, doesnt change the
median.
=( )
2
=(19:22)
2 = 20.5
-
Measures of Central Tendency - Mode
The mode is the value that occurs most frequently.
Not every sample has a distinct mode. Sometimes it is bimodal
(two modes) or multimodal (three or more modes) or sometimes there
is no mode at all.
The mode is the only measure of central tendency we can use for
nominal data.
CC BY-NC-SA Nordyke 2010
-
Measures of Central Tendency - Mode
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 39
17 is the only value that occurs more than once, so it is the
value that occurs most
frequently and the mode.
Mode is often denoted with the symbol M
M = 17
-
Measures of Central Tendency - Mode
Blue Green Green Purple Purple Red Red Red Red Yellow Yellow
Yellow
CC BY-NC-SA Nordyke 2010
M = Red
20 29 33 33 34 41 41 42 43 45 45
Multi modal
1.1 2.3 4.1 5.3 4.3 6.7 8.2 8.3 8.7 8.9
10.3
No Mode
-
Measures of Central Tendency - Midrange
The midrange, or middle of the range is the average of the
highest and lowest values.
There is no distinct symbol for the Midrange.
CC BY-NC-SA Nordyke 2010
Midrange=( : )
2
-
Measures of Central Tendency - Midrange
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 39
Midrange=( : )
2
Midrange=(15:39)
2
Midrange= 27
-
Comparing Measures of Central Tendency
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 39
Mean = 22.14 Median = 19 Mode = 17 Midrange = 27
-
Comparing Measures of Center
Measure of Center (Listed from most
used to least used)
Does it always exist?
Does it take into account every
value?
Is it affected by extreme values?
Mean Always Yes Yes
Median Always No No
Mode Might not exist, may have more than one
No No
Midrange Always No Yes
CC BY-NC-SA Nordyke 2010
-
The Range
The range of a sample is the difference between the highest
value and the lowest value.
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 39
In our example the Range = 39 15 or 24; there are 24 years
between our youngest and oldest volunteers in the sample.
-
Measures of Variance
Where measures of central tendency try to give us an idea of
where the middle of the data lies, measures of variance (or
variation) tell us about how the data is distributed around that
center.
Our three primary measures of variance are: Standard
Deviation,
Variance and
Coefficient of Variation
CC BY-NC-SA Nordyke 2010
-
Measures of Variance Standard Deviation
Sample Standard Deviation: = (;)=1
;1
2
Population Standard Deviation: = (;)=1
2
CC BY-NC-SA Nordyke 2010
The Standard Deviation is a measure of the variation of values
around the mean.
-
Some Key Points for Understanding Standard Deviation
The standard deviation is always positive.
The standard deviation of a sample will always be in the same
units as the observations in the sample.
Extreme values or outliers can change the value of the standard
deviation substantially.
The size of the sample will affect the size of the standard
deviation; as the sample size increases, the size of the standard
deviation decreases.
CC BY-NC-SA Nordyke 2010
-
Measures of Variance - Variance
The variance of a sample is just the standard deviation of the
sample squared.
Sample Variance: 2 = (;)=1
;1
Population Variance: 2 = (;)=1
2
CC BY-NC-SA Nordyke 2010
-
Standard Deviation and Variance Notation
Sample Population
s = standard deviation = standard deviation
s2 = variance 2 = variance
CC BY-NC-SA Nordyke 2010
-
Seeing Standard Deviations
Once I figure out how to draw the curves, this well be a slide
that shows the difference between a distribution with a small
standard deviation (tall and narrow) and a large one (broad and
flat).
CC BY-NC-SA Nordyke 2010
-
Back to our example
In our sample of volunteer ages, the mean was 22.14 years.
We can calculate the standard deviation to better understand how
the values or distributed around that mean.
CC BY-NC-SA Nordyke 2010
15 17 17 19 22 26 39
-
Back to our example
Sample Standard Deviation: = (;)=1
;1
2
CC BY-NC-SA Nordyke 2010
y (y-) (y-)2 15 22.14 -7.14 50.9796 17 22.14 -5.14 26.4196 17
22.14 -5.14 26.4196 19 22.14 -3.14 9.8596 22 22.14 -0.14 0.0196 26
22.14 3.86 14.8996 39 22.14 16.86 284.2596
412.8572
-
Back to our example
Sample Standard Deviation: = (;)=1
;1
2
= 412.86
7 1
CC BY-NC-SA Nordyke 2010
= 8.3
-
Copyright 2004 Pearson Education,
Inc.
How are standard deviations helpful?
The Empirical Rule
When data sets have distributions that are approximately bell
shaped, the following is true:
About 68% of all values fall within 1 standard deviation of the
mean
About 95% of all values fall within 2 standard deviations of the
mean
About 99.7% of all values fall within 3 standard deviations of
the mean
-
The Empirical Rule
CC BY-NC-SA Nordyke 2010
34% 34%
68% of values fall within 1 standard deviation of the
mean
-
The Empirical Rule
CC BY-NC-SA Nordyke 2010
34% 34%
68% of values fall within 1 standard deviation of the
mean
95% of values fall within 2 standard deviations of the mean
13.5% 13.5%
-
The Empirical Rule
CC BY-NC-SA Nordyke 2010
34% 34%
68% of values fall within 1 standard deviation of the
mean
95% of values fall within 2 standard deviations of the mean
99.7% of values fall within 3 standard deviations of the
mean
13.5% 13.5% 2.4% 2.4%
-
Measures of Center Coefficient of Variation
The Coefficient of Variation (CV) is a measure of the standard
deviation of a sample relative to its mean.
CVs can be useful when you are comparing the standard deviations
of variables that are in two different units.
CC BY-NC-SA Nordyke 2010
-
Measures of Center Coefficient of Variation
An example: You are comparing the heights and weights of fourth
graders.
CC BY-NC-SA Nordyke 2010
Height = 52 S = 4
Weight = 80 lbs. S = 10 lbs.
Which variable has greater variance? How can we compare 4 to 10
lbs?
-
Measures of Center Coefficient of Variation
CC BY-NC-SA Nordyke 2010
Height = 52 S = 4
CV =4
52 * 100%
CV = 8%
CV =
* 100%
Weight = 80 lbs. S = 10 lbs.
CV =10
80 * 100%
CV = 12.5%
The standard deviation of height is 8% of the mean of height,
where as the standard deviation of weight is 12.5% of the mean of
weight, so there is greater variation in the weight of the fourth
graders than in the height.