Top Banner
Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution- NonCommercial-ShareAlike 3.0 Unported Creative Commons License, the full details of which may be found online here: http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use, edit, or redistribute the content provided that the original source is cited, it is for non-commercial purposes, and provided it is distributed under a similar license. CC BY-NC-SA Nordyke 2010
38

Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

Dec 29, 2015

Download

Documents

Ann Miller
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Univariate Descriptive StatisticsDr. Shane Nordyke

University of South Dakota

This material is distributed under an Attribution-NonCommercial-ShareAlike 3.0 Unported Creative Commons License, the full details of which may be found online here: http://creativecommons.org/licenses/by-nc-sa/3.0/ . You may re-use, edit, or redistribute the content provided that the original source is cited, it is for non-commercial purposes, and provided it is distributed under a similar license.

Page 2: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Why do we need descriptive statistics

• We use the label univariate descriptive statistics to refer to a variety of measures of center and variation that are useful for understanding the nature and distribution of a single variable.

• They can allow us to quickly understand a large amount of information about a single variable.

• They make data meaningful!

Page 3: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Making Data Meaningful

Age of Volunteer15192217391726

A relatively small sample of the ages of volunteers at a local non-profit agency in the community.

What does this list tell us about the age of volunteers in the agency?

Page 4: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Making Data Meaningful Age of Volunteer

15171719222639

Sorting the list can provide a starting place.

What do we know now?

Page 5: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Making Data Meaningful

What if the sample is larger?

39 25 22 40 37 15 30 16 25 2816 31 50 46 30 15 25 20 17 2243 27 42 43 17 16 33 26 31 3038 43 40 22 19 15 24 19 26 4039 27 35 28 26 28 41 43 47 2236 41 25 38 25 36 38 38 18 4516 30 40 21 16 48 48 46 30 3131 16 26 49 24 44 39 15 21 2424 41 42 49 44 24 18 28 22 3822 47 44 20 31 24 24 27 34 3317 49 33 44 27 43 49 16 23 2535 34 20 26 29 44 17 42 43 2932 33 18 24 45 50 21 39 40 2128 31 19 16 26 26 16 45 22 2147 15 39 49 33 29 40 20 18 3749 16 19 23 34 37 18 15 19 41

Page 6: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

The Menu of Basic Descriptive Statistics

• Measures of central tendency– Mean, Median, Mode, Midrange

• Measures of distribution– Range, Min, Max, Percentiles

• Measures of Variation– Standard Deviation, Variance, Coefficient of

Variation

Page 7: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Some initial notation

indicates the addition of a set of values

y is the variable used to represent the individual data values

n represents the number of values in a sample

N represents the number of values in a population

Page 8: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Mean

The sample mean is the mathematical average of the data and is the measure of central tendency we use most often.

Page 9: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Mean

Sample Mean: Observation

# Age of

Volunteer1 152 173 174 195 226 267 39

155The sum of all of the observations

n = the number of observations

Page 10: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Median

The sample median is the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. If there isn’t one value in the middle we take the average of the two middle values.

The median is not affected by extreme values.

Page 11: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Median

ỹMedian:

Median is often denoted by ỹ which is pronounced “y-tilde”

Page 12: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Median

15 17 17 19 22 26 39

Sample ages are arranged in ascending order

The middle value is the median.ỹ = 19

Page 13: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Median

15 17 17 19 22 26 34 39

If there are two values in the middle, we take the average of the two.

ỹ ỹ 20.5

Page 14: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Median

15 17 17 19 22 26 34 99

Note that the presence of an extreme value, doesn’t change the median.

ỹ ỹ 20.5

Page 15: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Mode

The mode is the value that occurs most frequently. – Not every sample has a distinct mode. Sometimes

it is bimodal (two modes) or multimodal (three or more modes) or sometimes there is no mode at all.

– The mode is the only measure of central tendency we can use for nominal data.

Page 16: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Mode

15 17 17 19 22 26 39

17 is the only value that occurs more than once, so it is the value that occurs most

frequently and the mode.

Mode is often denoted with the symbol M

M = 17

Page 17: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - ModeBlueGreenGreenPurple Purple RedRedRedRedYellowYellowYellow

M = Red

2029333334414142434545

Multi modal

1.12.34.15.34.36.78.28.38.78.9

10.3

No Mode

Page 18: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Midrange

The midrange, or middle of the range is the average of the highest and lowest values.

There is no distinct symbol for the Midrange.

Midrange

Page 19: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Central Tendency - Midrange

15 17 17 19 22 26 39

Midrange

Midrange

Midrange

Page 20: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Comparing Measures of Central Tendency

15 17 17 19 22 26 39

Mean Median = 19Mode = 17Midrange = 27

Page 21: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Comparing Measures of CenterMeasure of Center(Listed from most

used to least used)

Does it always exist?

Does it take into account every

value?

Is it affected by extreme values?

Mean Always Yes Yes

Median Always No No

Mode Might not exist, may have more than one

No No

Midrange Always No Yes

Page 22: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

The Range

• The range of a sample is the difference between the highest value and the lowest value.

15 17 17 19 22 26 39

In our example the Range = 39 – 15 or 24; there are 24 years between our youngest and oldest volunteers in the sample.

Page 23: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Variance

• Where measures of central tendency try to give us an idea of where the middle of the data lies, measures of variance (or variation) tell us about how the data is distributed around that center.

• Our three primary measures of variance are:– Standard Deviation,– Variance and– Coefficient of Variation

Page 24: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Variance – Standard Deviation

Sample Standard Deviation: Population Standard Deviation:

The Standard Deviation is a measure of the variation of values around the mean.

Page 25: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Some Key Points for Understanding Standard Deviation

• The standard deviation is always positive.• The standard deviation of a sample will always

be in the same units as the observations in the sample.

• Extreme values or outliers can change the value of the standard deviation substantially.

• The size of the sample will affect the size of the standard deviation; as the sample size increases, the size of the standard deviation decreases.

Page 26: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Variance - Variance

• The variance of a sample is just the standard deviation of the sample squared.

Sample Variance: Population Variance:

Page 27: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Standard Deviation and Variance Notation

Sample Populations = standard deviation = standard deviations2 = variance 2 = variance

Page 28: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Seeing Standard Deviations

• Once I figure out how to draw the curves, this well be a slide that shows the difference between a distribution with a small standard deviation (tall and narrow) and a large one (broad and flat).

Page 29: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Back to our example

• In our sample of volunteer ages, the mean was 22.14 years.

• We can calculate the standard deviation to better understand how the values or distributed around that mean.

15 17 17 19 22 26 39

Page 30: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Back to our exampleSample Standard Deviation:

y (y-) (y-)2

15 22.14 -7.14 50.979617 22.14 -5.14 26.419617 22.14 -5.14 26.419619 22.14 -3.14 9.859622 22.14 -0.14 0.019626 22.14 3.86 14.899639 22.14 16.86 284.2596

412.8572

Page 31: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Back to our exampleSample Standard Deviation:

𝑠=8.3

Page 32: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

Copyright © 2004 Pearson Education, Inc.

How are standard deviations helpful?

The Empirical Rule

When data sets have distributions that are approximately bell shaped, the following is true:

• About 68% of all values fall within 1 standard deviation of the mean

• About 95% of all values fall within 2 standard deviations of the mean

• About 99.7% of all values fall within 3 standard deviations of the mean

Page 33: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

The Empirical Rule

34% 34%

68% of values fall within 1 standard deviation of the

mean

Page 34: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

The Empirical Rule

34% 34%

68% of values fall within 1 standard deviation of the

mean

95% of values fall within 2 standard deviations of the mean

13.5%13.5%

Page 35: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

The Empirical Rule

34% 34%

68% of values fall within 1 standard deviation of the

mean

95% of values fall within 2 standard deviations of the mean

99.7% of values fall within 3 standard deviations of the mean

13.5%13.5%2.4%2.4%

Page 36: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Center – Coefficient of Variation

• The Coefficient of Variation (CV) is a measure of the standard deviation of a sample relative to its mean.

• CV’s can be useful when you are comparing the standard deviations of variables that are in two different units.

Page 37: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Center – Coefficient of Variation

An example: You are comparing the heights and weights of fourth graders.

Height = 52”S = 4”

Weight = 80 lbs.S = 10 lbs.

Which variable has greater variance? How can we compare 4” to 10 lbs?

Page 38: Univariate Descriptive Statistics Dr. Shane Nordyke University of South Dakota This material is distributed under an Attribution-NonCommercial-ShareAlike.

CC BY-NC-SA Nordyke 2010

Measures of Center – Coefficient of Variation

Height = 52”S = 4” * 100%

CV = 8%

* 100%

Weight = 80 lbs.S = 10 lbs. * 100%

CV = 12.5%

The standard deviation of height is 8% of the mean of height, where as the standard deviation of weight is 12.5% of the mean of weight, so there is greater variation in the weight of the fourth graders than in the height.