Top Banner
Lecture 2 Lecture 2 Describing Data II Describing Data II ©
40

Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Dec 17, 2015

Download

Documents

Kelly Townsend
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Lecture 2Lecture 2

Describing Data IIDescribing Data II

©

Page 2: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Summarizing and Summarizing and Describing DataDescribing Data

Frequency distribution and Frequency distribution and the shape of the distributionthe shape of the distribution

Measures of variabilityMeasures of variability

Page 3: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

1. Frequency distribution 1. Frequency distribution and the shape of the and the shape of the

distributiondistribution

In the previous lecture, we saw that the mean of the household savings gives an inflated image of the saving of a “normal household”.

This was because the shape of the histogram was not symmetric.

It is important to look at how the observations are distributed.

Page 4: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Japanese household savingsJapanese household savingsHistgram of J apanese Household Savings

14.1

10.69.5

8.26.9 6.2

5.1 4.5 3.5 3 3 2.7 2 2 1.9 1.7 1.2 1.3 1 1

10.7

02468

10121416

below 2,000

2,000-4,000

4,000-6,000

6,000-8,000

8,000-10,000

10,000-12,000

12,000-14,000

14,000-16,000

16,000-18,000

18,000-20,000

20,000-22,000

22,000-24,000

24,000-26,000

26,000-28,000

28,000-30,000

30,000-32,000

32,000-34,000

34,000-36,000

36,000-38,000

38,000-40,000

Above 40,000

Savings in thousand yen

Perce

ntage

Sample Average=17,280,000

Median =10,520,000

Page 5: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

1-1 Frequency 1-1 Frequency DistributionDistribution

The frequency table that we used in the previous lecture is also called the frequency distribution.frequency distribution. A frequency distribution is usually referred to how observations are distributed. When we plot the frequency table, it is called a HistogramHistogram.

A histogram usually shows the number of observations in a specific range. However, sometimes, it shows the percentage of observations in a specific range.

Page 6: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

1-2 Shape of the Distribution1-2 Shape of the Distribution

The shape of the distribution refers to the shape of the Histogram.

Page 7: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

1-3 Symmetric 1-3 Symmetric DistributionDistribution

The shape of the distribution is said to be symmetricsymmetric if the observations are balanced, or evenly distributed, about the mean. The shape of the distribution is symmetric if the shape of the histogram is symmetric

Page 8: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Symmetric DistributionSymmetric DistributionSymmetric Distribution

0123456789

10

1 2 3 4 5 6 7 8 9

Fre

qu

en

cy

Note: For a symmetric distribution, the mean and median are equal.

Page 9: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Symmetric Distribution: An Symmetric Distribution: An exampleexample

The age distribution of the clients (from the previous lecture note) is nearly symmetric.

Histogram

0 0

45

11 11

6

4

2

0 00

2

4

6

8

10

12

Clients' Age range

Freq

uenc

y

Page 10: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

1-4 Skewed Distribution1-4 Skewed Distribution

A distribution is skewedskewed if the observations are not symmetrically distributed above and below the mean. A positively skewedpositively skewed (or skewed to the right) distribution has a tail that extends to the right in the direction of positive values. A negatively skewednegatively skewed (or skewed to the left) distribution has a tail that extends to the left in the direction of negative values.

Page 11: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Positively skewed Positively skewed distributiondistribution

Positively Skewed Distribution

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9

Fre

qu

ency

Page 12: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Positively skewed Positively skewed distribution: An exampledistribution: An example

The household saving histogram (from the previous lecture) is an example of a positively skewed distribution.

Histgram of J apanese Household Savings

14.1

10.69.5

8.26.9 6.2

5.1 4.5 3.5 3 3 2.7 2 2 1.9 1.7 1.2 1.3 1 1

10.7

02468

10121416

below 2,000

2,000-4,000

4,000-6,000

6,000-8,000

8,000-10,000

10,000-12,000

12,000-14,000

14,000-16,000

16,000-18,000

18,000-20,000

20,000-22,000

22,000-24,000

24,000-26,000

26,000-28,000

28,000-30,000

30,000-32,000

32,000-34,000

34,000-36,000

36,000-38,000

38,000-40,000

Above 40,000

Savings in thousand yen

Perc

enta

ge

Sample Average=17,280,000

Median =10,520,000

Page 13: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Positively skewed Positively skewed distribution: distribution:

A noteA note

For a positively skewed distribution the mean is greater than the median.

Page 14: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Negatively skewed Negatively skewed distributiondistribution

Negatively Skewed Distribution

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9

Fre

qu

en

cy

Note: For a negatively skewed distribution, the mean is less than the median.

Page 15: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2. Measures of Variability2. Measures of Variability

VarianceStandard deviation

Page 16: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

ExampleExample Data “Sales at two different stores”

contain daily sales data for two different stores. Data are collected for 60 days.

Store A’s average daily sales is 231,800 yen. Store B’s average daily sales is 230,500 yen.

Can we say that they are similar stores? Look at the following graphs.

Page 17: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Daily sales of the two storesDaily sales of the two stores

Store A: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily

sales

in 10

00 ye

n

Average=231,800yen

Store B: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily

sales

in 1

00 ye

n Average=230,500yen

Page 18: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Daily sales of the two Daily sales of the two storesstores

The difference between the two stores is that, Store A’s sales have much higher variation than Store B’s sales.

We need a measure of variability in data.

Page 19: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2-1 How to measure the 2-1 How to measure the variability (1)variability (1)

Take the Store A’s data as an example, variability of each observation can be seen from the difference between the observation and the mean.

But, how do we measure the overall variability of the data?

Store A: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily s

ales in

1000

yen

Average=231,800yen

For eachobservation, you cancompute thedifference from theaverage

Page 20: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

How to measure the variability How to measure the variability (2)(2)

Overall variabilityOverall variability How about taking the

average of all differences?

This is not a good idea, since the differences can be both positive or negative, so they would sum up to zero.

Therefore, we take the square of each difference. This is the first step to compute the “Variance”, a measure of overall variability.

Store A: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily

sales

in 1

000

yen Average

=231,800yen

For eachobservation, you cancompute thedifference from theaverage

Page 21: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2-2 Variance2-2 VarianceA measure of variabilityA measure of variability

Variance is computed in the following way.1. Subtract the mean from each observation

(compute the difference between each observation and the mean. Note that the difference can be minus)

2. Then, square each difference3. Sum all the squared differences4. Divide the sum of squared differences by n-1

(the number of observations minus 1) We will learn the reason why we divide the sum

of squares by n-1 after we learn the concept of the expectation.

Page 22: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Computation of the variance:Computation of the variance:ExerciseExercise

Open the data “Computation of Variance”, and compute the variance of Store A’s daily sales

Compute the variance of Store B’s daily sales

Page 23: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Computation of the variance:Computation of the variance:ExerciseExercise

Store A: Average daily sales =231.8 thousand yen Variance =4979.9 Store B: Average daily sales=230.5 thousand yen Variance =335.9 Notice that variance for Store A is higher

than that for Store B. This is because the variation in the daily sales is higher for Store A.

Page 24: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Variance: noteVariance: note In the previous slide, we did not use any

unit of measurement for variance. (For example, we do not say that the variance for Store A is 4979.9 thousand yen.)

This is because, when we compute the variance, we square the data. Therefore, the unit of measurement for variance is “square of thousand yen”, which is not a meaningful unit.

Therefore, we use the Standard Deviation, another measure of variation.

Page 25: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2-3 A measure of variability: 2-3 A measure of variability: Standard deviationStandard deviation

Standard deviation is the square root of the variance.

Exercise: Compute the standard

deviation of the daily sales for Store A and Store B.

VarianceDeviation Standard

Page 26: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Standard Deviation: Store Standard Deviation: Store sales data examplesales data example

Standard deviation of Store A’s daily sales=70.57 thousand yen.

Standard deviation for Store B’s daily sales= 18.33 thousand yen.

This means that the average variation of the store A’s sales is about 70.6 thousand yen, and the average variation of the store B’s sales is about 18.3 thousand yen.

Page 27: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Standard deviation and variance Standard deviation and variance as measures of risk (or as measures of risk (or

uncertainty)uncertainty)

Often standard deviation and variance are used as measures of uncertainty or risk.

If you would like to work as a store manager, then store B may be a better store to work for; although the average sales is almost the same as store A, the uncertainty is lower (low standard deviation)

Page 28: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Standard deviation and variance Standard deviation and variance as measures of risk (or as measures of risk (or

uncertainty)uncertainty) In the store sales data, the average sales for both

stores are similar. However, in many other occasions, higher return

(higher average sales) comes with higher risk (higher standard deviation).

One makes a decision by choosing a good combination of return and risk. For example, if you invest in a stock, you would choose a stock with a combination of return and risk that suits your preference.

Therefore, standard deviation and variance are important numerical measures of summarizing data for a decision making purpose.

Page 29: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2-4. Understanding the 2-4. Understanding the mathematical notation of the mathematical notation of the

variancevariance

Most of the time, we only have sample data (not population data).

Variance computed from a sample is called sample variance. We denote sample variance by s2.

When we have population data (which does not happen often), we can compute the population variance. We denote the population variance by σ2.

Page 30: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Understanding the Understanding the mathematical notation of mathematical notation of

sample variancesample variance

Observation id Variable X

1 x1

2 x2

3 x3

.

...

n xn

The typical data we use comes in this format. Using this format, we would like to represent variance in a mathematical form.

Page 31: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Understanding the Understanding the mathematical notation of mathematical notation of

sample variancesample variance

Obs idVariabl

e X

Each data-the mean

(Each data-the mean)2

1 X1 X1 - (X1 - )2

2 X2 X2 - (X2 - )2

3 X3 X3 - (X3 - )2  

: :   :  

n Xn Xn -(Xn - )

2

Average

     

X

X

X

X

X

X

X

X

X

The first steps of computing variance are written in the table.

The variance can be computed by summing the last column, and divide the sum by (n-1)

Therefore, mathematically, a sample variance, s2, can be written as

next page

Page 32: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Understanding the Understanding the mathematical notation for mathematical notation for

sample variancesample variance

1

)(

1

)()()()( 1

222

32

22

12

n

Xx

n

XxXxXxXxs

n

ii

n

Mathematically, sample variance, denoted as s2, can be written as

Page 33: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Mathematical notation for Mathematical notation for population variancepopulation variance

Though not often, we may have population data. Then we can compute the population variance. We use the notation, σ2, to denote the population variance. We also use upper case N to denote the number of observations. The mathematical notation for the population variance is

N

x

N

xxxx

N

ii

n

1

222

32

22

12

)()()()()(

Unlike the case for sample variance, we do not have to divide the sum of squares by N-1. We simply divide it by N.

Page 34: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2-5. Mathematical notation for 2-5. Mathematical notation for the sample standard deviation the sample standard deviation

The sample standard deviation, s, sample standard deviation, s, is written as

1

)(1

2

2

n

Xxss

n

ii

Page 35: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Mathematical Notation for Mathematical Notation for population standard deviationpopulation standard deviation

The population standard deviation, population standard deviation, , , is written as

N

xN

ii

1

2

2

)(

Page 36: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

2-6. Short-cut formula for 2-6. Short-cut formula for sample variance sample variance

The short-cut formula for the sample sample variance variance is:

1

)( 2

1

2

2

n

Xnxs

n

ii

Page 37: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

ExerciseExercise

Compute the variance for the sales of Store A by applying the short-cut formula for sample variance, and show that this indeed coincides with our previous calculation.

Page 38: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Other Measures of VariabilityOther Measures of Variability1. The Range 1. The Range

The range range in a set of data is the difference between the largest and smallest observations

Page 39: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Other Measures of Central Other Measures of Central TendencyTendency2. Mode2. Mode

The mode, mode, if one exists, is the most frequently occurring observation in the sample or population.

Page 40: Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

This lecture note covers:This lecture note covers:

Textbook P23~P28: Frequency distribution

Textbook 3.1, 3.2: Measures of central tendency and variability