Lecture 2 Describing Data II ©. Summarizing and Describing Data Frequency distribution and the shape of the distribution Frequency distribution and the.

Post on 17-Dec-2015

224 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

Transcript

Lecture 2Lecture 2

Describing Data IIDescribing Data II

©

Summarizing and Summarizing and Describing DataDescribing Data

Frequency distribution and Frequency distribution and the shape of the distributionthe shape of the distribution

Measures of variabilityMeasures of variability

1. Frequency distribution 1. Frequency distribution and the shape of the and the shape of the

distributiondistribution

In the previous lecture, we saw that the mean of the household savings gives an inflated image of the saving of a “normal household”.

This was because the shape of the histogram was not symmetric.

It is important to look at how the observations are distributed.

Japanese household savingsJapanese household savingsHistgram of J apanese Household Savings

14.1

10.69.5

8.26.9 6.2

5.1 4.5 3.5 3 3 2.7 2 2 1.9 1.7 1.2 1.3 1 1

10.7

02468

10121416

below 2,000

2,000-4,000

4,000-6,000

6,000-8,000

8,000-10,000

10,000-12,000

12,000-14,000

14,000-16,000

16,000-18,000

18,000-20,000

20,000-22,000

22,000-24,000

24,000-26,000

26,000-28,000

28,000-30,000

30,000-32,000

32,000-34,000

34,000-36,000

36,000-38,000

38,000-40,000

Above 40,000

Savings in thousand yen

Perce

ntage

Sample Average=17,280,000

Median =10,520,000

1-1 Frequency 1-1 Frequency DistributionDistribution

The frequency table that we used in the previous lecture is also called the frequency distribution.frequency distribution. A frequency distribution is usually referred to how observations are distributed. When we plot the frequency table, it is called a HistogramHistogram.

A histogram usually shows the number of observations in a specific range. However, sometimes, it shows the percentage of observations in a specific range.

1-2 Shape of the Distribution1-2 Shape of the Distribution

The shape of the distribution refers to the shape of the Histogram.

1-3 Symmetric 1-3 Symmetric DistributionDistribution

The shape of the distribution is said to be symmetricsymmetric if the observations are balanced, or evenly distributed, about the mean. The shape of the distribution is symmetric if the shape of the histogram is symmetric

Symmetric DistributionSymmetric DistributionSymmetric Distribution

0123456789

10

1 2 3 4 5 6 7 8 9

Fre

qu

en

cy

Note: For a symmetric distribution, the mean and median are equal.

Symmetric Distribution: An Symmetric Distribution: An exampleexample

The age distribution of the clients (from the previous lecture note) is nearly symmetric.

Histogram

0 0

45

11 11

6

4

2

0 00

2

4

6

8

10

12

Clients' Age range

Freq

uenc

y

1-4 Skewed Distribution1-4 Skewed Distribution

A distribution is skewedskewed if the observations are not symmetrically distributed above and below the mean. A positively skewedpositively skewed (or skewed to the right) distribution has a tail that extends to the right in the direction of positive values. A negatively skewednegatively skewed (or skewed to the left) distribution has a tail that extends to the left in the direction of negative values.

Positively skewed Positively skewed distributiondistribution

Positively Skewed Distribution

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9

Fre

qu

ency

Positively skewed Positively skewed distribution: An exampledistribution: An example

The household saving histogram (from the previous lecture) is an example of a positively skewed distribution.

Histgram of J apanese Household Savings

14.1

10.69.5

8.26.9 6.2

5.1 4.5 3.5 3 3 2.7 2 2 1.9 1.7 1.2 1.3 1 1

10.7

02468

10121416

below 2,000

2,000-4,000

4,000-6,000

6,000-8,000

8,000-10,000

10,000-12,000

12,000-14,000

14,000-16,000

16,000-18,000

18,000-20,000

20,000-22,000

22,000-24,000

24,000-26,000

26,000-28,000

28,000-30,000

30,000-32,000

32,000-34,000

34,000-36,000

36,000-38,000

38,000-40,000

Above 40,000

Savings in thousand yen

Perc

enta

ge

Sample Average=17,280,000

Median =10,520,000

Positively skewed Positively skewed distribution: distribution:

A noteA note

For a positively skewed distribution the mean is greater than the median.

Negatively skewed Negatively skewed distributiondistribution

Negatively Skewed Distribution

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9

Fre

qu

en

cy

Note: For a negatively skewed distribution, the mean is less than the median.

2. Measures of Variability2. Measures of Variability

VarianceStandard deviation

ExampleExample Data “Sales at two different stores”

contain daily sales data for two different stores. Data are collected for 60 days.

Store A’s average daily sales is 231,800 yen. Store B’s average daily sales is 230,500 yen.

Can we say that they are similar stores? Look at the following graphs.

Daily sales of the two storesDaily sales of the two stores

Store A: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily

sales

in 10

00 ye

n

Average=231,800yen

Store B: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily

sales

in 1

00 ye

n Average=230,500yen

Daily sales of the two Daily sales of the two storesstores

The difference between the two stores is that, Store A’s sales have much higher variation than Store B’s sales.

We need a measure of variability in data.

2-1 How to measure the 2-1 How to measure the variability (1)variability (1)

Take the Store A’s data as an example, variability of each observation can be seen from the difference between the observation and the mean.

But, how do we measure the overall variability of the data?

Store A: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily s

ales in

1000

yen

Average=231,800yen

For eachobservation, you cancompute thedifference from theaverage

How to measure the variability How to measure the variability (2)(2)

Overall variabilityOverall variability How about taking the

average of all differences?

This is not a good idea, since the differences can be both positive or negative, so they would sum up to zero.

Therefore, we take the square of each difference. This is the first step to compute the “Variance”, a measure of overall variability.

Store A: Daily Sales

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

0 10 20 30 40 50 60 70

Day

Daily

sales

in 1

000

yen Average

=231,800yen

For eachobservation, you cancompute thedifference from theaverage

2-2 Variance2-2 VarianceA measure of variabilityA measure of variability

Variance is computed in the following way.1. Subtract the mean from each observation

(compute the difference between each observation and the mean. Note that the difference can be minus)

2. Then, square each difference3. Sum all the squared differences4. Divide the sum of squared differences by n-1

(the number of observations minus 1) We will learn the reason why we divide the sum

of squares by n-1 after we learn the concept of the expectation.

Computation of the variance:Computation of the variance:ExerciseExercise

Open the data “Computation of Variance”, and compute the variance of Store A’s daily sales

Compute the variance of Store B’s daily sales

Computation of the variance:Computation of the variance:ExerciseExercise

Store A: Average daily sales =231.8 thousand yen Variance =4979.9 Store B: Average daily sales=230.5 thousand yen Variance =335.9 Notice that variance for Store A is higher

than that for Store B. This is because the variation in the daily sales is higher for Store A.

Variance: noteVariance: note In the previous slide, we did not use any

unit of measurement for variance. (For example, we do not say that the variance for Store A is 4979.9 thousand yen.)

This is because, when we compute the variance, we square the data. Therefore, the unit of measurement for variance is “square of thousand yen”, which is not a meaningful unit.

Therefore, we use the Standard Deviation, another measure of variation.

2-3 A measure of variability: 2-3 A measure of variability: Standard deviationStandard deviation

Standard deviation is the square root of the variance.

Exercise: Compute the standard

deviation of the daily sales for Store A and Store B.

VarianceDeviation Standard

Standard Deviation: Store Standard Deviation: Store sales data examplesales data example

Standard deviation of Store A’s daily sales=70.57 thousand yen.

Standard deviation for Store B’s daily sales= 18.33 thousand yen.

This means that the average variation of the store A’s sales is about 70.6 thousand yen, and the average variation of the store B’s sales is about 18.3 thousand yen.

Standard deviation and variance Standard deviation and variance as measures of risk (or as measures of risk (or

uncertainty)uncertainty)

Often standard deviation and variance are used as measures of uncertainty or risk.

If you would like to work as a store manager, then store B may be a better store to work for; although the average sales is almost the same as store A, the uncertainty is lower (low standard deviation)

Standard deviation and variance Standard deviation and variance as measures of risk (or as measures of risk (or

uncertainty)uncertainty) In the store sales data, the average sales for both

stores are similar. However, in many other occasions, higher return

(higher average sales) comes with higher risk (higher standard deviation).

One makes a decision by choosing a good combination of return and risk. For example, if you invest in a stock, you would choose a stock with a combination of return and risk that suits your preference.

Therefore, standard deviation and variance are important numerical measures of summarizing data for a decision making purpose.

2-4. Understanding the 2-4. Understanding the mathematical notation of the mathematical notation of the

variancevariance

Most of the time, we only have sample data (not population data).

Variance computed from a sample is called sample variance. We denote sample variance by s2.

When we have population data (which does not happen often), we can compute the population variance. We denote the population variance by σ2.

Understanding the Understanding the mathematical notation of mathematical notation of

sample variancesample variance

Observation id Variable X

1 x1

2 x2

3 x3

.

...

n xn

The typical data we use comes in this format. Using this format, we would like to represent variance in a mathematical form.

Understanding the Understanding the mathematical notation of mathematical notation of

sample variancesample variance

Obs idVariabl

e X

Each data-the mean

(Each data-the mean)2

1 X1 X1 - (X1 - )2

2 X2 X2 - (X2 - )2

3 X3 X3 - (X3 - )2  

: :   :  

n Xn Xn -(Xn - )

2

Average

     

X

X

X

X

X

X

X

X

X

The first steps of computing variance are written in the table.

The variance can be computed by summing the last column, and divide the sum by (n-1)

Therefore, mathematically, a sample variance, s2, can be written as

next page

Understanding the Understanding the mathematical notation for mathematical notation for

sample variancesample variance

1

)(

1

)()()()( 1

222

32

22

12

n

Xx

n

XxXxXxXxs

n

ii

n

Mathematically, sample variance, denoted as s2, can be written as

Mathematical notation for Mathematical notation for population variancepopulation variance

Though not often, we may have population data. Then we can compute the population variance. We use the notation, σ2, to denote the population variance. We also use upper case N to denote the number of observations. The mathematical notation for the population variance is

N

x

N

xxxx

N

ii

n

1

222

32

22

12

)()()()()(

Unlike the case for sample variance, we do not have to divide the sum of squares by N-1. We simply divide it by N.

2-5. Mathematical notation for 2-5. Mathematical notation for the sample standard deviation the sample standard deviation

The sample standard deviation, s, sample standard deviation, s, is written as

1

)(1

2

2

n

Xxss

n

ii

Mathematical Notation for Mathematical Notation for population standard deviationpopulation standard deviation

The population standard deviation, population standard deviation, , , is written as

N

xN

ii

1

2

2

)(

2-6. Short-cut formula for 2-6. Short-cut formula for sample variance sample variance

The short-cut formula for the sample sample variance variance is:

1

)( 2

1

2

2

n

Xnxs

n

ii

ExerciseExercise

Compute the variance for the sales of Store A by applying the short-cut formula for sample variance, and show that this indeed coincides with our previous calculation.

Other Measures of VariabilityOther Measures of Variability1. The Range 1. The Range

The range range in a set of data is the difference between the largest and smallest observations

Other Measures of Central Other Measures of Central TendencyTendency2. Mode2. Mode

The mode, mode, if one exists, is the most frequently occurring observation in the sample or population.

This lecture note covers:This lecture note covers:

Textbook P23~P28: Frequency distribution

Textbook 3.1, 3.2: Measures of central tendency and variability

top related