Top Banner
1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position
28

1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

1

Numerical Summary Measures

Lecture 03: Measures of Variation and Interpretation, and Measures of

Relative Position

Page 2: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

2

Measures of Variation

• Consider the following three data sets:– Data 1: 1, 2, 3, 4, 5

– Data 2: 1, 1, 3, 5, 5

– Data 3: 3, 3, 3, 3, 3

• For these data sets, the mean and the median are clearly identical.

• But, they are different data sets!• The need to measure the variation in the data.

Page 3: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

3

On the Perils of an “Average Value”• Situation: Man has his

head in a very hot compartment, and his feet feeling very cold.

• Question: Mr., how are you feeling?

• Reply: Oh, on the average, I am just fine! …

• Crash! Dead!

Page 4: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

4

Sample Variance

• To measure degree of variation, one could look at the values of the deviations of the observations from its sample mean.

• The sample variance, denoted by S2, is defined to be the ‘average’ of the squared deviations of the observations from its sample mean.

2

1

2

1

1

n

ii XX

nS

Page 5: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

5

Computational Formula

• Definitional formula not very efficient for purposes of computation of the sample variance.

• The computational formula is oftentimes used.

n

i

n

ii

i

n

ii n

X

Xn

XnXn

S1

2

12

1

222

1

1

1

1

Page 6: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

6

Properties• It has squared units … which leads to defining the

standard deviation.• It is always nonnegative, and equals zero if and

only if all the observations are identical.• The larger the value, the more variation in the

data.• The divisor of (n-1) instead of n makes the sample

variance “unbiased” for the population variance (2) … will be explained when we get into inference.

Page 7: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

7

Standard Deviation

• The sample standard deviation, denoted by S, is the positive square root of the sample variance.

• Purpose: to have a measure with the same units of measurements as the original observations.

2SS

Page 8: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

8

Illustration of Computation

• Data set in the example for the mean and median.• Data: 122, 135, 110, 126, 100, 110, 110, 126, 94,

124, 108, 110, 92, 98, 118, 110, 102, 108, 126, 104, 110, 120, 110, 118, 100, 110, 120, 100, 120, 92

• We illustrate computations using the definitional and computational formulas in a spreadsheet-type format.

Page 9: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

9

Example … continued

• The spreadsheet-type table on the next slide is obtained from an Excel worksheet.

• The first three columns illustrates the computation using the definitional formula.

• The last column is used to illustrate the computation using the computational formula.

• Details will be provided in class!

Page 10: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

10

Stat 700 Computation of the Variance and Standard Deviation

X Dev=X-Mean Dev^2 X^2122 10.90 118.81 14884135 23.90 571.21 18225110 -1.10 1.21 12100126 14.90 222.01 15876100 -11.10 123.21 10000110 -1.10 1.21 12100110 -1.10 1.21 12100126 14.90 222.01 15876

94 -17.10 292.41 8836124 12.90 166.41 15376108 -3.10 9.61 11664110 -1.10 1.21 12100

92 -19.10 364.81 846498 -13.10 171.61 9604

118 6.90 47.61 13924110 -1.10 1.21 12100102 -9.10 82.81 10404108 -3.10 9.61 11664126 14.90 222.01 15876104 -7.10 50.41 10816110 -1.10 1.21 12100120 8.90 79.21 14400110 -1.10 1.21 12100118 6.90 47.61 13924100 -11.10 123.21 10000110 -1.10 1.21 12100120 8.90 79.21 14400100 -11.10 123.21 10000120 8.90 79.21 14400

92 -19.10 364.81 8464

Sum_Of_X Sum_Of_Dev Sum_Of_Dev^2 Sum_Of_X^23333 0.00 3580.70 373877.00

Mean_Of_X Variance_Of_X Variance_Of_X111.1 123.47 123.47

Standard Dev of X Standard_Dev_Of_X11.11 11.11

Page 11: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

11

Explanations of Columns in the Sheet

• Column 1: contains the values of X, Sum of X, and Sample Mean.

• Column 2: contains the deviations, Dev = X-SampleMean, and the Sum of Deviations.

• Column 3: contains the squared deviations, Sum of squared deviations, variance, and the standard deviation (via definitional formula).

• Column 4: contains the squared X; sum of squared X, and the variance (via the computational formula).

Page 12: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

12

Population Parameters (Analogs)

• If the quantities are computed from the population values, then we obtain population parameters such as the mean, variance and standard deviations.

• The notation are as follows:

Symbols used for theSample (based on samplevalues)

Population (based onpopulation values)

Mean X Variance S2 2

Standard Deviation S

Page 13: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

13

Information from Mean and Standard Deviation

• Empirical Rule: For symmetric mound-shaped distributions:

– Percentage of all observations within 1 standard deviation of the mean is approximately 68%.

– Percentage of all observations within 2 standard deviations of the mean is approximately 95%.

– Percentage of all observations within 3 standard deviations of the mean is approximately 100%.

– Thus, usually no observations will be more than 3 standard deviations of the mean!

Page 14: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

14

Information … continued

• Chebyshev’s Rule: For any distribution (be it symmetric, skewed, bi-modal, etc.), we always have that:

– Percentage of all observations within 1 standard deviation of the mean is at least 0%.

– Percentage of all observations within 2 standard deviations of the mean is at least 75%.

– Percentage of all observations within 3 standard deviations of the mean is at least 88.89%.

– More generally, the percentage of observations within k standard deviations of the mean is at least (1 - 1/k2).

Page 15: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

15

Illustration of these Rules

• Consider the sample data with 30 observations considered earlier.

• Data: 122, 135, 110, 126, 100, 110, 110, 126, 94, 124, 108, 110, 92, 98, 118, 110, 102, 108, 126, 104, 110, 120, 110, 118, 100, 110, 120, 100, 120, 92

• Recall that:– Sample mean = 111.1

– Sample standard deviation = 11.11

• Percentages in the intervals of form: • [Mean - kS, Mean + kS]

Page 16: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

16

Percentages in Certain IntervalsInterval Limits of Interval Number of

ObservationsPercentage of Observations

Within 1S ofMean

[100, 122.2] 21 70.00

Within 2S ofMean

[88.9, 133.3] 29 96.67

Within 3S ofMean

[77.8, 144.4] 30 100.00

Lower Limit = (Sample mean) - 2(Std Dev) = 111.1 - 2(11.1) = 88.9Upper Limit = (Sample mean) + 2(Std Dev) = 1400.9 + 2(391.3) = 133.3

By going through the 30 observations, 29 of the observations are between88.9 and 133.3, which is (29/30)(100) = 96.67% of all the observations.

Note that the observed percentages certainly satisfy the lower boundsprovided by Chebyshev's Inequality.

Also, note that the observed percentages are very close to the percentagesspecified by the Empirical Rule. This is because the histogram is somewhatsymmetric.

Page 17: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

17

Measure of Relative Standing: Z-Score

Given a data set, the z-score, called the standardized score, associated with an observation whose value is x is given by

.S

XxZ

It measures the distance of x from the sample mean in terms of the number of standard deviations. A negative (positive) value indicates the value x is smaller (larger) than the sample mean.

Page 18: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

18

Percentiles

• Given a set of n observations, the 100pth percentile, where 0 < p < 1, is that value which is larger than 100p% of all the observation, and less than 100(1-p)% of the observations.

• For example, the 95th percentile is the value larger than 95% of all the observations and it is smaller than 5% of all the observations.

Page 19: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

19

Measures of Relative Standing: Quartiles

• The first quartile, denoted by Q1, is the 25th percentile of the data set.

• The third quartile, denoted by Q3, is the 75th percentile of the data set.

• The second quartile, which is the 50th percentile, is simply the median of the data set, M.

Page 20: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

20

Computing the Quartiles

• Divide the arranged data set into two parts using the median as cut-off.

• If the sample size n is odd, then the median should be included in each group; while if n is even then the median is not included in either group.

• First quartile (Q1) is the median of the lower group.

• Third quartile (Q3) is the median of the upper group.

Page 21: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

21

Example: Quartile Computation

• Arranged Data:

• 92, 92, 94, 98, 100, 100, 100, 102, 104, 108, 108, 110, 110, 110, 110, 110, 110, 110, 110, 118, 118, 120, 120, 120, 122, 124, 126, 126, 126, 135

• M = 110 = average of 15th and 16th values.

• Q1 = in 8th position = 102

• Q3 = in 23rd position = 120.

Page 22: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

22

Box Plots• Another graphical summary of the data is provided by the

boxplot. This provides information about the presence of outliers.

• Steps in constructing a boxplot are as follows:

– Calculate M, Q1, Q3, and the minimum and maximum values.

– Form a box with left and right ends being at Q1 and Q3, respectively.

– Draw a vertical line in the box at the location of the median.

– Connect the min and max values to the box by lines.

Page 23: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

23

The BoxPlot• For the systolic blood pressure data set, the resulting

boxplot, obtained using Minitab, is shown below.

HV

LV

Q3

Q1

M

Page 24: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

24

Comparative BoxPlotsThe boxplot could also be used to make a comparison of the distributions of different groups. This could be achieved by presenting the boxplots of the different groups in a side-by-side manner.

We demonstrate this idea using the Beanie Babies Data on page 91. This data set contains the following variable:

Name: name of beanie babyAge: in months, since 9/98Status: R=retired, C=currentValue: Value of baby

Page 25: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

25

Comparative BoxPlots of Value by Status

C R

0

1000

2000

Status

Value

Distributions for both groups very right-skewed!

Page 26: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

26

Comparative BoxPlots of Log(Value) by Status

C R

2

3

4

5

6

7

8

Status

LogValue

Page 27: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

27

Relationship Between Age and Value

0 10 20 30 40 50 60 70

0

1000

2000

Age0998

Value

Page 28: 1 Numerical Summary Measures Lecture 03: Measures of Variation and Interpretation, and Measures of Relative Position.

28

Relationship Between Log(Age) and Value

0 10 20 30 40 50 60 70

2

3

4

5

6

7

8

Age0998

LogValue