Top Banner
Copyright (c) Bani Mallic k 1 Lecture 2 Stat 651
49

Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Dec 19, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 1

Lecture 2

Stat 651

Page 2: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 2

Topics in Lecture #2 Population and sample parameters

More on populations and samples

The Median and Percentiles

Robustness of the median and IQR, lack of robustness for the mean and standard deviation

Variability: standard deviation and interquartile range (IQR)

Boxplots

Page 3: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 3

Book Sections Covered in Lecture #2

Chapter 3.4

Chapter 3.5, up to page 88

Chapter 3.6

Page 4: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 4

Review of Lecture #1

We described samples and populations

We want to make inference about populations

We draw samples from the population to do so

Different samples give different results

Page 5: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 5

Review of Lecture #1

I will make quite a big deal about the difference between samples and populationsOne major thing we do in statistics is

to quantify: how far is what we see in the sample from what we cannot see in the population?

Page 6: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 6

Review of Lecture #1

We defined the relative frequency histogram

This counts the percentage of the sample that falls into computer-generated categories

We studies the NHANES case-control study

This had samples from 2 (sub)populations

Those who developed breast cancer

Those who did not

Page 7: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 7

Review of Lecture #1

In the NHANES data, the sample who developed breast cancer seemed to have smaller values of saturated fat than the sample that did not develop breast cancer

What we will try to do today is to quantify those differences

Page 8: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 8

Review of Lecture #1: The Population Mean

• In many problems, the goal is to make inference about the population mean of a numerical variable, e.g., saturated fat intake

• The only thing we have available is data from a sample, e.g., the sample mean.

• Define in words what you mean by the population mean and the sample mean!

Page 9: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 9

Review of Lecture #1: The Population Mean

• In many problems, the goal is to make inference about the population mean of a numerical variable, e.g., saturated fat intake

• You’re right! The population mean is the average of all the outcomes in the population

• It can’t be measured, hence we take samples.

• BTW, what’s an average?

Page 10: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 10

Review of Lecture #1: The Population Mean

Formal definition. If the sample is of size n and the data are X1,…, Xn , then the sample mean is

Total sum of the values in the sample

n

i1 2 n i=1

Σ ΧΧ +Χ +...ΧΧ= =

n n

Number of observations in the sample

Page 11: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 11

Parameters and Statistics

Parameter: numerical characteristic of a population

Statistic: numerical characteristic of a sample

PopulationSample

Page 12: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 12

Parameters and Statistics

Parameter: numerical characteristic of

a population, called Statistic: numerical characteristic of a

sample, called

PopulationSample

Page 13: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 13

Parameters and Statistics

Parameter: numerical characteristic of a

population, called never known!!!)

Statistic: numerical characteristic of a sample, called

PopulationSample

We want to make inference about populationfrom sample

Page 14: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 14

Case-control data: NHANES log(Saturated

Fat)

Which sample has the larger sample mean?

0%

5%

10%

15%

Per

cen

t

Cancer

Healthy

2.00 3.00 4.0

Log(Saturated Fat)

0%

5%

10%

15%

Per

cen

t

Page 15: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 15

Case-control data: NHANES log(Saturated

Fat)

Cancer: = 2.70

Healthy: = 2.99

0%

5%

10%

15%

Per

cen

t

Cancer

Healthy

2.00 3.00 4.0

Log(Saturated Fat)

0%

5%

10%

15%

Per

cen

t

Page 16: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 16

The Concept of a Median

When we talk about the population median age of graduate students at Texas A&M, what do we mean?

When we talk about the sample median of graduate students at Texas A&M, what do we mean?

Page 17: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 17

The Concept of a Median

The population median is the central point

1/2 the population falls below the population median

1/2 the population falls above the population median

Look in newspapers for the use of the median and the mean

Page 18: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 18

The Concept of a Median

The sample median is the central point of the sample

1/2 the sample falls below the sample median

1/2 the sample falls above the sample median

We can use the sample median to try to estimate the population median

Remember though, different samples give different numbers

Page 19: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 19

The Concept of a Median

The sample median is computed by SPSS, or by hand as follows

Order the data

If n is an odd number, the sample median is the (n+1)/2 point in order

If n is even, it is the average of the n/2 point in order and the (n/2+1) point in order

Page 20: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 20

The Concept of a Median

Data 97 99 93 96 91 90 95: n = 7

Ordered: 90 91 93 95 96 97 99

If n is an odd number, the sample median is the (n+1)/2 point in order

(n+1)/2 = 4

Sample median = 95

Page 21: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 21

The Concept of a Median

Data 97 92 96 91 90 95: n = 6

Ordered: 90 91 93 95 96 97

If n is even, it is the average of the n/2 point in order and the (n/2+1) point in order

n/2 = 3, (n/2+1) = 4

Sample median = average of 93 and 95 = 94

Page 22: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 22

Summary Statistics in SPSS

Select “analyze”, “descriptive statistics”, “explore”

Select your variables (“Dependent”) and you populations (“Factor List”)

Ask for “Statistics”

Cut and paste as needed

Page 23: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 23

Descriptives

2.9905 7.969E-02

2.8310

3.1500

3.0015

2.9957

.381

.6173

1.39

4.26

2.88

.9130

-.332 .309

-.138 .608

2.6969 8.362E-02

2.5295

2.8642

2.6886

2.8332

.413

.6423

1.39

4.77

3.38

.8755

.156 .311

.748 .613

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Healthy Status Numerical:0 = Healthy, 1 = Cancer

Healthy

Breast Cancer

Log(Saturated Fat)

Statistic Std. Error

SPSS Output for NHANES: you see variable, populations, sample means, medians, minimum and maximum

Page 24: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 24

Summary Statistics in SPSS

For both measures of central tendency, healthy women had reported more saturated fat on the day they were interviewed

Page 25: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 25

VARIABILITY

Variability is one of the hardest concepts to understand, and to measure.

Variability means how widely spread out the population is

Populations with large variability should have samples that are more spread out than are samples from populations with low variability

Variability is measured by the average distance of points to center

Page 26: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 26

Shelf life of two drugs

(past shelf life the drug is harmful).

Which drug would you

take, assuming they are equally

effective on average?

Sample from Drug A

Sample fromDrug B

0%

5%

10%

15%

20%

Per

cen

t

.00

1.00

-4.0000 -2.0000 0.0000 2.0000 4.0000

v2

0%

5%

10%

15%

20%

Per

cen

t

Page 27: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 27

Absolute distance from sample mean

Measuring Variability by Distances

Absolute distance from sample median

iX

i

Squared distance to sample mean

2

i

All numerical measures of variability are measures of “the center of the distance”

Page 28: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 28

Variability as average squared distance

The sample variance, s2 is defined as follows

Compute (X - sample mean) for every observation: squared distances

Sum these numbers up

Divide by n-1

Except for the n-1, this is the sample mean of the squared differences

Page 29: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 29

COMPUTING FORMULA for the Sample Variance s2

2n

ii 12

X

n 1s

Note how s2 measures how far the data are from the sample mean

Page 30: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 30

The Standard Devation, s

The sample standard deviation is called s

It is the square root of the sample variance

Its units are the same as the units of the data

2s s

Page 31: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 31

Aortic Stenosis Data

• Two populations: healthy kids and kids with aortic stenosis

• Two outcomes: body surface area and aortic valve area

• Size adjusted aortic valve areas is the ratio of aortic valve area to body surface area

Page 32: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 32

Aortic Valve Area for

Healthy and Stenotic Kids

Which has larger mean,

larger variance?

5%

10%

15%

20%

25%

Per

cen

t

Healthy

Stenotic

0.000 1.000 2.000 3.000

Aortic Valve Area

5%

10%

15%

20%

25%

Per

cen

t

Page 33: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 33

Aortic Valve Area for

Healthy and Stenotic Kids

Which has larger mean,

larger variance?

5%

10%

15%

20%

25%

Per

cen

t

Healthy

Stenotic

0.000 1.000 2.000 3.000

Aortic Valve Area

5%

10%

15%

20%

25%

Per

cen

t

Mean = 1.06Median = 0.80std dev = 0.78

Mean = 0.83Median = 0.69std dev = 0.64

Page 34: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 34

SAMPLE MEANS ARE NOT ROBUST

HEIGHTS

-4 -2 0 2 4 6

MED=

-40 -2 0 2 4 6

MED=

-2 0 2 4 6 40

MED=

Page 35: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 35

SAMPLE MEANS ARE NOT ROBUST

Outliers affect sample means much more than they do sample medians.

You should look out for wild points

They may be errors, or naturally occurring variability, but they have the potential for mischief

We will develop statistical methods that help us understand whether our conclusions are being driven by a few wild points.

Page 36: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 36

PERCENTILESSAT-

SCORES•If your are in the 90th percentile of the population, what % scored higher than you?•If you are in the 25th percentile, what percent scored less than or equal to you?

•What percent lie between the 25th and 75th percentiles?

•What percentile is the median?

Page 37: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 37

INTERQUARTILE RANGE (IQR) Defined as the difference between the 75th and 25th percentiles

The length of data needed to cover 50% of the data.

This is a natural, robust measure of variability

Why do I say it is robust?

Why do I say it is natural?

Page 38: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 38

Aortic Valve Area for

Healthy and Stenotic Kids

Which has larger mean,

larger variance?

5%

10%

15%

20%

25%

Per

cen

t

Healthy

Stenotic

0.000 1.000 2.000 3.000

Aortic Valve Area

5%

10%

15%

20%

25%

Per

cen

t

Mean = 1.06Median = 0.80std dev = 0.78IQR = 0.98

Mean = 0.83Median = 0.69std dev = 0.64IQR = 0.59

Page 39: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 39

INTERQUARTILE RANGE Defined as the difference between the

75th and 25th percentiles

The length of data needed to cover 50% of the data.

This is a natural, robust measure of variability

If comparisons of variability are different for the standard deviation and the IQR, good chance of an outlier

Page 40: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 40

Descriptives

2.9905 7.969E-02

2.8310

3.1500

3.0015

2.9957

.381

.6173

1.39

4.26

2.88

.9130

-.332 .309

-.138 .608

2.6969 8.362E-02

2.5295

2.8642

2.6886

2.8332

.413

.6423

1.39

4.77

3.38

.8755

.156 .311

.748 .613

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Mean

Lower Bound

Upper Bound

95% ConfidenceInterval for Mean

5% Trimmed Mean

Median

Variance

Std. Deviation

Minimum

Maximum

Range

Interquartile Range

Skewness

Kurtosis

Healthy Status Numerical:0 = Healthy, 1 = CancerHealthy

Breast Cancer

Look for variance, standard deviation and IQR

Log(Saturated Fat)Statistic Std. Error

Page 41: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 41

Box Plots

Box plots are a means of visualizing data from different populations, especially comparing them

Far less clunky than histograms

Clear definitions, don’t have to worry about # of bars, class intervals, etc.

Easily available

Page 42: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 42

BASIC FORM OF THE BOXPLOT

75th percentile

Median

25th percentile

IQR canbe read off

Page 43: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 43

Box Plot Additions: Technical

To the basic box, whiskers are added:

Go out to furthest point 1.5 IQR from 75th and 25th percentiles

• Any other points are called “Suspicious” “Extreme” “Outliers”

Page 44: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 44

BASIC FORM OF THE BOXPLOT

75th percentile

Median

25th percentile

IQR canbe read off

Point here

Point here

Outlier

Page 45: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 45

IQR AS A MEASURE OF VARIABILITY

The box is from 25th to 75th sample percentileThis means 50% of the data are in the boxHence, IQR= range needed to cover 50% of the data IQR is a very robust measure of variability which can be judged graphically

Page 46: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 46

NHANES Saturated Fat Data:

moderately outlying unusually outlying

Cancer Healthy

0.00

25.00

50.00

75.00

100.00

Sat

ura

ted

Fat

Page 47: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 47

Box Plots

The SPSS plot actually displays the median line, but it did not translate into powerpoint

You go to graphs, interactive and boxplot to get these things

Here’s about what the thing looks like in SPSS (then I’ll show you SPSS)

Page 48: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 48

NHANES Data not done interactively (just graphs and boxplot): cannot edit

6059N =

Health Status

HealthyCancer

Sa

tura

ted

Fa

t140

120

100

80

60

40

20

0

-20

60

118

119

Page 49: Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.

Copyright (c) Bani Mallick 49

Aortic Valve Area for Healthy and Stenotic Kids: done interactively, hence the median is not

labeled when imported to powerpoint

Healthy Stenotic

0.000

1.000

2.000

3.000

Ao

rtic

Va

lve

Are

a