Top Banner
Statistics & Data Analysis Course Number B01.1305 Course Section 31 Meeting Time Wednesday 6-8:50 pm CLASS #6
65

Statistics & Data Analysis

Feb 25, 2016

Download

Documents

royce

Statistics & Data Analysis. Course NumberB01.1305 Course Section31 Meeting TimeWednesday 6-8:50 pm. CLASS #6. Class #6 Outline. Point estimation Confidence interval estimation Determining sample sizes Introduction to Regression and Correlation Analysis. Review of Last Class. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Statistics & Data Analysis

Statistics & Data Analysis

Course Number B01.1305Course Section 31Meeting Time Wednesday 6-8:50 pm

CLASS #6

Page 2: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 2 -

Class #6 Outline Point estimation

Confidence interval estimation

Determining sample sizes

Introduction to Regression and Correlation Analysis

Page 3: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 3 -

Review of Last Class Sampling distributions for sample statistics

Page 4: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 4 -

Review of Notation

sX

SamplePopulation

Deviation StandardMean

Page 5: Statistics & Data Analysis

Point and Interval Estimation

Chapter 8

Page 6: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 6 -

Review Basic problem of statistical theory is how to infer a population

or process value given only sample data

Any sample statistic will vary from sample to sample

Any sample statistic will differ from the true, population value

Must consider random error in sample statistic estimation

Page 7: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 7 -

Chapter Goals Summarize sample data

• Choosing an estimator• Unbiased estimator

Constructing confidence intervals for means with known standard deviation

Constructing confidence intervals for proportions

Determining how large a sample is needed

Constructing confidence intervals when standard deviation is not known

Understanding key underlying assumptions underlying confidence interval methods

Page 8: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 8 -

Reminder: Statistical Inference Problem of Inferential Statistics:

• Make inferences about one or more population parameters based on observable sample data

Forms of Inference:• Point estimation: single best guess regarding a population parameter• Interval estimation: Specifies a reasonable range for the value of the

parameter• Hypothesis testing: Isolating a particular possible value for the

parameter and testing if this value is plausible given the available data

Page 9: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 9 -

Point Estimators Computing a single statistic from the sample data to estimate

a population parameter

Choosing a point estimator:• What is the shape of the distribution?• Do you suspect outliers exist?• Plausible choices:

• Mean• Median• Mode• Trimmed Mean

Page 10: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 10 -

Technical Definitions

estimators unbiased possible all oferror standardsmallest thehasit if problem particular afor

efficientmost called isestimator An :ESTIMATOR EFFICIENT

equals valueexpected its if parameter population for the unbiased called is data sample theof

function a is that ˆestimator An :ESTIMATOR UNBIASED

on.distributi sampling al theoretica hasit thereforeand variablerandom a itself isestimator An .for

estimatepoint a yields that sample random a offunction a is parameter a of ˆestimator An :ESTIMATOR

Page 11: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 11 -

Example I used R to draw 1,000 samples, each of size 30, from a

normally distributed population having mean 50 and standard deviation 10.

data.mean = data.median = numeric(0)for(i in 1:1000) {

data = rnorm(n=30, mean=50, sd=10)data.mean[i] = mean(data)data.median[i] = median(data)

}

For each sample the mean and median are computed.

Do these statistics appear unbiased?

Which is more efficient?

Page 12: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 12 -

Example I used R to draw 1,000 samples, each of size 30, from an

extremely skewed population with mean and standard deviation both equal to 2. data.mean = data.median = numeric(0)

for(i in 1:1000) {data = rt(n=30, 3)data.mean[i] = mean(data)data.median[i] = median(data)

}

For each sample the mean and median are computed.

Do these statistics appear unbiased?

Which is more efficient?

Page 13: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 13 -

Expressing Uncertainty

accuracy. complete with estimates that impression false theleavemay alone of reporting thee,Furthermor

y.reliabilitown itsabout n informatio no containsit because usefulness limited of is itself,by

Used.parameter theofestimator point a is mean sample The . size of sample aon based mean

population aabout inferences make to tryingare weSuppose

XX

XX

n

Page 14: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 14 -

Confidence Interval An interval with random endpoints which contains the

parameter of interest (in this case, μ) with a pre-specified probability, denoted by 1 - α.

The confidence interval automatically provides a margin of error to account for the sampling variability of the sample statistic.

Example: A machine is supposed to fill “12 ounce” bottles of Guinness. To see if the machine is working properly, we randomly select 100 bottles recently filled by the machine, and find that the average amount of Guinness is 11.95 ounces. Can we conclude that the machine is not working properly?

Page 15: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 15 -

No! By simply reporting the sample mean, we are neglecting the fact that the amount of beer varies from bottle to bottle and that the value of the sample mean depends on the luck of the draw

It is possible that a value as low as 11.75 is within the range of natural variability for the sample mean, even if the average amount for all bottles is in fact μ = 12 ounces.

Suppose we know from past experience that the amounts of beer in bottles filled by the machine have a standard deviation of σ = 0.05 ounces.

Since n = 100, we can assume (using the Central Limit Theorem) that the sample mean is normally distributed with mean μ (unknown) and standard error 0.005

What does the Empirical Rule tell us about the average volume of the sample mean?

Page 16: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 16 -

Why does it work?

X

time theof 95%here in is X

XS

time theof 95%about here in is

X

Page 17: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 17 -

Using the Empirical Rule Assuming Normality

Page 18: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 18 -

Confidence Intervals “Statistics is never having to say you're certain”.

• (Tee shirt, American Statistical Association).

Any sample statistic will vary from sample to sample

Point estimates are almost inevitably in error to some degree

Thus, we need to specify a probable range or interval estimate for the parameter

Page 19: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 19 -

Confidence Interval

YY zyzy

2/2/

:mean sample theoferror standard the times valuetable-z a toequal termminus-or-plus a error with sampling

for allow mean, population theof estimatean asmean sample theUsing

KNOWN AND FOR INTERVAL CONFIDENCE )%1(100

Page 20: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 20 -

Example An airline needs an estimate of the average number of

passengers on a newly scheduled flight Its experience is that data for the first month of flights are

unreliable, but thereafter the passenger load settles down The mean passenger load is calculated for the first 20

weekdays of the second month after initiation of this particular flight

If the sample mean is 112 and the population standard deviation is assumed to be 25, find a 95% confidence interval for the true, long-run average number of passengers on this flight

Find the 90% confidence interval for the mean

Page 21: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 21 -

Interpretation The significance level of the confidence interval refers to the

process of constructing confidence intervals Each particular confidence interval either does or does not

include the true value of the parameter being estimated We can’t say that this particular estimate is correct to within

the error So, we say that we have a XX% confidence that the

population parameter is contained in the interval Or…the interval is the result of a process that in the long run

has a XX% probability of being correct

Page 22: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 22 -

Imagine Many Samples

22 23 24

The interval you computed

Missed!Missed!

The population mean = 23.29

Other intervals y

ou

might have computed

Page 23: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 23 -

Example A signal transmitting value is sent from California, the value

received in NY is normally distributed with mean and variance 4.

To reduce error, the same value is sent 9 times If the successive values received are:

• 5, 8.5, 12, 15, 7, 9, 7.5, 6.5, 10.5 Construct a 99% confidence interval for

Page 24: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 24 -

Getting Realistic The population standard deviation is rarely known

Usually both the mean and standard deviation must be estimated from the sample

Estimate with s

However…with this added source of random errors, we need to handle this problem using the t-distribution (later on)

Page 25: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 25 -

Confidence Intervals for Proportions We can also construct confidence intervals for proportions of

successes Recall that the expected value and standard error for the

number of successes in a sample are:

How can we construct a confidence interval for a proportion?

nE /)1(;)ˆ( ˆ

Page 26: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 26 -

Example Suppose that in a sample of 2,200 households with one or

more television sets, 471 watch a particular network’s show at a given time.

Find a 95% confidence interval for the population proportion of households watching this show.

Page 27: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 27 -

Example The 1992 presidential election looked like a very close three-

way race at the time when news polls reported that of 1,105 registered voters surveyed:• Perot: 33%• Bush: 31%• Clinton: 28%

Construct a 95% confidence interval for Perot? What is the margin of error? What happened here?

Page 28: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 28 -

Example A survey conducted found that out of 800 people, 46%

thought that Clinton’s first approved budget represented a major change in the direction of the country.

Another 45% thought it did not represent a major change. Compute a 95% confidence interval for the percent of people

who had a positive response. What is the margin of error? Interpret…

Page 29: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 29 -

Choosing a Sample Size Gathering information for a statistical study can be expensive,

time consuming, etc. So…the question of how much information to gather is very

important When considering a confidence interval for a population mean

, there are three quantities to consider:

n

z

Y /2/

Page 30: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 30 -

Choosing a Sample Size (cont)

Tolerability Width: The margin of acceptable error• ±3%• ± $10,000

Derive the required sample size using:• Margin of error (tolerability width)• Level of Significance (z-value)• Standard deviation (given, assumed, or calculated)

Page 31: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 31 -

Example Union officials are concerned about reports of inferior wages

being paid to employees of a company under its jurisdiction

How large a sample is needs to obtain a 90% confidence interval for the population mean hourly wage with width equal to $1.00? Assume that =4.

Page 32: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 32 -

Example A direct-mail company must determine its credit policies very

carefully. The firm suspects that advertisements in a certain magazine

have led to an excessively high rate of write-offs. The firm wants to establish a 90% confidence interval for this

magazine’s write-off proportion that is accurate to ± 2.0%• How many accounts must be sampled to guarantee this goal?• If this many accounts are sampled and 10% of the sampled accounts

are determined to be write-offs, what is the resulting 90% confidence interval?

• What kind of difference do we see by using an observed proportion over a conservative guess?

Page 33: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 33 -

The t Distribution Up until now, we have assumed that the population standard

deviation is known or that we choose a large enough sample so the sample standard deviation s can replace .

Sometimes a large sample is not possible So far, we’ve based the confidence interval on the z statistic:

nYZ

/

Page 34: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 34 -

The t Distribution (cont)

When the population standard deviation is unknown, it must be replaced by the sample statistic

This yields the summary statistic

This statistic follows a t-Distribution

nsYt

/

Page 35: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 35 -

The t Distribution (cont)

This statistic was derived by W. S. Gosset Gosset obtained a post as a chemist in the Guinness brewery

in Dublin in 1899 and did important work on statistics He invented the t Distribution to handle small samples for

quality control in brewing He wrote under the name "Student"

Page 36: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 36 -

Properties of the t Distribution Symmetric about the mean 0 More variable than the z-distribution

-4 -2 0 2 4

0.0

0.1

0.2

0.3

0.4

Normalt

Page 37: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 37 -

Properties of the t Distribution (cont)

There are many different t distributions.• We specify a particular one by its degrees of freedom• If a random sample is taken from a normal population, then the statistic:

has a t distribution with d.f. = n-1 As sample size increases, the t-distribution approaches the z-

distribution R functions

nsYt

/

CDF Inverse ;/

:df) qt(p,

functionon distributi Cumulative ;/

:df) pt(t,

ptns

YP

tns

YP

Page 38: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 38 -

Degrees of Freedom Technical definition fairly complex Intuitively: d.f. refers to the estimated standard deviation and

is used to indicate the number of pieces of information available for that estimate

The standard deviation is based on n deviations from the mean, but the deviations must sum to 0, so only n-1 deviations are free to vary

Page 39: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 39 -

Example How long should you wait before ordering new inventory?

• If you choose too soon, you pay stocking costs• If you choose too late, you risk stock-outs

Your supplier says goods will arrive in two weeks (10 business days), but you made note of how many business days it actually took: 10, 9, 7, 10, 3, 9, 12, 5

Calculate the sample mean, standard deviation, and standard error for this sample

What is the probability a shipment takes more than two weeks?

Page 40: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 40 -

Confident Intervals for the t Distribution

d.f. 1 with2 ofarea tail-righta off cutting t value tabulated theis where//

:mean sample theoferror standard the times valuetable-a t toequal termminus-or-plusa error with sampling

forallow mean, population theof estimate an as mean sample theUsing

UNKNOWN AND FORINTERVAL CONFIDENCE )%1(100

2

2/2/

n-α/tnstynsty

α/

Page 41: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 41 -

Example How long should you wait before ordering new inventory?

• If you choose too soon, you pay stocking costs• If you choose too late, you risk stock-outs

Your supplier says goods will arrive in two weeks (10 business days), but you made note of how many business days it actually took: 10, 9, 7, 10, 3, 9, 12, 5

Calculate a 95% confidence interval for the mean delivery time

Page 42: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 42 -

Assumptions Assumptions needed for validity of the Confidence Interval

1. Data are a RANDOM SAMPLE from the population of interest• (So that the sample can tell you about the population)

2. The sample average is approximately NORMAL• Either the data are normal (check the histogram)• Or the central limit theorem applies:

– Large enough sample size n, distribution not too skewed• (So that the t table is technically appropriate)

Page 43: Statistics & Data Analysis

Linear Regression and Correlation Methods

Chapter 11

Page 44: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 44 -

Chapter Goals Introduction to Bivariate Data Analysis

• Introduction to Simple Linear Regression Analysis• Introduction to Linear Correlation Analysis

Interpret scatter plots

Page 45: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 45 -

Motivating Example Before a pharmaceutical sales rep can speak about a product

to physicians, he must pass a written exam

An HR Rep designed such a test with the hopes of hiring the best possible reps to promote a drug in a high potential area

In order to check the validity of the test as a predictor of weekly sales, he chose 5 experienced sales reps and piloted the test with each one

The test scores and weekly sales are given in the following table:

Page 46: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 46 -

Motivating Example (cont)

SALESPERSON TEST SCORE WEEKLY SALES

JOHN 4 $5,000

BRENDA 7 $12,000

GEORGE 3 $4,000

HARRY 6 $8,000

AMY 10 $11,000

Page 47: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 47 -

Introduction to Bivariate Data Up until now, we’ve focused on univariate data

Analyzing how two (or more) variables “relate” is very important to managers• Prediction equations• Estimate uncertainty around a prediction• Identify unusual points• Describe relationship between variables

Visualization• Scatterplot

Page 48: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 48 -

Scatterplot

Do Test Score and Weekly Sales appear related?

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

score

sale

s

Page 49: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 49 -

Correlation

Boomers' Little Secret Still Smokes Up the ClosetJuly 14, 2002

…Parental cigarette smoking, past or current, appeared to have a stronger correlation to children's drug use than parental marijuana smoking, Dr. Kandel said. The researchers concluded that parents influence their children not according to a simple dichotomy — by smoking or not smoking — but by a range of attitudes and behaviors, perhaps including their style of discipline and level of parental involvement. Their own drug use was just one component among many…

A Bit of a Hedge to Balance the Market SeesawJuly 7, 2002

…Some so-called market-neutral funds have had as many years of negative returns as positive ones. And some have a high correlation with the market's returns…

Page 50: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 50 -

Correlation Analysis Statistical techniques used to measure the strength of the

relationship between two variables

Correlation Coefficient: describes the strength of the relationship between two sets of variables• Denoted r• r assumes a value between –1 and +1• r = -1 or r = +1 indicates a perfect correlation• r = 0 indicates not relationship between the two sets of variables• Direction of the relationship is given by the coefficient’s sign• Strength of relationship does not depend on the direction• r means LINEAR relationship ONLY

Page 51: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 51 -

Example Correlations

-2 -1 0 1 2

-2-1

01

2r=-0.9

-2 -1 0 1 2

-1.0

0.0

1.0

r=-0.73

-2 -1 0 1 2

-1.0

0.0

1.0

r=-0.25

-2 -1 0 1 2

-1.0

0.0

0.5

1.0

1.5

r=0.34

-2 -1 0 1 2

-2-1

01

r=0.7

-2 -1 0 1 2-3

-2-1

01

23

r=0.88Correlation Demo

Page 52: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 52 -

Scatterplot

r = 0.88

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

score

sale

s

Page 53: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 53 -

Correlation and Causation Must be very careful in interpreting correlation coefficients

Just because two variables are highly correlated does not mean that one causes the other • Ice cream sales and the number of shark attacks on swimmers are

correlated• The miracle of the "Swallows" of Capistrano takes place each year at the

Mission San Juan Capistano, on March 19th and is accompanied by a large number of human births around the same time

• The number of cavities in elementary school children and vocabulary size have a strong positive correlation.

To establish causation, a designed experiment must be runCORRELATION DOES NOT IMPLY CAUSATION

Page 54: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 54 -

Regression Analysis Simple Regression Analysis is predicting one variable

from another• Past data on relevant variables are used to create and evaluate a

prediction equation

Variable being predicted is called the dependent variable

Variable used to make prediction is an independent variable

Page 55: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 55 -

Introduction to Regression Predicting future values of a variable is a crucial management

activity• Future cash flows• Needs to raw materials into a supply chain• Future personnel or real estate needs

Explaining past variation is also an important activity• Explain past variation in demand for services• Impact of an advertising campaign or promotion

Page 56: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 56 -

Introduction to Regression (cont.)

Prediction: Reference to future values Explanation: Reference to current or past values Simple Linear Regression: Single independent variable

predicting a dependent variable• Independent variable is typically something we can control• Dependent variable is typically something that is linearly related to the

value of the independent variable

xy 10ˆˆˆ

Page 57: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 57 -

Introduction to Regression (cont.)

Basic Idea: Fit a straight line that relates dependent variable (y) and independent variable (x)

Linearity Assumption: Slope of the equation does not change as x change

Assuming linearity we can write

which says that Y is made up of a predictable part (due to X) and an unpredictable part

Coefficients are interpreted as the true, underlying intercept and slope

xy 10

Page 58: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 58 -

Regression Assumptions

We start by assuming that for each value of X, the correspondingvalue of Y is random, and has a normal distribution.

Page 59: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 59 -

Which Line?

There are many good fitting lines through these points

http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/index.html

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

score

sale

s

Page 60: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 60 -

Least Squares Principle This method gives a best-fitting straight line by minimizing the

sum of the squares of the vertical deviations about the line

Regression Coefficient Interpretations:• 0: Y-Intercept; estimated value of Y when X = 0

• 1: Slope of the line; average change in predicted value of Y for each change of one unit in the independent variable X

Page 61: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 61 -

Least Square Estimates

iiixy

iixx

xx

xy

yyxxS

xxS

where

xySS

))((

)(

ˆˆ;ˆ

2

101

Page 62: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 62 -

Back to the Example

3 4 5 6 7 8 9 10

4000

6000

8000

1000

012

000

x

y

y = 1133.33 x + 1199.99

simple.lm(score, sales)

Page 63: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 63 -

Back to the Example

Page 64: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 64 -

Example It is well known that the more beer you drink, the more your

blood alcohol level rises. However, the extent to how much it rises per additional

beer is not clear.

Calculate the correlation coefficient Perform a regression analysis

Student 1 2 3 4 5 6 7 8 9 10Beers 5 2 9 8 3 7 3 5 3 5BAL 0.100 0.030 0.190 0.120 0.040 0.095 0.070 0.060 0.020 0.050

Page 65: Statistics & Data Analysis

Professor S. D. Balkin -- March 5, 2003

- 65 -

Homework #6 Hildebrand/Ott

• HO: 7.1, page 204• HO: 7.2, page 204-205• HO: 7.14, page 211• HO: 7.17, page 211• HO: 7.18, page 211• HO: 7.20, page 214• HO: 7.21, page 214• HO: 7.30, page 218• HO: 7.39, page 229• HO: 7.74, page 244

Verzani• 13.4 – first part (do not test the

hypothesis). Provide an interpretation.