Top Banner
Biostat 200 Lecture 5 1
59

Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Jan 13, 2016

Download

Documents

Adrian Casey
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Biostat 200 Lecture 5

1

Page 2: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Normal distribution

• If X ~ N(µ,σ) then Z=(X- µ)/σ– To find the proportion of values in the population

that exceed some threshold, i.e. P(X>x)• Transform x to z=(x - µ)/σ and look up P(Z>z) using display 1-normal(z)

– To find the percentiles of the distribution• E.g. to find the 97.5 percentile, the value for which

97.5% of the values are smaller, find z by using display invnormal(.975) and solve z=(x - µ)/σ for x

2

Page 3: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• For normal distribution– Stata calculates P(Z<z) – If you have z and you want p, use display normal(z)

– If you have p and you want z, use display invnormal(p)

• For t distribution– Stata calculates P(T>t)– If you have t and you want p, use display ttail(df,t)

– If you have p and you want t, use display invttail(df,p)

3

-5 -4 -3 -2 -1 0 1 2 3 4 5x

Prob Z<1.96 highlighted

Standard normal distribution

-5 -4 -3 -2 -1 0 1 2 3 4 5x

Standard normal distribution

-5 -4 -3 -2 -1 0 1 2 3 4 5x

t dist

Page 4: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• But usually you want to summarize the values, and make inference about the mean

• The Central Limit Theorem says that the distribution of a sample mean X has a normal distribution with mean µ and standard deviation σ/ √n

4

Page 5: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Imagine that we did the following1. Draw a sample of size n from a population2. Calculate the sample mean3. Save the sample mean as an observation in the

data set4. Repeat steps 1-3 many times, enough to draw a

reasonable histogramThe histogram will show you the distribution of

the sample meansIf the samples themselves were large enough

(n), that distribution of the sample means will appear normal

5

Page 6: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

6

0.0

1.0

2.0

3.0

4.0

5D

ensi

ty

0 10 20 30Day of month

Day of birth of student

0.0

2.0

4.0

6.0

8D

ensi

ty

0 5 10 15 20 25Avg day of month

Avg of student + 1 relative dob

0.0

2.0

4.0

6.0

8.1

Den

sity

5 10 15 20 25Avg day of month

Avg of student + 3 relative dob

0.0

5.1

.15

Den

sity

10 15 20 25Avg day of month

Avg of student + 7 relative dob

graphs of averages of days of birth

Page 7: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• Using the central limit theorem, we can make probability statements about values of X ̅

• These probability statements can be used to construct confidence intervals

• Confidence intervals for means are statements about the probability that a given interval contains the true population mean µ

• Confidence intervals can be constructed for any point estimate, e.g., odds ratios, hazards ratios, correlations, regression coefficients, standard deviations, …

• The methods for constructing these will vary7

Confidence intervals

Page 8: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• The width of the confidence interval depends on the confidence level (1-) and n

• If σ is unknown, we use the t distribution and the sample standard deviation s in our computation

8

Page 9: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Confidence intervals for means

We know from the normal distribution thatP(-1.96 ≤ Z ≤ 1.96) = 0.95Substituting the formula for Z into the above we get

95.0)96.1/

96.1(

_

n

P X

After rearranging, the left and right are the confidence limits

And is σ is unknown,

95.0)/96.1/96.1(__

nnP XX

Pagano and Gavreau, Chapter 9

95.0)//( 025,.

_

025,.

_

nstnstP dfdf XX

Page 10: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Confidence intervals for proportions

• As n increases, the binomial distribution approaches the normal distribution

• The normal approximation: as n increases X ~ N( np, √(np(1-p)) )

X is the total count of the successes np is the mean of the binomial √ np(1-p) is the standard deviation of the binomial

X/n = ~ N( p, p̂� √(p(1-p)/n) )X/n is the proportion of successesp is the population proportion√ (p(1-p)/n) is the standard deviation of X/n if X is binomial • So we can use the normal approximation

to calculate confidence intervals for p

10

Page 11: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Confidence intervals for proportions

• But if n is small or p is small or both then the normal approximation is not good and the intervals using the normal approximation are not good

• We need to use the binomial distribution• Confidence intervals using the binomial distribution are

called exact confidence intervals• Exact confidence intervals are not symmetric (because

neither is the binomial for small n or small p)

11

Page 12: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

0.2

.4.6

Bin

omia

l pro

babi

lity

0 5 10 15 20n successes

n=10 p=0.05

0.1

.2.3

.4B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=20 p=0.050

.05

.1.1

5.2

.25

Bin

omia

l pro

babi

lity

0 5 10 15 20n successes

n=50 p=0.05

0.0

5.1

.15

.2B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=100 p=0.05

0.0

5.1

.15

.2.2

5B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=10 p=0.35

0.0

5.1

.15

.2B

inom

ial p

roba

bilit

y

0 5 10 15 20n successes

n=20 p=0.35

0.0

5.1

.15

Bin

omia

l pro

babi

lity

0 10 20 30 40n successes

n=50 p=0.35

0.0

2.0

4.0

6.0

8B

inom

ial p

roba

bilit

y

0 20 40 60n successes

n=100 p=0.35

12

For p=0.05, even for large n, the distribution of X doesn’t exactly look normal

For p=0.35, the distribution does look fairly normal at small sample sizes

Page 13: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

. gen cigs10=1 if cigs>=10 & cigs != .(541 missing values generated)

. replace cigs10=0 if cigs<10(23 real changes made)

. label values cigs10 yesnol

. tab cigs10

cigs10 | Freq. Percent Cum.------------+----------------------------------- no | 23 82.14 82.14 yes | 5 17.86 100.00------------+----------------------------------- Total | 28 100.00

. ci cigs10

Variable | Obs Mean Std. Err. [95% Conf. Interval]-------------+--------------------------------------------------------------- cigs10 | 28 .1785714 .073707 .0273371 .3298058

. ci cigs10, binomial

-- Binomial Exact -- Variable | Obs Mean Std. Err. [95% Conf. Interval]-------------+--------------------------------------------------------------- cigs10 | 28 .1785714 .0723789 .0606429 .3689333

How do these confidence intervals differ?

13

E.g. Proportion of smokers in our class data set who smoke 10 or more cigarettes/day

Page 14: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

E.g. proportion of smokers in the sample

ci smoke

Variable | Obs Mean Std. Err. [95% Conf. Interval]-------------+--------------------------------------------------------------- smoke | 546 .0531136 .0096062 .0342438 .0719833

. ci smoke, binomial -- Binomial Exact -- Variable | Obs Mean Std. Err. [95% Conf. Interval]-------------+--------------------------------------------------------------- smoke | 546 .0531136 .0095974 .0358559 .0753919

.

• How do these confidence intervals differ?

14

Page 15: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing• Confidence intervals tell us about the uncertainty in

a point estimate of the population mean or proportion (or some other entity)

• Hypothesis testing allows us to draw a conclusion about a population parameter that we have estimated in a sample

• Remember, we are taking samples from the population and we never know the truth

15Pagano and Gavreau, Chapter 10

Page 16: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing

• To conduct a hypothesis test about the mean of a population, we postulate a value for it, and call that value µ0. – We write this H0 : µ = µ0

– We call this the null hypothesis

• We set an alternative hypothesis for all other values of µ– We write this: HA : µ ≠ µ0

– We call this the alternative hypothesis

16Pagano and Gavreau, Chapter 10

Page 17: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing for a mean

• After formulating the hypotheses, draw a random sample

• Compare the mean of the random sample, X , to the hypothesized mean µ0

• Is the difference between the sample mean X and the hypothesized population mean µ0 likely due to chance (remember you have taken a sample), or too large to be attributed to chance?

• If too large to be due to chance, we say we have statistical significance 17

Page 18: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• Just like a 95% confidence interval has a 5% probability of not including the population parameter (e.g. the mean or the proportion), there is some probability that we will incorrectly reject the null hypothesis

• Like confidence intervals, this is because we are taking a sample and do not know the true distribution– We set that probability in advance, before collecting

the data and doing the analysis– That probability is called the significance level– The significance level is denoted by , and is

frequently set to 0.05.18

Page 19: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• Criminal trials in the US – innocent until proven guilty• In hypothesis testing, we assume the null hypothesis is

true, and only reject the null if there is enough evidence that the sample did not come from the hypothesized population (e.g. that the mean is not µ0)

• In jury trials there is a chance that someone who is innocent is found guilty, but our method of innocent until proven guilty is in place to minimize this

• In hypothesis testing, a true null hypothesis might be mistakenly rejected– This mistake is called a Type I error– The probability of this mistake is chosen in advance, and is

the significance level, which is referred to as 19

Page 20: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• The probability of obtaining a mean as or more extreme as the observed sample mean X bar, given that the null hypothesis is true, is called the p-value of the test, or p.

20

Page 21: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Terminology• Before the test, set the significance level,

– This is the probability of rejecting the null hypothesis when it is true (also called a Type I error)

• When you run the test, you get a probability of observing a mean as extreme as you did, given that the null hypothesis is true– This probability is the p-value

• If the p-value is less than , (e.g. =0.05 and p=0.013), then the test is called statistically significant– If the p-value is ≥ , the test is called “not statistically

significant”21

Page 22: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing

• 3 steps– Specify the null and alternative hypothesis

• E.g. null: Is the level of CD4 count in a sample of new HIV positives in Uganda below 200 cells/mm3?

– Determine the compatibility of the data with the null hypothesis

• We do this by using the data to calculate a test statistic that will be compared to a theoretical statistical distribution, e.g., the standard normal distribution or the t-distribution

• If the test statistic is very large, then our data are very unlikely under the null hypothesis

– Either:• Reject the null OR• Fail to reject the null

22Pagano and Gavreau, Chapter 10

Page 23: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

1. State the hypotheses• In epidemiology the null hypothesis is often that there is no association

between the exposure and the outcomeE.g., – The difference in disease risk=0

• Rexposed -Rnot exposed=0– The ratio of risk/prevalence/etc=1

• Rexposed /Rnot exposed=1– Clinical trials: The mean value of something does not differ between two

groups• The alternative hypothesis is usually an association or risk difference

– E.g., • Rexposed -Rnot exposed>0• Rexposed /Rnot exposed>1• The mean response in the treated group is greater than in the untreated group

• H0 (the null) and HA (the alternative) must be mutually exclusive (no overlap) and include all the possibilities

23Pagano and Gavreau, Chapter 10

Page 24: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

2. Determine the compatibility with the null.

• We determine if there is enough evidence to reject the null hypothesis (i.e. calculate the test statistic and look up the p-value) – If the null is true, what is the probability of obtaining the

sample data as extreme or more extreme?– This probability is call the p-value

24Pagano and Gavreau, Chapter 10

Page 25: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

3. Reject or fail to reject• If the p-value (the probability of obtaining the

sample data if the null is true) is sufficiently small (we often use 5%) then we reject the null and say the test was statistically significant.

• If we fail to reject the null, it does not mean that we accept it.– We might fail to reject the null if the sample size is

too small or the variability in the sample was too large

25Pagano and Gavreau, Chapter 10

Page 26: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing

26Pagano and Gavreau, Chapter 10

0 1

Significance level, set a priori

p-value, the result of your statistical test• If p<, reject the null test is statistically significant• If p≥, do not reject the null test is not statistically significant

Page 27: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Tests of one mean• We want to test whether a mean is equal to some hypothesized value

• If we believe there might be deviations in either direction, we use a two-sided test

• Two sided test:– Null hypothesis: H0: μ=μ0– Alternative hypothesis: HA: μ≠μ0

• If we only care about values above or below a certain value, we use a one-sided test

• One sided test: – Null hypothesis: H0: μ≥μ0– Alternative hypothesis: HA: μ<μ0

or – Null hypothesis: H0: μ≤μ0– Alternative hypothesis: HA: μ>μ0

27Pagano and Gavreau, Chapter 10

Page 28: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Lexicon

• For one sided tests, people often say they are testing the hypothesis that the mean is less than or more than xxx (μ0). When they say this they are usually stating the alternative hypothesis.

• For two sided tests, people often say they are testing the hypothesis that the mean is μ0 . When they say this they are stating the null hypothesis.

• Since you know that the null is the complement of the alternative, you don’t usually state both in practicality (but we will for this class)

28

Page 29: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Tests of one mean

• The distribution of a sample mean if n is large enough is normal. For a normally distributed random variable, calculate the Z statistic

• If the standard deviation (σ) is not known, calculate the t-statistic.

• Compare your test statistic to the appropriate distribution to get your p-value– Find P(Z>zstat) or P(T>tstat) for a one-sided hypothesis test– Find 2*P(Z>zstat) or *2P(T>tstat) for a two-sided

hypothesis test

nZ Xstat

/0

_

nst Xstat

/0

_

29Pagano and Gavreau, Chapter 10

Page 30: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Test of one mean, one sided

Non-pneumatic anti-shock garment for the treatment of obstetric hemorrhage in Nigeria

• Mean initial blood loss of 83 women 1413.1 ml, SD=491.3

• Our question: Are these women hemorrhaging (>750 ml blood loss)? – H0: μ≤750 HA: μ>750 =0.05

• T-statistic:• p-value: P(T>12.3)

3.1283/3.491

7501.1413

statt

30Data adapted from Miller, S et al. Int J Gynecol Obstet (2009) Pagano and Gavreau, Chapter 10

Page 31: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Test of one mean

31

• 12.3 is off the graph

•So the probability of observing a sample mean of 1413 with n=83 and SD=431 if the true mean is <=750 is very very low

Pagano and Gavreau, Chapter 10

Page 32: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Test of one mean, one sided hypothesis, Stata

P(T>test stat)

. display ttail(82, 12.3)1.308e-20

We reject the null

32Pagano and Gavreau, Chapter 10

-5 5x

P(T>t) Use ttail(df,tstat)

Page 33: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Test of one mean, one sided hypothesis, StataAnother way: Stata immediate code for ttests:

ttesti samplesize samplemean samplesd hypothesizedmean

ttesti 83 1413.1 491.3 750

One-sample t test------------------------------------------------------------------------------ | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-------------------------------------------------------------------- x | 83 1413.1 53.92718 491.3 1305.822 1520.378------------------------------------------------------------------------------ mean = mean(x) t = 12.2962 Ho: mean = 750 degrees of freedom = 82

Ha: mean < 750 Ha: mean != 750 Ha: mean > 750 Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

We reject the null

33Pagano and Gavreau, Chapter 10

Page 34: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Test of one mean, two sided hypothesis

Example: Do children with congenital heart disease walk at a different age than healthy children? Healthy children start walking at mean age of 11.4 months– Null hypothesis: H0: μ= 11.4 months (μ0)

– Alternative hypothesis: HA: μ≠ 11.4 months

– Significance level=0.05– Data: Sample of children with congenital heart

defects, n=9, sample mean=12.8, sample SD=2.4

34Pagano and Gavreau, Chapter 10

Page 35: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Calculate the test statistic and its associated p value

• P value is P(T<-1.75) + P(T>1.75) = 2*P(T>1.75)display ttail(8,1.75)*2.11823278

-----Or run ttesti in Stata

ttesti 9 12.8 2.4 11.4

One-sample t test------------------------------------------------------------------------------ | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]---------+-------------------------------------------------------------------- x | 9 12.8 .8 2.4 10.9552 14.6448------------------------------------------------------------------------------ mean = mean(x) t = 1.7500Ho: mean = 11.4 degrees of freedom = 8

Ha: mean < 11.4 Ha: mean != 11.4 Ha: mean > 11.4 Pr(T < t) = 0.9409 Pr(|T| > |t|) = 0.1182 Pr(T > t) = 0.0591

Fail to reject the null

35

75.19/4.2

4.118.12

statt

Page 36: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing

• One sided hypotheses : HA: μ<μ0 HA: μ>μ0

• For a one sided hypothesis, you calculate either P(Z<z) or P(Z>z) for the above alternative hypotheses respectively.

• You are only looking at one tail of the distribution– For P(Z<z) to be <0.05, z must be < -1.645

. di invnormal(.05)

-1.6448536

– For P(Z>z) to be <0.05, your test statistic z must be >1.645 . di invnormal(.95)

1.644853636

Page 37: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing

• Two sided hypotheses HA: μ≠μ0

– For a two sided hypothesis, you are considering the probability that μ>μ0 or μ<μ0 , so you calculate

P(Z>z) or P(Z<-z) which is both tails of the distribution– This is then 2*P(Z>z) – For 2*P(Z>z) to be <0.05, your test statistic z must be

>1.96 . di invnormal(.975)

1.959964

37-5 5

x

Page 38: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing• So at the same significance level, here =0.05, less

evidence is needed to reject the null for a one sided test as compared to a two sided test

• So a two sided test is more conservative• What if you ran a test and got z=1.83? If your hypothesis

was one-sided, you’d reject. If your hypothesis was two-sided, you’d fail to reject.

• Therefore it is very important to specify your test a priori!• For clinical trials, most people use two-sided hypotheses

– That way no one will suspect you of changing your hypothesis test a posteriori

– On the other hand, does it really make sense to have an alternative hypothesis that a new treatment is either better or worse than the old one?

38

Page 39: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Inference on proportions

• Variables that are measured as successes or failures, disease or no disease, etc., can be considered binomial random variables– x=number of successes– n=number of “trials”– x/n = proportion diseased

39Pagano and Gavreau, Chapter 14

Page 40: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis test of a proportion

• Under the CLT, the sampling distribution of an estimated proportion , if n is large enough, is p̂�approximately a normal distribution with mean p and standard deviation=√(p(1-p)/n)

40Pagano and Gavreau, Chapter 14

Page 41: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis test of a proportion

• Therefore we can test that a sample proportion is equal to, greater than, or less than some hypothesized p0 using the Z statistic

npp

ppzstat

/)1(

ˆ

00

0

41Pagano and Gavreau, Chapter 14

Page 42: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis test of a proportion

• Example: Micronutrient intake of black women in South Africa

• Study: Pre-menopausal women randomly selected based on geographic location

• Micronutrient intake was determined using a Quantitative Food Frequency Questionnaire developed for South Africans

• Question: Do more than 10% of the women have the micronutrient intake of less than the RDA?

42Pagano and Gavreau, Chapter 14 Data adapted from Hattingh Z. et al, West Indian Med J 2008: 57 (5):431

Page 43: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis test of a proportion

• Null hypothesis: – Fewer than 10% of the women aged 25-34 have dietary

folate levels of less than 268 µg (a cutoff based on RDA)– H0: p0 < 0.10

• Alternative hypothesis:– 10% or more of the women aged 25-34 have dietary folate

levels of less than 268 µg– HA: p0 ≥ 0.10

• Significance level set at = 0.05

43Pagano and Gavreau, Chapter 10

Page 44: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis test of a proportion

• The data: 158/279=56.6% had folate levels of less than the cutoff, 268 µg

• Using the normal approximation to the binomial distribution (formula on slide 40):– Z= (.566-0.10)/√(0.10*0.90/279) =25.9 – The chance of observing a z statistic of this large

(25.9) is very very small (<0.05)– So we reject the null hypothesis

44Pagano and Gavreau, Chapter 14

Page 45: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing of a proportion

# prtesti samplesize observedp hypothp

. prtesti 279 . 566 .1

One-sample test of proportion x: Number of obs = 279------------------------------------------------------------------------------ Variable | Mean Std. Err. [95% Conf. Interval]-------------+---------------------------------------------------------------- x | .566 .0296723 .5078434 .6241566------------------------------------------------------------------------------ p = proportion(x) z = 25.9458Ho: p = 0.1

Ha: p < 0.1 Ha: p != 0.1 Ha: p > 0.1 Pr(Z < z) = 1.0000 Pr(|Z| > |z|) = 0.0000 Pr(Z > z) = 0.0000

-------display 1-normal(25.9)0

45Pagano and Gavreau, Chapter 14

Page 46: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing of a proportion

• Another example• Null hypothesis:

– Fewer than 10% of the women aged 25-34 have dietary zinc levels of less than 5 mg

– H0: p < 0.10• Alternative hypothesis:

– 10% or more of the women aged 25-34 have dietary zinc levels of less than 5 mg

– HA: p ≥ 0.10• Significance level = 0.05

46Pagano and Gavreau, Chapter 14

Page 47: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing of a proportion• The data: 31/279=11.1% had zinc levels of less than 5 mg

prtesti 279 .111 .1

One-sample test of proportion x: Number of obs = 279------------------------------------------------------------------------------ Variable | Mean Std. Err. [95% Conf. Interval]-------------+---------------------------------------------------------------- x | .111 .0188066 .0741397 .1478603------------------------------------------------------------------------------ p = proportion(x) z = 0.6125Ho: p = 0.1

Ha: p < 0.1 Ha: p != 0.1 Ha: p > 0.1 Pr(Z < z) = 0.7299 Pr(|Z| > |z|) = 0.5402 Pr(Z > z) = 0.2701

• The data are NOT inconsistent with the null hypothesis, therefore we fail to reject the null

47Pagano and Gavreau, Chapter 14

Page 48: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis testing of a proportion

• Remember that if n or p are small, you may need to use an exact test

• The commands in Stata to do this aredisplay binomialtail(n,k,p)

or bitesti samplesize observedp hypothp

48

Page 49: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis tests versus confidence intervals

• You can reach the same conclusion with confidence intervals as with two-sided hypothesis tests – Reject the null if the 95% confidence interval does not

include the hypothesized value μ0

• Confidence intervals give us more information: the range of reasonable values for μ and the uncertainty in our estimate X .

• Many statisticians and others prefer using confidence intervals to using hypothesis tests.

49

Page 50: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Hypothesis test controversy (e.g. in psychology)• Lyken, D. L. (1991). What’s wrong with psychology? In D. Cicchetti & W.M. Grove (eds.), Thinking Clearly about Psychology, vol. 1:

Matters of Public Interest, Essays in honor of Paul E. Meehl (pp. 3 – 39). Minneapolis, MN: University of Minnesota Press.• Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.• Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of

Consulting and Clinical Psychology, 46, 806-834.• Rozeboom, W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57, 416-428.• Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-437. [optional]• Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.• Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren & C. Lewis (Eds.), A handbook for data

analysis in the behavioral sciences: Methodological issues (pp. 311-339). Hillsdale, NJ: Lawrence Erlbaum Associates.• Schmidt, F. L. & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of

research data. In Lisa A. Harlow, Stanley A. Mulaik, and James H. Steiger (Eds.) What if there were no significance tests? (pp. 37-64). Mahwah, NJ: Lawrence Erlbaum Associates.

• Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. New York: Wiley. (Chapter 2 [A Critique of Significance Tests]) [optional]Frick, R. W. (1996). The appropriate use of null hypothesis testing. Psychological Methods, 1, 379-390.

• Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.• Wilkinson, L., & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and

explanations. American Psychologist, 54, 594-604.• Wainer, H. (1999). One cheer for null hypothesis significance testing. Psychological Methods, 6, 212-213.• Mulaik, S. A., Raju, N. S., & Harshman, R. A. (1997). There is a time and place for significance testing. In Lisa A. Harlow, Stanley A.

Mulaik, and James H. Steiger , Eds. What if there were no significance tests? (pp. 65-116). Mahwah, NJ: Lawrence Erlbaum Associates. [optional]

• Abelson, R. P. (1997). On the surprising longevity of flogged horses: Why there is a case for the significance test. Psychological Science, 8, 12-15.

• Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16-26.• Scarr, S. (1997). Rules of evidence: A larger context for the statistical debate. Psychological Science, 8, 16-17.• Greenwald, A. G., Gonzalez, R., Harris, R. J., & Guthrie, D. (1996). Effect sizes and p values: What should be reported and what should

be replicated? Psychophysiology, 33, 175-183.• Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5,

241-301. [optional]• Harris, R. J. (1997). Significance tests have their place. Psychological Science, 8, 8-11. [optional]

50http://www.uic.edu/classes/psych/psych548/fraley/

Page 51: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Types of error

• Type I error – significance level of the test=P(reject H0 | H0 is true)

• Incorrectly reject the null• We take a sample from the population,

calculate the statistics, and make inference about the true population. If we did this repeatedly, we would incorrectly reject the null 5% of the time that it is true if is set to 0.05.

51

Page 52: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Types of error

• Type II error – = P(do not reject H0 | H0 is false)

• Incorrectly fail to reject the null• This happens when the test statistic is not

large enough, even if the underlying distribution is different

52

Page 53: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Types of error• Remember, H0 is a statement about the population and is

either true or false• We take a sample and use the information in the sample to

try to determine the answer• Whether we make a Type I error or a Type II error depends

on whether H0 is true or false• We can design our study to minimize the chance of a Type

II error

53

True state

Decision H0 is true H0 is false

Do not reject H0 Correct Type II error=

Reject H0 Type I error= Correct

Page 54: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Finding , P(Type II error)• Example: Mean age of walking

– H0: μ< 11.4 months (μ0)

– Alternative hypothesis: HA: μ>11.4 months – Known SD=2– Significance level=0.05– Sample size=9

• We will reject the null if the zstat (assuming σ known) > 1.654

• So we will reject the null if

• For our example, the null will be rejected if X > 1.654*2/3 + 11.4 = 13.9 54

0

0

/654.1

654.1/

nx

n

xzstat

Page 55: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• But if the true mean is really 16, what is the probability that the null will not be rejected? – The probability of a Type II error?

• The null will be rejected if the sample mean is >13.9, not rejected if is ≤13.9

• What is the probability of getting a sample mean of ≤13.9 if the true mean is 16?

• P(Z<(13.9-16)/(2/3)) = 0.0008• So if the true mean is 16 and the sample size is 9,

the probability of rejecting the null incorrectly is 0.0008

55

Page 56: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• Note that this depended on:– The alternative population mean (e.g. 16)

• What is the probability of failing to reject the null if the true population is 15?– P(Z<(13.9-15)/.6667)) = 0.05

• What is the probability of failing to reject the null if the true population is 14?– P(Z<(13.9-14)/.6667)) = 0.44

• 1- is called the power of a statistical test56

Page 57: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

Power

• The power of a test is the probability that you will reject the null hypothesis when it is not true and it is 1-

• You can construct a power curve by plotting power versus different alternative hypotheses

• You can also construct a power curve by plotting power versus different sample sizes (n’s)• This curve will allow you to see what gains you might

have versus cost of adding participants• Power curves are not linear

57

Page 58: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

• The power of a statistical test is lower for alternative values that are closer to the null value.

• The power of a statistical test can also be increased by increasing n

• For a fixed n, the power of a statistical test can also be increased by increasing , the probability of a Type I error

• The balance between type I error and type II error (or power) should depend on the context of the study

• When setting up a study, most investigators make sure that there will be at least 80% power. Sample size formulas allow you to specify the desired and levels.

58

Page 59: Biostat 200 Lecture 5 1. Normal distribution If X ~ N(µ,σ) then Z=(X- µ)/σ – To find the proportion of values in the population that exceed some threshold,

For next time

• Read Pagano and Gauvreau

– Chapter 10 and 14 (pages 329-330) today’s material

– Pagano and Gavreau Chapters 11-12, and 14 (pages 332-338)