Multiplicity, how to deal with the testing of more than one hypothesis.

1

Multiplicity

Gaetan Lion

July 2013

2

Probability of Making a Type I error*

when using a t test with > 2 Samples

*A false positive. Rejecting the null hypothesis when it is true.

Prob of Type I Error (Initial Confidence Level 95%)

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

1 2 3 4 5 6 7 8 9 10

# of Hypothesis

Confidence level 95%

Unadjusted a value 5%

Probability of Type I Error

# of Logic Logic Logic

hypothesis Bonferroni Sidak Sidak

1 0.05 0.05 0.05

2 0.10 0.10 0.10

3 0.15 0.14 0.14

4 0.20 0.19 0.19

5 0.25 0.23 0.23

6 0.30 0.26 0.26

7 0.35 0.30 0.30

8 0.40 0.34 0.34

9 0.45 0.37 0.37

10 0.50 0.40 0.40

3

How to test > 2 Samples

Two Basic Steps:

1) Choose a specific ANOVA method, given your testing

framework: Between-Groups, Within-Groups, or Mixed

ANOVA… or use a nonparametric equivalent if

warranted. This is to figure out whether your groups or

samples are different overall.

2) Decide in advance whether to conduct Post Hoc (after

the fact) or Planned Comparison tests to figure out which

specific groups are different.

4

The ANOVAs

Between-GroupsUnpaired testing. Difference between independent groups.

Single measures or observations.

Within-GroupsPaired testing. Difference between same groups before

and after treatment. Repeated measures or observations.

Mixed

Mixed testing. Difference between independent groups

before and after treatment. Repeated measures or

observations.

5

ANOVA semantics

One-Way Between-Groups ANOVA means an ANOVA with independent

groups measuring one single independent variable and one dependent

variable. The independent variable could be type of students by Major

and the dependent variable could be math proficiency.

Four-Way Between-Groups ANOVA using the same data, but in addition to

Majors would also look at: Gender, Class (Freshman, Sophomore,…), and

Ethnicity. So, you now have four independent variables.

Balanced ANOVA means that each group or sample is of the same size

(same number of male vs female, etc…). An Unbalanced ANOVA means

that some of the samples are of different size.

6

Excel ANOVA(s) Add-in Cryptic Semantics

“Factor” means the same as “Way.” They both mean Independent Variable. “With

Replication” can be confused with “Repeated Measures” that typically means

“Within Group” or paired testing.

“Without Replication” can be confused with “Single Measure” that typically means

“Between Groups” or unpaired testing.

In Excel Add-in “With Replication” means you have more than one single data point

per group or sample which is almost always the case.

“Without Replication” in Excel Add-in can be used for two very different situations:

1) Two-Way Between Groups ANOVA with a single observation per category; and

2) One-Way Within Groups ANOVA.

ANOVA method Excel Add-in corresponding tool

One-Way Between-Groups ANOVA Anova: Single Factor

Two-Way Between Groups ANOVA Anova: Two-Factor With Replication

More than one observation per category (standard)

Two-Way Between Groups ANOVA Anova: Two-Factor Without Replication

A single observation per category

One-Way Within Groups ANOVA Anova: Two-Factor Without Replication

7

Post Hoc vs Planned Comparison Tests

Post Hoc test Planned Comparison test

Purpose

Exploratory. You test the

difference between all potential

combination of Groups.

Confirmation of theory or hypothesis.

You test only the Groups you expect

to be different in a specific direction

(greater, lower).

Risk of Type I

error

Very low. Very unlikely to

generate a false positive. Reject

null hypothesis when it is true.

Low. Not quite as conservative as a

Post Hoc test. But, conservative

enough.

Risk of Type II

error

High. Not, unlikely to generate a

false negative (accept the null

hypothesis when it is false). This

test lacks Power.

Lower risk of Type II error than Post

Hoc test. The test is more sensitive,

more likely to uncover a difference.

It has more Power.

8

PH means Post Hoc test

PC means Planned Comparison test

HYPOTHESIS TESTING FLOW CHART

Multiple hypothesis testing. > 2 Samples or Groups

Multiple hypothesis test Transition test Post Hoc Test

Are the groups different? to facilitate Which group is different?

Post Hoc test

Tukey's HSD test (PH)

Scheffe test (PH)

Normal Between-Groups ANOVA

REGWQ test (PH)

Dunnett test (PC)

Unpaired t test

not Kruskal-Wallis test. Mann-Whitney

Bonferroni test (PH)

not Friedman test

Sidak test (PH)

Paired t test

Simple contrasts (PC)

Normal Within-Groups ANOVA

Repeated contrasts (PC)

Normal Mixed ANOVA No Post Hoc test

not No nonparametric

alternative

Unpaired testing

Difference between independent groups

(Between-Groups).

Single measure or observation.

Paired testing

Difference between same group before and

after treatment (Within-Groups).

Repeated measures or observations.

Mixed testing


before and after treatment (Mixed).


Research structure

Are we testing different groups once?

Are we testing the same group(s) at

different times?

Wilcoxon Sign

Rank Test

9

Two-Ways Between-Groups

ANOVA example

10

Data Format

For Excel Add-in

Y X1 X2

Int. Score Cowboy Gender

71 J. Wayne Male

76 J. Wayne Male

84 J. Wayne Male

72 J. Wayne Male

68 J. Wayne Male

66 J. Wayne Female

64 J. Wayne Female

66 J. Wayne Female

47 J. Wayne Female

66 J. Wayne Female

65 C. Eastwood Male

53 C. Eastwood Male

70 C. Eastwood Male

46 C. Eastwood Male

53 C. Eastwood Male

73 C. Eastwood Female





81 None Male

69 None Male

55 None Male

60 None Male

61 None Male

72 None Female

75 None Female

73 None Female

54 None Female

65 None Female

For XLStat

XLStat treats this

ANOVA as a linear

regression with one

dependent variable

and two qualitative

independent

variables.

Two-Way Between-Groups ANOVA

Two Independent variable: Cowboy preference in movies, Gender

One Dependent variable: Intelligence score

Male Female

John Wayne 71 66

76 64

84 66

72 47

68 66

Clint Eastwood 65 73

53 80

70 81

46 88

53 72

None 81 72

69 75

55 73

60 54

61 65

11

Excel Long Hand

Between-Sample Variability (BSV)

Sample

size Average Total Avg. Differ.^2

J. Wayne - Male 5 74.2 67.5 44.4

J. Wayne - Female 5 61.8 67.5 32.9

C. Eastwood - Male 5 57.4 67.5 102.7

C. Eastwood - Female 5 78.8 67.5 126.9

None - Male 5 65.2 67.5 5.4

None - Female 5 67.8 67.5 0.1

J. Wayne 10 68.0 67.5 0.2

C. Eastwood 10 68.1 67.5 0.3

None 10 66.5 67.5 1.1

Male 15 65.6 67.5 3.7

Female 15 69.5 67.5 3.7

SS DF (k - 1) MS

Corrected Model 1,562.3 5 312.5

Cowboy 16 2 8.0

Gender 112.1 1 112.1

Within-Sample Variability (WSV)

Sample -1 STDEV Variance

J. Wayne - Male 4 6.2 38.2

J. Wayne - Female 4 8.3 69.2

C. Eastwood - Male 4 9.8 96.3

C. Eastwood - Female 4 6.5 42.7

None - Male 4 10.2 103.2

None - Female 4 8.6 73.7

SS Within 1,693.2

DF (n - k) 24

MS Within 70.5

Between-Sample Variability/Within-Sample Variability Output

BSV/WSV

Source SS df MS F Sign.

Model 1,562.3 5 312.5 4.4 0.005

Cowboy 16.1 2 8.0 0.1 0.893

Gender 112.1 1 112.1 1.6 0.220

Cowboy*Gender 1,434.1 2 717.0 10.2 0.001

Error/Residual 1,693.2 24 70.5

Corrected Model 3,255.5 29

12

Excel Add-in

Anova: Two-Factor With ReplicationSUMMARY Male Female Total

J. Wayne

Count 5 5 10

Sum 371 309 680

Average 74.2 61.8 68

Variance 38.2 69.2 90.4

C. Eastwood

Count 5 5 10

Sum 287 394 681

Average 57.4 78.8 68.1

Variance 96.3 42.7 189.0

None

Count 5 5 10

Sum 326 339 665

Average 65.2 67.8 66.5

Variance 103.2 73.7 80.5

Total

Count 15 15

Sum 984 1042

Average 65.6 69.5

Variance 118.4 106.1

ANOVA

Source of Variation SS df MS F P-value F crit

Sample 16.1 2 8.0 0.11 0.893 3.4

Columns 112.1 1 112.1 1.59 0.220 4.3

Interaction 1434.1 2 717.0 10.16 0.001 3.4

Within 1693.2 24 70.6

Total 3255.5 29

Cowboy Gender

Error/Residual

Corrected Total/Model

13

XLStat ANOVA

Pred(I Score) / I Score

45

50

55

60

65

70

75

80

85

90

45 50 55 60 65 70 75 80 85 90

Pred(I Score)

I S

co

re

Analysis of variance:

Source DF SS MS F Pr > F

Model 5 1562.3 312.5 4.43 0.005

Error 24 1693.2 70.5

Corrected Total 29 3255.5

Computed against model Y=Mean(Y)

Type I Sum of Squares analysis:

Source DF SS MS F Pr > F

Cowboy 2 16 8.0 0.1 0.893

Gender 1 112 112.1 1.6 0.220

Cowboy*Gender 2 1434 717.0 10.2 0.001

14

Post Hoc and

Planned Comparison tests

15

Tukey’s HSD (PH) vs Dunnett test (PC)

for Cowboys

Tukey's Honestly Significant Difference (HSD) test. Post Hoc test Dunnett test. Planned Comparison

MS Within 70.5 MS Within 70.5

n 10 Number per treatment/Number of Groups n 10

Standard Error 2.66 SQRT(MS within(1/n) Standard Error 3.76 SQRT(2MS within/n)

Average intelligence score: Average intelligence score:

C. Eastwood 68.1 C. Eastwood 68.1

J. Wayne 68.0 J. Wayne 68.0

None 66.5 None 66.5

A A/B*1.96 A A/B*1.96

Standard. Alpha Estimated Estimated Standard. Alpha Estimated Est.

Differ. Difference 0.05 Z value 2-tail P val. Differ. Difference 0.05 Z value 2-tail P val.

C. East vs None 1.60 0.60 Not sign. 0.33 0.74 C. East vs None 1.60 0.43 Not sign. 0.36 0.72

C. East vs J. Wayne 0.10 0.04 Not sign. 0.02 0.98 C. East vs J. Wayne 0.10 0.03 Not sign. 0.02 0.98

J. Wayne vs None 1.50 0.56 Not sign. 0.31 0.75 J. Wayne vs None 1.50 0.40 Not sign. 0.33 0.74

Critical value @ a 0.05. 2-tail Critical value @ a 0.05. 2-tail

df within 24 df within 24

k # groups 3 k # groups 3

alpha 0.05 from table 3.53 B alpha 0.05 from table 2.35 B

16

Tuckey’s Test (PH) (for Cowboys) Tukey's Honestly Significant Difference (HSD) test. Post Hoc test

MS Within 70.5

n 10 Number per treatment/Number of Groups

Standard Error 2.66 SQRT(MS within(1/n)

Average intelligence score:

C. Eastwood 68.1

J. Wayne 68.0

None 66.5

A A/B*1.96

Standard. Alpha Estimated Estimated

Differ. Difference 0.05 Z value 2-tail P val.

C. East vs None 1.60 0.60 Not sign. 0.33 0.74

C. East vs J. Wayne 0.10 0.04 Not sign. 0.02 0.98

J. Wayne vs None 1.50 0.56 Not sign. 0.31 0.75

Critical value @ a 0.05. 2-tail

df within 24

k # groups 3

alpha 0.05 from table 3.53 B

17

Dunnett test (PC) (for Cowboys)

Dunnett test. Planned Comparison

MS Within 70.5

n 10

Standard Error 3.76 SQRT(2MS within/n)

Average intelligence score:

C. Eastwood 68.1

J. Wayne 68.0

None 66.5

A A/B*1.96

Standard. Alpha Estimated Est.

Differ. Difference 0.05 Z value 2-tail P val.

C. East vs None 1.60 0.43 Not sign. 0.36 0.72

C. East vs J. Wayne 0.10 0.03 Not sign. 0.02 0.98

J. Wayne vs None 1.50 0.40 Not sign. 0.33 0.74

Critical value @ a 0.05. 2-tail

df within 24

k # groups 3

alpha 0.05 from table 2.35 B

18

Comparing Dunnett vs Tukey’s across

various Mean difference levels Comparing Dunnett's vs Tukey's across various Mean difference level.

Standard

error

a 5% 2-tail

critical

value

Dunnett's 3.76 2.35

Tukey's 2.66 3.53

Z 1.96

Tuckey's Test Dunnett Test

% of % of

Mean Standard. 2-tail 2-tail 2-tail Mean Standard. 2-tail 2-tail 2-tail

difference differen. Critical val. Z equival. P val est. difference differen. Critical val. Z equival. P val est.

0.5 0.19 5.3% 0.10 0.92 0.5 0.13 5.7% 0.11 0.91

1.0 0.38 10.7% 0.21 0.83 1.0 0.27 11.3% 0.22 0.82

2.0 0.75 21.3% 0.42 0.68 2.0 0.53 22.7% 0.44 0.66

3.0 1.13 32.0% 0.63 0.53 3.0 0.80 34.0% 0.67 0.51

4.0 1.51 42.7% 0.84 0.40 4.0 1.06 45.3% 0.89 0.37

5.0 1.88 53.3% 1.05 0.30 5.0 1.33 56.6% 1.11 0.27

6.0 2.26 64.0% 1.25 0.21 6.0 1.60 68.0% 1.33 0.18

7.0 2.64 74.7% 1.46 0.14 7.0 1.86 79.3% 1.55 0.12

8.0 3.01 85.3% 1.67 0.09 8.0 2.13 90.6% 1.78 0.08

9.0 3.39 96.0% 1.88 0.06 9.0 2.40 102.0% 2.00 0.05

19

Dunnett vs Tukey’s visual comparison

Dunnett vs Tuckey 2-tail p value

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.5 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

Mean difference

2-t

ail

p v

alu

e

Tukey

Dunnett

The Dunnett test is only marginally more sensitive or has more Power than the

Tukey’s test (more likely to find a statistically significant difference) when using a

2-tail test. However, with Dunnett, if warranted, you can also use a 1-tail test…

which would make a huge difference. With Tukey, you can’t do that.

20

Bonferroni vs Sidak Test adjusted a value

Multiple hypothesis testing adjustments

Corresponding to a: 5%

Adjusted a value:

# of

hypothesis Bonferroni Sidak

1 5.00% 5.00%

2 2.50% 2.53%

3 1.67% 1.70%

4 1.25% 1.27%

5 1.00% 1.02%

6 0.83% 0.85%

7 0.71% 0.73%

8 0.63% 0.64%

9 0.56% 0.57%

10 0.50% 0.51%

Bonferroni: a/# of hypothesis

Sidak: 1 - (1- a)1/# of hypothesis

Those tests consists in adjusting

the relevant Alpha threshold (i.e.

5%) for the number of

hypothesis you are testing.

Bonferroni simply divides the

Alpha value by the # of

hypothesis. Sidak uses a

compounding formula that is

technically more accurate but

makes no material difference in

this situation.

21

What would be qualifying a value?

At what familywise level a would a single hypothesis qualify (Sidak logic)

Original unpaired t test p value

0.5% 1% 2.5% 5% 10% 15%

1 0.01 0.01 0.03 0.05 0.10 0.15

2 0.01 0.02 0.05 0.10 0.19 0.28

3 0.01 0.03 0.07 0.14 0.27 0.39

# of 4 0.02 0.04 0.10 0.19 0.34 0.48

hypothesis 5 0.02 0.05 0.12 0.23 0.41 0.56

6 0.03 0.06 0.14 0.26 0.47 0.62

7 0.03 0.07 0.16 0.30 0.52 0.68

8 0.04 0.08 0.18 0.34 0.57 0.73

9 0.04 0.09 0.20 0.37 0.61 0.77

10 0.05 0.10 0.22 0.40 0.65 0.80

22

A Radical Idea: Skipping ANOVA



Multiple hypothesis test Transition test Post Hoc Test

Are the groups different? to facilitate Which group is different?

Post Hoc test

Tukey's HSD test (PH)

Scheffe test (PH)

Normal Between-Groups ANOVA

REGWQ test (PH)

Dunnett test (PC)

Unpaired t test

not Kruskal-Wallis test. Mann-Whitney

Bonferroni test (PH)

not Friedman test

Sidak test (PH)

Paired t test

Simple contrasts (PC)

Normal Within-Groups ANOVA

Repeated contrasts (PC)

Normal Mixed ANOVA No Post Hoc test

not No nonparametric

alternative

Unpaired testing


(Between-Groups).


Paired testing




Mixed testing


before and after treatment (Mixed).


Research structure



different times?

Wilcoxon Sign

Rank Test

23

Streamlined Testing



Post Hoc Test

Which group is different?

Normal Unpaired t test

not Mann-Whitney

not Wilcoxon Sign Rk test

Normal Paired t test

Unpaired testing


(Between-Groups).


Paired testing




Research structure



different times?

Bonferroni or

Sidak test (PH)

24

Disadvantages of Streamlined Testing

• You can’t run Tukey’s (PH) and Dunnett (PC). Those tests are not just adjustment to P value, and may be superior in certain circumstances;

• You don’t have access to any Planned Comparison tests that are more sensitive (more Power) and can allow you to use a 1-tail P value when warranted;

• ANOVA gives you valuable information about the different independent variables, and their interaction. Between-Sample Variability/Within-Sample Variability Output

BSV/WSV

Source SS df MS F Sign.

Model 1,562.3 5 312.45 4.43 0.005

Cowboy 16.1 2 8.03 0.11 0.893

Gender 112.1 1 112.13 1.59 0.220

Cowboy*Gender 1,434.1 2 717.03 10.16 0.001

Multiplicity, how to deal with the testing of more than one hypothesis.

Education

groups anova anova

independent groups

groups different

different groups

groups anova example

specific groups

groups unpaired testing

anova semantics