Top Banner
Tuan V. Nguyen Garvan Institute of Medical Research Sydney, Australia Garvan Institute Biostatistical Workshop 16/7/2015 © Tuan V. Nguyen
42

Tuan V. Nguyen

Dec 18, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tuan V. Nguyen

Tuan V. Nguyen Garvan Institute of Medical Research

Sydney, Australia

Garvan Institute Biostatistical Workshop 16/7/2015 © Tuan V. Nguyen

Page 2: Tuan V. Nguyen

Analysis of variance

• Between-group and within-group variation explained

• Model of ANOVA

• Post-hoc comparisons

• R implementation

Page 3: Tuan V. Nguyen

Between-group variation andWithin-group variation

Page 4: Tuan V. Nguyen

Sir Ronald A. Fisher, inventor of ANOVA

Ronald A. Fisher, geneticist, statistician, philosopher

"a genius who almost single-handedly created the foundations for modern statistical science"

"Fisher laid the foundations for most of experimental design, analysis of variance and much of statistical inference,"

Ronald Fisher (1890 – 1962)

Page 5: Tuan V. Nguyen

The idea of ANOVA

• Analysis variable (Y) is continuous

• Comparison of multiple groups (k >2)

• Null hypothesis: all means are equal

Ho: µ1 = µ2 = …= µk

• Alternative hypothesis Ha: at least one pair is different

Page 6: Tuan V. Nguyen

The concept of "variation"

• Given a series of n observed values Xi (X1, X2, X3, …) a deviate is defined as:

D = Xi – M

• Squared D:

D2 = (Xi - M)2

• Sum of squares (eg variation):

SS = (X1 - M)2 + (X2 - M)2 + (X3 - M)2 + … + (Xn - M)2

= ( )å=

-n

i i MX1

2

Page 7: Tuan V. Nguyen

Between- and within-group variation

Key to understanding ANOVA:

• Between-group variation

• Within-group variation

Page 8: Tuan V. Nguyen

"Between-group" variation

Group 1 Group 2 Group 3 Group k

X11 X21 X31 Xk1

X12 X22 X32 Xk2

X13 X23 X33 Xk3

X14 X24 X34 Xk4

X15 X25 X35 Xk5

X16 X26 X36 Xk6

M1 M2 M3 Mk

Page 9: Tuan V. Nguyen

"Within-group" variation

Group 1 Group 2 Group 3 Group k

X11 X21 X31 Xk1

X12 X22 X32 Xk2

X13 X23 X33 Xk3

X14 X24 X34 Xk4

X15 X25 X35 Xk5

X16 X26 X36 Xk6

M1 M2 M3 Mk

Page 10: Tuan V. Nguyen

Logic ofANOVA

Group 1 Group 2 Group 3 Group k

X11 X21 X31 Xk1

X12 X22 X32 Xk2

X13 X23 X33 Xk3

X14 X24 X34 Xk4

X15 X25 X35 Xk5

X16 X26 X36 Xk6

M1 M2 M3 Mk

• Compare between variation (B) with within group variation (W)

• If B > W, then that is a signal of difference between groups

Page 11: Tuan V. Nguyen

An example

A hormone measured in 4 groups of patients

A B C D 8 7 28 26 9 17 21 16 11 10 26 13 4 14 11 12 7 12 24 9 8 24 19 10 5 11 11

22 17 15

Page 12: Tuan V. Nguyen

Between-group variation

A B C D 8 7 28 26 9 17 21 16 11 10 26 13 4 14 11 12 7 12 24 9 8 24 19 10 5 11 11

22 17 15

7.4 14.6 21.5 14.3Mean

Overall mean = 14.2

Page 13: Tuan V. Nguyen

Between-group variation

A B C D

7.4 14.6 21.5 14.37 8 6 9

Mean

N

Overall mean = 14.2

Sum of squares of between-group differences:

SSB = 7*(7.4 – 14.2)2 + 8*(14.6 – 14.2)2 + 6*(21.5 – 14.2)2 + 9*(14.3 – 14.2)2 = 643.9

Page 14: Tuan V. Nguyen

Within-group variation

A B C D 8 7 28 26 9 17 21 16 11 10 26 13 4 14 11 12 7 12 24 9 8 24 19 10 5 11 11

22 17 15

7.4 14.6 21.5 14.3Mean

Sum of within-group sum of squares:

SSWA = (8 – 7.4)2 + (9 – 7.4)2 + ... + (5 – 7.4)2 = 33.7

Page 15: Tuan V. Nguyen

Within-group variation

A B C D 8 7 28 26 9 17 21 16 11 10 26 13 4 14 11 12 7 12 24 9 8 24 19 10 5 11 11

22 17 15

7.4 14.6 21.5 14.3Mean

SSWA = (8 – 7.4)2 + (9 – 7.4)2 + ... + (5 – 7.4)2 = 33.7SSWB = 247.9SSWC = 185.5SSWD = 214.6

Page 16: Tuan V. Nguyen

Within-group variation

SSWA = (8 – 7.4)2 + (9 – 7.4)2 + ... + (5 – 7.4)2 = 33.7

SSWB = 247.9

SSWC = 185.5

SSWD = 214.6

SSW = 33.7 + 247.9 + 185.5 + 214.6 = 681.6

Page 17: Tuan V. Nguyen

ANOVA table

Source Degrees of

freedom

Sum of squares

(SS)

Mean square (MS)

Between groups 643.9 Within group 681.6

Page 18: Tuan V. Nguyen

Bảng phân tích phương sai

Source Degrees of freedom

Sum of squares

(SS)

Mean square (MS)

Between groups 3 643.9 214.6Within group 26 681.6 26.2Total 29 1325.5

F-test = 214.6 / 26.2 = 8.2

Page 19: Tuan V. Nguyen

R codes

A = c(8, 9, 11, 4, 7, 8, 5)B = c(7, 17, 10, 14, 12, 24, 11, 22)C = c(28, 21, 26, 11, 24, 19)D = c(26, 16, 13, 12, 9, 10, 11, 17, 15)

x = c(A, B, C, D)group = c(rep("A", 7), rep("B", 8), rep("C", 6), rep("D", 9))

data = data.frame(x, group)data

av = aov(x ~ group)summary(av)

Page 20: Tuan V. Nguyen

Result

> av=aov(x ~ group)> summary(av)

Df Sum Sq Mean Sq F value Pr(>F) group 3 642.3 214.09 8.197 0.000528 ***Residuals 26 679.1 26.12 ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There is a difference between groups

Page 21: Tuan V. Nguyen

Analysis of variance: Summary

• ANOVA is used for testing the hypothesis that involves comparison of multiple groups

• R function for ANOVA: aov

analysis = aov(y ~ group)

Page 22: Tuan V. Nguyen

Post hoc analyses

Page 23: Tuan V. Nguyen
Page 24: Tuan V. Nguyen

Multiple tests of hypothesis

• A study with m groups, the number of possible comparisons is:

( )21-mm

• Each pairwise comparison we have a 5% chance of resulting in a type I error (assuming a = 5%)

Page 25: Tuan V. Nguyen

Comparison 1P=0.95

P=0.05

Accept H0

Reject H0

P(Accept,Accept) = 0.95*0.95

=0.9025

P(Accept,Reject) = 0.95*0.05

=0.0475

P(Reject,Accept) = 0.05*0.95

=0.0475

P(Reject,Reject) = .05*0.05

=0.0025

Accept H0

Reject H0

Accept H0

Reject H0

P=0.95

P=0.05

P=0.95

P=0.05

Comparison 2

When there are multiple tests

Page 26: Tuan V. Nguyen

Inflation of type I error

• Type I error: Probability of significance (P < a) when there is in fact no difference

• Multiple tests of hypothesis à increase in type I error– For a single test, type I error = a

– For 6 tests, with a = 0.05, type I error = 1 – (1 – a)6 = 0.26

– For k tests, type I error = 1 – (1 – a)k

(This is also called experiment-wise error rate, family-wise type I error rate)

Page 27: Tuan V. Nguyen

Post-hoc comparisons

• Planned comparisons when there are small number of hypotheses to be tested

• Post-hoc comparisons when all possible comparisons are tested

• Omnibus F test

– Significance (P < 0.05) signifies a difference existed

– Which specific groups are different?

Page 28: Tuan V. Nguyen

Methods for post-hoc comparisons

• Classic methods

– Least significance difference (LSD): protected t test

– Bonferroni's adjustment

• Standard methods

– Tukey's Honestly Significant Difference

– Neuman-Keuls, Ryan, Scheffe, etc

• Newer / modern methods

– FDR

Page 29: Tuan V. Nguyen

Least significance difference (LSD)

• Multiple t-test with no correction

t = X1 − X22MSerror

n

Page 30: Tuan V. Nguyen

Carlo Emilio Bonferroni (1892 - 1960)

• Italian statistician, University of Florence

• Best known for "Bonferroni's inequality", and

• Bonferroni's correction in post-hoc analysis

Page 31: Tuan V. Nguyen

Bonferroni's adjustment

• When there are c tests of hypothesis, the probability of finding a chance significance is 1 - (1 – α)c

• Correcting for multiple tests of hypothesis

• Use α* = α / c, where c is the number of tests

• If the observed P value < α*, then declare "significant"

Page 32: Tuan V. Nguyen

Tukey’s HSD

• HSD = Honestly Significant Difference

/j kX X

QMSW n-

=

n is the average number of subjects per group

• If observed Q > theoretical Q (theoretical Tukey's Studentized critical value) then the difference is "statistically significant"

Page 33: Tuan V. Nguyen

Tukey's studentized

• Studentized range statistic

• Difference between X1 and X2 is significant if:

• When n is not the same, then

N = 2ninj/(ni+nj)

, ,max mini i

k n kX XQ NWMSa-

-=

, ,i j

ij k n k

X X NQ Q

WMS a-

-= >

Page 34: Tuan V. Nguyen

Newer method: False discovery rate (FDR)

• Advanced by Benjamini – Hochberg in a seminal paper (J Roy Statist Soc B 1995), almost 27,000 citations

• Question: How many false discoveries we have made?

FDR = # of falsely rejected null hypotheses total # of rejected null hypotheses

Page 35: Tuan V. Nguyen

Benjamini & Hochberg's method

• Create a vector A of sorted p values

• Create a vector B by computing j(α/n) for j=1,2,...,n

• Substract vector A from vector B; call this vector C

• Find the largest index d, (from 1 to 10) for which the corresponding number in C is negative

• Reject all null hypotheses whose p values ≤ Pd

• The null hypothesis for all other tests are not rejected

Page 36: Tuan V. Nguyen

FDR explained

• The FDR of a set of hypothesis tests is the expected percent of falsely rejected hypotheses.

• If FDR = 0.3, then we should expect 70% of them to be correct

Page 37: Tuan V. Nguyen

Which method is appropriate?

The method that yields the

narrowest confidence interval

Page 38: Tuan V. Nguyen

Adjustment for multiple comparisons

#Bonferroni

pairwise.t.test(x, group, p.adjust="bonferroni", pool.sd=T)

# Benjamin-Hochberg

pairwise.t.test(x, group, p.adjust="BH", pool.sd=T)

Page 39: Tuan V. Nguyen

> pairwise.t.test(x, group, p.adjust="bonferroni", pool.sd=T)

Pairwise comparisons using t tests with pooled SD

data: x and group

A B C B 0.06876 - -C 0.00023 0.11673 -D 0.07547 1.00000 0.07911

> pairwise.t.test(x, group, p.adjust="BH", pool.sd=T)

Pairwise comparisons using t tests with pooled SD

data: x and group

A B C B 0.01978 - -C 0.00023 0.02335 -D 0.01978 0.90741 0.01978

P value adjustment method: BH

Page 40: Tuan V. Nguyen

R code - Tukey’s Method

m = aov(x ~ group)

TukeyHSD(m)

> TukeyHSD(m)Tukey multiple comparisons of means95% family-wise confidence level

Fit: aov(formula = x ~ group)

$groupdiff lwr upr p adj

B-A 7.1964286 -0.05969765 14.4525548 0.0525014C-A 14.0714286 6.27132726 21.8715299 0.0002134D-A 6.9047619 -0.16073856 13.9702624 0.0571911C-B 6.8750000 -0.69675602 14.4467560 0.0850381D-B -0.2916667 -7.10424368 6.5209103 0.9994049D-C -7.1666667 -14.55594392 0.2226106 0.0597131

Page 41: Tuan V. Nguyen

plot(TukeyHSD(m), ordered=T)

-10 0 10 20

D-C

D-B

C-B

D-A

C-A

B-A

95% family-wise confidence level

Differences in mean levels of group

Page 42: Tuan V. Nguyen

ANOVA: summary

• ANOVA– An extension of 2-sample t-test to multiple groups

– Comparing between-group variation to within-group variation

• Many Post-hoc comparisons

– Help control type I error

– Beware of the alpha level and power issues