Statistical Tests for Computational Intelligence Research and Human Subjective Tests Hideyuki TAKAGI Kyushu University, Japan http://www.design.kyushu-u.ac.jp/~takagi/ ver. March 26, 2015 ver. July 15, 2013 ver. July 11, 2013 ver. April 23, 2013 Slides are downloadable from http://www.design.kyushu-u.ac.jp/~takagi Contents 2 groups n groups (n > 2) data distribution ・unpaired t -test ・sign test ・Wilcoxon signed-ranks test ・Friedman test ・Kruskal-Wallis test ・ one-way ANOVA ・ two-way ANOVA (no normality) one-way data two-way data Parametric Test Non-parametric Test (normality) unpaired (independent) paired (related) unpaired (independent) paired (related) ・paired t -test ・Mann-Whitney U-test ANOVA (Analysis of Variance) + Scheffé's method of paired comparison for Human Subjective Tests
52
Embed
Statistical Tests for Computational Intelligence Research ...homepages.ecs.vuw.ac.nz/.../StatisticalTests.pdf · Parametric Test Non-parametric Test (normality) unpaired (independent)
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Statistical Tests for Computational Intelligence Research and
Human Subjective Tests
Hideyuki TAKAGIKyushu University, Japan
http://www.design.kyushu-u.ac.jp/~takagi/ver. March 26, 2015ver. July 15, 2013ver. July 11, 2013ver. April 23, 2013
Slides are downloadable fromhttp://www.design.kyushu-u.ac.jp/~takagi
Contents2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
one-waydata
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
pair
ed(r
elat
ed)
unpa
ired
(inde
pend
ent)
pair
ed(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
+Scheffé's method of paired comparison for Human Subjective Tests
fitne
ss
generationsfit
ness
generations
conventionalEC
proposed EC1 proposed EC2
How to Show Significance?
conventionalEC
Just compare averages visually?It is not scientific.
Fig. XX Average convergence curves of n times of trial runs.
How to Show Significance?
sound made byconventional IEC
sound made byproposed IEC1
sound made byproposed IEC2
Sound design concept: exiting
Which method is good to make exiting sound?How to show it?
You cannot show the superiority of your method without statistical tests.
Papers without statistics tests may be rejected.
statistical test
My method is significantly better!
Which Test Should We Use?2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
one-waydata
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
Which Test Should We Use?2 groups n groups (n > 2)
When (p > 0.05), we assume that there is no significant difference between σ2
A and σ2B .
t-Testt-Test
t-Testt-TestExcel (32 bits version only?) has t-tests and ANOVA in Data Analysis Tools. You must install its add-in. (File -> option -> add-in, and set its add-in.)
When (p-value < 0.01 or 0.05), there is(are) significant differencesomewhere among data groups.
• Significant difference among Sample (e.g. initial conditions) cannot be found (p > 0.05).
• Significant difference can be found somewhere among Columns (e.g. three methods) (p < 0.01).
• We need not care an interaction effect between two factors (e.g. initial condition vs. methods) (p > 0.05).
A B C
11.0 12.8 9.4
9.3 11.3 12.4
11.5 9.5 16.8
16.4 14.0 14.3
16.0 15.2 17.0
15.0 13.0 14.6
12.8 12.4 17.0
13.6 15.0 14.3
13.0 12.4 15.6
12.0 17.8 15.0
13.4 12.6 18.6
10.0 13.4 12.4
10.8 16.8 15.4
Sam
ple
fact
or
Columnfactor
ANOVA: Analysis of Variance
Source of Variation SS df MS F P-value F critSample 0.755233 2 0.377617 2.755097 0.103596 3.885294Columns 3.582272 1 3.582272 26.13631 0.000256 4.747225Interaction 0.139411 2 0.069706 0.508573 0.613752 3.885294Within 1.644733 12 0.137061
Total 6.12165 17
A B C
11.0 12.8 9.4
9.3 11.3 12.4
11.5 9.5 16.8
16.4 14.0 14.3
16.0 15.2 17.0
15.0 13.0 14.6
12.8 12.4 17.0
13.6 15.0 14.3
13.0 12.4 15.6
12.0 17.8 15.0
13.4 12.6 18.6
10.0 13.4 12.4
10.8 16.8 15.4
Sam
ple
fact
or
Columnfactor
Q1: Where is significant among A, B, and C?
A1: Apply multiple comparisons between all pairsamong columns.
(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
significant?
Non-Parametric Tests
2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
one-waydata
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
If normality and equal variances are not guaranteed, use non-parametric tests.
Mann-Whitney U-test2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
one-waydata
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
Mann-Whitney U-test(Wilcoxon-Mann-Whitney test, two sample Wilcoxon test)
n-th generation
?
?no normality
1. Comparison of two groups.2. Data have no normality.3. There are no data corresponding
between two groups (independent).
Mann-Whitney U-test(Wilcoxon-Mann-Whitney test, two sample Wilcoxon test)
1. Calculate a U value.
0
2
34
U = 0 + 2 + 3 + 4 = 9U' = 11 (U + U' = n1n2)
when two values are the same, count as 0.5.( )
2. See a U-test table.
Mann-Whitney U-test (cont.)(Wilcoxon-Mann-Whitney test, two sample Wilcoxon test)
• Use the smaller value of U or U'.• When n1 ≤ 20 and n2 ≤ 20 , see a Mann-Whitney test table.
(where n1 and n2 are the # of data of two groups.)• Otherwise, since U follows the below normal distribution roughly,
normalize U as and check a standard normal distribution table
with the , where and .
12
)1(,
2, 2121212 nnnnnn
NN UU
U
UUz
221nn
U 12
)1( 2121
nnnnUz
Use an Excel function to calculate the p-value for the z-value:p-value = 1 - NORM.S.DIST( z )
Examples: Mann-Whitney U-test(Wilcoxon-Mann-Whitney test, two sample Wilcoxon test)
U = 9U' = 11
0
2
34
00.5
2.54
5
3.5
55
5
U = 23.5U' = 1.5
U = 12U' = 13
Ex.1 Ex.2 Ex.3
4 5 6 ・・・
・・・ ー ・・・ ・・・ ・・・
4 0 1 2 ・・・
5 2 3 ・・・
・・・ ・・・ ・・・
n1
n2
(p < 0.05)
4 5 6 ・・・
・・・ ー ・・・ ・・・ ・・・
4 ー ー 0 ・・・
5 1 1 ・・・
・・・ ・・・ ・・・
n1
n2
(p < 0.01)
5
Exercise: Mann-Whitney U-test(Wilcoxon-Mann-Whitney test, two sample Wilcoxon test)
2.5
45
6
66
U = 29.5U' = 6.5
4 5 6 7
3 ー 0 1 1
4 0 1 2 3
5 2 3 5
6 5 6
n1
n2
(p < 0.05)
4 5 6 7
3 ー ー ー ー
4 ー ー 0 0
5 1 1 1
6 2 3
n1
n2
(p < 0.01)
Since U' > 5, (p > 0.05): significance is not found( )
Sign Test2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
one-waydata
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
(1)Sign Test
(2)Wilcoxon's Signed Ranks Test
significance test between the # of winnings and losses
173 174
143 137
158 151
156 143
176 180
165 162
- ++ -+ -+ -- ++ -
-1+6+7+13-4+3
data of 2 groups
# of winnings and losses
the level of winnings/losses
Sign Test
significance test using both the # of winnings and losses and the level of winnings/losses
n-th generation
Sign Test
1. Calculate the # of winnings and losses by comparing runs with the same initial data.
2. Check a sign test table to show significance of two methods.
Sign Test Generations 0 10 20 30 40 50 |__________|__________|__________|__________|__________|F1: DE_N vs. DE_LR +++++++++++++++++++++++++++++++++++++++++++++++++F1: DE_N vs. DE_LS +++++++++++++++++++++++++++++++++++++++++++++++++F1: DE_N vs. DE_FR_GLB_nD + +++++++++++++++++++++++++++++++++++++F1: DE_N vs. DE_FR_LOC_nD ++ ++++++++++++++++++++++++++++++++++++F1: DE_N vs. DE_FR_GLB_1D + +++++++++++++++++++++++++++++++++++++F1: DE_N vs. DE_FR_LOC_1D +++ ++++++++++++++++++++++++++++++++++++F2: DE_N vs. DE_LR + ++++++++++++++++++++++++++++++++++++++++++F2: DE_N vs. DE_LS ++++++++++++++++++++++++++++++++++++++++++++F2: DE_N vs. DE_FR_GLB_nDF2: DE_N vs. DE_FR_LOC_nDF2: DE_N vs. DE_FR_GLB_1DF2: DE_N vs. DE_FR_LOC_1DF3: DE_N vs. DE_LRF3: DE_N vs. DE_LSF3: DE_N vs. DE_FR_GLB_nD ++++++++++++++++++++++++F3: DE_N vs. DE_FR_LOC_nDF3: DE_N vs. DE_FR_GLB_1D F3: DE_N vs. DE_FR_LOC_1D + +F4: DE_N vs. DE_LR +++++++++++++++++++++++++++++++++++++++++++++++++F4: DE_N vs. DE_LS ++++++++++++++++++++++++++++++++++++++++++++++++++F4: DE_N vs. DE_FR_GLB_nD ++++++++++++++++++++++++++++++++++++++++++++++F4: DE_N vs. DE_FR_LOC_nDF4: DE_N vs. DE_FR_GLB_1D + ++++++++++++++++++++++++++++++++++++++++++++++F4: DE_N vs. DE_FR_LOC_1D ++++++++++++++++++++++++++++++++++++++++++++++F5: DE_N vs. DE_LR + ++ +F5: DE_N vs. DE_LSF5: DE_N vs. DE_FR_GLB_nD +++++++++++++++++++ F5: DE_N vs. DE_FR_LOC_nD ++F5: DE_N vs. DE_FR_GLB_1D +++ F5: DE_N vs. DE_FR_LOC_1D +F6: DE_N vs. DE_LR ++++++++++++ + +++++++F6: DE_N vs. DE_LS +++++++++++++++++++++++++++++++++++ +++++F6: DE_N vs. DE_FR_GLB_nD + + +++++++++++++++++++++++++++++++++++F6: DE_N vs. DE_FR_LOC_nD +++++++++++++++++++++++++++++++++F6: DE_N vs. DE_FR_GLB_1D + + +++++++++++++++++++++++++++++++++++F6: DE_N vs. DE_FR_LOC_1D +++++++++ ++++++++++++++++++++++ +++F7: DE_N vs. DE_LR +F7: DE_N vs. DE_LSF7: DE_N vs. DE_FR_GLB_nD ++++++ + ++F7: DE_N vs. DE_FR_LOC_nD +F7: DE_N vs. DE_FR_GLB_1D +F7: DE_N vs. DE_FR_LOC_1DF8: DE_N vs. DE_LR +++++++++++++++++++++++++++++++++++F8: DE_N vs. DE_LS ++++++++++++++++++++++++++++++++++++++++++++++++F8: DE_N vs. DE_FR_GLB_nD +++++++++++++++++++++++++++++++++++++++++++++++F8: DE_N vs. DE_FR_LOC_nD +F8: DE_N vs. DE_FR_GLB_1D +++++++++++++++++++++++++++++++++++F8: DE_N vs. DE_FR_LOC_1D +++ +++++++++++++++++++++++++++++++++++++++++++++
Fig.3 in Y. Pei and H. Takagi, "Fourier analysis of the fitness landscape for evolutionary search acceleration," IEEE Congress on Evolutionary Computation (CEC), pp.1-7, Brisbane, Australia (June 10-15, 2012).
The (+,-) marks show whether our proposed methods converge significantly better or poorer than normal DE, respectively, (p ≤0.05).
Fig.2 in the same paper.
Task ExampleWhether performances of pattern recognition methods A and B are significantly different?
n1 cases: Both methods succeeded.n2 cases: Method A succeeded, and method B failed.n3 cases: Method A failed, and method B succeeded.n4 cases: Both methods failed.
1. Set N = n2 + n3.2. Check the right table with the N.3. If min(n2, n3) is smaller than the number for the N,
we can say that there is significant difference with the significant risk level of XX.
How to check?
Whether there is significant difference for n2 = 12 and n3 = 28?
Exercise
level of significanceSign Test %%
ANSWER:Check the right table with N = 40.As n2 is bigger than 11 and smaller than 13, we can say that there is a significant difference between two with (p < 0.05) but cannot say so with (p < 0.01).
% %
level of significance
level of significance
%%Sign Test
Let's think about the case of N = 17.
To say that n1 and n2 are significantly different,
(n1 vs. n2) = (17 vs. 0), (16 vs. 1), or (15 vs. 2) (p < 0.01)
or
(n1 vs. n2) = (14 vs. 3) or (13 vs. 4) (p < 0.05)
Check the significance of:
Exercise: Sign Test level of significance
%%
16 vs. 4
14 vs. 1
9 vs. 3
18 vs. 5
Wilcoxon Signed-Ranks Test2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
one-waydata
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
n-th generation
Wilcoxon Signed-Ranks Test
Q: When a sign test could not show significance,how to do?
A: Try the Wilcoxon signed-ranks test. It is more sensitive than a simple sign test due to more information use.
(1)Sign Test
(2)Wilcoxon's Signed Ranks Test
significance test between the # of winnings and losses
173 174
143 137
158 151
156 143
176 180
165 162
- ++ -+ -+ -- ++ -
-1+6+7+13-4+3
data of 2 groups
# of winnings and losses
the level of winnings/losses
significance test using both the # of winnings and losses and the level of winnings/losses
Wilcoxon Signed-Ranks Test
Wilcoxon Signed-Ranks Test
v (system A) v (system B) difference d rank of |d| add sign to the ranks
v (system A) v (system B) difference d rank of |d| add sign to the ranks
rank of fewer # of signs
182 163 19 7 7
169 142 27 8 8
173 172 1 1 1
143 137 6 4 4
158 151 7 5 5
156 143 13 6 6
176 172 4 3 3
165 168 -3 2 -2 2
T = 2
As T(=2) < 3, there is a significant difference between A and B (p<0.05).But, as 0 < T(=2), we cannot say so with the significance level of (p<0.01).
(step 1) (step 2) (step 3) (step 4)
(step 5) )4(# StepofT
(step 6)
Wilcoxon test table
v (system A) v (system B) difference d rank of |d| add sign to the ranks
rank of fewer # of signs
27 31
20 25
34 33
25 27
31 31
23 29
26 27
24 30
35 34
n =
Exercise 2: Wilcoxon Signed-Ranks Test
n = 8 (no count for d = 0.)
(step 1) (step 2) (step 3) (step 4)
(step 5) )4(# StepofT
(step 6)
Wilcoxon test table
v (system A) v (system B) difference d rank of |d| add sign to the ranks
rank of fewer # of signs
27 31 -4 5 -5
20 25 -5 6 -6
34 33 1 2 2 2
25 27 -2 4 -4
31 31 0
23 29 -6 7.5 -7.5
26 27 -1 2 -2
24 30 -6 7.5 -7.5
35 34 1 2 2 2
(No need to care the case of d = 0.)
T = 4
As T > 3, we cannot say that there is a significant difference between A and B.
Exercise 2: Wilcoxon Signed-Ranks Test
n-th generation
Explain how to apply this test to test whether two groups are significantly different at the below generation?
Exercise 3: Wilcoxon Signed-Ranks Test
Kruskal-Wallis Test2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
)
two-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
Kruskal-Wallis Test
n-th generation
1. Comparison of more than two groups.2. Data have no normality.3. There are no data corresponding
among groups (independent). ??? no
nor
mal
ity
Kruskal-Wallis Test
Let's use ranks of data.
1
23
45
6 7
8 910
11 121314
1516
17
Kruskal-Wallis TestN: total # of datak: # of groupsni: # of data of group iRi : sum of ranks of group i
R1 = 38 R2 = 69 R3 = 46
1. Rank all data.2. Calculate N, k, ni and Ri .3. Calculate statistical value H.
4. If k = 3 and N ≤ 17, compare the H with a significant point in a Kruskal-Wallis test table.Otherwise, assume that Hfollows the χ2 distribution and test the H using a χ2
Since significant points of (p<0.05) and (p<0.01) for (n1, n2, n3) = (6, 5, 6) are 5.765 and 8.124, respectively, there are significant difference(s) somewhere among three groups (p<0.05).
Since significant points of (p<0.05) and (p<0.01) for (n1, n2, n3) = (6, 5, 6) are 5.765 and 8.124, respectively, there are significant difference(s) somewhere among three groups (p<0.05).
6.6098.1245.765
significance point of (p<0.05)
significance point of (p<0.01)
Q1: Where is significant among A, B, and C?
A1: Apply multiple comparisons between all pairsamong columns.
(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
R1 = R2 = R3 =
1
2
4
7
10
8
11
12
13
3
5
6
9
N = n1+n2+n3 =k =(n1, n2, n3) =(R1, R2, R3) =
)1(3)1(
12
1
2
Nn
R
NNH
k
i i
i
=
Exercise: Kruskal-Wallis Test
24 44 23
13 samples3 groups( 5, 4, 4)(24, 44, 23)
6.227
7.7605.657
significance point of (p<0.05)
significance point of (p<0.01)
6.227
There is/are significant difference(s) somewhere among three groups (p<0.05).
Friedman Test2 groups n groups (n > 2)
datadistribution
・unpaired t -test
・sign test
・Wilcoxon signed-ranks test ・Friedman test
・Kruskal-Wallis test
・ one-way ANOVA
・ two-way ANOVA
(no
norm
ality
) one-waydata
Par
amet
ric T
est
Non
-par
amet
ric T
est
(nor
mal
ity)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
unpa
ired
(inde
pend
ent)
paire
d(r
elat
ed)
・paired t -test
・Mann-Whitney U-test
AN
OV
A(A
naly
sis
of V
aria
nce)
Friedman TestWhen
(1) more than two groups, (2) data have correspondence (not independent), but(3) the conditions of two-way ANOVA are not satisfied,
Let' use ranks of data and Friedman test.
benchmarktasks
methods
a b c d
A 0.92 0.75 0.65 0.81
B 0.48 0.45 0.41 0.52
C 0.56 0.41 0.47 0.50
D 0.61 0.50 0.56 0.54
(ex.) Comparison of recognition rates.a b c d
methods
12
3
4
1
2 3
4 123
4
1
234
Friedman Test
Step 1: Make a ranking table.Step 2: Sum ranks of the factor that you want to test.
Step 3: Calculate the Friedman test value, χ2r .
Step 4: If k =3 or 4, compare χ2r with a significant point
in a Friedman test table. Otherwise, use a χ2 table of (k-1) degrees of freedom.
methods
12
3
4
1
23
4 1 234
123
4
a b c d
)1(3)1(
12
1
22
knRknk
k
iir
where (k, n) are the # of levels of factors 1 and 2.
# of
dat
a(n
= 4
)
# of methods (k = 4)
benchmarktasks
methoda b c d
A 4 2 1 3 B 3 2 1 4 C 4 1 2 3 D 4 1 3 2
Σ 15 6 7 12
ranking among methods
Example: Friedman TestStep 1: Make a ranking table.Step 2: Sum ranks of the factor that you want to test.
Step 3: Calculate the Friedman test value, χ2r .
Step 4: Since significant point for (k,n) = (4,4) is7.80,there is/are significant difference(s) somewhereamong four methods, a, b, c, and d (p<0.05).
1.8
5*4*31276155*4*4
12
)1(3)1(
12
2222
1
22
knRknk
k
iir
# of
dat
a (n
= 4
)
# of methods (k = 4)
benchmarktasks
methoda b c d
A 4 2 1 3 B 3 2 1 4 C 4 1 2 3 D 4 1 3 2
Σ 15 6 7 12
k n p<0.05 p<0.01
3
3 6.00 -
4 6.50 8.00
5 6.40 8.40
6 7.00 9.00
7 7.14 8.86
8 6.25 9.00 9 6.22 9.56 ∞ 5.99 9.21
4
3 7.40 9.00
4 7.80 9.60
5 7.80 9.96
∞ 7.81 11.34
Friedman test table.
8.19.67.8
significance point of (p<0.05)
significance point of (p<0.01)
Example: Friedman TestStep 1: Make a ranking table.Step 2: Sum ranks of the factor that you want to test.
Step 3: Calculate the Friedman test value, χ2r .
Step 4: Since significant point for (k,n) = (4,4) is7.80,there is/are significant difference(s) somewhereamong four methods, a, b, c, and d (p<0.05).
1.8
5*4*31276155*4*4
12
)1(3)1(
12
2222
1
22
knRknk
k
iir
# of
dat
a (n
= 4
)
# of methods (k = 4)
benchmarktasks
methoda b c d
A 4 2 1 3 B 3 2 1 4 C 4 1 2 3 D 4 1 3 2
Σ 15 6 7 12
k n p<0.05 p<0.01
3
3 6.00 -
4 6.50 8.00
5 6.40 8.40
6 7.00 9.00
7 7.14 8.86
8 6.25 9.00 9 6.22 9.56 ∞ 5.99 9.21
4
3 7.40 9.00
4 7.80 9.60
5 7.80 9.96
∞ 7.81 11.34
Friedman test table.
8.19.67.8
significance point of (p<0.05)
significance point of (p<0.01)
Q1: Where is significant among a, b, c, or d?
A1: Apply multiple comparisons between all pairsamong columns.
(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
Multiple Comparisons
When there is significant difference among groups, multiple comparison is used to know which group is significantly difference from others.
4C2 = 6 times of pair comparisons with (p < 0.05)
1 - (1 - 0.05)6 = significance level 26.5%!
Example
Multiple Comparisons
When there is significant difference among groups, multiple comparison is used to know which group is significantly difference from others.
4C2 = 6 times of pair comparisons with (p < 0.05)
1 - (1 - 0.05)6 = significance level 26.5%!
Solution is to apply multiple pair comparisons with more strict significance level.
Example
Multiple Comparisons
-- Bobferroni method --
When pair comparisons are applied m times,let's use a significance level of p / m.
4C2 = 6 times of pair comparisons with (p < )0.05
6
Features:(1) Simple.(2) Rather strict, i.e. showing significances is rather hard.
Multiple Comparisons
-- Holm method --
Corrected Bonferroni method to detect significances easily.
Example pair comparisons
p-valuecorrected
p-value eqn.corrected p-value
0.0076 = p-value* 6 0.0456
0.0095 = p-value* 5 0.0475
0.0280 = p-value* 4 0.1120
0.0320 = p-value* 3 0.0960
0.0380 = p-value* 2 0.0760
0.0410 = p-value* 1 0.0410
vs.
vs.
vs.
vs.
vs.
vs.
normality(parametric)
2 groups n groups (n > 2)
datadistribution
t -test
・sign test
・Wilcoxon Signed-Ranks Test
ANOVA(Analysis of
Variance)
・Friedman test
・kruskal-wallis test
one-way ANOVA
two-way ANOVA
no normality(non-parametric)
one-waydata
two-waydata
Scheffé's Method of Paired Comparison
+Scheffé's method of paired comparison for Human Subjective Tests
IEC
lighting design of 3-D CG
measuring mental scale
geological simulation
hearing-aid fitting
Corridor
W
K
Wall
L
B
Verenda
MEMS design
EvolutionaryComputation
Target System
subjectiveevaluations
Interactive Evolutionary Computation
imag
e en
hanc
emen
t pro
cess
ing
room layoutplanning design
room lighting design byoptimizing LED assignments
Can you hear me ?
??
Scheffé's Method of Paired Comparison
ANOVA based on nC2 paired comparisons for n objects.
ANOVAevenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
significance checkusing a yardstick
Scheffé's Method of Paired Comparison
Original method and three modified methods
All subjects must evaluate all pairs.
no yes
yesoriginal
(原法, 1952)Ura's variation(浦の変法, 1956)
no Haga's variation(芳賀の変法)
Nakaya's variation(中屋の変法, 1970)or
der
effe
ct
Order Effect
(1) and then
(2) and then
may result different evaluation.
Scheffé's Method of Paired Comparison
Scheffé's Method of Paired Comparison
1. Ask N human subjects to evaluate t objects in 3, 5 or 7 grades. 2. Assign [-1, +1], [-2, +2] or [-3, +3] for these grades.3. Then, start calculation (see other material).
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
O1 O2 O3 O4 O5 O6
A1 - A2 2 1 1 2 1 2
A1 - A3 2 2 1 1 1 1
A2 - A3 1 0 1 1 -1 0
Six subjects (N = 6)
Pai
red
com
paris
ons
for
t=3
obje
cts.
Questionnaire Total row data
・・・
strap for a mobile phone
invitation to a dinner
tea /coffee stuffed animal fountain pen
Ex. Q.
Application Example:
What is the best present to be her/his boy/girl friend?
[SITUATION] He/he is my longing. I want to be her/his boy/girl friend before we graduate from our university. To get over my one-way love, I decided to present something of about 3,000 JPY and express my heart.
I show you 5C2 pairs of presents. Please compare each pair and mark your relative evaluation in five levels.
・・・・
present from a male
(significant difference)
Results of Scheffé's Method of Paired Comparison (Nakaya's variation)
What is the best present to be her/his boy/girl friend?
present from a female
I thi
nk e
ffect
ive.
Rea
lity
is ..
.
-1 -0.5 0 0.5 1
I will catch her heart by dinner.
mor
eef
fect
ive
less
effe
ctiv
e
-1 -0.5 0 0.5 1
How about tea leave or a stuffed anima?
mor
eef
fect
ive
less
effe
ctiv
e
-1 -0.5 0 0.5 1
Eat! Eat! Eat!
mor
eef
fect
ive
less
effe
ctiv
e
-1 -0.5 0 0.5 1
mor
eef
fect
ive
less
effe
ctiv
e
I hesitate to accept it as we have not gone about with him.
Original method and three modified methods
All subjects must evaluate all pairs.
no yes
yes original(原法, 1952)
Ura's variation(浦の変法, 1956)
no Haga's variation(芳賀の変法)
Nakaya's variation(中屋の変法, 1970)or
der
effe
ct
y yScheffé's Method of Paired Comparison
Modified methods by Ura and Nakaya
yScheffé's Method of Paired Comparison
Modified method by UraPairwise comparisons for objects which are effected by display order (order effect).
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
yScheffé's Method of Paired Comparison
Modified method by UraAsk N human subjects to evaluate 2×tC2 pairs for tobjects in 3, 5 or 7 grades and assign [-1, +1], [-2, +2] or [-3, +3], respectively.
yScheffé's Method of Paired Comparison
Modified method by Ura
Step 1: Make paired comparison table of each human subject.
evenbetter betterslightly better
slightly better
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
A1
A2
A1
A1
A2
A4
A4
A3
A3
A2
A4
A3
・・・
・・・
・・・
A1 A2 A3 A4
A1 0 -1 -1
A2 3 0 0
A3 3 1 -1
A4 3 3 1
ijlx : evaluation value when the l-th human subject compares the i-th object with the j-th object.
SubjectO1
Subject O3
SubjectO2
yScheffé's Method of Paired Comparison
Modified method by Ura
Step 1: Make paired comparison table of each human subject.
)(2
1ˆ iii xxtN
-1.1667 0.5417-0.5000 1.1250
A4 A3 A2 A1
1̂2̂3̂4̂
Average of four objects
where t: # of object (4) N: # of human subjects (3)
yScheffé's Method of Paired Comparison
Modified method by UraStep 2: Make a table summing all subjects' data and
calculate the average evaluations for all objects.
27 13 -12 -28
SxxN
S
Sxxt
S
xxtN
S
i ijjiij
l iilliB
iii
2
2)(
2
)(2
1
)(2
1
)(2
1
l i ijijlT
BBT
ilB
xS
SSSSSSS
Sxtt
S
xtNt
S
2
)()(
2)(
2
)1(
1
)1(
1
freedom. of degree and
,,,,,, where )()(
f
SSSSSSSS TBB
yScheffé's Method of Paired Comparison
Modified method by UraStep 3: Make a ANOVA table.
unbiased variance = S/f
F = unbiased variance
unbiased variance of S
for F tests.
SxxN
S
Sxxt
S
xxtN
S
i ijjiij
l iilliB
iii
2
2)(
2
)(2
1
)(2
1
)(2
1
l i ijijlT
BBT
ilB
xS
SSSSSSS
Sxtt
S
xtNt
S
2
)()(
2)(
2
)1(
1
)1(
1
yScheffé's Method of Paired Comparison
Modified method by Ura
ANOVA table.
-1.1667 0.5417-0.5000 1.1250
A4 A3 A2 A1
1̂2̂3̂4̂There are significant difference among A1 - A4
yScheffé's Method of Paired Comparison
Modified method by Ura
ANOVA table.
yScheffé's Method of Paired Comparison
Modified method by UraStep 4: Apply multiple comparisons.
Q1: Where is significant among A1, A2, and A3?
A1: Apply multiple comparisons between all pairs.(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
Step 4: Apply multiple comparisons between all pairs and find which distance is significant.
Example of a simple multiple comparison.• Calculate a studentized yardstick • When a difference of average > a studentized yardstick,
the distance is significant.
yScheffé's Method of Paired Comparison
Modified method by Ura
-1.1667 0.5417-0.5000 1.1250
A4 A3 A2 A1
1̂2̂3̂4̂
(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
(See in the next slide.)
yScheffé's Method of Paired Comparison
Modified method by Ura
tNftqY 2/ˆ),( 2
Step 4: Example of a simple multiple comparisons.
(studentized yardstick)
When (t, f) = (4,21), studentized yardsticks for significance levels of 5% and 1% are:
),,ˆ( 2 Ntwhere are an unbiased variance of Sε, the # of objects, and the #ofhuman subjects; is a studentized range obtained is a statistical test table for t, the degree of freedom of Sε ( ), and the significant level of φ; see these variables in an ANOVA table.
Modified method by UraStep 4: Example of a simple multiple comparisons.
Original method and three modified methods
All subjects must evaluate all pairs.
no yes
yes original(原法, 1952)
Ura's variation(浦の変法, 1956)
no Haga's variation(芳賀の変法)
Nakaya's variation(中屋の変法, 1970)or
der
effe
ct
y yScheffé's Method of Paired Comparison
Modified methods by Ura and Nakaya
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
y yScheffé's Method of Paired Comparison
Modified method by NakayaPairwise comparisons for objects that can be compared without order effect.
y yScheffé's Method of Paired Comparison
Modified method by Nakaya1. Ask N human subjects to evaluate t objects in 3, 5 or 7 grades. 2. Assign [-1, +1], [-2, +2] or [-3, +3] for these grades, respectively.3. Then, start calculation (see other material).
O1 O2 O3 O4 O5 O6
A1 - A2 2 3 3 2 0 1
A1 - A3 2 0 0 1 1 0
A2 - A3 -3 -2 -1 -1 -3 -2
Six human subjects (N = 6)
Pai
red
com
paris
ons
for t=
3 ob
ject
s.
Questionnaire
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
evenbetter betterslightly better
slightly better
-2 -1 0 1 2
-2 -1 0 1 2
-2 -1 0 1 2
y yScheffé's Method of Paired Comparison
Modified method by NakayaStep 1: Make paired comparison table of each human subject.
: evaluation value when the l-th human subject compares the i-th object with the j-th object.
ijlx
ii xtN
1̂Average of four objects
where t: # of object (3)N: # of human subjects (6)
y yScheffé's Method of Paired Comparison
Modified method by NakayaStep 2: Make a table summing all subjects' data and
calculate the average evaluations for all objects.
ii
l iliB
ii
xtN
S
Sxt
S
xtN
S
2
2
..
2.)(
..
1
1
1
SF
SSSSS
Sxt
S
BT
l iliB
of varianceUnbariased
varianceUnbiased
1
)(
2.)(
There are significant difference among A1 - A3
y yScheffé's Method of Paired Comparison
Modified method by NakayaStep 3: Make a ANOVA table.
ANOVA table.
y yScheffé's Method of Paired Comparison
Modified method by NakayaStep 4: Apply multiple comparisons.
Q1: Where is significant among A1, A2, and A3?
A1: Apply multiple comparisons between all pairsamong columns.
(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
ANOVA table.
y yScheffé's Method of Paired Comparison
Modified method by NakayaStep 4: Apply multiple comparisons between all pairs and
find which distance is significant.
Example of a simple multiple comparison.• Calculate a studentized yardstick • When a difference of average > a studentized yardstick,
the distance is significant.
(Fisher's PLSD method, Scheffé method, Bonferroni-Dunn test, Dunnett method, Williams method, Tukey method, Nemenyi test, Tukey-Kramer method, Games/Howell method, Duncan's new multiple range test, Student-Newman-Keuls method, etc. Each has different characteristics.)
1980.263/79.197.6
4506.163/79.160.4
01.0
05.0
Y
Y
y yScheffé's Method of Paired Comparison
Modified method by NakayaStep 4: Example of a simple multiple comparisons.
tNftqY /ˆ),( 2 (studentized yardstick)
),,ˆ( 2 Ntwhere are an unbiased variance of Sε, the # of objects, and the #ofhuman subjects; is a studentized range obtained is a statistical test table for t, the degree of freedom of Sε ( ), and the significant level of φ; see these variables in an ANOVA table.