Page 1
Dario Bruzzese Domenico [email protected] [email protected]
Dario Bruzzese, Domenico Vistocco () Compstat 2010 1 / 19
Cutting the dendrogram throughpermutation tests
Department ofPreventive Medical Sciences
UNIVERSITY OF NAPLES ITALY
Department ofEconomics
UNIVERSITY OF CASSINO ITALY
Page 2
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 2 / 19
Page 3
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 3 / 19
Page 4
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
Page 5
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
The rep1HighNoise datasetYeung KY, Medvedovic M, Bumgarner KY:Clustering gene-expression data withrepeated measurements.
Genome Biology, 2003, 4:R34
n = 200p = 20
Page 6
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
Horizontal cutk = 3
Page 7
Motivation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 4 / 19
Automatically determine the optimal cut-off level of a dendrogramExplore partitions different from those allowed by an horizontal cut
An alternative cutk = 3
Page 8
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 5 / 19
Page 9
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:
n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Page 10
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
︸ ︷︷ ︸︸ ︷︷ ︸
Page 11
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Page 12
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1L C1
R
Page 13
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2L C2
R
Page 14
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3L C3
R
Page 15
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Page 16
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1L C1
R
h(
C1L ∪ C1
R
)
Page 17
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2L C2
R
h(
C2L ∪ C2
R
)
Page 18
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3L C3
R
h(
C3L ∪ C3
R
)
Page 19
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
Page 20
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1L
h(
C1L
)
Page 21
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C1R
h(
C1R
)
Page 22
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2L
h(
C2L
)
Page 23
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C2R
h(
C2R
)
Page 24
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3L
h(
C3L
)
Page 25
Notation
Dario Bruzzese, Domenico Vistocco () Compstat 2010 6 / 19
Let:n the number of objects to classify;
CkL and Ck
R the two classes merged at level k(k=1,...,n-1)
h(
CkL ∪ Ck
R
)the height necessary to merge
CkL and Ck
R
h(
Ckj
)the height at which Ck
j has been obtained(j ∈ { L, R })
C3R
h(
C3R
)
Page 26
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
Page 27
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1
repeatif C i
L ≡ C iR then
add C iL ∪ C i
R to permClusterselse
add h(C iL) and h(C i
R) to aggregationLevelsToVisitsort aggregationLevelsToVisit in descending order
endremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
Page 28
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderend
remove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
Page 29
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
Page 30
The algorithm - Pseudo CodeInput: A dataset and its related dendrogramOutput: A partition of the dataset
initialization:aggregationLevelsToVisit← h(C1
L ∪ C1R)
permClusters← [ ]i← 1repeat
if C iL ≡ C i
R thenadd C i
L ∪ C iR to permClusters
elseadd h(C i
L) and h(C iR) to aggregationLevelsToVisit
sort aggregationLevelsToVisit in descending orderendremove the first element from aggregationLevelsToVisiti← i+1
until aggregationLevelsToVisit is empty
Dario Bruzzese, Domenico Vistocco () Compstat 2010 7 / 19
Page 31
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Initializationi ← 0
aggregationLevelsToVisit
h(C1L ∪ C1
R)
permClusters
Page 32
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
aggregationLevelsToVisit
h(C1L ∪ C1
R)
permClusters
h(
C1L ∪ C1
R
)
Page 33
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
aggregationLevelsToVisit
h(C1L ∪ C1
R)
permClusters
clusters to compare
H0 : C1L ≡ C1
R 7→ reject
C1L C1
R
Page 34
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
permClusters
aggregationLevelsToVisit
h(C1L ∪ C1
R),h(C1R),h(C
1L)
h(
C1L
)h(
C1R
)
Page 35
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 1
permClusters
aggregationLevelsToVisit
h(C1R),h(C
1L)
Page 36
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1R),h(C
1L)
h(
C1R
)
Page 37
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1R),h(C
1L)
clusters to compare
H0 : C2L ≡ C2
R 7→ reject
C2L C2
R
Page 38
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1R),h(C
1L),h(C
2R),h(C
2L)
h(
C2L
)h(
C2R
)
Page 39
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 2
aggregationLevelsToVisit
h(C1L),h(C
2R),h(C
2L)
Page 40
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 3
aggregationLevelsToVisit
h(C1L),h(C
2R),h(C
2L)
h(
C1L
)
Page 41
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 3
aggregationLevelsToVisit
h(C1L),h(C
2R),h(C
2L)
clusters to compare
H0 : C3L ≡ C3
R 7→ reject
C3L C3
R
Page 42
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 3
h(
C3L
)h(
C3R
) aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
Page 43
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 4
aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
h(
C3R
)
Page 44
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
permClusters
Iterationi ← 4
aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
C4L C4
R
clusters to compare
H0 : C4L ≡ C4
R 7→ accept
Page 45
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 4
aggregationLevelsToVisit
h(C3R),h(C
2R),h(C
2L),h(C
3L)
clusters to compare
H0 : C4L ≡ C4
R 7→ accept
permClusters
C4L ∪ C4
R ⇔ C3R
C3R
Page 46
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 9
aggregationLevelsToVisit
permClusters
C3L ,C
3R,C
2L ,C
4L ,C
4R
Page 47
The algorithm - The outline
Dario Bruzzese, Domenico Vistocco () Compstat 2010 8 / 19
Iterationi ← 9
aggregationLevelsToVisit
permClusters
C3L ,C
3R,C
2L ,C
4L ,C
4R
Page 48
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each aggregation level k a permutation test isdesigned to test the Null Hypothesis that the twogroups Ck
L and CkR really belong to the same
cluster, i.e. :
H0 : CkL ≡ Ck
R
Under this null, mixing up (permuting) the statisticalunits of Ck
L and CkR should not alter the aggregation
process resulting in their merging in.
Page 49
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each k , the difference between maxj∈{L,R}
h(
Ckj
)and min
j∈{L,R}h(
Ckj
)can be considered as the
minimum cost necessary to merge the two classes..
min h(C3j )
max h(C3j )
Page 50
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each k , the difference between maxj∈{L,R}
h(
Ckj
)and min
j∈{L,R}h(
Ckj
)can be considered as the
minimum cost necessary to merge the two classes.
The difference between h(
CkL ∪ Ck
R
)and
maxj∈{L,R}
h(
Ckj
)can be, instead, considered as the
cost actually incurred for merging CkL and Ck
R .
h(C3L ∪ C3
R )
max h(C3j )
Page 51
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 9 / 19
For each k , the difference between maxj∈{L,R}
h(
Ckj
)and min
j∈{L,R}h(
Ckj
)can be considered as the
minimum cost necessary to merge the two classes.
The difference between h(
CkL ∪ Ck
R
)and
maxj∈{L,R}
h(
Ckj
)can be, instead, considered as the
cost actually incurred for merging CkL and Ck
R .
The ratio between these two differences:
cost(
CkL ∪ Ck
R
)=
maxj∈{L,R}
h(
Ckj
)− min
j∈{L,R}h(
Ckj
)h(Ck
L ∪ CkR
)− max
j∈{L,R}h(
Ckj
)is thus a measure that characterizes the aggregation process resulting in thenew class Ck
L ∪ CkR
Page 52
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
C3L C3
R
mC3LmC3
R
Page 53
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
Page 54
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
The ratio:
cost(
mCkL ∪ mCk
R
)=
maxj∈{L,R}
h(
mCkj
)− min
j∈{L,R}h(
mCkj
)h(Ck
L ∪ CkR
)− max
j∈{L,R}h(
mCkj
)is thus a measure that characterizes the aggregation process resulting in thenew (potential) class mCk
L ∪ mCkR
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
Page 55
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
Under H0 the aggregation process resulting in the new cluster CkL ∪ Ck
R should be very similar
to the one that potentially produces mCkL ∪ mCk
R ; thus the two values cost(
mCkL ∪ mCk
R
)and
cost(
CkL ∪ Ck
R
)should be close enough.
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
Page 56
The algorithm - The permutation Test
Dario Bruzzese, Domenico Vistocco () Compstat 2010 10 / 19
The permutation procedure is repeated M times and each time a new couple mCkL , mCk
R isobtained. The pvalue Montecarlo is thus computed as:
p =#
{cost
(mCk
L ∪ mCkR
)≤ cost
(Ck
L ∪ CkR
)}+ 1
M + 1
C3L C3
R
mC3LmC3
R
mC3L
mC3R
h(mC3L)
h(mC3R)
Page 57
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 11 / 19
Page 58
Some results - Real datasets
The yeast galactosedatasetIdeker T, Thorsson V, Ranish JA,Christmas R, Buhler J, Eng JK,Bumgarner RE, Goodlett DR,Aebersold R, Hood LIntegrated genomic andproteomic analyses of asystemically perturbed metabolicnetwork.
Science 2001, 292:929-934.
n = 205p = 80
Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19
Page 59
Some results - Real datasets
Dario Bruzzese, Domenico Vistocco () Compstat 2010 12 / 19
% of misclassification = 1.5
Page 60
Some results - Real datasets
The diabetes datasetBanfield JD, Raftery AEModel–based Gaussian andNon–Gaussian Clustering.
Biometrics, 1993, 49, 803-821.
n = 145p = 3
Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19
Page 61
Some results - Real datasets
Dario Bruzzese, Domenico Vistocco () Compstat 2010 13 / 19
% of misclassification = 15.2
Page 62
Some results - Synthetic dataset
QIU W.-L, JOE H. (2009). clusterGeneration: random clustergeneration (with specified degree of separation). R package version1.2.7.
different number of clusters (k = 2; 3; 4; 5; 6; 7)separation index = 0.01different number of variables (p = 5; 10; 15)100 replications for each combination of k and p
Dario Bruzzese, Domenico Vistocco () Compstat 2010 14 / 19
Page 63
Some results - Synthetic dataset (p=5)
Dario Bruzzese, Domenico Vistocco () Compstat 2010 15 / 19
Page 64
Some results - Synthetic dataset (p=10)
Dario Bruzzese, Domenico Vistocco () Compstat 2010 16 / 19
Page 65
Some results - Synthetic dataset (p=15)
Dario Bruzzese, Domenico Vistocco () Compstat 2010 17 / 19
Page 66
La Carte
1 Motivation
2 The stairstep-like permutation procedureNotationThe outline
3 Some resultsReal datasetsSynthetic dataset
4 ToDo List
Dario Bruzzese, Domenico Vistocco () Compstat 2010 18 / 19
Page 67
ToDo List
Statistical issues
Introducing a penalty term in the permutation test stepQuality measures of the obtained partitionMultiple Testing Problem (???)
Computational issues
profiling and optimizing the R codeI use of compiled codeI deploying a package
Dario Bruzzese, Domenico Vistocco () Compstat 2010 19 / 19