DISSERTATION Classiﬁer Diversity in Combined …dec.bournemouth.ac.uk/staff/bgabrys/publications/D_Ruta_PhD_thesis… · DISSERTATION Classiﬁer Diversity in Combined Pattern Recognition

DISSERTATION
Classifier Diversity
in Combined
Pattern Recognition Systems
A Thesis
Presented to the School of Information and Communication Technologies
University of Paisley
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
By
Dymitr Ruta MSc, Eng
Applied Computational Intelligence Research Unit
University of Paisley, Scotland
September 2003

- To Ola and Robert, my Mum and Dad -
i

Abstract
This work covers explorative investigations of diversity in relation to multiple
classifier systems (MCS). The notion of diversity emerged as an attempt to explain
the sources of considerable performance improvement that can be observed when
classifiers are combined. At this early stage of development of a young and promis-
ing discipline of classifier fusion, the decision as to whether to choose the best, or
combine, and if so then which models, is unclear. With respect to these problems,
the role of diversity as an explanative and diagnostic tool guiding optimal design of
a multiple classifier system is addressed and thoroughly examined in three different
contexts:
majority voting performance and its limits;
relation between diversity measures and combined performance;
and classifier selection guided by various criteria
In the case of majority voting (MV), the behaviour of combined performance
is investigated and tracked back to the specific distributions of classifier outputs in
an attempt to extract classifier characteristics that could explain variability of com-
bined performance. Indepth parametric analysis of the impact of classifier outputs
distribution and various parameters of MCS on combined performance is conducted.
The results provide clear and comprehensive explanations of what makes majority
voting work, facilitated by a number of novel findings related to MV error limits,
extendibility of MCS and optimal patterns of outputs distribution.
Given a clear picture of the mechanisms driving performance gain in combined
systems, various models of diversity are evaluated in terms of their ability to ex-
plain the variability of combined performance and/or its improvement over indi-
vidual classifiers. Complex co-involvement of individual performances and various
relationships among classifier outputs in their relation with MV performance re-
vealed dissonance between traditionally perceived diversity and the performance of
majority voting. The constructive conclusions from that analysis laid grounds for
the development of a new strategy for constructing diversity measures that are opti-
mised with respect to the combiner. To that end, two novel diversity measures have
been proposed using systematic and set based analysis, and their advantages over
existing diversity measures have been demonstrated experimentally. These promis-
ing results together with the concept of ambiguity adopted from regression problems
ii

provided an inspiration for extending the strategy of modelling the improvement of
combiner performance, up to using directly combined performance in order to satisfy
requirements set for the diversity measures. It is demonstrated and experimentally
justified that such combiner specific perception of the diversity is more suitable for
the applications to diagnostics and design of MCS.
Classifier selection represents the ultimate test of the usefulness of diversity in
practical applications of multiple classifier systems. Complex though precise per-
formance driven classifier selection methods are confronted with simple diversity
guided selection techniques. Extensive experimental work with a number of novel
searching algorithms is carried out and its results used for development of an orig-
inal multistage organisation system employing both classifier fusion and selection
on many layers of its structure. A new mechanism of processing a number of best
classifier combinations at each layer is finally proposed and its positive effects on
the generalisation ability of the whole system is demonstrated over a number of
standard datasets.

Declaration
The work contained in this thesis is the result of my own investigations and has
not been accepted nor concurrently submitted in candidature for any other award.
Copyright c2003 Dymitr Ruta
The copyright of the thesis belongs to the author under the terms of the United
Kingdom Copyright Acts as qualified by the University of Paisley. Due acknowl-
edgements must be made of the use of any material contained in, or derived from,
this thesis. Power of discretion is granted to the depository libraries to allow the
thesis to be copied in whole or in part without further reference to the author. This
permission covers only single copies made for study purposes, subjects to normal
conditions of acknowledgement.
iv

Acknowledgments
I am deeply indebted to my supervisor Dr Bogdan Gabrys for his courage to
take me for his first PhD student and attack very young and uncertain discipline
of combined pattern recognition systems. With his passion in intelligent systems
combined with the emerging potential from the novel area of information fusion he
encouraged me to join the battle for alternative improvement of the pattern recogni-
tion systems - classifier fusion. His invaluable gift of filtering out and proposing good
ideas, efficient brain-storming sessions were important factors stimulating successful
accomplishment of this thesis. Full credit goes also to him for the establishment of
the financial support for the whole project.
The stimulating discussions with my second supervisor Prof. Colin Fyfe, and
also his great generosity in validating my participations in a number of research
conferences are also gratefully acknowledged.
It is a pleasure to express my gratitude to all the members of our Applied Com-
putational Intelligence Research Unit in particular to Lina Petrakieva for their ever-
lasting willingness to dispute computational, mathematical and philosophical issues
and for the excellent ambience in which doing research was a real pleasure.
Supplementary, the input and interest of the Pattern Recognition Group of the
Delf University of Technology led by Robert Duin who developed Matlab Pattern
Recognition Toolbox (PRTools) was of great help.
I have profited from the numerous exchanges of views and e-mails with several
experienced colleagues, actively participating the series of International Workshops
on Multiple Classifier Systems.
Finally, I wish to send hugs and kisses to my wife Aleksandra for several private
reasons, but particularly for her constant engagement with my son Robert, which
was a necessary condition for this dissertation being done.
v

Contents
Contents vi
List of Figures ix
List of Tables xv
Abbreviations xvi
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Project description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Original contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organisation of the thesis . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Overview of pattern recognition and classifier fusion 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Pattern classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Classifier design cycle . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 Classification error . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Information fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Data fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Feature fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.3 Decision fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.4 Classifier outputs . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Classifier fusion systems . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Combining based on classifiers outputs . . . . . . . . . . . . . 29
2.4.2 Combining based on training style . . . . . . . . . . . . . . . . 34
2.4.3 Coverage vs decision optimisation . . . . . . . . . . . . . . . . 35
2.4.4 Decomposition approaches . . . . . . . . . . . . . . . . . . . . 36
2.4.5 Properties of classifier fusion . . . . . . . . . . . . . . . . . . . 37
vi

Contents vii
3 Combining classifiers by majority voting 46
3.1 Theoretical background . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Combining independent classifiers . . . . . . . . . . . . . . . . . . . . 50
3.2.1 Bernoulli model . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.2 Relaxation of the equal performance assumption . . . . . . . . 51
3.2.3 Parametric performance analysis . . . . . . . . . . . . . . . . 53
3.2.4 Beneficial system extendibility . . . . . . . . . . . . . . . . . . 55
3.3 Error limits for dependent classifiers . . . . . . . . . . . . . . . . . . . 60
3.3.1 Patterns of boundary error distribution . . . . . . . . . . . . . 61
3.3.2 Stable boundary error distributions . . . . . . . . . . . . . . . 64
3.3.3 The limits of majority voting error . . . . . . . . . . . . . . . 67
3.4 Multistage organisations . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4.1 Optimal distribution of outputs for MOMV . . . . . . . . . . 72
3.4.2 Optimal permutation . . . . . . . . . . . . . . . . . . . . . . . 75
3.4.3 Optimal structure . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.4.4 Error limits for MOMV . . . . . . . . . . . . . . . . . . . . . 77
3.5 Performance stability of majority voting - experimental insight . . . . 79
3.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4 The notion of diversity 86
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.1.1 Software diversity . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.1.2 Classifier diversity . . . . . . . . . . . . . . . . . . . . . . . . 91
4.1.3 Perception of diversity . . . . . . . . . . . . . . . . . . . . . . 93
4.2 Measuring diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2.1 Pairwise diversity measures . . . . . . . . . . . . . . . . . . . 96
4.2.2 Non-pairwise diversity measures . . . . . . . . . . . . . . . . . 97
4.2.3 Diversity measure properties . . . . . . . . . . . . . . . . . . . 99
4.2.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.3 Analysis of error coincidences for majority voting . . . . . . . . . . . 104
4.3.1 Error distributions . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3.2 Set representation of coincident errors . . . . . . . . . . . . . . 112
4.3.3 Relations with majority voting . . . . . . . . . . . . . . . . . . 119
4.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.4 Combiner specific diversity . . . . . . . . . . . . . . . . . . . . . . . . 128
4.4.1 Usefulness of diversity . . . . . . . . . . . . . . . . . . . . . . 131

Contents viii
4.4.2 Relative error measure . . . . . . . . . . . . . . . . . . . . . . 131
4.4.3 Complexity reduction . . . . . . . . . . . . . . . . . . . . . . . 133
4.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.4.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . 137
5 Classifier selection 141
5.1 Selection model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.1.1 Static vs dynamic selection . . . . . . . . . . . . . . . . . . . . 144
5.1.2 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.1.3 Selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.2 Search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.2.1 Heuristic techniques . . . . . . . . . . . . . . . . . . . . . . . 148
5.2.2 Greedy approaches . . . . . . . . . . . . . . . . . . . . . . . . 149
5.2.3 Evolutionary algorithms . . . . . . . . . . . . . . . . . . . . . 150
5.2.4 Experimental investigations . . . . . . . . . . . . . . . . . . . 154
5.3 Multistage selection-fusion model (MSF) . . . . . . . . . . . . . . . . 161
5.3.1 Network of outputs . . . . . . . . . . . . . . . . . . . . . . . . 164
5.3.2 Analysis of generalisation ability . . . . . . . . . . . . . . . . . 165
5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6 Conclusions 171
6.1 Justification for the line of research . . . . . . . . . . . . . . . . . . . 171
6.2 Major findings and contributions . . . . . . . . . . . . . . . . . . . . 172
6.3 The role of diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.4 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
A Datasets and classifiers used in Experiments 180
A.1 Description of datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.2 Description of classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 185
B Generation of classification outputs 191
B.1 The training methodology . . . . . . . . . . . . . . . . . . . . . . . . 191
B.2 Testing individual classifiers . . . . . . . . . . . . . . . . . . . . . . . 192
Bibliography 195

List of Figures
2.1 Pattern recognition and classification design cycles . . . . . . . . . . . 11
2.2 Two examples of two dimensional datasets. . . . . . . . . . . . . . . . 14
2.3 Visualisation of the training process for 3 common classifiers. Plots
b,c,d show superposition of discriminative functions within 2-dimensional
feature space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Operational scope of fusion in combining classifiers . . . . . . . . . . 23
2.5 Classifier outputs. Transferability of one type into another (top).
Different soft measures and their associations (bottom) . . . . . . . . 27
2.6 Training ability of the fusion operator. . . . . . . . . . . . . . . . . . 35
2.7 Different variations of optimisation relations among data (D), classi-
fiers (C) and fusion operator (F). Greyed examples represent optimi-
sation models not yet designed. . . . . . . . . . . . . . . . . . . . . . 43
2.8 Combining architectures. Different models of decision processing
(top). Decision aggregation models - comparison between organi-
sation and network (bottom) . . . . . . . . . . . . . . . . . . . . . . . 45
3.1 Discrete error distribution with normal distribution approximation.
15 independent classifiers have been used with 40% error each. Shaded
bars refer to errors in majority voting sense. The majority voting er-
ror rate corresponds to the sum of all shaded bars. . . . . . . . . . . . 53
3.2 Normalised continuous error distribution for 15 independent classi-
fiers with 40% error each. Shaded area refers to majority voting error
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 A family of normalised continuous error distributions for increasing
number of classifiers with the same individual error rates of 40%.
Decreasing shaded area corresponds to reducing majority vote error. . 56
ix

Contents x
3.4 Variability of the normalised variance and its effect on majority voting
error. The continuous line represents the maximum variance limit
subject to fixing the mean and the number of classifiers. The surfaces
depict random variability of the normalised variance presented as a
function of the normalised mean error rate and with correspondence
to majority voting error. . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 The relationship (3.23) between majority voting error and error rates
e2 and e3 of a pair of classifiers added to a single classifier with error
rate e1 = 0.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Extendibility curves for different errors of a single classifier. Dashed
lines limit the area corresponding to individual errors of joining clas-
sifiers e1, e2 greater than error e1 but producing MV error lower than
e1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.7 Discrete error distributions for Iris, Biomed, and Chromo datasets
classified by 15 different classifiers (see Appendix A for details of
datasets and classifiers). Shaded bars correspond to errors in majority
voting sense. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.8 Visualisation of the distribution of success (DS) and failure (DF) for
Iris, Biomed, and Chromo datasets classified by 15 different classifiers
(see Appendix A for details of datasets and classifiers). Shaded bars
correspond to errors in majority voting sense. . . . . . . . . . . . . . 64
3.9 Visualisation of stable distributions of success and failure for Iris,
Biomed, and Chromo datasets classified by 15 different classifiers (see
Appendix A for details of datasets and classifiers). Shaded bars cor-
respond to errors in majority voting sense. . . . . . . . . . . . . . . . 67
3.10 Majority voting error limits presented as a function of the number
of classifiers (M = 3 : 99) and mean classifier error rate. Dotted
lines in the 2-D projection (b) represent independent MV error and
correspond to the internal surface in 3-D plot (a). . . . . . . . . . . . 69
3.11 Multistage organisation with 15 classifiers and structure S15 = (5, 3).
The outputs from the classifiers are permutated and passed to layer 1.
At each layer majority voting is applied to each group and the outputs
are passed on to the next layer until the final output is obtained. . . . 71
3.12 Multistage organisation with 27 classifiers and structure S27 = (3, 3, 3).
The first four rows illustrate examples of optimal permutations of out-
puts for given structure. Note that as little as 8 out 27 1s at the first
layer can propagate the correct decision up to the final layer. . . . . . 72

Contents xi
3.13 Majority voting error limits for MOMV presented as a function of
the number of classifiers (M = 3 : 2187) and mean classifier error
rate. Dotted lines on the 2-D projection (b) represent independent
MV error and correspond to the internal surface in 3-D plot (a). . . . 78
3.14 Majority vote errors observed for different boundary error distribu-
tions expressed as a function of mutation rate and mean classifier
error. Plots (a)-(f) correspond to DS, DF, SDS, SDF, DSMOMV ,
DFMOMV respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.15 Differences among majority voting errors for different boundary er-
ror distributions expressed as a function of mutation rate and mean
classifier error. Plots (a),(b) correspond to DS-SDS, DF-SDF, plots
(c),(d) show the differences between DSDSMOMV , DFDFMOMV ,and plots (e),(f) show the differences between SDS DSMOMV ,SDF DFMOMV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.1 Venn diagrams visualising the concept of diversity among classifiers.
Classifiers - the grey thin-lined circles - are trying to estimate the
true target classification function T - empty thick-lined circle. . . . . 92
4.2 Diagrams depicting relationship between diversity measures and (a)
MVE, (b) MVI. The position of each cell determines the correspond-
ing diversity measure (colums) and the dataset (rows) for which the
analysis was carried out. The points in each cell depict a depen-
dence between diversity measure and MVE (a), MVI (b) obtained
for all combinations of 3 out of 15 classifiers. Details of datasets and
classifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . 105
4.3 Diagrams presenting correlation coefficients between diversity mea-
sures and (a) MVE, (b) MVI. Fields in a grid correspond to various
measures and datasets as in 4.2. The darker the field the higher cor-
responding correlation coefficient. The bars underneath the diagrams
depict the correlation coefficients averaged along all datasets. Details
of datasets and classifiers are provided in Appendix A. . . . . . . . . 106
4.4 Averaged evolution of the correlation coefficients between diversity
measures and (a) MVE, (b) MVI. The graphs show the average corre-
lation coefficient measured for all combinations of 3,5,...,13 classifiers
from the ensemble of 15 classifiers. Details of datasets and classifiers
are provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . 107

Contents xii
4.5 Discrete error distributions presented for the ensembles of 15 clas-
sifiers on 27 datasets. The shapes of error distributions (bars with
continuous line joining their tops) are compared with the equivalent
distributions for independent classifiers (continuous lines). Details of
datasets and classifiers are provided in Appendix A. . . . . . . . . . . 109
4.6 Error distribution (thick line) decomposed into 15 partial error distri-
butions (thin lines) corresponding to 15 classifiers applied to Chromo
dataset. Details of datasets and classifiers are provided in Appendix A.111
4.7 Relationship between fault majority measure (FM) and the majority
voting error obtained for the combinations of 3 out of 15 classifiers
over 4 typical datasets. For comparison the same plots have been
obtained for MVE, F2 and ME measures analysed in Section 4.2.4.
Correlation coefficient c is included for each graph. . . . . . . . . . . 113
4.8 Visualisation of a set representation of coincident errors. (A) Binary
outputs from 3 classifiers (0-correct, 1-error). (B),(C) Venn Diagrams
showing all mutually exclusive subsets. (D) Venn Diagram with the
indices of samples put in the appropriate subsets positions. . . . . . . 114
4.9 Venn Diagrams for more than 3 classifiers. (A) 5 congruent ellipses.
(B) 6 triangles. (C) 7 symmetrical sets - Grunbaum construction.
(D) bipartite plot of 8 sets - Edwards construction. See [124] for
further details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.10 Two types of error coincidences for the classifier D1 and D3 of the
ensemble {D1, D2, D3}. (A) An example of error indices distribution.(B) General coincidences CG({D1, D3}) = {3, 5, 6}. (C) Exclusivecoincidences CE({D1, D3}) = {6}. . . . . . . . . . . . . . . . . . . . . 117
4.11 Collection generation. A: Algorithm. B: Visualisation of the collec-
tion generation process. . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.12 Graphs associated with Venn Diagrams. A: An ordered graph of
exclusive coincidences for 3 classifier sets. B: Unordered graph for
Edwards construction of 5 sets. To order the graph, all vertices have
to be directed towards lower order coincidence. . . . . . . . . . . . . . 119
4.13 Evolution of correlation coefficients along different levels of GC. Cor-
relation coefficient were measured between MVE and GC grouped in
series of 3,5,7,9 out of 11 classifiers for 8 considered datasets. . . . . . 124

Contents xiii
4.14 Evolution of correlation coefficients along different levels of EC. Cor-
relation coefficients were measured between MVE and EC levels grouped
in series of 3, 5, 7, 9 out of 11 classifiers for 2 representative datasets
showing typical patterns of the relationship observed. . . . . . . . . . 125
4.15 Evolution of correlation coefficients between MVE and type 1 sum
(from 1st to kth of GC levels presented as a function of a number
of GC levels taken to the sum (shown in bold lines). For compari-
son, correlation curves of the individual GC levels are also shown in
thin lines. Plots are presented for 4 datasets corresponding to most
representative patterns of the relationship observed. . . . . . . . . . . 125
4.16 Evolution of correlation coefficients between MVE and type 2 and
3 sums of coincidence levels shown as a function of the number of
levels taken to the sum. A: type 2 sum (from kth to M th level) of
EC levels, (shown in bold lines). For comparison, correlation curves
of the individual EC levels are also shown in thin lines. B: type 3
sum of GC levels (bold lines) with correlation curves of the individual
GC levels shown in thin lines. Details of datasets and classifiers are
provided in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.17 Illustration of the importance of correlation coefficients for classifier
selection in the example of the relation between majority voting error
and general coincidence levels of 3 out of 11 classifiers applied for the
Liver dataset. (A) Relation of the first general coincidence levels. (B)
Relation of the second general coincidence levels. (C) Relation of the
sum of the first and second general coincidence level with majority
voting error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.18 Graphical interpretation of the RE in two versions: with E0 as in-
dependent majority voting error 4.18(a), E0 denoting mean classifier
error 4.18(b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.19 Linear regression of the normalised higher levels of general coinci-
dence calculated as a result of the 11 classifier system applied to
some typical real-world datasets. (a) The LGi values for increasing
levels in logarithmic scale. (b) Lines matched in the logarithmic scale
to the higher levels (6:11) of general coincidence. . . . . . . . . . . . . 136
4.20 Visualisation of correlations between the improvement of the major-
ity voting error and the measures from Table 4.4. Coordinates of all
points represent the measures examined for all 3-element combina-
tions out of 11 classifiers for which the measures were applied. . . . . 138

Contents xiv
4.21 The diversity separation experiment. Majority voting error limits
diagrams with the points corresponding to the classification results
for increasingly trained teams of 5 classifiers. Suspected constant
diversity of the data matches the lines representing the same values
of RE measure with independent majority voting error as 0-point
(E0) 4.21(a), conversely to the second version of the RE measure with
E0 denoting mean classifier error 4.21(b). . . . . . . . . . . . . . . . . 139
5.1 Visualisation of the majority voting errors presented in Table 5.3.
The lighter the field the lower the majority voting error. Details of
5.2 Comparison of the errors from 50 best combinations of classifiers
found by four population-based searching methods: ES, SS, GS, PS. . 164
5.3 Evolution of the MVE for the MSF model with a network of 5 layers
and 15 nodes at each layer. The thick line shows the MVE values
for the best combinations found by different search algorithms at
each layer (1-5) of the MSF model. For comparison purposes this
line starts from the error of the single best classifier (layer 0), the
level of which is also marked by the dotted line. The thin line shows
the analogous evolution of the mean MVE from all the combinations
selected at each layer. Details of datasets and classifiers are provided
in Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
5.4 The network (5 15) resulting from the application of MSF modelwith M = 15 classifiers, majority voting and exhaustive search on the
phoneme dataset. Layer 0 represents individual classifiers and their
individual errors are marked underneath. The best combination at
each layer is marked by an enlarged black circle. The validation
and testing errors of the best combination at each layer is marked
respectively below the layer labels. Details of datasets and classifiers

List of Tables
4.1 Summary of the measures applied in the experiments. . . . . . . . . . 103
4.2 Comparison of the time needed to extract cardinalities of all general
coincidences from a binary matrix of outputs and a collection for
different number of classifiers. . . . . . . . . . . . . . . . . . . . . . . 121
4.3 Comparison between the real and approximated values of the major-
ity voting error for all datasets and applying all 11 classifiers. The
error rates are shown in percentages. . . . . . . . . . . . . . . . . . . 135
4.4 Correlations between the improvement of the majority voting error
over the mean classifier error (MVE-ME) and both versions of the
RE measure compared against Q statistics and double fault mea-
sures. The correlation coefficients were measured separately for the
combinations of 3, 5, 7, and 9 out of 11 classifiers within each dataset. 137
5.1 Individual best classifier errors for 27 available datasets. The first 3
columns correspond to majority voting errors obtained for SB applied
to validation matrix, testing matrix and validation matrix but tested
on the testing matrix. The following two columns show the index
of the best classifier evaluated separately in BV and BT matrices.
Details of datasets and classifiers are provided in Appendix A. . . . . 155
5.2 Summary of searching methods, selection criteria and datasets used
in experiments. Description of datasets and classifiers is provided in
Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.3 Majority voting errors obtained for best combinations of classifiers
selected by various searching methods (columns) and selection crite-
ria (rows). The results are averaged over 27 datasets. The bottom
row and right-most column show the averaged values of MVE for
the searching methods and selection criteria respectively. Details of
xv

Abbreviations xvi
5.4 Best combination of classifiers found by the exhaustive search from
the ensemble of 15 classifiers. Columns 2-4 present the MVE val-
ues for the best combination found in the validation matrix, testing
matrix and validation best tested on the testing matrix, respectively.
Columns 4 and 5 show indices of the classifiers forming the best val-
idation and testing combinations. Details of datasets and classifiers
5.5 Validation errors (obtained from the validation matrices) of the ma-
jority voting combiner obtained for the best combinations and mean
from 50 best (if possible) combinations of classifiers found by 8 dif-
ferent search algorithms for 27 datasets. Details of datasets and clas-
sifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . . . 162
5.6 Generalisation errors (evaluated on the testing matrices) of the ma-
jority voting combiner obtained for the best combinations and mean
from 50 best (if possible) combinations of classifiers found by 8 dif-
ferent search algorithms for 27 datasets. Details of datasets and clas-
sifiers are provided in Appendix A. . . . . . . . . . . . . . . . . . . . 163
5.7 Generalisation errors (evaluated on the testing matrices) of the ma-
jority voting combiner obtained for the best combinations from the
5-layer selection-fusion model. The columns show the minimum er-
rors obtained and the layer indices at which the minimum errors were
observed. Details of datasets and classifiers are provided in Appendix
A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.1 A list of datasets used in the experiments. . . . . . . . . . . . . . . . 186
A.2 A list of classifiers used in the experiments. . . . . . . . . . . . . . . . 190
B.1 Optimal classifiers parameters found exhaustively for each datatset.
The remaining classifiers (loglc, nmc, pfsvc, knnc, parzenc) have in-
ternal optimisation or are working well with default parameters. . . . 193
B.2 Individual classifier errors obtained during classification of 27 datasets.194

Abbreviations
ANN Artificial Neural Network
BKS Behaviour Knowledge Space
BS Backward Search
CC Computational Complexity
CFD Coincident Failure Diversity
DCS Dynamic Classifier Selection
DED Discrete Error Distribution
DF Boundary Distribution of Failure
DFD Distinct Failure Diversity
DI Difficulty measure
DS Boundary Distribution of Success
ECOC Error Correcting Output Coding
EL Eckhardt and Lee
FM Fault Majority
FS Forward Search
GA Genetic Algorithm
GD Generalised Diversity
IA Interrater Agreement Measure
KW Kohavi Wolpert
xvii

Abbreviations xviii
LM Littlewood and Miller
MCS Multiple Classifier System
ME Mean Error
MMI Maximim Mutual Information
MOMV Multistage Organisation with Majority Voting
MV Majority Voting
MVE Majority Voting Error
MVI Majority Voting Performance Improvement
NCED Normalised Continuous Error Distribution
NDM Non-Pairwise Diversity Measure
OWA Ordered Weighted Average
PBIL Population Based Incremental Learning
PCA Principal Component Analysis
PDED Partial Discrete Error Distribution
PDM Pairwise Diversity Measure
PK Partridge and Krzanowski
RSD Random Scatter Diversity
SB Single Best
SCS Static Classifier Selection
SD Specialisation Diversity
SDF Stable Distribution of Failure
SDS Stable Distribution of Success
SS Stochastic Hill-Climbing Search
TS Tabu Search

Chapter 1
Introduction
Endowed with a number of diverse senses, humans effortlessly tackle astoundingly
complex processes that underlie the act of pattern recognition. The astonishing
ease with which we can recognise faces, understand spoken words, eliminate rotten
eggs by smell, select the right coin from a pocket by touch or distinguish beer from
champagne by taste are apparently in conflict with the overwhelming complexity
of computer based pattern recognition systems. The explanation of this superior
performance seems to be related to highly specialised and complementary sensing
models that work simultaneously and are combined by a decision mechanism in
the human brain. Recent advances in combining pattern recognition systems seem
to support this conjecture although it is still not clear what exactly drives the
improvement in their performance. Is it complementarity among individual diverse
classification models, or are there some specific strengths of a particular combiner
that cause compensation for individual errors observed in classifier fusion systems?
The unresolved co-involvement between diversity and classifier performances and
their joint impact on combined performance remains another challenge. Multi-facet
diversity is believed to be the key to the explanation of performance variability in
combining classifiers. However due to multitude of perceptions and interpretations
and hence measuring methodologies, diversity has still no clear bonds with combined
performance and therefore is not used in applications. These and many other related
problems prevent a full explanation of the mechanisms ruling classifier fusion and
hence limiting our ability to predict and control the behaviour of the combined
performance so much appreciated in commercial applications.
One of the research project goals is the establishment of the relationship between
the performance of the combined system and various properties of multiple classifier
system (MCS). Diversity, identified as a promising descriptive tool, is thoroughly
investigated and the role it plays in classifier fusion examined in an attempt to
1

Chapter 1. Introduction 2
provide diagnostic tools invaluable during a complex process of designing MCS. All
these questions, doubts and challenges are to be addressed in this thesis within a
general framework of diversity analysis for combined pattern recognition.
1.1 Background
Research efforts dedicated to supervised pattern recognition invariably focussed on
further improvement of the recognition rate, have recently been undergoing a sig-
nificant change. The traditional continuous development of more and more so-
phisticated classification models turns out to provide some benefits only in specific
problem domains where some prior background knowledge or new evidence can be
exploited to further improve classification performance. In general however, re-
lated research proves that no individual method can be shown to deal well with
all kinds of classification tasks [148], [28], [7], [137]. Realisation of the inevitable
imperfections of individual classifiers catalysed the emergence of a new model de-
sign strategy assuming combining different classifiers as a main source of perfor-
mance improvement [137], [7], [158]. Classifier fusion methodology exploded recently
into a wide variety of models some of which have been shown to be very success-
ful [148], [28], [7], [137], [80], [81], [15], [165], [60], [158], [71], [53], [55], [65], [58], [27].
Although spectacular improvement of the recognition rate in combined pattern
classification systems has been demonstrated on a number of problem domains, the
explanation of that phenomenon remains vague and very general. On one hand
the process of classifier fusion is explicit and definable. The complexity of in-
dividual classification models limits however the interpretability of the combined
performance behaviour in terms of various individual and relational characteristics
exhibited among classifiers. Transparency of pattern recognition systems becomes
a crucial property in commercial or industrial applications, where due to security
or revenue maximisation, the risk associated with employing a highly complex com-
posite classification system is high and has to be minimised. To this end various
attempts at controlling or diagnosing the behaviour of combined performance have
shown only partially positive and still confusing results [138], [164], [88], [122]. Re-
flection of that fact can be found in safety critical pattern recognition systems, where
simple yet well explained and easily controllable techniques commonly based on try
all and choose the best model are preferred [139].
Research efforts towards explanations in combined classification systems focus
on two approaches. One way is to analyse the specific combining method and use
its characteristics backpropagated into relations among classifiers to model or di-

rectly measure combining performance or its improvement [131], [128], [75], [164].
The other method assumes the existence of underlying diversity among classifiers,
which together with the individual classifier performances determine in some im-
plicit way the combined performance. In this interpretation the notion of diver-
sity embodies the concepts of team strength or complementarity among classifiers
and is believed to have a key impact on combining performance [126], [89], [140].
There is though a number of uncertainties associated with diversity on both, the
conceptual and practical levels. First, it is not clear whether diversity as a con-
cept is independent of individual performances and the combining method used.
These doubts directly translate into problems of measuring diversity in a consistent
manner independent of a number of variable parameters of the multiple classifier
system [128], [126], [87], [140]. Another aspect which complicates the issue even
more is a doubt whether diversity should be considered together with the combiner
and its properties, or should it consistently represent a fixed concept ignoring any
bonds with the fusion system. In other words it is not clear whether diversity should
be perceived universally as an independent concept or if it should be biased by the
specific features of the particular combiner. The latter option would be particularly
justified by the diagnostic and control requirements so that diversity, being tuned
to the combiner, could be applied during the design process. Both models of diver-
sity pursuing explanations of the performance behaviour in combined classification
systems form the main theme investigated in this thesis. Extensive experimental
work attempts to justify the practical applicability of diversity during the process
of composite classifier design and accordingly verify the usefulness of the diversity
concept for combining classifiers.
1.2 Project description
The overall goal of the project is to explore the multi-modal concept of classifiers
diversity broken down into various interdependencies among individual models from
classifier ensembles and investigate its explanatory strength in the context of per-
formance variability of the combined system. Although the notion of diversity is
approached on many distinct platforms including perception, representation and
measuring, particular emphasis is put on the potential applicability of diversity
analysis in the process of designing multiple classifier systems. The research intends
to exploit diversity as a diagnostic tool capable of guiding or at least indicating
which classifier ensembles are most likely to show good combined results as opposed
to those classifiers which if combined do not show any improvement or even lead to

deterioration of the performance compared with the individually best model.
The initial investigations revealed a number of strategies for tackling diversity in
relation to combining classifiers. However, due to a large size and complexity of the
problem, the scope of the project is technically narrowed down to the phenomena
observed and investigated only for majority voting (MV) combiner operating on the
ensemble of different classification models. Within this setup the notion of classifiers
diversity is targeted in three different contexts:
Exploratory investigations of the behaviour of majority voting performanceand its limits - looking at the mechanisms responsible for performance im-
provement in multiple classifier systems.
Analysis of the relation between combined performance behaviour and vari-ous models of diversity - trying to identify the bonds between the two and
investigate the possibilities of their enhancement.
Diversity in classifier selection - experimental study attempting to apply di-versity measures as effective selection criteria capable of extracting optimal
ensembles of classifiers.
These three issues consistently build up into a comprehensive evaluation of the role
diversity plays in the combined pattern recognition systems and directly justify the
usefulness of the diversity analysis in designing multiple classifier systems.
1.3 Original contributions
This section provides a brief summary of the major original findings arising from
the study. It serves both to provide a clearer presentation throughout later chapters
and apparent specification of the thesis contributions to the field. The study has
been summarised in a number of peer reviewed publications [125], [130], [126], [131],
[128], [129], [132], [127] encompassing both theoretical and experimental material
realising the project goals. Contributions concern three problem domains following
the investigative strategy of the project as briefed in the previous section and are
summarised in a following list:
Proposition of a new systematic order and terminology describing in a uniformmanner wide family of classifier fusion systems, Section 2.4, [125].

Introduction of the error distribution based analysis of majority voting perfor-mance behaviour for large number of differently performing classifiers, Section
3.2.2, Section 3.2.3, [130].
New simple form of the ensemble extendibility condition for independent clas-sifiers, Section 3.2.4, [130].
Parametric analysis and extensive visualisation of majority voting error limits,Section 3.3, Section 3.3.3, [130].
Definition of new patterns of boundary distribution of classifier outputs definedfor the full range of mean classifier error, [0,1], and proposition of their stable
alternatives justified by analysis of classifier margins, Section 3.3, [130].
Definition of a multistage organisation with majority voting system and pre-senting the effect of error limits widening along with conditions necessary for
its occurrence, Section 3.4, [130].
Extensive analysis of the correlation between majority voting error and variousbinary operating diversity measures, Section 4.2, [126].
Definition of the asymmetry property of diversity measures and showing itsimportance in the correlation analysis, Section 4.2, [126].
Definition of the Fault Majority measure - as an example of a measure opti-mised to the combiner, Section 4.3.1, [126].
Presentation of the set-based analysis of error coincidences and using it forrapid extraction of error coincidences among classifiers, definition of new mea-
sures of diversity and decomposition of majority voting error, Section 4.3, [131].
Definition of a robust Relative Error measure promoting combiner specificapproach to diversity measures, justified experimentally, Section 4.4, [128].
Development of a new methodology for pattern classification based on theconcept of information fields inspired from physical potential fields, [129], [132].
Definition of gravity and electrostatic models of classification and showingtheir good performance in terms of both recognition rate and diversity, [132].
Development and evaluation of a number of search algorithms applied forclassifier selection with various selection criteria, Section 5.2, [127].

Evaluation of diversity measures as a classifier selection criteria, Section 5.2.
Proposition of the network-based processing of the population of combinationsof classifier outputs, Section 5.3.
Development of a multilayer selection-fusion model, analysis of its structureoptimality and extensive evaluation showing improvement of the generalisation
performance, Section 5.3.
1.4 Organisation of the thesis
Chapter 2 outlines the context and theoretical background for this work. It provides
a general overview of pattern recognition methodology, and on grounds of advances
in information fusion illustrates the state of the art in multiple classifier systems.
The material presented in the next three chapters covers the original contributions
summarised in the previous section.
Chapter 3 attempts to uncover various mechanisms driving performance im-
provement in majority voting. Parametric analysis of individual error coincidences
is formalised and used to explain several aspects of the behaviour of MV error and its
limits. In the second part majority voting is presented in a multistage organisation
setup and its interesting effects on the combined performance are discussed.
The next chapter summarises various models and perceptions of the diversity and
addresses the problem of its representation and measurement. The relation between
diversity among classifiers and the performance of majority voting is investigated
experimentally and the results compared with exhaustively extracted optimal en-
sembles of classifiers. The conclusions drawn from these experiments are directly
exploited in promoting the new form of diversity, conceptually biased by the defi-
nition of the combiners performance. Supported by comprehensive analysis of the
error coincidences, the combiners specific diversity is presented and embodied into
a series of novel measures ultimately leading to the convergence between the concept
of diversity and combined performance.
Chapter 5 focuses on the application side of diversity measures presenting ex-
tensive experimental results of classifier selection guided by various measures of
diversity and performance. Among many different selection algorithms and crite-
ria the best setup is analysed and expanded into a multilayer network preventing
selection overfitting and improving the generalisation properties of the system.
The concluding chapter summarises the main findings of the project and indicates
directions for further research.

Chapter 2
Overview of pattern recognition
and classifier fusion
2.1 Introduction
In the early developments of automated pattern recognition systems, inspirations
were always being found in the biological world, where we humans exhibit a remark-
able blend of recognition skills. Humans seem to be more efficient in solving many
complex, especially vaguely specified classification tasks owing to the natural ability
to cope with uncertain/ambiguous data coming in variety of forms from different
sources. In some more specific applications like fingerprint recognition [118] or DNA
sequence identification [101], automated pattern recognition systems immensely out-
performed humans mainly due to the enormous size of the data and interdependency
between the factors to be analysed and processed. It seems then, that a successful
pattern recognition system has to exhibit both the efficiency of a biological cogni-
tive system and the processing power of modern computing systems. Indeed in cases
like vision or speech recognition, understanding biological cognitive mechanisms and
adopting them on fast computer systems would open enormous capabilities. How-
ever, there are also pattern recognition problems like DNA identification [101], gas
detection [62], infra-red target tracking [7], which not only remain far beyond our
cognitive and processing capabilities but also require specific mathematical models
and sophisticated hardware sensing of a type unreachable for humans. In general,
there is no single strategy or recipe for successful pattern recognition systems. In-
stead there is a rich variety of individual problem dependent methods dealing well
with very specific problems but failing to generalise well to other tasks.
In parallel to the efforts at improving individual pattern recognition models, a
7

Chapter 2. Overview of pattern recognition and classifier fusion 8
completely new trend emerged recently attracting a lot of scientific attention. Fol-
lowing the advances made in electronics and computer science, pattern recognition
had been undergoing a rapid improvement encouraged by gradually relaxing com-
plexity constraints. The pioneering efforts of Dasarathy [22], but also these of many
other works reviewed in [22], initiated an entirely new branch of pattern recognition -
classifier fusion. The inspiration can be traced back as far as ancient Greece, citizens
of which were the first who reached decisions collectively in order to improve their
quality and minimise the risk of individual failures [116]. Omnipresent in current so-
cieties, group decision making indeed proves to secure well balanced decisions crucial
for the stability and prosperity of todays democracies [50], [134], [9]. In a similar
fashion it has been noticed that applying multiple classification models for the same
task and combining their results could lead to spectacular performance improve-
ments compared with the individual best model [158], [22], [121], [58]. It turned out
that fusion may in fact be successful not only if applied for classifier decisions but at
other stages of the classification cycle starting from data fusion [49] [7], [54], [32], [36]
through feature (processed data) fusion [7], [68], [35], [33] up to the aforementioned
classifier fusion [22], [137], [7]. Section 2.3 discusses in detail various issues related
to information fusion.
These findings triggered development of very complex systems where mixtures
of fusion, combining and selection of partial evidence applied to input data, features
or classifier outputs cover uncountable variations and structures of potential pat-
tern recognition systems. It is therefore not surprising that due to the potentially
large variety of the combined pattern recognition designs, there is still no consis-
tent and commonly agreed taxonomy naming and categorising different combining
techniques. Some recent attempts at a very general classification of fusion methods
into coverage and decision optimisation techniques [57] assume that either the clas-
sifiers or the combiner is to be optimised while the other remains fixed. However,
the state of the art in classifier fusion seems to be much wider and more complex
with the multiplicity of classifiers used in many different ways beyond these two
mentioned types of combining. One example could be a modular decomposition
system where the single best or a number of best classifiers are applied to differ-
ent classification subtasks controlled by the classifier selection process [137], [122].
Moreover, combining classifiers involves a number of other aspects including archi-
tectures for combining, training abilities of the fusion operator and may relate also
to fusion on different levels of abstraction within the classification cycle [7]. On top
of that, all different styles, paradigms and properties of combining may appear at
the same time during the design process. For example there is nothing wrong with

coverage and decision optimisation methods being combined together. Facing this
pudding of varieties, rather than contributing to the overall non-specificity in the
field we present the classifier fusion as scheme uniformly described by three distinct
properties and show in Section 2.4.5, that this noncompetitive approach covers all
different models and designs of combining.
The high complexity and hence the computational power demands of classifier
fusion systems is one of the reasons they are not widely applied yet. Among other
reasons, the major problem is the lack of interpretability of complex systems. Un-
fortunately these drawbacks usually eliminate such fusion systems from industrial
applications - where suddenly emerging problems require a quick explanation and
fix, while the system performance should be predictable and stable. Although com-
plexity can be increasingly dealt with and there is a prospect for the stability gain,
there is very little one can usually do with systems that occasionally do not work,
or work beyond ones control. The major issue addressed in this thesis - diversity
among classifiers, is believed to provide theoretical and practical answers accounting
for the diagnostic and explanative capabilities of diversity in the context of classifier
fusion.
The term diversity related to combining evidences originated from the software
engineering domain [29], [73], [99], [112], where the reliability of conventionally coded
programs was improved by combining independently written versions of the same
algorithm. Appearing under many names in the literature, diversity is believed to
be a major source of performance improvement in combined pattern recognition
[110], [138], [111], [131], [87]. A large variety of representations, models and data
types represent some of the many faces of diversity related to classifiers. In this
thesis the emphasis is put on practical aspects of diversity, the ways it can be
measured, understood and eventually whether it can explain and possibly diagnose
why and when combining classifiers could be an effective alternative to individual
classification models. Detailed conceptual and experimental investigations related
to diversity are undertaken in Chapter 4 and partially in Chapter 5.
2.2 Pattern classification
Pattern recognition is a scientific discipline one of whose goals is to classify objects
into a number of categories called classes. Objects represent compact data units
specific to a particular problem like images, spoken words, handwritten characters
and are in general referred to as patterns. The process of pattern recognition nor-
mally entails a sequence of well separated operations [28]. It begins with collecting

the evidence acquired from various sensing devices. In the ideal situation the data is
low-dimensional, independent and discriminative so that its values are very similar
for patterns in the same class but very different for patterns from different classes.
Raw data rarely satisfies these conditions and therefore a set of procedures called
feature generation, extraction and selection is required to provide a relevant input
for classification system. Data sensing and feature extraction is beyond the scope
of this thesis. It is noted however that the product of these two components of
the pattern recognition design are feature vectors representing the input data for
classification systems.
Given the feature vectors x X provided by a feature extractor, the objectiveof the supervised classification method, the classifier, is to assign the new object x
to a relevant class j , where = {1, ..., C}, based on previous observationsof labelled patterns: XT = {x, } - training data. The overall classification processcan be broken down into four major components: model choice, data preprocessing,
training and testing or evaluation. Evaluation closes the classification part in the
pattern recognition design which then enters the post-processing and overall system
evaluation stage. There is a great flexibility of operation in this last phase of pattern
recognition design. It may just involve risk or reliability analysis, it could be system
tuning aimed at minimising the cost or further context-based optimisation. There
is also space for combining classifiers or in general for processing the outputs from
many classifiers returned from the classification process. The diagram of pattern
recognition design and the subset involving the classification cycle is shown in Figure
2.1.
The major issue treated in this thesis - diversity among classifiers - narrows
down the operational scope to just the two last components of the pattern recogni-
tion design: classification and post-processing. Classification broken down into the
design cycle is presented in the following section, with a particular emphasis put on
the limitations of the individual model implementation. This is followed by a for-
mal definition of classification error pointing out its sources and indicating methods
for its elimination leading to the development of the combined system presented in
Section 2.4.
2.2.1 Classifier design cycle
In the supervised pattern recognition task considered in this thesis, the classifiers
goal is to assign the unlabelled object x to the class label based on the evidence
learned from the labelled training set XT : {xi, j}. Mathematically classifiers

Figure 2.1: Pattern recognition and classification design cycles

represent simply a discriminative function trying to separate classes from each other
in the multidimensional input space. In a general case such function provides class
support vectors: = [w1, ..., wC ], which depending on the classification model may
represent probabilities, fuzzy membership values or any other measures that can be
understood, compared and handled in the post-processing phase. Classification can
be therefore interpreted as a mapping:
D = f([x1, ..., xK ]T ) = [y1, ..., yC ]
T (2.1)
where yj denotes a degree of support for class j estimating the probability P (j|x).The difficulty of the classification problem depends on the variability in the feature
values within the same classes relative to the differences between feature values for
patterns from different classes. Among other phenomena complicating the classi-
fication task, the major contribution is attributed to the lack or incompleteness
of the data, the high complexity of the problem and, above all, noise accounting
for all kinds of randomness in pattern variability that is not due to underlying
model [28], [148]. The performance of a classifier becomes the result of the trade-
off between the conceptual adequacy of the classification model and its complexity
control mechanisms. As mentioned before, the classification process can be seg-
mented into four distinct operations: model choice, data preprocessing, training,
and evaluation.
Model choice
The decision regarding the selection of the classification model is very important
and difficult especially if there is little prior knowledge about the nature of the
problem. Additional difficulties come from a fact that the classification process is to
a large extent unpredictable and quite often nondeterministic which means that the
choice can not be immediately justified. The only effective quantitative feedback
comes from the evaluation of the overall classifier performance which means that
a designer has to come through the whole classification cycle to verify his choice.
Sometimes assumptions made by a classifier match the problem characteristics or
the problem is so specific that there is only one method suitable, in which cases
the choice is straightforward. In general however, with respect to the no free lunch
theorem [28], [156], there is no individual method providing the best solution for all
types of pattern recognition problems. In a typical scenario, given a classification
problem, the designer has typically plenty of different classification models at hand
and optimistically only a rough idea which ones could be the most successful. Unless

there is clear evidence of the model match to the problem, quite trivially a tedious
try all and choose the best approach seems to provide a justifiable strategy. Even
then, due to limited evaluation capabilities, assigning a single classifier to the task
puts the optimality of performance at risk. Another aspect arising from the model
selection stage is the loss of valuable evidence provided by competitive classifiers
ranked just behind the winner. These conceptual and practical difficulties in classi-
fier selection contributed to the development of classifier fusion systems, where all
the complementary evidence and knowledge is jointly incorporated into the decision
process. Further details related to classifier fusion are presented in Sections 2.3 and
2.4.
Data collection and preprocessing
Once the model is chosen the input data are prepared to be passed on to a classifier.
These data are in fact k-component feature vectors of the form: x = [x1, ..., xk]T
returned from the feature extraction stage of pattern recognition design. Individual
patterns represent points in the k-dimensional input space, examples of which are
depicted for two dimensional cases in Figure 2.2.
Although during the feature extraction phase, the data may have already been
preprocessed to enhance its class discriminative power, the choice of the classification
model usually dictates further adjustments. Various types of normalisation are
routinely required. For example to achieve invariance to displacements and scale
changes, one might transform the data so that they have zero mean and unit variance
[148]. Some models may require the data only from the specific range, for example
(0, 1), in which case normalisation has also to be applied [35], [32], [33], [35], [36], [34].
Normalisation may destroy the original data structure if there are some outliers,
hence removal of outliers may be required prior to normalisation [132], [148]. Missing
feature values is another common problem related to the data that has to be treated
to avoid failures [28], [148], [34], [106]. For some complex classifiers the number of
features returned from the feature extraction process may lead to intractability.
Various techniques aiming at reducing the data size may therefore be required.
Applying various data editing or data condensation techniques [18], [83] would reduce
directly the number of patterns while trying to preserve the structure of the data.
Alternatively, data dimensionality may be targeted and methods based on feature
selection [28], principal/independent component analysis (PCA/ICA) [107], [63] or
maximum mutual information (MMI) [119], [149] applied to reduce the number of
dimensions with minimal impact on the discriminatory strength of the remaining

features.
Further processing may be required if a multiple classifier system is to be ap-
plied. The input space may for instance be segmented and the training set effec-
tively split into parts fed to different classifiers like in dynamic classifier selection
(DCS) [41], [43], [40] systems. For the same purpose, the data may be grouped into
many subsets of features and applied separately for building many versions of the
model to be combined [84], [164]. Finally there is yet another reason for data prepro-
cessing prior to classification. Different classifiers may be encouraged to be diverse
by providing as much distinct evidence related to the same problem as possible.
Alongside already mentioned input space partitioning or selecting different feature
subsets, there are also simpler methods like injecting noise or differentiation of ini-
tial conditions [25] and many different linear and non-linear transformations [138]
that could be potentially used to enforce diversity among classifiers.
2 1 0 1 2 3 4 5 6 76
4
2
0
2
4
6
8
(a) artificial, 2-D, 8 classes
1 2 3 4 5 6 7 8 9 10 111
2
3
4
5
6
7
8
9
10
11
(b) artificial, 2-D, 3 classes
Figure 2.2: Two examples of two dimensional datasets.
Training
Training is the actual process of classifier learning. Although this thesis is only
concerned with supervised learning, the training process is a good place to briefly
discuss different learning models [28] as they directly affect the way training is
carried out.
Depending on the availability and reliability of the evidence one can distinguish
three learning strategies: supervised, unsupervised and reinforcement learning. In
supervised learning the classifier is given a labelled training set to build the model on.
It is called supervised as it could be thought of, as the teacher providing the patterns
and their true classes on the basis of which the classifier model learns how to return

an optimal solution to the problem. In some cases training data of known classes
may not be available, which eliminates the availability of a teacher. Such learning
on the basis of unlabelled data is called unsupervised learning. In intermediate
reinforcement learning although the true labels of patterns are not available, the
feedback is given on whether the classifier output is correct or incorrect, without
specifying what is the correct answer.
Classification models are normally fully learnt from labelled pattern examples.
The major fact to be realised is that the number of labelled data is limited and
usually very small and costly to obtain. Another important fact is that these data
have to be also used for performance evaluation. This implies that a part of the
available data has to be left out for testing purposes, which further narrows down
the amount of data to be used for a proper training of the classifier.
Given a set of all available labelled data X:
X : {xi = [x1, ..., xk]T , j } i = 1, ..., N j = 1, ..., C (2.2)
we denote the training set by XT where XT X and note that the remainingdata: XE = X \XT 1 will be used for testing (see Section 2.2.1). Normally the moretraining data is used the more adequately the model reflects a problem and the better
its performance. Some characteristics of the classifier training is captured in the
form of a learning curve, showing the relation between the classifiers generalisation
performance and the size of the training set used to train the model. Figure 2.3(a)
shows examples of such learning curves for three typical classifiers. The examples
present three types of learning behaviour. For the first linear classifier, adding more
training data does not improve its performance as the data are simply highly non-
linear. The decision tree classifier shows the optimal amount of training data above
which it becomes overtrained. The third highly non-linear k nearest neighbour
classifier seems to benefit constantly from adding more training data although at
the level of 400 samples it seems to reach a plateau and adding lots of new training
data does not improve the classifiers performance significantly. What it certainly
does though, is increase model complexity and reduce the size of prospective testing
samples. If the size of the labelled data is seriously limited then some more elaborate
splitting and error estimation techniques are required [154]. Figure 2.3 provides also
visualisation of the three classifiers after the training. For the 2-dimensional problem
it visualises discriminative functions and shows the resulting decision boundaries.
1\ denotes set subtraction operator: A \ B = C C = A B

50 100 150 200 250 300 350 4000.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
No of samples
Err
or r
ate
ldc
treec
knnc
(a) learning curves: artificial, 2-D, 8 classes
2
0
2
4
6
64
20
24
68
2
1.5
1
0.5
0
(b) linear discriminant classifier
2
0
2
4
6
64
20
24
68
2
1.5
1
0.5
0
(c) decision tree classifier
2
0
2
4
6
64
20
24
68
2
1.5
1
0.5
0
(d) k nearest neighbours classifier
Figure 2.3: Visualisation of the training process for 3 common classifiers. Plots b,c,dshow superposition of discriminative functions within 2-dimensional feature space.
Testing
The importance of model evaluation stems from the fact that it provides the most in-
formative measure of classifier performance which then could justify its use, leading
to possible optimisation, redesign or elimination if other models show better per-
formance. The common belief that a more elaborate classifier producing complex
non-linear class boundaries is better than simple linear models may not be always
true. Complex models tend to overfit the training data so that although their per-
formance on the training set is usually much better than simple linear models, they
could show very weak performance for new patterns [28]. Data overfitting is a typ-
ical trap for sophisticated systems unless some complexity control mechanisms are
incorporated in the design of such a classifier. It is believed that a model with
well-balanced complexity should perform similarly on the training and testing data
as well as any data from the problem domain [28], [148].
Given the limited amount of training data, the precise estimation of the true

model performance or error rate is quite a challenge. There is no issue if the size
of available training data is huge compared to the number of classes. According
to standard statistical analysis carried out in [154], 1000 testing samples should
provide satisfactory error tolerance of the predicted performance for most of the
cases. Problems start to emerge if there is less or much less data available. Random
multiple splitting into training and testing sets is the simplest method to enhance
the reliability of performance estimation. For smaller testing sets multiple splitting
still holds a high risk that some regions of the input space may be scarcely covered
leading to substantial bias in performance estimate. In such cases multiple cross-
validation procedures show quite satisfactory results [154]. In cross-validation, the
testing set is rotated over exclusive subsets exhaustively covering the whole dataset.
The extreme cross-validation with a rotation of only a single pattern used for testing
is called leave-one-out [154] and is preferred whenever its application is computa-
tionally tractable. For sample sizes smaller than 50 leave one-out can be supported
by boot-strapping [154], [28], generating a test set by sampling with replacement from
a training set. More precise guidelines for the use of true performance estimation
methods depending on the size of the testing set can be found in [154].
A final comment relates to the combined systems where the combiner may require
individual classifier performance estimates to decide which ones to combine. In
such a case apart from the proper training and testing sets used for individual
performance estimations, there is a need for additional validation set to be used for
the estimate of combiner performance. Normally, the combiner could be perceived
as a more general classifier, which would require a separate set for building the
combination model and separate for testing its performance. However, separating
additional set from the overall classification dataset would further limit training and
evaluation capabilities of individual classifiers.
Due to the large number of classifiers and datasets considered throughout the
experimental parts of this thesis, estimation of individual performances is based on
random multiple splitting. The estimations of combiner performance is based on
the same testing set as the one used for evaluation of individual classifier perfor-
mances. These choices have been taken to maintain simplicity and uniformity of the
experimental results and to ensure a coherent comparison between individual and
combined performances.

2.2.2 Classification error
Pattern classification incorporates supervised learning mechanisms and therefore
shares a similar description of the model error [28], [137]. The major objective
of supervised learning is to construct a predictor which, given the limited amount
of training data, will be able to estimate a target function T : x y with apossibly minimum error. Excluding artificial data, mapping x y usually reflectsthe real-world learning problem, which is commonly dependent on a large number
of factors. Due to a number of constraints the predictor tries to select only the
minimum number of factors, which contemporarily describe the problem and are
sufficient to give reliable predictions. However the fact that they never cover the
whole knowledge space supporting the solution of the problem, limits the ability of
generating correct outputs according to the following formula:
y = E(y|x) + (2.3)
where E(y|x) represents the expectation operator of y given x and stands for whitenoise. An additional portion of model error stems from the limited, usually small,
training set. Instead of using a whole input space X, which is commonly unknown,
the predictor uses only selected known training data XT for generation of predictions
for unknown data: f(x,XT ) with an unknown level of representativeness related to
x. After this additional constraint all considerations are forced to be targeted at
training dataset XT , which could be additionally split in order to leave out some
part for testing the accuracy of predictions. The total mean squared error of the
model can be now formulated as [39], [137]:
e2f = EXT {[y f(x,XT )]2} = E(2) + EXT {[E(y|x) f(x,XT )]2} (2.4)
Some further algebra results in:
e2f = E(2)
noise
+ E2XT [f(x,XT ) E(y|x)]
bias
+ EXT {[f(x,XT ) E[f(x,XT )]]2}
variance
(2.5)
As it is clear from the equation (2.5) a simple decomposition leads to a separation
of three independent components of model error. The first term is called white noise,
which cannot be reduced unless further evidence is provided. The second term - bias
can be intuitively characterised as a measure of predictors ability to generalise well
once trained. Finally, the third term variance can be similarly interpreted as a

measure of sensitivity of predictor outputs over different training sets. The model
error can be therefore rewritten in the concise form as:
e2f = 2 + B2(f) + V (f) (2.6)
In the classification model the only difference from the general prediction model
comes from the fact that classification operates on assignments to the crisp class
labels j as elements from the set . The individual error of the classifier occurs
thus in the form of picking the wrong class label, not by bias from some true value
measured continuously like in regression problems. The variability of the classifi-
cation outputs requires a specific description leading to slight differences in error
representation compared with the prediction models as shown in 2.5. Considering
classification error within a probabilistic frame of reference, each classifier produces
probabilistic outputs supporting different classes: Di = [p(1), ..., p(C)]T . Denot-
ing by T = arg max[p(j|x)] the true class for a given input pattern x and by fthe classifier choice arising from f = arg max[f(x,XT ) = ], error decomposition
can be reformulated from (2.5) to the following form [137]:
ef = 1 p(T |x)
Bayes error
+ p(f |f, x)[p(T |x) p(f |x)]
bias
+
6=f
p(|f, x)[p(T |x) p(|x)]
spread
(2.7)
The Bayes error appearing in the equation (2.7) in place of noise component
(2.5) forms the lower bound on the classification error and is only a function of
the problem complexity and the available evidence. The bias expresses classifier
goodness in modelling the problem while the spread (equivalent to variance in (2.5))
describes the variability of the model outputs.
While Bayes error component can not be reduced by any means, the remaining
bias and spread error components remain fixed only for individual classifiers. In
multiple classifier systems, the spread component is likely to be reduced by parallel
combining of redundant classifiers [137]. In such case the variability of classifier
outputs is stabilised as a result of applied aggregation [137]. On the other hand,
bias can only be reduced as a result of the better classification model, which can be
potentially achieved by applying modular decomposition of the classification task
and assign different classifiers to the subtasks for which they perform the best [137].
More detailed analysis of the error in combined multiple classifier systems will be
provided in Section 2.4.5 treating about different combining paradigm models.

2.3 Information fusion
Two important facts related to the reality of the end of the 20th century contributed
to the enormous dynamics we observe today in the area of evidence fusion. The first
was the emergence of multi-modal detection systems providing coordinated data
from multiple sensors of different types facilitated by immense information content
from highly developed interconnected information systems [49], [7]. Treating all
types of evidence separately with a single method was an unsuccessful option, lead-
ing to either complex hybridisation of the system or no gain in performance. What
led to the breakthrough was the fusion of distinct evidence on many different levels
from pure data to the decisions of individual experts operating on different parts
of the available evidence [49], [7], [22]. Another important point to note was that
individual classification methods provide alternative knowledge even in the absence
of alternative data. It turned out that even if applied to the same task using the
same data, a joint decision of combined classifiers is potentially more effective than
any one individual [22], [15], [70], [137]. These facts, emerging in an environment
of rapidly growing technology, cheap computational power and exponentially ex-
panding internet resources led to a sudden turn to fusion in the pattern recognition
domain.
Fusion of information can be carried out on many different levels of abstrac-
tion closely connected with the flow of the classification process: data level fusion,
feature level fusion, and classifier fusion [7]. There is little theory about the first
two levels of information fusion. However, there have been successful attempts to
transform numerical, interval and linguistic data into a single space of symmetric
trapezoidal fuzzy numbers [54], [115], and some heuristic methods have been suc-
cessfully used for feature level fusion [7], [68]. Classifier fusion has attracted most
scientific attention and continues to expand under many different names includ-
ing: classifier fusion, combining classifiers, mixture of experts, ensemble systems,
multiple classifier systems, composite classifiers etc [22], [137], [7], [25], [70], [122].
2.3.1 Data fusion
At the basic level of data sensing, the fusion of data from various modalities has been
used to resolve the occlusion problem in vision systems [7]. In another application,
fusion of differently sensed images improved object detection by overlapping many
partially discriminative projections [54]. In [54], [115] a method of combining various
types of data is presented. The proposed new model of data called heterogenous
fuzzy data, incorporates characteristics of real numerical values, confidence intervals

and linguistic information in a single representation. A generic neuro-fuzzy pattern
recognition model in which data can be processed in a generalised form of confidence
intervals has also been proposed in [32], [36]. These studies are supported by the
theory of fuzzy sets details of which can be found in [163], [72], [114]. Emerging from
this fuzzy measures are considered as generalisation of probabilistic measures within
the general theory of evidence [72], and provide various information modelling tools
that can be used in data fusion.
2.3.2 Feature fusion
There is little evidence of the feature fusion in the literature. Fusion on this level
is considered more general compared to the data fusion and often resembles clas-
sifier fusion techniques. Some authors suggest even that the difference between
feature fusion and combining classifier is somewhat arbitrary [7]. It commonly in-
volves combining multidimensional quantitative feature vectors possibly supported
by some qualitative measures. An example of feature fusion has been shown by
Keller and Gader [67] where the data features extracted from Geo-Centers GPR
system have been combined by a fuzzy rule incorporating some shape characteris-
tics of the raw data. Again the improvement obtained in a form of reduction of
false alarms has been observed. Another example of what may be considered a fea-
ture fusion has been proposed in [33], where the combination of multiple versions
of neuro-fuzzy classifiers is performed at the classifier model level. In this approach
hyperbox fuzzy sets representing clusters of data in different models are combined.
The resulting classifier complexity and transparency is comparable with classifiers
generated during a single cross-validation procedure, while the improved classifica-
tion performance and reduced variance is comparable to the ensemble of classifiers
with combined decisions.
2.3.

DISSERTATION Classiﬁer Diversity in Combined …dec.bournemouth.ac.uk/staff/bgabrys/publications/D_Ruta_PhD_thesis… · DISSERTATION Classiﬁer Diversity in Combined Pattern Recognition

Documents

A Hybrid System of Deep Learning and Learning Classiﬁer...

NECKAr: A Named Entity Classiﬁer for Wikidata

ScaIL: Classiﬁer Weights Scaling for Class Incremental ...

Classiﬁer evaluation and attribute selection against...

Efﬁcient discriminative learning of Bayesian network...

Matched Gene Selection and Committee Classiﬁer for...

Design of a multimedia traﬃc classiﬁer for Snort

Algorithms for Active Classiﬁer Selection: Maximizing...

Learning a Proposal Classiﬁer for Multiple Target Tracking...

Weakly-Supervised Classiﬁer Learning via Temporal...

Dynamic system classiﬁer

Review of Classiﬁer Combination...

Linear Dependency Modeling for Classiﬁer Fusion and...

classiﬁer probes - arXiv · Understanding intermediate...

PyCon2012 Notes Documentation€¦ · PyCon2012 Notes...

Novel Classiﬁer Fusion Approaches for Fault … Novel...