THEORETICAL ADVANCES MOPG: a multi-objective evolutionary algorithm for prototype generation Hugo Jair Escalante • Maribel Marin-Castro • Alicia Morales-Reyes • Mario Graff • Alejandro Rosales-Pe ´rez • Manuel Montes-y-Go ´mez • Carlos A. Reyes • Jesus A. Gonzalez Received: 26 March 2014 / Accepted: 22 January 2015 Ó Springer-Verlag London 2015 Abstract Prototype generation deals with the problem of generating a small set of instances, from a large data set, to be used by KNN for classification. The two key aspects to consider when developing a prototype gen- eration method are: (1) the generalization performance of a KNN classifier when using the prototypes; and (2) the amount of data set reduction, as given by the number of prototypes. Both factors are in conflict because, in general, maximizing data set reduction implies decreas- ing accuracy and viceversa. Therefore, this problem can be naturally approached with multi-objective optimiza- tion techniques. This paper introduces a novel multi- objective evolutionary algorithm for prototype gen- eration where the objectives are precisely the amount of reduction and an estimate of generalization performance achieved by the selected prototypes. Through a com- prehensive experimental study we show that the pro- posed approach outperforms most of the prototype generation methods that have been proposed so far. Specifically, the proposed approach obtains prototypes that offer a better tradeoff between accuracy and re- duction than alternative methodologies. Keywords Prototype generation Evolutionary algorithms 1NN classification Multi-objective optimization 1 Introduction kNearest neighbors (KNN) is one of the most used models for pattern classification [31]. This is due in part to its asymptotic behavior as the number of training instances tends to infinity [6]. Also, bias and variance of the KNN model can be adjusted by varying the value of k [18]. In addition to its effectiveness, KNN is quite popular because it is easy to understand and implement. Unfortunately, there are two main issues that limit the application of KNN to certain domains: memory storage requirements and ef- ficiency. On the one hand, KNN is known to be very ef- fective when a large number of instances are available. However, a large set of training objects implies (1) the requirement of storing all of the training objects into memory and (2) estimating the distance (or similarity) from a test object to all of the training objects each time a new object has to be classified. This is indeed a major com- plication because nowadays big-data problems are be- coming ubiquitous. Therefore, to keep KNN as a suitable option for cutting-edge classification problems, strategies for making it scalable and efficient must be applied. Prototype-based classification (PBC) aims to amend these issues for KNN [17, 24, 28]. PBC is a methodology that considers only a subset of representative training in- stances for making predictions, still under KNN’s classi- fication rule. In PBC there are two main ways of reducing the original training set: prototype selection (PS) and pro- totype generation (PG). The goal of PS methods is to se- lect, from the training data, the objects that better H. J. Escalante (&) M. Marin-Castro A. Morales-Reyes A. Rosales-Pe ´rez M. Montes-y-Go ´mez C. A. Reyes J. A. Gonzalez INAOE, Luis Enrique Erro No.1, Tonantzintla, Puebla 72840, Mexico e-mail: [email protected]M. Graff INFOTEC, Ca ´tedras CONACyT, Aguascalientes, Mexico 123 Pattern Anal Applic DOI 10.1007/s10044-015-0454-6
15
Embed
MOPG: a multi-objective evolutionary algorithm for ...dep.fie.umich.mx/produccion_dep/media/pdfs/00228_mopg:_a_multi-objec.pdfMOPG: a multi-objective evolutionary algorithm for prototype
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
THEORETICAL ADVANCES
MOPG: a multi-objective evolutionary algorithm for prototypegeneration
Hugo Jair Escalante • Maribel Marin-Castro • Alicia Morales-Reyes •
Mario Graff • Alejandro Rosales-Perez • Manuel Montes-y-Gomez •
Carlos A. Reyes • Jesus A. Gonzalez
Received: 26 March 2014 / Accepted: 22 January 2015
� Springer-Verlag London 2015
Abstract Prototype generation deals with the problem
of generating a small set of instances, from a large data
set, to be used by KNN for classification. The two key
aspects to consider when developing a prototype gen-
eration method are: (1) the generalization performance
of a KNN classifier when using the prototypes; and (2)
the amount of data set reduction, as given by the number
of prototypes. Both factors are in conflict because, in
general, maximizing data set reduction implies decreas-
ing accuracy and viceversa. Therefore, this problem can
be naturally approached with multi-objective optimiza-
tion techniques. This paper introduces a novel multi-
objective evolutionary algorithm for prototype gen-
eration where the objectives are precisely the amount of
reduction and an estimate of generalization performance
achieved by the selected prototypes. Through a com-
prehensive experimental study we show that the pro-
posed approach outperforms most of the prototype
generation methods that have been proposed so far.
Specifically, the proposed approach obtains prototypes
that offer a better tradeoff between accuracy and re-
duction than alternative methodologies.
Keywords Prototype generation � Evolutionary
algorithms � 1NN classification � Multi-objective
optimization
1 Introduction
k�Nearest neighbors (KNN) is one of the most used
models for pattern classification [31]. This is due in part to
its asymptotic behavior as the number of training instances
tends to infinity [6]. Also, bias and variance of the KNN
model can be adjusted by varying the value of k [18]. In
addition to its effectiveness, KNN is quite popular because
it is easy to understand and implement. Unfortunately,
there are two main issues that limit the application of KNN
to certain domains: memory storage requirements and ef-
ficiency. On the one hand, KNN is known to be very ef-
fective when a large number of instances are available.
However, a large set of training objects implies (1) the
requirement of storing all of the training objects into
memory and (2) estimating the distance (or similarity) from
a test object to all of the training objects each time a new
object has to be classified. This is indeed a major com-
plication because nowadays big-data problems are be-
coming ubiquitous. Therefore, to keep KNN as a suitable
option for cutting-edge classification problems, strategies
for making it scalable and efficient must be applied.
Prototype-based classification (PBC) aims to amend
these issues for KNN [17, 24, 28]. PBC is a methodology
that considers only a subset of representative training in-
stances for making predictions, still under KNN’s classi-
fication rule. In PBC there are two main ways of reducing
the original training set: prototype selection (PS) and pro-
totype generation (PG). The goal of PS methods is to se-
lect, from the training data, the objects that better
H. J. Escalante (&) � M. Marin-Castro � A. Morales-Reyes �A. Rosales-Perez � M. Montes-y-Gomez �C. A. Reyes � J. A. Gonzalez
sorting and crowding distance, see [5, 8, 9] for a detailed
explanation of these concepts. NSGA-II operates as follows,
see Algorithm 1. First, a population of solutions, also called
individuals, is generated and the solutions are evaluated
according to the objectives (steps 1 and 2 in Algorithm 1).
Then, an iterative process begins where evolutionary op-
erators, such as tournament selection, recombination, and
mutation are applied to generate a child population (step 5).
After that, the fitness values for each member in the child
population are estimated (step 6). Next, parent and child
populations are combined, and all non-dominated fronts are
identified using the non-dominance sorting mechanism to
rank solutions according to their non-dominance level, (i.e.,
from the combined parent–child population, we identify
which ones are non-dominated with respect to the others and
these constitute the first level. After that, the second level is
formed by the non-dominated solutions from the remainder
solutions, and we follow this procedure until each solution
has been assigned to a non-dominance level) (steps 3 and 7).
A new population of individuals is selected for the next
iteration by choosing solutions from both, the parent and
child populations using the previously identified non-
dominated fronts. If the size of the new population is greater
than the population size, individuals from the last added
front are chosen one-by-one by taking into account its
crowding distance in the objectives space (steps 8–12). The
algorithm stops after g repetitions of the iterative process are
performed. The rest of this section describes the NSGA-II
technique as we used it for PG.
3.3.1 Initialization
A common practice in evolutionary algorithms is to ran-
domly initialize solutions [13]. In this work, however, we
initialize them using information from the training data, to
speed up the convergence of MOPG. Recall that in PG each
solution consists of a matrix P 2 RP�ðdþ1Þ, where P can
vary for each solution. Solutions are initialized on a per-
class basis, where for each class k an integer number Pk
between 1 and Nk is randomly chosen, with Nk the number
of instances in D that belong to class k. Next, we randomly
select Pk instances from the Nk that belong to class k to
generate an individual P. We repeat this process for each
Algorithm 1 NSGA-II algorithm [9].Require: Npop, f , g
{Npop number of individuals (solutions); g number of generationsf = 〈f1(P), f2(P)〉 objectives}f
1: Initialize population X02: Evaluate objective functions f = 〈f1(P), f2(P)〉, ∀P : P ∈ X03: Identify fronts F1,...,F by sorting solutions according to their non-dominance level ∀P : P ∈
X04: while i = 1 < g do5: Create child population Qi from Xi applying evolutionary operators.6: Evaluate objective functions f , ∀P : P ∈ Qi
7: Identify fronts F1,...,F by sorting solutions according to their non-dominance level ∀P: P ∈ Xi ∪ Qi
8: Xi+1 = ∅; j = 1;9: while |Xi+1| < Npop do
10: Xi+1 = Xi+1 ∪ Fj ; j = j + 1;11: end while12: Select the last individuals for Xi+1 from Fj using crowding distance13: end while
Pattern Anal Applic
123
class to generate each of the initial solutions of the
population. In this way, the initial population of prototypes
belongs to the data set D and a different number of pro-
totypes-per-class is allowed.
We control the amount of prototypes to be considered
for the initialization of each individual of the population
through parameter Ip ¼P
kPk
N, where Ip is simply the
fraction of the training instances that will be considered for
initialization. One should note that the evolutionary op-
erators we propose allow MOPG to reduce the number of
prototypes from one generation to the other (via conden-
sation in crossover).
3.3.2 Fitness functions
Solutions are evaluated with respect to the two objectives
defined above: fðPÞ ¼ hf1ðPÞ; f2ðPÞi. Objective f2ðPÞ ¼cðD;PÞ is related to the generalization performance of an
1NN classifier using prototypes P. There are several ways
of estimating the performance of a classifier on unseen
data, including cross validation, bootstrapping, jacknife,
etc. In this work we considered a simple hold-out estimate
in which training data D is further divided into training,
DT , and validation, DV , data sets. The training partition is
used to initialize the population and to apply evolutionary
operators, whereas validation data are used to evaluate
solutions. Hence, we define cðD;PÞ as the accuracy ob-
tained by an 1NN classifier using P as training data when
classifying DV . On the other hand, since we defined
dðD;PÞ ¼ 1� PN
we can use dðD;PÞ directly as the ob-
jective f1 which is related to the amount of reduction.
To generate DT and DV we define a parameter g 2 ½0; 1�that controls the fraction of instances from each class to be
used for training and validation. For each class k, we randomly
select dNk � ge instances fromD and use them as the training
examples of class k, the other bNk � ð1� gÞc instances are
used for the validation set. Repeating this process for each
class we form training and validation partitions that maintain
the original distribution of classes as in D.
In each iteration of the genetic algorithm we update
training and validation partitions, to prevent the prototypes
from overfitting a single validation data set. Each time the
partitions are updated, we re-evaluate all of the solutions in
the Pareto set with the new validation partition, to avoid
evaluating solutions in different iterations of different data.
Please note that using a dynamic fitness function it is still
possible to obtain solutions in the Pareto set that obtained
good performance in a single partition of data (e.g., if a
solution in the last iteration obtains good performance in
the last partition). Anyway, we think that even with this
limitation, a dynamic fitness function is advantageous over
optimizing performance on a fixed data partition.
3.3.3 Evolutionary operators
In evolutionary computation, variation operators are used
to generate new solutions by updating the available
ones [13], where the two fundamental operators in evolu-
tionary algorithms are crossover and mutation. When so-
lutions are encoded as numerical vectors of fixed
dimension, there is a wide diversity of variation operators
that can be applied. However, since in PG each solution is
encoded as a matrix with variable number of rows, we must
propose ad hoc operators for this representation.
The goal of the crossover operator is to generate new
(children) solutions by combining the elements that form two
other individuals (parents) with the aim that new solutions
are better than their ancestors. In PG we want to combine two
sets of prototypes P1 and P2 to generate solutions P01 and P02.
Accordingly, we propose a crossover operator that inter-
changes the individual prototypes that form solutions P1 and
P2. Given two parent solutions P1 and P2 we randomly select
a prototype pP1 2 P1. Then we identify those prototypes
from solution P2 that belong to the same class as pP1 . We
replace a randomly chosen prototype in P2 with pP1 , where
the replaced prototype belongs to the same class as pP1 . Next
we apply with uniform probability either: (a) replace pP1 2P1 with a prototype randomly chosen from P2 that belongs to
the same class as pP1 , or (b) replace pP1 2 P1 with the av-
erage of prototypes in P2 belonging to the same class as pP1 .
The aim of (a) is to replace a prototype with another one that
belongs to the same class, hence allowing the interchange of
information between solutions. On the other hand, the goal of
(b) is to condense a set of solutions in such a way that the new
solution summarizes the position in the search space of all of
the prototypes of the same class.
Mutation operators aim to incorporate diversity in the
population through random modifications of solutions. For
PG we propose a mutation operator that given a solution P1,
generates a new solution P01 where one prototype of P1 is
modified. We randomly select a prototype pP1 2 P1. Next,
we apply with uniform probability either: (a) adding a vector
of random numbers to pP1 , where the numbers are uniformly
generated in the range of the values of each dimension of pP1 ,
or (b) pP1 is replaced by an instance in the training setDT that
belong to the same class. The aim of (a) is to introduce slight
perturbations to solutions in the current population, whereas
(b) aims to introduce new instances (not considered during
initialization) into the individuals.
To select solutions to which the crossover and mutation
operators will be applied, we used a binary tournament
scheme [8, 9, 13]. Crossover and mutation operators are
applied to individuals in the population with probabilities
Prc and Prm. We use default values for these probabilities,
but we also conducted experiments to assess the impact of
Pattern Anal Applic
123
the parameter values and the evolutionary operators on
MOPG.
3.3.4 Selecting a single solution
The output of the NSGA-II method is a set of non-
dominated solutions found during the search, i.e., an ap-
proximation of the Pareto optimal set. Theoretically, none
of these solutions is better/worse than any other. However,
having a set of solutions instead of a single one does not
make sense for PG. Of course, one could use the union of
all prototypes included in all solutions, but this would be
misleading because the amount of reduction would de-
crease and we could have redundant and noisy prototypes.
Instead, we propose a simple mechanism to select a single
solution out of a set of solutions.
We evaluate the classification performance of 1NN
when using the prototypes associated to each solution to
classify all the training instances (D). Then we choose as
final output of our method the solution that obtained the
highest classification performance. One should note that
since prototypes were generated using different partitions
of training and validation data, the performance on the
whole original data set is not expected to be perfect. Thus,
we can use this performance as an indicator of the accuracy
of the prototypes. In Sect. 5 we evaluate the effectiveness
of this selection strategy.
3.4 Discussion
In this section we have described in detail MOPG, our
multi-objective approach to PG. MOPG uses NSGA-II for
exploring the search space of prototypes. A variable-length
representation is adopted and ad hoc evolutionary operators
have been designed. Solutions are evaluated on a subset of
the training data that is updated every iteration. Also, a
strategy for the selection of a single solution from the set of
non-dominated solutions returned by NSGA-II has been
presented.
The main benefit of MOPG when compared to other PG
methods is that the proposed method searches for solutions
that offer a good tradeoff between reduction and accuracy,
which is the ultimate goal of PG methods. Multi-objective
optimization can naturally deal with this type of problems.
To the best of our knowledge no other author has ap-
proached the PG problem similarly. Besides, the ways we
evaluate the fitness functions (updating the training and
validation partitions in each iteration) and select a single
solution allow MOPG to avoid overfitting to some extent.
In Sect. 5 we show experimental results that evidence the
effectiveness of MOPG.
One should note that since MOPG is a population-based
technique, it requires the evaluation of many solutions (as
many as Npop � ðgþ 1Þ). Thus, depending on the data set size,
applying our method could be a computationally demanding
process. Fortunately, there is a growing interest on the re-
search community on the development of efficient and scal-
able methods for PG (see e.g., [29]), which could be applied to
our proposed technique. On the other hand, in evolutionary
computation there are also alternative methodologies that can
be adopted to speed up MOPG [7, 26].
4 Experimental settings
This section describes the experimental settings we adopted
for the evaluation of MOPG. We considered the suite of data
sets described in Table 1. These data sets were collected by
the authors of [28] and have been used for benchmarking
many PG methods proposed so far [2, 10, 14, 15, 23, 28, 30].
The data sets are diverse to each other in terms of number of
instances, attributes2 (numerical/nominal), and classes,
which allows us to assess the performance of MOPG under
different circumstances. Data sets with a large number of
instances are considered in the benchmark. In [28] the au-
thors distinguished small (less than 2,000 instances) from
large (at least 2,000 instances) data sets.
To make our experimental results comparable with
others [14, 28, 30], we applied tenfold cross validation
over the 59 data sets to evaluate the performance of
MOPG. In each experiment, for each data set we applied
MOPG 10 times using the training partitions generated via
tenfold cross validation; we evaluated the performance of
the generated prototypes in each of the 10 runs using the
corresponding test partitions. Hence, for a single ex-
periment over all of the data sets we ran MOPG 590 times.
We evaluated two main aspects of MOPG: accuracy on
unseen data (test partitions) and amount of training set
reduction. Additionally, we also evaluated the tradeoff in
performance by measuring: the product (reduction � ac-
curacy) and the harmonic mean of both objectives
2� reduction � accuracyreduction þ accuracy
� �. One should note that since we used
exactly the same partitions of tenfold cross validation as
in [14, 28, 30], we can directly compare the performance of
MOPG with those PG methods.
5 Experimental results
This section reports experimental results obtained with
MOPG over the suite of data sets introduced in [28] and
2 One should note that among the considered data sets numeric and
nominal attributes are included. For simplicity we have deliberatively
transformed nominal attributes into integers and applied MOPG
without any modification.
Pattern Anal Applic
123
described in Sect. 4. The goal of our experiments is to
evaluate the effectiveness of MOPG for the generation of
prototypes and to compare its performance with that re-
ported by alternative methods. First, we report results of a
study that aims at evaluating the reproducibility of results.
Next, we evaluate the effectiveness of the strategy for se-
lecting solutions from the Pareto set. Next, we assess the
performance of MOPG under different parameter settings
related to MOPG itself and to the evolutionary algorithm
we considered (NSGA-II). Finally, we compare the per-
formance of MOPG to that of other state-of-the-art
approaches.
5.1 Reproducibility of results
This section aims to provide evidence on the repro-
ducibility of results obtained with MOPG. Since MOPG is
based on NSGA-II, which is a stochastic optimization
technique, there is no guarantee that the same results will
be obtained with each execution of it. Accordingly, in this
section our aim is to determine whether results obtained
with MOPG are due to chance or not.
We assess the reproducibility of results obtained by
MOPG by performing experiments with two different pa-
rameter settings. The two configurations differed in the
population size and number of generations. For the first
configuration, called 50–50, 50 individuals and 50 gen-
erations were considered; for the second one, called
250–250, 250 individuals and 250 generations were con-
sidered (the rest of the parameter were fixed according to
the best results from Sect. 5.3). The intuition behind con-
sidering two parameter settings was to assess the repro-
ducibility of results when search is non-intensive (50–50
setting) and somewhat intensive (250–250 setting). We ran