A GRASP algorithm for fast hybrid 1 (filter-wrapper) feature subset selection in 2 high-dimensional datasets 3 Pablo Bermejo, Jose A. G´ amez, Jose M. Puerta 4 {Pablo.Bermejo,Jose.Gamez,Jose.Puerta}@uclm.es 5 Intelligent Systems and Data Mining Laboratory. Computing Systems Department 6 Universidad de Castilla-La Mancha. Albacete, 02071, Spain 7 Abstract 8 Feature subset selection is a key problem in the data-mining classification task that helps to obtain more compact and understandable models without degrad- ing (or even improving) their performance. In this work we focus on FSS in high-dimensional datasets, that is, with a very large number of predictive at- tributes. In this case, standard sophisticated wrapper algorithms cannot be applied because of their complexity, and computationally lighter filter-wrapper algorithms have recently been proposed. In this work we propose a stochastic al- gorithm based on the GRASP meta-heuristic, with the main goal of speeding up the feature subset selection process, basically by reducing the number of wrap- per evaluations to carry out. GRASP is a multi-start constructive method which constructs a solution in its first stage, and then runs an improving stage over that solution. Several instances of the proposed GRASP method are experimentally tested and compared with state-of-the-art algorithms over 12 high-dimensional datasets. The statistical analysis of the results shows that our proposal is com- parable in accuracy and cardinality of the selected subset to previous algorithms, but requires significantly fewer evaluations. Keywords: Feature selection, classification, GRASP, filter, wrapper, 9 high-dimensional datasets 10 Preprint submitted to Elsevier February 9, 2011
36
Embed
A GRASP algorithm for fast hybrid high-dimensional datasets
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A GRASP algorithm for fast hybrid1
(filter-wrapper) feature subset selection in2
high-dimensional datasets3
Pablo Bermejo, Jose A. Gamez, Jose M. Puerta4
{Pablo.Bermejo,Jose.Gamez,Jose.Puerta}@uclm.es5
Intelligent Systems and Data Mining Laboratory. Computing Systems Department6
Universidad de Castilla-La Mancha. Albacete, 02071, Spain7
Abstract8
Feature subset selection is a key problem in the data-mining classification task
that helps to obtain more compact and understandable models without degrad-
ing (or even improving) their performance. In this work we focus on FSS in
high-dimensional datasets, that is, with a very large number of predictive at-
tributes. In this case, standard sophisticated wrapper algorithms cannot be
applied because of their complexity, and computationally lighter filter-wrapper
algorithms have recently been proposed. In this work we propose a stochastic al-
gorithm based on the GRASP meta-heuristic, with the main goal of speeding up
the feature subset selection process, basically by reducing the number of wrap-
per evaluations to carry out. GRASP is a multi-start constructive method which
constructs a solution in its first stage, and then runs an improving stage over that
solution. Several instances of the proposed GRASP method are experimentally
tested and compared with state-of-the-art algorithms over 12 high-dimensional
datasets. The statistical analysis of the results shows that our proposal is com-
parable in accuracy and cardinality of the selected subset to previous algorithms,
and so, |S| = {X1, X2, X4, X8} is the selected subset.210
�211
With the goal of obtaining more compact subsets and of avoiding overfit-212
ting, in Ruiz et al. (2006) wrapper evaluation is carried out by using a 5-fold213
cross validation, and a t-test (confidence level α = 0.1) that takes as input the214
classification accuracy over the five folds to decide when the inclusion of a new215
variable in S is significant. Of course, the same five folds are used in all the216
wrapper evaluations in order to have fair comparisons. This and other criteria217
were studied in Bermejo et al. (2008), concluding that an appropiate method218
is to use a purely heuristic criterion that pursues the same goals: the inclu-219
sion of a variable is significant if the averaged classification accuracy over the220
5 folds is greater than the current one, with this advantage also holding in k221
of the 5 folds (k = 2 and k = 3 are the values recommended in Bermejo et al.222
(2008)). Therefore, now, the decision of acceptance or rejection of the studied223
variable is not based on a comparison between two numbers (as in Example 1)224
acc(S1) > acc(S2) but between two vectors −→acc(S1) ⊲−→acc(S2), where acc(S) is225
the accuracy of the classifier trained using only S as predictive attributes, and226
−→acc(S) is a vector containing the five accuracies corresponding to five-fold cross227
validation carried out using only S as predictive attributes.228
More formally, the relevance criterion used in this paper (⊲) is defined as229
10
follows:230
−→acc(S1) ⊲−→acc(S2) =
true iff
average(−→acc(S1)) > average(−→acc(S2))
and
count(−→acc(S1)[i] >−→acc(S2)[i]) ≥ k
false otherwise
(1)
Example 2. Let us consider −→acc(S1) = [0.7, 0.72, 0.75, 0.73, 0.69] and −→acc(S2) =231
[0.7, 0.71, 0.72, 0.74, 0.69], then −→acc(S1) ⊲−→acc(S2) is true if k = 2 and false if232
k = 3 because average(−→acc(S1)) = 0.718 > average(−→acc(S1)) = 0.712 and233
count(−→acc(S1)[i] >−→acc(S2)[i]) = 2. However, if we have−→acc(S1) = [0.75, 0.7, 0.7, 0.7, 0.7]234
and−→acc(S2) = [0.7, 0.7, 0.7, 0.7, 0.7] then−→acc(S1)⊲−→acc(S2) is false because average(
−→acc(S1)) >235
average(−→acc(S1)) due to only one of the five folds, and so we can consider that236
this success is due to noise or randomness in the partition.237
�238
The IWSS algorithm is very efficient, and linear in the number of attributes,239
O(n), because it carries out exactly n filter and wrapper evaluations. However,240
it also presents two main disadvantages: (1) it relies on a univariate ranking, so241
some interesting variables can be judged irrelevant/relevant just because some242
others have been judged irrelevant/relevant before; and (2) to be sure all the243
potentially relevant variables have been analyzed, the full ranking must be ex-244
plored. As an example, let us observe Figure 1, where we have plotted, for245
the twelve datasets used in our experiments, the relation between the number246
of variables in the datasets and the position in the SU-based ranking of the247
last variable selected by IWSS. As we can observe, in 8 out of the 12 datasets,248
variables after position 100 (the threshold commonly used by the linear-forward249
algorithm (Gutlein et al., 2009)) are selected, while the same happens in 6 out250
11
of the 12 datasets if we consider the first 10% of the ranking as theshold.251
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000
Pos
ition
in r
anki
ng o
f the
last
sel
ecte
d at
trib
ute
Number of attributes in the dataset
Figure 1: Relation between the number of attributes in the dataset and the position of thelast attribute selected by IWSS.
Disadvantage (1) can be alleviated by using other algorithms that follow the252
incremental behaviour but somehow manage in possible interactions between253
the variables (IWSSr (Bermejo et al., 2009), BARS (Ruiz et al., 2009) and254
SFS (Kittler, 1978)). However, these improved incremental algorithms are more255
complex, with O(n2) worst case complexity (or even worse in the case of BARS),256
though in practice this complexity reduces to a sub-quadratic number of wrapper257
evaluations. Therefore, different approaches are needed to deal with datasets258
having a very large number of attributes.259
5. A GRASP algorithm for FSS in high-dimmensional datasets260
In this section we describe a proposal for the GRASP algorithm that re-261
duces the number of evaluations to be sub-linear with respect to the number of262
attributes (n = |X|) and so it is suitable for solving FSS in high-dimensional263
datasets. We start by discussing the idea or intuition behind the proposed264
algorithm and then provide a detailed description.265
12
5.1. The idea266
As mentioned above, our idea is similar to the one simultaneously proposed267
by Esseghir (2010), that is to use a fast algorithm in the constructive step and a268
more sophisticated one in the improving step. However, given the focus of this269
paper, which deals with datasets having a (very) large number of predictive at-270
tributes, the use of standard filter (wrapper) FSS algorithms in the constructive271
(improving) step is not suitable. Notice that our goal is to drastically reduce272
the number of (wrapper) evaluations carried out, so the straightforward use of273
these standard algorithms several times is not a solution.274
What we propose in this study is to use a standard hybrid algorithm (IWSS)275
for the first phase, but to run it over a small subset of the available variables.276
The subset used in each iteration of the constructive phase is sampled by using277
problem-specific knowledge, specifically by using proportional selection based278
on a (noisy) filter score, e.g. SU(Xi, C) + ǫ, where ǫ is a tiny positive number.279
In this way, more promising variables receive more chance of being selected, but280
even variables marginally independent of C will receive a small chance, because281
they can become (conditionally) relevant given some other variable.282
Regarding the improving step, we also need to reduce the number of available283
variables in this phase. Notice that if, as usual, we start from the solution284
returned by the constructive step, S, and try to improve it by using a local285
search procedure (e.g. hill climbing) that has at its disposal all the n predictive286
attributes, then the requirements of such a local optimizer are too high (exactly287
n wrapper evaluations for each hill climbing step). Therefore, we must think of288
a different way of improving.289
Our idea tackles the problem of the comparison of two different solutions.290
When using incremental FSS algorithms this comparison is easy, because we291
operate in a local way. That is, we need to compare a subset S with S ∪292
13
{Xi}. Then, as the goal of FSS is to reduce the number of variables used while293
maintaining or improving the expected accuracy, the inclusion of Xi only makes294
sense if the accuracy of S ∪ {Xi} is better (using ⊲ criterion, eq. 1) than the295
one of only using S. However, when moving to global search, as it is the case of296
GRASP, we need to compare solutions coming from non-adjacent points in the297
search space, which means setting up the comparison not only by accuracy but298
also by subset cardinality. Therefore, we can tackle the problem as a bi-objective299
one and give the following definition:300
Definition 1. Given two candidate subsets S1 and S2, we say that S1 dominates301
S2 if |S1| ≤ |S2| and −→acc(S1) ⊲−→acc(S2). Otherwise we say that S2 is non-302
dominated by S1.303
Example 3. Let us consider two different solutions sol1 = 〈{X1, X2}, 0.9, (f11 , . . . , f
15 )〉304
and sol2 = 〈{X1, X3, X4}, 0.92, (f21 , . . . , f
25 )〉, where the first component is the305
subset of selected variables, the second one is the average accuracy over the306
five folders, and fji is the accuracy in folder i for solution j. Then, which one307
is better?. Perhaps the correct answer depends on some context, but without308
extra information, neither sol1 dominates sol2 nor does sol2 dominate sol1, so309
it is difficult to decide.310
�311
In our proposal we will maintain a set of non-dominated solutions (NDS)312
found during each search performed in the constructive phase. Thus, each time313
a new solution is provided by the constructive step, we will update NDS by314
using it (function update in Figure 2). Since we can expect solutions inside NDS315
to be of good quality, we make a pool with all the variables contained in the316
non-dominated solutions: Xnds, and the local search used in the constructive317
step will be limited to the use of only variables contained in this set, which318
presents a much lower cardinality than the original set of variables. In this way319
14
we expect to perform very fast local search in the improving step, and also to320
perform well because of the quality of the available variables.321
In NDS: the set of non-dominated solutions.sol: the candidate solution to be studied.
Out true if NDS is modified, false otherwiseparameter NDS is modified
1 If sol is dominated by any solution s ∈ NDS then return false
2 else
3 delete from NDS all solutions dominated by sol4 include sol in NDS5 return true
Figure 2: Auxiliary function update(NDS,sol).
5.2. The algorithm322
The pseudo-code of the proposed algorithm is shown in Figure 3. The next323
three subsections describe it in detail.324
5.2.1. Initialization325
Lines 1-5 account for the initialization part of the algorithm. Specifically, we326
initialize the set of non-dominated solutions (NDS) to be empty, and compute327
the marginal filter score for each variable with respect to the class in lines 2 and328
3. We store the results in scores[] in order to avoid having to re-compute them329
each time a filter score is needed (constructive step). In this study MT(Xi, C)330
corresponds to the computation of SU(Xi, C) over the training set T.331
Lines 4 and 5 compute and store the probability of selection for each vari-332
able using the proportional rule, that is, the typical roulette wheel or fitness333
proportionate selection used in Genetic Algorithms (Goldberg, 1989).334
5.2.2. Constructive step335
Lines 7-16 account for the constructive phase of GRASP. We can divide336
them into two parts: (1) line 7 selects the subset of variables to be used in this337
15
In T: training set; M : filter measure; C classifier algorithm;size number of variables to consider at each iteration;numIt: number of iterations; improving method
Out S // The selected subset
// initialization1 NDS ← ∅2 for each Xi ∈ X
3 scores[i]=MT(Xi, C) + ǫ // e.g. SU(Xi, C) and ǫ = 10−10
4 for each Xi ∈ X // Prob. of selecting each Xi
5 probSel[i]= scores[i]/∑n
j=1scores[j]
// GRASP6 for it=1 to numIt
// constructive step7 subset← sample size variables from X without replacement by using probSel[ ]8 R[ ]← create a rank for variables in subset by using scores[ ]
9 S = {R[1]} // S will contain the solution obtained by IWSS10 BestData = evaluate(C,S,T)11 for i = 2 to R.size()12 Saux = S ∪ {R[i]}13 AuxData = evaluate(C,Saux,T)14 if (AuxData ⊲ BestData) then15 S = Saux16 BestData = AuxData
// improving step17 if (update(NDS,S)) then18 Xnds ← ∪Si∈NDS Si19 S ′ ← runImprovingMethod(Xnds, S, C,T)20 update(NDS,S ′)
21 return all or best solution(s) in NDS
Figure 3: Proposed GRASP algorithm for FSS.
16
iteration according to the selection probabilities stored in probSel[]; (2) lines 8-338
16 are the code of the IWSS algorithm. Thus, each iteration of the constructive339
part corresponds to a randomized execution of IWSS, where the randomization340
comes from the fact that only a subset of sampled variables can be used.341
IWSS starts in line 7 by ranking those variables (subset) according to SU.342
However, because these values have been previously pre-computed and stored343
in scores[] (lines 2 and 3) no new computations are required. Then, it initializes344
the subset of selected variables (S) to the first variable in the ranking R[1] (line345
9) and the goodness of the selected variable to be the evaluation of such a subset346
(line 10). Notice that the call evaluate(C,S,T) corresponds to carrying out a347
5-cross validation for classifier C over dataset T restricted to use only variables348
in S as predictive variables (see Section 4). Therefore, BestData is a vector349
of size 5 containing the accuracy over the 5 test folds used in this inner cross350
validation. Lines 11-16 are the main loop of IWSS as described in Section 4,351
where variables are tested one at a time by following the ranking (lines 12-13)352
and added to S (lines 14-16) only if their inclusion is significant according to ⊲353
criterion (eq. 1).354
At the end of this phase we obtain a pair 〈S, BestData〉 with the selected355
subset and its accuracies from the cross validation.356
5.2.3. Improving step357
Lines 17-20 correspond to the improving phase. Following the idea described358
above, we only take into consideration the solution S returned by the construc-359
tive phase if it is a non-dominated solution, so the improving step starts by360
checking this requirement (line 17). Notice that the call to the update function361
(line 17) returns true if S is a non-dominated solution, and it also modifies362
NDS. If S is dominated by any solution in NDS, we simply skip the improv-363
ing step in this iteration, otherwise the set of available variables (Xnds) for the364
17
local search is constructed as the union in all non-dominated solutions (line 18)365
and a wrapper local-search-based FSS algorithm is used, provided with S as366
starting point, and restricted to the use of only variables included in Xnds (line367
19). Finally, because a new solution is obtained we again update NDS (line368
20).369
In this study we propose different choices for the local-search-based FSS370
procedure:371
• Hill Climbing. We use a classical hill-climbing algorithm, taking as starting372
point the solution S and Xnds as the list of possible attributes. The373
neighbourhood used is made up of all the subsets having the Hamming374
distance equal to 1 with respect to the current solution. In this way, |Xnds|375
is the number of evaluations per iteration during the hill-climbing search.376
• IWSS. The method described in Section 4 and also used in the constructive377
phase (lines 8-16 in Figure 3), but now it is limited to the variables inXnds.378
The number of evaluations is exactly |Xnds|.379
• IWSSr. This algorithm consists of an enhancement of IWSS by adding the380
operation of replacement (Bermejo et al., 2009). Thus, when an attribute381
ranked in position i is analyzed, not only is its inclusion studied but also382
its interchange with any of the variables already included in S. In this383
way, the algorithm can retract some of its previous decisions, that is, a384
previously selected variable can become useless after adding some others.385
As shown in Bermejo et al. (2009), this algorithm obtains more compact386
subsets than IWSS. In the worst case IWSSr will need O(|Xnds|2) eval-387
uations, but in practice the exponent reduces to 1.2-1.3 (Bermejo et al.,388
2009).389
• SFS. The classical Sequential Forward Selection (Kittler, 1978) described390
18
in Section 2.391
• BARS. The Best Agglomerative Ranked Subset (Ruiz et al., 2009) alter-392
nates between the construction of a ranking of the available subsets (ini-393
tially single variables) and a growing heuristic process that obtains all the394
combinations (by merging) of the first three subsets in the ranking with395
each of the remaining ones. After the growing phase, all the subsets with396
worse accuracy than the current best one are pruned. A new ranking is397
created and so on. The worst case complexity of BARS is exponential,398
but in practice it evaluates fewer candidates than IWSSr.399
Notice that in some of these procedures the only input needed is Xnds, while400
in others, such as in Hill Climbing, both the solution to optimize S and the list401
of available variables Xnds are used as input.402
5.3. Discussion403
The complexity, in terms of wrapper evaluations carried out, of the proposed404
GRASP algorithm comes from the following parameters: n, the number of pre-405
dictive variables; m the cardinality of the subset selected for the constructive406
step; k the number of iterations carried out by the grasp algorithm; and, the407
wrapper FSS method selected for the improving step.408
Due to the use of IWSS in the constructive step the algorithm carries out409
exactly k · m wrapper evaluations. That is, by fixing these parameters, the410
number of evaluations is constant regardless of the target dataset. For example,411
if k = 50 and m = 100, then it needs 5000 wrapper evaluations, which is far412
from the number of evaluations required by most of the algorithms used in this413
study, for large datasets.414
Regarding the improving phase, the complexity depends on the method used,415
the number of times it is called, and on the size of Xnds. Thus, worst case416
19
complexity is bounded byO(k·(k·m)2), assuming an FSS method with quadratic417
worst case complexity (e.g. SFS or IWSSr). However, as can be observed in the418
experiments (Section 6), the number of wrapper evaluations carried out in this419
phase is lower than those carried out in the constructive one. Thus, in-practice420
complexity is bounded by O(2 · k ·m), which in the case of datasets with many421
thousands of variables is far lower than the complexity of standard wrapper FSS422
algorithms.423
Apart from complexity there are some other aspects deserving discussion:424
• Randomization of IWSS in the constructive phase has a positive side ef-425
fect (besides being fast): there is room for the selection of variables that426
otherwise would always be discarded. For example, let us suppose that427
our ranking starts with X1, X2, X3, . . . and that in the score assigned428
by the wrapper evaluator the following holds: acc(X1, X2) > acc(X1);429