Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection

Combining Lazy Learning, Racingand Subsampling

for Effective Feature SelectionGianluca Bontempi, Mauro Birattari, Patrick E. Meyer

{gbonte,mbiro,pmeyer}@ulb.ac.be

ULB, Université Libre de Bruxelles

Boulevard de Triomphe - CP 212

Bruxelles, Belgium

http://www.ulb.ac.be/di/mlg

Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection – p. 1/24

http://www.ulb.ac.be/di/mlg

Outline• Local vs. global modeling

• Wrapper feature selection and local modeling

• F-Racing and subsampling

• Experimental results


The global modeling approach

x

y

q��

��

Input-output regression problem.



��

��

x

y

q

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Training data set.



��

��

x

y

q

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Global model fitting.



��

��

y

q x

Prediction by discarding the data and using the fitted global model.



��

��

x

y

q

Another prediction by using the fitted global model.


The local modeling approach

x

y

q��

��

Input-output regression problem.



��

��

x

y

q

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Training data set.



��

��

x

y

q

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

Ranking of data according to a metric, selection of neighbours, localfitting and prediction.



��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

x

y

q

��

��

��

��

Another prediction: again ranking of data according to a metric,selection of neighbours, local fitting and prediction


Global models: pros and cons• Examples of global models are linear regression models and

neural networks.

• PRO: even for huge datasets, a parametric model can be storedin a small memory.

• CON:• in the nonlinear case learning procedures are typically slow

and analytically intractable.• validation methods, which address the problem of assessing a

global model on the basis of a finite amount of noisy samples,are computationally prohibitive.


Local models: pros and cons• Examples of local models are locally weighted regression and

nearest neighbours

• We will consider here a Lazy Learning algorithm [2, 5, 4]published in previous works.

• PRO: fast and easy local linear learning procedures forparametric identification and validation.

• CON:• the dataset of observed input/output data must always be kept

in memory.• Each prediction requires a repetition of the learning procedure.


Complexity in global and local modeling• Consider a nonlinear regression problem where we have N

training samples, n given features and Q query points (i.e. Qpredictions to be performed).

• Let us compare the computational cost of a nonlinear globallearner (e.g. a neural network) and a local learner (with k << N

neighbors).

• Suppose that the nonlinear global learning procedure relies on anonlinear parametric identification step (e.g. backpropagation tocompute the weights) and a structural identification step (e.g.K-fold cross-validation to define the number of hidden nodes).

• Suppose that the local learning relies on a local leave-one-outlinear criterion (PRESS statistic).


Complexity in global and local modeling

GLOBAL LOCAL

Parametric identification CNLS O(Nn)+CLS

Structural identification by K-fold cross-validation KCNLS small

Cost of Q predictions (K + 1)CNLS Q(O(Nn) + CLS)

where CNLS and CLS represent the cost of a nonlinear and a linearleast squares, respectively.

The global modeling approach is computationally advantageous wrt tothe local modeling one when the same model is expected to be usedfor many predictions. Otherwise, a local approach is to be preferred.


Feature selection• In recent years many applications of data mining (text mining,

bioinformatics, sensor networks) deal with a very large number n

of features (e.g. tens or hundreds of thousands of variables) andoften comparably few samples.

• In these cases, it is common practice to adopt feature selectionalgorithms [7] to improve the generalization accuracy.

• Several techniques exist for feature selection: we focus here onwrapper search techniques.

• Wrapper methods assess subsets of variables according to theirusefulness to a given learning machine. These methods conductsa search for a good subset using the learning algorithm itself aspart of the evaluation function. The problem boils down to aproblem of stochastic state space search.

• Well-known example of greedy wrapper search is forwardselection.


Why being local in feature selection?• Suppose that we have F feature set candidates, N training

samples and that the assessment is perfomed by leave-one-out.

• The conventional approach is to to test all the F leave-one-outmodels on all the N samples ans choose the best.

• This requires the training of F ∗ N different models, each oneused for a single prediction.

• The use of a global model demands a huge cost of retraining.

• Local approaches appear to be an effective alternative.


Racing and subsampling: an analogy• You are a national team football trainer who has to select the

goalkeeper among a set of four candidates for the next WorldCup, starting the next month.

• You have available only twenty days of training session and eightdays to let the players play matches.

• Two options:

1. (i) Train all the candidates during the first twenty days, (ii) testall of them with matches the last eight days, and (iii) make adecision.

2. (i) Alternate each week of training with two matches, (ii) aftereach week, assess the candidates and if there is someonesignificantly worse than the others discard him (iii) keepselecting the others.

• In our analogy the players are the feature subsets, the trainingdays are the training data, the matches are the test data.


The racing idea• Suppose that we have F feature set candidates, N training

samples and that the assessment is perfomed by leave-one-out.

• The conventional approach is to to test all the F models on all theN samples and eventually choose the best.

• The racing idea [8] is to test each feature set on one point at thetime.

• After only a small number of points, by using statistical tests, wecan detect that some feature sets are significantly worse thanothers.

• We can discard them and keep focusing on the others.


Non racing approachConsider this simple example: we have F = 5 feature subsets andN = 10 samples to select the best feature set by leave-one-out corssvalidation.

Squared error

0.1

0.4

0.3

0.7

0.5

2

0.1

4

3.2

4

1.5ESTIMATED

0.3

0.6

1.7

2.5

2

3.1

4

5.2

4

4

0.2

0.5

0.4

1.2

1

2.7

3.5

5.3

3.9

4

2.7 2.2

0.0

0.1

0.1

0.9

0.4

1.9

0.0

3.5

3.4

0.2

1.0

0.05

0.2

0.4

0.8

0.5

2.4

3.0

8.4

4.2

3.9

2.4

WINNERMSE

F1 F2 F3 F4 F5

i=1

i=2

i=3

i=4

i=5

i=6

i=7

i=8

i=9

i=10

After 50 training and test procedures, we have the best candidate.Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection – p. 13/24

Racing approachSquared error

0.1

0.4

0.3

0.6

0.2

0.5

0.0

0.1

0.05

0.2

F1 F2 F3 F4 F5

i=1

i=2

OUT

After only 33 training and test procedures, we have the best candidate.



0.1

0.4

0.3

0.7

0.5

0.3

0.6

0.2

0.5

0.4

1.2

1

0.0

0.1

0.1

0.9

0.4

0.05

0.2

0.4

0.8

0.5

F1 F2 F3 F4 F5

i=1

i=2

i=3

i=4

i=5

OUT

OUT




0.1

0.4

0.3

0.7

0.5

2

0.3

0.6

0.2

0.5

0.4

1.2

1

0.0

0.1

0.1

0.9

0.4

1.9

0.05

0.2

0.4

0.8

0.5

2.4

F1 F2 F3 F4 F5

i=1

i=2

i=3

i=4

i=5

i=6

OUT

OUT

OUT



Racing approach

MSE

0.1

0.4

0.3

0.7

0.5

2

0.1

4

3.2

4

0.3

0.6

0.2

0.5

0.4

1.2

1

0.0

0.1

0.1

0.9

0.4

1.9

0.0

3.5

3.4

0.2

1.0

0.05

0.2

0.4

0.8

0.5

2.4

WINNER

F1 F2 F3 F4 F5

i=1

i=2

i=3

i=4

i=5

i=6

i=7

i=8

i=9

i=10

OUT

OUT

OUT

OUT

Squared error

After only 33 training and test procedures, we have the best candidate.Combining Lazy Learning, Racing and Subsamplingfor Effective Feature Selection – p. 14/24

F-racing for feature selection• We propose a nonparametric multiple test, the Friedman test [6],

to compare different configurations of input variables and to selectthe ones to be eliminated from the race.

• The use of the Friedman test for racing was proposed first by oneof the authors in the context of a technique for comparingmetaheuristics for combinatorial optimization problems [3]. This isthe first time that the technique is used in a feature selectionsetting.

• The main merit of this nonparametric approach is that it does notrequire to formulate hypotheses on the distribution of theobservations.

• The idea of F-racing techniques consists in using blocking andpaired multiple test to compare different models in similar conditionsand discard as soon as possible the worst ones.


Sub-sampling and LL• The goal of feature selection is to find the best subset in a set of

alternatives.

• Given a set of alternative subsets, what we expect is a correctranking of their generalization accuracy (eg F2 > F3 > F5 > F1>F4).

• By subsampling we mean using a random subset of the trainingset to perform the assessment of the different feature sets.

• The rationale of subsampling is that by reducing the training setsize N , we deteriorate the accuracy of each single feature subsetwithout affecting their ranking.

• In LL reducing the training set size N reduces the computationalcost.

• This makes more competitive the LL approach


RACSAM for feature selectionWe proposed the RACSAM (RACing+SAMpling) algorithm

1. Define an initial group of promising feature subsets.

2. Start with small training and test sets.

3. Discard by racing all the feature subsets that appear assignificantly worse than the others.

4. Increase the training and test size until at most W winners modelsremain.

5. Update the group with new candidates proposed by the searchstrategy and go back to step 3.


Experimental session• We compare the performance accuracy of the LL algorithm

enhanced by the RACSAM procedure to the the accuracy of twostate-of-art algorithms, a SVM for regression and a regressiontree (RTREE).

• Two version of the RACSAM algorithm were tested: the first(LL-RAC1) takes as feature set the best one (in terms of estimateMean absolute Error (MAE)) among the W winning candidates :the second (LL-RAC2) averages the predictions of the best W LLpredictors.

• W = 5, and p-value is 0.01.


Experimental resultsFive-fold cross-validation on six real datasets of high dimensionality:Ailerons (N = 14308, n = 40), Pole (N = 15000, n = 48),Elevators (N = 16599, n = 18), Triazines (N = 186, n = 60),Wisconsin (N = 194, n = 32) and Census (N = 22784, n = 137).

Dataset AIL POL ELE TRI WIS CEN

LL-RAC1 9.7e-5 3.12 1.6e-3 0.21 27.39 0.17

LL-RAC2 9.0e-5 3.13 1.5e-3 0.12 27.41 0.16

SVM 1.3e-4 26.5 1.9e-3 0.11 29.91 0.21

RTREE 1.8e-4 8.80 3.1e-3 0.11 33.02 0.17


Statistical significativity• LL-RAC1 vs. LL-RAC2:

• LL-RAC2 is significantly better than LL-RAC1 3 times out of 6• LL-RAC2 is never significantly worse than LL-RAC1.

• LL-RAC2 vs.state-of-the-art techniques:• LL-RAC2 approach is never significantly worse than SVM

and/or RTREE• LL-RAC2 5 times out of 6 significantly better than SVM and 6

times out of 6 significantly better than RTREE.


Software• MATLAB toolbox on Lazy Learning [1].

• R contributed packages:• lazy package.• racing package.

• Web page: http://iridia.ulb.ac.be/~lazy.

• About 5000 accesses since October 2002.


http://iridia.ulb.ac.be/~lazy

Conclusions• Wrapper strategies asks for a huge number of assessments. It is

important to make this process faster and less prone to instability.

• Local strategies reduce the computational cost of training modelsthat has to be used for few predictions.

• Ranking speeds up the evaluation by discarding bad candidatesas soon as they appear to be statistically significantly worse thanothers.

• Sub-sampling combined with local learning can speed up thetraining phase in preliminary phases when it is important todiscard the highest number of bad candidates.


ULB Machine Learning Group (MLG)• 7 researchers (1 prof, 6 PhD students), 4 graduate students).

• Research topics: Local learning, Classification, Computational statistics, Datamining, Regression, Time series prediction, Sensor networks, Bioinformatics.

• Computing facilities: cluster of 16 processors, LEGO Robotics Lab.

• Website: www.ulb.ac.be/di/mlg.

• Scientific collaborations in ULB: IRIDIA (Sciences Appliquées), Physiologie

Moléculaire de la Cellule (IBMM), Conformation des Macromolécules Biologiqueset Bioinformatique (IBMM), CENOLI (Sciences), Microarray Unit (Hopital Jules

Bordet), Service d’Anesthesie (ERASME).

• Scientific collaborations outside ULB: UCL Machine Learning Group (B),

Politecnico di Milano (I), Universitá del Sannio (I), George Mason University (US).

• The MLG is part to the "Groupe de Contact FNRS" on Machine Learning.


ULB-MLG: running projects1. "Integrating experimental and theoretical approaches to decipher the molecular

networks of nitrogen utilisation in yeast": ARC (Action de Recherche Concertée)

funded by the Communauté FranAçaise de Belgique (2004-2009). Partners:IBMM (Gosselies and La Plaine), CENOLI.

2. "COMP2SYS" (COMPutational intelligence methods for COMPlex SYStems)MARIE CURIE Early Stage Research Training funded by the European Union

(2004-2008). Main contractor: IRIDIA (ULB).

3. "Predictive data mining techniques in anaesthesia": FIRST Europe Objectif 1funded by the Région wallonne and the Fonds Social Européen (2004-2009).

Partners: Service d’anesthesie (ERASME).

4. "AIDAR - Adressage et Indexation de Documents Multimédias Assistés par des

techniques de Reconnaissance Vocale": funded by Région Bruxelles-Capitale(2004-2006). Partners: Voice Insight, RTBF, Titan.


References

[1] M. Birattari and G. Bontempi. The lazy learning toolbox, for

use with matlab. Technical Report TR/IRIDIA/99-7, IRIDIA-

ULB, Brussels, Belgium, 1999.

[2] M. Birattari, G. Bontempi, and H. Bersini. Lazy learn-

ing meets the recursive least-squares algorithm. In M. S.

Kearns, S. A. Solla, and D. A. Cohn, editors, NIPS 11,

pages 375–381, Cambridge, 1999. MIT Press.

[3] M. Birattari, T. Stützle, L. Paquete, and K. Varrentrapp. A

racing algorithm for configuring metaheuristics. In W. B.

Langdon, editor, GECCO 2002, pages 11–18. Morgan

Kaufmann, 2002.

[4] G. Bontempi, M. Birattari, and H. Bersini. Lazy learning

for modeling and control design. International Journal of

Control, 72(7/8):643–658, 1999.

[5] G. Bontempi, M. Birattari, and H. Bersini. A model selection

approach for local learning. Artificial Intelligence Commu-

nications, 121(1), 2000.

[6] W. J. Conover. Practical Nonparametric Statistics. John

Wiley & Sons, New York, NY, USA, third edition, 1999.

24-1

[7] I. Guyon and A. Elisseeff. An introduction to variable and

feature selection. Journal of Machine Learning Research,

3:1157–1182, 2003.

[8] O. Maron and A. Moore. The racing algorithm: Model selec-

tion for lazy learners. Artificial Intelligence Review, 11(1–

5):193–225, 1997.

24-2

Combining Lazy Learning, Racing and Subsampling for Effective Feature Selection

Data & Analytics