Sequential Genetic Search Sequential Genetic Search for Ensemble Feature for Ensemble Feature Selection Selection Alexey Tsymbal, Padraig Cunningham Department of Computer Science Trinity College Dublin Ireland Mykola Pechenizkiy Department of Computer Science University of Jyväskylä Finland IJCAI’2005, Edinburgh, Scotland August 1-5, 2005
30
Embed
Sequential Genetic Search for Ensemble Feature Selection Alexey Tsymbal, Padraig Cunningham Department of Computer Science Trinity College Dublin Ireland.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Sequential Genetic Search for Sequential Genetic Search for Ensemble Feature SelectionEnsemble Feature Selection
Alexey Tsymbal, Padraig CunninghamDepartment of Computer Science
Trinity College DublinIreland
Mykola PechenizkiyDepartment of Computer Science
University of Jyväskylä Finland
IJCAI’2005, Edinburgh, Scotland August 1-5, 2005
2
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
How to prepare inputs for generation of the base classifiers ?– Sampling the training set– Manipulation of input features– Manipulation of output targets (class values)
Goal of traditional feature selection– find and remove features that are unhelpful or
misleading to learning (making one feature subset for single classifier)
Goal of ensemble feature selection– find and remove features that are unhelpful or
destructive to learning making different feature subsets for a number of classifiers
– find feature subsets that will promote diversity (disagreement) between classifiers
AEE
7
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Basic Idea behind GA for EFSBasic Idea behind GA for EFS
Ensemble(generation)
BC1
BCi
BCEns.
Size
RSM
GA
Current Population(diversity)
New Population(fitness)
init
iii divaccFitness
13
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Basic Idea behind GAS-SEFSBasic Idea behind GAS-SEFS
Ensemble
BC1
BCi
BCi+1
GAi+1
BCi+1
diversity
RSMGeneration
Current Population(accuracies)
New Population(fitness)
new BC (fitness)
init
iii divaccFitness
14
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
GAS-SEFS 1 of 2GAS-SEFS 1 of 2
GAS-SEFS (GGenetic AAlgorithm-based SSequential SSearch for EEnsemble FFeature SSelection) – instead of maintaining a set of feature subsets in each
generation like in GA, consists in applying a series of genetic processes, one for each base classifier, sequentially.
– After each genetic process one base classifier is selected into the ensemble.
– GAS-SEFS uses the same fitness function, but
• diversity is calculated with the base classifiers already formed by previous genetic processes
• In the first GA process – accuracy only.
– GAS-SEFS uses the same genetic operators as GA.
15
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
GAS-SEFS 2 of 2GAS-SEFS 2 of 2
GA and GAS-SEFS peculiarities:– Full feature sets are not allowed in RS – The crossover operator may not produce a full feature
subset. – Individuals for crossover are selected randomly
proportional to log(1+fitness) instead of just fitness– The generation of children identical to their parents is
prohibited.– To provide a better diversity in the length of feature
subsets, two different mutation operators are used • Mutate1_0 deletes features randomly with a given
probability;• Mutate0_1 adds features randomly with a given
probability.
16
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Complexity of GA-based search does not depend on the #features
GAS-SEFS: GA:
where S is the number of base classifiers, S’ is the number of individuals (feature subsets) in one generation, and Ngen is the number of generations.
EFSS and EBSS:
where S is the number of base classifiers, N is the total number of features, and N’ is the number of features included or deleted on average in an FSS or BSS search.
Computational complexityComputational complexity
17
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Integration of classifiersIntegration of classifiers
Motivation for the Dynamic Integration:
Each classifier is best in some sub-areas of the whole data
set, where its local error is comparatively less than the
corresponding errors of the other classifiers.
Static Dynamic
Selection/Combination
Dynamic Votingwith Selection (DVS)
Weighted Voting (WV)
Dynamic Selection (DS)
Static Selection (CVM)
18
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Experimental DesignExperimental Design Parameter settings for GA and GAS-SEFS:
– a mutation rate - 50%; – a population size – 10; – a search length of 40 feature subsets/individuals:
• 20 are offsprings of the current population of 10 classifiers generated by crossover,
• 20 are mutated offsprings (10 with each mutation operator).
– 10 generations of individuals were produced; – 400 (GA) and 4000 (GAS-SEFS) feature subsets.
To evaluate GA and GAS-SEFS:– 5 integration methods– Simple Bayes as Base Classifier – stratified random-sampling with 60%/20%/20% of
instances in the training/validation/test set;– 70 test runs on each of 21 UCI data set for each strategy
and diversity.
19
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
GA vs GAS-SEFS on two groups of GA vs GAS-SEFS on two groups of datasetsdatasets
Ensemble accuracies for GA and GAS-SEFS on two groups of data sets (1): < 9 and (2) >= 9 features with four ensemble sizes
0.810
0.815
0.820
0.825
0.830
0.835
0.840
GA_gr1 GAS-SEFS_gr1 GA_gr2 GAS-SEFS_gr2
3 5 7 10
DVS
F/N-F disagreement
EnsembleSize
20
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
GA vs GAS-SEFS for Five Integration GA vs GAS-SEFS for Five Integration MethodsMethods
0.65
0.70
0.75
0.80
0.85
0.90
0.95
SS WV DS DV DVS
GA
GAS-SEFS
EnsembleSize = 10
Ensemble accuracies for five integration methods on Tic-Tac-Toe
21
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Conclusions and Future WorkConclusions and Future Work Diversity in ensemble of classifiers is very important We have considered two genetic search strategies for EFS. The new strategy, GAS-SEFS, consists in employing a series
of genetic search processes – one for each base classifier.
GAS-SEFS results in better ensembles having greater accuracy– especially for data sets with relatively larger numbers of
features. – one reason – each of the core GA processes leads to significant
overfitting of a corresponding ensemble member GAS-SEFS is significantly more time-consuming than GA.
– GAS-SEFS = ensemble_size * GA [Oliveira et al., 2003] better results for single FSS based on
Pareto-front dominating solutions. – Adaptation of this technique to EFS is an interesting topic for
further research.
22
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Thank you!Thank you!
Alexey Tsymbal, Padraig CunninghamDept of Computer Science
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
ReferencesReferences
• [Kuncheva, 1993] Ludmila I. Kuncheva. Genetic algorithm for feature selection for parallel classifiers, Information Processing Letters 46: 163-168, 1993.
• [Kuncheva and Jain, 2000] Ludmila I. Kuncheva and Lakhmi C. Jain. Designing classifier fusion systems by genetic algorithms, IEEE Transactions on Evolutionary Computation 4(4): 327-336, 2000.
• [Oliveira et al., 2003] Luiz S. Oliveira, Robert Sabourin, Flavio Bortolozzi, and Ching Y. Suen. A methodology for feature selection using multi-objective genetic algorithms for handwritten digit string recognition, Pattern Recognition and Artificial Intelligence 17(6): 903-930, 2003.
• [Opitz, 1999] David Opitz. Feature selection for ensembles. In Proceedings of the 16th National Conference on Artificial Intelligence, pages 379-384, 1999, AAAI Press.
25
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
GAS-SEFS AlgorithmGAS-SEFS Algorithm
26
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Other interesting findingsOther interesting findings• alpha
– were different for different data sets, – for both GA and GAS-SEFS, alpha for the dynamic integration
methods is bigger than for the static ones (2.2 vs 0.8 on average).
– GAS-SEFS needs slightly higher values of alpha than GA (1.8 vs 1.5 on average).
• GAS-SEFS always starts with a classifier, which is based on accuracy only, and the subsequent classifiers need more diversity than accuracy.
• # of selected features falls as the ensemble size grows, – this is especially clear for GAS-SEFS, as the base classifiers need more
diversity.
• integration methods (for both GA and GAS-SEFS):– the static, SS and WV, and the dynamic DS start to overfit the validation
set already after 5 generations and show lower accuracies, – accuracies of DV and DVS continue to grow up to 10 generations.
27
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Paper SummaryPaper Summary
0.810
0.815
0.820
0.825
0.830
0.835
0.840
GA_gr1 GAS-SEFS_gr1 GA_gr2 GAS-SEFS_gr2
3 5 7 10
• New strategy for genetic ensemble feature selection, GAS-SEFS, is introduced
• In contrast with previously considered algorithm (GA), it is sequential; a serious of genetic processes for each base classifier
• More time-consuming, but with better accuracy• Each base classifier has a considerable level of overfitting with
GAS-SEFS, but the ensemble accuracy grows• Experimental comparisons demonstrate clear superiority on 21 UCI
datasets, especially for datasets with many features (gr1 vs gr2)
28
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Simple Bayes as Base ClassifierSimple Bayes as Base Classifier
Bayes theorem:
P(C|X) = P(X|C)·P(C) / P(X)
Naïve assumption: attribute independence
P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)
If i-th attribute is categorical:P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C
If i-th attribute is continuous:P(xi|C) is estimated thru a Gaussian density function
Computationally easy in both cases
29
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
Dataset’s characteristicsDataset’s characteristicsData set Instances Classes
Features
Categ. Num.
Balance 625 3 0 4
Breast Cancer 286 2 9 0
Car 1728 4 6 0
Diabetes 768 2 0 8
Glass Recognition 214 6 0 9
Heart Disease 270 2 0 13
Ionosphere 351 2 0 34
Iris Plants 150 3 0 4
LED 300 10 7 0
LED17 300 10 24 0
Liver Disorders 345 2 0 6
Lymphography 148 4 15 3
MONK-1 432 2 6 0
MONK-2 432 2 6 0
MONK-3 432 2 6 0
Soybean 47 4 0 35
Thyroid 215 3 0 5
Tic-Tac-Toe 958 2 9 0
Vehicle 846 4 0 18
Voting 435 2 16 0
Zoo 101 7 16 0
30
IJCAI’2005 Edinburgh, Scotland, August 1-5, 2005 Sequential Genetic Search for Ensemble Feature Selection by Tsymbal A., Pechenizkiy M.,Cunningham
P.
GA vs GAS-SEFS for Five Integration GA vs GAS-SEFS for Five Integration MethodsMethods
0.65
0.70
0.75
0.80
0.85
0.90
0.95
SS WV DS DV DVS SS WV DS DV DVS
3
5
7
10
Ensemble accuracies for GA (left) and GAS-SEFS (right) for five integration methods and four ensemble sizes on Tic-Tac-Toe