Feature Selection Concepts and Methods Electronic & Computer Department Isfahan University Of Technology Reza Ramezani 1
Jan 20, 2015
1
Feature Selection
Concepts and Methods
Electronic & Computer Department
Isfahan University Of Technology
Reza Ramezani
What are Features?
Features are attributes that their value make an instance.
With features we can identify instances.
Features are determinant values that determine instance belong to which class.
2
Classifying Features
Relevance: These are features that have an influence on the output and whose role can not be assumed by the rest.
Irrelevance: Features that don't have any influence on the output, and whose values are generated at random for each example.
Redundance: A redundancy exists whenever a feature can take the role of another.
3
What is Feature Selection?Feature selection, is a preprocessing step to
machine learning that choose a subset of original features according to a certain evaluation criterion and is effective in:
Removing/Reduce effect of irrelevant data removing redundant data reducing dimensionality (binary model) increasing learning accuracy and improving result comprehensibility.
4
Other DefinitionsProcess which select a subset of features
defined by one of three approaches:
1) the subset with a specified size that optimizes an evaluation measure
2) the subset of smaller size that satisfies a certain restriction on the evaluation measure
3) the subset with the best compromise among its size and the value of its evaluation measure (general case).
5
Feature Selection Algorithm (FSA) FSA is a computational solution that is
motivated by a certain definition of relevance.1) The relevance of a feature may have several
definitions depending on the objective that is looked for.
2) Find a compromise among minimizing and maximizing (general case).
3) An irrelevant feature is not useful for induction, but not all relevant features are necessarily useful for induction.
6
Classifying FSAs The FSAs can be classified according to
the kind of output they yield:
1) Algorithms that giving a weighed linear order of features. (Continuous feature selection problem)
2) Algorithms that giving a subset of the original features. (Binary feature selection problem)
Note that both types can be seen in an unified way by noting that in (2) the weighting is binary.
7
Notation X = Feature Set X’ = Feature Subset xi = Feature I = Instances p = Probability distribution on E W = Space of labels (e.g. classes). c = Objective function c:E T according to
its relevant features. (Classifier) S = Data set (Training set)
8
Relevance of a feature
The purpose of a FSA is to identify relevant features according to a definition of relevance.
Unfortunately the notion of relevance in machine learning has not yet been rigorously defined on a common agreement.
Let us to define Relevance in many aspect:9
Relevance with respect to an
objective Feature Relevant to objective function c Two examples A, B in the instance space E A and B differ only in their assignment to .
10
Strong relevance with respect to S Fature
Strongly relevant to the sample S Two examples A and B differ in their assignment to .
That is to say, it is the same last definition, but now and the definition is with respect to S.
11
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categoric
al
categoric
al
continuous
class
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
12
Strong relevance with respect to p Feature
Strongly relevant to an objective c in the distribution p
Two examples with 0 and 0 A and B differ in their assignment to .
This definition is the natural extension of last definition and, contrary to it, the distribution is assumed to be known. 13
Weak relevance with respect to S Feature
Weakly relevant to the sample S There exists at least a proper where is
strongly relevant with respect to S.
A weakly relevant feature can appear when a subset containing at least one strongly relevant feature is removed.
14
Weak relevance with respect to p Feature
Weakly relevant to the objective c in the distribution p
There exists at least a proper where is strongly relevant with respect to p.
These 5 definitions are important to decide what features should be conserved and which can be eliminated.
15
Strongly Relevant Features The strongly relevant features are, in theory,
important to maintain a structure in the domain
And they should be conserved by any feature selection algorithm in order to avoid the addition of ambiguity to the sample.
16
Weakly Relevant Features
Weakly relevant features could be important or not, depending on:
The other features already selected.
The evaluation measure that has been chosen (accuracy, simplicity, consistency, etc.).
17
Relevance as a complexity measure
Define r(S,c) Smallest number of relevant features to c The error in S is the least possible for the
inducer.
In other words, it refers to the smallest number of features required by a specific inducer to reach optimum performance in the task of modeling c using S.
18
Incremental usefulness a data sample S
a learning algorithm L and a subset of features
The feature is incrementally useful to L with respect to if the accuracy of the group of features better than the accuracy reached using only the subset of features .
19
ExampleX1…………...X11…………....X21……………..X30
100000000000000000000000000000 +
111111111100000000000000000000 +
000000000011111111110000000000 +
000000000000000000001111111111 +
000000000000000000000000000000 – X1 is strongly relevant, the rest are weakly relevant. r(S,c) = 3 Incremental usefulness: after choosing {X1, X2}, none of X3…
X10 would be incrementally useful, but any of X11…X30 would.20
General Schemes for Feature Selection
Relationship between a FSA and the inducer Inducer:
• Chosen process to evaluate the usefulness of the features
• Learning Process
Filter Scheme
Wrapper Scheme
Embedded Scheme 21
Filter Scheme Feature selection process takes place before
the induction step
This scheme is independent of the induction algorithm.
• High Speed• Low Accuracy
22
Wrapper Scheme Use the learning algorithm as a subroutine to
evaluate the features subsets.
Inducer must be known.
• Low Speed• High Accuracy
23
Embedded Scheme Similar to the wrapper approach
Features are specifically selected for a certain inducer
Inducer selects the features in the process of learning (Explicitly or Implicitly).
24
MarSt
Refund
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
25
Embedded Scheme Example
Refund Marital Status
Taxable Income
Age Cheat
Yes Single 125K 18 No
No Married 100K 30 No
No Single 70K 28 No
Yes Married 120K 19 No
No Divorced 95K 18 Yes
No Married 60K 20 No
Yes Divorced 220K 25 No
No Single 85K 30 Yes
No Married 75K 20 No
No Single 90K 18 Yes 10
categoric
al
categoric
al
continuous
classcontin
uous
Decision Tree Maker Algorithm, willAutomatically Remove ‘Age’ Feature.
Characterization of FSAs
Search Organization: General strategy with which the space of hypothesis is explored.
Generation of Successors: Mechanism by which possible successor candidates of the current state are proposed.
Evaluation Measure: Function by which successor candidates are evaluated.
26
Types of Search Organization
We consider three types of search:
Exponential
Sequential
Random
27
Exponential Search Algorithms that carry out searches with
cost Best solution is guaranteed. The exhaustive search is an optimal search. An optimal search need not be exhaustive;
• Branch and Bound for monotonic evaluation measure
• search with an admissible heuristic.• A measure is monotonic if for any two subsets and
, then .28
Sequential Search This Strategy selects one among all the
successors to the current state. Once the state is selected it is not possible to go
back. The number of such steps must be limited by . Let be the number of evaluated subsets in each
state change. The cost of this search is therefore polynomial .
These methods do not guarantee an optimal result.
29
Random Search Use Randomness to avoid the algorithm to
stay on a local minimum.
Allow temporarily moving to other states with worse solutions.
These are anytime algorithms.
Can give several optimal subsets as solution.
30
Types of Successors Generation Forward
Backward
Compound
Weighting
Random 31
Forward Successors Generation Starting with .
Adds features to the current solution , among those that have not been selected yet.
In each step, the feature that makes be greater is added to the solution.
The cost of operator is
32
Backward Successors Generation Starting with .
Removes features from the current solution , among those that have not been removed yet.
In each step, the feature that makes J be greater is removed from the solution.
The cost of operator is 33
Forward and Backward Method, Stopping Criterion
( has been fixed in advance)
The value of J has not increased in the last k steps
The value of J has not surpasses a prefixed value .
In practice backward method demands more computation than its forward counterpart.
34
Compound Successors Generation Apply f consecutive forward steps and b
consecutive backward ones. If the net result is a forward operator, otherwise it
is a backward one. This method, allows to discover new interactions
among features. Other stopping conditions should be established if
f = b. (Such as for ) In sequential FSA, the condition assures a
maximum of steps, with a total cost . 35
Weighting Successors Generation In weighting operators (continuous features).
All of the features are present in the solution to a certain degree.
A successor state is a state with a different weighting.
This is typically done by iteratively sampling the available set of instances.
36
Random Successors Generation Includes those operators that can potentially
generate any other state in a single step.
Restricted to some criterion of advance:• In the number of features• In improving the measure J at each step.
37
Evaluation Measures Probability of Error
Divergence Dependence Interclass Distance Information or Uncertainty Consistency Relative values assigned to different subsets reflect
their greater or lesser relevance to the objective function.
Let an evaluation measure to be maximized, where is a (weighed) feature subset.
38
Evaluation Measures, Probability of Error
Ultimate goal is to build a classifier with minimizing the (Bayesian) probability of error.
of the classifier seems to be the most natural choice.
39
Evaluation Measures, Probability of Error
Since the class-conditional densities are usually unknown, they can either be explicitly modeled (using parametric or non-parametric methods)
40
Evaluation Measures, Probability of Error
Provided the classifier has been built using only a subset of the features, we have:
is a test data sample is the subset of where the classifier
performed correctly. Finally we have:
41
Evaluation Measures, DivergenceThese measures compute a probabilistic distance
or divergence among the class-conditional probability densities , using the general formula:
42
Evaluation Measures, Divergence For valid measure, the function must be
such that the value of satisfies the following conditions:
1) only when the are equal
2) is maximum when they are non-overlapping
If the features used in a solution are good ones, the divergence will be significant.
43
Divergence, Some classical choices:
44
Divergence, Some classical choices:
45
Evaluation Measures,
Dependence These measures quantify how strongly two features are associated with one another.
Knowing the value of one feature it is possible to predict the value of the other feature.
The correlation coefficient is a classical measure that still use for these methods.
46
Evaluation Measures, Interclass Distance
These measures are based on the assumption that instances of a different class are distant in the instance space.
being the instance of class , and the �number of instances of the class .
The most usual distances belong to the Euclidean family.
47
Evaluation Measures,
Consistency An inconsistency in and is defined as: two instances in that are equal when considering only
the features in and that belong to different classes.
The aim is thus to find the minimum subset of features leading to zero inconsistencies.
48
Evaluation Measures,
Consistency The inconsistency count of an instance is defined as:
is the number of instances in equal to using only the features in .
is the number of instances in of class equal to using only the features in .
49
Evaluation Measures,
ConsistencyThe inconsistency rate of a feature subset in a sample is:
Finally we have:
This measure is in [0, 1] that must min.
50
51
General Algorithm for Feature Selection
All FSA can be represented in a space of characteristics according to the criteria of: search organization (Org) Generation of successor states (GS) Evaluation measures (J)
This space <Org, GS, J> encompasses the �whole spectrum of possibilities for a FSA.
hybrid FSA when FSA requires more than a point in the same coordinate to be characterized.
52
53
FCBFFeature Correlation
Based Filter(Filter Mode)
<Sequential, Compound, Information> 54
Previous Works and Their Defects
1) Huge Time Complexity
Binary Mode: Subset search algorithms search through
candidate feature subsets guided by a certain search strategy and a evaluation measure.
Different search strategies, namely, exhaustive, heuristic, and random search, are combined with this evaluation measure to form different algorithms. 55
Previous Works and Their Defects The time complexity is exponential in terms
of data dimensionality for exhaustive search and quadratic for heuristic search. The complexity can be linear to the number
of iterations in a random search, but experiments show that in order to find best feature subset, the number of iterations required is mostly at least quadratic to the number of features.
56
Previous Works and Their Defects
2) Inability to recognize redundant features.
Relief: The key idea of Relief is to estimate the relevance of
features according to how well their values distinguish between the instances of the same and different classes that are near each other.
Relief randomly samples a number (m) of instances from the training set and updates the relevance estimation of each feature based on the difference between the selected instance and the two nearest instances of the same and opposite classes. 57
Previous Works and Their Defects Time complexity of Relief for a data set with
M instances and N features is O(mMN). With m being a constant, the time complexity
becomes O(MN), which makes it very scalable to data sets with both a huge number of instances and a very high dimensionality.
However, Relief does not help with removing redundant features.
58
Good Feature A feature is good if it is relevant to the class
concept but is not redundant to any of the other relevant features.
Correlation as Goodness Measure
A feature is good if it is highly correlated to the class but not highly correlated to any of the other features. 59
Approaches to Measure The Correlation
Classical Linear Correlation (Linear Correlation Coefficient)
Information theory (Entropy or Uncertainty)
60
Linear Correlation Coefficient For a pair of variables (X,Y ) the linear
correlation coefficient r is given by the formula:
If X and Y are completely correlated, r takes the value of 1 or -1.
If X and Y are totally independent, r is zero.
61
Advantages It helps to remove features with near zero
linear correlation to the class. It helps to reduce redundancy among
selected features.
Disadvantages It may not be able to capture correlations
that are not linear in nature. Calculation requires all features contain
numerical values.62
Entropy The Entropy of a variable (feature) X is
defined as:
The Entropy of X after observing values of another variable Y is defined as:
63
Entropy, Information Gain The amount by which the entropy of X
decreases reflects additional information about X provided by Y:IG(X|Y) = H(X) - H(X|Y)
Feature Y is regarded more correlated to feature X than to feature Z, if IG(X|Y) > IG(Z|Y)
Information gain is symmetrical for two random variables X and Y: IG(X|Y) = IG(Y|X)
64
Entropy, Symmetrical Uncertainty Information gain is biased in favor of features
with more values. Thus must normalize it:
SU(X,Y) values are normalized to the range [0,1]. value 1 indicating that knowledge of the value of either
one completely predicts the value of the other. The value 0 indicating that X and Y are independent.
65
Entropy, Symmetrical Uncertainty Symmetrical Uncertainty still treats a pair of
features symmetrically. Entropy-based measures require nominal
features. Entropy-based measures can be applied to
measure correlations between continuous features as well, if the values are discretized properly in advance.
66
Algorithm Steps Aspects of developing a procedure to select
good features for classification:
1) How to decide whether a feature is relevant to the class or not (C-correlation).
2) How to decide whether such a relevant feature is redundant or not when considering it with other relevant features (F-correlation).
Select features with SU greater than a threshold.67
Predominant Correlation
The correlation between a feature and the class C is predominant iff:
There exists no such that
68
Redundant Feature If is redundant to feature , we use to denote
the set of all redundant peers for . We divide into two parts:
69
Predominant Feature A feature is predominant to the class, iff:
Its correlation to the class is predominant Or can become predominant after removing its
redundant peers.
Feature selection for classification is a process that identifies all predominant features to the class concept and removes the rest.
70
Heuristic We must use heuristics in order to avoid
pairwise analysis of F-Correlations between all relevant features.
Heuristic: (if ). Treat as a predominant feature, remove all features in , and skip identifying redundant peers for them.
71
72
FCBF Algorithm
73
74
GA-SVMGeneric Algorithm
Support Vector Machine
(Wrapper Mode)<Sequential, Compound,
Classifier>75
Support Vector Machine (SVM) SVM, one of the best techniques for pattern
classification. Widely use in many application areas. SVM classifies data by determining a set of
support vectors and their distance to hyperplane.
SVM provides a generic mechanism that fits the hyperplane surface to the training data.
76
SVM Main Idea With this hypothesis that classes are linearly
separable, make hyperplane with maximum margin to separate classes.
When classes are not linearly separable, map them to high dimensional space to linearly separate them.
77Separating Surface:
A+A-
Support Vector Nearest training set instances to hyperplane:
Use SV instead of training set.
Line Equation: (w and b are unknown)
78
Class +1Class -1
X2
X1
SV
SV
SV
Kernel
79
1 2 4 5 6
class 2 class 1class 1
1 Dimension
1 2 4 5 6
class 2 class 1class 1
2 Dimension
Kernel Data in higher dimensional! The user may select a kernel function for the
SVM during the training process. The kernel parameters setting for SVM in a
training process impacts on the classification accuracy.
The parameters that should be optimized include penalty parameter C and the kernel function parameters.
80
Linear SVM SVM concepts for typical two-class
classification problems:
Training set of instance-label pairs
For the linearly separable case, the data points will be correctly classified by
81
Linear SVM Fnd an optimal separating hyperplane with
the maximum margin by solving the following optimization problem:
To solve this quadratic optimization problem one must find the saddle point of the Lagrange function:
denotes Lagrange multipliers, hence .
82
Linear SVM After differentiating and Karush Kuhn–
Tucker (KTT) conditions:
values determine the parameters and of the optimal hyperplane. Thus, we obtain an optimal decision hyperplane
83
Linear Generalized SVM When can't linearly separate, the goal is to
construct a hyperplane that makes the smallest number of errors. (non-negative slack variables)
Solve
84C : tradeoff parameter between error and margin, Number od misclassified instances.
Linear Generalized SVM This optimization model can be solved
using the Lagrangian method
The penalty parameter C, which is now the upper bound on
85
NonLinear SVM The nonlinear SVM maps the training
samples from the input space into a higher-dimensional feature space via a mapping function F, which are also called kernel function. Inner products are replaced by the kernel function:
86
NonLinear SVM, Kernels
final hyperplane equation
87
NonLinear SVM, Kernels In order to improve classification accuracy,
these kernel parameters in the kernel functions should be properly set.
Polynomial kernel:
Radial basis function kernel:
Sigmoid kernel:88
Genetic Algorithm (GA) Genetic algorithms (GA), as a optimization search
methodology is a promising alternative to conventional heuristic methods.
GA work with a set of candidate solutions called a population.
Based on the Darwinian principle of ‘survival of the fittest’, the GA obtains the optimal solution after a series of iterative computations.
GA generates successive populations of alternate solutions that are represented by a chromosome.
A fitness function assesses the quality of a solution in the evaluation step. 89
90
GA Feature Selection Structure
The chromosome comprises three parts, C, , and the features mask. (Different parameters when other types of kernel functions)
The binary coding system was used to represent the chromosome. is the number of bits representing parameter is the number of bits representing parameter is the number of bits representing the features
Choose and according to the calculation precision.91
Evaluation Measure Three criteria used to design a fitness
function: Classification accuracy The number of selected features The feature cost
Thus, for the individual (chromosome) with: High classification accuracy Small number of features Low total feature cost
Produce a high fitness value. 92
Evaluation Measure
𝑓𝑖𝑡𝑛𝑒𝑠𝑠=𝑊 𝐴∗𝑆𝑉 𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+¿93
94
95
Thanks For Your Regard
75
Thanks For Your Regard