Top Banner
Feature Selection Concepts and Methods Electronic & Computer Department Isfahan University Of Technology Reza Ramezani 1
96
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature selection concepts and methods

1

Feature Selection

Concepts and Methods

Electronic & Computer Department

Isfahan University Of Technology

Reza Ramezani

Page 2: Feature selection concepts and methods

What are Features?

Features are attributes that their value make an instance.

With features we can identify instances.

Features are determinant values that determine instance belong to which class.

2

Page 3: Feature selection concepts and methods

Classifying Features

Relevance: These are features that have an influence on the output and whose role can not be assumed by the rest.

Irrelevance: Features that don't have any influence on the output, and whose values are generated at random for each example.

Redundance: A redundancy exists whenever a feature can take the role of another.

3

Page 4: Feature selection concepts and methods

What is Feature Selection?Feature selection, is a preprocessing step to

machine learning that choose a subset of original features according to a certain evaluation criterion and is effective in:

Removing/Reduce effect of irrelevant data removing redundant data reducing dimensionality (binary model) increasing learning accuracy and improving result comprehensibility.

4

Page 5: Feature selection concepts and methods

Other DefinitionsProcess which select a subset of features

defined by one of three approaches:

1) the subset with a specified size that optimizes an evaluation measure

2) the subset of smaller size that satisfies a certain restriction on the evaluation measure

3) the subset with the best compromise among its size and the value of its evaluation measure (general case).

5

Page 6: Feature selection concepts and methods

Feature Selection Algorithm (FSA) FSA is a computational solution that is

motivated by a certain definition of relevance.1) The relevance of a feature may have several

definitions depending on the objective that is looked for.

2) Find a compromise among minimizing and maximizing (general case).

3) An irrelevant feature is not useful for induction, but not all relevant features are necessarily useful for induction.

6

Page 7: Feature selection concepts and methods

Classifying FSAs The FSAs can be classified according to

the kind of output they yield:

1) Algorithms that giving a weighed linear order of features. (Continuous feature selection problem)

2) Algorithms that giving a subset of the original features. (Binary feature selection problem)

Note that both types can be seen in an unified way by noting that in (2) the weighting is binary.

7

Page 8: Feature selection concepts and methods

Notation X = Feature Set X’ = Feature Subset xi = Feature I = Instances p = Probability distribution on E W = Space of labels (e.g. classes). c = Objective function c:E T according to

its relevant features. (Classifier) S = Data set (Training set)

8

Page 9: Feature selection concepts and methods

Relevance of a feature

The purpose of a FSA is to identify relevant features according to a definition of relevance.

Unfortunately the notion of relevance in machine learning has not yet been rigorously defined on a common agreement.

Let us to define Relevance in many aspect:9

Page 10: Feature selection concepts and methods

Relevance with respect to an

objective Feature Relevant to objective function c Two examples A, B in the instance space E A and B differ only in their assignment to .

10

Page 11: Feature selection concepts and methods

Strong relevance with respect to S Fature

Strongly relevant to the sample S Two examples A and B differ in their assignment to .

That is to say, it is the same last definition, but now and the definition is with respect to S.

11

Page 12: Feature selection concepts and methods

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

12

Page 13: Feature selection concepts and methods

Strong relevance with respect to p Feature

Strongly relevant to an objective c in the distribution p

Two examples with 0 and 0 A and B differ in their assignment to .

This definition is the natural extension of last definition and, contrary to it, the distribution is assumed to be known. 13

Page 14: Feature selection concepts and methods

Weak relevance with respect to S Feature

Weakly relevant to the sample S There exists at least a proper where is

strongly relevant with respect to S.

A weakly relevant feature can appear when a subset containing at least one strongly relevant feature is removed.

14

Page 15: Feature selection concepts and methods

Weak relevance with respect to p Feature

Weakly relevant to the objective c in the distribution p

There exists at least a proper where is strongly relevant with respect to p.

These 5 definitions are important to decide what features should be conserved and which can be eliminated.

15

Page 16: Feature selection concepts and methods

Strongly Relevant Features The strongly relevant features are, in theory,

important to maintain a structure in the domain

And they should be conserved by any feature selection algorithm in order to avoid the addition of ambiguity to the sample.

16

Page 17: Feature selection concepts and methods

Weakly Relevant Features

Weakly relevant features could be important or not, depending on:

The other features already selected.

The evaluation measure that has been chosen (accuracy, simplicity, consistency, etc.).

17

Page 18: Feature selection concepts and methods

Relevance as a complexity measure

Define r(S,c) Smallest number of relevant features to c The error in S is the least possible for the

inducer.

In other words, it refers to the smallest number of features required by a specific inducer to reach optimum performance in the task of modeling c using S.

18

Page 19: Feature selection concepts and methods

Incremental usefulness a data sample S

a learning algorithm L and a subset of features

The feature is incrementally useful to L with respect to if the accuracy of the group of features better than the accuracy reached using only the subset of features .

19

Page 20: Feature selection concepts and methods

ExampleX1…………...X11…………....X21……………..X30

100000000000000000000000000000 +

111111111100000000000000000000 +

000000000011111111110000000000 +

000000000000000000001111111111 +

000000000000000000000000000000 – X1 is strongly relevant, the rest are weakly relevant. r(S,c) = 3 Incremental usefulness: after choosing {X1, X2}, none of X3…

X10 would be incrementally useful, but any of X11…X30 would.20

Page 21: Feature selection concepts and methods

General Schemes for Feature Selection

Relationship between a FSA and the inducer Inducer:

• Chosen process to evaluate the usefulness of the features

• Learning Process

Filter Scheme

Wrapper Scheme

Embedded Scheme 21

Page 22: Feature selection concepts and methods

Filter Scheme Feature selection process takes place before

the induction step

This scheme is independent of the induction algorithm.

• High Speed• Low Accuracy

22

Page 23: Feature selection concepts and methods

Wrapper Scheme Use the learning algorithm as a subroutine to

evaluate the features subsets.

Inducer must be known.

• Low Speed• High Accuracy

23

Page 24: Feature selection concepts and methods

Embedded Scheme Similar to the wrapper approach

Features are specifically selected for a certain inducer

Inducer selects the features in the process of learning (Explicitly or Implicitly).

24

Page 25: Feature selection concepts and methods

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

25

Embedded Scheme Example

Refund Marital Status

Taxable Income

Age Cheat

Yes Single 125K 18 No

No Married 100K 30 No

No Single 70K 28 No

Yes Married 120K 19 No

No Divorced 95K 18 Yes

No Married 60K 20 No

Yes Divorced 220K 25 No

No Single 85K 30 Yes

No Married 75K 20 No

No Single 90K 18 Yes 10

categoric

al

categoric

al

continuous

classcontin

uous

Decision Tree Maker Algorithm, willAutomatically Remove ‘Age’ Feature.

Page 26: Feature selection concepts and methods

Characterization of FSAs

Search Organization: General strategy with which the space of hypothesis is explored.

Generation of Successors: Mechanism by which possible successor candidates of the current state are proposed.

Evaluation Measure: Function by which successor candidates are evaluated.

26

Page 27: Feature selection concepts and methods

Types of Search Organization

We consider three types of search:

Exponential

Sequential

Random

27

Page 28: Feature selection concepts and methods

Exponential Search Algorithms that carry out searches with

cost Best solution is guaranteed. The exhaustive search is an optimal search. An optimal search need not be exhaustive;

• Branch and Bound for monotonic evaluation measure

• search with an admissible heuristic.• A measure is monotonic if for any two subsets and

, then .28

Page 29: Feature selection concepts and methods

Sequential Search This Strategy selects one among all the

successors to the current state. Once the state is selected it is not possible to go

back. The number of such steps must be limited by . Let be the number of evaluated subsets in each

state change. The cost of this search is therefore polynomial .

These methods do not guarantee an optimal result.

29

Page 30: Feature selection concepts and methods

Random Search Use Randomness to avoid the algorithm to

stay on a local minimum.

Allow temporarily moving to other states with worse solutions.

These are anytime algorithms.

Can give several optimal subsets as solution.

30

Page 31: Feature selection concepts and methods

Types of Successors Generation Forward

Backward

Compound

Weighting

Random 31

Page 32: Feature selection concepts and methods

Forward Successors Generation Starting with .

Adds features to the current solution , among those that have not been selected yet.

In each step, the feature that makes be greater is added to the solution.

The cost of operator is

32

Page 33: Feature selection concepts and methods

Backward Successors Generation Starting with .

Removes features from the current solution , among those that have not been removed yet.

In each step, the feature that makes J be greater is removed from the solution.

The cost of operator is 33

Page 34: Feature selection concepts and methods

Forward and Backward Method, Stopping Criterion

( has been fixed in advance)

The value of J has not increased in the last k steps

The value of J has not surpasses a prefixed value .

In practice backward method demands more computation than its forward counterpart.

34

Page 35: Feature selection concepts and methods

Compound Successors Generation Apply f consecutive forward steps and b

consecutive backward ones. If the net result is a forward operator, otherwise it

is a backward one. This method, allows to discover new interactions

among features. Other stopping conditions should be established if

f = b. (Such as for ) In sequential FSA, the condition assures a

maximum of steps, with a total cost . 35

Page 36: Feature selection concepts and methods

Weighting Successors Generation In weighting operators (continuous features).

All of the features are present in the solution to a certain degree.

A successor state is a state with a different weighting.

This is typically done by iteratively sampling the available set of instances.

36

Page 37: Feature selection concepts and methods

Random Successors Generation Includes those operators that can potentially

generate any other state in a single step.

Restricted to some criterion of advance:• In the number of features• In improving the measure J at each step.

37

Page 38: Feature selection concepts and methods

Evaluation Measures Probability of Error

Divergence Dependence Interclass Distance Information or Uncertainty Consistency Relative values assigned to different subsets reflect

their greater or lesser relevance to the objective function.

Let an evaluation measure to be maximized, where is a (weighed) feature subset.

38

Page 39: Feature selection concepts and methods

Evaluation Measures, Probability of Error

Ultimate goal is to build a classifier with minimizing the (Bayesian) probability of error.

of the classifier seems to be the most natural choice.

39

Page 40: Feature selection concepts and methods

Evaluation Measures, Probability of Error

Since the class-conditional densities are usually unknown, they can either be explicitly modeled (using parametric or non-parametric methods)

40

Page 41: Feature selection concepts and methods

Evaluation Measures, Probability of Error

Provided the classifier has been built using only a subset of the features, we have:

is a test data sample is the subset of where the classifier

performed correctly. Finally we have:

41

Page 42: Feature selection concepts and methods

Evaluation Measures, DivergenceThese measures compute a probabilistic distance

or divergence among the class-conditional probability densities , using the general formula:

42

Page 43: Feature selection concepts and methods

Evaluation Measures, Divergence For valid measure, the function must be

such that the value of satisfies the following conditions:

1) only when the are equal

2) is maximum when they are non-overlapping

If the features used in a solution are good ones, the divergence will be significant.

43

Page 44: Feature selection concepts and methods

Divergence, Some classical choices:

44

Page 45: Feature selection concepts and methods

Divergence, Some classical choices:

45

Page 46: Feature selection concepts and methods

Evaluation Measures,

Dependence These measures quantify how strongly two features are associated with one another.

Knowing the value of one feature it is possible to predict the value of the other feature.

The correlation coefficient is a classical measure that still use for these methods.

46

Page 47: Feature selection concepts and methods

Evaluation Measures, Interclass Distance

These measures are based on the assumption that instances of a different class are distant in the instance space.

being the instance of class , and the �number of instances of the class .

The most usual distances belong to the Euclidean family.

47

Page 48: Feature selection concepts and methods

Evaluation Measures,

Consistency An inconsistency in and is defined as: two instances in that are equal when considering only

the features in and that belong to different classes.

The aim is thus to find the minimum subset of features leading to zero inconsistencies.

48

Page 49: Feature selection concepts and methods

Evaluation Measures,

Consistency The inconsistency count of an instance is defined as:

is the number of instances in equal to using only the features in .

is the number of instances in of class equal to using only the features in .

49

Page 50: Feature selection concepts and methods

Evaluation Measures,

ConsistencyThe inconsistency rate of a feature subset in a sample is:

Finally we have:

This measure is in [0, 1] that must min.

50

Page 51: Feature selection concepts and methods

51

Page 52: Feature selection concepts and methods

General Algorithm for Feature Selection

All FSA can be represented in a space of characteristics according to the criteria of: search organization (Org) Generation of successor states (GS) Evaluation measures (J)

This space <Org, GS, J> encompasses the �whole spectrum of possibilities for a FSA.

hybrid FSA when FSA requires more than a point in the same coordinate to be characterized.

52

Page 53: Feature selection concepts and methods

53

Page 54: Feature selection concepts and methods

FCBFFeature Correlation

Based Filter(Filter Mode)

<Sequential, Compound, Information> 54

Page 55: Feature selection concepts and methods

Previous Works and Their Defects

1) Huge Time Complexity

Binary Mode: Subset search algorithms search through

candidate feature subsets guided by a certain search strategy and a evaluation measure.

Different search strategies, namely, exhaustive, heuristic, and random search, are combined with this evaluation measure to form different algorithms. 55

Page 56: Feature selection concepts and methods

Previous Works and Their Defects The time complexity is exponential in terms

of data dimensionality for exhaustive search and quadratic for heuristic search. The complexity can be linear to the number

of iterations in a random search, but experiments show that in order to find best feature subset, the number of iterations required is mostly at least quadratic to the number of features.

56

Page 57: Feature selection concepts and methods

Previous Works and Their Defects

2) Inability to recognize redundant features.

Relief: The key idea of Relief is to estimate the relevance of

features according to how well their values distinguish between the instances of the same and different classes that are near each other.

Relief randomly samples a number (m) of instances from the training set and updates the relevance estimation of each feature based on the difference between the selected instance and the two nearest instances of the same and opposite classes. 57

Page 58: Feature selection concepts and methods

Previous Works and Their Defects Time complexity of Relief for a data set with

M instances and N features is O(mMN). With m being a constant, the time complexity

becomes O(MN), which makes it very scalable to data sets with both a huge number of instances and a very high dimensionality.

However, Relief does not help with removing redundant features.

58

Page 59: Feature selection concepts and methods

Good Feature A feature is good if it is relevant to the class

concept but is not redundant to any of the other relevant features.

Correlation as Goodness Measure

A feature is good if it is highly correlated to the class but not highly correlated to any of the other features. 59

Page 60: Feature selection concepts and methods

Approaches to Measure The Correlation

Classical Linear Correlation (Linear Correlation Coefficient)

Information theory (Entropy or Uncertainty)

60

Page 61: Feature selection concepts and methods

Linear Correlation Coefficient For a pair of variables (X,Y ) the linear

correlation coefficient r is given by the formula:

If X and Y are completely correlated, r takes the value of 1 or -1.

If X and Y are totally independent, r is zero.

61

Page 62: Feature selection concepts and methods

Advantages It helps to remove features with near zero

linear correlation to the class. It helps to reduce redundancy among

selected features.

Disadvantages It may not be able to capture correlations

that are not linear in nature. Calculation requires all features contain

numerical values.62

Page 63: Feature selection concepts and methods

Entropy The Entropy of a variable (feature) X is

defined as:

The Entropy of X after observing values of another variable Y is defined as:

63

Page 64: Feature selection concepts and methods

Entropy, Information Gain The amount by which the entropy of X

decreases reflects additional information about X provided by Y:IG(X|Y) = H(X) - H(X|Y)

Feature Y is regarded more correlated to feature X than to feature Z, if IG(X|Y) > IG(Z|Y)

Information gain is symmetrical for two random variables X and Y: IG(X|Y) = IG(Y|X)

64

Page 65: Feature selection concepts and methods

Entropy, Symmetrical Uncertainty Information gain is biased in favor of features

with more values. Thus must normalize it:

SU(X,Y) values are normalized to the range [0,1]. value 1 indicating that knowledge of the value of either

one completely predicts the value of the other. The value 0 indicating that X and Y are independent.

65

Page 66: Feature selection concepts and methods

Entropy, Symmetrical Uncertainty Symmetrical Uncertainty still treats a pair of

features symmetrically. Entropy-based measures require nominal

features. Entropy-based measures can be applied to

measure correlations between continuous features as well, if the values are discretized properly in advance.

66

Page 67: Feature selection concepts and methods

Algorithm Steps Aspects of developing a procedure to select

good features for classification:

1) How to decide whether a feature is relevant to the class or not (C-correlation).

2) How to decide whether such a relevant feature is redundant or not when considering it with other relevant features (F-correlation).

Select features with SU greater than a threshold.67

Page 68: Feature selection concepts and methods

Predominant Correlation

The correlation between a feature and the class C is predominant iff:

There exists no such that

68

Page 69: Feature selection concepts and methods

Redundant Feature If is redundant to feature , we use to denote

the set of all redundant peers for . We divide into two parts:

69

Page 70: Feature selection concepts and methods

Predominant Feature A feature is predominant to the class, iff:

Its correlation to the class is predominant Or can become predominant after removing its

redundant peers.

Feature selection for classification is a process that identifies all predominant features to the class concept and removes the rest.

70

Page 71: Feature selection concepts and methods

Heuristic We must use heuristics in order to avoid

pairwise analysis of F-Correlations between all relevant features.

Heuristic: (if ). Treat as a predominant feature, remove all features in , and skip identifying redundant peers for them.

71

Page 72: Feature selection concepts and methods

72

FCBF Algorithm

Page 73: Feature selection concepts and methods

73

Page 74: Feature selection concepts and methods

74

Page 75: Feature selection concepts and methods

GA-SVMGeneric Algorithm

Support Vector Machine

(Wrapper Mode)<Sequential, Compound,

Classifier>75

Page 76: Feature selection concepts and methods

Support Vector Machine (SVM) SVM, one of the best techniques for pattern

classification. Widely use in many application areas. SVM classifies data by determining a set of

support vectors and their distance to hyperplane.

SVM provides a generic mechanism that fits the hyperplane surface to the training data.

76

Page 77: Feature selection concepts and methods

SVM Main Idea With this hypothesis that classes are linearly

separable, make hyperplane with maximum margin to separate classes.

When classes are not linearly separable, map them to high dimensional space to linearly separate them.

77Separating Surface:

A+A-

Page 78: Feature selection concepts and methods

Support Vector Nearest training set instances to hyperplane:

Use SV instead of training set.

Line Equation: (w and b are unknown)

78

Class +1Class -1

X2

X1

SV

SV

SV

Page 79: Feature selection concepts and methods

Kernel

79

1 2 4 5 6

class 2 class 1class 1

1 Dimension

1 2 4 5 6

class 2 class 1class 1

2 Dimension

Page 80: Feature selection concepts and methods

Kernel Data in higher dimensional! The user may select a kernel function for the

SVM during the training process. The kernel parameters setting for SVM in a

training process impacts on the classification accuracy.

The parameters that should be optimized include penalty parameter C and the kernel function parameters.

80

Page 81: Feature selection concepts and methods

Linear SVM SVM concepts for typical two-class

classification problems:

Training set of instance-label pairs

For the linearly separable case, the data points will be correctly classified by

81

Page 82: Feature selection concepts and methods

Linear SVM Fnd an optimal separating hyperplane with

the maximum margin by solving the following optimization problem:

To solve this quadratic optimization problem one must find the saddle point of the Lagrange function:

denotes Lagrange multipliers, hence .

82

Page 83: Feature selection concepts and methods

Linear SVM After differentiating and Karush Kuhn–

Tucker (KTT) conditions:

values determine the parameters and of the optimal hyperplane. Thus, we obtain an optimal decision hyperplane

83

Page 84: Feature selection concepts and methods

Linear Generalized SVM When can't linearly separate, the goal is to

construct a hyperplane that makes the smallest number of errors. (non-negative slack variables)

Solve

84C : tradeoff parameter between error and margin, Number od misclassified instances.

Page 85: Feature selection concepts and methods

Linear Generalized SVM This optimization model can be solved

using the Lagrangian method

The penalty parameter C, which is now the upper bound on

85

Page 86: Feature selection concepts and methods

NonLinear SVM The nonlinear SVM maps the training

samples from the input space into a higher-dimensional feature space via a mapping function F, which are also called kernel function. Inner products are replaced by the kernel function:

86

Page 87: Feature selection concepts and methods

NonLinear SVM, Kernels

final hyperplane equation

87

Page 88: Feature selection concepts and methods

NonLinear SVM, Kernels In order to improve classification accuracy,

these kernel parameters in the kernel functions should be properly set.

Polynomial kernel:

Radial basis function kernel:

Sigmoid kernel:88

Page 89: Feature selection concepts and methods

Genetic Algorithm (GA) Genetic algorithms (GA), as a optimization search

methodology is a promising alternative to conventional heuristic methods.

GA work with a set of candidate solutions called a population.

Based on the Darwinian principle of ‘survival of the fittest’, the GA obtains the optimal solution after a series of iterative computations.

GA generates successive populations of alternate solutions that are represented by a chromosome.

A fitness function assesses the quality of a solution in the evaluation step. 89

Page 90: Feature selection concepts and methods

90

Page 91: Feature selection concepts and methods

GA Feature Selection Structure

The chromosome comprises three parts, C, , and the features mask. (Different parameters when other types of kernel functions)

The binary coding system was used to represent the chromosome. is the number of bits representing parameter is the number of bits representing parameter is the number of bits representing the features

Choose and according to the calculation precision.91

Page 92: Feature selection concepts and methods

Evaluation Measure Three criteria used to design a fitness

function: Classification accuracy The number of selected features The feature cost

Thus, for the individual (chromosome) with: High classification accuracy Small number of features Low total feature cost

Produce a high fitness value. 92

Page 93: Feature selection concepts and methods

Evaluation Measure

𝑓𝑖𝑡𝑛𝑒𝑠𝑠=𝑊 𝐴∗𝑆𝑉 𝑀 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦+¿93

Page 94: Feature selection concepts and methods

94

Page 95: Feature selection concepts and methods

95

Thanks For Your Regard

Page 96: Feature selection concepts and methods

75

Thanks For Your Regard