Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Feature Selection (I)

Data Mining IIYear 2010-11Lluís Belanche

The problem

Science and variable acquisition Economy + simplicity vs. Abundance + necessity Relevant, redundant and irrelevant variables Is exhaustive search adequate?

Computationally Scientifically Statistically

Selection vs. weighing

Reasons for dimension reduction

Improve learning efficiency Reduce the cost of measurements Enable better understanding of the

underlying process Eliminate noise Improve the prediction performance Visualization

Approaches to dimension reduction

select a subset of the original features construct features that replace the original

feature set construct features that add to the original

feature set weight features differentially a combination of the above

Φ Transformation

Φ (x) = [ Φ1(x), Φ2(x), … Φm(x) ], x = [ x1, x2, …, xn ]

Useful for “ascending” (m>n) or “descending” (m<n) in dimension:visualization, compactation,

noise reduction, removal of useless information

feature extraction, feature selection (or both!)

What are features?

The good and the bad

Example 1Consider a battery of diagnostic tests T1,...,Tm for a fairly rare disease, which around 5% of all patients tested actually have.

Suppose test T1 correctly picks up 99% of the real cases and has a very low false positive rate.

However, there is a rare special form of the disease that T1 cannot detect, but T2 can, yet T2 is inaccurate on the normal disease form.

If we test the diagnostic tests one at a time, we will never even think of including T2, yet T1 and T2 together may give a nearly perfect classifier by declaring a patient diseased if T1 is positive or T1 is negative and T2 is positive.

Ripley, B. Pattern Recognition and Neural Networks, 1996 ,p.327

Example 1 (cont.)

Battery of diagnostic tests T1, …, T10 for a rare disease (5% incidence in patients)

T1 detects 90% of positives and has 0 FPs

T2 detects 40% of positives but detects 90% of the FNs for T1

T7 detects 20% of positives but detects 90% of the FNs for T2

Assume the rest of the tests are (almost) useless

a) How useful are these three tests taken together?

b) What are the chances that we come across T1, T2, T7 by selecting features in the full set T1, …, T10 ?

Example 2

Data set Three nominal

features Four classes Optimal subset: {X2, X3}

# X1 X2 X3 C

1 1 2 1 1

2 1 3 1 1

3 2 1 2 2

4 2 2 2 2

5 2 2 3 2

6 2 1 3 3

7 3 4 2 3

891011

3334

1233

1422

3444

Naïve sequential feature selection

Filtering

Evaluation independent of learning algorithm

Filter methods

Laplacian Score Fisher Score (= Fisher’s discriminant ratio) Entropy-based (Mutual Information, Gini, SU, mRmR) Distance Correlation (CFS) Separability Kruskal-Wallis ReliefF Chi-Square-, t- and F-Scores

Zhao et al. Advancing Feature Selection Research (can be fetched at http://featureselection.asu.edu/)

The RELIEF algorithm [Kira & Rendell 1992]ReliefF: Extensions, empirical and theoretical analysis in [Robnik-Sikonja & Kononenko 2003]

Idea: favor features having:1. different values in dissimilar examples of different

classes2. Equal values in similar examples of same class

w[] := 0for i:=1 to p=c|S| do pick an example E at random from S M := nearest example to E from different class H := nearest example to E from same class for k:=1 to n do w[k] := w[k] d(E– k, Mk)/n + d(Ek, Hk)/n

endend

Wrappermethods

Evaluation uses same learning algorithm that is used after the feature selection (equaling the biases)

A Filter/Wrapper synergy

1. Use a filter method able to detect interactions between features (e.g. ReliefF)2. Sort the features by relevance from best to

worst 3. Use the learning algorithm to check whether the elimination of the worst feature is useful 4. If this is the case, eliminate it and go to 1 5. If this is not the case, STOP

Filters vs. Wrappers

Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Documents