Top Banner
Feature Selection (I) Data Mining II Year 2010-11 Lluís Belanche
17

Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Aug 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Feature Selection (I)

Data Mining IIYear 2010-11Lluís Belanche

Page 2: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

The problem

Science and variable acquisition Economy + simplicity vs. Abundance + necessity Relevant, redundant and irrelevant variables Is exhaustive search adequate?

Computationally Scientifically Statistically

Selection vs. weighing

Page 3: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Reasons for dimension reduction

Improve learning efficiency Reduce the cost of measurements Enable better understanding of the

underlying process Eliminate noise Improve the prediction performance Visualization

Page 4: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Approaches to dimension reduction

select a subset of the original features construct features that replace the original

feature set construct features that add to the original

feature set weight features differentially a combination of the above

Page 5: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Φ Transformation

Φ (x) = [ Φ1(x), Φ2(x), … Φm(x) ], x = [ x1, x2, …, xn ]

Useful for “ascending” (m>n) or “descending” (m<n) in dimension:visualization, compactation,

noise reduction, removal of useless information

feature extraction, feature selection (or both!)

Page 6: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

What are features?

Page 7: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

The good and the bad

Page 8: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Example 1Consider a battery of diagnostic tests T1,...,Tm for a fairly rare disease, which around 5% of all patients tested actually have.

Suppose test T1 correctly picks up 99% of the real cases and has a very low false positive rate.

However, there is a rare special form of the disease that T1 cannot detect, but T2 can, yet T2 is inaccurate on the normal disease form.

If we test the diagnostic tests one at a time, we will never even think of including T2, yet T1 and T2 together may give a nearly perfect classifier by declaring a patient diseased if T1 is positive or T1 is negative and T2 is positive.

Ripley, B. Pattern Recognition and Neural Networks, 1996 ,p.327

Page 9: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Example 1 (cont.)

Battery of diagnostic tests T1, …, T10 for a rare disease (5% incidence in patients)

T1 detects 90% of positives and has 0 FPs

T2 detects 40% of positives but detects 90% of the FNs for T1

T7 detects 20% of positives but detects 90% of the FNs for T2

Assume the rest of the tests are (almost) useless

a) How useful are these three tests taken together?

b) What are the chances that we come across T1, T2, T7 by selecting features in the full set T1, …, T10 ?

Page 10: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Example 2

Data set Three nominal

features Four classes Optimal subset:    {X2, X3}

# X1 X2 X3 C

1 1 2 1 1

2 1 3 1 1

3 2 1 2 2

4 2 2 2 2

5 2 2 3 2

6 2 1 3 3

7 3 4 2 3

891011

3334

1233

1422

3444

Page 11: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Naïve sequential feature selection

Page 12: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Filtering

Evaluation independent of learning algorithm

Page 13: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Filter methods

Laplacian Score Fisher Score (= Fisher’s discriminant ratio) Entropy-based (Mutual Information, Gini, SU, mRmR) Distance Correlation (CFS) Separability Kruskal-Wallis ReliefF Chi-Square-, t- and F-Scores

Zhao et al. Advancing Feature Selection Research (can be fetched at http://featureselection.asu.edu/)

Page 14: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

The RELIEF algorithm [Kira & Rendell 1992]ReliefF: Extensions, empirical and theoretical analysis in [Robnik-Sikonja & Kononenko 2003]

Idea: favor features having:1. different values in dissimilar examples of different

classes2. Equal values in similar examples of same class

w[] := 0for i:=1 to p=c|S| do pick an example E at random from S M := nearest example to E from different class H := nearest example to E from same class for k:=1 to n do w[k] := w[k] d(E– k, Mk)/n + d(Ek, Hk)/n

endend

Page 15: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Wrappermethods

Evaluation uses same learning algorithm that is used after the feature selection (equaling the biases)

Page 16: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

A Filter/Wrapper synergy

1. Use a filter method able to detect interactions between features (e.g. ReliefF)2. Sort the features by relevance from best to

worst 3. Use the learning algorithm to check whether the elimination of the worst feature is useful 4. If this is the case, eliminate it and go to 1 5. If this is not the case, STOP

Page 17: Feature Selection (I)belanche/Docencia/dm2... · feature selection (equaling the biases) A Filter/Wrapper synergy 1. Use a filter method able to detect interactions between features

Filters vs. Wrappers