Feature Selection (I) Data Mining II Year 2010-11 Lluís Belanche
Feature Selection (I)
Data Mining IIYear 2010-11Lluís Belanche
The problem
Science and variable acquisition Economy + simplicity vs. Abundance + necessity Relevant, redundant and irrelevant variables Is exhaustive search adequate?
Computationally Scientifically Statistically
Selection vs. weighing
Reasons for dimension reduction
Improve learning efficiency Reduce the cost of measurements Enable better understanding of the
underlying process Eliminate noise Improve the prediction performance Visualization
Approaches to dimension reduction
select a subset of the original features construct features that replace the original
feature set construct features that add to the original
feature set weight features differentially a combination of the above
Φ Transformation
Φ (x) = [ Φ1(x), Φ2(x), … Φm(x) ], x = [ x1, x2, …, xn ]
Useful for “ascending” (m>n) or “descending” (m<n) in dimension:visualization, compactation,
noise reduction, removal of useless information
feature extraction, feature selection (or both!)
What are features?
The good and the bad
Example 1Consider a battery of diagnostic tests T1,...,Tm for a fairly rare disease, which around 5% of all patients tested actually have.
Suppose test T1 correctly picks up 99% of the real cases and has a very low false positive rate.
However, there is a rare special form of the disease that T1 cannot detect, but T2 can, yet T2 is inaccurate on the normal disease form.
If we test the diagnostic tests one at a time, we will never even think of including T2, yet T1 and T2 together may give a nearly perfect classifier by declaring a patient diseased if T1 is positive or T1 is negative and T2 is positive.
Ripley, B. Pattern Recognition and Neural Networks, 1996 ,p.327
Example 1 (cont.)
Battery of diagnostic tests T1, …, T10 for a rare disease (5% incidence in patients)
T1 detects 90% of positives and has 0 FPs
T2 detects 40% of positives but detects 90% of the FNs for T1
T7 detects 20% of positives but detects 90% of the FNs for T2
Assume the rest of the tests are (almost) useless
a) How useful are these three tests taken together?
b) What are the chances that we come across T1, T2, T7 by selecting features in the full set T1, …, T10 ?
Example 2
Data set Three nominal
features Four classes Optimal subset: {X2, X3}
# X1 X2 X3 C
1 1 2 1 1
2 1 3 1 1
3 2 1 2 2
4 2 2 2 2
5 2 2 3 2
6 2 1 3 3
7 3 4 2 3
891011
3334
1233
1422
3444
Naïve sequential feature selection
Filtering
Evaluation independent of learning algorithm
Filter methods
Laplacian Score Fisher Score (= Fisher’s discriminant ratio) Entropy-based (Mutual Information, Gini, SU, mRmR) Distance Correlation (CFS) Separability Kruskal-Wallis ReliefF Chi-Square-, t- and F-Scores
Zhao et al. Advancing Feature Selection Research (can be fetched at http://featureselection.asu.edu/)
The RELIEF algorithm [Kira & Rendell 1992]ReliefF: Extensions, empirical and theoretical analysis in [Robnik-Sikonja & Kononenko 2003]
Idea: favor features having:1. different values in dissimilar examples of different
classes2. Equal values in similar examples of same class
w[] := 0for i:=1 to p=c|S| do pick an example E at random from S M := nearest example to E from different class H := nearest example to E from same class for k:=1 to n do w[k] := w[k] d(E– k, Mk)/n + d(Ek, Hk)/n
endend
Wrappermethods
Evaluation uses same learning algorithm that is used after the feature selection (equaling the biases)
A Filter/Wrapper synergy
1. Use a filter method able to detect interactions between features (e.g. ReliefF)2. Sort the features by relevance from best to
worst 3. Use the learning algorithm to check whether the elimination of the worst feature is useful 4. If this is the case, eliminate it and go to 1 5. If this is not the case, STOP
Filters vs. Wrappers