Selection of Relevant Features and Examples in Machine Learning Paper By: Avrim L. Blum Pat Langley Presented By: Arindam Bhattacharya (1030 Akshat Malu (10305 Yogesh Kakde (10305039 Tanmay Haldankar (1030
Feb 23, 2016
Selection of Relevant Features and Examples in Machine Learning
Paper By: Avrim L. BlumPat Langley
Presented By:Arindam Bhattacharya (10305002)Akshat Malu (10305012)Yogesh Kakde (10305039)Tanmay Haldankar (10305911)
Overview Introduction Selecting Relevant Features
Embedded Approaches Filter Approaches Wrapper Approaches Feature Weighting Approaches
Selecting Relevant Examples Selecting Labeled Data Selecting Unlabeled Data
Challenges and Future Work
Introduction Machine learning are addressing larger and
complex tasks. Internet has a huge volume of low quality
information. We focus on:
Selecting the most relevant features Selecting the most relevant examples
[1] Cover and Hart, 1967 [2] Langley and Iba, 1993
Problems of Irrelevant Features Not helpful in classification Slow the learning process[1]
Number of training examples required grows exponentially with number of irrelevant features[2]
Blum et al., 1997
Definitions of Relevance Definition 1: Relevant to Target:-
A feature xi is relevant to a target concept C if there exists a pair of examples A and B such that A and B differ only in feature xi and c(A) ≠ c(B) .
John, Kohavi and Pfelger (1994)
Definitions of Relevance Definition 2: Strongly Relevant to sample:-
A feature is said to be strongly relevant to sample S if there exist examples A and B in S that differ only in feature xi and have different labels.
John, Kohavi and Pfelger (1994)
Definitions of Relevance Definition 3: Weakly Relevant to the sample:-
A feature xi is said to be weakly relevant to the sample S if it possible to remove a subset of the features so that xi becomes strongly relevant
Blum et al, 1997
Definitions of Relevance Definition 4: Relevance as complexity
measure:- Given a sample S and a set of concepts C,
let r(S,C) be the number of features relevant (using definition 1) to a concept C that, out of all those whose error over S is least, has the fewest relevant features.
Caruana and Frietag, 1994
Definitions of Relevance Definition 5: Incremental Usefulness:-
Given a sample S, a learning algorithm L, and a feature set A, feature xi is incrementally useful to L if the accuracy of the hypothesis that L produces using the feature set {xi} U A is better than the accuracy achieved using just the feature set A.
Example Consider concepts can be expressed as disjunctions
and the algorithm sees the following examples:
x1 x2 x3 x4 x51 0 0 0 0 +
1 1 1 0 0 +
0 0 0 1 0 +
0 0 0 0 1 +
0 0 0 0 0 -
Example Using Definition 2 and 3, we can say that x1 is
strongly relevant while x2 is weakly relevant. Using Definition 4, we can say that there are three
relevant features (r(S,C)=3). Using Definition 5, given the feature set {1,2}, the
third feature may not be useful but features 4,5 would be useful.
Feature Selection as Heuristic Search
Heuristic Search is an ideal paradigm for Feature selection algorithms.
Feature Selection as Heuristic Search
Search Space Partial Order
Four Basic Issues
Where to start? Forward Selection Bckward Elimination
Four Basic Issues
How to organize the search?
Exhaustive search: 2n
possibilities for n attributes
Greedy search: Hill climbing First is best
Four Basic Issues
Which is better? - Strategy to evaluate alternatives Accuracy on training or
separate evaluation set Feature selection-basic
induction interaction
Four Basic Issues
When to stop? Stop when nothing improves Go on until things worsen Reach the end and select best Each combination of selected
features map to single class Order by relevance and
determine a break point
An Example – Set Cover AlgorithmDisjunction of 0
features
From safe features, select one that maximize correctly
classified +ive example
Output the selected features
Any safe feature that improves is left?
Begins at the left of the figure Incrementally move right Evaluate based on performance on training set with ∞ penalty for misclassifying -ve example Halts when no further step improves performance
Feature Selection Method
• Feature selection methods are grouped into three classes :• Those that embed the selection into induction
algorithm• Those that use feature selection algorithm to filter
the attributes passed to induction algorithm• Those that treat feature selection as a wrapper
around the induction process
Embedded Approaches to Feature Selection
• For these class of algorithm, feature selection is embedded within basic induction algorithm.
• Most algorithms for inducing logical concepts (e.g. the set-cover algorithm) adds or remove features from concept description based on prediction errors
• For these algorithms, the feature space is also the concept space.
[1] (Verbeurgt, 1990)
Embedded Approaches in Binary Feature Space
• Gives attractive results for systems learning pure conjunctive (or pure disjunctive) rules.
• At most logarithmic factor more than smallest possible hypothesis!
• Also applies in settings where target hypothesis is characterized by conjunction (or disjunction) of functions produced by induction algorithms• e.g.: algorithms for learning DNF in O(nlog n) time[1]
Embedded Approaches for Complex Logical Concepts
• In this approach, the core method adds/removes features to induce complex logical concepts.
• e.g. ID3 [1] and C4.5 [2]
• Greedy search through space of decision tree• Each stage select attribute that discriminate
among classes using evaluation function (usually based on information theory)
[1] (Quinlan, 1983)[2] (Quinlan, 1993)
[1](Langley and Sage, 1997)
Embedded Approaches: Scalability Issues
• Experimental studies[1] suggest decision list learners scale linearly with increase in irrelevant features
• For other target concepts, exhibit exponential growth.
• (Kira and Rendell, 1992) shows that there is substantial decrease in accuracy on inserting irrelevant features into Boolean target concept.
Embedded Approaches: Remedies
• Problems are caused due to reliance on greedy selection of attributes to discriminate among classes.
• Some researchers[1] attempted to replace greedy approach with look-ahead techniques.
• Letting Greedy take larger steps[2].• None has been able to handle scaling
effectively.
[1] Norton, 1989[2] (Methes and Rendell, 1989; Pagallo and Haussler, 1990)
John et al, 1994.
Filter Approaches
• Feature selection is done based on some general characteristics of the training set.
• Independent of the induction algorithm used, and thus, can be combined with any such method.
Blum et al, 1997.
A Simple Filtering Scheme
• Evaluate each feature individually based on its correlation with the target function.
• Select the ‘k’ features with the highest value.• The best choice of ‘k’ can be determined by
testing on a holdout set.
Almuallim et al,1991.
FOCUS Algorithm• Looks for the minimal combinations of
attributes that perfectly discriminate among the classes
• Halt only when a pure partition of the training set is generated
• Performance: Under similar conditions, FOCUS was almost unaffected by the introduction of irrelevant attributes, whereas decision-tree accuracy degraded significantly.
{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}
Blum et al, 1997.
Comparing Various Filter Approaches
AUTHORS (SYSTEM)
STARTING POINT
SEARCH CONTROL
HALTING CRITERION
INDUCTION ALGORITHM
ALMUALLIM (FOCUS)
NONE BREADTH FIRST
CONSISTENCY DECISION TREE
CARDIE NONE GREEDY CONSISTENCY NEAR. NEIGH.
KOLLER/SAHAMI
ALL GREEDY THRESHOLD TREE/BAYES
KUBAT et al. NONE GREEDY CONSISTENCY NAÏVE BAYES
SINGH/ PROVAN
NONE GREEDY NO INFO. GAIN BAYES NET
John et al, 1994.
Wrapper Approaches (1/2)
• Motivation: The features selected should depend not only on the relevance of the data, but also on the learning algorithm.
Wrapper Approaches (2/2)
• Advantage: The inductive method that uses the feature subset provides a better estimate of accuracy than a separate measure that may have an entirely different inductive bias.
• Disadvantage: Computational cost, which results from calling the induction algorithm for each feature set considered.
• Modifications: – Caching decision trees– Reducing percentage of training cases
Langley et al, 1994.
OBLIVION Algorithm
• It carries out a backward elimination search through the space of feature sets.
Start with all features and iteratively remove the one that leads to a tree that has the greatest improvement in the estimated accuracy.
Continue this process till there is a constant improvement in accuracy.
Blum et al, 1997.
Comparing Various Wrapper ApproachesAUTHORS (SYSTEM)
STARTING POINT
SEARCH CONTROL
HALTING CRITERION
INDUCTION ALGORITHM
CARUANA/ FREITAG (CAP)
COMPARISON GREEDY ALL USED DEC. TREE
JOHN/ KOHAVI/ PFLEGER
COMPARISON GREEDY NO BETTER DEC. TREE
LANGLEY/SAGE (OBLIVION)
ALL GREEDY WORSE NEAR. NEIGH.
LANGLEY/ SAGE (SEL. BAYES)
NONE GREEDY WORSE NAÏVE BAYES
MOORE/ LEE (RACE)
COMPARISON GREEDY NO BETTER NEAR. NEIGH.
SINGH/PROVAN (K2-AS)
NONE GREEDY WORSE BAYES NET
SKALAK RANDOM MUTATION ENOUGH TIMES
NEAR. NEIGH.
Feature Selection v/s Feature Weighting
FEATURE SELECTION FEATURE WEIGHTING
Explicitly attempts to select a ‘most relevant’ subset of features.
Assigns degrees of perceived relevance to features via a weighting function.
Most natural when the result is to be understood by humans, or fed into another algorithm.
Easier to implement in on-line incremental settings.
Most commonly characterized in terms of heuristic search.
Most common techniques involve some form of gradient descent, updating weights in successive passes through the training instances.
Littlestone, 1988.
Winnow AlgorithmInitialize the weights w1, w2,…, wn
of the features to 1.
Given an example (x1,…, xn), output 1 if w1x1+…+wnxn ≥n, and output 0
otherwise.
For each xi equal to 1, double the value
of wi.
For each xi equal to 1, cut the value of
wi to half.
If algorithm predicts negative on a positive sample.
If algorithm predicts positive on a negative sample.
References• Avrim L. Blum, Pat Langley, Selection of relevant features and examples in
machine learning, Artificial IntelligenceVolume 97, Issues 1-2, Pages 245-271, (1997).
• D. Aha, A study of instance-based algorithms for supervised learning tasks: mathematical, empirical and psychological evaluations. University of California, Irvine, CA, (1990).
• K. Verbeurgt, Learning DNF under the uniform distribution in polynomial time. In: Proceedings 3rd Annual Workshop on Computational Learning TheorySan Francisco, CA, , Morgan Kaufmann, San Mateo, CA, pp. 314–325, (1990).
• T.M. Cover and P.E. Hart, Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13, pp. 21–27, (1967).
• P. Langley and W. Iba, Average-case analysis of a nearest neighbor algorithm. In: Proceedings IJCAI-93, pp. 889–894, (1993).
References (contd…)• G.H. John, R. Kohavi and K. Pfleger, Irrelevant features and the subset
selection problem. In: Proceedings 11th International Conference on Machine LearningNew Brunswick, NJ, , Morgan Kaufmann, San Mateo, CA, pp. 121–129, (1994).
• J.R. Quinlan, Learning efficient classification procedures and their application to chess end games. In: R.S. Michalski, J.G. Carbonell and T.M. Mitchell, Editors, Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Mateo, CA (1983).
• J.R. Quinlan. In: C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA (1993).
• C.J. Matheus and L.A. Rendell, Constructive induction on decision trees. In: Proceedings IJCAI-89Detroit, MI, , Morgan Kaufmann, San Mateo, CA, pp. 645–650, (1989).
• N. Littlestone, Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Machine Learning 2, pp. 285–318, (1988).