Selection of Relevant Features and Examples in Machine Learning

Selection of Relevant Features and Examples in Machine Learning

Paper By: Avrim L. BlumPat Langley

Presented By:Arindam Bhattacharya (10305002)Akshat Malu (10305012)Yogesh Kakde (10305039)Tanmay Haldankar (10305911)

Overview Introduction Selecting Relevant Features

Embedded Approaches Filter Approaches Wrapper Approaches Feature Weighting Approaches

Selecting Relevant Examples Selecting Labeled Data Selecting Unlabeled Data

Challenges and Future Work

Introduction Machine learning are addressing larger and

complex tasks. Internet has a huge volume of low quality

information. We focus on:

Selecting the most relevant features Selecting the most relevant examples

[1] Cover and Hart, 1967 [2] Langley and Iba, 1993

Problems of Irrelevant Features Not helpful in classification Slow the learning process[1]

Number of training examples required grows exponentially with number of irrelevant features[2]

Blum et al., 1997

Definitions of Relevance Definition 1: Relevant to Target:-

A feature xi is relevant to a target concept C if there exists a pair of examples A and B such that A and B differ only in feature xi and c(A) ≠ c(B) .

John, Kohavi and Pfelger (1994)

Definitions of Relevance Definition 2: Strongly Relevant to sample:-

A feature is said to be strongly relevant to sample S if there exist examples A and B in S that differ only in feature xi and have different labels.

John, Kohavi and Pfelger (1994)

Definitions of Relevance Definition 3: Weakly Relevant to the sample:-

A feature xi is said to be weakly relevant to the sample S if it possible to remove a subset of the features so that xi becomes strongly relevant

Blum et al, 1997

Definitions of Relevance Definition 4: Relevance as complexity

measure:- Given a sample S and a set of concepts C,

let r(S,C) be the number of features relevant (using definition 1) to a concept C that, out of all those whose error over S is least, has the fewest relevant features.

Caruana and Frietag, 1994

Definitions of Relevance Definition 5: Incremental Usefulness:-

Given a sample S, a learning algorithm L, and a feature set A, feature xi is incrementally useful to L if the accuracy of the hypothesis that L produces using the feature set {xi} U A is better than the accuracy achieved using just the feature set A.

Example Consider concepts can be expressed as disjunctions

and the algorithm sees the following examples:

x1 x2 x3 x4 x51 0 0 0 0 +

1 1 1 0 0 +

0 0 0 1 0 +

0 0 0 0 1 +

0 0 0 0 0 -

Example Using Definition 2 and 3, we can say that x1 is

strongly relevant while x2 is weakly relevant. Using Definition 4, we can say that there are three

relevant features (r(S,C)=3). Using Definition 5, given the feature set {1,2}, the

third feature may not be useful but features 4,5 would be useful.

Feature Selection as Heuristic Search

Heuristic Search is an ideal paradigm for Feature selection algorithms.

Feature Selection as Heuristic Search

Search Space Partial Order

Four Basic Issues

Where to start? Forward Selection Bckward Elimination

Four Basic Issues

How to organize the search?

Exhaustive search: 2n

possibilities for n attributes

Greedy search: Hill climbing First is best

Four Basic Issues

Which is better? - Strategy to evaluate alternatives Accuracy on training or

separate evaluation set Feature selection-basic

induction interaction

Four Basic Issues

When to stop? Stop when nothing improves Go on until things worsen Reach the end and select best Each combination of selected

features map to single class Order by relevance and

determine a break point

An Example – Set Cover AlgorithmDisjunction of 0

features

From safe features, select one that maximize correctly

classified +ive example

Output the selected features

Any safe feature that improves is left?

Begins at the left of the figure Incrementally move right Evaluate based on performance on training set with ∞ penalty for misclassifying -ve example Halts when no further step improves performance

Feature Selection Method

• Feature selection methods are grouped into three classes :• Those that embed the selection into induction

algorithm• Those that use feature selection algorithm to filter

the attributes passed to induction algorithm• Those that treat feature selection as a wrapper

around the induction process

Embedded Approaches to Feature Selection

• For these class of algorithm, feature selection is embedded within basic induction algorithm.

• Most algorithms for inducing logical concepts (e.g. the set-cover algorithm) adds or remove features from concept description based on prediction errors

• For these algorithms, the feature space is also the concept space.

[1] (Verbeurgt, 1990)

Embedded Approaches in Binary Feature Space

• Gives attractive results for systems learning pure conjunctive (or pure disjunctive) rules.

• At most logarithmic factor more than smallest possible hypothesis!

• Also applies in settings where target hypothesis is characterized by conjunction (or disjunction) of functions produced by induction algorithms• e.g.: algorithms for learning DNF in O(nlog n) time[1]

Embedded Approaches for Complex Logical Concepts

• In this approach, the core method adds/removes features to induce complex logical concepts.

• e.g. ID3 [1] and C4.5 [2]

• Greedy search through space of decision tree• Each stage select attribute that discriminate

among classes using evaluation function (usually based on information theory)

[1] (Quinlan, 1983)[2] (Quinlan, 1993)

[1](Langley and Sage, 1997)

Embedded Approaches: Scalability Issues

• Experimental studies[1] suggest decision list learners scale linearly with increase in irrelevant features

• For other target concepts, exhibit exponential growth.

• (Kira and Rendell, 1992) shows that there is substantial decrease in accuracy on inserting irrelevant features into Boolean target concept.

Embedded Approaches: Remedies

• Problems are caused due to reliance on greedy selection of attributes to discriminate among classes.

• Some researchers[1] attempted to replace greedy approach with look-ahead techniques.

• Letting Greedy take larger steps[2].• None has been able to handle scaling

effectively.

[1] Norton, 1989[2] (Methes and Rendell, 1989; Pagallo and Haussler, 1990)

John et al, 1994.

Filter Approaches

• Feature selection is done based on some general characteristics of the training set.

• Independent of the induction algorithm used, and thus, can be combined with any such method.

Blum et al, 1997.

A Simple Filtering Scheme

• Evaluate each feature individually based on its correlation with the target function.

• Select the ‘k’ features with the highest value.• The best choice of ‘k’ can be determined by

testing on a holdout set.

Almuallim et al,1991.

FOCUS Algorithm• Looks for the minimal combinations of

attributes that perfectly discriminate among the classes

• Halt only when a pure partition of the training set is generated

• Performance: Under similar conditions, FOCUS was almost unaffected by the introduction of irrelevant attributes, whereas decision-tree accuracy degraded significantly.

{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}{ f1, f2, f3,…, fn}

Blum et al, 1997.

Comparing Various Filter Approaches

AUTHORS (SYSTEM)

STARTING POINT

SEARCH CONTROL

HALTING CRITERION

INDUCTION ALGORITHM

ALMUALLIM (FOCUS)

NONE BREADTH FIRST

CONSISTENCY DECISION TREE

CARDIE NONE GREEDY CONSISTENCY NEAR. NEIGH.

KOLLER/SAHAMI

ALL GREEDY THRESHOLD TREE/BAYES

KUBAT et al. NONE GREEDY CONSISTENCY NAÏVE BAYES

SINGH/ PROVAN

NONE GREEDY NO INFO. GAIN BAYES NET

John et al, 1994.

Wrapper Approaches (1/2)

• Motivation: The features selected should depend not only on the relevance of the data, but also on the learning algorithm.

Wrapper Approaches (2/2)

• Advantage: The inductive method that uses the feature subset provides a better estimate of accuracy than a separate measure that may have an entirely different inductive bias.

• Disadvantage: Computational cost, which results from calling the induction algorithm for each feature set considered.

• Modifications: – Caching decision trees– Reducing percentage of training cases

Langley et al, 1994.

OBLIVION Algorithm

• It carries out a backward elimination search through the space of feature sets.

Start with all features and iteratively remove the one that leads to a tree that has the greatest improvement in the estimated accuracy.

Continue this process till there is a constant improvement in accuracy.

Blum et al, 1997.

Comparing Various Wrapper ApproachesAUTHORS (SYSTEM)

STARTING POINT

SEARCH CONTROL

HALTING CRITERION

INDUCTION ALGORITHM

CARUANA/ FREITAG (CAP)

COMPARISON GREEDY ALL USED DEC. TREE

JOHN/ KOHAVI/ PFLEGER

COMPARISON GREEDY NO BETTER DEC. TREE

LANGLEY/SAGE (OBLIVION)

ALL GREEDY WORSE NEAR. NEIGH.

LANGLEY/ SAGE (SEL. BAYES)

NONE GREEDY WORSE NAÏVE BAYES

MOORE/ LEE (RACE)

COMPARISON GREEDY NO BETTER NEAR. NEIGH.

SINGH/PROVAN (K2-AS)

NONE GREEDY WORSE BAYES NET

SKALAK RANDOM MUTATION ENOUGH TIMES

NEAR. NEIGH.

Feature Selection v/s Feature Weighting

FEATURE SELECTION FEATURE WEIGHTING

Explicitly attempts to select a ‘most relevant’ subset of features.

Assigns degrees of perceived relevance to features via a weighting function.

Most natural when the result is to be understood by humans, or fed into another algorithm.

Easier to implement in on-line incremental settings.

Most commonly characterized in terms of heuristic search.

Most common techniques involve some form of gradient descent, updating weights in successive passes through the training instances.

Littlestone, 1988.

Winnow AlgorithmInitialize the weights w1, w2,…, wn

of the features to 1.

Given an example (x1,…, xn), output 1 if w1x1+…+wnxn ≥n, and output 0

otherwise.

For each xi equal to 1, double the value

of wi.

For each xi equal to 1, cut the value of

wi to half.

If algorithm predicts negative on a positive sample.

If algorithm predicts positive on a negative sample.

References• Avrim L. Blum, Pat Langley, Selection of relevant features and examples in

machine learning, Artificial IntelligenceVolume 97, Issues 1-2, Pages 245-271, (1997).

• D. Aha, A study of instance-based algorithms for supervised learning tasks: mathematical, empirical and psychological evaluations. University of California, Irvine, CA, (1990).

• K. Verbeurgt, Learning DNF under the uniform distribution in polynomial time. In: Proceedings 3rd Annual Workshop on Computational Learning TheorySan Francisco, CA, , Morgan Kaufmann, San Mateo, CA, pp. 314–325, (1990).

• T.M. Cover and P.E. Hart, Nearest neighbor pattern classification. IEEE Trans. Inform. Theory 13, pp. 21–27, (1967).

• P. Langley and W. Iba, Average-case analysis of a nearest neighbor algorithm. In: Proceedings IJCAI-93, pp. 889–894, (1993).

References (contd…)• G.H. John, R. Kohavi and K. Pfleger, Irrelevant features and the subset

selection problem. In: Proceedings 11th International Conference on Machine LearningNew Brunswick, NJ, , Morgan Kaufmann, San Mateo, CA, pp. 121–129, (1994).

• J.R. Quinlan, Learning efficient classification procedures and their application to chess end games. In: R.S. Michalski, J.G. Carbonell and T.M. Mitchell, Editors, Machine Learning: An Artificial Intelligence Approach, Morgan Kaufmann, San Mateo, CA (1983).

• J.R. Quinlan. In: C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, CA (1993).

• C.J. Matheus and L.A. Rendell, Constructive induction on decision trees. In: Proceedings IJCAI-89Detroit, MI, , Morgan Kaufmann, San Mateo, CA, pp. 645–650, (1989).

• N. Littlestone, Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Machine Learning 2, pp. 285–318, (1988).

Selection of Relevant Features and Examples in Machine Learning

Documents

feature xi

number of features relevant

feature set

relevant featuresselecting

relevant features rs

fewest relevant features

feature selection algorithms

relevant examples31