Top Banner
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN
17

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Dec 13, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

ASSESSING LEARNING

ALGORITHMS

Yılmaz KILIÇASLAN

Page 2: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Assessing the performance of the learning algorithm A learning algorithm is good if it produces

hypotheses that do a good job of predicting the classifications of unseen examples.

Later on, we may see how prediction quality can be estimated in advance.

For now, we will look at a methodology for assessing prediction quality after the fact.

Page 3: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Assessing the quality of a hypothesis A methodology for assessing the quality of a hypothesis:

1. Collect a large set of examples.

2. Divide it into two disjoint sets:

a. the training set and

b. the test set.

3. Apply the learning algorithm to the training set, generating a hypothesis h.

4. Measure the percentage of examples in the test set that are correctly classified by h.

5. Repeat steps 2 to 4 for different sizes of training sets and different randomly selected training sets of each size.

Page 4: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

The learning curve for the algorithm

1000 20 40 60 80Training set size

A learning curve for the decision tree algorithm on 100 randomly generated examples in the restaurant domain. The graph summarizes 20 trials.

Page 5: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Evaluation Metrics: Accuracy, Recall and Precision Accuracy: A simple metric that computes the

fraction of instances for which the correct result is returned.

Recall: The fraction of correct instances among all instances that actually belong to the relevant subset.

Precision: The fraction of correct instances among those that the algorithm believes to belong to the relevant subset.

Page 6: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Recall versus Precision (in information retrieval) Precision can be seen as a measure of

exactness or fidelity, whereas recall is a measure of completeness.

Recall and precision depend on the outcome (oval) of a query and its relation to all relevant documents (left) and the non-relevant documents (right). The more correct results (green), the better. Precision: horizontal arrow. Recall: diagonal arrow.

Page 7: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Recall and Precision in Information Retrieval

In the field of information retrieval, precision is the fraction of retrieved documents that are relevant to the search:

Recall in Information Retrieval is the fraction of the documents that are relevant to the query that are successfully retrieved:

Page 8: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Recall and Precision in Classification - I In the context of classification tasks, the terms true

positives, true negatives, false positives and false negatives are used to compare the given classification of an item (the class label assigned to the item by a classifier) with the desired correct classification (the class the item actually belongs to). This is illustrated by the table below:

Correct Result / Classification

E1 E2

ObtainedResult /

Classification

E1 tp(true positive)

fp(false positive)

E2 fn(false negative)

tn(true negative)

Page 9: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Recall and Precision in Classification - II Recall and precision are then defined as:

Page 10: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Peeking at the test data

Obviously, the learning algorithm must not be allowed to "see" the test data before the learned hypothesis is tested on them.

Unfortunately, it is all too easy to fall into the trap of peeking at the test data: A learning algorithm can have various "knobs" that can be

twiddled to tune its behavior-for example, various different criteria for choosing the next attribute in decision tree learning.

We generate hypotheses for various different settings of the knobs, measure their performance on the test set, and report the prediction performance of the best hypothesis.

Alas, peeking has occurred!

Page 11: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Noise and overfitting - I

Unfortunately, this is far from the whole story. It is quite possible, and in fact likely, that even when vital information is missing, the decision tree learning algorithm will find a decision tree that is consistent with all the examples.

This is because the algorithm can use the irrelevant attributes, if any, to make spurious distinctions among the examples.

Page 12: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Noise and overfitting - II

Consider the problem of trying to predict the roll of a die. Suppose that experiments are carried out during an extended

period of time with various dice and that the attributes describing each training example are as follows:1. Day: the day on which the die was rolled (Mon, Tue, Wed, Thu).2. Month: the month in which the die was rolled (Jan or Feb).3. Color: the color of the die (Red or Blue).

As long as no two examples have identical descriptions, Decision-Tree-Learning will find an exact hypothesis, which is in fact spurious.

What we would like is that Decision-Tree-Learning return a single leaf node with probabilities close to 1/6 for each roll.

Page 13: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Decision tree pruning

Decision tree pruning is a simple technique for the treatment of overfitting.

Pruning works by preventing recursive splitting on attributes that are not clearly relevant, even when the data at that node in the tree are not uniformly classified.

The information gain is a good clue to irrelevance.

But, how large a gain should we require in order to split on a particular attribute?

Page 14: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Statistical significance test We can answer the preceding question by using a

statistical significance test. Such a test begins by assuming that there is no underlying

pattern (the so-called null hypothesis). Then the actual data are analyzed to calculate the extent

to which they deviate from a perfect absence of pattern. If the degree of deviation is statistically unlikely (usually

taken to mean a 5% probability or less), then that is considered to be good evidence for the presence of a significant pattern in the data.

The probabilities are calculated from standard distributions of the amount of deviation one would expect to see in random sampling.

Page 15: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Deviation from the null hypothesis

Page 16: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

X2 (chi-squared) distribution

A convenient measure of the total deviation is given by

Under the null hypothesis, the value of D is distributed according to the X2 (chi-squared) distribution with v - 1 degrees of freedom.

The probability that the attribute is really irrelevant can be calculated with the help of standard X2 tables or with statistical software.

Page 17: ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Reference

Russell, S. and P. Norvig (2003). Artificial Intelligence: A Modern Approach. Prentice Hall.