1 Evaluation
Dec 30, 2015
1 1 Slide
Slide
Evaluation
2 2 Slide
Slide
Interactive decision tree construction
• Load segmentchallenge.arff; look at dataset
• Select UserClassifier (tree classifier)
• Use the test set segmenttest.arff
• Examine data visualizer and tree visualizer
• Plot regioncentroidrow vs intensitymean
• Rectangle, Polygon and Polyline selection tools
… several selections …
• Right click in Tree visualizer and Accept the tree
Over to you: how well can you do?
Be a classifier!
3 3 Slide
Slide
Build a tree: what strategy did you use?
Given enough time, you could produce a “perfect”
tree for the dataset
• but would it perform well on the test test?
Be a classifier!
4 4 Slide
Slide
Testdata
Trainingdata
MLalgorithm
Classifier Deploy!
Evaluationresults
Training and Testing
5 5 Slide
Slide
Testdata
Trainingdata
MLalgorithm
Classifier Deploy!
Evaluationresults
sets produced byBasic assumption: training and testindependent sampling from an infinite population
Training and Testing
6 6 Slide
Slide
Use J48 to analyze the segment dataset
• Open file segment‐challenge.arff
• Choose J48 decision tree learner (trees>J48)
• Supplied test set segment‐test.arff
• Run it: 96% accuracy
• Evaluate on training set: 99% accuracy
• Evaluate on percentage split: 95% accuracy
• Do it again: get exactly the same result!
Training and Testing
7 7 Slide
Slide
Basic assumption:
• training and test sets sampled independently
from an infinite population
Just one dataset? — hold some out for testing
Expect slight variation in results… but Weka
produces same results each time…Why?
• E.g. J48 on segment‐challenge dataset
Training and Testing
8 8 Slide
Slide
Evaluate J48 on segment‐challenge
• With segment‐challenge and J48 (trees>J48)
• Set percentage split to 90%
• Run it: 96.7% accuracy
• [More options] Repeat
with a different ith seed
• Use 2, 3, 4, 5, 6, 7, 8, 9, 10
Repeated Training and Testing
0.967
0.9400.9400.9670.9530.9670.9200.947
0.9330.947
9 9 Slide
Slide
0.967
0.9400.9400.9670.9530.9670.9200.9470.9330.947
x iSample mean x =n
(xi – x )2Variance 2 =
n – 1
Standard deviation
x = 0.949, = 0.0158
Repeated Training and Testing
Evaluate J48 on segment‐challenge
10 10 Slide
Slide
Basic assumption:
• training and test sets sampled independently
from an infinite population
Expect slight variation in results … get it by
setting the random‐number seed
Can calculate mean and standard deviation
experimentally
Repeated Training and Testing
11 11 Slide
Slide
Use diabetes dataset and default holdout Open file diabetes.arff Test option: Percentage split Try these classifiers:
• trees > J48 76%• bayes > NaiveBayes 77%• lazy > IBk 73%• rules > PART 74%
768 instances (500 negative, 268 positive) Always guess “negative”: 500/768=65%
• rules > ZeroR: most likely class!
Baseline Accuracy
12 12 Slide
Slide
Sometimes baseline is best!• Open supermarket.arff and blindly apply
• rules > ZeroR 64%• trees > J48 63%• bayes > NaiveBayes 63%• lazy > IBk 38%• rules > PART 63%
• Attributes are not informative
• Caution: Don’t just apply Weka to a dataset:
you need to understand what’s going on
Baseline Accuracy
13 13 Slide
Slide
Consider whether differences are significant
Always try a simple baseline, e.g. rules > ZeroR
Caution: Don’t just apply Weka to a dataset: you
need to understand what’s going on
Baseline Accuracy
14 14 Slide
Slide
Can we improve upon repeated holdout (i.e.
reduce variance)?
Cross‐validation
Stratified cross‐validation
Cross-Validation
15 15 Slide
Slide
Repeated holdouthold out 10% for testing, repeat 10 times
(repeat 10 times)
Cross-Validation
16 16 Slide
Slide
10‐fold cross‐validation
Divide dataset into 10 parts
Hold out each part in turnAverage the results
(folds)
Each data point used once for testing, 9 times for training
Stratified cross‐validation
Ensure that each fold has the rightproportion of each class value
Cross-Validation
17 17 Slide
Slide
Cross‐validation better than repeated holdout
Stratified is even better
Practical rule of thumb:Lots of data? – use percentage splitElse stratified 10‐fold cross‐validation
Cross-Validation
18 18 Slide
Slide
Is cross‐validation really better than repeated holdout?
Diabetes dataset
Baseline accuracy (rules > ZeroR):
trees > J4810‐fold cross‐validation
65.1%
73.8%
… with1
73.8
different random number seed2
75.0
3
75.5
4
75.5
5
74.4
6
75.6
7
73.6
8
74.0
9
74.5
10
73.0
Cross-Validation Results
19 19 Slide
Slide
holdout(10%)
75.377.980.574.071.470.179.271.480.567.5
cross‐validation(10‐fold)
73.875.075.575.574.475.673.674.074.573.0
xi Sample mean x =n
(xi – x )2Variance 2 =
n – 1
Standard deviation
x = 74.5x = 74.8 = = 4.6 0.9
Cross-Validation Results
20 20 Slide
Slide
Why 10‐fold? E.g. 20‐fold: 75.1%
Cross‐validation really is better than repeated holdout
It reduces the variance of the estimate
Cross-Validation Results
21 21 Slide
Slide
Evaluation MethodsExercises
22 22 Slide
Slide
Plan
To evaluate the performance of machine learning algorithms classifying Tic-Tac-Toe games.
23 23 Slide
Slide
Classification on Tic-Tac-Toe
Download Tic-Tac-Toe dataset tic-tac-toe.zip from Course Page.
Work as a team to evaluate the performance of machine learning algorithms classifying Tic-Tac-Toe games.
24 24 Slide
Slide
Evaluation Methods
Using Training Set (use 100% of instances to train/learn and use 100% of instances to test performance)
10-fold Cross-Validation
Split 70% (use 70% of instances to train/learn and use the rest of 30% of instances to test performance)
25 25 Slide
Slide
Classifiers Being Used Decision Tree
• Tree → J48 Neural Network
• Functions → MultilayerPerceptron (trainingtime=50)
Bayes Network• Bayes → NaiveBayes
Nearest Neighbor• Lazy → IBk (k=3)
26 26 Slide
Slide
Using Weka
Extract Tic-Tac-Toe.zip to the Weka folder Load Weka program Open the Tic-Tac-Toe.arff Choose Explorer
27 27 Slide
Slide
Using Weka (cont.)
Click Classify tab Choose J48 Classifier below trees Set the Test options to Use training set Enable Output predictions in More options Click Start to run
28 28 Slide
Slide
Using Weka (cont.)
Accuracy rate
29 29 Slide
Slide
Reporting Download Tic-tac-toe-report.docx Complete the table evaluating the performance of
different learning methods in Q1. Find the best performer in Q2, Q3, and Q4.