Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Classification(slides adapted from Rob Schapire)

Eran Segal Weizmann Institute

Classification Scheme

Labeled training

examples

Classification algorithm

Classification rule

Test example

Predicted classification

Building a Good Classifier Need enough training examples Good performance on training set Classifier is not too complex

Measures of complexity: Number of bits needed to write classifier Number of parameters VC dimension

Example

Example

Classification Algorithms Nearest neighbors Naïve Bayes Decision trees Boosting Neural networks SVMs Bagging …

Nearest Neighbor Classification

Popular nonlinear classifier Find k nearest neighbors of the unknown (test)

vector from the training vectors Assign the unknown (test) vector to the most

frequent class from the k nearest neighboring vectors

Question: how to select similarity measure between vectors?

Problem: in high-dimensional data, nearest neighbors are still not ‘near’

Naïve Bayes Classifier Input of labeled examples

m1={c, x1,...,xn} m2={c, x1,...,xn} …

Parameter estimation / learning phase

Prediction Compute assignment to classes and choose most

likely class

C

X1 XnX2…

Naïve Bayes

€

P(C[m] = c | x[m],θ) ∝ P(C[m] = c |θ)P(x[m] | C[m] = c,θ)

= P(C[m] = c |θ) P(x i[m] | C[m] = c,θ)i=1

n

∏

€

M θ [c] =

δ[c]m

∑M

M θ [x i,c] =

δ[x i,c]m

∑M

Decision Trees

Decision Trees Example

Building Decision Trees Choose a rule to split on Divide data to disjoint subsets based on

splitting rule Repeat recursively for each subset Stop when leaves are (almost) “pure”

Choosing the Splitting Rule Choose rule that leads to greatest increase in

“purity” Purity measures

Entropy: -(p+ln(p+) + p-ln(p-)) Gini index: p+p-

Tree Size vs. Prediction Accuracy

Trees must be big enough to fit training data, but not too big to overfit (capture noise or spurious patterns)

Decision Tree Summary Best known packages

C4.5 (Quinlan) CART (Breiman, Friedman, Olshen & Stone)

Very fast to train and evaluate Relatively easy to interpret But: accuracy is often not state-of-the-art Work well within boosting approaches

Boosting Main observation: easy to find simple ‘rules of

thumb’ that are ‘often’ correct General approach

Concentrate on “hard examples” Derive ‘rule of thumb’ for these examples Combine rule with previous rules by taking a

weighted majority of all current rules Repeat T times

Boosting guarantees: given sufficient data, and an algorithm that can consistently find classifiers (‘rules of thumb’) slightly better than random, a high accuracy classifier can be built

AdaBoost

Setting α

Error εt:

Setting αt:

Classifier weight inversely proportional to classifier error, i.e., classifier weight increases with classification accuracy

€

εt = P(ht (x i) ≠ y i)i=1

N

∑

€

α t =1

2ln

1−ε t

ε t

⎛

⎝ ⎜

⎞

⎠ ⎟

Toy Example Weak classifiers: vertical or horizontal half-

planes

Round 1

Round 2

Round 3

Final Boosting Classifier

Test Error Behavior

Expected Typical

The Margins Explanation Training error measures classification accuracy,

but confidence of classifications is also important

Recall: Hfinal is weighted majority vote of weak rules

Measure confidence by margin = vote strength Empirical evidence and mathematical proof

that: Large margins better generalization error Boosting tends to increase margins of training

examples

Boosting Summary Relatively fast (but not like other algorithms) Simple and easy to program Flexible: can combine with any learning

algorithm e.g., C4.5, very simple rules of thumb

Provable guarantees State-of-the-art accuracy Tends not to overfit (but does sometimes) Many applications

Basic unit: perceptron (linear threshold function)

Neural network Perceptrons in a network Weight on edges Each unit: perceptron

Neural Networks

Perceptron Units Problem: network computation

is discontinuous due to g(x) Solution: approximate g(x) with

smoothed threshold function e.g., g(x)=1/(1+exp(-x))

Hw(x) is now continuous and

differentiable in both x and w

Finding Weights

Neural Network Summary Slow to converge Difficult to get right network architecture and

parameters Not start-of-the-art accuracy as general

method Can be tuned to specific applications and

achieve good performance then

Support Vector Machines (SVMs)

Given linearly separable data Choose hyperplane that maximizes minimum

margin (=distance to separating hyperplane) Intuition: separate +’s from –’s as much as

possible

Finding Max-Margin Hyperplane

Non-Linearly Separable Data Penalize each point by distance from the

margin 1, i.e., minimize:

Map data into high-dimensional space in which data becomes linearly separable

SVM Summary Fast algorithms available Not simple to program State-of-the-art accuracy Theoretical justification Many applications

Assignment Classify tissue samples into various clinical

states (e.g., tumor/normal) based on microarray profiles

Classifiers to compare: Naïve Bayes Naïve Bayes + feature selection Boosting with decision tree stumps (weak classifiers

are decision tree with one split)

Assignment Data: Breast cancer dataset

295 samples data.tab: genes on rows, first column is gene identifier (ID),

columns 2-296 are expression level of gene in each array 27 clinical attributes

experiment_attributes.tab: attributes on columns, samples on rows, typically 0/1 indicates association of sample and attribute

Evaluation: use 10-fold cross validation scheme and compute prediction accuracy for clinical attributes: met_in_5_yr_ eventmet_ Alive_8yr_ eventdea_ GradeIII_

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Documents

training data

classifier error

training vectorsassign

training setclassifier

high accuracy classifier

p lnp plnp

classifiers rules of

p ptree size