Top Banner
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute
35

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Jan 17, 2016

Download

Documents

Erika Collins
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Classification(slides adapted from Rob Schapire)

Eran Segal Weizmann Institute

Page 2: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Classification Scheme

Labeled training

examples

Classification algorithm

Classification rule

Test example

Predicted classification

Page 3: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Building a Good Classifier Need enough training examples Good performance on training set Classifier is not too complex

Measures of complexity: Number of bits needed to write classifier Number of parameters VC dimension

Page 4: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Example

Page 5: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Example

Page 6: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Classification Algorithms Nearest neighbors Naïve Bayes Decision trees Boosting Neural networks SVMs Bagging …

Page 7: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Nearest Neighbor Classification

Popular nonlinear classifier Find k nearest neighbors of the unknown (test)

vector from the training vectors Assign the unknown (test) vector to the most

frequent class from the k nearest neighboring vectors

Question: how to select similarity measure between vectors?

Problem: in high-dimensional data, nearest neighbors are still not ‘near’

Page 8: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Naïve Bayes Classifier Input of labeled examples

m1={c, x1,...,xn} m2={c, x1,...,xn} …

Parameter estimation / learning phase

Prediction Compute assignment to classes and choose most

likely class

C

X1 XnX2…

Naïve Bayes

P(C[m] = c | x[m],θ) ∝ P(C[m] = c |θ)P(x[m] | C[m] = c,θ)

= P(C[m] = c |θ) P(x i[m] | C[m] = c,θ)i=1

n

M θ [c] =

δ[c]m

∑M

M θ [x i,c] =

δ[x i,c]m

∑M

Page 9: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Decision Trees

Page 10: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Decision Trees Example

Page 11: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Building Decision Trees Choose a rule to split on Divide data to disjoint subsets based on

splitting rule Repeat recursively for each subset Stop when leaves are (almost) “pure”

Page 12: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Choosing the Splitting Rule Choose rule that leads to greatest increase in

“purity” Purity measures

Entropy: -(p+ln(p+) + p-ln(p-)) Gini index: p+p-

Page 13: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Tree Size vs. Prediction Accuracy

Trees must be big enough to fit training data, but not too big to overfit (capture noise or spurious patterns)

Page 14: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Decision Tree Summary Best known packages

C4.5 (Quinlan) CART (Breiman, Friedman, Olshen & Stone)

Very fast to train and evaluate Relatively easy to interpret But: accuracy is often not state-of-the-art Work well within boosting approaches

Page 15: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Boosting Main observation: easy to find simple ‘rules of

thumb’ that are ‘often’ correct General approach

Concentrate on “hard examples” Derive ‘rule of thumb’ for these examples Combine rule with previous rules by taking a

weighted majority of all current rules Repeat T times

Boosting guarantees: given sufficient data, and an algorithm that can consistently find classifiers (‘rules of thumb’) slightly better than random, a high accuracy classifier can be built

Page 16: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

AdaBoost

Page 17: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Setting α

Error εt:

Setting αt:

Classifier weight inversely proportional to classifier error, i.e., classifier weight increases with classification accuracy

εt = P(ht (x i) ≠ y i)i=1

N

α t =1

2ln

1−ε t

ε t

⎝ ⎜

⎠ ⎟

Page 18: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Toy Example Weak classifiers: vertical or horizontal half-

planes

Page 19: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Round 1

Page 20: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Round 2

Page 21: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Round 3

Page 22: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Final Boosting Classifier

Page 23: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Test Error Behavior

Expected Typical

Page 24: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

The Margins Explanation Training error measures classification accuracy,

but confidence of classifications is also important

Recall: Hfinal is weighted majority vote of weak rules

Measure confidence by margin = vote strength Empirical evidence and mathematical proof

that: Large margins better generalization error Boosting tends to increase margins of training

examples

Page 25: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Boosting Summary Relatively fast (but not like other algorithms) Simple and easy to program Flexible: can combine with any learning

algorithm e.g., C4.5, very simple rules of thumb

Provable guarantees State-of-the-art accuracy Tends not to overfit (but does sometimes) Many applications

Page 26: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Basic unit: perceptron (linear threshold function)

Neural network Perceptrons in a network Weight on edges Each unit: perceptron

Neural Networks

Page 27: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Perceptron Units Problem: network computation

is discontinuous due to g(x) Solution: approximate g(x) with

smoothed threshold function e.g., g(x)=1/(1+exp(-x))

Hw(x) is now continuous and

differentiable in both x and w

Page 28: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Finding Weights

Page 29: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Neural Network Summary Slow to converge Difficult to get right network architecture and

parameters Not start-of-the-art accuracy as general

method Can be tuned to specific applications and

achieve good performance then

Page 30: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Support Vector Machines (SVMs)

Given linearly separable data Choose hyperplane that maximizes minimum

margin (=distance to separating hyperplane) Intuition: separate +’s from –’s as much as

possible

Page 31: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Finding Max-Margin Hyperplane

Page 32: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Non-Linearly Separable Data Penalize each point by distance from the

margin 1, i.e., minimize:

Map data into high-dimensional space in which data becomes linearly separable

Page 33: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

SVM Summary Fast algorithms available Not simple to program State-of-the-art accuracy Theoretical justification Many applications

Page 34: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Assignment Classify tissue samples into various clinical

states (e.g., tumor/normal) based on microarray profiles

Classifiers to compare: Naïve Bayes Naïve Bayes + feature selection Boosting with decision tree stumps (weak classifiers

are decision tree with one split)

Page 35: Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Assignment Data: Breast cancer dataset

295 samples data.tab: genes on rows, first column is gene identifier (ID),

columns 2-296 are expression level of gene in each array 27 clinical attributes

experiment_attributes.tab: attributes on columns, samples on rows, typically 0/1 indicates association of sample and attribute

Evaluation: use 10-fold cross validation scheme and compute prediction accuracy for clinical attributes: met_in_5_yr_ eventmet_ Alive_8yr_ eventdea_ GradeIII_