Top Banner
Introduction to Machine Learning Isabelle Guyon [email protected]
40

Introduction Machine Learning

Jan 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction Machine Learning

Introduction to

Machine Learning

Isabelle [email protected]

Page 2: Introduction Machine Learning

What is Machine Learning?

Learningalgorithm

TRAININGDATA Answer

Trainedmachine

Query

Page 3: Introduction Machine Learning

What for?

• Classification• Time series prediction• Regression• Clustering

Page 4: Introduction Machine Learning

Some Learning Machines

• Linear models • Kernel methods• Neural networks• Decision trees

Page 5: Introduction Machine Learning

Applications

inputs

training examples

10

102

103

104

105

Bioinformatics

Ecology

OCRHWR

MarketAnalysis

TextCategorization

MachineVision

Syst

em d

iagno

sis

10 102 103 104 105

Page 6: Introduction Machine Learning

Banking / Telecom / Retail

• Identify:– Prospective customers– Dissatisfied customers– Good customers– Bad payers

• Obtain:– More effective advertising– Less credit risk– Fewer fraud– Decreased churn rate

Page 7: Introduction Machine Learning

Biomedical / Biometrics

• Medicine:– Screening– Diagnosis and prognosis– Drug discovery

• Security:– Face recognition– Signature / fingerprint / iris

verification– DNA fingerprinting 6

Page 8: Introduction Machine Learning

Computer / Internet

• Computer interfaces:– Troubleshooting wizards – Handwriting and speech– Brain waves

• Internet– Hit ranking– Spam filtering– Text categorization– Text translation– Recommendation 7

Page 9: Introduction Machine Learning

Challenges

inputs

training examples

10

102

103

104

105

Arcene, Dorothea, Hiva

Sylva

GisetteGina

Ada

Dexter, Nova

Made

lon

10 102 103 104 105

NIPS 2003 & WCCI 2006

Page 10: Introduction Machine Learning

Ten Classification Tasks

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.50

50100150

ADA

GINA

HIVA

NOVA

SYLVA

0 5 10 15 20 25 30 35 40 45 500

20

40ARCENE

0 5 10 15 20 25 30 35 40 45 500

20

40DEXTER

0 5 10 15 20 25 30 35 40 45 500

20

40DOROTHEA

0 5 10 15 20 25 30 35 40 45 500

20

40GISETTE

0 5 10 15 20 25 30 35 40 45 500

20

40MADELON

Test BER (%)

Page 11: Introduction Machine Learning

Challenge Winning Methods

00.20.40.60.8

11.21.41.61.8

Linear/Kernel

NeuralNets

Trees/RF

NaïveBayes

Gisette (HWR)Gina (HWR)Dexter (Text)Nova (Text)Madelon (Artificial)Arcene (Spectral)Dorothea (Pharma)Hiva (Pharma)Ada (Marketing)Sylva (Ecology)

BE

R/<

BE

R>

Page 12: Introduction Machine Learning

Conventions

X={xij}

n

mxiy ={yj}

α

w

Page 13: Introduction Machine Learning

Learning problem

Colon cancer, Alon et al 1999

Unsupervised learningIs there structure in data?

Supervised learningPredict an outcome y.

Data matrix: X

m lines = patterns (data points, examples): samples, patients, documents, images, …

n columns = features: (attributes, input variables): genes, proteins, words, pixels, …

Page 14: Introduction Machine Learning

Linear Models

• f(x) = w • x +b = Σj=1:n wj xj +b Linearity in the parameters, NOT in the input components.

• f(x) = w • Φ(x) +b = Σj wj φj(x) +b (Perceptron)

• f(x) = Σi=1:m αi k(xi,x) +b (Kernel method)

Page 15: Introduction Machine Learning

Artificial Neurons

x1

x2

xn

1

Σ f(x)

w1

w2

wn

b

f(x) = w • x + b

Axon

Synapses

Activation of other neurons Dendrites

Cell potential

Activation function

McCulloch and Pitts, 1943

Page 16: Introduction Machine Learning

Linear Decision Boundary

-0.50

0.5-0.

00.5

-0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

X1X2

X3

x1x2

x3

hyperplane

x1

x2

Page 17: Introduction Machine Learning

Perceptron

Rosenblatt, 1957

f(x)

f(x) = w • Φ(x) + b

φ1(x)

1

x1

x2

xn

φ2(x)

φN(x)

Σw1

w2

wN

b

Page 18: Introduction Machine Learning

NL Decision Boundary

x1

x2

-0.5

0

0.5

-0.5

0

0.5-0.5

0

0.5

Hs.128749Hs.234680

Hs.

7780

x1

x2

x3

Page 19: Introduction Machine Learning

Kernel Method

Potential functions, Aizerman et al 1964

f(x) = Σi αi k(xi,x) + b

k(x1,x)

1

x1

x2

xn

Σα1

α2

αm

b

k(x2,x)

k(xm,x)

k(. ,. ) is a similarity measure or “kernel”.

Page 20: Introduction Machine Learning

Hebb’s Rule

wj ← wj + yi xij

AxonΣ

yxj wj

Synapse

Activation of another neuron

Dendrite

Link to “Naïve Bayes”

Page 21: Introduction Machine Learning

Kernel “Trick” (for Hebb’s rule)

• Hebb’s rule for the Perceptron:

w = Σi yi Φ(xi)

f(x) = w • Φ(x) = Σi yi Φ(xi) • Φ(x)

• Define a dot product: k(xi,x) = Φ(xi) • Φ(x)

f(x) = Σi yi k(xi,x)

Page 22: Introduction Machine Learning

Kernel “Trick” (general)

• f(x) = Σi αi k(xi, x)

• k(xi, x) = Φ(xi) • Φ(x)

• f(x) = w • Φ(x)

• w = Σi αi Φ(xi)

Dual forms

Page 23: Introduction Machine Learning

A kernel is:• a similarity measure• a dot product in some feature space: k(s, t) = Φ(s) • Φ(t)

But we do not need to know the Φ representation.

Examples:

• k(s, t) = exp(-||s-t||2/σ2) Gaussian kernel

• k(s, t) = (s • t)q Polynomial kernel

What is a Kernel?

Page 24: Introduction Machine Learning

Multi-Layer Perceptron

Back-propagation, Rumelhart et al, 1986

Σxj

Σ

Σ

“hidden units”

internal “latent” variables

Page 25: Introduction Machine Learning

Chessboard Problem

Page 26: Introduction Machine Learning

Tree Classifiers

CART (Breiman, 1984) or C4.5 (Quinlan, 1993)

At each step, choose the feature that

“reduces entropy” most. Work towards “node purity”.

All the data

f1

f2

Choose f2

Choose f1

Page 27: Introduction Machine Learning

Iris Data (Fisher, 1936)

Linear discriminant Tree classifier

Gaussian mixture Kernel method (SVM)

setosavirginica

versicolor

Figure from Norbert Jankowski and Krzysztof Grabczewski

Page 28: Introduction Machine Learning

x1

x2

Fit / Robustness Tradeoff

x1

x2

15

Page 29: Introduction Machine Learning

x1

x2

Performance evaluation

x1

x2

f(x) = 0

f(x) > 0

f(x) < 0

f(x) = 0

f(x) > 0

f(x) < 0

Page 30: Introduction Machine Learning

x1

x2

x1

x2

f(x) = -1

f(x) > -1

f(x) < -1f(x) = -1

f(x) > -1

f(x) < -1

Performance evaluation

Page 31: Introduction Machine Learning

x1

x2

x1

x2

f(x) = 1

f(x) > 1

f(x) < 1

f(x) = 1

f(x) > 1

f(x) < 1

Performance evaluation

Page 32: Introduction Machine Learning

ROC Curve

100%

100%

For a given threshold on f(x), you get a point on the ROC curve. Actu

al ROC

0

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

Random ROC

Ideal ROC curve

Page 33: Introduction Machine Learning

ROC Curve

Ideal ROC curve (AUC=1)

100%

100%

0 ≤ AUC ≤ 1

Actual R

OC

Random ROC (AUC=0.5)

0

Positive class success rate

(hit rate, sensitivity)

1 - negative class success rate (false alarm rate, 1-specificity)

For a given threshold on f(x), you get a point on the ROC curve.

Page 34: Introduction Machine Learning

Lift Curve

OMGini =

O M

Fraction of customers selected

Hit

rate

= F

rac.

goo

dcu

stom

ers s

elec

t.

Random lift

Ideal Lift

100%

100%Customers ranked according to f(x); selection of the top ranking customers.

Gini=2 AUC-1

0 ≤ Gini ≤ 1

Actual L

ift

0

Page 35: Introduction Machine Learning

Predictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

Cost matrixPredictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

neg=tn+fp

Total

pos=fn+tp

sel=fp+tprej=tn+fnTotal m=tn+fp+fn+tp

False alarm = fp/neg

Class +1 / Total

Hit rate = tp/pos

Frac. selected = sel/m

Cost matrix

Class+1/Total

Precision= tp/sel

False alarm rate = type I errate = 1-specificityHit rate = 1-type II errate = sensitivity = recall = test power

Compare F(x) = sign(f(x)) to the target y, and report:• Error rate = (fn + fp)/m• {Hit rate , False alarm rate} or {Hit rate , Precision} or {Hit rate , Frac.selected} • Balanced error rate (BER) = (fn/pos + fp/neg)/2 = 1 – (sensitivity+specificity)/2• F measure = 2 precision.recall/(precision+recall)

Vary the decision threshold θ in F(x) = sign(f(x)+θ), and plot: • ROC curve

Performance Assessment

: Hit rate vs. False alarm rate• Lift curve: Hit rate vs. Fraction selected• Precision/recall curve: Hit rate vs. Precision

Predictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

neg=tn+fp

Total

pos=fn+tp

sel=fp+tprej=tn+fnTotal m=tn+fp+fn+tp

False alarm = fp/neg

Class +1 / Total

Hit rate = tp/pos

Frac. selected = sel/m

Cost matrixPredictions: F(x)

Class -1 Class +1

Truth:y

Class -1 tn fp

Class +1 fn tp

neg=tn+fp

Total

pos=fn+tp

sel=fp+tprej=tn+fnTotal m=tn+fp+fn+tp

Cost matrix

Page 36: Introduction Machine Learning

What is a Risk Functional?

A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task.

Examples:• Classification:

– Error rate: (1/m) Σi=1:m 1(F(xi)≠yi)– 1- AUC (Gini Index = 2 AUC-1)

• Regression: – Mean square error: (1/m) Σi=1:m(f(xi)-yi)2

Page 37: Introduction Machine Learning

How to train?

• Define a risk functional R[f(x,w)]• Optimize it w.r.t. w (gradient descent,

mathematical programming, simulated annealing, genetic algorithms, etc.)

Parameter space (w)

R[f(x,w)]

w*(… to be continued in the next lecture)

Page 38: Introduction Machine Learning

How to Train?

• Define a risk functional R[f(x,w)]• Find a method to optimize it, typically

“gradient descent”wj ← wj - η ∂R/∂wj

or any optimization method (mathematical programming, simulated annealing, genetic algorithms, etc.)

(… to be continued in the next lecture)

Page 39: Introduction Machine Learning

Summary

• With linear threshold units (“neurons”) we can build:– Linear discriminant (including Naïve Bayes)– Kernel methods– Neural networks– Decision trees

• The architectural hyper-parameters may include:– The choice of basis functions φ (features)– The kernel – The number of units

• Learning means fitting:– Parameters (weights)– Hyper-parameters– Be aware of the fit vs. robustness tradeoff

Page 40: Introduction Machine Learning

Want to Learn More?

• Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlabcode. http://rii.ricoh.com/~stork/DHS.html

• The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods forclassification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/

• Linear Discriminants and Support Vector Machines, I. Guyon and D. Stork, In Smola et al Eds. Advances in Large Margin Classiers. Pages 147--169, MIT Press, 2000.http://clopinet.com/isabelle/Papers/guyon_stork_nips98.ps.gz

• Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material.http://clopinet.com/fextract-book