Top Banner
Classification Introduction to Supervised Learning Ana Karina Fermin M2 ISEFAR http://fermin.perso.math.cnrs.fr/
22

Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009)...

Sep 26, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

ClassificationIntroduction to Supervised Learning

Ana Karina Fermin

M2 ISEFARhttp://fermin.perso.math.cnrs.fr/

Page 2: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Motivation

Credit Default, Credit Score, Bank Risk, Market Risk Management

Data: Client profile, Client credit history...Input: Client profileOutput: Credit risk

Page 3: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Motivation

Spam detection (Text classification)

Data: email collectionInput: emailOutput : Spam or No Spam

Page 4: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Motivation

Face Detection

Data: Annotated database of imagesInput : Sub window in the imageOutput : Presence or no of a face...

Page 5: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Motivation

Number Recognition

Data: Annotated database of images (each image isrepresented by a vector of 28× 28 = 784 pixel intensities)Input: ImageOutput: Corresponding number

Page 6: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Machine Learning

A definition by Tom Mitchell (http://www.cs.cmu.edu/~tom/)A computer program is said to learn from experience E withrespect to some class of tasks T and performance measure P, if itsperformance at tasks in T, as measured by P, improves withexperience E.

Page 7: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Supervised Learning

Supervised Learning Framework

Input measurement X = (X (1),X (2), . . . ,X (d)) ∈ XOutput measurement Y ∈ Y.(X,Y ) ∼ P with P unknown.Training data : Dn = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)Often

X ∈ Rd and Y ∈ {−1, 1} (classification)or X ∈ Rd and Y ∈ R (regression).

A classifier is a function in F = {f : X → Y}

Goal

Construct a good classifier f̂ from the training data.

Need to specify the meaning of good.Formally, classification and regression are the same problem!

Page 8: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Loss and Probabilistic Framework

Loss functionLoss function : `(f (x), y) measure how well f (x) “predicts"y .Examples:

Prediction loss: `(Y , f (X)) = 1Y 6=f (X)Quadratic loss: `(Y ,X) = |Y − f (X)|2

Risk of a generic classifierRisk measured as the average loss for a new couple:

R(f ) = E [`(Y , f (X))]

Examples:Prediction loss: E [`(Y , f (X))] = P {Y 6= f (X)}Quadratic loss: E [`(Y , f (X))] = E

[|Y − f (X)|2

]Beware: As f̂ depends on Dn

Page 9: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Supervised Learning

Experience, Task and Performance measureTraining data : D = {(X1,Y1), . . . , (Xn,Yn)} (i.i.d. ∼ P)Predictor: f : X → YCost/Loss function : `(f (X),Y )Risk: R(f ) = E [`(Y , f (X))]

Often `(f (X),Y ) = |f (X)− Y |2 or `(f (X),Y ) = 1Y 6=f (X)

Goal

Learn a rule to construct a classifier f̂ ∈ F from the trainingdata Dn s.t. the risk R(f̂ ) is small on average or with highprobability with respect to Dn.

Page 10: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Goal

Machine Learning

Learn a rule to construct a classifier f̂ ∈ F from the trainingdata Dn s.t. the risk R(f̂ ) is small on average or with highprobability with respect to Dn.

Canonical example: Empirical Risk MinimizerOne restricts f to a subset of functions S = {fθ, θ ∈ Θ}One replaces the minimization of the average loss by theminimization of the empirical loss

f̂ = fθ̂

= argminfθ,θ∈Θ

1n

n∑i=1

`(Yi , fθ(Xi ))

Examples:Linear discrimination with

S = {x 7→ sign{βT x + β0} /β ∈ Rd , β0 ∈ R}

Page 11: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Example: TwoClass Dataset

Synthetic DatasetTwo features/covariates.Two classes.

Dataset from Applied Predictive Modeling, M. Kuhn andK. Johnson, SpringerNumerical experiments with R and the caret package.

Page 12: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Example: Linear Discrimination

Page 13: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Example: More Complex Model

Page 14: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Under-fitting / Over-fitting Issue

Different behavior for different model complexityUnder-fit : Low complexity models are easily learned but toosimple to explain the truth.Over-fit : High complexity models are memorizing the datathey have seen and are unable to generalize to unseenexamples.

Page 15: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Under-fitting / Over-fitting Issue

We can determine whether a predictive model is underfittingor overfitting the training data by looking at the predictionerror on the training data and the test data.How to estimate the test error ?

Page 16: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Binary Classification Loss Issue

Empirical Risk Minimizer

f̂ = argminf ∈S

1n

n∑i=1

`0/1(Yi , f (Xi ))

Classification loss: `0/1(y , f (x)) = 1y 6=f (x)

Not convex and not smooth!

Page 17: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Statistical Point of ViewIdeal Solution and Estimation

The best solution f ∗ (which is independent of Dn) is

f ∗ = arg minf ∈F

R(f ) = arg minf ∈F

E [`(Y , f (X))]

Bayes Predictor (Ideal solution)In binary classification with 0− 1 loss:

f ∗(X) ={

+1 if P {Y = +1|X} ≥ P {Y = −1|X}−1 otherwise

Issue: Explicit solution requires to know E [Y |X] for all values of X!

Page 18: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Conditional prob. and Bayes Predictor

Page 19: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Classification Loss and Convexification

Classification loss: `0/1(y , f (x)) = 1y 6=f (x)

Not convex and not smooth!

Classical convexification

Logistic loss: `(y , f (x)) = log(1 + e−yf (x)) (Logistic / NN)Hinge loss: `(y , f (x)) = (1− yf (x))+ (SVM)Exponential loss: `(y , f (x)) = e−yf (x) (Boosting...)

Page 20: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Machine Learning

Page 21: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Methods (Today):

1 k Nearest-Neighbors

Page 22: Classification - Introduction to Supervised Learningfermin.perso.math.cnrs.fr/Files/Cours-Introduction-ISEFAR.pdf · Bibliography T.Hastie,R.Tibshirani,andJ.Friedman(2009) TheElementsofStatisticalLearning

Bibliography

T. Hastie, R. Tibshirani, and J. Friedman (2009)The Elements of Statistical LearningSpringer Series in Statistics.

G. James, D. Witten, T. Hastie and R. Tibshirani (2013)An Introduction to Statistical Learning with Applications in RSpringer Series in Statistics.

B. Schölkopf, A. Smola (2002)Learning with kernels.The MIT Press