New Theory of Statistical Learning - unice.frdgarreau/pdfs/01_IntroML.pdf · 2020. 2. 27. · Typology of machine learning problems Unsupervised learning I Clustering: organize objects

1/25

Theory of Statistical LearningIntroduction to machine learning and pattern recognition

Damien Garreau, Remi Flamary

January 8, 2020

2/25

Organization of the course

Two parallel courses:

I me (Damien Garreau) → supervised learning

I Thomas Laloe → unsupervised learning

Structure

I course

I practical sessions (Python)

I Final grade: project + final exam

Any questions?

I send me an email at [email protected]

I with a clear subject + question

3/25

Ressources

Websites

I en.wikipedia.org → great ressource, who do you think writes thearticles?

I scholar.google → access to scientific articles

I wolframalpha.com → easy-to-use symbolic calculator

Books

I Hastie, Tibshirani, Firedman, The Elements of Statistical Learning,Springer, 2001

I Devroye, Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition,Springer, 1996

4/25

Course overview

IntroductionExamples of statistical learning problemsTypology machine learning problemsDefinitions

Unsupervised learning, data description/explorationClusteringProbability density estimationDimensionality reduction, visualization

Supervised learningClassificationRegression

Implementation of a machine learning systemReal-life dataParameter and model selection

5/25

What is statistical learning?

General idea of statistical learning

I instead of setting up complicated rule to solve a given problem,learn the rule from (many!) examples

I How? adjust the parameters of a (potentially complicated) model

I Goal: get the best performance

Differences with statistics

I statistics are concerned with a model for the data

I ideally, statisticians would like to do statistical inference and saysomething meaningful about the distribution of the data

I easy to confuse, depends on your point of view

6/25

What is pattern recognition?

From the litterature

I process of assigning a pre-specified category to a physical object or event(Duda and Hart).

I using several examples of complex signals and associated labels (ordecisions), process of automatic decisions for new signals. Ripley

I process of assigning a name y to an observation x. Schurmann

Goal of Pattern Recognition

I automated detection of patterns and irregularities in data

I machine learning is one way to do pattern recognition

7/25

Exemples of statistical learning problems

Computer vision

I product inspection in manufacturing

I military targets identification

I facial recognition

Optical Characters Recognition

I automatic mail classification

I automatic checks amount reading

Computer Aided Diagnosis

I medical imagery (EEG, ECG)

I assist physicians (not replace them)

8/25

Typology of machine learning problems

Unsupervised learningI Clustering: organize objects in similar groups (taxonomy of animal species)

I Probability Density Estimation: estimate probability distributions fromdata (distribution of noise)

I Dimensionality reduction: represent large dimensional data in a smalldimension space for better visualization and interpretation (recommendersystems)

Supervised learningI Classification: assign a class to an observation (handwriting / object

recognition, automated diagnosis)

I Regression: predict a continuous value from an observation (weathertemperature)

Reinforcement learning

I train a machine to choose actions that maximize a reward (games)

9/25

Components of a machine learning system

Typical system is composed of

I data acquisition (sensors)

I pre-processing of the data (missing values, measurement errors)

I feature extraction (Fourier transform)

I prediction step (classification / regression)

10/25

Training datasets

Unsupervised learning

I x ∈ X is an observation

I generally, d features (= parameters): X = Rd

I training set X contains the observations {xi}ni=1, n is the number oftraining points (= examples).

I examples often stored as a matrix X ∈ Rn×d with X = [x1, . . . ,xn]>:training examples are vectors and vectors are columns!

I d and n define the dimensionality of the learning problem

Supervised learning

I label yi ∈ Y associated to each training sample xiI prediction space Y can be:

I Y = {−1, 1} or Y = {1, . . . ,m} for classification problems.I Y = R for regression problemsI structured for emphstructured prediction (graphs,...)

I labels can be concatenated in a vector y ∈ Yn

11/25

Features and patterns

I feature = distinct trait, or detail of an object

I can be symbolic (color, type) or numeric (size, intensity).

I DefinitionI combination of features represented as a vector x of dimensionality dI d-dimensional space containing the exmaples is called the feature spaceI objects can be represented as points in this space. This representation is

called scatter plot

I pattern = set of traits for an observation. In a classification problem apattern is composed of a feature vector and a label

12/25

Features

What is a “good” feature?

The quality of a feature depends on the learning problem.

I Classification: samples from the same class should have similar featurevalues, examples from different classes should have different feature values.

I Regression: feature values should help better predict the value (correlationor at least non-independence with the value to predict).

Other properties:

12/25





13/25

Unsupervised learning, data description/exploration

Let {xi}ni=1 be a training set of n samples of dimension d

Examples

I Clustering: {xi}ni=1 7→ {yi}ni=1 where y are the group ids

I Probability density estimation: {xi}ni=1 7→ p, where p pdf of the data

I Generative modeling: {xi}ni=1 7→ p(G(z)) = p(x) with z ∼ N(0, σ2).

I Dimensionality reduction: {xi ∈ Rd}ni=1 7→ {xi ∈ Rp}ni=1 with p� d.

14/25

Clustering

Goal

I organize training examples incoherent groups

I {xi}ni=1 7→ {yi}ni=1 where y ∈ Yrepresents a class ({1, . . . ,m})

I parameters:I m number of classes (optional)I similarity measure δ (Euclidean

distance)

Methods

I k-means

I Gaussian mixtures

I spectral clustering

I hierachical clustering

I ...

Examples

I animal taxonomy

I gene clustering

I social networks

I ...

15/25

Probability density estimation

Goal

I estimate the probability distributionthat generated the data

I {xi}ni=1 7→ p where p : X 7→ R is aprobability densityfunction(

∫X p(x)dx = 1)

I parameters:I type of distribution (Gaussian)I parameters of the law (µ,Σ)

N (µ,Σ )

4 modes

Methods

I kernel density estimation (Parzen)

I histogram rules (1D/2D)

I Gaussian mixtures

I ...

Examples

I noise estimation

I data generation

I novelty detection

16/25

Generative modeling

Objective

I mapping function G that generatessimilar samples as {xi}ni=1

I namely G(z) with z ∼ N (µ,Σ)close to the data in distribution

I parameters:I class of distribution for z

(Gaussian).I type of function GI measure of similarity betweenG(z) and p(x)

Methods

I Principal Component Analysis(PCA)

I Generative Adversarial Networks(GAN)

I Variational Auto-Encoders (VAE)

Examples

I generate realistic images

I style adaptation

I data modeling

17/25

Dimensionality reduction, visualization

Objective

I project the data into a lowdimensionnal space

I {xi ∈ Rd}ni=1 7→ {xi ∈ Rp}ni=1 withp� d (often p = 2)

I Parameters:I type of projectionI similarity measure δ

Methods

I feature selection

I PCAs

I nonlinear dimensionality reduction(MDS, tSNE, AutoEncoders)

Examples

I visualization in 2D/3D

I data interpretation (is feature spacediscriminant?)

I recommender systems

17/25





18/25

Supervised learning

Reminder: {xi, yi}ni=1 training set, f(x) is the class of x (classification) or acontinuous value (regression).

General framework of supervised learning

I loss function ` : X × X → Y: `(y, f(x)) small if the prediction is good

I Ideally, pick a function f? that performs accurately on unseen data, i.e.,

f? ∈ arg minf

Ex,y∼(X,Y ) [`(y, f(x))] .

I in real-life:I restricted set of functions fθ : X → Y, depending on parameter θ ∈ ΘI only finite number of training data

I Empirical Risk Minimization:

f? ∈ arg minfθ s.t. θ∈Θ

{1

n

n∑i=1

`(fθ(xi),yi)

}.

19/25

Examples of functions

Linear model

fθ(x) = b+

d∑j=1

wjxj = b+ w>x ,

parametrized by θ = (b,w)>, with w ∈ Rd and b ∈ R

Logistic regression

fθ(x) = P (Y = k|X = x)exp(βk0 + β>k x)

1 +∑m−1i=1 exp(βi0 + β>i x)

,

parametrized by θ = (βi)1≤i≤m−1

Neural network

fθ(x) =

20/25

Binary classification

Objective

I train a function that predicts −1or 1

I {xi, yi}ni=1 7→ f

I prediction: sign of f

I f(x) = 0: decision boundaryI Parameters:

I type of functionI performance measure (what is

optimized)

C1

C2

Methods

I linear discrimination

I Support Vector Machines (SVM)

I decision trees, random forests

Examples

I Optical Character Recognition(OCR)

I computer aided diagnosis

I weather prediction

21/25

Multiclass classification

Principle

A classifier does a partition of the feature space inseveral regions associated to different classes.

I boundaries between the regions are calleddecision boundaries

I classifying a new example x = finding itsregion and assign the corresponding label

One-Against-All strategy

I classifier represented by an ensemble ofdiscriminant functions gi(x): predicted classfor sample x is class j such thatgj(x) > gi(x) for all i 6= j

I scores can be used to output probabilities foreach class using the softmax function insteadof the max

22/25

Regression

Principle

I train a function predicting acontinuous value

I {xi, yi}ni=1 7→ f(x)

I parameters:I type of functionI performance measureI prediction error

Methods

I Ordinary Least Square (OLS)

I ridge regression

I LASSO

I kernel regression

Examples

I movement prediction

I inverse problems

I weather prediction (temperature)

22/25





23/25

Real data (I)

I Unrelated features

C1

C2

I Non-representative

C1

C2

non couvert par les mesures

I Noise

C1

C2

domaine a plus fort niveau de bruit ?

I Outliers

C1

C2

?

?

24/25

Real data (II)

Dataset dimensionality

We always have a finite number n of training samples of dimensionality d.

Curse of dimensionality

x1D = 1

x1

x2

D = 2

x1

x2

x3D = 3

The curse of dimensionality illustrate the fact that when the dimensionality ofthe data increase the number of samples necessary for sampling the domainincreases exponentialy with the dimension.

25/25

Model selection

How to select?

I model too simple: not complexenough, big test error

I too complex: complicated to train,low training error, but generallyhigh test error (over-fitting)

I in the end, we want to predict wellon new data!

Validation

I split the data in learning /validation sets

I maximize performance on validationdata

I validation needs good performancemeasure

C1

C2

trop simple ?

trop complexe ?

adapte ?

New Theory of Statistical Learning - unice.frdgarreau/pdfs/01_IntroML.pdf · 2020. 2. 27. · Typology of machine learning problems Unsupervised learning I Clustering: organize objects

Documents

New Theory of Statistical Learning - unice.frdgarreau/pdfs/01_IntroML.pdf · 2020. 2. 27. · Typology of machine learning problems Unsupervised learning I Clustering: organize objects