1/25
Theory of Statistical LearningIntroduction to machine learning and pattern recognition
Damien Garreau, Remi Flamary
January 8, 2020
2/25
Organization of the course
Two parallel courses:
I me (Damien Garreau) → supervised learning
I Thomas Laloe → unsupervised learning
Structure
I course
I practical sessions (Python)
I Final grade: project + final exam
Any questions?
I send me an email at [email protected]
I with a clear subject + question
3/25
Ressources
Websites
I en.wikipedia.org → great ressource, who do you think writes thearticles?
I scholar.google → access to scientific articles
I wolframalpha.com → easy-to-use symbolic calculator
Books
I Hastie, Tibshirani, Firedman, The Elements of Statistical Learning,Springer, 2001
I Devroye, Gyorfi, Lugosi, A Probabilistic Theory of Pattern Recognition,Springer, 1996
4/25
Course overview
IntroductionExamples of statistical learning problemsTypology machine learning problemsDefinitions
Unsupervised learning, data description/explorationClusteringProbability density estimationDimensionality reduction, visualization
Supervised learningClassificationRegression
Implementation of a machine learning systemReal-life dataParameter and model selection
5/25
What is statistical learning?
General idea of statistical learning
I instead of setting up complicated rule to solve a given problem,learn the rule from (many!) examples
I How? adjust the parameters of a (potentially complicated) model
I Goal: get the best performance
Differences with statistics
I statistics are concerned with a model for the data
I ideally, statisticians would like to do statistical inference and saysomething meaningful about the distribution of the data
I easy to confuse, depends on your point of view
6/25
What is pattern recognition?
From the litterature
I process of assigning a pre-specified category to a physical object or event(Duda and Hart).
I using several examples of complex signals and associated labels (ordecisions), process of automatic decisions for new signals. Ripley
I process of assigning a name y to an observation x. Schurmann
Goal of Pattern Recognition
I automated detection of patterns and irregularities in data
I machine learning is one way to do pattern recognition
7/25
Exemples of statistical learning problems
Computer vision
I product inspection in manufacturing
I military targets identification
I facial recognition
Optical Characters Recognition
I automatic mail classification
I automatic checks amount reading
Computer Aided Diagnosis
I medical imagery (EEG, ECG)
I assist physicians (not replace them)
8/25
Typology of machine learning problems
Unsupervised learningI Clustering: organize objects in similar groups (taxonomy of animal species)
I Probability Density Estimation: estimate probability distributions fromdata (distribution of noise)
I Dimensionality reduction: represent large dimensional data in a smalldimension space for better visualization and interpretation (recommendersystems)
Supervised learningI Classification: assign a class to an observation (handwriting / object
recognition, automated diagnosis)
I Regression: predict a continuous value from an observation (weathertemperature)
Reinforcement learning
I train a machine to choose actions that maximize a reward (games)
9/25
Components of a machine learning system
Typical system is composed of
I data acquisition (sensors)
I pre-processing of the data (missing values, measurement errors)
I feature extraction (Fourier transform)
I prediction step (classification / regression)
10/25
Training datasets
Unsupervised learning
I x ∈ X is an observation
I generally, d features (= parameters): X = Rd
I training set X contains the observations {xi}ni=1, n is the number oftraining points (= examples).
I examples often stored as a matrix X ∈ Rn×d with X = [x1, . . . ,xn]>:training examples are vectors and vectors are columns!
I d and n define the dimensionality of the learning problem
Supervised learning
I label yi ∈ Y associated to each training sample xiI prediction space Y can be:
I Y = {−1, 1} or Y = {1, . . . ,m} for classification problems.I Y = R for regression problemsI structured for emphstructured prediction (graphs,...)
I labels can be concatenated in a vector y ∈ Yn
11/25
Features and patterns
I feature = distinct trait, or detail of an object
I can be symbolic (color, type) or numeric (size, intensity).
I DefinitionI combination of features represented as a vector x of dimensionality dI d-dimensional space containing the exmaples is called the feature spaceI objects can be represented as points in this space. This representation is
called scatter plot
I pattern = set of traits for an observation. In a classification problem apattern is composed of a feature vector and a label
12/25
Features
What is a “good” feature?
The quality of a feature depends on the learning problem.
I Classification: samples from the same class should have similar featurevalues, examples from different classes should have different feature values.
I Regression: feature values should help better predict the value (correlationor at least non-independence with the value to predict).
Other properties:
12/25
IntroductionExamples of statistical learning problemsTypology machine learning problemsDefinitions
Unsupervised learning, data description/explorationClusteringProbability density estimationDimensionality reduction, visualization
Supervised learningClassificationRegression
Implementation of a machine learning systemReal-life dataParameter and model selection
13/25
Unsupervised learning, data description/exploration
Let {xi}ni=1 be a training set of n samples of dimension d
Examples
I Clustering: {xi}ni=1 7→ {yi}ni=1 where y are the group ids
I Probability density estimation: {xi}ni=1 7→ p, where p pdf of the data
I Generative modeling: {xi}ni=1 7→ p(G(z)) = p(x) with z ∼ N(0, σ2).
I Dimensionality reduction: {xi ∈ Rd}ni=1 7→ {xi ∈ Rp}ni=1 with p� d.
14/25
Clustering
Goal
I organize training examples incoherent groups
I {xi}ni=1 7→ {yi}ni=1 where y ∈ Yrepresents a class ({1, . . . ,m})
I parameters:I m number of classes (optional)I similarity measure δ (Euclidean
distance)
Methods
I k-means
I Gaussian mixtures
I spectral clustering
I hierachical clustering
I ...
Examples
I animal taxonomy
I gene clustering
I social networks
I ...
15/25
Probability density estimation
Goal
I estimate the probability distributionthat generated the data
I {xi}ni=1 7→ p where p : X 7→ R is aprobability densityfunction(
∫X p(x)dx = 1)
I parameters:I type of distribution (Gaussian)I parameters of the law (µ,Σ)
N (µ,Σ )
4 modes
Methods
I kernel density estimation (Parzen)
I histogram rules (1D/2D)
I Gaussian mixtures
I ...
Examples
I noise estimation
I data generation
I novelty detection
16/25
Generative modeling
Objective
I mapping function G that generatessimilar samples as {xi}ni=1
I namely G(z) with z ∼ N (µ,Σ)close to the data in distribution
I parameters:I class of distribution for z
(Gaussian).I type of function GI measure of similarity betweenG(z) and p(x)
Methods
I Principal Component Analysis(PCA)
I Generative Adversarial Networks(GAN)
I Variational Auto-Encoders (VAE)
Examples
I generate realistic images
I style adaptation
I data modeling
17/25
Dimensionality reduction, visualization
Objective
I project the data into a lowdimensionnal space
I {xi ∈ Rd}ni=1 7→ {xi ∈ Rp}ni=1 withp� d (often p = 2)
I Parameters:I type of projectionI similarity measure δ
Methods
I feature selection
I PCAs
I nonlinear dimensionality reduction(MDS, tSNE, AutoEncoders)
Examples
I visualization in 2D/3D
I data interpretation (is feature spacediscriminant?)
I recommender systems
17/25
IntroductionExamples of statistical learning problemsTypology machine learning problemsDefinitions
Unsupervised learning, data description/explorationClusteringProbability density estimationDimensionality reduction, visualization
Supervised learningClassificationRegression
Implementation of a machine learning systemReal-life dataParameter and model selection
18/25
Supervised learning
Reminder: {xi, yi}ni=1 training set, f(x) is the class of x (classification) or acontinuous value (regression).
General framework of supervised learning
I loss function ` : X × X → Y: `(y, f(x)) small if the prediction is good
I Ideally, pick a function f? that performs accurately on unseen data, i.e.,
f? ∈ arg minf
Ex,y∼(X,Y ) [`(y, f(x))] .
I in real-life:I restricted set of functions fθ : X → Y, depending on parameter θ ∈ ΘI only finite number of training data
I Empirical Risk Minimization:
f? ∈ arg minfθ s.t. θ∈Θ
{1
n
n∑i=1
`(fθ(xi),yi)
}.
19/25
Examples of functions
Linear model
fθ(x) = b+
d∑j=1
wjxj = b+ w>x ,
parametrized by θ = (b,w)>, with w ∈ Rd and b ∈ R
Logistic regression
fθ(x) = P (Y = k|X = x)exp(βk0 + β>k x)
1 +∑m−1i=1 exp(βi0 + β>i x)
,
parametrized by θ = (βi)1≤i≤m−1
Neural network
fθ(x) =
20/25
Binary classification
Objective
I train a function that predicts −1or 1
I {xi, yi}ni=1 7→ f
I prediction: sign of f
I f(x) = 0: decision boundaryI Parameters:
I type of functionI performance measure (what is
optimized)
C1
C2
Methods
I linear discrimination
I Support Vector Machines (SVM)
I decision trees, random forests
Examples
I Optical Character Recognition(OCR)
I computer aided diagnosis
I weather prediction
21/25
Multiclass classification
Principle
A classifier does a partition of the feature space inseveral regions associated to different classes.
I boundaries between the regions are calleddecision boundaries
I classifying a new example x = finding itsregion and assign the corresponding label
One-Against-All strategy
I classifier represented by an ensemble ofdiscriminant functions gi(x): predicted classfor sample x is class j such thatgj(x) > gi(x) for all i 6= j
I scores can be used to output probabilities foreach class using the softmax function insteadof the max
22/25
Regression
Principle
I train a function predicting acontinuous value
I {xi, yi}ni=1 7→ f(x)
I parameters:I type of functionI performance measureI prediction error
Methods
I Ordinary Least Square (OLS)
I ridge regression
I LASSO
I kernel regression
Examples
I movement prediction
I inverse problems
I weather prediction (temperature)
22/25
IntroductionExamples of statistical learning problemsTypology machine learning problemsDefinitions
Unsupervised learning, data description/explorationClusteringProbability density estimationDimensionality reduction, visualization
Supervised learningClassificationRegression
Implementation of a machine learning systemReal-life dataParameter and model selection
23/25
Real data (I)
I Unrelated features
C1
C2
I Non-representative
C1
C2
non couvert par les mesures
I Noise
C1
C2
domaine a plus fort niveau de bruit ?
I Outliers
C1
C2
?
?
24/25
Real data (II)
Dataset dimensionality
We always have a finite number n of training samples of dimensionality d.
Curse of dimensionality
x1D = 1
x1
x2
D = 2
x1
x2
x3D = 3
The curse of dimensionality illustrate the fact that when the dimensionality ofthe data increase the number of samples necessary for sampling the domainincreases exponentialy with the dimension.
25/25
Model selection
How to select?
I model too simple: not complexenough, big test error
I too complex: complicated to train,low training error, but generallyhigh test error (over-fitting)
I in the end, we want to predict wellon new data!
Validation
I split the data in learning /validation sets
I maximize performance on validationdata
I validation needs good performancemeasure
C1
C2
trop simple ?
trop complexe ?
adapte ?