Top Banner
Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo
33

Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Dec 30, 2015

Download

Documents

Patience Lester
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Classification Supervised and unsupervised

Tormod Næs

Matforsk

and

University of Oslo

Page 2: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Classificaton

• Unsupervised (cluster analysis)– Searching for groups in the data

• Suspicion or general exploration– Hierarchical methods, partitioning methods

• Supervised (discriminant analysis)– Groups determined by other information

• External or from a cluster analysis

– Understand differences between groups

– Allocate new objects to the groups• Scoring, finding degree of membership

Page 3: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Group 1

Group 2

New object X

?

?

What is the difference? Where?

Page 4: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Why supervised classification?

• Authenticity studies– Adulteration, impurities, different origin, species

etc.• Raw materials• Consumer products according to specification

• When quality classes are more important than chemical values

• raw materials acceptable or not• raw materials for different products

Page 5: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Flow chart for discriminant analysis

Page 6: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Main problems

• Selectivity– Multivariate methods are needed

• Collinearity– Data compression is needed

• Complex group structures– Ellipses, squares or ”bananas”?

Page 7: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

X1

X2

Authentic

Adulterated

The selectivity problem

Page 8: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Solving the selectivity problem

• Using several measurements at the same time– The information is there!

• Multivariate methods. These methods combine several instrumental NIR variables in order to determine the property of interest

• Mathematical ”purification” instead of wet chemical analysis

Page 9: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Multivariate methodsToo many variables can also sometimes create

problems

– Interpretation– Computations, time and numerical stability– Simple and difficult regions (nonlinearity)– Overfitting is easier (dependentent on method used)

• Sometimes important to find good compromises (variable selection)

Page 10: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Conflict between flexibility and stability

Estimation error

Model error

Page 11: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Some main classes of methods

• Classical Bayes classification– LDA, QDA

• Variants, modifications used to solve the collinearity problem– RDA, DASCO, SIMCA

• Classification based on regression analysis– DPLS, DPCR

• KNN methods, flexible with respect to shape of the groups

Page 12: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Bayes classification

• Assume prior probabilities pj for the groups

– If unknown, fix them to be pj= 1/C or

– equal to the proportions in the dataset

• Assume known probability model within each class (fj(x))

– Estimated from the data, usually covariance matrices and means

Page 13: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Bayes classification• +

• well understood, much used, often good properties, easy to validate

• easy to modify for collinear data

• Easy to updated, covariances

• Can be modified for cost

• Outlier diagnostics (not directly, but can be done, M-distance)

• - • Can not handle too complex group structures, designed for elliptic

structures

• not so easy to interpret directly

• often followed by a Fisher’s linear discriminant analysis. Directly related to interpreting differences between groups

Page 14: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Bayes ruleMaximise porterior probability

Normal data, minimise

Estimate model parameters,

jjjijT

jii xxL log2log)()( 1

jjjijT

jii xxL log2ˆlog)ˆ(ˆ)ˆ(ˆ 1

Mahalanobis distance plus determinant minus prior probability

Page 15: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Different covariancestructures

Page 16: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Mahalanobis distance is constant on ellipsoids

Page 17: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Best known members

• Equal covariance matrix for each group– LDA

• Unequal covariance matrices– QDA

• Collinear data, unstable inverted covariance matrix (see equation)– Use principal components (or PLS components)

– RDA, DASCO estimate stable inverse covariance matrices

Page 18: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Classification by regression

• 0,1 dummy variables for each group• Run PLS-2 (or PCR) or any other method which solves the

collinearity• Predict class membership.

– The class with the highest value gets the vote• All regular interpretation tools are available, variable selection,

plotting outliers diagnostics etc.• Linear borders between subgroups, not too complicated groups.• Related to LDA, not covered here• If large data sets, we can use more flexible methods

Page 19: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Example, classification of mayonnaise based on different oils

The oils were•soybean•sunflower•canola•olive•corn•grapeseed

Indahl et al (1999). Chemolab

16 samples in each group

, Feasibility study, authenticity

Page 20: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Classification properties of QDA, LDA and regression

Start out low

Page 21: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Comparison

• LDA and QDA gave almost identical results

• It was substantially better to use LDA/QDA based on PLS/PCA components instead of using PLS directly

Page 22: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Fisher’s linear discriminant analysis

• Closely related to LDA

• Focuses on interpretation

– Use “spectral loadings” or group averages

• Finds the directions in space which distinguish the most between groups

– Uncorrelated

• Sensitive to overfitting, use PC’s first

Page 23: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Fisher’s method. Næs, Isaksson, Fearn and Davies (2001). A user friendly guide to cal. and class.

Page 24: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Not possible to distinguish the groups from each other

Plot of PC1 vs PC2

Page 25: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Mayonnaise data, clear separation

Canonical variates based on PC’s

Page 26: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

PCA Fisher’s method

Forina et al(1986), Vitis

Italian wines from same region, but based on different cultivars,27 chromatic and chemical variables

Barolo

Grignolino

Barbera

Page 27: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Error ratesValidated properly

• LDA – Barolo 100%, Grignolino 97.7%, Barbera

100%

• QDA– Barolo 100%, Grignolino 100%, Barbera100%

Page 28: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

KNN methods

• No model assumptions

• Therefore: needs data from “everywhere” and many data points

• Flexible, complex data structures

• Sensitive to overfitting, use PC’s

Page 29: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

New sample

KNN, finds the N samples which are closestIn this case 3 samples

Page 30: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Cluster analysisUnsupervised classification

• Identifying groups in the data

– Explorative

Page 31: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Examples of use

• Forina et al(1982). Olive oil from different regions (fatty acid composition). Ann. Chim.

• Armanino et al(1989), Olive oils from different Tuscan provinces (acids, sterols, alcohols). Chemolab.

Page 32: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Methods

• PCA (informal/graphical)– Look for structures in scores plots

– Interpretation of subgroups using loadings plots

• Hierarchical methods (more formal)– Based on distances between objects (Euclidean or

Mahalanobis)

– Join the two most similar

– Interpret dendrograms

Page 33: Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.

Armanino et al(1989), Chem.Int. lab. Systems.

120 olive oils from one region in Italy, 29 variables (fatty acids, sterols, etc.)