Top Banner
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
40

PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Mar 31, 2018

Download

Documents

hacong
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Page 2: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Introduction

In the previous chapter, we explored a class of regression models having particularly simple analytical and computational properties.

We now discuss an analogous class of models for solving classification problems.

The goal in classification is to take an input vector x and to assign it to one of K discrete classes Ck where k = 1, . . . , K.

In the most common scenario, the classes are taken to be disjoint, so that each input is assigned to one and only one class.

Page 3: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The input space is thereby divided into decision regions whose boundaries are called decision boundaries or decision surfaces.

by which we mean that the decision surfaces are linear functions of the input vector x and hence are defined by (D − 1)-dimensional hyperplanes within the D-dimensional input space.

Data sets whose classes can be separated exactly by linear decision surfaces are said to be linearly separable.

Page 4: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Discriminant Functions

A discriminant is a function that takes an input vector x and assigns it to one of K classes, denoted Ck. In this chapter(linear discriminants,)

To simplify the discussion, we consider first the case of two classes

Page 5: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Two classes

The simplest representation of a linear discriminant function

where w is called a weight vector, and w0 is a bias

The negative of the bias is sometimes called a threshold.

An input vector x is assigned to class C1 if y(x) 0 and to class C2 otherwise.

Page 6: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The corresponding decision boundary is therefore defined by the relation y(x) = 0,

Which corresponds to a (D − 1)-dimensional hyperplane within the D-dimensional input space.

Consider two points xA and xB both of which lie on the decision surface. Because y(xA) = y(xB) = 0, we have wT(xA−xB) = 0,

hence the vector w is orthogonal to every vector lying within the decision surface, and so w determines the orientation of the decision surface.

Similarly, if x is a point on the decision surface, then y(x) = 0, and so the normal distance from the origin to the decision surface is given by

Page 7: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification
Page 8: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Multiple classes

Now consider the extension of linear discriminants to K >2 classes

We might be tempted be to build a K-class discriminant by combining a number of two-class discriminant functions.

However, this leads to some serious difficulties (Duda and Hart, 1973) as we now show.

Consider the use of K−1 classifiers each of which solves a two-class problem of separating points in a particular class Ck from points not in that class. This is known as a one-versus-the-rest classifier.

Page 9: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

An alternative is to introduce K(K − 1)/2 binary discriminant functions, one for every possible pair of classes

This is known as a one-versus-one classifier. Each point is then classified according to a majority vote

amongst the discriminant functions. We can avoid these difficulties by considering a single K-

class discriminant comprising K linear functions of the form

Then assigning a point x to class Ck if yk(x) > yj(x) for all j not equal to k.

Page 10: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification
Page 11: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Then assigning a point x to class Ck if yk(x) > yj(x) for all j not equal to k.

The decision boundary between class Ck and class Cj is therefore given by yk(x) = yj(x)

hence corresponds to a (D − 1)-dimensional hyperplane defined by

The decision regions of such a discriminant are always singly connected and convex.

Page 12: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The decision regions of such a discriminant are always singly connected and convex.

xA and xB both of which lie inside decision region Rk

Page 13: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Least squares for classification

Each class Ck is described by its own linear model so that

We now determine the parameter matrix ,W by minimizing a sum-of-squares error function, as we did for regression in Chapter 3.

Page 14: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Consider a training data set {xn, tn} where n = 1, . . . , N, and define a matrix T whose n-th row is the vector tn

T

. .

Page 15: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Pseudo inverse

Moore-Penrose pseudo-inverse of the matrix

Page 16: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Linearity

An interesting property of least-squares solutions with multiple target variables is that if every target vector in the training set satisfies some linear constraint

for some constants a and b, then the model prediction for any value of x will satisfy

.

Page 17: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Outliers

least squares is highly sensitiveto outliers, unlike logistic regression. The sum-of-squares error function penalizes predictions that are ‘too correct’

in that they lie a long way on the correct side of the decision boundary. In chapter 7, we shall consider several alternative error functions for

classification and we shall see that they do not suffer from this difficulty.

.

Page 18: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

On the left is the result of using a least-squares discriminant.

On the right is the result of using logistic regressions

.

.

Page 19: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Fisher’s linear discriminant

One way to view a linear classification model is in terms of dimensionality reduction.

Consider first the case of two classes, take the D dimensional input vector x and project it

down to one dimension using

If we place a threshold on y and classify y > −w0 as class C1, and otherwise class C2, then we obtain our standard linear classifier discussed in the previous section

In general, the projection onto one dimension leads to a considerable loss of information, and classes that are well separated in the original D-dimensional space may become strongly overlapping in one dimension.

Page 20: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

expression can be made arbitrarily large simply by increasing the magnitude of w

we could constrain w to have unit length,

but that have considerable overlap when projected onto the line joining their means.

Page 21: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification
Page 22: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification
Page 23: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The least-squares approach to the determination of a linear discriminant was based on the goal of making the model predictions as close as possible to a set of target values.

By contrast, the Fisher criterion was derived by requiring maximum class separation in the output space.

It is interesting to see the relationship between these two approaches.

In particular, we shall show that, for the two-class problem, the Fisher criterion can be obtained as a special case of least squares.

Page 24: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Fisher and Least square

Page 25: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Fisher’s discriminant for multiple classes

Page 26: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The perceptron algorithm

Another example of a linear discriminant model is the perceptron of Rosenblatt (1962), which occupies an important place in the history of pattern recognition algorithms.

It corresponds to a two-class model in which the input vector x is first transformed using a fixed nonlinear transformation to give a feature vector φ(x), and this is then used to construct a generalized linear model of the form

nonlinear activation function f

Page 27: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The perceptron algorithm

The perceptron criterion associates zero error with any pattern that is correctly classified, whereas for a misclassified pattern xn it tries to minimize the quantity

The perceptron criterion is therefore given by

Where M denotes the set of all misclassified patterns We now apply the stochastic gradient descent algorithm to this

error function.

we can set the learning rate parameter η equal to 1 without of generality.

Page 28: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

The perceptron algorithm Perceptron convergence theorem states that if there exists an exact

solution, then the perceptron learning algorithm is guaranteed to find an exact solution in a finite number of steps.

.

.

Page 29: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Probabilistic Generative Models

.

Page 30: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Continuous inputs

.

Page 31: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Maximum likelihood solution

.

Page 32: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Probabilistic Generative Models

Illustration of the role of nonlinear basis functions in linear classification models.

left plot shows the original input space (x1, x2) together with data points from two classes labelled red and blue.

Two ‘Gaussian’ basis functions φ1(x) and φ2(x) are defined in this space with centres shown by the green crosses and with contours shown by the green circles.

.

Page 33: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Logistic regression

.

Page 34: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Iterative reweighted least squares

the error function can be minimized by an efficient iterative technique based on the Newton-Raphson iterative optimization scheme,

uses a local quadratic approximation to the log likelihood function.

Page 35: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Iterative reweighted least squares

Page 36: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Iterative reweighted least squares

Page 37: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Multiclass logistic regression

Page 38: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Probit regression

We have seen that, for a broad range of class-conditional distributions, described by the exponential family, the resulting posterior class probabilities are given by a logistic (or softmax) transformation acting on a linear function

However, not all choices of class-conditional density give rise to such a simple form for the posterior probabilities

This suggests that it might be worth exploring other types of discriminative probabilistic model.

Page 39: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Probit regression

Page 40: PATTERN RECOGNITION AND MACHINE LEARNING - …madani.pro/files/chapter4.pdf · pattern recognition and machine learning chapter 4: linear models for classification

Probit regression

.