Top Banner
Linear Classification with Perceptrons
37

Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Dec 13, 2015

Download

Documents

Elwin Manning
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Linear Classification with Perceptrons

Page 2: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Framework

• Assume our data consists of instances x = (x1, x2, ..., xn)

• Assume data can be separated into two classes, positive and negative, by a linear decision surface.

• Learning: Assuming data is n-dimensional, learn (n−1)-dimensional hyperplane to classify the data into classes.

Page 3: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Linear Discriminant

Feature 1

Feature 2

Page 4: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Feature 2

Feature 1

Linear Discriminant

Page 5: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Feature 2

Feature 1

Linear Discriminant

Page 6: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Example where line won’t work?

Feature 2

Feature 1

Page 7: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptrons• Discriminant function:

w0 is called the “bias”.

−w0 is called the “threshold”

• Classification:

Page 8: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptrons as simple neural networks

.

.

.

w1

w2

wn

output

w0

+1x1

x2

xn

Page 9: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Example

• What is the class y?

.4

-.4

-.1

+11

-1

Page 10: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Geometry of the perceptron

Feature 1

Feature 2

HyperplaneIn 2d:

2

01

2

12

02211 0

w

wx

w

wx

wxwxw

Page 11: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Work with one neighbor on this:

(a) Find weights for a perceptron that separates “true” and “false” in x1

x2. Find the slope and intercept, and sketch the separation line defined by this discriminant.

(b) What (if anything) might make one separation line better than another?

In-class exercise

Page 12: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• To simplify notation, assume a “dummy” coordinate (or attribute) x0 = 1. Then we can write:

Page 13: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Notation

• Let S = {(xk, tk): k = 1, 2, ..., m} be a training set.

xk is a vector of inputs: xk = (xk,1, xk,2, ..., xk,n)

tk is {+1, −1} for binary classification, tk for regression.

• Output o:

• Error of a perceptron on the kth training example, (xk ,tk)

Page 14: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Example

• Training set:

x1 = (0,0), t1 = −1

x2 = (0,1), t2 = 1

• Let w = {w0, w1, w2) = {0.1, 0.1, −0.3}

o

What is E1?

What is E2?

+1

x10.1

0.1

x2−0.3

Page 15: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Coursepack is now (really!) on reserve at library

Reading for next week:

Coursepack: Learning From Examples,

Section 7.3-7.5 (pp. 39-45).

T. Fawcett, “An introduction to ROC analysis”,

Sections 1-4, 7

(linked from the course web page)

Page 16: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Clarification on

HW 1, Q. 2

f1 f2 f3 f4 f5 f6 f7 f8

Page 17: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptrons (Recap)

output o

1

x1

x2

xn

w0

w1

w2

wn

input x.

..

xi , wi

Page 18: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptrons (Recap)

output o

1

x1

x2

xn

w0

w1

w2

wn

input x.

..

xi , wi

Page 19: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptrons (Recap)

output o

1

x1

x2

xn

w0

w1

w2

wn

input x.

..

xi , wi

w0 is called the “bias”

−w0 is called the “threshold”

Page 20: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptrons (Recap)

output o

1

x1

x2

xn

w0

w1

w2

wn

input x.

..

xi , wi

w0 is called the “bias”

−w0 is called the “threshold”

If then o = 1.

Page 21: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Notation

Target value t is {+1, −1} for binary classification

Error of a perceptron on the kth training example, (xk ,tk):

Page 22: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

How do we train a perceptron?

Gradient descent in weight space

From T. M. Mitchell, Machine Learning

Page 23: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Perceptron learning algorithm

• Start with random weights w = (w1, w2, ... , wn).

• Do gradient descent in weight space, in order to minimize error E:

– Given error E, want to modify weights w so as to take a step in direction of steepest descent.

Page 24: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Gradient descent

• We want to find w so as to minimize sum-squared error (or loss):

• To minimize, take the derivative of E(w) with respect to w.

• A vector derivative is called a “gradient”: E(w)

nw

E

w

E

w

EE ,...,,)(

10

w

Page 25: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• Here is how we change each weight:

and is the learning rate.

jj

jjj

w

Ew

www

where

Page 26: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• Error function

has to be differentiable, so output function o also has to be differentiable.

Page 27: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

-1

1

activation

output

0

-1

1

activation

output

0

Activation functions

0sgn wxwo

jjj

0wxwoj

jj

Not differentiable Differentiable

Page 28: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

-1

1

activation

output

0

-1

1

activation

output

0

Activation functions

0sgn wxwo

jjj

0wxwoj

jj

Not differentiable Differentiable

Approximatethis

With this

Page 29: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

This is called the perceptronlearning rule, with “true gradientdescent”.

Page 30: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Training a perceptron (true gradient descent)

Assume there are m training examples and let be the learning rate (a user-set parameter).

1. Create a perceptron with small random weights, w = (w1, w2, ... , wn).

2. For k = 1 to m:

Run the perceptron with input xk and weights w to obtain ok.

3.

4. Go to 2.

Page 31: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• Problem with true gradient descent:

Training process is slow.

Training process will land in local optimum.

• Common approach to this: use stochastic gradient descent:

– Instead of doing weight update after all training examples have been processed, do weight update after each training example has been processed (i.e., perceptron output has been calculated).

– Stochastic gradient descent approximates true gradient descent increasingly well as 1/.

Page 32: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Training a perceptron(stochastic gradient descent)

1. Start with random weights, w = (w1, w2, ... , wn).

2. Select training example (xk, tk).

3. Run the perceptron with input xk and weights w to obtain ok.

4. Now,

5. Go to 2.

Page 33: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

In-class exercise

Training set:

((0,0), −1)

((0,1), 1)

((1,1), 1)

Let w = {w0, w1, w2) = {0.1, 0.1, −0.3}

1. Calculate new perceptron weightsafter each training example is processed. Let η = 0.2 .

2. What is accuracy on training data after one epoch of training? Didthe accuracy improve?

o

+1

x10.1

0.1

x2

−0.3

Perceptron learning rule:

Page 34: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• 1960s: Rosenblatt proved that the perceptron learning rule converges to correct weights in a finite number of steps, provided the training examples are linearly separable.

• 1969: Minsky and Papert proved that perceptrons cannot represent non-linearly separable target functions.

• However, they proved that any transformation can be carried out by adding a fully connected hidden layer.

Page 35: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

Decision regions of a multilayer feedforward network.The network was trained to recognize 1 of 10 vowel sounds occurring in the context “h_d” (e.g., “had”, “hid”)The network input consists of two parameters, F1 and F2, obtained from a spectral analysis of the sound.The 10 network outputs correspond to the 10 possible vowel sounds.

(From T. M. Mitchell, Machine Learning)

Multi-layer perceptron example

Page 36: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• Good news: Adding hidden layer allows more target functions to be represented.

• Bad news: No algorithm for learning in multi-layered networks, and no convergence theorem!

• Quote from Minsky and Papert’s book, Perceptrons (1969):

“[The perceptron] has many features to attract attention: its linearity; its intriguing learning theorem; its clear paradigmatic simplicity as a kind of parallel computation. There is no reason to suppose that any of these virtues carry over to the many-layered version. Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive judgment that the extension is sterile.”

Page 37: Linear Classification with Perceptrons. Framework Assume our data consists of instances x = (x 1, x 2,..., x n ) Assume data can be separated into two.

• Two major problems they saw were:

1. How can the learning algorithm apportion credit (or blame) to individual weights for incorrect classifications depending on a (sometimes) large number of weights?

2. How can such a network learn useful higher-order features?

• Good news: Successful credit-apportionment learning algorithms developed soon afterwards (e.g., back-propagation). Still successful, in spite of lack of convergence theorem.