Top Banner
Learning from Observations Artificial Neural Networks These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 1)
31
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Chapter 18

Learning from Observations

Artificial Neural Networks

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 1)

Page 2: Chapter 18

Learning

Learning is essential for unknown environments.

Learning modifies the agent’s decision mechanisms to improve performance.

Machine Learning is concerned with how to construct computer programsthat can automatically improve with experience.

Learning Process:

• Choosing a training set.

• Choosing the target function.

• Choosing a target approximation/optimization function.

• Testing the induced function (performance).

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 2)

Page 3: Chapter 18

Learning

Simplest form: learn a function from examples

f is the target function

An example is a pair x, f(x).

Problem: find a(n) hypothesis hsuch that h ≈ fgiven a training set of examples

(This is a highly simplified model of real learning:

– Ignores prior knowledge

– Assumes a deterministic, observable “environment”

– Assumes examples are given

– Assumes that the agent wants to learn f—why?)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 3)

Page 4: Chapter 18

Learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

x

f(x)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 4)

Page 5: Chapter 18

Learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

x

f(x)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 5)

Page 6: Chapter 18

Learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

x

f(x)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 6)

Page 7: Chapter 18

Learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

x

f(x)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 7)

Page 8: Chapter 18

Learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

x

f(x)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 8)

Page 9: Chapter 18

Learning method

Construct/adjust h to agree with f on training set(h is consistent if it agrees with f on all examples)

E.g., curve fitting:

x

f(x)

Ockham’s razor: maximize a combination of consistency and simplicity

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig ( 9)

Page 10: Chapter 18

Attribute-based representations

Examples described by attribute values (Boolean, discrete, continuous, etc.)E.g., situations where I will/won’t wait for a table:

Example Attributes Target

Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait

X1 T F F T Some $$$ F T French 0–10 T

X2 T F F T Full $ F F Thai 30–60 F

X3 F T F F Some $ F F Burger 0–10 T

X4 T F T T Full $ F F Thai 10–30 T

X5 T F T F Full $$$ F T French >60 F

X6 F T F T Some $$ T T Italian 0–10 T

X7 F T F F None $ T F Burger 0–10 F

X8 F F F T Some $$ T T Thai 0–10 T

X9 F T T F Full $ T F Burger >60 F

X10 T T T T Full $$$ F T Italian 10–30 F

X11 F F F F None $ F F Thai 0–10 F

X12 T T T T Full $ F F Burger 30–60 T

Classification of examples is positive (T) or negative (F)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 10)

Page 11: Chapter 18

Neural networks

Chapter 20, Section 5

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 11)

Page 12: Chapter 18

Neural networks

♦ Brains

♦ Neural networks

♦ Perceptrons

♦ Multilayer perceptrons

♦ Applications of neural networks

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 12)

Page 13: Chapter 18

Brains

1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle timeSignals are noisy “spike trains” of electrical potential

Axon

Cell body or Soma

Nucleus

Dendrite

Synapses

Axonal arborization

Axon from another cell

Synapse

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 13)

Page 14: Chapter 18

McCulloch–Pitts “unit”

Output is a “squashed” linear function of the inputs:

ai← g(ini) = g(

ΣjWj,iaj

)

Output

ΣInput Links

Activation Function

Input Function

Output Links

a0 = −1 ai = g(ini)

ai

giniWj,i

W0,i

Bias Weight

aj

A gross oversimplification of real neurons, but its purpose isto develop understanding of what networks of simple units can do

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 14)

Page 15: Chapter 18

Activation functions

(a) (b)

+1 +1

iniini

g(ini)g(ini)

(a) is a step function or threshold function

(b) is a sigmoid function 1/(1 + e−x)

Changing the bias weight W0,i moves the threshold location

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 15)

Page 16: Chapter 18

Implementing logical functions

AND

W0 = 1.5

W1 = 1

W2 = 1

OR

W2 = 1

W1 = 1

W0 = 0.5

NOT

W1 = –1

W0 = – 0.5

McCulloch and Pitts: every Boolean function can be implemented

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 16)

Page 17: Chapter 18

Network structures

Feed-forward networks:– single-layer perceptrons– multi-layer perceptrons

Feed-forward networks implement functions, have no internal state

Recurrent networks:– Hopfield networks have symmetric weights (Wi,j = Wj,i)

g(x) = sign(x), ai = ± 1; holographic associative memory

– Boltzmann machines use stochastic activation functions,≈ MCMC in Bayes nets

– recurrent neural nets have directed cycles with delays⇒ have internal state (like flip-flops), can oscillate etc.

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 17)

Page 18: Chapter 18

Feed-forward example

W1,3

1,4W

2,3W

2,4W

W3,5

4,5W

1

2

3

4

5

Feed-forward network = a parametrized family of nonlinear functions:

a5 = g(W3,5 · a3 + W4,5 · a4)

= g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2))

Adjusting weights changes the function: do learning this way!

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 18)

Page 19: Chapter 18

Single-layer perceptrons

InputUnits Units

OutputWj,i

-4 -2 0 2 4x1-4

-20

24

x2

00.20.40.60.8

1Perceptron output

Output units all operate separately—no shared weights

Adjusting weights moves the location, orientation, and steepness of cliff

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 19)

Page 20: Chapter 18

Expressiveness of perceptrons

Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)

Can represent AND, OR, NOT, majority, etc., but not XOR

Represents a linear separator in input space:

ΣjWjxj > 0 or W · x > 0

(a) x1 and x2

1

00 1������������x1

x2

(b) x1 or x2������������0 1

1

0

x1

x2

(c) x1 xor x2

?

0 1

1

0

x1

x2

Minsky & Papert (1969) pricked the neural network balloon

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 20)

Page 21: Chapter 18

Perceptron learning

Learn by adjusting weights to reduce error on training set

The squared error for an example with input x and true output y is

E =1

2Err

2 ≡1

2(y − hW(x))2 ,

Perform optimization search by gradient descent:

∂E

∂Wj= Err ×

∂Err

∂Wj= Err ×

∂Wj

(

y − g(Σnj = 0

Wjxj))

= −Err × g′(in)× xj

Simple weight update rule:

Wj ← Wj + α×Err × g′(in)× xj

E.g., +ve error ⇒ increase network output⇒ increase weights on +ve inputs, decrease on -ve inputs

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 21)

Page 22: Chapter 18

Perceptron learning contd.

Perceptron learning rule converges to a consistent functionfor any linearly separable data set

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100Pro

port

ion

corr

ect o

n te

st s

et

Training set size - MAJORITY on 11 inputs

PerceptronDecision tree

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100Pro

port

ion

corr

ect o

n te

st s

et

Training set size - RESTAURANT data

PerceptronDecision tree

Perceptron learns majority function easily, DTL is hopeless

DTL learns restaurant function easily, perceptron cannot represent it

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 22)

Page 23: Chapter 18

Multilayer perceptrons

Layers are usually fully connected;numbers of hidden units typically chosen by hand

Input units

Hidden units

Output units ai

Wj,i

aj

Wk,j

ak

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 23)

Page 24: Chapter 18

Expressiveness of MLPs

All continuous functions w/ 2 layers, all functions w/ 3 layers

-4 -2 0 2 4x1-4

-20

24

x2

00.20.40.60.8

1

hW(x1, x2)

-4 -2 0 2 4x1-4

-20

24

x2

00.20.40.60.8

1

hW(x1, x2)

Combine two opposite-facing threshold functions to make a ridge

Combine two perpendicular ridges to make a bump

Add bumps of various sizes and locations to fit any surface

Proof requires exponentially many hidden units (cf DTL proof)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 24)

Page 25: Chapter 18

Back-propagation learning

Output layer: same as for single-layer perceptron,

Wj,i ← Wj,i + α× aj ×∆i

where ∆i = Err i × g′(in i)

Hidden layer: back-propagate the error from the output layer:

∆j = g′(inj)∑

iWj,i∆i .

Update rule for weights in hidden layer:

Wk,j ← Wk,j + α× ak ×∆j .

(Most neuroscientists deny that back-propagation occurs in the brain)

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 25)

Page 26: Chapter 18

Back-propagation derivation

The squared error on a single example is defined as

E =1

2

i(yi − ai)

2 ,

where the sum is over the nodes in the output layer.

∂E

∂Wj,i= −(yi − ai)

∂ai

∂Wj,i= −(yi − ai)

∂g(in i)

∂Wj,i

= −(yi − ai)g′(in i)

∂in i

∂Wj,i= −(yi − ai)g

′(in i)∂

∂Wj,i

jWj,iaj

= −(yi − ai)g′(in i)aj = −aj∆i

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 26)

Page 27: Chapter 18

Back-propagation derivation contd.

∂E

∂Wk,j= −

i(yi − ai)

∂ai

∂Wk,j= −

i(yi − ai)

∂g(in i)

∂Wk,j

= −∑

i(yi − ai)g

′(in i)∂in i

∂Wk,j= −

i∆i

∂Wk,j

jWj,iaj

= −∑

i∆iWj,i

∂aj

∂Wk,j= −

i∆iWj,i

∂g(inj)

∂Wk,j

= −∑

i∆iWj,ig

′(inj)∂inj

∂Wk,j

= −∑

i∆iWj,ig

′(inj)∂

∂Wk,j

kWk,jak

= −∑

i∆iWj,ig

′(inj)ak = −ak∆j

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 27)

Page 28: Chapter 18

Back-propagation learning contd.

At each epoch, sum gradient updates for all examples and apply

Training curve for 100 restaurant examples: finds exact fit

0

2

4

6

8

10

12

14

0 50 100 150 200 250 300 350 400

Tot

al e

rror

on

trai

ning

set

Number of epochs

Typical problems: slow convergence, local minima

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 28)

Page 29: Chapter 18

Back-propagation learning contd.

Learning curve for MLP with 4 hidden units:

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50 60 70 80 90 100

Pro

port

ion

corr

ect o

n te

st s

et

Training set size - RESTAURANT data

Decision treeMultilayer network

MLPs are quite good for complex pattern recognition tasks,but resulting hypotheses cannot be understood easily

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 29)

Page 30: Chapter 18

Handwritten digit recognition

3-nearest-neighbor = 2.4% error400–300–10 unit MLP = 1.6% errorLeNet: 768–192–30–10 unit MLP = 0.9% error

Current best (kernel machines, vision algorithms) ≈ 0.6% error

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 30)

Page 31: Chapter 18

Summary

Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)

Perceptrons (one-layer networks) insufficiently expressive

Multi-layer networks are sufficiently expressive; can be trained by gradientdescent, i.e., error back-propagation

Many applications: speech, driving, handwriting, fraud detection, etc.

Engineering, cognitive modelling, and neural system modellingsubfields have largely diverged

These lecture notes are an updated versions of the lecture slides prepared by Stuart Russell and Peter Norvig (Chapter 20, Section 5 31)