Top Banner
START OF DAY 4 Reading: Chap. 3 & 4
45

START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Dec 30, 2015

Download

Documents

Laurel Greene
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

START OF DAY 4Reading: Chap. 3 & 4

Page 2: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Project

Page 3: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Topics & Teams

• Select topics/domains• Select teams• Deliverables

– Description of the problem– Selection of objective(s)– Description of the methods used

• Data preparation• Learning algorithms used

– Description of the results• Project presentations

Page 4: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Perceptron

Page 5: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Neural Networks

• Sub-symbolic approach:– Does not use symbols to denote objects– Views intelligence/learning as arising from the

collective behavior of a large number of simple, interacting components

• Motivated by biological plausibility

Page 6: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Natural Neuron

Page 7: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Artificial Neuron

Captures the essence of the natural neuron

• (Dendrites) Input values Xi from the environment or other neurons

• (Synapses) Real-valued weights wi associated with each input

• (Soma’s chemical reaction) Function F({Xi},{wi}) computing activation as a function of input values and weights

• (Axon) Activation value that may serve as input to other neurons

Page 8: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Feedforward Neural Networks

• Sets of (highly) interconnected artificial neurons (i.e., simple computational units)– Layered organization

• Characteristics– Massive parallelism– Distributed knowledge representation (i.e., implicit in

patterns of interactions)– Graceful degradation (e.g., grandmother cell)– Less susceptible to brittleness– Noise tolerant– Opaque (i.e., black box)

There exist other types of NNs

Page 9: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

FFNN Topology

• Pattern of interconnections among neurons: primary source of inductive bias

• Characteristics– Number of layers– Number of neurons per layer– Interconnectivity (fully connected, mesh, etc.)

Page 10: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Perceptrons (1958)

• Simplest class of neural networks• Single-layer, i.e., only one set of connection

weights between inputs and outputs• Boolean activation (aka step function)

x1

xn

x2

w1

w2

wn

z

Page 11: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Learning for Perceptrons

• Algorithm devised by Rosenblatt in 1958• Given an example (i.e., labeled input pattern):– Compute output– Check output against target– Adapt weights

Page 12: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Example (I)

.8

.3

z

.4

-.2

.1

net = .8*.4 + .3*-.2 = .26

=1

x1 x2 t

0

1

.1

.3

.4

.8

Output matches target

Page 13: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Example (II)

.8

.3

z

.4

-.2

.1

net = .4*.4 + .1*-.2 = .14

=1

x1 x2 t

0

1

.1

.3

.4

.8

Output does not match target

Page 14: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Learn-Perceptron

• When should weights be changed?– Output does not match target: (ti-zi)

• How should weights be changed?– By some fixed amount (learning rate): c(ti-zi)– Proportional to input value: c(ti-zi)xi

• Algorithm:– Initialize weights (typically random)– For each new training example

• Compute network output• Change weights: Dwi = c(ti – zi)xi

– Repeat until no change in weights

Page 15: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

What About Θ?

1 0 1 -> 01 0 0 -> 1

Augmented Version1 0 1 1 -> 01 0 0 1 -> 1

• Treat threshold like any other weight. Call it a bias since it biases the output up or down

• Since we start with random weights anyways, ignore - notion; just think of the bias as an extra available weight

• Always use a bias weight

Page 16: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Example• Assume 3-input perceptron (plus bias, outputs 1 if net > 0, else 0) • Assume c=1 and initial weights all 0• Training set: 0 0 1 -> 0

1 1 1 -> 11 0 1 -> 10 1 1 -> 0

PatternTarget Weight Net OutputDW____

0 0 1 1 0 0 0 0 0 00 0 0 0 0

1 1 1 1 1 0 0 0 0 00 1 1 1 1

1 0 1 1 1 1 1 1 1 31 0 0 0 0

0 1 1 1 0 1 1 1 1 31 0 -1 -1 -1

0 0 1 1 0 1 0 0 0 00 0 0 0 0

1 1 1 1 1 1 0 0 0 11 0 0 0 0

1 0 1 1 1 1 0 0 0 11 0 0 0 0

0 1 1 1 0 1 0 0 0 00 0 0 0 0

Dwi = c(ti – zi)xi

Page 17: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Another Example• Assume 2-input perceptron (plus bias, outputs 1 if net > 0, else 0) • Assume c=1 and initial weights all 0• Training set: 0 0 -> 0

1 0 -> 11 1 -> 00 1 -> 1

PatternTarget Weight Net OutputDW _

0 0 1 0 0 0 0 00 0 0 0

1 0 1 1 0 0 0 00 1 0 1

1 1 1 0 1 0 1 21 0 -1 0

0 1 1 1 0-1 0 -1 00 0 1

0 0 1 0 0 0 1 11 0 0 0

1 0 1 1 0 0 0 00 1 0 1

1 1 1 0 1 0 1 21 0 -1 0

0 1 1 1 0-1 0 -1 00 0 1

Dwi = c(ti – zi)xi

What is happening? Why?

Page 18: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Decision Surface

• Assume 2-input perceptron– z=1 if w1x1+w2x2≥Θ

– z=0 if w1x1+w2x2<Θ

• Decision boundary: w1x1+w2x2=Θ– A line with slope –w1/w2 and intercept Θ/w2

– No bias line goes through origin• In general: hyperplane (i.e., a linear surface)

Page 19: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Linear Separability

Generalization:noise vs. exception

Limited functionality?

Page 20: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

The Plague of Linear Separability

• The good news is:– Learn-Perceptron is guaranteed to converge to a correct

assignment of weights if such an assignment exists

• The bad news is:– Such an assignment exists only for linearly separable tasks

• The really bad news is:– There is a large number of non-linearly separable tasks

• Let d be the number of inputs

– Too many tasks escape the algorithm

Page 21: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Are We Stuck?• So far we have used

• What if we preprocessed the inputs in a non-linear way and did

• To the perceptron algorithm it would look just the same, except with different inputs

• For example, for a problem with two inputs x and y (plus the bias), we could also add the inputs x2, y2, and x·y

• The perceptron would just think it is a 5-dimensional task, and it is linear in those 5 dimensions– But what kind of decision surfaces would it allow for the 2-d input space?

Page 22: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Quadric Machine Example• All quadratic surfaces (2nd order)

– ellipsoid– parabola– Etc.

• For example:

-3 -2 -1 0 1 2 3 f1

-3 -2 -1 0 1 2 3

f2

f1

• Perceptron with just feature f1 cannot separate the data

• Assume we add another feature to our perceptron f2 = f12

Page 23: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Quadric Machine

• All quadratic surfaces (2nd order)– ellipsoid– parabola– etc.

• That significantly increases the number of problems that can be solved, but there are still many problems that are not quadrically separable

• Could go to 3rd and higher order features, but number of possible features grows exponentially

• Multi-layer neural networks will allow us to discover high-order features automatically from the input space

Page 24: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Backpropagation

Page 25: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Towards a Solution

• Main problem:– Learn-Perceptron implements discrete model of error (i.e.,

identifies the existence of error and adapts to it, but not the magnitude of the error – since step function)

• First thing to do:– Allow nodes to have real-valued activations (amount of error

= difference between computed and target output)

• Second thing to do:– Design learning rule that adjusts weights based on error

• Last thing to do:– Use the learning rule to implement a multi-layer algorithm

Page 26: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Real-valued Activation

• Replace the threshold unit (step function) with a linear unit, where:

For instance d:

Page 27: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Defining Error

• We define the training error of a hypothesis, or weight vector, by:

• Goal: minimize E– Find direction of steepest ascent of E (aka, gradient)– Move in the opposite direction (i.e., to decrease E)

Page 28: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Minimizing the Error

Page 29: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

The Delta Rule

• Gradient descent on the error surface:

• Initialize weights to small random values• Repeat until no progress

– Initialize each wi to 0 – For each training example <x,t>

• Compute output o for x• For each weight wi

– wi wi + c(t – o)xi

– For each weight wi

• wi wi + wi

Stochastic version• For each weight wi

– wi wi + c(t – o)xi

Better?

Note change in sign since we minimize E

Page 30: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Discussion

• Gradient-descent learning (with linear units) requires more than one pass through the training set

• The good news is: – Convergence is guaranteed if the problem is solvable

• The bad news is:– Still produces only linear functions– Even when used in a multi-layer context (composition of

linear functions is a linear function)

• Needs to be further generalized!

Page 31: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Non-linear Activation

• Introduce non-linearity with a sigmoid function:

1. Differentiable (required for gradient-descent)2. Most unstable in the middle

Page 32: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Derivative of the Sigmoid

You need only compute the sigmoid; its derivative comes for free!

Page 33: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Multi-layer Feed-forward NN

i

i

i

i

j

k

k

k

Page 34: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Backpropagation Learning

• Repeat– Present a training instance d– Compute error k of output units– For each hidden layer• Compute error j using error from next layer

– Update all weights: wpq wpq + wpq

where • Until stopping criterion

Note that BP is stochastic

Page 35: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Setting Up the Derivation

Page 36: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Output Units

Page 37: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Hidden Units

Page 38: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Putting it all togetheri

i

i

i

jk

k

k

Max when sigmoid is most unstable

Page 39: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Example (I)

• Consider a simple network composed of:– 3 inputs: x, y, z– 1 hidden node: h– 2 outputs: q, r

• Assume c=0.5, all weights are initialized to 0.2 and weight updates are incremental

• Consider the training set:– 1 0 1 – 0 1– 0 1 1 – 1 1

• 4 iterations over the training set

Page 40: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Example (II)

Page 41: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Local Minima

• FFNN can get stuck in local minimum– More common for small networks– For most large networks (many weights), local minima

rarely occur in practice

Many dimensions of weights unlikely to be in a minima in every dimension simultaneously – almost always a way down (e.g., water running down a high-dimensional surface)

If needed, can use momentum or train several NNs

Page 42: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Momentum

• Simple speed-up modification

• Weight update maintains momentum in the direction it has been going– Faster in flats– Could leap past minima (good or bad)– Significant speed-up, common value ≈ .9– Effectively increases learning rate in areas where the

gradient is consistently the same sign (a common approach in adaptive learning rate methods)

Page 43: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Learning Parameters

• Connectivity: typically fully connected between layers• Number of hidden nodes:

– Too many nodes make learning slower, could overfit– Too few will underfit

• Number of layers: usually 1 or 2 hidden layers which are usually sufficient, attenuation makes learning very slow – 1 most common

• Momentum: (.5 - .99)• Most common method to set parameters: a few trial and error runs

(CV)• All of these could be set automatically by the learning algorithm and

there are numerous approaches to do so

Page 44: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Backpropagation Summary• Most common neural network approach

– Many other different styles of neural networks (RBF, Hopfield, etc.)• Excellent empirical results• Scaling – the pleasant surprise

– Local minima very rare as problem and network complexity increase• User defined parameters usually handled by multiple experiments• Many variants, such as

– Regression – Typically linear output nodes, normal hidden nodes– Adaptive parameters, ontogenic (growing and pruning) learning algorithms– Many different learning algorithm approaches– Recurrent networks– Deep networks– Still an active research area

Page 45: START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

END OF DAY 4Homework: Decision Tree Learning