CS-424 Gregory Dudek Today’s Lecture • Neural networks – Training • Backpropagation of error (backprop) – Example – Radial basis functions
CS-424 Gregory Dudek
Today’s Lecture
• Neural networks– Training
• Backpropagation of error (backprop)– Example– Radial basis functions
CS-424 Gregory Dudek
Simple neural models• Oldest ANN model is McCulloch-Pitts neuron
[1943] .– Inputs are +1 or -1 with real-valued weights.– If sum of weighted inputs is > 0, then the neuron “fires”
and gives +1 as an output.– Showed you can compute logical functions.– Relation to learning proposed (later!) by Donald Hebb
[1949].• Perceptron model [Rosenblatt, 1958].
– Single-layer network with same kind of neuron.• Firing when input is about a threshold: ∑xiwi>t .
– Added a learning rule to allow weight selection.
Not in text
CS-424 Gregory Dudek
Perceptrons: Early motivation• Von Neumann (1951), among others, worried about
the theory behind how you get a network like the brain to learn things given:– Random (?) connections– Broken connections– Noisy signal.
• Boolean algebra & symbolic logic were not well suited to these questions.
• The perceptron (originally photo-perceptron) was used as a model for a cell that responds to illumination patterns.
Not in text
CS-424 Gregory Dudek
Perceptron nets
CS-424 Gregory Dudek
Perceptron learning• Perceptron learning:
– Have a set of training examples (TS) encoded as input values (I.e. in the form of binary vectors)
– Have a set of desired output values associated with these inputs.
• This is supervised learning.– Problem: how to adjust the weights to make the actual
outputs match the training examples. • NOTE: we to not allow the topology to change! [You
should be thinking of a question here.]• Intuition: when a perceptron makes a mistake, it’s weights
are wrong.– Modify them so make the output bigger or smaller, as desired.
CS-424 Gregory Dudek
Learning algorithm• Desired Ti Actual output Oi
• Weight update formula (weight from unit j to i):Wj,i = Wj,I + k* xj * (Ti - Oi)
Where k is the learning rate.
• If the examples can be learned (encoded), then the perceptron learning rule will find the weights.– How?Gradient descent.
Key thing to prove is the absence of local minima.
CS-424 Gregory Dudek
Perceptrons: what can they learn?• Only linearly separable functions [Minsky &
Papert 1969].
• N dimensions: N-dimensional hyperplane.
CS-424 Gregory Dudek
More general networks• Generalize in 3 ways:
– Allow continuous output values [0,1]– Allow multiple layers.
• This is key to learning a larger class of functions.– Allow a more complicated function than thresholded
summation[why??]
Generalize the learning rule to accommodate this: let’s see how it works.
CS-424 Gregory Dudek
The threshold• The key variant:
– Change threshold into a differentiable function
– Sigmoid, known as a “soft non-linearity” (silly).
M = ∑xiwi
O = 1 / (1 + e -k M )
CS-424 Gregory Dudek
Recall: trainingFor a single input-output layer, we could adjust the weights to get
linear classification.– The perceptron computed a hyperplane over the space defined by the inputs.
• This is known as a linear classifier.• By stacking layers, we can compute a wider range of functions.• Compute
error derivativewith respectto weights. output
inputs
Hiddenlayer
CS-424 Gregory Dudek
• “Train” the weights to correctly classify a set of examples (TS: the training set).
• Started with perceptron, which used summing and a step function, and binary inputs and outputs.
• Embellished by allowing continuous activations and a more complex “threshold” function.– In particular, we considered a sigmoid activation
function, which is like a “blurred” threshold.
CS-424 Gregory Dudek
The Gaussian• Another continuous, differentiable function that is
commonly used is the Gaussian function.
Gaussian(x) =
• where is the width of the Gaussian.• The Gaussian is a continuous, differentiable version
of the step function. |<--- w --->| 1 _____________ | | 0 __________| | |__________ c
ex 2
2
ex 2
2
CS-424 Gregory Dudek
What is learning?• For a fixed set of weights w1,...,wn
f(x1,...,xn) = Sigma(x1 w1 + ... + xn wn)represents a particular scalar function of n variables.
• If we allow the weights to vary, then we can represent a family of scalar function of n variables.
F(x1,...,xn,w1,...,wn) = Sigma(x1 w1 + ... + xn wn)• If the weights are real-valued, then the family of
functions is determined by an n-dimensional parameter space, Rn.
• Learning involves searching in this parameter space.
CS-424 Gregory Dudek
Basis functions• Here is another family of functions. In this case, the
family is defined by a linear combination of basis functions
{g1,g2,...,gn}.
The input x could be scalar or vector valued.
F(x,w1,...,wn) = w1 g1(x) + ... + wn gn(x)
CS-424 Gregory Dudek
Combining basis functionsWe can build a network as follows: g1(x) --- w1 ---\
g2(x) --- w2 ----\
... \
... ∑ --- f(x) / / gn(x) --- wn ---/
E.g. From the basis {1,x,x2} we can build quadratics:
F(x, w1 ,w2 ,w3 ) = w1 + w2 x + w3 x2
CS-424 Gregory Dudek
Receptive Field• It can be generalized to an arbitrary vector space
(e.g., Rn).• Often used to model what are called “localized
receptive fields” in biological learning theory. – Such receptive fields are specially designed to represent
the output of a learned function on a small portion of the input space.
– How would you approximate an arbitrary continuous function using a sum of gaussians or a sum of piecewise constant functions of the sort described above?
CS-424 Gregory Dudek
Backprop• Consider sigmoid activation functions.• We can examine the output of the net as a function
of the weights.– How does the output change with changes in the weights?– Linear analysis: consider partial derivative of output with
respect to weight(s). • We saw this last lecture.
– If we have multiple layers, consider effect on each layer as a function of the preceding layer(s).
• We propagate the error backwards through the net (using the chain rule for differentiation).
• Derivation on overheads [reference: DAA p. 212]
CS-424 Gregory Dudek
Backprop observations• We can do gradient descent in weight space.• What is the dimensionality of this space?
– Very high: each weight is a free variable.• There are as many dimensions as weights.• A “typical” net might have hundreds of weights.
• Can we find the minimum?– It turns out that for multi-layer networks, the error space
(often called the “energy” of the network) is NOT CONVEX. [so?]
– Commonest approach: multiple restart gradient descent.• i.e. Try learning given various random initial weigh t
distributions.
CS-424 Gregory Dudek
Success? Stopping?• We have a training algorithm (backprop).
• We might like to ask:– 1. Have we done enough training (yet)?– 2. How good is our network at solving the problem?– 3. Should we try again to learn the problem (from the beginning)?
• The first 2 problems have standard answers:– Can’t just look at energy. Why not?
• Because we want to GENERALIZE across examples. “I
understand multiplication: I know 3*6=18, 5*4=20.” – What’s 7*3? Hmmmm.
– Must have additional examples to validate the training.• Separate input data into 2 classes: training and testing
sets. Can also use cross-validation.
CS-424 Gregory Dudek
What can we learn?• For any mapping from input to output units, we can
learn it if we have enough hidden units with the right weights!
• In practice, many weights means difficulty.• The right representation is critical!
• Generalization depends on bias.– The hidden units form an internal representation of the
problem. make them learn something general.• Bad example: one hidden unit learns exactly one
training example. – Want to avoid learning by table lookup.
CS-424 Gregory Dudek
Representation• Much learning can be equated with selected a good
problem representation.– If we have the right hidden layer, things become easy.
• Consider the problem of face recognition from photographs. Or fingerprints.– Digitized photos: a big array (256x256 or 512x512) of
intensities.– How do we match one array to another? (Either manually
or by computer.)– Key: measure important properties, use those as criteria
for estimating similarity?
CS-424 Gregory Dudek
Faces (an example)• What is an important property to measure for faces?
– Eye distance?– Average intensity
• BAD!– Nose width?– Forehead height?
• These measurements form the basis functions for describing faces.– BUT NOT NECESSARILY photographs!!!
• We don’t need to reconstruct the photo. Some information is not needed.
CS-424 Gregory Dudek
Radial basis functions• Use “blobs” summed together to create an arbitrary
function.– A good kind of blob is a Gaussian: circular, variable
width, can be easily generalized to 2D, 3D, ....
CS-424 Gregory Dudek
Topology changes• Can we get by with fewer connections?• When every neuron from one layer is connected to
every layer in the next layer, we call the network fully-connected.
• What if we allow signals to flow backwards to a preceding layer?Recurrent networks