1 Lectures 12&13&14: Multilayer Perceptrons (MLP) Networks
1
Lectures 12&13&14: Multilayer Perceptrons (MLP) Networks
MultiLayer Perceptron (MLP)y ( )
• formulated from loose biological principles• popularized mid 1980s▫ Rumelhart, Hinton & Williams 1986 Werbos 1974 Ho 1964 Werbos 1974, Ho 1964
• “learn” pre-processing stage from data• layered, feed-forward structurey ,▫ sigmoidal pre-processing▫ task-specific output
li d lnon-linear model
MLP
Input layer; Hidden layers; Output layerp y ; y ; p y
x1 y1IN
OUTk
x2 y2
more
PUT
TPU
jixu yi
e layeT
LA
T
L
xd yn
ersAYE
LAYEHIDDEN LAYERS
L0 L1 LM-1 LM
R ER
HIDDEN LAYERS
A solution for the XOR problem
x1
1
1-1
x1 x2 x1 xor x2
-1 -1 -1-1 1 1
-1
x21 -1 11 1 -1
1
+1+1x1 1 if v > 0
-1
0.1
+1
-1
-1x2
1 if v > 0(v) =
-1 if v 0 is the sign function+1 is the sign function.
-1
MLP
• Hidden layers of computation nodesI t t i f d di ti l b• Input propagates in a forward direction, layer-by-layer basis▫ also called Multilayer Feedforward Network, MLPE b k i l i h• Error back-propagation algorithm▫ supervised learning algorithm▫ error-correction learning algorithm▫ Forward pass input vector is applied to input nodes its effects propagate through the network layer-by-layer
i h fi d i i h with fixed synaptic weights▫ backward pass synaptic weights are adjusted in accordance with error
signalsignal error signal propagates backward, layer-by-layer fashion
MLP Distinct Characteristics• Non-linear activation function▫ differentiable )exp(1
1
ji v
y
differentiable
▫ sigmoidal function, logistic function▫ nonlinearity prevents reduction to single-layer perceptron !
)p( j
h h ldy y
threshold linear
a ai i li
y ypiece-wise linear sigmoid
a a
MLP Distinct Characteristics
• One or more layers of hidden neuronsy▫ progressively extracting more meaningful features
from input patterns• High degree of connectivity
N li it d hi h d f ti it• Nonlinearity and high degree of connectivity makes theoretical analysis difficult
• Learning process is hard to visualize• Learning process is hard to visualize• BP is a landmark in NN: computationally efficient
trainingtraining
NN: Universal Approximator?
• Any desired continuous function can be implemented by th l t k i ffi i t b f hidda three-layer network given sufficient number of hidden
units, proper nonlinearitiers and weights (Kolmogorov)• Kolmogorov proved that any continuous function g(x) g p y g( )
can be represented as
12
1 1))(()( n
j
d
i iijj xxg for properly chosen and .
(A. N. Kolmogorov. On the representation of continuous
j
ijj
( g pfunctions of several variables by superposition of continuous functions of one variable and addition.Doklady Akademiia Nauk SSSR, 114(5):953-956, y , ( ) ,1957)
Universal Approximation Property of ANN
Boolean functionsE b l f ti b t d b• Every boolean function can be represented by network with single hidden layer
• But might require exponential (in number of inputs) hidden unitshidden units
Continuous functions• Every bounded continuous function can be• Every bounded continuous function can be
approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989]
• Any function can be approximated to arbitrary• Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
Preliminaries
• Function signal▫ input signals come in at the input end of the
network▫ propagates forward to output nodes▫ propagates forward to output nodes
• Error signal▫ originates from output neuron▫ propagates backward to input nodes
Two computations in Training• Two computations in Training▫ computation of function signal▫ computation of an estimate of gradient vectorp g gradient of error surface with respect to the weights
MLP: Nonlinear multilayer networks
x0
Two-layer networksOutput of 1st layer yjx0
x1z1
p y yj
Inputs xi Outputs zk
2nd layer weights
zn
xd2nd layer weights wkj from j to k1st layer weights wji
from i to j
Multi-Modal Cost Surface
5
di t?0 gradient?
10
-5
12
30
2
-10
global minlocal min
-3-2
-10
-2
Heading Downhillg
• assume▫ minimization (e.g. SSE)▫ analytically intractable
d hill• step parameters downhill• wnew = wold + step in right direction
b k ti ( f )• backpropagation (of error)▫ slow but efficient
• conjugate gradients Levenburg/Marquardt• conjugate gradients, Levenburg/Marquardt▫ for preference
Neuron with Sigmoid-Functiong
x w
inputsweights
x1
x2
w1
w2
y
weights
activation output
.. wna=i=1
n wi xi
y
xn
.y=(a) =1/(1+e-a)
Sigmoid Unitg
x0=-1x1 w1
w2
w0
0=
a=i=0n wi xi
y=(a)=1/(1+e-a)
x2
.. w
y
(x) is the sigmoid function: 1/(1+e-x)
xn
.. wn (x) is the sigmoid function: 1/(1+e )d(x)/dx= (x) (1- (x))
Derive gradient decent rules to train:Derive gradient decent rules to train:• one sigmoid functionE/wi = - (d-y) y (1-y) xi
Gradient Descent Rule for Sigmoid Output Function
sigmoid E[w1,…,wn] = ½ (d-y)2g
E/wi = /wi ½ (d-y)2
a= /wi ½ (d- (i wi xi))2
= (d-y) ‘(i wi xi) (-xi)( y) ( i i i) ( i)
for y=(a) = 1/(1+e-a)’(a)= e-a/(1+e-a)2=(a) (1-(a))
’ (a)= e /(1+e ) =(a) (1 (a))
aw’i= wi + wi = wi + η y(1-y)(d-y) xi
a
Gradient Descent Learning Rule
y
wji
yj
xi
wi = η yj(1-yj) (dj-yj) xi
activation oflearning ratepre-synaptic neuron
error of post-synaptic neuronderivative of ti ti f ti
learning rate
activation function
Learning with hidden unitsg• Networks without hidden units are very limited in the
input-output mappings they can modelinput-output mappings they can model.▫ More layers of linear units do not help; It is still linear.▫ Fixed output nonlinearities are not enough
• We need multiple layers of adaptive nonlinear hidden units. This gives us a universal approximator. But how can we train such nets?can we train such nets?▫ We need an efficient way of adapting all the weights,
not just the last layer; this is hard. Learning the weights going into hidden units is equivalent to learning features.
▫ Nobody is telling us directly what hidden units shouldNobody is telling us directly what hidden units should do.
Learning by perturbing weightsLearning by perturbing weights• Randomly perturb one weight and see
if it improves performance. If so, save p p ,the change.▫ Very inefficient. We need to do
multiple forward passes on a f
output units
representative set of training data just to change one weight.
▫ Towards the end of learning, large eight pert rbations ill nearl
hidden units
weight perturbations will nearly always make things worse.
• We could randomly perturb all the weights in parallel and correlate the
Learning the hidden to output i ht i L i th
input units
weights in parallel and correlate the performance gain with the weight changes. ▫ Not any better because we need lots
weights is easy. Learning the input to hidden weights is hard.
Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others.
The idea behind backpropagationp p g
• We do not know what the hidden units might do, but we can compute how fast the error changes asbut we can compute how fast the error changes as we change a hidden activity.▫ Instead of using desired activities to train the hidden
it d i ti t hidd ti itiunits, use error derivatives w.r.t. hidden activities.▫ Each hidden activity can affect many output units and
can therefore have many separate effects on the Th ff b bi derror. These effects must be combined.
▫ We can compute error derivatives for all the hidden units efficiently.
▫ Once we have the error derivatives for the hidden activities, it is easy to get the error derivatives for the weights going into a hidden unit.g g g
Multi-Layer Networksy
yj output layeryj
w
hidden layerx
wji
hidden layerxi
input layer
Back-Propagation Algorithm (BPA)g g ( )
• Error signal for neuron j at iteration n: )()()( nyndne jjj
• Total error energy )(21)( 2 nenE
Cjj
▫ C is set of the output nodes• Average squared error energy
N
av nEN
E )(1
▫ average over all training sample▫ cost function as a measure of learning performance
• Objective of Learning process
n
av N 1)(
Objective of Learning process▫ adjust NN parameters (synaptic weights) to minimize
E(n) or Eav• Weights updated pattern-by-pattern basis until oneWeights updated pattern by pattern basis until one
epoch▫ complete presentation of the entire training set
BPA
• Induced local field )()()( nynwnv i
m
jij
• output of neuron j
0i
))(()( nvny jjj
• Gradient )()()()()( nvnynenEnE jjj
▫ Sensitivity factord t i th di ti f h i i ht
)()(
)()(
)()(
)()(
)()(
nwnvy
nynen
nwn
ij
j
j
j
j
j
jij
▫ determine the direction of search in weight space▫ according to chain rule
)(nE )( ne )(ny )(nv)()()( ne
nenE
jj
1)()(
nyne
j
j ))(()()(
nvnvny
jj
j
)(
)()(
nynwnv
iij
j
Gradient Descent
• Therefore, )())(()()(
)( nynvnenw
nEijjj
, )(nw jjj
ji
• By delta rule)(
)(η)(
nwnEnw
jiji
which is gradient descent in weight space• Local gradient
( ) ηδ ( ) ( )w n n y n
j
( ) ( )( ) ( )δ ( ) ( ( ))( ) ( ) ( ) ( )
j jj j
j j j j
e n y nE n E nn e(n) v nv n e n y n v n
f¶ ¶¶ ¶ ¢=- =- =
¶ ¶ ¶ ¶
j( ) ηδ ( ) ( )ji iw n n y n
Local Gradient
• Neuron j is an output node )()()( nyndne jjj j p
• Neuron j is a hidden node
jjj
▫ credit assignment problem how to determine their share of responsibility
)()()( EE
• for output neuron k
))(()()(
)()(
)()()(δ j nv
nynE
nvny
nynEn jj
jj
j
j
)(1)( 2 nenE for output neuron k
)()(
)()()(
)()()(
)()(
nynv
nvnene
nynene
nynE kk
kk
k
)(2
)( nenECk
k
)()()()( nynvnyny jkkjkj
Local Gradient
• Error in neuron k ))(()()()()( nvndnyndne kkkkkk
• Hence
m
))(()()( nv
nvne
kkk
k
)(since ,
m
jjkjk nynwnv
0)()()( )(
)()( nw
nynv
kjj
k
• desired partial derivative)()(δ)())(()(
)()( nwnnwnvne
nynE
kjkkjkkk
• back-propagation formula for hidden neuron j
)(ny kkj
)()(δ))(()(δ nwnnvn kjk
kjjj
BP Summaryy
signalinput locallearningWeight
)(jneuron to
)(δgradient
η
parameter)(
correction
j nynnw iji
• forward pass)()()(
0nynwnv i
m
ijij
))(()(
• backward pass
))(()( nvny jjj
▫ recursively compute local gradient from output layer towards input layer
▫ synaptic weight change by delta rule▫ synaptic weight change by delta rule
Indeed …
• Back-propagation algorithmBack propagation algorithm
Function signalsF d StForward Step
Error signals
• It adjusts the weights of the NN in order to Backward Step
minimize the average squared error.
Activation Function (logistic function)• Sigmoidal Function
1)(v
Activation Function (logistic function)
jave
1
1j)(v)( jv
1
Increasing a
jvijijv yw
• induced field of neuron j
-10 -8 -6 -4 -2 2 4 6 8 10 i
,...,0jij y
mi
jv j• Most common form of activation function• a threshold function
Diff ti bl
j
• Differentiable
Activation Function (logistic function)
1( ) ( ( )) 0 and - ( )1 exp( ( ))j j j j
j
y n v n a v nav n
1 exp( ( ))jav n
)](1)[( ))](exp(1[))(exp(
))(( 2 nynaynavnava
nv jjj
jjj
• local gradient
))](exp(1[ nav j
▫ for output node
jδ ( ) ( ( )) [ ( ) ( )] ( )[1 ( )]i j i j j j jn e v n a d n y n y n y nf¢= = - -
▫ for hidden nodeδ ( ) ( ( )) δ ( ) ( ) ( )[1 ( )] δ ( ) ( )j j i k kj j j k kjn v n n w n ay n y n n w nf¢= = -å åδ ( ) ( ( )) δ ( ) ( ) ( )[1 ( )] δ ( ) ( )j j i k kj j j k kj
k k
n v n n w n ay n y n n w nf å å
Activation Function (Hyperbolic tangent function)tangent function)
( ) ( ( )) tanh( ( )) 0j j i jy n v n a bv n (a,b) ( ) ( ( )) ( ( ))j j i jy ( , )
)]()][([
)))((tanh1())((hsec))(( 22
b
nbvabnbvabnv jjij
• local gradient
)]()][([ nyanyaa jj
▫ for output nodejδ ( ) ( ( )) [ ( ) ( )][ ( )[ ( )]i j i j j j j
bn e v n d n y n a y n a y na
f¢= = - - +
▫ for hidden nodeδ ( ) ( ( )) δ ( ) ( ) [ ( )][ ( )] δ ( ) ( )bn v n n w n a y n a y n n w nf¢= = - +å åδ ( ) ( ( )) δ ( ) ( ) [ ( )][ ( )] δ ( ) ( )j j i k kj j j k kj
k k
n v n n w n a y n a y n n w na
f= = +å å
Moment term• BP approximates the trajectory of steepest descent▫ smaller learning-rate parameter makes smoother pathsmaller learning rate parameter makes smoother path
• increase rate of learning yet avoiding danger of instability
)()(ηδ)1(α)( nynnwnw
where is momentum constant
)()(ηδ)1(α)( nynnwnw jjjiji
n nn-1 n-1( )
( ) η α η α δ ( ) ( )E t
w n t y t
▫ converge if 0 | | 1▫ the partial derivative has the same sign on consecutive
t 0 t 0( ) η α η α δ ( ) ( )
( )ji j iji
w n t y tw t
p giterations: grows in magnitude; accelerate descent
▫ opposite sign: shrinks; stabilizing effect• benefit of preventing the learning process from g g
terminating in a shallow local minimum
Mode of Trainingg
▫ Epoch: one complete presentation of training datad i th d f t ti f h h▫ randomize the order of presentation for each epoch
• Sequential mode▫ for each training sample, synaptic weights are updated
i l t▫ requires less storage▫ converges fast, particularly training data is redundant▫ random order makes trapping at local minimum less
lik llikely• Batch mode▫ at the end of one epoch, synaptic weights are updated
)()(
)(Nη
)()(
η)(N
1n nwne
nenwnE
nwji
jj
ji
avgji
▫ may be robust with outliers
Stopping Criteriag
• No well-defined stopping criteriapp g• Terminate when Gradient vector g(W) = 0▫ located at local or global minimum
• Terminate when error measure is stastionary• Terminate if NN’s generalization performance is
dadequate
Two-layer networks
x0 Output of 1st layer zj
y
0
x1
y1
p y j
Inputs xk
y1
Outputs yi
yn
xm2nd layer weights wij from j to i1st layer weights wij from j to i1st layer weights
vjk from k to j
An Example
All biases set to 1. Will not draw them for clarity.
Learning rate h = 0.1
v11= -1y1
x1v11= -1
v21= 0v 0
w11= 1
w21= -1
x1= 0
y2x2
v12= 0
v22= 1w12= 0
w22= 1x2= 1
Have input [0 1] with target [1 1].
w22 1
Use identity activation function (i.e. g(a) = a)
An Example
Forward pass: C l l h hidd i iCalculate the hidden unit input:
u = 1y1
v11= -1
v21= 0w11= 1
w21= -1
u1 = 1x1
y2
v12= 0
v22= 1
w21 1
w12= 0
1x2
w22= 1u2 = 2
1121
11
2221212
2121111
xvxvuxvxvu
An Example
Calculate the activities of hidden units:Calculate the activities of hidden units:
v = -1z1 = 1
y1x1
v11= -1
v21= 00
w11= 1
w21= -1
y2x2
v12= 0
v22= 1w12= 0
w22= 1w22 1z2 = 2
z = g(u ) = 1z1 = g(u1) = 1
z2 = g(u2) = 2
An Example
Calculate outputs:
y1= 2x1v11= -1
v21= 0w11= 1
w21= -1
y2= 3x2
v12= 0
v22= 1
w21= -1
w12= 0
w22= 1
3121
2221212
2121111
zwzwyzwzwy
2221212y
An Example
Backward pass:
D = 1x1v11= -1 w11= 1
pCalculate error signal from the output units:
D1= -11
v21= 0v12= 0
11
w21= -1
w 0D2= -2x2 v22= 1
w12= 0
w22= 1
Target =[1, 1] So:
D1 = (y1 - t1 )= 1 – 2 = -1D2 = (y2 - t2 )= 1 – 3 = -2
An Example
C l l i l f hidd i
v11= -1 δ 1
Calculate error signal from hidden units:
δ1= -1x111
v21= 0v = 0
δ1 w11= -1
δ2 w21= 2
δ2= -2x2
v12= 0
v22= 1δ1 w12= 0
δ2 w22= -2
( ) '( ( ) ) ( )j j i i jt g u t t w
δ2 w22 2
j j i i ji
An Example
1D1= -1x1
v11= -1
v21= 0
δ1= 1
D2= -2x2
v12= 0
v22= 1
δ2 = -2
δ1 = - 1 + 2 = 1δ2 = 0 – 2 = -2
( ) '( ( ) ) ( )i i k k ik
t g u t t w
An Example
Update weights in the output layer:
D z = 1v11= -1 z1 = 1D1 z1 =-1x1
v11= -1
v21= 0v 0
w11= 1
w21= -1 D1 z2 =-2
x2
v12= 0
v22= 1w12= 0
w22= 1D2 z1 =-2
D z =-4
( 1) ( ) ( ) ( )w t w t t z t
w22 1z2 = 2
D2 z2 =-4
( 1) ( ) ( ) ( )i j i j i jw t w t t z t
An Example
O t t l i ht hOutput layer weights change:
x1v11= -1
0w11= 0.9
x
v21= 0v12= 0
1
w21= -1.2
w12= -0.2x2 v22= 1
w22= 0.6
An Example
Hidden layer weights change:
v11= -1 d1 x1 = 0x1= 0D1= -1
v21= 0v12= 0
x1 0
d1 x2 = 1
D2= -2v12 0
v22= 1
d 2
x2= 1d2 x1 = 0
( 1 ) ( ) ( ) ( )
d2 x2 = -2
( 1 ) ( ) ( ) ( )j k j k j kv t v t t x t
Practical considerations in MLP
Data:• Learning set• Test set• Validation set
• Stopping criterion• Learning curve,
the average error per patternthe average error per pattern• Cross-validation• The total training error is minimized• It usually decreases monotonically,
even though this is not the case forthe error on each individual patternthe error on each individual pattern.
MLP• When the training set is small one can generate surrogate training patterns.
• In the absence of problem-specific information, the surrogate patterns can be made by adding Gaussian noise to true training points, the category label should be left unchanged.
• If we know about the source of variation among patterns, we can manufacture training datamanufacture training data.
• The number of hidden units (neurons) should be less than the total number of training points n, say roughly n/10.
• Initializing weights
• We cannot initialize the weights to 0 (why?).• uniform learning =>• uniform learning =>
choose weights randomly from a single distribution• Input-to-hidden weights:
-1 / d < wij < + 1 / d ,where d is the number of input units• Hidden-to-output weights:• Hidden-to-output weights:
-1 / nH < wkj < + 1 / nH ,where nH is the number of hidden units
Newton’s Method to speed up
• The idea is to minimize the quadratic approximation of the cost function E(w) around the current point w(n). Using a second order Taylor series expansion of• Using a second-order Taylor series expansion of the cost function around the point w(n).
• E[w(n)] gT(n)w(n) +½ wT(n) H(n) w(n)E[w(n)] g (n)w(n) +½ w (n) H(n) w(n) • g(n) is the m-by-1 gradient vector of the cost
function E(w) evaluated at the point w(n). The matrix H(n) is the Hessian m-by-m matrix of E(w) (second derivative), H = ²E(w)
Newton’s Method to speed up
• H = ²E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements ofcontinuously differentiable with respect to the elements of w.
• Differentiating E[w(n)] gT(n)w(n) +½ wT(n) H(n) w(n) with respect to w the change E(n) is minimizedw(n) with respect to w, the change E(n) is minimized when
• g(n) + H(n)w(n) = 0 -> w(n) = H-1(n)g(n)• w(n+1) = w(n) + w(n)• w(n+1) = w(n)+H-1(n)g(n)• where H-1(n) is the inverse of the Hessian of E(w)• where H 1(n) is the inverse of the Hessian of E(w).• Newton’s method converges quickly asymtotically and
does not exhibit the zigzagging behavior.• Newton’s method requires that the Hessian H(n) to be
inversible for all n!
Gauss-Newton Method
• It is applicable to a cost function that is expressed th fas the sum of error squares.
• E(w) = ½i=1n e²(i), note that all the error terms are
calculated on the basis of a weight vector w thatcalculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n.
• The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w bywe linearize the dependence of e(i) on w by writing e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,ne (i,w) e(i) [e(i)/w] w=w(n) (w w(n)), i 1,2,...,n
Gauss-Newton Method
e’(n,w) = e(n) + J(n)(w-w(n))where e(n) is the error vector e(n) =
[e(1),e(2),...,e(n)]T and J(n) is the n-by-m Jacobian matrix of e(n) (The Jacobian J(n) is theJacobian matrix of e(n) (The Jacobian J(n) is the transpose of the m-by-n gradient matrix e(n), where e(n) =[e(1), e(2), ...,e(n)].
w(n+1) = arg min w {½e’(n,w)²}= ½e(n)² +eT(n)J(n)(w-w(n)) + ½(w-
w(n))TJT(n)J(n)(w w(n))w(n))TJT(n)J(n)(w-w(n))Differentiating the expression with respect w and
setting the result equal to zerosetting the result equal to zero
Gauss-Newton Method
JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0w(n+1) = w(n) – [JT(n)J(n)]-1JT(n)e(n)The Gauss-Newton requires only the Jacobian
matrix of the error function e(n)matrix of the error function e(n).For the Gauss-Newton iteration to be computable,
the matrix product JT(n)J(n) must be nonsigularthe matrix product J (n)J(n) must be nonsigular. JT(n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n)
t h k dd th di lmust have row rank n. -> add the diagonal matrix I to the matrix JT(n)J(n), the parameter is a small positive constant.is a small positive constant.
Gauss-Newton Method
• JT(n)J(n)+ I ; positve definite for all n.• -> The Gauss-Newton method is
implemented in the following form:w(n+1) = w(n) – [JT(n)J(n) + I]-1JT(n)e(n)• This is the solution to the modified cost
function:E(w) = ½{w-w(0)²+ i=1
n e²(i)}where w(0) is the initial value of w.
Heuristics for making BP Betterg
• Training with BP is more an art than scienceg• Sequential vs. Batch update• Maximizing information content▫ examples of largest training error▫ examples of radically different from previous ones
• Randomize the order of presentation▫ successive examples rarely belongs to the same
classclass• Activation function▫ antisymmetric function learns fast
)tanh()( bvav
antisymmetric function learns fast
Heuristics for making BP Betterg
• Normalizing the inputs▫ preprocessed so that its mean value is close to
zero▫ input variables should be uncorrelated▫ input variables should be uncorrelated by principal component analysis
▫ scaled so that covariance are equal• Weight Initialization▫ large weight value => saturation local gradient value is small = slow learninglocal gradient value is small = slow learning
▫ small weight value => operate on flat area = slow learning
▫ somewhere between two extremes
Heuristics for making BP Betterg
• Learning from hintsg▫ prior information should be included in the
learning process• Learning Rate▫ all the neurons learn at the same rate
last layer has large local gradient last layer has large local gradient last layer learns fast of last layer should be assigned smaller one y g
Overfittingg• The training data contains information about the
regularities in the mapping from input to outputregularities in the mapping from input to output. But it also contains noise▫ The target values may be unreliable.g y▫ There is sampling error. There will be accidental
regularities just because of the particular training cases that were chosencases that were chosen.
• When we fit the model, it cannot tell which regularities are real and which are caused by sampling error. ▫ So it fits both kinds of regularity.▫ If the model is very flexible it can model the▫ If the model is very flexible it can model the
sampling error really well. This is a disaster.
A simple example of overfittingg
• Which model do you believe?y▫ The complicated model fits the data
better.▫ But it is not economical
• A model is convincing when it fits a lot of data surprisingly well.▫ It is not surprising that a complicated▫ It is not surprising that a complicated
model can fit a small amount of data.
Generalization• The objective of learning is to achieve good
generalization to new cases otherwise justgeneralization to new cases, otherwise just use a look-up table.
• Generalization can be defined as a• Generalization can be defined as a mathematical interpolation or regression over a set of training points:g p
f(x)
x
Training Set Size for Generalization
• Generalization is influenced byy▫ size of training set▫ architecture of Neural Network
• Given architecture, determine the size of training set for good generalizationGi t f t i i l d t i th• Given set of training samples, determine the best architecture for good generalization
ε
WON
Approximation of Functions• Non-linear input-output mapping▫ M input space to M output space▫ Mo input space to ML output space
• What is the minimum number of hidden layers in a MLP that provides approximate any continuous mapping?Universal Approximation Theorem• Universal Approximation Theorem▫ existence of approximation of arbitrary continuous function▫ single hidden layer is sufficient for MLP to compute a uniform
i ti t i t i i t-approximation to a given training set▫ not saying single layer is optimum in the sense of training time,
easy of implementation, or generalizationB d f A i ti E f i l hidd d NN• Bound of Approximation Errors of single hidden node NN▫ larger the number of hidden nodes, more accurate the
approximationll th b f hidd d t th▫ smaller the number of hidden nodes, more accurate the
empirical fit, i.e., better generalization
Curse of Dimensionalityy
• For good generalization, N > W / g g ,▫ where W is total number of synaptic weights
• We need dense sample points to learn it well.• Dense samples are hard to find in high
dimensionsti l th i l it i f▫ exponential growth in complexity as increase of
dimensions
Practical Consideration
• Single hidden layer vs double (multiple) hidden g y ( p )layer▫ single HL NN is good for any approximation of
continuous function▫ double HL NN may be good some times
• double(multiple) hidden layer▫ first hidden layer - local feature detectionfirst hidden layer local feature detection▫ second hidden layer - global feature detection
Cross-Validation
• Validate learned model on different sets to assess the generalization performance ▫ guarding against overfitting
• Partition Training set into▫ Estimation subset (or training subset)
lid ti b t ( t t b t)▫ validation subset (or test subset)• cross-validation for▫ best model selection▫ best model selection▫ determine when to stop training
Model selection
• Choosing MLP with the best number of free parameters with given N training sampleswith given N training samples
• Issue is to choose r▫ that determines split of training set between estimation set
and validation setand validation set▫ to minimize classification error of model trained by the
estimation set when it tested with the validation set• Kearns(1996): Qualitative properties of optimum r( ) p p p▫ for small complexity problem (desired response is small
compared to N), performance of cross-validation is insensitive to ri l fi d l ti l f id f t t▫ single fixed r nearly optimal for wide range of target
function• suggest r = 0.2▫ 80% of training set is estimation set▫ 80% of training set is estimation set
Stopping method of trainingg g• Right time to stop training
to avoid overfitting▫ to avoid overfitting• Early stopping method▫ after some training with fixed synaptic weights▫ after some training, with fixed synaptic weights
computed test error▫ resume training after computing test errorg p g
Meansquarederror
testsample
error
Training
Number of epoch
Trainingsample
Early stopping point
Stopping methodg
• Amari(1996)▫ for N<W early stopping improves generalization
▫ for N<30W▫ for N<30W overfitting occurs
)1(21121
W
Wropt
1
example: w=100, r=0.07W
ropt 211 for large W
p , 93% for estimation, 7% for validation
▫ for N>30W early stopping improvement is small early stopping improvement is small
• Leave-one-out method
Network Pruningg• Minimizing network improves generalization
less likely to learn idiosyncrasies or noise▫ less likely to learn idiosyncrasies or noise• Network growing• Network pruning• Network pruning▫ weakening or eliminate synaptic weights
• Complexity-regularizationComplexity regularization▫ tradeoff between reliability of training data and
goodness of the model▫ supervised learning by minimizing the risk function
h)(λ)()( WEWEwR cs
where)()(
WEWE
c
s : standard performance measure
: complexity penalty
Complexity-regularizationy g
• Weight Decays 22||||)( ic wWWE
▫ some weights are forced to take value zero▫ weights in network are grouped into two categoriesweights in network are grouped into two categories those of large influence those of little or no influence: excess weights
• Weight Elimination• Weight Elimination
2
2
)/(1)/()(oi
oic ww
wwWE
▫ when wi << w0, eliminated
• Approximate Smoother• Approximate Smoother
Hessian-based Network Pruningg
• Identify parameters whose deletion will cause the least increase in Eav
• by Taylor series1
▫ Parameters are deleted after training process has d (i ( ) 0)
)||(||21)()()( 3wwHwwgwww OwEE tt
avav
converged (i.e., g (w) ≈ 0)▫ quadratic approximation (i.e., O(higher orders) ≈ 0)
wHwww tav wEEE
21)()(
• eliminate the weights with small effect• Solve the constrained optimization problem:
av 2)()(
iw 1Hw 1
▫ if is small, even small weight is importanti
ii
1HH
w,
1][
ii,1][ H
Optimal Brain Surgeong
• Saliency of wi as ii
wS 1
2
][2 Hi
▫ represent the increase in the mean-squared error from deleting w
ii ,1][2 H
from deleting wi
• OBS procedurep▫ weight of small saliency will be deleted
Optimal Brain Damage• Optimal Brain Damage▫ with assumption of the Hessian matrix is diagonal
• Computation of the inverse of Hessian
Accelerated Convergenceg
Heuristics
1. Adjustable weights should have own learning rate parameterparameter
2. Learning rate parameters should be allowed to vary on iteration
3 If i f th d i ti i f l it ti3.If sign of the derivative is same for several iteration, learning rate parameter should be increased▫ Apply the Momentum idea even on learning rate pp y g
parameters4. If sign of the derivative is alternating for several
iteration learning rate parameter should beiteration, learning rate parameter should be decreased
Network Design and Training Issues
Design:g• Architecture of network• Structure of artificial neurons
L i l• Learning rules Training:• Ensuring optimum training• Ensuring optimum training• Learning parameters• Data preparation• and more ....
Network Designg
Architecture of the network: How many nodes?Architecture of the network: How many nodes?• Determines number of network weights• How many layers? • How many nodes per layer?
Input Layer Hidden Layer Output Layer
• Automated methods: ▫ augmentation (cascade correlation)g ( )▫ weight pruning and elimination
Network Designg
Architecture of the network: Connectivity?• Concept of model or hypothesis space• Constraining the number of hypotheses:• Constraining the number of hypotheses:
▫ selective connectivity▫ shared weightsg▫ recursive connections
Network Designg
Structure of artificial neuron nodes• Choice of input integration:
▫ summed, squared and summed▫ multiplied▫ multiplied
• Choice of activation (transfer) function:▫ sigmoid (logistic)
h b li t t▫ hyperbolic tangent▫ Guassian▫ linear▫ soft-max
Network Designg
Selecting a Learning RuleSelecting a Learning Rule • Generalized delta rule (steepest descent)• Momentum descent• Momentum descent• Advanced weight space search techniques• Global Error function can also vary• Global Error function can also vary
- normal - quadratic - cubic
Network Trainingg
How do you ensure that a network has ybeen well trained?
• Objective: To achieve good generalizationaccuracy on new examples/cases
• Establish a maximum acceptable error rate • Train the network using a validation test set to tune it• Train the network using a validation test set to tune it• Validate the trained network against a separate test
set which is usually referred to as a production test set
Network Trainingg
Approach #1: Large SampleWhen the amount of available data is large ...
Available Examples
Training Production
70% 30%Divide randomly
Generalization errort t
TestgSet Set
Used to develop one ANN model ComputeT t
= test errorSet
p Test error
Network Trainingg
Approach #2: Cross-validationWhen the amount of available data is small ...
Available Examples Repeat 10times
Training Pro
10%90%
Generalization errord i d bTestTraining
SetPro.Set
Used to develop 10 different ANN models Accumulate
determined by meantest error and stddev
TestSet
Used o deve op 0 d e e NN ode stest errors
Network Trainingg
How do you select between two ANN designs ? y g• A statistical test of hypothesis is required to ensure that
a significant difference exists between the error rates of two ANN modelsof two ANN models
• If Large Sample method has been used then apply McNemar’s test
• If Cross-validation then use a paired t-test for difference of two proportions
Network Trainingg
Mastering ANN ParametersMastering ANN ParametersTypical Range
learning rate - 0 1 0 01 - 0 99learning rate - 0.1 0.01 - 0.99momentum - 0.8 0.1 - 0.9weight-cost - 0 1 0 001 - 0 5
weight cost 0.1 0.001 0.5
Fine tuning : - adjust individual parameters at each node and/or connection weight
▫ automatic adjustment during training
Network Trainingg
Network weight initialization• Random initial values +/- some range
Smaller eight al es for nodes ith man incoming• Smaller weight values for nodes with many incoming connections
• Rule of thumb: initial weight range should be g gapproximately
coming into a node
1# weightscoming into a node g
Network Trainingg
Typical Problems During TrainingTypical Problems During TrainingE
Would like:Steady, rapid declinei t t l
# iter
E
Would like: in total error
Seldom a localE
# iterBut
sometimes:
Seldom a local minimum - reduce learning or momentum
EparameterReduce learning parms.
# iter - may indicate data is not learnable
An Example
Three-layer network for solving the Exclusive-OR y goperation
1
x1 31
3w13
w23w35 5
1
y55
w23
w24 w
5
x2
Inputlayer
Outputlayer
42w24
w45
4layer layer
Hi dden layer1
An Example
The effect of the bias applied to a neuron in the pphidden or output layer is represented by its weight, , connected to a fixed input equal to 1. The initial weights and bias levels are set
randomly as follows: 0 5 0 9 0 4 1 0w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0,
w35 = 1.2, w45 = 1.1, 3 = 0.8, 4 = 0.1 and 0 35 = 0.3.
An Example
We consider a training set where inputs x1 and x2 are equal to 1 and desired output yd,5 is 0. The actual outputs of neurons 3 and 4 in the hidden layer are calculated as
525001/1)( )8.014.015.01( ewxwxsigmoidy 5250.01/1)( )(32321313 ewxwxsigmoidy
8808.01/1)( )1.010.119.01(42421414 ewxwxsigmoidy
Now the actual output of neuron 5 in the output layer is determined as:
Thus, the following error is obtained:
5097.01/1)( )3.011.18808.02.15250.0(54543535 ewywysigmoidy
5097.05097.0055, yye d
An Example
The next step is weight training. To update the weights and bias levels in our network, we propagate the error, e, from the output layer backward to the input layer. First we calculate the error gradient for neuron 5 in the First, we calculate the error gradient for neuron 5 in the
output layer:1274.05097).0(0.5097)(10.5097)1( 555 eyy
Then we determine the weight corrections assuming that the learning rate parameter is equal to 0 1:
555
the learning rate parameter, , is equal to 0.1:
0112.0)1274.0(8808.01.05445 yw0067.0)1274.0(5250.01.05335 yw0112.0)1274.0(8808.01.05445 yw
0127.0)1274.0()1(1.0)1( 55
An Example
Next we calculate the error gradients for 3 d 4 i th hidd lneurons 3 and 4 in the hidden layer:
0381.0)2.1(0.1274)(0.5250)(10.5250)1( 355333 wyy
0 0147114)0 127(0 8808)(10 8808)1(
We then determine the weight corrections:0.0147.114)0.127(0.8808)(10.8808)1( 455444 wyy
0038.00381.011.03113 xw0038.00381.011.03223 xw
0038003810)1(10)1( 0038.00381.0)1(1.0)1( 33 0015.0)0147.0(11.04114 xw0015.0)0147.0(11.04224 xw 4224
0015.0)0147.0()1(1.0)1( 44
An Example At last, we update all weights and bias levels:
503800038050 5038.00038.05.0131313 www8985.00015.09.0141414 www
4038.00038.04.0232323 www 232323
9985.00015.00.1242424 www
2067.10067.02.1353535 www
0888.10112.01.1454545 www
7962.00038.08.0333
098500015010
The training process is repeated until the sum of
0985.00015.01.0444
3127.00127.03.0555
The training process is repeated until the sum of squared errors is less than 0.001.
Learning curve for operation Exclusive ORExclusive-OR
10 1Sum-Squared Network Error for 224 Epochs
100
10-1
10-2
10 -3
4
0 50 100 150 200Epoch
10 -4
Final results of three-layer network learninglearning
Inputs Desiredoutput
Actualoutput
Sum ofsquared
x1 x2
10
11
01
yd0.0155
y5 e errors
0 98490.0010
010
100
110
0.98490.98490 01750 0 0 0.0175
Software
N l N k f F R i i• Neural Networks for Face Recognition http://www.cs.cmu.edu/afs/cs.cmu.edu/user/mitchell/ftp/faces.html
• SNNS Stuttgart Neural Networks Simulatorghttp://www-ra.informatik.uni-tuebingen.de/SNNS• Neural Networks at your fingertipshttp://www geocities com/CapeCanaveral/1624/http://www.geocities.com/CapeCanaveral/1624/• Neural Network Design Demonstrationshttp://ee.okstate.edu/mhagan/nndesign_5.ZIP
’• Bishop’s network toolbox• Matlab Neural Network toolbox
MLP for object recognition from imagesimages
• Objectivej▫ Identify interesting objects from input images Face recognition Locate faces happy/sad faces gender face pose orientation Locate faces, happy/sad faces, gender, face pose, orientation Recognize specific faces: authorization
Vehicle recognition (traffic control or safe driving assistant) Passenger car van pick up bus truckPassenger car, van, pick up, bus, truck
Traffic sign detection
• Challenges▫ Image size (100x100, 10240x10240)▫ Object size, pose and object orientation▫ Illuminations▫ Illuminations
Example
Example: Face Detection Challenges
pose variation
lighting condition variation
facial expression variation
Normal procedures
• Training (identify your problem and build specific model)Build training dataset▫ Build training dataset Isolate sample images Images containing faces
Extract regions containing the objectsi t i i f region containing faces
Normalization (size and illumination) 200x200 etc.
Select counter-class examples N f i Non-face regions
▫ Determine Neural Net Input layers are determined by the input images E.g., a 200x200 image requires 40,000 input dimensions, each containing a
l b t 0 255value between 0-255 Neural net architectures A three layer FF NN (two hidden layers) is a common practice
Output layers are determined by the learning problemBi l l ifi ti lti l l ifi ti Bi-class classification or multi-class classification
▫ Train Neural Net
Normal procedures
• Test▫ Given a test image Select a small region (considering all possibilities of
th bj t l ti d i )the object location and size) Scanning from the top left to the bottom right Sampling at different scale levelsp g
Feed the region into the network, determine whether this region contains the object or notRepeat the above process Repeat the above process Which is a time consuming process
CMU Neural Nets for Face Pose RecognitionRecognition
Head pose (1-of-4):90% accuracy90% accuracy
Face recognition (1-of-20):90% accuracy
Neural Net Based Face Detection
• Large training set of faces and small set of non-faces
• Training set of non-faces automatically built up:
• Set of images with no faces
• Every ‘face’ detected is added to the non-face training set.
Traffic sign detectiondetection
• Demo▫ http://www.mathworks.com/products/demos/videoimag
e/traffic sign/vipwarningsigns htmle/traffic_sign/vipwarningsigns.html• Intelligent traffic light control system▫ Instead of using loop detectors (like metal detectors)Instead of using loop detectors (like metal detectors) Using surveillance video: Detecting vehicle and bicycles
Vehicle Detection
• Intelligent vehicles aim at improving the driving g p g gsafety by machine vision techniques
http://www.mobileye.com/visionRange.shtml
Readingg
• S Haykin, Neural Networks: A Comprehensive y , pFoundation, 2007 (Chapter 5).