Multilayer Perceptrons Lecture 11
1!
Multilayer Perceptrons!!
Lecture 11!
2!
Chain Rule!g Letʼs say we have two functions f(x) and g(x)"
g What is the derivative of f(g(x))?"!
f (x) = x 5
g(x) = (x 2 +1)f (g(x)) = (x 2 +1)5
!
df (x)dx
= 5x 4
df (g)dg
= 5g4
dg(x)dx
= 2x
df (g(x))dx
=df (g)dg
"dg(x)dx
= 5(x 2 +1)4 " 2x
3!
Chain Rule!
y = esinx2
dydx
= esin x2
!cos x2 !2x
dydu
= esin x2
dudv
= cos x2
dvdx
= 2x
dydx
=dydu!dudv!dvdx
v = h(x)
= x2
u = g(v) =sin(v)
y = f(u)
=eu x
4!
Multilayer Perceptron!g Graph of a multilayer perceptron with two
hidden layers"
Input Layer
First Hidden Layer
Second Hidden Layer
Output Layer
5!
Signal Flow Graph (Output Neuron j)!
yo=+1 [Bias-term]
yi(n) wji(n) vj(n) yj(n)
φj(vj(n))
dj(n) [desired output value]
-1 ej(n)
(error at node j)
!
v j (n) = w ji(n)yi(n)i=0
M
"
wj0(n)
yM(n)
6!
Signal Flow Graph (Output Neuron j)!
yo=+1 [Bias-term]
yi(n) wji(n) vj(n) yj(n)
φj(vj(n))
dj(n) [desired output value]
-1 ej(n)
(error at node j)
!
v j (n) = w ji(n)yi(n)i=0
M
"
!
y j (n) =" j (v j (n))
wj0(n)
yM(n)
7!
Signal Flow Graph (Output Neuron j)!
yo=+1 [Bias-term]
yi(n) wji(n) vj(n) yj(n)
φj(vj(n))
dj(n) [desired output value]
-1 ej(n)
(error at node j)
!
v j (n) = w ji(n)yi(n)i=0
M
"
!
y j (n) =" j (v j (n))
!
e j (n) = d j (n) " y j (n)
wj0(n)
yM(n)
8!
Goal: Minimize Total Error at the O/P!
yo=+1
yi(n) wji(n) vj(n) yj(n)
φj(vj(n))
dj(n)
-1 ej(n)
!
v j (n) = w ji(n)yi(n)i=0
M
"
!
y j (n) =" j (v j (n))
!
e j (n) = d j (n) " y j (n)
wj0(n)
!
E(n) =12
ej
2 (n)j"O /P#
9!
Update Weights!
yo=+1
yi(n) wji(n) vj(n) yj(n)
φj(vj(n))
dj(n)
-1 ej(n)
!
v j (n) = w ji(n)yi(n)i=0
M
"
!
y j (n) =" j (v j (n))
!
e j (n) = d j (n) " y j (n)
wj0(n)
!
w ji(n +1) = w ji(n) "#$E(n)$w ji(n)
Small step against the direction of the gradient to minimize error
10!
Update for Weights!
"g where"
!
w ji(n +1) = w ji(n) "#$E(n)$w ji(n)
!
E(n) =12
ej
2 (n)j"O /P#
e j (n) = d j (n) $ y j (n)y j (n) =% j (v j (n))
v j (n) = w ji(n)yi(n)i=0
M
#
11!
Update for Weights!
""g Applying chain rule we get:""
!
w ji(n +1) = w ji(n) "#$E(n)$w ji(n)
!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
!
E(n) =12
ej
2 (n)j"O /P#
e j (n) = d j (n) $ y j (n)y j (n) =% j (v j (n))
v j (n) = w ji(n)yi(n)i=0
M
#
12!
Update for Weights!g Applying chain rule we get:""
!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
"E(n)"e j (n)
="
"e j (n)12
ej
2 (n)j$O /P%
&
' ( (
)
* + + = e j (n)
!
E(n) =12
ej
2 (n)j"O /P#
e j (n) = d j (n) $ y j (n)y j (n) =% j (v j (n))
v j (n) = w ji(n)yi(n)i=0
M
#
13!
Update for Weights!g Applying chain rule we get:""
!
E(n) =12
ej
2 (n)j"O /P#
e j (n) = d j (n) $ y j (n)y j (n) =% j (v j (n))
v j (n) = w ji(n)yi(n)i=0
M
#!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
"E(n)"e j (n)
="
"e j (n)12
ej
2 (n)j$O /P%
&
' ( (
)
* + + = e j (n)
"e j (n)"y j (n)
="
"y j (n)d j (n) , y j (n)( ) = ,1
14!
Update for Weights!g Applying chain rule we get:""
!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
"E(n)"e j (n)
="
"e j (n)12
ej
2 (n)j$O /P%
&
' ( (
)
* + + = e j (n)
"e j (n)"y j (n)
="
"y j (n)d j (n) , y j (n)( ) = ,1
"y j (n)"v j (n)
="
"v j (n)- j (v j (n))( ) =- j '(v j (n))
!
E(n) =12
ej
2 (n)j"O /P#
e j (n) = d j (n) $ y j (n)y j (n) =% j (v j (n))
v j (n) = w ji(n)yi(n)i=0
M
#
15!
Update for Weights!g Applying chain rule we get:""
!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
"E(n)"e j (n)
="
"e j (n)12
ej
2 (n)j$O /P%
&
' ( (
)
* + + = e j (n)
"e j (n)"y j (n)
="
"y j (n)d j (n) , y j (n)( ) = ,1
"y j (n)"v j (n)
="
"v j (n)- j (v j (n))( ) =- j '(v j (n))
"v j (n)"w ji(n)
="
"w ji(n)w ji(n)yi(n)
i=0
M
%&
' (
)
* + = yi(n)
!
E(n) =12
ej
2 (n)j"O /P#
e j (n) = d j (n) $ y j (n)y j (n) =% j (v j (n))
v j (n) = w ji(n)yi(n)i=0
M
#
16!
Update for Weights!g Applying chain rule we get:"
!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
"E(n)"w ji(n)
= e j (n)# ($1)# % j '(v j (n))# yi(n)
17!
Update for Weights!g Applying chain rule we get:"
g Delta rule for updating weights"!
"E(n)"w ji(n)
="E(n)"e j (n)
#"e j (n)"y j (n)
#"y j (n)"v j (n)
#"v j (n)"w ji(n)
"E(n)"w ji(n)
= e j (n)# ($1)# % j '(v j (n))# yi(n)
!
w ji(n +1) = w ji(n) "#$E(n)$w ji(n)
% w ji(n +1) = w ji(n) + #learningrate
! & e j (n)& ' j '(v j (n))local gradient
" # $ $ % $ $ & yi(n)
input!
18!
Update for Weights!g Delta rule for updating weights"
g For output neurons this rule can be directly applied"
!
w ji(n +1) = w ji(n) "#$E(n)$w ji(n)
% w ji(n +1) = w ji(n) + #learningrate
! & ' j (n)local gradient" # $
& yi(n)input!
' j (n) =$E(n)$v j (n)
= e j (n)& ( j)(v j (n))
*
+ ,
-
. /
19!
How to update weights for hidden layers?!g Delta rule for updating weights"
g Credit-assignment problem:"n Even though the hidden neurons are not directly
accessible they share responsibility for the error"n How to penalize or reward hidden neurons?"
!
w ji(n +1) = w ji(n) "#$E(n)$w ji(n)
% w ji(n +1) = w ji(n) + #learningrate
! & ' j (n)local gradient" # $
& yi(n)input!
20!
Signal-flow at hidden node h!
yo=+1
yh(n) wjh(n) vj(n) yj(n)
φj(vj(n))
dj(n)]
-1 ej(n)
wj0(n)
yM(n)
yo=+1
yi(n) whi(n) vh(n)
φh(vh(n))
wh0(n)
yM2(n)
21!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
!
"h (n) = #$E(n)$vh (n)
= #$E(n)$yh (n)
%$yh (n)$vh (n)
22!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
!
"h (n) = #$E(n)$vh (n)
= #$E(n)$yh (n)
%$yh (n)$vh (n)
"h (n) = #$E(n)$yh (n)
%$
$vh (n)(&h (vh (n)))
23!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
!
"h (n) = #$E(n)$vh (n)
= #$E(n)$yh (n)
%$yh (n)$vh (n)
"h (n) = #$E(n)$yh (n)
%$
$vh (n)(&h (vh (n)))
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
24!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"g We know"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
E(n) =12
e j2(n)
j"O /P#
$E(n)$yh (n)
= e j (n)j"O /P#
$e j (n)$yh (n)
25!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"g We know"
g Again applying chain rule:"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
"E(n)"yh (n)
= e j (n)j#O /P$
"e j (n)"yh (n)
!
"E(n)"yh (n)
= e j (n)j#O /P$
"e j (n)"v j (n)
%"v j (n)"yh (n)
26!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
"E(n)"yh (n)
= e j (n)j#O /P$
"e j (n)"v j (n)
%"v j (n)"yh (n)
27!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
"E(n)"yh (n)
= e j (n)j#O /P$
"e j (n)"v j (n)
%"v j (n)"yh (n)
!
"e j (n)"v j (n)
="
"v j (n)(d j (n) # y j (n))
28!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
"E(n)"yh (n)
= e j (n)j#O /P$
"e j (n)"v j (n)
%"v j (n)"yh (n)
!
"e j (n)"v j (n)
="
"v j (n)(d j (n) # y j (n))
"e j (n)"v j (n)
="
"v j (n)(d j (n) #$ j (v j (n))) = #$ j
%(v j (n))
29!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
"E(n)"yh (n)
= e j (n)j#O /P$
"e j (n)"v j (n)
%"v j (n)"yh (n)
!
"e j (n)"v j (n)
="
"v j (n)(d j (n) # y j (n))
"e j (n)"v j (n)
="
"v j (n)(d j (n) #$ j (v j (n))) = #$ j
%(v j (n))
"v j (n)"yh (n)
="
"yh (n)w jh (n)yh (n)
h=0
M
&'
( )
*
+ , = w jh (n)
30!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"
!
"h (n) = #$E(n)$yh (n)
% (&h'(vh (n)))
!
"h (n) =#h$(vh (n)) e j (n)% # j (v j (n))
local gradient! " # # $ # #
% w jh (n)j&O /P'
!
"h (n) =#h$(vh (n)) " j (n)% w jh (n)
j&O /P'
31!
How to update weights for hidden layers?!g Local gradient of hidden neuron ʻhʼ:"
"
!
"h (n) =#h$(vh (n)) " j (n)% w jh (n)
j&O /P'
!
whi(n +1) = whi(n) + "learningrate
! # $h (n)local gradient" # $ # yi(n)
input!
32!
Back-propagation of errors!
δj(n)
φj’(vj(n)) ej(n) wjh(n)
δM(n)
e1(n)
eM(n)
φ1’(v1(n))
φM’(vM(n))
δ1(n)
δh(n)
!
"h (n) =#h$(vh (n)) " j (n)% w jh (n)
j&O /P'
Intuition: weight the error at each output node by the connection weights of the hidden node to the output node and assign that as the error caused by the hidden node
33!
Back Propagation Algorithm!g Output node"
g Hidden node"
!
w ji(n +1) = w ji(n) + "learningrate
! # $ j (n)local gradient" # $
# yi(n)input!
!
"h (n) =#h$(vh (n)) " j (n)% w jh (n)
j&O /P'
(
) * *
+
, - -
!
whi(n +1) = whi(n) + "learningrate
! # $h (n)local gradient" # $ # yi(n)
input!
!
" j (n) = e j (n)# $ j%(v j (n))
& ' (
) * +
34!
An Example!g Lets assume a simple MLP with one hidden
layer"
o1
o2
x1
x2
x0=1 y0=1
Input Layer
Hidden Layer
Output Layer
35!
An Example!g Begin with random assignment of weights"
o1
o2
x1
x2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
Input Layer
Hidden Layer
Output Layer
36!
An Example!g Let input x=[0,1] and the desired output be
d=[1,0]; η=0.1"
o1
o2
x1=0
x2=1
u11= -1
u22= 1
w11= 1
x0=1 y0=1
w20= 1
Input Layer
Hidden Layer
Output Layer
37!
An Example!g Forward pass: Hidden Layer"
o1
o2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
Input Layer
Hidden Layer
Output Layer
v1 = u10x0+u11x1+u12x2 v2 = u20x0+u21x2+u22x2
v1=?
v2=?
x1=0
x2=1
38!
An Example!g Forward pass"
o1
o2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
Input Layer
Hidden Layer
Output Layer
v1 = u10x0+u11x1+u12x2 v2 = u20x0+u21x2+u22x2
v1=1
v2=2
x1=0
x2=1
39!
An Example!g Forward pass: Lets assume identity
activation function: φj(x)=x "
o1
o2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
Input Layer
Hidden Layer
Output Layer
y1=φ(v1) y2=φ(v2)
v1=1
v2=2
x1=0
x2=1
40!
An Example!g Forward pass: Lets assume identity
activation function: φj(x)=x "
o1
o2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
Input Layer
Hidden Layer
Output Layer
y1=φ(v1) y2=φ(v2)
y1=1
y2=2
x1=0
x2=1
!
"'(x) =1[ ]
41!
An Example!g Forward pass: Output layer"
o1
o2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
ov1 = w10y0+w11y1+w12y2 ov2 = w20y0+w21y2+w22y2
ov1=?
ov2=?
42!
An Example!g Forward pass: Output layer"
o1
o2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
ov1 = w10y0+w11y1+w12y2 ov2 = w20y0+w21y2+w22y2
ov1=2
ov2=2
43!
An Example!g Forward pass: Output layer"
o1=?
o2=?
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
ov1=2
ov2=2
o1=φ(ov1) o2=φ(ov2)
44!
An Example!g Forward pass: Output layer"
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
ov1=2
ov2=2
Desired o/p d1=1 d2=0
45!
An Example!g Forward pass: Output layer"
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
e1=d1-o1
Desired o/p d1=1 d2=0
e2=d2-o2
46!
An Example!g Backward pass: Output layer"
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
e1=-1
Desired o/p d1=1 d2=0
e2=-2
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
47!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
e1=-1
e2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
48!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
49!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=?
δ2h=?
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
50!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
δ1w11
x0=1 y0=1
δ2w20
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=?
δ2h=?
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
51!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
δ1w11
x0=1 y0=1
δ2w20
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=1
δ2h=-2
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
52!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
w11= 1
x0=1 y0=1
w20= 1
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=1
δ2h=-2
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
53!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
w11= 0.9
x0=1 y0=1
w20= 0.8
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=1
δ2h=-2
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
54!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 1
w11= 0.9
x0=1 y0=1
w20= 0.8
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=1
δ2h=-2
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
55!
An Example!g Backward pass: "
o1=2
o2=2
u11= -1
v22= 0.8
w11= 0.9
x0=1 y0=1
w20= 0.8
y1=1
y2=2
x1=0
x2=1
δ1=-1
δ2=-2
!
" j (n) = e j (n)# $ j%(v j (n))
"h (n) =$h%(vh (n)) " j (n)# w jh (n)
j&O /P'
δ1h=1
δ2h=-2
!
w ji(n +1) = w ji(n) +"e j (n)yi(n) [output]
u ji(n +1) = u ji(n) +"xi(n) e j (n)# w jh (n) [hidden]j$O /P%
56!
An Example!g Again forward pass: "
o1=1.66
o2=0.32
u11= -1
v22= 0.8
w11= 0.9
x0=1 y0=1
w20= 0.8
y1=1.2 x1=0
x2=1
y2=1.6
Desired o/p d1=1 d2=0
Notice the error has reduced
57!
What does each layer do?!
1st layer draws linear boundaries!
2nd layer combines the boundaries!
3rd layer can generate arbitrarily complex boundaries!