Backpropagation An efficient way to compute the gradient Hung-yi Lee
BackpropagationAn efficient way
to compute the gradientHung-yi Lee
Review: Notation
……
nodeslNLayer l
……
……
Layer 1lnodes1lN
……
1
2
j
1
2
i
la1
la2
lia
lz1
lz2
liz
lalz
lia
la
liz
lz
:output of a neuron
:output of a layer
: input of activation function
: input of activation function for a layer
Review: Notation
……
……
……
……
1
2
j
1
2
ilijw l
ib
1
lijw
lW
lib
lb
: a weight
: a bias
: a bias for all neurons in a layer
: the weights between layers
nodeslNLayer lLayer 1l
nodes1lN
lW
Review: Relations between Layer Outputs
……
nodeslNLayer l
……
……
Layer 1lnodes1lN
……
1
2
j
1
2
i
11la
12la
1lja
la1
la2
lia
lz1
lz2
liz
lalz1la
llll baWz 1
ll za
llll baWa 1
Review: Neural Network is a function
LL bbbxxfy 2112 WWW;
vector x
vector y
111W abx 2212W aba LL1-LLW aba y
LL bbb ,W,W,,W 2211 (to be learned from training examples)
Review: Gradient Descent
• Given training examples:• Find a set of parameters θ* minimizing the error
function C(θ)
• We have to compute and lij
r
w
C
RRrr yxyxyx ˆ,ˆ,ˆ, 11
r
rr yxfR
C2
ˆ;1
2ˆ;C rrr yxf
li
r
b
C
Neat Representation
• is the multiplication of two termslij
r
w
C
……
1
2
j…
…
1
2
ilijw
liz
lia
rli
lij ΔCΔzΔw
li
r
lij
li
lij
r
z
C
w
z
w
C
Layer lLayer 1l
Neat Representation – First Term• is the multiplication of two termsl
ij
r
w
C
……
1
2
j…
…
1
2
ilijw
liz
lia
rli
lij ΔCΔzΔw
li
r
lij
li
lij
r
z
C
w
z
w
C
Layer lLayer 1l
Neat Representation – First Term
……
Layer l-1
1
2
j…
…
1
2
i
Layer l
li
r
lij
li
lij
r
z
C
w
z
w
C
li
lj
j
lij
li bawz 1 1
l
jlij
li aw
zIf l > 1
lijw
liz
Neat Representation – First Term
li
r
lij
li
lij
r
z
C
w
z
w
C
If l = 1 111i
rj
jiji bxwz r
jij
i xw
z
1
1
……
Input
……
1
2
i
Layer 1 rx1
rx2
rjx 1
ijw1iz
li
lj
j
lij
li bawz 1 1
l
jlij
li aw
zIf l > 1
Neat Representation – Second Term• is always the multiplication of two termsl
ij
r
w
C
……
1
2
j…
…
1
2
ilijw
liz
lia
rli
lij ΔCΔzΔw
li
r
lij
li
lij
r
z
C
w
z
w
C
Layer lLayer 1l
li
Neat Representation – Second Term
……
Layer l-1
1
2
j…
…
1
2
i
Layer l …
…1
2
k
Layer l+1 ……
……
……
……
1
2
n
Layer L(output layer)
Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδli
r
lij
li
lij
r
z
C
w
z
w
C
l
i
liδ
lδ2
lδ1
1lkδ
12lδ
11lδ
Lnδ
L2δ
Lδ1
lδ 1lδ Lδ
Neat Representation – Second Term Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδli
r
lij
li
lij
r
z
C
w
z
w
C
l
i
LL
n
r
n z
C
rrLL Cyaz nnn
rL
r
n
r
n
n
y
C
z
y
Lnz ……
1
2
n
Layer L(output layer)
Lnδ
L2δ
Lδ1
z
z
Depending on the definition of error function
Lnz
Neat Representation – Second Term Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδli
r
lij
li
lij
r
z
C
w
z
w
C
l
i
LL
n
r
n z
C
rL
r
n
r
n
n
y
C
z
y
Ln
L
L
L
z
z
z
z
2
1
rn
r
rr
rr
rr
yC
yC
yC
yC2
1
rrl yCzδ L rn
rLn y
Cz
Neat Representation – Second Term
li
li ΔaΔz rΔC
k
lk
r
li
lk
li
li
li
rli z
C
a
z
z
a
z
C1
1
11lΔz1
2lΔz
1lkΔz
……
li
r
lij
li
lij
r
z
C
w
z
w
C
l
i
Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδ
……
1
2
i
Layer l
……
1
2
k
Layer l+1
liδ
lδ2
lδ1
1lkδ
12lδ
11lδ
1lδlδ
li
rli z
Cδ
1lk
Neat Representation – Second Term
li
r
lij
li
lij
r
z
C
w
z
w
C
l
i
k
lk
lki
li
li wz 11……
1
2
i
Layer l
……
1
2
k
Layer l+1
liδ
lδ2
lδ1
1lkδ
12lδ
11lδ
1lδlδ
li
li ΔaΔz rΔC
11lΔz1
2lΔz
1lkΔz
……
k
lkl
i
lk
li
lil
i a
z
z
a 11
liz 111 lk
li
i
lki
lk bawz
Neat Representation – Second Term
li
r
lij
li
lij
r
z
C
w
z
w
C
l
i
liδ i
liz
multiply a constant
1lkδ
12lδ
11lδ
……
1lkiw
1lkiw
12liw
11liw
output
input
new type of neuron
k
lk
lki
li
li wz 11
……
1
2
i
Layer l
……
1
2
k
Layer l+1
liδ
lδ2
lδ1
1lkδ
12lδ
11lδ
1lδlδ
Neat Representation – Second Term
2
…
11 lz
12 lz
1 lkz
1
k
2
1
i
…
Layer l+1Layer l
lz1
lz2
liz
liδ
lδ2
lδ1
1lkδ
12lδ
11lδ
11 lTlll Wz
1lδlδ
li
l
l
l
z
z
z
z
2
1
k
lk
lki
li
li wz 11
Neat Representation – Second Term
2
…
11 lz
12 lz
1 lkz
1
k
2
1
i
…
Layer l+1Layer l
lz1
lz2
liz
liδ
lδ2
lδ1
1lkδ
12lδ
11lδ
11 lTlll Wz
1lδlδ
Compare
1lka
12la
11la
……
Layer l
1
2
i
……
1
2
klia
Layer l+1
la2
la1
la1la
111 llll baWa
1
2
n
……
r1y
C r
Lz1
Lz2
Lnz
r2y
C r
rn
r
y
C
Layer L
2
…
11 lz
12 lz
1 lkz
1
k
2
1
i……
…
Layer l+1Layer l
lz1
lz2
liz
lδ1
lδ2
liδ
2
… 1L1 z
1
m
Layer L-1
…
……
……
……
Two Questions:1. How to compute Lδ
2. The relation of and lδ 1lδ 11 lTlll Wz
TW L TlW 1
rrl yCzδ Lli
r
lij
li
lij
r
z
C
w
z
w
C
li
rr yCL1-L
1L2 z
1L mz
1lδlδ
Backpropagationli
r
lij
li
lij
r
z
C
w
z
w
C
Forward Pass Backward Pass
11 ll za 11 lTlll Wz
1211 llll baWz
rrL yCzδ L
1
11
lx
larj
lj
……
1
2
j
……
1
2
ilijw
Layer lLayer 1l
111 bxWz r
11 za
LTLLL Wz 11
li
Appendix
A reverse network
1
2
n
……
Layer L (Output layer)
2
…
(formed by new types of neurons)
1
k
2
1
i……
…
Layer l+1Layer l
2
…1
k
Layer l+2
………
……
……
Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδ 11 lTlll Wz
2lW1lW
rrl yCzδ L
rr yCl
Review: Gradient descent
Start at paramter θ0
Compute gradient at W0: g0
Move to W1 = W0 - μg0
Compute gradient at W1: g1
Move to W2 = W1 – μg1
Movement
Gradient
……
θ0
θ1
θ2
θ3
g0
g1
g2
g3
Neat Representation – First Term
rx1rx2
……
Layer 1
rx3……
……
Layer L-1
……
……
……
Input
rx
111 abxW r llll abaW 121
11,bW 1
2
j
1-l1a
1-l2a
1-lja
li
r
lij
li
lij
r
z
C
w
z
w
C
Neat Representation – Second Term
……
1
2
n
Layer L(output layer)
Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδli
r
lij
li
lij
r
z
C
w
z
w
C
l
i
Lnδ
L2δ
Lδ1
Lδ
r
rL
r
n
rLn
n
r
n
nLn
y
Cz
y
C
z
y
1
2
n
……
r1y
C r
Lz1
Lz2
Lnz
r2y
C r
rn
r
y
C
Layer L (Output layer)
Lnδ
L2δ
Lδ1
Neat Representation – Second Term Two Questions:
1. How to compute Lδ
2. The relation of and lδ 1lδli
r
lij
li
lij
r
z
C
w
z
w
C
l
i
Lδ
1
2
n
……
r1y
C r
Lz1
Lz2
Lnz
r2y
C r
rn
r
y
C
Layer L (Output layer)
Lnδ
L2δ
Lδ1
li
l
l
l
z
z
z
z
2
1
rn
r
rr
rr
rr
yC
yC
yC
yC2
1
rrl yCzδ L
Reference
• https://theclevermachine.wordpress.com/