Backpropagation: Understanding How to Update ANNs Weights Step-by-Step Ahmed Fawzy Gad [email protected]MENOUFIA UNIVERSITY FACULTY OF COMPUTERS AND INFORMATION INFORMATION TECHNOLOGY معة المنوفية جامعلوماتت واللحاسبا كلية امعلومات ال تكنولوجيامعة المنوفية جا
76
Embed
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Backpropagation: Understanding How to Update ANNs Weights Step-by-Step
• The backpropagation algorithm is used to update the NN weightswhen they are not able to make the correct predictions. Hence, weshould train the NN before applying backpropagation.
Initial Weights PredictionTraining
Train then Update
• The backpropagation algorithm is used to update the NN weightswhen they are not able to make the correct predictions. Hence, weshould train the NN before applying backpropagation.
Initial Weights PredictionTraining
BackpropagationUpdate
Neural Network Training Example
𝐗𝟏 𝐗𝟐 𝐎𝐮𝐭𝐩𝐮𝐭
𝟎. 𝟏 𝟎. 𝟑 𝟎. 𝟎𝟑
𝐖𝟏 𝐖𝟐 𝐛
𝟎. 𝟓 𝟎. 𝟓 1. 𝟖𝟑
Training Data Initial Weights
𝟎. 𝟏
In Out
𝑾𝟏 = 𝟎. 𝟓
𝑾𝟐 = 𝟎. 𝟐
+𝟏
𝒃 = 𝟏. 𝟖𝟑
𝟎. 𝟑
𝑿𝟏
In Out
𝑾𝟏
𝑾𝟐
+𝟏
𝒃
𝑿𝟐
Network Training
• Steps to train our network:1. Prepare activation function input
(sum of products between inputsand weights).
2. Activation function output.
𝟎. 𝟏
In Out
𝑾𝟏 = 𝟎. 𝟓
𝑾𝟐 = 𝟎. 𝟐
+𝟏
𝒃 = 𝟏. 𝟖𝟑
𝟎. 𝟑
Network Training: Sum of Products
• After calculating the sop between inputsand weights, next is to use this sop as theinput to the activation function.
𝟎. 𝟏
In Out
𝑾𝟏 = 𝟎. 𝟓
𝑾𝟐 = 𝟎. 𝟐
+𝟏
𝒃 = 𝟏. 𝟖𝟑
𝟎. 𝟑
𝒔 = 𝑿1 ∗ 𝑾1 + 𝑿2 ∗ 𝑾2 + 𝒃
𝒔 = 𝟎. 𝟏 ∗ 𝟎. 𝟓 + 𝟎. 𝟑 ∗ 𝟎. 𝟐 + 𝟏. 𝟖𝟑
𝒔 = 𝟏. 𝟗𝟒
Network Training: Activation Function
• In this example, the sigmoid activationfunction is used.
• Based on the sop calculated previously,the output is as follows:
𝟎. 𝟏
In Out
𝑾𝟏 = 𝟎. 𝟓
𝑾𝟐 = 𝟎. 𝟐
+𝟏
𝒃 = 𝟏. 𝟖𝟑
𝟎. 𝟑
𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝒔
𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝟏.𝟗𝟒=
𝟏
𝟏 + 𝟎. 𝟏𝟒𝟒=
𝟏
𝟏. 𝟏𝟒𝟒
𝒇 𝒔 = 𝟎. 𝟖𝟕𝟒
Network Training: Prediction Error
• After getting the predicted outputs,next is to measure the prediction errorof the network.
• We can use the squared error functiondefined as follows:
• Based on the predicted output, theprediction error is:
𝟎. 𝟏
In Out
𝑾𝟏 = 𝟎. 𝟓
𝑾𝟐 = 𝟎. 𝟐
+𝟏
𝒃 = 𝟏. 𝟖𝟑
𝟎. 𝟑
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝑬 =𝟏
𝟐𝟎. 𝟎𝟑 − 𝟎. 𝟖𝟕𝟒 𝟐 =
𝟏
𝟐−𝟎. 𝟖𝟒𝟒 𝟐 =
𝟏
𝟐𝟎. 𝟕𝟏𝟑 = 𝟎. 𝟑𝟓𝟕
How to Minimize Prediction Error?
• There is a prediction error and it should be minimized until reachingan acceptable error.
What should we do in order to minimize the error?• There must be something to change in order to minimize the error. In
our example, the only parameter to change is the weight.
How to update the weights?• We can use the weights update equation:
𝑾𝒏𝒆𝒘 = 𝑾𝒐𝒍𝒅 + η 𝒅 − 𝒀 𝑿
Weights Update Equation
• We can use the weights update equation:
𝑾𝒏𝒆𝒘: new updated weights.
𝑾𝒐𝒍𝒅: current weights. [1.83, 0.5, 0.2]
η: network learning rate. 0.01
𝒅: desired output. 0.03
𝒀: predicted output. 0.874
𝑿: current input at which the network made false prediction. [+1, 0.1, 0.3]
• The backpropagation algorithm is used to answer these questionsand understand effect of each weight over the prediction error.
New Weights!Old Weights
Forward Vs. Backward Passes
• When training a neural network, there are twopasses: forward and backward.
• The goal of the backward pass is to know how eachweight affects the total error. In other words, howchanging the weights changes the prediction error?
Forward
Backward
Backward Pass
• Let us work with a simpler example:
• How to answer this question: What is the effect on the output Ygiven a change in variable X?
• This question is answered using derivatives. Derivative of Y wrt X (𝝏𝒀
𝝏𝑿)
will tell us the effect of changing the variable X over the output Y.
𝒀 = 𝑿𝟐𝒁 + 𝑯
Calculating Derivatives
• The derivative𝝏𝒀
𝝏𝑿can be calculated as follows:
• Based on these two derivative rules:
• The result will be:
𝝏𝒀
𝛛𝑿=
𝛛
𝛛𝑿(𝑿𝟐𝒁 + 𝑯)
𝒀 = 𝑿𝟐𝒁 + 𝑯
𝛛
𝛛𝑿𝑿𝟐 = 𝟐𝑿Square
𝛛
𝛛𝑿𝑪 = 𝟎Constant
𝝏𝒀
𝛛𝑿= 𝟐𝑿𝒁 + 𝟎 = 𝟐𝑿𝒁
Prediction Error – Weight Derivative
E W?
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
Change in Y wrt X𝝏𝒀
𝛛𝑿Change in E wrt W
𝝏𝑬
𝛛𝑾
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟎𝟑 (𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕)
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟎𝟑 (𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 = 𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝒔
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−𝒔
𝟐
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟎𝟑 (𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 = 𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝒔
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−𝒔
𝟐
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟎𝟑 (𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 = 𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝒔
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−𝒔
𝟐
𝒔 = 𝑿1 ∗ 𝑾1 + 𝑿2 ∗ 𝑾2 + 𝒃
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟎𝟑 (𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 = 𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝒔
Prediction Error – Weight Derivative
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−𝒔
𝟐
𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟎𝟑 (𝑪𝒐𝒏𝒔𝒕𝒂𝒏𝒕) 𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 = 𝒇 𝒔 =𝟏
𝟏 + 𝒆−𝒔
𝒔 = 𝑿1 ∗ 𝑾1 + 𝑿2 ∗ 𝑾2 + 𝒃
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−(𝑿1∗ 𝑾1+ 𝑿2∗𝑾2+𝒃)
𝟐
Multivariate Chain Rule
Predicted Output
Prediction Error
sop Weights
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐 𝒇 𝒙 =
𝟏
𝟏 + 𝒆−𝒔𝒔 = 𝑿𝟏 ∗ 𝑾𝟏 + 𝑿𝟐 ∗ 𝑾𝟐 + 𝒃 𝑾𝟏,𝑾𝟐
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−(𝑿1∗ 𝑾1+ 𝑿2∗𝑾2+𝒃)
𝟐
𝝏𝑬
𝝏𝑾=
𝝏
𝝏𝑾(𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 −
𝟏
𝟏 + 𝒆−(𝑿𝟏∗ 𝑾𝟏+ 𝑿𝟐∗𝑾𝟐+𝒃)
𝟐
)
Chain Rule
Multivariate Chain RulePredicted
OutputPrediction
Errorsop Weights
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐 𝒇 𝒙 =
𝟏
𝟏 + 𝒆−𝒔𝒔 = 𝑿𝟏 ∗ 𝑾𝟏 + 𝑿𝟐 ∗ 𝑾𝟐 + 𝒃 𝑾𝟏,𝑾𝟐
𝝏𝑬
𝛛𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝛛𝒔
𝝏𝒔
𝛛𝑾𝟏
𝝏𝒔
𝛛𝑾𝟐
𝝏𝑬
𝛛𝑾𝟏
𝝏𝑬
𝛛𝑾𝟐
Let’s calculate these individual partial derivatives.
𝝏𝑬
𝝏𝑾𝟏=
𝝏𝑬
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅∗
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔∗
𝝏𝒔
𝝏𝑾𝟏
𝝏𝑬
𝝏𝑾𝟐=
𝝏𝑬
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅∗
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔∗
𝝏𝒔
𝝏𝑾𝟐
𝝏𝑬
𝝏𝑾𝟐=
𝝏𝑬
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅∗
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔∗
𝝏𝒔
𝝏𝑾𝟐
Error-Predicted (𝝏𝑬
𝛛𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅) Partial Derivative
Substitution
𝝏𝑬
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅=
𝝏
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅(𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐)
= 𝟐 ∗𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐−𝟏 ∗ (𝟎 − 𝟏)
)= (𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅) ∗ (−𝟏
= 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 − 𝒅𝒆𝒔𝒊𝒓𝒆𝒅
𝝏𝑬
𝛛𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅= 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 − 𝒅𝒆𝒔𝒊𝒓𝒆𝒅 = 𝟎. 𝟖𝟕𝟒 − 𝟎. 𝟎𝟑
𝝏𝑬
𝛛𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅= 𝟎. 𝟖𝟒𝟒
𝑬 =𝟏
𝟐𝒅𝒆𝒔𝒊𝒓𝒆𝒅 − 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅 𝟐
Predicted-sop (𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔) Partial Derivative
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔=
𝝏
𝝏𝒔(
𝟏
𝟏 + 𝒆−𝒔)
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔=
𝟏
𝟏 + 𝒆−𝒔(𝟏 −
𝟏
𝟏 + 𝒆−𝒔)
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔=
𝟏
𝟏 + 𝒆−𝒔(𝟏 −
𝟏
𝟏 + 𝒆−𝒔) =
𝟏
𝟏 + 𝒆−𝟏.𝟗𝟒(𝟏 −
𝟏
𝟏 + 𝒆−𝟏.𝟗𝟒)
=𝟏
𝟏 + 𝟎. 𝟏𝟒𝟒(𝟏 −
𝟏
𝟏 + 𝟎. 𝟏𝟒𝟒)
=𝟏
𝟏. 𝟏𝟒𝟒(𝟏 −
𝟏
𝟏. 𝟏𝟒𝟒)
= 𝟎. 𝟖𝟕𝟒(𝟏 − 𝟎. 𝟖𝟕𝟒)
= 𝟎. 𝟖𝟕𝟒(𝟎. 𝟏𝟐𝟔)
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝛛𝒔= 𝟎. 𝟏𝟏
Substitution
𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐞𝐝 =𝟏
𝟏 + 𝒆−𝒔
Sop-𝑊1 (𝝏𝒔
𝛛𝑾𝟏) Partial Derivative
𝝏𝒔
𝛛𝑾𝟏=
𝛛
𝛛𝑾𝟏(𝑿𝟏 ∗ 𝑾𝟏 + 𝑿𝟐 ∗ 𝑾𝟐 + 𝒃)
= 𝟏 ∗ 𝑿𝟏 ∗ 𝑾𝟏𝟏−𝟏 + 𝟎 + 𝟎
= 𝑿𝟏 ∗ 𝑾𝟏𝟎
)= 𝑿𝟏(𝟏𝝏𝒔
𝛛𝑾𝟏= 𝑿𝟏
𝝏𝒔
𝛛𝑾𝟏= 𝑿𝟏
Substitution
𝝏𝒔
𝛛𝑾𝟏= 𝟎. 𝟏
𝐬 = 𝑿1 ∗ 𝑾1 + 𝑿2 ∗ 𝑾2 + 𝒃
𝝏𝒔
𝛛𝑾𝟐=
𝛛
𝛛𝑾𝟐(𝑿𝟏 ∗ 𝑾𝟏 + 𝑿𝟐 ∗ 𝑾𝟐 + 𝒃)
= 𝟎 + 𝟏 ∗ 𝑿𝟐 ∗ 𝑾𝟐𝟏−𝟏 + 𝟎
= 𝑿𝟐 ∗ 𝑾𝟐𝟎
)= 𝑿𝟐(𝟏𝝏𝒔
𝛛𝑾𝟐= 𝑿𝟐
𝝏𝒔
𝛛𝑾𝟐= 𝑿𝟐 = 𝟎. 𝟑
Substitution
𝝏𝒔
𝛛𝑾𝟐= 𝟎. 𝟑
𝐬 = 𝑿1 ∗ 𝑾1 + 𝑿2 ∗ 𝑾2 + 𝒃
Sop-𝑊1 (𝝏𝒔
𝛛𝑾𝟐) Partial Derivative
Error-𝑊1 (𝛛𝑬
𝛛𝑾𝟏) Partial Derivative
• After calculating each individual derivative, we can multiply all ofthem to get the desired relationship between the prediction errorand each weight.
𝝏𝑬
𝝏𝑾𝟏=
𝝏𝑬
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅∗
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔∗
𝝏𝒔
𝝏𝑾𝟏𝝏𝑬
𝛛𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅= 𝟎. 𝟖𝟒𝟒
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝛛𝒔= 𝟎. 𝟏𝟏
𝝏𝒔
𝛛𝑾𝟏= 𝟎. 𝟏
𝝏𝑬
𝛛𝑾𝟏= 𝟎. 𝟖𝟒𝟒 ∗ 𝟎. 𝟏𝟏 ∗ 𝟎. 𝟏
𝝏𝑬
𝛛𝑾𝟏= 𝟎. 𝟎𝟏
Calculated Derivatives
Error-𝑊2 (𝛛𝑬
𝛛𝑾𝟐) Partial Derivative
𝝏𝑬
𝝏𝑾𝟐=
𝝏𝑬
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅∗
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝝏𝒔∗
𝝏𝒔
𝝏𝑾𝟐𝝏𝑬
𝛛𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅= 𝟎. 𝟖𝟒𝟒
𝝏𝑷𝒓𝒆𝒅𝒊𝒄𝒕𝒆𝒅
𝛛𝒔= 𝟎. 𝟏𝟏
𝝏𝒔
𝛛𝑾𝟐= 𝟎. 𝟑
𝛛𝑬
𝛛𝑾𝟐= 𝟎. 𝟎𝟑
𝝏𝑬
𝛛𝑾𝟐= 𝟎. 𝟖𝟒𝟒 ∗ 𝟎. 𝟏𝟏 ∗ 𝟎. 𝟑
Calculated Derivatives
Interpreting Derivatives
• There are two useful pieces of information from the derivativescalculated previously.