Corneliu Florea. Machine Learning pentru Aplicatii Vizuale
Corneliu Florea.
Machine Learning pentru Aplicatii Vizuale
Perceptron
Perceptron
• Historically was the first machine learning structure
• Inspired from biological neuron
• Capable of linear separation only – Limitation concluded in 15 years gap in research
• Learning algorithm a prequel to the powerful gradient descent
Biological Neuron
The Neuron - A Biological Information Processor
• dentrites - the receivers• soma - neuron cell body (sums
input signals)• axon - the transmitter• synapse - point of transmission• neuron activates after a certain
threshold is metLearning occurs via electro-chemical
changes in effectiveness of synaptic junction.
An Artificial Neuron - The PerceptronBasic function of neuron is to sum inputs, and produce output given sum is greater than threshold
Artificial Perceptron
• Model similar with logistic regression• Artificial Neuron – Perceptron• The model was introduced by McCullogh and Pitts in 1943
with a hard limiting function• Tipically trained by Delta ???? algorithm
Neuron Modelw1x1
w2x2
wNxN
o
Input weight
threshold
Output 0 or 1
Summation(with bias)
+= ∑
=
N
iiiwxwfo
10
Learning:-During training the weights w0, w1,…,wN are found
-Perceptron works on linearly separable problems only!!!
Activation Functions
1) Threshold Functionf(v) = 1 if v≥ 0
= 0 otherwise
2) Piecewise-Linear Functionf(v) = 1 if v ≥ ½
= v if ½> v > - ½ = 0 otherwise
3) Sigmoid Functionf(v) = 1/{1 + exp(- av)}
etc..
Neurons can use any differentiable transfer function f to generate their output
Separating hyperplane
Example
Example
Perceptron Learning Algorithm:
1. Initialize weights with small random values
2. Present a pattern, xi and target output, yi
3. Compute output :
4. Update weights :
Repeat starting at 2, until acceptable level of error
Learning in a Simple Neuron
+= ∑
=
N
iiiwxwfo
10
)()()1( twtwtw ∆+=+
Learning in a Simple Neuron
Widrow-Hoff or Delta Rule for weight modification
)()()1( twtwtw ∆+=+(t)ixtotyη(t)ixtηtiw
−==∆ )()()()( εWhere:η = learning rate (0 < η <= 1), typically set to 0.1 or 0.2ε(t) = error signal = desired output (y()) - network output (o())
)()()1( 00 ttwtw ηε+=+
For input weights
For bias
Weight Updates
• Binary classification – weight updates
x(t)totyηx(t)tηtw
−==∆ )()()()( ε
o(t) y(t) ε(t) Δw(t)
0 0 0 0
0 1 +1 ηx(t)
1 0 -1 -ηx(t)
1 1 0 0
Perceptron learning
• Presenting the perceptron with enough training vectors, the weight vector w(n) will tend to the correct value w.
• Rosenblatt proved that if input patterns are linearly separable, then the perceptron learning law converges, and the hyperplane separating two classes of input patterns can be determined
Example
Logical OR Function
x1 x2 y0 0 00 1 11 0 11 1 1 x1
x2
0,0 0,1
1,0 1,1
y = f(w0+w1x1+w2x2)
SimpleNeural Network
Line: ?x1+?x2 = ? or -?x1-?x2 + ? = 0
Trainingo(t) y(t) ε(t) Δw(t)
0 0 0 0
0 1 +1 ηx(t)
1 0 -1 -ηx(t)
1 1 0 0
η=0.2
Iterat
w1 w2 w0 x1 x2 y=(x1orx2)
o' o ε Δw1 Δw2 Δw0
1 0.02 -0.15 0.09 1 0 1 0.11 0 1 0.2 00.2
2 0.22 -0.15 0.29 0 1 1 0.14 0 1 0 0.20.2
3 0.22 0.05 0.49 1 1 1 0.76 1 0 0 00
4 0.22 0.05 0.49 1 0 1 0.71 1 0 0 00
5 0.22 0.05 0.49 0 0 0 0.49 0 0 0 00
6 0.22 0.05 0.49 1 1 1 0.76 1 0 0 00
1 0.02 -0.15 0.09 1 0 1 0.11 0 1 0.2 00.2
o = f(w0+w1x1+w2x2) = f(o’)
≤>
=5.005.01
)(xx
xf
Example
Iterat w1 w2 w0 x1 x2 y o' o ε Δw1 Δw2 Δw0
1 0.02 -0.15 0.09 1 0 1 0.11 0 1 0.2 0 0.22 0.22 -0.15 0.29 0 1 1 0.14 0 1 0 0.2 0.23 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 04 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 05 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 06 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 07 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 08 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 09 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 0
10 0.22 0.05 0.49 0 1 1 0.54 1 0 0 0 011 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 012 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 013 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 014 0.22 0.05 0.49 0 1 1 0.54 1 0 0 0 015 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 016 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 0
Line: 0.22x1+0.25x2 = -0.11
Convergence
FAILURE
Logical XOR Function
x1 x2 y0 0 00 1 11 0 11 1 0 0,0 0,1
1,0 1,1
Two neurons are need! Their combined results can produce good classification.
x2
x1
XOR- Failure
Iterat w1 w2 w0 x1 x2 y o' o ε Δw1 Δw2 Δw0
1 0.02 -0.15 0.09 1 0 0 0.11 0 0 0 0 02 0.02 -0.15 0.09 0 1 0 -0.06 0 0 0 0 03 0.02 -0.15 0.09 1 1 1 -0.04 0 1 0.2 0.2 0.24 0.22 0.05 0.29 1 0 0 0.51 1 -1 -0.2 0 -0.25 0.02 0.05 0.09 0 0 1 0.09 0 1 0 0 0.26 0.02 0.05 0.29 1 1 1 0.36 0 1 0.2 0.2 0.27 0.22 0.25 0.49 0 0 1 0.49 0 1 0 0 0.28 0.22 0.25 0.69 1 1 1 1.16 1 0 0 0 09 0.22 0.25 0.69 1 0 0 0.91 1 -1 -0.2 0 -0.2
10 0.02 0.25 0.49 0 1 0 0.74 1 -1 0 -0.2 -0.211 0.02 0.05 0.29 0 0 1 0.29 0 1 0 0 0.212 0.02 0.05 0.49 1 1 1 0.56 1 0 0 0 013 0.02 0.05 0.49 1 0 0 0.51 1 -1 -0.2 0 -0.214 -0.18 0.05 0.29 0 1 0 0.34 0 0 0 0 015 -0.18 0.05 0.29 1 0 0 0.11 0 0 0 0 016 -0.18 0.05 0.29 0 0 1 0.29 0 1 0 0 0.2
Geometric interpretation of the learning law
Multi-Layer Perceptron
Multi-Layer Perceptron
• Multi-Layer Perceptron = Artificial Neural Network = = FeedForward Network = Fully Connected Network
Over the 15 years (1969-1984) some research continued ... • hidden layer of nodes allowed combinations of linear functions • non-linear activation functions displayed properties closer to real
neurons:– output varies continuously but not linearly– differentiable .... sigmoid
non-linear ANN classifier was possibleae
af−+
=1
1)(
Collection of Artificial Neurons
Hidden Nodes
Output Nodes
Input Nodes
I1 I2 I3 I4
O1 O2“Distributed processing
and representation”
3-Layer Networkhas
2 active layers
Mathematical formulation
• Each (i-th) neuron’s from the j-th layer output is computed as
+= ∑
=
N
iijijijij wxwfo
10
Where:• f() is a non linear function• xi are:o inputs for first layero outputs of the previous layer
• wi – weights• w0 - bias
Learning: - find the values of all weights
Training MLPs
• Initially there was no learning algorithm to adjust the weights of a multi-layer network -
– weights had to be set by hand.
• How could the weights below the hidden layer be updated?The Back-propagation Algorithm• 1986: the solution to multi-layer ANN weight update rediscovered • Conceptually simple - the global error is backward propagated to network nodes, weights
are modified proportional to their contribution
• Most important ANN learning algorithm• Become known as back-propagation because the error is send back through the
network to correct all weights
The Back-Propagation Algorithm
• Like the Perceptron - calculation of error is based on difference between target and actual output:
• However in BP it is the rate of change of the error which is the important feedback through the network
generalized delta rule
• Relies on the sigmoid activation function for communicationijδw
δLηijΔw −=
2)(21
jj
j oyL −= ∑
The Back-Propagation Algorithm
Objective: compute for allDefinitions:
= weight from node i to node j= totaled weighted input of node
= output of node = error (loss) for 1 pattern over all output nodes
ijwL
δδ
ijw
jx
jo
L= = + −f x e x
jj( ) /( )1 1
ijw
i
n
iijow∑=
=0
)1/(1)( jj
xexfo −+==
The Back-Propagation Algorithm
Objective: compute derivatives for all wij
Four step process:
1. Compute how fast error changes as output of node j is changed
2. Compute how fast error changes as total input to node j is changed
3. Compute how fast error changes as weight wij coming into node j is changed
4. Compute how fast error changes as output of node i in previous layer is changed
ijwL
δδ
The Back-Propagation Algorithm
On-Line algorithm:
1. Initialize weights
2. Present a pattern (training example) xi and target output yi
3. Compute output :
4. Update weights :
where
Repeat starting at 2 until acceptable level of error
o f w oj iji
n
i= ∑=
[ ]0
ijijij wtwtw ∆+=+ )()1(
ijwE
ijw δδη−=∆
+= ∑
=
N
iijijijij wxwfo
10
Back-Propagation
Where:
For output nodes:
For hidden nodes:
ijij
ijij LWwLow η
∂∂ηηε −=−==∆
))(1()1(
jjjj
jjjjj
oyooooLALI
−−=
=−==ε
ijj
jiiiiiii wLIooooLALI ∑−=−== )1()1(ε
BackPropagation
• Node (all) outputs assume a sigmoid function:
• We need to compute its derivative
aeaf
−+=
1
1)(
( ))(1)()(' afafaf −=
The Back-propagation Algorithm
Visualizing the BackProp learning process:
The algorithm performs a gradient descent in weights space toward a minimum level of error using a fixed step size or learning rate
The gradient is given by : = rate at which error changes as weights change
ijwE
δδ
η
The Back-propagation Algorithm
Momentum Descent: Minimization can be speed-up if an additional term is added to the
update equation:where:
Thus: Augments the effective learning rate to vary the amount a weight is
updatedAnalogous to momentum of a ball - maintains directionRolls through small local minima Increases weight upadte when on stable gradient
)]1()([ −− twtw ijijα
η
1<< αo)1()( −∆+=∆ twodtw ijijij αη
The Back-propagation Algorithm
Line Search Techniques: Steepest and momentum descent use only gradient of
error surfaceMore advanced techniques explore the weight space using
various heuristicsMost common is to search ahead in the direction defined
by the gradient
The Back-propagation Algorithm
On-line vs. Batch algorithms: Batch (or cumulative) method reviews a set of training
examples known as an epoch and computes global error:
Weight updates are based on this cumulative error signal On-line more stochastic and typically a little more
accurate, batch more efficient
2)(21
jj
jp
otE −= ∑∑
The Back-propagation Algorithm
Several Questions:• What is BP’s inductive bias?
• Can BP get stuck in local minimum?
• How does learning time scale with size of the network & number of training examples?
• Is it biologically plausible?
• Do we have to use the sigmoid activation function?
• How well does a trained network generalize to unseen test cases?
Example
• Taken from: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/• Network:
Example
• Initial weights The Forward Pass
Here’s how we calculate the total net input for :
We use the logistic function to get the output of h1 :
Carrying out the same process for h2 we get:
ExampleContinuing the forward pass• We repeat this process for the output layer neurons, using the output from
the hidden layer neurons as inputs.• Here’s the output for o1
And carrying out the same process for o2 we get:
ExampleCalculating the Total Error
We can now calculate the error for each output neuron using the squared error function and sum them to get the total error:
For example, the target output for is 0.01 but the neural network output 0.75136507, therefore its error is:
Repeating this process for (remembering that the target is 0.99) we get:
The total error for the neural network is the sum of these errors:
ExampleThe Backwards PassOur goal with BackPropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.
Output LayerConsider w5 . We want to know how much a change in w5 affects the total error, aka .
By applying the chain rule we know that:
Visually, here’s what we’re doing:
Example
We need to figure out each piece in this equation.First, how much does the total error change with respect to the output?
When we take the partial derivative of the total error with respect to outo1 , the quantity [1/2(targeto2 – out02)2] becomes zero because outo1 does not affect it which means we’re taking the derivative of a constant which is zero.
Example
Next, how much does the output of o1 change with respect to its total net input?The partial derivative of the logistic function (a.k.a. sigmoid) is the output multiplied by 1 minus the output:
Finally, how much does the total net input of o1 change with respect to w5 ?
Putting it all together:
ExampleYou’ll often see this calculation combined in the form of the delta rule:
Alternatively, we have and which can be written as , aka aka the
node delta. We can use this to rewrite the calculation above:
Therefore:
Some sources extract the negative sign from δ so it would be written as:
To decrease the error, we then subtract this value from the current weight (multiplied by some learning rate, eta, which is set here to 0.5):
We can repeat this process to get the new weights :
ExampleHidden LayerNext, we’ll continue the backwards pass by calculating new values for w1,w2,w3,w4.We need to figure out:
The process is similar with the one for the output layer, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons. We know that outh1 affects both outo1 and outo2 therefore the needs to take into consideration its effect on the both output neurons:
We can calculate using values we calculated earlier:
And is equal to :
Example
Plugging them in:
Following the same process for , we get:
Therefore:
Now that we have , we need to figure out and then for each weight:
We calculate the partial derivative of the total net input to h1 with respect to w1 the same as we did for the output neuron:
Putting it all together:
ExampleWe can now update w1 :
Repeating this for w2, w3 and w4:
Finally, we’ve updated all of our weights!
When we fed forward the 0.05 and 0.1 inputs originally, the error on the network was 0.298371109. After this first round of backpropagation, the total error is now down to 0.291027924. It might not seem like much, but after repeating this process 10,000 times, for example, the error plummets to 0.0000351085. At this point, when we feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs 0.99 target).