Machine Learning pentru Aplicatii Vizuale

Corneliu Florea.

Machine Learning pentru Aplicatii Vizuale

Perceptron

Perceptron

• Historically was the first machine learning structure

• Inspired from biological neuron

• Capable of linear separation only – Limitation concluded in 15 years gap in research

• Learning algorithm a prequel to the powerful gradient descent

Biological Neuron

The Neuron - A Biological Information Processor

• dentrites - the receivers• soma - neuron cell body (sums

input signals)• axon - the transmitter• synapse - point of transmission• neuron activates after a certain

threshold is metLearning occurs via electro-chemical

changes in effectiveness of synaptic junction.

An Artificial Neuron - The PerceptronBasic function of neuron is to sum inputs, and produce output given sum is greater than threshold

Artificial Perceptron

• Model similar with logistic regression• Artificial Neuron – Perceptron• The model was introduced by McCullogh and Pitts in 1943

with a hard limiting function• Tipically trained by Delta ???? algorithm

Neuron Modelw1x1

w2x2

wNxN

o

Input weight

threshold

Output 0 or 1

Summation(with bias)

+= ∑

=

N

iiiwxwfo

10

Learning:-During training the weights w0, w1,…,wN are found

-Perceptron works on linearly separable problems only!!!

Activation Functions

1) Threshold Functionf(v) = 1 if v≥ 0

= 0 otherwise

2) Piecewise-Linear Functionf(v) = 1 if v ≥ ½

= v if ½> v > - ½ = 0 otherwise

3) Sigmoid Functionf(v) = 1/{1 + exp(- av)}

etc..

Neurons can use any differentiable transfer function f to generate their output

Separating hyperplane

Example

Example

Perceptron Learning Algorithm:

1. Initialize weights with small random values

2. Present a pattern, xi and target output, yi

3. Compute output :

4. Update weights :

Repeat starting at 2, until acceptable level of error

Learning in a Simple Neuron

+= ∑

=

N

iiiwxwfo

10

)()()1( twtwtw ∆+=+

Learning in a Simple Neuron

Widrow-Hoff or Delta Rule for weight modification

)()()1( twtwtw ∆+=+(t)ixtotyη(t)ixtηtiw

−==∆ )()()()( εWhere:η = learning rate (0 < η <= 1), typically set to 0.1 or 0.2ε(t) = error signal = desired output (y()) - network output (o())

)()()1( 00 ttwtw ηε+=+

For input weights

For bias

Weight Updates

• Binary classification – weight updates

x(t)totyηx(t)tηtw

−==∆ )()()()( ε

o(t) y(t) ε(t) Δw(t)

0 0 0 0

0 1 +1 ηx(t)

1 0 -1 -ηx(t)

1 1 0 0

Perceptron learning

• Presenting the perceptron with enough training vectors, the weight vector w(n) will tend to the correct value w.

• Rosenblatt proved that if input patterns are linearly separable, then the perceptron learning law converges, and the hyperplane separating two classes of input patterns can be determined

Example

Logical OR Function

x1 x2 y0 0 00 1 11 0 11 1 1 x1

x2

0,0 0,1

1,0 1,1

y = f(w0+w1x1+w2x2)

SimpleNeural Network

Line: ?x1+?x2 = ? or -?x1-?x2 + ? = 0

Trainingo(t) y(t) ε(t) Δw(t)

0 0 0 0

0 1 +1 ηx(t)

1 0 -1 -ηx(t)

1 1 0 0

η=0.2

Iterat

w1 w2 w0 x1 x2 y=(x1orx2)

o' o ε Δw1 Δw2 Δw0

1 0.02 -0.15 0.09 1 0 1 0.11 0 1 0.2 00.2

2 0.22 -0.15 0.29 0 1 1 0.14 0 1 0 0.20.2

3 0.22 0.05 0.49 1 1 1 0.76 1 0 0 00

4 0.22 0.05 0.49 1 0 1 0.71 1 0 0 00

5 0.22 0.05 0.49 0 0 0 0.49 0 0 0 00

6 0.22 0.05 0.49 1 1 1 0.76 1 0 0 00

1 0.02 -0.15 0.09 1 0 1 0.11 0 1 0.2 00.2

o = f(w0+w1x1+w2x2) = f(o’)

≤>

=5.005.01

)(xx

xf

Example

Iterat w1 w2 w0 x1 x2 y o' o ε Δw1 Δw2 Δw0

1 0.02 -0.15 0.09 1 0 1 0.11 0 1 0.2 0 0.22 0.22 -0.15 0.29 0 1 1 0.14 0 1 0 0.2 0.23 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 04 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 05 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 06 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 07 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 08 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 09 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 0

10 0.22 0.05 0.49 0 1 1 0.54 1 0 0 0 011 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 012 0.22 0.05 0.49 1 1 1 0.76 1 0 0 0 013 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 014 0.22 0.05 0.49 0 1 1 0.54 1 0 0 0 015 0.22 0.05 0.49 1 0 1 0.71 1 0 0 0 016 0.22 0.05 0.49 0 0 0 0.49 0 0 0 0 0

Line: 0.22x1+0.25x2 = -0.11

Convergence

FAILURE

Logical XOR Function

x1 x2 y0 0 00 1 11 0 11 1 0 0,0 0,1

1,0 1,1

Two neurons are need! Their combined results can produce good classification.

x2

x1

XOR- Failure

Iterat w1 w2 w0 x1 x2 y o' o ε Δw1 Δw2 Δw0

1 0.02 -0.15 0.09 1 0 0 0.11 0 0 0 0 02 0.02 -0.15 0.09 0 1 0 -0.06 0 0 0 0 03 0.02 -0.15 0.09 1 1 1 -0.04 0 1 0.2 0.2 0.24 0.22 0.05 0.29 1 0 0 0.51 1 -1 -0.2 0 -0.25 0.02 0.05 0.09 0 0 1 0.09 0 1 0 0 0.26 0.02 0.05 0.29 1 1 1 0.36 0 1 0.2 0.2 0.27 0.22 0.25 0.49 0 0 1 0.49 0 1 0 0 0.28 0.22 0.25 0.69 1 1 1 1.16 1 0 0 0 09 0.22 0.25 0.69 1 0 0 0.91 1 -1 -0.2 0 -0.2

10 0.02 0.25 0.49 0 1 0 0.74 1 -1 0 -0.2 -0.211 0.02 0.05 0.29 0 0 1 0.29 0 1 0 0 0.212 0.02 0.05 0.49 1 1 1 0.56 1 0 0 0 013 0.02 0.05 0.49 1 0 0 0.51 1 -1 -0.2 0 -0.214 -0.18 0.05 0.29 0 1 0 0.34 0 0 0 0 015 -0.18 0.05 0.29 1 0 0 0.11 0 0 0 0 016 -0.18 0.05 0.29 0 0 1 0.29 0 1 0 0 0.2

Geometric interpretation of the learning law

Multi-Layer Perceptron

Multi-Layer Perceptron

• Multi-Layer Perceptron = Artificial Neural Network = = FeedForward Network = Fully Connected Network

Over the 15 years (1969-1984) some research continued ... • hidden layer of nodes allowed combinations of linear functions • non-linear activation functions displayed properties closer to real

neurons:– output varies continuously but not linearly– differentiable .... sigmoid

non-linear ANN classifier was possibleae

af−+

=1

1)(

Collection of Artificial Neurons

Hidden Nodes

Output Nodes

Input Nodes

I1 I2 I3 I4

O1 O2“Distributed processing

and representation”

3-Layer Networkhas

2 active layers

Mathematical formulation

• Each (i-th) neuron’s from the j-th layer output is computed as

+= ∑

=

N

iijijijij wxwfo

10

Where:• f() is a non linear function• xi are:o inputs for first layero outputs of the previous layer

• wi – weights• w0 - bias

Learning: - find the values of all weights

Training MLPs

• Initially there was no learning algorithm to adjust the weights of a multi-layer network -

– weights had to be set by hand.

• How could the weights below the hidden layer be updated?The Back-propagation Algorithm• 1986: the solution to multi-layer ANN weight update rediscovered • Conceptually simple - the global error is backward propagated to network nodes, weights

are modified proportional to their contribution

• Most important ANN learning algorithm• Become known as back-propagation because the error is send back through the

network to correct all weights

The Back-Propagation Algorithm

• Like the Perceptron - calculation of error is based on difference between target and actual output:

• However in BP it is the rate of change of the error which is the important feedback through the network

generalized delta rule

• Relies on the sigmoid activation function for communicationijδw

δLηijΔw −=

2)(21

jj

j oyL −= ∑


Objective: compute for allDefinitions:

= weight from node i to node j= totaled weighted input of node

= output of node = error (loss) for 1 pattern over all output nodes

ijwL

δδ

ijw

jx

jo

L= = + −f x e x

jj( ) /( )1 1

ijw

i

n

iijow∑=

=0

)1/(1)( jj

xexfo −+==


Objective: compute derivatives for all wij

Four step process:

1. Compute how fast error changes as output of node j is changed

2. Compute how fast error changes as total input to node j is changed

3. Compute how fast error changes as weight wij coming into node j is changed

4. Compute how fast error changes as output of node i in previous layer is changed

ijwL

δδ


On-Line algorithm:

1. Initialize weights

2. Present a pattern (training example) xi and target output yi

3. Compute output :

4. Update weights :

where

Repeat starting at 2 until acceptable level of error

o f w oj iji

n

i= ∑=

[ ]0

ijijij wtwtw ∆+=+ )()1(

ijwE

ijw δδη−=∆

+= ∑

=

N

iijijijij wxwfo

10

Back-Propagation

Where:

For output nodes:

For hidden nodes:

ijij

ijij LWwLow η

∂∂ηηε −=−==∆

))(1()1(

jjjj

jjjjj

oyooooLALI

−−=

=−==ε

ijj

jiiiiiii wLIooooLALI ∑−=−== )1()1(ε

BackPropagation

• Node (all) outputs assume a sigmoid function:

• We need to compute its derivative

aeaf

−+=

1

1)(

( ))(1)()(' afafaf −=

The Back-propagation Algorithm

Visualizing the BackProp learning process:

The algorithm performs a gradient descent in weights space toward a minimum level of error using a fixed step size or learning rate

The gradient is given by : = rate at which error changes as weights change

ijwE

δδ

η


Momentum Descent: Minimization can be speed-up if an additional term is added to the

update equation:where:

Thus: Augments the effective learning rate to vary the amount a weight is

updatedAnalogous to momentum of a ball - maintains directionRolls through small local minima Increases weight upadte when on stable gradient

)]1()([ −− twtw ijijα

η

1<< αo)1()( −∆+=∆ twodtw ijijij αη


Line Search Techniques: Steepest and momentum descent use only gradient of

error surfaceMore advanced techniques explore the weight space using

various heuristicsMost common is to search ahead in the direction defined

by the gradient


On-line vs. Batch algorithms: Batch (or cumulative) method reviews a set of training

examples known as an epoch and computes global error:

Weight updates are based on this cumulative error signal On-line more stochastic and typically a little more

accurate, batch more efficient

2)(21

jj

jp

otE −= ∑∑


Several Questions:• What is BP’s inductive bias?

• Can BP get stuck in local minimum?

• How does learning time scale with size of the network & number of training examples?

• Is it biologically plausible?

• Do we have to use the sigmoid activation function?

• How well does a trained network generalize to unseen test cases?

Example

• Taken from: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/• Network:

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Example

• Initial weights The Forward Pass

Here’s how we calculate the total net input for :

We use the logistic function to get the output of h1 :

Carrying out the same process for h2 we get:

ExampleContinuing the forward pass• We repeat this process for the output layer neurons, using the output from

the hidden layer neurons as inputs.• Here’s the output for o1

And carrying out the same process for o2 we get:

ExampleCalculating the Total Error

We can now calculate the error for each output neuron using the squared error function and sum them to get the total error:

For example, the target output for is 0.01 but the neural network output 0.75136507, therefore its error is:

Repeating this process for (remembering that the target is 0.99) we get:

The total error for the neural network is the sum of these errors:

http://en.wikipedia.org/wiki/Backpropagation

ExampleThe Backwards PassOur goal with BackPropagation is to update each of the weights in the network so that they cause the actual output to be closer the target output, thereby minimizing the error for each output neuron and the network as a whole.

Output LayerConsider w5 . We want to know how much a change in w5 affects the total error, aka .

By applying the chain rule we know that:

Visually, here’s what we’re doing:

http://en.wikipedia.org/wiki/Chain_rule

Example

We need to figure out each piece in this equation.First, how much does the total error change with respect to the output?

When we take the partial derivative of the total error with respect to outo1 , the quantity [1/2(targeto2 – out02)2] becomes zero because outo1 does not affect it which means we’re taking the derivative of a constant which is zero.

Example

Next, how much does the output of o1 change with respect to its total net input?The partial derivative of the logistic function (a.k.a. sigmoid) is the output multiplied by 1 minus the output:

Finally, how much does the total net input of o1 change with respect to w5 ?

Putting it all together:

http://en.wikipedia.org/wiki/Logistic_function

ExampleYou’ll often see this calculation combined in the form of the delta rule:

Alternatively, we have and which can be written as , aka aka the

node delta. We can use this to rewrite the calculation above:

Therefore:

Some sources extract the negative sign from δ so it would be written as:

To decrease the error, we then subtract this value from the current weight (multiplied by some learning rate, eta, which is set here to 0.5):

We can repeat this process to get the new weights :

http://en.wikipedia.org/wiki/Delta_rule

ExampleHidden LayerNext, we’ll continue the backwards pass by calculating new values for w1,w2,w3,w4.We need to figure out:

The process is similar with the one for the output layer, but slightly different to account for the fact that the output of each hidden layer neuron contributes to the output (and therefore error) of multiple output neurons. We know that outh1 affects both outo1 and outo2 therefore the needs to take into consideration its effect on the both output neurons:

We can calculate using values we calculated earlier:

And is equal to :

Example

Plugging them in:

Following the same process for , we get:

Therefore:

Now that we have , we need to figure out and then for each weight:

We calculate the partial derivative of the total net input to h1 with respect to w1 the same as we did for the output neuron:

Putting it all together:

ExampleWe can now update w1 :

Repeating this for w2, w3 and w4:

Finally, we’ve updated all of our weights!

When we fed forward the 0.05 and 0.1 inputs originally, the error on the network was 0.298371109. After this first round of backpropagation, the total error is now down to 0.291027924. It might not seem like much, but after repeating this process 10,000 times, for example, the error plummets to 0.0000351085. At this point, when we feed forward 0.05 and 0.1, the two outputs neurons generate 0.015912196 (vs 0.01 target) and 0.984065734 (vs 0.99 target).

Machine Learning pentru Aplicatii Vizuale

Documents