Linear classification

LINEAR CLASSIFICATION

Biological inspirations

Some numbers… The human brain contains about 10 billion

nerve cells (neurons) Each neuron is connected to the others through

10000 synapses

Properties of the brain It can learn, reorganize itself from experience It adapts to the environment It is robust and fault tolerant

Biological neuron (simplified model)

A neuron has A branching input (dendrites) A branching output (the axon)

The information circulates from the dendrites to the axon via the cell body

The cell body sums up the inputs in some way and fires – generates a signal through the axon – if the result is greater than some threshold

An Artificial Neuron

- weights - inputs

Definition : Non linear, parameterized function with restricted output range

Activation Function

Usually not pictured (we’ll see why), but you can imagine a threshold parameter here.

Same Idea using the Notation in the Book

The Output of a Neuron

As described so far…

This simplest form of a neuron is also called a perceptron.

The Output of a Neuron

Other possibilities, such as the sigmoid function for continuous output.

𝟏

𝟏+𝒆−𝒊𝒏 𝒋

𝒑• is the activation of the neuron • is a parameter which controls the

shape of the curve (usually )

Linear Regression using a Perceptron

Linear regression:

Find a linear function (straight line) that best predicts the continuous-valued output.

Linear Regression As an Optimization Problem

Finding the optimal weights could be solved through: Gradient descent Simulated annealing Genetic algorithms … and now Neural Networks

Linear Regression using a Perceptron

𝑓 (𝑥 )=𝑤1𝑥+𝑤0

𝑥 1𝑤1 𝑤0

𝑤1𝑥+1𝑤0

𝑓 (𝑥)

The Bias Term

So far we have defined the output of a perceptron as controlled by a threshold

x1w1 + x2w2 + x3w3… + xnwn >= t But just like the weights, this threshold is

a parameter that needs to be adjusted

Solution: make it another weightx1w1 + x2w2 + x3w3… + xnwn + (1)(-t)

>= 0The bias term.

A Neuron with a Bias Term

Another Example

Assign weights to perform the logical OR operation.

𝐴 𝐵𝑤2 𝑤1

𝑤2 𝐴+𝑤1𝐵+𝑤0≥0

1

𝑤0

𝑤0=¿

𝑤1=¿

𝑤2=¿

Artificial Neural Network (ANN) A mathematical model to solve

engineering problems Group of highly connected neurons to

realize compositions of non linear functions

Tasks Classification Discrimination Estimation

Feed Forward Neural Networks The information is propagated from the inputs to

the outputs There are no cycles between outputs and inputs

the state of the system is not preserved from one iteration to another

x1 x2 xn…..

1st hidden layer

2nd hiddenlayer

Output layer

ANN Structure

Finite number of inputs Zero or more hidden layers One or more outputs

All nodes at the hidden and output layers contain a bias term.

Examples

Handwriting character recognition

Control of a virtual agent

ALVINNNeural Network controlled AGV (1994)

weights

http://blog.davidsingleton.org/nnrccar

Learning

The procedure that consists in estimating the weight parameters so that the whole network can perform a specific task

The Learning process (supervised) Present the network a number of inputs and their

corresponding outputs See how closely the actual outputs match the desired

ones Modify the parameters to better approximate the desired

outputs

Perceptron Learning Rule

1. Initialize the weights to some random values (or 0)

2. For each sample in the training set1. Calculate the current output of the

perceptron, 2. Update the weights

3. Repeat until the error is smaller than some predefined threshold

is the learning rate, usually

Linear Separability

Perceptrons can classify any input that is linearly separable.

For more complex problems we need a more complex model.

Different Non-Linearly Separable Problems

StructureTypes of

Decision RegionsExclusive-OR

ProblemClasses with

Meshed regionsMost General

Region Shapes

Single-Layer

Two-Layer

Three-Layer

Half PlaneBounded ByHyperplane

Convex OpenOr

Closed Regions

Arbitrary(Complexity

Limited by No.of Nodes)

A

AB

B

A

AB

B

A

AB

B

BA

BA

BA

Calculating the Weights

The weights are a vector of parameters where we need to find a global optimum

Could be solved by: Simulated annealing Gradient descent Genetic algorithms

http://www.youtube.com/watch?v=0Str0Rdkxxo

Perceptron learning rule is pretty much gradient descent.



Learning the Weights in a Neural Network

Perceptron learning rule (gradient descent) worked before, but it required us to know the correct output of the node.

How do we know the correct output of a given hidden node??

Backpropagation Algorithm

Gradient descent over entire network weight vector

Easily generalized to arbitrary directed graphs

Will find a local, not necessarily global error minimum in practice often works well (can be invoked

multiple times with different initial weights)

Backpropagation Algorithm

1. Initialize the weights to some random values (or 0)2. For each sample in the training set

1. Calculate the current output of the node, 2. For each output node , update the weights

3. For each hidden node, update the weights

3. For all network weights do

4. Repeat until weights converge or desired accuracy is achieved

∆ 𝑗=(h𝑥 𝑗)(1−h𝑥 𝑗

)∑𝑘

𝑤 𝑗 ,𝑘∆𝑘

𝑤𝑖 , 𝑗=𝑤𝑖 , 𝑗+𝛼∆ 𝑗 𝑥 𝑗

Intuition

General idea: hidden nodes are “responsible” for some of the error at the output nodes it connects to

The change in the hidden weights is proportional to the strength (magnitude) of the connection between the hidden node and the output node

This is the same as the perceptron learning rule, but for a sigmoid decision function instead of a step decision function (full derivation on p. 726)

𝑤𝑖=𝑤𝑖+𝛼 ( 𝑦 𝑗−h𝑥 𝑗 )(h𝑥 𝑗)(1−h𝑥 𝑗

)𝑥 𝑗

Intuition

General idea: hidden nodes are “responsible” for some of the error at the output nodes it connects to

The change in the hidden weights is proportional to the strength (magnitude) of the connection between the hidden node and the output node

Intuition

When expanded, the update to the output nodes is almost the same as the perceptron rule

Slight difference is that the algorithm uses a sigmoid function instead of a step function (full derivation on p. 726)

𝑤𝑖=𝑤𝑖+𝛼 ( 𝑦 𝑗−h𝑥 𝑗 )(h𝑥 𝑗)(1−h𝑥 𝑗

)𝑥 𝑗

Questions

Linear classification

Documents

Linear classification