Neural network for machine learning

NEURAL NETWORK FOR MACHINE LEARNING

INTRODUCTION TO NEURAL NETS

WHAT IS MACHINE LEARNING?

Its very hard to write programs that solve problems like recognizing a 3D object from a novel viewpoint. Even if we could have written such a program, it would have been very

complicated

Its hard to write a program that computes credit card fraudulent There are no specific rules that are simple and reliable. We need to

combine a large number of weak rules Fraud is a moving target. The program needs to keep changing

THE MACHINE LEARNING APPROACH

Instead of writing programs for a specific task, we collect a lot of examples that specify the correct output for a given input

The machine learning algorithm then takes these examples and produces a program that does the job Looks very different from a typical program Program works well for new cases and the ones that we train it on If the data changes, the program can change by training on the new data

Massive amount of computation is now cheaper than paying for the code

REASONS TO STUDY NEURAL NETWORKS

To study how the brain actually works It very big and complicated. So, we need the use of computer simulation

To understand the style of parallel computation inspired by neurons and their adaptive connection Very different from sequential connection

Should be good at things that the brain is good at. Ex. Vision Should be bad at thing the brain is bad at. Ex. Computation 24 X 44

To solve practical problems by using novel learning algorithms inspired by brain

IT BEGAN WITH MCCULLOCH AND PITTS

Revolutionary Idea: think of neural tissue as circuits performing mathematical computation

THE MCCULLOCH-PITTS NEURON

Linear weighted sum of inputs

Non-linear, possibly stochastic transfer function

Learning rule

A TYPICAL CORTICAL NEURON Gross physical structure

One axon that branches There is a dendritic tree that collects inputs from other neurons

Axons typically contact dendritic tree at synapses A spike of activity in axon causes charge to be injected into post-synaptic neuron

Spike generation There is an axon hillock that generates outgoing spikes whenever enough charge

has flowed in at synapse to depolarize the cell membrane

IDEALIZED NEURONS To model things we have to idealize them (ex. Atoms)

Idealization removes complicated details that are not essential for understanding the main principle

Allows to apply mathematics and make analogies to other familiar systems

Its worth understanding models that are known to be wrong Neurons that communicate real values rather than discrete spikes of

activity

LINEAR NEURONS These are simple but computationally limited

If we can make them learn, we may get insight into more complicated neurons

BINARY THRESHOLD NEURONS First compute the weighted sum of the inputs

Then send out the fixed spike of activity if the weighted sum exceeds a threshold

There are two equivalent ways to write the equations for a binary threshold neuron

RECTIFIED LINEAR NEURONS

Also called threshold linear neuron

Computes a linear weighted sum of their inputs

The output is a non-linear function of the total input

SIGMOID NEURONS Gives a real valued output that is smooth and bounded function of their

total input Typically, they use logistic function They have derivatives, which make the learning easy

STOCHASTIC BINARY NEURONS

They use the same equation as logistic unit They treat the output of the logistic as probability of producing a spike in short

time window.

TYPES OF LEARNING TASK Supervised Learning

Learn to predict the output when given an input vector

Reinforcement Learning Learn to select an action to maximize payoff

Unsupervised Learning Discover a good internal representation of the input

SUPERVISED LEARNING Each training case consist of an input vector x and a target output t

Regression: the target output is a real number or a whole vector of a real number The price of stock in six month time The temperature at noon tomorrow

Classification: the target output is a class label The simplest case is between 1 and 0 We can have multiple alternative label

Working - We start by choosing a model class: y=f(x, w) A model class f, is a way of using some numerical parameters, w to match each

input vector, x into a predicted output y

REINFORCEMENT LEARNING In reinforcement learning, the output is an action or sequence of actions

and the only supervisory signal is the occasional scalar reward. The goal in selecting each action is to maximize the expected sum of the future

rewards.

Reinforcement learning is difficult The rewards are typically delayed and its hard to know where we went wrong. A scalar reward does not supply much information

ARCHITECTURE OF NEURAL NETWORKSArchitecture of a neural network means way in which the neurons are connected to each other

FEED-FORWARD NEURAL NETWORK

Most common type First layer is input and the last layer is

output If there is more than one hidden layer, we

call it “deep” neural network

They compute a series of transformations that change the similarities between cases Activities of neuron in each layer are a non-

linear function of the activities in the layer below

RECURRENT NEURAL NETWORKS

These have directed cycles in their connection graph This means that you can sometimes get back to where you

started by following the arrows

They can have complicated dynamics and this can make them very difficult to train

They are more biologically realistic

They have a natural way to model sequential data Equivalent to deep nets with one hidden layer per time slice They use same weights at every time slice and get input at

every time slice

They have the ability to remember information in their hidden state for a long time Its hard to train to use this potential

SYMMETRICALLY CONNECTED NETWORK

Like recurrent network, but the connections between units are symmetrical (have same weights in both direction) Much easier to analyze than recurrent networks More restricted in what they can do because they are restricted by energy

function Ex. Cannot model cycles

Symmetrically connected nets without hidden units are called Hopfield nets

PERCEPTRONS

PERCEPTRONS ARE LINEAR CLASSIFIERS

DECISION BOUNDARY OFF THE ORIGIN?

WEIGHT SPACE The space has one dimension for

each weight

A point in the space represents a particular setting of the weight

Each training case represents a hyperplane The weights must lie on one side of a

hyperplane to get the answer correct

THRESHOLD VS BIASES

PERCEPTRON LEARNING RULE

PERCEPTRON CONVERGENCE THEOREM

Theorem:

If a problem is linearly separable, thenA perceptron will learn it

In a finite number of steps

BACKPROPOGATION ALGORITHM

MORE THAN 2 CLASSES

THE LEAST MEAN SQUARE LEARNING ALGORITHM

LMS/WIDROW-HOFF RULEWorks fine for single layer of trainable weights, but what about multi-layer neurons?

WHAT CAN BE DONE WITH NON-LINEAR UNITS?

HOW DO WE TRAIN A MULTI-LAYER NETWORK?

HOW DO WE TRAIN A MULTI-LAYER NETWORK?

LEANING THE WEIGHTS OF A LINEAR NEURON

In perceptron, the weights are always getting closer to a good set of weights

In a linear neuron, the outputs are always getting closer to the target output

Why perceptron convergence procedure cannot be generalized to hidden layers? The perceptron learning algorithm works by ensuring that every time the weights

change, they get closer to a generously feasible set of weights This type of extension cannot be extended to more complex network

We hence show that the actual output values get closer to the target values while this may not be the case with perceptron i.e. the outputs may get away from the target outputs

BEHAVIOR OF ITERATIVE LEARNING PROCEDURE

Does the learning procedure eventually get the right answer? There may be no perfect answer By making the learning rate slow, we get very close to the desired answer.

How quickly do the weights converge? Can be very slow if the input dimensions are highly correlated

OPTIMIZATION TECNIQUES

PARAMETER OPTIMIZATION Selection of parameter values which are optimal in some desired sense

Ex. Minimize the object function over a dataset

Parameters are weights and biases

Training the neural nets is iterative and time consuming and hence its in our interest to reduce training time

Methods Gradient descent Line search Conjugate gradient search

LINEAR OPTIMIZATION

NON-LINEAR OPTIMIZATION

THE PARAMETER SPACE

APPROXIMATING ERROR SURFACE BEHAVIOR

NEAR A MINIMUM

ERROR SURFACE FOR A LINEAR NEURON Horizontal axis corresponds to weight and vertical axis for error For linear neuron with

squared error, it is a quadratic bowl

Vertical cross-sections are parabolas

Horizontal cross-sections are ellipses

CONVERGENCE SPEED OF FULL BATCH LEARNING

The gradient is big in the direction in which we only want to travel a small distance

The gradient is small in the direction in which we want to travel large distance

HOW THE LEARNING GOES WRONG

If the learning rate is big, the weight slosh to and fro across the ravine. If the learning rate is too big, this

oscillation diverges

What we would like to achieve Move quickly in directions with small and

consistent gradients Move slowly in direction with big

inconsistent gradients

GRADIENT DESCENT SEARCH PROS AND CONS

Straightforward, iterative, tractable, locally optimal descent in error

Cannot avoid local minima and cannot escape them – my overshoot them

Cannot guarantee a scalable bound on time complexity

Search direction only locally optimal

AVOIDING/ESCAPING LOCAL MINIMA

Local minima is possible by random perturbation

Stochastic gradient descent is a form of injecting randomness into gradient descent

RECURRENT NEURAL NETWORK

TARGETS WHEN MODELLING SEQUENCES

When applying machine leaning to sequences, we often want to turn an input sequence to output sequence that lives in different domain Ex. Turn a sequence of sound pressure into a sequence of word identities

When there is no separate target sequence, we get a teaching sequence by trying to predict the next term in the input sequence Target output sequence is the input output sequence with an advance of 1 step Its like predicting one pixel of an image from the other pixel, or one patch of the

image from other

MEMORYLESS MODELS FOR SEQUENCE

Autoregressive models Output depends linearly on its own

previous values Take previous terms and predicts the

next Weighted average of previous terms

Feed-forward neural network Take in a few terms, put them through

some hidden units and predict the next term

Connection between units do not form a directed cycle

RECURRENT NETWORKS Recurrent means feeding back on itself

They are powerful because they combine two properties: Have distributes hidden states – means

several different units can be active at once. Hence, they can remember multiple values at once

Non-linear Dynamics – allows the dynamics to be updated in complicated way

WHAT KIND OF BEHAVIOR CAN RNN EXHIBIT?

They can oscillate – good for motor control

They can settle to point attractors – good for retrieving memories

They behave chaotically – bad for information processing

Implement small programs in parallel

TYPES OF RECURRENT NETWORK

Recurrent backpropogation network Discrete time

Simple Recurrent Network – Elamn net Jodan net Fixed point attractor network

Continuous time

Spin – Glass Model – Hopfield, Boltzmann

Interactive – Activation Model: cognitive modeling

Competitive networks – self-organizing feature maps

DISCRETE TIME RECURRENT BACKPROP

SIMPLE RECURRENT NETWORK

JORDAN NET: SEQUENCE NETWORK

TRAINING RNN WITH BACKPROP

Assume that there is a time delay of one in using each connection

The recurrent net is just a layered net that keeps reusing the same weights

PROVIDING INPUT TO RECURRENT NETWORK

We can specify inputs in several ways: Specify the initial subsets of all the

units Specify the initial states of a subset

of units Specify the states of the same subset

of the units at every time step Specify desired final activities of all

the units Specify desired activities of all units

for the last few steps Specify the desired activity of a

subset of unit

TOY EXAMPLE OF TRAINING AN RNN LIMITATIONS

Maximum number of digits must be decided in advance

This cannot be generalized for long numbers because it use different weights

A RECURRENT NET FOR BINARY ADDITION The network has two input units and one

output unit

Given two input unit each time

Desired output for each step is the output for column that was provided as input two time steps ago Takes one time step to update the hidden units

based on the input It takes another time step for the hidden unit to

cause the output

WHY IS IT DIFFICULT TO TRAIN RNN

There is big difference between the forward pass and the backward pass

In forward pass, we use squashing function (like logistic) to prevent the activity vectors from exploding

The backward pass is completely linear. If you double the error derivatives at the final layer, all the error derivatives will double

THE PROBLEM OF EXPLODING GRADIENTS

What happens to the magnitude of the gradient as we backpropogate? If the weights are small, the gradients shrinks exponentially If the weights are big, the gradients grows exponentially

Typical feed-forward nets can cope with these factors because they have a few hidden units

In the RNN trained on long sequences, the gradients can explode or vanish Can be avoided by initializing the weights exponentially

WHY THE BACKPROP GRADIENT BLOWS UP

FOUR EFFECTIVE WAYS TO TRAIN RNN

Long Short Term Memory: Make RNN of the little modules that are designed to hold values for a long time

Hessian Free Optimization: Deals with vanishing gradient problem Ex. HF optimizer

Echo State Networks Initialize connections so that the hidden state has a huge reservoir of weakly

coupled oscillator

Good initialization with momentum Initialize like in echo-state networks but learn all connections using momentum

LONG SHORT TERM MEMORY Dynamic state of the neural network is a short term memory which has to

be converted to long term to make the data last

Very successful for task like recognizing handwriting

Example considered – getting a RNN to remember things for long time (like hundred of time steps) Uses logistic and linear units Write gate - Information gets in Keep gate - Information stored Read gate - Information is extracted

IMPLEMENTING A MEMORY CELL IN NEURAL NETS

Circuit implements analog memory cell

Linear unit with self link and weight of 1 will maintain state

Activate write gate to store information

Activate read gate for retrieving information

Backprop is possible because logistic has nice derivatives

ECHO STATE NETWORK AND PERCEPTRON

Perceptron Make early layers random and fixed We learn the last layer which is a linear model It uses the transformed inputs to predict the output

Echo state network Fix the input->hidden and hidden->hidden connections at random values Learn hidden->output connection Choose the random connections carefully

COMPETITIVE LEARNING AND KOHONEN MAPSIn competitive learning, neurons compete among themselves to be activated

HOW DO WE FIND CLUSTERS?

K-MEANS WITH K=4

K-MEANS COMPUTES FOUR CLUSTERS

NEAREST NEIGHBOUR VORONOI DIAGRAM

COMPETITIVE LEARNING Output units are said to be in competition for input patterns

During training, the output unit that provides highest activation to a given pattern is declared the winner and is moved closer to the input pattern

Unsupervised learning

Also called winner-takes-all One neuron wins over all others Only the winning neuron learns

Hard Learning – weight of only the winner is updated

Soft Learning – weight of winner and close associates is updated

COMPETITIVE LEARNING ALGORITHM

KOHONEN SELF-ORGANIZING MAPS

Produces a mapping from multi-dimensional input space onto a lattice of clusters Mapping is topology-preserving Typically organized as 1D or 2D lattice

Have a strong neurological basis Topology is preserved. Ex. If we touch parts of the body that are close together,

group of cells will fire that are also close together

K-SOM results from synergy of three basic processes Competition Cooperation Adaptation

COMPETITION Each neuron in SOM is assigned a

weight vector with same dimensionality N as the input space

Any given input pattern is compared to the weight vector of each neuron and the closest is declared the winner

The Euclidean norm is usually used to measure distance

COOPERATION The activation of winning neuron is

spread to neurons in its immediate neighbourhood This allows topologically close neurons to

become sensitive to similar patterns

The size of neighbourhood is initially large, but shrinks over time Large neighbourhood promotes a topology

preserving mapping Smaller neighbourhood allows neurons to

specialize in later stages of training

ADAPTATION During training, the winner neuron and its

topological neighbours are adapted to make their weight vectors more similar to the input pattern that caused the activation

Neurons that are closer to the winner will adapt more heavily than neurons far away

Magnitude of adaptation is controlled by learning rate

ALGORITHMA neuron learn by shifting its weight from inactive neurons to active neurons

Change Dwij applied to synaptic weight wij as

where xi is the input signal and a is the learning rate parameter

The overall effect relies on moving the synaptic weight vector of the winning neuron towards the input pattern

Matching criteria is the equivalent Euclidean distance

ncompetitiothelosesneuronif,0

ncompetitiothewinsneuronif),(

j

jwxw

ijiij

Euclidean distance is given by

where xi and wij are the ith elements of the vectors X and Wj, respectively.

To identify the winning neuron, jX, that best matches the input vector X, we may apply the following condition:

2/1

1

2)(

n

iijij wxd WX

,jj

minj WXX j =1,2, . . .,m

EXAMPLESuppose 2D input vector is presented to three neuron kohenen network

Initial weight vector is given by

12.0

52.0X

81.0

27.01W

70.0

42.02W

21.0

43.03W

We find the winning neuron using the minimum-distance euclidean criteria

Neuron 3 is the winner and its weight vector is updated according to the competitive learning rule

2212

21111 )()( wxwxd 73.0)81.012.0()27.052.0( 22

2222

21212 )()( wxwxd 59.0)70.012.0()42.052.0( 22

2232

21313 )()( wxwxd 13.0)21.012.0()43.052.0( 22

0.01)43.052.0(1.0)( 13113 wxw

0.01)21.012.0(1.0)( 23223 wxw

The updated weight vector with iteration (p+1) is determined as

The weight vector w3 of the winning neuron 3 becomes closer to the input vector X with each iteration

20.0

44.0

01.0

0.01

21.0

43.0)()()1( 333 ppp WWW

EXAMPLE Kohenen network with 100 neurons arranged in the form of 2D lattice with

10 rows and 10 columns

The network is required to classify 2D input vectors – each neuron should respond to input vectors occurring in that region only

The network is trained with 1000 2D input vectors generated randomly in a square region in the interval between -1 and +1

Learning rate parameter is 0.1

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

W(2,j)

W(1,j)

Initial random weights

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1

W(2,j)

W(1,j)

Network after 100 iterations

-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1

W(2,j)

W(1,j)


-0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8-1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

1

W(2,j)

W(1,j)


HOPFIELD NEURAL NETS

Serves as Content-Addressable Memory system with binary threshold nodes

Provides model for understanding human memory

Used for storing memory as distributed patterns of activity

Stable states are fixed point attractors

ALGORITHM

Two ways of updating Asynchronous: picks one neuron, calculate weight sum and updates immediately.

Can be done in fixed order or neurons can be picked at random Synchronous: weight sum is calculated without updating neurons. Then all

neurons are set to new values

Conditions on weight matrix: symmetry: wij = wji

no self connections: wii = 0s

THE ENERGY FUNCTION Global energy depends on one connection weight and the binary state

of two neurons

Weight of two neurons

Activity of two connecting

neurons

Bias term

EXAMPLE

A NEAT WAY TO USE THIS ENERGY

Memories could be energy minima of a neural net The binary threshold decision rule can then be used to clean up incomplete or

corrupted memories

Using energy minima to represent memories gives a content-addressable memory An item can be accessed by just knowing a part of its content

Neural network for machine learning

Data & Analytics

correct output

total input

given input

real valued output

input vector x

learning easy

typical program program

complicated neurons