Neural Networks Pabitra Mitra Computer Science and Engineering IIT Kharagpur pabitra@gmail.com.

Post on 29-Dec-2015

224 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Neural Networks

Pabitra MitraComputer Science and Engineering

IIT Kharagpurpabitra@gmail.com

Neural Networks NN 1 3

The Neuron• The neuron is the basic information processing unit of a

NN. It consists of:1 A set of synapses or connecting links, each link

characterized by a weight: W1, W2, …, Wm

2 An adder function (linear combiner) which computes the weighted sum of the inputs:

3 Activation function (squashing function) for limiting the amplitude of the output of the neuron.

m

1jj xwu

j

) (u y b

Computation at Units• Compute a 0-1 or a graded function of the

weighted sum of the inputs• is the activation function

ii xwxw.

1w

nw

2w

1x

2x

nx

).( xwgg

()g

Neural Networks NN 1 5

The Neuron

Inputsignal

Synapticweights

Summingfunction

Biasb

ActivationfunctionLocal

Fieldv Output

y

x1

x2

xm

w2

wm

w1

)(

Common Activation Functions

• Step function: g(x)=1, if x >= t ( t is a threshold)g(x) = 0, if x < t

• Sign function: g(x)=1, if x >= t ( t is a threshold)g(x) = -1, if x < t

• Sigmoid function: g(x)= 1/(1+exp(-x))

Neural Networks NN 1 7

Bias of a Neuron

• Bias b has the effect of applying an affine transformation to u

v = u + b• v is the induced field of the neuron

v

u

m

1jj xwu

j

Neural Networks NN 1 8

Bias as extra input

Inputsignal

Synapticweights

Summingfunction

ActivationfunctionLocal

Fieldv Output

y

x1

x2

xm

w2

wm

w1

)(

w0x0 = +1

• Bias is an external parameter of the neuron. Can be modeled by adding an extra input.

bw

xwv j

m

j

j

0

0

Neural Networks NN 1 9

Face Recognition

90% accurate learning head pose, and recognizing 1-of-20 faces

Neural Networks NN 1 10

Handwritten digit recognition

Computing with spaces

x1 x2

y

perceptual features

+1 = cat, -1 = dog

x1

x2

y

dog cat

y g(Wx)QuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.

E y g(Wx) 2error:

Can Implement Boolean Functions

• A unit can implement And, Or, and Not• Need mapping True and False to numbers:– e.g. True = 1.0, False= 0.0

• (Exercise) Use a step function and show how to implement various simple Boolean functions

• Combining the units, we can get any Boolean function of n variablesCan obtain logical circuits as special case

Network Structures

• Feedforward (no cycles), less power, easier understood– Input units– Hidden layers– Output units

• Perceptron: No hidden layer, so basically correspond to one unit, also basically linear threshold functions (ltf)

• Ltf: defined by weights and threshold , value is 1 iff otherwise, 0txw .

tw

Neural Networks NN 1 14

Single Layer Feed-forward

Input layerof

source nodes

Output layerof

neurons

Neural Networks NN 1 15

Multi layer feed-forward

Inputlayer

Outputlayer

Hidden Layer

3-4-2 Network

Network Structures

• Recurrent (cycles exist), more powerful as they can implement state, but harder to analyze. Examples:

• Hopfield network, symmetric connections, interesting properties, useful for implementing associative memory• Boltzmann machines: more general, with applications

in constraint satisfaction and combinatorial optimization

Simple recurrent networks

z1 z2

x1 x2

hidden layer

input layer

output layer

context units

input

(Elman, 1990)

x2 x1

copy

x(i1)

x(i)

Perceptron Capabilities

• Quite expressive: many, but not all Boolean functions can be expressed. Examples:– conjuncts and disjunctions, example

– more generally, can represent functions that are true if and only if at least k of the inputs are true:

– Can’t represent XOR

1)( 2121 xxxx

kxxx n ...21

Representable Functions

• Perceptrons have a monotinicity property: If a link has positive weight, activation can

only increase as the corresponding input value increases (irrespective of other input values)

• Can’t represent functions where input interactions can cancel one another’s effect (e.g. XOR)

Representable Functions

• Can represent only linearly separable functions

• Geometrically: only if there is a line (plane) separating the positives from the negatives

• The good news: such functions are PAC learnable and learning algorithms exist

Linearly Separable

+

++

+

+

++

+++

+

+

-

_

NOT linearly Separable

++

+

+

_

+

+ OR

+

+

Problems with simple networks

x1 x2

x1

x2 y

Some kinds of data are not linearly separable

x1

x2

AND

x1

x2

OR

x1

x2

XOR

A solution: multiple layers

z1 z2

y

x1 x2

y

z1

z2

x1

x2

hidden layer

input layer

output layer

The Perceptron Learning Algorithm

• Example of current-best-hypothesis (CBH) search (so incremental, etc.):

• Begin with a hypothesis (a perceptron)• Repeat over all examples several times– Adjust weights as examples are seen

• Until all examples correctly classified or a stopping criterion reached

Method for Adjusting Weights

• One weight update possibility:• If classification correct, don’t change• Otherwise:– If false negative, add input: – If false positive, subtract input:

• Intuition: For instance, if example is positive, strengthen/increase the weights corresponding to the positive attributes of the example

jjj xww

jjj xww

Properties of the Algorithm

• In general, also apply a learning rate

• The adjustment is in the direction of minimizing error on the example

• If learning rate is appropriate and the examples are linear separable, after a finite number of iterations, the algorithm converges to a linear separator

jjj xww

Another Algorithm(least-sum-squares algorithm)

• Define and minimize an error function• S is the set of examples, is the ideal

function, is the linear function corresponding to the current perceptron

• Error of the perceptron (over all examples):

• Note:

()h()f

Se

ehefhE 2))()(()2/1()(

)(.)(.)( exwexweh ii

The Delta Rule

x1 x2

y

+1 = cat, -1 = dog

E y g(Wx) 2

wij E

wij

E

wij

2 y g(Wx) g'(Wx) x j

wij y g(Wx) g'(Wx) x j

output error

influenceof input

for any function g with derivative g

perceptual features

Derivative of Error

• Gradient (derivative) of E:

• Take the steepest descent direction:

• is the gradient along , is the learning rate

],...,,[)(10 nw

E

w

E

w

EhE

iiiii w

Ewwww

where,

iw

E

iw

Gradient Descent

• The algorithm: pick initial random perceptron and repeatedly compute error and modify the perceptron (take a step along the reverse of gradient)

E

Descent direction:

Gradient direction:

E (error)

wij

E

wij

0

E

wij

0

E

wij

0

wij E

wij

( is learning rate)

General-purpose learning mechanisms

Gradient Calculation

))())(())(()((

))()(())()((22

1

))()((2

1]))()((

2

1[ 22

Se ii

Se i

Se iSeii

ehw

efw

ehef

ehefw

ehef

ehefw

ehefww

E

Derivation (cont.)

jiw

exwexehef

w

exwehef

efw

exwehef

w

E

i

jj

Sei

Se j i

jj

Se ii

for ,0))(.(

as ,))())(()((

)))(.(

))(()((

constant is )( ,)))(.(

0))(()((

Properties of the algorithm• Error function has no local minima (is quadratic)

• The algorithm is a gradient descent method to the global minimum, and will asymptotically converge

• Even if not linearly separable, can find a good (minimum error) linear classifier

• Incremental?

Multilayer Feed-Forward Networks

• Multiple perceptrons, layered• Example: a two-layer network with 3 inputs one

output, one hidden layer (two hidden units)

hidden layer

output layerinputs layer

2x

3x

1x

Power/Expressiveness

• Can represent interactions among inputs (unlike perceptrons)

• Two layer networks can represent any Boolean function, and continuous functions (within a tolerance) as long as the number of hidden units is sufficient and appropriate activation functions used

• Learning algorithms exist, but weaker guarantees than perceptron learning algorithms

Back-Propagation

• Similar to the perceptron learning algorithm and gradient descent for perceptrons

• Problem to overcome: How to adjust internal links (how to distribute the “blame” or the error)

• Assumption: internal units use differentiable functions and nonlinear

• sigmoid functions are convenient

Neural Networks NN 1 39

Recurrent Network with hidden neuron(s): unit delay operator z-1 implies dynamic system

z-1

z-1

z-1

Recurrent network

inputhiddenoutput

Back-Propagation (cont.)

• Start with a network with random weights• Repeat until a stopping criterion is met– For each example, compute the network

output and for each unit i it’s error term– Update each weight (weight of link going

from node i to node j):

i

ijw

)( where, iowwww jijijijij

Output of unit i

The Error Term

• • •

• •

)o'(i)Err(iδi

nodeoutput an is if ),()()( i ehefiErr ii

node internal an is if ,)( i wiErr jij

i node of derivative is )(' io

1 sigmoid,For -o(i)) o(i) (o'(i)

Derivation

• Write the error for a single training example; as before use sum of squared error (as it’s convenient for differentiation, etc):

• Differentiate (with respect to each weight…)• For example, we get

for weight connecting node j to output i

iunitoutput

ii ehefE

2))()((2

1

,)()()(')( iji

joiErriojow

E

ijw

Properties

• Converges to a minimum, but could be a local minimum

• Could be slow to converge(Note: Training a three node net is NP-Complete!)

• Must watch for over-fitting just as in decision trees (use validation sets, etc.)

• Network structure? Often two layers suffices, start with relatively few hidden units

Properties (cont.)

• Many variations to the basic back-propagation: e.g. use momentum

• Reduce with time (applies to perceptrons as well)

10),1()(- )( nwionw ijjij

Nth update amount a constant

Networks, features, and spaces

• Artificial neural networks can represent any continuous function…

• Simple algorithms for learning from data– fuzzy boundaries– effects of typicality

NN properties

• Can handle domains with – continuous and discrete attributes– Many attributes– noisy data

• Could be slow at training but fast at evaluation time

• Human understanding of what the network does could be limited

Networks, features, and spaces

• Artificial neural networks can represent any continuous function…

• Simple algorithms for learning from data– fuzzy boundaries– effects of typicality

• A way to explain how people could learn things that look like rules and symbols…

Networks, features, and spaces

• Artificial neural networks can represent any continuous function…

• Simple algorithms for learning from data– fuzzy boundaries– effects of typicality

• A way to explain how people could learn things that look like rules and symbols…

• Big question: how much of cognition can be explained by the input data?

Challenges for neural networks• Being able to learn anything can make it

harder to learn specific things– this is the “bias-variance tradeoff”

top related