Neural Networks for Classiﬁcation

1 / 40

Neural Networks for Classification

Andrei Alexandrescu

June 19, 2007

Introduction

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

2 / 40

Neural Networks: History






Outputs

NLP Applications

Conclusions

3 / 40

■ Modeled after the human brain■ Experimentation and marketing predated

theory■ Considered the forefront of the AI spring

Suffered from the AI winter■ Theory today still not fully developed and

understood

What is a Neural Network?






Outputs

NLP Applications

Conclusions

4 / 40

■ Essentially:A network of interconnectedfunctional elementseach with several inputs/one output

y(x1, . . . , xn) = f(w1x1 +w2x2 + . . .+wnxn)(1)

■ wi are parameters■ f is the activation function■ Crucial for learning that addition is used

for integrating the inputs

Examples of Neural Networks






Outputs

NLP Applications

Conclusions

5 / 40

■ Logical functions with 0/1 inputs andoutputs

■ Fourier series:

F (x) =∑

i≥0

(ai cos(ix) + bi sin(ix)) (2)

■ Taylor series:

F (x) =∑

i≥0

ai(x − x0)i (3)

■ Automata

Elements of a Neural Network






Outputs

NLP Applications

Conclusions

6 / 40

■ The function performed by an element■ The topology of the network■ The method used to train the weights

Single-Layer Perceptrons

Introduction


The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons



Outputs

NLP Applications

Conclusions

7 / 40

The Perceptron

Introduction


The Perceptron



Algorithm




Outputs

NLP Applications

Conclusions

8 / 40

■ n inputs, one output:

y(x1, . . . , xn) = f(w1x1 + . . . + wnxn)(4)

■ Oldest activation function(McCulloch/Pitts):

f(v) = 1x≥0(v) (5)

Perceptron Capabilities

Introduction


The Perceptron



Algorithm




Outputs

NLP Applications

Conclusions

9 / 40

■ Advertised to be as extensive as the brainitself

■ Can (only) distinguish between twolinearly-separable sets

■ Smallest undecidable function: XOR■ Minsky’s proof started the AI winter■ It was not fully understood what

connected layers could do

Bias

Introduction


The Perceptron



Algorithm




Outputs

NLP Applications

Conclusions

10 / 40

■ Notice that the decision hyperplane mustgo through the origin

■ Could be achieved by preprocessing theinput

■ Not always desirable or possible■ Add a bias input:

y(x1, . . . , xn) = f(w0+w1x1+. . .+wnxn)(6)

■ Same as an input connected to theconstant 1

■ We consider that ghost input implicithenceforth

Training the Perceptron

Introduction


The Perceptron



Algorithm




Outputs

NLP Applications

Conclusions

11 / 40

■ Switch to vector notation:

y(x) = f(wx) = fw(x) (7)

■ Assume we need to separate sets ofpoints A and B.

E(w) =∑

x∈A

(1−fw(x))+∑

x∈B

fw(x) (8)

■ Goal: E(w) = 0■ Start from a random w and improve it

Algorithm

Introduction


The Perceptron



Algorithm




Outputs

NLP Applications

Conclusions

12 / 40

1. Start with random w, set t = 02. Select a vector x ∈ A ∪ B

3. If x ∈ A and wx ≤ 0, thenwt+1 = wt + x

4. Else if x ∈ B and wx ≥ 0, thenwt+1 = wt − x

5. Conditionally go to step 2

■ Guaranteed to converge iff A and B arelinearly separable!

Summary of Simple Perceptrons

Introduction


The Perceptron



Algorithm




Outputs

NLP Applications

Conclusions

13 / 40

■ Simple training■ Limited capabilities■ Reasonably efficient training

Simplex, linear programming arebetter

Multi-Layer Perceptrons

Introduction




A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions14 / 40

Multi-Layer Perceptrons

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions15 / 40

■ Let’s connect the output of a perceptronto the input of another

■ What can we compute with thishorizontal combination?

■ (We already take vertical combination forgranted)

A Misunderstanding of Epic

Proportions

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions16 / 40

■ Some say “two-layered” network

◆ Two cascaded layers ofcomputational units

■ Some say “three-layered” network

◆ There is one extra input layer thatdoes nothing

■ Let’s arbitrarily choose “three-layered”

◆ Input◆ Hidden◆ Output

Workings

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions17 / 40

■ The hidden layer maps inputs into asecond space: “feature space,”“classification space”

■ This makes the job of the output layereasier

Capabilities

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions18 / 40

■ Each hidden unit computes a linearseparation of the input space

■ Several hidden units can carve a polytopein the input space

■ Output units can distinguish polytopemembership

⇓

Any union of polytopes can be decided


Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions19 / 40

■ The step function bad for gradientdescent techniques

■ Replace with a smooth step function:

f(v) =1

1 + e−v(9)

■ Notable fact:f ′(v) = f(v)(1 − f(v))

■ Makes the function cycles-friendly

Output Activation

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions20 / 40

■ Simple binarydiscrimination—zero-centered sigmoid:

f(v) =1 − e−v

1 + e−v(10)

■ Probability distribution—softmax:

f(vi) =evi

∑

j

evj(11)

The Backpropagation Algorithm

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions21 / 40

■ Works on any differentiable activationfunction

■ Gradient descent in weight space■ Metaphor: a ball rolls on the error

function’s envelope■ Condition: no flat portion■ Ball would stop in indifferent equilibrium■ Some add a slight pull term:

f(v) =1 − e−v

1 + e−v+ cv (12)

The Task

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions22 / 40

■ Minimize error function:

E =1

2

p∑

i=1

‖oi − ti‖2 (13)

where:

◆ oi actual outputs◆ ti desired outputs◆ p number of patterns

Training. The Delta Rule

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions23 / 40

■ Compute ∇E =(

∂E∂w1

, . . . , ∂E∂wl

)

■ Update weights:

∆wi = −γ∂E

∂wi

i = 1, . . . , l (14)

■ Expect to find a point ∇E = 0■ Algorithm for computing ∇E:

backpropagation■ Beyond the scope of this class

Gradient Locality

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions24 / 40

■ Only summation guarantees locality ofbackpropagation

■ Otherwise backpropagation wouldpropagate errors due to one input to allinputs

■ Essential to use summation as inputintegration!

Regularization

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions25 / 40

■ Weights can grow uncontrollably■ Add a regularization term that opposes

weight growth

∆wi = −γ∂E

∂wi

− αwi (15)

■ Very important practical trick■ Also avoids overspecialization■ Forces a smoother output

Local Minima

Introduction





Workings

Capabilities


Output Activation



Gradient Locality

Regularization

Local Minima


Outputs

NLP Applications

Conclusions26 / 40

■ The gradient surf can stop in a localminimum

■ Biggest issue with neural networks■ Overspecialization second biggest■ Convergence not guaranteed either, but

regularization helps

Accommodating Discrete

Inputs

Introduction




Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

27 / 40

Discrete Inputs

Introduction




Discrete Inputs

One-Hot Encoding



Outputs

NLP Applications

Conclusions

28 / 40

■ Many NLP applications foster discretefeatures

■ Neural nets expect real numbers■ Smooth: similar outputs for similar inputs

■ Any two discrete inputs are “just asdifferent”

■ Treating them as integral numbersundemocratic

One-Hot Encoding

Introduction




Discrete Inputs

One-Hot Encoding



Outputs

NLP Applications

Conclusions

29 / 40

■ One discrete feature with n values → n

real inputs■ The ith feature value sets the ith input to

1 and others to 0■ The Hamming distance between any two

distinct inputs is now constant!■ Disadvantage: input vector size much

larger

Optimizing One-Hot Encoding

Introduction




Discrete Inputs

One-Hot Encoding



Outputs

NLP Applications

Conclusions

30 / 40

■ Each hidden unit has all inputs zeroexcept the ith one

■ Even that one is just multiplied by 1■ Regroup weights by discrete input, not by

hidden unit!■ Matrix w of size n × l

■ Input i just copies row i to the output(virtual multiplication by 1)

■ Cheap computation■ Delta rule applies as usual

One-Hot Encoding: Interesting

Tidbits

Introduction




Discrete Inputs

One-Hot Encoding



Outputs

NLP Applications

Conclusions

31 / 40

■ The row wi is a continuousrepresentation of discrete feature i

■ Only one row trained per sample■ The size of the continuous representation

can be chosen depending on the feature’scomplexity

■ Mix this continuous representation freelywith “truly” continuous features, such asacoustic features

Outputs

Introduction




Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

32 / 40

Multi-Label Classification

Introduction




Outputs


Soft Training

NLP Applications

Conclusions

33 / 40

■ n real outputs summing to 1■ Normalization included in the softmax

function:

f(vi) =evi

∑

j

evj=

evi−vmax

∑

j

evj−vmax(16)

■ Train with 1 − ǫ for the known label, ǫn−1

for all others (avoids saturation)

Soft Training

Introduction




Outputs


Soft Training

NLP Applications

Conclusions

34 / 40

■ Maybe the targets are known probabilitydistribution

■ Or want to reduce the number of trainingcycles

■ Train with actual desired distributions asdesired outputs

■ Example: for feature vector x, labels l1,l2, l3 are possible with equal probability

■ Train with 1−ǫ3

for the three, ǫn−3

for allothers

NLP Applications

Introduction




Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

35 / 40

Language Modeling

Introduction




Outputs

NLP Applications

Language Modeling

Lexicon Learning


Conclusions

36 / 40

■ Input: n-gram context■ May include arbitrary word features

(cool!!!)■ Output: probability distribution of next

word■ Automatically figures which features are

important

Lexicon Learning

Introduction




Outputs

NLP Applications

Language Modeling

Lexicon Learning


Conclusions

37 / 40

■ Input: Word-level features (root, stem,morph)

■ Input: Most frequent previous/nextwords

■ Output: Probability distribution of theword’s possible POSs

Word Sense Disambiguation

Introduction




Outputs

NLP Applications

Language Modeling

Lexicon Learning


Conclusions

38 / 40

■ Input: bag of words in context, localcollocations

■ Output: Probability distribution oversenses

Conclusions

Introduction




Outputs

NLP Applications

Conclusions

Conclusions

39 / 40

Conclusions

Introduction




Outputs

NLP Applications

Conclusions

Conclusions

40 / 40

■ Neural nets respectable machine learningtechnique

■ Theory not fully developed■ Local optima and overspecialization are

killers■ Yet can learn very complex functions■ Long training time■ Short testing time■ Small memory requirements

Neural Networks for Classiﬁcation

Documents