Top Banner
1 / 40 Neural Networks for Classification Andrei Alexandrescu June 19, 2007
40

Neural Networks for Classification

Oct 01, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Neural Networks for Classification

1 / 40

Neural Networks for Classification

Andrei Alexandrescu

June 19, 2007

Page 2: Neural Networks for Classification

Introduction

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

2 / 40

Page 3: Neural Networks for Classification

Neural Networks: History

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

3 / 40

■ Modeled after the human brain■ Experimentation and marketing predated

theory■ Considered the forefront of the AI spring

Suffered from the AI winter■ Theory today still not fully developed and

understood

Page 4: Neural Networks for Classification

What is a Neural Network?

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

4 / 40

■ Essentially:A network of interconnectedfunctional elementseach with several inputs/one output

y(x1, . . . , xn) = f(w1x1 +w2x2 + . . .+wnxn)(1)

■ wi are parameters■ f is the activation function■ Crucial for learning that addition is used

for integrating the inputs

Page 5: Neural Networks for Classification

Examples of Neural Networks

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

5 / 40

■ Logical functions with 0/1 inputs andoutputs

■ Fourier series:

F (x) =∑

i≥0

(ai cos(ix) + bi sin(ix)) (2)

■ Taylor series:

F (x) =∑

i≥0

ai(x − x0)i (3)

■ Automata

Page 6: Neural Networks for Classification

Elements of a Neural Network

IntroductionNeural Networks:History

What is a NeuralNetwork?Examples of NeuralNetworksElements of aNeural Network

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

6 / 40

■ The function performed by an element■ The topology of the network■ The method used to train the weights

Page 7: Neural Networks for Classification

Single-Layer Perceptrons

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

7 / 40

Page 8: Neural Networks for Classification

The Perceptron

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

8 / 40

■ n inputs, one output:

y(x1, . . . , xn) = f(w1x1 + . . . + wnxn)(4)

■ Oldest activation function(McCulloch/Pitts):

f(v) = 1x≥0(v) (5)

Page 9: Neural Networks for Classification

Perceptron Capabilities

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

9 / 40

■ Advertised to be as extensive as the brainitself

■ Can (only) distinguish between twolinearly-separable sets

■ Smallest undecidable function: XOR■ Minsky’s proof started the AI winter■ It was not fully understood what

connected layers could do

Page 10: Neural Networks for Classification

Bias

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

10 / 40

■ Notice that the decision hyperplane mustgo through the origin

■ Could be achieved by preprocessing theinput

■ Not always desirable or possible■ Add a bias input:

y(x1, . . . , xn) = f(w0+w1x1+. . .+wnxn)(6)

■ Same as an input connected to theconstant 1

■ We consider that ghost input implicithenceforth

Page 11: Neural Networks for Classification

Training the Perceptron

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

11 / 40

■ Switch to vector notation:

y(x) = f(wx) = fw(x) (7)

■ Assume we need to separate sets ofpoints A and B.

E(w) =∑

x∈A

(1−fw(x))+∑

x∈B

fw(x) (8)

■ Goal: E(w) = 0■ Start from a random w and improve it

Page 12: Neural Networks for Classification

Algorithm

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

12 / 40

1. Start with random w, set t = 02. Select a vector x ∈ A ∪ B

3. If x ∈ A and wx ≤ 0, thenwt+1 = wt + x

4. Else if x ∈ B and wx ≥ 0, thenwt+1 = wt − x

5. Conditionally go to step 2

■ Guaranteed to converge iff A and B arelinearly separable!

Page 13: Neural Networks for Classification

Summary of Simple Perceptrons

Introduction

Single-LayerPerceptrons

The Perceptron

PerceptronCapabilities

BiasTraining thePerceptron

Algorithm

Summary of SimplePerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

13 / 40

■ Simple training■ Limited capabilities■ Reasonably efficient training

Simplex, linear programming arebetter

Page 14: Neural Networks for Classification

Multi-Layer Perceptrons

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions14 / 40

Page 15: Neural Networks for Classification

Multi-Layer Perceptrons

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions15 / 40

■ Let’s connect the output of a perceptronto the input of another

■ What can we compute with thishorizontal combination?

■ (We already take vertical combination forgranted)

Page 16: Neural Networks for Classification

A Misunderstanding of Epic

Proportions

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions16 / 40

■ Some say “two-layered” network

◆ Two cascaded layers ofcomputational units

■ Some say “three-layered” network

◆ There is one extra input layer thatdoes nothing

■ Let’s arbitrarily choose “three-layered”

◆ Input◆ Hidden◆ Output

Page 17: Neural Networks for Classification

Workings

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions17 / 40

■ The hidden layer maps inputs into asecond space: “feature space,”“classification space”

■ This makes the job of the output layereasier

Page 18: Neural Networks for Classification

Capabilities

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions18 / 40

■ Each hidden unit computes a linearseparation of the input space

■ Several hidden units can carve a polytopein the input space

■ Output units can distinguish polytopemembership

Any union of polytopes can be decided

Page 19: Neural Networks for Classification

Training Prerequisite

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions19 / 40

■ The step function bad for gradientdescent techniques

■ Replace with a smooth step function:

f(v) =1

1 + e−v(9)

■ Notable fact:f ′(v) = f(v)(1 − f(v))

■ Makes the function cycles-friendly

Page 20: Neural Networks for Classification

Output Activation

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions20 / 40

■ Simple binarydiscrimination—zero-centered sigmoid:

f(v) =1 − e−v

1 + e−v(10)

■ Probability distribution—softmax:

f(vi) =evi

j

evj(11)

Page 21: Neural Networks for Classification

The Backpropagation Algorithm

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions21 / 40

■ Works on any differentiable activationfunction

■ Gradient descent in weight space■ Metaphor: a ball rolls on the error

function’s envelope■ Condition: no flat portion■ Ball would stop in indifferent equilibrium■ Some add a slight pull term:

f(v) =1 − e−v

1 + e−v+ cv (12)

Page 22: Neural Networks for Classification

The Task

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions22 / 40

■ Minimize error function:

E =1

2

p∑

i=1

‖oi − ti‖2 (13)

where:

◆ oi actual outputs◆ ti desired outputs◆ p number of patterns

Page 23: Neural Networks for Classification

Training. The Delta Rule

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions23 / 40

■ Compute ∇E =(

∂E∂w1

, . . . , ∂E∂wl

)

■ Update weights:

∆wi = −γ∂E

∂wi

i = 1, . . . , l (14)

■ Expect to find a point ∇E = 0■ Algorithm for computing ∇E:

backpropagation■ Beyond the scope of this class

Page 24: Neural Networks for Classification

Gradient Locality

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions24 / 40

■ Only summation guarantees locality ofbackpropagation

■ Otherwise backpropagation wouldpropagate errors due to one input to allinputs

■ Essential to use summation as inputintegration!

Page 25: Neural Networks for Classification

Regularization

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions25 / 40

■ Weights can grow uncontrollably■ Add a regularization term that opposes

weight growth

∆wi = −γ∂E

∂wi

− αwi (15)

■ Very important practical trick■ Also avoids overspecialization■ Forces a smoother output

Page 26: Neural Networks for Classification

Local Minima

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

Multi-LayerPerceptrons

A Misunderstandingof Epic Proportions

Workings

Capabilities

Training Prerequisite

Output Activation

TheBackpropagationAlgorithm

The TaskTraining. The DeltaRule

Gradient Locality

Regularization

Local Minima

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions26 / 40

■ The gradient surf can stop in a localminimum

■ Biggest issue with neural networks■ Overspecialization second biggest■ Convergence not guaranteed either, but

regularization helps

Page 27: Neural Networks for Classification

Accommodating Discrete

Inputs

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

27 / 40

Page 28: Neural Networks for Classification

Discrete Inputs

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

28 / 40

■ Many NLP applications foster discretefeatures

■ Neural nets expect real numbers■ Smooth: similar outputs for similar inputs

■ Any two discrete inputs are “just asdifferent”

■ Treating them as integral numbersundemocratic

Page 29: Neural Networks for Classification

One-Hot Encoding

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

29 / 40

■ One discrete feature with n values → n

real inputs■ The ith feature value sets the ith input to

1 and others to 0■ The Hamming distance between any two

distinct inputs is now constant!■ Disadvantage: input vector size much

larger

Page 30: Neural Networks for Classification

Optimizing One-Hot Encoding

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

30 / 40

■ Each hidden unit has all inputs zeroexcept the ith one

■ Even that one is just multiplied by 1■ Regroup weights by discrete input, not by

hidden unit!■ Matrix w of size n × l

■ Input i just copies row i to the output(virtual multiplication by 1)

■ Cheap computation■ Delta rule applies as usual

Page 31: Neural Networks for Classification

One-Hot Encoding: Interesting

Tidbits

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Discrete Inputs

One-Hot Encoding

Optimizing One-HotEncoding

One-Hot Encoding:Interesting Tidbits

Outputs

NLP Applications

Conclusions

31 / 40

■ The row wi is a continuousrepresentation of discrete feature i

■ Only one row trained per sample■ The size of the continuous representation

can be chosen depending on the feature’scomplexity

■ Mix this continuous representation freelywith “truly” continuous features, such asacoustic features

Page 32: Neural Networks for Classification

Outputs

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

32 / 40

Page 33: Neural Networks for Classification

Multi-Label Classification

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

33 / 40

■ n real outputs summing to 1■ Normalization included in the softmax

function:

f(vi) =evi

j

evj=

evi−vmax

j

evj−vmax(16)

■ Train with 1 − ǫ for the known label, ǫn−1

for all others (avoids saturation)

Page 34: Neural Networks for Classification

Soft Training

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

Multi-LabelClassification

Soft Training

NLP Applications

Conclusions

34 / 40

■ Maybe the targets are known probabilitydistribution

■ Or want to reduce the number of trainingcycles

■ Train with actual desired distributions asdesired outputs

■ Example: for feature vector x, labels l1,l2, l3 are possible with equal probability

■ Train with 1−ǫ3

for the three, ǫn−3

for allothers

Page 35: Neural Networks for Classification

NLP Applications

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

35 / 40

Page 36: Neural Networks for Classification

Language Modeling

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

36 / 40

■ Input: n-gram context■ May include arbitrary word features

(cool!!!)■ Output: probability distribution of next

word■ Automatically figures which features are

important

Page 37: Neural Networks for Classification

Lexicon Learning

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

37 / 40

■ Input: Word-level features (root, stem,morph)

■ Input: Most frequent previous/nextwords

■ Output: Probability distribution of theword’s possible POSs

Page 38: Neural Networks for Classification

Word Sense Disambiguation

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Language Modeling

Lexicon Learning

Word SenseDisambiguation

Conclusions

38 / 40

■ Input: bag of words in context, localcollocations

■ Output: Probability distribution oversenses

Page 39: Neural Networks for Classification

Conclusions

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

Conclusions

39 / 40

Page 40: Neural Networks for Classification

Conclusions

Introduction

Single-LayerPerceptrons

Multi-LayerPerceptrons

AccommodatingDiscrete Inputs

Outputs

NLP Applications

Conclusions

Conclusions

40 / 40

■ Neural nets respectable machine learningtechnique

■ Theory not fully developed■ Local optima and overspecialization are

killers■ Yet can learn very complex functions■ Long training time■ Short testing time■ Small memory requirements