Top Banner
Feed-forward Neural Networks Ying Wu Electrical Engineering and Computer Science Northwestern Universit y Evanston, IL 60208 http://www.ee cs.northwestern .edu/~yingwu 1/24
24

MLP_handout.pdf

Apr 14, 2018

Download

Documents

getmak99
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 1/24

Feed-forward Neural Networks

Ying Wu

Electrical Engineering and Computer ScienceNorthwestern University

Evanston, IL 60208

http://www.eecs.northwestern.edu/~yingwu

1 / 2 4

Page 2: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 2/24

Connectionism

◮ How does our brain process information?

◮ Are we Turing Machines?

◮ Things that are difficult for Turing Machines

◮ Perception is difficult for Turing Machines◮ We have so many neurons

◮ How do they work?

◮ Can we have computational models?

◮ Connectionism vs. Computationalism

2 / 2 4

Page 3: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 3/24

History

in 1940s◮ The model for neuron (McCulloch & Pitts, 1943)◮ The Hebbian learning rule (Hebb, 1949)

◮ ր in 1950s◮ Perceptron (Rosenblatt, 1950s)

◮ ց in 1960s◮

Limitation of Perceptron (Minsky & Papert, 1969)◮ Expert systems was so hot by then

◮ ր again in 1980s◮ Hopfield feed-back network (Hopfield, 1982)◮ Back-propagation algorithm (Rumelhart & Le Cun, 1986)◮

Expert systems ⇓◮ ց again in 1990s

◮ Overfitting in neural networks◮ SVM was so hot (Vapnik, 1995)

◮ where to go?

3 / 2 4

Page 4: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 4/24

Outline

Neuron Model

Multi-Layer Perceptron

Radial Basis Function Networks

4 / 2 4

Page 5: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 5/24

Neuron: the Basic Unit

x1

x2

xd

w1

w2

wd

 . 

 . 

 . 

y

◮ Input x = [1, . . . , x d ]T  ∈ Rd +1

◮ Connection weight (i.e., synapses) w = [w 0, . . . , w d ]T  ∈ Rd +1

◮ Net activation: net 

net  =d 

i =0

w i x i  = wT x

◮ Activation function and output

y  = f  (net ) = f  (w

x)5 / 2 4

Page 6: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 6/24

Activation Function◮ Activation function introduces nonlinearity◮ We can use

f  (x ) = sgn(x ) = 1 x ≥ 0−1 x  < 0

◮ Or we can use the Sigmoid function

f  (x ) =2

1 + e −2x − 1, f  (x ) ∈ (−1, 1)

its derivativef  ′(x ) = 1− f  2(x )

◮ Or

f  (x ) =1

1 + e −x  , f  (x ) ∈ (0, 1)

its derivative

f  ′(x ) = f  (x )[1− f  (x )] =e −x 

(1 + e −x )2

6 / 2 4

Page 7: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 7/24

Outline

Neuron Model

Multi-Layer Perceptron

Radial Basis Function Networks

7 / 2 4

Page 8: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 8/24

Perceptron

 .  .  . 

x1

 . 

 . 

 . 

x2

input layer output layer

xd

z1

zc

◮ Two layers (input and linear output)◮ Desired output t = [t 1, . . . , t c ]

T  ∈ Rc 

◮ Actual output z i  = wT i  x, i  = 1, . . . , c 

◮ Learning (Widrow-Hoff)

wi (t + 1) = wi (t ) + η(t i − z i )x = wi (t ) + η(t i −wT i  x)x

◮ It only works for linearly separable patterns◮

It cannot even solve the simple XOR problem8 / 2 4

Page 9: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 9/24

Multi-layer Network

 . 

 . 

 . 

 . 

 . 

 . 

x1

 . 

 . 

 . 

x2

input layer hidden layer output layer

xd

◮ Input layer x = [1, . . . , x d ]T  ∈ Rd +1

◮ Hidden layer y  j  = f  (wT  j  x), j  = 1, . . . , nH 

◮ Output layer z k  = f  (wT k  y), k  = 1, . . . , c 

◮ Weight between hidden node y  j  and input node x i  is w  ji 

◮ Weight between output node z k  and hidden node y  j  is w kj 

◮ May have multiple hidden layers

9 / 2 4

Page 10: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 10/24

Discriminant Function

◮ For 3-layer network, its discriminant function

g k (x) = z k  = f  

nH 

 j =1

w kj f  

i =1

w  ji x i  + w  j 0

+ w k 0

◮ Kolmogorov showed that a 3-layer structure is enough toapproximate any nonlinear function

◮ We expect a 3-layer MLP to make any decision boundary

◮ Certainly, the nonlinearity depends on nH , the number of 

hidden units◮ Larger nH  results in overfitting

◮ Smaller nH  leads to underfitting

10/24

Page 11: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 11/24

Training the Network

◮ Desired output of the network t = [t 1, . . . , t c ]T 

◮ The objective in training the weights {w j , wk }

J (w) =1

2t− z2

◮ We need to find the best set of {w j , wk } that minimize J 

◮ It can be done through gradient-based optimization

◮ In a general form

w(k + 1) = w(k )− η∂ J 

∂ w

◮ It make it clear, let’s do it component by component

11/24

Page 12: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 12/24

Back-propagation (BP): output-hidden wk 

◮ w kj  is the weight between output node k  and hidden node j 

∂ J 

∂ w kj  =

∂ J 

∂ net k 

∂ net k 

∂ w kj 

◮ Define sensitivity for a general node i  as

δ i  = −∂ J 

∂ net i 

◮ In this case, for the output node k 

δ k  = −∂ J 

∂ net k =

∂ J 

∂ z k 

∂ z k 

∂ net k = (t k − z k )f  

′(net k )

◮ As net k  = nH 

 j =1

w kj y  j , it is clear that

∂ net k 

∂ w kj = y  j 

◮ So we have ∆w kj  = ηδ k y  j  = η(t k − z k )f  ′(net k )y  j 

This is a generalization of Widrow-Hoff 12/24

Page 13: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 13/24

Back-propagation (BP): hidden-input w j 

◮ As did before∂ J 

w  ji 

=∂ J 

∂ y  j 

∂ y  j 

∂ net  j 

∂ net  j 

∂ w  ji    

easy 

◮ The first one is a little more complicated

∂ J 

∂ y  j 

=∂ 

∂ y  j  1

2

k =1

(t k − z k )2 = −

k =1

(t k − z k )∂ z k 

∂ y  j 

= −c 

k =1

(t k − z k )∂ z k 

∂ net k 

∂ net k 

∂ y  j 

= −

c k =1

(t k − z k )f  ′

(net k )w kj  = −

c k =1

δ k w kj 

◮ We can compute the sensitivity for the hidden node j 

δ  j  = −∂ J 

∂ net  j 

= −∂ J 

∂ y  j 

∂ y  j 

∂ net  j 

= f  ′(net  j )c 

k =1

w kj δ k 

13/24

Page 14: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 14/24

Why is it Called “Back Propagation”?

j

1

2

k

c

input layerhidden layer

output layer

wkj

◮ Sensitivity δ i  reflects the information on node i ◮ δ  j  of a hidden node j  combines two sources of information

◮ linear combination of those from the output layerc 

k =1

w kj δ k 

◮ its local information f  ′(net  j )◮ The learning rule for the hidden-input node is

∆w  ji  = ηδ  j x i  = ηf  ′(net  j )c 

k =1

w kj δ k x i 

14/24

Page 15: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 15/24

Algorithm: Back-propagation (BP)

Algorithm 1: Stochastic Back-propagation

Init: nH , w, stop criterion θ, η, k  = 0Do k ← k + 1

xk ← randomly pickforward compute y and then z

backward compute {δ k } and then {δ  j }w kj ← w kj  + ηδ k y  j w  ji ← w  ji  + ηδ  j x i 

Until ∇J (w) < θReturn w

◮ This is the one-sample BP training

◮ It can be easily extended to batch training

15/24

Page 16: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 16/24

Bayes Discriminant and MLP◮ In the linear discriminative models, we know that the MSE and

MMSE approximate to Bayesian discriminate asymptotically

◮ MLP can do better by approximating the posteriors◮ Suppose we have c  classes and the desired output is t k (x) = 1

if  x ∈ ωk , and 0 otherwise◮ The MLP criterion

J (w) = x 

[g k (x; w)−t k ]2

= x ∈ωk 

[g k (x; w)−1]2

+x /∈ωk 

[g k (x; w)−0]2

◮ It can be shown that minimizing limn→∞

J (w) is equivalent to

miniming   [g k (x; w)− P (ωk |x]

2

p (x)d x

◮ This means the output units represents the posteriors

g k (x; w) ≃ P (ωk |x)

16/24

Page 17: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 17/24

Outputs as Probabilities

◮ If we want the output of MLP to be posteriors◮ The desired output in training should be in [0, 1]

◮ As we have limited number of training samples, the outputsmay not sum to 1

◮ We can use a different activation function for the output layer◮ Softmax activation

z k  =e net k 

c m=1

e net m

17/24

f

Page 18: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 18/24

Practice: Number of Hidden Units

◮ The number of hidden nodes is the most critical parameter inMLP

◮ It determines the expressive power of the network and thecomplexity of the decision boundary

◮ A smaller number leads to simpler boundary, and a largernumber can produce very complicated one

◮ Overfitting and generalizability

◮ Unfortunately, there is no foolproof method to choose this

parameter◮ Many heuristics were proposed

18/24

P i L i R

Page 19: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 19/24

Practice: Learning Rates

◮ Another critical parameter in MLP is the learning rate η

◮ In principle, if  η is small enough, the iteration converges

◮ But very slowly◮ To speed it up, we need to use the 2nd order gradient

information, e.g., the Newtonian method in training

19/24

P i Pl d M

Page 20: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 20/24

Practice: Plateaus and Momentum

◮ Error surfaces often have plateaus where ∂ J (w)∂ w is very small

Then the iteration can hardly move on◮ Introduce momentum to push it

◮ Momentum uses the weight change at previous iteration

w(k + 1) = w(k ) + (1− α)∆w bp (k ) + α∆w (k − 1)

Algorithm 2: Stochastic Back-propagation with Momentum

Init: nH , w, stop criterion θ, η, k  = 0Do k ← k + 1

x

← randomly pickb kj ← η(1− α)δ k y i  + αb kj ; b  ji ← η(1− α)δ  j x i  + αb  ji w  ji ← w kj  + b kj ; w  ji ← w  ji  + b  ji 

Until ∇J (w) < θReturn w

20/24

O tli

Page 21: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 21/24

Outline

Neuron Model

Multi-Layer Perceptron

Radial Basis Function Networks

21/24

R di l B i F ti N t k

Page 22: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 22/24

Radial Basis Function Network

 . 

 . 

 . 

 .  .  . 

x1

 . 

 . 

 . 

x2

input layer hidden layer output layer

xd

K

K

K

wkj

◮ The input-hidden weights are all 1◮ The activation function for the hidden units is the Radial

Basis Function (RBF), e.g.,

K (||x− xc ||) = exp{−x− xc 2

2σ2}

◮ The output

z k (x) =

nH 

 j =0

w kj K (x, x j )

22/24

I t t ti

Page 23: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 23/24

Interpretation

◮ It can be treated as a function approximation, as a linearcombination of a set of bases

◮ The hidden units transform the original feature space to

another feature space (high-dim) by using the kernel◮ We hope the data become linearly separable in the new

feature space

◮ This is what we did in the Kernel Machines!

23/24

Learning

Page 24: MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 24/24

Learning

◮ Parameters in the RBF network◮ The basis center x j  for each hidden node◮ The variance σ of the RBF function◮ The weights W

◮ Once the RBF parameters are set, W can be done bypseudo-inverse or Widrow-Hoff 

◮ Finding RBF parameters is not easy

◮ Uniformly select the centers◮ Using the data cluster centers as xi 

24/24