MLP_handout.pdf

7/27/2019 MLP_handout.pdf

http://slidepdf.com/reader/full/mlphandoutpdf 1/24

Feed-forward Neural Networks

Ying Wu

Electrical Engineering and Computer ScienceNorthwestern University

Evanston, IL 60208

http://www.eecs.northwestern.edu/~yingwu

1 / 2 4



Connectionism

◮ How does our brain process information?

◮ Are we Turing Machines?

◮ Things that are difficult for Turing Machines

◮ Perception is difficult for Turing Machines◮ We have so many neurons

◮ How do they work?

◮ Can we have computational models?

◮ Connectionism vs. Computationalism

2 / 2 4



History

◮

in 1940s◮ The model for neuron (McCulloch & Pitts, 1943)◮ The Hebbian learning rule (Hebb, 1949)

◮ ր in 1950s◮ Perceptron (Rosenblatt, 1950s)

◮ ց in 1960s◮

Limitation of Perceptron (Minsky & Papert, 1969)◮ Expert systems was so hot by then

◮ ր again in 1980s◮ Hopfield feed-back network (Hopfield, 1982)◮ Back-propagation algorithm (Rumelhart & Le Cun, 1986)◮

Expert systems ⇓◮ ց again in 1990s

◮ Overfitting in neural networks◮ SVM was so hot (Vapnik, 1995)

◮ where to go?

3 / 2 4



Outline

Neuron Model

Multi-Layer Perceptron

Radial Basis Function Networks

4 / 2 4



Neuron: the Basic Unit

x1

x2

xd

w1

w2

wd

.

.

.

y

◮ Input x = [1, . . . , x d ]T ∈ Rd +1

◮ Connection weight (i.e., synapses) w = [w 0, . . . , w d ]T ∈ Rd +1

◮ Net activation: net

net =d

i =0

w i x i = wT x

◮ Activation function and output

y = f (net ) = f (w

T

x)5 / 2 4



Activation Function◮ Activation function introduces nonlinearity◮ We can use

f (x ) = sgn(x ) = 1 x ≥ 0−1 x < 0

◮ Or we can use the Sigmoid function

f (x ) =2

1 + e −2x − 1, f (x ) ∈ (−1, 1)

its derivativef ′(x ) = 1− f 2(x )

◮ Or

f (x ) =1

1 + e −x , f (x ) ∈ (0, 1)

its derivative

f ′(x ) = f (x )[1− f (x )] =e −x

(1 + e −x )2

6 / 2 4



Outline

Neuron Model



7 / 2 4



Perceptron

. . .

x1

.

.

.

x2

input layer output layer

xd

z1

zc

◮ Two layers (input and linear output)◮ Desired output t = [t 1, . . . , t c ]

T ∈ Rc

◮ Actual output z i = wT i x, i = 1, . . . , c

◮ Learning (Widrow-Hoff)

wi (t + 1) = wi (t ) + η(t i − z i )x = wi (t ) + η(t i −wT i x)x

◮ It only works for linearly separable patterns◮

It cannot even solve the simple XOR problem8 / 2 4



Multi-layer Network

.

.

.

.

.

.

x1

.

.

.

x2

input layer hidden layer output layer

xd

◮ Input layer x = [1, . . . , x d ]T ∈ Rd +1

◮ Hidden layer y j = f (wT j x), j = 1, . . . , nH

◮ Output layer z k = f (wT k y), k = 1, . . . , c

◮ Weight between hidden node y j and input node x i is w ji

◮ Weight between output node z k and hidden node y j is w kj

◮ May have multiple hidden layers

9 / 2 4



Discriminant Function

◮ For 3-layer network, its discriminant function

g k (x) = z k = f

nH

j =1

w kj f

d

i =1

w ji x i + w j 0

+ w k 0

◮ Kolmogorov showed that a 3-layer structure is enough toapproximate any nonlinear function

◮ We expect a 3-layer MLP to make any decision boundary

◮ Certainly, the nonlinearity depends on nH , the number of

hidden units◮ Larger nH results in overfitting

◮ Smaller nH leads to underfitting

10/24



Training the Network

◮ Desired output of the network t = [t 1, . . . , t c ]T

◮ The objective in training the weights {w j , wk }

J (w) =1

2t− z2

◮ We need to find the best set of {w j , wk } that minimize J

◮ It can be done through gradient-based optimization

◮ In a general form

w(k + 1) = w(k )− η∂ J

∂ w

◮ It make it clear, let’s do it component by component

11/24



Back-propagation (BP): output-hidden wk

◮ w kj is the weight between output node k and hidden node j

∂ J

∂ w kj =

∂ J

∂ net k

∂ net k

∂ w kj

◮ Define sensitivity for a general node i as

δ i = −∂ J

∂ net i

◮ In this case, for the output node k

δ k = −∂ J

∂ net k =

∂ J

∂ z k

∂ z k

∂ net k = (t k − z k )f

′(net k )

◮ As net k = nH

j =1

w kj y j , it is clear that

∂ net k

∂ w kj = y j

◮ So we have ∆w kj = ηδ k y j = η(t k − z k )f ′(net k )y j

◮

This is a generalization of Widrow-Hoff 12/24



Back-propagation (BP): hidden-input w j

◮ As did before∂ J

w ji

=∂ J

∂ y j

∂ y j

∂ net j

∂ net j

∂ w ji

easy

◮ The first one is a little more complicated

∂ J

∂ y j

=∂

∂ y j 1

2

c

k =1

(t k − z k )2 = −

c

k =1

(t k − z k )∂ z k

∂ y j

= −c

k =1

(t k − z k )∂ z k

∂ net k

∂ net k

∂ y j

= −

c k =1

(t k − z k )f ′

(net k )w kj = −

c k =1

δ k w kj

◮ We can compute the sensitivity for the hidden node j

δ j = −∂ J

∂ net j

= −∂ J

∂ y j

∂ y j

∂ net j

= f ′(net j )c

k =1

w kj δ k

13/24



Why is it Called “Back Propagation”?

j

1

2

k

c

input layerhidden layer

output layer

wkj

◮ Sensitivity δ i reflects the information on node i ◮ δ j of a hidden node j combines two sources of information

◮ linear combination of those from the output layerc

k =1

w kj δ k

◮ its local information f ′(net j )◮ The learning rule for the hidden-input node is

∆w ji = ηδ j x i = ηf ′(net j )c

k =1

w kj δ k x i

14/24



Algorithm: Back-propagation (BP)

Algorithm 1: Stochastic Back-propagation

Init: nH , w, stop criterion θ, η, k = 0Do k ← k + 1

xk ← randomly pickforward compute y and then z

backward compute {δ k } and then {δ j }w kj ← w kj + ηδ k y j w ji ← w ji + ηδ j x i

Until ∇J (w) < θReturn w

◮ This is the one-sample BP training

◮ It can be easily extended to batch training

15/24



Bayes Discriminant and MLP◮ In the linear discriminative models, we know that the MSE and

MMSE approximate to Bayesian discriminate asymptotically

◮ MLP can do better by approximating the posteriors◮ Suppose we have c classes and the desired output is t k (x) = 1

if x ∈ ωk , and 0 otherwise◮ The MLP criterion

J (w) = x

[g k (x; w)−t k ]2

= x ∈ωk

[g k (x; w)−1]2

+x /∈ωk

[g k (x; w)−0]2

◮ It can be shown that minimizing limn→∞

J (w) is equivalent to

miniming [g k (x; w)− P (ωk |x]

2

p (x)d x

◮ This means the output units represents the posteriors

g k (x; w) ≃ P (ωk |x)

16/24



Outputs as Probabilities

◮ If we want the output of MLP to be posteriors◮ The desired output in training should be in [0, 1]

◮ As we have limited number of training samples, the outputsmay not sum to 1

◮ We can use a different activation function for the output layer◮ Softmax activation

z k =e net k

c m=1

e net m

17/24

f



Practice: Number of Hidden Units

◮ The number of hidden nodes is the most critical parameter inMLP

◮ It determines the expressive power of the network and thecomplexity of the decision boundary

◮ A smaller number leads to simpler boundary, and a largernumber can produce very complicated one

◮ Overfitting and generalizability

◮ Unfortunately, there is no foolproof method to choose this

parameter◮ Many heuristics were proposed

18/24

P i L i R



Practice: Learning Rates

◮ Another critical parameter in MLP is the learning rate η

◮ In principle, if η is small enough, the iteration converges

◮ But very slowly◮ To speed it up, we need to use the 2nd order gradient

information, e.g., the Newtonian method in training

19/24

P i Pl d M



Practice: Plateaus and Momentum

◮ Error surfaces often have plateaus where ∂ J (w)∂ w is very small

◮

Then the iteration can hardly move on◮ Introduce momentum to push it

◮ Momentum uses the weight change at previous iteration

w(k + 1) = w(k ) + (1− α)∆w bp (k ) + α∆w (k − 1)

Algorithm 2: Stochastic Back-propagation with Momentum

Init: nH , w, stop criterion θ, η, k = 0Do k ← k + 1

x

k

← randomly pickb kj ← η(1− α)δ k y i + αb kj ; b ji ← η(1− α)δ j x i + αb ji w ji ← w kj + b kj ; w ji ← w ji + b ji

Until ∇J (w) < θReturn w

20/24

O tli



Outline

Neuron Model



21/24

R di l B i F ti N t k



Radial Basis Function Network

.

.

.

. . .

x1

.

.

.

x2

input layer hidden layer output layer

xd

K

K

K

wkj

◮ The input-hidden weights are all 1◮ The activation function for the hidden units is the Radial

Basis Function (RBF), e.g.,

K (||x− xc ||) = exp{−x− xc 2

2σ2}

◮ The output

z k (x) =

nH

j =0

w kj K (x, x j )

22/24

I t t ti



Interpretation

◮ It can be treated as a function approximation, as a linearcombination of a set of bases

◮ The hidden units transform the original feature space to

another feature space (high-dim) by using the kernel◮ We hope the data become linearly separable in the new

feature space

◮ This is what we did in the Kernel Machines!

23/24

Learning



Learning

◮ Parameters in the RBF network◮ The basis center x j for each hidden node◮ The variance σ of the RBF function◮ The weights W

◮ Once the RBF parameters are set, W can be done bypseudo-inverse or Widrow-Hoff

◮ Finding RBF parameters is not easy

◮ Uniformly select the centers◮ Using the data cluster centers as xi

24/24

MLP_handout.pdf

Documents