Feed-forward Neural Networks Ying Wu Electrical Engineering and Computer Science Northwestern Universit y Evanston, IL 60208 http://www.ee cs.northwestern .edu/~yingwu 1/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 1/24
Feed-forward Neural Networks
Ying Wu
Electrical Engineering and Computer ScienceNorthwestern University
Evanston, IL 60208
http://www.eecs.northwestern.edu/~yingwu
1 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 2/24
Connectionism
◮ How does our brain process information?
◮ Are we Turing Machines?
◮ Things that are difficult for Turing Machines
◮ Perception is difficult for Turing Machines◮ We have so many neurons
◮ How do they work?
◮ Can we have computational models?
◮ Connectionism vs. Computationalism
2 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 3/24
History
◮
in 1940s◮ The model for neuron (McCulloch & Pitts, 1943)◮ The Hebbian learning rule (Hebb, 1949)
◮ ր in 1950s◮ Perceptron (Rosenblatt, 1950s)
◮ ց in 1960s◮
Limitation of Perceptron (Minsky & Papert, 1969)◮ Expert systems was so hot by then
◮ ր again in 1980s◮ Hopfield feed-back network (Hopfield, 1982)◮ Back-propagation algorithm (Rumelhart & Le Cun, 1986)◮
Expert systems ⇓◮ ց again in 1990s
◮ Overfitting in neural networks◮ SVM was so hot (Vapnik, 1995)
◮ where to go?
3 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 4/24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
4 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 5/24
Neuron: the Basic Unit
x1
x2
xd
w1
w2
wd
.
.
.
y
◮ Input x = [1, . . . , x d ]T ∈ Rd +1
◮ Connection weight (i.e., synapses) w = [w 0, . . . , w d ]T ∈ Rd +1
◮ Net activation: net
net =d
i =0
w i x i = wT x
◮ Activation function and output
y = f (net ) = f (w
T
x)5 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 6/24
Activation Function◮ Activation function introduces nonlinearity◮ We can use
f (x ) = sgn(x ) = 1 x ≥ 0−1 x < 0
◮ Or we can use the Sigmoid function
f (x ) =2
1 + e −2x − 1, f (x ) ∈ (−1, 1)
its derivativef ′(x ) = 1− f 2(x )
◮ Or
f (x ) =1
1 + e −x , f (x ) ∈ (0, 1)
its derivative
f ′(x ) = f (x )[1− f (x )] =e −x
(1 + e −x )2
6 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 7/24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
7 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 8/24
Perceptron
. . .
x1
.
.
.
x2
input layer output layer
xd
z1
zc
◮ Two layers (input and linear output)◮ Desired output t = [t 1, . . . , t c ]
T ∈ Rc
◮ Actual output z i = wT i x, i = 1, . . . , c
◮ Learning (Widrow-Hoff)
wi (t + 1) = wi (t ) + η(t i − z i )x = wi (t ) + η(t i −wT i x)x
◮ It only works for linearly separable patterns◮
It cannot even solve the simple XOR problem8 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 9/24
Multi-layer Network
.
.
.
.
.
.
x1
.
.
.
x2
input layer hidden layer output layer
xd
◮ Input layer x = [1, . . . , x d ]T ∈ Rd +1
◮ Hidden layer y j = f (wT j x), j = 1, . . . , nH
◮ Output layer z k = f (wT k y), k = 1, . . . , c
◮ Weight between hidden node y j and input node x i is w ji
◮ Weight between output node z k and hidden node y j is w kj
◮ May have multiple hidden layers
9 / 2 4
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 10/24
Discriminant Function
◮ For 3-layer network, its discriminant function
g k (x) = z k = f
nH
j =1
w kj f
d
i =1
w ji x i + w j 0
+ w k 0
◮ Kolmogorov showed that a 3-layer structure is enough toapproximate any nonlinear function
◮ We expect a 3-layer MLP to make any decision boundary
◮ Certainly, the nonlinearity depends on nH , the number of
hidden units◮ Larger nH results in overfitting
◮ Smaller nH leads to underfitting
10/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 11/24
Training the Network
◮ Desired output of the network t = [t 1, . . . , t c ]T
◮ The objective in training the weights {w j , wk }
J (w) =1
2t− z2
◮ We need to find the best set of {w j , wk } that minimize J
◮ It can be done through gradient-based optimization
◮ In a general form
w(k + 1) = w(k )− η∂ J
∂ w
◮ It make it clear, let’s do it component by component
11/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 12/24
Back-propagation (BP): output-hidden wk
◮ w kj is the weight between output node k and hidden node j
∂ J
∂ w kj =
∂ J
∂ net k
∂ net k
∂ w kj
◮ Define sensitivity for a general node i as
δ i = −∂ J
∂ net i
◮ In this case, for the output node k
δ k = −∂ J
∂ net k =
∂ J
∂ z k
∂ z k
∂ net k = (t k − z k )f
′(net k )
◮ As net k = nH
j =1
w kj y j , it is clear that
∂ net k
∂ w kj = y j
◮ So we have ∆w kj = ηδ k y j = η(t k − z k )f ′(net k )y j
◮
This is a generalization of Widrow-Hoff 12/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 13/24
Back-propagation (BP): hidden-input w j
◮ As did before∂ J
w ji
=∂ J
∂ y j
∂ y j
∂ net j
∂ net j
∂ w ji
easy
◮ The first one is a little more complicated
∂ J
∂ y j
=∂
∂ y j 1
2
c
k =1
(t k − z k )2 = −
c
k =1
(t k − z k )∂ z k
∂ y j
= −c
k =1
(t k − z k )∂ z k
∂ net k
∂ net k
∂ y j
= −
c k =1
(t k − z k )f ′
(net k )w kj = −
c k =1
δ k w kj
◮ We can compute the sensitivity for the hidden node j
δ j = −∂ J
∂ net j
= −∂ J
∂ y j
∂ y j
∂ net j
= f ′(net j )c
k =1
w kj δ k
13/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 14/24
Why is it Called “Back Propagation”?
j
1
2
k
c
input layerhidden layer
output layer
wkj
◮ Sensitivity δ i reflects the information on node i ◮ δ j of a hidden node j combines two sources of information
◮ linear combination of those from the output layerc
k =1
w kj δ k
◮ its local information f ′(net j )◮ The learning rule for the hidden-input node is
∆w ji = ηδ j x i = ηf ′(net j )c
k =1
w kj δ k x i
14/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 15/24
Algorithm: Back-propagation (BP)
Algorithm 1: Stochastic Back-propagation
Init: nH , w, stop criterion θ, η, k = 0Do k ← k + 1
xk ← randomly pickforward compute y and then z
backward compute {δ k } and then {δ j }w kj ← w kj + ηδ k y j w ji ← w ji + ηδ j x i
Until ∇J (w) < θReturn w
◮ This is the one-sample BP training
◮ It can be easily extended to batch training
15/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 16/24
Bayes Discriminant and MLP◮ In the linear discriminative models, we know that the MSE and
MMSE approximate to Bayesian discriminate asymptotically
◮ MLP can do better by approximating the posteriors◮ Suppose we have c classes and the desired output is t k (x) = 1
if x ∈ ωk , and 0 otherwise◮ The MLP criterion
J (w) = x
[g k (x; w)−t k ]2
= x ∈ωk
[g k (x; w)−1]2
+x /∈ωk
[g k (x; w)−0]2
◮ It can be shown that minimizing limn→∞
J (w) is equivalent to
miniming [g k (x; w)− P (ωk |x]
2
p (x)d x
◮ This means the output units represents the posteriors
g k (x; w) ≃ P (ωk |x)
16/24
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 17/24
Outputs as Probabilities
◮ If we want the output of MLP to be posteriors◮ The desired output in training should be in [0, 1]
◮ As we have limited number of training samples, the outputsmay not sum to 1
◮ We can use a different activation function for the output layer◮ Softmax activation
z k =e net k
c m=1
e net m
17/24
f
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 18/24
Practice: Number of Hidden Units
◮ The number of hidden nodes is the most critical parameter inMLP
◮ It determines the expressive power of the network and thecomplexity of the decision boundary
◮ A smaller number leads to simpler boundary, and a largernumber can produce very complicated one
◮ Overfitting and generalizability
◮ Unfortunately, there is no foolproof method to choose this
parameter◮ Many heuristics were proposed
18/24
P i L i R
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 19/24
Practice: Learning Rates
◮ Another critical parameter in MLP is the learning rate η
◮ In principle, if η is small enough, the iteration converges
◮ But very slowly◮ To speed it up, we need to use the 2nd order gradient
information, e.g., the Newtonian method in training
19/24
P i Pl d M
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 20/24
Practice: Plateaus and Momentum
◮ Error surfaces often have plateaus where ∂ J (w)∂ w is very small
◮
Then the iteration can hardly move on◮ Introduce momentum to push it
◮ Momentum uses the weight change at previous iteration
w(k + 1) = w(k ) + (1− α)∆w bp (k ) + α∆w (k − 1)
Algorithm 2: Stochastic Back-propagation with Momentum
Init: nH , w, stop criterion θ, η, k = 0Do k ← k + 1
x
k
← randomly pickb kj ← η(1− α)δ k y i + αb kj ; b ji ← η(1− α)δ j x i + αb ji w ji ← w kj + b kj ; w ji ← w ji + b ji
Until ∇J (w) < θReturn w
20/24
O tli
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 21/24
Outline
Neuron Model
Multi-Layer Perceptron
Radial Basis Function Networks
21/24
R di l B i F ti N t k
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 22/24
Radial Basis Function Network
.
.
.
. . .
x1
.
.
.
x2
input layer hidden layer output layer
xd
K
K
K
wkj
◮ The input-hidden weights are all 1◮ The activation function for the hidden units is the Radial
Basis Function (RBF), e.g.,
K (||x− xc ||) = exp{−x− xc 2
2σ2}
◮ The output
z k (x) =
nH
j =0
w kj K (x, x j )
22/24
I t t ti
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 23/24
Interpretation
◮ It can be treated as a function approximation, as a linearcombination of a set of bases
◮ The hidden units transform the original feature space to
another feature space (high-dim) by using the kernel◮ We hope the data become linearly separable in the new
feature space
◮ This is what we did in the Kernel Machines!
23/24
Learning
7/27/2019 MLP_handout.pdf
http://slidepdf.com/reader/full/mlphandoutpdf 24/24
Learning
◮ Parameters in the RBF network◮ The basis center x j for each hidden node◮ The variance σ of the RBF function◮ The weights W
◮ Once the RBF parameters are set, W can be done bypseudo-inverse or Widrow-Hoff
◮ Finding RBF parameters is not easy
◮ Uniformly select the centers◮ Using the data cluster centers as xi
24/24