This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Artificial Neural Networks • Threshold units
• Gradient descent• Gradient descent
• Multilayer networks
• Backpropagation
• Hidden layer representationsy p
• Example: Face recognition
Ad d t i• Advanced topics
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 1
Connectionist ModelsConsider humans• Neuron switching time ~.001 second• Number of neurons ~1010
• Connections per neuron ~104-5
• Scene recognition time ~ 1 second• Scene recognition time ~.1 second• 100 inference step does not seem like enoughmust use lots of parallel computation!f p pProperties of artificial neural nets (ANNs):• Many neuron-like threshold switching units• Many weighted interconnections among units• Highly parallel, distributed process• Emphasis on tuning weights automaticallyCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 2
• Emphasis on tuning weights automatically
When to Consider Neural Networks• Input is high-dimensional discrete or real-valued (e.g., raw
sensor input)• Output is discrete or real valued• Output is a vector of values
P ibl i d t• Possibly noisy data• Form of target function is unknown• Human readability of result is unimportantHuman readability of result is unimportant
Represents some useful functionsRepresents some useful functions• What weights represent g(x1,x2) = AND(x1,x2)?But some functions not representablep• e.g., not linearly separable• therefore, we will want networks of these ...
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 6
Perceptron Training Rule•
where Δ+← www iii
lue target vais )( )( η
=•−=Δ
xctxotw ii
r
ratelearning called.1)(e.g.,constant small is output perceptron is
g)(
η•• o
convergeit will proveCan
g)( gη
smallly sufficient is and separablelinearly is data trainingIf
gp
η••
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 7
Gradient Descent where, simpleconsider ,understand To
xwxwwotlinear uni
+++=
[ ]error squared theminimize that s'learn :Idea
21
110
wxw...xwwo
i
nn
∑
+++=
[ ]
examplestrainingofsettheisWhere
)( 221
D
otwEDd
dd∑∈
−≡r
examplestrainingofset theis Where D
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 8
Gradient Descent
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 9
Gradient Descent
EEE ⎤⎡ ∂∂∂
n
wEwwE
wE
wEwE
∇−=Δ
⎥⎦
⎤⎢⎣
⎡∂∂
∂∂
∂∂
≡∇
][:ruleTraining
,...,,][ Gradient 10
η r
r
ii
i
wEw
wEw
∂∂
−=Δ
∇=Δ
i.e.,
][ :rule Training
η
η
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 10
iw∂
Gradient Descent)(
21 2∑ −
∂∂
=∂∂
ddd
ii
otww
E
)(21 2∑ −
∂∂
= ddd i
dii
otw
)()(221 ∑ −
∂∂
−= ddd i
dd
d i
otw
ot
)()(
2
∑ ⋅−∂∂
−=
∂
ddd i
dd
d i
xwtw
ot
wrr
))(( ,∑ −−=∂∂
∂
ddidd
d i
xotwE
w
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 11
∂ diw
Gradient Descent
ctor of is the vexwhere ,,txe form pair of thples is a ining exam Each traexamplestraining
><−
) ,_(DESCENTGRADIENT η
rr
iw).(e.g., .rning rateis the leavalue. et output s the tes and t iinput valu
••
domet,isconditiononterminatitheUntil valuerandom small some toeach Initialize
05 arg η
i
examplestraining,txw
><Δ
do ,_in each For - zero. toeach Initialize -
domet,iscondition on terminati the Until
r
iii
i
xotwww
ox
−+Δ←Δ
∗
)(do ,t unit weighlinear each For *
output computeandinstance Input the
η
r
iii
iii
www
xotww
Δ+←
+Δ←Δ
do t w,unit weighlinear each For -
)( η
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 12
SummaryPerceptron training rule guaranteed to succeed if• Training examples are linearly separableTraining examples are linearly separable• Sufficiently small learning rate η
Linear unit training rule uses gradient descent• Guaranteed to converge to hypothesis with• Guaranteed to converge to hypothesis with
minimum squared error• Given sufficiently small learning rate ηGiven sufficiently small learning rate η• Even when training data contains noise• Even when training data not separable by HCS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 13
• Even when training data not separable by H
Incremental (Stochastic) Gradient DescentBatch mode Gradient Descent:Do until satisfied:
di tthC t1 ][E r∇ 21 )(][E ∑r
][ 2.gradient theCompute .1
wEww]w[E
D
Drrr
∇−←∇
η
I t l d G di t D t
221 )(][ d
DddD otwE −≡ ∑
∈
r
Incremental mode Gradient Descent:Do until satisfied:- For each training example d in D
][ 2.gradient theCompute .1
wEww]w[E
d
drrr
r
∇−←∇
η
221 )(][ ddd otwE −≡
r
Incremental Gradient Descent can approximate Batch GradientDescent arbitrarily closely if η made small enough
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 14
Multilayer Networks of Sigmoid Units
d1 d3 d3
h1h2
d1d2
3
d0 d2d0o
h1 h2
x1 x2
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 15
Multilayer Decision Space
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 16
Sigmoid UnitX
X0=1
W2
W0W
1
X1
ΣX2
W n
Σi
n
ii xwnet ∑
=
=0
neteneto −+
==1
1)(σ
Xnx
1functionsigmoid theis )(σ
xxdx
xde-x
))( 1)(( )( :property Nice
1
−=
+
σσσ
ationBackpropagnetworksMultilayer unitssigmoidofunit sigmoid One
Convergence of BackpropagationGradient descent to some local minimum• Perhaps not global minimump g• Momentum can cause quicker convergence• Stochastic gradient descent also results in faster
convergence• Can train multiple networks and get different results (using
different initial weights)different initial weights)
Nature of convergence• Initialize weights near zero• Therefore, initial networks near-linear
CS 5751 Machine Learning
Chapter 4 Artificial Neural Networks 27
• Increasingly non-linear functions as training progresses
Expressive Capabilities of ANNsBoolean functions:• Every Boolean function can be represented by network y p y
with a single hidden layer• But that might require an exponential (in the number of
inp ts) hidden nitsinputs) hidden units
C ti f tiContinuous functions:• Every bounded continuous function can be approximated
with arbitrarily small error by a network with one hiddenwith arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989]
• Any function can be approximated to arbitrary accuracy by k i h hidd l [C b k 1988]