CSE 5526: Regression 1 CSE 5526: Introduction to Neural Networks Regression and the LMS Algorithm
CSE 5526: Regression 1
CSE 5526: Introduction to Neural Networks
Regression and the LMS Algorithm
CSE 5526: Regression 2
Problem statement
CSE 5526: Regression 3
Linear regression with one variable
• Given a set of N pairs of data {xi, di}, approximate d by a linear function of x (regressor) i.e. or where the activation function φ(x) = x is a linear function, corresponding to a linear neuron. y is the output of the neuron, and is called the regression (expectational) error
bwxd +≈
ii
iiiii
bwxbwxyd
εεϕε
++=++=+= )(
iii yd −=ε
CSE 5526: Regression 4
Linear regression (cont.)
• The problem of regression with one variable is how to choose w and b to minimize the regression error
• The least squares method aims to minimize the square error:
∑∑==
−==N
iii
N
ii ydE
1
2
1
2 )(21
21 ε
CSE 5526: Regression 5
Linear regression (cont.)
• To minimize the two-variable square function, set
=∂∂
=∂∂
0
0
wE
bE
CSE 5526: Regression 6
Linear regression (cont.)
∑
∑=−−−=
−−∂∂
=∂∂
iii
iii
bwxd
bwxdbb
E
0)(
)(21 2
∑
∑=−−−=
−−∂∂
=∂∂
iiii
iii
xbwxd
bwxdww
E
0)(
)(21 2
CSE 5526: Regression 7
Analytic solution approaches
• Solve one equation for b in terms of w • Substitute into other equation, solve for w • Substitute solution for w back into equation for b
• Setup system of equations in matrix notation • Solve matrix equation
• Rewrite problem in matrix form • Compute matrix gradient • Solve for w
CSE 5526: Regression 8
Linear regression (cont.)
• Hence
where an overbar (i.e. ) indicates the mean
∑∑∑∑∑
−
−
=
ii
iii
ii
ii
ii
xxN
dxxdxb 2
2
)(
∑∑
−
−−=
ii
iii
xx
ddxxw 2)(
))((
x
Derive yourself!
CSE 5526: Regression 9
Linear regression in matrix notation
• Let 𝑿𝑿 = 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 …𝒙𝒙𝑵𝑵 𝑻𝑻 • Then the model predictions are 𝒚𝒚 = 𝑿𝑿𝑿𝑿 • And the mean square error can be written 𝐸𝐸 𝑿𝑿 = 𝒅𝒅 − 𝒚𝒚 2 = 𝒅𝒅 − 𝑿𝑿𝑿𝑿 2
• To find the optimal w, set the gradient of the error with respect to w equal to 0 and solve for w
𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = 𝜕𝜕
𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2
• See The Matrix Cookbook (Petersen & Pedersen)
CSE 5526: Regression 10
Linear regression in matrix notation
• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 𝜕𝜕
𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2
=𝜕𝜕𝜕𝜕𝑿𝑿
𝒅𝒅 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒅𝒅 − 𝑿𝑿𝑿𝑿
=𝜕𝜕𝜕𝜕𝑿𝑿
𝒅𝒅𝑇𝑇𝒅𝒅 − 𝟐𝟐𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝒅𝒅 + 𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿
= −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿
• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿
⇒ 𝑿𝑿 = 𝑿𝑿𝑇𝑇𝑿𝑿 −1𝑿𝑿𝑇𝑇𝒅𝒅 • Much cleaner!
CSE 5526: Regression 11
Finding optimal parameters via search
• Often there is no closed form solution for 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0
• We can still use the gradient in a numerical solution • We will still use the same example to permit comparison • For simplicity’s sake, set b = 0
E(w) is called a cost function
∑=
−=N
iii wxdwE
1
2)(21)(
CSE 5526: Regression 12
Cost function
w
E(w)
w*
Emin
Question: how can we update w from w0 to minimize E?
w0
CSE 5526: Regression 13
Gradient and directional derivatives
• Consider a two-variable function f(x, y). Its gradient at the point (x0, y0)T is defined as
where ux and uy are unit vectors in the x and y directions, and and
yyxx
yyxx
T
yxfyxf
yyxf
xyxf
u),(u),(
),(,),(f
0000
00
+=
∂
∂∂
∂=∇
==
xff x ∂∂= yff y ∂∂=
CSE 5526: Regression 14
Gradient and directional derivatives (cont.)
• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is • Which direction has the greatest slope? The gradient because of the
dot product!
u),(f
),(),(
)],(),([)],(),([lim
),(),(lim),(
00
0000
00000000
0
0000
000u
T
yx
h
hx
yx
yxbfyxafh
yxfhbyxfhbyxfhbyhaxfh
yxfhbyhaxfyxfD
∇=
+=
−+++−++=
−++=
→
→
122 =+ ba
CSE 5526: Regression 15
Gradient and directional derivatives (cont.)
• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5
2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥
CSE 5526: Regression 16
Gradient and directional derivatives (cont.)
• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5
2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥
CSE 5526: Regression 17
Gradient and directional derivatives (cont.)
• The level curves of a function 𝑓𝑓(𝑥𝑥,𝑥𝑥) are curves such that 𝑓𝑓 𝑥𝑥,𝑥𝑥 = 𝑘𝑘
• Thus, the directional derivative along a level curve is 0
• And the gradient vector is perpendicular to the level curve
0u),(f 00u =∇= TyxD
CSE 5526: Regression 18
Gradient and directional derivatives (cont.)
• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum E increase and with a magnitude equal to the slope of the tangent of the cost function along that direction • Can the slope be negative?
CSE 5526: Regression 19
Gradient illustration
w
E(w)
w*
Emin
w0
Δw
wwwEwwE
wE
w ∆∆−−∆+
=
∇
→∆ 2)()(lim
)(
000
0
Gradient
CSE 5526: Regression 20
Gradient descent
• Minimize the cost function via gradient (steepest) descent – a case of hill-climbing
n: iteration number η: learning rate •See previous figure
)()()1( nEnwnw ∇−=+ η
CSE 5526: Regression 21
Gradient descent (cont.)
• For the mean-square-error cost function and linear neurons
2
22
)]()()([21
)]()([21)(
21)(
nxnwnd
nyndnenE
−=
−==
)()()()(
21
)()(
2
nxnenwne
nwEnE
−=∂∂
=∂∂
=∇
CSE 5526: Regression 22
Gradient descent (cont.)
• Hence • This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff
rule
)()]()([)()()()()1(
nxnyndnwnxnenwnw−+=
+=+ηη
CSE 5526: Regression 23
Stochastic gradient descent
• If the cost function is of the form
𝐸𝐸 𝑤𝑤 = �𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁
𝑛𝑛=1
• Then one gradient descent step requires computing
Δw =𝜕𝜕𝜕𝜕𝑤𝑤
𝐸𝐸 𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤
𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁
𝑛𝑛=1
• Which means computing 𝐸𝐸(𝑤𝑤) or its gradient for every data point
• Many steps may be required to reach an optimum
CSE 5526: Regression 24
Stochastic gradient descent
• It is generally much more computationally efficient to use
Δ𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤
𝐸𝐸𝑛𝑛 𝑤𝑤𝑛𝑛𝑖𝑖+𝑛𝑛𝑏𝑏−1
𝑛𝑛=𝑛𝑛𝑖𝑖
• For small values of 𝑛𝑛𝑏𝑏 • This update rule may converge in many fewer
passes through the data (epochs)
CSE 5526: Regression 25
Stochastic gradient descent example
CSE 5526: Regression 26
Stochastic gradient descent error functions
CSE 5526: Regression 27
Stochastic gradient descent gradients
CSE 5526: Regression 28
Stochastic gradient descent animation
CSE 5526: Regression 29
Gradient descent animation
CSE 5526: Regression 30
Multi-variable LMS
• The analysis for the one-variable case extends to the multi-variable case where w0= b (bias) and x0 = 1, as done for perceptron learning
2)]()()([21)( nnndnE T xw−=
T
mwE
wE
wE
∂∂
∂∂
∂∂
=∇ ,...,,)w(E10
CSE 5526: Regression 31
Multi-variable LMS (cont.)
• The LMS algorithm
)()]()([)()()()(
)()()1(
nnyndnnnen
nnn
xwxw
Eww
−+=+=
∇−=+
ηηη
CSE 5526: Regression 32
LMS algorithm remarks
• The LMS rule is exactly the same equation as the perceptron learning rule
• Perceptron learning is for nonlinear (M-P) neurons, whereas LMS learning is for linear neurons. • i.e., perceptron learning is for classification and LMS is
for function approximation • LMS should be less sensitive to noise in the input
data than perceptrons • On the other hand, LMS learning converges slowly
• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. • But it is not online and is computationally expensive
CSE 5526: Regression 33
Stability of adaptation
When η is too small, learning converges slowly
When η is too large, learning doesn’t converge
CSE 5526: Regression 34
Learning rate annealing
• Basic idea: start with a large rate but gradually decrease it • Stochastic approximation
c is a positive parameter
ncn =)(η
CSE 5526: Regression 35
Learning rate annealing (cont.)
• Search-then-converge η0 and τ are positive parameters •When n is small compared to τ, learning rate is approximately constant •When n is large compared to τ, learning rule schedule roughly follows stochastic approximation
)(1)( 0
τηη
nn
+=
CSE 5526: Regression 36
Rate annealing illustration
CSE 5526: Regression 37
Nonlinear neurons
• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n
2
2
)()(21
)]()([21)(
−=
−=
∑j
jj nxwnd
nyndnE
ϕ
CSE 5526: Regression 38
Nonlinear neurons (cont.)
• By chain rule of differentiation
( )( ) )()()(
)()()]()([
nxnvnenxnvnynd
wv
vy
yE
wE
j
j
jj
ϕ
ϕ′−=
′−−=
∂∂
∂∂
∂∂
=∂∂
CSE 5526: Regression 39
Nonlinear neurons (cont.)
• Gradient descent gives • The above is called the delta (δ) rule
• If we choose a logistic sigmoid for φ
then
)()()(
)())(()()()1(
nxnnwnxnvnenwnw
jj
jjj
ηδ
ϕη
+=
′+=+
)exp(11)(
avv
−+=ϕ
)](1)[()( vvav ϕϕϕ −=′ (see textbook)
CSE 5526: Regression 40
Role of activation function
v
φ
v
φ′
The role of φ′: weight update is most sensitive when v is near zero