Top Banner
CSE 5526: Regression 1 CSE 5526: Introduction to Neural Networks Regression and the LMS Algorithm
40

Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

May 20, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 1

CSE 5526: Introduction to Neural Networks

Regression and the LMS Algorithm

Page 2: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 2

Problem statement

Page 3: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 3

Linear regression with one variable

• Given a set of N pairs of data {xi, di}, approximate d by a linear function of x (regressor) i.e. or where the activation function φ(x) = x is a linear function, corresponding to a linear neuron. y is the output of the neuron, and is called the regression (expectational) error

bwxd +≈

ii

iiiii

bwxbwxyd

εεϕε

++=++=+= )(

iii yd −=ε

Page 4: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 4

Linear regression (cont.)

• The problem of regression with one variable is how to choose w and b to minimize the regression error

• The least squares method aims to minimize the square error:

∑∑==

−==N

iii

N

ii ydE

1

2

1

2 )(21

21 ε

Page 5: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 5

Linear regression (cont.)

• To minimize the two-variable square function, set

=∂∂

=∂∂

0

0

wE

bE

Page 6: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 6

Linear regression (cont.)

∑=−−−=

−−∂∂

=∂∂

iii

iii

bwxd

bwxdbb

E

0)(

)(21 2

∑=−−−=

−−∂∂

=∂∂

iiii

iii

xbwxd

bwxdww

E

0)(

)(21 2

Page 7: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 7

Analytic solution approaches

• Solve one equation for b in terms of w • Substitute into other equation, solve for w • Substitute solution for w back into equation for b

• Setup system of equations in matrix notation • Solve matrix equation

• Rewrite problem in matrix form • Compute matrix gradient • Solve for w

Page 8: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 8

Linear regression (cont.)

• Hence

where an overbar (i.e. ) indicates the mean

∑∑∑∑∑

=

ii

iii

ii

ii

ii

xxN

dxxdxb 2

2

)(

∑∑

−−=

ii

iii

xx

ddxxw 2)(

))((

x

Derive yourself!

Page 9: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 9

Linear regression in matrix notation

• Let 𝑿𝑿 = 𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 𝒙𝒙𝟑𝟑 …𝒙𝒙𝑵𝑵 𝑻𝑻 • Then the model predictions are 𝒚𝒚 = 𝑿𝑿𝑿𝑿 • And the mean square error can be written 𝐸𝐸 𝑿𝑿 = 𝒅𝒅 − 𝒚𝒚 2 = 𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

• To find the optimal w, set the gradient of the error with respect to w equal to 0 and solve for w

𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = 𝜕𝜕

𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

• See The Matrix Cookbook (Petersen & Pedersen)

Page 10: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 10

Linear regression in matrix notation

• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 𝜕𝜕

𝜕𝜕𝑿𝑿𝒅𝒅 − 𝑿𝑿𝑿𝑿 2

=𝜕𝜕𝜕𝜕𝑿𝑿

𝒅𝒅 − 𝑿𝑿𝑿𝑿 𝑇𝑇 𝒅𝒅 − 𝑿𝑿𝑿𝑿

=𝜕𝜕𝜕𝜕𝑿𝑿

𝒅𝒅𝑇𝑇𝒅𝒅 − 𝟐𝟐𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝒅𝒅 + 𝑿𝑿𝑇𝑇𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

= −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

• 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0 = −2𝑿𝑿𝑇𝑇𝒅𝒅 − 2𝑿𝑿𝑇𝑇𝑿𝑿𝑿𝑿

⇒ 𝑿𝑿 = 𝑿𝑿𝑇𝑇𝑿𝑿 −1𝑿𝑿𝑇𝑇𝒅𝒅 • Much cleaner!

Page 11: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 11

Finding optimal parameters via search

• Often there is no closed form solution for 𝜕𝜕𝜕𝜕𝑿𝑿𝐸𝐸 𝑿𝑿 = 0

• We can still use the gradient in a numerical solution • We will still use the same example to permit comparison • For simplicity’s sake, set b = 0

E(w) is called a cost function

∑=

−=N

iii wxdwE

1

2)(21)(

Page 12: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 12

Cost function

w

E(w)

w*

Emin

Question: how can we update w from w0 to minimize E?

w0

Page 13: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 13

Gradient and directional derivatives

• Consider a two-variable function f(x, y). Its gradient at the point (x0, y0)T is defined as

where ux and uy are unit vectors in the x and y directions, and and

yyxx

yyxx

T

yxfyxf

yyxf

xyxf

u),(u),(

),(,),(f

0000

00

+=

∂∂

∂=∇

==

xff x ∂∂= yff y ∂∂=

Page 14: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 14

Gradient and directional derivatives (cont.)

• At any given direction, u = aux + buy, with , the directional derivative at (x0, y0)T along the unit vector u is • Which direction has the greatest slope? The gradient because of the

dot product!

u),(f

),(),(

)],(),([)],(),([lim

),(),(lim),(

00

0000

00000000

0

0000

000u

T

yx

h

hx

yx

yxbfyxafh

yxfhbyxfhbyxfhbyhaxfh

yxfhbyhaxfyxfD

∇=

+=

−+++−++=

−++=

122 =+ ba

Page 15: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 15

Gradient and directional derivatives (cont.)

• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5

2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥

Page 16: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 16

Gradient and directional derivatives (cont.)

• Example: f(x, y) = 52𝑥𝑥2 − 3𝑥𝑥𝑥𝑥 + 5

2𝑥𝑥2 + 2𝑥𝑥 + 2𝑥𝑥

Page 17: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 17

Gradient and directional derivatives (cont.)

• The level curves of a function 𝑓𝑓(𝑥𝑥,𝑥𝑥) are curves such that 𝑓𝑓 𝑥𝑥,𝑥𝑥 = 𝑘𝑘

• Thus, the directional derivative along a level curve is 0

• And the gradient vector is perpendicular to the level curve

0u),(f 00u =∇= TyxD

Page 18: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 18

Gradient and directional derivatives (cont.)

• The gradient of a cost function is a vector with the dimension of w that points to the direction of maximum E increase and with a magnitude equal to the slope of the tangent of the cost function along that direction • Can the slope be negative?

Page 19: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 19

Gradient illustration

w

E(w)

w*

Emin

w0

Δw

wwwEwwE

wE

w ∆∆−−∆+

=

→∆ 2)()(lim

)(

000

0

Gradient

Page 20: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 20

Gradient descent

• Minimize the cost function via gradient (steepest) descent – a case of hill-climbing

n: iteration number η: learning rate •See previous figure

)()()1( nEnwnw ∇−=+ η

Page 21: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 21

Gradient descent (cont.)

• For the mean-square-error cost function and linear neurons

2

22

)]()()([21

)]()([21)(

21)(

nxnwnd

nyndnenE

−=

−==

)()()()(

21

)()(

2

nxnenwne

nwEnE

−=∂∂

=∂∂

=∇

Page 22: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 22

Gradient descent (cont.)

• Hence • This is the least-mean-square (LMS) algorithm, or the Widrow-Hoff

rule

)()]()([)()()()()1(

nxnyndnwnxnenwnw−+=

+=+ηη

Page 23: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 23

Stochastic gradient descent

• If the cost function is of the form

𝐸𝐸 𝑤𝑤 = �𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁

𝑛𝑛=1

• Then one gradient descent step requires computing

Δw =𝜕𝜕𝜕𝜕𝑤𝑤

𝐸𝐸 𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤

𝐸𝐸𝑛𝑛 𝑤𝑤𝑁𝑁

𝑛𝑛=1

• Which means computing 𝐸𝐸(𝑤𝑤) or its gradient for every data point

• Many steps may be required to reach an optimum

Page 24: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 24

Stochastic gradient descent

• It is generally much more computationally efficient to use

Δ𝑤𝑤 = �𝜕𝜕𝜕𝜕𝑤𝑤

𝐸𝐸𝑛𝑛 𝑤𝑤𝑛𝑛𝑖𝑖+𝑛𝑛𝑏𝑏−1

𝑛𝑛=𝑛𝑛𝑖𝑖

• For small values of 𝑛𝑛𝑏𝑏 • This update rule may converge in many fewer

passes through the data (epochs)

Page 25: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 25

Stochastic gradient descent example

Page 26: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 26

Stochastic gradient descent error functions

Page 27: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 27

Stochastic gradient descent gradients

Page 28: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 28

Stochastic gradient descent animation

Page 29: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 29

Gradient descent animation

Page 30: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 30

Multi-variable LMS

• The analysis for the one-variable case extends to the multi-variable case where w0= b (bias) and x0 = 1, as done for perceptron learning

2)]()()([21)( nnndnE T xw−=

T

mwE

wE

wE

∂∂

∂∂

∂∂

=∇ ,...,,)w(E10

Page 31: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 31

Multi-variable LMS (cont.)

• The LMS algorithm

)()]()([)()()()(

)()()1(

nnyndnnnen

nnn

xwxw

Eww

−+=+=

∇−=+

ηηη

Page 32: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 32

LMS algorithm remarks

• The LMS rule is exactly the same equation as the perceptron learning rule

• Perceptron learning is for nonlinear (M-P) neurons, whereas LMS learning is for linear neurons. • i.e., perceptron learning is for classification and LMS is

for function approximation • LMS should be less sensitive to noise in the input

data than perceptrons • On the other hand, LMS learning converges slowly

• Newton’s method changes weights in the direction of the minimum E(w) and leads to fast convergence. • But it is not online and is computationally expensive

Page 33: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 33

Stability of adaptation

When η is too small, learning converges slowly

When η is too large, learning doesn’t converge

Page 34: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 34

Learning rate annealing

• Basic idea: start with a large rate but gradually decrease it • Stochastic approximation

c is a positive parameter

ncn =)(η

Page 35: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 35

Learning rate annealing (cont.)

• Search-then-converge η0 and τ are positive parameters •When n is small compared to τ, learning rate is approximately constant •When n is large compared to τ, learning rule schedule roughly follows stochastic approximation

)(1)( 0

τηη

nn

+=

Page 36: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 36

Rate annealing illustration

Page 37: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 37

Nonlinear neurons

• To extend the LMS algorithm to nonlinear neurons, consider differentiable activation function φ at iteration n

2

2

)()(21

)]()([21)(

−=

−=

∑j

jj nxwnd

nyndnE

ϕ

Page 38: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 38

Nonlinear neurons (cont.)

• By chain rule of differentiation

( )( ) )()()(

)()()]()([

nxnvnenxnvnynd

wv

vy

yE

wE

j

j

jj

ϕ

ϕ′−=

′−−=

∂∂

∂∂

∂∂

=∂∂

Page 39: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 39

Nonlinear neurons (cont.)

• Gradient descent gives • The above is called the delta (δ) rule

• If we choose a logistic sigmoid for φ

then

)()()(

)())(()()()1(

nxnnwnxnvnenwnw

jj

jjj

ηδ

ϕη

+=

′+=+

)exp(11)(

avv

−+=ϕ

)](1)[()( vvav ϕϕϕ −=′ (see textbook)

Page 40: Regression and the LMS Algorithmmr-pc.org/t/cse5526/pdf/02-regression.pdf · CSE 5526: Regression 32 LMS algorithm remarks • The LMS rule is exactly the same equation as the perceptron

CSE 5526: Regression 40

Role of activation function

v

φ

v

φ′

The role of φ′: weight update is most sensitive when v is near zero