Top Banner
10 1 Widrow-Hoff Learning (LMS Algorithm)
26

Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

Mar 18, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

1

Widrow-Hoff Learning(LMS Algorithm)

Anthony S Maida
Hagan, Demuth, Beale, De Jesus
Anthony S Maida
Approximate steepest descent
Anthony S Maida
Least mean square
Page 2: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

2

ADALINE Network

a = purelin (Wp + b)

Linear Neuron

p a

1

nW

b

R x 1S x R

S x 1

S x 1

S x 1

Input

R S

a purelin Wp b+( ) Wp b+= =

ai purelin ni( ) purelin wTi p bi+( ) wT

i p bi+= = =

wi

wi 1,

wi 2,

wi R,

=

Anthony S Maida
Adaptive linear neuron.
Anthony S Maida
Single layer network w/ trainable linear output units.
Anthony S Maida
Incoming wt vec to output unit i.
Anthony S Maida
Output unit i.
Anthony S Maida
Text
Page 3: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

3

Two-Input ADALINE

p1an

Inputs

bp2 w1,2

w1,1

1

Σ

a = purelin (Wp + b)

Two-Input Neuron

p1-b/w1,1

p2

-b/w1,2

1wTp + b = 0

a > 0a < 0

1w

a purelin n( ) purelin wT1 p b+( ) wT

1 p b+= = =

a wT1 p b+ w1 1, p1 w1 2, p2 b+ += =

Page 4: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

4

Mean Square Error

p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,

Training Set:

pq tqInput: Target:

x w1b

= z p1

= a wT1 p b+= a xTz=

F x( ) E e2 ][= E t a–( )2 ][ E t xTz–( )2 ][= =

Notation:

Mean Square Error:

Anthony S Maida
Labeled data.
Anthony S Maida
Q data items.
Anthony S Maida
trainable bias
Anthony S Maida
(target - actual output) squared
Anthony S Maida
Anthony S Maida
Trainable params
Anthony S Maida
Anthony S Maida
Trainable params
Anthony S Maida
Anthony S Maida
Input
Anthony S Maida
Anthony S Maida
Packing bias into the wt vec simplifies notation.
Anthony S Maida
Anthony S Maida
Expected value
Anthony S Maida
Anthony S Maida
Anthony S Maida
Expected square error.
Page 5: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

5

Error AnalysisF x( ) E e2 ][= E t a–( )2 ][ E t xTz–( )2 ][= =

F x( ) E t2 2txTz– xTzzTx+ ][=

F x( ) E t2 ] 2xTE tz[ ]– xT E zzT[ ]x+[=

F x( ) c 2xTh– xTRx+=

c E t2 ][= h E tz[ ]= R E zzT[ ]=

F x( ) c dTx 12---x

TAx+ +=

d 2h–= A 2R=

The mean square error for the ADALINE Network is aquadratic function:

Anthony S Maida
Substitute linear output.
Anthony S Maida
Anthony S Maida
Expand quadratic.
Anthony S Maida
Anthony S Maida
Minimize F(x).
Anthony S Maida
Expected value allows you to break sum into components.
Anthony S Maida
Another quadratic.
Anthony S Maida
Input correlation matrix.
Anthony S Maida
Anthony S Maida
Cross-correlation matrix btw input & target.
Anthony S Maida
Characteristics of quadratic depend on A.
Anthony S Maida
Convert above quadratic to standard form for further analysis.
Anthony S Maida
z: inputs
Anthony S Maida
x: trainable params
Anthony S Maida
h and R are difficult to caculate.
Anthony S Maida
Anthony S Maida
Page 6: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

6

Stationary Point

∇F x( ) ∇ c dTx 12---xTAx+ +⎝ ⎠

⎛ ⎞ d Ax+ 2h– 2Rx+= = =

2h– 2Rx+ 0=

x∗ R 1– h=

A 2R=

Hessian Matrix:

The correlation matrix R must be at least positive semidefinite. Ifthere are any zero eigenvalues, the performance index will either

have a weak minumum or else no stationary point, otherwisethere will be a unique global minimum x*.

If R is positive definite:

Anthony S Maida
Only nonneg eigenvals.
Anthony S Maida
Anthony S Maida
Anthony S Maida
Global minimum.
Anthony S Maida
Guarantee
Anthony S Maida
Anthony S Maida
Set gradient to zero and solve for global min.
Anthony S Maida
Rx = h
Anthony S Maida
Quadratic analysis of
Anthony S Maida
R may be expensive to compute is data set is large.
Anthony S Maida
Anthony S Maida
Gradient of quadratic.
Anthony S Maida
Anthony S Maida
All eigenvals > 0.
Page 7: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

7

Approximate Steepest Descent

F̂ x( ) t k( ) a k( )–( )2 e2 k( )= =

Approximate mean square error (one sample):

∇̂F x( ) e2 k( )∇=

e2 k( )∇[ ] je2 k( )∂w1 j,∂

---------------- 2e k( ) e k( )∂w1 j,∂

-------------= = j 1 2 … R, , ,=

e2 k( )∇[ ]R 1+e2 k( )∂

b∂---------------- 2e k( )

e k( )∂b∂

-------------= =

Approximate (stochastic) gradient:

Anthony S Maida
Anthony S Maida
Design algorithm to locate minimum point.
Anthony S Maida
Avoids calculating h and R.
Anthony S Maida
Also called online or incremental b/c of trial-by-trial update.
Anthony S Maida
R = # of input features, j varies over R.
Anthony S Maida
bias
Anthony S Maida
k is a trial.
Anthony S Maida
Expectation of squared error is replaced by squared error at iteration k.
Anthony S Maida
R+1
Anthony S Maida
A total of R+1 components in gradient vector.
Anthony S Maida
Squared term drops out when taking deriv.
Anthony S Maida
Page 8: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

8

Approximate Gradient Calculation

e k( )∂w1 j,∂

------------- t k( ) a k( )–[ ]∂w1 j,∂

---------------------------------- w1 j,∂∂ t k( ) wT

1 p k( ) b+( )–[ ]= =

e k( )∂w1 j,∂

-------------w1 j,∂∂ t k( ) w1 i, pi k( )

i 1=

R

∑ b+⎝ ⎠⎜ ⎟⎛ ⎞

–=

e k( )∂w1 j,∂

------------- p j k( )–= e k( )∂b∂

------------- 1–=

∇̂F x( ) e2 k( )∇ 2e k( )z k( )–= =

Anthony S Maida
Calculate partial derivatives.
Anthony S Maida
t - a
Anthony S Maida
Anthony S Maida
Anthony S Maida
All steps involve substitutions.
Anthony S Maida
substitution
Anthony S Maida
Pack result into vectorized form.
Page 9: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

9

LMS Algorithm

xk 1+ xk α F x( )∇ x xk=–=

xk 1+ xk 2αe k( )z k( )+=

w1 k 1+( ) w1 k( ) 2αe k( )p k( )+=

b k 1+( ) b k( ) 2αe k( )+=

Anthony S Maida
Stochastic (trial-by-trial) gradient descent.
Anthony S Maida
1st layer wt vector
Anthony S Maida
Anthony S Maida
From previous slide.
Anthony S Maida
wt updates
Anthony S Maida
bias update
Page 10: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

10

Multiple-Neuron Case

wi k 1+( ) wi k( ) 2αei k( )p k( )+=

bi k 1+( ) bi k( ) 2αei k( )+=

W k 1+( ) W k( ) 2αe k( )pT k( )+=

b k 1+( ) b k( ) 2αe k( )+=

Matrix Form:

Anthony S Maida
Since each output unit is independent, simply convert to matrix form.
Anthony S Maida
Initialize wts to 0.
Page 11: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

11

Analysis of Convergencexk 1+ xk 2αe k( )z k( )+=

E xk 1+[ ] E xk[ ] 2αE e k( )z k( )[ ]+=

E xk 1+[ ] E xk[ ] 2α E t k( )z k( )[ ] E xkTz k( )( )z k( )[ ]–{ }+=

E xk 1+[ ] E xk[ ] 2α E tkz k( )[ ] E z k( )zTk( )( )xk[ ]–{ }+=

E xk 1+[ ] E xk[ ] 2α h RE xk[ ]–{ }+=

E xk 1+[ ] I 2αR–[ ]E xk[ ] 2αh+=

For stability, the eigenvalues of thismatrix must fall inside the unit circle.

Anthony S Maida
Stability
Page 12: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

12

Conditions for Stability

eig I 2αR–[ ]( ) 1 2αλi– 1<=

Therefore the stability condition simplifies to

1 2αλi– 1–>

λi 0>Since , 1 2αλi– 1< .

α 1 λ⁄ i for all i<

0 α 1 λmax⁄< <

(where λi is an eigenvalue of R)

Anthony S Maida
Stability condition
Page 13: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

13

Steady State Response

E xss[ ] I 2αR–[ ]E xss[ ] 2αh+=

E xss[ ] R 1– h x∗= =

E xk 1+[ ] I 2αR–[ ]E xk[ ] 2αh+=

If the system is stable, then a steady state condition will be reached.

The solution to this equation is

This is also the strong minimum of the performance index.

Page 14: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

14

Example

p1

1–11–t1, 1–= =

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

p2

111–t2, 1= =

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

R E ppT[ ] 1

2---p1p1T 1

2---p2p2T+==

R 12---

1–11–

1– 1 1–12---

111–

1 1 1–+1 0 00 1 1–0 1– 1

= =

λ1 1.0 λ2 0.0 λ3 2.0=,=,=

α 1λmax------------< 1

2.0------- 0.5==

Banana Apple

Anthony S Maida
Choose alpha to guarantee stability, from slide 12 .
Anthony S Maida
2 trials/batch.
Anthony S Maida
Input correlation matrix.
Anthony S Maida
Largest eigenvalue is 2.0.
Anthony S Maida
Text
Anthony S Maida
Positive semidefinite.
Anthony S Maida
Compute expectation to obtain input correlation matrix.
Anthony S Maida
Set up apple/banana example and choose alpha.
Anthony S Maida
Find eigenvals.
Anthony S Maida
Must be < 0.5, but the following example chose 0.2.
Page 15: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

15

Iteration One

a 0( ) W 0( )p 0( ) W 0( )p1 0 0 01–11–

0====

e 0( ) t 0( ) a 0( ) t1 a 0( ) 1– 0 1–=–=–=–=

W 1( ) W 0( ) 2αe 0( )pT 0( )+=

W 1( ) 0 0 0 2 0.2( ) 1–( )1–11–

T

0.4 0.4– 0.4=+=

Banana

Anthony S Maida
alpha = .2
Anthony S Maida
Anthony S Maida
Initialize wts to 0.
Anthony S Maida
New wts.
Anthony S Maida
Start training.
Anthony S Maida
Anthony S Maida
Page 16: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

16

Iteration Two

Apple a 1( ) W 1( )p 1( ) W 1( )p2 0.4 0.4– 0.4111–

0.4–====

e 1( ) t 1( ) a 1( ) t2 a 1( ) 1 0.4–( ) 1.4=–=–=–=

W 2( ) 0.4 0.4– 0.4 2 0.2( ) 1.4( )111–

T

0.96 0.16 0.16–=+=

Anthony S Maida
Page 17: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

17

Iteration Three

a 2( ) W 2( )p 2( ) W 2( )p1 0.96 0.16 0.16–1–11–

0.64–====

e 2( ) t 2( ) a 2( ) t1 a 2( ) 1– 0.64–( ) 0.36–=–=–=–=

W 3( ) W 2( ) 2αe 2( )pT 2( )+ 1.1040 0.0160 0.0160–= =

W ∞( ) 1 0 0=

Anthony S Maida
After convergence.
Anthony S Maida
Use banana again.
Anthony S Maida
Gradually approaching convergence value.
Page 18: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

18

Adaptive Filtering

p1(k) = y(k)

D

D

D

p2(k) = y(k - 1)

pR(k) = y(k - R + 1)

y(k)

a(k)n(k)SxR

Inputs

Σb

w1,R

w1,1

y(k)

D

D

D

w1,2

a(k) = purelin (Wp(k) + b)

ADALINE

1

Tapped Delay Line Adaptive Filter

a k( ) purelin Wp b+( ) w1 i, y k i– 1+( )i 1=

R

∑ b+= =

Anthony S Maida
uses a tapped delay line
Anthony S Maida
New building block.
Anthony S Maida
Takes an input sequence.
Anthony S Maida
Finite impulse response (FIR) filter.
Anthony S Maida
Trainable
Page 19: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

19

Example: Noise Cancellation

Adaptive Filter

60-HzNoise Source

Noise Path Filter

EEG Signal(random)

Contaminating Noise

Contaminated Signal

"Error"

Restored Signal

+-

Adaptive Filter Adjusts to Minimize Error (and in doing this removes 60-Hz noise from contaminated signal)

Adaptively Filtered Noise to Cancel Contamination

Graduate Student

v

m

s t

a

e

Anthony S Maida
Use adeline for adaptive filter and train.
Anthony S Maida
Anthony S Maida
Electroencephalogram
Anthony S Maida
Contaminated by 60 Hz noise.
Anthony S Maida
Contamination.
Anthony S Maida
Noise is input to filter.
Anthony S Maida
Train to get t-a=0 by getting a = m.
Anthony S Maida
Anthony S Maida
Anthony S Maida
Anthony S Maida
Need to restore original signal, s.
Anthony S Maida
Anthony S Maida
Anthony S Maida
Anthony S Maida
e = t - a
Page 20: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

20

Noise Cancellation Adaptive Filter

a(k)n(k)SxR

Inputs

Σ

w1,1

D w1,2

ADALINE

v(k)

a(k) = w1,1 v(k) + w1,2 v(k - 1)

Anthony S Maida
Condensed diagram for this example.
Anthony S Maida
Remember one step into the past.
Anthony S Maida
No bias.
Anthony S Maida
v is the noise signal.
Anthony S Maida
v is 60 Hz sine wave sampled at 180 Hz.
Anthony S Maida
Train to get a = m.
Page 21: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

21

Correlation Matrix

R zzT[ ]= h E tz[ ]=

z k( ) v k( )v k 1–( )

=

t k( ) s k( ) m k( )+=

R E v2 k( )[ ] E v k( )v k 1–( )[ ]

E v k 1–( )v k( )[ ] E v2

k 1–( )[ ]=

h E s k( ) m k( )+( )v k( )[ ]E s k( ) m k( )+( )v k 1–( )[ ]

=

Anthony S Maida
Find the input correlation matrix.
Anthony S Maida
Input vector
Anthony S Maida
Find input/target cross-correlation matrix.
Anthony S Maida
Target is current signal + filtered noise from slide 19.
Anthony S Maida
From slide 5.
Anthony S Maida
Input correlation matrix.
Anthony S Maida
Cross-correlation matrix.
Anthony S Maida
Expected values are used.
Anthony S Maida
s: EEG signal
Anthony S Maida
m: filtered noise
Page 22: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

22

Signals

v k( ) 1.2 2πk3

---------⎝ ⎠⎛ ⎞sin=

E v2 k( )[ ] 1.2( )213---

2πk3---------⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞2

k 1=

3

∑ 1.2( )20.5 0.72= = =

E v2 k 1–( )[ ] E v2 k( )[ ] 0.72= =

E v k( )v k 1–( )[ ] 13--- 1.2 2πk

3---------sin⎝ ⎠⎛ ⎞ 1.2 2π k 1–( )

3-----------------------sin⎝ ⎠⎛ ⎞

k 1=

3

∑=

1.2( )20.5 2π3------⎝ ⎠

⎛ ⎞cos 0.36–= =

R 0.72 0.36–0.36– 0.72

=

m k( ) 1.2 2πk3--------- 3π

4------–⎝ ⎠⎛ ⎞sin=

Anthony S Maida
Artificial data
Page 23: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

23

Stationary PointE s k( ) m k( )+( )v k( )[ ] E s k( )v k( )[ ] E m k( )v k( )[ ]+=

E s k( ) m k( )+( )v k 1–( )[ ] E s k( )v k 1–( )[ ] E m k( )v k 1–( )[ ]+=

h 0.51–0.70

=

x∗ R 1– h 0.72 0.36–0.36– 0.72

1–0.51–

0.700.30–

0.82= = =

0

0

h E s k( ) m k( )+( )v k( )[ ]E s k( ) m k( )+( )v k 1–( )[ ]

=

E m k( )v k 1–( )[ ] 13--- 1.2 2πk

3--------- 3π4------–⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞ 1.2 2π k 1–( )

3-----------------------sin⎝ ⎠⎛ ⎞

k 1=

3

∑ 0.70= =

E m k( )v k( )[ ]13--- 1.2 2πk

3--------- 3π4------–⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞ 1.2 2πk

3---------sin⎝ ⎠⎛ ⎞

k 1=

3

∑ 0.51–= =

Page 24: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

24

Performance IndexF x( ) c 2xTh– xTRx+=

c E t2 k( )[ ] E s k( ) m k( )+( )2[ ]==

c E s2 k( )[ ] 2E s k( )m k( )[ ] E m2 k( )[ ]+ +=

E s2 k( )[ ] 10.4------- s2 sd

0.2–

0.2

∫ 13 0.4( )---------------s3

0.2–

0.20.0133= = =

F x∗( ) 0.7333 2 0.72( )– 0.72+ 0.0133= =

E m2 k( )[ ] 13--- 1.2 2π

3------ 3π

4------–⎝ ⎠

⎛ ⎞sin⎩ ⎭⎨ ⎬⎧ ⎫2

k 1=

3

∑ 0.72= =

c 0.0133 0.72+ 0.7333= =

Page 25: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

25

LMS Response

-2 -1 0 1 2-2

-1

0

1

2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4

-2

0

2

4Original and Restored EEG Signals

Original and Restored EEG Signals

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4

-2

0

2

4

Time

EEG Signal Minus Restored Signal

Anthony S Maida
alpha = 0.1
Anthony S Maida
Assessing the performance.
Anthony S Maida
Illustrates how the filter adapts to cancel noise.
Anthony S Maida
Initially the restored signal is a poor approximation to the original signal.
Anthony S Maida
Jitter around minimum is caused by using approximation (stochastic) gradient descent.
Anthony S Maida
s - e
Anthony S Maida
Anthony S Maida
Doesn’t stabilize at exactly 0 b/c of jitter.
Page 26: Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

26

Echo Cancellation

AdaptiveFilterHybrid

+

-

Transmission Line

Transmission Line

Phone

+

-

AdaptiveFilter Hybrid Phone

Anthony S Maida
A larger application involving transmission over phone lines.