Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

10

1

Widrow-Hoff Learning(LMS Algorithm)

Anthony S Maida

Hagan, Demuth, Beale, De Jesus

Anthony S Maida

Approximate steepest descent

Anthony S Maida

Least mean square

10

2

ADALINE Network

a = purelin (Wp + b)

Linear Neuron

p a

1

nW

b

R x 1S x R

S x 1

S x 1

S x 1

Input

R S

a purelin Wp b+( ) Wp b+= =

ai purelin ni( ) purelin wTi p bi+( ) wT

i p bi+= = =

…

wi

wi 1,

wi 2,

wi R,

=

Anthony S Maida

Adaptive linear neuron.

Anthony S Maida

Single layer network w/ trainable linear output units.

Anthony S Maida

Incoming wt vec to output unit i.

Anthony S Maida

Output unit i.

Anthony S Maida

Text

10

3

Two-Input ADALINE

p1an

Inputs

bp2 w1,2

w1,1

1

Σ

a = purelin (Wp + b)

Two-Input Neuron

p1-b/w1,1

p2

-b/w1,2

1wTp + b = 0

a > 0a < 0

1w

a purelin n( ) purelin wT1 p b+( ) wT

1 p b+= = =

a wT1 p b+ w1 1, p1 w1 2, p2 b+ += =

10

4

Mean Square Error

p1 t1{ , } p2 t2{ , } … pQ tQ{ , }, , ,

Training Set:

pq tqInput: Target:

x w1b

= z p1

= a wT1 p b+= a xTz=

F x( ) E e2 ][= E t a–( )2 ][ E t xTz–( )2 ][= =

Notation:

Mean Square Error:

Anthony S Maida

Labeled data.

Anthony S Maida

Q data items.

Anthony S Maida

trainable bias

Anthony S Maida

(target - actual output) squared

Anthony S Maida

Anthony S Maida

Trainable params

Anthony S Maida

Anthony S Maida

Trainable params

Anthony S Maida

Anthony S Maida

Input

Anthony S Maida

Anthony S Maida

Packing bias into the wt vec simplifies notation.

Anthony S Maida

Anthony S Maida

Expected value

Anthony S Maida

Anthony S Maida

Anthony S Maida

Expected square error.

10

5

Error AnalysisF x( ) E e2 ][= E t a–( )2 ][ E t xTz–( )2 ][= =

F x( ) E t2 2txTz– xTzzTx+ ][=

F x( ) E t2 ] 2xTE tz[ ]– xT E zzT[ ]x+[=

F x( ) c 2xTh– xTRx+=

c E t2 ][= h E tz[ ]= R E zzT[ ]=

F x( ) c dTx 12---x

TAx+ +=

d 2h–= A 2R=

The mean square error for the ADALINE Network is aquadratic function:

Anthony S Maida

Substitute linear output.

Anthony S Maida

Anthony S Maida

Expand quadratic.

Anthony S Maida

Anthony S Maida

Minimize F(x).

Anthony S Maida

Expected value allows you to break sum into components.

Anthony S Maida

Another quadratic.

Anthony S Maida

Input correlation matrix.

Anthony S Maida

Anthony S Maida

Cross-correlation matrix btw input & target.

Anthony S Maida

Characteristics of quadratic depend on A.

Anthony S Maida

Convert above quadratic to standard form for further analysis.

Anthony S Maida

z: inputs

Anthony S Maida

x: trainable params

Anthony S Maida

h and R are difficult to caculate.

Anthony S Maida

Anthony S Maida

10

6

Stationary Point

∇F x( ) ∇ c dTx 12---xTAx+ +⎝ ⎠

⎛ ⎞ d Ax+ 2h– 2Rx+= = =

2h– 2Rx+ 0=

x∗ R 1– h=

A 2R=

Hessian Matrix:

The correlation matrix R must be at least positive semidefinite. Ifthere are any zero eigenvalues, the performance index will either

have a weak minumum or else no stationary point, otherwisethere will be a unique global minimum x*.

If R is positive definite:

Anthony S Maida

Only nonneg eigenvals.

Anthony S Maida

Anthony S Maida

Anthony S Maida

Global minimum.

Anthony S Maida

Guarantee

Anthony S Maida

Anthony S Maida

Set gradient to zero and solve for global min.

Anthony S Maida

Rx = h

Anthony S Maida

Quadratic analysis of

Anthony S Maida

R may be expensive to compute is data set is large.

Anthony S Maida

Anthony S Maida

Gradient of quadratic.

Anthony S Maida

Anthony S Maida

All eigenvals > 0.

10

7

Approximate Steepest Descent

F̂ x( ) t k( ) a k( )–( )2 e2 k( )= =

Approximate mean square error (one sample):

∇̂F x( ) e2 k( )∇=

e2 k( )∇[ ] je2 k( )∂w1 j,∂

---------------- 2e k( ) e k( )∂w1 j,∂

-------------= = j 1 2 … R, , ,=

e2 k( )∇[ ]R 1+e2 k( )∂

b∂---------------- 2e k( )

e k( )∂b∂

-------------= =

Approximate (stochastic) gradient:

Anthony S Maida

Anthony S Maida

Design algorithm to locate minimum point.

Anthony S Maida

Avoids calculating h and R.

Anthony S Maida

Also called online or incremental b/c of trial-by-trial update.

Anthony S Maida

R = # of input features, j varies over R.

Anthony S Maida

bias

Anthony S Maida

k is a trial.

Anthony S Maida

Expectation of squared error is replaced by squared error at iteration k.

Anthony S Maida

R+1

Anthony S Maida

A total of R+1 components in gradient vector.

Anthony S Maida

Squared term drops out when taking deriv.

Anthony S Maida

10

8

Approximate Gradient Calculation

e k( )∂w1 j,∂

------------- t k( ) a k( )–[ ]∂w1 j,∂

---------------------------------- w1 j,∂∂ t k( ) wT

1 p k( ) b+( )–[ ]= =

e k( )∂w1 j,∂

-------------w1 j,∂∂ t k( ) w1 i, pi k( )

i 1=

R

∑ b+⎝ ⎠⎜ ⎟⎛ ⎞

–=

e k( )∂w1 j,∂

------------- p j k( )–= e k( )∂b∂

------------- 1–=

∇̂F x( ) e2 k( )∇ 2e k( )z k( )–= =

Anthony S Maida

Calculate partial derivatives.

Anthony S Maida

t - a

Anthony S Maida

Anthony S Maida

Anthony S Maida

All steps involve substitutions.

Anthony S Maida

substitution

Anthony S Maida

Pack result into vectorized form.

10

9

LMS Algorithm

xk 1+ xk α F x( )∇ x xk=–=

xk 1+ xk 2αe k( )z k( )+=

w1 k 1+( ) w1 k( ) 2αe k( )p k( )+=

b k 1+( ) b k( ) 2αe k( )+=

Anthony S Maida

Stochastic (trial-by-trial) gradient descent.

Anthony S Maida

1st layer wt vector

Anthony S Maida

Anthony S Maida

From previous slide.

Anthony S Maida

wt updates

Anthony S Maida

bias update

10

10

Multiple-Neuron Case

wi k 1+( ) wi k( ) 2αei k( )p k( )+=

bi k 1+( ) bi k( ) 2αei k( )+=

W k 1+( ) W k( ) 2αe k( )pT k( )+=

b k 1+( ) b k( ) 2αe k( )+=

Matrix Form:

Anthony S Maida

Since each output unit is independent, simply convert to matrix form.

Anthony S Maida

Initialize wts to 0.

10

11

Analysis of Convergencexk 1+ xk 2αe k( )z k( )+=

E xk 1+[ ] E xk[ ] 2αE e k( )z k( )[ ]+=

E xk 1+[ ] E xk[ ] 2α E t k( )z k( )[ ] E xkTz k( )( )z k( )[ ]–{ }+=

E xk 1+[ ] E xk[ ] 2α E tkz k( )[ ] E z k( )zTk( )( )xk[ ]–{ }+=

E xk 1+[ ] E xk[ ] 2α h RE xk[ ]–{ }+=

E xk 1+[ ] I 2αR–[ ]E xk[ ] 2αh+=

For stability, the eigenvalues of thismatrix must fall inside the unit circle.

Anthony S Maida

Stability

10

12

Conditions for Stability

eig I 2αR–[ ]( ) 1 2αλi– 1<=

Therefore the stability condition simplifies to

1 2αλi– 1–>

λi 0>Since , 1 2αλi– 1< .

α 1 λ⁄ i for all i<

0 α 1 λmax⁄< <

(where λi is an eigenvalue of R)

Anthony S Maida

Stability condition

10

13

Steady State Response

E xss[ ] I 2αR–[ ]E xss[ ] 2αh+=

E xss[ ] R 1– h x∗= =

E xk 1+[ ] I 2αR–[ ]E xk[ ] 2αh+=

If the system is stable, then a steady state condition will be reached.

The solution to this equation is

This is also the strong minimum of the performance index.

10

14

Example

p1

1–11–t1, 1–= =

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

p2

111–t2, 1= =

⎩ ⎭⎪ ⎪⎨ ⎬⎪ ⎪⎧ ⎫

R E ppT[ ] 1

2---p1p1T 1

2---p2p2T+==

R 12---

1–11–

1– 1 1–12---

111–

1 1 1–+1 0 00 1 1–0 1– 1

= =

λ1 1.0 λ2 0.0 λ3 2.0=,=,=

α 1λmax------------< 1

2.0------- 0.5==

Banana Apple

Anthony S Maida

Choose alpha to guarantee stability, from slide 12 .

Anthony S Maida

2 trials/batch.

Anthony S Maida


Anthony S Maida

Largest eigenvalue is 2.0.

Anthony S Maida

Text

Anthony S Maida

Positive semidefinite.

Anthony S Maida

Compute expectation to obtain input correlation matrix.

Anthony S Maida

Set up apple/banana example and choose alpha.

Anthony S Maida

Find eigenvals.

Anthony S Maida

Must be < 0.5, but the following example chose 0.2.

10

15

Iteration One

a 0( ) W 0( )p 0( ) W 0( )p1 0 0 01–11–

0====

e 0( ) t 0( ) a 0( ) t1 a 0( ) 1– 0 1–=–=–=–=

W 1( ) W 0( ) 2αe 0( )pT 0( )+=

W 1( ) 0 0 0 2 0.2( ) 1–( )1–11–

T

0.4 0.4– 0.4=+=

Banana

Anthony S Maida

alpha = .2

Anthony S Maida

Anthony S Maida

Initialize wts to 0.

Anthony S Maida

New wts.

Anthony S Maida

Start training.

Anthony S Maida

Anthony S Maida

10

16

Iteration Two

Apple a 1( ) W 1( )p 1( ) W 1( )p2 0.4 0.4– 0.4111–

0.4–====

e 1( ) t 1( ) a 1( ) t2 a 1( ) 1 0.4–( ) 1.4=–=–=–=

W 2( ) 0.4 0.4– 0.4 2 0.2( ) 1.4( )111–

T

0.96 0.16 0.16–=+=

Anthony S Maida

10

17

Iteration Three

a 2( ) W 2( )p 2( ) W 2( )p1 0.96 0.16 0.16–1–11–

0.64–====

e 2( ) t 2( ) a 2( ) t1 a 2( ) 1– 0.64–( ) 0.36–=–=–=–=

W 3( ) W 2( ) 2αe 2( )pT 2( )+ 1.1040 0.0160 0.0160–= =

W ∞( ) 1 0 0=

Anthony S Maida

After convergence.

Anthony S Maida

Use banana again.

Anthony S Maida

Gradually approaching convergence value.

10

18

Adaptive Filtering

p1(k) = y(k)

D

D

D

p2(k) = y(k - 1)

pR(k) = y(k - R + 1)

y(k)

a(k)n(k)SxR

Inputs

Σb

w1,R

w1,1

y(k)

D

D

D

w1,2

a(k) = purelin (Wp(k) + b)

ADALINE

1

Tapped Delay Line Adaptive Filter

a k( ) purelin Wp b+( ) w1 i, y k i– 1+( )i 1=

R

∑ b+= =

Anthony S Maida

uses a tapped delay line

Anthony S Maida

New building block.

Anthony S Maida

Takes an input sequence.

Anthony S Maida

Finite impulse response (FIR) filter.

Anthony S Maida

Trainable

10

19

Example: Noise Cancellation

Adaptive Filter

60-HzNoise Source

Noise Path Filter

EEG Signal(random)

Contaminating Noise

Contaminated Signal

"Error"

Restored Signal

+-

Adaptive Filter Adjusts to Minimize Error (and in doing this removes 60-Hz noise from contaminated signal)

Adaptively Filtered Noise to Cancel Contamination

Graduate Student

v

m

s t

a

e

Anthony S Maida

Use adeline for adaptive filter and train.

Anthony S Maida

Anthony S Maida

Electroencephalogram

Anthony S Maida

Contaminated by 60 Hz noise.

Anthony S Maida

Contamination.

Anthony S Maida

Noise is input to filter.

Anthony S Maida

Train to get t-a=0 by getting a = m.

Anthony S Maida

Anthony S Maida

Anthony S Maida

Anthony S Maida

Need to restore original signal, s.

Anthony S Maida

Anthony S Maida

Anthony S Maida

Anthony S Maida

e = t - a

10

20

Noise Cancellation Adaptive Filter

a(k)n(k)SxR

Inputs

Σ

w1,1

D w1,2

ADALINE

v(k)

a(k) = w1,1 v(k) + w1,2 v(k - 1)

Anthony S Maida

Condensed diagram for this example.

Anthony S Maida

Remember one step into the past.

Anthony S Maida

No bias.

Anthony S Maida

v is the noise signal.

Anthony S Maida

v is 60 Hz sine wave sampled at 180 Hz.

Anthony S Maida

Train to get a = m.

10

21

Correlation Matrix

R zzT[ ]= h E tz[ ]=

z k( ) v k( )v k 1–( )

=

t k( ) s k( ) m k( )+=

R E v2 k( )[ ] E v k( )v k 1–( )[ ]

E v k 1–( )v k( )[ ] E v2

k 1–( )[ ]=

h E s k( ) m k( )+( )v k( )[ ]E s k( ) m k( )+( )v k 1–( )[ ]

=

Anthony S Maida

Find the input correlation matrix.

Anthony S Maida

Input vector

Anthony S Maida

Find input/target cross-correlation matrix.

Anthony S Maida

Target is current signal + filtered noise from slide 19.

Anthony S Maida

From slide 5.

Anthony S Maida


Anthony S Maida

Cross-correlation matrix.

Anthony S Maida

Expected values are used.

Anthony S Maida

s: EEG signal

Anthony S Maida

m: filtered noise

10

22

Signals

v k( ) 1.2 2πk3

---------⎝ ⎠⎛ ⎞sin=

E v2 k( )[ ] 1.2( )213---

2πk3---------⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞2

k 1=

3

∑ 1.2( )20.5 0.72= = =

E v2 k 1–( )[ ] E v2 k( )[ ] 0.72= =

E v k( )v k 1–( )[ ] 13--- 1.2 2πk

3---------sin⎝ ⎠⎛ ⎞ 1.2 2π k 1–( )

3-----------------------sin⎝ ⎠⎛ ⎞

k 1=

3

∑=

1.2( )20.5 2π3------⎝ ⎠

⎛ ⎞cos 0.36–= =

R 0.72 0.36–0.36– 0.72

=

m k( ) 1.2 2πk3--------- 3π

4------–⎝ ⎠⎛ ⎞sin=

Anthony S Maida

Artificial data

10

23

Stationary PointE s k( ) m k( )+( )v k( )[ ] E s k( )v k( )[ ] E m k( )v k( )[ ]+=

E s k( ) m k( )+( )v k 1–( )[ ] E s k( )v k 1–( )[ ] E m k( )v k 1–( )[ ]+=

h 0.51–0.70

=

x∗ R 1– h 0.72 0.36–0.36– 0.72

1–0.51–

0.700.30–

0.82= = =

0

0

h E s k( ) m k( )+( )v k( )[ ]E s k( ) m k( )+( )v k 1–( )[ ]

=

E m k( )v k 1–( )[ ] 13--- 1.2 2πk

3--------- 3π4------–⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞ 1.2 2π k 1–( )

3-----------------------sin⎝ ⎠⎛ ⎞

k 1=

3

∑ 0.70= =

E m k( )v k( )[ ]13--- 1.2 2πk

3--------- 3π4------–⎝ ⎠

⎛ ⎞sin⎝ ⎠⎛ ⎞ 1.2 2πk

3---------sin⎝ ⎠⎛ ⎞

k 1=

3

∑ 0.51–= =

10

24

Performance IndexF x( ) c 2xTh– xTRx+=

c E t2 k( )[ ] E s k( ) m k( )+( )2[ ]==

c E s2 k( )[ ] 2E s k( )m k( )[ ] E m2 k( )[ ]+ +=

E s2 k( )[ ] 10.4------- s2 sd

0.2–

0.2

∫ 13 0.4( )---------------s3

0.2–

0.20.0133= = =

F x∗( ) 0.7333 2 0.72( )– 0.72+ 0.0133= =

E m2 k( )[ ] 13--- 1.2 2π

3------ 3π

4------–⎝ ⎠

⎛ ⎞sin⎩ ⎭⎨ ⎬⎧ ⎫2

k 1=

3

∑ 0.72= =

c 0.0133 0.72+ 0.7333= =

10

25

LMS Response

-2 -1 0 1 2-2

-1

0

1

2

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4

-2

0

2

4Original and Restored EEG Signals

Original and Restored EEG Signals

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4

-2

0

2

4

Time

EEG Signal Minus Restored Signal

Anthony S Maida

alpha = 0.1

Anthony S Maida

Assessing the performance.

Anthony S Maida

Illustrates how the filter adapts to cancel noise.

Anthony S Maida

Initially the restored signal is a poor approximation to the original signal.

Anthony S Maida

Jitter around minimum is caused by using approximation (stochastic) gradient descent.

Anthony S Maida

s - e

Anthony S Maida

Anthony S Maida

Doesn’t stabilize at exactly 0 b/c of jitter.

10

26

Echo Cancellation

AdaptiveFilterHybrid

+

-

Transmission Line

Transmission Line

Phone

+

-

AdaptiveFilter Hybrid Phone

Anthony S Maida

A larger application involving transmission over phone lines.

Widrow-Hoff Learning10 1 Widrow-Hoff Learning (LMS Algorithm) Hagan, Demuth, Beale, De Jesus. Approximate steepest descent. Least mean square

Documents