Discrete Denoising with Shifts - Stanford University · Discrete Denoising with Shifts 1 Prediction with Experts’ Advice 2 Discrete Denoising with Shifts Recap of DUDE Motivation

Discrete Denoising with Shifts

Taesup Moon

Yahoo! Labs

EE477 Guest LectureNovember 10, 2011

Taesup Moon (Yahoo! Labs) EE477 Guest Lecture Nov 10, 2011 1 / 24

Discrete Denoising with Shifts

1 Prediction with Experts’ Advice

2 Discrete Denoising with ShiftsRecap of DUDEMotivationNew algorithm: S-DUDEResults


Discrete Denoising with Shifts Recap of DUDE

Discrete denoising

Xt, Zt, Xt take values in finite alphabets

Choose Xn1 as close as possible to Xn

1 , based on theentire Zn

1Ex) text correction, image denoising, DNA sequence analyses, etc.Performance metric: per-symbol average loss



DUDE is the first universal discretedenoiser

DUDE - [Weissman et.al 05]

For location t to be denoised, do :

1 fix the window size k

2 find left k-context (`1, . . . , `k) and right k-context (r1, . . . , rk) of zt

`1 `2 · · · `k zt r1 r2 · · · rk

3 count all occurrences of symbols in zn with the same context

4 decide on xt according to

xt(zt+kt−k) = simple rule(Π,Λ, count vector[zn, zt−1

t−k, zt+kt+1 ], zt)

Whenever DUDE sees zt+1t−kztz

t+kt+1 , it makes the same decision for zt

DUDE is a “sliding window” denoiser



Ex 1 : stationary bit stream gets corrupted

Xn : 00000011111110000000000111111111100000001111111110000

Zn : 00100011101110010001000111110111100000011110111110001

source : binary Markov chain with p = 0.1, sequence length n = 106

1! p

0 1

p

p

1! p

noise : BSC(δ = 0.1)

!

0

1 1

01! !

1! !

!

⇒ optimal BER attained by the Forward-Backward Recursion




Xn : 00000011111110000000000111111111100000001111111110000Zn : 00100011101110010001000111110111100000011110111110001


1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!





Xn : 00000011111110000000000111111111100000001111111110000Zn : 00100011101110010001000111110111100000011110111110001


1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!




DUDE achieves the optimal BER as thewindow size grows

0 1 2 3 4 5 60.5

0.6

0.7

0.8

0.9

1

Window size k

Bit e

rror r

ate/

Bit error rate plot

Bayes Optimum = 0.558

DUDE = 0.561

Window size k is a design parameter for given sequencelength n



DUDE attains the optimum performancesfor stationary sources

For a denoiser Xn = {Xt(zn)}nt=1,

LXn(xn, zn) =1

n

n∑

t=1

Λ(xt, Xt(zn))

is the performance measure




main results of DUDE : when k = kn < d12 log|Z| ne,

1 For any stationary process X,

limn→∞

[E(LXn

DUDE(Xn, Zn)

)− min

Xn∈Dn

E(LXn(Xn, Zn)

)]= 0

Dn is the set of all denoisers in the world

DUDE attains the Bayes optimal performance

2 For all x ∈ X∞,

limn→∞

[LXn

DUDE(xn, Zn)−Dk(xn, Zn)

]= 0 w.p.1

Dk(xn, zn) : the best performance among Sk

DUDE is as good as the best sliding window denoiser






limn→∞

[E(LXn

DUDE(Xn, Zn)

)− min

Xn∈Dn

E(LXn(Xn, Zn)

)]= 0




limn→∞

[LXn


]= 0 w.p.1








limn→∞

[E(LXn

DUDE(Xn, Zn)

)− min

Xn∈Dn

E(LXn(Xn, Zn)

)]= 0




limn→∞

[LXn


]= 0 w.p.1








limn→∞

[E(LXn

DUDE(Xn, Zn)

)− min

Xn∈Dn

E(LXn(Xn, Zn)

)]= 0




limn→∞

[LXn


]= 0 w.p.1




Discrete Denoising with Shifts Motivation

Ex 2 :piecewise stationary bit stream getscorruptedXn : 00000011111110000000000111111101100011011011011010110

Zn : 00100011101110010001000111110101100011111011010010100

source : binary Markov chain with p1 = 0.01→ p2 = 0.2 at t∗ = n2

1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!




Ex 2 :piecewise stationary bit stream getscorruptedXn : 00000011111110000000000111111101100011011011011010110Zn : 00100011101110010001000111110101100011111011010010100


1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!




Ex 2 :piecewise stationary bit stream getscorruptedXn : 00000011111110000000000111111101100011011011011010110Zn : 00100011101110010001000111110101100011111011010010100


1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!




Does DUDE achieve the optimal BER?



Does DUDE achieve the optimal BER?

0 1 2 3 4 5 60.4

0.5

0.6

0.7

0.8

0.9

1

Window Size k

Bit e

rror r

ate/

Bit error rate plot


DUDE = 0.574

(+18%)

DUDE applies the same rule “regardless of the location”DUDE has a limitation for time- (space-) varying sources



In practice, many sources are time-(space-) varying

text : English → Spanish → German . . .

voice : image :





voice :

image :





voice : image :


Discrete Denoising with Shifts New algorithm: S-DUDE

Can we do better than the DUDE whenthe source varies?

Questions

1 Can we perform as if we knew the source including its change points?

2 If so, can we do it efficiently?

answers1 Yes. S-DUDE can do essentially as well as if it knows

the source and its change points

2 Yes. S-DUDE is a linear complexity algorithm

[M and Weissman, IEEE Trans. Info. Theory, Nov 09]




Questions










Questions









Take a closer look at the binary example

Binary, BSC(δ)Suppose DUDE with window size k = 3 decided as follows :

zt+3t−3 :

↓xt :

0100110︸︷︷︸↓0

0101110︸︷︷︸↓1

010 • 110 defined a “say-what-you-see” mapping in the middleDUDE employs the same mapping whenever it sees 010 • 110





zt+3t−3 :

↓xt :

0100110︸︷︷︸↓0

0101110︸︷︷︸↓1

010 • 110 defined a “say-what-you-see” mapping in the middle

DUDE employs the same mapping whenever it sees 010 • 110





zt+3t−3 :

↓xt :

0100110︸︷︷︸↓0

0101110︸︷︷︸↓1






zt+3t−3 :

↓xt :

0100110︸︷︷︸↓0

0101110︸︷︷︸↓1


Only 4 single-letter mappings in binary example“say-what-you-see”,“flip-what-you-see”,“always-say-0”,“always-say-1”





zt+3t−3 :

↓xt :

0100110︸︷︷︸↓0

0101110︸︷︷︸↓1


DUDE counts n0 and n1 for 010 • 110 andif n0 ≈ n1 → “say-what-you-see”if n0 � n1 → “always-say-0”if n0 � n1 → “always-say-1”threshold depends on δ



Employing shifting single-letter mappingswill be helpful

Suppose 0’s 1’s at 010 • 110 looked like

0000100011000011111111011101︸︷︷︸swys

“always-say-0” → “always-say-1” may be better than fixed“say-what-you-see”

Generally, if single-letter mappings have some freedom to shift,they can attain smaller loss

How can we decide when to shift to what?





00001000110000︸︷︷︸ 11111111011101︸︷︷︸all− 0 all− 1








00001000110000︸︷︷︸ 11111111011101︸︷︷︸all− 0 all− 1








00001000110000︸︷︷︸ 11111111011101︸︷︷︸all− 0 all− 1








00001000110000︸︷︷︸ 11111111011101︸︷︷︸all− 0 all− 1






Snm is a class of shifting single-lettermappings

Ideally, shifting every time to the correct mapping would be thebest

equivalent to knowing the source sequence ⇒ impossible!

We limit the number of shifts to m

Snm : class of single-letter mappings shifting at most m times for

sequence length n, e.g.,

swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm

)· |S|m, |S| = |Z||X | (number of single-letter mappings)

Deciding when to shift to what m times⇔ Selecting the best combination in Sn

m









swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm



m









swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm



m






We limit the number of shifts to mSn

m : class of single-letter mappings shifting at most m times forsequence length n, e.g.,

swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm



m








swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm



m








swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm



m








swys!

swys{s1, · · · , sn} :

zn :

all-0 all-1

|Snm| ≤

(nm



m



The key tool is to devise an estimate ofthe loss ΛFocus on the single-letter setting (s(·) : Z → X )

X = s(Z)x ! Z

Λ(x, s(Z)) : loss between x and s(Z)

not observable

But, from the knowledge of Π, we devise `(Z, s) such that

Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)

`(Z, s) is an unbiased estimate of Ex

(Λ(x, s(Z))

)

`(Z, s) : loss between Z and s(·)

observable

[Weissman et. al., Universal filtering via prediction, IEEE IT 07]




X = s(Z)x ! Z


not observable


Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)


(Λ(x, s(Z))

)


observable





X = s(Z)x ! Z


not observable


Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)


(Λ(x, s(Z))

)


observable





X = s(Z)x ! Z


not observable


Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)


(Λ(x, s(Z))

)


observable





X = s(Z)x ! Z


not observable


Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)


(Λ(x, s(Z))

)


observable





X = s(Z)x ! Z


not observable


Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)


(Λ(x, s(Z))

)


observable

[Weissman et. al., Universal filtering via prediction, IEEE IT 07]Taesup Moon (Yahoo! Labs) EE477 Guest Lecture Nov 10, 2011 13 / 24



X = s(Z)x ! Z


not observable


Ex

(`(Z, s)

)= Ex

(Λ(x, s(Z))

)


(Λ(x, s(Z))

)

`(Z, s) : loss between Z and s(·)observable

[Weissman et. al., Universal filtering via prediction, IEEE IT 07]Taesup Moon (Yahoo! Labs) EE477 Guest Lecture Nov 10, 2011 13 / 24


S-DUDE is defined by minimizing the sumof the estimated losses

For each context c (e.g., 010 • 110),S-DUDE finds

S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them

Question : how can we get S = {s1, · · · , snc} ∈ Sncm efficiently?





S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them






S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them






S ,

arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them






S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them






S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them






S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them






S , arg minS∈Snc

m

∑

i∈context c

`(zi, si)

vs. arg minS∈Snc

m

∑

i∈context c

Λ(xi, si(zi))

and applies them




S-DUDE can be implemented with atwo-pass algorithm

again binary, BSC(δ) example

problem : find best {s1, · · · , sn} ∈ Snm that minimizes

∑nt=1 `(zt, st)

si ∈ {all-0, all-1, swys, fwys}

to solve,

1 allocate Mt ∈ Rm×4 for each 1 ≤ t ≤ n2 first pass : scan (z1, · · · , zn) and update {Mt}nt=1 by dynamic

programming

3 second pass : from Mn, extract the best {s1, · · · , sn} by a backwardrecursion






∑nt=1 `(zt, st)


to solve,


programming







∑nt=1 `(zt, st)


to solve,


programming







∑nt=1 `(zt, st)


to solve,


programming







∑nt=1 `(zt, st)


to solve,

1 allocate Mt ∈ Rm×4 for each 1 ≤ t ≤ n

2 first pass : scan (z1, · · · , zn) and update {Mt}nt=1 by dynamicprogramming







∑nt=1 `(zt, st)


to solve,


programming







∑nt=1 `(zt, st)


to solve,


programming




Mt stores minimum sum of estimatedlosses up to tAgain binary, BSC(δ) example

Problem : find best {s1, · · · , sn} ∈ Snm that minimizes

∑nt=1 `(zt, st)

si ∈ {all-0, all-1, swys, fwys}Elements of Mt are defined to be the minimum sum up to t, e.g.,

Mt

all-0 swysall-1 fwys

i

m

Mt(i, swys) = min{s1,··· ,st}∈St

i

{`(zt, st = swys) +t−1∑

r=1

`(zr, sr)}



First pass uses dynamic programming

Only two possible cases to attain Mt(i, swys)

1 i-th shift has occurred at t : min1≤j≤|S|Mt−1(i− 1, j) + `(zt, swys)2 i-th shift has occurred before t : Mt−1(i, swys) + `(zt, swys)





1 i-th shift has occurred at t : min1≤j≤|S|Mt−1(i− 1, j) + `(zt, swys)

2 i-th shift has occurred before t : Mt−1(i, swys) + `(zt, swys)





1 i-th shift has occurred at t : min1≤j≤|S|Mt−1(i− 1, j) + `(zt, swys)2 i-th shift has occurred before t : Mt−1(i, swys) + `(zt, swys)





Mt(i, swys) =`(zt, swys) + min

{Mt−1(i, swys),min1≤j≤|S|Mt−1(i− 1, j)

}

same for all other elements



Second pass extracts S and denoise

When t = n,sn = arg minj∈{all−0,all−1,swys,fwys}Mn(m, j), xn = sn(zn)

minS!Snm

!nt=1 !(zt, st)

Mn


mmin

for t = n− 1, · · · , 1 : follow the optimal path and denoise!



Second pass extracts S and denoise

When t = n,sn = arg minj∈{all−0,all−1,swys,fwys}Mn(m, j), xn = sn(zn)

minS!Snm

!nt=1 !(zt, st)

Mn


mmin

for t = n− 1, · · · , 1 : follow the optimal path and denoise!



The complexity of S-DUDE is linear in nand m

Complexity

space : O(mn|Z|2k)time : O(mn|Z|2k)practical




Complexityspace : O(mn|Z|2k)time : O(mn|Z|2k)

practical




Complexityspace : O(mn|Z|2k)time : O(mn|Z|2k)practical



Summary of S-DUDE

S-DUDE (Shifting DUDE)

For location t to be denoised, do :

1 fix the window size k, set the number of shifts m

2 find left k-context (`1, . . . , `k) and right k-context (r1, . . . , rk) of zt

`1 `2 · · · `k zt r1 r2 · · · rk

3 on all positions that share the same context c with zt

find S = arg minS∈Sncm

Pt∈context c `(zt, st)

4 decide on xt according to

xt = st(zt), where st(·) comes from S

We can also show that if we set m = 0, S-DUDE coincides withDUDE


Discrete Denoising with Shifts Results

S-DUDE achieves the optimum loss fortime-(space-) varying sourcesWhen k = kn <

12 log|Z| n,

Theorem 1 (stochastic setting)

For all piecewise stationary processes X,

limn→∞

[E(LXn

S-DUDE(Xn, Zn)

)− min

Xn∈Dn

E(LXn(Xn, Zn)

)]= 0,

provided that the number of stationary segments is m = o(n) w.p.1

Theorem 2 (individual sequence setting)

When m = o(n), for all x ∈ X∞,

limn→∞

[LXn

S-DUDE(xn, Zn)−Dk,m(xn, Zn)

]= 0 w.p.1

where Dk,m(xn, zn) is the best performance attained by k-th order slidingwindow denoisers that can shift at most m times



S-DUDE achieves the optimum loss fortime-(space-) varying sourcesWhen k = kn <

12 log|Z| n,

Theorem 1 (stochastic setting)

For all piecewise stationary processes X,

limn→∞

[E(LXn

S-DUDE(Xn, Zn)

)− min

Xn∈Dn

E(LXn(Xn, Zn)

)]= 0,

provided that the number of stationary segments is m = o(n) w.p.1

Theorem 2 (individual sequence setting)

When m = o(n), for all x ∈ X∞,

limn→∞

[LXn

S-DUDE(xn, Zn)−Dk,m(xn, Zn)

]= 0 w.p.1

where Dk,m(xn, zn) is the best performance attained by k-th order slidingwindow denoisers that can shift at most m times



No denoiser is better than S-DUDE

Strong converse

If m = Θ(n), no denoiser can achieve previous theorems.

m = o(n) is a necessary and sufficient condition for previous theorems!



Ex 2 : piecewise stationary bit stream(revisited)Xn : 00000011111110000000000111111111100000001111111110000

Zn : 00100011101110010001000111110111100000011110111110001


1! p

0 1

p

p

1! p

noise : flips bits with probability δ = 0.1

!

0

1 1

01! !

1! !

!




Ex 2 : piecewise stationary bit stream(revisited)Xn : 00000011111110000000000111111111100000001111111110000Zn : 00100011101110010001000111110111100000011110111110001


1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!




Ex 2 : piecewise stationary bit stream(revisited)Xn : 00000011111110000000000111111111100000001111111110000Zn : 00100011101110010001000111110111100000011110111110001


1! p

0 1

p

p

1! p


!

0

1 1

01! !

1! !

!




Can S-DUDE achieve the Bayes optimalperformance?

0 1 2 3 4 5 60.4

0.5

0.6

0.7

0.8

0.9

1

Bit e

rror r

ate/

Window size k

Bit error rate plot


DUDE = 0.574

S DUDE (m=1) = 0.498

(+2.3%)

⇒ m can be regarded as another design parameter indevising a discrete denoiser




0 1 2 3 4 5 60.4

0.5

0.6

0.7

0.8

0.9

1Bi

t erro

r rat

e/

Window size k

Bit error rate plot


DUDE = 0.574

S DUDE (m=1) = 0.498

(+2.3%)

⇒ m can be regarded as another design parameter indevising a discrete denoiser




0 1 2 3 4 5 60.4

0.5

0.6

0.7

0.8

0.9

1Bi

t erro

r rat

e/

Window size k

Bit error rate plot


DUDE = 0.574

S DUDE (m=1) = 0.498

(+2.3%) ⇒ m can be regarded as another design parameter indevising a discrete denoiser


Discrete Denoising with Shifts - Stanford University · Discrete Denoising with Shifts 1 Prediction with Experts’ Advice 2 Discrete Denoising with Shifts Recap of DUDE Motivation

Documents