WITMSE 2013

.

.

. ..

.

.

The MDL principle for arbitrary data:either discrete or continuous or none of them

Joe Suzuki

Osaka University

WITMSE 2013Sanjo-Kaikan, University of Tokyo, Japan

August 26, 2013

Joe Suzuki (Osaka University) The MDL principle for arbitrary data: either discrete or continuous or none of themWITMSE 2013Sanjo-Kaikan, University of Tokyo, JapanAugust 26, 2013 1

/ 24

Road Map

Road Map

.

. . 1 Problem

.

. .

2 The Ryabko measure

.

. .

3 The Radon-Nikodym theorem

.

. .

4 Generalization

.

. .

5 Universal Histogram Sequence

.

. .

6 Conclusion


/ 24

Road Map

The slides of this talk can be seen via Internet

keywords: Joe Suzukislideshare

http://www.slideshare.net/prof-joe/


/ 24

Problem

Given {(xi , yi)}ni=1, identify whether X ⊥⊥ Y or not

A,B: finite setsxn = (x1, · · · , xn) ∈ An, yn = (y1, · · · , yn) ∈ Bn

Pn(xn|θ), Pn(yn|θ), Pn(xn, yn|θ): expressed by parameter θ

p: the prior probability of X ⊥⊥ Y

.

Bayesian solution

.

.

.

. ..

. .

X ⊥⊥ Y ⇐⇒ pQn(xn)Qn(yn) ≥ (1− p)Qn(xn, yn)

Qn(xn) :=

∫Pn(xn|θ)w(θ)dθ , Qn(yn) :=

∫Pn(yn|θ)w(θ)dθ

Qn(xn, yn) :=

∫Pn(xn, yn|θ)w(θ)dθ

using a weight w over θ.


/ 24

Problem

Q should be an alternative to P as n grows

A: the finite set in which X takes values.

.

Q is a Bayesian measure

.

.

.

. ..

.

.

Kraft’s inequality: ∑xn∈An

Qn(xn) ≤ 1 (1)

　For Example, Qn(xn) = |A|−n, xn ∈ An

satisfies (1); but

does not converges to Pn in any sense


/ 24

Problem

Universal Bayesian Measures

Qn(xn) :=

∫Pn(xn|θ)w(θ)dθ

w(θ) ∝∏x∈A

θ−a[x] with {a[x ] = 12}x∈A (Krichevsky-Trofimov)

−1

nlogQn(xn) → H(P)

for any Pn(xn|θ) =∏x∈A

θ−c[x] with {c[x ]}x∈A in xn ∈ An.

Shannon McMillian Breiman:

−1

nlogPn(xn|θ) −→ H(P)

for any stationary ergodic P , so that for Pn(xn) := Pn(xn|θ),

1

nlog

Pn(xn)

Qn(xn)→ 0 . (2)


/ 24

Problem

When X has a density function f

There exists a g s.t. ∫xn∈Rn

gn(xn) ≤ 1 (3)

1

nlog

f n(xn)

gn(xn)→ 0 (4)

for any f satisfying a condition mentioned later (Ryabko 2009).


/ 24

Problem

The problem in this paper

.

Universal Bayesian measure in the general settings

.

.

.

. ..

.

.

What are (1)(2) and (3)(4) for general random variables ?

.

.

.

1 without assuming either discrete or continuous

.

.

.

2 removing the constraint Ryabko poses:


/ 24

The Ryabko measure

Ryabko measure: X has a density function f

A: the set in which X takes values.

{Aj}∞j=0 :

{A0 := {A}Aj+1 is a refinement of Aj

For example, for A = [0, 1), A0 = {[0, 1)}A1 = {[0, 1/2), [1/2, 1)}A2 = {[0, 1/4), [1/4, 1/2), [1/2, 3/4), [3/4, 1)}. . .

Aj = {[0, 2−(j−1)), [2−(j−1), 2 · 2−(j−1)), · · · , [(2j−1 − 1)2−(j−1), 1)}. . .

sj : A → Aj : x ∈ a ∈ Aj =⇒ sj(x) = aλ: the Lebesgue measure

fj(x) :=Pj(sj(x))

λ(sj(x))for x ∈ A


/ 24

The Ryabko measure

Given xn = (x1, · · · , xn) ∈ An s.t. (sj(x1), · · · , sj(xn)) = (a1, · · · , an) ∈ Anj ,

f nj (xn) := fj(x1) · · · fj(xn) =

Pj(a1) · · ·Pj(an)

λ(a1) . . . λ(an).

gnj (x

n) :=Qn

j (a1, · · · , an)λ(a1) · · ·λ(an)

Qj : a universal Bayesian measure w.r.t. finite set Aj .　f n(xn) := f (x1) · · · f (xn)

gn(xn) :=∞∑j=0

wjgnj (x

n) for {ωj}∞j=1 s.t.∑j

ωj = 1, ωj > 0

1

nlog

f n(xn)

gn(xn)→ 0

for any f s.t. differential entropy h(fj) → h(f ) as j → ∞ (Ryabko, 2009)


/ 24

The Radon-Nikodym theorem

In general, exactly when a density function exists ?

B: the entire Borel sets of Rµ(D) := P(X ∈ D): the probability of (X ∈ D) for D ∈ BFX : the distribution function of X

.

µ is absolutely continuous w.r.t. λ (µ ≪ λ)

.

.

.

. ..

.

.

The following two are equivalent:

.

.

.

1 f : R → R exists s.t. P(X ≤ x) = FX (x) =

∫t≤x

f (t)dt

.

.

.

2 for any D ∈ B, λ(D) :=∫D dx = 0 =⇒ µ(D) = 0.

f (x) =dFX (x)

dx


/ 24


Even discrete variables have density functions!

B: a countable subset of Rµ(D) := P(X ∈ D): the probability of (X ∈ D) for D ⊆ Br : B → R

.

µ is absolutely continuous w.r.t. η (µ ≪ η)

.

.

.

. ..

.

.

.

.

.

1 f : B → R exists s.t. P(X ∈ D) =∑x∈D

f (x)r(x), D ⊆ B

.

.

.

2 for any D ⊆ B, η(D) :=∑x∈D

r(x) = 0 =⇒ µ(D) = 0.

f (x) =P(X = x)

r(x)


/ 24


Radon-Nikodym

µ, η: σ-finite measures over σ-field F

.

µ is absolutely continuous w.r.t. η (µ ≪ η)

.

.

.

. ..

.

.

.

.

.

1 F-measurable f exists s.t. for any A ∈ F , µ(A) =

∫Af (t)dη(t)

.

.

.

2 for any A ∈ F , η(A) = 0 =⇒ µ(A) = 0

∫Af (t)dη(t) := sup

{Ai}

∑i

[ infx∈Ai

f (x)]η(Ai )

dµ

dη:= f is the density function w.r.t. η when µ is the probability measure.


/ 24

Generalization

When Y has a density function w.r.t. η s.t. µ ≪ η

B: the set in which Y takes values.

{Bj}∞k=0 :

{B0 := {B}Bk+1 is a refinement of Bk

For example, for B = N := {1, 2, · · · }, B0 = {B}B1 := {{1}, {2, 3, · · · }}B2 := {{1}, {2}, {3, 4, · · · }}. . .Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}. . .

tk : B → Bk : y ∈ b ∈ Bk =⇒ tk(y) = bη: µ ≪ η

fk(y) :=Pk(tk(y))

η(tk(y))for y ∈ B


/ 24

Generalization

Given yn = (y1, · · · , yn) ∈ Bn, s.t.(tk(y1), · · · , tk(yn)) = (b1, · · · , bn) ∈ Bn

k ,

f nk (yn) := fk(y1) · · · fk(yn) =

Pk(b1) · · ·Pk(bn)

η(b1) . . . η(bn)

gnk (y

n) :=Qn

k (b1, · · · , bn)η(b1) · · · η(bn)

Qk : a universal Bayesian measure w.r.t. finite set Bk

Similarly,1

nlog

f n(xn)

gn(xn)→ 0

for any f s.t. h(fj) → h(f ) as j → ∞

h(f ) :=

∫−f (y) log f (y)dη(y)


/ 24

Generalization

Generalization

µn(Dn) :=

∫Df n(yn)dηn(yn) , Dn ∈ Bn

νn(Dn) :=

∫Dgn(yn)dηn(yn) , Dn ∈ Bn

f n(yn)

gn(yn)=

dµn

dηn(yn)/

dνn

dηn(yn) =

dµn

dνn(yn)

D(µ||ν) :=∫

dµ logdµ

dν

h(f ) :=

∫−f (y) log f (y)dη(y)

= −∫

dµ

dη(y) log

dµ

dη(y) · dη(y) = −D(µ||η)


/ 24

Generalization

Result 1

.

Proposition 1 (Suzuki, 2011)

.

.

.

. ..

.

.

If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and

1

nlog

dµn

dνn(yn) → 0

for any µ s.t. D(µk ||η) → D(µ||η) as k → ∞.


/ 24

Generalization

The solution of the exercise in Introduction

{Aj × Bk}

gnj ,k(x

n, yn) :=Qn

j ,k(a1, b1, · · · , an, bn)λ(a1) · · ·λ(an)η(b1) · · · η(bn)

gn(xn, yn) :=∑j ,k

wj ,kgnj ,k(x

n, yn) for {ωj ,k} s.t.∑j ,k

ωj ,k = 1, ωj ,k > 0

1

nlog

f n(xn, yn)

gn(xn, yn)→ 0

.

Solution

.

.

.

. ..

.

.

We estimatef n(xn, yn)

f n(xn)f n(yn)by

gn(xn, yn)

gn(xn)gn(yn)extending

Qn(xn, yn)

Qn(xn)Qn(yn).


/ 24

Generalization

Further generalization

Proposition 1 assumes

a specific histogram sequence {Bk}; andµ should satisfy D(µk ||η) → D(µ||η) as k → ∞

　

.

{Bk} should be universal

.

.

.

. ..

.

.

Construct {Bk} s.t. D(µk ||η) → D(µ||η) as k → ∞ for any µ


/ 24

Universal Histogram Sequence

Universal histogram sequence {Bk}

µ, σ ∈ R, σ > 0.{Ck}∞k=0:

C0 = {(−∞,∞)}

C1 = {(−∞, µ], (µ,∞)}

C2 = {(−∞, µ− σ], (µ− σ, µ], (µ, µ+ σ], (µ+ σ,∞)}

· · ·

Ck → Ck+1:(−∞, µ− (k − 1)σ] 7→ (−∞, µ− kσ], (µ− kσ, µ− (k − 1)σ]

(a, b] 7→ (a, a+b2 ], (a+b

2 , b](µ+ (k − 1)σ,∞) 7→ (µ+ (k − 1)σ, µ+ kσ], (µ+ kσ,∞)

　B: the set in which Y takes values

Bk := {B ∩ c |c ∈ Ck}\{ϕ} .


/ 24


B = R and µ ≪ λ

{Bk} = {Ck}

For each y ∈ B, there exist K ∈ N and a unique {(ak , bk ]}∞k=K s.t.{y ∈ [ak , bk ] ∈ Bk , k = K ,K + 1, · · ·|ak − bk | → 0 , k → ∞

FY : the distribution function of Y

fk(y) =P(Y ∈ (ak , bk ])

λ((ak , bk ])=

FY (bk)− FY (ak)

bk − ak→ f (y) , y ∈ B

h(fk) → h(f )

as k → ∞ for any f


/ 24


B = N and µ ≪ η

B0 = {B}B1 := {{1}, {2, 3, · · · }}B2 := {{1}, {2}, {3, 4, · · · }}. . .Bk := {{1}, {2}, · · · , {k}, {k + 1, k + 2, · · · }}. . .

can be obtained via µ = 1, σ = 1.For each y ∈ B, there exists K ∈ N and a unique {Dk}∞k=1 s.t.{

y ∈ Dk ∈ Bk k = 1, 2, · · ·{y} = Dk ∈ Bk , k = K ,K + 1, · · · fk(y) =P(Y ∈ Dk)

η(Dk)→ f (y) =

P(Y = y)

η({y}), y ∈ B

h(fk) → h(f )

as k → ∞ for any f


/ 24


Result 2

.

Theorem 1

.

.

.

. ..

.

.

If µ ≪ η, ν ≪ η exists s.t. νn(Rn) ≤ 1 and for any µ

1

nlog

dµn

dνn(yn) → 0

The proof is based on the following observation:

.

Billingeley: Probability & Measure, Problem 32.13

.

.

.

. ..

.

.

limh→0

µ((x − h, x + h])

η((x − h, x + h])= f (x) , x ∈ R

to remove the condition Ryabko posed:“for any µ s.t. D(µk ||η) → D(µ||η) as k → ∞”


/ 24

Conclusion

Summary and Discussion

.

Universal Bayesian Measure

.

.

.

. ..

.

.

the random variables may be either discrete or continuous

a universal histogram sequence to remove Ryabko’s condition

.

Many Applications

.

.

.

. ..

.

.

Bayesian network structure estimation (DCC 2012)

The Bayesian Chow-Liu Algorithm (PGM 2012)

Markov order estimation even when {Xi} is continuous

Extending MDL:gn(yn|m): the universal Bayesian measure w.r.t. model m given yn ∈ Bn

pm: the prior probability of model m

− log gn(yn|m)− log pm → min


/ 24

WITMSE 2013

Technology

f x1 f xn gn xn

f n j xn

yn qn xn

fmeasurable f

probability of x d

university of tokyo

mdl principle

arbitrary data