A survey on mixing coefficients: computation and estimation.munoz/schedule/2013/slides/mixing_coeff.pdf · Estimation of mixing coe cients: stochastic processes (iii) j d(a) (a)ja

A survey on mixing coecients:computation and estimation.

Vitaly Kuznetsov

Courant Institute of Mathematical Sciences,New York University

October 29, 2013

1 / 24

Introduction

Binary classication

Receive a sample X1, . . . ,Xm with labels in 0, 1.Choose a hypothesis h that has a good expectedperformance on unseen data.

X1, . . . ,Xm are typically assumed i.i.d.

2 / 24

Introduction (continued)

Much of the learning theory operates under theassumption that data comes from an i.i.d. source.

In certain scenarios this assumption is not appropriate,e.g. time series analysis.

To extend learning theory to this scenarios we need tond a suitable relaxation of i.i.d. requirement.

One common approach found in literature is imposingvarious \mixing conditions".

Under these mixing conditions the strength ofdependence between random variables is measuredusing \mixing coecients".

3 / 24

Outline

Mixing conditions and coefficients: definitions

and basic properties.

Computational aspects.

Estimating mixing coefficients.

Discussion.

4 / 24

How can we measure dependence between

random variables?

Common measures of dependence are so called

“mixing” coefficients.

Originally introduced to prove laws of large

numbers for sequences of dependent variables.

5 / 24

α mixing coecient between two σ-algebras

Given a probability space (Ω,F ,P) and two sub

σ-algebras σ1 and σ2, define α-mixing coefficient

α(σ1, σ2) = supA,B|P(A)P(B)− P(A ∩ B)|

where supremum is taken over all A ∈ σ1 and

B ∈ σ2.

6 / 24

ϕ mixing coecient

Define ϕ-mixing coefficient

ϕ(σ1|σ2) = supA,B|P(A)− P(A|B)|

where supremum is taken over all A ∈ σ1 and

B ∈ σ2.

Note that ϕ coefficient is not symmetric.

7 / 24

β mixing coecient

Dene β-mixing coecient between two σ-algebras σ1and σ2:

β(σ1, σ2) = E supA|P(A)− P(A|σ2)|

where supremum is taken over all A ∈ σ1.

We can rewrite β-mixing coecient as follows:

β(σ1, σ2) = 12 sup

I∑i=1

J∑j=1

|P(Ai)P(Bj)− P(Ai ∩ Bj)|

where supremum is taken over all nite partitionsA1, . . . ,AI and B1, . . . ,BJ of such that Ai ∈ σ1and Bj ∈ S2.

8 / 24

Alternative denitions of β mixing coecient

This leads to yet another characterization of β-mixingcoecient:

β(σ1, σ2) = ‖Pσ1⊗ Pσ2

− Pσ1⊗σ2‖

where ‖ · ‖ denotes the total variation distance, i.e.‖P − Q‖ = supA |P(A)− Q(A)|.Assuming distributions P and Q have densities f andg respectively

‖P − Q‖ = 12

∫|f − g |

9 / 24

Relations between mixing coecients

We have the following:

2α(σ1, σ2) ≤ β(σ1, σ2) ≤ ϕ(σ1, σ2)

The second inequality is immediate from thedenition.

Proof of the rst inequality:

|P(A)P(B)− P(A ∩ B)|+ |P(A)P(Bc)− P(A ∩ Bc)|+ |P(Ac)P(B)− P(Ac ∩ B)|+ |P(Ac)P(Bc)− P(Ac ∩ Bc)| ≤ 2β(σ1, σ2)

10 / 24

From two variables to stochastic processes (i)

Let Xt∞t=−∞ be a doubly infinite sequence of

random variables.

Notation:

X ji = (Xi ,Xi+1, . . . ,Xj)

Pji is the joint probability distribution of X j

i

σji is the σ-algebra generated by X ji

11 / 24

From two variables to stochastic processes (ii)

Dene the following mixing coecients

α(a) = suptα(σt−∞, σ

∞t+a)

β(a) = suptβ(σt−∞, σ

∞t+a)

ϕ(a) = suptϕ(σt−∞, σ

∞t+a)

We say that a sequence of random variables X∞−∞ is α,β or ϕ mixing if the corresponding mixing coecient→ 0 as a→∞.

These coecients measure dependence between futureand the past separated by a time units.

12 / 24

Stationary stochastic processes

A stochastic process X∞−∞ is (strictly) stationary forany t ∈ Z and k , n ∈ N the distribution of X t+n

t is thesame as the distribution of X t+k+n

t+k .

For stationary processes mixing coecients can besimplied to

α(a) = α(σ0−∞, σ∞a )

β(a) = β(σ0−∞, σ∞a )

ϕ(a) = ϕ(σ0−∞, σ∞a )

13 / 24

Connections to machine learning

Theorem (M. Mohri, A. Rostamizadeh, 2009): LetH = X → Y be a set of hypothesis and L be an M-bounded lossfunction. Let S be a sample of size 2µa from a stationary β-mixingprocess on X × Y , for any δ > 4(µ− 1)β(a) with probability at least1− δ′ the following holds for all h ∈ H

E[L(h(X ),Y )] ≤ 1

m

m∑i=1

L(h(Xi),Yi) + RSµ(L H) + 3M

√log 4

δ′

2µ

where RSµ denotes the empirical Rademacher complexity andδ′ = δ − 4(µ− 1)β(a).

Other results of the similar nature by R. Meir, M. Mohri and A.Rostamizadeh, I. Steinwart et. al. to name a few.

14 / 24

Can we compute mixing coecients?

Theorem (M. Ahsen, M. Vidyasagar, 2013):Suppose X and Y are discrete random variables withknown joint and marginal probability distributions. Thencomputing α-mixing coecient is NP - hard. (equivalentto \partition problem").

Ahsen and Vidyasgar also give eciently computableupper and lower bounds.

15 / 24

Can we compute mixing coecients? (continued)

Theorem (M. Ahsen, M. Vidyasagar, 2013):Suppose X and Y are discrete random variables withknown joint distribution θij and marginal probabilitydistributions µi and νj . Then one has that

β(σ(X ), σ(Y )) = 12

∑∑|γij |

ϕ(σ(X ), σ(Y )) = maxj

1νj

∑i

max(γij , 0)

where γij = θij − µiνj . Thus, β(σ(X ), σ(Y )) andϕ(σ(X ), σ(Y )) both are computable in polynomial time.

16 / 24

Estimation of mixing coecients: naive approach (i)

Question: Given i.i.d. samples (X1,Y1), . . . , (Xm,Ym) from a jointdistribution of real-valued (X ,Y ), can we estimate any of the mixingcoecients?

Dene the following estimators of the joint and marginaldistributions:

(x) =1

m

m∑i=1

IXi≤x

(y) =1

m

m∑i=1

IYi≤y

(x , y) =1

m

m∑i=1

IXi≤x ,Yi≤y

Let β and ϕ be estimators of β and γ based on empirical c.d.f.’s.

17 / 24

Estimation of mixing coecients: naive approach (ii)

Theorem (M. Ahsen, M. Vidyasagar, 2013):

ϕ ≥ β =m − 1

m→ 1 as m→∞

Justification: Under empirical probability distributionseach sample has mass 1/m. Marginals are also uniformand hence product distribution assigns mass of 1/m toeach point in the grid (xi , yj). The conclusion now followsfrom the above formula for discrete β.

18 / 24

Estimation of mixing coecients: histograms (i)

A histogram estimator f of a density f based on a sampleX1, . . . ,Xm is

f (x) =J∑

j=1

pjmwj

IBj(x)

where

Bj ’s are bins partitioning the region with observations

pj =m∑i=1

IBj(Xi) counts number of samples in bin Bj

wj is the width of the j-th bin

19 / 24

Estimation of mixing coecients: histograms (ii)

Given m samples choose Jm intervals on R so that eachbin contains bm/Jmc or bm/Jmc+ 1 samples from both Xand Y .Theorem (M. Ahsen, M. Vidyasagar, 2013):Suppose (X ,Y ) ∼ θ, X ∼ µ and Y ∼ ν with θ beingabsolutely continuous with respect to µ⊗ ν. Then βconverges to β provided that Jm/m→ 0. If in addition,the density f ∈ L∞ then α and ϕ also converge to α andϕ respectively.

The measure-theoretic arguments used in the proofestablish consistency of the estimators but do not yielderror rates.

20 / 24

Estimation of mixing coecients: stochastic processes (i)

Two step approximation

|βd(a)− β(a)| ≤ |βd(a)− βd(a)|+ |βd(a)− β(a)|

where βd(a) = sup β(σtt−d , σt+a+dt+a ) and βd(a) is an

estimator based on

βd(a) = 12

∫|fd ⊗ fd − f2d |

with fd , f2d being d and 2d dimensional histogramestimators.

21 / 24

Estimation of mixing coecients: stochastic processes (ii)

Theorem (D. McDonald, C. Shalizi, M. Shervish, 2011): LetXm1 be a sample from a stationary β-mixing process. For m = 2µmbm

and d ≤ µm we have that

P(|βd(a)− βd(a)| ≥ ε) ≤2 exp

(−µmε

21

2

)+ 2 exp

(−µmε

22

2

)+ 4(µm − 1)β(bm)

where ε1 = ε/2− E[∫|fd − fd |] and ε2 = ε− E[

∫|f2d − f2d |].

Proof is based on blocking technique.

22 / 24

Estimation of mixing coecients: stochastic processes (iii)

|βd(a)− β(a)| a measure-theoretic argument can be used toshow that this → 0 as d →∞.

Under the assumption that densities fd and f2d are in the Sobolevspace H2 McDonald, Shalizi and Shervish argue that f2d and fdare consistent.

Choosing dm = O(exp(W (log n)), wm = O(m−km) where

km =W (logm) + 1

2logm

logm(12

exp(W (log n)) + 1)

and W is an inverse of w exp(w), they show that estimator of βbased on histograms is consistent.

23 / 24

Estimation of mixing coecients: discussion

Results do not provide convergence rate.

High-dimensional histogram estimation may not beaccurate.

Instead of estimating β directly intermediate step isused to estimate densities.

Estimators based on kernels instead of histograms?

24 / 24

A survey on mixing coefficients: computation and estimation.munoz/schedule/2013/slides/mixing_coeff.pdf · Estimation of mixing coe cients: stochastic processes (iii) j d(a) (a)ja

Documents