A survey on mixing coe cients: computation and estimation. Vitaly Kuznetsov Courant Institute of Mathematical Sciences, New York University October 29, 2013 1 / 24
A survey on mixing coecients:computation and estimation.
Vitaly Kuznetsov
Courant Institute of Mathematical Sciences,New York University
October 29, 2013
1 / 24
Introduction
Binary classication
Receive a sample X1, . . . ,Xm with labels in 0, 1.Choose a hypothesis h that has a good expectedperformance on unseen data.
X1, . . . ,Xm are typically assumed i.i.d.
2 / 24
Introduction (continued)
Much of the learning theory operates under theassumption that data comes from an i.i.d. source.
In certain scenarios this assumption is not appropriate,e.g. time series analysis.
To extend learning theory to this scenarios we need tond a suitable relaxation of i.i.d. requirement.
One common approach found in literature is imposingvarious \mixing conditions".
Under these mixing conditions the strength ofdependence between random variables is measuredusing \mixing coecients".
3 / 24
Outline
Mixing conditions and coefficients: definitions
and basic properties.
Computational aspects.
Estimating mixing coefficients.
Discussion.
4 / 24
How can we measure dependence between
random variables?
Common measures of dependence are so called
“mixing” coefficients.
Originally introduced to prove laws of large
numbers for sequences of dependent variables.
5 / 24
α mixing coecient between two σ-algebras
Given a probability space (Ω,F ,P) and two sub
σ-algebras σ1 and σ2, define α-mixing coefficient
α(σ1, σ2) = supA,B|P(A)P(B)− P(A ∩ B)|
where supremum is taken over all A ∈ σ1 and
B ∈ σ2.
6 / 24
ϕ mixing coecient
Define ϕ-mixing coefficient
ϕ(σ1|σ2) = supA,B|P(A)− P(A|B)|
where supremum is taken over all A ∈ σ1 and
B ∈ σ2.
Note that ϕ coefficient is not symmetric.
7 / 24
β mixing coecient
Dene β-mixing coecient between two σ-algebras σ1and σ2:
β(σ1, σ2) = E supA|P(A)− P(A|σ2)|
where supremum is taken over all A ∈ σ1.
We can rewrite β-mixing coecient as follows:
β(σ1, σ2) = 12 sup
I∑i=1
J∑j=1
|P(Ai)P(Bj)− P(Ai ∩ Bj)|
where supremum is taken over all nite partitionsA1, . . . ,AI and B1, . . . ,BJ of such that Ai ∈ σ1and Bj ∈ S2.
8 / 24
Alternative denitions of β mixing coecient
This leads to yet another characterization of β-mixingcoecient:
β(σ1, σ2) = ‖Pσ1⊗ Pσ2
− Pσ1⊗σ2‖
where ‖ · ‖ denotes the total variation distance, i.e.‖P − Q‖ = supA |P(A)− Q(A)|.Assuming distributions P and Q have densities f andg respectively
‖P − Q‖ = 12
∫|f − g |
9 / 24
Relations between mixing coecients
We have the following:
2α(σ1, σ2) ≤ β(σ1, σ2) ≤ ϕ(σ1, σ2)
The second inequality is immediate from thedenition.
Proof of the rst inequality:
|P(A)P(B)− P(A ∩ B)|+ |P(A)P(Bc)− P(A ∩ Bc)|+ |P(Ac)P(B)− P(Ac ∩ B)|+ |P(Ac)P(Bc)− P(Ac ∩ Bc)| ≤ 2β(σ1, σ2)
10 / 24
From two variables to stochastic processes (i)
Let Xt∞t=−∞ be a doubly infinite sequence of
random variables.
Notation:
X ji = (Xi ,Xi+1, . . . ,Xj)
Pji is the joint probability distribution of X j
i
σji is the σ-algebra generated by X ji
11 / 24
From two variables to stochastic processes (ii)
Dene the following mixing coecients
α(a) = suptα(σt−∞, σ
∞t+a)
β(a) = suptβ(σt−∞, σ
∞t+a)
ϕ(a) = suptϕ(σt−∞, σ
∞t+a)
We say that a sequence of random variables X∞−∞ is α,β or ϕ mixing if the corresponding mixing coecient→ 0 as a→∞.
These coecients measure dependence between futureand the past separated by a time units.
12 / 24
Stationary stochastic processes
A stochastic process X∞−∞ is (strictly) stationary forany t ∈ Z and k , n ∈ N the distribution of X t+n
t is thesame as the distribution of X t+k+n
t+k .
For stationary processes mixing coecients can besimplied to
α(a) = α(σ0−∞, σ∞a )
β(a) = β(σ0−∞, σ∞a )
ϕ(a) = ϕ(σ0−∞, σ∞a )
13 / 24
Connections to machine learning
Theorem (M. Mohri, A. Rostamizadeh, 2009): LetH = X → Y be a set of hypothesis and L be an M-bounded lossfunction. Let S be a sample of size 2µa from a stationary β-mixingprocess on X × Y , for any δ > 4(µ− 1)β(a) with probability at least1− δ′ the following holds for all h ∈ H
E[L(h(X ),Y )] ≤ 1
m
m∑i=1
L(h(Xi),Yi) + RSµ(L H) + 3M
√log 4
δ′
2µ
where RSµ denotes the empirical Rademacher complexity andδ′ = δ − 4(µ− 1)β(a).
Other results of the similar nature by R. Meir, M. Mohri and A.Rostamizadeh, I. Steinwart et. al. to name a few.
14 / 24
Can we compute mixing coecients?
Theorem (M. Ahsen, M. Vidyasagar, 2013):Suppose X and Y are discrete random variables withknown joint and marginal probability distributions. Thencomputing α-mixing coecient is NP - hard. (equivalentto \partition problem").
Ahsen and Vidyasgar also give eciently computableupper and lower bounds.
15 / 24
Can we compute mixing coecients? (continued)
Theorem (M. Ahsen, M. Vidyasagar, 2013):Suppose X and Y are discrete random variables withknown joint distribution θij and marginal probabilitydistributions µi and νj . Then one has that
β(σ(X ), σ(Y )) = 12
∑∑|γij |
ϕ(σ(X ), σ(Y )) = maxj
1νj
∑i
max(γij , 0)
where γij = θij − µiνj . Thus, β(σ(X ), σ(Y )) andϕ(σ(X ), σ(Y )) both are computable in polynomial time.
16 / 24
Estimation of mixing coecients: naive approach (i)
Question: Given i.i.d. samples (X1,Y1), . . . , (Xm,Ym) from a jointdistribution of real-valued (X ,Y ), can we estimate any of the mixingcoecients?
Dene the following estimators of the joint and marginaldistributions:
(x) =1
m
m∑i=1
IXi≤x
(y) =1
m
m∑i=1
IYi≤y
(x , y) =1
m
m∑i=1
IXi≤x ,Yi≤y
Let β and ϕ be estimators of β and γ based on empirical c.d.f.’s.
17 / 24
Estimation of mixing coecients: naive approach (ii)
Theorem (M. Ahsen, M. Vidyasagar, 2013):
ϕ ≥ β =m − 1
m→ 1 as m→∞
Justification: Under empirical probability distributionseach sample has mass 1/m. Marginals are also uniformand hence product distribution assigns mass of 1/m toeach point in the grid (xi , yj). The conclusion now followsfrom the above formula for discrete β.
18 / 24
Estimation of mixing coecients: histograms (i)
A histogram estimator f of a density f based on a sampleX1, . . . ,Xm is
f (x) =J∑
j=1
pjmwj
IBj(x)
where
Bj ’s are bins partitioning the region with observations
pj =m∑i=1
IBj(Xi) counts number of samples in bin Bj
wj is the width of the j-th bin
19 / 24
Estimation of mixing coecients: histograms (ii)
Given m samples choose Jm intervals on R so that eachbin contains bm/Jmc or bm/Jmc+ 1 samples from both Xand Y .Theorem (M. Ahsen, M. Vidyasagar, 2013):Suppose (X ,Y ) ∼ θ, X ∼ µ and Y ∼ ν with θ beingabsolutely continuous with respect to µ⊗ ν. Then βconverges to β provided that Jm/m→ 0. If in addition,the density f ∈ L∞ then α and ϕ also converge to α andϕ respectively.
The measure-theoretic arguments used in the proofestablish consistency of the estimators but do not yielderror rates.
20 / 24
Estimation of mixing coecients: stochastic processes (i)
Two step approximation
|βd(a)− β(a)| ≤ |βd(a)− βd(a)|+ |βd(a)− β(a)|
where βd(a) = sup β(σtt−d , σt+a+dt+a ) and βd(a) is an
estimator based on
βd(a) = 12
∫|fd ⊗ fd − f2d |
with fd , f2d being d and 2d dimensional histogramestimators.
21 / 24
Estimation of mixing coecients: stochastic processes (ii)
Theorem (D. McDonald, C. Shalizi, M. Shervish, 2011): LetXm1 be a sample from a stationary β-mixing process. For m = 2µmbm
and d ≤ µm we have that
P(|βd(a)− βd(a)| ≥ ε) ≤2 exp
(−µmε
21
2
)+ 2 exp
(−µmε
22
2
)+ 4(µm − 1)β(bm)
where ε1 = ε/2− E[∫|fd − fd |] and ε2 = ε− E[
∫|f2d − f2d |].
Proof is based on blocking technique.
22 / 24
Estimation of mixing coecients: stochastic processes (iii)
|βd(a)− β(a)| a measure-theoretic argument can be used toshow that this → 0 as d →∞.
Under the assumption that densities fd and f2d are in the Sobolevspace H2 McDonald, Shalizi and Shervish argue that f2d and fdare consistent.
Choosing dm = O(exp(W (log n)), wm = O(m−km) where
km =W (logm) + 1
2logm
logm(12
exp(W (log n)) + 1)
and W is an inverse of w exp(w), they show that estimator of βbased on histograms is consistent.
23 / 24
Estimation of mixing coecients: discussion
Results do not provide convergence rate.
High-dimensional histogram estimation may not beaccurate.
Instead of estimating β directly intermediate step isused to estimate densities.
Estimators based on kernels instead of histograms?
24 / 24