Almost optimal sequential detection in multiple data streams

Almost optimal sequential detectionin multiple data streams

Georgios Fellouris

Department of Statistics

University of Illinois

Joint work with Alexander Tartakovsky

University of MichiganAnn Arbor, May 13th, 2015

Georgios Fellouris (UIUC) Almost optimal sequential tests July 9, 2012 1 / 49

Outline

1 Simple null against simple alternative

2 A simple null against a finite number of alternatives

3 The continuous-parameter case


Sequentially testing of two simple hypotheses

Sequentially acquired observations

X1, . . . ,Xt, . . .iid∼ f .

Stop sampling as soon as possible and distinguish between

H0 : f = f0 and H1 : f = f1.

Let Ft be the history of observations up to time t,

Ft = σ(Xs : 1 ≤ s ≤ t).


Wald’s formulation

Find an Ft-stopping time, T , at which to stop samplingand an FT -measurable r.v., dT , so that

{dT = 1} = {Accept H1,T <∞}{dT = 0} = {Accept H0,T <∞}.

A sequential test is such a pair (T, dT).

Goal: Minimize E0[T] and E1[T] in

Cα,β = {(T, dT) : P0(dT = 1) ≤ α and P0(dT = 1) ≤ β}.


Wald’s SPRT (1945)

Let Zt the log-likelihood ratio of the first t observations:

Zt :=t∑

s=1

logf1(Xs)f0(Xs)

, Z0 := 0.

Sequential Probability Ratio Test (SPRT)Let α, β such that α+ β < 1 and A,B > 0 be fixed thresholds. Define

S = inf{

t ≥ 1 : Zt /∈ (−A,B)}

dS ={

0, if ZS ≤ −A1, if ZS ≥ B


An exact optimality property (Wald & Wolfowitz (1948))

Suppose that A,B are selected so that

P0(dS = 1) = α and P1(dS = 0) = β.

Then,

E0[S] = inf(T,dT )∈Cα,β

E0[T] and E1[S] = inf(T,dT )∈Cα,β

E1[T].


Optimal Asymptotic Performance (Woodroofe’ 1982)

Suppose that β| logα|+ α| logβ| → 0 and Ei[Z21 ] <∞, i = 0, 1.

Then, as α, β → 0,

E1[S] = 1I1

[| logα|+ ρ1 + log δ1 + o(1)]

E0[S] = 1I0

[| logβ|+ ρ0 + log δ0 + o(1)] .

I1 := D(f1||f0) and I0 := D(f0||f1) are the K-L information numbers.Let H1 be the asymptotic distribution of the overshoot of Zk under P1. Then

ρ1 :=∫

x H1(dx) and δ1 := log∫

e−x H1(dx).

Let H0 be the asymptotic distribution of the overshoot of −Z under P0. Then

ρ0 :=∫

x H0(dx) and δ0 := log∫

e−x H0(dx).


Sequentially testing a simple null against a finite number of alternatives


X1, . . . ,Xt, . . .iid∼ f .


H0 : f = f0 vs H1 : f ∈ {f1, . . . , fM}.


Ft = σ(Xs : 1 ≤ s ≤ t).

Find (T, dT), where T is an Ft-stopping time and dT an FT -measurable r.v.

{dT = i} = {Accept Hi,T <∞}, i = 0, 1.


1st Motivation: The multichannel problem



Observations are collected from K independent sources so that

Xt = (X1t , . . . ,X

Kt ).

For every sensor k, the true density is fk and

Xk1, . . . ,X

kt , . . .

iid∼ f k ={

f k0 , noise

f k1 , signal

We want to test the simple hypothesis

H0 : f k = f k0 ∀ k ∈ {1, . . . ,K}

against

H1 : f k ={

f k0 , k /∈ A

f k1 , k ∈ A

where A is an unknown subset of {1, . . . ,K}.



A is known to belong to some class of subsets of {1, . . . ,K}, P .

Then, H1 contains M = |P| possibilities, where |P| is the size of class P .

When signal can be present in only one sensor, then |P| = K.

When signal can be present in at most L sensors, then |P| =∑L

k=1

(Kk

).


2nd motivation: Discretization of a continuous alternative


X1, . . . ,Xt, . . .iid∼ f ∈ {fθ, θ ∈ Θ}


H0 : θ = θ0 and H1 : θ ∈ Θ1,

where θ0 /∈ Θ1 ⊂ Θ.

Then, we would like to minimize Eθ0 [T] and Eθ[T] for every θ ∈ Θ1 in

Cα,β = {(T, dT) : Pθ0(dT = 1) ≤ α and supθ∈Θ1

Pθ(dT = 0) ≤ β}.

Approximating Θ1 by {θ1, . . . , θM} ⊂ Θ1 may have (computational) benefits.


A simple null against a finite number of alternatives


X1, . . . ,Xt, . . .iid∼ f .


Ft = σ(Xs : 1 ≤ s ≤ t).


H0 : f = f0 vs H1 : f ∈ {f1, . . . , fM}.

Find (T, dT), where T is an Ft-stopping time and dT an FT -measurable r.v.

{dT = i} = {Accept Hi,T <∞}, i = 0, 1.


Sequentially testing a simple null against a finite number of alternatives

Pi is the probability measure and Ei the expectation when

f = fi, i = 0, 1, . . . ,M.

Goal: MinimizeE0[T] and Ei[T], i = 1, . . . ,K

among sequential tests in

Cα,β = {(T, dT) : P0(dT = 1) ≤ α and max1≤i≤K

Pi(dT = 0) ≤ β.}

This can be done only in an asymptotic sense, i.e., as α, β → 0.


Generalized Sequential Likelihood Ratio Test (GSLRT)

For i = 1, . . . ,M let

Λit :=

t∏s=1

fi(Xs)f0(Xs)

, Zit := log Λi

t, t ∈ N.

GSLRTFollowing a maximum likelihood approach, we obtain

S = inf{

t ≥ 1 : max1≤i≤M

Zit /∈ (−A,B)

},

{dS = 1} ={

max1≤i≤M

ZiS ≥ B

}, {dS = 0} =

{max

iZi

S ≤ −A}.

where A,B > 0 are fixed thresholds.

Studied by Tartakovsky et al (2003).


Weighted Sequential Likelihood Ratio Test (WSLRT)

Recall that

Λit :=

t∏s=1

fi(Xs)f0(Xs)

, Zit := log Λi

t, t ∈ N.

We will call q = (q1, . . . , qM) a weight if qi > 0 for every i.We will write:

Λt(q) :=K∑

i=1

qi Λit and Zt(q) := log Λt(q)

WSLRT

S = inf{

t ≥ 1 : Zt(q) /∈ (−A,B)}

{dS = 1} ={

ZS(q) ≥ B}, {dS = 0} =

{ZS(q) ≤ −A

}An idea that goes back to Wald (1945) in the case of a continuous parameter.


Weighted GSLRT

In order to treat the maximizing and the averaging approach similarly, letq = (q1, . . . , qM) a weight.We will write:

Λt(q) := max1≤i≤K

(qi Λi

t

)and Zt(q) = log Λt(q).

Weighted (WGSLRT)

S = inf{

t ≥ 1 : Zt(q) /∈ (−A,B)},

{dS = 1} ={

ZS ≥ B}, {dS = 0} =

{ZS ≤ −A

}.


Controlling the error probabilities

For any given α, β ∈ (0, 1), S, S ∈ Cα,β when A,B are chosen so that

A = | logβ|+ log(

max1≤k≤K

qk

)and B = | logα|+ log

( K∑k=1

qk

).

Suppose also that each Zi has a non-arithmetic distribution. Then,P0(S = 1) ∼ α when

B = | logα|+ log( K∑

k=1

qiδi

).

Let Hi the limiting distribution of the overshoot of the random walk Zi under Pi.That is, if we set

T ia := inf{t : Zi

t ≥ a},

then Hi is the limiting distribution of ZiT i

a− a as a→∞. Then

δi := log∫

e−x Hi(dx).


Asymptotic Expansions under H1

Suppose further

that each Zi1 has a finite second moment under Pi.

| logα|/| logβ| goes to some constant as α, β → 0.

A,B→∞ so that

k0α(1 + o(1)) ≤ P0(dS = 1) ≤ α(1 + o(1))k1β(1 + o(1)) ≤ max

1≤i≤MPi(dS = 0) ≤ β(1 + o(1))

for some k0, k1 ∈ (0, 1).

Then, as A,B→∞ we have

Ei[S] = 1Ii

[B + ρi − log qi] + o(1) = Ei[S],

where ρi is the limiting expected overshoot of Zi under Pi, i.e.,

ρi :=∫

x Hi(dx) = lima→∞

Ei[ZiT i

a− a], T i

a := inf{n : Zin ≥ a}.


Uniform Second-Order Asymptotic Optimality under H1

Suppose that the previous assumptions hold.

If A,B are selected so that S, S ∈ Cα,β , then for every i we have

Ei[S] = inf(T,dT )∈Cα,β

Ei[T] +O(1) = Ei[S]

However, even first-order asymptotic optimality is lost when f /∈ {f1, . . . , fM}.If A,B are selected so that P0(S = 1) ∼ α ∼ P0(S = 1), then

Ei[S] = 1Ii

[| logα|+ log

( M∑k=1

qkδk

)+ ρi − log qi

]+ o(1) ≥ Ei[S].


Accuracy of asymptotic approximations

2.0 2.5 3.0 3.5 4.0 4.5 5.0

4060

8010

012

014

0

First Channel

log10(α)

Expe

cted

Sam

ple

Size

M = 3, exponential distribution, θ1 = 0.5, θ2 = 1, θ3 = 2



2.0 2.5 3.0 3.5 4.0 4.5 5.0

1520

2530

3540

45

Second Channel

log10(α)

Expe

cted

Sam

ple

Size

K = 3, exponential distribution, θ1 = 0.5, θ2 = 1, θ3 = 2



2.0 2.5 3.0 3.5 4.0 4.5 5.0

68

1012

1416

Third Channel

log10(α)

Expe

cted

Sam

ple

Size

M = 3, exponential distribution, θ1 = 0.5, θ2 = 1, θ3 = 2


Asymptotic Optimality under H0

Let Ii0 = D(f0||fi) for every 1 ≤ i ≤ M and

I0 = min1≤i≤M

Ii0.

If there is a unique i that attains I0, then

E0[S] = 1I0

[| logβ|+O(1)]

If not,

E0[S] = 1I0

[| logβ|+ Θ(

√log B)

]The second-order term is not always constant.If A,B are selected so that S, S ∈ Cα,β , then

E0[S] ∼ inf(T,d)∈Cα,β

E0[T] ∼ E0[S].


Remarks

These asymptotic results are based on non-linear renewal theory (Lai andSiegmund ’77,’79, Woodroofe ’82, Zhang’88) and Dragalin et al. (’99,’00) .

Were known for the GSLRT (Tartakovsky, 2003).

Here, we have shown that they hold for arbitrary weights (and both tests).

How should one choose these weights?

For this choice, we will show that a particular choice of weights satisfies an evenstronger asymptotic optimality property.


Almost minimax?

What if we select q so that

max1≤i≤M

Ei[S] = inf(T,dT )∈Cα,β

max1≤i≤M

Ei[S] + o(1)?

This would require that Ei[S] = Ej[S] + o(1) for every 1 ≤ i, j ≤ M.

However, to have Ei[S] ∼ Ej[S] for every 1 ≤ i, j ≤ M, we need

Ei[S] ∼ | logα|Ii

∼ | logα|Ij

∼ Ej[S].

This is not possible unless I1 = . . . = IM .


Almost optimality with respect to a weighted expected sample size

Let p = (p1, . . . , pK) a vector of positive numbers that add up to 1.

We will try to design the proposed tests so that

K∑k=1

piEi[S] = inf(T,dT )∈Cα,β

K∑k=1

piEi[T] + o(1) =K∑

k=1

piEi[S].

(Later how to choose the pi’s).

For this, we need to generalize the class of sequential tests.


Two Families of Sequential Tests

Let q0, q1 M-dimensional vectors of positive numbers.

WSLRT

S = inf{

t ≥ 1 : Zt(q1) ≥ B or Zt(q0) ≤ −A}

{dS = 1} ={

ZS(q1) ≥ B}, {dS = 0} =

{ZS(q0) ≤ A

}WG-SLRT

S = inf{

t ≥ 1 : Zt(q1) ≥ B or Zt(q0) ≤ A},

{dS = 1} ={

ZS(q1) ≥ B}, {dS = 0} =

{ZS(q0) ≤ A

}.


Almost optimality

Theorem(F. & Tartakovsky 2013)If A,B are chosen so that (S, dS) ∈ Cα,β and

qi1 = pi/Li, qi

0 = piLi,

then as α, β → 0 so that | logα| ∼ | logβ| we have

M∑i=1

piEi[S] = inf(T,dT )∈Cα,β

M∑i=1

piEi[T] + o(1)

The Li’ were introduced by Lorden (1977)

Li : = exp{−∞∑

n=1

n−1[P0(Zin > 0) + Pi(Zi

n ≤ 0)]}

= δi Ii.

| logα| ∼ | logβ| is more restrictive than what we had assumed before.


Ingredients of proof

Formulate a Bayesian problem, in which there is a penalty for a wrong decisionunder each hypothesis and a cost of sampling, c, per observation.

Show that the WSLRT with these particular weights that involve the L numbers(and appropriate thresholds) attains the Bayes risk up to an o(c) term (Lorden(1977)).

A third-order asymptotic expansion for expected sample size of this rule:

Ei[S] = 1Ii

[| logα|+ ρi + log δi + Ci(p)

]+ o(1),

where

Ci(p) = log

(M∑

k=1

pk

Ik

)− log

(pk

Ik

).


How to select p?

We have seen that an almost minimax rule does not make sense.

We may design the rule so that

max1≤i≤M

(Ii Ei[S]) = inf(T,dT )∈Cα,β

max1≤i≤M

(Ii Ei[T]) + o(1).

This is done when pi is selected ∝ Li eρi .

It is not clear with this is a good criterion.


Robustness

Let Si the optimal SPRT for testing f0 against fi and set

Ji[S] := Ei[S]− Ei[Si]Ei[Si]

when both tests satisfy, at least approximately, the error probability constraints.

Based on the previous approximations,

Ji[S] ≈ Ci(p)| logα|+ ρi + log δi

, where Ci(p) = log

(M∑

k=1

pk

Ik

)− log

(pi

Ii

).

Setting pi ∝ Ii guarantees that

Ji[S] ∼ Jj[S] ∀ 1 ≤ i 6= j ≤ M.


Example

Two channels with densities

f k0 (x) = h(x) and f k

1 (x) = eθkx−ψ(θk) h(x), k = 1, 2.

Say, θ1 = 4 (fixed) and let θ2 = x vary.


Relative performance loss vs relative signal strength

0 2 4 6 8

0.00.1

0.20.3

0.40.5

0.6

First Channel (θ = 4)

x

0 2 4 6 8

0.00.1

0.20.3

0.40.5

0.6

Second Channel (θ = x)

x

Li ≤ Ii ≤ eρkLi 1


Sequentially testing of a continuous parameter


X1, . . . ,Xn, . . .iid∼ f ∈ {fθ, θ ∈ Θ}


H0 : θ = θ0 and H1 : θ ∈ Θ1,

where θ0 /∈ Θ1 ⊂ Θ.Then, we would like to minimize Eθ0 [T] and Eθ[T] for every θ ∈ Θ1 in

Cα,β = {(T, dT) : Pθ0(dT = 1) ≤ α and supθ∈Θ1

Pθ(dT = 0) ≤ β}.


A multi-parameter exponential family

SetupAn exponential family

fθ(x) := e〈θ,x〉−ψ(θ), x ∈ Rd , θ ∈ Θ ⊂ Rd.

Θ = {θ ∈ Rd :∫

e〈θ,x〉 ν(dx) <∞} is the natural parameter space.ψ(θ) = log

∫e〈θ,x〉 ν(dx) is the log-moment generating function of X.

We denote by ψ(θ) the gradient and by ψ(θ) the Hessian matrix of ψ(θ).We assume that ψ(θ) is non-singular for all θ ∈ Θ.The Kullback–Leibler information number between fθ2 and fθ1 is

I(θ2, θ1) := Eθ2

[log

fθ2(X)fθ1(X)

]= 〈θ2 − θ1, ψ(θ2)〉 − [ψ(θ2)− ψ(θ1)].

I(θ0, θ1) > 0 ∀ θ0 ∈ Θ0, θ1 ∈ Θ1.


The one-sided setup

Suppose that sampling needs to stop only to reject H0.

Then, we need to minimize Eθ[T] for every θ ∈ Θ1 among stopping times in

Cα = {T : P0(T <∞) ≤ α}.

Let `n(θ) the likelihood of the first n observations under Pθ, i.e.,

`n(θ) =n∏

k=1

fθ(Xk).

Open-ended WSPRT and GSLRTLet B > 1 be a fixed threshold and g a positive function on Θ1. Define

SB(g) = inf{t ≥ 1 : Λt ≥ B}, Λt = 1`n(θ0)

∫Θ1

`t(θ) g(θ) dθ

SB = inf{t ≥ 1 : Λt ≥ B}, Λt = 1`t(θ0) sup

θ∈Θ1

`t(θ).


A minimax, second-order property

The weighted idea goes back to Wald (1945).The GSLRT has been studied by Schwarz (1962), Wong (1968), Lorden (1977),Lai (1988,2004), etc.

If Θ1 is a compact set bounded away from 0,both tests attain

infT∈Cα

supθ∈Θ1

I(θ, 0)Eθ[T]

within an O(1) term as α→ 0.

Pollak (1978) proved this result for the WSPRT with any continuous mixingdensity whose support includes Θ1 (for a one-parameter exponential family).

Lai (2004) proved this result for the GSLRT.


Almost Minimax WSPRT

Asymptotic average overshootConsider the one-sided SPRT for testing fθ versus f0,

Tθa := inf{t : Zθt ≥ a}, where Zθt := log`n(θ)`t(θ0)

and define κθ := lima→∞ Eθ[ZθTθa − a].

Theorem (F. & Tartakovsky (2013)Consider the WSPRT SB(g) with weight function

g(θ) := eκθ√

det(ψ(θ))/I(θ, 0)

and suppose that P0(SB(g) <∞) = α. Then, as α→ 0,

supθ∈Θ1

I(θ, 0) Eθ[SB(g)] = infT∈Cα

supθ∈Θ1

I(θ, 0)Eθ[T] + o(1).


Idea of Proof

Auxiliary Bayesian problemConsider the sequential decision problem with

loss 1 when stopping under P0,sampling cost per observation equal to cIθ under Pθ,conditional prior distribution on Θ1 given that θ 6= 0 equal to g.

The WSPRT SB(g) is asymptotically Bayes as c→ 0 within an o(c) term.

Almost equalizerAs B→∞

I(θ, 0)Eθ[SB(g)] = log B + d2

log log B + C + o(1),

where C is a constant term that does not depend on θ.

Idea of proof“Almost Bayesian + Almost Equalizer = Almost Minimax”


Almost Minimax Weighted GSLRT

A weighted version of the GSLRT turns out to have the same optimality property.Recall that

Λn = supθ∈Θ1

`n(θ)`n(θ0) = `n(θn)

`n(θ0) ,

where θn is the (constrained on Θ1) MLE of θ based on the first n observations.We define:

SB(g) = inf{n ≥ 1 : Λn g(θn) ≥ B}

where g is some positive function on Θ1 and B > 1 is a fixed threshold.

Theorem

Consider the WGSLRT with g(θ) = eκθ . If P0(SB(g) <∞) = α, then as α→ 0

supθ∈Θ1

I(θ, 0) Eθ[SB(g)] = infT∈Cα

supθ∈Θ1

I(θ, 0)Eθ[T] + o(1).


Remarks

The function θ → κθ usually does not admit a closed-form expression.

As a result, the previous nearly minimax sequential tests can be implementedonly approximately, as the corresponding mixture-based and generalizedlikelihood ratio statistics can only be computed numerically.

For the two-sided testing problem, we need an almost Bayes rule for exponentialfamilies (Lorden (1977)) and we need to consider again the L numbers (Keener(2005)).


Work in progress

Extension to a composite null hypothesis.

Extension to multiple hypotheses.


THE END

THANK YOU!


References

Dragalin, V. P., Tartakovsky, A. G., and Veeravalli, V. V. (2000).“Multihypothesis se quential probability ratio tests - Part II: Accurate asymptoticexpansions for the expected sample size.” IEEE Trans. Inform. Theory 46,1366-1343.

Fellouris, G. and Tartakovsky, A. G. (2012). “Nearly minimax mixture-basedopen-ended sequential tests.” Sequential Analysis 31, 297-325.

G. Fellouris and A.G. Tartakovsky (2013) “Almost optimal sequential tests fordiscrete composite hypotheses.” Statistica Sinica vol. 23

Lai, T. L. (1988). “Nearly optimal sequential tests of composite hypotheses.”Ann. Statist. 16, 856-886.

Lai, T. L. and Siegmund, D. (1977). “A nonlinear renewal theory withapplications to sequential analysis I. ” Ann. Statist. 5, 628-643.

Lai, T. L. and Siegmund, D. (1979). “A nonlinear renewal theory withapplications to sequential analysis II.” Ann. Statist. 7, 60-76.

Lorden, G. (1967). “Integrated risk of asymptotically Bayes sequential tests.]]Ann. Math. Statist. 38, 1399-1422.


References

Lorden, G. (1977). “Nearly optimal sequential tests for finitely many parametervalues.” Ann. Statist. 5, 1-21.

Schwarz, G. (1962). “Asymptotic shapes of Bayes sequential testing regions.”Ann. Math. Statist. 33, 224-236

Tartakovsky, A. G., Li, X. R., and Yaralov, G. (2003). Sequential detection oftargets in multichannel systems. IEEE Trans. Inform. Theory, vol. 49, 425-445.

A. Wald and J. Wolfowitz, “ Optimum character of the sequential probabilityratio test,” Ann. Math. Statist., vol. 19, pp. 326-339, 1948.

A. Wald, Sequential analysis. Wiley, New York, 1947.


Almost optimal sequential detection in multiple data streams

Documents