Simple Adaptive Size-Exact Testing for Full-Vector and ...

Simple Adaptive Size-Exact Testing for Full-Vector and

Subvector Inference in Moment Inequality Models*

Gregory Cox Xiaoxia Shi

June 10, 2021

Abstract

We propose a simple test for moment inequalities that has exact size in normal mod-

els with known variance and has uniformly asymptotically exact size under asymptotic

normality. The test compares the quasi-likelihood ratio statistic to a chi-squared criti-

cal value, where the degree of freedom is the rank of the inequalities that are active in

finite samples. The test requires no simulation and thus is computationally fast and

especially suitable for constructing confidence sets for parameters by test inversion. It

uses no tuning parameter for moment selection and yet still adapts to the slackness of

the moment inequalities. Furthermore, we show how the test can be easily adapted

to inference on subvectors in the common empirical setting of conditional moment in-

equalities with nuisance parameters entering linearly. User-friendly Matlab code to

implement the test is provided.

Keywords: Moment Inequalities, Uniform Inference, Likelihood Ratio, Subvector Infer-

ence, Convex Polyhedron, Linear Programming

*We acknowledge helpful feedback from Donald Andrews, Isaiah Andrews, Xiaohong Chen, WhitneyNewey, Adam Rosen, Jonathan Roth, Matthew Shum, Jorg Stoye, the participants of the 2nd Economet-rics Jamboree at the UC Berkeley, the 2nd CEMMAP UCL/Vanderbilt Joint Conference, the 2020 WorldCongress of the Econometric Society, the 2021 Winter Meeting of the Econometric Society, and economet-rics seminars at Columbia University, the National University of Singapore, the UCSD, the UCLA, and theUniversity of Wisconsin at Madison.

Department of Economics, National University of Singapore ([email protected])Department of Economics, University of Wisconsin-Madison ([email protected])

1

mailto:[email protected]

mailto:[email protected]

1 Introduction

In the past decade or so, inequality testing has become a mainstream inference method used

for models where standard maximum likelihood or method of moments are difficult to use,

for reasons including multiple equilibria, incomplete data, or complicated dynamic patterns.

In such models, inequalities can often be derived from equilibrium conditions and rational

decision making. Inference can then be conducted by inverting tests for these inequalities

at each given parameter value. That is, by testing the inequalities at each parameter value

and collecting the values at which the test does not reject to form a confidence set.1

Although conceptually simple, conducting inference via test inversion poses considerable

computational challenges to practitioners. This is because, in order to get an accurate

calculation of the confidence set, one needs to test the inequalities at a set of parameter

values that is dense enough in the parameter space. Depending on the application, the

number of values that need to be tested can be astronomical and increases exponentially

with the dimension of the parameter space. Moreover, existing tests often require simulated

critical values that are nontrivial to compute even for a single value of the parameter, let

alone repeated for a large number of parameter values.2

Besides computational challenges, most existing methods for moment inequality models

involve tuning parameter sequences that are required to diverge at a certain rate as the sam-

ple size increases. The threshold in the generalized moment selection procedures (e.g. Rosen

(2008) and Andrews and Soares (2010)) and the subsample size in subsampling-based meth-

ods (e.g. Chernozhukov et al. (2007) and Romano and Shaikh (2012)) are notable examples.3

Appropriate choices often depend on data in complicated ways, and an inappropriate choice

can threaten the validity of the test.

Clearly, there are two ways to ease the computational burden: one is to make the in-

equality test easier for each parameter value, and the other is to reduce the number of

parameter values that need to be tested. We contribute to the literature in both. First, we

1An incomplete list of applications that use inequalities as estimation restrictions includes Tamer (2003),Uhlig (2005), Bajari et al. (2007), Blundell et al. (2007), Ciliberto and Tamer (2009), Beresteanu et al.(2011), Holmes (2011), Baccara et al. (2012), Chetty (2012), Nevo and Rosen (2012), Kawai and Watanabe(2013), Eizenberg (2014), Huber and Mellace (2015), Pakes et al. (2015), Magnolfi and Roncoroni (2016),Sheng (2016), Sullivan (2017), He (2017), Iaryczower et al. (2018), Wollman (2018), Fack et al. (2019), andMorales et al. (2019). For a recent overview of the literature, see for example Ho and Rosen (2017), Canayand Shaikh (2017), and Molinari (2020).

2Existing tests for general moment inequalities with simulated critical values include Chernozhukov et al.(2007), Romano and Shaikh (2008), Andrews and Guggenberger (2009), Andrews and Soares (2010), Bugni(2010), Canay (2010), Romano and Shaikh (2012), and Romano et al. (2014). See Canay and Shaikh (2017)and Molinari (2020) for more references.

3Arguably, the size of a first stage confidence set or the number of simulation/bootstrap draws are alsotuning parameters commonly used to test moment inequalities.

2

propose a simple test for general moment inequalities that requires no simulation. It simply

uses the (quasi-) likelihood ratio statistic (Tn) and a chi-squared critical value, where the

data-dependent degrees of freedom come as a by-product of computing Tn. We call it a

conditional chi-squared test. By not requiring simulation, the test saves computation time

hundreds-fold compared to tests involving simulated critical values, where a statistic needs

to be computed for each simulated sample. For example, in the simulation experiment re-

ported in Section 5.1, our test is about 200-400 times faster than the recommended testing

procedures in Andrews and Barwick (2012) (AB, hereafter) and Romano et al. (2014) (RSW,

hereafter).

Second, we then consider a conditional moment inequality model where the parameter

vector can be partitioned into two subvectors: (θ′, δ′)′. The subvector θ is the parameter of

interest, while δ is the subvector that the researcher is not interested in, commonly referred

to as the nuisance parameter. We specialize to the setting where δ enters the moment

inequalities linearly and propose a version of the conditional chi-squared test for θ. The

subvector test is based on eliminating the nuisance parameters from a system of inequalities.

By eliminating the nuisance parameters, one only needs to consider a grid on the space

of θ, which can be much lower dimensional than the space of (θ′, δ′)′. Thus, the number of

parameter values that need to be tested is drastically reduced. For example, in the simulation

experiment reported in Section 5.2 below, our subvector test uses only 10 seconds to compute

a confidence interval in a specification with a 4-dimensional δ and 32 moment inequalities.

In both contexts, the conditional chi-squared test is simulation and tuning parameter

free. Its critical value is simply the chi-squared critical value with degrees of freedom equal

to the rank of the active moment inequalities, where we call a moment inequality active if

it holds with equality at the restricted estimator of the moments.4 In a normal model with

known variance, the test is shown to have exact size in finite sample. That is, its worst case

rejection probability under the null hypothesis is equal to its nominal significance level. In

an asymptotically normal model, it is shown to be uniformly asymptotically valid. Moreover,

it automatically adapts to the slackness of the moment inequalities despite the absence of a

deliberate moment selection step. In particular, when all but one inequality get increasingly

slack, the test asymptotes to one that ignores all the slack inequalities, which coincides with

the uniformly most powerful test for the limiting model.

The idea of simple chi-squared critical values for testing inequalities appeared as early

as in Bartholomew (1961) and Rogers (1986) for testing one-sided alternatives against a

4Active inequalities are the sample counterpart of binding inequalities, which hold with equality at thepopulation expectation of the moments. An inequality that is not active is referred to as inactive. Aninequality that is not binding is referred to as slack.

3

simple null, but was only recently proved to be valid for a composite null in Mohamad et al.

(2020) in a normal model. We extend Mohamad et al. (2020) in four ways: (a) we allow

an intercept in the inequalities defining the null hypothesis and thus generalize the null

hypothesis from a cone to a polyhedron. This is important for moment inequality models

as, in the limit, the null hypothesis may not be a cone when some inequalities are close to

binding; (b) we design a simple but novel refinement to make the test size-exact; (c) we

prove the test is uniformly asymptotically valid in moment inequality models; and (d) we

show how to feasibly extend the test to the subvector inference context in the presence of

nuisance parameters that enter the moments linearly. Extensions (a)-(c) rely on technical

contributions described in the appendix. We highlight them briefly here, as they may be

useful in other contexts. The finite sample validity of the refinement relies on a careful

partition of the state space (see Lemmas 1 and 2) combined with an inequality on the tail

of the truncated normal distribution (see Lemma 4). The uniform asymptotic validity relies

on a lemma guaranteeing convergence of an arbitrary sequence of polyhedra to a limiting

polyhedron along a subsequence (see Lemma 7).

The idea of eliminating nuisance parameters from linear moment inequalities is first sug-

gested in Guggenberger et al. (2008), where they introduce Fourier-Motzkin elimination, a

classical algorithm for eliminating nuisance parameters from linear inequalities, to the liter-

ature and propose a Wald-type test on the resulting inequalities. Yet two main difficulties

hinder the application of this idea: (a) numerical calculation of the Fourier-Motzkin elimi-

nation in general is an NP-hard computational problem, and (b) the estimated coefficients

in front of the nuisance parameters enter the resulting inequalities via a non-differentiable

function, and could undermine the validity of testing procedures applied directly to them.

The first difficulty is circumvented because the conditional chi-squared test only relies on the

rank of the active inequalities, and results from the convex analysis literature (see Lemmas

12 and 13) allow us to compute the rank of the active inequalities without carrying out

Fourier-Motzkin elimination. The second difficulty is circumvented by considering models

where the moment inequalities hold conditional on a vector of instrumental variables, a class

of models first proposed by Andrews et al. (2019).

Andrews et al. (2019) (hereafter ARP) has the closest setting with our paper. They

propose a test based on the largest standardized sample moment. In the most basic version,

their test uses a conditional critical value from a truncated normal distribution. This basic

version involves no simulation or tuning parameter and as a result is easy to compute.

However, the basic version has poor power properties that prompt them to recommend a

hybrid test. The hybrid test uses a simulated critical value as well as a tuning parameter

that determines the size of a first-stage least favorable test.

4

There are a few papers in the literature that propose methods to mitigate the compu-

tational challenges described above. Kaido et al. (2019) cast the problem of finding the

bounds of the projection confidence interval of each parameter into a nonlinear nonconvex

constrained optimization problem, and provide a novel algorithm to solve this optimiza-

tion problem more efficiently. Our simple inequality test is complementary to Kaido et al.

(2019)’s algorithm in that we make testing for each value hundreds-fold easier while their

algorithm reduces the number of values that need to be tested. Bugni et al. (2017) propose

a profiling method that simplifies computation in the same way as the subvector confidence

set proposed in this paper, by reducing the search from the space of the whole parameter

vector to that of a low dimensional subvector. The difference is that our subvector test, by

taking advantage of the linearity of the model, is much easier to compute than Bugni et al.

(2017)’s test, which applies more generally. Chen et al. (2018) propose a quasi-Bayesian

method that can also be applied to subvector inference in moment inequality models, as well

as a simple method that applies to scalar parameters of interest.

A couple of other papers aim to reduce the sensitivity of testing to tuning parameters. AB

refines the procedure of Andrews and Soares (2010) (AS, hereafter) by computing an optimal

moment selection threshold that maximizes a weighted average power and a size correction.

Using the optimal threshold and the size correction provided in that paper, one no longer

needs to choose a tuning parameter. Computationally, it is the same as AS if one has 10

or fewer moment inequalities and can use the tables of optimal tuning and size correction

values in the paper. It is much more computationally demanding otherwise. RSW replace

the moment selection step of the previous literature with a confidence set for the slackness

parameter and employ a Bonferroni correction to take into account the error rate of this

confidence set. There is still a tuning parameter, the confidence level of the first step, but

this tuning parameter no longer affects the asymptotic size of the test. Computationally,

using the same number of bootstrap draws, it is slightly more costly than AS due to the first-

step confidence set construction. The recommended tests in AB and RSW are our points

of comparison in the simulation experiments in Section 5.1, where we show that our simple

test saves computational cost hundreds-fold, while having competitive size and power.

The remainder of this paper proceeds as follows. Section 2 describes our setup and several

examples. Section 3 describes how to implement the full-vector and subvector conditional

chi-squared tests. Section 4 states theoretical properties that the tests have. Section 5

reports the simulation results. Section 6 concludes. An appendix contains the proofs and

additional results.

5

2 Setup and Examples

This section describes the setup for full-vector and subvector moment inequality testing,

together with several examples.

2.1 Moment Inequality Model: Full-Vector Inference

Consider a dm-dimensional moment function, m(Wi, θ), that depends on a vector parameter

of interest, θ. Let Θ denote the parameter space for θ, and denote the data by Wini=1 with

joint distribution F . We assume the moments satisfy a vector of linear inequalities given by

AEFmn(θ) ≤ b, (1)

where A is a dA × dm matrix, b is a dA × 1 vector, and mn(θ) = n−1∑n

i=1m(Wi, θ). The

moment inequalities identify the true parameter value up to the identified set,5

Θ0(F ) = θ ∈ Θ : AEFmn(θ) ≤ b. (2)

The specification of a moment inequality model given by (1) is very general. Other papers

in the moment inequality literature, such as AS, specify moment inequalities of the form

EFm1(Wi, θ) ≤ 0 and EFm2(Wi, θ) = 0, (3)

where m1(Wi, θ) denotes a dm1-vector of moments that satisfy inequalities and m2(Wi, θ)

denotes a dm2-vector of moments that satisfy equalities. By including a coefficient matrix A

and an intercept b, (1) covers the specification in (3) with b = 0 and

A =

Idm10dm1×dm2

0dm2×dm1−Idm2

0dm2×dm1Idm2

, (4)

where dA = dm1 + 2dm2 . Introducing A and b is convenient because it allows us to succinctly

cover both equalities and inequalities. It also readily accommodates models with upper and

lower bounds with a deterministic gap in between.6 Below we assume the variance-covariance

matrix of the moments is invertible. Introducing A and b is useful because it specifies the

5The quantities A and b may depend on θ and the sample size n, a dependence that we keep implicitfor simplicity unless otherwise needed. If the dependence is made explicit, the formula for Θ0(F ) becomesθ ∈ Θ : A(θ)EFmn(θ) ≤ b(θ).

6For example, E[Wn] − 1 ≤ θ ≤ E[Wn] can be written in our notation with m(w, θ) = θ − w, A =(

1−1

)and b = ( 0

1 ).

6

inequalities as a linear combination of a “core” set of moments, and only the core set of

moments needs to have an invertible variance-covariance matrix.

Moment inequalities have become widely used in practice as the reference list given in

the first paragraph of the introduction shows. We mention two recent examples here.

Example 1. He (2017) uses a moment inequality model to estimate preferences of applicants

in a school admission problem under a matching mechanism called the Boston mechanism.

To fix ideas, consider a simple case with 3 schools, a, b, and c. Each applicant i submits

a rank-ordered list (r1i , r

2i , r

3i ) to the mechanism. The Boston mechanism first assigns as

many applicants as possible to the top-ranked school in their list while respecting the capacity

constraints of the schools. The unassigned applicants are considered by their second-ranked

school for the remaining school seats, if any. The process continues until all seats are filled or

all students assigned. The Boston mechanism is not strategy-proof in that applicants, instead

of submitting their true preference ranking, can benefit from an untruthful rank-ordered list.

He (2017) aims to answer an important policy question: does switching to a strategy-

proof mechanism make the less sophisticated applicants better off? The question necessitates

He to allow in his model less sophisticated applicants who do not form correct beliefs about

admission probabilities. He allows them to form individualized beliefs which become incidental

parameters that preclude full identification. However, He shows that the model can uniquely

predict the probability of some rank-ordered lists and bound that of other rank-ordered lists

using functions that do not involve beliefs. For example,

Pr((r1i , r

2i , r

3i ) = (0, 0, 0)) = Pr(uia < ui0, uib < ui0, uic < ui0) or equivalently (5)

E[1(r1i , r

2i , r

3i ) = (0, 0, 0)] = g000(θ) :=

∫1ua < u0, ub < u0, uc < u0dF (u0, ua, ub, uc|θ),

where uis is the utility of being admitted to school s, ui0 is the utility of the outside option, and

F (·|θ) is the joint distribution of (ui0, uia, uib, uic) assumed to be known up to the parameter

θ. Also,

Pr((r1i , r

2i , r

3i ) = (a, b, c)) ≤ Pr(uia ≥ ui0, uia ≥ minuib, uic), or equivalently (6)

E[1(r1i , r

2i , r

3i ) = (a, b, c)] ≤ gabc(θ) :=

∫1ua > u0, ua > minub, ucdF (u0, ua, ub, uc|θ),

where the preference restriction on the right-hand side of the first line means that school a

cannot be unacceptable (i.e., worse than the outside option) or be the least favorite. While

everyone who submits (a,b,c) must have those preferences, not everyone who has such pref-

erences will submit (a,b,c). For example, if an applicant expects the admission probability

7

at school a is too low even when it is top-ranked, she may rank it bottom. Thus we have

an inequality instead of an equality in (6). Observing a data set of r1i , r

2i , r

3i , one can use

moment equalities and inequalities like (5) and (6) to conduct inference on θ. In this exam-

ple, our formulation (1) using the A matrix and the b intercept is particularly useful because

there are equalities and inequalities, the probabilities on the left-hand side sum up to one,

and the probabilities may have both upper and lower bounds.

Example 2. Morales et al. (2019) use moment inequalities to estimate a model of interna-

tional trade to quantify the importance of extended gravity, which is the dependence of an

exporting firm’s entry cost to a new market on its previous exporting to similar markets.

Due to extended gravity, the set of markets the firm exports to has important dynamic im-

plications for future entry costs. Thus, it becomes necessary to consider a dynamic discrete

choice model with the choice set of each firm being the power set of potential markets. The

sheer size of this choice set makes it difficult, if not impossible, to estimate the model using

traditional maximum likelihood methods.

Morales et al. (2019) form moment inequalities by comparing the equilibrium profit with

the profit after a single-period perturbation from the optimal strategy. More specifically, they

obtain moment inequalities of the form

E[(πijj′t + δπijj′t+1)I(Zit)1dijt(1− dij′t) = 1] ≥ 0, (7)

where i is the firm index, j and j′ are indices for destination markets, t is time, dijt = 1

indicates that firm i exports to market j at time t and dijt = 0 indicates otherwise, πijj′t

is the loss of static profit at time t due to switching the time t export from country j to j′,

πijj′t+1 is the loss of static profit at time t+1 due to the same switch, δ is the discount factor,

Zit is a vector of instrumental variables, and I(Zit) is a vector of nonnegative functions of

Zit. Profit is parameterized to reflect revenue and costs with the extended gravity parameters

as part of the cost function. Morales et al. (2019) use the AS test on a grid of parameter

values to compute a joint confidence set for a 5-dimensional parameter.

2.2 Conditional Moment Inequality Model: Subvector Inference

We next consider a conditional moment inequality model with nuisance parameters entering

linearly. The model has three differences from the full-vector setup: (a) there are nuisance

parameters, denoted by δ, that enter the moments linearly, (b) the inequalities hold condi-

tionally on exogenous variables, Zini=1, and (c) the coefficients on δ depend only on the

8

exogenous variables, Zini=1. In mathematical terms, the model is given by

EFZ [BZmn(θ)− CZδ|Z] ≤ dZ , a.s. (8)

where Z = Zini=1 is a sample of instrumental variables (each Zi is taken to be a subvector

of Wi without loss of generality), BZ , CZ , and dZ are k×dm, k×p, and k×1 matrices, δ is a

vector of unknown nuisance parameters, θ is a vector of unknown parameters of interest, and

FZ denotes the conditional distribution of Wini=1 given Zini=1. The subscript Z is used to

denote dependence on Z1, ..., Zn. The quantities BZ , CZ , and dZ are also allowed to depend

on θ and the sample size n (while k and p are fixed), but we keep that implicit for notational

simplicity. Similar to the full-vector case, this model can succinctly cover both equalities and

inequalities by an appropriate choice of BZ , CZ , and dZ , as well as accommodating upper

and lower bounds with a gap between the bounds that depends on Zini=1.

The model (8) was first spotted as an interesting class of models by ARP. It is a special

case of the conditional moment inequality models considered in Andrews and Shi (2013). We

note two special features of this setup: (i) the nuisance parameter δ enters linearly, and (ii)

the coefficients on δ depend only on the exogenous variables Zini=1. ARP recognized that,

while these features significantly restrict the generality of the full-vector model, they are

common in many empirical models: exogenous covariates are frequently used to incorporate

heterogeneity and/or to control for confounders. We use these features to develop a con-

ditional subvector inference procedure, which, like our full-vector test, is tuning parameter

and simulation free. Here we describe three examples of models that fit this setup.

Example 3. Manski and Tamer (2002) consider an interval regression model:

Y ∗i = X ′iθ + Z ′ciδ + εi, (9)

where Y ∗i is a dependent variable, Xi is a vector of possibly endogenous variables, Zci is a

vector of exogenous covariates including the constant. There is a vector of excluded instru-

mental variables Zei that satisfies E[εi|Zi] = 0, where Zi = (Z ′ci, Z′ei)′. The outcome Y ∗i is

not observed. Instead, YLi and YUi are observed such that Y ∗i ∈ [YLi, YUi]. The imperfect

observation of Y ∗i may be caused by missing data or survey design where respondents are

given a few brackets to select from instead of asked to give a precise answer.

Let I(Zi) be a finite non-negative vector of instrumental functions. Then we have

E

[((YLi −X ′iθ0)I(Zi)

−(YUi −X ′iθ0)I(Zi)

)−

(I(Zi)Z

′ci

−I(Zi)Z′ci

)δ0

∣∣∣∣∣Zi]≤ 0, (10)

9

which yields a model of the form (8) with BZ = I, Wi = (YLi, YUi, X′i, Z

′i)′, m(Wi, θ) =(

(YLi−X′iθ)I(Zi)

−(YUi−X′iθ)I(Zi)

), CZ = n−1

∑ni=1

(I(Zi)Z

′ci

−I(Zi)Z′ci

), and dZ = 0.

Example 4. Gandhi et al. (2019) consider a generalized interval regression model to conduct

inference for an aggregate demand function when observed market shares of differentiated

products have many zero values. Mathematically, the latent inverse demand model is of the

form

ψ(Y ∗i , Xi, θ0) = Z ′ciδ0 + εi, E[εi|Zi] = 0, (11)

where ψ is a known function, but Y ∗i , the expectation of the market share in market i, is

unobserved. Under an assumption on the source of the zeroes, Gandhi et al. (2019) construct

bounds, ψUi (θ0) and ψLi (θ0), such that

E[ψLi (θ0)|Zi] ≤ E[ψ(Y ∗i , Xi, θ0)|Zi] ≤ E[ψUi (θ0)|Zi]. (12)

Then, analogously to the previous example, we have

E

[(ψLi (θ0)I(Zi)

−ψUi (θ0)I(Zi)

)−

(I(Zi)Z

′ci

−I(Zi)Z′ci

)δ0

∣∣∣∣∣Zi]≤ 0, (13)

where I(Zi) is a finite non-negative vector of instrumental functions of Zi = (Z ′ci, Z′ei)′. This

yields a model of the form (8), where BZ = I, Wi contains Zi as well as the variables used to

construct ψLi and ψUi , m(Wi, θ) =(

ψLi (θ)I(Zi)

−ψUi (θ)I(Zi)

), CZ = n−1

∑ni=1

(I(Zi)Z

′ci

−I(Zi)Z′ci

), and dZ = 0.

In Section 5.2, we consider a Monte Carlo example of a special case of this model where

we also provide more details on the bound construction. In the application of Gandhi et al.

(2019), control variables (Zci) are essential for the validity of the instruments.

Example 5. Eizenberg (2014) studies the portable PC market to quantify the welfare effect

of eliminating a product. Central to the question is the fixed cost of providing the product.

Eizenberg uses the revealed preference approach to construct bounds, Li and Ui, for the fixed

cost of product i. Let Zi be a vector of product characteristics (including the constant). One

can consider the following conditional moment inequality model:

E [(Li − P (Zi)′γ0)I(Zi)|Zi] ≤ 0 (14)

E [(−Ui + P (Zi)′γ0)I(Zi)|Zi] ≤ 0,

where P (Zi) is a vector of known functions of Zi and I(Zi) is a vector of nonnegative

instrumental functions. The function P (Zi)′γ0 captures the (observed) heterogeneity of fixed

costs across products. Using our method, one can construct confidence intervals for each

10

element of γ0 and any linear combinations of γ0 such as the average derivative.

Suppose the parameter of interest is the average derivative with respect to the first element

of Zi: θ0 = γ′0P1,n, where P1,n = n−1∑n

i=1 ∂P (Zi)/∂z1. One can rewrite (14) as

E[(Li − θ0)I(Zi)− I(Zi)(P (Zi)− P1,n)′γ0|Zi

]≤ 0

E[(−Ui + θ0)I(Zi) + I(Zi)(P (Zi)− P1,n)′γ0|Zi

]≤ 0, (15)

which falls into the framework of (8) where BZ = I, Wi contains Zi as well as the variables

used to construct Li and Ui, m(Wi, θ) =(

(Li−θ)I(Zi)−(Ui−θ)I(Zi)

), CZ = n−1

∑ni=1

(I(Zi)(P (Zi)−P1,n)

−I(Zi)(P (Zi)−P1,n)

),

and dZ = 0.

Two additional examples that fit into our subvector framework are Katz (2007) and

Wollman (2018) as reviewed in ARP.

3 Conditional Chi-Squared Tests: Implementation

In this section we define a new family of tests, called conditional chi-squared tests, for the

inequalities specified in (1) and (8). They are called conditional chi-squared tests because

they use a critical value that is a quantile of the chi-squared distribution, where the degree

of freedom depends on the active inequalities. We give instructions for implementing the

tests, which shows that they are easy to implement and have low computational cost.

3.1 Full-Vector Tests

We use the inequalities specified in (1) to test hypotheses on θ. Like most papers in the

literature, including AS, AB, and RSW, we conduct inference for the true parameter θ0 by

test inversion. That is, for a given significance level α ∈ (0, 1), one constructs a test φn(θ, α)

for H0 : θ = θ0, where φn(θ, α) = 1 indicates rejection and φn(θ, α) = 0 indicates a failure to

reject. One then obtains the confidence set for θ0 by calculating

CSn(1− α) = θ ∈ Θ : φn(θ, α) = 0. (16)

In practice, CSn(1− α) is calculated by testing H0 : θ = θ0 on a grid of values of θ ∈ Θ.

We introduce two new tests, one being a refinement of the other. Both are easy to

compute, requiring no tuning parameters or simulations. Both use the (quasi-) likelihood

ratio statistic,

Tn(θ) = minµ:Aµ≤b

n(mn(θ)− µ)′Σn(θ)−1(mn(θ)− µ), (17)

11

where Σn(θ) denotes an estimator of VarF (√nmn(θ)), the variance-covariance matrix of the

standardized moments. When Wini=1 is i.i.d., we can take

Σn(θ) =1

n

n∑i=1

(m(Wi, θ)−mn(θ))(m(Wi, θ)−mn(θ))′. (18)

When Wini=1 is not i.i.d., we can define Σn(θ) to account for the clustering or autocorrela-

tion in Wini=1.

Both tests use data-dependent critical values that are based on the rank of the rows of A

corresponding to the inequalities that are active in finite sample. To define them rigorously,

let µ be the solution to the minimization problem in (17). This is the restricted estimator

for the moments. It can be calculated using a quadratic programming algorithm. Let a′j

denote the jth row of A and let bj denote the jth element of b for j = 1, 2, . . . , dA. Let

J = j ∈ 1, 2, . . . , dA : a′jµ = bj, (19)

which is the set of indices for the active inequalities. For a set J ⊆ 1, 2, . . . , dA, let AJ be

the submatrix of A formed by the rows of A corresponding to the elements in J . Let rk(AJ)

denote the rank of AJ , and let r = rk(AJ). Note that for test inversion, µ, J , and r need to

be recalculated for every value of θ.

The critical value of the first simple test is the 100(1−α)% quantile of χ2r, the chi-squared

distribution with r degrees of freedom, denoted by χ2r,1−α. We denote the first simple test

by

φCCn (θ, α) = 1

Tn(θ) > χ2

r,1−α, (20)

where CC stands for “conditional chi-squared” indicating that the test uses the chi-squared

critical value conditional on the active inequalities.7 We show the validity of the CC test

below. The intuition is that Tn(θ) (asymptotically) follows the χ2r distribution conditional on

r when all inequalities are binding (that is, AEFmn(θ) = b), and is stochastically dominated

by the χ2r distribution when some of the inequalities are slack.

The CC test does not reject when r = 0, and rejects with probability at most α when

r > 0. Thus, an upper bound on its (asymptotic) null rejection probability is (1 − Pr(r =

0))α. This shows that the CC test can be somewhat conservative.

7The conditional aspect of our critical value gives it an apparent resemblance with the critical value ofthe conditional test in ARP. However, the resemblance is only superficial. Like any conditional test, what isimportant is the statistic that is conditioned on. That statistic is the set of active inequalities in our case,while it is the second largest standardized sample moment in ARP’s case.

12

We propose a second simple test that eliminates the conservativeness. We call this

the RCC (refined CC) test. We define the RCC test by adjusting the quantile of the χ21

distribution when r = 1. Instead of the 100(1−α)% quantile, the RCC test uses a 100(1−β)%

quantile, where β varies between α and 2α depending on how far from active the additional

(inactive) inequalities are. We now construct β carefully so that the refinement exactly

restores the size of the test.

When r = 1, suppose without loss of generality that the first inequality is active and

satisfies a1 6= 0.8 Next, for each j = 2, ..., dA, let

τj =

√n‖a1‖Σn(θ)(bj−a′j µ)

‖a1‖Σn(θ)‖aj‖Σn(θ)

−a′1Σn(θ)ajif ‖a1‖Σn(θ)‖aj‖Σn(θ) 6= a′1Σn(θ)aj

∞ otherwise, (21)

where ‖a‖Σ= (a′Σa)1/2. This τj is a normalized measure of the inactivity of the jth inequality.

It is essentially bj − a′jµ normalized using the ratio of the Euclidean norms of Σ1/2n (θ)a1 and

Σ1/2n (θ)aj and the angle between the two.9 Then let

τ = infj∈2,...,dA

τj. (22)

This is a measure of the minimum inactivity of the inactive inequalities. This quantity is

easy to compute and has a nice geometric interpretation that is illustrated in Illustration 1

below.

Now we can define

β =

2αΦ(τ) if r = 1

α otherwise,(23)

where Φ(·) is the standard normal cumulative distribution function (cdf). When a second

inequality is close to being active, τ is close to 0 and then β is close to α. When all the other

inequalities are far from active, then τ is very large and β is close to 2α. We define the RCC

test for H0 : θ = θ0 to be

φRCCn (θ, α) = 1Tn(θ) > χ2

r,1−β. (24)

Note that for test inversion, both r and β need to be recalculated for every value of θ since

they may depend on θ via A, b, and Σn(θ).

8In this case, other inequalities may be active too since we do not rule out the possibility that A containsredundant or zero rows. But this is possible only if the other active inequalities are collinear with a1.

9Note that a′1Σn(θ)aj = ‖a1‖Σn(θ)‖aj‖Σn(θ)cos γ, where γ stands for the angle.

13

Since τ ∈ [0,∞], β ∈ [α, 2α]. Thus we have the following comparison of the CC and the

RCC tests:

φRCCn (θ, α/2) ≤ φCC

n (θ, α) ≤ φRCCn (θ, α). (25)

Moreover, when an equality is being tested, at least two inequalities are always active, in

which case we have β = α, and the RCC test reduces to the CC test.

It helps to illustrate the CC and RCC tests in a simple two-inequality example.

Illustration 1. Consider an example where dm = 2, A = I, b = 0, and Σn(θ) = I. We

omit θ from the notation for ease of exposition. Thus, we are testing H0 : EFmn ≤ 0 using

the statistic√nmn, which asymptotically follows a bivariate standard normal distribution.

On the space of√nmn, the rejection region for the CC test is illustrated by the shaded

region in Figure 1. In this example, the likelihood ratio statistic is the squared distance

between√nmn and the third quadrant of the plane. If

√nmn lies in the second or fourth

quadrants of the plane, one inequality is active and the χ21 quantile is used. If

√nmn lies in

the first quadrant of the plane, two inequalities are active and the χ22 quantile is used. The

critical values for the RCC test are illustrated using a dashed line where they deviate from

the CC test.10

From the figure, we can see that the RCC test deviates from the CC test only when the

number of active inequalities is one (in the second and fourth quadrants of the plane). In

that case, a smaller critical value is used that depends on how far from active the other in-

equality is, measured using τ . The quantity τ has the following geometric interpretation: the

point√nµ is the projection of

√nmn onto a face of the polyhedron defined by the inequali-

ties. Continue that line into the interior of the polyhedron until you reach a point, y, that is

equidistant between two inequalities. In the figure, the set of points that are equidistant be-

tween two inequalities is represented by the dotted line, which is the 45-degree line. Then τ is

the distance between√nµ and y. This geometric interpretation extends to more complicated

examples with more inequalities or non-orthogonal inequalities.

The reason the refinement still controls size is that we condition on the event that√nmn

belongs to the ray that starts at y and emanates through√nµ and

√nmn. It is sufficient to

control the conditional rejection probability for every such ray. By conditioning on the ray,

the denominator of the conditional rejection probability is Φ(τ), which allows us to adjust α

up to β.

10The discontinuity in the critical value illustrated in Figure 1 is similar to the discontinuity in the rec-ommended generalized moment selection function (their ϕ(1)) in AB that occurs whenever a moment is atthe threshold of being selected.

14

√nmn

√nµ

yτ

√nmn1

√nmn2

Figure 1: Geometric representation of the CC test (shaded) and the RCC test (dashed) inIllustration 1.

It is also helpful to see the CC tests in a simple model with a scalar parameter of

interest, one upper bound, and one lower bound: E[Y L] ≤ θ ≤ E[Y U ]. This setup has

been considered, for example, in Stoye (2009). For simplicity, suppose Y L and Y U are

independent and have unit variance. Let Y Ln and Y U

n be the sample average of Y L and Y U ,

respectively. Then it is not difficult to find that (when ∆n :=√n(Y U

n − Y Ln ) > −z1−α/2) the

100(1−α)% CC confidence interval is [Y Ln − z1−α/2/

√n, Y U

n + z1−α/2/√n], and also that the

RCC confidence interval is the set of θ values that satisfy Y Ln − z1−αΦ(

√n(Y Un −θ)∨0)/

√n ≤ θ ≤

Y Un +z1−αΦ(

√n(θ−Y Ln )∨0)/

√n, where ∨ is the maximum operator. Solving numerically, we find

that the RCC confidence interval is [Y Ln − cα/

√n, Y U

n + cα/√n] where cα depends on ∆n.

For example, when α = 0.05, cα declines smoothly from 1.96 to 1.67 and then to 1.65 as ∆n

varies from −1.96 to 0 and then to 1. Thus, the refinement brings about a big improvement

in this simple setup.

To end this subsection, Algorithm 1 presents pseudo-code that can be used to compute

the CC and RCC tests. The pseudo-code is implemented in user-friendly Matlab code

provided in the replication files. The implementation requires a tolerance (tol) to account

for numerical imprecision in the quadratic programming used to compute Tn(θ). We use

10−8 in the Monte Carlo simulations.

Remark. Algorithm 1 makes clear some of the convenient features of the implementation

of the CC tests. We list them here for emphasis. (a) The CC tests do not require any tuning

parameters or simulations to implement. (b) The CC tests are simple to code. (c) There

15

Algorithm 1: Pseudo-code for implementing the CC and RCC tests.

1: %Compute the CC Test2: Tn(θ), µ ← minµ:Aµ≤b n(mn(θ)− µ)′Σn(θ)−1(mn(θ)− µ)

3: J := j = 1, . . . , dA : a′jµ = bj4: AJ ← J , A5: r := rk(AJ)6: φCC

n (θ, α) := 1Tn(θ) > maxχ2r,1−α, tol.

7:

8: %Compute the RCC Test9: Implement lines 2-5, and then

10: if r = 1 and χ21,1−2α ≤ Tn(θ) ≤ χ2

1,1−α then11: (suppose a′1µ = b1 and ‖a1‖6= 0)12: for j = 2, . . . , dA do

13: τj :=




∞ otherwise14: end for15: τ := inf2,...,dA τj16: β := 2αΦ(τ)17: φRCC

n (θ, α) := 1Tn(θ) > maxχ21,1−β, tol

18: else19: φRCC


20: end if

is also a third convenient feature of the implementation that is less clear from Algorithm

1, which is that the inequalities do not need to be “reduced” before implementing the test.

Often in practice a collection of inequalities contains redundant inequalities, or inequalities

that are implied by the other inequalities. The CC tests are invariant to the inclusion of

redundant inequalities. In contrast, other tests for moment inequalities, including AS, AB,

and RSW, are not invariant, and thus are improved by reducing the collection of inequalities

by removing the redundant ones before implementing the tests.

3.2 Subvector Tests

Next we use the inequalities in (8) to test hypotheses on θ. For a given value, θ0, testing

H0 : θ = θ0 amounts to testing the following hypothesis:

H0 : ∃δ such that BZEFZ [mn(θ0)|Z]− CZδ ≤ dZ , a.s. (26)

In this subsection, we define subvector versions of the conditional chi-squared tests for (26).

16

Directly testing (26) is difficult because it requires checking the validity of the inequality

for all values of δ. We construct our test using an equivalent form of (26) that eliminates δ:

H0 : AZEFZ [mn(θ0)|Z] ≤ bZ , (27)

for some matrix AZ and vector bZ that are deterministic functions of CZ , BZ , and dZ .

The existence of such a transformation is well-known in the theory of linear inequalities,

dating back to Fourier (1826). It has been noted in the moment inequality literature by

Guggenberger et al. (2008), but has not been used in practice to the best of our knowledge.

One significant obstacle is that calculating AZ and bZ is computationally difficult except

in small dimensions. The key innovation in our approach is to conduct the conditional

chi-squared test on (27) without calculating AZ and bZ , as we describe next.

The subvector CC (sCC) test for (26) is the full-vector CC test based on (27). It uses

the test statistic

Tn(θ) = minµ:AZµ≤bZ

n(mn(θ)− µ)′Σn(θ)−1(mn(θ)− µ), (28)

where Σn(θ) is an estimator of the conditional variance: Σn(θ) = Var(√nmn(θ)|Z), discussed

in more detail below. The critical value of the sCC test is χ2r,1−α, where r is the rank of the

active inequalities, defined as in the full-vector CC test applied to the problem in (28).

The first step to computing Tn(θ) without computing AZ and bZ is to recognize that

Tn(θ) = minδ,µ:BZµ−CZδ≤dZ

n(mn(θ)− µ)′Σn(θ)−1(mn(θ)− µ). (29)

One can calculate Tn(θ) without knowing AZ or bZ by quadratic programming, where (δ′, µ′)′

is the decision variable. Let (δ′, µ′)′ be the solution to the minimization problem.

Before we describe how to compute r without AZ or bZ , we briefly describe what AZ and

bZ are. There are multiple ways to define AZ and bZ for (27) to be equivalent to (26). The

Fourier-Motzkin algorithm noted in Guggenberger et al. (2008) is one of them. Another that

is particularly convenient for our purpose is to take convex combinations of the inequalities.

If we let h ∈ Rk denote a vector of nonnegative weights that sum to one, then the convex

combination of the inequalities in (26) is given by

h′BZEFZ [mn(θ0)|Z]− h′CZδ ≤ h′dZ . (30)

When h′CZ = 0, the δ parameter is eliminated from the inequalities. It follows from Gale’s

17

Theorem11 that it is sufficient to consider the set of all inequalities (30) indexed by

h ∈ H := h ∈ Rk : h ≥ 0, C ′Zh = 0, 1′h = 1. (31)

To connect this result to (27), note that H defines a convex polyhedron in Rk. Every element

of a convex polyhedron is a convex combination of its extreme points, or vertices. Thus, it

is sufficient to consider the vertices of H. That is, a particular value θ0 satisfies (26) if and

only if θ0 satisfies (30) for all h that are vertices of H. Equivalently, if we take H(CZ) to

denote a matrix where each row is a vertex of H, then defining

AZ = H(CZ)BZ and bZ = H(CZ)dZ (32)

renders (27) equivalent to (26). This result is formally stated in Lemma 12 in Appendix C.1.

Thus, to calculate AZ and bZ , we could enumerate the vertices of H. While vertex

enumeration seems simple, it can be computationally challenging when k and/or p are large.

(Experience suggests even moderate values of k and p can lead to computational challenges.)

As noted in various textbooks, including Sierksma and Zwols (2015), there is no polynomial

time algorithm for vertex enumeration available in general. We proceed to describe how to

compute r without AZ or bZ .

To compute r, we define the active inequalities. For any h ∈ H, we say that the inequality

in (30) is active if h′BZ µ = h′dZ , where µ is calculated from (29). Accordingly, let

H0 = h ∈ H : (BZ µ− dZ)′h = 0 (33)

denote the subset of H that characterizes the active inequalities. In fact, H0 is always a

face of H due to the definition of µ. By the definition of r and AZ , r is the maximum

number of linearly independent vectors of the form B′Zh, where h is a vertex of H0. The

key is to recognize that we do not need to enumerate the vertices of H0 to calculate r.

Instead, we only have to calculate the maximum number of linearly independent vectors

in B′ZH0 = B′Zh : h ∈ H0. Notationally, we call the maximum number of linearly

independent vectors in B′ZH0 the “rank of B′ZH0” and denote it by rk(B′ZH0).12 The fact

that r = rk(B′ZH0) is stated formally in Lemma 13 in Appendix C.1.

Therefore, to compute r one only needs to find rk(B′ZH0). It turns out that calculating

the rank of a polyhedron is much faster computationally than enumerating the vertices.

11See Theorem 2.7 in Gale (1960). Gale’s Theorem is considered by some authors (e.g. Bachem and Kern(1992), Theorem 4.1) to be a variant of Farkas’ Lemma, a result that may be familiar to readers who haveworked on nonnegative solutions to linear systems of equations.

12Usually, rk(·) is defined for matrices. Here we extend the definition to arbitrary sets of vectors.

18

Here, we present an algorithm based on solving k + 1 linear programming (LP) problems.

For exposition, we assume rk(BZ) = k, which is true in Examples 3-5, so that the rank of

BZH0 is equal to the rank of H0.13 Calculating the rank of H0 is equivalent to finding the

dimension of the smallest linear subspace containing H0, denoted by span(H0). Note that

H0 is defined by linear equalities and inequalities, where the inequalities are given by h ≥ 0.

Some of these inequalities may have to hold with equality due to the other equations in

the definition of H0. That is, for some j = 1, ..., k, h ∈ H0 may imply that hj = 0. If we

can figure out which of the inequalities have to hold with equality, we can find a system of

equations that defines span(H0), and from there figure out the dimension of span(H0).

Thus, the imminent question becomes: for which j does h ∈ H0 imply hj = 0? For each

j = 1, ..., k, we can answer this question with a LP problem.14 For each j = 1, ..., k, calculate

ζj = minh−hj s.t. h ≥ 0, C ′Zh = 0, (BZ µ− dZ)′h = 0, 1′h = 1. (34)

If ζj = 0, then there does not exist an h ∈ H0 with hj > 0, which means that the jth

inequality has to hold with equality. Let J0 be the collection of all j’s such that ζj = 0. Also

let IJ0 denote the rows of the k-dimensional identity matrix corresponding to indices in J0.

It follows that

span(H0) = h ∈ Rk : IJ0h = 0, C ′Zh = 0, (BZ µ− dZ)′h = 0. (35)

Correspondingly, the rank ofH0 is k minus the rank of the coefficients on the linear equations

defining span(H0):

rk(H0) = k − rk

(IJ0

C′Z(BZ µ−dZ)′

). (36)

This is how we compute r, and hence the sCC test, without computing AZ or bZ .

While implementing the sCC test does not require computing AZ or bZ , this is not the case

for the subvector RCC (sRCC) test. The refinement requires knowing AZ and bZ . However,

note that the refinement makes a difference only when r = 1 and Tn(θ) ∈ [χ21,1−2α, χ

21,1−α]

(because β ∈ [α, 2α]). Thus, to implement the sRCC test, we recommend computing r and

Tn(θ) first using the method outlined above, and only computing AZ , bZ , and the refinement

when r = 1 and Tn(θ) ∈ [χ21,1−2α, χ

21,1−α]. Our experience is that this event is rare when

13Appendix C.2 presents an algorithm for the case that rk(BZ) < k.14Before implementing these LP problems, one should first determine if H0 is empty. This can be done

by solving the LP problem: f := minh−(BZ µ− dZ)′h s.t. h ≥ 0, C ′Zh = 0, 1′h = 1. If f > 0, that indicatesthat all elements of AZ µ − bZ are negative and there is no active inequality. In this case, set H0 = ∅ andr = 0.

19

k and p are large enough to make computing AZ and bZ challenging.15 When k and p are

small, computing AZ and bZ via vertex enumeration is feasible.

Next we give two examples of the conditional variance estimator Σn(θ). The conditional

variance is the appropriate variance matrix to be estimated because the inequalities hold

conditionally on Z and the theoretical properties of the tests are derived using the conditional

distribution of mn(θ0) given Z. We describe two conditional variance matrix estimators, one

for discrete Zi and the other for continuous Zi, both in the context of i.i.d. data.

In the first case, Zi takes on a finite number of values in a set, Z. A straightforward

estimator of Var(√nmn(θ)|Z) is the weighted average of the sample variances of m(Wi, θ)

within each category of Zi:

Σn(θ) =∑`∈Z

n`n

1

n` − 1

n∑i=1

(m(Wi, θ)−m`n(θ))(m(Wi, θ)−m`

n(θ))′1Zi = `, (37)

where n` =∑n

i=1 1Zi = ` and m`n(θ) = 1

n`

∑ni=1m(Wi, θ)1Zi = `. As we show in

Appendix D.2, sufficient conditions for the consistency of this estimator involve boundedness

of the fourth moment of m(Wi, θ) and the assumption that every Zi value occurs twice or

more in the sample Zini=1 eventually. This is the estimator used in our Monte Carlo

simulations in Section 5.2.

In the second case, Zi contains continuous random variables. One can use a nearest

neighbor matching estimator similar to that used for the standard error of a regression

discontinuity estimator in Abadie, Imbens, and Zheng (2014).16 Let ΣZ,n = n−1∑n

i=1(Zi −Zn)(Zi − Zn)′ where Zn = n−1

∑ni=1 Zi. For each i, define the nearest neighbor to be

`Z(i) = argminj∈1,...,n,j 6=i(Zi − Zj)′Σ−1Z,n(Zi − Zj). (38)

When the argmin is not unique, picking one randomly does not affect the consistency of the

resulting estimator. The estimator of Σn(θ) is then given by

Σn(θ) =1

2n

n∑i=1

(m(Wi, θ)−m(W`Z(i), θ))(m(Wi, θ)−m(W`Z(i), θ))′. (39)

As we show in Appendix D.2, sufficient conditions for the consistency of this matching estima-

tor involves the boundedness of Zi∞i=1 and the Lipschitz continuity of Var(m(Wi, θ)|Zi = zi)

15Theoretically, if the moment inequalities are uncorrelated and k of them are binding, then the probabilityr = 1 is asymptotically k2−k. This is an upper bound for Pr(r = 1, Tn(θ) ∈ [χ2

1,1−2α, χ21,1−α]). In finite

sample, the number of near-binding moment inequalities also reduces this probability.16This is also the estimator used in ARP.

20

Algorithm 2: Pseudo-code for the sCC and sRCC tests when rk(BZ) = k.

1: %Compute the sCC Test2: Tn(θ), µ← minδ,µ:BZµ−CZδ≤dZ n(mn(θ)− µ)′Σn(θ)−1(mn(θ)− µ)3: f := minh∈H−(BZ µ− dZ)′h4: if f > tol then5: r := 06: else7: for j = 1, . . . , k do8: hmj ← CZ , BZ , dZ , µ by (34).9: end for

10: J0 := j = 1, . . . , k : hmj = 011: IJ0 ← J0

12: r := k − rk

(IJ0

C′Z(BZ µ−dZ)′

)13: end if14: φsCC


15:

16: %Compute the sRCC Test17: Implement lines 2-13, and then18: if r = 1 and Tn(θ) ∈ [χ2

1,1−2α, χ21,1−α] then

19: H(CZ)← H using a vertex enumeration algorithm, e.g. con2vert.m in Matlab(ref. Kleder (2020))

20: AZ , bZ ← H(CZ)BZ , H(CZ)dZ21: Suppose a′1µ = b1 and ‖a1‖6= 0. %Ignore the subscript Z for notational ease.22: for j = 2, . . . , dA do

23: τj :=




∞ otherwise24: end for25: τ := inf2,...,dA τj26: β := 2αΦ(τ)27: φsRCC

n (θ, α) := 1Tn(θ) > maxχ21,1−β, tol

28: else29: φsRCC


30: end if

in zi.

To end this subsection, Algorithm 2 presents pseudo-code that can be used to compute

the sCC and sRCC tests in the case where BZ has rank k. The pseudo-code is implemented

in user-friendly Matlab code provided in the replication files. The implementation requires

a tolerance to account for numerical imprecision in the quadratic programming used to

compute Tn(θ). We use 10−8 in the Monte Carlo simulations. Note that the sCC tests have

21

the same convenient implementation features listed in the remark on Algorithm 1. The third

feature, that the inequalities do not need to be “reduced” before implementing the tests, is

especially convenient for the sRCC test because the vertex enumeration used to calculate

AZ and bZ often delivers redundant inequalities, and these do not have to be removed before

implementing the sRCC test.

4 Theoretical Properties

Next we consider the theoretical properties of the CC tests. We show that both the full-

vector and subvector RCC tests are size-exact in finite samples under normality and known

variance-covariance matrix. We also show that they are uniformly asymptotically size-exact

when the moments are asymptotically normal. Moreover, we make precise the adaptiveness

of the tests to slackness of the moment inequalities.

4.1 Finite Sample Properties

When the moments are normally distributed with known variance-covariance matrix, the

following theorem states the finite sample properties of the RCC test. We define a reduced

test that only uses a subset of the inequalities. For J ⊆ 1, ..., dA, let φRCCn,J (θ, α) denote

the RCC test defined with AJ and bJ instead of A and b, where bJ denotes the subvector of

b formed by the elements of b corresponding to the indices in J . This test is a useful point

of comparison when the inequalities not in J are very slack.

Theorem 1. Suppose Σn(θ) is an invertible matrix such that√n(mn(θ) − EFmn(θ)) ∼

N(0,Σn(θ)) and Σn(θ) = Σn(θ) a.s. for all θ ∈ Θ. Then the following hold.

(a) For any θ ∈ Θ0(F ), EFφRCCn (θ, α) ≤ α.

(b) If AEFmn(θ) = b and A 6= 0, then EFφRCCn (θ, α) = α.

(c) If J ⊆ 1, ..., dA and θs∞s=1 ⊆ Θ is a sequence such that for all j /∈ J , aj 6= 0 and

(a′jEFmn(θs)− bj)/‖aj‖→ −∞ as s→∞, where the dependence of aj and bj on s via

θs is implicit, then

lims→∞

PrF(φRCCn (θs, α) 6= φRCC

n,J (θs, α))

= 0.

Remarks. (1) Part (a) shows the finite sample validity of the RCC test when the moments

are normally distributed with known variance. Part (b) shows that the RCC test is size-exact

22

when there is an F and θ ∈ Θ0(F ) under which all the inequalities bind.17 By size-exact,

we mean that the worst case null rejection probability is equal to α. This is compatible

with having other F and θ ∈ Θ0(F ) under which the rejection probability is less than α (i.e.

under-rejection). Indeed, the RCC test under-rejects when some inequalities do not bind,

which is a common feature of all moment inequality tests. The under-rejection does not

increase monotonically with the slackness of the slack inequalities, however. Part (c) shows

that when the slack inequalities get very slack, the RCC test reduces to a version of it that

does not use those inequalities. In particular, if some inequalities are binding while others

get very slack, the rejection rate of the RCC test gets close to α. Put another way, the RCC

test adapts to the slackness of the inequalities. We call this property “irrelevance of distant

inequalities” or IDI. This is especially useful if all but one inequality is very slack, because

the reduced test is the one-sided t-test for the sole binding inequality, which is uniformly

most powerful.

(2) Several papers, including Kudo (1963) and Wolak (1987), propose a classical test for

inequalities that can be applied here. The classical test is based on the 1 − α quantile of

the least favorable distribution of Tn(θ), which is a mixture of χ20, χ

21, . . . , χ

2dA

distributions.

This test also has exact size, but lacks the IDI property. When many inequalities tested are

slack, the power of this test can be very low. Besides, the critical value typically requires

simulation, which makes it computationally less attractive than the RCC test.

(3) AS introduced a generalized moment selection procedure that achieves an asymptotic

version of the IDI property via a sequence of tuning parameters. AB consider a test that

simulates a critical value from the asymptotic normal distribution. That test has finite

sample exact size (due to size correction), but it does not have the IDI property. The size

correction the test uses causes it to respond to very slack inequalities, albeit to a lesser extent

than the classical test.

(4) The only other moment inequality test with exact size and the IDI property is the

non-hybrid test in ARP. This test is based on the conditional distribution of the maximum

standardized element of√n(Amn(θ)− b) given the second-largest maximum. The test also

asymptotes to the one-sided t-test when all but one inequality get increasingly slack. How-

ever, the test has undesirable power when multiple inequalities are not well separated, which

prompts them to recommend a hybrid test instead.

(5) Theorem 1 and the other results in this paper are stated in terms of hypothesis tests.

However, they can be extended to results on the coverage probability of confidence sets

defined by test inversion in a standard way. Specifically, under the conditions of Theorem

17Using (25), part (b) also implies that the size of the CC test is between α/2 and α in this case.

23

1(a), we have for all θ0 ∈ Θ0(F ),

PrF(θ0 ∈ CSRCC

n (1− α))≥ 1− α, (40)

where CSRCCn (1 − α) = θ ∈ Θ : φRCC

n (θ, α) = 0 is the confidence set formed by inverting

the RCC test.

(6) The proof of Theorem 1 is challenging. It relies on a careful partition of the space

of realizations of the moments according to which inequalities are active (see Lemmas 1

and 2). It then uses a bound on probabilities of translations of sets to bound the rejection

probability conditional on each set in the partition (see Lemmas 3 and 4). Mohamad et al.

(2020) prove a special case of part (a) for the CC test when the inequalities define a cone.

We extend the result to the RCC test and allow the inequalities to define an arbitrary

polyhedron, an important extension for moment inequality models. In particular, the design

of the refinement is not obvious from Mohamad et. al (2020) and requires a careful study of

the geometric properties of the test statistic.

4.2 Asymptotics

Now we turn to the asymptotic properties of the CC tests when the moments are only

asymptotically normal and the variance-covariance matrix is estimated. For expositional

purposes, we focus on the independent and identically distributed (i.i.d.) data case here,

while results in Appendix B cover more general cases. With i.i.d. data, we can estimate

the variance matrix with Σn(θ) defined in (18). We show that the RCC test has correct

asymptotic size uniformly over a large class of data generating processes.

The following assumption defines the set of data generating processes allowed. Here

|·| denotes the matrix determinant, and ε and M are fixed positive constants that do not

dependent on F or θ.

Assumption 1. For all F ∈ F and θ ∈ Θ0(F ), the following hold.

(a) Wini=1 are i.i.d. under F .

(b) σ2F,j(θ) := VarF (mj(Wi, θ)) > 0 for j = 1, . . . , dm.

(c) |CorrF (m(Wi, θ))|> ε, where CorrF (m(Wi, θ)) is the correlation matrix of the random

vector m(Wi, θ) under F .

(d) EF |mj(Wi, θ)/σF,j(θ)|2+ε≤M for j = 1, . . . , dm.

24

Remarks. (1) This set of assumptions is commonly made in the moment inequality lit-

erature (see e.g. Andrews and Guggenberger (2009), AS, or Kaido et al. (2019)). Part (a)

assumes i.i.d. for simplicity, but is not essential for the results. One can use our method

on data with cluster, spatial, or temporal dependence, after changing Σn(θ) to a variance

estimator that appropriately accommodates the dependence. In that case, the validity of

our procedure follows from Theorem 3 in Appendix B. Part (b) is innocuous as it simply

requires the moment functions be nonconstant in Wi. Parts (a), (b), and (d) together imply

asymptotic normality of the sample moments via a Lyapunov central limit theorem.

(2) Part (c) requires uniform invertibility of the correlation matrix, which is imposed be-

cause we use the inverse of Σn(θ) in the test statistic. While this rules out perfectly correlated

moments and near-perfectly correlated moments, perfectly correlated moments can be han-

dled in specification (1) by an appropriate choice of A and b provided the perfect correlation

is known. For example, in Example 1, suppose one reaches the moment inequalities:

E[1(r1i , r

2i , r

3i ) = (0, 0, 0) − g000(θ)] ≤ 0

E[−1(r1i , r

2i , r

3i ) = (0, 0, 0)+ g000(θ)] ≤ 0

E[1(r1i , r

2i , r

3i ) = (a, b, c) − gabc(θ)] ≤ 0

...

E[1(r1i , r

2i , r

3i ) = (b, 0, 0) − gb00(θ)] ≤ 0

E[1(r1i , r

2i , r

3i ) = (c, 0, 0) − gc00(θ)] ≤ 0. (41)

These moment inequalities are collinear both because the first is the negative of the second

and because the probabilities of all rank-order lists add up to 1. The invertibility requirement

can still be satisfied by defining

m(Wi, θ) =

1(r1

i , r2i , r

3i ) = (0, 0, 0)

1(r1i , r

2i , r

3i ) = (a, b, c)...

1(r1i , r

2i , r

3i ) = (b, 0, 0)

, A =

1 0 ... 0−1 0 ... 00 1 ... 0...

......

0 0 ... 1−1 −1 ... −1

, and b =

g000(θ)

−g000(θ)

gabc(θ)...

gb00(θ)

gc00(θ)− 1

.

Note that m(Wi, θ) is the core set of moments, one for each possible rank-order list, omitting

the last one. This is similar to dealing with perfect multicollinearity in a linear regression

with binary variables.

Let DF (θ) denote the diagonal matrix formed by σ2F,j(θ) : j = 1, . . . , dm. For J ⊆

25

1, . . . , dA, let IJ denote the rows of the identity matrix corresponding to the indices in J .18

The following theorem states the asymptotic properties of the RCC test.

Theorem 2. Suppose Assumption 1 holds.

(a) limsupn→∞

supF∈F supθ∈Θ0(F ) EFφRCCn (θ, α) ≤ α.

For a sequence (Fn, θn) : Fn ∈ F , θn ∈ Θ0(Fn)∞n=1 such that A(θn)DFn(θn) → A∞, for

some matrix A∞ and for all J ⊆ 1, . . . , dA, rk(IJA(θn)DFn(θn)) = rk(IJA∞) for all n,

(b) if A∞ 6= 0 and for all j ∈ 1, ..., dA,√n(a′jEFnmn(θn)− bj)→ 0, then

limn→∞

EFnφRCCn (θn, α) = α, and

(c) if instead there is a J ⊆ 1, . . . , dA such that for all j /∈ J ,√n(a′jEFnmn(θn)− bj)→

−∞ as n→∞, then

limn→∞

PrFn(φRCCn (θn, α) 6= φRCC

n,J (θn, α))

= 0.

Remarks. (1) Part (a) shows that the RCC test is asymptotically uniformly valid. Part

(b) shows that when all the inequalities bind or are sufficiently close to binding, the RCC

test does not under-reject asymptotically, assuming the rank of combinations of rows of A

does not change in the limit. Part (c) shows an asymptotic IDI property of the RCC test: if

some inequalities are very slack, the test reduces to the one based only on the not-very-slack

inequalities.

(2) If θ and F are fixed and A and b do not depend on n, then the condition in part

(c) is satisfied with J equal to the set of all binding inequalities.19 If, in addition, AJ 6= 0,

parts (b) and (c) can be combined to show that the RCC test has exact asymptotic size

(and hence is asymptotically non-conservative). By exact asymptotic size, we mean that

there exists a sequence of θn ∈ Θ0(Fn) such that the limiting rejection probability is equal

to α. (In this case, the sequence is just the fixed sequence with (θn, Fn) = (θ, F ) for all

n.) This is compatible with the possibility that other sequences, θn ∈ Θ0(Fn), have limiting

rejection probability strictly less than α. Indeed, whenever some moment inequalities are

local to binding such that their slackness neither converges to zero nor diverges to infinity,

the limiting rejection probability will be less than α.

18Note that IJA is an alternate notation for AJ .19Technically, since F is the joint distribution of Wini=1, we need the marginal distribution of each Wi

to be fixed.

26

(3) Theorem 2 combines with (25) to imply that the CC test is asymptotically uniformly

valid, and, when the RCC test is asymptotically non-conservative, it can only be conservative

to a limited extent:

α/2 ≤ limsupn→∞

supF∈F

supθ∈Θ0(F )

EFφCCn (θ, α) ≤ α. (42)

(4) The outline of the proof of Theorem 2(a) is conceptually simple. The almost sure

representation theorem is invoked on the convergence of the moments, and then Theorem 1 is

invoked on the limiting experiment. However, the details are quite complicated. A technical

complication that arises is that the rank of the inequalities can be lower in the limit than

in the finite sample. This is handled by adding additional inequalities so the sequence of

polyhedra defined by the inequalities converges to a limiting polyhedron along a subsequence

(see Lemma 7 in the appendix).

4.3 Finite Sample Validity of the sCC and sRCC Tests

The following result states the finite sample properties of the sRCC test assuming normally

distributed moments and a known conditional variance matrix. The result is a corollary of

Theorem 1. Let z be a realization of Z, and let Θ0(Fz) = θ ∈ Θ : ∃δ s.t. BzEFz [mn(θ)|z]−Czδ ≤ dz.20 Let ej denote the RdA-vector with jth element one and all other elements zero.

For any J ⊆ 1, . . . , dA, let φsRCCn,J (θ, α) denote the sRCC test defined using IJAz and IJbz

in place of Az and bz.

Corollary 1. Suppose Σn(θ) is an invertible matrix such that the conditional distribution of√n(mn(θ)− EFZmn(θ)) given Z = z is distributed N(0,Σn(θ)) and Σn(θ) = Σn(θ) a.s. for

all θ ∈ Θ. Then the following hold.

(a) For any θ ∈ Θ0(Fz), EFz [φsRCCn (θ, α)|z] ≤ α.

(b) If AzEFz [mn(θ)|z] = bz and Az 6= 0, then EFz [φsRCCn (θ, α)|z] = α.

(c) If J ⊆ 1, ..., dA and θs∞s=1 ⊆ Θ is a sequence such that for all j /∈ J , e′jAz 6= 0 and

e′j(AzEFzmn(θs)− bz)/‖e′jAz‖→ −∞ as s→∞, where the dependence of Az and bz on

s via θs is implicit, then

lims→∞

PrFz(φsRCCn (θs, α) 6= φsRCC

n,J (θs, α)|z)

= 0.

Remarks. (1) Part (a) shows the finite sample validity of the sRCC test under normality.

Part (b) shows that the sRCC test is size-exact when there is an Fz under which all the

20Technically, z and the objects that are defined given z, including Θ0(Fz), depend on n as well. We keepthis dependence implicit for simplicity.

27

inequalities bind. Part (c) states the IDI property of the sRCC test. Since the sRCC

test rejects whenever the sCC test does, the corollary implies the validity of the sCC test:

EFz [φsCCn (θ, α)|z] ≤ α.

(2) A result on the asymptotic properties of the sRCC test is available in Appendix D. It

relies on the asymptotic normality of the moments conditional on Z1, ..., Zn and a consistent

estimator for Σn(θ).

(3) The condition in part (c) depends on Az = H(Cz)Bz and bz = H(Cz)dz, which are

the inequalities after eliminating the nuisance parameters. It is unclear whether an alter-

native sufficient condition can be formulated that depends only on the original inequalities.

In a model with a scalar parameter of interest, Rambachan and Roth (2020) use a linear

independence constraint qualification to show that the test in ARP reduces to the one-sided

t-test at an endpoint of the identified set. A similar constraint qualification may be helpful in

formulating a sufficient condition for part (c) that depends only on the original inequalities.

(4) The invertibility requirement on Σn(θ) guides the choice of instrumental functions,

I(Zi), in Examples 3-5. In those examples, the instrumental functions are used to increase

the number of moment inequalities in order to sharpen identification. The instrumental

functions in Andrews and Shi (2013) serve the same purpose. Like in Andrews and Shi

(2013), appropriate functions are indicators of cells defined by Zi. However, unlike Andrews

and Shi (2013), we do not recommend using cells of multiple levels of fineness. For example,

when Zi ∈ 0, 12, we do not recommend using both 1Zi = (0, 1)′, 1Zi = (0, 0)′, 1Zi =

(1, 1)′, 1Zi = (1, 0)′ and 1Zi ∈ (0, 1), (0, 0), 1Zi ∈ (1, 0), (1, 1). This is because

the subsequent moments are linearly dependent, causing Σn(θ) to be singular. We thus

recommend choosing a partition of the space of Zi and using the indicator of all cells in

that partition. For example, when Zi ∈ 0, 12, use I(Zi) = (1Zi = (0, 1)′, 1Zi =

(0, 0)′, 1Zi = (1, 1)′, 1Zi = (1, 0)′)′.The need to choose instrumental functions is common in conditional moment inequality

models. A complete cost and benefit analysis is beyond the scope of this paper, but we

can make some general observations. (a) A finer partition yields sharper identification,

meaning that Θ0(Fz) is smaller. (b) A finer partition also means fewer observations per cell,

potentially implying a worse normal approximation. A crude rule of thumb is to ensure that

the smallest cell in the partition contains 15 or more observations.

5 Monte Carlo Simulations

In the previous two sections, we have shown that the CC and RCC tests have a variety

of desirable properties, including convenient implementation features and theoretical results

28

on size and adaptation to slackness. In this section, we use Monte Carlo simulations to

compare the CC and RCC tests to alternative moment inequality tests in terms of size,

power, and computational cost. We consider two sets of Monte Carlo simulations, one to

evaluate the performance of the CC and RCC tests in a general moment inequality model

without nuisance parameters, and the second to evaluate the performance of the sCC and

sRCC tests in Example 4. In these simulations, no test should be expected to dominate any

other in terms of power. Still, we find that the CC and RCC tests are at least competitive

in terms of size and power and dominate in terms of computational cost.

5.1 Full-Vector Simulations

Our first set of simulations takes the generic moment inequality design from AB. This design

allows a variety of correlation structures across moments and thus can approximate a wide

range of applications.

We briefly describe the Monte Carlo design here and refer readers to Section 6 of AB for

further details. Consider the moment inequality model

E[θ −Wi] ≤ 0, (43)

and the null hypothesis H0 : θ = 0, where Wi is a k-dimensional random vector. Let the

data Wini=1 be i.i.d. with sample size n. Let Wi ∼ (µ,Ω), where Ω is a correlation matrix

and µ is a mean-vector. Three choices of Ω are considered: ΩNeg, ΩZero, and ΩPos. For ΩZero,

the moments are uncorrelated. For ΩPos, the moments are positively correlated. For ΩNeg,

some pairs of moments are strongly negatively correlated while other pairs of moments are

positively correlated. The exact numerical specifications of these matrices for different k’s

are in Section 4 of AB and Section S7.1 of the Supplemental Material of AB.

We consider separately cases with k ≤ 10 and cases with k ≥ 10. With k ≤ 10, we

compare the CC and the RCC tests to the recommended tests in AB and RSW. More

specifically, we compare to the bootstrap-based AQLR (adjusted quasi-likelihood ratio) test

in AB and two two-step procedures in RSW, one using their T qlrn statistic and the other using

their Tmaxn statistic.21 With k ≥ 10, we only compare the CC and RCC tests to the RSW

tests as the AB test is no longer computationally feasible. The RSW tests are implemented

using 499 bootstrap draws and with a first-step significance level of 0.005. The AB test is

implemented using 1000 bootstrap draws. These are the recommended values in RSW and

21We use the AB test for comparison because it is tuning parameter free (in the sense that AB proposeand use an optimal choice of the AS tuning parameter), and we use RSW’s two-step tests for comparisonbecause they should be insensitive to reasonable choices of their tuning parameters.

29

AB, respectively.

5.1.1 k ≤ 10

We approximate the size of the tests using the maximum null rejection probability (MNRP)

over a set of µ values that satisfies µ ≥ 0 for each combination of Ω and k. These µ values are

taken from AB, whose calculations suggest that these points are capable of approximating

the size of the tests. We also compute a weighted average power (WAP) for easy comparison.

The WAP is the simple average of a set of carefully chosen points in the alternative space.

We take these points also from AB, who design them to reflect cases with various degrees of

violation or slackness for each of the inequalities. These µ values are given in Section 4 of

AB and Section S7.1 of the Supplemental Material of AB. Besides WAP, we also report size-

corrected WAP, which is obtained by adding a (positive or negative) number to the critical

value where the number is set to make the size-corrected MNRP equal to the nominal level.

Table 1 shows the MNRP and WAP results when Wi is normally distributed with known

Ω. In this case, only the RCC test should have exact size. The CC test should be somewhat

under-sized especially with small k. The results are consistent with these theoretical predic-

tions. The MNRP of the RCC test is within simulation error of 5%, and the MNRP of the

CC test is somewhat below the MNRP of the RCC test. The MNRP’s of the AB and the

RSW1 tests appear to be more different from 5% than the RCC test, while the MNRP of

the RSW2 test is close to 5%.

Table 2 shows the results when Wi is normally distributed with estimated Ω. In this

case, none of the tests have exact size. The RCC test still has very good MNRP at k = 2,

but has noticeably larger MNRP (up to 7.4% from 5%) when k = 10 with Ω = ΩNeg. This

may reflect the difficulty in estimating Ω with a small sample size (n = 100). The AB test

and the RSW1 test continue to have good size, while the RSW2 test now exhibits some

over-rejection when k = 10 and Ω = Ωzero.

In terms of weighted average power, the RCC test has weakly higher ScWAP than both

RSW tests in all but one case in Table 2 (estimated Ω), and in all but two cases in Table 1

(known Ω). The RCC test has higher ScWAP than the AB test in 4 out of 9 cases in both

Table 1 and Table 2. The ScWAP of all the tests, except RSW2, are quite close to each other,

with differences between them no greater than 6 percentage points. The ScWAP of RSW2

is close to the other tests in all cases except when the moments have negative correlations

(Ω = ΩNeg), when they are much lower, especially for k = 10.

On the computational side, the AB test and the RSW1 test are 200-400 times as costly

as the RCC test, and the RSW2 test is 4-9 times as costly as the RCC test, as shown in the

Time columns in the tables. Also note that the AB test is computed using 1000 bootstrap

30

Table 1: Finite Sample Maximum Null Rejection Probabilities and Size-Corrected AveragePower of Nominal 5% Tests (normal distribution, known Ω, n = 100)

k = 10 k = 4 k = 2

Test MNRP WAP ScWAP Time MNRP WAP ScWAP Time MNRP WAP ScWAP Time

Ω = ΩNeg

RCC .051 .61 .61 .003 .052 .62 .62 .003 .051 .62 .62 .003CC .051 .61 .60 .003 .049 .60 .61 .003 .046 .58 .60 .003AB .046 .53 .55 1.11 .051 .59 .59 1.07 .059 .65 .64 1.40

RSW1 .054 .58 .56 .551 .056 .60 .59 .538 .052 .64 .63 .701RSW2 .050 .23 .23 .014 .052 .34 .34 .013 .052 .50 .49 .014

Ω = ΩZero

RCC .052 .63 .63 .003 .052 .65 .65 .003 .051 .68 .68 .003CC .050 .62 .62 .003 .045 .62 .64 .003 .038 .61 .66 .004AB .043 .65 .66 1.08 .050 .67 .67 1.07 .056 .69 .67 1.39

RSW1 .053 .61 .60 .545 .056 .63 .62 .539 .052 .65 .65 .699RSW2 .053 .54 .52 .014 .052 .62 .62 .014 .049 .66 .66 .014

Ω = ΩPos

RCC .051 .76 .75 .003 .053 .75 .74 .003 .051 .72 .71 .003CC .038 .72 .76 .003 .033 .68 .74 .003 .032 .62 .69 .003AB .042 .78 .80 1.05 .051 .75 .75 1.03 .059 .72 .70 1.34

RSW1 .053 .77 .77 .547 .056 .73 .71 .534 .052 .67 .66 .700RSW2 .052 .77 .77 .014 .052 .74 .74 .013 .049 .68 .69 .014

Note: CC, RCC, AB, RSW1 and RSW denote the conditional chi-squared test, the refined CC test, the adjusted quasi-likelihoodratio test with bootstrap critical value in AB, the two-step test in RSW based on the QLR statistic and that based on the Maxstatistic, respectively. MNRP, WAP, ScWAP and Time denote maximum null rejection probability, weighted average power, size-corrected WAP, and average computation time used in seconds in each Monte Carlo simulation. Cases with different k and knowledgestatus of Ω may have been assigned to different machines and their computation times are not comparable. But times across testsare comparable. The AB test and the RSW tests use 1000 and 499 bootstrap draws respectively. The results for the CC, RCC, andRSW2 tests are based on 5000 simulations, while those for the AB and RSW1 tests are based on 2000 simulations for feasibility.

Table 2: Finite Sample Maximum Null Rejection Probabilities and Size-Corrected AveragePower of Nominal 5% Tests (normal distribution, estimated Ω, n = 100)

k = 10 k = 4 k = 2


Ω = ΩNeg

RCC .074 .63 .54 .003 .058 .63 .61 .004 .053 .62 .61 .003CC .074 .63 .54 .003 .058 .61 .59 .004 .048 .59 .60 .003AB .046 .51 .53 1.23 .049 .58 .58 1.56 .056 .64 .63 1.13

RSW1 .054 .55 .53 .569 .053 .58 .58 .736 .050 .62 .62 .533RSW2 .053 .24 .23 .026 .051 .34 .33 .026 .051 .49 .48 .023

Ω = ΩZero

RCC .069 .65 .59 .003 .053 .66 .65 .004 .051 .68 .68 .003CC .069 .64 .57 .003 .049 .63 .63 .004 .039 .61 .66 .003AB .043 .62 .64 1.23 .048 .66 .67 1.55 .053 .68 .67 1.14

RSW1 .052 .58 .58 .570 .053 .62 .61 .732 .050 .64 .64 .541RSW2 .062 .54 .50 .026 .053 .61 .60 .026 .050 .64 .64 .024

Ω = ΩPos

RCC .056 .77 .75 .003 .054 .75 .74 .004 .051 .71 .71 .003CC .043 .73 .74 .003 .034 .68 .73 .004 .035 .63 .69 .003AB .044 .78 .79 1.19 .049 .74 .75 1.50 .055 .71 .70 1.07

RSW1 .053 .76 .75 .566 .053 .71 .71 .731 .052 .66 .66 .528RSW2 .056 .76 .74 .026 .052 .73 .72 .027 .050 .67 .67 .023

Note: Same as Table 1.

31

Table 3: Finite Sample Maximum Null Rejection Probabilities and Size-Corrected AveragePower of Nominal 5% Tests (t(3) distribution, known Ω, n = 100)

k = 10 k = 4 k = 2


Ω = ΩNeg

RCC .057 .61 .58 .003 .054 .62 .60 .003 .046 .62 .64 .003CC .057 .60 .57 .003 .053 .60 .58 .003 .043 .59 .62 .003AB .036 .54 .58 1.08 .048 .64 .64 1.07 .047 .71 .71 1.40

RSW1 .046 .58 .60 .551 .047 .65 .66 .536 .044 .70 .71 .702RSW2 .048 .24 .25 .014 .048 .36 .37 .013 .047 .54 .55 .014

Ω = ΩZero

RCC .059 .63 .59 .003 .050 .66 .66 .003 .047 .69 .70 .003CC .059 .61 .57 .003 .047 .63 .64 .003 .036 .62 .68 .003AB .037 .65 .70 1.07 .046 .72 .73 1.05 .052 .75 .74 1.39

RSW1 .045 .59 .60 .541 .047 .67 .68 .529 .046 .71 .72 .700RSW2 .046 .49 .52 .014 .043 .63 .65 .013 .046 .69 .71 .014

Ω = ΩPos

RCC .054 .76 .75 .003 .051 .75 .75 .003 .047 .72 .74 .003CC .045 .72 .73 .003 .033 .68 .74 .003 .032 .63 .71 .003AB .040 .82 .85 1.05 .049 .81 .82 1.04 .052 .78 .78 1.34

RSW1 .050 .80 .80 .547 .047 .78 .79 .537 .047 .73 .74 .700RSW2 .046 .79 .80 .014 .045 .78 .79 .013 .046 .72 .73 .014

Note: Same as Table 1

Table 4: Finite Sample Maximum Null Rejection Probabilities and Size-Corrected AveragePower of Nominal 5% Tests (mixed normal distribution, known Ω, n = 100)

k = 10 k = 4 k = 2


Ω = ΩNeg

RCC .049 .61 .62 .003 .051 .63 .62 .003 .055 .63 .61 .003CC .049 .60 .61 .003 .049 .61 .61 .003 .051 .59 .59 .003AB .049 .53 .55 1.08 .062 .61 .57 1.07 .069 .65 .61 1.40

RSW1 .055 .58 .56 .550 .063 .62 .59 .539 .063 .64 .60 .701RSW2 .058 .24 .22 .014 .057 .34 .32 .013 .062 .51 .47 .014

Ω = ΩZero

RCC .049 .64 .64 .003 .047 .66 .67 .003 .054 .70 .68 .003CC .046 .62 .64 .003 .043 .63 .65 .003 .041 .62 .66 .003AB .052 .66 .65 1.07 .064 .69 .65 1.05 .068 .70 .65 1.39

RSW1 .053 .61 .60 .540 .064 .65 .61 .529 .063 .67 .62 .699RSW2 .058 .57 .53 .014 .057 .64 .62 .013 .064 .67 .64 .014

Ω = ΩPos

RCC .053 .77 .76 .003 .048 .75 .75 .003 .056 .73 .72 .003CC .039 .73 .76 .003 .033 .68 .74 .003 .037 .64 .69 .003AB .045 .78 .80 1.04 .060 .76 .74 1.03 .069 .73 .68 1.34

RSW1 .051 .77 .77 .547 .058 .74 .72 .533 .063 .69 .64 .700RSW2 .058 .78 .75 .014 .057 .74 .73 .013 .062 .70 .66 .014


32

Table 5: Finite Sample Maximum Null Rejection Probabilities and Size-Corrected AveragePower of Nominal 5% Tests (t(3) distribution, estimated Ω, n = 100)

k = 10 k = 4 k = 2


Ω = ΩNeg

RCC .067 .70 .64 .003 .053 .69 .69 .004 .053 .69 .68 .003CC .067 .69 .64 .003 .053 .68 .67 .004 .046 .66 .67 .003AB .042 .56 .57 1.23 .058 .64 .63 1.56 .060 .70 .68 1.12

RSW1 .057 .60 .59 .569 .059 .66 .64 .732 .057 .69 .67 .529RSW2 .079 .29 .24 .026 .078 .40 .35 .026 .078 .56 .50 .023

Ω = ΩZero

RCC .063 .72 .67 .003 .051 .73 .72 .004 .056 .74 .73 .002CC .063 .70 .66 .003 .048 .70 .70 .004 .041 .68 .71 .003AB .043 .67 .69 1.22 .056 .72 .71 1.55 .066 .74 .69 1.11

RSW1 .057 .64 .63 .566 .059 .69 .67 .730 .057 .71 .68 .528RSW2 .093 .67 .57 .026 .085 .71 .63 .026 .085 .72 .65 .023

Ω = ΩPos

RCC .053 .80 .79 .003 .052 .79 .79 .004 .053 .77 .75 .002CC .043 .77 .78 .003 .037 .73 .78 .004 .032 .69 .75 .003AB .050 .80 .80 1.19 .058 .79 .77 1.50 .064 .77 .72 1.07

RSW1 .059 .79 .77 .565 .059 .77 .75 .728 .057 .73 .71 .528RSW2 .082 .79 .73 .026 .078 .78 .71 .026 .082 .74 .67 .023


Table 6: Finite Sample Maximum Null Rejection Probabilities and Size-Corrected AveragePower of Nominal 5% Tests (mixed normal distribution, estimated Ω, n = 100)

k = 10 k = 4 k = 2


Ω = ΩNeg

RCC .096 .64 .47 .003 .067 .63 .57 .004 .066 .63 .58 .003CC .096 .63 .46 .003 .066 .61 .56 .004 .063 .60 .56 .003AB .047 .49 .50 1.23 .053 .56 .53 1.56 .055 .62 .61 1.12

RSW1 .046 .52 .53 .568 .055 .56 .55 .732 .053 .60 .60 .529RSW2 .060 .24 .22 .026 .055 .33 .31 .026 .060 .48 .46 .023

Ω = ΩZero

RCC .096 .67 .51 .003 .074 .67 .60 .004 .066 .69 .64 .003CC .096 .65 .49 .003 .070 .64 .57 .004 .055 .62 .61 .003AB .046 .58 .59 1.22 .056 .64 .61 1.54 .053 .66 .65 1.11

RSW1 .045 .54 .57 .565 .055 .60 .58 .728 .053 .62 .61 .528RSW2 .086 .52 .41 .026 .072 .58 .51 .026 .067 .61 .56 .023

Ω = ΩPos

RCC .065 .77 .72 .003 .061 .75 .71 .004 .068 .72 .66 .002CC .053 .73 .72 .003 .044 .68 .71 .004 .050 .63 .64 .003AB .039 .75 .80 1.19 .046 .73 .75 1.50 .055 .70 .68 1.07

RSW1 .045 .73 .76 .565 .046 .70 .72 .728 .053 .65 .64 .527RSW2 .066 .74 .68 .026 .059 .70 .67 .026 .063 .65 .60 .023


33

draws while the RSW tests using 499 bootstrap draws. Increasing the number of bootstrap

draws would increase their computational costs proportionally.

The theoretical properties of the CC tests rely in an essential way on the normality or

asymptotic normality of the moments. Thus, we report results when Wi is not normal to

investigate the sensitivity of these simulations to the data distribution. Tables 3 and 5 report

results when Wi has a student t distribution with 3 degrees of freedom, denoted by t(3). This

distribution is chosen to investigate the consequences of thick tails on the tests. Tables 4 and

6 report results when Wi has a mixed normal distribution, taken to be the equal probability

mix of N(−2, 1) and N(2, 4) scaled to have unit variance. This distribution is chosen to

investigate the consequences of a skewed and bimodal distribution on the tests.

Tables 5 and 6 show the results for the tests when Ω is estimated. The size performance

of the RCC test varies somewhat with the data distribution. It has worse MNRPs under

the skewed mixed normal distribution (up to 9.6% for k = 10) than under normality (Table

6), while it has better MNRPs under the t(3) distribution than under normality (Table 5).

It is noteworthy that the bootstrap-based AB, RSW1, and RSW2 tests have lower worst

case MNRP’s across Tables 3-6 (6.9%, 6.4%, 9.3%, respectively, compared to 9.6% for the

RCC test). This may reflect a type of refinement property for the bootstrap, but more

investigation is needed. It also is interesting to note that the over-rejection either disappears

or is greatly reduced when the true Ω is used, as shown in Tables 3 and 4. This seems to

suggest that the non-normality of the sample moments is less of an issue than estimating

the variance matrix in a relatively small sample.

It is worth noting that, in this context, different tests direct power to different alterna-

tives, and there is not a test that is uniformly most powerful. The ScWAP comparison hides

important power variations across different directions. To investigate these power variations,

Figure 2 reports simulated power curves for the tests in the k = 2 case with normally dis-

tributed moments, estimated Ω, and Ω = ΩZero. The power curves are functions of the true

mean vector, µ = (µ1, µ2). When either µ1 or µ2 is negative, the corresponding inequality is

violated, and we expect higher power. As the figure shows, the RCC test has better power

when only one inequality is violated, while the AB and the RSW1 tests have better power

when both inequalities are violated. We expect this pattern extends to cases with more in-

equalities and more general variance-covariance matrices: the AB or RSW1 tests have better

power when all or most of the inequalities are violated, while the RCC test has better power

when few inequalities are violated. If a researcher has a preference for tests that direct power

in a particular direction, they can choose a test based on this pattern. Otherwise, when such

a preference is not present, the RCC test is at least competitive with the other tests in terms

of size and power.

34

Figure 2: Power Curves for 5 Tests (k = 2, normal distribution, Ω = ΩZero, Estimated Ω,n = 100, α = 5%)

−0.3 −0.2 −0.1 0 0.1 0.2 0.30

0.2

0.4

0.6

0.8

1

CCRCCAB

RSW1RSW2

(a) µ1 = −0.3

−0.3 −0.2 −0.1 0 0.1 0.2 0.30

0.2

0.4

0.6

0.8

1CC

RCCAB

RSW1RSW2

(b) µ1 = −0.15

−0.3 −0.2 −0.1 0 0.1 0.2 0.30

0.2

0.4

0.6

0.8

1CC

RCCAB

RSW1RSW2

(c) µ1 = 0

−0.3 −0.2 −0.1 0 0.1 0.2 0.30

0.2

0.4

0.6

0.8

1CC

RCCAB

RSW1RSW2

(d) µ1 = 0.15

Note: CC denotes the conditional chi-squared test, RCC denotes the refined CC test, AB denotes the adjusted quasi-likelihoodratio (AQLR) test with bootstrap critical value in AB, and RSW1 and RSW2 denote the two-step test in RSW based on the QLRstatistic and the Max statistic, respectively. The AB test uses 1000 bootstrap draws and the RSW tests uses 499 bootstrap draws.The results for the CC, RCC, and RSW2 tests are based on 5000 simulations, while the results for the AB and RSW1 tests arebased on 2000 simulations for computational reasons.

35

5.1.2 k ≥ 10

One advantage of the CC and RCC tests is that they remain feasible when the number of

inequalities (k) and the sample size (n) are both large. In this subsection, we investigate the

size and computational time of the RCC, CC, RSW1, and RSW2 tests when both k and n are

large. Table 7 reports the results for four pairs of (k, n): (10, 100), (50, 700), (100, 1600), and

(150, 2550), where the pairs are chosen so that k is approximately proportional to n/log(n).

The message from Table 7 is quite encouraging. The MNRP’s of all the tests appear to

be stable as we move across columns. The computational time of the RCC and CC tests

increases the slowest with k, while that for RSW2 increases the fastest.

None of these tests have been proven to control size asymptotically when k grows with

n at this rate, but these simulations suggest such a result could be formulated, under the

correct assumptions. Intuitively, if the moments are approximately normal, in some sense,

then one can appeal to Theorem 1 as a good approximation. We do not pursue this type

of result here, but note two challenges to keep in mind. On the theoretical side, a theory

of Gaussian approximations for quadratic forms, such as the likelihood ratio statistic, that

covers this high-dimensional case is an open question, to the best of our knowledge. On the

practical side, a consistent covariance matrix estimator can be difficult to find. A potential

way to improve covariance matrix estimation is to assume sparsity or use shrinkage as in

Ledoit and Wolf (2012). It would be interesting to study the theoretical properties of the

CC and RCC tests in settings with many inequalities, but we leave that to future research.

5.2 Subvector Inference in Interval Regression

To investigate the finite sample performance of the subvector CC and RCC tests, we consider

a special case of Example 4, where Y ∗i = s∗i is the probability of an event of interest. For

example, the event can be death by homicide for a random person in county i, or a product

being purchased by a random consumer in market i. For simplicity, a simple logit model is

assumed for the probability: s∗i =exp(X′iθ0+Z′ciδ0+εi)

1+exp(X′iθ0+Z′ciδ0+εi), where εi is the country or market level

unobservable that satisfies E[εi|Zi] = 0. Then (11) holds with

ψ(Y ∗i , Xi, θ) = log(s∗i /(1− s∗i ))−X ′iθ. (44)

The variable s∗i is unobserved, but we observe sN,i, an empirical estimate of s∗i based

on N independent chances for the event of interest to happen. NsN,i follows a binomial

distribution with parameters (N, s∗i ). For example, N could be the population of a county

while sN,i is the homicide rate of the county. We use the method introduced in Gandhi et al.

36

Table 7: Finite Sample Maximum Null Rejection Probabilities of Nominal 5% Tests,the Large k and n Cases (Estimated Ω, Ω = Ωzero)

k = 10, n = 100 k = 50, n = 700 k = 100, n = 1600 k = 150, n = 2550

Test MNRP Time MNRP Time MNRP Time MNRP Time

normal

RCC .069 .003 .074 .004 .076 .011 .081 .024CC .069 .003 .074 .005 .076 .011 .081 .024

RSW1 .052 .582 .062 1.07 .047 3.25 .051 7.96RSW2 .062 .027 .056 .211 .048 1.15 .045 4.04

t(3)

RCC .063 .003 .069 .004 .071 .010 .079 .024CC .063 .003 .069 .005 .071 .011 .079 .024

RSW1 .057 .568 .054 1.07 .050 3.23 .054 7.97RSW2 .093 .026 .069 .210 .061 1.24 .063 4.05

mixed normal

RCC .096 .003 .090 .004 .089 .010 .089 .024CC .096 .003 .090 .005 .089 .010 .089 .024

RSW1 .045 .567 .051 1.06 .059 3.20 .054 7.92RSW2 .086 .026 .069 .210 .065 1.24 .058 4.04

Note: CC denotes the conditional chi-squared test, RCC denotes the refined CC test, and RSW1 andRSW2 denote the two-step test in RSW based on the QLR statistic and the Max statistic, respectively.MNRP denotes maximum null rejection probability, and Time denotes average computation time inseconds for the test in each Monte Carlo repetition. The RSW tests use 499 critical value simulations.The results for the CC, RCC and RSW2 tests are based on 5000 simulations, while the results for theRSW1 tests are based on 2000 simulations for computational reasons.

Table 8: Average Value, Length, and Computation Time (in seconds) of Confidence Intervals

n = 500 n = 1000

CI Excess Length Time CI Excess Length Time

dc = 2, 8 moment inequalities

sRCC [-1.774, -.339] .989 2.5 [-1.609 ,-.440] .723 2.4sCC [-1.780, -.332] 1.00 2.5 [-1.615, -.433] .736 2.4

ARP Hybrid [-1.998, -.264] 1.29 111 [-1.736, -.395] .895 109


sRCC [-1.852, -.293] 1.11 7.1 [-1.659, -.404] .809 4.7sCC [-1.852, -.293] 1.11 7.1 [-1.659, -.404] .809 4.7

ARP Hybrid [-2.219, -.123] 1.65 199 [-1.883, -.287] 1.15 120


sRCC [-1.921, -.254] 1.22 6.5 [-1.718, -.366] .906 10sCC [-1.921, -.254] 1.22 6.5 [-1.718, -.366] .906 10

ARP Hybrid [-2.596, -.011] 2.14 97 [-2.104, -.180] 1.48 145

Note: The identified set for θ0 is [−1.203,−.757]. The computation times across different (n, dc) cases arenot comparable because they may have been performed by different computers on the computer cluster. Thecomputation of different tests within each (n, dc) case is always completed on the same computer. Thus thecomputation times across tests are comparable.

37

(2019) to construct ψLi (θ) and ψUi (θ) based on sN,i. By Gandhi et al. (2019), for N ≥ 100,

the following construction satisfies (12):

ψUi (θ) = log(sN,i + 2/N)− log(1− sN,i + s)−X ′iθ (45)

ψLi (θ) = log(sN,i + s)− log(1− sN,i + 2/N)−X ′iθ,

where s is the smaller of 0.05 and half of the minimum possible value of min(s∗i , 1 − s∗i ).22

We assume that s is known and refer the reader to Gandhi et al. (2019) for practical recom-

mendations regarding s.

Let the endogenous variable Xi be a scalar, and let Zei be a scalar excluded instrument.

Let there be a dc-dimensional exogenous covariate Zci = (1, Zc,2,i, . . . , Zc,dc,i)′. To generate

the data, we let N = 100, εi ∼ minmax−4, N(0, 1), 4, and let the non-constant elements

of Zi be mutually independent Bernoulli variables with success probability 0.5. Let Xi =

1Zei + εi/2 > 0. Also let θ0 = −1, δ0 = (0,−1,0′dc−2)′, parameters chosen so that the

identified set for θ0 does not change when dc is varied from 2 to 4. The value −1 is chosen

to match the typical sign of a price coefficient and to normalize the scale for presentation

purposes. Here since δ0 is the nuisance parameter, dc is the number of nuisance parameters.

Given this data generating process, the lowest and the highest possible values for s∗i are

respectivelyexp(−6)

1 + exp(−6)= 0.0025 and

exp(4)

1 + exp(4)= 0.982.

Thus s = 0.00125. Given s and N , we calculate numerically that the identified set of θ0 is

approximately [−1.203,−0.757]. Details of the calculation are given in Appendix E.1.

For instrumental functions, we use

I(Zi) = (1(Zei, Zc,2,i, . . . , Zc,dc,i) = z)z∈0,1dc . (46)

Thus, when dc = 2 (or 3, 4), there are 4 (or 8, 16) instrumental functions, which give us 8

(or 16, 32) moment inequalities.

We consider 5000 Monte Carlo repetitions. In each repetition, we generate an i.i.d. data

set, sN,i, Xi, Zini=1, for two sample sizes, n = 500 and n = 1000. For each repetition, we

compute an implied confidence interval for the sRCC test and the sCC test. We also include

the hybrid test of ARP (with their recommended tuning parameter and number of simulation

draws) for comparison. For all tests, the CI endpoints are computed with an accuracy to

the third digit. The details for computing the confidence intervals are given in Supplemental

22These bounds are not necessarily sharp, but that is not important for our purpose, which is to investigatethe statistical performance of the sCC and sRCC tests.

38

Appendix E.2.

Table 8 reports the average confidence interval (CI), average excess length (= length of CI

- length of identified set), as well as average computation time for the CI. As the table shows,

the average computation time of the sCC and the sRCC tests are identical to each other up

to 1/10 of a second in all cases. This is mainly because the vertex enumeration required to

compute the refinement is easy to compute for dc = 2, and the refinement is rarely needed

for dc = 3 and dc = 4. The sRCC and sCC tests are faster than the ARP hybrid test in

all cases. The relative computational cost of the ARP hybrid test seems to improve as the

model gets bigger, but it remains more than 14 times as costly as the subvector CC and

RCC tests when dc = 4.

In terms of length, the sRCC and sCC confidence intervals are similar to each other, and

are shorter on average than the ARP hybrid test for all cases. As we move from dc = 2

to dc = 4, the model contains more and more non-informative moment inequalities since

the added control variables do not contribute in the data generating process. All tests are

negatively affected by these non-informative inequalities to various degrees.

Figure 3 reports the rejection rates of the tests for H0 : θ0 = θ plotted against θ values in

[−2.5, 0.5].23 The shaded area indicates the identified set for θ0. As we can see, the rejection

rates for the points in the identified set are less than or equal to 5% in all cases. For dc = 3

and dc = 4, all tests shows some under-rejection at the boundary of the identified set, while

the under-rejection of the ARP hybrid test appears to be less. It is encouraging to see that

the under-rejection does not translate to poor power. The power of the sCC and sRCC tests

are nearly identical to each other and are higher than the power of the ARP hybrid test

except in the area of θ immediately next to the identified set, consistent with the excess

length results in Table 8. However, we note that this comparison is specific to this example,

and the power comparison may change with other examples or data generating processes.

6 Conclusion

This paper proposes the refined conditional chi-squared (RCC) test for moment inequality

models. This test compares a quasi-likelihood ratio statistic to a chi-squared critical value,

where the number of degrees of freedom is the rank of the active inequalities. This test has

many desirable properties, including being tuning parameter and simulation free, adaptive

to slackness, easy to code, and invariant to redundant inequalities. We show that, with an

easy refinement, it has exact size in normal models and has uniformly asymptotically exact

23For all three tests, a point is considered rejected if it is outside the confidence interval. We found in oursimulations that these rejection rates are slightly lower than those obtained directly point by point.

39

Figure 3: Rejection Rates of the sCC, sRCC, and ARP hybrid tests for dc ∈ 2, 3, 4 andfor Sample Size n ∈ 500, 1000 with Nominal Size 5%.

−2.5 −2 −1.5 −1 −0.5 0 0.50

0.2

0.4

0.6

0.8

1sCC

sRCCARP Hyb

(a) dc = 2, n = 500

−2.5 −2 −1.5 −1 −0.5 0 0.50

0.2

0.4

0.6

0.8

1sCC

sRCCARP Hyb

(b) dc = 2, n = 1000

−2.5 −2 −1.5 −1 −0.5 0 0.50

0.2

0.4

0.6

0.8

1sCC

sRCCARP Hyb

(c) dc = 3, n = 500

−2.5 −2 −1.5 −1 −0.5 0 0.50

0.2

0.4

0.6

0.8

1sCC

sRCCARP Hyb

(d) dc = 3, n = 1000

−2.5 −2 −1.5 −1 −0.5 0 0.50

0.2

0.4

0.6

0.8

1sCC

sRCCARP Hyb

(e) dc = 4, n = 500

−2.5 −2 −1.5 −1 −0.5 0 0.50

0.2

0.4

0.6

0.8

1sCC

sRCCARP Hyb

(f) dc = 4, n = 1000

40

size in asymptotically normal models. We also propose a version of the test for subvector

inference with conditional moment inequalities and when the nuisance parameters enter lin-

early. Simulations show the RCC and subvector RCC tests have a computational advantage

over alternatives while being competitive in terms of size and power.

References

Andrews, D. and Barwick, P. (2012). Inference for parameters defined by moment inequali-

ties: A recommended moment selection procedure. Econometrica, 80:2805–2862.

Andrews, D. and Guggenberger, P. (2009). Validity of subsampling and “plug-in asymptotic”

inference for parameters defined by moment inequalities. Econometric Theory, 25:669–709.

Andrews, D. and Soares, G. (2010). Inference for parameters defined by moment inequalities

using generalized moment selection. Econometrica, 78:119–157.

Andrews, D. W. and Shi, X. (2013). Inference based on conditional moment inequalities.

Econometrica, 81:609–666.

Andrews, I., Roth, J., and Pakes, A. (2019). Inference for linear conditional moment in-

equalities.

Baccara, M., Imrohoroglu, A., Wilson, A. J., and Yariv, L. (2012). A field study on matching

with network externalities. American Economic Review, 102:1773–1804.

Bachem, A. and Kern, W. (1992). Linear Programming Duality: An Introduction to Oriented

Matroids. Springer-Verlag Berlin, Heidelberg.

Bajari, P., Benkard, C. L., and Levin, J. (2007). Estimating dynamic models of imperfect

competition. Econometrica, 75:1331–1370.

Bartholomew, D. J. (1961). A test of homogeneity of means under restricted alternatives.

Journal of the Royal Statistical Society. Series B (Methodological), 23:239–281.

Beresteanu, A., Molchanov, I., and Molinari, F. (2011). Sharp identification regions in models

with convex moment predictions. Econometrica, 79:1785–1821.

Blundell, R., Gosling, A., Ichimura, H., and Meghir, C. (2007). Changes in the distribution

of male and female wages accounting for employment composition using bounds. Econo-

metrica, 75:323–363.

41

Bugni, F., Canay, I., and Shi, X. (2017). Inference for functions of partially identified

parameters in moment inequality models. Quantitative Economics, 8:1–38.

Bugni, F. A. (2010). Bootstrap inference in partially identified models defined by moment

inequalities: Coverage of the identified set. Econometrica, 78:735–753.

Canay, I. A. (2010). El inference for partially identified models: Large deviations optimality

and bootstrap validity. Journal of Econometrics, 156:408–425.

Canay, I. A. and Shaikh, A. (2017). Practical and theoretical advances for inference in par-

tially identified models. In B. Honore, A. Pakes, M. Piazzesi, and L. Samuelson (Eds)

Advances in Economics and Econometrics: Volume 2: Eleventh World Congress, (Econo-

metric Society Monographs, pp. 271-306). Cambridge University Press.

Chen, X., Christensen, T. M., and Tamer, E. (2018). Monte carlo confidence sets for identified

sets. Econometrica, 86:1965–2018.

Chernozhukov, V., Hong, H., and Tamer, E. (2007). Estimation and confidence regions for

parameter sets in econometric models. Econometrica, 75:1243–1284.

Chetty, R. (2012). Bounds on elasticities with optimization frictions: A synthesis of micro

and macro evidence on labor supply. Econometrica, 80:969–1018.

Ciliberto, F. and Tamer, E. (2009). Market structure and multiple equilibria in airline

markets. Econometrica, 77:1791–1828.

Eizenberg, A. (2014). Upstream innovation and product variety in the u.s. home pc market.

The Review of Economic Studies, 81:1003–1045.

Fack, G., Grenet, J., and He, Y. (2019). Beyond truth-telling: Preference estimation with

centralized school choice and college admissions. American Economic Review, 109:1486–

1529.

Fourier, J. B. J. (1826). Solution d’une question particuliere du calcul des inegalites. Nouveau

Bulletin des sciences par la Societe philomathique de Paris, p. 99, pages 317–319.

Gale, D. (1960). The Theory of Linear Economic Models. McGraw-Hill Book Company, 1

edition.

Gandhi, A. K., Lu, Z., and Shi, X. (2019). Estimating demand for differentiated products

with zeroes in market share data.

42

Guggenberger, P., Hahn, J., and Kim, K. (2008). Specification testing under moment in-

equalities. Economics Letters, 99:375–378.

He, Y. (2017). Gaming the boston school choice mechanism in beijing. Unpublished

manuscript, Department of Economics, Rice University.

Ho, K. and Rosen, A. (2017). Partial identification in applied research: Benefits and chal-

lenges. In B. Honore, A. Pakes, M. Piazzesi, and L. Samuelson (Eds) Advances in Eco-

nomics and Econometrics: Volume 2: Eleventh World Congress, (Econometric Society

Monographs, pp. 307-359). Cambridge University Press.

Holmes, T. J. (2011). The diffusion of wal-mart and economies of density. Econometrica,

79:253–302.

Huber, M. and Mellace, G. (2015). Testing instrument validity for late identification based

on inequality moment constraints. The Review of Economics and Statistics, 97:398–411.

Iaryczower, M., Shi, X., and Shum, M. (2018). Can words get in the way? the effect of

deliberation in collective decision-making. Journal of Political Economy, 126:688–734.

Kaido, H., Molinari, F., and Stoye, J. (2019). Confidence intervals for projections of partially

identified parameters. Econometrica, 87:1397–1432.

Katz, M. (2007). Supermarkets and zoning laws. Ph.D. dissertation, Harvard University.

Kawai, K. and Watanabe, Y. (2013). Inferring strategic voting. American Economic Review,

103:624–662.

Kleder, M. (2020). Con2vert - constraints to vertices

(https://www.mathworks.com/matlabcentral/ fileexchange/7894-con2vert-constraints-to-

vertices). MATLAB Central File Exchange. Retrieved June 9, 2020.

Kudo, A. (1963). A multivariate analogue of the one-sided test. Biometrika, 50:403–418.

Ledoit, O. and Wolf, M. (2012). Nonlinear shrinkage estimation of large-dimensional covari-

ance matrices. The Annals of Statistics, 40:1024–1060.

Magnolfi, L. and Roncoroni, C. (2016). Estimation of discrete games with weak assump-

tions on information. Unpublished manuscript, Department of Economics, University of

Wisconsin at Madison.

Manski, C. F. and Tamer, E. (2002). Inference on regressions with interval data on a regressor

or outcome. Econometrica, 70:519–546.

43

Mohamad, D. A., van Zwet, E. W., Cator, E. A., and Goeman, J. J. (2020). Adaptive critical

value for constrained likelihood ratio testing. Biometrica, 107:677–688.

Molinari, F. (2020). Econometrics with partial identification. In S. Durlauf, L. Hansen,

J. Heckman, and R. Matzkin (Eds) Handbook of Econometrics: Volume 7A, 1st Edition.

North Holland.

Morales, E., Sheu, G., and Zahler, A. (2019). Extended gravity. The Review of Economic

Studies, 86:2668–2712.

Nevo, A. and Rosen, A. (2012). Identification with imperfect instruments. The Review of

Economics and Statistics, 94:659–671.

Pakes, A., Porter, J., Ho, K., and Ishii, J. (2015). Moment inequalities and their applications.

Econometrica, 83:315–334.

Rambachan, A. and Roth, J. (2020). An honest approach to parallel trends. Unpublished

manuscript, Department of Economics, Harvard University.

Rogers, A. J. (1986). Modified lagrange multiplier tests for problems with one-sided alter-

natives. Journal of Econometrics, 31:341–361.

Romano, J., Shaikh, A., and Wolf, M. (2014). A practical two-step method for testing

moment inequalities. Econometrica, 82:1979–2002.

Romano, J. P. and Shaikh, A. M. (2008). Inference for identifiable parameters in partially

identified econometric models. Journal of Statistical Planning and Inference, 138:2786–

2807. Special Issue in Honor of Theodore Wilbur Anderson, Jr. on the Occasion of his

90th Birthday.

Romano, J. P. and Shaikh, A. M. (2012). On the uniform asymptotic validity of subsampling

and the bootstrap. the Annals of Statistics, 40:2798–2822.

Rosen, A. (2008). Confidence sets for partially identified parameters that satisfy a finite

number of moment inequalities. Journal of Econometrics, 146:107–117.

Sheng, S. (2016). A structural econometric analysis of network formation games. Unpublished

manuscript, Department of Economics, University of California Los Angeles.

Sierksma, G. and Zwols, Y. (2015). Linear and Integer Optimization: Theory and Practice.

CRC Press, Taylor & Francis Group, 3 edition.

44

Stoye, J. (2009). More on confidence intervals for partially identified parameters. economet-

rica, 77:1299–1315.

Sullivan, C. J. (2017). The ice cream split: Empirically distinguishing price and product space

collusion. Unpublished manuscript, Department of Economics, University of Wisconsin at

Madison.

Tamer, E. (2003). Incomplete simultaneous discrete response model with multiple equilibria.

The Review of Economic Studies, 70:147–165.

Uhlig, H. (2005). What are the effects of monetary policy on output? results from an

agnostic identification procedure. Journal of Monetary Economics, 52:381–419.

Wolak, F. (1987). An exact test for multiple inequality and equality constraints in the linear

regression model. Journal of the American Statistical Association, 82:782–793.

Wollman, T. G. (2018). Trucks without bailouts: Equilibrium product characteristics for

commercial vehicles. American Economic Review, 108:1364–1406.

45

Supplemental Appendix for “Simple Adaptive Size-Exact Testingfor Full-Vector and Subvector Inference in Moment Inequality

Models”

Gregory Cox and Xiaoxia Shi

This supplemental appendix contains proofs and other supporting materials for “Simple

Adaptive Size-Exact Testing for Full-Vector and Subvector Inference in Moment Inequality

Models” (henceforth referred to as the main text) by Gregory Cox and Xiaoxia Shi. The

following sections are included:

Section A contains the proof of Theorem 1 in the main text.

Section B contains the proof of Theorem 2 in the main text. This section also includes

Theorem 3, a general theorem for uniform asymptotic properties of the CC and RCC

tests, as well as the proof of Theorem 3. Theorem 3 is used to prove Theorem 2.

Section C contains supporting materials for Section 3.2 in the main text. This section

includes lemmas that reduce the calculation of the sCC test to a rank calculation

problem. It also includes an algorithm to carry out the rank calculation in the case

not covered in the main text.

Section D proves the asymptotic validity of the Subvector Tests by verifying the con-

ditional asymptotic normality of the sample moments and the consistency of the two

conditional variance matrix estimators proposed in Section 3.2 in the main text.

Section E provides details for the identified set calculation and the confidence interval

calculation in Section 5.2 in the main text.

A Proof of Theorem 1

For this proof, we assume Σn(θ) = nIdm . If this is not the case, then the following proof can be

applied after premultiplying mn(θ) by n1/2Σn(θ)−1/2 and postmultiplying A by n−1/2Σn(θ)1/2.

Fix θ and let X = mn(θ) ∼ N(µ, Idm), where µ = EFmn(θ). Let C = µ ∈ Rdm|Aµ ≤ b.The fact that θ ∈ Θ0(F ) implies µ ∈ C. These simplifications imply that Tn(θ) = ‖X − µ‖2

and

τj =

‖a1‖(bj−a′j µ)

‖a1‖‖aj‖−a′1ajif ‖a1‖‖aj‖6= a′1aj

∞ otherwise. (47)

46

The definitions of µ, J , r, τ , and β are unchanged. µ is the projection of X onto C. We also

denote it by PCX. We also denote J by J(X), r by r(X), τ by τ(X), and β by β(X).

A.1 Auxiliary Lemmas

The proof of Theorem 1 relies on four lemmas.

The first lemma partitions Rdm according to which inequalities are active. We define

some notation for the partition. For any J ⊆ 1, ..., dA, let J c = 1, ..., dA/J , and let

CJ = x ∈ C : ∀j ∈ J, a′jx = bj, and ∀j ∈ J c, a′jx < bj. Then CJ forms a partition of C.

Also let VJ = ∑

j∈J vjaj : vj ∈ R, vj ≥ 0, and let KJ = CJ + VJ .24 The following lemma

shows that KJ forms a partition that characterizes which inequalities are active.

Lemma 1. (a) If X ∈ KJ , then X − PCX ∈ VJ and PCX ∈ CJ .

(b) The set of all KJ for J ⊆ 1, ..., dA is a partition of Rdm.

(c) For every J ⊆ 1, ..., dA, X ∈ KJ iff J = J(X).

The next lemma considers the event r = 0 and partitions that event according to which

face of C is closest to the realization of X. Let J0 = j ∈ 1, ..., dA|aj = 0 and let

J00 = j ∈ 1, ..., dA|aj = 0 and bj = 0. Also let

J1 =J ⊆ 1, ..., dA|rk(AJ) = 1, J ∩ J0 = J00, (48)

and if j ∈ J, ` ∈ J c, s.t. ‖aj‖> 0, ‖a`‖> 0,

thenaj‖aj‖

6= a`‖a`‖

orbj‖aj‖

6= b`‖a`‖.

Further subdivide

J os1 =J ∈ J1| if j, ` ∈ J s.t. ‖aj‖> 0, ‖a`‖> 0, then

aj‖aj‖

=a`‖a`‖ (49)

J ts1 =J ∈ J1|∃j, ` ∈ J s.t. ‖aj‖> 0, ‖a`‖> 0,

aj‖aj‖

= − a`‖a`‖

andbj‖aj‖

= − b`‖a`‖.

The next lemma provides a partition of C0 := ∪J⊆1,...,dA:rk(AJ )=0CJ . (Note that for these

sets, CJ = KJ .) Let J6=0 = j = 1, . . . , dA : ‖aj‖6= 0. For each J ∈ J os1 , let

C∆J = x ∈ C0|argminj∈J 6=0

‖aj‖−1(bj − a′jx) = J ∩ J6=0. (50)

24When J = ∅, then VJ = 0dm.

47

The set C∆J is the set of points in C that are closer to CJ than to any other CJ for J ∈ J1.

It is helpful to picture CJ for J ∈ J1 as the faces of a polyhedron, C, and C∆J as a partition

of C into triangularly shaped sets. Also let

C | = C0/(∪J∈J os1

C∆J

). (51)

Lemma 2. (a) C0 = CJ00.

(b) The sets C | and C∆J for J ∈ J os

1 form a partition of C0.

(c) If A 6= 0dA×dm, then C | has Lebesgue measure zero.

(d) ∪J⊆1,...,dA|rk(AJ )=1KJ = ∪J∈J os1 ∪J ts1KJ .

The next lemma bounds the probabilities of translations of sets in the multivariate normal

distribution. Let V denote an arbitrary cone in Rr for a positive integer r.25 Let V ∗ denote

the polar cone. That is, V ∗ = γ ∈ Rr|〈y, γ〉 ≤ 0 for all y ∈ V . For any γ ∈ V ∗, let

Y ∼ N(γ, Ir). The following lemma provides a property of probabilities of cones under a

translation.

Lemma 3. For every γ ∈ V ∗, Prγ(‖Y ‖2> χ2r,1−α|Y ∈ V ) ≤ α, with equality if γ = 0.

Lemma 3 states that the probability that a random vector, Y , belongs to the tail of its

distribution, conditional on belonging to the cone, V , is less than or equal to α, where the

tail is any point outside a sphere of radius√χ2r,1−α. The key assumption is that the mean of

Y must belong to the polar cone, V ∗, which translates the distribution away from the cone,

V . When γ = 0, this lemma holds with equality because unconditionally ‖Y ‖2∼ χ2r, the tail

of which has mass exactly α, and because ‖Y ‖2 has exactly the same distribution whether

or not we condition on Y ∈ V . Lemma 3 follows from Lemma 1 in Mohamad et al. (2020),

and thus the proof is omitted.

The following lemma is the key to validity of the refinement to the CC test. It is a bound

on translations of sets in the univariate normal distribution.

Lemma 4. For every µ ≤ 0, for every τ ≥ 0, and for every α ∈ [0, 1],

Prµ(Z > z1−β/2|Z > −τ

)≤ α,

where Z ∼ N(µ, 1) and β = 2αΦ(τ), with equality if µ = 0.

25A cone is a set, V , such that for all v ∈ V and for all λ ≥ 0, λv ∈ V .

48

A.2 Proof of Theorem 1

First, we show part (a). Notice that

Pr(‖X − PCX‖2> χ2rk(AJ(X)),1−β(X))

=∑

J⊆1,...,dA

Pr(X ∈ KJ and ‖X − PCX‖2> χ2rk(AJ ),1−β(X))

=∑

J⊆1,...,dA|rk(AJ )≥2

Pr(X ∈ KJ and ‖X − PCX‖2> χ2rk(AJ ),1−α) (52)

+∑J∈J ts1

Pr(X ∈ KJ and ‖X − PCX‖2> χ21,1−α) (53)

+∑J∈J os1

Pr(X ∈ KJ and ‖X − PCX‖2> χ21,1−β(X)) (54)

+∑

J⊆1,...,dA|rk(AJ )=0

Pr(X ∈ KJ and ‖X − PCX‖2> χ20,1−α), (55)

where the first equality follows from Lemma 1(b,c), and the second equality uses Lemma

2(d) and the fact that β(X) = α whenever rk(AJ(X)) 6= 1 or J ∈ J ts1 . That latter fact follows

because for J ∈ J ts1 with X ∈ KJ , there exists j, ` ∈ J such that ‖a`‖−1a` = −‖aj‖−1aj

and ‖a`‖−1b` = −‖aj‖−1bj, which implies that b`− a′`PCX = bj − a′jPCX = 0 (and therefore

τ(X) = 0).

For each J , we consider the span of VJ as a subspace of Rdm . Let PJ denote the projection

onto span(VJ), and MJ denote the projection onto its orthogonal complement. We note that,

given J , there exists a κJ ∈ span(VJ) such that for every z ∈ CJ , PJz = κJ . This follows

because for two z1, z2 ∈ CJ , and for any v ∈ span(VJ), 〈z1 − z2, v〉 = 0, which implies

z1 − z2 ⊥ span(VJ), so that PJ(z1 − z2) = 0dm . Thus, for any X ∈ KJ , we can write

PJX = PJ(X − PCX) + PJPCX = X − PCX + κJ , where the second equality follows by

Lemma 1(a) and the above discussion. We also write MJX = X − PJX = PCX − κJ .

First, let’s consider the terms in (55). For J such that rk(AJ) = 0, we have span(VJ) =

0dm. Thus, PJX = κJ = 0dm . This implies that ‖X − PCX‖= 0. Therefore,

Pr(X ∈ KJ and ‖X − PCX‖2> χ20,1−α) = 0. (56)

For J such that rk(AJ) > 0, we define a linear isometry from span(VJ) to Rrk(AJ ). Let

BJ be a dm × rk(AJ) matrix whose columns form a basis for span(VJ). Then PJX =

BJ(B′JBJ)−1B′JX. The projection matrix BJ(B′JBJ)−1B′J is idempotent with rank rk(AJ),

and thus there exists a dm×rk(AJ) matrix with orthonormal columns, QJ , such that QJQ′J =

BJ(B′JBJ)−1B′J . The linear isometry from span(VJ) to Rrk(AJ ) is QJ(X) = Q′JX. This is an

49

isometry because for any v1, v2 ∈ span(VJ),

‖v1 − v2‖2 = (v1 − v2)′(v1 − v2)

= (v1 − v2)′(PJ(v1 − v2))

= (v1 − v2)′QJQ′J(v1 − v2)

= ‖QJ(v1)−QJ(v2)‖2, (57)

where the second equality holds because v1, v2 ∈ span(VJ). Now let Q′JVJ = Q′Jv : v ∈ VJ.Then PJX−κJ ∈ VJ if and only if Q′J(PJX−κJ) ∈ Q′JVJ because this isometry is bijective.

Next, we consider the terms in (52) and (53). Notice that

Pr(X ∈ KJ and ‖X − PCX‖2> χ2rk(AJ ),1−α)

= Pr(MJX + κJ ∈ CJ , PJX − κJ ∈ VJ , and ‖PJX − κJ‖2> χ2rk(AJ ),1−α)

= Pr(MJX + κJ ∈ CJ)× Pr(PJX − κJ ∈ VJ and ‖PJX − κJ‖2> χ2rk(AJ ),1−α), (58)

where the first equality uses Lemma 1(a) and the facts that MJX + κJ = PCX and X =

PJX + MJX, and the second equality follows from the fact that PJX is independent of

MJX. Applying the isometry, we have

Pr(PJX − κJ ∈ VJ and ‖PJX − κJ‖2> χ2rk(AJ ),1−α)

= Pr(Q′J(PJX − κJ) ∈ Q′JVJ and ‖Q′J(PJX − κJ)‖2> χ2rk(AJ ),1−α). (59)

We would like to apply Lemma 3 to this probability. Since X ∼ N (µ, I), we have

Q′J(PJX − κJ) ∼ N(Q′J(PJµ− κJ), Q′JIQJ) = N(Q′J(PJµ− κJ), I). (60)

Also note that Q′JVJ is a cone in Rrk(AJ ). The random vector Q′J(PJX−κJ) ∼ N(γ, I) where

γ = Q′J(PJµ− κJ). The vector γ is in the polar cone because, for all y ∈ Q′JVJ , there exists

a y =∑

j∈J vjaj ∈ VJ such that y = Q′Jy, and thus

〈γ, y〉 = 〈Q′J(PJµ− κJ), Q′Jy〉

= 〈PJµ− κJ , y〉

= 〈(µ−MJµ− PJz), y〉

= 〈(µ−MJµ− z +MJz), y〉

= 〈(µ− z), y〉

50

=∑j∈J

vj(〈µ, aj〉 − 〈z, aj〉) ≤ 0, (61)

where z is any element26 of CJ so that κJ = PJz, the second equality holds because 〈Q′J(PJµ−κJ), Q′Jy〉 = y′QJQ

′J(PJµ − κJ) = y′PJ(PJµ − κJ) = y′(PJµ − κJ), the fifth equality holds

because MJy = 0, and the inequality follows because 〈z, aj〉 = bj ≥ 〈µ, aj〉, using the facts

that z ∈ CJ and µ ∈ C.

Therefore, we can apply Lemma 3 to get that, for every J ⊆ 1, ..., dA such that rk(AJ) ≥1, we have

Pr(PJX − κJ ∈ VJ and ‖PJX − κJ‖2> χ2rk(AJ ),1−α)

= Pr(Q′J(PJX − κJ) ∈ Q′JVJ and ‖Q′J(PJX − κJ)‖2> χ2

rk(AJ ),1−α)

≤ αPr(Q′J(PJX − κJ) ∈ Q′JVJ)

= αPr(PJX − κJ ∈ VJ), (62)

where the inequality holds as equality if γ = Q′J(PJµ− κJ) = 0.

Next, consider the terms in (54). For each J ∈ J os1 , let j ∈ J such that ‖aj‖6= 0. Notice

that we can take BJ = aj, so that PJ = ‖aj‖−2aja′j, QJ = ‖aj‖−1aj, and κJ = ‖aj‖−2ajbj.

Notice that

Pr(X ∈ KJ and ‖X − PCX‖2> χ21,1−β(X))

= Pr(MJX + κJ ∈ CJ , PJX − κJ ∈ VJ , and ‖PJX − κJ‖2> χ21,1−β(X)) (63)

= Pr(MJX + κJ ∈ CJ , Q′J(PJX − κJ) ∈ Q′JVJ , and ‖Q′J(PJX − κJ)‖2> χ21,1−β(MJX+κJ )),

where the first equality uses the definitions of MJ , PJ , and κJ , and the second equality uses

the isometry QJ , together with the fact that β(X) depends on X only through MJX + κJ

(because the formula for τ(X) only depends on PCX = MJX + κJ).

We next note that Q′JVJ = [0,∞). This follows because for any c ≥ 0, c = Q′Jcaj‖aj‖,where caj‖aj‖∈ VJ . Conversely, for any v =

∑`∈J cà` ∈ VJ for some constants c` ≥ 0, we

have Q′Jv =∑

`∈J c`‖aj‖−1a′ja`, where a′ja` ≥ 0 because a` is either zero or a positive scalar

multiple of aj by the definition of J os1 .

Also, the fact that X ∼ N(µ, I) implies that Z := Q′J(PJX − κJ) ∼ N(γ, 1), where

γ = Q′J(PJµ− κJ) = ‖aj‖−1(a′jµ− bj) ≤ 0. Note that Z is independent of MJX.

26If CJ is empty, so that no such z exists, then (58) is zero, and so (62) below is not needed.

51

Let z1−α denote the 1− α quantile of the standard normal distribution. We have

Pr(MJX + κJ ∈ CJ , Q′J(PJX − κJ) ∈ Q′JVJ , and ‖Q′J(PJX − κJ)‖2> χ21,1−β(MJX+κJ ))

= Pr(MJX + κJ ∈ CJ , Z > z1−β(MJX+κJ )/2)

=E1(MJX + κJ ∈ CJ) Pr(Z > z1−β(MJX+κJ )/2|MJX + κJ)

≤αE1(MJX + κJ ∈ CJ) Pr(Z > −τ(MJX + κJ)|MJX + κJ) (64)

=αPr(MJX + κJ ∈ CJ , Z > −τ(MJX + κJ))

=αPr(MJX + κJ ∈ CJ , Q′J(PJX − κJ) ∈ Q′JVJ)

+ αPr(MJX + κJ ∈ CJ , Q′J(PJX − κJ) ∈ (−τ(MJX + κJ), 0))

=α(Pr(X ∈ KJ) + Pr(X ∈ C∆J )), (65)

where the first equality follows from the events Z ≥ 0 and Z2 > χ21,1−β(MJX+κJ ) being equiv-

alent to the event Z > z1−β(MJX+κJ )/2, the second and third equalities uses the conditional

distribution of Z given MJX + κJ , the inequality follows by Lemma 4, the fourth equal-

ity follows from splitting the event Z > −τ into Z ≥ 0 (equivalent to Z ∈ Q′JVJ) and

Z ∈ (−τ, 0), and the final equality follows from the fact that Q′J(PJX − κJ) ∈ Q′JVJ if and

only if PJX − κJ ∈ VJ , the characterization of KJ using Lemma 1(a), together with the

argument that follows.

To show (65), we show that for all J ∈ J os1 ,

C∆J = x ∈ Rdm|MJx+ κJ ∈ CJ and Q′J(PJx− κJ) ∈ (−τ(MJx+ κJ), 0). (66)

Denote the set on the right hand side of (66) by Υ. We show (1) x ∈ C∆J implies x ∈ Υ

and (2) x ∈ Υ implies x ∈ C∆J . It is useful to point out that for any x, we can write

Q′J(PJx−κJ) = ‖aj‖−1(a′jx− bj) and MJx+κJ = x−‖aj‖−2aj(a′jx− bj) using the formulas

for PJ , QJ , and κJ .

(1) Let x ∈ C∆J . We calculate that MJx + κJ ∈ CJ by showing that equality holds for

every ` ∈ J and strict inequality holds for every ` /∈ J . For any ` ∈ J either ` ∈ J00 or

` ∈ J ∩ J 6=0. If ` ∈ J00, a′`(MJx+ κJ) = 0 = b`, so equality holds. If ` ∈ J ∩ J 6=0,

a′`(x− ‖aj‖−2aj(a′jx− bj)) = a′`(x− ‖aj‖−1‖a`‖−1aj(a

′`x− b`)) = b`, (67)

where the first equality uses the fact that ‖a`‖−1(b`−a′`x) = ‖aj‖−1(bj−a′jx) by the definition

of C∆J , and the second equality uses the fact that a′àj = ‖aj‖‖a`‖ by the definition of J os

1 .

Therefore, equality holds for every ` ∈ J . For any ` ∈ J c, we show that strict inequality

52

holds. Either ` ∈ J0/J00 or ` ∈ J6=0/J . If ` ∈ J0/J00, a′`(MJx+ κJ) = 0 < b`.27 If ` ∈ J6=0,

a′`(x− ‖aj‖−2aj(a′jx− bj)) =a′`x− ‖aj‖−2a′àj(a

′jx− bj)

=a′`x− b` − ‖aj‖−2a′àj(a′jx− bj) + b`

<‖a`‖‖aj‖

(a′jx− bj)− ‖aj‖−2a′àj(a

′jx− bj) + b`

=(‖a`‖‖aj‖−a′àj)(a′jx− bj)

‖aj‖2+ b` ≤ b`, (68)

where the first inequality uses the fact that ‖aj‖−1(bj − a′jx) < ‖a`‖−1(b` − a′`x) by the

definition of x ∈ C∆J (because ` is not in the argmin), and the second inequality uses the

fact that a′jx < bj and ‖aj‖‖a`‖≥ a′à′j. This shows that for every ` ∈ J c the inequality is

strict. Therefore, MJx+ κJ ∈ CJ .

We also calculate that Q′J(PJx− κJ) ∈ (−τ(MJx+ κJ), 0). The fact that x ∈ C0 implies

that a′jx − bj < 0, and so Q′J(PJx − κJ) = ‖aj‖−1(a′jx − bj) < 0. Let ` ∈ 1, ..., dA/j.28

We show that

‖aj‖−1(a′jx− bj) > −τj(MJx+ κJ). (69)

If ‖a`‖‖aj‖−a′àj = 0, then by definition the right hand side of (69) is −∞. Otherwise, we

can plug in MJx+ κJ = x− ‖aj‖−2aja′jx+ ‖aj‖−2ajbj and rewrite (69) as

(a′jx− bj)(‖a`‖‖aj‖−a′àj) > −‖aj‖2(b` − a′`(x− ‖aj‖−2aja

′jx+ ‖aj‖−2ajbj)). (70)

We can simplify this to show that it holds if and only if

‖aj‖−1(bj − a′jx) < ‖a`‖−1(b` − a′`x). (71)

The fact that ‖a`‖‖aj‖6= a′àj implies that ` /∈ J (by the definition of J os1 ) and therefore, by

the definition of C∆J , (71) holds (because ` is not in the argmin). Therefore, (69) holds for

every ` ∈ 1, ..., dA/j, which implies that

Q′J(PJx− κJ) = ‖aj‖−1(a′jx− bj) > −τ(MJx+ κJ). (72)

This shows that x ∈ Υ.

(2) Let x ∈ Υ. Consider the set argminj∈J 6=0‖aj‖−1(bj − a′jx). We first show that the

argmin is equal to J ∩ J6=0. If ` ∈ J6=0/J , an algebraic manipulation similar to above shows

27b` cannot be negative because, by assumption, θ ∈ Θ0(F ), so µ ∈ C, and therefore C is non-empty.28We note here that τ(x) is defined for an arbitrary active inequality j ∈ J ∩ J 6=0. One can verify that the

definition of τ(x) does not depend on which j ∈ J ∩ J 6=0 is selected.

53

that

Q′J(PJx− κJ) > −τ(MJx+ κJ)

⇒ ‖aj‖−1(a′jx− bj) > −‖aj‖(b` − a′`(x− ‖aj‖−2aj(a

′jx− bj)))

‖aj‖‖a`‖−a′àj⇐⇒ ‖a`‖−1(b` − a′`x) > ‖aj‖−1(bj − a′jx), (73)

where the first implication uses the definition of τ(x) and the “iff” follows from multiplying

by ‖aj‖‖a`‖−a′àj and cancelling the a′àj term. This shows that ` ∈ J 6=0/J cannot be in the

argmin. Also consider ` ∈ J ∩ J6=0. The definition of J os1 implies that ‖a`‖−1a` = ‖aj‖−1aj.

Notice that

0 = bj − a′j(MJx+ κJ) = b` − a′`(MJx+ κJ)

⇐⇒ 0 = bj − a′j(x− ‖aj‖−2aj(a

′jx− bj)) = b` − a′`(x− ‖aj‖−2aj(a

′jx− bj))

⇒ b` = a′`(x− ‖aj‖−2aj(a′jx− bj))

⇐⇒ ‖a`‖−1(b` − a′`x) = ‖a`‖−1a′`‖aj‖−2aj(bj − a′jx) = ‖aj‖−1(bj − a′jx), (74)

where the first line holds because MJx + κJ ∈ CJ , the first “iff” holds by plugging in the

formula for MJx + κJ , the implication holds by solving for b` and cancelling bj, the second

“iff” holds by rearranging and the fact that a′àj = ‖a`‖‖aj‖. We have shown that ` should

be in the argmin. Therefore, the argmin is equal to J ∩ J6=0.

We also show that x ∈ C0. Note that a′jx < bj because Q′J(PJx− κJ) < 0 and plugging

in the formulas for QJ , PJ , and κJ . For any other ` ∈ J 6=0, we have

‖a`‖−1(b` − a′`x) ≥ ‖aj‖−1(bj − a′jx) > 0, (75)

because j belongs to the argmin. Thus, x ∈ C0 because all the inequalities for ` ∈ J6=0 are

inactive. Therefore, x ∈ C∆J .

Therefore, we have shown (66), which implies (65).

To finish the proof of part (b), we plug in (56), (58), (59), (62), (63), and (65) into (52),

(53), (54), and (55) to get that∑J⊆1,...,dA

Pr(X ∈ KJ and ‖X − PCX‖2> χ2rk(AJ ),1−α)

≤∑

J⊆1,...,dA:rk(AJ )≥2

αPr(MJX + κJ ∈ CJ)× Pr(PJX − κJ ∈ VJ)

54

+∑J∈J ts1

αPr(MJX + κJ ∈ CJ)× Pr(PJX − κJ ∈ VJ)

+∑J∈J os1

α(Pr(X ∈ KJ) + Pr(X ∈ C∆J ))

=α×

∑J⊆1,...,dA:rk(AJ(X))>0

Pr(X ∈ KJ) +∑J∈J os1

Pr(X ∈ C∆J )

=α(1− Pr(X ∈ C |)) ≤ α, (76)

where the first equality uses Lemma 1(a) and the fact that PJX is independent of MJX,

together with Lemma 2(d), and the second equality uses Lemma 1(b) and Lemma 2(b).

We next prove part (b). Fix J ⊆ 1, . . . , dA. We first note that when CJ is empty, the

inequality in (62) holds with equality because both sides are zero. When CJ 6= ∅, we show

that κJ = PJµ. Let z ∈ CJ and for every λ ∈ [0, 1] let µλ = λz + (1 − λ)µ. Recall that

Aµ = b. Thus, for each λ ∈ (0, 1]

a′jµλ = λa′jz + (1− λ)a′jµ = λbj + (1− λ)bj = bj for j ∈ J, and (77)

a′jµλ = λa′jz + (1− λ)a′jµ < λbj + (1− λ)bj = bj for j ∈ J c.

This implies that µλ ∈ CJ , and hence, for every λ ∈ (0, 1], PJµλ = κJ . Take λ → 0 and by

the continuity of the projection, κJ = PJµ. Thus, γ = Q′J(PJµ− κJ) = 0, implying that the

inequality in (62) holds with equality. The fact that γ = 0 also implies that the inequality

in (64) holds with equality by Lemma 4. The inequality in the last line of (76) holds with

equality by Lemma 2(c). This proves part (b).

Part (c). Let C = µ ∈ Rdm|AJµ ≤ bJ. Let Tn(θ), PCX, J(X), r(X), τ(X), and β(X)

be defined with AJ and bJ in place of A and b. For each L ⊆ J , also define CL and KL

similarly. Note that all these objects also depend on s because A and b may depend on θ.

Notice that

Pr(φRCCn (θs, α) 6= φRCC

n,J (θs, α))

=∑

L⊆1,...,dA

Pr(X ∈ KL and φRCCn (θs, α) 6= φRCC

n,J (θs, α))

=∑L⊆J


n,J (θs, α)) + o(1) (78)

=∑L⊆J

Pr(X ∈ KL and χ2

rk(AL),1−β(X) ≥ ‖X − PCX‖2> χ2rk(AL),1−β(X)

)+ o(1) (79)

55

=∑

L⊆J :rk(AL)=1

Pr(X ∈ KL and χ2

1,1−β(X) ≥ ‖X − PCX‖2> χ21,1−β(X)

)+ o(1) (80)

→ 0, (81)

where the first equality follows from Lemma 1(b) and the subsequent equalities and conver-

gence are justified below.

For (78), let X = X − µ ∼ N(0, I). Fix the value of X. We show that for any

L 6⊆ J , either (i) X + µ /∈ KL or (ii) 1‖X + µ − PC(X + µ)‖2> χ2r(X+µ),1−β(X+µ)

=

1‖X+µ−PC(X+µ)‖2> χ2r(X+µ),1−β(X+µ)

eventually as s→∞. Note that the expression

in (ii) is equivalent to φRCCn (θs, α) = φRCC

n,J (θs, α) when evaluated at X = X + µ. Suppose,

to reach a contradiction, there exists a subsequence in s such that (i) and (ii) are both false

for all s.29 Let ` ∈ L/J . Then by Lemma 1(a), we have a′`PC(X + µ) = b`. It follows that

‖X + µ− PC(X + µ)‖ ≥−a′`

(X + µ− PC(X + µ)

)‖a`‖

(82)

=−a′`X‖a`‖

− a′`µ− b`‖a`‖

→ +∞, (83)

where the inequality follows from Cauchy-Schwarz, and the convergence follows by assump-

tion for ` /∈ J , using the fact that X is fixed so ‖a`‖−1a′`X is bounded. This implies that

‖X + µ− PC(X + µ)‖2> χ2dA,1−α ≥ χ2

r(X+µ),1−β(X+µ)eventually as s→∞.

We next claim that there exists a further subsequence in s and an ˜ /∈ J such that

a′˜PC(X + µ) ≥ b˜ along the further subsequence. Such an ˜ and subsequence must exist

because otherwise, PC(X + µ) ∈ C, which implies that PC(X + µ) = PC(X + µ) (because

C ⊆ C), and which further implies by Lemma 1(a) that a′˜PC(X + µ) = a′˜PC(X + µ) = b˜

for every ˜∈ L. In that case, we can take ˜∈ L/J . It follows that

‖X + µ− PC(X + µ)‖ ≥−a′˜

(X + µ− PC(X + µ)

)‖a˜‖

(84)

≥−a′˜X‖a˜‖

−a′˜µ− b˜

‖a˜‖→ +∞, (85)

where the inequality follows from Cauchy-Schwarz, and the convergence follows by assump-

tion for ˜ /∈ J , using the fact that X is fixed so ‖a˜‖−1a′˜X is bounded. This implies that

‖X + µ− PC(X + µ)‖2> χ2dA,1−α ≥ χ2

r(X+µ),1−β(X+µ)eventually as s→∞. Therefore, along

this subsequence, condition (ii) holds eventually. This contradiction implies that for every

29For notational simplicity, we do not introduce notation for this subsequence or any further subsequence.

56

L 6⊆ J and for every fixed X either (i) or (ii) holds eventually. Therefore,


n,J (θs, α)

= Pr(X + µ ∈ KL and

1‖X+µ−PC(X+µ)‖2>χ2r(X+µ),1−β(X+µ)

6= 1‖X+µ−PC(X+µ)‖2>χ2r(X+µ),1−β(X+µ)

)

→0,

where the equality follows from the fact that X has the same distribution as X +µ, and the

convergence follows from the bounded convergence theorem.

For (79), note that for any L ⊆ J , if X ∈ KL, then PCX ∈ CL by Lemma 1(a). We argue

that PCX = PCX. By a property of projection onto convex sets, it is sufficient to show that

for all y ∈ C, we have 〈X − PCX, y− PCX〉 ≤ 0.30 This follows because by the definition of

KL, we can write X − PCX =∑

j∈L vjaj with vj ≥ 0, so

〈X − PCX, y − PCX〉 =∑j∈L

vj(〈aj, y〉 − 〈aj, PCX) ≤ 0,

where the inequality uses the fact that y ∈ C, so a′jy ≤ bj and PCX ∈ CL, so a′jPCX =

bj. Therefore, PCX = PCX. The fact that PCX ∈ CL implies that PCX ∈ CL because

CL ⊆ CL. By Lemma 1(b), this implies that X ∈ KL. Thus, r(X) = r(X) = rk(AL), and

Tn(θ) = ‖X −PCX‖2= ‖X −PCX‖2= Tn(θ). Also, the fact that β(X) ≥ β(X) implies that

the only way φRCCn (θs, α) 6= φRCC

n,J (θs, α) is if χ2rk(AL),1−β(X) ≥ ‖X − PCX‖2> χ2

rk(AL),1−β(X).

For (80), we use the fact that β(X) = β(X) = α whenever rk(AL) 6= 1.

Finally we show (81). For each L ⊆ J , note that rk(AL) may depend on s. Fix any

subsequence in s such that rk(AL) = 1 along the subsequence.31 Write PL = QPLQ

PL′

and

ML = QML Q

ML′, where QP

L is a dm× 1 vector with unit length, QML is a dm× (dm− 1) matrix

with orthonormal columns, and QPL′QML = 0. Let X = (x1, x2) ∼ N(0, I), where x1 ∈ Rdm−1

and x2 ∈ R. We can then write

X = QML x1 +QP

L x2 + µ. (86)

Note that X does not depend on s, while µ, QML , and QP

L may.

For each L, we can rewrite the term in (80) as∫x1

∫x2

1X ∈ KL1χ21,1−β(X) ≥ ‖X − PCX‖2> χ2

1,1−β(X)φ(x1)φ(x2)dx2dx1

30See section 3.12 in Luenberger (1969).31For notational simplicity, we do not introduce notation for this subsequence or any further subsequence.

57

=

∫x1

1MLX + κL ∈ CL∫x2

gs(x1, x2)φ(x2)dx2φ(x1)dx1, (87)

where X is viewed as a function of (x1, x2) using (86), φ(·) is the probability density function

of the standard normal distribution of the dimension determined by the dimension of its

argument, and

gs(x1, x2) = 1PLX − κL ∈ VL1χ21,1−β(MLX+κL) ≥ ‖PLX − κL‖2> χ2

1,1−β(MLX+κL), (88)

which uses the same decomposition of X ∈ KL as in (58), and the fact that MLX only

depends on x1 (and that β(X) and β(X) only depend on X through PCX = MLX + κL).

Fix x1. We show that the inner integral goes to zero as s→∞.

Fix an arbitrary subsequence in s. We show that there exists a further subsequence such

that the inner integral goes to zero. Since β(MLX + κL) and β(MLX + κL) do not depend

on x2 and both lie in [α, 2α] for all s, there exists a further subsequence along which both

converge. Denote the limits by β∞ and β∞. Also note that PLX − κL = QPL x2 + PLµ− κL.

Take a further subsequence such that PLµ − κL diverges or converges and such that QPL

converges to QPL,∞ (since QP

L has unit length, it must converge along a subsequence). We

consider two cases.

(i) If PLµ−κL diverges, then for every x2, ‖QPL x2+PLµ−κL‖2≥ (‖PLµ−κL‖−‖QP

L x2‖)2 →∞, so gs(x1, x2) = 0 eventually as s → ∞ along this subsequence. Therefore by the domi-

nated convergence theorem, the inner integral in (87) goes to zero.

(ii) If PLµ−κL converges to some κ∞, then fix x2 such that ‖QPL,∞x2+κ∞‖2 6= χ2

1,1−β∞ and

‖QPL,∞x2+κ∞‖2 6= χ2

1,1−β∞. Note that the set of such x2 is a set of probability one with respect

to x2 ∼ N(0, 1). We show that gs(x1, x2) = 0 eventually. Consider ‖a`‖−1a′`(µ − PCX) for

` /∈ J . If there exists an ` /∈ J and a subsequence of s such that ‖a`‖−1a′`(µ−PCX)→ ±∞,

then

‖X − PCX‖ = ‖[QPL , Q

ML ]′X + µ− PCX‖ (89)

≥±a′`

([QP

L , QML ]′X + µ− PCX

)‖a`‖

(90)

=±a′`[QP

L , QML ]′X

‖a`‖± a′` (µ− PCX)

‖a`‖→ +∞, (91)

where the inequality follows by Cauchy-Schwarz and the convergence follows from the fact

that ‖a`‖−1a′`[QPL , Q

ML ]′X is bounded. This shows that ‖PLX − κL‖2= ‖X − PCX‖2>

χ21,1−α ≥ χ2

1,1−β(MLX+κL) eventually, and therefore gs(x1, x2) = 0 eventually along this sub-

58

sequence. Otherwise, suppose ‖a`‖−1a′`(µ − PCX) is bounded along a subsequence for all

` /∈ J . We show that β∞ = β∞. Let j ∈ L such that aj 6= 0. Then note that for each ` /∈ J ,

τ`(X) =‖aj‖(b` − a′`PCX)

‖aj‖‖a`‖−a′ja`=

(b` − a′`PCX)/‖a`‖1− a′

ja`/(‖aj‖‖a`‖)

≥ 1

2

b` − a′`PCX‖a`‖

→ ∞, (92)

where the convergence follows because ‖a`‖−1(b` − a′`µ) → ∞ and ‖a`‖−1(a′`µ − a′`PCX) is

bounded (and if the denominator is zero, then τ`(X) =∞). Therefore,

τ(X) = inf6=jτ`(X) = min

(inf

`∈J ;` 6=jτ`(X), inf

`/∈Jτ`(X)

)= min

(τ(X), inf

`/∈Jτ`(X)

). (93)

If τ(X)→∞, then τ(X)→∞ too, and if τ(X) converges to a finite value, τ(X) converges

to the same value. This shows that β∞ = β∞. Therefore gs(x1, x2) = 0 eventually along this

subsequence. Since every subsequence has a further subsequence such that gs(x1, x2) = 0

eventually, it follows that gs(x1, x2) = 0 eventually along the original sequence. Therefore

by the dominated convergence theorem, the inner integral in (87) goes to zero.

Since the inner integral in (87) converges to zero in either case (i) or (ii) for every fixed x1,

by the dominated convergence theorem, the outer integral converges to zero too along this

subsequence. Since every subsequence has a further subsequence such that (87) converges

to zero, this shows (81).

A.3 Proofs of the Auxiliary Lemmas

Proof of Lemma 1. (a) By assumption, X ∈ KJ = CJ + VJ . So, we write X = X1 + X2,

where X1 ∈ CJ and X2 ∈ VJ . Then, PCX1 = X1 because X1 ∈ C already. We show that

PCX = X1. By a property of projection onto convex sets, it is necessary and sufficient that

for all y ∈ C, we have 〈X −X1, y −X1〉 ≤ 0.32 This follows because X2 =∑

j∈J vjaj with

vj ≥ 0, so

〈X2, y −X1〉 =∑j∈J

vj(〈aj, y〉 − 〈aj, X1〉) ≤ 0, (94)

where the inequality uses the fact that y ∈ C, so a′jy ≤ bj and X1 ∈ CJ , so a′jX1 = bj.

Combining these, we get that PCX = X1 ∈ CJ and X − PCX = X −X1 = X2 ∈ VJ .

(b) We first show that every X belongs to some KJ . For every X, PCX ∈ C, so there

exists a J such that PCX ∈ CJ .

By the inner-product property of projection, we know that for all y ∈ C, 〈y−PCX,X −32See Section 3.12 in Luenberger (1969). Hereafter, call this property of projection onto a convex set the

“inner-product property.”

59

PCX〉 ≤ 0. Using this fact, let z ⊥ span(VJ). Then, there exists a ε > 0 such that PCX + εz

and PCX − εz both belong to C.33 Then, 〈εz,X − PCX〉 ≤ 0 and 〈−εz,X − PCX〉 ≤ 0.

These two inequalities imply that 〈z,X − PCX〉 = 0. Thus, X − PCX is orthogonal to all

vectors, z, which are orthogonal to span(VJ). This implies that X − PCX ∈ span(VJ).

If X −PCX /∈ VJ , then by the separating hyperplane theorem,34 there exists a direction,

c ∈ span(VJ) such that 〈c,X−PCX〉 > 0 and 〈c, aj〉 < 0 for all j ∈ J . We consider PCX+εc.

We show that for ε sufficiently small, (1) PCX + εc ∈ C, and (2) 〈X − PCX, εc〉 > 0.

(1) For j ∈ J , 〈PCX + εc, aj〉 = bj + ε〈c, aj〉 < bj, where the equality follows because

PCX ∈ CJ and the inequality follows from the definition of c. For j ∈ J c, 〈PCX +

εc, aj〉 = 〈PCX, aj〉 + ε〈c, aj〉, which is less than bj for ε sufficiently small because

〈PCX, aj〉 < bj.

(2) 〈X − PCX, εc〉 = ε〈X − PCX, c〉 > 0 by the definition of c.

This contradicts the inner-product property of projection onto a convex set, and therefore

X − PCX ∈ VJ , and X ∈ KJ .

We next show that no X belongs to two distinct KJ . If X ∈ KJ and KJ ′ , then, by part

(a), PCX ∈ CJ and PCX ∈ CJ ′ . But this is a contradiction because the projection onto a

convex set is unique, and the CJ form a partition of C.

(c) If X ∈ KJ , then PCX ∈ CJ , so all the inequalities in J are active. If X /∈ KJ , then

X is in a different KJ ′ , for some J ′ 6= J , by part (b). Thus, J 6= J(X) = J ′.

Proof of Lemma 2. (a) Note that J00 satisfies rk(AJ00) = 0. Thus, it is sufficient to show

that CJ = ∅ for all J ⊆ J0 that are not J00. If J 6= J00, then either (i) there exists j ∈ J00/J

or (ii) there exists j ∈ J/J00. In the first case, any x ∈ CJ would have to satisfy 0′x < 0, a

contradiction. In the second case, any x ∈ CJ would have to satisfy 0′x = bj, where bj 6= 0,

another contradiction.

(b) We first show that the C∆J are disjoint for different J ∈ J os

1 . If x ∈ C∆J1∩ C∆

J2for

J1, J2 ∈ J os1 , then both

J6=0 ∩ J1 = J6=0 ∩ J2 = argminj∈J6=0‖aj‖−1(bj − a′jx) (95)

and

J0 ∩ J1 = J0 ∩ J2 = J00. (96)

This implies that J1 = J2. Part (b) then follows from the definition of C |.

33This uses the slackness of the inequalities in the definition of CJ .34See Section 11 of Rockafellar (1970) or Section 5.12 in Luenberger (1969).

60

(c) For any x ∈ C |, let J6=0(x) = argminj∈J 6=0‖aj‖−1(bj − a′jx). We show below that

∃j, ` ∈ J 6=0(x) s.t. ‖aj‖−1aj 6= ‖a`‖−1a`. (97)

That implies that for any x ∈ C |, there exists j, ` ∈ J6=0 such that ‖aj‖−1aj 6= ‖a`‖−1a` and

‖aj‖−1(bj − a′jx) = ‖a`‖−1(b` − a′`x). Or equivalently,

C | ⊆ ∪ j,`∈J 6=0‖aj‖a` 6=‖a`‖aj

x ∈ Rdm|‖aj‖b` − ‖a`‖bj = (‖aj‖a` − ‖a`‖aj)′x. (98)

Since the right hand-side is a finite union of measure-zero subspaces of Rdm , it must be that

C | has Lebesgue measure zero, establishing part (c).

Now we show (97). Let J(x) = J00 ∪ J6=0(x). We note that J 6=0(x) is not empty because

A 6= 0dA×dX . This implies that rk(AJ(x)) ≥ 1. Then there are two possibilities: rk(AJ(x)) ≥ 2

and rk(AJ(x)) = 1. In the first case, (97) holds trivially.

In the latter case, we first show that J(x) ∈ J1. Suppose there exists j ∈ J(x) and

` ∈ 1, ..., dA/J(x) such that ‖aj‖> 0, ‖a`‖> 0,aj‖aj‖ = a`

‖a`‖, and

bj‖aj‖ = b`

‖a`‖. This implies

‖a`‖−1(b` − a′`x) = ‖aj‖−1(bj − a′jx), (99)

so ` should also belong to J(x). Since such a j and ` cannot exist, it must be the case that

J(x) ∈ J1. The fact that x ∈ C | means that J(x) /∈ J os1 . Thus, it must be that J(x) ∈ J ts

1 ,

which also implies (97). Therefore (97) holds in all cases. This proves part (c).

(d) First note that for every J ∈ J os1 ∪ J ts

1 we have rk(AJ) = 1. Thus, it is sufficient to

show that for every J ⊆ 1, ..., dA with rk(AJ) = 1 and J /∈ J os1 ∪ J ts

1 , we have KJ = ∅.Note that if J ∩ J0 6= J00, then either (i) there exists j ∈ J00/(J0 ∩ J) or (ii) there exists

j ∈ (J ∩J0)/J00. In the first case, any x ∈ CJ would have to satisfy 0′x < 0, a contradiction.

In the second case, any x ∈ CJ would have to satisfy 0′x = bj, where bj 6= 0, another

contradiction. This implies that CJ , and therefore KJ , is empty.

We next note that if j ∈ J while ` ∈ 1, ..., dA/J with ‖aj‖> 0, ‖a`‖> 0,aj‖aj‖ = a`

‖a`‖,

andbj‖aj‖ = b`

‖a`‖, then any x ∈ CJ should satisfy

‖a`‖−1(b` − a′`x) = ‖aj‖−1(bj − a′jx) = 0, (100)

so ` should also belong to J . This contradiction implies that CJ , and therefore KJ , must be

empty.

This implies that the only nonempty KJ with rk(AJ) = 1 must belong to J1. If we

suppose that J /∈ J os1 , then there must exist j, ` ∈ J s.t. ‖aj‖> 0, ‖a`‖> 0, and

aj‖aj‖ 6=

a`‖a`‖

.

61

However, since rk(AJ) = 1, a` and aj must be collinear. This implies thataj‖aj‖ = − a`

‖a`‖.

Then, any x ∈ CJ must satisfy

0 = ‖a`‖−1(b` − a′`x) = ‖aj‖−1(bj − a′jx). (101)

This implies ‖a`‖−1b` = −‖aj‖−1bj, which implies that J ∈ J ts.

Therefore, the only J ⊆ 1, ..., dA with rk(AJ) = 1 and KJ 6= ∅ belong to J os1 ∪J ts

1 .

Proof of Lemma 4. For every λ ≥ 0, let

f(λ) =

∫ ∞−τ

(α− 1Z > z1−β/2)e−12

(Z+λ)2

dZ. (102)

Note that α ≤ 1 implies that β ≤ 2Φ(τ), which in turn implies that z1−β/2 ≥ −τ . We show

that f(λ) ≥ 0 for all λ ≥ 0. This is sufficient because

αPr µ(Z ≥ −τ)− Prµ(Z ≥ z1−β/2)

=

∫ ∞−τ

(α− 1Z > z1−β/2)1√2πe−

12

(Z−µ)2

dZ

=f(−µ)√

2π≥ 0 (103)

for all µ ≤ 0.

Let f ′(λ) denote the derivative of f . We show that (1) f(0) ≥ 0 and (2) for all λ ≥ 0,

f ′(λ) ≥ −(z1−β/2 + λ

)f(λ). Together, these two properties imply that f(λ) ≥ 0 because, if

not, then there exists a λ > 0 such that f(λ) < 0. Then, by the mean value theorem, there

exists a λ ∈ (0, λ) such that f(λ) < 0 and f ′(λ) < 0, which contradicts property (2).

Property (1) holds because

f(0)√2π

=

∫ ∞−τ

(α− 1Z > z1−β/2)1√2πe−

12Z2

dZ

= αΦ(τ)− (1− Φ(z1−β/2)) = αΦ(τ)− β/2 = 0. (104)

This also shows that equality holds when µ = 0.

To show that property (2) holds, we evaluate

f ′(λ) =d

dλ

∫ ∞−τ

(α− 1Z > z1−β/2)e−12

(Z+λ)2

dZ

= −∫ ∞−τ

(Z + λ)(α− 1Z > z1−β/2)e−12

(Z+λ)2

dZ

62

= −∫ z1−β/2

−τα(Z + λ)e−

12

(Z+λ)2

dZ +

∫ ∞z1−β/2

(1− α)(Z + λ)e−12

(Z+λ)2

dZ

≥ −∫ z1−β/2

−τα(z1−β/2 + λ)e−

12

(Z+λ)2

dZ +

∫ ∞z1−β/2

(1− α)(z1−β/2 + λ)e−12

(Z+λ)2

dZ

= −(z1−β/2 + λ)

∫ ∞−τ

(α− 1Z > z1−β/2)e−12

(Z+λ)2

dZ

= −(z1−β/2 + λ)f(λ), (105)

where the second equality follows by dominated convergence and the inequality follows from

the events Z > z1−β/2 and Z ≤ z1−β/2.

B Theorem 3 and the Proof of Theorem 2

In this section we prove Theorem 3, a general theorem for uniform asymptotic properties of

the CC and RCC tests. Theorem 3 is used to prove Theorem 2.

B.1 Theorem 3: A General Asymptotic Theorem

In this section, we sometimes make explicit the dependence of A and b on θ, denoting them

by A(θ) and b(θ). The rows of A(θ) are denoted by aj(θ), and submatrices composed of the

rows of A(θ) are denoted by AJ(θ).

Assumption 2. The given sequence (Fn, θn) : Fn ∈ F , θn ∈ Θ0(Fn)∞n=1 satisfies, for every

subsequence, nm, there exists a further subsequence, nq, and there exists a sequence of positive

definite dm × dm matrices, Dq such that:

(a) Under the sequence Fnq∞q=1,

√nqD

−1/2q (mnq(θnq)− EFnqmnq(θnq))→d N(0,Ω), (106)

for some positive definite correlation matrix, Ω, and

‖D−1/2q Σnq(θnq)D

−1/2q − Ω‖→p 0. (107)

(b) ΛqA(θnq)Dq → A0 for some dA × dm matrix A0, and for every J ⊆ 1, ..., dA,rk(IJA(θnq)Dq) = rk(IJA0), where Λq is the diagonal dA × dA matrix whose jth diagonal

entry is one if e′jA(θnq) = 0 and ‖e′jA(θnq)Dq‖−1 otherwise.

Remark. The matrix Dq typically is the diagonal matrix of variances of the elements of√nqmnq(θnq). In part (a), we allow each diagonal element to go to zero (or infinity) at

63

different rates, to incorporate the cases where different moments are on different scales or

where different moments involve time series processes that are integrated at different orders.

Andrews and Guggenberger (2009), Andrews and Soares (2010), and Andrews et al. (2020)

also use a diagonal normalizing matrix for this purpose. Moreover, the matrix Dq can

be non-diagonal, which is useful when the asymptotic variance matrix of√nq(mnq(θnq) −

EFnmnq(θnq)) is singular but a certain rotation of the vector with proper scaling has a non-

singular asymptotic variance matrix.

Part (b) is not required to show the uniform asymptotic validity of the RCC test. It is

only used to show asymptotic size-exact and the asymptotic IDI property. The existence

of A0 follows by the choice of the subsequence, while the rank condition is used to verify

Lemma 6, below.

The following theorem is a general asymptotic theorem used to show the uniform asymp-

totic properties of the RCC test.

Theorem 3. (a) Suppose Assumption 2(a) holds for all sequences (Fn, θn) : Fn ∈ F , θn ∈Θ0(Fn)∞n=1. Then,

limsupn→∞

supF∈F

supθ∈Θ0(F )

EF (φRCCn (θ, α)) ≤ α.

Next consider a sequence (Fn, θn) : Fn ∈ F , θn ∈ Θ0(Fn)∞n=1 satisfying Assumption 2(a,b).

(b) If, along any further subsequence, for all j = 1, ..., dA,√nqe

′jΛq(A(θnq)EFnqmnq(θnq)−

b(θnq))→ 0, and if A0 6= 0dA×dm, then,

limn→∞

EFnφRCCn (θn, α) = α.

(c) If, for J ⊆ 1, ..., dA, along any further subsequence,√nqe

′jΛq(A(θnq)EFnqmnq(θnq)−

b(θnq))→ −∞ as q →∞, for all j /∈ J , then

limn→∞

PrFn(φRCCn (θn, α) 6= φRCC

n,J (θn, α))

= 0.

Remarks. (1) Notice that no assumptions are placed on A(θ) for Theorem 3(a). It can be

low-rank or any submatrix of A(θ) can be local to singular as θ varies. This is achieved by

an extra step in the proof that adds inequalities that are redundant in the finite sample but

are relevant in the limit (see Lemma 7 below).

(2) If θn and Fn are such that EFnmn(θn) does not depend on n (for example, if Wini=1

is stationary under Fn with a fixed marginal distribution and θ ∈ Θn(Fn) is fixed), and if

A(θn) and b(θn) are fixed, then the condition in part (c) is automatically satisfied with J

64

equal to the set of all binding inequalities. If, in addition, AJ(θn) 6= 0, parts (b) and (c) can

be combined to show that the RCC test has exact asymptotic size.

B.2 Auxiliary Lemmas for Theorem 3

The proof of Theorem 3 uses four important lemmas. Lemma 5 establishes a condition under

which the projection onto a sequence of polyhedra converges when the coefficient matrix

defining the polyhedra converges. The condition is verified in a special context in Lemma

6, which is used to prove part (b) of Theorem 3. The conditions for part (a) are not strong

enough for us to apply Lemma 5 because we do not restrict the rank of A(θ). Nonetheless,

Lemma 7 shows that inequalities that are redundant in finite sample but relevant in the

limit can be added to guarantee the condition of Lemma 5, and help us to prove part (a) of

Theorem 3. Lemma 8 shows that the additional inequalities from Lemma 7 do not change

the definition of β.

First we define some notation. For any dA × dm real-valued matrix A and vector h ∈RdA

+,∞ := [0,∞]dA , let poly(A, h) = µ ∈ Rdm : Aµ ≤ h denote the polyhedron defined by

inequalities with coefficients given by A and constants given by h. Also define

µ∗(x;A, h) = argminµ∈poly(A,h)

‖x− µ‖2. (108)

The lemma considers a sequence of dA × dm real-valued matrices An∞n=1 and a sequence

of dA × 1 vectors hn ∈ RdA+ := [0,∞)dA such that, as n → ∞, An → A0 and hn → h0 for a

dA × dm real-valued matrix A0 and a vector h0 ∈ RdA+,∞. Also, let xn ∈ Rdm be a sequence of

vectors such that xn → x0 ∈ Rdm as n → ∞. We say that a sequence of sets, poly(An, hn),

Kuratowski converges to a limit set, poly(A0, h0), denoted by

poly(An, hn)K→ poly(A0, h0), (109)

if (i) for every x0 ∈ poly(A0, h0) there exists a sequence xn ∈ poly(An, hn) such that xn → x0,

and (ii) for every subsequence nq and for every converging sequence xnq ∈ poly(Anq , hnq) that

converges to a point x0, we have x0 ∈ poly(A0, h0).35

Lemma 5. If poly(An, hn)K→ poly(A0, h0), then µ∗(xn;An, hn)→ µ∗(x0;A0, h0).

We denote submatrices of An and A0 formed by the rows with indices in J ⊆ 1, ..., dAby AJ,n and AJ,0. Important for the following lemma is the fact that every element of hn is

nonnegative for all n.

35One can check that this definition of Kuratowski convergence is equivalent to other definitions given in,for example Aubin and Frankowska (1990).

65

Lemma 6. If for all J ⊆ 1, . . . , dA, rk(AJ,n) = rk(AJ,0) for all n, then poly(An, hn)K→

poly(A0, h0).

For any dA × dm matrix, A, and for any vector, g, let J(x;A, g) = j ∈ 1, . . . , dA :

a′jµ∗(x;A, g) = gj. This generalizes the previous notation for active inequalities to make

explicit the dependence on A and g. Also let [A;B] denotes the vertical concatenation of

two matrices, A and B.

Lemma 7. Let An be a sequence of dA × dm matrices such that each row is either zero

or belongs to the unit circle. Let gn be a sequence of nonnegative dA-vectors. Then, there

exists a subsequence, nq, a sequence of dB × dm matrices, Bq, and a sequence of nonnegative

dB-vectors hq such that the following hold.

(a) Anq → A0, Bq → B0, gnq → g0, and hq → h0 (some of the elements of g0 and h0 may

be +∞, in which case the convergence/divergence occurs elementwise).

(b) poly(Anq , gnq) ⊆ poly(Bq, hq) for all q.

(c) For all q and for all x ∈ poly(Anq , gnq),

rk(IJ(x;Anq ,gnq )Anq) = rk([IJ(x;Anq ,gnq )Anq ; IJ(x;Bq ,hq)Bq]).

(d) poly([Anq ;Bq], [gnq ;hq])K→ poly([A0;B0], [g0;h0]) as q →∞.

Suppose that j ∈ J(x;A, g) and aj 6= 0. If such a j does not exist, let τj(x;A, g) = 0 for

all j ∈ 1, ..., dA. Otherwise, let

τj(x;A, g) =

‖aj‖(gj−a′jµ∗(x;A,g))

‖aj‖‖aj‖−a′jajif ‖aj‖‖aj‖6= a′jaj

∞ otherwise.(110)

Let τ(x;A, g) = infj∈1,...,dA τj(x;A, g). One can verify that the definition of τ(x;A, g) does

not depend on which j ∈ J(x;A, g) is used to define it, when more than one is available.

This definition coincides with the definition of τ or τ(X), as used in the proof of Theorem

1, making explicit the dependence on A and g.

Lemma 8. If poly(A, g) ⊆ poly(B, h), then τ(x;A, g) = τ(x; [A;B], [g;h]) for all x ∈ Rdm.

66

B.3 Proof of Theorems 2 and 3

Proof of Theorem 2

We verify the conditions of Theorem 3. We first show that Assumption 1 implies that

Assumption 2(a) holds for any sequence (θn, Fn) : Fn ∈ F , θn ∈ Θ0(Fn). Fix an arbitrary

sequence (θn, Fn) : Fn ∈ F , θn ∈ Θ0(Fn). Let Σn = VarFn(m(Wi, θn)), which does not

depend on i due to Assumption 1(a). Let Dn be the diagonal matrix formed by the diagonal

elements of Σn. By Assumption 1(b), Dn is invertible and thus we can define

Ωn = D−1/2n ΣnD

−1/2n = CorrFn(m(Wi, θn)). (111)

The elements of Ωn ∈ [−1, 1] which is a compact set. Thus, for any subsequence of n,there is a further subsequence nq such that

Ωnq → Ω, (112)

for some matrix Ω. By Assumption 1(c), Ω is positive definite.

Now consider an arbitrary vector a ∈ Rdm such that a′a = 1, and consider the sequence

of random variables:

n1/2a′D−1/2q (mnq(θnq)− EFnqmnq(θnq))

= n−1/2

n∑i=1

a′D−1/2q (m(Wi, θnq)− EFnqm(Wi, θnq))

→d N(0, a′Ωa), (113)

by the Lindeberg-Feller central limit theorem, where the Lindeberg condition holds because

EFnq |a′D−1/2

q m(Wi, θnq)|2+ε≤ EFnq [dm∑j=1

aj|mj(Wi, θnq)/σFnq ,j(θnq)|2+ε] ≤M <∞, (114)

where the first inequality holds by the convexity of g(x) = |x|2+ε and the second and the

third inequalities hold by Assumption 1(d).

The Cramer-Wold device combined with (113) proves (106) in Assumption 2.

To show (107), consider that

D−1/2q Σnq(θnq)D

−1/2q

67

= n−1

n∑i=1

D−1/2q (m(Wi, θnq)− EFnqm(Wi, θnq))(m(Wi, θnq)− EFnqm(Wi, θnq))

′D−1/2q

−D−1/2q (mnq(θnq)− EFnqm(Wi, θnq))(mnq(θnq)− EFnqm(Wi, θnq))

′D−1/2q . (115)

By Assumptions 1(a) and (d), the law of large numbers for rowwise i.i.d. triangular arrays

applies and gives us

n−1

n∑i=1

D−1/2q (m(Wi, θnq)− EFnqm(Wi, θnq))(m(Wi, θnq)− EFnqm(Wi, θnq))

′D−1/2q

→p Ω, (116)

and similarly

D−1/2q (mnq(θnq)− EFnqm(Wi, θnq))→p 0. (117)

Thus, (107) is also verified.

Next, we show that Assumption 1, combined with the additional assumptions in The-

orem 2(b), implies Assumption 2(b). First note that each element of Λq is either one or

‖e′jA(θnq)Dq‖−1. By the common additional condition for Theorem 2(b,c), ‖e′jA(θnq)Dq‖−1→‖e′jA∞‖−1. Note that e′jA(θnq)Dq cannot go to zero if e′jA(θnq) 6= 0 because that would vi-

olate the common additional condition for Theorem 2(b,c) for J = j. Therefore, there

exists a further subsequence along which Λq → Λ∞ for a positive definite diagonal matrix

Λ∞. Therefore, ΛqA(θnq)Dq → A0 = Λ∞A∞. Also note that for each J ⊆ 1, ..., dA,rk(AJ(θnq)Dq) = rk(IJA∞) = rk(IJA0), where the first equality follows from the common

additional condition for Theorem 2(b,c) and the second equality follows because each row

of A0 is a positive scalar multiple of the corresponding row of A∞. This verifies Assumption

2(b).

We also note that along every further subsequence, each diagonal element of Λq converges

to a positive value. This implies that, for part (b), we have for every j = 1, ..., dA,

√nqe

′jΛq(Aj(θnq)EFnqmnq(θnq)− b(θnq))→ 0. (118)

Also, for part (c), we have for every j /∈ J ,

√nqe

′jΛq(Aj(θnq)EFnqmnq(θnq)− b(θnq))→ −∞. (119)

Also, for part (b), A0 6= 0 is implied by A∞ 6= 0 because Λ∞ is positive definite.

68

Therefore, Theorem 2 follows from Theorem 3.

Proof of Theorem 3

We first prove part (a). Let θn, Fn∞n=1 be an arbitrary sequence satisfying Fn ∈ F and

θn ∈ Θ0(Fn) for all n. Let nm be an arbitrary subsequence of n. It is sufficient to show

that there exists a further subsequence, nq, such that as q →∞,

liminfq→∞

Pr Fnq

(Tnq(θnq) ≤ χ2

r,1−β

)≥ 1− α. (120)

Fix an arbitrary subsequence, nm. By Assumption 2(a), there exists a further subse-

quence, nq, a sequence of positive definite matrices, Dq, and a positive definite correlation

matrix, Ω0, such that36

√nqD

−1/2q (mnq(θnq)− EFnqmnq(θnq))→d Y ∼ N(0,Ω0), and (121)

D−1/2q Σnq(θnq)D

−1/2q →p Ω0. (122)

We introduce some simplified notation. Let Ωq = D−1/2q Σnq(θnq)D

−1/2q , X = Ω

−1/20 Y ∼

N(0, I), Yq =√nqD

−1/2q (mnq(θnq)−EFnqmnq(θnq)), and Xq = Ω

−1/2q Yq. Equations (121) and

(122) imply that

Xq →d X ∼ N(0, I), and (123)

Ωq →p Ω0. (124)

The remainder of the proof proceeds in four steps. (A) In the first step, the problem

defined in (17) is transformed to include additional inequalities. (B) In the second step,

notation is defined for partitioning Rdm according to Lemma 1, for both finite q and the

limit. (C) In the third step, the almost sure representation theorem is invoked on the

convergence in (123) and (124). (D) In the final step, we show that (almost surely) the event

Tnq(θnq) ≤ χ2r,1−β eventually implies a limiting event based on X and Ω0. This limiting event

has probability at least 1− α from Theorem 1.

(A) Consider the sequence of matrices A(θnq)D1/2q . For each q, let Λq denote a dA × dA

diagonal matrix with positive entries on the diagonal such that each row of ΛqA(θnq)D1/2q

is either zero or belongs to the unit circle. Such a Λq always exists by taking the diagonal

element to be the inverse of the magnitude of the corresponding row of A(θnq)Dq, if it

is nonzero, and one otherwise. Let gq =√nqΛq(b(θnq) − A(θnq)EFnqmnq(θnq)). With this

36For notational simplicity, we denote all further subsequences by nq

69

notation, we can write

Tnq(θnq) = infy:ΛqA(θnq )D

1/2q y≤gq

(Yq − y)′Ω−1q (Yq − y), (125)

which adds and subtracts EFnqmnq(θnq) in the objective and applies the change of variables,

y =√nqD

−1/2q (µ− EFnqmnq(θnq)).

We can apply Lemma 7 to ΛqA(θnq)D1/2q and gq to get a further subsequence, nq, a

sequence of matrices, Bq, a sequence of vectors, hq, matrices A0 and B0, and vectors g0 and

h0, satisfying Lemma 7(a-d). Let

Aq =

[ΛqA(θnq)D

1/2q

Bq

]and hq =

[gq

hq

], (126)

and similarly for A0 and h0. Let dA = dA + dB. We have that

Tnq(θnq) = infy:Aqy≤hq

(Yq − y)′Ω−1q (Yq − y) (127)

= inft:AqΩ

1/2q t≤hq

(Xq − t)′(Xq − t), (128)

where the first equation follows from Lemma 7(b) and the second equation follows from the

change of variables t = Ω−1/2q y.

Equation (128) has changed the problem by adding additional inequalities. We verify

that the rank of the active inequalities is unchanged. For any positive definite matrix, Ω, let

Jq(x,Ω) be the set of indices for the active inequalities in the problem:

infy:ΛqA(θnq )D

1/2q y≤gq

(x− y)′Ω−1(x− y). (129)

Recall that J is the set of active inequalities for the problem defined in (17), which is equal

to Jq(Yq, Ωq) by a change of variables. Similarly, let Jq(x,Ω) be the set of active inequalities

in the problem:

inft:AqΩ1/2t≤hq

(x− t)′(x− t). (130)

Also let t∗q(x,Ω) denote the unique minimizer. We have that for any y ∈ Rdm and for any

positive definite Ω,

rk(AJq(y,Ω)(θnq)) = rk(IJq(y,Ω)ΛqA(θnq)D1/2q )

= rk(IJq(Ω−1/2y,Ω)Aq) = rk(IJq(Ω−1/2y,Ω)AqΩ1/2), (131)

70

where the first equality follows because Λq is diagonal with positive entries on the diagonal

and Dq is positive definite, the second equality follows by Lemma 7(c), and the final equality

follows from the fact that Ω is positive definite.

Before proceeding to the next step, we simplify the rank calculation by taking a further

subsequence. Notice that for each J ⊆ 1, ..., dA, rk(IJAq) ∈ 1, ..., dm. We can denote it

by rqJ , and then take a subsequence, nq, so that for all J , rqJ does not depend on q. Similarly,

we define r∞J = rk(IJA0). Note that by the convergence of Aq to A0, rqJ ≥ r∞J for all J .

(B) For any positive definite dm × dm matrix, Ω, and for every J ⊆ 0, 1, ..., dA, let

Aq(Ω) = AqΩ1/2

aq`′(Ω) = `th row of Aq(Ω)

Cq(Ω) = x ∈ Rdm : aq`′(Ω)x ≤ h`,q for all ` = 1, ..., dA

CqJ(Ω) = x ∈ Cq(Ω) : aq`

′(Ω)x = h`,q for all ` ∈ J and aq`′(Ω)x < h`,q for all ` ∈ J c

V qJ (Ω) =

∑`∈J

vàq`(Ω) : v` ∈ R, v` ≥ 0

, and

KqJ(Ω) = Cq

J(Ω) + V qJ (Ω). (132)

Furthermore, for every J ⊆ 1, ..., dA, let P qJ (Ω) denote the projection onto span(V q

J (Ω)),

and let M qJ(Ω) denote its orthogonal projection. There exists a κqJ(Ω) ∈ span(V q

J (Ω)) such

that for every x ∈ CqJ(Ω), P q

J (Ω)x = κqJ(Ω). This follows because for two x1, x2 ∈ CqJ(Ω),

and for any v ∈ span(V qJ (Ω)), v′(x1 − x2) = 0, which implies that P q

J (Ω)(x1 − x2) = 0.

For every given Ω, we can apply Lemma 1 to the objects defined in (132). This implies

(a) if x ∈ KqJ(Ω) then x− t∗q(x,Ω) ∈ V q

J (Ω) and t∗q(x,Ω) ∈ CqJ(Ω),

(b) the sets KqJ(Ω) for all J ⊆ 1, . . . , dA form a partition of Rdm , and

(c) for each J ⊆ 1, . . . , dA, we have x ∈ KqJ(Ω) iff J = Jq(x,Ω).

These properties imply that, for all x ∈ KqJ(Ω), we can write

P qJ (Ω)x = P q

J (Ω)(x− t∗q(x,Ω)) + P qJ (Ω)t∗q(x,Ω) = x− t∗q(x,Ω) + κqJ(Ω), (133)

where the second equality follows by (a) and the definition of κqJ(Ω). Then, we can also write

M qJ(Ω)x = x− P q

J (Ω)x = t∗q(x,Ω)− κqJ(Ω).

71

Let rq(x,Ω) = rk(AJq(x,Ω),q). When rq(x,Ω) = 1, we can define

τ qj (x,Ω) =

‖aqj(Ω)‖(hj,q−aqj (Ω)′t∗q(x,Ω))

‖aqj (Ω)‖‖aqj(Ω)‖−aqj (Ω)′aq

j(Ω)

if ‖aqj(Ω)‖‖aqj(Ω)‖6= aqj(Ω)′aq

j(Ω)

∞ else, (134)

where j ∈ Jq(x,Ω) such that aqj(Ω) 6= 0. We also let τ q(x,Ω) = infj=1,...,dA

τ qj (x,Ω), and

βq(x,Ω) = 2αΦ(τ q(x,Ω)). When rq(x,Ω) 6= 1, let τ qj (x,Ω) = 0, so that βq(x,Ω) = α. Note

that β = βq(Xq, Ωq), where the addition of extra inequalities via Lemma 7 has no effect on

β or τ because of Lemma 8, where the condition is satisfied by Lemma 7(b).

We define similar notation for the limiting objects. Let J∞ = ` ∈ 1, ..., dA : h`,0 <∞.These are the indices for the inequalities that are “close-to-binding.” For any positive definite

matrix, Ω, let A∞(Ω) denote the matrix formed by the rows of A0Ω1/2 associated with the

indices in J∞. For notational simplicity, we refer to the rows of A∞(Ω) using ` ∈ J∞ even

though the matrix A∞(Ω) has been compressed.

For every J ⊆ J∞, let

a∞`′(Ω) = `th row of A∞(Ω) for ` ∈ J∞

C∞(Ω) = x ∈ Rdm : a∞` (Ω)′x ≤ h`,0 for all ` ∈ J∞

C∞J (Ω) = x ∈ C∞(Ω) : a∞` (Ω)′x = h`,0 ∀` ∈ J and a∞` (Ω)′x < h`,0 ∀` ∈ J∞/J

V ∞J (Ω) =

∑`∈J

và∞` (Ω) : v` ∈ R, v` ≥ 0

, and

K∞J (Ω) = C∞J (Ω) + V ∞J (Ω). (135)

Furthermore, for every J ⊆ J∞, let P∞J (Ω) denote the projection onto span(V ∞J (Ω)). There

exists a κ∞J (Ω) ∈ span(V ∞J (Ω)) such that for every x ∈ C∞J (Ω), P∞J (Ω)x = κ∞J (Ω). This

follows because for two x1, x2 ∈ C∞J (Ω), and for any v ∈ span(V ∞J (Ω)), v′(x1 − x2) = 0,

which implies that P∞J (Ω)(x1 − x2) = 0.

Let h∞ denote the vector formed from the elements of h0 that are finite. Let J∞(x,Ω)

be the indices for the binding inequalities in the problem:

inft:A∞(Ω)t≤h∞

(x− t)′(x− t). (136)

Also let t∗∞(x,Ω) denote the unique minimizer. We can apply Lemma 1 to the objects defined

in (135). This implies that

(a∞) if x ∈ K∞J (Ω) then x− t∗∞(x,Ω) ∈ V ∞J (Ω) and t∗∞(x,Ω) ∈ C∞J (Ω),

72

(b∞) the set of all K∞J (Ω) form a partition of Rdm , and

(c∞) for each J ⊆ J∞, we have x ∈ K∞J (Ω) iff J = J∞(x,Ω).

Let r∞(x,Ω) = rk(A∞J∞(x,Ω)). When r∞(x,Ω) = 1, we can define

τ∞j (x,Ω) =

‖a∞j

(Ω)‖(hj,0−a∞j (Ω)′t∗∞(x,Ω))

‖a∞j (Ω)‖‖a∞j

(Ω)‖−a∞j (Ω)′a∞j

(Ω)if ‖a∞j (Ω)‖‖a∞j (Ω)‖6= a∞j (Ω)′a∞j (Ω)

∞ else, (137)

where j ∈ J∞(x,Ω) such that a∞j (Ω) 6= 0. We also let τ∞(x,Ω) = infj=J∞ τ∞j (x,Ω), and

β∞(x,Ω) = 2αΦ(τ∞(x,Ω)). When r∞(x,Ω) 6= 1, let τ∞j (x,Ω) = 0, so that β∞(x,Ω) = α.

Before proceeding to the next step, consider M qJ(Ω0), which is a sequence of projection

matrices in Rdm onto a space of dimension dm − rqJ . Since the space of such matrices is

compact, we can find a subsequence, nq, such that for all J ⊆ 1, ..., dA, M qJ(Ω0) → MN

J ,

where MNJ is a projection matrix onto a subspace, NJ , of dimension dm−rqJ .37 Furthermore,

for any sequence of positive definite matrices such that Ωq → Ω0, we have M qJ(Ωq) → MN

J .

This follows because, if we let Eq denote a dm×rqJ matrix whose columns form an orthonormal

basis for span(V qJ (Ω0)) (which is the range of P q

J (Ω0)) then for any positive definite matrix,

Ω, the columns of Ω1/2Ω−1/20 Eq form a basis for span(V q

J (Ω)), which implies that

M qJ(Ωq) = Idm − Ω1/2

q Ω−1/20 Eq(E

′qΩ−1/20 ΩqΩ

−1/20 Eq)

−1E ′qΩ−1/20 Ω1/2

q

= Idm − Eq(E ′qEq)−1E ′q + o(1) = M qJ(Ω0) + o(1). (138)

(C) Next, we invoke the almost sure representation theorem on the convergence in (123)

and (124).38 Then, we can treat the convergence in (123) and (124) as holding almost

surely.39 For the rest of the proof of part (a), consider A∞(Ω), P∞J (Ω), κ∞J (Ω), and the

objects defined in (135) and let the objects without the argument (Ω) denote the objects

evaluated at Ω0. For example, A∞ = A∞(Ω0).

We now construct an event, Ξ ⊆ Rdm , such that Pr(X ∈ Ξ) = 1. For every L ⊆ J∞, let

V ∞L+ = x ∈ V ∞L |∀L′ ⊆ L, if rqL′ < r∞L then MNL′x 6= 0. (139)

For each L ⊆ J∞ such that r∞L > 0, let

ΞL = x ∈ K∞L : P∞L x− κ∞L ∈ V ∞L+ and (P∞L x− κ∞L )′(P∞L x− κ∞L ) 6= χ2r∞L ,1−β∞(x). (140)

37Recall that rqJ does not depend on q due to the construction of the subsequence nq.38See van der Vaart and Wellner (1996), Theorem 1.10.3, for the a.s. representation theorem.39This can be formalized by defining random variables, Xq, X, and Ωq, satisfying Xq =d Xq, X =d X,

Ωq =d Ωq, Xq →a.s. X, and Ωq →a.s. Ω0.

73

Since r∞L > 0, P∞L X ∼ N(0, P∞L ), which is absolutely continuous on span(V ∞L ), and

therefore the probability that P∞L X − κ∞L lies in any one of the finitely many subspaces,

null(MNL′ ) = x ∈ Rdm : MN

L′x = 0, each with dimension rqL′ < r∞L , is zero. Also,

(P∞L X − κ∞L )′(P∞L X − κ∞L ) is absolutely continuous because it can be written as the sum

of rk(A∞L ) squared normal random variables. Also, χ2r∞L ,1−β∞(X) depends on X only through

M∞L X, which is independent of P∞L X. Therefore, for each fixed M∞

L X, the conditional

probability that (P∞L X − κ∞L )′(P∞L X − κ∞L ) = χ2r∞L ,1−β∞(M∞L X+κ∞L ) is zero. This implies that

the unconditional probability is also zero. Therefore,

Pr(P∞L X − κ∞L ∈ V ∞L /V ∞L+ or (P∞L X − κ∞L )′(P∞L X − κ∞L ) = χ2r∞L ,1−β∞(X)) = 0. (141)

For L ⊆ J∞ such that rk(A∞L ) = 0, let ΞL = K∞L . Then, let Ξ = ∪L⊆J∞ΞL. Therefore,

by property (b∞) and equation (141), Pr(X ∈ Ξ) = 1.

(D) We consider the set of all sequences such that xq → x∞ ∈ Ξ and Ωq → Ω0. By

the definition of Ξ and the almost sure convergence of (Xq, Ωq), these sequences occur with

probability one. Fix such a sequence for the remainder of the proof of part (a). For this

step, consider Aq(Ω), P qJ (Ω), and the objects defined in (132), and let the objects without

the argument (Ω) denote the objects evaluated at Ωq.

Below we show that for each sequence,

1‖xq − t∗q(xq)‖2≤ χ2rq(xq),1−βq(xq) ≥ 1‖x∞ − t∗∞(x∞)‖2≤ χ2

r∞(x∞),1−β∞(x∞) (142)

eventually. Notice that by (128) and (130), the left hand side is equal to 1Tnq(θnq) ≤ χ2r,1−β

with (Xq, Ωq) replaced by (xq,Ωq). If (142) holds, then by the bounded convergence theorem,

liminfq→∞

Pr Fnq (Tnq(θnq) ≤ χ2r,1−β) ≥ Pr (‖X − t∗∞(X,Ω0)‖2≤ χ2

r∞(X,Ω0),1−β∞(X,Ω0)). (143)

Also,

Pr (‖X − t∗∞(X,Ω0)‖2≤ χ2r∞(X,Ω0),1−β∞(X,Ω0)) ≥ 1− α (144)

by Theorem 1(a), which applies with n = 1 and mn(θ) = X because t∗∞(X,Ω0) = PC∞X,

where PC∞ is the projection of X onto C∞ = µ ∈ Rdm : A∞(Ω0)µ ≤ h∞. Together, (143)

and (144) imply (120) for the given subsequence, nq.

To finish the proof of part (a), we prove (142). Let L∞ be the subset of J∞ for which

x∞ ∈ K∞L∞ . We show that

1 =∑

L⊆1,...,dA

1xq ∈ KqL =

∑L⊆L∞:rqL≥r

∞L∞

1xq ∈ KqL (145)

74

eventually. Property (b) above implies that the first equality holds at every q. Thus it

is sufficient to show the second equality. Note that t∗q(xq) → t∗∞(x∞) by Lemma 5, using

Lemma 7(d) to verify the condition. For the second equality, it is sufficient to show that, for

all L /∈ L ⊆ L∞ : rqL ≥ r∞L∞, xq /∈ KqL eventually. Specifically, we consider three cases: (I)

L 6⊆ J∞, (II) L ⊆ J∞ but L 6⊆ L∞, (III) L ⊆ L∞ but rqL < r∞L∞ .

(I) Let L 6⊆ J∞. Then, there exists a ` ∈ L such that h`,q → ∞. Then aq`′t∗q(xq) < h`,q

eventually because t∗q(xq) → t∗∞(x∞). This implies that t∗q(xq) /∈ CqL, and therefore by (a),

xq /∈ KqL eventually.

(II) Let L ⊆ J∞ but L 6⊆ L∞. Then, there exists a ` ∈ L such that a∞`′t∗∞(x∞) < h`,0. By

the fact that aq`′t∗q(xq)→ a∞`

′t∗∞(x∞) and h`,q → h`,0, we have that aq`′t∗q(xq) < h`,q eventually.

This implies that z∗q (xq) /∈ CqL, and therefore by property (a) above, xq /∈ Kq

L eventually.

(III) Let L ⊆ L∞ such that rqL < r∞L∞ . This case is impossible if r∞L∞ = 0. Thus we only

need to consider r∞L∞ > 0. Note that x∞− t∗∞(x∞) = P∞L∞x∞−κ∞L∞ by property (a∞) above.

Also, by the definition of Ξ we have x∞ ∈ ΞL∞ , which implies that x∞ − t∗∞(x∞) ∈ V ∞L∞+,

which in turn means that MNL (x∞ − t∗∞(x∞)) 6= 0. By the convergence, M q

L(xq − t∗q(xq))→MN

L (x∞ − t∗∞(x∞)), we have that M qL(xq − t∗q(xq)) 6= 0 eventually. However, if xq ∈ Kq

L,

then by property (a) above, xq − t∗q(xq) ∈ VqL , which implies that M q

L(xq − t∗q(xq)) = 0. This

means that xq /∈ KqL eventually. Therefore, (145) holds eventually.

We next show that

lim supq→∞

βq(xq) ≤ β∞(x∞). (146)

It is sufficient to show that for every subsequence, there exists a further subsequence such

that (146) holds. Thus, it is without loss of generality to suppose that βq := βq(xq) converges.

If βq → α, then (146) holds simply because β∞(x∞) ≥ α. If limq→∞ βq > α, then for every

q large enough, there exists a jq such that aqjq6= 0 and aq

jq

′xq = hjq ,q. We can take a further

subsequence so that jq does not vary with q (and denote it by j). Then limq→∞ aqj

= a∞j 6= 0

because each row of Aq is either zero or belongs to the unit circle. Also, the fact that

hj,q = aqj

′xq → a∞j′x∞ = hj,0 implies that j ∈ J∞. Thus, j can be used to define τ∞j (x∞).

Take j ∈ J∞, and consider two cases. (i) For j such that ‖a∞j ‖‖a∞j ‖= a∞j

′a∞j , we have

τ∞j (t∗∞(x∞); A0Ω1/2, h0) =∞. (ii) For j such that ‖a∞j ‖‖a∞j ‖6= a∞j

′a∞j , we have

τ qj (xq) =‖aq

j‖(hj,q − aqj

′t∗q(xq))

‖aqj‖‖aqj‖−aqj

′aqj

→‖a∞j ‖(hj,0 − a

∞j′t∗∞(x∞)

‖a∞j ‖‖a∞j ‖−a∞j′a∞j

= τ∞j (x∞), (147)

which uses aqj→ a∞j , aqj → a∞j , hj,q → hj,0, and t∗q(xq) → t∗∞(x∞) by Lemma 5 and Lemma

75

7(d). Therefore,

limq→∞

τ q(xq) = limq→∞

infj∈1,...,dA

τ qj (xq)

≤ limq→∞

infj∈J∞:‖a∞

j‖‖a∞j ‖6=a∞j

′a∞jτ qj (xq) (148)

= infj∈J∞:‖a∞

j‖‖a∞j ‖6=a∞j

′a∞jτ∞j (x∞)

= infj∈J∞

τ∞j (x∞) = τ∞(x∞).

This shows that (146) holds.

We now verify (142). Notice that

1‖xq − t∗q(xq)‖2≤ χ2rq(xq),1−βq(xq)

=∑

J⊆1,...,dA

1‖xq − t∗q(xq)‖2≤ χ2rq(xq),1−βq(xq)1xq ∈ K

qJ

=∑

L⊆L∞:rqL≥r∞L∞

1‖xq − t∗q(xq)‖2≤ χ2rq(xq),1−βq(xq)1xq ∈ K

qL

=∑


1‖xq − t∗q(xq)‖2≤ χ2rqL,1−βq(xq)

1xq ∈ KqL

≥∑


1‖xq − t∗q(xq)‖2≤ χ2r∞L∞ ,1−βq(xq)

1xq ∈ KqL (149)

≥1‖x∞ − t∗∞(x∞)‖2≤ χ2r∞L∞ ,1−β∞(x∞)

∑L⊆L∞:rqL≥r

∞L∞

1xq ∈ KqL (150)

=1‖x∞ − t∗∞(x∞)‖2≤ χ2r∞L∞ ,1−β∞(x∞)1x∞ ∈ K∞L∞

=∑J⊆J∞

1‖x∞ − t∗∞(x∞)‖2≤ χ2r∞J ,1−β∞(x∞)1x∞ ∈ K∞J

=∑J⊆J∞

1‖x∞ − t∗∞(x∞)‖2≤ χ2r∞(x∞),1−β∞(x∞)1x∞ ∈ K∞J

=1‖x∞ − t∗∞(x∞)‖2≤ χ2r∞(x∞),1−β∞(x∞), (151)

where: the first equality follows from property (b); the second equality follows from (145);

the third equality follows from property (c); the first inequality follows because rqL ≥ r∞L∞ ;

the second inequality must hold eventually because, when r∞L∞ > 0,

‖xq − t∗q(xq)‖2→ ‖x∞ − t∗∞(x∞)‖2= ‖P∞L∞x∞ − κ∞L∞‖2 6= χ2r∞L∞ ,1−β∞(x∞) (152)

76

and

lim infq→∞

χ2r∞L∞ ,1−βq(xq)

≥ χ2r∞L∞ ,1−β∞(x∞) (153)

(the equality follows from property (a∞) above because x∞ ∈ K∞L∞ , the 6= follows because

x ∈ ΞL∞ , and the ≥ follows from (146)), and when r∞L∞ = 0, A∞L∞ = 0, we have AqL∞ = 0

eventually (because each row of A∞ either belongs to the unit circle or is zero), and therefore,

xq ∈ KqL for L ⊆ K∞ implies xq − t∗q(xq) = 0 eventually; the fourth equality follows from

(145) and x∞ ∈ K∞L∞ ; the fifth equality follows because all the terms with J 6= L∞ are zero;

the sixth equality follows from (c∞); and the final equality follows from (a∞). This verifies

(142), proving part (a).

Next, we prove part (b). Consider the sequence Fn, θn∞n=1. It is sufficient to show that for

every subsequence, nm, there exists a further subsequence, nq, such that

limq→∞

Pr Fnq (Tnq(θnq) ≤ χ2r,1−β) = 1− α. (154)

The proof follows that of part (a) with the following changes.

(i) The augmentation of the inequalities with additional inequalities defined by Bq and

hq in (126) is no longer needed. We can take Aq = ΛqA(θnq)D1/2q and hq = gnq .

(ii) Note that for each j ∈ 1, ..., dA, gj,q (which is equal to hj,q) is either zero or

√nqbj(θnq)− aj(θnq)′EFnqmn(θnq)

‖aj(θnq)′Dq‖→ 0, (155)

by assumption. Thus, h0 = 0dA .

(iii) Without Bq, (131) still holds without appealing to Lemma 7 by Assumption 2(b).

(iv) By Assumption 2(b), for all J ⊆ 1, ..., dA, rqJ = r∞J for all q.

(v) The appeal to Lemma 7 below (145) is replaced by Lemma 6 to get t∗q(xq)→ t∗∞(x∞).

(vi) The expression in (146) becomes

limq→∞

βq(xq) = β∞(x∞). (156)

To show this, we consider two cases. In the first case, we have rk(A∞L∞) = 0. Then by (145),

xq ∈ KqL for some L such that rk(AqL) = 0 eventually. This implies that βq(xq) = α = β∞(x∞)

eventually. In the second case, we have rk(A∞L∞) ≥ 1. Then by (145), xq ∈ KqL for some

L such that rk(AqL) ≥ 1 eventually. That is, for large enough q, there exists a jq such that

aqjq

′t∗q(xq) = hjq ,q. By a subsequencing argument, we can suppose jq does not depend on

q, and denote it by j. If ‖a∞j ‖‖a∞j ‖= a∞j

′a∞j , then τ∞j (x∞) = ∞ and τ qj (xq) = ∞, using

77

the fact that ‖aqj‖‖aqj‖= aqj

′aqj, which follows from the fact that rk(Aqj,j) = rk(A∞j,j). If

‖a∞j ‖‖a∞j ‖6= a∞j

′a∞j , then the convergence in (147) continues to hold by appealing to Lemma

6 instead of Lemma 7(d) to get t∗q(xq)→ t∗∞(x∞). This implies that the inequality in (148)

holds with equality because J∞ = 1, ..., dA, and for all j such that ‖a∞j ‖‖a∞j ‖= a∞j

′a∞j ,

τ qj (xq) =∞. Therefore, (156) holds.

(vii) Now (142) is satisfied with equality for the following reasons: (1) the inequality in

(149) holds with equality because rqL = r∞L , and thus r∞L = r∞L∞ for all L ⊆ L∞ : rqL ≥ r∞L∞ ,

and (2) the inequality in (150) holds with equality eventually using (156).

(viii) An appeal to Theorem 1(b) implies that equality holds in (144). The conditions

for Theorem 1(b) are satisfied because A∞ 6= 0 and h∞ = 0.

Combining these changes with the proof of part (a) proves (154), and therefore, part (b).

Finally, we prove part (c). We can apply the proof of part (b) twice. We can define t∗q(x),

rq(x) and βq(x) in the same way as t∗q(x), rq(x) and βq(x) except with AJ and bJ replacing

A and b. Consider any sequence Ωq → Ω0 and xq → x∞ ∈ Ξ. The proof of part (b) applied

to the original inequalities (A and b) implies (142) holds with equality eventually. We note

that the limiting objects are the same when applied to AJ and bJ because J∞ ⊆ J . This

is because the assumption for part (c) implies that the jth element of gq =√nqΛq(b(θnq)−

A(θnq)EFnqmnq(θnq)) diverges to +∞ for all j /∈ J . Therefore, when we apply the proof of

part (b) to the reduced inequalities (AJ and bJ), we get that

1‖xq − t∗q(xq)‖2≤ χ2rq(xq),1−βq(xq) = 1‖x∞ − t∗∞(x∞)‖2≤ χ2

r∞(x∞),1−β∞(x∞) (157)

eventually as q →∞. Therefore,

11‖xq − t∗q(xq)‖2≤ χ2rq(xq),1−βq(xq) 6= 1‖xq − t∗q(xq)‖2≤ χ2

rq(xq),1−βq(xq) = 0 (158)

eventually. Therefore, by the bounded convergence theorem,

PrFnq

(φRCCnq (θnq , α) 6= φRCC

nq ,J (θnq , α))→ 0. (159)

Since, for every subsequence, nm, there exists a further subsequence, nq, such that (159)

holds, part (c) of Theorem 3 follows.

78

B.4 Proof of Auxiliary Lemmas 5 - 8

Proof of Lemma 5. The assumption implies that there exists a sequence, z∗n, such that

Anz∗n ≤ hn and z∗n → µ∗(x0, A0, h0) as n→∞. (160)

This implies that

‖xn − µ∗(xn, An, hn)‖2≤ ‖xn − z∗n‖2→ ‖x0 − µ∗(x0, A0, h0)‖2. (161)

Taking lim sup on both sides, we get

lim sup n→∞‖xn − µ∗(xn, An, hn)‖2≤ ‖x0 − µ∗(x0, A0, h0)‖2. (162)

Now note that µ∗(xn, An, hn) = arg minz:Anz≤hn‖xn− z‖2. This sequence of minimizers is

necessarily bounded because otherwise (162) cannot hold. Thus for any subsequence nmthere is a further subsequence nq such that µ∗(xnq , Anq , hnq) → z∞ for some z∞ ∈ Rdm .

Since Anqµ∗(xnq , Anq , hnq) ≤ hnq , we have A0z∞ ≤ h0. Thus,

limq→∞‖xnq − µ∗(xnq , Anq , hnq)‖2= ‖x0 − z∞‖2≥ ‖x0 − µ∗(x0, A0, h0)‖2. (163)

Since the subsequence is arbitrary, this implies that

liminfn→∞‖xn − µ∗(xn, An, hn)‖2≥ ‖x0 − µ∗(x0, A0, h0)‖2. (164)

Combining (162) and (164), we have limn→∞‖xn − µ∗(xn, An, hn)‖2= ‖x0 − µ∗(x0, A0, h0)‖2.

This, (163), and the uniqueness of arg minz:A0z≤h0‖x0 − z‖2 together imply that

µ∗(xn, An, hn)→ z∞ = µ∗(x0, A0, h0) as n→∞, (165)

proving the lemma.

Proof of Lemma 6. Let a′j,0 denote the jth row of A0 and let a′j,n denote the jth row of An.

For part (ii) of the definition of convergence, let nq be a subsequence and zq be a sequence

such that zq ∈ poly(Anq , hnq) for all q and zq → z0 as q →∞. Then,

a′j,0z0 = limq→∞

a′j,nqzq ≤ lim supq→∞

hj,nq = hj,0, (166)

showing that z0 ∈ poly(A0, h0).

79

For part (i) of the definition of the convergence, let z0 ∈ poly(A0, h0), and let J0 = j =

1, . . . , dA : a′j,0z0 = hj,0. If J0 = ∅, then z∗n = z0 satisfies the requirement by An → A0 and

hn → h0. If J0 6= ∅ but rk(AJ0,0) = 0, then aj,0 = 0 for all j ∈ J0, which implies that aj,n = 0

for all j ∈ J0 by the rank condition stated in the lemma. Then, we can again let z∗n = z0

and a′j,nz∗n = 0 ≤ hj,n for all j ∈ J0. Again, z∗n satisfies the requirement due to An → A0

and hn → h0.

Now suppose that rk(AJ0,0) > 0. The key for the next step is to partition J0 into two

subsets J∗0 and Jo0 . We require the partition to satisfy the following conditions:

(i) J∗0 contains rk(AJ0,0) elements such that aj∗,0 : j∗ ∈ J∗0 has full rank, and for any

element in jo ∈ Jo0 , there exists a unique linear representation ajo,0 =∑

j∗∈J∗0wjo,j∗aj∗,0,

where wjo,j∗ : j∗ ∈ J∗0 are real-valued weights.

(ii) The linear representation satisfies: for any j∗ ∈ J∗0 and jo ∈ Jo0 such that wjo,j∗ 6= 0,

we have hj∗,0 ≤ hjo,0.

Such a partition always exists. To see why, note that the existence of a partition satisfying

(i) is guaranteed by linear algebra. The number of partitions satisfying (i) is finite because

J0 is a finite set. If we choose the partition to be one that minimizes∑

j∗∈J∗0hj∗,0 among

those satisfying (i), then the chosen partition also satisfies (ii).

We note that for all n, rk(AJ∗0 ,n) = rk(AJ0) implies that for every jo ∈ Jo0 and j∗ ∈ J∗0 ,

there exist weights, wjo,j∗,n, such that

ajo,n =∑j∗∈J∗0

wjo,j∗,naj∗,n. (167)

Furthermore, we know that if wjo,j∗,n 6= 0, then wjo,j∗ 6= 0. This follows because, otherwise,

we would have

rk(Ijo∪(J∗0 /j∗)An) > rk(I(J∗0 /j∗)An) = rk(I(J∗0 /j∗)A0) = rk(Ijo∪(J∗0 /j∗)A0), (168)

contradicting the assumed rank condition.

Let AJ∗0 ,0 denote the submatrix of A0 formed by the rows selected by J∗0 , and let AJ∗0 ,n,

hJ∗0 ,0, and hJ∗0 ,n be defined analogously. Now let D be a (dm − |J0|) × dm matrix, the rows

of which form an orthonormal basis for the orthogonal complement to the space spanned

by aj,0 : j ∈ J∗0. Then the matrix(AJ∗0 ,0D

)is invertible, which implies that the matrix(

AJ∗0 ,nD

)is invertible for large enough n. Let h∧J∗0 ,n = min(hJ∗0 ,n, hJ∗0 ,0), where the minimum

80

is taken element by element. Let

z†n =(AJ∗0 ,nD

)−1 ( h∧J∗0 ,nDz0

). (169)

It is easy to verify that

z†n →(AJ∗0 ,0D

)−1 ( hJ∗0 ,0Dz0

)= z0, and (170)

AJ∗0 ,nz†n = h∧J∗0 ,n ≤ hJ∗0 ,n. (171)

If a′j,nz†n ≤ hj,n for all j ∈ Jo0 for large enough n, then (160) holds with z∗n = z†n and we

are done. Otherwise, let

λn =

min

1,minj∈Jo0 :hj,0>0

hj,n

a′j,nz†n

if j ∈ Jo0 : hj,0 > 0 6= ∅

1 otherwise. (172)

This is well-defined for large enough n since a′j,nz†n → a′j,0z0 = hj,0 and thus a′j,nz

†n 6= 0 for

large enough n. Also, by definition λn ≤ 1, and

λn → minj∈Jo0 :hj,0>0

hj,0a′j,0z0

= 1. (173)

Now let

z∗n = λnz†n. (174)

Then for any j ∈ Jo0 such that hj,0 > 0, we have

a′j,nz∗n ≤ hj,n (175)

by the definition of λn. For any j ∈ Jo0 such that hj,0 = 0, we have

a′j,nz∗n = λn

∑j∗∈J∗0

wj,j∗,na′j∗,nz

†n

= λn∑j∗∈J∗0

wj,j∗,n min(hj∗,n, hj∗,0)

= 0 ≤ hj,n, (176)

where the first equality follows by the definition of the weights, wj,j∗,n, the second equality

81

follows from the definition of z†n, the third equality follows because, if wj,j∗,n 6= 0, then

wj,j∗ 6= 0, and therefore 0 ≤ min(hj∗,n, hj∗,0) ≤ hj∗,0 ≤ hj,0 = 0 by property (ii) of the

partition.

Equations (170), (173), and (174) together imply that z∗n → z0. This also implies that,

for all j /∈ J0, a′j,nz∗n − hj,n → a′j,0z0 − hj,0 < 0 and thus, for large enough n,

A1,...,dA/J0,nz∗n < h1,...,dA/J0,n. (177)

This combined with equations (171), λn ≤ 1, and (174)-(176) implies that Anz∗n ≤ hn.

Therefore, z∗n satisfies the requirement and the lemma is proved.

Proof of Lemma 7. The proof of Lemma 7 makes use of three additional lemmas which are

stated and proved at the end of this subsection. We use bj,q to denote the transpose of the

jth row of Bq, and similarly for aj,nq , aj,0, and bj,0. An equivalent way to state condition (d)

is:

(i) for any further subsequence, nq, and for every sequence xq ∈ poly(Anq , gnq)∩poly(Bq, hq)

such that xq → x0, x0 ∈ poly(A0, g0) ∩ poly(B0, h0), and

(ii) for every x0 ∈ poly(A0, g0)∩poly(B0, h0), there exists xq ∈ poly(Anq , gnq)∩poly(Bq, hq)

such that xq → x0.

Before proving the lemma, we note that for any subsequence, nq such that Anq → A0

and gnq → g0, and for any Bq → B0 and hq → h0, condition (d)(i) is satisfied. Specifically,

let xq denote a sequence that belongs to poly(Anq , gnq)∩poly(Bq, hq) for all q, and such that

xq → x0. Then

a′j,0x0 = limq→∞

a′j,nqxq ≤ limq→∞

gj,nq = gj,0. (178)

Also, by the convergence of hq, we have that

b′j,0x0 = limq→∞

b′j,qxq ≤ limq→∞

hj,q = hj,0. (179)

Therefore, x0 ∈ poly(A0, g0) ∩ poly(B0, h0).

We also note that for any q, Bq, and hq satisfying (b), condition (c) must also be satisfied.

If not, then there exists a q, an x ∈ poly(Anq , gnq) and a j′ ∈ J(x;Bq, hq) such that bj′,q

cannot be written as a linear combination of aj,nq for j ∈ J(x;Anq , gnq). This implies that

there exists a v such that b′j′,qv > 0 and v ⊥ aj,nq for all j ∈ J(x;Anq , gnq). But then,

x + αv ∈ poly(Anq , gnq) for sufficiently small α, at the same time that b′j′,q(x + αv) > hq.

This contradicts the fact that poly(Anq , gnq) ⊆ poly(Bq, hq). Therefore, (c) holds.

82

We now prove the lemma by finding a subsequence, nq, and sequences Bq and hqthat satisfy conditions (a), (b), and (d)(ii). We first consider An and gn. By the compactness

of the unit circle, let nq be a subsequence so that Anq converges to some A0. Also suppose

gnq converges along the subsequence to some vector g0 ∈ RdA+,∞.

Let J+A denote the subset of 1, ..., dA for which gj,0 > 0, and let J0

A denote the subset

for which gj,0 = 0. Consider AJ0A,0

, which defines a cone in Rdm : poly(AJ0A,0,0) = x ∈ Rdm :

AJ0A,0x ≤ 0. Let S denote the smallest linear subspace of Rdm that contains this cone. Let

the dimension of S be dS. Let JSA be the subset of J0A for which aj,0 ⊥ S for all j ∈ JSA. Let

JNA = 1, ..., dA/JSA.

Next, we define sequences Bq and hq that satisfy conditions (a), (b), and (d)(ii) by

induction on the dimension of S. If dS = 0, then no Bq or hq is required. Condition

(a) is satisfied by the above choice of the subsequence. Condition (b) is satisfied because

poly(Bq, hq) = Rdm for all q. Condition (d)(ii) is satisfied because poly(A0, g0) = 0,and then we can take xq = 0 for all q, which belongs to poly(Anq , gnq) and converges to

x0 = 0 ∈ poly(A0, g0).

If dS > 0, then suppose the conclusion of Lemma 7 holds for all values of the dimension

of S less than dS.

Let Cq = poly(AJSA,nq , gJSA,nq). Let CSq be the projection of Cq onto S. That is, CS

q =

PSx : x ∈ Cq, where PS denotes the projection onto S and MS = I − PS. The fact that

Cq is a polyhedral set (defined by finitely many affine inequalities) implies by Theorem 19.3

in Rockafellar (1970) that CSq is also a polyhedral set. Therefore, there exists a dB1 × dm

matrix of unit vectors in S, B1q and a vector h1

q such that CSq = y ∈ S : B1

qy ≤ h1q. We

note that CSq contains zero, so h1

q ≥ 0. In the special case of dS = dm, CSq = Cq and we let

B1q be the matrix composed of all the non-zero rows of Anq and let h1

q be the corresponding

elements of gnq .

Let nq be a further subsequence so that B1q → B1

0 and h1q → h1

0, where some of the

elements of h10 may be +∞, in which case the convergence holds elementwise. We note that

this construction satisfies conditions (a) and (b) because poly(Anq , gnq) ⊆ Cq ⊆ poly(B1q , h

1q)

for all q, where the second subset holds because B1qx = B1

qMSx+B1qPSx = B1

qPSx ≤ h1q for

all x ∈ Cq because the rows of B1q belong to S and PSx ∈ CS

q .

Let J+B denote the set of j ∈ 1, ..., dB1 for which h1

j,0 > 0, and let J0B denote the set for

which h1j,0 = 0, where h1

j,0 is the jth element of h10. Consider B1

J0B ,0

and AJ0A,0

, which together

define a cone in S: x ∈ S : B1J0B ,0x ≤ 0 and AJ0

A,0x ≤ 0. As before, let S† denote the

smallest linear subspace of S that contains this cone. Let JS†

B denote the set of all j ∈ J0B

for which b1j,0 ⊥ S†. Also let JS

†A denote the set of all j ∈ J0

A for which aj,0 ⊥ S†. Let the

dimension of S† be dS† .

83

If dS† < dS, then the result follows by the induction assumption. In particular, if we let

Aq =

[Anq

B1q

]and gq =

[gnq

h1q

],

then the subspace, S, defined to be the smallest linear subspace containing poly(A0, g0),

is equal to S†. Therefore, there exists a further subsequence, nq, and another matrix of

inequalities, B2q and h2

q such that: (a) B2q → B2

0 and h2q → h2

0, (b) poly(Aq, gq) ⊆ poly(B2q , h

2q)

for all q along the subsequence, and (d)(ii) poly(Aq, gq) ∩ poly(B2q , h

2q) → poly(A0, g0) ∩

poly(B20 , h

20) pointwise. It is easy to see that these conditions imply conditions (a), (b), and

(d)(ii) for the original An and gn along this subsequence, with

Bq =

[B1q

B2q

]and hq =

[h1q

h2q

],

using the fact that poly(Aq, gq) = poly(Anq , gnq) ∩ poly(B1q , h

1q).

Therefore, we only need to show condition (d)(ii) in the case that dS† = dS. In this

case, S = S†, and so JS†

B = ∅ and JS†

A = JSA. Fix x0 ∈ poly(A0, g0) ∩ poly(B10 , h

10). We

show that for every ε > 0 there exists a Q such that for all q ≥ Q there exists a yq ∈poly(Anq , gnq) ∩ poly(B1

q , h1q) such that ‖yq − x0‖≤ 2ε. If true, then this can be used to

construct a sequence satisfying yq → x0, establishing condition (d)(ii).

Fix ε > 0. By Lemma 9, there exists a point, x, in S that satisfies b1j,0′x < h1

j,0 for all

j ∈ 1, ..., dB1, and a′j,0x < gj,0 for all j ∈ JNA . There exists a λ ∈ (0, 1) small enough that

x† = λx + (1 − λ)x0 ∈ B(x0, ε), where B(x0, ε) denotes the closed ball of radius ε around

x0. Note that x† satisfies a′j,0x† < gj,0 for all j ∈ JNA and b1

j,0′x† < hj,0 for all j ∈ 1, ..., dB1.

Therefore, there exists a δ ∈ (0, ε) and a Q such that for all q ≥ Q, and for all x ∈ B(x†, δ),

b1j,q′x < h1

j,q for all j ∈ 1, ..., dB1, and a′j,nqx < gj,nq for all j ∈ JNA . Notice that, for all

q ≥ Q, x† ∈ CSq = y ∈ S : B1

qy ≤ h1q, which means that there exists a yq ∈ Cq such that

x† = PSyq. By Lemma 10 applied to K = x† (where the condition is satisfied because, by

Lemma 11, S = x ∈ Rdm : AJS ,0x ≤ 0), there exists a larger Q such that for all q ≥ Q,

yq ∈ B(x†, δ). Therefore, ‖yq − x0‖≤ 2ε.

Proof of Lemma 8. Fix x ∈ Rdm . The fact that poly(A, g) ⊆ poly(B, h) implies that

µ∗(x;A, g) = µ∗(x; [A;B], [g, h]). Denote the common value by µ∗.

If there does not exist a j ∈ J(x;A, g) such that aj 6= 0, then x = µ∗ and a′jx <

gj for all j /∈ J(x;A, g). Suppose, to reach a contradiction, that there does exist a j ∈J(x; [A;B], [g;h]) such that bj−dA 6= 0. Then, there would exist a point, y, very close to x

(say, y = x + εbj−dA for some ε > 0) such that y /∈ poly(B, h) but y ∈ poly(A, g). This

84

contradicts the assumption that poly(A, g) ⊆ poly(B, h). Therefore, there does not exist a

j ∈ J(x; [A;B], [g;h]) such that bj−dA 6= 0. This implies that, in this case, τj(x,A, g) = 0

for all j ∈ 1, ..., dA and τj(x; [A;B], [g;h]) = 0 for all j ∈ 1, ..., dA + dB. Therefore,

τ(x;A, g) = τ(x; [A;B], [g;h]).

Suppose there does exist a j ∈ J(x;A, g) such that aj 6= 0. Then, the same j can be

used to define τj(x; [A;B], [g;h]) because J(x;A, g) ⊆ J(x; [A;B], [g;h]).

We show that for every j = 1, ..., dB, τj+dA(x; [A;B], [g;h]) ≥ τ(x;A, g). The result holds

trivially if ‖bj‖‖aj‖= b′jaj because then τj+dA(x; [A;B], [g;h]) = ∞. Suppose, to reach a

contradiction, that τj+dA(x; [A;B], [g;h]) < τ(x;A, g). Let τ ∗ = τj+dA(x; [A;B], [g;h]), and

consider two cases.

(i) If τ ∗ = 0, then for some ε > 0, the point t∗ = µ∗ + ε(Idm − aja′j‖aj‖−2)bj belongs to

poly(A, g) but not poly(B, h). To see that t∗ ∈ poly(A, g), note that for all ` ∈ J(x;A, g),

the fact that τ(x;A, g) > 0 implies that a` is collinear with aj. Then, a′`t∗ = a′`µ

∗ = g`.

For all ` /∈ J(x;A, g), a′`µ∗ < g`, so ε can be chosen small enough that a′`t

∗ < g` for all

` /∈ J(x;A, g). To see that t∗ /∈ poly(B, h), note that

b′jt∗ = b′jµ

∗ + ε‖bj‖2−ε(b′jaj)‖aj‖−2= hj + ε‖bj‖2−ε(b′jaj)2‖aj‖−2> hj, (180)

where the second equality follows because τ ∗ = 0 and bj is not collinear with aj (so b′jµ∗ = hj),

and the inequality follows because (b′jaj)2 < ‖aj‖2‖bj‖2. This contradicts the assumption

that poly(A, g) ⊆ poly(B, h). Therefore, in this case, τ ∗ ≥ τ(x;A, g).

(ii) If τ ∗ > 0, then let t∗ = µ∗+ τ ∗(

bj‖bj‖ −

aj‖aj‖

). We show that t∗ belongs to the interior

of poly(A, g) but is on the boundary of poly(B, h). Note that for every ` ∈ 1, ..., dA,

a′`t∗ = a′`µ

∗ + τ ∗(a′`bj‖bj‖

−a′àj‖aj‖

)= a′`µ

∗ +‖aj‖(hj − b′jµ∗)‖aj‖‖bj‖−b′jaj

(a′`bj‖bj‖

−a′àj‖aj‖

). (181)

When a` is collinear with aj, the right hand side of (181) is less than g` because a′`µ∗ ≤ g`,

hj > b′jµ∗ (because τ ∗ > 0), and ‖aj‖‖bj‖> a′jbj (because bj is not collinear with aj). When

a` is not collinear with aj, the right hand side of (181) is less than g` because

‖aj‖(hj − b′jµ∗)(a′`bj‖bj‖

−a′àj‖aj‖

)− (g` − a′`µ∗)(‖aj‖‖bj‖−b′jaj)

<(hj − b′jµ∗)(‖aj‖

(a′`bj‖bj‖

−a′àj‖aj‖

)− (‖aj‖‖a`‖−a′àj)

)≤0, (182)

85

where the first inequality follows because

(g` − a′`µ∗)(‖aj‖‖bj‖−b′jaj) > (hj − b′jµ∗)(‖aj‖‖a`‖−a′àj) (183)

(by the assumption that τ ∗ < τ(x;A, g) ≤ τ`(x,A, g)), and the second inequality follows

because b′jµ∗ ≤ hj and ‖a`‖‖bj‖≥ a′`bj. This shows that t∗ is on the interior of poly(A, g).

We also show that t∗ is on the boundary of poly(B, h) by calculating that b′jt∗ = hj. By a

similar calculation to above, we see that

(b′jt∗ − hj)(‖bj‖‖aj‖−b′jaj)

=‖aj‖(hj − b′jµ∗)(‖bj‖−

b′jaj

‖aj‖

)+ (b′jµ

∗ − hj)(‖bj‖‖aj‖−b′jaj) = 0. (184)

This implies that there exists a point, y, very close to t∗ (say y = t∗+εbj for some ε > 0) such

that y /∈ poly(B, h) but y ∈ poly(A, g). This contradicts the assumption that poly(A, g) ⊆poly(B, h). Therefore, τj+dA(x; [A;B], [g;h]) ≥ τ(x;A, g) for all j = 1, ..., dB.

Lemma 9. Let A be a dA × dm matrix. Let g be nonnegative. Let J+ denote the subset of

1, ..., dA such that gj > 0, and let J0 denote the subset of 1, ..., dA such that gj = 0. Let

S denote the smallest linear subspace containing poly(AJ0 , 0) = x ∈ Rdm : AJ0x ≤ 0. Let

JS be the subset of J0 for which AJS ⊥ S. Let JN = 1, ..., dA/JS. There exists a x ∈ Ssuch that a′jx < gj for all j ∈ JN .

Proof of Lemma 9. First, let M > maxj∈J+‖aj‖, and let ε ∈ (0,minj∈J+gj/M). Then, for

all x ∈ B(0, ε), a′jx < gj for all j ∈ J+, where B(x, ε) denotes the closed ball of radius ε

around x. Also, for every j ∈ JN ∩ J0, x ∈ S : a′jx = 0 defines a subspace of S. We note

that for all j ∈ JN ∩ J0, x ∈ S : a′jx = 0 is a proper subset of S, because otherwise j

would belong to JS. By the definition of S, S ∩ poly(AJN∩J0 , 0) is not contained within any

of these subspaces. In particular, for each j ∈ JN ∩J0, we can find a xj and a neighborhood,

Nj, (relatively open in S) that belongs to S ∩ poly(AJN∩J0,0, 0)/x ∈ S : a′jx = 0. Indeed,

we can consider j ∈ JN ∩ J0 sequentially, and define each neighborhood to be a subset of

the previous one. Therefore, the final xj must belong to S ∩ poly(AJN∩J0,0,0) and satisfy

a′jx < 0 for all j ∈ JN ∩ J0. Take x = λxj, where λ > 0 is small enough that x ∈ B(0, ε).

Then, x satisfies a′jx < gj for all j ∈ JN .

Lemma 10. Let An → A0 and gn → 0, where gn ≥ 0 for all n. Suppose S = x ∈ Rdm :

A0x ≤ 0 is a linear subspace of Rdm. Let S⊥ denote the orthogonal subspace to S in Rdm.

Let PSx denote the projection of x ∈ Rdm onto S and let MSx denote x − PSx. Then, for

86

every K ⊆ S, compact, and for every ε > 0, we have

x ∈ poly(An, gn) : PSx ∈ K, ‖MSx‖≥ ε = ∅ (185)

eventually as n→∞.

Proof of Lemma 10. Suppose that the conclusion of the lemma is not true. Then there

exists a sequence xn ∈ poly(An, gn) and a subsequence nm such that PSxnm ∈ K and

‖MSxnm‖≥ ε for all m ≥ 1. Define the unit vector x⊥nm = MSxnm/‖MSxnm‖. Then, by

the compactness of K and the unit circle, there exists a further subsequence nq such that

PSxnq → xS and x⊥nq → x⊥ for some xS ∈ S and x⊥ ∈ S⊥ as q →∞.

Because x⊥ ∈ S⊥ and x⊥ 6= 0, we know that x⊥ /∈ S = x ∈ Rdm : A0x ≤ 0, and

therefore there exists a j such that

a′j,0x⊥ > 0. (186)

Also, since xS ∈ S, a′j,0xS ≤ 0. Since S is a linear subspace, we have a′j,0(−xS) ≤ 0 as well.

This shows that a′j,0xS = 0 (and more generally, S = x ∈ Rdm : A0x = 0).

Now consider

a′j,nqxnq − gj,nq = a′j,nqPSxnq + a′j,nqMSxnq − gj,nq= o(1) + a′j,0x

S + ‖MSxnq‖(o(1) + a′j,0x⊥)− o(1)

= o(1) + ‖MSxnq‖(o(1) + a′j,0x⊥). (187)

By (186), o(1) + a′j,0x⊥ > 0 eventually. This, combined with ‖MSxnq‖≥ ε implies that

a′j,nqxnq − gj,nq > 0 (188)

eventually. This contradicts the definition of the sequence xn which requires that xn ∈poly(An, gn) for all n.

Lemma 11. Let A be a matrix. Let S be the smallest linear subspace containing C =

poly(A,0). Let J = j : aj ⊥ S. Then, S = poly(AJ ,0).

Proof of Lemma 11. First, notice that if x ∈ S, then x ⊥ aj for all j ∈ J , and therefore,

AJx = 0, so x ∈ poly(AJ ,0).

To go the other way, let x ∈ poly(AJ ,0). Lemma 9 implies that there exists an x ∈ Ssuch that a′jx < 0 for all j ∈ J c, where J c = 1, ..., dA/J . Consider y = x+Mx for M large.

We note that AJy = AJx + MAJ x ≤ 0 since x ∈ poly(AJ ,0) and x ∈ S ⊆ poly(AJ ,0). We

87

also note that for every j ∈ J c, a′jy = a′jx+Ma′jx→ −∞ as M diverges. Thus, there exists

an M large enough that y ∈ poly(A,0). This implies that y ∈ S because poly(A,0) ⊆ S.

This also implies that x = y −Mx ∈ S because S is a linear subspace.

C Supporting Materials for Section 3.2

C.1 Lemmas 12-13 and Their Proofs

Lemma 12. Let B and C be conformable matrices and d be a conformable vector. There

exists a matrix A = A(B,C) and a vector b = b(C, d) such that

δ : Cδ ≥ Bµ− d 6= ∅ ⇔ Aµ ≤ b.

Furthermore, A(B,C) = H(C)B and b(C, d) = H(C)d, where H(C) is the matrix with rows

formed by the vertices of the polyhedron h ∈ Rk : h ≥ 0, C ′h = 0,1′h = 1.

Proof of Lemma 12. By Theorem 2.7 in Gale (1960), δ : Cδ ≥ Bµ − d 6= ∅ is equivalent

to

h′(Bµ− d) ≤ 0 for all h ≥ 0, C ′h = 0. (189)

The equivalence is not affected by adding the scale normalization: 1′h = 1 to (189). Thus,

δ : Cδ ≥ Bµ− d 6= ∅ is equivalent to

h′(Bµ− d) ≤ 0 for all h ∈ H := h ≥ 0 : C ′h = 0,1′h = 1. (190)

Since the rows of H(C) are vertices, and thus elements of H, we have

δ : Cδ ≥ Bµ− d 6= ∅ ⇒ H(C)(Bµ− d) ≤ 0.

Then by the definition of A and b, we have

δ : Cδ ≥ Bµ− d 6= ∅ ⇒ Aµ ≤ b.

Conversely, since the rows of H(C) are vertices ofH, for any h ∈ H, there exists Rk-vector

c ≥ 0 such that H(C)′c = h. Thus, if Aµ ≤ b, we must have h′(Bµ−d) = c′H(C)(Bµ−d) =

c′(Aµ− b) ≤ 0. That this holds for all h ∈ H implies that δ : Cδ ≥ Bµ− d 6= ∅. Thus the

lemma is proved.

88

Lemma 13. r = rk(B′ZH0).

Proof of Lemma 13. Denote BZ , CZ , and dZ by B, C, and d. Let h′1, . . . , h′m1

be all the

rows of H(C) orthogonal to Bµ− d. Then by definition, AJ = [B′h1, . . . , B′hm1 ]′, and thus

rk(AJ) = rk(B′h1, . . . , B′hm1). Since h1, . . . , hm1 ∈ h ≥ 0 : h′C = 0, h′(Bµ − d) = 0,

we have B′h1, . . . , B′hm1 ∈ B′h : h ≥ 0, h′C = 0, h′(Bµ − d) = 0. This implies that

rk(AJ) ≤ rk(B′h : h ≥ 0, h′C = 0, h′(Bµ− d) = 0) = rk(B′H0).

Next, suppose that h1, . . . , hm2 ∈ h ≥ 0 : h′C = 0, h′(Bµ − d) = 0 such that

rk(B′H0) = rk(B′h1, . . . , Bhm2). By the definition of H(C), h1, . . . , hm2 must all be con-

vex combinations of the rows of H(C). In fact, they must all be convex combinations of

h1, . . . , hm1 defined in the first part of the proof because any other row (say, h∗) of H(C)

must satisfy the strict inequality h′∗(Bµ− d) > 0 (since they correspond to the inactive in-

equalities). Consequently, B′h1, . . . , B′hm2 must be convex combinations of B′h1, . . . , B

′hm1 .

This implies that rk(B′H0) ≤ rk(AJ). Therefore, the lemma is proved.

C.2 General Algorithm for Calculating rk(B′ZH0)

We suppress the subscript Z in BZ , CZ , and dZ for notational ease. As discussed in Section

3.2, the rank of the polyhedron B′H0 is the dimension of the smallest linear subspace con-

taining it. That is, its linear span, denoted by span(B′H0). Let G :=(I ′J0

C Bµ− d)

,

where J0 is defined in Section 3.2. From the discussions in Section 3.2, we know that

span(B′H0) = B′h : h ∈ Rk, G′h = 0. (191)

The number of restrictions in G′h = 0 is given by the number of columns of G. However,

some of those restrictions may be redundant or contained in the null space of B′. We do

some linear algebra to calculate span(B′H0).

Let G = (g1, g2, ..., gk)′, where each g′j is the jth row of G. Suppose the first rk(G) rows of

G have rank rk(G). This can be achieved by rearranging the rows of G together with the rows

of B and the elements of h accordingly. Partition G as [G′1, G′2]′, where G1 = (g1, ..., grk(G))

′

and G2 = (grk(G)+1, ..., gk)′. We can then solve the equations, G′h = 0, for the first rk(G)

elements of h as a function of the other elements. Specifically, let h = (h′1, h′2)′ be partitioned

conformably with G. Then, we can solve for h1 = Γh2, where Γ = −(G1G′1)−1G1G

′2.

Then span(H0) can be written as

span(H0) = (h′1, h′2)′ ∈ Rk : h1 = Γh2. (192)

89

Accordingly,

span(B′H0) = B′span(H0) = (B′1Γ +B′2)h2 : h2 ∈ Rk−rk(G), (193)

where B = [B′1, B′2]′ is partitioned conformably with h. This implies that

rk(B′H0) = rk(Γ′B1 +B2).

To end this section, we provide the pseudo-code to calculate rk(B′H0) in Algorithm 3.

This can be plugged in Algorithm 2 in Section 3.2, replacing line 12, to compute the sCC

and the sRCC tests.

Algorithm 3: Pseudo-code for calculating rk(B′H0) when rk(B) < k.

1: %H0 = h ∈ Rk : G′h = 0, where G = (g1, . . . , gk)′ and g′j is the jth row of G.

2:

3: if rk(G) = k then4: rk(B′H0) := 05: else6: if rk(g1, . . . , grk(G)) < rk(G) then7: Rearrange rows of G so that rk(g1, . . . , grk(G)) = rk(G) holds. Rearrange the

elements of h and the rows of B accordingly.8: end if9: G1 := (g1, . . . , grk(G))

′

10: G2 := (grk(G)+1, . . . , gk)′

11: Γ := −(G1G′1)−1G1G

′2

12: rk(B′H0) := rk((Γ′, Ik−rk(G))B).13: end if

D Asymptotic Validity of the Subvector Tests

D.1 General Conditions for Asymptotic Validity

We fix a realization of Zini=1 and denote it by z. Let Fz be a collection of distributions Fz.

The following high-level assumption is sufficient for the uniform asymptotic validity of the

sCC and the sRCC tests. This assumption is the conditional version of Assumption 2.

Assumption 3. The given sequence (Fz,n, θn) : Fz,n ∈ Fz, θn ∈ Θ0(Fz,n)∞n=1 satisfies, for

every subsequence, nm, there exists a further subsequence, nq, and there exists a sequence of

positive definite dm × dm matrices, Dq, such that:

90

(a) Under the sequence Fz,nq∞q=1,

√nqD

−1/2q (mnq(θnq)− EFz,nqmnq(θnq))→d N(0,Ω), (194)

for a positive definite correlation matrix Ω, and

‖D−1/2q Σnq(θnq)D

−1/2q − Ω‖→p 0. (195)

(b) Let A(θ) and b(θ) be defined in Lemma 12. ΛqA(θnq)Dq → A0 for some dA × dm

matrix A0, and for every J ⊆ 1, ..., dA, rk(IJA(θnq)Dq) = rk(IJA0), where Λq is the

diagonal dA×dA matrix whose jth diagonal entry is one if e′jA(θnj) = 0 and ‖e′jA(θnq)Dq‖−1

otherwise.

The following corollary of Theorem 3 shows the asymptotic properties of the sRCC test.

Corollary 2. (a) Suppose Assumption 3(a) holds for all sequences (Fz,n, θn) : Fz,n ∈Fz, θn ∈ Θ0(Fz,n)∞n=1. Then,

limsupn→∞

supFz∈Fz

supθ∈Θ0(Fz)

EFz(φsRCCn (θ, α)|z) ≤ α.

Next consider a sequence (Fz,n, θn) : Fz,n ∈ Fz, θn ∈ Θ0(Fz,n)∞n=1 satisfying Assumption

3(a,b).

(b) If, along any further subsequence, for all j = 1, ..., dA,√nqe

′jΛq(A(θnq)EFz,nqmnq(θnq)−

b(θnq))→ 0, and if A0 6= 0dA×dm, then

limn→∞

EFz,nφsRCCn (θn, α) = α.

(c) If, for J ⊆ 1, ..., dA, along any further subsequence,√nqe

′jΛq(A(θnq)EFz,nqmnq(θnq)−

b(θnq))→ −∞ as q →∞, for all j /∈ J , then

limn→∞

PrFz,n(φRCCn (θn, α) 6= φRCC

n,J (θn, α)) = 0.

Remark. Corollary 2 follows from Theorem 3 because Θ0(Fz,n) has the equivalent repre-

sentation

Θ0(Fz,n) = θ ∈ Θ : A(θ)EFz,n [mn(θ)|z] ≤ b(θ), (196)

by Lemma 12. There is one subtle point: A(θ) = H(Cz(θ))Bz(θ) might change dimension

because H(Cz) might change dimension with Cz, and Cz might change with the sample size.

91

But this does not cause a problem because the dimension of H(Cz) and thus that of A(θ) is

bounded by a function of k and p which does not change with the sample size.40 Due to this

boundedness, for any subsequence of n we can always find a further subsequence along

which the dimension of A(θ) does not change. Then the problem falls into the framework of

Theorem 3.

D.2 Primitive Conditions under i.i.d. Sampling

Now we assume that Wini=1 is an i.i.d. sample unconditionally and derive primitive con-

ditions for Assumption 3(a). Let the conditional distribution of Wi given Zi = zi be

represented by the mapping: F| : zi 7→ F|zi . Let F| denote a collection of F| and let

Fz = ×ni=1F|zi : F| ∈ F|, where ×ni=1F|zi denotes the joint distribution whose marginal

distributions are independent F|zi . The following assumption is sufficient for (194) in As-

sumption 3. In the assumption σ2j|z(θ) := n−1

∑ni=1 VarF|zi (mj(Wi, θ)|zi) and

D|z(θ) = diag(σ21|z(θ), . . . , σ

2dm|z(θ)). (197)

Let eigmin(V ) denote the minimum eigenvalue of a matrix V .

Assumption 4. There exists an M0 < ∞ and an ε0 > 0 such that for all F| ∈ F|, the

following hold.

(a) σ2j|z(θ) > 0 for all j = 1, . . . , dm, θ ∈ Θ, and for all n.

(b) n−1∑n

i=1 EF|zi ((mj(Wi, θ)/σj|z(θ))4|zi) < M0 for all j, all θ ∈ Θ, and for all n.

(c) eigmin(n−1∑n

i=1[VarF|zi (D−1/2|z (θ)m(Wi, θ)|zi)]) > ε0 for all θ ∈ Θ and for all n.

Remark. Part (b) requires m(Wi, θ) to have finite 4th moment conditional on Zi = zi.

This is used both to derive the asymptotic normality of mn(θ) using the Lindeberg-Feller

central limit theorem under the sequence of Fz,n, and to show the consistency of the average

conditional variance estimator Σn(θ). Part (c) requires that the average conditional variance

ofm(Wi, θ) be invertible uniformly over θ and F| ∈ F|. This is required since we use the quasi-

likelihood ratio statistic which involves inverting an estimator of the average conditional

variance.

When the nearest neighbor matching variance estimator in (39) is used, the following

additional assumption is used for consistency.

Assumption 5. (a) zi∞i=1 is a bounded sequence of distinct values.

40This fact is know as the McMullen’s upper bound theorem. See e.g. Section 8.4 of Ziegler (1995).

92

(b) ΣZ,n → ΣZ where ΣZ is finite positive definite matrix.

(c) There exist Mg > 0 and MV > 0 such that for all θ ∈ Θ and F| ∈ F| the conditional

mean and variance, EF|zi [D|z(θ)−1/2m(Wi, θ)|zi] and V arF|zi (D|z(θ)

−1/2m(Wi, θ)|zi), are

Lipschitz continuous in zi with Lipschitz constants Mg and MV , respectively.

Remark. The boundedness part of part (a) is used to show that zi and its nearest neighbor

get close to each other on average as n→∞. This can be guaranteed by pre-normalizing Zi

before applying the matching estimator. For example, if the raw conditioning variable Zi is

supported in (0,∞), one can let Zi = Φ(Zi) where Φ(·) is the standard normal cumulative

distribution function. This and the Lipschitz continuity in part (c) together ensure that the

nearest neighbor provides the correct information about the conditional variance in the limit.

The distinct values required by part (a) ensures that each point can be the nearest neighbor

to at most a uniformly bounded number of other points. This holds with probability one if

Zi has no probability mass on any single point. It can be made to hold by adding a tiny

continuous noise to Zi when Zi has repeated values. The noise should be set small enough

to be a tie breaker only in the nearest neighbor calculation. Part (b) of the assumption can

be established for a probability-one set of Zi values by the strong law of large numbers.

The following theorem verifies Assumption 3(a).

Theorem 4. (a) Assumption 4 implies (194) in Assumption 3 for all sequences (Fz,n, θn) :

Fz,n ∈ Fz, θn ∈ Θ0(Fz,n)∞n=1.

(b) If zini=1 contains at least two instances of each value eventually as n → ∞, and

Assumption 4 holds, then (195) holds for Σn(θ) defined in (37), for all sequences

(Fz,n, θn) : Fz,n ∈ Fz, θn ∈ Θ0(Fz,n)∞n=1.

(c) If Assumptions 4 and 5 hold, then (195) holds for Σn(θ) defined in (39), for all se-

quences (Fz,n, θn) : Fz,n ∈ Fz, θn ∈ Θ0(Fz,n)∞n=1.

D.3 Proof of Theorem 4

(a) Let (Fz,n, θn) : Fz,n ∈ Fz, θn ∈ Θ0(Fz,n) be an arbitrary sequence. Let F|zi,n denote the

conditional distribution of Wi given Zi = zi implied by Fz,n. Let σ2j|z,n(θ) and D|z,n(θ) be

defined just like σ2j|z(θ) and D|z(θ) except with F|zi replaced by F|zi,n. Let Dn = D|z,n(θn).

Then Dn is positive definite for every n by Assumption 4(a).

Let Ωn = D−1/2n n−1

∑ni=1 VarF|zi,n(m(Wi, θn)|zi)D−1/2

n . Algebra shows that the square of

93

the (j, `)th element of Ωn is bounded by

2n−1

n∑i=1

EF|zi,n

[(mj(Wi, θn)

σj|z,n(θn)

)4

|zi

]+ 2n−1

n∑i=1

EF|zi,n

[(m`(Wi, θn)

σj|z,n(θn)

)4

|zi

], (198)

which is bounded by 4M0 by Assumption 4(a). Thus vec(Ωn) ∈ [0, 4M0]d2m , which is a

compact set. This implies that a subsequence nq can be found for any subsequence of nsuch that Ωnq → Ω∞. Furthermore, Assumption 4(c) implies that Ω∞ is positive definite.

It remains to verify the Lindeberg condition for the Lindeberg-Feller central limit theorem

(CLT) along the subsequence nq. Let a be an arbitrary real vector on the unit sphere in

Rdm . Let

mn,i(θ) = a′D−1/2n (m(Wi, θ)− EF|zi,n [m(Wi, θ)|zi]). (199)

Let

s2q = nq

−1

nq∑i=1

EF|zi,nq[mnq ,i(θnq)

2|zi]. (200)

For an arbitrary ε > 0, consider the derivation,

nq∑i=1

n−1q s−2

q EF|zi,nq [mnq ,i(θnq)21n−1

q s−2q mnq ,i(θnq)

2 > ε|zi]

≤ n−2q s−4

q ε−1

nq∑i=1

EF|zi,nq [mnq ,i(θnq)4|zi]

≤ 16n−2q s−4

q ε−1

nq∑i=1

EF|zi,nq [(a′D−1/2

nq m(Wi, θnq))4|zi]

≤ 16n−2q s−4

q ε−1

nq∑i=1

EF|zi,nq [‖D−1/2nq m(Wi, θnq)‖4|zi]

= O(n−1q s−4

q ε−1)

→ 0, as q →∞, (201)

where the first inequality holds because 1(x > ε) ≤ xε

for any x ≥ 0, the second inequality

holds because E[(X−E(X))4] ≤ 16E[X4], the third inequality holds by the Cauchy-Schwarz

inequality and ‖a‖= 1, the equality holds by Assumption 4(b), and the convergence holds

because s2q → a′Ω∞a by the definition of the subsequence nq. Therefore, the Lindeberg

condition holds and the CLT applies, proving part (a).

(b) Note that Σn(θ) is the weighted average of the standard sample variance estimator

94

within subsamples with same zi values. Thus, by standard argument, we have

EFz,n [Σn(θn)|z] =∑`∈Z

n`n

VarF|`,n(m(Wi, θn)|`) =1

n

n∑i=1

VarF|zi,n(m(Wi, θn)|zi), (202)

where the second equality holds by rearranging terms. Thus,

EFz,n [D−1/2n Σn(θn)D−1/2

n |z] = Ωn. (203)

Also by standard calculation, the (j, j′) element of D−1/2n Σn(θn)D

−1/2n has a conditional

variance given z:

1

n2

n∑i=1

VarF|zi,n

(mj(Wi, θn)mj′(Wi, θn)

σj|z,n(θn)σj′|z,n(θn)|zi)

+1

n2

n∑i=1

ω2j|zi,n(θn)ω2

j′|zi,n(θ) + ωjj′|zi,n(θn)2

nzi − 1,

(204)

where ωj|zi,n(θ) = VarF|zi,n

(mj(Wi,θ)

σj|z,n(θ)|zi)

and ωjj′|zi,n(θ) = CovF|zi,n

(mj(Wi,θ)

σj|z,n(θ),mj′ (Wi,θ)

σj′|z,n(θ)|zi)

. By

standard algebraic manipulation, we have

VarF|zi,n

(mj(Wi, θn)mj′(Wi, θn)

σj|z,n(θn)σj′|z,n(θn)|zi)≤ 1

2(Mji +Mj′i), and

ω2j|zi,n(θn)ω2

j′|zi,n(θ) + ωjj′|zi,n(θn)2 ≤Mji +Mj′i (205)

where Mji = EF|zi,n

[(mj(Wi,θn)

σj|z,n(θn)

)4

|zi]. Therefore, by Assumption 4 and the additional

assumption that nzi ≥ 2 for all i, we have that the expression in (204) is bounded by1n(M0 + 2M0), which converges to zero as n→∞. This proves part (b).

(c) First, we prove that

n−1

n∑i=1

‖zi − z`Z(i)‖2→ 0. (206)

To begin, define zi = Σ−1/2Z,n zi. By Assumption 5(b), Σ

−1/2Z,n → Σ

−1/2Z as n→∞ and this limit

is finite. Thus, Σ−1/2Z,n is uniformly bounded over all large enough n. This and Assumption

5(a) together imply that the elements of the array z1, . . . , znn≥1 are chosen from a bounded

set. Then Lemma 1 of Abadie and Imbens (2008) applies directly and implies that

n−1

n∑i=1

‖zi − z`Z(i)‖2→ 0. (207)

95

Consider the derivation

n−1

n∑i=1

‖zi − z`Z(i)‖2 = n−1

n∑i=1

(zi − z`Z(i))′ΣZ,n(zi − z`Z(i))

≤ n−1

n∑i=1

‖zi − z`Z(i)‖2eigmax(ΣZ,n)

→ 0, (208)

where eigmax(·) stands for maximum eigenvalue and the convergence holds by (207) and

Assumption 5(b). This proves (206).

Next consider an arbitrary unit vector a in Rdm , let

s2n,i(θ) = a′D−1/2

n (m(Wi, θ)−m(W`Z(i), θ))(m(Wi, θ)−m(W`Z(i), θ))′D−1/2

n a. (209)

Then a′D−1/2n Σn(θ)D

−1/2n a = 1

2n

∑ni=1 s

2n,i(θn). Since a is arbitrary, it suffices to show that

for any subsequence of n there exists a further subsequence nq such that

1

2nq

nq∑i=1

s2nq ,i(θnq)→p a

′Ω∞a. (210)

as q →∞.

Let mn,i(θ) be defined in the proof of part (a). Then

EFz,n [s2n,i(θn)|z]

= EFz,n [a′(mn,i(θn)− mn,`Z(i)(θn) + ∆ni)(mn,i(θn)− mn,`Z(i)(θn) + ∆ni)′a|z]

= a′EF|zi,n [mn,i(θn)mn,i(θn)′|zi]a+ a′EF|z`Z (i),n[mn,`Z(i)(θn)mn,`Z(i)(θn)′|z`Z(i)]a+ a′∆ni∆

′nia

= 2a′D−1/2n VarF|zi,n [m(Wi, θn)|zi]D−1/2

n a+ a′∆Vnia+ a′∆ni∆

′nia, (211)

where ∆ni = EF|zi,n [D−1/2n m(Wi, θn)|zi]− EF|z`Z (i),n

[D−1/2n m(W`Z(i), θn)|z`Z(i)], and

∆Vni = VarF|z`Z (i),n

[D−1/2n m(W`Z(i), θn)|z`Z(i)]− VarF|zi,n [D−1/2

n m(Wi, θn)|zi].

By Assumption 5(c) we have,

‖∆ni‖≤Mg‖zi − z`Z(i)‖ and ‖∆Vni‖≤MV ‖zi − z`Z(i)‖. (212)

96

Thus,

n−1

n∑i=1

a′∆ni∆′nia ≤ n−1

n∑i=1

‖∆ni‖2≤ n−1

n∑i=1

M2g ‖zi − z`Z(i)‖2→ 0, (213)

and

n−1

n∑i=1

a′∆Vnia ≤ n−1

n∑i=1

‖∆Vni‖≤ n−1

n∑i=1

MV ‖zi − z`Z(i)‖

≤MV

√√√√n−1

n∑i=1

‖zi − z`Z(i)‖2 → 0. (214)

For an arbitrary subsequence of n, consider a further subsequence nq such that Ωn → Ω∞.

Such a further subsequence always exists by the proof of part (a). Then as q →∞,

n−1q

nq∑i=1

2a′VarF|zi,nq [D−1/2nq m(Wi, θnq)|zi]a→ 2a′Ω∞a. (215)

Combining (211), (213), (214), and (215), we have

EFz,nq [a′D−1/2

nq Σnq(θnq)D−1/2nq a|z] =

1

2nq

nq∑i=1

EFz,nq [s2nq ,i(θn)|z]→ a′Ω∞a. (216)

Now it suffices to show that

EFz,n

(n−1

n∑i=1

(s2n,i(θn)− EFz,n [s2

n,i(θn)|z]))2

|z

→ 0. (217)

Let εi(θ) = a′mn,i(θ) and σ2i (θ) = a′VarF|zi,n(D

−1/2n m(Wi, θ|zi))a = EF|zi,n [ε2

i (θ)|zi]. Consider

n−1

n∑i=1

(s2n,i(θn)− EFz,n [s2

n,i(θn)|z])

= n−1

n∑i=1

(ε2i (θn)− σ2

i (θn))

+ n−1

n∑i=1

(ε2`Z(i)(θn)− σ2

`Z(i)(θn))

+ 2n−1

n∑i=1

(a′∆ni)εi(θn)

− 2n−1

n∑i=1

(a′∆ni)ε`Z(i)(θn)

97

+ 2n−1

n∑i=1

εi(θn)ε`Z(i)(θn). (218)

Clearly, all the summands on the right-hand side have conditional expectation zero. Now we

show that the conditional variance (which then is the conditional second moment) of each

of them converges to zero.

For the first summand on the right-hand side of (218), we have

EFz,n

(n−1

n∑i=1

(ε2i (θn)− σ2

i (θn))

)2

|z

=1

n2

n∑i=1

VarF|zi,n(ε2i (θn)|zi)

≤ 1

n2

n∑i=1

EF|zi,n [ε4i (θn)|zi]

≤ 16

n2

n∑i=1

EF|zi,n [(a′D−1/2n m(Wi, θn))4|zi]

≤ 16

n2

n∑i=1

EF|zi,n [‖D−1/2n m(Wi, θn)‖4|zi]

→ 0, (219)

where the convergence holds by Assumption 4(b). For the second summand on the right-hand

side of (218), we have

EFz,n

(n−1

n∑i=1

(ε2`Z(i)(θn)− σ2

`Z(i)(θn))

)2

|z

=

1

n2

n∑i=1

EFz,n [(ε2`Z(i)(θn)− σ2

`Z(i)(θn)))2 |z]

+2

n2

n∑i=1

n∑j=i+1

EFz,n [(ε2`Z(i)(θn)− σ2

`Z(i)(θn)) (ε2`Z(j)(θn)− σ2

`Z(j)(θn))|z]

≤ L+ 2L2

n2

n∑i=1

EF|zi,n [(ε2i (θn)− σ2

i (θn)))2 |zi]→ 0, (220)

where L is the maximum number of times a j is `Z(i) for some i. This number is bounded

by 3dz−1, which does not depend on n (see e.g. Zeger and Gersho (1994)). The convergence

holds by (219).

98

For the third summand in (218), we have

EFz,n

(n−1

n∑i=1

a′∆niεi(θn)

)2

|z

=1

n2

n∑i=1

(a′∆ni)2EF|zi,n [ε2

i (θn)|zi]

≤ MgB

n2

n∑i=1

EF|zi,n [ε2i (θn)|zi]

≤ MgB

n2

n∑i=1

(1 + EF|zi,n [ε4i (θn)|zi])

→ 0, (221)

where B is the maximum distance of two points in the sequence zini=1, which is bounded

by Assumption 5(a), the first inequality holds by Assumption 5(c), the second inequality

holds by x2 ≤ (max(1, |x|))2 ≤ max1, x4 ≤ 1 + x4 and the convergence holds by (219).

For the fourth summand in (218), we have

EFz,n

(n−1

n∑i=1

a′∆niε`Z(i)(θn)

)2

|z

=1

n2

n∑i=1

(a′∆i(θn))2EF|z`Z (i),n[ε2`Z(i)(θn)|z`Z(i)]

≤ MgB

n2

n∑i=1

EFz`Z (i),n[ε2`Z(i)(θn)|z`Z(i)]

≤ MgB

n2

n∑i=1

(1 + EF|z`Z (i),n[ε4`Z(i)(θn)|z`Z(i)])

≤ MgLB

n2

n∑i=1

(1 + EF|zi,n [ε4i (θn)|zi])

→ 0, (222)

where L is number discussed below (220).

For the fifth summand on the right-hand side of (218), we have

EFz,n

(n−1

n∑i=1

εi(θn)ε`Z(i)(θn)

)2

|z

=

1

n2

n∑i=1

EFz,n [ε2i (θn)ε2

`Z(i)(θn)|z]

+2

n2

n∑i=1

n∑j=i+1

EFz,n [εi(θn)ε`Z(i)(θn)εj(θn)ε`Z(j)(θn)|z]

99

≤ 1

n2

n∑i=1


`Z(i)(θn)|z] +L

n2

n∑i=1


`Z(i)(θn)|z]

≤ 1 + L

2n2

n∑i=1

EFz,n [ε4i (θn) + ε4

`Z(i)(θn)|z]

→ 0, (223)

where the first inequality holds because EFz,n [εi(θn)ε`Z(i)(θn)εj(θn)ε`Z(j)(θn)|z] is nonzero only

when j = `Z(i) and `Z(j) = i and this occurs at most L times for each i, the second inequality

holds by 2xy ≤ x2 + y2, and the convergence holds by (219) and the last two lines of (222).

Combining (218)-(223), we have that (217) holds, which then proves part (c).

E Numerical Details for Section 5.2

E.1 Calculation of the Identified Set

Let YU denote log(sN,i + 2/N)− log(1− sN,i + s) and let YL denote log(sN,i + s)− log(1−sN,i + 2/N). For every θ0 in the identified set, there exists a δ = (δ1, δ2)′ ∈ R2 such that for

all z = (zc, ze)′ ∈ 0, 12,

E[YU |z]− E[X|z]θ0 ≥ δ1 + δ2zc ≥ E[YL|z]− E[X|z]θ0. (224)

The identified set for θ0 can be solved via two linear programming problems once E[YU |z],

E[YL|z], and E[X|z] are calculated. Note that

E[X|z] = E[12ze + ε ≥ 0] = Φ(2ze). (225)

We also need to calculate E[YU |z] and E[YL|z]. Let

`(s,N, c) =N∑i=0

(Ni ) si(1− s)N−i log(i+ c). (226)

Then

E[YU |z] = E[`(s∗, N, 2)− `(1− s∗, N,Ns)|z]

E[YL|z] = E[`(s∗, N,Ns)− `(1− s∗, N, 2)|z], (227)

100

where s∗ = exp(−12ze+ε>0−zc+ε)1+exp(−12ze+ε>0−zc+ε) . The conditional expectations can then be calculated by

simulating a large number of ε draws. After obtaining, E[YU |z] and E[YL|z], we use linear

programming based on (224) to calculate the upper and the lower bound for θ0. We find the

bounds to be [−1.203,−0.757].

E.2 Bisection Algorithm for Calculating Confidence Sets

We compute the confidence sets for θ0 implied by the sRCC, sCC, and the ARP hybrid tests

by following the same steps:

1. Find a point in the confidence set. Let this point be denoted θ0. Methods to find such

a point are given below.

2. Find a point θL < θ0 that is outside the identified set. This is done by checking whether

θ0 − j is rejected for j = 1, 2, ... iteratively until a rejected point is found.

3. Use bisection to find a point between θ0 and θLB that is a boundary point of the

confidence set. Specifically, check whether θM = (θ0 + θLB)/2 is rejected or not. If yes,

check the midpoint between θM and θ0, and if not, check the midpoint between θM

and θLB. Continue this way until the desired accuracy is reached. The accuracy that

we use is 0.00049, yielding CI endpoints that are accurate to the third digit.

4. Let the last rejected point from Step 2 be denoted θL. This is the computed lower

endpoint of the confidence interval.

5. Follow analogous steps to find the upper endpoint θU .

6. Finally, the confidence interval is [θL, θU ].

Remarks. (1) The bisection algorithm is much more efficient than grid search in finding

the endpoints accurately. However there is one caveat: the confidence set from inverting

either the sRCC, the sCC, or the ARP hybrid test is not guaranteed to be an interval.

The algorithm above always yields an interval. This interval sometimes is the confidence

set itself, sometimes is the convex hull, and sometimes is a sub-interval of the confidence

set. In our experience, however, the difference between the bisection confidence interval

and the test-inversion confidence set is rarely big. The coverage probability of the bisection

confidence interval is in fact slightly higher than the test-inversion confidence set (computed

via grid-search).

(2) The initial point in the confidence set for the sRCC and sCC tests are obtained using

iterated generalized method of moments (GMM). To describe the iterated GMM procedure,

101

first note that the moment inequality model of the subvector simulation example is of the

form:

YZ −XZθ0 − CZδ ≤ 0, (228)

where

YZ =(

E[−n−1∑ni=1 YU,iI(Zi)|Zi]

E[n−1∑ni=1 YL,iI(Zi)|Zi]

),

XZ =(

E[−n−1∑ni=1 XiI(Zi)|Zi]

E[n−1∑ni=1 XiI(Zi)|Zi]

), and

CZ =(−n−1

∑ni=1 I(Zi)Z

′ci]

n−1∑ni=1 I(Zi)Z

′ci]

).

Let YZ =(−n−1

∑ni=1 YU,iI(Zi)

n−1∑ni=1 YL,iI(Zi)]

)and XZ =

(−n−1

∑ni=1 XiI(Zi)

n−1∑ni=1XiI(Zi)

). Let ΣZ(θ) be an estimator of

the conditional variance of√n(YZ −XZθ) given Zini=1 as a function of θ.

Let ΣZ = ΣZ(0), and let

(θ1, δ′, µ′)′ = arg min

θ,δ′,µ′:µ≤0(YZ −XZθ − CZδ − µ)′Σ−1

Z (YZ −XZθ − CZδ − µ)′. (229)

If θ1 is not rejected by the sCC (sRCC) test, let θ0 = θ1. Otherwise, update ΣZ with ΣZ(θ1)

and repeat (229). Iterate until either (1) a point that is not rejected is found, or (2) the

update to the GMM estimator of θ is small, or (3) a maximum number of iteration is reached.

If the iteration ended in either (2) or (3), let the confidence interval be the singleton that is

the last point checked. In our simulations, (2) and (3) never occurred. In the majority of

the times, θ1 is already not rejected and no iteration is needed.

(3) The initial point in the confidence set for the ARP hybrid test is obtained in a similar

iterative approach, except that instead of using GMM, we minimize the maximum violation

of the inequalities. That is, (229) is replaced by

(θ1, δ′, τ)′ = arg min

θ,δ′,τ :τ≥0τ s.t.

D−1/2Z (YZ −XZθ0 + CZδ) ≤ τ1, (230)

where DZ is the diagonal matrix sharing diagonal elements with ΣZ .

102

References

Abadie, A. and Imbens, G. W. (2008). Estimation of the conditional variance in paired

experiments. Annales d’Economie et de Statistique, No. 91/92, Econometric Evaluation

of Public Policies: Methods and Applications (July-December 2008):175–187.

Andrews, D., Cheng, X., and Guggenberger, P. (2020). Generic results for establishing the

asymptotic size of confidence sets and tests. Journal of Econometrics.

Andrews, D. and Guggenberger, P. (2009). Validity of subsampling and “plug-in asymptotic”

inference for parameters defined by moment inequalities. Econometric Theory, 25:669–709.

Andrews, D. and Soares, G. (2010). Inference for parameters defined by moment inequalities

using generalized moment selection. Econometrica, 78:119–157.

Aubin, J.-P. and Frankowska, H. (1990). Set-Valued Analysis. Birkhauser.

Gale, D. (1960). The Theory of Linear Economic Models. McGraw-Hill Book Company, 1

edition.

Luenberger, D. (1969). Optimization by Vector Space Methods. John Wiley & Sons, Inc.

Mohamad, D. A., van Zwet, E. W., Cator, E. A., and Goeman, J. J. (2020). Adaptive critical

value for constrained likelihood ratio testing. Biometrica, 107:677–688.

Rockafellar, R. (1970). Convex Analysis. Princeton University Press.

van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes.

Springer.

Zeger, K. and Gersho, A. (1994). Number of nearest neighbors in a euclidean code. IEEE

Transactions on Information Theory, 40:1647–1649.

Ziegler, G. (1995). Lectures on Polytopes, Graduate Texts in mathematics, vol. 152. Springer,

Berlin.

103

Simple Adaptive Size-Exact Testing for Full-Vector and ...

Documents