Estimating Demand for Differentiated Products with Zeroes in Market Share Data * Amit Gandhi UW-Madison Microsoft Zhentong Lu SUFE Xiaoxia Shi † UW-Madison April 18, 2017 Abstract In this paper we introduce a new approach to estimating differentiated product demand systems that allows for products with zero sales in the data. Zeroes in demand are a common problem in product differentiated markets, but fall outside the scope of existing demand es- timation techniques. Our solution to the zeroes problem is based on constructing bounds for the conditional expectation of the inverse demand. These bounds can be translated into mo- ment inequalities that are shown to yield consistent and asymptotically normal point estimator for demand parameters under natural conditions for differentiated product markets. In Monte Carlo simulations, we demonstrate that the new approach works well even when the fraction of zeroes is as high as 95%. We apply our estimator to supermarket scanner data and find price elasticities become on the order of twice as large when zeroes are properly controlled. Keywords: Demand Estimation, Differentiated Products, Profile, Measurement Error, Mo- ment Inequality. JEL: C01, C12, L10, L81. 1 Introduction In this paper we introduce a new approach to differentiated product demand estimation that allows for zeroes in empirical market share data. Such zeroes are a highly prevalent feature of demand in a variety of empirical settings, ranging from workhorse scanner retail data, to data as diverse as * Previous version of this paper was circulated under the title “Estimating Demand for Differentiated Products with Error in Market Shares.” † We are thankful to Steven Berry, Jean-Pierre Dub´ e, Philip Haile, Bruce Hansen, Ulrich M¨ uller, Aviv Nevo, Jack Porter, and Chris Taber for insightful discussions and suggestions; We would also like to thank the participants at the MIT Econometrics of Demand Conference, Chicago-Booth Marketing Lunch, the Northwestern Conference on “Junior Festival on New Developments in Microeconometrics,” the Cowles Foundation Conference on “Structural Empirical Microeconomic Models,” 3rd Cornell - Penn State Econometrics & Industrial Organization Workshop, as well as seminar participants at Wisconsin-Madison, Wisconsin-Milwaukee, Cornell, Indiana, Princeton, NYU, Penn and the Federal Trade Commission for their many helpful comments and questions. 1
66
Embed
Estimating Demand for Di erentiated Products with Zeroes ... · elastic than standard estimates that select out the zeroes. We also show that the estimated price elasticities do not
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Estimating Demand for Differentiated Products with Zeroes in
Market Share Data∗
Amit Gandhi
UW-Madison
Microsoft
Zhentong Lu
SUFE
Xiaoxia Shi †
UW-Madison
April 18, 2017
Abstract
In this paper we introduce a new approach to estimating differentiated product demand
systems that allows for products with zero sales in the data. Zeroes in demand are a common
problem in product differentiated markets, but fall outside the scope of existing demand es-
timation techniques. Our solution to the zeroes problem is based on constructing bounds for
the conditional expectation of the inverse demand. These bounds can be translated into mo-
ment inequalities that are shown to yield consistent and asymptotically normal point estimator
for demand parameters under natural conditions for differentiated product markets. In Monte
Carlo simulations, we demonstrate that the new approach works well even when the fraction of
zeroes is as high as 95%. We apply our estimator to supermarket scanner data and find price
elasticities become on the order of twice as large when zeroes are properly controlled.
In this paper we introduce a new approach to differentiated product demand estimation that allows
for zeroes in empirical market share data. Such zeroes are a highly prevalent feature of demand in
a variety of empirical settings, ranging from workhorse scanner retail data, to data as diverse as
∗Previous version of this paper was circulated under the title “Estimating Demand for Differentiated Productswith Error in Market Shares.”†We are thankful to Steven Berry, Jean-Pierre Dube, Philip Haile, Bruce Hansen, Ulrich Muller, Aviv Nevo, Jack
Porter, and Chris Taber for insightful discussions and suggestions; We would also like to thank the participants atthe MIT Econometrics of Demand Conference, Chicago-Booth Marketing Lunch, the Northwestern Conference on“Junior Festival on New Developments in Microeconometrics,” the Cowles Foundation Conference on “StructuralEmpirical Microeconomic Models,” 3rd Cornell - Penn State Econometrics & Industrial Organization Workshop, aswell as seminar participants at Wisconsin-Madison, Wisconsin-Milwaukee, Cornell, Indiana, Princeton, NYU, Pennand the Federal Trade Commission for their many helpful comments and questions.
1
homicide rates and international trade flows (we discuss these examples in further depth below).
Zeroes naturally arise in “big data” applications which allow for increasingly granular views of
consumers, products, and markets (see for example Quan and Williams (2015), Nurski and Verboven
(2016)). Unfortunately, the standard estimation procedures following the seminal Berry, Levinsohn,
and Pakes (1995) (BLP for short) cannot be used in the presence of zero empirical shares - they
are simply not well defined when zeroes are present. Furthermore, ad hoc fixes to market zeroes
that are sometimes used in practice, such as dropping zeroes from the data or replacing them with
small positive numbers, are subject to biases which can be quite large (discussed further below).
This has left empirical work on demand for differentiated products without satisfying solutions to
the zero shares problem, which is the key void our paper aims to fill.
In this paper we provide an approach to estimating differentiated product demand models that
provides consistency (and asymptotic normality) for demand parameters despite a possibly large
presence of market zeroes in the data. We first isolate the econometric problem caused by zeroes
in the data. The problem we show is driven by the wedge between choice probabilities, which
are the theoretical outcome variables predicted by the demand model, and market shares, which
are the empirical revealed preference data used to estimate choice probabilities. Although choice
probabilities are strictly positive in the underlying model, market shares are often zero if choice
probabilities are small. The root of the zeroes problem is that substituting market shares (or
some other consistent estimate) for choice probabilities in the moment conditions that identify the
model, which is the basis for the traditional estimators, will generally lead to asymptotic bias.
While this bias is assumed away in the traditional approach, it cannot be avoided whenever zeroes
are prevalent in the data.
Our solution to this problem is to construct a set of moment inequalities for the model, which
are by design robust to the sampling error in market shares - our moment inequalities will hold at
the true value of the parameters regardless of the magnitude of the measurement error in market
shares for choice probabilities. Despite taking an inequality form, we use these moment inequalities
to form a GMM-type point estimator based on minimizing the deviations from the inequalities.
We show this estimator is consistent so long as there is a positive mass of observations whose
latent choice probabilities are bounded sufficiently away from zero, e.g., products for whom market
shares are not likely to be zero. This is natural in many applications (as illustrated in Section 2),
and strictly generalizes the restrictions on choice probabilities for consistency under the traditional
approach. Asymptotic normality then follows by adapting arguments from censored regression
models by Kahn and Tamer (2009).
Computationally, our estimator closely resembles the traditional approach with only a slight
adjustment in how the empirical moments are constructed. In particular it is no more burdensome
than the usual estimation procedures for BLP and can be implemented using either the standard
nested fixed point method of the original BLP, or the MPEC method as advocated more recently
by Dube, Fox, and Su (2012).
We investigate the finite sample performance of the approach in a variety of mixed logit ex-
2
amples. We find that our estimator works well even when the the fraction of zeros is as high as
95%, while the standard procedure with the observations with zeroes deleted yields severely biased
estimators even with mild or moderate fractions of zeroes.
We apply our bounds approach to widely used scanner data from the Dominicks Finer Foods
(DFF) retail chain. In particular, we estimate demand for the tuna category as previously studied
by Chevalier, Kashyap, and Rossi (2003) and continued by Nevo and Hatzitaskos (2006) in the
context of testing the loss leader hypothesis of retail sales. We find that controlling for products
with zero demand using our approach gives demand estimates that can be more than twice as
elastic than standard estimates that select out the zeroes. We also show that the estimated price
elasticities do not increase during Lent, which is a high demand period for this product category,
after we control for the zeroes. Both of these findings have implications for reconciling the loss-
leader hypothesis with the data.
The plan of the paper is the following. In Section 2, we illustrate the stylized empirical pattern
of Zipf’s law where market zeroes naturally arise. In Section 3, we describe our solution to the
zeroes problem using a simple logit setup without random coefficients to make the essential matters
transparent. In Section 4, we introduce our general approach for discrete choice model with random
coefficients. Section 5 and 6 present results of Monte Carlo simulations and the application to the
DFF data, respectively. Section 7 concludes.
2 The Empirical Pattern of Market Zeroes
In this section we highlight some empirical patterns that arise in applications where the zero shares
problem arises, which will also help to motivate the general approach we take to it in the paper.
Here we will primarily use workhorse store level scanner data to illustrate these patterns. It is
this same data that will also be used for our empirical application. However we emphasize that
our focus here on scanner data is only for the sake of a concrete illustration of the market zeroes
problem - the key patterns we highlight in scanner data are also present in many other economic
settings where demand estimation techniques are used (discussed further below and illustrated in
the Appendix).
We employ here a widely studied store level scanner data from the Dominick’s Finer Foods
grocery chain, which is public data that has been used by many researchers.1 The data comprises
93 Dominick’s Finer Foods stores in the Chicago metropolitan area over the years from 1989 to
1997. Like other store level scanner data sets, this data set provides demand information (price,
sales, marketing) at a store/week/UPC level, where a UPC (universal product code) is a unique
1For a complete list of papers using this data set, see the website of Dominick’s Database:http://research.chicagobooth.edu/marketing/databases/dominicks/index.aspx
3
bar code that identifies a product2.
Table 1 presents information on the resulting product variety across the different product cat-
egories in data. The first column shows the number of products in an average store/week - the
number of UPC’s can be seen varying from roughly 50 (e.g., bath tissue) to over four hundred
(e.g., soft drinks) within even these fairly narrowly defined categories. Thus there is considerable
product variety in the data. The next two columns illustrate an important aspect of this large
product variety: there are often just a few UPC’s that dominate each product category whereas
most UPC’s are not frequently chosen. The second column illustrates this pattern by showing the
well known “80/20” rule prevails in our data: we see that roughly 80 percent of the total quantity
purchased in each category is driven by the top 20 percent of the UPC’s in the category. In con-
trast to these “top sellers”, the other 80 percent of UPC’s contain relatively “sparse sellers” that
share the remaining 20 percent of the total volume in the category. The third column shows an
important consequence of this sparsity: many UPC’s in a given week at a store simply do not sell.
In particular, we see that the fraction of observations with zero sales can even be nearly 60% for
some categories.
Table 1: Selected Product Categories in the Dominick’s Database
Category
Average
Number of
UPC’s in a
Store/Week
Pair
Percent of
Total Sale of
the Top 20%
UPC’s
Percent of
Zero Sales
Beer 179 87.18% 50.45%
Cereals 212 72.08% 27.14%
Crackers 112 81.63% 37.33%
Dish Detergent 115 69.04% 42.39%
Frozen Dinners 123 66.53% 38.32%
Frozen Juices 94 75.16% 23.54%
Laundry Detergents 200 65.52% 50.46%
Paper Towels 56 83.56% 48.27%
Refrigerated Juices 91 83.18% 27.83%
Soft Drinks 537 91.21% 38.54%
Snack Crackers 166 76.39% 34.53%
Soaps 140 77.26% 44.39%
Toothbrushes 137 73.69% 58.63%
Canned Tuna 118 82.74% 35.34%
Bathroom Tissues 50 84.06% 28.14%
We can visualize this situation another way by fixing a product category (here we use canned
2Store level scanner data can often be augmented with a panel of household level purchases (available, for example,through IRI or Nielsen). Although the DFF data do not contain this micro level data, the main points of our analysisare equally applicable to the case where household level data is available. In fact our general choice model willaccommodate the possibility of micro data. Store level purchase data is actually a special case household level datawhere all households are observationally identical (no observable individual level characteristics).
4
Figure 1: Zipf’s Law in Scanner Data
tuna) and simply plotting the histogram of the volume sold for each week/UPC realization for a
single store in the data. This frequency plot is given in Figure 1. As can be see there is a sharp
decay in the empirical frequency as the purchase quantity becomes larger, with a long thin tail. In
particular the bulk of UPC’s in the store have small purchase volume: the median UPC sells less
than 10 units a week, which is less than 1.5% of the median volume of Tuna the store sells in a
week. The mode of the frequency plot is a zero share.
This power-law decay in the frequency of product demand is often associated with “Zipf’s
law” or the “the long tail”, which has a long history in empirical economics.3 We present further
illustrations of this long-tail demand pattern found in international trade flows as well as cross-
county homicide rates in Appendix A, which provides a sense of the generality of these stylized
facts.
The key takeaway from these illustrations is that the presence of market zeroes in the data is
closely intertwined to the prevalence of power-law patterns of demand. We will later exploit this
relationship to place structure on the data generating process that underlies market zeroes.
3 A First Pass Through Logit Demand
Why do zero shares create a problem for demand estimation? In this section, we use the workhorse
multinomial logit model to explain the zeroes problem and introduce our new estimation strategy.
Formal treatment for general differentiated product demand models is given in the next section.
3See Anderson (2006) for a historical summary of Zipf’s law and many examples from the social and naturalsciences. See Gabaix (1999a) for an application of Zipf’s law to the economics literature.
5
3.1 Zeroes Problem in the Logit Model
Consider a multinomial logit model for the demand of J products (j = 1, . . . , J) and an outside
option (j = 0). A consumer i derives utility uijt = δjt+ εijt from product j in market t, where δjt is
the mean-utility of product j in market t, and εijt is the idiosyncratic taste shock that follows the
type-I extreme value distribution. As is standard, the mean-utility δjt of product j > 0 is modeled
as
δjt = X ′jtβ + ξjt, (3.1)
where Xjt is the vector of observable (product, market) characteristics, often including price, and
ξjt is the unobserved characteristic. The outside good j = 0 has mean utility normalized to δ0t = 0.
The parameter of interest is β.
Each consumer chooses the product that yields the highest utility. Aggregating consumers’
choices, we obtain the true choice probability of product j in market t, denoted as
πjt = Pr(product j is chosen in market t).
The standard approach introduced by Berry (1994) for estimating β is to combine demand system
inversion and instrumental variables.
First, for demand inversion, one uses the logit structure to find that
δjt = ln (πjt)− ln (π0t) , for j = 1, . . . , J. (3.2)
Then, to handle the potential endogeneity of Xjt (correlation with ξjt), one finds a random vector
zjt, such that
E [ξjt| zjt] = 0. (3.3)
Then two stage least squares with δjt defined in terms of choice probabilities as the dependent
variable becomes the identification strategy for β.
Unfortunately πjt is not observed as data - it is a theoretical choice probability defined by the
model but only indirectly revealed through actual consumer choices. The standard approach to this
following Berry (1994), Berry, Levinsohn, and Pakes (1995), and many subsequent papers in the
literature has been to substitute sjt, the empirical market share of product j in market t based on
the choices of n potential consumers, for πjt, and run a two-stage least square with ln (sjt)− ln (s0t)
as dependent variable, xjt as covariates, and zjt as instruments to obtain estimates for β.
Plugging in the estimate sjt for πjt appears innocuous at first glance because the number of
potential consumers (n) in a market from which sjt is constructed is typically large. Nevertheless
problems arise when there are (jt)’s for which πjt is very small. Because the slope of the natural
logarithm function approaches infinity when the argument approaches zero, even small estimation
error of πjt may lead to large error in the plugged-in version of δjt when πjt is very small. In particu-
lar, sjt may frequently equal zero in this case, causing the demand inversion to fail completely. The
first is the theoretical root of the small πjt problem, while the second is an unmistakable symptom.
6
Data sets with this symptom are frequently encountered in empirical research as discussed in
the Section 2. With such data, a common practice is to ignore the (jt)’s with sjt = 0, effectively
lumping those j’s into the outside option in market t. This leads however to a selection problem.
To see this, suppose sjt = 0 for some (j, t) and one drops these observations from the analysis
- effectively one is using a selected sample where the selection criterion is sjt > 0. In this selected
sample, the conditional mean of ξjt is no longer zero, i.e.,
E[ξjt|xjt, sjt > 0] 6= 0. (3.4)
This is the well-known selection-on-unobservables problem and with such sample selection, an
attenuation bias ensues.4 The attenuation bias generally leads to demand estimates that appear
to be too inelastic.5
Another commonly adopted empirical “trick” is to add a small positive number ε > 0 to the
sjt’s that are zero, and use the resulting modified shares sεjt > 0 in place of πjt.6 However, this
trick only treats the symptom, i.e., sjt = 0, but overlooks the nature of the problem: the true choice
probability πjt is small. And in this case, small estimation error in any estimator πjt of πjt would
lead to large error in the plugged-in version of δjt and the estimation of β. This problem manifests
itself directly because the estimate β can be incredibly sensitive to the particular choice of the small
number being added with little guidance on what is the “right” choice of small number. In general,
like selecting away the zeroes, the “adding a small number trick” is also a biased estimator for β.
We illustrate both biases in the Monte Carlo section (Section 5).
Despite their failure as general solutions, these “ad hoc zero fixes” have in them what could be a
useful idea – Perhaps the variation among the non-zero share observations can be used to estimate
the model parameters, while at the same time the presence of zeroes is controlled in such a way
that avoids bias. We now present a new estimator that formalizes this possibility by using moment
inequalities to control for the zeroes in the data while using the variation in the remaining part of
the data to consistently estimate the demand parameters. We continue in this section to illustrate
our approach within the logit model before treatment of the general case in the next section.
3.2 A Bounds Estimator
Our bounds approach turns the selection-on-unobservable problem into a selection-on-observable
strategy, with the key features that the selection is not based on market share but on exogenous vari-
4In fact,E[ξjt|xjt, sjt > 0] > 0 (3.5)
in the homoskedastic case. This is because the criterion sjt > 0 selects high values of ξjt and leaves out low valuesof ξjt.
5It is easy to see that the selection bias is of the same direction if the selection criterion is instead sjt > 0 for allt, as one is effectively doing when focusing on a few top sellers that never demonstrate zero sales in the data. Thereason is that the event sjt > 0 for all t contains the event sjt > 0 for a particular t. If the markets are weaklydependent, the particular t part of the selection dominates.
6Berry, Linton, and Pakes (2004)and Freyberger (2015) study the biasing effect of plugging in sjt for πjt. Theirbias corrections do not apply when there are zeroes in the empirical shares.
7
ables, and is not determined ex-ante by the econometrician but rather automatically performed by
the estimator. Specifically, we assume that there exist a set of “safe product/market” (j, t) , identi-
fied by the instrumental variable zjt, with inherently thick demand such that sjt has a small chance
of being zero. In particular, we assume a partition on the support of zjt: supp(zjt) = Z = Z0 ∪Z1
that separates the safe product/markets (zjt ∈ Z0) from the remaining “risky product/markets”
(zjt ∈ Z1). 7 The safe products have inherently desirable characteristics that often make them the
“top sellers” described in Section 2, while the risky products have less attractive characteristics that
often yield sparse demand. If we knew Z0 and focused on the observations such that zjt ∈ Z0, the
standard estimator would be consistent. The key challenge in the data is that the econometricians
will not know Z0 in advance. Our bounds estimator automatically utilizes the variation in Z0,
but at the same time safely controls for the observations in Z1, to consistently estimate β without
requiring the researcher either to know or to estimate the underlying partition (Z0,Z1).
Our approach first uses two mean-utility estimators: δujt and δ`jt that are functions of empirical
market shares (rather than the true choice probability), to form bounds on E [δjt| zjt]:
E[δujt∣∣ zjt] ≥ E [δjt| zjt] ≥ E
[δ`jt
∣∣∣ zjt] ,∀j, t a.s. (3.6)
where δjt is the true mean-utility in (3.1). Next, the inequalities (3.6) combined with (3.3) imply
E[δujt − x
′jtβ∣∣∣ zjt] ≥ 0 ≥ E
[δ`jt − x
′jtβ∣∣∣ zjt] a.s. (3.7)
Observe that the moment restriction (3.3) implies that
E[(δjt − x
′jtβ)g (zjt)
]= 0 ∀g ∈ G,
where G is a set of instrumental variable functions. Using instead our upper and lower mean utility
estimators in place of the true mean utility we have the following moment inequalities
E[(δujt − x
′jtβ)g(zjt)
]≥ 0 ≥ E
[(δ`jt − x
′jtβ)g(zjt)
]∀g ∈ G. (3.8)
Following Andrews and Shi (2013), we take each g ∈ G to be an indicator function for a hypercube
Bg ⊆ supp (z), i.e.,
g(zjt) = 1 (zjt ∈ Bg) ,
and as long as G is rich enough, identification information in (3.7) is preserved by the moment
equalities (3.8).
7We will formalize the requirement on the partition in Section 4.
8
To form our estimator, define
ρuT (β, g) = (TJ)−1T∑t=1
J∑j=1
(δujt − x
′jtβ)g(zjt)
,
ρ`T (β, g) = (TJ)−1T∑t=1
J∑j=1
(x′jtβ − δ`jt
)g(zjt)
.
Let [a]− denote |min 0, a |. Our estimator is then
βBD = arg minθ
∑g∈G
µ(g)
[ρuT (β, g)]2− +
[ρ`T (θ, g)
]2
−
, (3.9)
where µ(g) is a probability density function on G, that is µ(g) > 0 for all g ∈ G, and∑
g∈G µ(g) = 1.
The function µ(g) is used to ensure summability of the terms, and the choice of µ(·) is discussed
in the next section.
Why is βBD consistent? A heuristic proof is as follows. Let us define the partition G = G0 ∪ G1
where each g ∈ G0 has support inside Z0. This partition does not need to be explicitly formed
by the econometrician (only the flexible set of instrumental variable functions G over the entire
support of zjt in the observed data is needed as an input), but only needs to exist in the underlying
DGP. We can then separate the objective function underlying (3.9) into two additive pieces
∑g∈G0
µ(g)
[ρuT (β, g)]2− +
[ρ`T (β, g)
]2
−
+∑g∈G1
µ(g)
[ρuT (β, g)]2− +
[ρ`T (β, g)
]2
−
. (3.10)
Notice that at the true parameter value β0, each of these sums in (3.10) converges in probability
to 0 because of the validity of the moment inequalities (3.8) at the true value β0. What happens
away from the true value β∗ 6= β0? Observe that the second sum over G1 is by construction
nonnegative regardless of the value of β. The first sum on the other hand approaches for each
g ∈ G0 the square of ∑g∈G0
µ(g)E[(δjt − x
′jtβ∗)g(zjt)
]because ρuT (β, g) and ρlT (β, g) converge as T → ∞ for g ∈ G0 (this is, for products whose zjt lies
in the safe set Z0). Then so long as the instruments g (zjt)g∈G0have sufficient variation for IV
rank condition with xjt to hold (the standard logit identifying condition), we are ensured that for
at least a positive mass of g ∈ G0 we have that
E[(δjt − x
′jtβ∗)g(zjt)
]6= 0.
Thus the first sum in (3.10) will converge in probability to a strictly positive number. Hence
the limiting value of the objective function (3.9) attains a minimum at the true value β0 and thus
9
by standard arguments βBD →p β0.
Figure 2 provides a graphical illustration of the above arguments. In the safe products region Z0,
the bounds are tight and provide identification power, while in Z1, the bounds may be uninformative
but still valid. So instrumental functions such as g1 ∈ G0 will form moment equalities that point
identify the model. Other instrumental functions, such as g2, g3 ∈ G1, are associated with slack
moment inequalities so they do not undermine the identification.
Figure 2: Illustration of Bounds Approach
E[δujt
∣∣∣Zjt]
E[δljt
∣∣∣Zjt]E [δjt|Zjt]
Z0Z1
g1g2
g3
Zjt
The bounds estimator thus controls for the zeroes in the data while using the variation among
the safe products to consistently estimate the model parameters. We now generalize this logic and
formalize it to the general differentiated product demand context with general error distribution
for the random utility model. We will show both consistency and asymptotic normality of the
estimator in this general case.
4 The General Model and Estimator
The researcher has data on a sample of markets t = 1 . . . , T , and for each market t, there is a sample
of individuals i = 1, . . . nt choosing from the j = 0, . . . , Jt products in the market. A product j in
market t is characterized by a vector of characteristics xjt ∈ Rdx that are observed to the researcher,
and a scalar unobserved product attribute ξjt. We will refer to the bundle (xjt, ξjt) as j′s product
characteristics (observed and unobserved). Note that to better match the feature of popular data
sets, we allow a t subscript for J , that is, different markets can have different number of products.
We will also allow a t subscript for n, the number of potential consumers.
10
In discrete choice models each consumer i = 1, . . . , nt in market t is assumed to make a single
choice from the product varieties j = 0, . . . , Jt in the market, where j = 0 denotes the outside
option of not purchasing. This choice is determined by maximizing a utility function that is random
from the perspective of the researcher. Specifically, the utility consumer i derives from consuming
product j in market t is given by
uijt = δjt + εijt,
where:
1. δjt is the mean-utility of product j in market t. Normalize δ0t = 0. As is standard, δjt is
modeled as
δjt = x′jtβ + ξjt, (4.1)
where xjt is the vector of observable (product, market) characteristics, often including price,
and ξjt is the vector of unobservable characteristics;
2. εijt is the idiosyncratic taste shock governed by the following distribution,
εit = (εi0t, . . . , εiJtt) ∼ F (· |xt;λ) , (4.2)
where xt stands for (x′1t, . . . , x
′Jtt
)′, and F (·|xt, λ) is a conditional cumulative distribution
function known up to the finite dimensional unknown parameter λ. Thus, the unknown
parameter in the model is θ = (β′, λ′)′. For clarity, we use θ0 ≡ (β′0, λ′0)′ to denote the true
value of the unknown parameter.
It is worth noting that allowing xt and the parameter λ to enter F makes this specification encom-
pass random coefficient specifications uijt = x′jtβi + ξjt, where βi follows some distribution (e.g.,
joint normal), because one can then view β as the mean of the random coefficients and εijt as the
sum of the products of the de-meaned random coefficients and the product characteristic xjt.8
We assume consumers demand the product that maximizes utility. Thus integrating out εit
yields a system of choice probabilities for agents in the market
where δt = (δ1t, . . . , δJtt)′. Then we obtain the demand system
πt ≡ (π1t, . . . , πJt)′ = σ(δt, xt, λ), (4.3)
where πjt = Pr(product j is chosen in market t) represents the true choice probability of product
j in market t. Let σ−1(πt, xt, λ) ≡ (σ−11 (πt, xt, λ), . . . , σ−1
Jt(πt, xt, λ))′ denote the inverse demand
function such that:
δt = σ−1(πt, xt, λ). (4.4)
8Requiring F (·|xt, λ) to be known up to a finite dimensional parameter rules out the vertical model because forthe vertical model, εit is a function of the unobservable product characteristics (quality).
11
Note that in the simple logit model, σj(δt, xt, λ) reduces to σj(δt) =exp(δjt)
1+∑Jtj′=1
exp(δj′t), and σ−1
j (πt, xt, λ)
reduces to σ−1(πt) = ln(πjt)− ln(π0t).
Inverting the demand system allows for the use of instrumental variables to identify θ. In
particular, instruments for the model are a random vector zjt that satisfies
E [ξjt |zjt ] = 0. (4.5)
Combining (4.4) and (4.5), the model yields the following moment restriction:
E[σ−1j (πt, xt, λ)− x′jtβ
∣∣∣ zjt] = 0. (4.6)
If πt is observed, identification can be stated as follows. The model is identified if and only if for
any θ = (β, λ) 6= θ0,
PrF (m∗F (θ, zjt) 6= 0 and zjt ∈ Z = supp (zjt)) > 0,
where
m∗F (θ, zjt) = E[σ−1j (πt, xt;λ)− x′jtβ
∣∣∣ zjt] (4.7)
Primitive conditions for identification are given in Berry and Haile (2014).
4.1 Bounds Estimator in the General Case
Like in the logit case, we construct a pair of inverse demand functions: δujt (λ) and δ`jt (λ), to form
bounds on E[σ−1j (πt, xt, λ)
∣∣∣ zjt], i.e.,
E[δujt (λ)
∣∣ zjt] ≥ E [σ−1j (πt, xt, λ)
∣∣∣ zjt] ≥ E [δ`jt (λ)∣∣∣ zjt] , a.s. (4.8)
These inequalities combined with (4.5) form the moment inequalities that our estimation of θ is
based upon:
E[δujt (λ)− x′jtβ
∣∣∣ zjt] ≥ 0 ≥ E[δ`jt (λ)− x′jtβ
∣∣∣ zjt] , a.s. (4.9)
To construct these upper and lower mean utility estimates δljt (λ) , δujt (λ), we start by applying
the Laplace rule of succession to obtain an initial choice probability estimator that does not have
zeros: sjt =nsjt+1n+J+1 .9 We call this the Laplace share estimator. It is a good estimator for the
choice probabilities when the prior information is only that these probabilities should be positive,
as argued in Jaynes (2003, Chap. 18), and thus provides a good starting point for our construction.
9The Laplace rule of succession was proposed by Pierre-Simon Laplace in the early 19th century to predict theprobability of an event happening given n independent past observations and the prior knowledge that the probabilitymust be strictly between 0 and 1. It is a concept fundamental to modern probability theory despite being widelymisunderstood and criticized. See Jaynes (2003, Chap. 18) for a thorough discussion.
12
We do not use the Laplace share estimator directly in place of πt, but use it to construct bounds
δujt (λ) and δ`jt (λ). Specifically, we define
δujt (λ) = ∆jt (st, xt;λ) + log
(sjt + ηts0t − ηt
)(4.10)
δ`jt (λ) = ∆jt (st, xt;λ) + log
(sjt − ηts0t + ηt
), (4.11)
where
∆jt (st, xt;λ) ≡ σ−1j (st, xt;λ)− log
(sjts0t
), (4.12)
and ηt is a scalar in (0, 1 /(nt + Jt + 1)).
It is instructive to consider the simple logit case, where σ−1j (st, xt;λ) = log
(sjts0t
), the term
∆jt (st, xt;λ) = 0 so the bounds boil down to
δujt = log
(sjt + ηts0t − ηt
)and δ`jt = log
(sjt − ηts0t + ηt
). (4.13)
Thus the tuning term ηt perturbs (both in a positive and negative direction) the Laplace share
for each product one at a time, and δujt, δljt are then formed by applying the logit inversion to this
perturbed share.
Remark 1. Observe that the distance between δujt and δ`jt is large when sjt is small (i.e., respecting
the large error caused by the noise in st) and is negligible when sjt is large. Thus intuitively, for
observations such that zjt ∈ Z0, which defines the safe products in a market, sjt is large with a
high probability so that E[δujt
∣∣∣ zjt] and E[δ`jt
∣∣∣ zjt] closely resemble E [δjt| zjt]. On the other hand,
the difference between E[δujt
∣∣∣ zjt] and E[δ`jt
∣∣∣ zjt] may be large for risky products (i.e., zjt ∈ Z1)
because sjt has a high probability being close to zero. This feature of the construction is a key to
the consistency result to be discussed later.
We now formally establish the validity of the bounds defined by (4.10) and (4.11).
Assumption 1. The conditional distribution of (ntsjt)Jtj=0 given (πjt, xjt, zjt)
Jtj=1 is multinomial
with parameters nt and (πjt)Jtj=0.
Assumption 2. The inverse demand function σ−1j (·, xt, λ) is well-defined and continuous on the
Lemma 1. Suppose that Assumptions 1 and 2 hold. Then, there exists ηt ∈ (0, 1/(nt + Jt + 1))
such that the inequalities in (4.8) hold at λ = λ0 with δujt(λ) and δ`jt(λ) defined in (4.10) and (4.11).
Remark 2. The scalar ηt is chosen to guarantee equation (4.8). The ηt satisfying (4.8) may depend
on πt, xt, and nt, and thus may itself be a random variable, which makes it appear difficult to
choose. However, we find that a rule of thumb works very well in both our Monte Carlo and
13
empirical exercises. The rule is to choose, for example, ηt = 1−10−3
nt+Jt+1 to start with, and increasing
it to ηt = 1−10−4
nt+Jt+1 , and to ηt = 1−10−5
nt+Jt+1 , and so on, until the estimates stabilize. To see why this
rule of thumb is reasonable, it is useful to note that if one choice, say, η1t , satisfies (4.8), another
choice, say η2t , that lies between η1
t and 1/(nt + Jt + 1) also satisfies (4.8). This is so due to the
monotonicity of right-hand-side of (4.10) and (4.11) in ηt. On the other hand, using ηt’s that are
closer to the boundary 1/(nt + Jt + 1) will generally not hurt estimation precision much because
identification is based on the safe products, for which even the upper bound 1/(nt + Jt + 1) is
negligible relative to sjt with high probability. This suggests that we do not need to know the
precise range of ηt’s that work, but can afford to make a conservative choice, as our rule of thumb
does.
In order to estimate θ based on the moment inequalities (4.9), we first transform the conditional
moments inequalities into unconditional ones, following Andrews and Shi (2013), using a set G of
instrumental functions, where an instrumental function is a function of zjt. The set G that we use
is given below, and it guarantees that (4.9) is equivalent to
E[(δujt(λ)− x′jtβ)g(zjt)
]≥ 0 ≥ E
[(δ`jt(λ)− x′jtβ)g(zjt)
]. (4.14)
Andrews and Shi (2013) discussed many different choices of G including uncountable sets and
countable sets. We only consider countable G sets. Thus, given a data set of J products from T
markets, we can construct a sample criterion function as:
QT (θ) =∑g∈G
[ρuT (θ, g)]2− µ(g) +∑g∈G
[ρ`T (θ, g)
]2
−µ(g), (4.15)
where
ρuT (θ, g) = (T J)−1T∑t=1
Jt∑j=1
(δujt(λ)− x′jtβ)g(zjt)
,
ρ`T (θ, g) = (T J)−1T∑t=1
Jt∑j=1
(x′jtβ − δ`jt(λ))g(zjt)
, (4.16)
where J = T−1∑T
t=1 Jt is the average number of products on a market. The function µ(·) is a
probability distribution on G, which gives weights to each unconditional moment inequality. Our
choice for µ(·) is given below after the choice for G is introduced.
14
Our bound estimator for θ = (β′, λ′)′ is defined as 10
θBDT = arg minθQT (θ). (4.17)
Numerically solving for θBDT is not much different from solving for the standard BLP estimator. As
in the standard procedure, the criterion function is convex in β 11. Thus, it is useful to separate
the minimization problem into two steps:
minλ
minβQT (β, λ). (4.18)
The β minimization can be solved efficiently and accurately even when many control variables are
included in xjt. The λ minimization typically is a low-dimensional problem. One point worth noting
is that the inverse demand functions involved in the quantities δujt(λ) and δ`jt(λ) can be solved by
the same contraction mapping algorithm used in the standard BLP procedure. Alternatively, the
optimization problem (4.18) can be formulated and solved as a MPEC problem using the machinery
of Dube, Fox, and Su (2012).
Now we define the instrumental function collection G and the weight on it µ(·) that we use
in the simulation and the empirical application of this paper.12 For G, we divide the instrument
vector zjt into discrete instruments, zd,jt, and continuous instruments zc,jt. Let the set Zd be the
discrete set of values that zd,jt can take. Normalize the continuous instruments to lie in [0, 1]:
zc,jt = FN(0,1)
(Σ−1/2zc zc,jt
), where FN(0,1)(·) is the standard normal cdf and Σzc is the sample
covariance matrix of zc,jt. The set G is defined as
G = ga,r,ζ(zd, zc) = 1((z′c, z′d)′ ∈ Ca,r,ζ) : Ca,r,ζ ∈ C, where
C = (×dzcu=1((au − 1)/(2r), au/(2r)])× ζ : au ∈ 1, 2, ..., 2r, for u = 1, ..., dzc ,
r = r0, r0 + 1, ..., and ζ ∈ Zd. (4.19)
In practice, we truncate r at a finite value rT . This does not affect the first order asymptotic
property of our estimator as long as rT →∞. For µ(·), we use
µ(ga,r,ζ) ∝ (100 + r)−2(2r)−dzcK−1d for g ∈ Gd,cc, (4.20)
where Kd is the number of elements in Zd. The same µ measure is used and works well in Andrews
and Shi (2013).
10When there is not a partition in the space of zjt that distinguishes the safe products out, the moment inequalities(4.9) partially identify θ. In that case, the confidence set procedure in Andrews and Shi (2013), as well as the profilingapproach in an early version of this paper, may be used for inference. However, in the current version of this paper,we focus on the point identification case, which is much more computationally tractable.
11The convexity can be seen by examining the second order derivative of QT (θ) with respect to β.12We note that appropriate choices of G and µ are not unique. For other possible choices, see Andrews and Shi
(2013).
15
4.2 Consistency and Asymptotic Normality
In the asymptotic framework, we let the number of markets T go to infinity, and let the number of
consumers in each market, nt, be a function of T that also goes to infinity as T does. The number
of products Jt may also be a function of T that goes to infinity as the latter; it may also stay finite.
The key concept behind our approach is the notion of safe products. We define the safe products
according to the value that zjt takes. Let Z0 be a subset of Rdz , where dz is the dimension of zjt.
The product j is said to be a safe product in market t if zjt ∈ Z0. Thus, the instrumental variable
not only induces exogenous variation of the explanatory variables as in standard setup, but also
serves as an identifier of the safe products. The requirements on the set Z0 is listed below.
If j is a safe product in market t, its market share πjt tends to be sufficiently different from
zero, so that the slope of σ−1j (πt, xt;λ) at the true choice probability πt tends not to be huge. As
a result the inverse demand function σ−1j (πt, xt;λ) should be sufficiently close to σ−1
j (πt, xt;λ) for
a consistent estimator πt of πt. Thus, the first requirement is as follows.
Assumption 3. For any estimator πt of πt such that supj=0,...,Jt,t=1,...,T |πjt − πjt| →p 0, we have
A standard bootstrap procedure can be used to estimate the standard deviation of the estimator
in practice and we shall discuss the implementation details of this procedure in the empirical section.
4.3 Partial Identification as an Alternative
The approach above provides a consistent point estimator based on an underlying set of moment
inequalities. Point estimation relies on Assumptions 3 and 4, which allows for using variation among
safe products for consistency. This is natural in many applications where the long tail pattern is
present and we illustrate its performance in the Monte Carlo below. Nevertheless in settings where
these Assumptions are questionable, we can still use the underlying moment inequalities (4.14) as
a basis for partial identification and inference.
The model (4.14) is a moment inequality model with many moment conditions. One can use the
method developed in Andrews and Shi (2013) to construct a joint confidence set for the full vector
θ0. This confidence set is constructed by inverting an Anderson-Rubin test: CS = θ : T (θ) ≤ c(θ)for some test statistic T (θ) and critical value c(θ). Computing this set amounts to computing the
0-level set of the function T (θ)− c(θ), where c(θ) typically is simulated quantiles and thus a non-
smooth function of θ. This is feasible if the dimension of θ0 is moderate, especially if one has access
to parallel computing technology. If the dimension is high, however, the computational cost gets
exponentially higher, and methods for it have not been well developed.
On the other hand, in demand estimation, θ0 is high dimensional mainly because of many
control variables included in xjt. The coefficients of the control variables are nuisance parameters
that often are of no particular interest. The typical parameters of interest are the price coefficient
or the price elasticities, which are small dimensional. Based on this observation, we propose a
profiling method to profile out the nuisance parameters and only construct confidence sets for a
18
parameter of interest. Since this part of the discussion is rather technical and tangential to our
main contribution, we relegate it to Appendix D. Also, readers are referred to the early version of
this paper (Gandhi, Lu, and Shi (2013)) for Monte Carlo simulations and empirical results using
the profiling approach under partial identification.
5 Monte Carlo Simulations
In this section, we present two sets of Monte Carlo experiments with random coefficient logit models.
The first experiment investigates the performance of our approach with moderate fractions of zero
shares, which should cover most of the empirical scenarios. In the second experiment, we test
our estimator with a data generating process that produces extremely large fractions of zeros; the
purpose is to further illustrate the key idea of our estimator in exploiting the long tail pattern that
is naturally present in the data.
Both experiments use the a random coefficient logit model, where the utility of consumer i for
product j in market t is
uijt = α0 + xjtβ0 + λ0xjtvi + ξjt + εijt,
where vi ∼ N (0, 1) , λ0 is the standard deviation of the random coefficients on xjt, εijt’s are i.i.d.
across i, j and t following Type I extreme value distribution. The parameters of interest are β0
and λ0, while α0 is a nuisance parameter. In both experiments, we set λ0 = .5, β0 = 1 and vary α0
for different designs. We simulate T markets, each with J products.
5.1 Moderately Many Zeroes
In the first experiment, the observed and unobserved characteristics are generated as xjt = j10 +
N (0, 1) and ξjt ∼ N(0, .12
)for each product j in market t. Thus one feature of the design
is that the xjt has some persistence across markets - products with larger index tend to have
higher value of x (which respects the nature of the variation in the scanner data shown in Sec-
tion 2. Finally, the vector of empirical shares in market t, (s0t, s1t, ..., sJt), is generated from
Multinomial(n, [π0t, π1t, ..., πJt]
′)/
n, where n represents the number of consumers in each mar-
ket.13
With the simulated data set (sjt, xjt) : j = 1, ..., JTt=1, we compute our bound estimator
(bound), the standard BLP estimator using st in place of πt and discarding observations with
sjt = 0 (ES), the standard BLP estimator using st (no zeros) in place of πt (LS).
All the estimators require simulating the market shares and solving demand systems for each
trial of λ in optimizing the objective function for estimation. We use the same set of random draws
13The πt has no closed form solution in the random coefficient model, and thus, we compute them via simulation,i.e.,
πjt =1
s
s∑i=1
exp (α0 + xjtβ0 + λ0xjtvi + ξjt)
1 +∑Jk=1 exp (α0 + xktβ0 + λ0xktvi + ξkt)
,
where s = 1000 is the number of consumer type draws (vi).
19
of vi as in the data generating process to eliminate simulation error as it is not the focus of this
paper. BLP contraction mapping method is employed to numerically solve the demand systems.
We simulate 1000 datasets (srt , xrt ) : t = 1, ..., T1000r=1 and implement all the estimators men-
tioned above on each for a repeated simulation study. For the instrumental functions, we use the
countable hyper-cubes defined in (4.19), and set rT = 50. We let η = 1−ιn+J+1 with ι = 10−6 in
constructing the bounds on the conditional expectation of the inverse demand function. Setting
smaller ι, e.g., 10−10 gives virtually the same results as reported in the following tables. For the
BLP estimator, we use(
1, xjt, x2jt − 1, x3
jt − 3xjt
)(the first three Hermite polynomials) as instru-
ments to construct the GMM objective function. Alternative transformations of xjt as instruments
yield effectively the same results.
The bias and standard deviation of the estimators are presented in Table 2. As we can see
from the table, The standard estimator with st shows large bias for both β and λ. Replacing the
empirical share st with the Laplace share st (and thus not discarding the observations with sjt = 0)
increases the bias for β although reducing the bias for λ. Our bound estimators are the least biased,
and its bias is very small for both parameters, especially when the sample size (T ) is larger.
20
Table 2: Monte Carlo Results: Random-Coefficient Logit Model
Note: 1. J = 50, N = 10, 000, β0 = 1, λ0 = .5, Number of Repetitions = 1000.
2. “ES”: Empirical Shares; “LS”: Laplace Shares.
3. DGP: I, II, III and IV correspond toα0 = −9, −10, −12 and −13, respectively.
5.2 Extremely Many Zeroes
Next we pressure test our bound estimator by pushing the fraction of zeroes in empirical shares
toward the extreme. We modify the DGP slightly to produce very high fraction of zeros. Specifically,
we generate xjt from the following discrete distribution
x 1 12 15
Pr (xjt = x) .99 .005 .005
and
ξjt ∼ 1 (xjt = 1)×N(0, 22
)+ 1 (xjt 6= 1)×N
(0, .12
).
All the other aspects of the DGP is the identical to the previous DGP.
The fractions of zeroes are made very high: 82%-96% by choosing the α0 parameter. With
such high fractions of zeroes, the vast majority of observations are uninformative. Thus, we need
21
larger sample size for any estimator to perform well. We consider T = 100, 200, 400. For simplicity
of presentation and to reduce computational burden, we will here fix λ at its true value, and only
investigate the behaviors of the estimators for β .
The results are reported in Table 3, and they are very encouraging for the bound approach. The
ES estimator is severely biased toward 0, so is the LS estimator. The bound estimator is remarkably
accurate in these extreme cases. The performance highlights the key idea of identification behind
our estimator: utilizing the information in safe products with inherently thick demand to identify
the model while controlling the risky products with small/zero sales properly.
Table 3: Monte Carlo Results: Very Large Fraction of Zeros
DGP TAve. % β
of Zeros ES Bound LS
I
100 82.91%Bias -.3222 -.0072 -.2643
SD .0272 .0342 .0240
200 82.92%Bias -.3219 -.0072 -.2633
SD .0142 .0095 .0041
400 82.94%Bias -.3194 -.0060 -.2633
SD .0267 .0068 .0031
II
100 89.59%Bias -.3777 -.0059 -.3311
SD .0129 .0133 .0063
200 89.57%Bias -.3777 -.0066 -.3308
SD .0125 .0095 .0045
400 89.55%Bias -.3759 -.0060 -.3308
SD .0230 .0066 .0033
III
100 96.35%Bias -.5613 -.0060 -.5499
SD .0090 .0139 .0090
200 96.36%Bias -.5615 -.0064 -.5498
SD .0069 .0097 .0064
400 96.35%Bias -.5605 -.0061 -.5495
SD .0102 .0071 .0046
Note: 1. T = 100, J = 50, N = 10, 000, β0 = 1, λ0 = .5,
Number of Repetitions = 1000.
2. We fix λ = λ0 (at the true value) without estimating it.
3. DGP: I, II, III correspond to α0 = −13, −14, −17.
6 Empirical Application
In this section, we apply our estimator on the same DFF scanner data previewed in Section 2. In
particular, we focus on the canned tuna category, as previously studied by Chevalier, Kashyap, and
Rossi (2003) (CKR for short) and Nevo and Hatzitaskos (2006) (NH for short). CKR observed
using the DFF data discussed in Section 2 that the share weighted price of tuna fell by 15 percent
during Lent (which we replicate below in our sample from the same data source), which is a high
demand period for this product. They attributed the outcome to loss-leading behavior on the part
22
of retailers. NH on the other hand suggest that this pricing pattern in the tuna data could instead
be explained by increased price sensitivity of consumers (consistent with an increase in search)
which causes a re-allocation of market shares towards less expensive products in the Lent period,
and hence a fall in the observed share weighted price index. They test this hypothesis directly in
the data by estimating demand parameters separately in the Lent and Non-Lent periods, and find
that demand becomes more elastic in the high demand (Lent) period.
Here we revisit the groundwork laid by NH to examine the difference in price elasticity between
Lent and non-Lent periods. The main difference in our analysis is that we use data on all products in
the analysis, while NH restrict the sample to include only the top 30 UPCs and thus automatically
drop products with small/zero sales. There are two main questions we seek to address are: a) Does
the selection of UPC’s with only positive shares significantly bias the estimates of price elasticity
and b) Does the difference in price elasticities between the Lent and Non-Lent period persist after
properly controlling for zeroes.
To make the comparison clear, we use largely the same specification of the model used in NH.
In particular we consider a logit specification
uijt = αpjt + βxjt + ξjt + εijt,
where the control variables xjt consist of UPC fixed effects and a time trend.14 Thus the week
to week variation in the product-/market-level unobserved demand shock ξjt largely captures the
short-term promotional efforts, e.g., in-store advertising and shelving choices, because the UPC
fixed effects control the intrinsic product quality that is likely to be stable over short time horizon.
Because stores are likely to advertise or shelf the product in a more prominent way during weeks
when the product is on a price sale, we expect a negative correlation between price and the unob-
servable. We construct instruments for price by inverting DFF’s data on gross margin to calculate
the chain’s wholesale costs, which is the standard price instrument in the literature that has studied
the DFF data.15
We implement our bound estimator defined by (4.17) to obtain point estimate of (α, β) in the
model. And the 95% confidence interval for the parameters are obtained using a standard bootstrap
procedure16.
14Empirical market shares are constructed using quantity sales and the number of people who visited the storethat week (the customer count) as the relevant market size.
15The gross margin is defined as (retail price - wholesale cost)/retail price, so we get wholesale cost using retailprice×(1 - gross margin). The instrument defensible in the store disaggregated context we consider here because ithas been shown that price sales in retail price primarily reflect a reduction in retailer margins rather than a reductionin marginal costs (see e.g., Chevalier, Kashyap, and Rossi (2003) and Hosken and Reiffen (2004)). Thus sales (andhence promotions) are not being driven by the manufacturer through temporary reduction in marginal costs.
16The procedure contains the following steps: 1) draw with replacement a bootstrap sample of markets, denoted
as t1, ..., tT ; 2) compute the bound estimator θBD∗T using the bootstrap sample; 3) repeat 1)-2) for BT times and
obtain BT independent (conditional on the original sample) copies of θBD∗T ; 4) q∗T (τ) is the τ -th quantile of the BT
copies of(θBD∗T − θBDT
), then the 95% bootstrap confidence interval is
[θBDT − q∗T (.975) , θBDT − q∗T (.025)
].
23
The estimation results are presented in Table 4 and 5. 17 Table 4 shows that standard logit
estimator that inverts empirical shares to recover mean utilities (and hence drops zeroes) has a
significant selection bias towards zero. The UPC level elasticities for the logit model are small
in economic magnitude, with the average elasticity in the data being -.572. Furthermore, over
90% percent of products having inelastic demand. Using our bounds approach instead to control
for zeroes has a major effect on the estimated elasticities. Average demand elasticity for UPC’s
becomes -1.362 and less than 35% percent of observations have inelastic demand. This change in
the direction of elasticities is consistent with the attenuation bias effects of dropping products with
small/zero market shares.
Table 4: Demand Estimation ResultsBLP Bound
Price Coefficient -.390 -.91095% CI [-.40, -.38] [-1.06, -.81]
Ave. Own Price Elasticity -.572 -1.362Fraction of Inelastic Products 90.04% 33.79%
Ave. Own Price Elasticity -.757 -.544 -1.09 -1.302Fraction of Inelastic Products 84.02% 92.84% 43.65% 35.00%
No. of Obs. 70,496 792,187 78,838 880,493
Our second result is that we do not find evidence to suggest demand is becoming more elastic in
the high demand period, as shown in Table 5. Using the standard logit estimator with zeroes being
dropped shows findings consistent with Nevo and Hatzitaskos (2006) - demand appears more elastic
in the high demand Lent period. On the contrary, this effect disappears, and marginally changes
signs, under our bounds estimator that controls for the zeroes. Thus we do not see evidence in our
estimation of price elasticity being higher during the high demand period.
This finding can be rationalized if the magnitude of the selection problem with dropping zeroes
is different across the two periods. Such a change in the distribution of the unobservable ξjt in the
Lent period is indeed consistent with several features of the data. To see this, let us first recall that
the main reduced form fact in the data documented Nevo and Hatzitaskos (2006) that suggested
17In principle we can estimate our model separately for each store, letting preferences change freely over storesdepending on local preferences. These results are available upon request. Here we present for the results of demandpooling together all stores together as was done by Nevo and Hatzitaskos (2006). The store level regressions resultsare very similar to the pooled store regression and the latter is a more concise summary of demand behavior that wepresent here.
24
a change in price sensitivity in the Lent period. We replicate this reduced form finding in Table
6, which shows that although the price index of tuna during Lent appears to be approximately 15
percent less expensive than other weeks (as previously underscored by CKR), the average price of
tuna is virtually unchanged between the Lent versus non-Lent period. Hence it is a re-allocation of
demand towards less expensive products during Lent that drives the change in the aggregate price
index.
Table 6: Regression of Price Index on LentP P
(Price Index) (Average Price)
Lent -.150 -.009s.e. (.0005) (.0003)
We take this decomposition one step further than NH, and examine the price index separately
for products “on sale” and “regularly priced” during these periods.18 As can be seen in Table 7,
it is the sales price index that is the key driver of the aggregate price index being cheaper during
Lent. However the average price of an “on-sale” product is not cheaper in the Lent period. This
shows that it is a re-allocation towards more steeply discounted “on-sale” product during Lent
that is driving this change in the aggregate price index. But we do not see a corresponding such
reallocation for “regularly priced” products.
Table 7: Regression of Sales Price Index on LentP P
This suggests a tighter coordination of promotional effort and discounting in the high demand
period. In effect more steeply discounted products are receiving larger promotional effort on the
part of the retailer during the high demand, which is closer in spirit to the loss-leader hypothesis
originally advanced for this data by CKR. Because promotional effort in the model is largely
captured through the unobservable ξjt, this change in behavior of the unobservable would also
account for the selection effect due to dropping zeroes changing across the two periods. This
hypothesis is also consistent with our estimated model: the correlation between pjt and ξjt among
products that are flagged as being on sale (having at least a 5% reduction from highest price of
previous 3 weeks) increases from -.16 to -.24 between the Non-Lent and Lent periods.
18We flag an observation in the data as being on sale if that particular UPC in that particular store in thatparticular week has at least a 5% reduction from highest price of previous 3 weeks.
25
7 Conclusion
We have shown that differentiated product demand models have enough content to construct a
system of moment inequalities that can be used to consistently estimate demand parameters despite
a possibly large presence of observations with zero market shares in the data. We construct a GMM-
type estimator based on these moment inequalities that is consistent and asymptotically normal
under assumptions that are a reasonable approximation to the DGP in many product differentiated
environments. Our application to scanner data reveals that taking the market zeroes in the data
into account has economically important implications for price elasticities.
A key message from our analysis is that it is critical to not ignore the zero shares when estimating
discrete choice models with disaggregated market data. And a potentially fruitful area for future
research is the application of our approach is individual level choice data, such as a household panel.
Aggregating over households is still necessary to control for price endogeneity, such as described
by Berry, Levinsohn, and Pakes (2004) and Goolsbee and Petrin (2004), and thus zero market
shares when we aggregate over limited sample of households in the data is a clear problem for
many contexts. Nevertheless the demographic richness in the household panel provides additional
identifying power for random coefficients. The approach we describe can offer a novel solution to
the joint problem of endogenous prices and flexible consumer heterogeneity with micro data, which
we plan to pursue in future work.
26
A Further Illustrations of Zipf’s Law
In Figure 3 we illustrate this regularity using data from the two other applications that were
mentioned in Section 2: homicide rates and international trade flows. The left hand graph shows
the annual murder rate (per 10,000 people) for each county in the US from 1977-1992 (for details
about the data see Dezhbakhsh, Rubin, and Shepherd (2003)). The right hand side graph shows the
import trade flows (measured in millions of US dollars) among 160 countries that have a regional
trade agreement in the year 2006 (for details about the data see Head, Mayer, et al. (2013)). In each
of these two cases we see the characteristic pattern of Zipf’s law - a sharp decay in the frequency
for large outcomes and a large mass near zero (with a mode at zero in each case).
Figure 3: Zipf’s Law in Crime and Trade Data
27
B Proofs Lemma 1 and Theorem 1
B.1 Proof of Lemma 1
Proof of Lemma 1. First consider the derivation:
E
[ln
(sjt + η
s0t − η
)∣∣∣∣πt, xt]= E
[ln
(ntsjt + 1
nt + Jt + 1+ η
)∣∣∣∣πt, xt]− E [ ln
(nts0t + 1
nt + Jt + 1− η)∣∣∣∣πt, xt]
≥ ln
(1
nt + Jt + 1+ η
)− E
[ln
(ns0t + 1
nt + Jt + 1− η)∣∣∣∣πt, xt]
≥ ln
(1
nt + Jt + 1+ η
)− ln
(nt + 1
nt + Jt + 1− η)
Pr(ns0t ≥ 1|πt)−
ln
(1
nt + Jt + 1− η)
Pr(nts0t = 0|πt)
≥ ln
(1
nt + Jt + 1+ η
)− ln
(nt + 1
nt + Jt + 1− η)− ln
(1
nt + Jt + 1− η)
(1− π0t)n
≥ ln
(1 + η(nt + Jt + 1)
nt + 1 + η(nt + Jt + 1)
)− ln
(1
nt + Jt + 1− η)
(1− π0t)nt , (B.1)
where the first inequality holds because ntsjt ≥ 0, the second inequality holds because nts0t ≤ nt,
the third inequality holds by Pr(nts0t ≥ 1|πt) ≤ 1 and Assumption 1. As η approaches 1/(nt+Jt+1)
from below, the right-hand-side diverges to positive infinity. Therefore, for any finite (πt, xt)-
measuable quantity, there exists an ηt ∈ (0, 1/(nt + Jt + 1)) such that E[
ln(sjt+ηs0t−η
)∣∣∣πt, xt] is
greater than this quantity when η = ηt.
Next, define the ε-shrinkage of the J dimensional simplex be ∆εJ = (p1, . . . , pJ) ∈ (0, 1)J : pj ≥
ε, 1−∑J
j=1 pj ≥ ε. By the definition of the Laplace share, ∆jt(st, xt, λ0) lies in the interval minπ∈∆
1/(nt+Jt+1)Jt
∆jt(π, xt, λ0), maxπ∈∆
1/(nt+Jt+1)Jt
∆jt(π, xt, λ0)
. (B.2)
The interval is well-defined and finite by Assumption 2. Similarly, δjt(λ0) is finite. Therefore, there
exists ηt such that
E
[ln
(sjt + ηts0t − ηt
)∣∣∣∣πt, xt] ≥ − minπ∈∆
1/(nt+Jt+1)Jt
∆jt(π, xt, λ0) + δjt(λ0)
≥ −E[∆jt(st, xt, λ0)|πt, xt] + δjt(λ0). (B.3)
This shows that E[δujt(λ0)|πt, xt] ≥ δjt(λ0), which implies that
E[δujt(λ0)|zjt] ≥ E[δjt(λ0)|zjt]. (B.4)
28
This proves the upper bound part of (4.8). The lower bound part is analogous and thus omitted.
B.2 Proof of Theorem 1
Next, we prove Theorem 1. To do so, we present three lemmas first. Proofs of these lemmas are
presented after that of Theorem 1. Consider the subset of the instrumental function collection:
where the first equality holds by rearranging terms, the inequality holds by the fact that |[a]− −[b]−| ≤ |a − b| for any a, b ∈ R, and by the Cauchy-Schwarz Inequality, the second equality holds
by Assumptions C.1 and C.3 and Theorem 1, the third equality holds with θT being a point on
37
the line segment joining θBDT and θ0 by a mean-value expansion, and the last equality holds by
Assumption C.2 and Theorem 1. Similarly, we can show∑g∈G0
Chevalier, J. A., A. K. Kashyap, and P. E. Rossi (2003): “Why Don’t Prices Rise During
Periods of Peak Demand? Evidence from Scanner Data,” American Economic Review, 93(1),
15–37.
Dezhbakhsh, H., P. H. Rubin, and J. M. Shepherd (2003): “Does capital punishment have a
deterrent effect? New evidence from postmoratorium panel data,” American Law and Economics
Review, 5(2), 344–376.
Dube, J.-P., J. T. Fox, and C.-L. Su (2012): “Improving the numerical performance of static
and dynamic aggregate discrete choice random coefficients demand estimation,” Econometrica,
80(5), 2231–2267.
Freyberger, J. (2015): “Asymptotic theory for differentiated products demand models with
many markets,” Journal of Econometrics, 185(1), 162–181.
38
Gabaix, X. (1999a): “Zipf’s Law and the Growth of Cities,” The American Economic Review,
Papers and Proceedings, 89, 129–132.
Gandhi, A., Z. Lu, and X. Shi (2013): “Estimating Demand for Differentiated Products with
Error in Market Shares,” CeMMAP working paper.
Goolsbee, A., and A. Petrin (2004): “The consumer gains from direct broadcast satellites and
the competition with cable TV,” Econometrica, 72(2), 351–381.
Head, K., T. Mayer, et al. (2013): “Gravity equations: Workhorse, toolkit, and cookbook,”
Handbook of international economics, 4.
Hosken, D., and D. Reiffen (2004): “Patterns of retail price variation,” RAND Journal of
Economics, pp. 128–146.
Jaynes, E. T. (2003): Probability Theory: The Logic of Science. Cambridge University Press, 1st
edn.
Kahn, S., and E. Tamer (2009): “Inference on Randomly Censored Regression Models Using
Conditional Moment Inequalities,” Journal of Econometrics, 152, 104–119.
Nevo, A., and K. Hatzitaskos (2006): “Why does the average price paid fall during high demand
periods?,” Discussion paper, CSIO working paper.
Nurski, L., and F. Verboven (2016): “Exclusive Dealing as a Barrier to Entry? Evidence from
Automobiles,” The Review of Economic Studies, 83(3), 1156.
Quan, T. W., and K. R. Williams (2015): “Product Variety, Across-market Demand Hetero-
geneity, And The Value Of Online Retail,” Working Paper.
39
Online Appendix to “Estimating Demand forDifferentiated Products with Zeroes in Market Share
Data”
In this online appendix, we introduce the profiling approach for models defined by many moment
inequalities. The profiling approach developed here is similar to the penalized resampling approach
in Bugni, Canay, and Shi (2016) for unconditional moment inequality models. Section D describes
the profiling approach and gives the formal results, and Section E presents the proofs of those
results.
D The Profiling Approach
The profiling approach applies to general moment inequality models with many moment inequali-
ties. Thus from this point on, we focus on the moment inequality model:
Eρ(wt, θ, g) ≥ 0 for all g ∈ G, (D.1)
where ρ takes values in Rk. We also let G be a general set of indices that can be either countable
or uncontable. Let µ : G → [0, 1] denote a probability density on G. We assume the data wtTt=1 are
i.i.d. across t.
We assume that there is a parameter of interest, γ0, that is related to θ0 through:
γ0 ∈ Γ(θ0) ⊆ Rdγ , (D.2)
where Γ : Θ → 2Rdγ
is a known mapping where 2Rdγ
denotes the collection of all subsets of Rdγ .
Three examples of Γ are given below:
Example. Γ(θ) = α: γ0 is the price coefficient α0. In the simple logit model, the price coefficient
is all one needs to know to compute the demand elasticity.
Example. Γ(θ) = ej(p, π, θ, x) = (αpj/πj)(∂σj(σ−1(π, x, λ), x, λ)/∂δj): γ0 is the own-price de-
mand elasticity of product j at a given value of the price vector p, the choice probability vector π
and the covariates x.
Example. Γ(θ) = ej(p, π, θ, x) : π ∈ [πl, πu]: γ0 is the demand elasticity of product j at a given
value of the price vector p, the covariates x and at the choice probability vector that is known to lie
between πl and πu. This example is particularly useful when the elasticity depends on the choice
probability but the choice probability is only known to lie in an interval.
Let Γ0 be the identified set of γ0: Γ0 = γ ∈ Rdγ : ∃θ ∈ Θ0 s.t. Γ(θ) 3 γ, where Θ0 = θ ∈ Θ :
Eρ(wt, θ, g) ≥ 0 ∀g ∈ G. The profiling approach constructs a confidence set for γ0 by inverting a
40
test of the hypothesis:
H0 : γ ∈ Γ0, (D.3)
for each parameter value γ. The confidence set is the collection of values that are not rejected by
the test.
Let Γ−1(γ) = θ ∈ Θ : Γ(θ) 3 γ. The test to be inverted uses the profiled test statistic:
TT (γ) = T × minθ∈Γ−1(γ)
QT (θ), (D.4)
where QT (θ) is an empirical measure of the violation to the moment inequalities. The confidence
set of confidence level p is the set of all points for which the test statistic does not exceed a critical
value cT (γ, p):
CST = γ ∈ Rdγ : TT (γ) ≤ cT (γ, p). (D.5)
Notice that the new confidence set only involves computing a dγ-dimensional level set, where dγ is
often 1. The profiling transfers the burden of searching (for low values) over the surface of the non
smooth function T (θ)− c(θ) to searching over the surface of the typically smooth and often convex
function QT (θ).
We choose a critical value, cT (γ, p), of significance level 1− p ∈ (0, 0.5), to satisfy
limT→∞
inf(γ,F )∈H0
Pr F (TT (γ) > cT (γ, p)) ≤ 1− p, (D.6)
where F is the distribution on (wt)Tt=1 and H0 is the null parameter space of (γ, F ). The definition
of H0 along with other technical assumptions are given in Section D.4.19
As a result of (D.6), the confidence set asymptotically has the correct minimum coverage prob-
ability:
lim infT→∞
inf(γ,F )∈H0
PrF (γ ∈ CST ) ≥ p. (D.7)
The left hand side is called the “asymptotic size” of the confidence set in Andrews and Shi (2013).
We achieve the asymptotic size control by deriving an asymptotic approximation for the distribution
of the profiled test statistic TT (γ) that is uniformly valid over (γ, F ) ∈ H0 and simulating the
critical value from the approximating distribution through either a subsampling or a bootstrapping
procedure.
In the next subsections, we describe the test statistic and the critical value in detail and show
that (D.7) holds.
19Note that we use F to denote the distribution of the full observed data vector and thus (γ, F ) captures everything
unknown in the expression PrF (TT (γ) > cT (γ, p)). This notation differs from the traditional literature where the truedistribution of the data is often indicated by the true value of θ, but is standard in the recent partial identificationliterature. See Romano and Shaikh (2008) and Andrews and Shi (2013).
41
D.1 Test Statistic
The test statistic is the QLR statistic (i.e. a criterion-function-based statistic)20
TT (γ) = T × minθ∈Γ−1(γ)
QT (θ) with
QT (θ) =
∫GTS(ρT (θ, g), Σι
T (θ, g))dµ(g), (D.8)
where GT is a truncated/simulated version of G such that GT ↑ G as T → ∞, µ(·) is a probability
measure on G, S(m,Σ) is a real-valued function that measures the discrepancy of m from the
inequality restriction m ≥ 0, and
ρT (θ, g) = T−1T∑t=1
ρ(wt, θ, g),
ΣιT (θ, g) = ΣT (θ, g) + ι× ΣT (θ, 1)
ΣT (θ, g) = T−1T∑t=1
ρ(wt, θ, g)ρ(wt, θ, g)′ − ρT (θ, g)ρT (θ, g)
′. (D.9)
In the above definition, ι is a small positive number which is used because in some form of S defined
in Section D.4, the inverse of ΣιT (θ, g)’s diagonal elements enter, and the ι prevents us from taking
inverse of zeros. In some other forms of S, e.g. the one defined below and used in the simulation
and empirical section of this paper, the ι does not enter the test statistic because S(m,Σ) does not
depend on Σ.
Section D.4 gives the assumptions that the user-chosen quantities S, µ, G and GT should
satisfy. Under those assumptions, we can show that minθ∈Γ−1(γ) QT (θ) consistently estimate
The symbols “EF ” and “CovF ” denote expectation and covariance under the data distribution F
respectively. Notice that Γ0 depends on F . We make this explicit by changing the notation Γ0 to
Γ0,F for the rest of this paper.
We can also show that minθ∈Γ−1(γ)QF (θ) = 0 if and only if γ ∈ Γ0,F . This result combined
with the consistency of minθ∈Γ−1(γ) QT (θ) implies that TT (γ) diverges to infinity at γ /∈ Γ0,F . That
20Note that we do not follow the traditional QLR test exactly to define TT (γ) = T × minθ∈Γ−1(θ) QT (θ) − T ×minθ∈Θ QT (θ). This is because the validity of our critical value depends on certain monotonicity of the asymptoticapproximation of the test statistic and the monotonicity does not hold with this alternative test statistic due to thesubtraction of T ×minθ∈Θ QT (θ).
42
implies that there is no information loss in using such a test statistic. Lemma D.1 summarizes those
two results. The parameter space H of (γ, F ) appearing in the lemma is defined in Assumption
D.2 in Section D.4.
Lemma D.1. Suppose that Assumptions D.1, D.2, D.4, D.5(a), and D.6(a) and (d) hold. Then for
any (γ, F ) ∈ H,
(a) minθ∈Γ−1(γ) QT (θ)→p minθ∈Γ−1(γ)QF (θ) under F , and
(b) minθ∈Γ−1(γ)QF (θ) ≥ 0 and = 0 if and only if γ ∈ Γ0,F .
In the simulation and the empirical application of this paper, the following choices of S, G, GT and µ
are used mainly for computational convenience. For G, we use the one defined in (4.19). For GT ,
the truncated version of G, we define it to be the same as G except that we let r run from r0 to rT
where rT →∞ as T →∞ in the definition.
For S , we use
S(m,Σ) =k∑j=1
[mj ]2−, (D.11)
where mj is the jth coordinate of m and [x]− = |minx, 0|. There may be efficiency loss from not
weighting the moments using the variance matrix, but this S function brings great computational
convenience because it makes the minimization problem in (D.4) a convex one. For µ(·), we use
µ(ga,r,ζ) ∝ (100 + r)−2(2r)−dzcK−1d for g ∈ Gd,cc, (D.12)
where Kd is the number of elements in Zd. The same µ measure is used and seems to work well in
Andrews and Shi (2013).
D.2 Critical Value
We propose two types of critical values, one based on standard subsampling and the other based
on a bootstrapping procedure with moment shrinking. Both are simple to compute. The bootstrap
critical value may have better small sample properties, and is the procedure we use in the empirical
section.21 It is worth noting that we resample at the market level for both the subsampling and
the bootstrap.
Let us formally define the subsampling critical value first. It is obtained through the standard
subsampling steps: [1] from 1, ..., T, draw without replacement a subsample of market indices
of size bT ; [2] compute TT,bT (γ) in the same way as TT (γ) except using the subsample of markets
corresponding to the indices drawn in [1] rather than the original sample; [3] repeat [1]-[2] ST times
obtain ST independent (conditional on the original sample) copies of TT,bT (γ); [4] let c∗sub (γ, p) be
the p quantile of the ST independent copies. Let the subsampling critical value be
csubT (γ, p) = c∗sub (γ, p+ η∗) + η∗, (D.13)
21The bootstrap procedure here, like in most problems with partial identification, does not lead to higher-orderimprovement.
43
where η∗ > 0 is an infinitesimal number. The infinitesimal number is used to avoid making hard-
to-verify uniform continuity and strict monotonicity assumptions on the distribution of the test
statistic. It can be set to zero if one is willing to make the continuity assumptions. Such infinitesimal
numbers are also employed in Andrews and Shi (2013). One can follow their suggestion of using
η∗ = 10−6.
Let us now define the bootstrap critical value. It is obtained through the following steps: [1]
from the original sample 1, ..., T, draw with replacement a bootstrap sample of size T ; denote
the bootstrap sample by t1, ..., tT , [2] let the bootstrap statistic be
(c) the class of functions ρ(wt, θ, g) : (θ, g) ∈ Γ−1(γ) × G is F -Donsker and pre-Gaussian
uniformly over H;
(d) the class of functions ρ(wt, θ, g)ρ(wt, θ, g)′
: (θ, g) ∈ Γ−1(γ) × G is Glivenko-Cantelli
uniformly over H;
(e) ρF (θ, g) is differentiable with respect to θ ∈ Θ, and there exists constants C and δ1 > 0 such
that, for any (θ(1), θ(2)), sup(γ,F )∈H,g∈G ||vec(GF (θ(1), g))− vec(GF (θ(2), g))|| ≤ C × ||θ(1) − θ(2)||δ1,
and
45
(f) ΣιF (θ, g) ∈ Ψ for all (γ, F ) ∈ H and θ ∈ Γ−1(γ) where Ψ is a compact subset of M, and
vec(ΣF (·, g(1), ·, g(2))) : (Γ−1(γ))2 → Rk2
: (γ, F ) ∈ H, g(1), g(2) ∈ G are uniformly bounded and
uniformly equicontinuous.
Remark. Part (a) is the i.i.d. assumption, which can be replaced with appropriate weak dependence
conditions at the cost of more complicated derivation in the uniform weak convergence of the
bootstrap empirical process. Part (b) is standard uniform Lindeberg condition. Part (c)-(d) imposes
restrictions on the complexity of the set G as well as on the shape of ρ(wt, θ, g) as a function of
θ. A sufficient condition is (i) ρ(wt, θ, g) is Lipschitz continuous in θ with the Lipschitz coefficient
being integrable and (ii) the set C in the definition of G forms a Vapnik-Cervonenkis set and Jt is
bounded. The Lipschitz continuity is also a sufficient condition of part (f).
The following assumptions defines the null parameter space, H0, for the pair (γ, F ).
Assumption D.3. The null parameter space H0 is a subset of H that satisfies:
(a) for every (γ, F ) ∈ H0, γ ∈ Γ0,F , and
(b) there exists C, c > 0 and 2 ≤ δ2 < 2(δ1 + 1) such that QF (θ) ≥ C · (d(θ,Θ0,F (γ))δ2 ∧ c) for
all (γ, F ) ∈ H0 and θ ∈ Γ−1(γ).
Remark. Part (b) is an identification strength assumption. It requires the criterion function to
increase at certain minimum rate as θ is perturbed away from the identified set. This assumption
is weaker than the quadratic minorant assumption in Chernozhukov, Hong, and Tamer (2007) if
δ2 > 2 and as strong as the latter if δ2 = 2. Putting part (b) and Assumption D.2(e) together,
we can see that there is a trade-off between the minimum identification strength required and
the degree of Hı¿œlder continuity of the first derivative of ρF (·, g). If ρF (·, g) is linear, δ2 can be
arbitrarily large – the criterion function can increase very slowly as θ is perturbed away from the
identified set.
The following assumption is on the measure µ. For any θ, let a pseudo-metric on G be: ||g(1)−g(2)||θ,F = ||ρF,j(θ, g(1))− ρF,j(θ, g(2))||. This assumption is needed for Lemma D.1 and not needed
for the asymptotic size result Theorem D.1.
Assumption D.4. For any θ ∈ Θ, µ(·) has full support on the metric space (G, || · ||θ,F ).
Remark. Assumption D.4 implies that for any θ ∈ Θ, F and j, if ρF,j(θ, g0) < 0 for some g0 ∈ G,
then there exists a neighborhood N (g0) with positive µ-measure such that ρF,j(θ, g) < 0 for all
g ∈ N (g0).
The following assumption is on the set GT .
Assumption D.5. (a) GT ↑ G as T →∞ and
(b) lim supT→∞ sup(γ,F )∈H0supθ∈Γ−1(γ)
∫G\GT S(
√TρF (θ, g),ΣF (θ, g))dµ(g) = 0.
The following assumptions are imposed on the function S. For a ξ > 0, let the ξ-expansion of
Now it is left to show that TmedT (γ;B1, B2) and TT (γ;B1, B2) are close. First, we have
|TT (γ;B1, B2)− TmedT (γ;B1, B2)|
≤ supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ)
∫G
∣∣∣S(νB1T (θ + λ/
√T , g) +GF (θT , g)λ+
√TρF (θ, g), Σι
T (θ + λ/√T , g))
−S(νB1T (θ, g) +GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))∣∣∣ dµ(g)
≤C2 × supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ)
maxg∈G
c(∆T (θ, λ, g))×∫G(1 +MT (θ, λ, g))dµ(g), (E.26)
where c(x) = (x+√x2 + 8x)/2, C is the constant in (E.9),
∆T (θ, λ, g) =‖νB1T (θ + λ/
√T , g)− νB1
T (θ, g) +GF (θT , g)λ−GF (θ, g)λ‖2+
‖vech(ΣT (θ + λ/√T , g)− ΣF (θ, g))‖ and
MT (θ, λ, g) =S(νB1T (θ, g) +GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g)). (E.27)
Below we show that for any ε > 0, and some universal constant C > 0,
sup(γ,F )∈H0
Pr F
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
∆T (θ, λ, g) > ε
→ 0 and (E.28)
supT
sup(γ,F )∈H0
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ)
∫GMT (θ, λ, g)dµ(g) < C. (E.29)
Once (E.28) and (E.29) are shown, it is immediate that for any ε > 0,
sup(γ,F )∈H0
Pr F
(|TT (γ;B1, B2)− TmedT (γ;B1, B2)| > ε
)→ 0. (E.30)
This combined with (E.25) shows (E.6).
Now we show (E.28) and (E.29). The convergence result (E.28) is implied by the following
results: for any ε > 0,
sup(γ,F )∈H0
Pr F
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
||νB1T (θ + λ/
√T , g)− νB1
T (θ, g)|| > ε
→ 0
sup(γ,F )∈H0
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
||GF (θT , g)λ−GF (θ, g)λ|| → 0 and
sup(γ,F )∈H0
Pr F
supθ∈Θ0,F (γ),λ∈Λ
B2T (θ,γ),g∈G
||vech(ΣT (θ + λ/√T , g)− ΣF (θ, g))|| > ε
→ 0. (E.31)
The first result in the above display holds by the first result in equation (E.19) and the uniform
stochastic equicontinuity of the empirical process νT (·, g) : Γ−1(γ) → Rdm with respect to the
56
Euclidean metric. The uniform equicontinuity is implied by Assumptions D.2(b), (c) and (f) by
Theorem 2.8.2 of van der Vaart and Wellner (1996). The second result in the above display holds
by the second result in (E.19). The third result in (E.31) holds by Assumption D.2(d) and (f).
Result (E.29) holds because for any θ ∈ Θ0,F (γ) and λ ∈ ΛB2T (θ, γ),∫
GMT (θ, λ, g)dµ(g)
≤2
∫GS(νB1
T θ, g),ΣιF (θ, g))dµ(g) + 2
∫GS(GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))dµ(g)
≤ supΣ∈Ψ
S(−B11k,Σ) + 2
∫GS(GF (θ, g)λ+
√TρF (θ, g),Σι
F (θ, g))dµ(g)
≤ supΣ∈Ψ
S(−B11k,Σ) + 2B2 + C2(B2 + 1)× o(1), (E.32)
where the first inequality holds by Assumptions D.6(f), the second inequality holds by Assumption
D.6(c) and the last inequality holds by the second and third inequality in (E.22) and the o(1) is
uniform over (θ, λ).
STEP 3. In order to show (E.7), first extend the definition of TT (γ;B1, B2) from Step 1 to
allow B1 and B2 to take the value ∞ and observe that TT (γ;∞,∞) = TT (γ).
Assumptions D.2 (c) and Lemma E.1 imply that for any ε > 0, there exists B1,ε large enough
such that
lim supT→∞
sup(γ,F )∈H0
Pr F
(sup
θ∈Θ,g∈G‖νT (θ, g)‖ > B1,ε
)< ε. (E.33)
Therefore we have for all B2,
lim supT→∞
sup(γ,F )∈H0
Pr F(TT (γ,∞, B2) 6= TT (γ;B1,ε, B2)
)< ε. (E.34)
To show that T T (γ) and TT (γ;∞, B2) are close for B2 large enough, first observe that:
T T (γ) ≤ supθ∈Θ0,F (γ)
∫GS(νT (θ, g) +
√TρF (θ, g), Σι
T (θ, g))dµ(g)
≤ supθ∈Θ0,F (γ)
∫GS(νT (θ, g), Σι
T (θ, g))dµ(g)
= Op(1) (E.35)
where the first inequality holds because 0 ∈ ΛT (θ, γ), the second inequality holds because ρF (θ, g) ≥0 for θ ∈ Θ0,F (γ) and by Assumption D.6(c), the equality holds by Assumption D.6(a)-(c) and
Assumptions D.2 (c), (d) and (f). The Op(1) is uniform over (γ, F ) ∈ H0.
For any T , γ, B2, if T T (γ) 6= TT (γ;∞, B2), then there must be a θ∗ ∈ Γ−1(γ) such that
57
T ×QF (θ∗) > B2 and∫GS(νT (θ∗, g) +
√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g) < Op(1). (E.36)
But ∫GS(νT (θ∗, g) +
√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g)
≥2−1
∫GS(√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g)−∫GS(−νT (θ∗, g), Σι
T (θ∗, g))dµ(g)
≥2−1
∫GS(√TρF (θ∗, g), Σι
T (θ∗, g))dµ(g)−Op(1)
≥2−1
[TQF (θ∗)−
∫G|S(√TρF (θ∗, ·), Σι
T (θ∗, ·))− S(√TρF (θ∗, ·),Σι
F (θ∗, ·))|dµ]−Op(1)
≥2−1
[TQF (θ∗)− C2 sup
g∈Gc(||vech(Σι
T (θ∗, g)− ΣιF (θ∗, g))||)× (1 + TQF (θ∗))
]−Op(1)
=B2/2− o(1)− op(1)× C2 ×B2/4−Op(1), (E.37)
where c(x) = (x +√x2 + 8x)/2 and C is the constant in (E.9). The first inequality holds by
Assumptions D.6(e)-(f), the second inequality holds by Assumption D.6(c) and Assumptions D.2(c)-
(d) and (f), the third inequality holds by the triangle inequality, the fourth inequality holds by (E.9)
and the equality holds by Assumption D.2(d). The terms o(1), op(1) and Op(1) terms are uniform
over θ∗ ∈ Γ−1(γ) and (γ, F ) ∈ H0.
Then
sup(γ,F )∈H0
Pr F
(T T (γ) 6= TT (γ;∞, B2)
)≤ sup
(γ,F )∈H0
Pr F(2−1(1− op(1))×B2 − o(1)−Op(1) ≤ Op(1)
)= sup
(γ,F )∈H0
Pr F (Op(1) ≥ B2) , (E.38)
where the first inequality holds by (E.36) and (E.37). Then for any ε, there exists B2,ε such that
limT→∞
sup(γ,F )∈H0
Pr F (TT (γ) 6= TT (γ;∞, B2,ε)) < ε. (E.39)
Combining this with (E.34), we have (E.7).
STEP 4. In order to show (E.8), first extend the definition of T apprT (γ;B1, B2) from Step 1 to
allow B1 and B2 to take the value ∞ and observe that T apprT (γ;∞,∞) = T apprT (γ).
By the same arguments as those for (E.34), for any ε and B2, there exists B1,ε large enough so
58
that
lim supn→∞
sup(γ,F )∈H0
Pr F(T apprT (γ;∞, B2) 6= T apprT (γ;B1,ε, B2)
)< ε. (E.40)
Also by the same reasons as those for (E.35), we have
T apprT (γ) ≤ supθ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g), (E.41)
where the right hand side is a real-valued random variable.
For any T and B2, if T apprT (γ) 6= T apprT (γ;∞, B2,ε), then there must be a θ∗ ∈ Θ0,F (γ), a
λ∗∗ ∈ λ ∈ ΛT (θ∗, γ) : T ×QF (θ∗ + λ/√T ) > B2 such that
I(λ∗∗) < supθ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g), (E.42)
where I(λ) =∫G S(νF (θ∗, g)+GF (θ∗, g)λ+
√TρF (θ∗, g),Σι
F (θ∗, g))dµ(g). Next we show that if λ∗∗
exists, then there must exists a λ∗ such that
λ∗ ∈ λ ∈ ΛT (θ∗, γ) : T ×QF (θ∗ + λ/√T ) ∈ (B2, 2B2] and
I(λ∗) < supθ∈Θ0,F (γ)
∫GS(νF (θ, g),Σι
F (θ, g))dµ(g). (E.43)
If T ×QF (θ∗+λ∗∗/√T ) ∈ (B2, 2B2], then we are done. If T ×QF (θ∗+λ∗∗/
√T ) > 2B2, there must
be a a∗ ∈ (0, 1) such that T ×QF (θ∗+ a∗λ∗∗/√T ) ∈ (B2, 2B2] because TQF (θ∗+ 0×λ∗∗/
√T ) = 0
and TQF (θ∗ + aλ∗∗/√T ) is continuous in a (by Assumptions D.2(e) and D.6(a)). By Assump-
tion D.6(f), I(λ) is convex. Thus I(a∗λ∗∗) ≤ a∗I(λ∗∗) + (1 − a∗)I(0). For the same argu-
ments as those for (E.35), I(0) ≤ supθ∈Θ0,F (γ)
∫G S(νF (θ, g),Σι
F (θ, g))dµ(g). Thus, I(a∗λ∗∗) <
supθ∈Θ0,F (γ)
∫G S(νF (θ, g),Σι
F (θ, g))dµ(g). Assumption (D.1)(c) and the definition of ΛT (θ, γ) guar-