Harbingers of Failure...and Success Chaoqun Chen Assistant Professor Marketing Department Cox School of Business Southern Methodist University 6214 Bishop Blvd. Dallas, TX 75275 [email protected]Eric T. Anderson Hartmarx Professor of Marketing Marketing Department Kellogg School of Management Northwestern University 2001 Sheridan Rd. Evanston, IL 60611 [email protected]Blakeley B. McShane Associate Professor Marketing Department Kellogg School of Management Northwestern University 2001 Sheridan Rd. Evanston, IL 60611 [email protected]Author note: Correspondence concerning this manuscript should be addressed to Chaoqun Chen. Supplementary materials for this manuscript are available online. The order of au- thors other than the first was determined by alphabetical order. Acknowledgements: NA Financial Disclosure: NA
54
Embed
Harbingers of Failureand Success - Columbia … approach that yields interpretable consumer-level estimates and im-proved predictive accuracy, (iii) we characterize harbingers of failure
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Author note: Correspondence concerning this manuscript should be addressed to ChaoqunChen. Supplementary materials for this manuscript are available online. The order of au-thors other than the first was determined by alphabetical order.
Acknowledgements: NA
Financial Disclosure: NA
Harbingers of Failure...and Success
Abstract
We extend the work of Anderson et al. (2015) who find evidence of “harbingers offailure”–consumers who tend to purchase new products that are destined to meet adoomed fate–along four lines: (i) we replicate their findings in a dataset that coversover 400 U.S. retailers and a wide range of product categories, (ii) we develop a novelsemi-parametric approach that yields interpretable consumer-level estimates and im-proved predictive accuracy, (iii) we characterize harbingers of failure showing that theyare wealthier, have more children and larger family size, and shop at warehouse clubs,and (iv) we investigate potential mechanisms that explain the harbingers of failurephenomenon finding that harbingers of failure are more variety-seeking.
We find evidence for not only harbingers of failure but also harbingers of success(i.e., customers who tend to purchase new products that are likely to succeed) withthe former making up 44% of consumers in our data and the latter making up theremaining 56%. Further, were all early sales of a new product to harbingers of failureas opposed to harbingers of success, the probability that the new product remains inthe market two (four) years after introduction is five (seven) percentage points lower;thus, sales to harbingers have a powerful impact–especially when considered in tandemwith the high rate of new product failure.
Keywords: new products, harbinger, failure, methodology.
1
1 Introduction
New product development is the major driver of growth for consumer packaged goods (CPG)
firms (IRI, 2013). However, despite their large amount of investment in new product devel-
opment, the failure rate of new products introduced to market is as high as 75% (Schneider
and Hall, 2011). Moreover, even new products that exceed their expected first-year sales
tend not to last very long in markets.
That new product ideas, which have passed numerous tests from the idea generation
phase all the way through to the commercialization phase, fail at such high rates vexes both
practitioners and academics. Particularly surprising are the results from the market testing
phase as sales in the market might reasonably be thought to provide information about
future product performance. Indeed, new product forecasting tools typically assume that
early sales portend future success.
Recently, Anderson et al. (2015) have challenged this conventional wisdom by demonstrat-
ing not all early sales do in fact portend future success. Instead, certain consumers–so-called
“harbingers of failure”–tend to purchase new products that are destined to meet a doomed
fate. This suggests firms should pay attention not only to how much their new products are
selling but also to whom they are selling.
In this paper, we extend the work of Anderson et al. (2015) along four lines. First,
Anderson et al. (2015) possess data from only a single, national retail chain and it is thus
unclear whether their results generalize to a broader set of retailers. In contrast, we possess
data that covers over 400 U.S. retailers and a wide range of product categories and replicate
and extend the basic findings of Anderson et al. (2015) in this broader context.
Second, the empirical approach of Anderson et al. (2015) is rather ad hoc in several
respects, including (i) requiring that new products be arbitrarily classified as “successes”
or “failures” by the researcher rather than allowing for an objective, continuous measure of
product success and (ii) requiring that consumers be arbitrarily grouped into four segments
by the researcher rather than allowing for a individual consumer-level estimates. In contrast,
2
we develop a novel semi-parametric approach that treats product success in a continuous
manner and yields both interpretable consumer-level (here household-level) estimates and
improved predictive accuracy; we view this model as one of the major contributions of this
paper and note that it can easily be (and is) generalized to accommodate cross-category
effects and yields much improved predictive performance relative to the model of Anderson
et al. (2015) and other competing models.
Third, Anderson et al. (2015) possess only transaction-level data thus limiting their
ability to understand who the demographic profile of harbingers of failure. In contrast,
we possess rich, household-level demographic data that we can tie to our household-level
estimates thereby allowing us to characterize harbingers.
Finally, Anderson et al. (2015) provide only a limited investigation of potential mech-
anisms that explain the harbingers of failure phenomenon, suggesting that they have un-
representative tastes so that the products they choose do not match the preference of the
mass market and thus ultimately fail. In contrast, we develop several alternative hypotheses
that purport to explain what makes a consumer a harbinger of failure: Do they search less
such that they choose poor quality products that ultimately fail? Do they fail to be opinion
leaders who drive the word of mouth necessary for new product success? Are they more
innovative thus adopting new products at a rate that far outpaces that of more typical con-
sumers? Or are they more variety seeking thus failing to drive the repeat purchase necessary
for new product success?
To preview our results, we find evidence for not only harbingers of failure but also
harbingers of success (i.e., customers who tend to purchase new products that are likely
to succeed) with the former making up 44% of consumers in our data and the latter making
up the remaining 56%. Further, were all early sales of a new product to harbingers of failure
as opposed to harbingers of success, the probability that the new product remains in the
market two (four) years after introduction is five (seven) percentage points lower; thus, sales
to harbingers have a powerful impact–especially when considered in tandem with the high
3
rate of new product failure.
We also find that harbingers of failure tend to be wealthier and to have more children
and larger family size. Further, harbingers’ purchase behavior consistently predicts new
product success across multiple categories; for example, if a consumer who has purchased
many newly introduced bakery products that have ultimately gone on to fail also purchases a
newly introduced beauty product, this portends ill for the newly introduced beauty product.
In addition to looking at the behavior of harbingers across categories, we also examine their
behavior across retail formats. Harbingers of failure spend highly at mass merchandisers and
warehouse clubs, whereas harbingers of success spend highly at drug stores and traditional
grocery stores; thus, although from the perspective of manufacturers harbingers of failure
portend new product failure, from the perspective of retailers in particular mass merchan-
disers and warehouse clubs they are an important source of revenue. Finally, our results
suggest that harbingers of failure are more variety-seeking.
The remainder of this paper is organized as follows. In the next section, we review
related literature. In Section 3 we describe our data and in Section 4 we replicate the
results of Anderson et al. (2015) using our more extensive data. Next, we discuss our
more general model in Section 5 and present results from it in Section 6. In Section 7, we
investigate potential mechanisms that explain the harbingers of failure phenomenon. Finally,
we conclude with a brief discussion in Section 8.
2 Literature review
Successful new products are the growth engine for CPG firms (IRI, 2013). Despite the
strategic importance of developing innovative new products, the failure rate of new CPG
products has been high for decades. For example, Crawford (1977) summarizes more than
ten different sources that cite failure rates ranging from 40% to 90% for CPG products.
The high failure rate of new products has led many researchers to develop theories as to
why new products succeed or fail. Crawford (1977) argues that improved marketing research
4
could address many of the eight factors that he believes are linked to a high failure rate among
new products. He then offers a series of nine hypotheses as to why improved market research
may not yield better outcomes. We believe that our research on harbingers of failure and
success introduces a new hypothesis that has not been previously considered. That is, our
research shows that customers vary in their innate preferences and can be classified into
those that systematically purchase new products that ultimately go on to fail or succeed.
We agree with Crawford (1977) that improved consumer insight can address this issue and
believe that our model offers one such approach.
While better information can obviously reduce failure rates, there are many factors that
influence the success or failure of a new product. As discussed in Anderson et al. (2015), one
factor that contributes to success or failure is how managers make decisions. Escalated com-
mitment (Boulding et al., 1997; Brockner and Rubin, 1985; Brockner, 1992), an inability to
integrate information (Biyalogorsky et al., 2006), and distortions in management incentives
(Simester and Zhang, 2010) have all been offered as explanations for the high rate of failure
of new innovations.
A second broad factor that influences success or failure is organizational structure. Both
theoretical and empirical research has shown that the integration of marketing, sales, and
research and development are critical for new product success (see for example Ayers et al.
(1997) and Ernst et al. (2010)). In a study of hundreds of new product launches in Japan,
Song and Parry (1997) show that a firm’s ability to share information throughout the orga-
nization is a key success factor. Research by Sethi and Iqbal (2008) suggests a link between
managerial learning and organizational structure. In particular, they show that a rigid
stage-gate development process can impede learning and this effect is more pronounced in
turbulent markets.
A third factor is technical skills. While management decision making and organizational
factors clearly play a role, Calantone et al. (1996) examine hundreds of new product launches
in China and the U.S. and show that technical resources and skills are critical for new product
5
success.
The strategic importance of new products has led to the development of models that
can be used to understand those factors that predict success or failure. One of the earliest
models was developed by Fourt and Woodlock (1960) and focused on modeling the trial and
repeat purchase behavior of customers. These so-called “trial-repeat” models were further
developed and enhanced by subsequent researchers, including Massy (1969), Eskin (1973),
Eskin and Malec (1976), Silk and Urban (1978), and Pringle et al. (1982). Steenkamp
and Gielens (2003) provide an extensive analysis across many categories that illustrates the
complex interactions between consumer characteristics and marketing actions that influence
trial. A core finding of these research papers is that success follows from both attracting a
large base of buyers and then encouraging them to repeat purchase. In our research, this
finding is consistent with the behavior of harbingers of success; in contrast, repeat purchases
among harbingers of failure may signal failure (Anderson et al., 2015).
In addition to trial-repeat models, researchers have developed many other approaches to
predict new product success. Moe and Fader (2003) show how prelaunch sales of music, which
typically occurs three to five weeks before the official launch, is predictive of overall music
sales. Neelamegham and Chintagunta (1999) show how domestic movie sales are predictive
of international movie sales. Garber et al. (2004) utilize the geographical distribution of
sales to predict the overall success of a new product launch. Finally, research by Calantone
and Cooper (1981) argue that success stems from the integration of multiple factors that
they refer to as scenarios; they analyze more than 200 industrial new product launches to
develop a taxonomy of scenarios that are correlated with success. Our model contributes to
these approaches by developing a novel way of predicting success that utilizes cross-category
data.
These models are part of a broader literature in marketing on predicting outcomes. Some
of the early work in this literature includes predictive model validation (Ryans, 1976), statis-
tics for model fit (Rust and Schmittlein, 1985), and methods for model selection (Bunn,
6
1979). More recently, techniques developed in machine learning have been applied to predic-
tive and other tasks in marketing applications (Cui and Curry, 2005; Dzyabura and Hauser,
2011). Our paper contributes to this literature by offering a novel methodology for predicting
new product success.
Finally, our work relates to a new and growing literature on harbingers of failure. For
instance, consider Simester et al. (2017) who use data from a two retailers to show that
zip codes that tend to have a high proportion of harbingers of failure for one retailer also
tend to have a high proportion of harbingers of failure for the other retailer; they also
analyze geographic customer movements to show that the harbinger trait is a stable customer
characteristic. Our research provides convergent support for both of these findings.
3 Data
Our principal dataset is the IRI U.S. consumer panel dataset which contains the transactions
records and demographics of 103,168 unique households from 2006 to 2009. The data is highly
comprehensive in that households report their shopping trips to over 400 major retailers
that sell products across eight grand categories (bakery, dairy, deli, edible, frozen, general
merchandise, health-beauty-care (HBC), and non-edible).
In addition, we also possess data that indicates when a product was first scanned and last
scanned in the national market through 2013. This allows us to identify new products that
were introduced to the market during the 2006 - 2009 period as well as their product lifetime
(i.e., the number of weeks sold in the national market). As a new product introduction
typically involves multiple versions or several different flavors and sizes, we restrict ourselves
to a subset of independent Universal Product Codes (UPCs) to avoid duplication in our data;
specifically, for each brand in a given new product group, we pick the UPC that was first
introduced to the market as the primary UPC. Further, as seasonal and holiday products
necessarily appear on shelves for only a limited period of time and thus their short lifetimes
are not indicative of product failure, we exclude these new products from our analysis. This
7
leaves us 47,370 independent new products.
[Table 1 about here.]
In Table 1, we present several summary statistics that describe our 47,370 new products.
We focus on on product lifetime as it will serve as our objective, continuous measure of
product success. We note that although lifetime is based on when a new product was first
scanned, a given new product may not be released in all geographic markets simultaneously;
further, even if a manufacturer or retailer discontinues a given new product at a given point
in time, sales may persist in subsequent periods due to inventory. Consequently, the new
product lifetime observed in our data may be longer than expected; however, as the definition
applies equally to all products, it should not affect the relative product lifetime observed in
our data. We note that 19,025 (40%) of our 47,370 new products were still being sold beyond
2013; consequently, the product lifetimes of these products are (right) censored. Nonetheless,
product lifetime varies substantially across the 60% of products which are uncensored.
We also present summary statistics for a variety of other variables in Table 1. As can
be seen, the majority of new products are relatively inexpensive, associated with national
brands, seldom promoted, and have comparably few unit sales in the first twenty-six weeks
after introduction.
[Figure 1 about here.]
For exploratory purposes, we compare the revenue of relatively short-lived versus rela-
tively long-lived new products in the top panel of Figure 1 (although we do not have access
to national sales data, the large number of household in our panel data provides a reasonable
approximation to revenue relative to category). In the figure, we consider only new products
that remained on shelves for at least one year, classifying those that remained for less (more)
than four years as short-lived (long-lived). The smooth curves provide the fit of a generalized
additive model with degree of smoothness estimated from the data separately for short-lived
8
and long-lived new products. As can be seen, the revenue of long-lived new products grows
rapidly in the first fifteen weeks after introduction and remains relatively stable thereafter;
on the other hand, the revenue of short-lived new products declines from the start.
As the difference in revenue between the two curves in the top panel of Figure 1 reflects
both differences in (i) adoption rates in the early phases and (ii) repeat purchase rates in
the later phases, we decompose these curves into the two components shown in the bottom
two panels of the figure. As can be seen, long-lived new products have both higher adoption
and repeat purchase rates relative to short-lived new products.
4 Replication of Anderson et al. (2015)
In this section, we replicate the results of Anderson et al. (2015), which were based on a
single, national retail chain, using our more extensive data. In particular, we apply exactly
the same methodology as Anderson et al. (2015) and find evidence for both harbingers of
failure as well as harbingers of success.
The methodology of Anderson et al. (2015) involves four steps. First, two quantities are
arbitrarily defined: new products are classified as successes or failures based on whether or
not their lifetimes exceed some threshold chosen by the researcher and an “initial evaluation
period” used to assess the early performance of a new product is defined. We choose four
years as our threshold for new product success versus failure; this implies an average failure
rate of 39% which is conservative as compared to the failure rate of 75% documented in
Schneider and Hall (2011) (to obtain a failure rate of 75% would require a threshold of six
years). We also define the first twenty-six weeks after introduction as the initial evaluation
period. Given this, we let Ti denote the lifetime of new product i in weeks, and we define
yi = 1(Ti > 208) as our new product success indicator variable and xi,h as the number of
units of new product i purchased by household h in the initial evaluation period. We also
note two facts that follow from these two definitions: (i) all new products with censored
product lifetimes had lifetimes in excess of four years and thus censoring has no impact on
9
any results presented in this section and (ii) 81,195 households purchased one or more new
products in the initial evaluation period of twenty-six weeks and thus impact results in this
section and in the remainder of this manuscript.
[Table 2 about here.]
Second, the data is split by product into three datasets: a calibration dataset, an in-
sample (or estimation) dataset, and an out-of-sample dataset. We follow Anderson et al.
(2015) and split our datasets by year of new product introduction; in particular, we use
all 25,957 new products introduced in 2006 and 2008 as our calibration dataset, a random
sample of 17,130 (80%) new products introduced in 2007 and 2009 as our in-sample dataset,
and the remaining 4,283 (20%) new products introduced in 2007 and 2009 as our out-of-
sample dataset (Anderson et al. (2015) conduct model evaluation only in-sample and thus
lack a separate out-of-sample dataset). We illustrate our notation and datasets in Table
2 and note that, in the remainder of this manuscript, the subscript i1 always indexes new
products in the calibration dataset, the subscript i2 always indexes new products in the
in-sample dataset, and the subscript i3 always indexes new products in the out-of-sample
dataset
Third, the calibration dataset is used to compute ah, the so-called “flop affinity” of each
household h. Formally, flop affinity is defined as
ah =
∑i1
1(yi1 = 0)1(xi1,h > 0)∑i1
1(xi1,h > 0)
which is the fraction of new products in the calibration dataset purchased by household h
that are classified as failures (i.e., “flops”). Then, households are grouped into four equal-
sized flop affinity segments based on the quartiles (a1, a2, a3) of the distribution of ah.
Finally, the logistic regression model
logit(pi) = β0 +4∑j=1
βjSi,j + β5Si (1)
10
where pi = P(yi = 1); Si,j =∑
h 1(aj−1 < ah ≤ aj)xi,h is the total sales of new product
i among households in flop affinity segment j in the initial evaluation period and where
a0 = −ε and a5 = 1 for any ε ∈ R+; and Si is the total sales of new product i among
households for which flop affinity is undefined (i.e., households that did not purchase any
new products in the calibration dataset years 2006 and 2008 as well as the households that
were present in the data in only the in-sample and out-of-sample dataset years 2007 and
2009). The model is fit using each product i2 in the in-sample dataset and evaluated using
each product i3 in the out-of-sample dataset.
The performance of the model in Equation (1) is compared with that of a more typical
new product forecasting model in which sales to all households are treated equally, namely
the logistic regression model
logit(pi) = β0 + β1
(4∑j=1
Si,j + Si
)= β0 + β1Si (2)
where Si gives the total sales of new product i among all households in the initial evaluation
period.
[Table 3 about here.]
We present results from the benchmark model (Equation (2)) and the model of Anderson
et al. (2015) (Equation (1)) respectively in the first two columns of Table 3. The positive
coefficient of total sales S in the benchmark is consistent with the conventional wisdom that
early sales portend future success. However, the results of the model in Anderson et al.
(2015) yield a more nuanced interpretation: while the coefficients for sales to the first two
flop affinity segments are positive the coefficients to the last two are negative. In other words,
sales to households with low flop affinity portend new product success while sales to those
with high flop affinity portent new product failure. Further, our model fit statistics (log
likelihood (LL) and area under the receiver operating characteristic curve (AUC), in-sample
11
and out-of-sample) show the model of Anderson et al. (2015) outperforms the benchmark
model.
To test the robustness of this result, we consider two additional models that generalize the
model of Anderson et al. (2015), in particular by successively adding three product covariates
(price, private label indicator, and promotion frequency) and category effects (fixed effects for
each of the eight categories; random effects for each of the 291 subcategories) to the model.
These results for these models are presented in the third and fourth columns of Table 3
respectively. As can be seen, the principal results remain unchanged: sales to households
with low (high) flop affinity portend new product success (failure).
In sum, we replicate the principal results of Anderson et al. (2015) and find evidence of
harbingers of failure using our more extensive data; we also find evidence of harbingers of
success.
For a further comparison to the model of Anderson et al. (2015), we also fit logistic
regression model to the data treating the success of new products i2 in the in-sample dataset
as binary as in Anderson et al. (2015) and this section but treating the βh as in Equation
4 of the next section. Again, we find that household that purchase many short-lived new
products are more likely to be harbingers of failure; see Appendix A.1 for details.
5 Model
In this section, we introduce novel methodology that addresses a number of limitations of
the model of Anderson et al. (2015). First, rather than requiring that new products be
arbitrarily classified as successes or failures and using yi as the measure of product success,
we use an objective, continuous measure of product success, namely the product lifetime Ti.
Second, rather than requiring that households be arbitrarily grouped into four flop affinity
segments such that all households in a given segment are constrained to have the same effect
on new product success, we allow for individual household-level effects. Third, and perhaps
most subtly, rather than requiring that flop affinity (and thus the household-level effects) be
12
based merely on the fraction of new product purchased that are classified as failures (and
thus, for example, treating households that purchase one new product thusly classified out
of two total the same as households that purchase ten new products thusly classified out of
twenty total), we use a more general measure.
As a first step towards relaxing the first limitation, rather than employing a logistic
regression model as in Anderson et al. (2015), we employ a survival model, in particular a
Cox proportional hazards model (Cox, 1972), that models the product lifetime Ti. In its
most general form, our model is given by
λ(Ti) = λ0(Ti) exp
(β0 +
H∑h=1
βhxi,h + βH+1Si
)(3)
where λ(Ti) is the hazard function for product i with lifetime Ti and λ0(Ti) is the baseline
hazard function. As before, the model is fit using each product i2 in the in-sample dataset
and evaluated using each product i3 in the out-of-sample dataset.
Within the hazards model framework, the analogue of the benchmark model of Equation
(2) involves constraining βh = β for h ∈ {1, ..., H + 1} in Equation (3) such that sales to
all households are treated equally and estimating β using each product i2 in the in-sample
dataset. Similarly, the analogue of the Anderson et al. (2015) model of Equation (1) involves
constraining the βh for h ∈ {1, ..., H} to follow a step function with three steps where the
location of the steps are based on the quartiles of the distribution of ah and the levels of
the steps are estimated using each product i2 in the in-sample dataset; this in tandem with
Equation (3) yields the hazards model analogue of the model of Anderson et al. (2015).
While this moves toward relaxing the first limitation discussed above (i.e., in that the
dependent variable is the objective, continuous measure product lifetime Ti rather than the
arbitrary classification yi), it does not fully relax it as flop affinity ah still requires that new
products be arbitrarily classified as successes or failures; further, it does not relax either the
second or the third limitations.
13
Consequently, we take an alternative approach that fully relaxes all three limitations.
Key to our approach is recognizing that flop affinity can be written as
ah =
∑i1
1(yi1 = 0)1(xi1,h > 0)∑i1
1(xi1,h > 0)=
∑i1
1(Ti1 ≤ 208)1(xi1,h > 0)∑i1
1(xi1,h > 0)=
∑i1g1(Ti1)g2(xi1,h)
g3(xxxh)
where xxxh is a vector containing the xi1,h and (i) g1(x) = 1(x ≤ 208); (ii) g2(x) = 1(x > 0);
and (iii) g3(xxx) =∑
i 1(xi > 0). Given this, we alter g1, g2, and g3 to create a generalized
flop affinity which will serve as our βh. In particular, we set
βh =
∑i1w(Ti1)1(xi1,h > 0)(∑i1
1(xi1,h > 0))γ (4)
in which case (i) g1 = w is a flexible function estimated from the in-sample data using splines
(specifically thin plate regression splines (Wood, 2003)); (ii) g2 is as above; and (iii) g3 is
as above but exponentiated by γ ≥ 0. This model thus relaxes all three limitations of the
model of Anderson et al. (2015): (i) product lifetimes are treated continuously by both λ
and w; (ii) household effects βh are at the individual-level and are flexibly determined by w;
and (iii) the number of new products purchased impacts the household-level effects via γ.
Of particular note are two key differences in the manner in which the βh are estimated
by the model of Anderson et al. (2015) and this model. First, rather than weighting each
new product purchased in the calibration set in a binary manner via 1(Ti1 ≤ 208) as in
Anderson et al. (2015), this model weights each in a continuous manner via w, a function
of our objective, continuous measure of product success, namely the product lifetime Ti;
consequently, we hereafter refer to w as our weight function. Second, rather than imposing
a rigid functional form for the βh and estimating it in part from the calibration dataset and
in part from the in-sample dataset as in Anderson et al. (2015) (i.e., the locations of the step
function are estimated using (only) each product i1 in the calibration dataset while the levels
of the steps are estimated using each product i2 in the in-sample dataset), this model allows
a flexible functional form for the βh (via w and γ) and estimates it in a more principled
14
manner using only the in-sample dataset; consequently, it is semi-parametric not only in the
usual Cox proportional hazards sense but also in the sense that βh has both parametric and
non-parametric components.
The censoring of new products is naturally accommodated in the hazards model frame-
work; consequently, censoring of each product i2 in the in-sample dataset and each product
i3 in the out-of-sample dataset poses no difficulty for our model. However, censoring of each
product i1 in the calibration dataset does pose a problem as formally Ti1 (and thus w(Ti1))
is not defined for censored products. In our principal analysis presented in the main text
of this manuscript, we simply assume Ti1 is equal to our the ultimate date in our dataset
(i.e., the last week 2013); this necessarily underestimates Ti1 as we know Ti1 does in fact
fall beyond this date resulting in a conservative assumption provided that longer product
lifetimes do indeed portend product success (i.e., w is non-increasing). In an additional
analysis presented in Appendix A.2, we fully model the censoring of each product i1 in the
calibration dataset.
Estimation of our model parameters w and γ proceeds as follows. Conditional on γ, we
estimate w via penalized maximum likelihood using the gam function of the mgcv package in
R (Wood, 2011). We then conduct a grid search over γ to obtain the optimum w and γ.
6 Results
6.1 Model Evaluation
To validate our proposed approach, we evaluate our model against three competitor models.
The first two are the hazard framework analogue of the benchmark model and the Anderson
et al. (2015) model discussed in Section 5; we label these models “Benchmark” and “Ander-
son” respectively. We also consider a competing approach that models the household-level
effects βh as a linear function of a vector of demographic variables dddh such that βh = dddh ·ααα;
we label this model “Demographics” and note it is a particularly relevant competitor model
because managers often in practice use demographic variables for segmentation and target-
15
ing. We note that dddh is composed of variables indicating household income, household size,
age of female and male head of household, and indicators for households that contain a single
child, contain two or more children, consist only of a female, and consist of only of a male1.
We evaluate our model specifications using two metrics, the partial log likelihood (PLL)
and the integrated area under the receiver operating characteristic curve (IAUC); these
metrics are the respective survival model analogues of LL and AUC used in Table 3. To
define PLL, we let Zi = β0 +∑
h βhxi,h + βH+1Si such that λ(Ti) = λ0(Ti) exp(Zi). Then,
the partial log likelihood is given by
∑i:Ci=0
Zi − log∑
j:Tj≥Ti
exp(Zj)
where Ci is a binary variable indicating that the lifetime of product i is right censored (i.e.,
is still being sold through 2013).
To define IAUC, we note that Zi can be used to predict whether or not product i fails
by time t, in particular by thresholding Zi at some value. By varying this value, one can
obtain specificity and sensitivity–and thus the receiver operating characteristic curve and
the AUC–as for any binary classifier (see Chambless and Diao (2006) for details). IAUC is
then defined as this AUC, which is a function of time t, averaged over all values of t.
[Table 4 about here.]
[Figure 2 about here.]
We present our model evaluation results in Table 4. As can be seen, our proposed
approach outperforms the alternative models on products i2 in the in-sample dataset and
products i3 in the out-of-sample dataset. We also plot the out-of-sample AUC across time–
the average of which is IAUC–in Figure 2. Again, our proposed approach outperforms the
alternative models.
16
6.2 Principal Results
[Figure 3 about here.]
We present our estimate of the weight function w in Figure 3. As can be seen, the
estimated weight function is decreasing in the product lifetime and, importantly, is positive
(negative) for sufficiently short (long) lifetimes. This implies that households that purchases
many short-lived new products associate with higher failure risk; in other words, consistent
with the results of Anderson et al. (2015), such households are more likely to be harbingers
of failure.
[Figure 4 about here.]
We present our estimates of the βh (computed via Equation (4)) in Figure 4. 44% of
households have positive βh thus implying that 44% (56%) of households are harbingers of
failure (success); further, 31% and 44% (27% and 40%) of households have positive and
negative βh with 95% (99%) intervals that do not overlap zero respectively.
[Figure 5 about here.]
Our estimates of w and the βh naturally raise the question of the impact of harbingers of
success and failure on new product lifetimes. In Figure 5, we plot the Kaplan-Meier estimate
of the baseline survival probability along with how that estimates changes were all sales of
a given new product to harbingers of success versus harbingers of failure. As can be seen,
the impact can be large. For example, the baseline survival probability of a new product
two (four) years after is 0.82 (0.59); were all sales to harbingers of success, it would rise to
0.84 (0.62) but were all sales to harbingers of failure it would fall to 0.79 (0.55). In general,
the difference between the survival probability estimates when all sales are to harbingers of
success versus failure is larger than 0.05.
Before proceeding and as discussed above, we note we present a further comparison to the
model of Anderson et al. (2015) in Appendix A.1 and additional analysis that fully models
the censoring of each product i1 in the calibration dataset in Appendix A.2.
17
6.3 Covariate Results
[Table 5 about here.]
To describe harbingers of success and failure in terms of demographic variables, we regress
the βh on various variables and present results in the first column of Table 5 (the second
column of this table will be discussed in Section 7.2). As can be seen, households with higher
income, more children, larger family size, and with only a female head of household are more
likely to be harbingers of failure (i.e., have βh > 0).
[Figure 6 about here.]
To evaluate the profitability of harbingers of success and failure to retailers, we present
their revenue contribution to a set of retailers in Figure 6. As an be seen, harbingers
of failure are not necessarily a customer segment to avoid for warehouse clubs and mass
merchandisers; indeed, they account for more than 50% of the revenue of these retailers
despite accounting for only 44% of the population. On the other hand, harbingers of success
spend disproportionately at channels such as drug stores and grocery stores. Thus, although
from the perspective of manufacturers harbingers of failure portend new product failure,
from the perspective of retailers in particular mass merchandisers and warehouse clubs they
are an important source of revenue.
6.4 Cross-category Results
Thus far, our model has treated all new products identically. In particular, the impact of
the lifetime of new product i1 purchased in the calibration dataset on the hazard rate of
new product i2 in the in-sample dataset (or product i3 in the out-of-sample dataset) does
not vary by the category of the products. As a robustness test, we now consider a simple
extension of our model that allows for differential impact by category. In particular, instead
of estimating a single βh for each household, we estimate two in order to account for whether
or not new products i1 and j match in category for products i1 in the calibration datasets
18
and j in the in-sample or out-of-sample datasets; specifically, we replace Equation (4) by
βh,j =
∑i1w1(c(i1)=c(j))(Ti1)1(xi1,h > 0)(∑
i11(xi1,h > 0)
)γwhich amounts to reparameterizing the weight function w as
where c(i) gives the category of product i and uc(i1),c(j)(Ti1) that accounts for the respective
categories of products i1 and j.
Because data for many category pairs is relatively sparse, our estimation of the u differs
from that of the w. Specifically, while we still use the gam function of the mgcv package
in R (Wood, 2011), the u are treated as random effects while the w are, as they have been
throughout, treated as fixed effects.
33
[Figure 10 about here.]
We present our results in Figure 10. As in Figures 3 and 7, the weight functions are
positive for sufficiently short lifetimes. This again implies that a household that purchases
many short-lived new products is more likely to be a harbinger of failure and, as indicated
in the right panel, this holds even when the product is the calibration dataset is from a
different category than the product in the in-sample or out-of-sample dataset.
34
List of Tables
1 Summary Statistics. Product lifetime as it will serve as our objective, con-tinuous measure of product success and varies substantially across the 60%of products which are uncensored. The majority of new products are rela-tively inexpensive, associated with national brands, seldom promoted, andhave comparably few unit sales in the first twenty-six weeks after introduction. 36
2 Notation and Datasets. Ti gives the lifetime of new product i in weeks andxi,h gives the number of units of new product i purchased by household h inthe initial evaluation period (i.e., first twenty-six weeks after introduction).The calibration dataset consists of all new products introduced in 2006 and2008, the in-sample dataset consists of a random sample of 80% of new prod-ucts introduced in 2007 and 2009, and the out-of-sample dataset consists ofthe remaining 20% of new products introduced in 2007 and 2009. In the re-mainder of this manuscript, the subscript i1 always indexes new products inthe calibration dataset, the subscript i2 always indexes new products in thein-sample dataset, and the subscript i3 always indexes new products in theout-of-sample dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Replication of Anderson et al. (2015). The model presented in the first columnis the benchmark new product forecasting model and the model presented inthe second column is that of Anderson et al. (2015). The models presented inthe third and fourth columns generalize the model of Anderson et al. (2015)by successively adding three product covariates (price, private label indicator,and promotion frequency) and category effects (fixed effects for each of theeight categories; random effects for each of the 291 subcategories). The cellsin the upper right subtable give coefficient estimates (estimated standard er-rors). LL denotes log likelihood and AUC denotes the area under the receiveroperating characteristic curve. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4 Model Evaluation. We evaluate our proposed approach against three alterna-tive models: the hazard framework analogue of the benchmark model and theAnderson et al. (2015) model as well as one that models the household-leveleffects as a linear function of a vector of demographic variables. Our proposedapproach outperforms the alternative models on products i2 in the in-sampledataset and products i3 in the out-of-sample dataset. . . . . . . . . . . . . . 39
5 Covariate Results. The model in column one (two) is a regression of the βhon demographic variables (demographic variables and behavioral variables).The second column of this table will be discussed in Section 7.2. Coefficientestimates are presented on the z-score scale such that they indicate the effectof a one standard deviation change in the covariate. . . . . . . . . . . . . . . 40
6 Correlation of βh and Survey Behavioral Variables. The results in the first(second) column calculate βh using purchase recall (purchase intention). Largeropinion leadership, innovativeness, and variety-seeking are all associated withlarger βh using purchase intention. The standard error of all correlations is0.06. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Table 1: Summary Statistics. Product lifetime as it will serve as our objective, continuousmeasure of product success and varies substantially across the 60% of products which areuncensored. The majority of new products are relatively inexpensive, associated with na-tional brands, seldom promoted, and have comparably few unit sales in the first twenty-sixweeks after introduction.
36
DatasetNew Product Data Household Purchase DataProduct ID Lifetime HH 1 HH 2 . . . HH h . . . . . . HH H
1. Calibration(2006, 2008; 100%)
......
...... . . .
... . . ....
i1 Ti1 xi1,1 xi1,2 . . . xi1,h . . . xi1,H...
......
... . . .... . . .
...
2. In-sample(2007, 2009; 80%)
......
...... . . .
... . . ....
i2 Ti2 xi2,1 xi2,2 . . . xi2,h . . . xi2,H...
......
... . . .... . . .
...
3. Out-of-sample(2007, 2009; 20%)
......
...... . . .
... . . ....
i3 Ti3 xi3,1 xi3,1 . . . xi3,h . . . xi3,H...
......
... . . .... . . .
...
Table 2: Notation and Datasets. Ti gives the lifetime of new product i in weeks and xi,hgives the number of units of new product i purchased by household h in the initial evaluationperiod (i.e., first twenty-six weeks after introduction). The calibration dataset consists ofall new products introduced in 2006 and 2008, the in-sample dataset consists of a randomsample of 80% of new products introduced in 2007 and 2009, and the out-of-sample datasetconsists of the remaining 20% of new products introduced in 2007 and 2009. In the remainderof this manuscript, the subscript i1 always indexes new products in the calibration dataset,the subscript i2 always indexes new products in the in-sample dataset, and the subscript i3always indexes new products in the out-of-sample dataset.
37
VariableModels
Model 1 Model 2 Model 3 Model 4Intercept 0.3521∗∗∗ 0.3483∗∗∗ 0.4180∗∗∗ 0.1606
(0.0159) (0.0160) (0.0242) (0.2646)
S 0.0001∗∗∗
(0.00002)
S.,1 0.0021∗∗∗ 0.0020∗∗∗ 0.0017∗∗
(0.0008) (0.0008) (0.0008)
S.,2 0.0019∗∗∗ 0.0020∗∗∗ 0.0015∗∗∗
(0.0003) (0.0003) (0.0003)
S.,3 -0.0017∗∗∗ −0.0017∗∗∗ −0.0013∗∗∗
(0.0003) (0.0003) (0.0003)
S.,4 −0.0011∗∗ −0.0011∗∗ −0.0008∗
(0.0005) (0.0005) (0.0005)
S −0.00002 0.0001 0.0002(0.0008) (0.0008) (0.0008)
Table 3: Replication of Anderson et al. (2015). The model presented in the first column isthe benchmark new product forecasting model and the model presented in the second columnis that of Anderson et al. (2015). The models presented in the third and fourth columnsgeneralize the model of Anderson et al. (2015) by successively adding three product covariates(price, private label indicator, and promotion frequency) and category effects (fixed effectsfor each of the eight categories; random effects for each of the 291 subcategories). The cells inthe upper right subtable give coefficient estimates (estimated standard errors). LL denoteslog likelihood and AUC denotes the area under the receiver operating characteristic curve.38
Table 4: Model Evaluation. We evaluate our proposed approach against three alternativemodels: the hazard framework analogue of the benchmark model and the Anderson et al.(2015) model as well as one that models the household-level effects as a linear function of avector of demographic variables. Our proposed approach outperforms the alternative modelson products i2 in the in-sample dataset and products i3 in the out-of-sample dataset.
39
VariableModels
Model 1 Model 2Intercept −0.0003∗∗∗ −0.0003∗∗∗
(0.00002) (0.00002)
Income 0.0002∗∗∗ 0.0002∗∗∗
(0.00003) (0.00003)
Size 0.0001∗ 0.0001∗
(0.00004) (0.00004)
Single child 0.0003∗∗∗ 0.0003∗∗∗
(0.00003) (0.00003)
Two+ children 0.0004∗∗∗ 0.0004∗∗∗
(0.00004) (0.00004)
Female head age 0.0001∗ 0.00005(0.00005) (0.00005)
Male head age −0.0001 −0.0001∗
(0.0001) (0.0001)
Only a female 0.0001∗∗∗ 0.0001∗∗∗
(0.00004) (0.00005)
Only a male 0.0001 0.00002(0.00004) (0.00004)
Popularity rank 0.0002∗∗∗
(0.00002)
Store search −0.00003(0.00003)
Price search 0.00002(0.00002)
Adoption lag 0.0001∗∗∗
(0.00002)
No. brands per category 0.0001∗∗∗
(0.00003)
∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
Table 5: Covariate Results. The model in column one (two) is a regression of the βh ondemographic variables (demographic variables and behavioral variables). The second columnof this table will be discussed in Section 7.2. Coefficient estimates are presented on the z-score scale such that they indicate the effect of a one standard deviation change in thecovariate.
Table 6: Correlation of βh and Survey Behavioral Variables. The results in the first (second)column calculate βh using purchase recall (purchase intention). Larger opinion leadership,innovativeness, and variety-seeking are all associated with larger βh using purchase intention.The standard error of all correlations is 0.06.
41
List of Figures
1 Performance of Short-lived and Long-lived New Products Over Time. Thesmooth curves are fit separately for relatively short-lived (lifetime betweenone and four years) and relatively long-lived (lifetime greater than four years)new products using a generalized additive model with the degree of smooth-ness estimated from the data. The revenue of long-lived new products growsrapidly in the first fifteen weeks after introduction and remains relatively sta-ble thereafter while the revenue of short-lived new products declines from thestart; long-lived new products have both higher adoption and repeat purchaserates relative to short-lived new products. . . . . . . . . . . . . . . . . . . . 44
2 Out-of-Sample AUC Across Time. Our proposed approach outperforms thealternative models on products i3 in the out-of-sample dataset. . . . . . . . . 45
3 Estimated Weight Function w. The estimated weight function is decreasingin the product lifetime and, importantly, is positive (negative) for sufficientlyshort (long) lifetimes. This implies that a household that purchases manyshort-lived new products is more likely to be a harbinger of failure. . . . . . 46
4 Estimates of βh. 44% of households have positive βh thus implying that 44%(56%) of households are harbingers of failure (success). . . . . . . . . . . . . 47
5 Effects of Harbingers on Survival Probability. The solid line provides theKaplan-Meier estimator of the underlying baseline survival probability. Thedashed lines provides average survival probability were all sales to harbingersof success and failure. In general, the difference between the survival prob-ability estimates when all sales are to harbingers of success versus failure islarger than 0.05. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6 Retailer Revenue Contribution of Harbingers of Failure. Harbingers of failureaccount for more than 50% of the revenue of particular mass merchandisersand warehouse clubs despite accounting for only 44% of the population. . . . 49
7 Cross-category Estimated Weight Functions w1 and w0. Both weight functionsare positive (negative) for sufficiently short (long) lifetimes. This implies thata household that purchases many short-lived new products is more likely tobe a harbinger of failure and, as indicated in the bottom panel, this holdseven when the product is the calibration dataset is from a different categorythan the product in the in-sample or out-of-sample dataset. . . . . . . . . . . 50
8 Logistic Regression Estimated Weight Function w. The estimated weightfunction is increasing in the product lifetime and, importantly, is negative(postive) for sufficiently short (long) lifetimes. This implies that a householdthat purchases many short-lived new products is more likely to be a harbingerof failure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
42
9 Estimated Weight Function w in the Full Censoring Model. The top (bot-tom) panel displays the weight function associated uncensored (censored) newproducts in the calibration set. The estimated weight function for uncensoredproducts is positive for short lifetimes but negative for long lifetimes; thisagain implies that a household that purchases many short-lived new productsis more likely to be a harbinger of failure. It also shows the weight function forcensored products is negative which implies that a household that purchasesnew products still being sold is less likely to be a harbinger of failure. . . . . 52
10 Cross-category Estimated Weight Functions w1 and w0. Both weight functionsare positive for sufficiently short lifetimes. This implies that a household thatpurchases many short-lived new products is more likely to be a harbinger offailure and, as indicated in the bottom panel, this holds even when the productis the calibration dataset is from a different category than the product in thein-sample or out-of-sample dataset. . . . . . . . . . . . . . . . . . . . . . . . 53
43
Repeat Purchase Rate (%)
Number of New Consumers
Revenue Relative to Category Average
0 10 20 30 40 50
0.25
0.30
0.35
0.40
1.6
2.0
2.4
2.8
0.75
1.00
1.25
1.50
Product Lifetime
New products
Long−lived
Short−lived
Figure 1: Performance of Short-lived and Long-lived New Products Over Time. The smoothcurves are fit separately for relatively short-lived (lifetime between one and four years) andrelatively long-lived (lifetime greater than four years) new products using a generalized addi-tive model with the degree of smoothness estimated from the data. The revenue of long-livednew products grows rapidly in the first fifteen weeks after introduction and remains rela-tively stable thereafter while the revenue of short-lived new products declines from the start;long-lived new products have both higher adoption and repeat purchase rates relative toshort-lived new products.
44
0.47
0.48
0.49
0.50
0.51
0 100 200 300 400
Week
AU
C
Model
Benchmark
Demographics
Anderson
Proposed
Figure 2: Out-of-Sample AUC Across Time. Our proposed approach outperforms the alter-native models on products i3 in the out-of-sample dataset.
45
−0.02
0.00
0.02
0.04
0.06
0 100 200 300 400
Product Lifetime
Wei
ght F
unct
ion
Figure 3: Estimated Weight Function w. The estimated weight function is decreasing in theproduct lifetime and, importantly, is positive (negative) for sufficiently short (long) lifetimes.This implies that a household that purchases many short-lived new products is more likelyto be a harbinger of failure.
46
0
20
40
60
−0.01 0.00 0.01 0.02 0.03
βh
Den
sity
Figure 4: Estimates of βh. 44% of households have positive βh thus implying that 44% (56%)of households are harbingers of failure (success).
47
0.4
0.6
0.8
1.0
0 100 200 300
Product Lifetime
Sur
viva
l Pro
babi
lity
Baseline
All success
All failure
Figure 5: Effects of Harbingers on Survival Probability. The solid line provides the Kaplan-Meier estimator of the underlying baseline survival probability. The dashed lines providesaverage survival probability were all sales to harbingers of success and failure. In general,the difference between the survival probability estimates when all sales are to harbingers ofsuccess versus failure is larger than 0.05.
48
Drug store B
Drug store A
Mass merchandiser C
Grocery C
Grocery B
Grocery A
Limit Assort A
Warehouse club B
Mass merchandiser B
Mass merchandiser A
Warehouse club A
0% 20% 40% 60%
Revenue Contribution
Ret
aile
r
Figure 6: Retailer Revenue Contribution of Harbingers of Failure. Harbingers of failureaccount for more than 50% of the revenue of particular mass merchandisers and warehouseclubs despite accounting for only 44% of the population.
49
Cross−Category
Same Category
0 100 200 300 400
−0.025
0.000
0.025
0.050
−0.025
0.000
0.025
0.050
Product Lifetime
Wei
ght F
unct
ion
Figure 7: Cross-category Estimated Weight Functions w1 and w0. Both weight functions arepositive (negative) for sufficiently short (long) lifetimes. This implies that a household thatpurchases many short-lived new products is more likely to be a harbinger of failure and, asindicated in the bottom panel, this holds even when the product is the calibration datasetis from a different category than the product in the in-sample or out-of-sample dataset.
50
−0.03
−0.02
−0.01
0.00
0 100 200 300 400
Product Lifetime
Wei
ght F
unct
ion
Figure 8: Logistic Regression Estimated Weight Function w. The estimated weight functionis increasing in the product lifetime and, importantly, is negative (postive) for sufficientlyshort (long) lifetimes. This implies that a household that purchases many short-lived newproducts is more likely to be a harbinger of failure.
51
Censored
Uncensored
0 100 200 300 400
−0.04
−0.02
0.00
0.02
0.04
0.06
−0.04
−0.02
0.00
0.02
0.04
0.06
Product Lifetime
Wei
ght F
unct
ion
Figure 9: Estimated Weight Function w in the Full Censoring Model. The top (bottom)panel displays the weight function associated uncensored (censored) new products in thecalibration set. The estimated weight function for uncensored products is positive for shortlifetimes but negative for long lifetimes; this again implies that a household that purchasesmany short-lived new products is more likely to be a harbinger of failure. It also showsthe weight function for censored products is negative which implies that a household thatpurchases new products still being sold is less likely to be a harbinger of failure.
52
Cross−Category
Same Category
0 100 200 300 400
0.000
0.025
0.050
0.000
0.025
0.050
Product Lifetime
Wei
ght F
unct
ion
Figure 10: Cross-category Estimated Weight Functions w1 and w0. Both weight functionsare positive for sufficiently short lifetimes. This implies that a household that purchasesmany short-lived new products is more likely to be a harbinger of failure and, as indicatedin the bottom panel, this holds even when the product is the calibration dataset is from adifferent category than the product in the in-sample or out-of-sample dataset.