Can Microtargeting Improve Survey Sampling? An Assessment of Accuracy and Bias in Consumer File Marketing Data Josh Pasek University of Michigan, Ann Arbor S. Mo Jang University of Michigan, Ann Arbor Curtiss Cobb GfK Charles DiSogra Abt SRBI J. Michael Dennis GfK
90
Embed
European Survey Research Association · Web viewMICE uses a Markov Chain Monte Carlo (i.e. Bayesian) approach to imputation. The algorithm regresses each of the variables onto all
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Can Microtargeting Improve Survey Sampling?
An Assessment of Accuracy and Bias in Consumer File Marketing Data
Josh Pasek
University of Michigan, Ann Arbor
S. Mo Jang
University of Michigan, Ann Arbor
Curtiss Cobb
GfK
Charles DiSogra
Abt SRBI
J. Michael Dennis
GfK
Please direct all correspondence to Josh Pasek, Assistant Professor of Communication Studies, University of Michigan, 5413 North Quad, 105 S. State Street, Ann Arbor, MI 48109. [email protected]. The authors thank Mike Traugott, Mansour Fahimi, and Michael Link for helpful comments and advice.
2012). We briefly discuss potentials for nonresponse adjustment, targeted sampling, and
sampling frame corrections.
Nonresponse adjustment. Any information that can be gathered on non-respondents has
the potential to mitigate survey bias (see e.g., Dixon & Tucker, 2010; Smith, 2011). Researchers
armed with ancillary data may be better able to estimate response propensities and examine the
relations between response propensities and survey variables (Groves, 2006; Kreuter & Olson,
2011; Little & Vartivarian, 2005). Beyond detecting nonresponse bias, ancillary data could
allow for more discretion in producing nonresponse adjustments (see e.g., Boehnke, 2003;
Brehm, 1999). For example, researchers might use ancillary data to create weights to match the
sample instead of the population or to develop imputation strategies (Peytchev, 2012; Smith &
Kim, 2009).
4
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Targeted data collection. Ancillary data could also increase the efficiency of data
collection. Accurate ancillary data would allow researchers to know about key characteristics of
households selected for sampling. Such information could allow for targeted oversampling of
households in typically underrepresented population groups. Coupling demographic information
with knowledge of recruitment rates by group, researchers could sample groups in proportion to
their expected response rates, limiting the need for post hoc adjustments. Similarly,
demographic information acquired prior to data collection could streamline survey
administration. Ancillary data might allow firms to prepare for Spanish-speaking respondents,
difficult to reach individuals, or those who require special survey accommodations. Accurate
ancillary data put to each of these uses could reduce costs and improve statistical efficiency in
the sampling process.
Sampling frame corrections. Finally, ancillary data could help researchers supplement a
biased or unknown sampling frame to make the respondents resemble the entire population.
Ancillary data could reveal systematic distinctions between households with telephones and
those that do not have them in ways that might improve inferences from random digit dial
samples. More dramatically, matching schemes based on ancillary data might even allow
researchers to impute population parameters using nonprobability samples (see e.g.,
Ansolabehere & Hersh, 2012; Rivers, n.d.). By linking respondents with a representative set of
individuals in the target population, ancillary data could potentially build projections that might
mirror society at large.
Quality of Ancillary Data
The advantages of using ancillary data depend on a variety of assumptions being met. If
ancillary data are systematically inaccurate, sampling strategies based on ancillary data could
5
RUNNING HEAD: Supplementing Surveys with Consumer File Data
lead to invalid inferences. If ancillary data are missing for select groups of individuals,
attempted corrections for nonresponse may not improve survey estimates. And if interrelations
among ancillary data variables systematically differ between respondents and non-respondents,
the use of the additional data may introduce new sources of error (cf. Peytchev, 2012). These
potential challenges highlight a need for systematic evaluation of the accuracy and potential
biases in consumer file data. We can confidently use these data only when we can reasonably
identify (and ideally address) the biases that might occur.
Understanding the quality and accuracy of consumer file data is a critical step in using
these data to improve survey methods. Unfortunately, the purveyors of consumer file data
provide little information on the original source of these variables. As Smith (2011, p. 393)
notes, “many databases are ‘black boxes’ that do not disclose how they are constructed and what
rules are followed.” It is unclear how many sources of data were aggregated, how discrepancies
were identified and prioritized, what modes of data collection were used, and the extent to which
the data presented represent inferences rather than observations. Hence, practitioners should be
concerned that the quality of data may fluctuate both across variables and across respondents in
ways that could introduce systematic biases.
To date, no research has systematically evaluated the quality of ancillary data for survey
purposes.1 It is unclear how frequently ancillary data will correspond with self-reports and much
remains unknown about the scale of errors in ancillary data and the conditions under which
1 One earlier conference proceeding by DiSogra, Dennis, and Fahimi (2010) compared ancillary data to respondents’ self-reports and found that the match between the two was inconsistent. For example, ancillary data and self-reports largely corresponded in assessing homeownership, with 93 percent of individuals identified as homeowners in the ancillary data reporting that they actually owned their homes. In contrast, household income was poorly predicted by the ancillary data, matching respondents’ self-reports only around 50 percent of the time. Although the previous study represents an important first step, we could not identify any research that has appeared in the journal literature and many important questions remain unanswered.
6
RUNNING HEAD: Supplementing Surveys with Consumer File Data
ancillary data are more or less accurate. Inaccurate information could result in misleading
conclusions and could undermine sampling that relies on ancillary data.
Another concern stems from missing information. If households lacking ancillary data
on any particular variable differ systematically from households where those data are present,
sampling strategies relying on those data could lead to biased outcomes. Biases that stem from
missing information present particular challenges; if we cannot confidently predict their
occurrence, they may be impossible to correct for. Hence, analyses based on tenuous
assumptions about why ancillary data may be missing could lead to even greater total errors. It
is therefore important that we evaluate the characteristics of missingness in ancillary data. To do
so, however, we need datasets that provide reliable information about households for which
ancillary data are absent.
Present Study
The current research represents a first foray into evaluating the quality of one set of
consumer file ancillary data and addressing questions about the value of these data as a
nonresponse corrective. Using a unique dataset from GfK that links an address-based sample
(ABS) with ancillary demographic data from marketing companies, we first evaluate the
correspondence between ancillary data and self-reports by comparing survey responses in the
ABS sample to ancillary estimates about those same households for the same variables (Analysis
1). Then, we assess the nature of incompleteness in ancillary data by investigating the three
possible conditions of missingness: missing completely at random (MCAR), missing at random
(MAR), and nonignorable (Analysis 2). These evaluations primarily concern whether
inconsistencies between ancillary and self-report data and missingness appear to result from
systematic or random processes.
7
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Inconsistencies between ancillary and self-reported results do not necessarily doom
attempts to correct for sampling biases with ancillary measures. Corrective tools may still work
if researchers can specify a model for when and where errors occur. To this end, a third analysis
investigates whether consumer file data may provide a useful strategy for addressing survey
nonresponse (Analysis 3). Using multiple imputation, we predict values for self-reported
variables among all sampled households, rather than simply among survey respondents. We then
compare the results of these imputations both to the parameter estimates obtained from the CPS
and to the estimates generated under the assumption that unit nonresponse was ignorable. We
illustrate circumstances under which imputations produced results that were more or less
representative and discuss the implications for the use of such strategies with both probability
and non-probability sampling designs.
Methods
Sample
Data for all analyses come from GfK. In January of 2011, GfK used the U.S. Postal
Service’s Computerized Delivery Sequence File (CDSF) to choose 25,000 random addresses that
would be recruited by mail (with telephone follow-ups where numbers were available) to join
KnowledgePanel®, an online probability-based sample of Americans. The CDSF covers well
over 95 percent of American households, making it one of the broadest potential sampling
frames (Iannacchione, 2011). Because of the breadth of the sampling frame, we could expect a
100% response rate among selected households would closely mirror that of all American
households.
8
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Of 25,000 sampled addresses, 2498 households were successfully recruited to join the
panel, a household response rate of 10.0% (AAPOR RR1). Multiple individuals were allowed to
sign up for the panel from each of these households. In total, 4472 individuals were recruited
from these households with the median household yielding two respondents.2
Because surveying for KnowledgePanel® is completed online, GfK provides a laptop
computer, Internet access, or both to panel members for whom these services are not already
available. Self-report data came from the Core Adult Profile, the first survey respondents
completed upon admission to the panel.
Ancillary Data
Ancillary data were matched to all 25,000 households in the ABS sample. These data
were provided by the Marketing Systems Group (MSG). MSG uses data from several databases
to compile information on American addresses. Data for the current study originally sourced
from infoUSA, Experian, and Acxiom. Since the consumer file data were themselves produced
through a combination of aggregation and inferential techniques, it was impossible to trace the
source of any particular piece of information about a particular household.
Despite the opaque nature of individual ancillary measures, the firms providing the
information suggest that these data are ideal for tracking and identifying Americans. Acxiom
(2010) claims its data “covers more than 99% of marketable addresses worldwide” (p.6) and
incorporates regular updates from the U.S. Postal Service. Experian (2012) notes that it excels at
linking identities between social media sites, phone numbers (landline and mobile), work and
home addresses, email accounts, and other online identifiers. And InfoUSA (2012) monitors
voter registration, utility, and real estate data to compile information, with monthly updates to
2 The mean number of respondents per household was 1.79. Fewer than 1% of all households yielded more than 4 respondents and no households yielded more than 10.
9
RUNNING HEAD: Supplementing Surveys with Consumer File Data
keep records current. Aggregating across these sources seems like a very effective way to keep
track of the American public.3
CPS Benchmarks
Benchmark values for analysis 3 came from the January 2011 CPS. The CPS included
data from 134,090 individuals in 53,867 American households with a response rate of 91%. The
Census Bureau provided weights to link data for sampled households to all American households
(and individuals). CPS numbers were weighted to match the household distribution unless stated
otherwise.
Variables
Homeownership, household income, and household size were measured in all three
datasets at the household level. Full question wordings and coding for household-level variables
in the GfK, ancillary, and CPS data are shown in Appendix A.4 Marital status, education, and age
were measured in all three datasets at the individual level. Question wordings and coding for
these variables are shown in Appendix B. Table 1 presents descriptive statistics for all measures
among respondents.
Weighting
Eight sets of weights were generated to assess correspondence between self-reported,
ancillary, and CPS estimates across the three analyses. No substantive differences were observed
3 Because of the nature of these data, we are not able to diagnose the source of discrepancies between ancillary and self-reported data. We also cannot conclude that other sources of ancillary data would not result in different results. Such comparisons should be a subject of continuing research.4 Substantive variables used at both the household level and the individual level were those for which measurement categories could be matched across self-report, ancillary, and CPS measures. Three ancillary measures for presence of a telephone, racial/ethnic status, and number of children in the household, were excluded because the match between these measures in the datasets would have been inconsistent. They were used in the imputations in analysis 3 because additional variables should only serve to improve the fit of the multiple imputation procedure.
10
RUNNING HEAD: Supplementing Surveys with Consumer File Data
between weighting strategies, which led us to present only a single combination for each
analysis. Full information on weighting is presented in Appendix C.
Analysis 1: Correspondence Between Ancillary Data and Self-Reports
Analytic Method
To evaluate the ancillary data, we compared the values of variables in the ancillary data
with corresponding measures in the self-reported survey results. This evaluation proceeded in a
three-step process. We first assessed the extent to which self-reported and ancillary data
revealed matching information about households (and heads of household). High match rates
indicated relatively accurate ancillary data whereas low match rates would call that accuracy into
question.5 Second, to understand whether the ancillary and self-report data differed in systematic
ways, we explored whether ancillary measures tended to overestimate or underestimate
corresponding self-reports. If overestimates and underestimates were disproportional, the results
would suggest that misclassifications were a product of bias rather than stochastic error. Finally,
for measures with more than two categories, we assessed the proportion of responses that
differed by a large margin – defined as greater than 5 years for age and two or more categories
for ordinal measures. These “far-off” cases were unlikely to be a product of ancillary data that
was simply out of date.
Results
Correspondence between ancillary and self-reported estimates differed markedly across
variables. The reports largely matched in cases like homeownership, marital status, and age,
5 Although some of these discrepancies could occur as a product of inaccuracies in survey estimates, we think this is not a major concern for two reasons. First, there is little evidence of systematic biases in reporting demographic variables in surveys. Second, web-based survey administration appears to minimize biases (see Chang & Krosnick, 2010), further mitigating this likelihood.
11
RUNNING HEAD: Supplementing Surveys with Consumer File Data
with 88.9%, 72.3% and 70.4% agreement respectively (Table 2). In contrast, estimates of
household income, household size, and education differed enormously between the data sources
(with 22.8%, 32.1%, and 38.9% matching). There was no apparent pattern discriminating
between variables with high and low correspondence.
Ancillary and self-reported measures corresponded at different rates across levels of the
same variables, a pattern indicative of bias. Looking only at households with both types of data,
homeownership, marital status, and income were consistently overestimated in the ancillary data
relative to the self-reports. 24.7% of households self-reported that they did not own their homes,
but fully 28.6% of these cases were classified as homeowners in the ancillary data. In contrast,
only 5.3% of self-reported homeowners were classified as non-owners in the ancillary data.
Perhaps most troublingly, of the 40.6% of individuals who reported they were unmarried in the
self-reports, more than half (51.0%) were classified as married in the ancillary data. Yet among
individuals reporting that they were married, 88.2% were similarly classified in the ancillary
data.
Self-report and ancillary estimates did not always diverge in similar patterns. Individuals
tended to self-report greater educational achievement than was apparent in the ancillary data and
discrepancies in reports of household size and age did not skew in a single direction. Overall,
there did not seem to be a clear pattern for when the estimates from the two data sources differed
in these manners (Table 2; Appendix D).
Large discrepancies between data sources emerged frequently for most variables. 44.1%
of households reported income that differed from the category suggested in the ancillary data by
more than $10,000 per year (Table 2). The number of occupants reported by a household
differed by 2 or more individuals from the ancillary estimate in 32.3% of cases. One in five
12
RUNNING HEAD: Supplementing Surveys with Consumer File Data
individuals reported an education level that differed by two or more categories from the ancillary
estimate. And 18.6% of self-reported ages differed from the ancillary estimate by six or more
years, even though cases were selected for the closest possible age match. Such large
discrepancies seem unlikely to have emerged from slightly outdated consumer file data.
Discussion
The correspondence between the ancillary data and self-reports varied across the six
variables. Disparities between self-report and ancillary results were fairly large for income,
household size, and education; they were generally more consistent when predicting
homeownership, though notable biases emerged.6 At a minimum, the findings indicate that the
ancillary data used may not be particularly accurate in their description of individual or
population parameters. Ancillary information was, however, considerably better than chance
determinations for all variables.
In considering uses of ancillary data, current results present a mixed picture. For certain
low-responding groups such as young Americans, increasing sampling rates, even at the cost of
some bias, could lead to more accurate overall estimates. Sampling firms and researchers using
such data, however, need to exercise caution. The bias-variance tradeoff induced by
oversampling in this manner will vary depending on the predictions being made. More
troublingly, researchers ignoring biased oversamples in making inferences could reach
inaccurate conclusions concerning both oversampled groups and society as a whole.
Analysis 2: Missingness in Ancillary Data
Analytic Method
6 In these respects, our results are largely a replication of DiSogra, Dennis, and Fahimi (2010) with a newer – but very similar dataset.
13
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Three tests were used to assess the scope and nature of missingness in the ancillary data.
First, we examined the extent of missingness in the ancillary data for each variable and across
individuals. Second, we compared missingness in ancillary data variables to self-reports for those
same measures. Differential missingness across self-report categories would provide strong
evidence of nonignorable missingness. Finally, we used logistic regressions to predict the
presence of missingness for each of the ancillary variables based on the values of self-report
measures and an OLS regression to predict the number of ancillary variables for which data were
missing across individuals. Presumably, if ancillary data were MCAR, missing data should not
be concentrated among specific individuals, should be unrelated to self-reports of those same
variables, and should be impossible to predict with the logistic regressions. If ancillary data
were MAR conditional on observable variables, we should see a strong ability for the logistic
regressions to predict missingness. Strong relations between self-reports and ancillary
missingness on the same variables, coupled with a relative inability to predict missingness in
regressions, would indicate that we did not have a good handle for when ancillary data were
missing.
Missingness
Missingness indicator variables were created to identify cases missing ancillary data for
each of the variables of interest (0 = presence of data, 1 = absence of data). A total missingness
variable was defined as the number of ancillary measures missing for each individual (ranging
from 0 to 6).
Descriptive Statistics
Ancillary data were missing for a large number of cases. On average, 18.0% of
households were missing data for any given ancillary data variable. Missingness varied
14
RUNNING HEAD: Supplementing Surveys with Consumer File Data
considerably across households from only 6.1% of cases missing for household income to 28.3%
of households missing age information (Figure 1a).
We also explored how missing ancillary data varied across respondents. Although the
modal respondent was not missing data for any of the ancillary variables (45.7%; Figure 1b),
missingness was not concentrated in a few households. Only 8.2% were missing data for more
than two variables. Of households missing data, the vast majority was missing information for
only a single variable. To examine whether individuals lacking data on one variable were also
likely to have missing data on other variables, we conducted a reliability test among six
missingness indicator variables. Cronbach’s Alpha was .70, indicating a moderate consistency
among missingness indicators, but not enough to consider them a single factor.
Ancillary Missingness by Self-Report Value
Missing ancillary data appeared distinctly non-random when compared with self-reported
household status on the same variables. Ancillary homeownership was missing for 7.3% of self-
reported home owning households and 29.5% of non-owning households (Figure 2a; 2(1) =
181.4, p < .001). Ancillary income was missing systematically for households with lower
reported incomes. 10.4% of ancillary data were missing for households reporting an income of
below $35,000 per year whereas only 2.7% of ancillary data were missing for households with
incomes above $75,000 (Figure 2b; 2(7) = 32.5, p < .001). Smaller households were missing
more information about household size than were larger households (Figure 2c; 2(4) = 15.1,
p<.01). These results refuted the possibility that ancillary data were MCAR and indicated that
ancillary household data were likely nonignorable.
Rates of missingness in individual-level ancillary data also frequently depended on self-
reports for the same variables. When respondents reported that they were married, only 13.5%
15
RUNNING HEAD: Supplementing Surveys with Consumer File Data
of ancillary marital status data were missing. In contrast, 28.2% of ancillary marital information
was missing among unmarried individuals (Figure 2d; 2(1) = 59.8, p<.001). Variation in
missingness across self-reported education levels was not statistically significant (Figure 2e;
2(4) = 5.8, p = .21). Finally, missing ancillary age information was more common among
younger individuals. For individuals aged 18-24, 40% of ancillary data were missing; only
14.0% of ancillary age data were missing for 55-64 year olds (Figure 2f; 2(6) = 188.2, p < .001).
As with household-level variables, missing data for individual-level variables appeared
systematic.
Regressions
To understand patterns of missingness, regressions predicted the presence of missingness
for each of the ancillary data measures and the total number of missing ancillary variables as a
function of each respondent’s status on self-reported demographics. To conduct these
regressions, multiple imputations were used to account for missingness in self-reported values
among all respondents.7 All regression results were weighted.
Predictors of missing ancillary data varied depending on the missingness indicator being
predicted. Three self-reported demographics predicted multiple missingness indicators:
homeownership, household size, and ethnoracial identity. Compared to non-owners, self-
reported owners were less likely to lack ancillary data for homeownership, household income,
household size, marital status, and age (Table 3). The self-reported number of individuals in the
household predicted missing ancillary income, household size, and marital status, with larger
households translating into a reduced likelihood of missingness. Ethnoracial identity predicted
7Multiple imputations were conducted by using Multiple Imputations via Chained Equations predicting each missing value with all self-reports and ancillary demographic variables. Running these same regressions without imputations led to the same conclusions. Imputed versions were used in case missingness in self-reports biased the process. Imputations were conducted to mirror the full set of respondents (not all sampled households).
16
RUNNING HEAD: Supplementing Surveys with Consumer File Data
inconsistently. Black Americans were more likely than Whites to lack ancillary homeownership
data. Hispanic and multiracial individuals were missing ancillary age information more
frequently than were other individuals. Self-reported marital status, education, and age each
predicted one of the missingness indicators. Married individuals were, ceteris paribus, less
likely to be missing information on marital status. Individuals who reported that they were
younger or had lower education were more likely to be missing ancillary age information.
Despite significant predictors for most missingness indicators, missing ancillary data was
not well estimated in the current analyses. Pseudo R2s for the missingness indicators were
always below .20 and the full list of covariates improved prediction for only one of the six
indicators: missing ancillary age information (Table 3).
Predictions of the number of ancillary measures missing revealed similar challenges.
Although evidence suggested that missingness was related to less self-reported homeownership,
remained poorly explained. These variables again captured only a small portion of the variation
across individuals (R2 = .11).
Discussion
Results of analysis 2 suggested that missing ancillary data likely represented a
nonignorable source of bias. Ancillary data were missing systematically, but not in patterns that
could be predicted with simple covariates or regression techniques. Because missing data
appeared to violate both MCAR and MAR assumptions, the use of ancillary data for analytic
techniques could result in substantive error for all purposes we have considered.
The most pernicious of our results indicated that missingness in ancillary data was often
related to self-reported values of the same variables. Because ancillary missingness was
17
RUNNING HEAD: Supplementing Surveys with Consumer File Data
correlated with self-reports, sampling strategies based on ancillary data are likely to result in
sampled units that are differentially accurate across different variables. Instead of reducing the
weights required for a sample, oversampling with the use of ancillary variables instead
necessitates a two-stage weighting process whereby oversampling ratios must be corrected for
before weighting to population targets. Depending on the specific variables and sampling ratios
in play, such corrections could sometimes increase the variance and design effect of sample
weights instead of reducing them.
Analysis 3: Assessing Imputations Based on Ancillary Data
Analytic Methods
To evaluate whether the ancillary data allowed us to correct self-reports to better match
the sample, we produced three sets of comparisons for each of the variables of interest.
Specifically, we examined correspondence between household-level estimates provided by the
CPS and those of imputed self-report data for the entire sample (imputed), the original GfK data
among respondent households (raw), and the ancillary data (ancillary). Imputation procedures
are described in Appendix E. Household-level analyses were used to compare homeownership,
household income, and household size across each of these measures. Individual-level analyses
were also used to compare marital status, education, and age across each of these measures
(results presented in Appendix F). For each comparison, we calculated an absolute average error
across all analyses (cf. Yeager et al., 2010). If correspondence between imputed data and the
CPS was significantly stronger than that of the raw GfK data, we could conclude that the
inclusion of ancillary data in our models was helping to provide a more accurate understanding
of the population.
18
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Household Results
Homeownership. Predictions of homeownership were slightly more accurate when using
the self-reported data without imputation than when using the imputations. Estimates of
homeownership rates in the raw sample (69.2%) did not significantly differ from those of the
CPS (68.7%; t=-.72, p=.47). Estimates from imputations remained close to the CPS estimate,
but were somewhat more discrepant. Discrepancies between imputations and the CPS ranged
from -3.51 percentage points to .07 percentage points (95% CI=-2.41 to -0.45). The average
discrepancy across the 100 imputations was 1.51 percentage points (p=.03).8 Overall, 90.5% of
the simulations were further from the CPS estimate than was the raw GfK estimate, a difference
just shy of statistical significance (p=.10). Ancillary estimates were considerably further from
CPS numbers in predicting that 77.0% of Americans owned their homes (t=26.94, p<.001).
The close correspondence between the CPS, raw, and imputed estimates can be seen in
Figure 3. The boxplot represents the range of estimates from the imputations, the horizontal red
line denotes the raw data prediction, the horizontal blue line indicates the ancillary estimate, and
the dashed green line is the prediction based on the CPS. From this figure, readers can see that
the raw estimate, on average, was a closer match to the CPS than were the imputations. The
ancillary estimate was more discrepant.
Household Income. Household income distribution estimates varied considerably across
sources, with the imputations as most congruent with the CPS, followed closely by the raw self-
reports (Figure 4). Compared to the CPS, households earning under $15,000 and between
8 p values for the multiple imputations were generated in a four-step Monte Carlo process. First, weighted t-tests were used to compare the each imputed dataset to the CPS estimate. Second, raw differences and standard errors for the estimates from each imputed dataset were used in a Monte Carlo simulation to produce 1000 simulated differences between each imputation and the CPS (for 100,000 total data points). Third, these Monte Carlo simulations were stacked, and a parameter was calculated as proportion of the estimates across all simulations in the less-populated direction of difference from the CPS. Finally this number was doubled to produce a two-tailed p value.
19
RUNNING HEAD: Supplementing Surveys with Consumer File Data
$50,000 and $150,000 per year were overrepresented in the raw data. The CPS, in contrast,
identified more households earning between $25,000 and $50,000 and more than $150,000 per
year than did the raw data. The imputed datasets performed similarly to the raw data. The
imputations corresponded more closely for four income categories and less closely for two
categories. On average, raw self-reports differed from the CPS by 2.74 percentage points (2 =
84.6 (7) p<.001) and imputations differed from the CPS by 2.47 percentage points (p<.001).
Ancillary data estimates were somewhat more discrepant, differing by an average of 3.36
percentage points (2 = 2153.9 (7) p<.001).
Household Size. Estimated distributions of household sizes also varied across sources,
with imputations generally the most accurate (Figure 5). All three estimates overstated the
proportion of households with a single resident and understated the proportion of households
with 4 or more residents relative to the CPS. For households with 2 or 3 residents, all three
estimates were reasonably close, with ancillary data closely reflecting the CPS estimate in 2-
resident households and both raw and imputation estimates mirroring the CPS in 3-resident
households. Across household size estimates, differences from the CPS were largest for the
ancillary data, averaging 10.89 percentage points (2 = 18114.4 (4) p<.001), and smallest for the
imputations, averaging 4.05 percentage points (p<.001). Raw estimates differed from the CPS
by 8.80 percentage points on average (2 = 789.0 (4) p<.001).
Across Measures. Overall, imputations represented a moderate improvement in
correspondence with the CPS over raw GfK self-reports. Among household-level measures, the
average difference between CPS and imputation estimates was 2.67 percentage points (See
Figure 6). Raw data differed from the CPS by a moderately worse 4.04 percentage points.
Ancillary data diverged from the CPS by an average of 7.52 percentage points.
20
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Including individual-level measures as well, imputations outperformed raw and ancillary
data, though not by an enormous margin (Appendix E). Across all six variables, imputations
differed from the CPS by an average of 2.97 percentage points as compared to 5.26 percentage
points for raw data and 10.12 percentage points for ancillary data. Imputations thus reduced the
differences between the raw data and the CPS by 43.7 percent on average.
Discussion
Although imputations have been touted as a strong improvement over listwise deletion
for all types of missing data (see e.g., King, Honaker, Joseph, & Scheve, 2001) and as a
potentially more effective correction for survey nonresponse (Peytchev, 2012), we find only
moderate efficacy. Imputing estimates to the sample level using ancillary data reduced
discrepancies between self-reported data from a national sample of American households with
sizable nonresponse and that of the very high response rate CPS by less than half. Given that the
weights produced were not designed to correct for nonresponse in the raw data (unlike traditional
raking strategies), our evidence suggests that the imputation strategy employed was not a
particularly effective solution to nonresponse errors in the sample.
These findings suggest one of three possibilities. First, the processes producing
nonresponse errors in the raw GfK sample may approximate MCAR, meaning that non-
respondents could be safely removed via listwise deletion (cf. King, et al., 2001). A second
possibility is that the demographic variables used in the current study were insufficient to
substantively improve upon errors in the GfK reports. If this is the case, many such variables
may be necessary for imputation results to represent a large improvement. Finally, the Bayesian
procedures used presume that ancillary and self-report data relate in similar ways among
respondents as they would (were self-reports available) among non-respondents. Biases in the
21
RUNNING HEAD: Supplementing Surveys with Consumer File Data
presence, nature, and correlates of the ancillary data could limit the capacity for data-driven
algorithms to result in accurate population inferences.
General Discussion
The current analyses represent a first foray into understanding the nature of potential
biases when using ancillary marketing data to supplement (or supplant) the traditional survey
process. In our initial analysis comparing six demographic variables across ancillary and survey
data for a probability sample of individuals, we find both hopeful and discomforting signs. The
ability to match ancillary data to the sample and the generally decent correspondence between
ancillary and self-reported data suggest that such data could prove useful for sampling purposes.
We were somewhat concerned, however, both by the large discrepancies that sometimes
emerged and by an inability to clearly characterize the nature of missingness in the ancillary
data.
One of the most notable findings in the current study is the nature of missingness in the
set of ancillary data used. The results from our second study suggest that the ancillary data were
not MCAR and that the set of self-report measures used were far from sufficient to treat missing
ancillary data as MAR conditional on only a small set of standard demographics. Particularly
notable was the correspondence between missingness on many variables and self-report
measures of the same concept. Such biases are pernicious as they are unlikely to be resolved
with ancillary data alone. Unfortunately the opaque nature of the ancillary data made it difficult
to diagnose the reasons that such large biases were identified.
Finally, our results suggested that imputation procedures based on the ancillary data did
less than we had hoped to adjust for discrepancies between a low response rate probability
22
RUNNING HEAD: Supplementing Surveys with Consumer File Data
survey and the CPS, a high response rate government sample. This indicated that standard
imputation procedures based on the types of ancillary data used may be insufficient when
attempting to understand attributes of the sample.
The fact that the imputations did not eliminate most of the error in the raw data indicates
one of three likely conclusions: It is possible that nonresponse (both unit and item) may not lead
to substantive bias in the raw self-reports, that the ancillary data used may be insufficient to
correct for the distinctions between samples, or that the imputation procedure itself represents the
wrong model for addressing nonresponse bias. We discuss each of these possibilities and the
implications in turn.
Ignorable Nonresponse?
The possibility that nonresponse is relatively ignorable for a sample like that of the raw
self-reports (at least for these variables) is an intriguing one. The differences between the self-
reports of respondents and the data from the CPS were relatively small (averaging approximately
4 percentage points). Although these differences were statistically significant with a sample of
4,000 individuals across three measures, they were not substantively important. Hence, it might
require a large number of highly discriminate ancillary variables to meaningfully improve the
estimates. Further, some of the error that is present may be attributable to slight differences in
measurement rather than to nonresponse. The only question posed identically by both the CPS
and GfK that did not require weighting by household size was the question about
homeownership, for which the differences were no larger than those expected from sampling
error.
Insufficient Data
23
RUNNING HEAD: Supplementing Surveys with Consumer File Data
The nine ancillary measures used for the imputations seem to have led to only a modest
improvement in the predictions of the imputations over those of the raw self-reports. It may be
the case that more data or a different source of data would be needed to better correct these
estimates. Even if ancillary data sources can help distinguish between respondents and non-
respondents, the distinctions may be relatively poorly predicted through the combination of
ancillary variables presented here (cf. Mustillo, 2012). This would seem unlikely in light of the
sizable capacity for ancillary data to predict unit nonresponse, but it remains possible that a
combination of missing ancillary data (leading to biases in listwise deletion) and the lack of a
consistent correlation between ancillary and self-report measures could undermine the corrective
ability of the imputation procedure. Of course, it is also possible that other sources of ancillary
data provided might prove better on these metrics.
Imputation Assumptions
The multiple imputation procedure employed here – as for most imputation strategies –
proceeded in a completely data-driven fashion assuming a consistent set of links between
variables. MICE models the error in any given imputation using Monte Carlo simulation results
from the posterior distributions of linear (and logistic) predictions of categorization for each of
the variables used. This procedure is based on a variety of often-implicit assumptions. Notably,
relations between variables among respondents are presumed to perfectly mirror the relations
between variables among non-respondents. Slight discrepancies between interrelations for these
sets of individuals have the potential to skew the results of the imputation, leading to results that
are no more accurate (and occasionally less accurate) representations of the public at large than
those of respondents.
Non-Probability Samples?
24
RUNNING HEAD: Supplementing Surveys with Consumer File Data
The results presented here do not discount the possibility that some set of ancillary data
could operate as a corrective for data from non-probability samples; they do however lend
caution to the endeavor. Imputation procedures only modestly improve on point estimates from
a probability sampling frame, which suggests that they are unlikely to address all differences
between a non-probability sample and the population, at least with the set of variables employed
here. To the extent that researchers are interested in using ancillary data to adjust a non-
probability frame to match the public, additional scrutiny is warranted.
Looking Forward
We hope that the current study raises a number of issues for researchers thinking about
using ancillary data for data collection, nonresponse adjustment, and sampling frame correction.
Specifically, we highlighted notable challenges in the use of at least one source of these data and
caution researchers thinking of using these data to consider carefully the nature of the
assumptions they are making. For example, the capacity for ancillary data to identify hard-to-
reach populations may require less stringent assumptions than would efforts to correct for
unrepresentative sampling. In this vein, it is important for future research to estimate the size of
possible biases from the use of these data for various types of inference and to explore how
broadly the current results can generalize. Only when we can calculate the ways in which use of
a variety of types and sources of data alter total survey error and – more particularly – the bias-
variance tradeoff introduced in various applications can we truly understand where such data
may prove a boon to social research and where they might instead be a hindrance. These
questions, more than ease and ability, should guide practitioners in their decision-making.
It is also important to consider the use a larger array of ancillary variables in developing
survey correctives. Notably, if imputation is indeed capable of bridging the gaps between
25
RUNNING HEAD: Supplementing Surveys with Consumer File Data
respondents and sample, we should progressively reduce the errors as more ancillary measures
are added. Advances in improving the quality of future consumer file data may also help to
address biases inherent in using such ancillary information. Better data aggregation and systems
for understanding the origins of the data that are available could also provide considerable
leverage.
Finally, future studies should consider higher levels of inference. It is possible that the
procedures used could better correct for results such as trends over time and experimental
interventions (cf. Berrens et al., 2003; Pasek and Krosnick, 2010). In this respect, we remain
optimistic about the potentials for ancillary data as a corrective tool. The current study suggests,
however, that we should remain guarded. Additional evidence is needed to determine conditions
under which ancillary data will reduce nonresponse biases and those for which it may not
represent an improvement.
26
RUNNING HEAD: Supplementing Surveys with Consumer File Data
References
Acxiom (2010). Maximizing Profit with Accurate Customer Data: How next-generation data
quality is improving marketing outcomes. White Paper [Accessed 10/31/12]. Available
Note: Far-off cases are counted when values from self-reported survey and ancillary data differ by more than five years (age), or more than one category (household income, household size, and education). *Ages within one year were considered equivalent.
32
Table 3. Logistic Regressions Predicting Missing Ancillary Data with Self-Reports
Note: OLS was used to predict the number of missing variables for each individual (Column 7). Number of missing variables ranged from 0 to 6. A negative binomial regression provided a poorer overall fit and was therefore not presented. *p < .05; **p < .01; ***p < .001
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figures
Figure 1. Missing Ancillary Data by Variables and Respondents
34
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure 2. Missing Data by Self-Reported Variable Values
35
Figure 3. Homeownership Estimates Weighted By Household
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure 4. Income Category Estimates Weighted By Household
37
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure 5. Household Size Estimates Weighted By Household
38
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure 6. Average Absolute Differences From CPS
39
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Appendix A. Question Wordings for Household-Level Variables
Homeownership (Nominal; 2 categories)
Self-report. Respondents were asked: “Are your living quarters. . . Owned or being
bought by you or someone in your household, rented for cash, or occupied without payment of
cash rent.” Respondents who selected “Owned or being bought by you or someone in your
household” were coded 1, all other answers were coded 0.
Ancillary. Ancillary homeownership data were categorized 1 for homeowners and 0 for
non-owners.
CPS. Respondents were asked: “Are your living quarters. . . Owned or being bought by a
household member, rented for cash, or occupied without payment of cash rent.” Respondents
who selected “Owned or being bought by a household member” were coded 1, all other answers
were coded 0.
Recoding. Survey respondents and CPS respondents reported their homeownership in
three categories, “owned or being bought by you or someone in your household”, “rented for
cash”, “occupied without payment of cash rent”. To facilitate comparisons with the ancillary
data, the self-reported and CPS responses were recoded into two categories, “homeowners” and
“non-owners”.
Household Income (Ordinal)
Self-report. Respondents were asked: “Was your total HOUSEHOLD income in the past
12 months ...” Respondents could choose: “Below $35,000”, “$35,000 or more”, or “Don’t
Know”. Respondents who selected “Below $35,000” were asked: “We would like to get a better
estimate of your total HOUSEHOLD income in the past 12 months before taxes. Was it ...”
Respondents could choose: “Less than $5,000”, “$5,000 to $7,499”, “$7,500 to $9,999”,
40
RUNNING HEAD: Supplementing Surveys with Consumer File Data
“$10,000 to $12,499”, “$12,500 to $14,999”, “$15,000 to $19,999”, “$20,000 to $24,999”,
“$25,000 to $29,999”, or “ $30,000 to $34,999”. Respondents who selected “$35,000 or more”
were asked: “We would like to get a better estimate of your total HOUSEHOLD income in the
past 12 months before taxes. Was it ...” Respondents could choose: “$35,000 to $39,999”,
“$40,000 to $49,999”, “$50,000 to $59,999”, “$60,000 to $74,999”, “$75,000 to $84,999”,
“$85,000 to $99,999”, “$100,000 to $124,999”, “$125,000 to $149,000”, “$150,000 to
$174,999”, or “$175,000 or more”. Responses to all three questions were recoded into eight
categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to
$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than
$150,000”.
Ancillary. Ancillary income was coded into categories for “$1,000-$14,999”, “,
Degree (For example: MD,DDS,DVM,LLB,JD)”, and “Doctorate degree (For example: PhD,
EdD)”.
Recoding. Education levels from all three data sources were coded into 5 categories for
“Less than High School”, “High School Graduate”, “Some College”, “College Graduate” and
“Post-Graduate” education levels.
Age
Self-report. Respondents could enter their age in an open ended way.
Ancillary. Data were requested on an individual’s age in each household.
46
RUNNING HEAD: Supplementing Surveys with Consumer File Data
CPS. Respondents were asked: “What is your date of birth?” They were then asked to
verify their age with the question: “As of last week, that would make (name/you) (approximately
(AGE) years old. Is that correct?”
Recoding. Ages for all individuals were coded to range from 18 to 90 in analyses 1 and 2
and from 18 to 80 in analysis 3. To facilitate presentation, all three data sources were also coded
into 7 categories for individuals aged 18-24, 25-34, 35-44, 45-54, 55-64, 65-74, and 75 and
older.
47
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Appendix C. Weighting
Eight sets of weights were produced to match the various data sources with the American
public and to one-another. These weights served two purposes. First, we adjusted for
differences between individual-level self-reports and household-level ancillary data. To do this,
we explored four strategies for aggregating information from individuals within a household.
Second, we corrected for unequal probabilities of selection introduced as part of the sampling
procedure. Two techniques were employed to cancel out any biases at the sample and
respondent levels. All combinations of these strategies (four corrections for individual-
household distinction x two levels of generalization) were explored, where appropriate, to ensure
that the results presented were robust to differing choices.
Household level weight
To produce data that could be compared across sources, we needed to match self-reports
to the values in ancillary data. The sampling procedure, however, allowed multiple individuals
from a single household to enter the panel. This introduced three potential problems. First, the
presence of multiple individuals from a single household introduced concerns about the
independence of observations. Second, the results of our analyses might be biased toward
households with multiple representatives. And third, it might be possible that ancillary data
could correctly match one individual in a household while providing an inaccurate portrait of
other individuals in the household. The first and second challenges are easily overcome by
weighting observations at the household, rather than individual, level. The third challenge is
more pernicious and requires that we consider the conditions under which household and
individual data should be considered a match. To circumvent these problems, we created four
sets of weights for respondents:
48
RUNNING HEAD: Supplementing Surveys with Consumer File Data
1. Pure household weight: (1) the weight was coded as the inverse of the number of
respondents in the household.
2. Household adult weight: (1) individuals under 18 were dropped; (2) the weight was
coded as the inverse of the number of respondents in the household.
3. Best ancillary match weight: (1) respondents were asked to provide the names and
ages of all individuals in the household; (2) in households where one of the respondents was
closest to the age9 indicated by the household ancillary data, all other members of the household
were dropped; (3) in households where no respondents were close to the age indicated by the
ancillary data or where multiple respondents were equally close, individuals under age 18 or who
were clearly not the best match were dropped and the weight was coded as the inverse of the
number of remaining individuals in the household.
4. Full household only best ancillary match weight: (1) respondents were asked to
provide the names and ages of all individuals in the household; (2) in households where all
individuals mentioned were respondents, we followed the same procedures as the best ancillary
match weight; (3) individuals in incomplete households were dropped from the analysis.
Probability of sampling corrections
Probability of sampling corrections were created to adjust for several sources of deviation
from an equal probability of selection. As part of a procedure to increase the number of
respondents in traditionally underrepresented groups, GfK used a stratified sampling technique.
Households were categorized into four groups depending on the age and Hispanic status
indicated in the ancillary data. Sampling probabilities were assigned to oversample households
9 Age was used in these circumstances because it was the most commonly available piece of ancillary information and was the only piece of ancillary information that could be consistently expected to discriminate between members of a household. Other variables were either household-level or would be expected to match multiple household members (e.g. marital status).
49
RUNNING HEAD: Supplementing Surveys with Consumer File Data
that included Hispanics or individuals ages 18-24. Because our goal was to assess whether such
techniques might improve the survey process, we needed to eliminate any biases that might have
been introduced through this sampling procedure. To do so, we used three pieces of information:
the proportion of households out of a random sample of one million with each of the ancillary
demographic characteristics considered (population), the sampling probabilities used to identify
sampled households (sample), and the proportion of respondents in each demographic category
according to the ancillary data (respondents). Two sets of base weights were produced. Sample-
level weights were calculated to invert distortion produced by the mismatch between sample and
population proportions (weight = population / sample). Respondent-level weights were
calculated to match the characteristics of respondents to those of the population (weight =
population / respondents).
Weights Used by Analysis
Analysis 1. Best ancillary match weights were adjusted to match the respondents for all
analyses presented. Alternate weighting strategies led to the same conclusions.
Analysis 2. Best ancillary match weights were adjusted to match the respondents for all
analyses presented. Alternate weighting strategies led to the same conclusions.
Analysis 3. Best ancillary match weights adjusted for the sample were used to correct for
all imputation and ancillary household-level estimates. Best ancillary match weights adjusted for
responding households were used to correct for raw household-level estimates. For individual-
level analyses, household adult weights were used to avoid bias due to the demarcation of some
individuals as “heads of household”. These weights were then adjusted to match the sample (for
imputation and ancillary estimates) or the household (raw GfK estimates) and were multiplied by
the number of persons in the household to produce individual-level weights. The number of
50
RUNNING HEAD: Supplementing Surveys with Consumer File Data
persons in the household was identified using ancillary data for ancillary estimates, using
imputed self-report data for imputation estimates, and using self-report data for raw estimates.
This set of weights produced the closest correspondence between the imputations and the CPS
among weighting strategies tested.
51
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Appendix D. Comparison of Self-Report and Ancillary Values
Table D1 – Crosstabs Comparing Self-Report and Ancillary Values
Matches Bolded. Percentages are proportion of pairwise complete N (see Table 2).
52
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure D1 – Comparisons of Self-Reported and Ancillary Age
Points are jittered to show density. Dashed line indicates 5-year margin.
53
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Appendix E. Imputation Procedures for Analysis 3
Respondent-level variables were translated to match the sample by way of a multiple
imputation procedure. The package MICE: Multivariate Imputation by Chained Equations in R
was used to generate 100 multiply imputed datasets that provided values for all missing self-
report and ancillary variables in the entire sample (N = 26,974).10 MICE uses a Markov Chain
Monte Carlo (i.e. Bayesian) approach to imputation. The algorithm regresses each of the
variables onto all other variables in the dataset and generates a Monte Carlo prediction, based on
that regression, for all missing data points. The algorithm then proceeds iteratively updating the
predictions for each of the missing variables using predictions from the newly imputed data from
all other variables. Each prediction was a result of up to 20 rounds of this type of imputation.11
The imputation strategy for each variable depended on the coding of that variable. Variables
identified as “Nominal; 2 categories” above were imputed using a logistic regression imputation.
Variables identified as “Ordinal” above were imputed using an ordinal logistic regression.
Variables identified as “Continuous” above were imputed using progressive mean matching.
Notably, the imputation was conducted at the individual, rather than household level, though this
should not represent a substantive drawback given the nature of the imputation procedure.
10 Note that N is this large because multiple respondents were chosen from households. Weighting techniques corrected for this in all analyses (see Appendix C).11 Because these imputations converge quickly, 5 is considered sufficient for most missing data problems.
54
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Appendix F. Individual-Level Imputation Results
Marital Status (Individual). Marital status estimates were the most accurate in the
imputed data, followed by raw self-reports and the ancillary data (Figure F1). The imputations
differed from the CPS estimate by an average of 2.1 percentage points (p=.03). The raw
estimates differed from the CPS by an average of 8.7 percentage points (t=9.0, p=.01). And the
ancillary estimates differed from those of the CPS by more than 30 percentage points (t=102,
p<.001). All of the imputation estimates were closer to the CPS estimate than was the raw GfK
estimate.
Education (Individual). All three measures overestimated the proportion of individuals
with some college education and underestimated the number of individuals with a high school
degree relative to the CPS. Estimates for the proportion of individuals with college or graduate
school degrees closely matched both imputation and ancillary estimates, but were somewhat
overestimated using the raw GfK data (Figure F2). Imputation estimates differed from the CPS
by an average of 5.1 percentage points, whereas raw GfK estimates differed from the CPS by 7.2
percentage points. Despite only reporting education for heads of households, ancillary estimates
most closely matched the CPS, differing by an average of only 3.5 percentage points. All of
these differences were highly statistically significant (ps<.001).
Age (Individual). Population age estimates from the three measures were all relatively
close to the CPS, differing by an average of only 2.57 percentage points for the imputations, 3.57
percentage points for the raw data, and 3.99 percentage points in the ancillary data (Figure F3;
ps<.001).
55
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure F1. Marital Status Estimates Weighted By Individual
56
RUNNING HEAD: Supplementing Surveys with Consumer File Data
Figure F2. Education Estimates Weighted By Individual
57
RUNNING HEAD: Supplementing Surveys with Consumer File Data