European Survey Research Association · Web viewMICE uses a Markov Chain Monte Carlo (i.e. Bayesian) approach to imputation. The algorithm regresses each of the variables onto all

Can Microtargeting Improve Survey Sampling?

An Assessment of Accuracy and Bias in Consumer File Marketing Data

Josh Pasek

University of Michigan, Ann Arbor

S. Mo Jang

University of Michigan, Ann Arbor

Curtiss Cobb

GfK

Charles DiSogra

Abt SRBI

J. Michael Dennis

GfK

Please direct all correspondence to Josh Pasek, Assistant Professor of Communication Studies, University of Michigan, 5413 North Quad, 105 S. State Street, Ann Arbor, MI 48109. [email protected]. The authors thank Mike Traugott, Mansour Fahimi, and Michael Link for helpful comments and advice.

mailto:[email protected]

RUNNING HEAD: Supplementing Surveys with Consumer File Data

Can Microtargeting Improve Survey Sampling?An Assessment of Accuracy and Bias in Consumer File Marketing Data

Abstract

With the move to address-based sampling, survey firms can now purchase data from

commercial marketing companies to supplement information about sampled households.

Because such ancillary data can provide information about non-respondents as well as

respondents, these data could aid nonresponse adjustment, targeted sampling, and sampling

frame corrections. For these potentials to be realized, it is important to document both accuracy

and missingness in ancillary data and to assess the use of ancillary data as a corrective tool. The

current study uses a unique dataset linking survey self-reports derived from an address-based

sample with one set of demographic ancillary data collected on all sampled individuals (not just

respondents). Exploring correspondence and discrepancies between the data sources, this study

identifies patterns of systematic inaccuracy and nonignorable missingness threatening

conclusions derived using the consumer file data. Multiple imputation is also used to generate

demographic information for all individuals in the sampling frame based on the ancillary data.

Comparing imputations with benchmark estimates reveals that matching strategies only partially

address biases due to unit non-response. Imputed demographics varied in their accuracy,

indicating that the imputation strategy was insufficient. Overall the results suggest that consumer

file ancillary data may be limited in its ability to improve population inference.

2


Can Microtargeting Improve Survey Sampling?An Assessment of Accuracy and Bias in Consumer File Marketing Data

With the increasing prevalence of address-based sampling (ABS), a new potential for

assessing and addressing survey nonresponse has emerged as a function of computerized data

collection and the rise of large consumer file marketing databases. Survey firms can purchase

data about all of the individuals in the sample, not just those who respond. Treated as an

ancillary data source, consumer file data could provide a window into factors related to each

individual’s propensity to respond. If each individual’s likelihood of response could be

determined without an interview, the predictors of responsiveness could improve sampling

techniques as well as our ability to measure and adjust for nonresponse bias (cf. Groves, 2006;

Maitland, Casas-Cordero, & Kreuter, 2009; Peytchev, 2012; Smith, 2011). In this manner,

ancillary data could improve upon classical weighting techniques (see e.g., Peytchev, 2012).

Before researchers use consumer file data to supplement survey samples, it is imperative

that we evaluate the accuracy and potential biases inherent in their use. To date, no research has

systematically examined the quality of consumer file ancillary data and assessed potential biases

that may be introduced where these ancillary data are missing. The lack of assessments in this

area could prove problematic if ancillary data are used for purposes to which they are ill suited.

To address this gap, the present study examines missingness in one set of consumer file ancillary

data using a unique dataset from GfK that links an ABS sample with ancillary demographic data

from marketing companies. Specifically, we assess how the consumer file ancillary data relate to

self-reported responses. In addition to exploring the extent of correspondence and missingness,

we conduct imputations to determine whether these ancillary data could improve nonresponse

3


adjustments. To assess this, we impute values for self-reported variables among all sampled

units and compare those values with Current Population Survey (CPS) estimates.

Uses of Ancillary Data

To facilitate marketing efforts, commercial companies collect a wide range of

information about all Americans that is linked to each individual’s residential address. These

ancillary variables range from demographics, including name, age, race, income, and education,

to financial and commercial information, such as credit scores and cable subscriptions. Although

these data were originally collected for marketing purposes, survey firms and researchers have

found the data increasingly useful with the resurgence of ABS (cf. Sarndal & Lundstrom, 2005;

Smith, 2011). Linking ancillary data to the units sampled in a survey, researchers may utilize

rich information about unreached individuals to correct for various survey errors, including

nonresponse and unrepresentative sampling frames (cf. Boehmke, 2003; Sakshaug & Kreuter,

2012). We briefly discuss potentials for nonresponse adjustment, targeted sampling, and

sampling frame corrections.

Nonresponse adjustment. Any information that can be gathered on non-respondents has

the potential to mitigate survey bias (see e.g., Dixon & Tucker, 2010; Smith, 2011). Researchers

armed with ancillary data may be better able to estimate response propensities and examine the

relations between response propensities and survey variables (Groves, 2006; Kreuter & Olson,

2011; Little & Vartivarian, 2005). Beyond detecting nonresponse bias, ancillary data could

allow for more discretion in producing nonresponse adjustments (see e.g., Boehnke, 2003;

Brehm, 1999). For example, researchers might use ancillary data to create weights to match the

sample instead of the population or to develop imputation strategies (Peytchev, 2012; Smith &

Kim, 2009).

4


Targeted data collection. Ancillary data could also increase the efficiency of data

collection. Accurate ancillary data would allow researchers to know about key characteristics of

households selected for sampling. Such information could allow for targeted oversampling of

households in typically underrepresented population groups. Coupling demographic information

with knowledge of recruitment rates by group, researchers could sample groups in proportion to

their expected response rates, limiting the need for post hoc adjustments. Similarly,

demographic information acquired prior to data collection could streamline survey

administration. Ancillary data might allow firms to prepare for Spanish-speaking respondents,

difficult to reach individuals, or those who require special survey accommodations. Accurate

ancillary data put to each of these uses could reduce costs and improve statistical efficiency in

the sampling process.

Sampling frame corrections. Finally, ancillary data could help researchers supplement a

biased or unknown sampling frame to make the respondents resemble the entire population.

Ancillary data could reveal systematic distinctions between households with telephones and

those that do not have them in ways that might improve inferences from random digit dial

samples. More dramatically, matching schemes based on ancillary data might even allow

researchers to impute population parameters using nonprobability samples (see e.g.,

Ansolabehere & Hersh, 2012; Rivers, n.d.). By linking respondents with a representative set of

individuals in the target population, ancillary data could potentially build projections that might

mirror society at large.

Quality of Ancillary Data

The advantages of using ancillary data depend on a variety of assumptions being met. If

ancillary data are systematically inaccurate, sampling strategies based on ancillary data could

5


lead to invalid inferences. If ancillary data are missing for select groups of individuals,

attempted corrections for nonresponse may not improve survey estimates. And if interrelations

among ancillary data variables systematically differ between respondents and non-respondents,

the use of the additional data may introduce new sources of error (cf. Peytchev, 2012). These

potential challenges highlight a need for systematic evaluation of the accuracy and potential

biases in consumer file data. We can confidently use these data only when we can reasonably

identify (and ideally address) the biases that might occur.

Understanding the quality and accuracy of consumer file data is a critical step in using

these data to improve survey methods. Unfortunately, the purveyors of consumer file data

provide little information on the original source of these variables. As Smith (2011, p. 393)

notes, “many databases are ‘black boxes’ that do not disclose how they are constructed and what

rules are followed.” It is unclear how many sources of data were aggregated, how discrepancies

were identified and prioritized, what modes of data collection were used, and the extent to which

the data presented represent inferences rather than observations. Hence, practitioners should be

concerned that the quality of data may fluctuate both across variables and across respondents in

ways that could introduce systematic biases.

To date, no research has systematically evaluated the quality of ancillary data for survey

purposes.1 It is unclear how frequently ancillary data will correspond with self-reports and much

remains unknown about the scale of errors in ancillary data and the conditions under which

1 One earlier conference proceeding by DiSogra, Dennis, and Fahimi (2010) compared ancillary data to respondents’ self-reports and found that the match between the two was inconsistent. For example, ancillary data and self-reports largely corresponded in assessing homeownership, with 93 percent of individuals identified as homeowners in the ancillary data reporting that they actually owned their homes. In contrast, household income was poorly predicted by the ancillary data, matching respondents’ self-reports only around 50 percent of the time. Although the previous study represents an important first step, we could not identify any research that has appeared in the journal literature and many important questions remain unanswered.

6


ancillary data are more or less accurate. Inaccurate information could result in misleading

conclusions and could undermine sampling that relies on ancillary data.

Another concern stems from missing information. If households lacking ancillary data

on any particular variable differ systematically from households where those data are present,

sampling strategies relying on those data could lead to biased outcomes. Biases that stem from

missing information present particular challenges; if we cannot confidently predict their

occurrence, they may be impossible to correct for. Hence, analyses based on tenuous

assumptions about why ancillary data may be missing could lead to even greater total errors. It

is therefore important that we evaluate the characteristics of missingness in ancillary data. To do

so, however, we need datasets that provide reliable information about households for which

ancillary data are absent.

Present Study

The current research represents a first foray into evaluating the quality of one set of

consumer file ancillary data and addressing questions about the value of these data as a

nonresponse corrective. Using a unique dataset from GfK that links an address-based sample

(ABS) with ancillary demographic data from marketing companies, we first evaluate the

correspondence between ancillary data and self-reports by comparing survey responses in the

ABS sample to ancillary estimates about those same households for the same variables (Analysis

1). Then, we assess the nature of incompleteness in ancillary data by investigating the three

possible conditions of missingness: missing completely at random (MCAR), missing at random

(MAR), and nonignorable (Analysis 2). These evaluations primarily concern whether

inconsistencies between ancillary and self-report data and missingness appear to result from

systematic or random processes.

7


Inconsistencies between ancillary and self-reported results do not necessarily doom

attempts to correct for sampling biases with ancillary measures. Corrective tools may still work

if researchers can specify a model for when and where errors occur. To this end, a third analysis

investigates whether consumer file data may provide a useful strategy for addressing survey

nonresponse (Analysis 3). Using multiple imputation, we predict values for self-reported

variables among all sampled households, rather than simply among survey respondents. We then

compare the results of these imputations both to the parameter estimates obtained from the CPS

and to the estimates generated under the assumption that unit nonresponse was ignorable. We

illustrate circumstances under which imputations produced results that were more or less

representative and discuss the implications for the use of such strategies with both probability

and non-probability sampling designs.

Methods

Sample

Data for all analyses come from GfK. In January of 2011, GfK used the U.S. Postal

Service’s Computerized Delivery Sequence File (CDSF) to choose 25,000 random addresses that

would be recruited by mail (with telephone follow-ups where numbers were available) to join

KnowledgePanel®, an online probability-based sample of Americans. The CDSF covers well

over 95 percent of American households, making it one of the broadest potential sampling

frames (Iannacchione, 2011). Because of the breadth of the sampling frame, we could expect a

100% response rate among selected households would closely mirror that of all American

households.

8


Of 25,000 sampled addresses, 2498 households were successfully recruited to join the

panel, a household response rate of 10.0% (AAPOR RR1). Multiple individuals were allowed to

sign up for the panel from each of these households. In total, 4472 individuals were recruited

from these households with the median household yielding two respondents.2

Because surveying for KnowledgePanel® is completed online, GfK provides a laptop

computer, Internet access, or both to panel members for whom these services are not already

available. Self-report data came from the Core Adult Profile, the first survey respondents

completed upon admission to the panel.

Ancillary Data

Ancillary data were matched to all 25,000 households in the ABS sample. These data

were provided by the Marketing Systems Group (MSG). MSG uses data from several databases

to compile information on American addresses. Data for the current study originally sourced

from infoUSA, Experian, and Acxiom. Since the consumer file data were themselves produced

through a combination of aggregation and inferential techniques, it was impossible to trace the

source of any particular piece of information about a particular household.

Despite the opaque nature of individual ancillary measures, the firms providing the

information suggest that these data are ideal for tracking and identifying Americans. Acxiom

(2010) claims its data “covers more than 99% of marketable addresses worldwide” (p.6) and

incorporates regular updates from the U.S. Postal Service. Experian (2012) notes that it excels at

linking identities between social media sites, phone numbers (landline and mobile), work and

home addresses, email accounts, and other online identifiers. And InfoUSA (2012) monitors

voter registration, utility, and real estate data to compile information, with monthly updates to

2 The mean number of respondents per household was 1.79. Fewer than 1% of all households yielded more than 4 respondents and no households yielded more than 10.

9


keep records current. Aggregating across these sources seems like a very effective way to keep

track of the American public.3

CPS Benchmarks

Benchmark values for analysis 3 came from the January 2011 CPS. The CPS included

data from 134,090 individuals in 53,867 American households with a response rate of 91%. The

Census Bureau provided weights to link data for sampled households to all American households

(and individuals). CPS numbers were weighted to match the household distribution unless stated

otherwise.

Variables

Homeownership, household income, and household size were measured in all three

datasets at the household level. Full question wordings and coding for household-level variables

in the GfK, ancillary, and CPS data are shown in Appendix A.4 Marital status, education, and age

were measured in all three datasets at the individual level. Question wordings and coding for

these variables are shown in Appendix B. Table 1 presents descriptive statistics for all measures

among respondents.

Weighting

Eight sets of weights were generated to assess correspondence between self-reported,

ancillary, and CPS estimates across the three analyses. No substantive differences were observed

3 Because of the nature of these data, we are not able to diagnose the source of discrepancies between ancillary and self-reported data. We also cannot conclude that other sources of ancillary data would not result in different results. Such comparisons should be a subject of continuing research.4 Substantive variables used at both the household level and the individual level were those for which measurement categories could be matched across self-report, ancillary, and CPS measures. Three ancillary measures for presence of a telephone, racial/ethnic status, and number of children in the household, were excluded because the match between these measures in the datasets would have been inconsistent. They were used in the imputations in analysis 3 because additional variables should only serve to improve the fit of the multiple imputation procedure.

10


between weighting strategies, which led us to present only a single combination for each

analysis. Full information on weighting is presented in Appendix C.

Analysis 1: Correspondence Between Ancillary Data and Self-Reports

Analytic Method

To evaluate the ancillary data, we compared the values of variables in the ancillary data

with corresponding measures in the self-reported survey results. This evaluation proceeded in a

three-step process. We first assessed the extent to which self-reported and ancillary data

revealed matching information about households (and heads of household). High match rates

indicated relatively accurate ancillary data whereas low match rates would call that accuracy into

question.5 Second, to understand whether the ancillary and self-report data differed in systematic

ways, we explored whether ancillary measures tended to overestimate or underestimate

corresponding self-reports. If overestimates and underestimates were disproportional, the results

would suggest that misclassifications were a product of bias rather than stochastic error. Finally,

for measures with more than two categories, we assessed the proportion of responses that

differed by a large margin – defined as greater than 5 years for age and two or more categories

for ordinal measures. These “far-off” cases were unlikely to be a product of ancillary data that

was simply out of date.

Results

Correspondence between ancillary and self-reported estimates differed markedly across

variables. The reports largely matched in cases like homeownership, marital status, and age,

5 Although some of these discrepancies could occur as a product of inaccuracies in survey estimates, we think this is not a major concern for two reasons. First, there is little evidence of systematic biases in reporting demographic variables in surveys. Second, web-based survey administration appears to minimize biases (see Chang & Krosnick, 2010), further mitigating this likelihood.

11


with 88.9%, 72.3% and 70.4% agreement respectively (Table 2). In contrast, estimates of

household income, household size, and education differed enormously between the data sources

(with 22.8%, 32.1%, and 38.9% matching). There was no apparent pattern discriminating

between variables with high and low correspondence.

Ancillary and self-reported measures corresponded at different rates across levels of the

same variables, a pattern indicative of bias. Looking only at households with both types of data,

homeownership, marital status, and income were consistently overestimated in the ancillary data

relative to the self-reports. 24.7% of households self-reported that they did not own their homes,

but fully 28.6% of these cases were classified as homeowners in the ancillary data. In contrast,

only 5.3% of self-reported homeowners were classified as non-owners in the ancillary data.

Perhaps most troublingly, of the 40.6% of individuals who reported they were unmarried in the

self-reports, more than half (51.0%) were classified as married in the ancillary data. Yet among

individuals reporting that they were married, 88.2% were similarly classified in the ancillary

data.

Self-report and ancillary estimates did not always diverge in similar patterns. Individuals

tended to self-report greater educational achievement than was apparent in the ancillary data and

discrepancies in reports of household size and age did not skew in a single direction. Overall,

there did not seem to be a clear pattern for when the estimates from the two data sources differed

in these manners (Table 2; Appendix D).

Large discrepancies between data sources emerged frequently for most variables. 44.1%

of households reported income that differed from the category suggested in the ancillary data by

more than $10,000 per year (Table 2). The number of occupants reported by a household

differed by 2 or more individuals from the ancillary estimate in 32.3% of cases. One in five

12


individuals reported an education level that differed by two or more categories from the ancillary

estimate. And 18.6% of self-reported ages differed from the ancillary estimate by six or more

years, even though cases were selected for the closest possible age match. Such large

discrepancies seem unlikely to have emerged from slightly outdated consumer file data.

Discussion

The correspondence between the ancillary data and self-reports varied across the six

variables. Disparities between self-report and ancillary results were fairly large for income,

household size, and education; they were generally more consistent when predicting

homeownership, though notable biases emerged.6 At a minimum, the findings indicate that the

ancillary data used may not be particularly accurate in their description of individual or

population parameters. Ancillary information was, however, considerably better than chance

determinations for all variables.

In considering uses of ancillary data, current results present a mixed picture. For certain

low-responding groups such as young Americans, increasing sampling rates, even at the cost of

some bias, could lead to more accurate overall estimates. Sampling firms and researchers using

such data, however, need to exercise caution. The bias-variance tradeoff induced by

oversampling in this manner will vary depending on the predictions being made. More

troublingly, researchers ignoring biased oversamples in making inferences could reach

inaccurate conclusions concerning both oversampled groups and society as a whole.

Analysis 2: Missingness in Ancillary Data

Analytic Method

6 In these respects, our results are largely a replication of DiSogra, Dennis, and Fahimi (2010) with a newer – but very similar dataset.

13


Three tests were used to assess the scope and nature of missingness in the ancillary data.

First, we examined the extent of missingness in the ancillary data for each variable and across

individuals. Second, we compared missingness in ancillary data variables to self-reports for those

same measures. Differential missingness across self-report categories would provide strong

evidence of nonignorable missingness. Finally, we used logistic regressions to predict the

presence of missingness for each of the ancillary variables based on the values of self-report

measures and an OLS regression to predict the number of ancillary variables for which data were

missing across individuals. Presumably, if ancillary data were MCAR, missing data should not

be concentrated among specific individuals, should be unrelated to self-reports of those same

variables, and should be impossible to predict with the logistic regressions. If ancillary data

were MAR conditional on observable variables, we should see a strong ability for the logistic

regressions to predict missingness. Strong relations between self-reports and ancillary

missingness on the same variables, coupled with a relative inability to predict missingness in

regressions, would indicate that we did not have a good handle for when ancillary data were

missing.

Missingness

Missingness indicator variables were created to identify cases missing ancillary data for

each of the variables of interest (0 = presence of data, 1 = absence of data). A total missingness

variable was defined as the number of ancillary measures missing for each individual (ranging

from 0 to 6).

Descriptive Statistics

Ancillary data were missing for a large number of cases. On average, 18.0% of

households were missing data for any given ancillary data variable. Missingness varied

14


considerably across households from only 6.1% of cases missing for household income to 28.3%

of households missing age information (Figure 1a).

We also explored how missing ancillary data varied across respondents. Although the

modal respondent was not missing data for any of the ancillary variables (45.7%; Figure 1b),

missingness was not concentrated in a few households. Only 8.2% were missing data for more

than two variables. Of households missing data, the vast majority was missing information for

only a single variable. To examine whether individuals lacking data on one variable were also

likely to have missing data on other variables, we conducted a reliability test among six

missingness indicator variables. Cronbach’s Alpha was .70, indicating a moderate consistency

among missingness indicators, but not enough to consider them a single factor.

Ancillary Missingness by Self-Report Value

Missing ancillary data appeared distinctly non-random when compared with self-reported

household status on the same variables. Ancillary homeownership was missing for 7.3% of self-

reported home owning households and 29.5% of non-owning households (Figure 2a; 2(1) =

181.4, p < .001). Ancillary income was missing systematically for households with lower

reported incomes. 10.4% of ancillary data were missing for households reporting an income of

below $35,000 per year whereas only 2.7% of ancillary data were missing for households with

incomes above $75,000 (Figure 2b; 2(7) = 32.5, p < .001). Smaller households were missing

more information about household size than were larger households (Figure 2c; 2(4) = 15.1,

p<.01). These results refuted the possibility that ancillary data were MCAR and indicated that

ancillary household data were likely nonignorable.

Rates of missingness in individual-level ancillary data also frequently depended on self-

reports for the same variables. When respondents reported that they were married, only 13.5%

15


of ancillary marital status data were missing. In contrast, 28.2% of ancillary marital information

was missing among unmarried individuals (Figure 2d; 2(1) = 59.8, p<.001). Variation in

missingness across self-reported education levels was not statistically significant (Figure 2e;

2(4) = 5.8, p = .21). Finally, missing ancillary age information was more common among

younger individuals. For individuals aged 18-24, 40% of ancillary data were missing; only

14.0% of ancillary age data were missing for 55-64 year olds (Figure 2f; 2(6) = 188.2, p < .001).

As with household-level variables, missing data for individual-level variables appeared

systematic.

Regressions

To understand patterns of missingness, regressions predicted the presence of missingness

for each of the ancillary data measures and the total number of missing ancillary variables as a

function of each respondent’s status on self-reported demographics. To conduct these

regressions, multiple imputations were used to account for missingness in self-reported values

among all respondents.7 All regression results were weighted.

Predictors of missing ancillary data varied depending on the missingness indicator being

predicted. Three self-reported demographics predicted multiple missingness indicators:

homeownership, household size, and ethnoracial identity. Compared to non-owners, self-

reported owners were less likely to lack ancillary data for homeownership, household income,

household size, marital status, and age (Table 3). The self-reported number of individuals in the

household predicted missing ancillary income, household size, and marital status, with larger

households translating into a reduced likelihood of missingness. Ethnoracial identity predicted

7Multiple imputations were conducted by using Multiple Imputations via Chained Equations predicting each missing value with all self-reports and ancillary demographic variables. Running these same regressions without imputations led to the same conclusions. Imputed versions were used in case missingness in self-reports biased the process. Imputations were conducted to mirror the full set of respondents (not all sampled households).

16


inconsistently. Black Americans were more likely than Whites to lack ancillary homeownership

data. Hispanic and multiracial individuals were missing ancillary age information more

frequently than were other individuals. Self-reported marital status, education, and age each

predicted one of the missingness indicators. Married individuals were, ceteris paribus, less

likely to be missing information on marital status. Individuals who reported that they were

younger or had lower education were more likely to be missing ancillary age information.

Despite significant predictors for most missingness indicators, missing ancillary data was

not well estimated in the current analyses. Pseudo R2s for the missingness indicators were

always below .20 and the full list of covariates improved prediction for only one of the six

indicators: missing ancillary age information (Table 3).

Predictions of the number of ancillary measures missing revealed similar challenges.

Although evidence suggested that missingness was related to less self-reported homeownership,

larger household sizes, unmarried status, and relative youth (Table 3, column 7), missingness

remained poorly explained. These variables again captured only a small portion of the variation

across individuals (R2 = .11).

Discussion

Results of analysis 2 suggested that missing ancillary data likely represented a

nonignorable source of bias. Ancillary data were missing systematically, but not in patterns that

could be predicted with simple covariates or regression techniques. Because missing data

appeared to violate both MCAR and MAR assumptions, the use of ancillary data for analytic

techniques could result in substantive error for all purposes we have considered.

The most pernicious of our results indicated that missingness in ancillary data was often

related to self-reported values of the same variables. Because ancillary missingness was

17


correlated with self-reports, sampling strategies based on ancillary data are likely to result in

sampled units that are differentially accurate across different variables. Instead of reducing the

weights required for a sample, oversampling with the use of ancillary variables instead

necessitates a two-stage weighting process whereby oversampling ratios must be corrected for

before weighting to population targets. Depending on the specific variables and sampling ratios

in play, such corrections could sometimes increase the variance and design effect of sample

weights instead of reducing them.

Analysis 3: Assessing Imputations Based on Ancillary Data

Analytic Methods

To evaluate whether the ancillary data allowed us to correct self-reports to better match

the sample, we produced three sets of comparisons for each of the variables of interest.

Specifically, we examined correspondence between household-level estimates provided by the

CPS and those of imputed self-report data for the entire sample (imputed), the original GfK data

among respondent households (raw), and the ancillary data (ancillary). Imputation procedures

are described in Appendix E. Household-level analyses were used to compare homeownership,

household income, and household size across each of these measures. Individual-level analyses

were also used to compare marital status, education, and age across each of these measures

(results presented in Appendix F). For each comparison, we calculated an absolute average error

across all analyses (cf. Yeager et al., 2010). If correspondence between imputed data and the

CPS was significantly stronger than that of the raw GfK data, we could conclude that the

inclusion of ancillary data in our models was helping to provide a more accurate understanding

of the population.

18


Household Results

Homeownership. Predictions of homeownership were slightly more accurate when using

the self-reported data without imputation than when using the imputations. Estimates of

homeownership rates in the raw sample (69.2%) did not significantly differ from those of the

CPS (68.7%; t=-.72, p=.47). Estimates from imputations remained close to the CPS estimate,

but were somewhat more discrepant. Discrepancies between imputations and the CPS ranged

from -3.51 percentage points to .07 percentage points (95% CI=-2.41 to -0.45). The average

discrepancy across the 100 imputations was 1.51 percentage points (p=.03).8 Overall, 90.5% of

the simulations were further from the CPS estimate than was the raw GfK estimate, a difference

just shy of statistical significance (p=.10). Ancillary estimates were considerably further from

CPS numbers in predicting that 77.0% of Americans owned their homes (t=26.94, p<.001).

The close correspondence between the CPS, raw, and imputed estimates can be seen in

Figure 3. The boxplot represents the range of estimates from the imputations, the horizontal red

line denotes the raw data prediction, the horizontal blue line indicates the ancillary estimate, and

the dashed green line is the prediction based on the CPS. From this figure, readers can see that

the raw estimate, on average, was a closer match to the CPS than were the imputations. The

ancillary estimate was more discrepant.

Household Income. Household income distribution estimates varied considerably across

sources, with the imputations as most congruent with the CPS, followed closely by the raw self-

reports (Figure 4). Compared to the CPS, households earning under $15,000 and between

8 p values for the multiple imputations were generated in a four-step Monte Carlo process. First, weighted t-tests were used to compare the each imputed dataset to the CPS estimate. Second, raw differences and standard errors for the estimates from each imputed dataset were used in a Monte Carlo simulation to produce 1000 simulated differences between each imputation and the CPS (for 100,000 total data points). Third, these Monte Carlo simulations were stacked, and a parameter was calculated as proportion of the estimates across all simulations in the less-populated direction of difference from the CPS. Finally this number was doubled to produce a two-tailed p value.

19


$50,000 and $150,000 per year were overrepresented in the raw data. The CPS, in contrast,

identified more households earning between $25,000 and $50,000 and more than $150,000 per

year than did the raw data. The imputed datasets performed similarly to the raw data. The

imputations corresponded more closely for four income categories and less closely for two

categories. On average, raw self-reports differed from the CPS by 2.74 percentage points (2 =

84.6 (7) p<.001) and imputations differed from the CPS by 2.47 percentage points (p<.001).

Ancillary data estimates were somewhat more discrepant, differing by an average of 3.36

percentage points (2 = 2153.9 (7) p<.001).

Household Size. Estimated distributions of household sizes also varied across sources,

with imputations generally the most accurate (Figure 5). All three estimates overstated the

proportion of households with a single resident and understated the proportion of households

with 4 or more residents relative to the CPS. For households with 2 or 3 residents, all three

estimates were reasonably close, with ancillary data closely reflecting the CPS estimate in 2-

resident households and both raw and imputation estimates mirroring the CPS in 3-resident

households. Across household size estimates, differences from the CPS were largest for the

ancillary data, averaging 10.89 percentage points (2 = 18114.4 (4) p<.001), and smallest for the

imputations, averaging 4.05 percentage points (p<.001). Raw estimates differed from the CPS

by 8.80 percentage points on average (2 = 789.0 (4) p<.001).

Across Measures. Overall, imputations represented a moderate improvement in

correspondence with the CPS over raw GfK self-reports. Among household-level measures, the

average difference between CPS and imputation estimates was 2.67 percentage points (See

Figure 6). Raw data differed from the CPS by a moderately worse 4.04 percentage points.

Ancillary data diverged from the CPS by an average of 7.52 percentage points.

20


Including individual-level measures as well, imputations outperformed raw and ancillary

data, though not by an enormous margin (Appendix E). Across all six variables, imputations

differed from the CPS by an average of 2.97 percentage points as compared to 5.26 percentage

points for raw data and 10.12 percentage points for ancillary data. Imputations thus reduced the

differences between the raw data and the CPS by 43.7 percent on average.

Discussion

Although imputations have been touted as a strong improvement over listwise deletion

for all types of missing data (see e.g., King, Honaker, Joseph, & Scheve, 2001) and as a

potentially more effective correction for survey nonresponse (Peytchev, 2012), we find only

moderate efficacy. Imputing estimates to the sample level using ancillary data reduced

discrepancies between self-reported data from a national sample of American households with

sizable nonresponse and that of the very high response rate CPS by less than half. Given that the

weights produced were not designed to correct for nonresponse in the raw data (unlike traditional

raking strategies), our evidence suggests that the imputation strategy employed was not a

particularly effective solution to nonresponse errors in the sample.

These findings suggest one of three possibilities. First, the processes producing

nonresponse errors in the raw GfK sample may approximate MCAR, meaning that non-

respondents could be safely removed via listwise deletion (cf. King, et al., 2001). A second

possibility is that the demographic variables used in the current study were insufficient to

substantively improve upon errors in the GfK reports. If this is the case, many such variables

may be necessary for imputation results to represent a large improvement. Finally, the Bayesian

procedures used presume that ancillary and self-report data relate in similar ways among

respondents as they would (were self-reports available) among non-respondents. Biases in the

21


presence, nature, and correlates of the ancillary data could limit the capacity for data-driven

algorithms to result in accurate population inferences.

General Discussion

The current analyses represent a first foray into understanding the nature of potential

biases when using ancillary marketing data to supplement (or supplant) the traditional survey

process. In our initial analysis comparing six demographic variables across ancillary and survey

data for a probability sample of individuals, we find both hopeful and discomforting signs. The

ability to match ancillary data to the sample and the generally decent correspondence between

ancillary and self-reported data suggest that such data could prove useful for sampling purposes.

We were somewhat concerned, however, both by the large discrepancies that sometimes

emerged and by an inability to clearly characterize the nature of missingness in the ancillary

data.

One of the most notable findings in the current study is the nature of missingness in the

set of ancillary data used. The results from our second study suggest that the ancillary data were

not MCAR and that the set of self-report measures used were far from sufficient to treat missing

ancillary data as MAR conditional on only a small set of standard demographics. Particularly

notable was the correspondence between missingness on many variables and self-report

measures of the same concept. Such biases are pernicious as they are unlikely to be resolved

with ancillary data alone. Unfortunately the opaque nature of the ancillary data made it difficult

to diagnose the reasons that such large biases were identified.

Finally, our results suggested that imputation procedures based on the ancillary data did

less than we had hoped to adjust for discrepancies between a low response rate probability

22


survey and the CPS, a high response rate government sample. This indicated that standard

imputation procedures based on the types of ancillary data used may be insufficient when

attempting to understand attributes of the sample.

The fact that the imputations did not eliminate most of the error in the raw data indicates

one of three likely conclusions: It is possible that nonresponse (both unit and item) may not lead

to substantive bias in the raw self-reports, that the ancillary data used may be insufficient to

correct for the distinctions between samples, or that the imputation procedure itself represents the

wrong model for addressing nonresponse bias. We discuss each of these possibilities and the

implications in turn.

Ignorable Nonresponse?

The possibility that nonresponse is relatively ignorable for a sample like that of the raw

self-reports (at least for these variables) is an intriguing one. The differences between the self-

reports of respondents and the data from the CPS were relatively small (averaging approximately

4 percentage points). Although these differences were statistically significant with a sample of

4,000 individuals across three measures, they were not substantively important. Hence, it might

require a large number of highly discriminate ancillary variables to meaningfully improve the

estimates. Further, some of the error that is present may be attributable to slight differences in

measurement rather than to nonresponse. The only question posed identically by both the CPS

and GfK that did not require weighting by household size was the question about

homeownership, for which the differences were no larger than those expected from sampling

error.

Insufficient Data

23


The nine ancillary measures used for the imputations seem to have led to only a modest

improvement in the predictions of the imputations over those of the raw self-reports. It may be

the case that more data or a different source of data would be needed to better correct these

estimates. Even if ancillary data sources can help distinguish between respondents and non-

respondents, the distinctions may be relatively poorly predicted through the combination of

ancillary variables presented here (cf. Mustillo, 2012). This would seem unlikely in light of the

sizable capacity for ancillary data to predict unit nonresponse, but it remains possible that a

combination of missing ancillary data (leading to biases in listwise deletion) and the lack of a

consistent correlation between ancillary and self-report measures could undermine the corrective

ability of the imputation procedure. Of course, it is also possible that other sources of ancillary

data provided might prove better on these metrics.

Imputation Assumptions

The multiple imputation procedure employed here – as for most imputation strategies –

proceeded in a completely data-driven fashion assuming a consistent set of links between

variables. MICE models the error in any given imputation using Monte Carlo simulation results

from the posterior distributions of linear (and logistic) predictions of categorization for each of

the variables used. This procedure is based on a variety of often-implicit assumptions. Notably,

relations between variables among respondents are presumed to perfectly mirror the relations

between variables among non-respondents. Slight discrepancies between interrelations for these

sets of individuals have the potential to skew the results of the imputation, leading to results that

are no more accurate (and occasionally less accurate) representations of the public at large than

those of respondents.

Non-Probability Samples?

24


The results presented here do not discount the possibility that some set of ancillary data

could operate as a corrective for data from non-probability samples; they do however lend

caution to the endeavor. Imputation procedures only modestly improve on point estimates from

a probability sampling frame, which suggests that they are unlikely to address all differences

between a non-probability sample and the population, at least with the set of variables employed

here. To the extent that researchers are interested in using ancillary data to adjust a non-

probability frame to match the public, additional scrutiny is warranted.

Looking Forward

We hope that the current study raises a number of issues for researchers thinking about

using ancillary data for data collection, nonresponse adjustment, and sampling frame correction.

Specifically, we highlighted notable challenges in the use of at least one source of these data and

caution researchers thinking of using these data to consider carefully the nature of the

assumptions they are making. For example, the capacity for ancillary data to identify hard-to-

reach populations may require less stringent assumptions than would efforts to correct for

unrepresentative sampling. In this vein, it is important for future research to estimate the size of

possible biases from the use of these data for various types of inference and to explore how

broadly the current results can generalize. Only when we can calculate the ways in which use of

a variety of types and sources of data alter total survey error and – more particularly – the bias-

variance tradeoff introduced in various applications can we truly understand where such data

may prove a boon to social research and where they might instead be a hindrance. These

questions, more than ease and ability, should guide practitioners in their decision-making.

It is also important to consider the use a larger array of ancillary variables in developing

survey correctives. Notably, if imputation is indeed capable of bridging the gaps between

25


respondents and sample, we should progressively reduce the errors as more ancillary measures

are added. Advances in improving the quality of future consumer file data may also help to

address biases inherent in using such ancillary information. Better data aggregation and systems

for understanding the origins of the data that are available could also provide considerable

leverage.

Finally, future studies should consider higher levels of inference. It is possible that the

procedures used could better correct for results such as trends over time and experimental

interventions (cf. Berrens et al., 2003; Pasek and Krosnick, 2010). In this respect, we remain

optimistic about the potentials for ancillary data as a corrective tool. The current study suggests,

however, that we should remain guarded. Additional evidence is needed to determine conditions

under which ancillary data will reduce nonresponse biases and those for which it may not

represent an improvement.

26


References

Acxiom (2010). Maximizing Profit with Accurate Customer Data: How next-generation data

quality is improving marketing outcomes. White Paper [Accessed 10/31/12]. Available

from: http://www.acxiom.com/site-assets/whitepaper/wp-winterberry-2012/.

Anderson, R. M., Kasper, J., & Frankel, M. R. (1979). Total survey error, San Francisco: Jossey-

Bass.

Ansolabehere, S., & Hersh, E. (Forthcoming). Validation: What Big Data Reveal About Survey

Misreporting in the Real Electorate. Political Analysis.

Berrens, R. P., Bokhara, A. K., Jenkins-Smith, H., Silva, C., & Weimer, D. L. (2003). The

Advent of Internet Surveys for Political Research: A Comparison of Telephone and

Internet Samples. Political Analysis, 11(1), 1-22.

Blumberg, S. J., & Luke, J. V. (2011). Wireless Substitution: Early Release of Estimates From

the National Health Interview Survey, January–June 2011. Division of Health Interview

Statistics, National Center for Health Statistics. [Accessed 8/23/12]. Available from:

http://www.cdc.gov/nchs/data/nhis/earlyrelease/wireless201112.pdf.

Boehmke, F. J. (2003). Using Auxiliary Data to Estimate Selection Bias Models, with an

Application to Interest Group Use of the Direct Initiative Process. Political Analysis,

11(3), 234-254.

Brehm, J. (1999). Alternative Corrections for Sample Truncation: Applications to the 1988,

1990, and 1992, Senate Election Studies. Political Analysis, 8(2), 183-199.

Chang, L. and Krosnick, J. A. (2010). Comparing Oral Interviewing with Self-Administered

Computerized Questionnaires: An Experiment. Public Opinion Quarterly, 74(1), 154-

167.

27


Converse, J. M. (1987). Survey Research in the United States: Roots and Emergence 1890-1960.

Berkeley, CA: University of California Press.

Deming, W. E. (1944). On errors in surveys. American Sociological Review, 9(4), 359-369.

Disogra, C., Dennis, J. M., & Fahimi, M. (2010). On the quality of ancillary data available for

address based sampling. Paper presented at the Joint Statistics Meetings.

Dixon J. & Tucker, C. (2010). Overview of design issues: Total survey error. Handbook of

Survey Research. In P. Marsden and J. Wright. Bingley (Eds), UK, Emerald Group

Publishing Limited: 591-630.

Experian (2012). Crossing the great divide: Resolving customer identities across online and

offline channels. White Paper [Accessed 10/31/12]. Available from:

http://www.experian.com/assets/marketing-services/white-papers/wp-crossing-the-great-

divide-experian.pdf.

Groves, R. M. (1989). Survey errors and survey costs. New York: Wiley.

Groves, R. M. (2006). Nonresponse rates and nonresponse bias in household surveys. Public

Opinion Quarterly, 70(5), 646-675. doi: 10.1093/poq/nfl033

Iannacchione, Vincent G. (2011). The changing role of address-based sampling in survey

research. Public Opinion Quarterly, 75(3), 556-575. doi: 10.1093/poq/nfr017

InfoUSA (2012). Data Quality at InfoUSA. [Accessed 10/31/12]. Available from:

http://www.infousa.com/data-quality/.

King, G., Honaker, J., Joseph, A., & Scheve, K. (2001). Analyzing Incomplete Political Science

Data: An Alternative Algorithm for Multiple Imputation. American Political Science

Review, 95(1), 49-69.

Kish, L. (1965). Survey sampling. New York: John Wiley and Sons, Inc.

28


Kreuter, F., & Olson, K. (2011). Multiple auxiliary variables in nonresponse adjustment.

Sociological Methods & Research, 40(2), 311-332. doi: 10.1177/0049124111400042

Little, R. J. A., Vartivarian, S. (2005). Does weighting for nonresponse increase the variance of

survey means? Survey Method, 31:161–168.

Maitland, A., Casas-Cordero, C., & Kreuter, F. (2009). An evaluation of nonresponse bias using

paradata from a health survey. In: Proceedings of the section on survey research methods

of the American statistical association.

Mustillo, S. (2012). The Effects of Auxiliary Variables on Coefficient Bias and Efficiency in

Multiple Imputation. Sociological Methods & Research, 41(2), 335-361.

Pasek, J. & Krosnick, J. A. (2010). Measuring Intent to Participate and Participation in the 2010

Census and Their Correlates and Trends: Comparisons of RDD Telephone and Non-

probability Sample Internet Survey Data. Statistical Research Division of the U.S. Census

Bureau (#2010-15). Available from: http://www.census.gov/srd/papers/pdf/ssm2010-

15.pdf.

Peytchev, A. (2012). Multiple Imputation for Unit Nonresponse and Measurement Error. Public

Opinion Quarterly, 76(2), 214-237.

Rivers, D. (n.d.) Sample Matching: Representative Sampling from Internet Panels. Available

from: http://www.polimetrix.com/documents/YGPolimetrixSampleMatching.pdf.

Särndal C.-E., & Lundström, S. (2005). Estimation in surveys with nonresponse. New York:

Wiley.

Sakshaug, J. W. & Kreuter, F. (2012). Assessing the Magnitude of Non-Consent Biases in

Linked Survey and Administrative Data. Survey Research Methods, 6(2), 113-122.

Smith, T. W. (2011). The report of the international workshop on using multi-level data from

29


sample frames, auxiliary databases, paradata and related sources to detect and adjust for

nonresponse bias in surveys. International Journal of Public Opinion Research, 23(3),

389-402. doi: 10.1093/ijpor/edr035

Smith, T. W., & Kim, J. (2009). An assessment of the multi-level integrated database approach

GSS Methodological Report (Vol. 116). Chicago, IL: NORC.

Yeager, D., Krosnick, J. A., Chang, L., Javitz, H. S., Levendusky, M. S., Simpser, A., and Wang,

R. (2011) Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys

Conducted with Probability and Non-Probability Samples. Public Opinion Quarterly,

75(4), 709-747.

30


Table 1. Unweighted Descriptive Characteristics of Six Demographic Variables Among Respondents (N = 4472)

Survey Data Ancillary Data Percent N of Missing Percent N of MissingHome Owner 75.50% 6 77.40% 572Non-Owner 24.50% 22.60%

Income < 15k 14.70% 1120 4.30% 236Income 15k – 25k 10.40% 7.90%Income 25k – 35k 9.80% 10.70%Income 35k – 50k 16.90% 16.40%Income 50k – 75k 19.90% 23.80%Income 75k – 100k 12.10% 16.40%Income 100k – 150k 11.50% 14.40%Income > 150k 4.70% 6.10%

Household size – 1 13.40% 6 28.80% 246Household size – 2 29.60% 25.70%Household size – 3 20.60% 18.30%Household size – 4 20.00% 12.40%Household size > 5 16.40% 14.90%

Married 52.20% 6 76.30% 909Not Married 47.80% 23.70%

Education - Less than HS 13.80% 1883 23.40% 801

Education – HS 19.00% 31.40%Education – Some College 37.80% 14.60%

Education - Bachelors 18.60% 8.60%Education – Post Grad 10.90% 22.00%

Age 41.16 (17.40) 0 47.58 (15.40) 1277

Note: Entries in age are mean values with standard deviations in parentheses.

31


Table 2. Variable Value Comparisons between Survey and Ancillary Data

Survey < Ancillary

Survey = Ancillary

Survey > Ancillary Total Far-off

cases N

Homeownership 7.00% 88.90% 4.00% 100% NA 1460Household Income 50.40% 22.80% 26.80% 100% 44.10% 1234

Household Size 32.50% 32.10% 35.30% 100% 32.30% 2133Marital Status 20.70% 72.30% 7.00% 100% NA 1078Education 21.60% 38.90% 39.50% 100% 22.40% 1116Age* 13.90% 70.40% 15.70% 100% 18.60% 1629

Note: Far-off cases are counted when values from self-reported survey and ancillary data differ by more than five years (age), or more than one category (household income, household size, and education). *Ages within one year were considered equivalent.

32

Table 3. Logistic Regressions Predicting Missing Ancillary Data with Self-Reports

Note: OLS was used to predict the number of missing variables for each individual (Column 7). Number of missing variables ranged from 0 to 6. A negative binomial regression provided a poorer overall fit and was therefore not presented. *p < .05; **p < .01; ***p < .001


Figures

Figure 1. Missing Ancillary Data by Variables and Respondents

34


Figure 2. Missing Data by Self-Reported Variable Values

35

Figure 3. Homeownership Estimates Weighted By Household


Figure 4. Income Category Estimates Weighted By Household

37


Figure 5. Household Size Estimates Weighted By Household

38


Figure 6. Average Absolute Differences From CPS

39


Appendix A. Question Wordings for Household-Level Variables

Homeownership (Nominal; 2 categories)

Self-report. Respondents were asked: “Are your living quarters. . . Owned or being

bought by you or someone in your household, rented for cash, or occupied without payment of

cash rent.” Respondents who selected “Owned or being bought by you or someone in your

household” were coded 1, all other answers were coded 0.

Ancillary. Ancillary homeownership data were categorized 1 for homeowners and 0 for

non-owners.

CPS. Respondents were asked: “Are your living quarters. . . Owned or being bought by a

household member, rented for cash, or occupied without payment of cash rent.” Respondents

who selected “Owned or being bought by a household member” were coded 1, all other answers

were coded 0.

Recoding. Survey respondents and CPS respondents reported their homeownership in

three categories, “owned or being bought by you or someone in your household”, “rented for

cash”, “occupied without payment of cash rent”. To facilitate comparisons with the ancillary

data, the self-reported and CPS responses were recoded into two categories, “homeowners” and

“non-owners”.

Household Income (Ordinal)

Self-report. Respondents were asked: “Was your total HOUSEHOLD income in the past

12 months ...” Respondents could choose: “Below $35,000”, “$35,000 or more”, or “Don’t

Know”. Respondents who selected “Below $35,000” were asked: “We would like to get a better

estimate of your total HOUSEHOLD income in the past 12 months before taxes. Was it ...”

Respondents could choose: “Less than $5,000”, “$5,000 to $7,499”, “$7,500 to $9,999”,

40


“$10,000 to $12,499”, “$12,500 to $14,999”, “$15,000 to $19,999”, “$20,000 to $24,999”,

“$25,000 to $29,999”, or “ $30,000 to $34,999”. Respondents who selected “$35,000 or more”

were asked: “We would like to get a better estimate of your total HOUSEHOLD income in the

past 12 months before taxes. Was it ...” Respondents could choose: “$35,000 to $39,999”,

“$40,000 to $49,999”, “$50,000 to $59,999”, “$60,000 to $74,999”, “$75,000 to $84,999”,

“$85,000 to $99,999”, “$100,000 to $124,999”, “$125,000 to $149,000”, “$150,000 to

$174,999”, or “$175,000 or more”. Responses to all three questions were recoded into eight

categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to

$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than

$150,000”.

Ancillary. Ancillary income was coded into categories for “$1,000-$14,999”, “,

“$15,000-$24,999”, “$25,000-$34,999”, “$35,000-$49,999”, “$50,000-$74,999”, “$75,000-

$99,999”, “$100,000-$124,999”, “125,000-$149,999”, “$150,000-$174,999”, “175,000-

$199,999”, “$200,000-$249,999”, and “$250,000+”. Ancillary income data were recoded into

eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to

$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “More than

$150,000”.

CPS. Respondents were asked: “Which category represents the total combined

income of all members of this FAMILY during the past 12 months? This includes money from

jobs, net income from business, farm or rent, pensions, dividends, interest, social security

payments and any other money income received by members of this family who are 15 years of

age or older?” Response options were: “Less than $5,000”, “5,000 to 7,499”, “7,500 to 9,999”,

“10,000 to 12,499”, “12,500 to 14,999”, “15,000 to 19,999”, “20,000 to 24,999”, “25,000 to

41


29,999”, “30,000 to 34,999”, “35,000 to 39,999”, “40,000 to 49,999”, “50,000 to 59,999”,

“60,000 to 74,999”, “75,000 to 99,999”, “100,000 to 149,000”, “150,000 to more”. Responses

were recoded into eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to

$34,999”, “$35,000 to $49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to

$149,999”, and “More than $150,000”.

Recoding. Responses to household income questions in three sources were recoded into

eight categories: “Less than $14,999”, “$15,000 to $24,999”, “$25,000 to $34,999”, “$35,000 to

$49,999”, “$50,000 to $74,999”, “$75,000 to $99,999”, “$100,000 to $149,999”, and “$150,000

or more”.

Number of Persons in Household (Ordinal)

Self-report. Respondents were asked: “Including yourself, how many people currently

live in your household at least 50% of the time? Please remember to include babies or small

children, include unrelated individuals (such as roommates), and also include those now away

traveling, at school, or in a hospital.” Respondents could enter a number between 1 and 15.

Responses indicating more than 5 household members were collapsed into the single category:

“5 or more”.

Ancillary. Data were requested on the number of adults in each household, the presence

of children in the household, and the number of children in the household. Households that were

not listed as having children present were coded 0 for the number of children (N=19,732). The

number of adults and children in the household were summed to produce a variable for the total

number of persons in the household. Sums indicating more than 5 individuals in the household

were collapsed into the single category: “5 or more”.

42


CPS. Respondents were asked: “What are the names of all persons living or staying

here?” After each name, respondents were asked: “What is the name of the next person?”

Respondents were later asked: “Are there any other persons 15 years old or older now living or

staying there?” If respondents report that there are, they are asked: “How many other?”

Respondents were then asked, in succession: “Have I missed any babies or small children?”,

“Have I missed anyone who usually lives here but is away now -traveling, at school, or in a

hospital?”, “Have I missed any lodgers, boarders, or persons you employ who live here?”, and

“Have I missed anyone else staying here?” The results from these questions were combined to

determine the number of individuals in each household. Sums indicating more than 5 individuals

in the household were collapsed into the single category: “5 or more”.

Recoding. Number of persons in the household was coded to range from 1 to 5 in all

three datasets, with values greater than 5 recoded to equal 5.

Number of Children in Household (Ordinal)

Self-report. Respondents were asked: “Including yourself, how many people currently

live in your household at least 50% of the time? Please remember to include babies or small

children, include unrelated individuals (such as roommates), and also include those now away

traveling, at school, or in a hospital.” Respondents could enter a number between 1 and 15.

Responses indicating more than 5 household members were collapsed into the single category:

“5 or more”.

Ancillary. Data were requested on the number of adults in each household, the presence

of children in the household, and the number of children in the household. Households that were

not listed as having children present were coded 0 for the number of children (N=19,732). The

number of adults and children in the household were summed to produce a variable for the total

43


number of persons in the household. Sums indicating more than 5 individuals in the household

were collapsed into the single category: “5 or more”.

CPS. Respondents were asked: “What are the names of all persons living or staying

here?” After each name, respondents were asked: “What is the name of the next person?”

Respondents were later asked: “Are there any other persons 15 years old or older now living or

staying there?” If respondents report that there are, they are asked: “How many other?”

Respondents were then asked, in succession: “Have I missed any babies or small children?”,

“Have I missed anyone who usually lives here but is away now -traveling, at school, or in a

hospital?”, “Have I missed any lodgers, boarders, or persons you employ who live here?”, and

“Have I missed anyone else staying here?” The results from these questions were combined to

determine the number of individuals in each household. Sums indicating more than 5 individuals

in the household were collapsed into the single category: “5 or more”.

Presence of Telephone (Nominal; 2 categories)

Self-report. Respondents were asked: “Is there at least one telephone INSIDE your home

that is currently working and is not a cell phone?” Respondents who selected “Yes” were coded

1, all other respondents were coded 0.

Ancillary. Phone number matches were requested for all households in the sample.

Phone numbers were matched for 11,881 households and could not be matched for 13,119

households. Matched households were coded 1, all other households were coded 0.

CPS. Respondents were asked: “Is there a telephone in this house/apartment?”

Respondents who answered “Yes” were coded 1, respondents who answered “No” were coded 0.

44


Appendix B. Question Wordings for Individual-Level Variables

Marital Status (Nominal; 2 categories)

Self-report. Respondents were asked: “Are you now married, widowed, divorced,

separated, never married, or living with a partner?” Response options were “married”,

“widowed”, “divorced”, “separated”, “never married”, and “living with partner”. Responses

were recoded 1 for “married” and 0 for all others.

Ancillary. Data were requested on the marital status of an individual in each household.

Ancillary marital status data were categorized 1 for married and 0 for single.

CPS. Respondents were asked: “Respondents were asked: “Are you now married,

widowed, divorced, separated, or never married?” Response options were “married – spouse

present”, “married – spouse absent”, “widowed”, “divorced”, “separated”, and “never married”.

Responses were recoded 1 for “married – spouse present” and “married – spouse absent” and 0

for all others.

Recoding. Responses to marital status from all three data sources were coded as 1 for

respondents who reported that they were currently married and 0 for all other respondents.

Marital status in the ancillary data was reported for individuals who were classified as “heads of

household”. This category (as all others in the ancillary data) was not defined.

Education

Self-report. Respondents were asked: “What is the highest level of school you have

completed?” Response options were “no formal education”, “first, second, third, or fourth

grade”, “fifth or sixth grade”, “seventh or eighth grade”, “ninth grade”, “tenth grade”, “eleventh

grade”, “twelfth grade no diploma”, “high school diploma or the equivalent”, “some college no

45


degree”, “associate degree”, “bachelor degree”, “master degree”, and “professional or doctoral

degree”.

Ancillary. Data were requested on the education level of an individual in each household.

Ancillary education was coded into six categories for “less than high school diploma”, “high

school diploma”, “some college”, “bachelor”, “graduate school”, and “Don’t know”.

CPS. Respondents were asked: “What is the highest level of school (name/you)

(have/has) completed or the highest degree (name/you) (have/has) received?” Response options

were “Less than 1st grade”, “1st, 2nd, 3rd or 4th grade”, “5th or 6th grade”, “7th or 8th grade”,

“9th grade” “10th grade”, “11th grade”, “12th grade NO DIPLOMA”, “HIGH SCHOOL

GRADUATE- high school DIPLOMA or the equivalent (For example: GED)”, “Some college

but no degree”, Associate degree in college - Occupational/vocational program”, “Associate

degree in college -- Academic program”, “Bachelor's degree (For example: BA, AB, BS)”,

“Master's degree (For example: MA, MS, MEng, MEd, MSW, MBA)”, “Professional School

Degree (For example: MD,DDS,DVM,LLB,JD)”, and “Doctorate degree (For example: PhD,

EdD)”.

Recoding. Education levels from all three data sources were coded into 5 categories for

“Less than High School”, “High School Graduate”, “Some College”, “College Graduate” and

“Post-Graduate” education levels.

Age

Self-report. Respondents could enter their age in an open ended way.

Ancillary. Data were requested on an individual’s age in each household.

46


CPS. Respondents were asked: “What is your date of birth?” They were then asked to

verify their age with the question: “As of last week, that would make (name/you) (approximately

(AGE) years old. Is that correct?”

Recoding. Ages for all individuals were coded to range from 18 to 90 in analyses 1 and 2

and from 18 to 80 in analysis 3. To facilitate presentation, all three data sources were also coded

into 7 categories for individuals aged 18-24, 25-34, 35-44, 45-54, 55-64, 65-74, and 75 and

older.

47


Appendix C. Weighting

Eight sets of weights were produced to match the various data sources with the American

public and to one-another. These weights served two purposes. First, we adjusted for

differences between individual-level self-reports and household-level ancillary data. To do this,

we explored four strategies for aggregating information from individuals within a household.

Second, we corrected for unequal probabilities of selection introduced as part of the sampling

procedure. Two techniques were employed to cancel out any biases at the sample and

respondent levels. All combinations of these strategies (four corrections for individual-

household distinction x two levels of generalization) were explored, where appropriate, to ensure

that the results presented were robust to differing choices.

Household level weight

To produce data that could be compared across sources, we needed to match self-reports

to the values in ancillary data. The sampling procedure, however, allowed multiple individuals

from a single household to enter the panel. This introduced three potential problems. First, the

presence of multiple individuals from a single household introduced concerns about the

independence of observations. Second, the results of our analyses might be biased toward

households with multiple representatives. And third, it might be possible that ancillary data

could correctly match one individual in a household while providing an inaccurate portrait of

other individuals in the household. The first and second challenges are easily overcome by

weighting observations at the household, rather than individual, level. The third challenge is

more pernicious and requires that we consider the conditions under which household and

individual data should be considered a match. To circumvent these problems, we created four

sets of weights for respondents:

48


1. Pure household weight: (1) the weight was coded as the inverse of the number of

respondents in the household.

2. Household adult weight: (1) individuals under 18 were dropped; (2) the weight was

coded as the inverse of the number of respondents in the household.

3. Best ancillary match weight: (1) respondents were asked to provide the names and

ages of all individuals in the household; (2) in households where one of the respondents was

closest to the age9 indicated by the household ancillary data, all other members of the household

were dropped; (3) in households where no respondents were close to the age indicated by the

ancillary data or where multiple respondents were equally close, individuals under age 18 or who

were clearly not the best match were dropped and the weight was coded as the inverse of the

number of remaining individuals in the household.

4. Full household only best ancillary match weight: (1) respondents were asked to

provide the names and ages of all individuals in the household; (2) in households where all

individuals mentioned were respondents, we followed the same procedures as the best ancillary

match weight; (3) individuals in incomplete households were dropped from the analysis.

Probability of sampling corrections

Probability of sampling corrections were created to adjust for several sources of deviation

from an equal probability of selection. As part of a procedure to increase the number of

respondents in traditionally underrepresented groups, GfK used a stratified sampling technique.

Households were categorized into four groups depending on the age and Hispanic status

indicated in the ancillary data. Sampling probabilities were assigned to oversample households

9 Age was used in these circumstances because it was the most commonly available piece of ancillary information and was the only piece of ancillary information that could be consistently expected to discriminate between members of a household. Other variables were either household-level or would be expected to match multiple household members (e.g. marital status).

49


that included Hispanics or individuals ages 18-24. Because our goal was to assess whether such

techniques might improve the survey process, we needed to eliminate any biases that might have

been introduced through this sampling procedure. To do so, we used three pieces of information:

the proportion of households out of a random sample of one million with each of the ancillary

demographic characteristics considered (population), the sampling probabilities used to identify

sampled households (sample), and the proportion of respondents in each demographic category

according to the ancillary data (respondents). Two sets of base weights were produced. Sample-

level weights were calculated to invert distortion produced by the mismatch between sample and

population proportions (weight = population / sample). Respondent-level weights were

calculated to match the characteristics of respondents to those of the population (weight =

population / respondents).

Weights Used by Analysis

Analysis 1. Best ancillary match weights were adjusted to match the respondents for all

analyses presented. Alternate weighting strategies led to the same conclusions.

Analysis 2. Best ancillary match weights were adjusted to match the respondents for all

analyses presented. Alternate weighting strategies led to the same conclusions.

Analysis 3. Best ancillary match weights adjusted for the sample were used to correct for

all imputation and ancillary household-level estimates. Best ancillary match weights adjusted for

responding households were used to correct for raw household-level estimates. For individual-

level analyses, household adult weights were used to avoid bias due to the demarcation of some

individuals as “heads of household”. These weights were then adjusted to match the sample (for

imputation and ancillary estimates) or the household (raw GfK estimates) and were multiplied by

the number of persons in the household to produce individual-level weights. The number of

50


persons in the household was identified using ancillary data for ancillary estimates, using

imputed self-report data for imputation estimates, and using self-report data for raw estimates.

This set of weights produced the closest correspondence between the imputations and the CPS

among weighting strategies tested.

51


Appendix D. Comparison of Self-Report and Ancillary Values

Table D1 – Crosstabs Comparing Self-Report and Ancillary Values

Matches Bolded. Percentages are proportion of pairwise complete N (see Table 2).

52


Figure D1 – Comparisons of Self-Reported and Ancillary Age

Points are jittered to show density. Dashed line indicates 5-year margin.

53


Appendix E. Imputation Procedures for Analysis 3

Respondent-level variables were translated to match the sample by way of a multiple

imputation procedure. The package MICE: Multivariate Imputation by Chained Equations in R

was used to generate 100 multiply imputed datasets that provided values for all missing self-

report and ancillary variables in the entire sample (N = 26,974).10 MICE uses a Markov Chain

Monte Carlo (i.e. Bayesian) approach to imputation. The algorithm regresses each of the

variables onto all other variables in the dataset and generates a Monte Carlo prediction, based on

that regression, for all missing data points. The algorithm then proceeds iteratively updating the

predictions for each of the missing variables using predictions from the newly imputed data from

all other variables. Each prediction was a result of up to 20 rounds of this type of imputation.11

The imputation strategy for each variable depended on the coding of that variable. Variables

identified as “Nominal; 2 categories” above were imputed using a logistic regression imputation.

Variables identified as “Ordinal” above were imputed using an ordinal logistic regression.

Variables identified as “Continuous” above were imputed using progressive mean matching.

Notably, the imputation was conducted at the individual, rather than household level, though this

should not represent a substantive drawback given the nature of the imputation procedure.

10 Note that N is this large because multiple respondents were chosen from households. Weighting techniques corrected for this in all analyses (see Appendix C).11 Because these imputations converge quickly, 5 is considered sufficient for most missing data problems.

54


Appendix F. Individual-Level Imputation Results

Marital Status (Individual). Marital status estimates were the most accurate in the

imputed data, followed by raw self-reports and the ancillary data (Figure F1). The imputations

differed from the CPS estimate by an average of 2.1 percentage points (p=.03). The raw

estimates differed from the CPS by an average of 8.7 percentage points (t=9.0, p=.01). And the

ancillary estimates differed from those of the CPS by more than 30 percentage points (t=102,

p<.001). All of the imputation estimates were closer to the CPS estimate than was the raw GfK

estimate.

Education (Individual). All three measures overestimated the proportion of individuals

with some college education and underestimated the number of individuals with a high school

degree relative to the CPS. Estimates for the proportion of individuals with college or graduate

school degrees closely matched both imputation and ancillary estimates, but were somewhat

overestimated using the raw GfK data (Figure F2). Imputation estimates differed from the CPS

by an average of 5.1 percentage points, whereas raw GfK estimates differed from the CPS by 7.2

percentage points. Despite only reporting education for heads of households, ancillary estimates

most closely matched the CPS, differing by an average of only 3.5 percentage points. All of

these differences were highly statistically significant (ps<.001).

Age (Individual). Population age estimates from the three measures were all relatively

close to the CPS, differing by an average of only 2.57 percentage points for the imputations, 3.57

percentage points for the raw data, and 3.99 percentage points in the ancillary data (Figure F3;

ps<.001).

55


Figure F1. Marital Status Estimates Weighted By Individual

56


Figure F2. Education Estimates Weighted By Individual

57


Figure F3. Age Estimates Weighted By Individual

58

European Survey Research Association · Web viewMICE uses a Markov Chain Monte Carlo (i.e. Bayesian) approach to imputation. The algorithm regresses each of the variables onto all

Documents