-
DEMOGRAPHIC RESEARCH VOLUME 29, ARTICLE 22, PAGES 579-616
PUBLISHED 26 SEPTEMBER 2013
http://www.demographic-research.org/Volumes/Vol29/22/ DOI:
10.4054/DemRes.2013.29.22 Research Article
Validation of spatially allocated small area estimates for 1880
Census demography
Matt Ruther Galen Maclaurin Stefan Leyk Barbara Buttenfield
Nicholas Nagle © 2013 Ruther, Maclaurin, Leyk, Buttenfield &
Nagle. This open-access work is published under the terms of the
Creative Commons Attribution NonCommercial License 2.0 Germany,
which permits use, reproduction & distribution in any medium
for non-commercial purposes, provided the original author(s) and
source are given credit. See http://
creativecommons.org/licenses/by-nc/2.0/de/
-
Table of Contents
1 Introduction 580 2 Background 582 2.1 Small area estimation
using Census microdata 582 2.2 Maximum entropy microdata allocation
583 2.3 The context of the 1880 Census 585 3 Methods 587 3.1
Finding meaningful constraining variables 587 3.2 Establishing a
validation procedure 588 3.2.1 Error in margin 589 3.2.2 Residuals
and Standardized Allocation Error (SAE) 589 3.3 Modified
z-statistic 590 4 Results 591 4.1 The selection of constraining
variables 591 4.2 Post-allocation results: Comparison of allocated
distributions to
actual distributions 595
4.3 Post-allocation results: Comparison of the joint
distribution of a constraining variable and a non-constraining
variable
599
4.4 Post-allocation results: Comparison of the joint
distribution of two non-constraining variables
600
4.5 Post-allocation results: Geographic heterogeneity in
benchmark variable allocation errors
602
5 Discussion and concluding remarks 607 5.1 Limitations and
future steps 611 6 Acknowledgement 611 References 612 Appendix
615
-
Demographic Research: Volume 29, Article 22
Research Article
http://www.demographic-research.org 579
Validation of spatially allocated small area estimates for
1880
Census demography
Matt Ruther1
Galen Maclaurin 2
Stefan Leyk2
Barbara Buttenfield2
Nicholas Nagle3
Abstract
OBJECTIVE
This paper details the validation of a methodology which
spatially allocates Census
microdata to census tracts, based on known, aggregate tract
population distributions. To
protect confidentiality, public-use microdata contain no spatial
identifiers other than the
code indicating the Public Use Microdata Area (PUMA) in which
the individual or
household is located. Confirmatory information including the
location of microdata
households can only be obtained in a Census Research Data Center
(CRDC). Due to
restrictions in place at CRDCs, a systematic procedure for
validating the spatial
allocation methodology needs to be implemented prior to
accessing CRDC data.
METHODS
This study demonstrates and evaluates such an approach, using
historical census data
for which a 100% count of the full population is available at a
fine spatial resolution.
The approach described allows for testing of the behavior of a
maximum entropy
imputation and spatial allocation model under different
specifications. The imputation
and allocation is performed using a microdata sample of records
drawn from the full
1880 Census enumeration and synthetic summary files created from
the same source.
The results of the allocation are then validated against the
actual values from the 100%
count of 1880.
1 University of Colorado Boulder, U.S.A. E-mail:
[email protected]. 2 University of Colorado Boulder,
U.S.A. 3 University of Tennessee, U.S.A.
mailto:[email protected]
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
580 http://www.demographic-research.org
RESULTS
The results indicate that the validation procedure provides
useful statistics, allowing an
in-depth evaluation of the household allocation and identifying
optimal configurations
for model parameterization. This provides important insights as
to how to design a
validation procedure at a CRDC for spatial allocations using
contemporary census data.
1. Introduction
Census public-use microdata possess an attribute richness which
should make them
tremendously useful to researchers interested in demographic
small area estimation;
however, they are underutilized, largely due to their coarse
spatial resolution. The
smallest identifiable geographic areas in Census microdata
contain a minimum of
100,000 individuals, a restriction which may significantly
compromise the geographic
nature of a demographic study. Research which focuses on smaller
geographic areas
generally relies on a limited number of aggregate population
characteristics provided by
the Census Bureau in summary tables and cross-tabulations at the
census tract or block
group level. In order to better exploit the attribute richness
of Census microdata at finer
spatial scales, spatial allocation methods, which allocate
microdata households to small
areas and generate summary statistics for these smaller
geographic units using the
attributes of the allocated microdata households, may be used
(Johnston and Pattie
1993; Ballas et al. 2005; Assunção et al. 2005). Small area
estimates, which contain
extensive detail on the underlying population, are in great
demand and are important to
research on demographic and social processes such as migration,
impoverishment, and
human-environmental interactions.
A persistent shortcoming in the use of such spatial allocation
methods for deriving
demographic small area estimates is the lack of confirmatory
validation. There are
often few, if any, sources against which to compare the
estimated fine-scale population
counts and the associated distributions of population
characteristics. The main reason
for the absence of fine-resolution comparison data is the
confidentiality protection in
census surveys, which precludes the release of confirmatory
data. Although
demographic estimates based on U.S. Census data and geographies
may be validated at
a Census Research Data Center (CRDC), the expense of accessing a
CRDC and the
necessary confidentiality restrictions in place at the CRDC
mandate that the validation
process, which is not trivial, be fully realized prior to its
implementation at the CRDC.
This article describes one procedure for validating demographic
small area
estimates derived from spatially allocated household microdata.
In general terms,
spatial allocation refers to the process of assigning data from
a set of source zones into a
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 581
set of (different) target zones. The estimation methodology used
here was originally
developed to spatially allocate Public Use Microdata Sample
(PUMS) households,
which are spatially contained within Public Use Microdata Areas
(PUMAs) (source), to
census tracts (target), by imputing tract-specific sampling
weights for each microdata
household. The imputation is based on the principle of maximum
entropy: Conditional
on prior information known about the data, the most uniform
distribution (i.e., all
values have equal probability of occurrence) best represents the
data-generating process
(Phillips, Anderson, and Schapire 2006). Maximum entropy models
are constrained by
the information that is known about the process while making no
assumptions about
what is unknown. In this case, the model maximizes uniformity of
the distribution of
tract-specific sampling weights, subject to the constraint that
the weights sum to the
known, aggregate tract populations (summary statistics) (Nagle
et al. 2013; Leyk,
Buttenfield, and Nagle 2013).4
The fine-scale data necessary for the validation of this
methodology for
contemporary Censuses are available only at a CRDC. In contrast,
historical Census
data from 1880 are publicly available, and these data contain
the full demographic
detail for a 100% count of the population.5 This historical data
is used to (1) generate a
nested data structure comparable to contemporary census data
(i.e., a 5% microdata
sample and small area population summary statistics), (2) run
the imputation model and
allocate households based on the imputed weights, and (3)
examine and validate model
performance.
In the context of methodological validation, the 1880 Census
presents a unique
opportunity, as the publicly available data include the full
count of the population at a
fine spatial resolution. The spatial structure of the 1880
Census data is comparable,
although not identical, to that of contemporary censuses, and
the collected population
4 Although originally developed and tested for the U.S. context
using PUMAs and census tracts, the allocation
method described in this paper could conceivably be carried out
using data from other nations. The
international version of the Integrated Public Use Microdata
Series (IPUMS), maintained by the Minnesota Population Center
(MPC), includes microdata from many countries, and national
statistical agencies
frequently provide aggregate population data for small
sub-national geographies. The applicability of the
method described here to an international context will depend on
the unit of microdata geography and the unit of aggregate
population geography used in a particular country. In the U.K., for
example, the Sample of
Anonymised Records microdata regions are quite similar in size
to PUMAs in the U.S., and the Output Areas
(the smallest geography for which U.K. aggregate population
estimates are made) are comparable to U.S. census tracts. In other
countries (e.g., France and Germany), the smallest geographical
unit identified in the
microdata (German states or French regions) is considerably
larger than a PUMA; the spatial allocation may
not perform adequately in these cases. 5 In fact, historical
records from U.S. Censuses in 1940 and decades prior are publicly
available and preserved
on microfilm by the National Archives. Microdata samples from
these historical Censuses are also available
in the IPUMS (Ruggles et al. 2010). The advantage of the 1880
Census, and the motivation for its use here, is that 100% of the
records have been transcribed and are digitally and freely
available, thus allowing for the
validation procedure which follows. Full transcription of these
other historical Censuses has not yet occurred.
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
582 http://www.demographic-research.org
characteristics are similar. Thus the performance of spatial
microdata allocation
procedures can be objectively evaluated and interpreted to
better understand the quality
of finer resolution demographic estimates and how they reflect
underlying population
characteristics when the model parameters are changed. In order
to mimic the data
available in contemporary censuses, a random 5% sample of
population is drawn from
the full 1880 Census enumeration (comparable to current PUMS
data) and “synthetic”
summary tables are created from the same source (comparable to
SF3 files). The spatial
allocation procedure will be performed on these historical data
using different
combinations of constraining variables, and the results will be
validated against the
actual values from the 100% population count.
The primary purpose of this article is to evaluate the
performance of a spatial
allocation model which generates small area estimates, through
comparisons of these
estimates with actual population counts and an investigation of
model residuals and
their geographic variation. The paper will also shed light on
the evaluation process
itself, highlighting important considerations in parameter
selection and their influence
on resulting estimates for different population attributes.
These considerations are
crucial in designing a robust validation process prior to
undertaking the validation of
the allocation results using contemporary census data at a CRDC.
Data from the 1880
Census are utilized here as an easily accessible and appropriate
surrogate for
contemporary census data; as such, the priority in this analysis
is neither in historical
interpretations of these allocation results nor in drawing
substantive conclusions
regarding demographic processes in 1880. This article focuses on
confirmatory testing
that can be directly reproduced using contemporary public-domain
Census data, as well
as confidential data in a CRDC, and provides preliminary
validation measures for
spatial allocation methods.
2. Background
2.1 Small area estimation using Census microdata
Matching the distribution of spatially allocated survey data to
known census population
distributions has been widely employed in small area estimation
in the geographical and
other social sciences, using a variety of reweighting algorithms
or other allocation
techniques. To date, much of this research has occurred in the
United Kingdom
(Johnston and Pattie 1993; Williamson, Birkin, and Rees 1998;
Ballas et al. 2005;
Smith, Clarke, and Harland 2009) and Australia (Melhuish, Blake,
and Day 2002;
Tanton et al. 2011). Of particular relevance to the current
study is recent work that
focuses on the definition of appropriate goodness-of-fit
measures to assess the accuracy
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 583
of synthetic or reweighted microdata (Williamson, Birkin, and
Rees 1998; Voas and
Williamson 2001). A general shortcoming in validating the
performance of such models
is the lack of a “true” population against which the allocation
results can be compared.
Beckman, Baggerly, and McKay (1996) apply an iterative
proportional fitting (IPF)
technique to 1990 U.S. Census data and demonstrate that
estimated tract-level
household distributions are concordant with tract-level summary
statistics released by
the Census Bureau. However, they validate their estimates
against a different sample
drawn from the same population, not against the 100% population
count. Melhuish,
Blake, and Day (2002) use a reweighting process that allocates
Australian household
survey data which lack locative information to small census
districts based on the
known sociodemographic profiles of these small geographies.
Their evaluation of the
results suggests that the allocated populations correctly match
the 100% population
counts for most districts, but data to evaluate the joint
distributions for most population
characteristics are not publicly available. Hermes and Poulsen
(2012) provide a current
and general overview of the use of microdata reweighting
techniques in generating
small area estimates.
2.2 Maximum entropy microdata allocation
A methodology to allocate reweighted demographic microdata to
small enumeration
areas such as census tracts using decennial U.S. Census data has
been recently
described (Nagle et al. 2012; Leyk, Buttenfield, and Nagle
2013), based on the concepts
of dasymetric mapping and areal interpolation (Mrozinski and
Cromley 1999). In this
approach, maximum entropy methods impute a set of tract-specific
sampling weights
for each microdata record, with the initial tract-specific
weights derived from the survey
design weight. The imputed weights are constrained to match the
known (i.e., publicly
available) tract-level distributions for a number of population
characteristics; the
weights imputation is thus guided and influenced by this chosen
set of constraining
variables. Sampling weights for each microdata household sum
across all tracts to the
approximate design (or household) weight provided by the Census
Bureau. As the
design weight reflects the expected number of households in the
Public Use Microdata
Area (PUMA) that are similar to a given microdata record, each
constructed sampling
weight can be interpreted as the number of households of this
“type” that can be
expected in the respective census tract.
The maximum entropy imputation of sampling weights is
accomplished through an
iterative proportional fitting technique and uses nonlinear
optimization to improve
computational efficiency (Malouf 2002). Given a set of N
microdata household attribute
values Xi and a set of probabilities pij that a household
randomly selected from the
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
584 http://www.demographic-research.org
population has attributes similar to those of PUMS household i
and is located in census
tract j, it is possible to impute the k-th tract-level attribute
value by the equation
∑ . At the outset of modeling, the probabilities pij are
unknown. The
imputation is constrained such that the imputed probabilities
reproduce tract-level
populations given in Census summary files:
∑ ∑ ( ) (
) (1)
subject to ∑ and ∑
for all households i, tracts j, and attributes k. The yjk are
tract-level summaries of
attribute k from the summary files, and dij are prior estimates
of tract-level (i.e., for each
tract j) weights for each PUMS record i, which are subject to
reweighting (Leyk,
Buttenfield, and Nagle 2013). Following the maximum entropy
imputation, the set of
imputed weights guides allocation of households to individual
tracts or other sub-
PUMA areas.
Although maximum entropy imputation is but one of many methods
through
which this type of data imputation or reweighting may be
performed, it offers certain
advantages. The tract-specific weights imputed through maximum
entropy will not lead
to negative population estimates, as can be the case with least
squares regression
techniques, and the reweighted tract population distributions
for constraining variables
will exactly reproduce those distributions in the summary
tables. The maximum entropy
procedure used here also retains, for each microdata record, the
full set of attributes
present in the record, allowing for the construction of revised
summary tables for every
available attribute or combinations of attributes.
Once allocated, the microdata household characteristics can be
summarized to (1)
create revised estimates of tract-level (or finer-scale)
demographic summary statistics,
(2) generate summary statistics of attributes not available in
summary files, and (3)
compute new cross-tabulations. In Leyk, Buttenfield, and Nagle
(2013), the revised
summary statistics were compared to original tract population
distributions from the
Census-produced summary tables (based on a 1-in-6 sample), and
allocation ambiguity
was evaluated for each household as a function of the
distribution of imputed sampling
weights over all census tracts. While correlations between the
revised tract-level
summaries and original tract summary statistics were found to be
high and statistically
significant for constraining and non-constraining variables, a
full validation could not
be conducted without access to the full population details
maintained at a CRDC.
In this paper, the same weights imputation technique will be
applied to a sample of
households from the 100% count of the 1880 Census. These
households will be
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 585
allocated to enumeration districts according to their exact
imputed sampling weights.
From these allocations, revised summary statistics are computed
for each enumeration
district. These revised tables are then compared against the
true aggregated population
attributes from the full (100%) population count. While the
maximum entropy
imputation model detailed above is explored here, any allocation
model that uses
similar demographic data could be validated using the procedure
outlined in this paper.
2.3 The context of the 1880 Census
The 1880 Census is considered the first high-quality enumeration
of the U.S. population
and full individual records from this historical census have
been digitally transcribed
and made available online (Goeken et al. 2003; Ruggles et al.
2010). Important for the
research reported here, the 1880 Census records contain
household microdata including
spatial identifiers for the geographic units – enumeration
districts – in which the
households were located as well as the spatial boundaries of
these districts. Although
neither PUMAs nor census tracts had yet been defined in 1880,
State Economic Areas
(SEAs) and enumeration districts (EDs) represent a similar
spatial data structure as can
be found in contemporary censuses. SEAs, which consist of single
counties or groups of
contiguous counties, were defined for the 1950 Census and
retroactively applied to
prior censuses by the Minnesota Population Center (Bogue 1951;
Ruggles et al. 2010).
SEAs were designed to have a minimum population of 100,000
people, much like
contemporary PUMAs, although the retrospective definition of
SEAs to the 1880
Census may result in substantially different population sizes.
SEAs were divided into
minor subdivisions known as EDs, similar to contemporary census
tracts but slightly
smaller; these districts corresponded to the area that a
door-to-door enumerator could
cover during the Census period. EDs are fully nested in and
completely enclosed by
SEAs. The similarity between SEAs and PUMAs, and EDs and census
tracts, allows the
1880 Census to serve as a reasonable substitute for more current
censuses in performing
and validating the allocation method.
Although the questions on the 1880 Census covered a wide array
of social and
demographic characteristics, there are differences in attribute
coverage in the 1880
Census relative to recent censuses. Notably, the 1880 Census
carried no questions
regarding income or housing tenure, and the results from the
tendered questions on
educational attainment and literacy were not digitally
transcribed. This lack of direct
measures of socioeconomic status may require the use of less
distinct related data, such
as occupational class or standing, in the construction of
constraining variables. The
purpose of the constraining variables and the procedure used to
select them are
described in the Methods section below.
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
586 http://www.demographic-research.org
Because the number of observed attributes found in each
individual record is quite
large, the validation and discussion of the spatial allocation
results will focus on
selected benchmark variables commonly used by, and of particular
interest to,
demographic researchers. These benchmark variables include the
gender, age, race, and
marital status of the householder; the full list of benchmark
variables and their
categorizations are shown in Table 1.
Table 1: Benchmark variables for validation of spatial
allocation validation
Benchmark
Number of
categories
Measurement Record Count
(PUMS N = 3,408)
Age of Householder 4 Age 0-17 87
Age 18-34 976
Age 35-49 1,278
Age 50+ 1,067
Gender of Householder 1 Male 2,747
Race of Householder 1 Non-White
151
Marital Status of Householder 2 Single 305
Married 2,542
Presence of Children in Household 2 Any Children 2,528
5+ Children Present 555
Nativity of Householder1
2 Native Born
872
Foreign Born 1,918
Occupational Status of Householder2
4 Non-Worker 637
Low-Skill 997
Medium-Skill 909
High-Skill 865
Group Quarters Status of Household3
1 Group Quarters 293
Urban Status of Household4
1 Urban Household 2,788
Farm Status of Household 1 Farm Household 219
Notes: 1 Native born refers to individuals born in the U.S. with
parents who were born in the U.S. Foreign born refers to
individuals
not born in the U.S. A third grouping, U.S.-born household heads
whose parents were foreign born, is not considered here. 2
Occupational standing is measured using the occupational earnings
score variable, with the observed variable broken into three
tertiles (Low-Skill, Medium-Skill, and High-Skill). Non-workers
were individuals outside of the labor force. 3 Because households
were not defined in the 1880 Census, the contemporary distinction
between group quarters and households is
not relevant here. 4 The converse of “Urban” is “Rural”,
distinct from but correlated with the “Farm” designation.
These benchmark variables and the categorizations used in this
study are believed to be
fairly representative of the full range of population
characteristics available in this
census. To clarify, while the benchmark variables include some
variables that will be
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 587
used as constraining variables in the allocation procedure, the
function of the remaining
benchmark variables is to serve as validation instruments.
3. Methods
The 1880 Census data were used to run the weights imputation and
spatial allocation
model for Hamilton County, Ohio. This county was chosen based on
its stable
boundaries over time and the fact that it was coextensive with a
single SEA (SEA 336).
Although the 1880 Census did not define households in the same
way as is done in
contemporary Censuses, variables describing household
composition were added
retrospectively during data transcription (Goeken et al. 2003;
Ruggles et al. 2010).
There were 68,160 households (comprising 313,702 individuals) in
Hamilton County in
the 100% count of the 1880 Census. Household characteristics
were identified using the
records for all individuals listed as person number one (head of
household), and all
references to household or householder refer to the attributes
of this individual.
Individuals living in group quarters, who are not considered
household members in
current Censuses, are considered household members in this
study. Hamilton County
was divided into 135 EDs, which contained, on average, 505
households (or
approximately 2,300 individuals). A 5% sample, similar to a
contemporary PUMS, was
randomly drawn from the full count of households in the SEA, and
each household in
this sample was assigned a design weight (household weight) of
20. This “pseudo-
PUMS” (N=3,408) comprises the analytical sample used in the
maximum entropy
procedure, which is subsequently spatially allocated among the
135 EDs covering the
county.
Prior to running the weights imputation, a crucial task is the
selection of
constraining variables; this procedure is described first. Then
three different measures
are described that can be used to validate the imputation and
allocation results for
different combinations of constraining variables. As noted
above, this study focuses on
the validation procedure; technical details about the maximum
entropy weights
imputation and allocation model beyond the above summary are
described in Nagle et
al. (2012; 2013) and Leyk, Buttenfield, and Nagle (2013).
3.1 Finding meaningful constraining variables
The constraining variables in the maximum entropy weights
imputation should ideally
delineate different household-level residential patterns; this
will increase the variability
in the underlying data that can be explained and result in more
accurate estimates.
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
588 http://www.demographic-research.org
Population characteristics (such as gender) that are similarly
distributed among EDs are
unlikely to produce satisfactory allocation results when used as
constraints, since there
may be little variation to exploit. In addition, the inclusion
of multiple highly correlated
variables may be unnecessary, as highly correlated variables
will likely be redundant in
explaining variation in the underlying population distribution.
The choice of
constraining variables represents a difficult problem in survey
sampling that has found
limited attention to date and there is no standard method in
place.
Bivariate correlations of ED-level population characteristics
are calculated as one
obvious way of assessing highly correlated variables that would
be unsuitable
constraining variables if applied in concert. Principal
component analysis (PCA) is used
to examine how much variation in the data is explained by the
different population
characteristics, and thus to identify the variables that may be
most useful as constraints.
While PCA is commonly used to reduce the dimensionality in a
given set of data, it
may also be helpful in describing the associations between the
variables present in the
data (Jolliffe 2002; Demšar et al. 2013). Finally, a segregation
index, the index of
dissimilarity (D), is computed at the ED-level to determine
those variables that may
represent appropriate constraints. The index of dissimilarity is
a measure of the
evenness of the distribution of two groups (Massey and Denton
1988), and may
therefore be helpful in determining which variables best
differentiate (or segregate)
household residential patterns. Dissimilarity index values range
from 0 to 1, with values
tending towards 1 indicative of more highly segregated groups
and values tending
towards 0 suggesting low levels of segregation among the
groups.
3.2 Establishing a validation procedure
Weights imputation is performed using different combinations of
constraining variables
to examine the sensitivity of the allocation model to the number
and types of constraints
applied. As noted in the Methods section above, the weights
imputation redistributes
among the 135 EDs the original design weight for each household
in the pseudo-PUMS
sample, and then iteratively reweights the ED-level weights to
match the aggregate
summary statistics for each ED. Although these imputed weights
are not required, and
in reality are unlikely, to be whole numbers, the sum of the
weights for a particular
household record type across all EDs will be equal to the
expected number of
households of that type (with „type‟ characterized by the set of
constraining variables
used) in the SEA. Aggregating the imputed weights over those
households exhibiting a
particular attribute (e.g., foreign born household heads) within
each ED will result in a
revised summary statistic for that ED. This revised summary
statistic will match exactly
the actual count derived from the full enumeration if this
attribute has been used as a
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 589
constraining variable. An important component of the validation
task then is to establish
how well the revised summary statistics for household attributes
not used as constraints
replicate the actual number of households with those attributes
in each ED. Following
each model run, revised summary tables were generated by ED for
the attributes of
interest (benchmark variables as described above) based on the
allocated microdata.
The revised summary tables were compared to summary tables
constructed from the
100% enumeration of the population. To examine the accuracy of
allocation results
from different perspectives, three goodness-of-fit statistics
were calculated, as described
below.
3.2.1 Error in margin
The actual number of households in the entire study area
exhibiting a particular
population characteristic will be compared to the total
allocated number of households
with the same characteristic in order to assess how well
individual variables are being
allocated overall; this difference is designated the error in
margin. While the error in
margin reveals little about the performance of the allocation
procedure in reproducing
the accurate population distribution within EDs, substantial
differences between total
household counts and allocated household counts will indicate
variables for which the
model critically fails. In short, concordance between the total
number of allocated
households and the total number of actual households is a
necessary, but not sufficient,
condition under which to validate model performance.
Importantly for the implementation of the allocation model with
current Census
data, the error in margin can be easily calculated in most cases
based on publicly
available data, even for attributes for which the other
goodness-of-fit statistics described
below cannot be derived. In such cases it is important to
examine how well errors in
margin correspond to the standardized absolute error or
z-statistics described below,
which quantify the error in the distribution. These measures are
sometimes irretrievable
from contemporary censuses without access to confidential
data.
3.2.2 Residuals and Standardized Allocation Error (SAE)
The residual is the difference within an ED between the actual
population count and the
allocated population count. Standardized Allocation Error (SAE)
is the sum over all
EDs of the absolute residuals standardized by the total expected
population:
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
590 http://www.demographic-research.org
∑ | |
∑ (2)
where Ui is the actual count of the population in EDi and Ti is
the allocated count of the
population in EDi. SAE will generally fall between 0 and 2, with
values closer to 0
indicating a better fit between the actual and allocated
distributions. Because the
allocated margin is not required to match the actual margin for
non-constraining
variables, the SAE could, in theory, be greater than 2 for these
variables. The SAE
compares the actual ED-level household distribution to the
allocated ED-level
household distribution, and is a stricter evaluation of the
accuracy of the model than is
the error in margin described above; SAE is thus the primary
measure of model
performance. SAE is used to test the performance of a variety of
model specifications
(e.g., different variables used as constraining variables) and
to compare across
specifications. The SAE may also be computed for individual EDs,
or for individual
estimates within an ED. In this sense, the SAE is similar to a
coefficient of variation,
which is calculated as the standard error of an (average)
estimate divided by the
estimate itself.
3.3 Modified z-statistic
The modified z-statistic can be used to compare a table
representing the actual joint
distribution (or cross-tabulation) of multiple population
attributes with a table
representing the allocated joint distribution of those
attributes (Williamson, Birkin, and
Rees 1998). The z-statistic is calculated for each corresponding
pair of table cells, with
significant values representing those elements in the
distribution of the particular
population attribute for which the allocation procedure is
performing inadequately. The
modified z-statistic is calculated by
√
∑
∑
∑ (3)
where i and j indicate individual cells (row i and column j)
within the joint distribution
table of some population attributes, Uij is the actual count for
cell ij in the ED and Tij is
the allocated count for cell ij in the ED. Population attributes
for which the actual and
allocated distributions are poorly matched may require further
consideration, such as
additional constraining variables to be incorporated into the
model.
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 591
The above three measures will highlight those variables which
show unusual
behavior within the allocation procedure and make it possible to
carry out an in-depth
validation based on the available full population count. Of
particular interest is the level
of accuracy with which non-constraining variables can be
estimated. An important
question is whether one can differentiate between those
non-constraining variables
which are strongly correlated with one or more constraining
variables, and those which
are seemingly unrelated to any of the constraining variables.
This will provide
important insight for the selection process of constraining
variables and the
configuration of the allocation model. The described validation
procedure will also
indicate whether the accuracies of the ED estimates for
different population
characteristics exhibit geographic heterogeneity through the
compilation of residual
maps, and whether the goodness-of-fit for an allocated
distribution, as measured by the
SAE, can be inferred from the error in margin.
The focus on these different measures of error, and the
relationships between the
measures, is based on the consideration that, in the
contemporary context, model
performance may need to be assessed under different conditions.
The error in margin
can be evaluated with no knowledge of the underlying tract-level
distribution of the
population and the data necessary to carry out this evaluation
is frequently available in
summary tables at the county- or PUMA-level. This is true even
for those population
attributes for which no census tract summaries are publicly
available. However, the
error in margin is limited in assessing model performance
because it does not provide
any information about the distributional accuracy of the model.
The SAE and the
modified z-statistics can be used to evaluate distributional
accuracy, but can be
calculated only when tract-level summary tables are available.
Of course, this is not to
say that the calculation of these latter measures requires the
100% count of the
population that is available here; however, having the 100%
count of the population
allows SAE to be calculated for the full range of
sociodemographic variables and, more
importantly, for any cross tabulations of variables in the
microdata.
4. Results
4.1 The selection of constraining variables
The first step in the allocation process is the selection of
those variables that will be
used as constraints. Although the digitally transcribed 1880
Census includes fewer
variables than more contemporary censuses, there is greater
flexibility in choosing
constraining variables using the 100% population count because
univariate and joint
distributions of any variables of choice can be constructed.
Thus this step is not limited
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
592 http://www.demographic-research.org
by the summary tables produced by the Census Bureau. As noted
above, while the
choice of constraining variables should be grounded in theory,
there are analytical
techniques that may guide the selection process. In this study
segregation indices,
bivariate correlations, and principal component analysis are
used to determine favorable
constraining variables i.e., variables with higher potential
explanatory power that are
not strongly correlated.
Table 2 displays the index of dissimilarity, measured at the
level of the ED using
the aggregate summary tables, for each of the benchmark
variables. Some variables,
including the urban/rural dichotomy, residence in group
quarters, and farm residence,
display very high levels of segregation, due to their natural
geographical disparity.
However, several benchmark variables are highly correlated, and
the inclusion of
multiple highly correlated variables as constraints would be
redundant. Examples of
highly correlated variables include urban residence and farm
residence (Spearman ρ=-
0.64) and group quarters status and single status (Spearman
ρ=0.69). The full
correlation matrix for all benchmark variables is displayed in
Appendix 1.
Principal component analysis (PCA) provides another method of
selecting relevant
and non-superfluous constraining variables. The results from the
PCA run on the ED-
level aggregate summary tables for the 19 benchmark variables
suggest that five
underlying latent variables explain more than 85% of the
variation in the benchmarks.
These five principal components all have eigenvalues greater
than 1; the sixth principal
component has a substantially smaller eigenvalue.6
6 PCA is commonly used to reduce the dimensionality (number of
variables) of a dataset by creating new variables (principal
components) that are combinations of the original variables and
that are uncorrelated with
each other. The principal components should retain as much of
the variation in the dataset that is explained by
the original variables as possible. Eigenvalues are the sample
variances of the principal component scores. The rubric of
retaining only those principal components with eigenvalues greater
than 1 (in cases where the
PCA was run on a correlation matrix) is known as Kaiser‟s Rule
(Kaiser 1960; Jolliffe 2002).
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 593
Table 2: Segregation indices for Hamilton County, Ohio
(diversity measured
by enumeration district)
Variable D
Urban vs. Rural 1.00
Farm vs. Non-farm 0.81
Group vs. Non-group 0.81
Male vs. Female 0.14
White vs. Non-white 0.53
Single vs. Non-single 0.25
Married vs. Non-married 0.13
Children present vs. No children present 0.13
5+ Children present vs. Less than 5 children present 0.15
Foreign born vs. Non-foreign born 0.28
Native vs. Non-native 0.39
Occupation: Non-worker vs. All other 0.13
Occupation: Low-skill vs. All other 0.27
Occupation: Medium-skill vs. All other 0.19
Occupation: High-skill vs. All other 0.14
Age: Age 0-17 vs. All other 0.76
Age: Age 18-34 vs. All other 0.07
Age: Age 35-49 vs. All other 0.06
Age: Age 50+ vs. All other 0.09
Note: The urban/rural dichotomy has an index of dissimilarity of
1 because EDs are wholly classified as either urban or rural, with
the
classification extending to all households within the district.
While no such “perfect” constraining variables will exist in
contemporary Census data, this variable was nevertheless
retained as a constraint.
Based on the PCA, the segregation indices, and the bivariate
correlations, five
constraining variables were selected for the analysis. Urban
status and group quarters
status loaded most heavily on principal components 1 and 2,
respectively, and were
retained; these variables also exhibited high dissimilarity
indices. Foreign born status
and native born status loaded most heavily on principal
component 3. Because these
variables display a (naturally) high correlation, only foreign
born status was kept as a
constraint. The variables loading most heavily on principal
component 4 were those
relating to the occupational status of the householder; all of
these variables were also
retained. Although the variables that displayed the highest
loadings on principal
components 5 (and 6) were those related to the age of the
householder, race, with the
highest loading on principal component 7, was nevertheless
chosen as a fifth
constraining variable. This substitution was made because
householder race displayed a
higher level of segregation than did most of the age categories.
The exclusion of age as
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
594 http://www.demographic-research.org
a constraining variable also allows for its use in the
confirmatory validation that
follows.
While the constraining variables used here are chosen through a
quantitatively
informed selection procedure, this procedure should not be
construed as the de facto
standard for choosing the “optimal” constraining variables for
the model. There are
variables available in the 1880 Census that are not considered
in this paper, and the
groupings of householder age and occupational status used here
may not reflect the
ideal categorizations for these variables. Cross-tabulations or
interactions of individual
variables (e.g. race by age, gender by occupational status)
could also be constructed and
used as constraints, in the hope that such interactions would
ultimately provide
improved estimates. However, the constraining variables selected
above are assumed to
be sufficiently robust for the validation procedure which
follows.
As noted above, the primary purpose of this paper is to describe
a method of
validating small area estimates using perfect and complete
census information and to
infer from the validation results a process for validating when
such complete
information is not available. A second purpose, however, is to
assess how changing
estimation parameters affect model performance and the estimates
themselves. To this
end, while the model with five constraints will form the base
model, models with 2-4
constraining variables will also be estimated. This step-wise
modeling approach will
facilitate the evaluation of the sensitivity of the estimation
procedure to changes in the
model parameterization. Because adding constraints is likely to
increase the accuracy of
the spatial allocation process in reproducing the actual 100%
population distribution, a
natural inclination would be to add constraining variables until
the supply of available
constraints was exhausted. However, overfitting of the maximum
entropy model
through the inclusion of an excessive number of constraining
variables may lead to
inefficiency and non-convergence. This may be particularly true
in cases where the
univariate or joint population distributions (such as summary
statistics from the Census
SF3 or American Community Survey) for constraint variables used
in the maximum
entropy imputation include sampling error or imputed data.
Following the maximum entropy imputation, the set of imputed
weights is applied
to allocate households to specific EDs. The imputed weight for a
single household in a
single ED represents the expected number of households of that
type within that ED.
Allocation can proceed by assigning fractional parts of
households in strict adherence to
the imputed weights, or by rounding the imputed weights to
integers and relaxing the
strict adherence (Leyk, Buttenfield, and Nagle 2013). The method
applied here utilizes
the exact imputed household weights.
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 595
4.2 Post-allocation results: Comparison of allocated
distributions to actual
distributions
Figure 1 compares the total population counts in the SEA to the
allocated population
counts following the maximum entropy allocation model with five
constraining
variables. The variables used as constraints are listed first
(within the gray area),
followed by the additional benchmark variables. While the 100%
population counts and
the allocated population counts for constraining variables are,
by design, the same, this
chart highlights how the allocated total counts for the other
benchmark variables are
very close to their actual counts. For example, the actual
number of male householders
in Hamilton County is 54,932, while the number of male
householders predicted by the
model is only slightly larger, at 54,999.
Figure 1: Comparison of actual population count to allocated
count,
model with five constraints
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
596 http://www.demographic-research.org
The largest absolute errors in margin occur in the number of
households with five
or more children, for which 931 households are over-allocated
(9.1% error in margin),
and in the number of married householders, over-predicted by 589
households (1.2%
error in margin). Other than the variable denoting married
householders, the only
benchmark variable with an error in margin greater than 5% is
the number of
householders younger than age 18 (6.5% error in margin).
Figure 2 displays the SAE (distributional error) metrics for
those benchmark
variables not used as constraining variables in the
five-constraint maximum entropy
model; by design the SAE for variables used as constraints is 0.
Although many of the
benchmarks appear to be well allocated by this measure, two
variables have noticeably
poorer fits: Householders younger than age 18 and farm
households. The SAE is
equivalent to the mean residual divided by the mean actual
number of households. On
average, the number of allocated households in an ED is within
approximately 20% of
the actual number of households in that ED, for most benchmark
variables.
Figure 2: Standardized allocation error, model with five
constraints
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 597
The maximum entropy procedure was also run with different
numbers of
constraining variables to evaluate how additional constraints
affect the distribution of
allocation errors. Figure 3 displays the SAEs of the benchmark
variables for the
maximum entropy models with 2-4 constraining variables, as well
as the SAEs for the
baseline five-constraint model. As before, these SAEs fall to
zero when a variable is
used as a constraint in the model. In general, the addition of
constraining variables to
the model reduces the SAE for the benchmark variables, although
the magnitude of the
decrease appears to depend on the relationship between the
benchmark variable and the
newly added constraint. For example, the error in the allocation
of farm households
drops substantially when occupational status is added as a
constraining variable (most
farm householders have low occupational status), while the error
in the allocation of
native-born households is greatly reduced when foreign-born is
added as a constraint.
Several benchmark variables, including those representing ages
above 18, gender, and
marital status, exhibit little change when additional
constraints are added to the model.
These benchmarks are largely uncorrelated with any of the
constraining variables and
generally have small errors under any of the model
specifications.
Figure 3: Comparison of standardized allocation error for
different constraint
variable specifications
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
598 http://www.demographic-research.org
One final facet in the evaluation of model performance is the
association between
the error in margin and the SAE. Figure 4 highlights the
relationships between the
errors in margin of the benchmark variables (x-axis) and their
ED-level SAE (y-axis),
for the model with 5 constraining variables. A linear regression
line is provided to
summarize the point relationship between the two measures of
error. The error in
margin and the SAE exhibit a positive association, although it
is fairly weak. Notably,
the total allocated counts of both farm households and
householders less than age 18 are
very close to their actual counts in the population, but the
distribution of these
populations within specific EDs is much less successful. It
appears therefore that
inferences about the distributional performance of the
allocation model based on
agreement between the actual and allocated totals (error in
margin) should be
approached with caution. This underscores the earlier statement
that the error in margin
is itself insufficient in determining model performance.
Figure 4: Model with 5 constraints: Error in margin (ratio of
residual to actual
count) by error in distribution (ratio of summed absolute
residuals to
actual count)
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 599
4.3 Post-allocation results: Comparison of the joint
distribution of a constraining
variable and a non-constraining variable
To this point, only allocation errors in the univariate
distributions of the group of
benchmark variables have been explored. However, researchers are
often interested in
the joint distributions of variables; indeed, one anticipated
goal from developing spatial
allocation models using microdata is the ability to estimate
joint distributions of
variables for which none had previously existed. To assess the
accuracy of the spatial
allocation in duplicating the actual joint distributions of
variables, the z-statistic
described above may be used.
The top two panels of Table 3 display, for two randomly selected
enumeration
districts, the actual numbers of households, the allocated
numbers of households, and
the calculated z-statistics for the cross-tabbed distribution of
a household attribute used
as a constraining variable, householder occupational status, and
a household attribute
not used as a constraining variable, householder age. These
tables reveal ED-specific
discrepancies in the allocation performance of the model and may
also highlight broad
misallocation trends, such as that seen among the non-worker
occupational class in both
EDs. The last panel of Table 3 shows the aggregated performance
metrics for each
occupation/age group cell as the total number (and percent) of
EDs which are well-
allocated for that cell. In general, the number of
well-allocated EDs is quite high for any
particular cell, with noticeable patterns of poor allocation
among the 25-29 age group
and among the oldest householders.
Table 3: Comparison of allocated age and occupational status
distribution to
100% count distribution
Enumeration District 192
Non-worker Low Occupational
Status
Medium Occupational
Status
High Occupational
Status
Age 100% Allocated
z-
statistic 100% Allocated
z-
statistic 100% Allocated
z-
statistic 100% Allocated
z-
statistic
19 or less 245 232 -1.32 36 25 2.29** 9 10 0.34 3 1 -1.16
20-24 13 14 0.28 42 47 -0.79 21 25 0.92 16 22 1.55
25-29 30 6 -4.55** 59 50 1.39 35 35 0.00 42 34 -1.35
30-34 17 29 2.97** 34 37 -0.53 40 36 -0.70 43 35 -1.34
35-39 18 33 3.62** 28 37 -1.58 37 34 -0.54 37 31 -1.07
40-44 23 13 -2.15** 23 38 -2.60** 30 34 0.78 35 39 0.73
45-49 18 6 -2.89** 32 21 2.49** 24 14 -2.16** 29 33 0.79
50-54 16 12 -1.02 25 18 1.70 13 14 0.29 22 33 2.45**
55-59 11 15 1.22 10 15 -1.32 8 11 1.08 16 12 -1.03
60-64 8 14 2.14** 12 11 0.31 6 6 0.00 7 10 1.15
65 or greater 5 30 11.25** 11 14 -0.82 3 7 2.32** 6 5 -0.41
Total 404 404 312 313 226 226 256 255
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
600 http://www.demographic-research.org
Table 3: (Continued)
Enumeration District 115
Non-worker Low Occupational
Status
Medium Occupational
Status
High Occupational
Status
Age 100% Allocated
z-
statistic 100% Allocated
z-
statistic 100% Allocated
z-
statistic 100% Allocated
z-
statistic
19 or less 108 83 -3.60** 1 3 2.01** 0 0 0.00 0 2 2.00**
20-24 1 6 5.01** 5 4 -0.46 6 7 0.42 2 14 8.53**
25-29 2 5 2.13** 5 12 3.22** 9 16 2.42** 10 29 6.16**
30-34 2 15 9.24** 15 11 -1.13 32 21 -2.23** 32 29 -0.58
35-39 6 13 2.90** 12 13 0.31 19 20 0.25 48 34 -2.30**
40-44 10 9 -0.32 10 13 1.00 20 19 -0.24 22 28 1.35
45-49 13 10 -0.86 19 12 -1.80 15 15 0.00 26 26 0.00
50-54 20 13 -1.65 11 8 -0.96 13 15 0.58 22 22 0.00
55-59 12 12 0.00 7 7 0.00 11 11 0.00 16 11 -1.30
60-64 9 11 0.68 8 6 -0.74 6 7 0.42 14 8 -1.66
65 or greater 12 18 1.79 2 5 2.14** 3 4 0.58 17 6 -2.78**
Total 195 195 95 94 134 135 209 209
All Enumeration Districts
19 or less 112 83% 119 88% 129 96% 131 97%
20-24 116 86% 113 84% 102 76% 92 68%
25-29 102 76% 99 73% 105 78% 98 73%
30-34 100 74% 118 87% 111 82% 117 87%
35-39 100 74% 123 91% 120 89% 124 92%
40-44 117 87% 116 86% 123 91% 118 87%
45-49 122 90% 109 81% 112 83% 122 90%
50-54 118 87% 116 86% 125 93% 111 82%
55-59 122 90% 123 91% 111 82% 124 92%
60-64 116 86% 111 82% 114 84% 118 87%
65 or greater 96 71% 106 79% 106 79% 103 76%
Notes: ** Statistically significant at 5%, based on modified
z-statistic. Totals may not be equivalent due to rounding.
Bottom panel: Number of enumeration districts with allocated
count of age/occupational status category statistically near the
actual
count, as measured by the modified z-statistic. Total number of
enumeration districts in the study area is 135.
4.4 Post-allocation results: Comparison of the joint
distribution of two non-
constraining variables
Because occupational status was used as a constraining variable
in the maximum
entropy imputation, the allocated counts for a joint
distribution which includes this
variable might be expected to maintain a high level of
consistency with the 100% count.
To assess the performance of the allocation for the joint
distribution of two non-
constraining variables, the cross-tabulation analysis in the
prior section was repeated for
the gender and age of the householder, two benchmark variables
that are not used to
constrain the maximum entropy imputation. These results are
shown in Table 4.
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 601
Table 4: Comparison of allocated age and sex distribution to
100% count
distribution
Enumeration District 192
Female Male
Age 100% Allocated z-statistic 100% Allocated z-statistic
19 or less 61 158 14.37** 221 122 -7.59**
20-24 13 30 4.85** 84 73 -1.26
25-29 32 31 -0.19 125 103 -2.11**
30-34 22 22 0.00 115 112 -0.30
35-39 22 31 2.01** 107 96 -1.13
40-44 26 24 -0.42 100 85 -1.59
45-49 22 21 -0.22 70 64 -0.74
50-54 18 25 1.72 51 59 1.15
55-59 11 15 1.23 39 33 -0.98
60-64 10 10 0.00 22 33 2.37**
65 or greater 4 20 8.07** 24 33 1.86
Total 241 387 958 813
Enumeration District 115
19 or less 49 37 -2.16** 60 51 -1.24
20-24 6 7 0.42 8 24 5.70**
25-29 2 5 2.14** 24 57 6.90**
30-34 6 10 1.67 75 66 -1.13
35-39 9 11 0.69 76 68 -1.00
40-44 9 11 0.69 53 59 0.87
45-49 13 11 -0.58 60 51 -1.24
50-54 18 14 -1.01 48 44 -0.61
55-59 10 12 0.66 36 29 -1.21
60-64 7 9 0.78 30 24 -1.13
65 or greater 4 11 3.55** 30 21 -1.69
Total 133 138 500 494
All Enumeration Districts
Female Male
Age
Number of EDs
Allocated Satisfactorily Percent
Number of EDs
Allocated Satisfactorily Percent
19 or less 112 83% 90 67%
20-24 100 74% 89 66%
25-29 91 67% 81 60%
30-34 101 75% 104 77%
35-39 91 67% 117 87%
40-44 110 81% 106 79%
45-49 111 82% 110 81%
50-54 118 87% 119 88%
55-59 111 82% 113 84%
60-64 113 84% 105 78%
65 or greater 99 73% 97 72%
Notes: ** Statistically significant at 5%, based on modified
z-statistic. Totals may not be equivalent for non-constraining
variables.
Bottom panel: Number of enumeration districts with allocated
count of age/gender category statistically near the actual count,
as
measured by the modified z-statistic. Total number of
enumeration districts in the study area is 135.
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
602 http://www.demographic-research.org
Within the two selected EDs, allocation performance appears to
be at least as good
as, and possibly better than, the allocation in the previous
occupation/age distribution.
Once again, the most egregious misallocations occur in the
youngest and oldest age
groups. Unlike in the occupation/age distribution shown above,
in which the column
margins (occupation) were constrained to be equal, there is no
such restriction in this
table. As such, much of the misallocation in the gender/age
distribution in enumeration
district 192 of Table 4 may be the consequence of the
overallocation of female heads of
household over the whole study area.
Although it is infeasible to examine the joint distributions of
all variables over
each and every ED in the sample, the information gleaned from
the comparisons of a
few EDs may be useful in restructuring the original optimization
problem. In addition,
the third panels of Tables 3 and 4, which aggregate the joint
distributional errors over
all EDs, may be helpful for a better understanding of spatial
heterogeneity in the
allocation error, which is the focus of the next section.
4.5 Post-allocation results: Geographic heterogeneity in
benchmark variable
allocation errors
The model with five constraints results in only two benchmark
variables (householder
age 0-17 and households with 5+ children) having an error in
margin greater than 5%
and only four benchmark variables (householder age 0-17, farm
households, native-
born householder, and single householder) having SAE values
greater than 20%. Maps
of the allocation errors in these poorly performing variables
were created at the scale of
the ED to visually assess whether spatial heterogeneity or local
clustering was present
in the errors.7 Figures 5-9 display maps of the standardized
residuals, by ED, for those
benchmark variables that have high SAEs or high errors in
margin. The focus of these
maps is on the EDs in the denser, central portion of the county,
which comprise most of
the city of Cincinnati. The extant outset maps display the whole
county as a reference.
EDs are shaded according to their allocation error (the residual
divided by the actual ED
count) in the five constraint model, with lighter EDs indicating
lower allocation errors
and darker EDs indicating greater allocation errors.
Residuals for householders age 0-17 (Figure 5) and households
with 5+ children
(Figure 6) appear to be largest in the south-central portion of
the county, which
encompasses the city of Cincinnati. While single householders
(Figure 7) were also
misallocated to the largest extent in this general locale, large
residuals for single
householders seem to be clustered on the outskirts of the
central city. Perhaps the most
7 These maps were based on GIS boundary files downloaded from
the Urban Transition Historical GIS
Project (http://www.s4.brown.edu/utp/). This project is further
described in Logan et al. (2011).
http://www.s4.brown.edu/utp/
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 603
distinct clustering of allocation residuals occurs for the
benchmark variables of native-
born householder (Figure 8) and farm households (Figure 9).
There are large errors in
the allocation of native born households in the EDs just north
of the historic central
business district of Cincinnati, while farm households are
highly misallocated in the
majority of the downtown EDs.
Figure 5: Standardized allocation error (expressed as %) for
householders age
0-17
Spatial heterogeneity in the ED-level allocation errors for a
benchmark variable
may arise due to spatial clustering of the variables used as
constraints or due to very
small population sizes in some EDs, and may be linked to
substantive processes and
ideas. In a general sense, the processes that lead to clustered
misallocations of
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
604 http://www.demographic-research.org
households of a particular population attribute in nearby EDs
may manifest as clustered
ED-level allocation errors of this attribute or another that is
closely related. The residual
maps provide visual confirmation of such spatial patterns, and
may be useful in guiding
additional examination of constraining variables that may
improve model performance
(i.e., decrease spatially clustered allocation errors). In this
sense such maps can be
useful investigative tools to better understand the allocation
process and the reasons for
its limited performance in some areas.
Figure 6: Standardized allocation error (expressed as %) for
households with
5+ children
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 605
Figure 7: Standardized allocation error (expressed as %) for
single
householders
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
606 http://www.demographic-research.org
Figure 8: Standardized allocation error (expressed as %) for
native born
householders
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 607
Figure 9: Standardized allocation error (expressed as %) for
farm households
5. Discussion and concluding remarks
The maximum entropy procedure detailed above aims to increase
the utility of Census
microdata in small area estimation by adding geographic detail
to household microdata
records. This spatially enhanced microdata can be used in the
construction of revised
summary tables which cover a wider range of population
characteristics than those
currently available, as well as new joint distributions.
However, there has been limited
authentication of the results obtained from this spatial
allocation model, a model which
may comprise many different specifications, variables, and
geographical contexts. The
purpose of this paper is thus to design and test a validation
procedure, highlighting the
performance of the model under different configurations using
publicly available
Census data from 1880 and drawing conclusions about how to
transfer this framework
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
608 http://www.demographic-research.org
to the more contemporary context. The results shown above
suggest that the validation
procedure provides useful statistics, allowing an in-depth
evaluation of the accuracy of
the household allocation model and highlighting some directions
for future work. While
the focus in this paper is on an imputation and allocation
procedure based on the
principle of maximum entropy, nothing in the validation itself
is specific to this spatial
allocation design. As such, a wide variety of different
allocation methods could be
employed and validated with the same data source, and the
results used to compare and
evaluate the performance of the different methods. The 100%
count data from the 1880
Census offers an attractive alternative to the use of synthetic
or simulated microdata, the
creation of which may rely on a host of assumptions regarding
social and residential
processes.
One important conclusion from this assessment is that the
addition of constraining
variables improves model fit not just for the constraining
variables themselves, but also
for variables that are correlated with the constraining
variables (Figure 3). For example,
the addition of the foreign born variable as a constraint
results in a decrease of nearly
50% in the SAE for the native born benchmark variable, with
which it is highly
correlated. This behavior can be leveraged, and the total number
of constraints
minimized, through a careful selection of constraining variables
that share multiple high
correlations with other variables. A second significant
conclusion is that smaller errors
in margin are associated, albeit weakly, with overall better
fitting distributions
(indicated by SAEs). Errors in margin are an easily calculable
fit statistic, and since
their computation requires no knowledge of the actual
distribution of the population
within or across the EDs (or tracts) in the SEA (or PUMA), they
can be computed using
publicly available data. This fact is highly beneficial, as it
would allow for a
preliminary validation of an allocation model with a specific
set of parameters without
any need to access confidential data. However, the number of
variables over which this
relationship was tested was small, some of the benchmark
variables still displayed poor
distributional fit, and the overall association between the
errors in margin and the SAE
was not remarkably strong.
While the intent here is not to develop an optimal model fitting
the 1880 Census
data, it is instructive to consider the overall pattern of data
fit that is being produced by
the maximum entropy imputation and subsequent spatial
allocation, as this model has
been developed for use with, and evaluated using, contemporary
Census data. In
general, the estimates being produced by the model are quite
promising. In its report on
the use of American Community Survey data, the National Research
Council (2007, p.
64) advises that coefficients of variation (CVs) in the range of
10-12% are acceptable
for population estimates. While the SAEs shown above are not CVs
in a strict sense,
they are mathematically comparable. The results from Figure 2
show that nearly half of
the non-constraint benchmark variables used in this study
achieve this goal, with
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 609
several others performing only marginally worse. In the context
of contemporary ACS
tract population estimates which have large variances, the
estimates from this spatial
allocation do not seem excessive for most of the benchmark
variables surveyed.
Although the allocated counts for most benchmark variables
display high
concordance with the 100% counts, two variables, the number of
householders age 0-17
and the number of farm households, are poorly allocated. The
large allocation errors for
these variables are somewhat surprising since both variables are
highly correlated with
constraint variables included in the model; the minor
householder population with the
group quarters population (ρ=0.53) and the farm household
population with the urban
population (ρ=-0.64). The problem in the allocation of these two
variables, then,
appears to be that both describe relatively small populations.
Of the non-constraining
benchmark variables examined in this paper, these two variables
have by far the
smallest sample counts (NAGE 0-17 = 87, NFARM = 219). Although
both the group quarters
variable and the non-white variable also have sample counts in
this range, these
variables are used as constraints; thus, the allocation errors
for these variables are 0.
The inability of the maximum entropy procedure to accurately
allocate variables
describing rare populations is troubling, as estimates for these
variables may be the
most desired; in the contemporary context, variables with small
populations may be
least likely to have Census-produced summary tables. Additional
research is therefore
warranted on whether these variables may be better estimated
through a different post-
imputation allocation and how they can be reliably identified
based on model
diagnostics. It is also worth repeating that the allocated
counts described above are not
counts of specific households within each ED; rather, they are
the imputed weights for
all households exhibiting an attribute, aggregated within the
ED. As such, small
allocated counts do not identify the ED in which any particular
household is located.
In addition to overall measures of goodness of fit, the spatial
distributions of model
residuals for individual variables are useful to determine where
a model over- or under-
predicts and to identify local clusters of small or large
residuals. While in this study
substantive discussion is not a priority, the results
demonstrate the usefulness of such
maps for researchers who are interested in more detailed
interpretations of residual
distributions with regard to specific variables of interest.
As noted before, this article does not discuss substantive
questions regarding
demographic processes in 1880 due to a desire to focus on the
validation procedure
itself. An important question is how the validation methods
described above will
translate from the 1880 Census to more current Censuses or the
ACS. Nothing in the
validation procedure is specific to the data from 1880 (or to
the chosen geography of
Hamilton County, Ohio), although there are certainly differences
between the 1880
Census and the current context.
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
610 http://www.demographic-research.org
The maximum entropy procedure requires constraining variables
that occur within
(and are comparable between) both the public-use microdata file
and the Census-
produced summary files. This caused no restriction in using the
1880 data, for which
the “public-use” microdata file and summary files could simply
be constructed from the
100% data. This will not be the case when using current data,
although an examination
of the 2006-2010 ACS public-use microdata and summary files
reveals that it contains
many of the constraining variables used in this analysis. A more
persistent
methodological problem may be the presence of sampling variance
and imputed data in
contemporary Census data. Because the full 1880 census was
available for use in this
analysis, there is no inherent uncertainty in the summary tables
created. Sampling
variance and data imputation in current Census-produced summary
files could lead to
convergence problems in the maximum entropy procedure and may
require model
reparameterization. In the worst case scenario some potential
constraining variables
may have to be discarded if their inclusion in the model
repeatedly leads to non-
convergence. This indicates an obvious need for
uncertainty-sensitive modeling
techniques that can handle inherent sampling variance in
constraining variables.
Beyond the issue raised above, this study has been designed so
that the historical
data are similar to contemporary data in organizational
structure and geographic scale,
to offer a compelling case for the use of this method on such
data. Prior research has
likewise provided evidence for the application of this method to
the contemporary
context. In Leyk, Nagle, and Buttenfield (2013), tract summary
counts for spatially
allocated microdata from the 2000 Census exhibit strong
correlations with tract counts
from Census-produced summary tables for several variables. The
results from the
present study suggest that further validation of these
contemporary findings should be
pursued at a CRDC.
Based on the insights from this study there are some general
rules and actions that
can be done prior to undertaking a model validation at a CRDC.
The first is to develop a
set of benchmark variables against which to evaluate the
results. A limited number of
benchmark variables were included in this validation analysis.
It may be desirable to
include additional variables in the full evaluation,
particularly those variables which are
uncorrelated with model constraints or which have small overall
margins, as these
benchmarks exhibited high residuals and SAEs. Next, those
variables available for use
as constraining variables can be determined using a combination
of publicly available
summary tables and PUMS documentation. Note that constraining
variables must be
procurable in both the microdata and the summary tables.
Variables which are likely to
produce the most satisfactory results when used as constraints
may be identified using
bivariate correlations, measures of segregation, and PCA; the
data necessary to run
these identification tests are publicly available in
Census-produced summary files.
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 611
Following the selection of the constraining variables, the
imputation may be run
using the publicly available PUMS data. The imputed weights can
then be used in the
tract allocation, and the total margins for the allocated data
can be compared to the
actual margins to identify prominent errors and adjust the model
accordingly. For those
benchmark variables for which Census-produced summary tables or
cross-tabulations
are publicly available, SAEs and z-statistics can be computed to
further adjust the
model. Measures of error for benchmark variables and joint
distributions not publicly
available will require evaluation at a CRDC.
5.1 Limitations and future steps
Some potential limitations with regard to the relationship
between the historical and
contemporary data may require further consideration. Relative to
current censuses, the
1880 Census appears to include a less diverse population with
more homogeneous
residential patterns (less segregation), and thus the choice of
constraining variables may
need to be revisited. While the results in this paper indicate
that additional constraining
variables have a beneficial impact on the reproduction of the
correct population
distribution for other non-constraining variables, it is still
unclear what the optimal
number of constraints might be. Additional work with current ACS
data will allow
determination of the point at which additional constraints may
result in model non-
convergence or increasing misallocation. The impact of
population size of SEAs and
EDs should also be further examined, to better understand the
effect of population size
on the maximum entropy method applied. This will also provide
some indication how
the method might be applied to different survey data, such as
the National Health
Interview Survey or the National Health and Nutrition
Examination Survey, which are
reported only for large geographies (i.e. states or regions).
Future research will also
investigate differences in the validation results within rural
and urban settings in more
detail.
6. Acknowledgments
This research is funded by the National Science Foundation:
“Collaborative Research:
Putting People in Their Place: Constructing a Geography for
Census Microdata”,
project BCS-0961598 awarded to University of Colorado - Boulder
and University of
Tennessee - Knoxville. Funding by NSF is gratefully
acknowledged.
-
Ruther et al.: Validation of spatially allocated small area
estimates for 1880 Census demography
612 http://www.demographic-research.org
References
Assunção, R.M., Schmertmann, C.P., Potter, J.E., and Cavenaghi,
S.M. (2005).
Empirical Bayes estimation of demographic schedules for small
areas.
Demography 42(3): 537-558. doi:10.1353/dem.2005.0022.
Ballas, D., Clarke, G., Dorling, D., Eyre, H., Thomas, B., and
Rossiter, D. (2005).
SimBritain: A spatial microsimulation approach to population
dynamics.
Population, Space and Place 11(1): 13-34.
doi:10.1002/psp.351.
Beckman, R.J., Baggerly, K.A., and McKay, M.D. (1996). Creating
synthetic baseline
populations. Transportation Research Part A 30(6): 415-429.
doi:10.1016/0965-
8564(96)00004-3.
Bogue, D.J. (1951). State economic areas. Washington: U.S.
Bureau of the Census.
Demšar, U., Harris, P., Brunsdon, C., Fotheringham, A.S., and
McLoone, S. (2013).
Principal component analysis on spatial data: An overview.
Annals of the
Association of American Geographers 103(1): 106-128.
doi:10.1080/00045
608.2012.689236.
Goeken, R., Nguyen, C., Ruggles, S., and Sargent, W. (2003). The
1880 U.S.
population database. Historical Methods 36(1): 27-34.
doi:10.1080/01615440
309601212.
Hermes, K. and Poulsen, M. (2012). A review of current methods
to generate synthetic
spatial microdata using reweighting and future directions.
Computer,
Environment and Urban Systems 36(4): 281-290.
doi:10.1016/j.compenvurbsys.
2012.03.005.
Johnston, R.J. and Pattie, C.J. (1993). Entropy-maximizing and
the iterative
proportional fitting procedure. Professional Geographer 45(3):
317-322.
doi:10.1111/j.0033-0124.1993.00317.x.
Jolliffe, I.T. (2002). Principal component analysis (2nd
edition). Berlin: Springer
Verlag.
Kaiser, H.F. (1960). The application of electronic computers to
factor analysis.
Educational and Psychological Measurement 20(1): 141-151.
doi:10.1177/0013
16446002000116.
http://dx.doi.org/10.1353/dem.2005.0022http://dx.doi.org/10.1002/psp.351http://dx.doi.org/10.1016/0965-8564(96)00004-3http://dx.doi.org/10.1016/0965-8564(96)00004-3http://dx.doi.org/10.1080/00045608.2012.689236http://dx.doi.org/10.1080/00045608.2012.689236http://dx.doi.org/10.1080/01615440309601212http://dx.doi.org/10.1080/01615440309601212http://dx.doi.org/10.1016/j.compenvurbsys.2012.03.005http://dx.doi.org/10.1016/j.compenvurbsys.2012.03.005http://dx.doi.org/10.1111/j.0033-0124.1993.00317.xhttp://dx.doi.org/10.1177/001316446002000116http://dx.doi.org/10.1177/001316446002000116
-
Demographic Research: Volume 29, Article 22
http://www.demographic-research.org 613
Leyk, S., Buttenfield, B.P., and Nagle, N. (2013). Modeling
ambiguity in Census
microdata allocations to improve demographic small area
estimates.
Transactions in Geographic Information Science 17(3): 406-425.
doi:10.1111/
j.1467-9671.2012.01366.x.
Leyk, S., Nagle, N., and Buttenfield, B.P. (2013). Maximum
entropy dasymetric
modeling for demographic small area estimation. Geographical
Analysis 45(3):
285-306. doi:10.1111/gean.12011.
Logan, J.R., Jindrich, J., Shin, H., and Zhang, W. (2011).
Mapping America in 1880:
The urban transition historical GIS project. Historical Methods:
A Journal of
Quantitative and Interdisciplinary History 44(1): 49-60.
doi:10.1080/
01615440.2010.517509.
Malouf, R. (2002). A comparison of algorithms for maximum
entropy parameter
estimation. Proceedings of the sixth conference on natural
language learning
(CoNLL-2002). New Brunswick, NJ: 49-55.
doi:10.3115/1118853.1118871.
Massey, D.S. and Denton, N.A. (1988). The dimensions of
residential segregation.
Social Forces 67(2): 281-315. doi:10.2307/2579183.
Melhuish, T., Blake, M., and Day, S. (2002). An evaluation of
synthetic household
populations for Census Collection Districts created using
optimization
techniques. Australasian Journal of Regional Studies 8(3):
369-387.
Mrozinski, Jr., R.D. and Cromley, R.G. (1999). Single – and
doubly – constrained
methods of areal interpolation for vector-based GIS.
Transactions in GIS 3(3):
285-301. doi:10.1111/1467-9671.00022.
Nagle, N.N., Buttenfield, B.P., Leyk, S., and Spielman, S.E.
(2013, forthcoming).
Dasymetric modeling and uncertainty. Annals of the Association
of American
Geographers.
Nagle, N.N., Buttenfield, B.P., Leyk, S., and Spielman, S.E.
(2012). An uncertainty-
informed penalized maximum entropy dasymetric model. Presented
at the 7th
International Conference on Geographic Information Science
(GIScience 2012),
Columbus, OH, September 18-21, 2012.
National Research Council (2007). Using the American Community
Survey: Benefits
and challenges. In: Citro, C.F. and Kalton, G. (eds.). Panel on
the Functionality
and Usability of Data from the American Community Survey.
Washington, DC:
The National Academies Press, Committee on National Statistics,
Division of
B