The Analysis of Complex Surveys David Bellhouse
The Analysis of Complex Surveys
David Bellhouse
A Typical Complex DesignStratified two-stage cluster samplingStrata• geographical areasFirst stage units• smaller areas within the larger areasSecond stage units• householdsClusters• all individuals in the household
Why a Complex Design?
Classical reasons• better cover of the entire region of interest
(stratification)• efficient for interviewing: less travel, less costlyA new reason• to reduce cost the sample is piggybacked on a
sample chosen by a complex design such as the Canadian Labour Force Survey
Problem: estimation and analysis are more complex
Examples of Complex Designs
• Ontario Health Survey• Canadian Community Heath Survey• Youth in Transition Survey• Youth Risk Behavior Survey (U.S.)
Ontario Health Survey
• carried out in 1990• health status of the population was
measured• data were collected relating to the risk
factors associated with major causes of morbidity and mortality in Ontario
• survey of 61,239 persons was carried out in a stratified two-stage cluster sample by Statistics Canada
OHSSample Selection• strata: public health
units – divided into rural and urban strata
• first stage: enumeration areas defined by the 1986 Census of Canada and selected by pps
• second stage: dwellings selected by SRS
• cluster: all persons in the dwelling
Youth in Transition Survey
Reading cohort (15 year-olds): stratified two-stage sampling
• school population stratified by province, language of instruction and enrollment size
• 1,200 schools selected within strata• eligible students selected within each sampled
school– the initial student sample size for the survey
conducted in 2000 was 38,000.
Canadian Community Health Survey
Piggybacked on the Canadian Labour Force Survey (LFS)
• LFS design– stratified by province and economic regions– geographical areas (usually enumeration
areas) within strata chosen with probability proportional to the population size of the area
– dwellings chosen within geographical areas– all persons in the dwelling interviewed
Youth Risk Behavior Survey
Stratified three-stage cluster sample• strata are metropolitan statistical areas• primary sampling units are large counties
or groups of smaller counties• second stage units are schools within
counties• third stage units are classes within schools• all students within a class are interviewed
Basic Problem in
Survey Data Analysis
≠
Issues
iid (independent and identical distribution) assumption
• the assumption does not not hold in complex surveys because of correlations induced by the sampling design or because of the population structure
• blindly applying standard programs to the analysis can lead to incorrect results
Two Simple Examples to Illustrate the Problems Involved in Analyzing “Complex”
Samples
• an old Ontario lottery called Lottario that is similar to Lotto 6/49
• a pay equity lawsuit involving a stratified sampling design
Lottery Example
Lottario – old Ontario lottery, a Lotto 6/39• seven numbers chosen on a draw night –
six regular numbers and a bonus number• winning numbers collected for 167 draws
ending in January 1982• want to test whether each of the 39
numbers (or balls) has the same chance of being chosen
Breakdown of Independence Assumption
• On a draw night, the numbers are chosen by simple random sampling withoutreplacement– numbers chosen within a draw are not
independent• Between draws the balls are replaced to
be drawn again on the next draw– numbers chosen between draws are
independent
Lottario Draws up to January 1982
05
101520253035404550
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39
Ball Number
Freq
uenc
y
Chi-square Test for Equality of Proportions
• Ignoring the lack of independence– test statistic = 48.674– p-value = 0.115
• Taking into account the lack of independence– test statistic = (38/32)*48.674 = 57.8– p-value = 0.021
Pay Equity Example
Pay equity survey dispute: Canada Post and PSAC
• two job evaluations on the same set of people (and same set of information) carried out in 1987 and 1993
• rank correlation between the two sets of job values obtained through the evaluations was 0.539
• assumption to obtain a valid estimate of correlation: pairs of observations are iid
Scatterplot of Evaluations
• Rank correlation is 0.539
0 100 200
0
100
200
Rank in 1987
Ran
k in
199
3
A Stratified Design with Distinct Differences Between Strata
• the pay level increases with each pay category (four in number)
• the job value also generally increases with each pay category
• therefore the observations are not iid
Scatterplot by Pay Category
2345
0 100 200
0
100
200
Rank in 1987
Ran
k in
199
3
Correlations within Level
Correlations within each pay level• Level 2: –0.293 • Level 3: –0.010 • Level 4: 0.317 • Level 5: 0.496 Only Level 4 is significantly different from 0
Available Software for Complex Survey Analysis
• commercial Packages:• STATA• SAS• SPSS• SUDAAN
• noncommercial Package• R
Typical Survey Data Filestratum psu initwt finalwt age race ethnicty educ sex
7 20 8 10 14 2 2 7 17 20 8 10 13 2 2 7 17 20 8 10 12 2 2 5 27 20 8 10 15 2 2 8 17 20 8 10 14 2 1 7 17 20 8 14 14 1 2 9 17 20 8 14 16 1 2 9 2
12 21 8 10 17 2 2 9 112 21 8 10 16 2 2 9 212 21 8 10 14 2 9 8 112 21 8 10 16 2 2 9 112 21 8 10 16 2 2 10 112 21 8 10 16 2 2 9 112 21 8 10 18 2 2 11 112 21 8 14 17 1 2 11 112 21 8 14 17 1 2 11 112 21 8 10 16 2 2 9 112 21 8 14 17 1 2 10 112 21 8 10 16 2 2 9 112 21 8 10 13 2 2 7 112 21 8 10 17 2 2 10 1
Survey Weights: Definitions
initial weight• equal to the inverse of the inclusion
probability of the unitfinal weight
• initial weight adjusted for nonresponse, poststratification and/or benchmarking
• interpreted as the number of units in the population that the sample unit represents
Interpretation
• the survey weight for a particular sample unit is the number of units in the population that the unit represents
Not sampled, Wt = 2, Wt = 5, Wt = 6, Wt = 7
Some Consequences of Ignoring the Weights: Survey of Youth in Custody
• first U.S. survey of youths confined to long-term, state-operated institutions
• complemented existing Children in Custody censuses.
• companion survey to the Surveys of State Prisons
• the data contain information on criminal histories, family situations, drug and alcohol use, and peer group activities
• survey carried out in 1989 using stratified systematic sampling
SYC Design
strata• type (a) groups of smaller institutions• type (b) individual larger institutions
sampling units• strata type (a)
• first stage – institution by probability proportional to size of the institution
• second stage – individual youths in custody • strata type (b)
• individual youths in custody• individuals chosen by systematic random
sampling
Effect of the Weights
• Example: age distribution, Survey of Youth in Custody
Age
Counts
Sum of Weights
11 1 28 12 9 149 13 53 764 14 167 2143 15 372 3933 16 622 5983 17 634 5189 18 334 2778 19 196 1763 20 122 1164 21 57 567 22 27 273 23 14 150 24 13 128
Totals 2621 25012
Unweighted Histogram
Age Distribution of Youth in Custody
0
0.05
0.1
0.15
0.2
0.25
0.3
11 12 13 14 15 16 17 18 19 20 21 22 23 24
Age
Prop
ortio
n
Weighted Histogram
Age Distribution of Youth in Custody
0
0.05
0.1
0.15
0.2
0.25
0.3
11 12 13 14 15 16 17 18 19 20 21 22 23 24
Age
Prop
ortio
n
Weighted versus Unweighted
Weighted and Unweighted Histograms
00.05
0.10.15
0.20.25
0.3
11 12 13 14 15 16 17 18 19 20 21 22 23 24
Age
Prop
ortio
n
Weighted Unweighted
General Approach to Analysis with Standard Software
• the software usually handles stratified two-stage cluster samples
• if there are more than two stages of sampling the latter stages are usually ignored in the analysis
Reason
lmnS
lmS
lSV
23
22
21 ++≅
Typical Models used in Analysis
Xβπ
π
Xβy
=⎟⎠⎞
⎜⎝⎛−
=
1ln
Regression Logistic
)E(
Regression Ordinary
General Consequences of Using the Sampling Weights but Ignoring the Sampling Design
)()( 1T ββVββ −− − ˆˆˆ
• V is the variance-covariance matrix of the regression parameter estimates
• ignoring the survey design leads to estimates of V that are too small
• therefore estimates of V-1 are larger than they should be• leads to test statistics that are larger than they should be
(you find a significant result more often than you should)• leads to confidence interval statements that are narrower
than they should be
Inferences are usually base on the quadratic form
12 22 32 42BMI
15
20
25
30
DBM
I
DBMI versus BMI (binned)
Regression Example: Ontario Health Survey• size of the circle is related to the sum of the surveys weights in
the estimate• more data in the BMI range 17 to 29 approximately
Ontario Health Survey
Regress desired body mass index (DBMI) on body mass index (BMI)
STATA Unweighted WeightedIntercept
Estimate 10.877 11.196 10.877 S.E. 0.141 0.064 0.065
Slope
Estimate 0.4958 0.4716 0.4858 S.E. 0.0058 0.0025 0.0026
Youth Risk Behavior Survey
Recall• sampling design is a stratified three-stage
cluster sample• need only to give the strata and first stage
unit identifiers to do the analysis with available software
Demo in SPSS
Log into MyVlab
Youth Risk Behavior SurveyData File in an SPSS sav file
Find the survey design variables in the file
Prepare for Analysis by Specifying the Design
Specifying the design variables
Design options to choose from
Logistic Regression Analysis
Other Approaches• The estimate of the variance of the regression
parameters is obtained using a technique called Taylor linearization– the cluster identifiers are needed to carry out this
procedure– due to privacy constraints StatCan will provide this
information only through an RDC– you need to apply to get into an RDC
• Alternate approach – the bootstrap– different approach to the bootstrap for complex
surveys than iid data sets– data file consists of several sets of bootstrap weights– calculate the estimates for each set of bootstrap
weights and look at the variability in the estimates– can be done using SAS macros