The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

The Analysis of Complex Surveys

David Bellhouse

A Typical Complex DesignStratified two-stage cluster samplingStrata• geographical areasFirst stage units• smaller areas within the larger areasSecond stage units• householdsClusters• all individuals in the household

Why a Complex Design?

Classical reasons• better cover of the entire region of interest

(stratification)• efficient for interviewing: less travel, less costlyA new reason• to reduce cost the sample is piggybacked on a

sample chosen by a complex design such as the Canadian Labour Force Survey

Problem: estimation and analysis are more complex

Examples of Complex Designs

• Ontario Health Survey• Canadian Community Heath Survey• Youth in Transition Survey• Youth Risk Behavior Survey (U.S.)

Ontario Health Survey

• carried out in 1990• health status of the population was

measured• data were collected relating to the risk

factors associated with major causes of morbidity and mortality in Ontario

• survey of 61,239 persons was carried out in a stratified two-stage cluster sample by Statistics Canada

OHSSample Selection• strata: public health

units – divided into rural and urban strata

• first stage: enumeration areas defined by the 1986 Census of Canada and selected by pps

• second stage: dwellings selected by SRS

• cluster: all persons in the dwelling

Youth in Transition Survey

Reading cohort (15 year-olds): stratified two-stage sampling

• school population stratified by province, language of instruction and enrollment size

• 1,200 schools selected within strata• eligible students selected within each sampled

school– the initial student sample size for the survey

conducted in 2000 was 38,000.

Canadian Community Health Survey

Piggybacked on the Canadian Labour Force Survey (LFS)

• LFS design– stratified by province and economic regions– geographical areas (usually enumeration

areas) within strata chosen with probability proportional to the population size of the area

– dwellings chosen within geographical areas– all persons in the dwelling interviewed

Youth Risk Behavior Survey

Stratified three-stage cluster sample• strata are metropolitan statistical areas• primary sampling units are large counties

or groups of smaller counties• second stage units are schools within

counties• third stage units are classes within schools• all students within a class are interviewed

Basic Problem in

Survey Data Analysis

≠

Issues

iid (independent and identical distribution) assumption

• the assumption does not not hold in complex surveys because of correlations induced by the sampling design or because of the population structure

• blindly applying standard programs to the analysis can lead to incorrect results

Two Simple Examples to Illustrate the Problems Involved in Analyzing “Complex”

Samples

• an old Ontario lottery called Lottario that is similar to Lotto 6/49

• a pay equity lawsuit involving a stratified sampling design

Lottery Example

Lottario – old Ontario lottery, a Lotto 6/39• seven numbers chosen on a draw night –

six regular numbers and a bonus number• winning numbers collected for 167 draws

ending in January 1982• want to test whether each of the 39

numbers (or balls) has the same chance of being chosen

Breakdown of Independence Assumption

• On a draw night, the numbers are chosen by simple random sampling withoutreplacement– numbers chosen within a draw are not

independent• Between draws the balls are replaced to

be drawn again on the next draw– numbers chosen between draws are

independent

Lottario Draws up to January 1982

05

101520253035404550

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Ball Number

Freq

uenc

y

Chi-square Test for Equality of Proportions

• Ignoring the lack of independence– test statistic = 48.674– p-value = 0.115

• Taking into account the lack of independence– test statistic = (38/32)*48.674 = 57.8– p-value = 0.021

Pay Equity Example

Pay equity survey dispute: Canada Post and PSAC

• two job evaluations on the same set of people (and same set of information) carried out in 1987 and 1993

• rank correlation between the two sets of job values obtained through the evaluations was 0.539

• assumption to obtain a valid estimate of correlation: pairs of observations are iid

Scatterplot of Evaluations

• Rank correlation is 0.539

0 100 200

0

100

200

Rank in 1987

Ran

k in

199

3

A Stratified Design with Distinct Differences Between Strata

• the pay level increases with each pay category (four in number)

• the job value also generally increases with each pay category

• therefore the observations are not iid

Scatterplot by Pay Category

2345

0 100 200

0

100

200

Rank in 1987

Ran

k in

199

3

Correlations within Level

Correlations within each pay level• Level 2: –0.293 • Level 3: –0.010 • Level 4: 0.317 • Level 5: 0.496 Only Level 4 is significantly different from 0

Available Software for Complex Survey Analysis

• commercial Packages:• STATA• SAS• SPSS• SUDAAN

• noncommercial Package• R

Typical Survey Data Filestratum psu initwt finalwt age race ethnicty educ sex

7 20 8 10 14 2 2 7 17 20 8 10 13 2 2 7 17 20 8 10 12 2 2 5 27 20 8 10 15 2 2 8 17 20 8 10 14 2 1 7 17 20 8 14 14 1 2 9 17 20 8 14 16 1 2 9 2

12 21 8 10 17 2 2 9 112 21 8 10 16 2 2 9 212 21 8 10 14 2 9 8 112 21 8 10 16 2 2 9 112 21 8 10 16 2 2 10 112 21 8 10 16 2 2 9 112 21 8 10 18 2 2 11 112 21 8 14 17 1 2 11 112 21 8 14 17 1 2 11 112 21 8 10 16 2 2 9 112 21 8 14 17 1 2 10 112 21 8 10 16 2 2 9 112 21 8 10 13 2 2 7 112 21 8 10 17 2 2 10 1

Survey Weights: Definitions

initial weight• equal to the inverse of the inclusion

probability of the unitfinal weight

• initial weight adjusted for nonresponse, poststratification and/or benchmarking

• interpreted as the number of units in the population that the sample unit represents

Interpretation

• the survey weight for a particular sample unit is the number of units in the population that the unit represents

Not sampled, Wt = 2, Wt = 5, Wt = 6, Wt = 7

Some Consequences of Ignoring the Weights: Survey of Youth in Custody

• first U.S. survey of youths confined to long-term, state-operated institutions

• complemented existing Children in Custody censuses.

• companion survey to the Surveys of State Prisons

• the data contain information on criminal histories, family situations, drug and alcohol use, and peer group activities

• survey carried out in 1989 using stratified systematic sampling

SYC Design

strata• type (a) groups of smaller institutions• type (b) individual larger institutions

sampling units• strata type (a)

• first stage – institution by probability proportional to size of the institution

• second stage – individual youths in custody • strata type (b)

• individual youths in custody• individuals chosen by systematic random

sampling

Effect of the Weights

• Example: age distribution, Survey of Youth in Custody

Age

Counts

Sum of Weights

11 1 28 12 9 149 13 53 764 14 167 2143 15 372 3933 16 622 5983 17 634 5189 18 334 2778 19 196 1763 20 122 1164 21 57 567 22 27 273 23 14 150 24 13 128

Totals 2621 25012

Unweighted Histogram

Age Distribution of Youth in Custody

0

0.05

0.1

0.15

0.2

0.25

0.3

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Age

Prop

ortio

n

Weighted Histogram

Age Distribution of Youth in Custody

0

0.05

0.1

0.15

0.2

0.25

0.3

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Age

Prop

ortio

n

Weighted versus Unweighted

Weighted and Unweighted Histograms

00.05

0.10.15

0.20.25

0.3

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Age

Prop

ortio

n

Weighted Unweighted

General Approach to Analysis with Standard Software

• the software usually handles stratified two-stage cluster samples

• if there are more than two stages of sampling the latter stages are usually ignored in the analysis

Reason

lmnS

lmS

lSV

23

22

21 ++≅

Typical Models used in Analysis

Xβπ

π

Xβy

=⎟⎠⎞

⎜⎝⎛−

=

1ln

Regression Logistic

)E(

Regression Ordinary

General Consequences of Using the Sampling Weights but Ignoring the Sampling Design

)()( 1T ββVββ −− − ˆˆˆ

• V is the variance-covariance matrix of the regression parameter estimates

• ignoring the survey design leads to estimates of V that are too small

• therefore estimates of V-1 are larger than they should be• leads to test statistics that are larger than they should be

(you find a significant result more often than you should)• leads to confidence interval statements that are narrower

than they should be

Inferences are usually base on the quadratic form

12 22 32 42BMI

15

20

25

30

DBM

I

DBMI versus BMI (binned)

Regression Example: Ontario Health Survey• size of the circle is related to the sum of the surveys weights in

the estimate• more data in the BMI range 17 to 29 approximately

Ontario Health Survey

Regress desired body mass index (DBMI) on body mass index (BMI)

STATA Unweighted WeightedIntercept

Estimate 10.877 11.196 10.877 S.E. 0.141 0.064 0.065

Slope

Estimate 0.4958 0.4716 0.4858 S.E. 0.0058 0.0025 0.0026

Youth Risk Behavior Survey

Recall• sampling design is a stratified three-stage

cluster sample• need only to give the strata and first stage

unit identifiers to do the analysis with available software

Demo in SPSS

Log into MyVlab

Youth Risk Behavior SurveyData File in an SPSS sav file

Find the survey design variables in the file

Prepare for Analysis by Specifying the Design

Specifying the design variables

Design options to choose from

Logistic Regression Analysis

Other Approaches• The estimate of the variance of the regression

parameters is obtained using a technique called Taylor linearization– the cluster identifiers are needed to carry out this

procedure– due to privacy constraints StatCan will provide this

information only through an RDC– you need to apply to get into an RDC

• Alternate approach – the bootstrap– different approach to the bootstrap for complex

surveys than iid data sets– data file consists of several sets of bootstrap weights– calculate the estimates for each set of bootstrap

weights and look at the variability in the estimates– can be done using SAS macros

The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Documents