Top Banner
The Analysis of Complex Surveys David Bellhouse
47

The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Mar 07, 2018

Download

Documents

dangkhue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

The Analysis of Complex Surveys

David Bellhouse

Page 2: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

A Typical Complex DesignStratified two-stage cluster samplingStrata• geographical areasFirst stage units• smaller areas within the larger areasSecond stage units• householdsClusters• all individuals in the household

Page 3: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Why a Complex Design?

Classical reasons• better cover of the entire region of interest

(stratification)• efficient for interviewing: less travel, less costlyA new reason• to reduce cost the sample is piggybacked on a

sample chosen by a complex design such as the Canadian Labour Force Survey

Problem: estimation and analysis are more complex

Page 4: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Examples of Complex Designs

• Ontario Health Survey• Canadian Community Heath Survey• Youth in Transition Survey• Youth Risk Behavior Survey (U.S.)

Page 5: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Ontario Health Survey

• carried out in 1990• health status of the population was

measured• data were collected relating to the risk

factors associated with major causes of morbidity and mortality in Ontario

• survey of 61,239 persons was carried out in a stratified two-stage cluster sample by Statistics Canada

Page 6: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

OHSSample Selection• strata: public health

units – divided into rural and urban strata

• first stage: enumeration areas defined by the 1986 Census of Canada and selected by pps

• second stage: dwellings selected by SRS

• cluster: all persons in the dwelling

Page 7: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Youth in Transition Survey

Reading cohort (15 year-olds): stratified two-stage sampling

• school population stratified by province, language of instruction and enrollment size

• 1,200 schools selected within strata• eligible students selected within each sampled

school– the initial student sample size for the survey

conducted in 2000 was 38,000.

Page 8: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Canadian Community Health Survey

Piggybacked on the Canadian Labour Force Survey (LFS)

• LFS design– stratified by province and economic regions– geographical areas (usually enumeration

areas) within strata chosen with probability proportional to the population size of the area

– dwellings chosen within geographical areas– all persons in the dwelling interviewed

Page 9: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Youth Risk Behavior Survey

Stratified three-stage cluster sample• strata are metropolitan statistical areas• primary sampling units are large counties

or groups of smaller counties• second stage units are schools within

counties• third stage units are classes within schools• all students within a class are interviewed

Page 10: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Basic Problem in

Survey Data Analysis

Page 11: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Issues

iid (independent and identical distribution) assumption

• the assumption does not not hold in complex surveys because of correlations induced by the sampling design or because of the population structure

• blindly applying standard programs to the analysis can lead to incorrect results

Page 12: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Two Simple Examples to Illustrate the Problems Involved in Analyzing “Complex”

Samples

• an old Ontario lottery called Lottario that is similar to Lotto 6/49

• a pay equity lawsuit involving a stratified sampling design

Page 13: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Lottery Example

Lottario – old Ontario lottery, a Lotto 6/39• seven numbers chosen on a draw night –

six regular numbers and a bonus number• winning numbers collected for 167 draws

ending in January 1982• want to test whether each of the 39

numbers (or balls) has the same chance of being chosen

Page 14: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Breakdown of Independence Assumption

• On a draw night, the numbers are chosen by simple random sampling withoutreplacement– numbers chosen within a draw are not

independent• Between draws the balls are replaced to

be drawn again on the next draw– numbers chosen between draws are

independent

Page 15: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Lottario Draws up to January 1982

05

101520253035404550

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Ball Number

Freq

uenc

y

Page 16: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Chi-square Test for Equality of Proportions

• Ignoring the lack of independence– test statistic = 48.674– p-value = 0.115

• Taking into account the lack of independence– test statistic = (38/32)*48.674 = 57.8– p-value = 0.021

Page 17: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Pay Equity Example

Pay equity survey dispute: Canada Post and PSAC

• two job evaluations on the same set of people (and same set of information) carried out in 1987 and 1993

• rank correlation between the two sets of job values obtained through the evaluations was 0.539

• assumption to obtain a valid estimate of correlation: pairs of observations are iid

Page 18: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Scatterplot of Evaluations

• Rank correlation is 0.539

0 100 200

0

100

200

Rank in 1987

Ran

k in

199

3

Page 19: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

A Stratified Design with Distinct Differences Between Strata

• the pay level increases with each pay category (four in number)

• the job value also generally increases with each pay category

• therefore the observations are not iid

Page 20: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Scatterplot by Pay Category

2345

0 100 200

0

100

200

Rank in 1987

Ran

k in

199

3

Page 21: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Correlations within Level

Correlations within each pay level• Level 2: –0.293 • Level 3: –0.010 • Level 4: 0.317 • Level 5: 0.496 Only Level 4 is significantly different from 0

Page 22: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Available Software for Complex Survey Analysis

• commercial Packages:• STATA• SAS• SPSS• SUDAAN

• noncommercial Package• R

Page 23: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Typical Survey Data Filestratum psu initwt finalwt age race ethnicty educ sex

7 20 8 10 14 2 2 7 17 20 8 10 13 2 2 7 17 20 8 10 12 2 2 5 27 20 8 10 15 2 2 8 17 20 8 10 14 2 1 7 17 20 8 14 14 1 2 9 17 20 8 14 16 1 2 9 2

12 21 8 10 17 2 2 9 112 21 8 10 16 2 2 9 212 21 8 10 14 2 9 8 112 21 8 10 16 2 2 9 112 21 8 10 16 2 2 10 112 21 8 10 16 2 2 9 112 21 8 10 18 2 2 11 112 21 8 14 17 1 2 11 112 21 8 14 17 1 2 11 112 21 8 10 16 2 2 9 112 21 8 14 17 1 2 10 112 21 8 10 16 2 2 9 112 21 8 10 13 2 2 7 112 21 8 10 17 2 2 10 1

Page 24: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Survey Weights: Definitions

initial weight• equal to the inverse of the inclusion

probability of the unitfinal weight

• initial weight adjusted for nonresponse, poststratification and/or benchmarking

• interpreted as the number of units in the population that the sample unit represents

Page 25: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Interpretation

• the survey weight for a particular sample unit is the number of units in the population that the unit represents

Not sampled, Wt = 2, Wt = 5, Wt = 6, Wt = 7

Page 26: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Some Consequences of Ignoring the Weights: Survey of Youth in Custody

• first U.S. survey of youths confined to long-term, state-operated institutions

• complemented existing Children in Custody censuses.

• companion survey to the Surveys of State Prisons

• the data contain information on criminal histories, family situations, drug and alcohol use, and peer group activities

• survey carried out in 1989 using stratified systematic sampling

Page 27: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

SYC Design

strata• type (a) groups of smaller institutions• type (b) individual larger institutions

sampling units• strata type (a)

• first stage – institution by probability proportional to size of the institution

• second stage – individual youths in custody • strata type (b)

• individual youths in custody• individuals chosen by systematic random

sampling

Page 28: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Effect of the Weights

• Example: age distribution, Survey of Youth in Custody

Age

Counts

Sum of Weights

11 1 28 12 9 149 13 53 764 14 167 2143 15 372 3933 16 622 5983 17 634 5189 18 334 2778 19 196 1763 20 122 1164 21 57 567 22 27 273 23 14 150 24 13 128

Totals 2621 25012

Page 29: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Unweighted Histogram

Age Distribution of Youth in Custody

0

0.05

0.1

0.15

0.2

0.25

0.3

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Age

Prop

ortio

n

Page 30: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Weighted Histogram

Age Distribution of Youth in Custody

0

0.05

0.1

0.15

0.2

0.25

0.3

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Age

Prop

ortio

n

Page 31: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Weighted versus Unweighted

Weighted and Unweighted Histograms

00.05

0.10.15

0.20.25

0.3

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Age

Prop

ortio

n

Weighted Unweighted

Page 32: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

General Approach to Analysis with Standard Software

• the software usually handles stratified two-stage cluster samples

• if there are more than two stages of sampling the latter stages are usually ignored in the analysis

Reason

lmnS

lmS

lSV

23

22

21 ++≅

Page 33: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Typical Models used in Analysis

Xβπ

π

Xβy

=⎟⎠⎞

⎜⎝⎛−

=

1ln

Regression Logistic

)E(

Regression Ordinary

Page 34: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

General Consequences of Using the Sampling Weights but Ignoring the Sampling Design

)()( 1T ββVββ −− − ˆˆˆ

• V is the variance-covariance matrix of the regression parameter estimates

• ignoring the survey design leads to estimates of V that are too small

• therefore estimates of V-1 are larger than they should be• leads to test statistics that are larger than they should be

(you find a significant result more often than you should)• leads to confidence interval statements that are narrower

than they should be

Inferences are usually base on the quadratic form

Page 35: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

12 22 32 42BMI

15

20

25

30

DBM

I

DBMI versus BMI (binned)

Regression Example: Ontario Health Survey• size of the circle is related to the sum of the surveys weights in

the estimate• more data in the BMI range 17 to 29 approximately

Page 36: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Ontario Health Survey

Regress desired body mass index (DBMI) on body mass index (BMI)

STATA Unweighted WeightedIntercept

Estimate 10.877 11.196 10.877 S.E. 0.141 0.064 0.065

Slope

Estimate 0.4958 0.4716 0.4858 S.E. 0.0058 0.0025 0.0026

Page 37: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Youth Risk Behavior Survey

Recall• sampling design is a stratified three-stage

cluster sample• need only to give the strata and first stage

unit identifiers to do the analysis with available software

Demo in SPSS

Page 38: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Log into MyVlab

Page 39: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Youth Risk Behavior SurveyData File in an SPSS sav file

Page 40: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Find the survey design variables in the file

Page 41: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Prepare for Analysis by Specifying the Design

Page 42: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Specifying the design variables

Page 43: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Design options to choose from

Page 44: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Logistic Regression Analysis

Page 45: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to
Page 46: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to
Page 47: The Analysis of Complex Surveys - rdc.uwo.cardc.uwo.ca/events/docs/...slides/2009-10/Bellhouse-ComplexSurveys … · 12 21 8 10 172291 12 21 8 10 162292 12 ... unit identifiers to

Other Approaches• The estimate of the variance of the regression

parameters is obtained using a technique called Taylor linearization– the cluster identifiers are needed to carry out this

procedure– due to privacy constraints StatCan will provide this

information only through an RDC– you need to apply to get into an RDC

• Alternate approach – the bootstrap– different approach to the bootstrap for complex

surveys than iid data sets– data file consists of several sets of bootstrap weights– calculate the estimates for each set of bootstrap

weights and look at the variability in the estimates– can be done using SAS macros