survey data analysis Fall2014 50th.ppt - UIC

1

Introduction to Survey Data Analysis

Linda K. Owens, PhD

Assistant Director for Sampling & Analysis

General information

� Please hold questions until the end of the presentation

� Slides available at www.srl.uic.edu/SEMINARS/Fall14Seminars.htm

� Please raise your hand so that I can see that you can hear me

2

2

Focus of the webinar

� Define survey data

� Data setup

� Data cleaning/missing data

� Weighting data

3

What is survey data?

� Data gathered from a sample of individuals

� Sample is random (drawn using probabilistic methods)

� Each member of population must have known, nonzero chance of selection

� Chance of selection does not have to be equal for all members

� Issues/techniques covered in webinar assume random sampling

4

3

When analyzing survey data...

1. Understand & evaluate survey design

2. Screen the data

3. Adjust for sample design

5

1. Understand & evaluate survey

� Conductor of survey

� Sponsor of survey

� Measured variables

� Unit of analysis

� Mode of data collection

� Dates of data collection

6

4

1. Understand & evaluate survey

� Geographic coverage

� Respondent eligibility criteria

� Sample design

� Sample size & response rate

7

Levels of measurement

� Nominal

� Ordinal

� Interval

� Ratio

8

5

2. Data screening

� ALWAYS examine raw frequency distributions for…

a) out-of-range values (outliers)

b) missing values

c) check frequencies against codebook

d) for Web surveys, compare data collection output to data analysis output (e.g., Survey Monkey compared to SPSS)

9

2. Data screening

� Out-of-range values:

• Delete data

• Recode values

10

6

Missing data:

� can be unit or item missing

� can reduce effective sample size

� may introduce bias

11

Reasons for missing data

� Refusals (question sensitivity)

� Don’t know responses (cognitive problems, memory problems)

� Not applicable (skip patterns)

� Data processing errors

� Questionnaire programming errors

� Design factors

� Attrition in panel studies

12

7

Effects of ignoring missing data

� Reduced sample size – loss of statistical power

� Data may no longer be representative –introduces bias

� Difficult to identify effects

13

Assumptions on missing data

� Missing completely at random (MCAR)

� Missing at random (MAR)

� Ignorable

� Nonignorable

� Type of missing data determines ignorable/nonignorable

14

8

Missing completely at random

(MCAR)

� Being missing is independent from any variables.

� Cases with complete data are indistinguishable from cases with missing data.

� Missing cases are a random subsample of original sample.

15

Missing at random (MAR)

� The probability of a variable being observed is independent of the true value of that variable controlling for one or more variables.

� Example: Probability of missing income is unrelated to income within levels of education.

16

9

Ignorable missing data

� The data are MAR.

� The missing data mechanism is unrelated to the parameters we want to estimate.

17

Nonignorable missing data

� The pattern of data missingness is non-MAR.

� Missingness is related to either the data collection process or to the questionnaire content

18

10

Methods of handling missing data

� Listwise (casewise) deletion: uses only complete cases

� Pairwise deletion: uses all available cases

� Dummy variable adjustment: Missing value indicator method

� Mean substitution: substitute mean value computed from available cases (cf. unconditional or conditional)

19


� Regression methods: predict value based on regression equation with other variables as predictors

� Hot deck: identify the most similar case to the case with a missing and impute the value

20

11


� Maximum likelihood methods: use all available data to generate maximum likelihood-based statistics.

21


� Multiple imputation: combines the methods of ML to produce multiple data sets with imputed values for missing cases

22

12

Multiple imputation software

� SAS—user written IVEware, experimental MI & MIANALYZE

� STATA—user written ICE

� R—user written libraries and functions

� SOLAS

� S-PLUS

23

Methods of handling missing data:

Summary

� Listwise deletion assumes MCAR or MAR

� Dummy variable adjustment is biased; don’t use

� Conditional mean substitution assumes MCAR within cells

� Hot deck & regression improvement on mean substitution but...

� Multiple imputation is least biased but computationally intensive

24

13

Missing data final point

� All methods of imputation underestimate true variance• artificial increase in sample size

• values treated as though obtained by data collection

� Rao (1996). On variance estimation with imputed survey data. Journal of the American Statistical Society, 91, 499-506.

� Fay (1996) in bibliography

25

Types of survey sample designs

� Simple random sampling

� Systematic sampling

� Complex sample designs

� stratified designs

� cluster designs

� mixed mode designs

26

14

Why complex sample designs?

� Increased efficiency

� Decreased costs

27

Why complex sample designs?

� Statistical software packages with an assumption of SRS underestimate the sampling variance.

� Not accounting for the impact of complex sample design can lead to a biased estimate of the sampling variance (Type I error).

28

15

Sample weights

� Used to adjust for differing probabilities of selection.

� In theory, simple random samples are self-weighted.

� In practice, simple random samples are likely to also require adjustments for nonresponse.

29

Types of sample weights

� Selection weights: adjust for differences in probability of selection

� Poststratification weights: designed to bring the sample proportions in demographic subgroups into agreement with the population proportion in the subgroups.

� Nonresponse weights: designed to inflate the weights of survey respondents to compensate for nonrespondents with similar characteristics.

� “Blow-up” (expansion) weights: provide estimates for the total population of interest.

30

16

Syntax examples of design-based

analysis in STATA, SUDAAN, & SAS

STATA

svyset strata strata

svyset psu psu

svyset pweight finalwt

svyreg fatitk age male black hispanic

SUDAAN

proc regress data=”c:\nhanes.sav” filetype=spss desgn=wr;

nest strata psu;

weight finalwt

subpgroup sex race;

levels 2 3;

model fatintk = age sex race;

31

Syntax examples of design-based

analysis in STATA, SUDAAN, & SAS

�SAS

�proc surveyreg data=nhanes;

�strata strata;

�cluster psu;

�class sex race;

�model fatintk = age sex race;

�weight finalwt

32

17

In summary, when analyzing survey data...

� Understand & evaluate survey design.

� Screen the data – deal with missing data & outliers.

� If necessary, adjust for study design using weights

and appropriate computer software.

33

Bibliography

� On Web site with slides

� Big names in area of missing data include the following:• Roderick Little

• Donald B. Rubin

• Paul Allison

• See also McKnight et al. Missing Data: A Gentle Introduction

� Large government data sets (NSFG, NHANES, NHIS) often include detailed methodological documentation on imputation and weight construction.

34

18

Questions?

� Type questions in chat box.

35

Thank You!

www.srl.uic.edu

36

survey data analysis Fall2014 50th.ppt - UIC

Documents