Top Banner
HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE
25

HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

Dec 21, 2015

Download

Documents

Jasper Barrett
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

HOW TO DEAL WITH MISSING DATA:

INTRODUCTION

LI QI

UNC CHARLOTTE

Page 2: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

GENERAL STEPS FOR ANALYSIS WITH MISSING DATA

1. Identify patterns/reasons for missing and recode correctly

2. Decide on best method of analysis

3. Make an inference about some aspect (parameter) of the distribution of the “full” data when some of the data are missing

Page 3: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

STEP 1: UNDERSTAND YOUR DATA

Attrition due to social/natural processes

Eg: School graduation, dropout, death…

Skip pattern in survey

Eg: Certain questions only asked to respondents who indicate

they are married

Intentional missing as part of data collection process Respondent refusal/Non-response

Observations are not sampled with the same frequency. 

Page 4: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

UNDERSTAND YOUR DATA (CONT.)

Are certain groups more likely to have missing values?

Example: Respondents in service occupations less likely to report income

Are certain responses more likely to be missing?

Example: Respondents with high income less likely to report income

Page 5: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MISSING DATA MECHANISM

MCAR (missing completely at random): The probability of missingness is independent of the data.

If the data are MCAR, then the complete-case estimator is unbiased and consistent, as our intuition would suggest.

In fact, there is no way that we can distinguish whether the missing data were MCAR or not from the observed data.--Identifiability problem

Page 6: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MISSING DATA MECHANISM (CONT.)

MAR (missing at random): The probability of missingness depends only on the observed data.

The MAR assumption allows the dependence between missingness δ and the variable Y.

P(δ=1|Y,W)=f(W);

Example: Respondents in service occupations less likely to report income

Page 7: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

NMAR (nonmissing at random): The probability of missingness may also depend on the unobservable part of the data.

Difficult to deal with

MISSING DATA MECHANISM (CONT.)

Page 8: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

STEP 2: DEAL WITH MISSING DATA

Use what you know about why data is missing. Decide on the best analysis strategy to yield the best estimates.

Page 9: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

TRADITIONAL APPROACHES

Deletion Methods

Listwise deletion, pairwise deletion

Single imputation methods

Mean/Mode substitution

Regression substitution

Page 10: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

ADVANCED METHODS

Maximum Likelihood method (ML)

Weighing method (IPW)

Multiple imputation method (MI)

Page 11: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MODEL-BASED METHODS: ML

Identify the set of parameter values that produces the highest log-likelihood.

Advantages: Use both complete cases and incomplete cases; Enjoy the optimality properties afforded to an MLE.

Disadvantages: We need correctly specify the two parametric models; Difficult to compute.

Page 12: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

INVERSE PROBABILITY WEIGHTING

Little and Rubin (1987) proposed this method for missing data problems in survey.

Page 13: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

INVERSE PROBABILITY WEIGHTING

Idea: A subject with weight of 4 has a probability of observation of 0.25 (or 1/pi= 0.25). As a result, data from this subject should count once for herself and 3 times for those subjects missing.

Page 14: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

INVERSE PROBABILITY WEIGHTING (CONT.)

Advantages: Full likelihood is not necessary; use GEE. Could be applied widely in different models.

Disadvantages: The selection probability model is not correctly specified, then IPW estimator would be biased. If the true selection probability is very small, then it could be very .

Page 15: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

INVERSE PROBABILITY WEIGHTING (CONT.)

Robins, Rotnitzky and Zhao (1994) discussed the idea of adding an augmentation term to a simple weighted estimation equation.

Page 16: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

INVERSE PROBABILITY WEIGHTING (CONT.)

SAS: The GEE and CAUSALTRT Procedures

R: The ipw and CausalGAM package

Page 17: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

SINGLE IMPUTATION

Since some of the Y are missing, a natural strategy is to impute or “estimate” a value for such missing data and then estimate the parameter of interest behaving as if the imputed values were the true values.

Page 18: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

SINGLE IMPUTATION (CONT.)

For monotone missing data patterns

Regression Method

Propensity Score Method

For arbitrary missing data patterns

MCMC method

All these options are available in SAS MI procedure.

Page 19: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MULTIPLE IMPUTATION (MI)

Single imputation does not refect the uncertainty about the predictions of the unknown missing values, and the resulting estimated variances of the parameter estimates will be biased toward zero.

Multiple imputation does not attempt to estimate each missing value through simulated values but rather to represent a random sample of the missing values.

Page 20: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MI PROCEDURE

Multiple imputation inference involves three distinct phases:

The missing data are filled in m times to generate m complete data sets.

The m complete data sets are analyzed by using standard procedures.

The results from the m complete data sets are combined for the inference.

Page 21: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MULTIPLE IMPUTATION PROCESS

Page 22: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

MULTIPLE IMPUTATION (CONT.)

SAS

PROC MI

R

mi package

Page 23: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

TIME SERIES DATA

Idea: aggregation and interpolation

SAS: PROC EXPAND  

http://support.sas.com/rnd/app/examples/ets/missval

Page 24: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

Missing Spatial Data

Page 25: HOW TO DEAL WITH MISSING DATA: INTRODUCTION LI QI UNC CHARLOTTE.

Thank you!