Top Banner
Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA
67

Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Jan 18, 2018

Download

Documents

Ella Tucker

So what is missing data? Missing data is information that we want to know, but don’t It can come in many forms, from people not answering questions on surveys, to inaccurate recordings of the height of plants that need to be discarded, to canceled runs in a driving experiment due to rain We could also consider something we never even thought of to be missing data
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Missing Data and Multiple

ImputationBy Jon AtwoodCollaborator

LISA

Page 2: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

In this course, we will…

• Examine missing data in a general sense; what it is, where it comes from, what types exist, etc.

• Explain the problems of certain common methods for dealing with missing data, such as complete case analysis and single imputation methods

• Study multiple imputation (MI), learning generally how it works

• Apply MI to real data sets using SAS and R

Page 3: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

So what is missing data?

• Missing data is information that we want to know, but don’t

• It can come in many forms, from people not answering questions on surveys, to inaccurate recordings of the height of plants that need to be discarded, to canceled runs in a driving experiment due to rain

• We could also consider something we never even thought of to be missing data

Page 4: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

The key question is, why is the data missing?

• What mechanism is it that contributes to, or is associated with, the probability of a data point being absent?

• Can it be explained by our observed data or not?• The answers drastically affect what we can ultimately do

to compensate for the missingness

Page 5: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Perhaps the most common method of handling missing data is “Complete Case Analysis”

• Simply delete all cases that have any missing values at all, so you are left only with observations with all variables observed

• Computer software often does this by default when performing analysis (regression, for example)

• This is the simplest way to handle missing data. In some cases, will work fine

• However, loss of sample will lead to variance larger than reflected by the size of your data.

• May bias your sample

Page 6: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

And now a closer look…

• We use as an example a data set of body fat percentage in men, and the circumference of various body parts (Penrose et al., 1985)

• Does the circumference of certain body parts predict body fat percentage?

• Here are some significant figures from a regression model with body fat percentage as the response

Predictor Estimate S.E. P-ValueAge 0.0626 0.0313 0.0463

Neck -0.4728 0.2294 0.0403Forearm 0.45315 0.1979 0.0229

Wrist -1.6181 0.5323 0.0026

Page 7: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

In this case, the data is complete, with sample size 252

• But suppose about 5 percent of the participants had missing values? 10 percent? 20 percent?

• What if we performed complete case analysis and removed those who had missing values?

• First let’s examine the effect if we do this if when the data is MCAR

• I randomly removed cases from the data set, reran the analysis and stored the p-values. I did this 1,000 times, and plotted the 1,000 p-values in boxplots

Page 8: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

For about 5 percent (n=13) deleted

Age Neck Forearm Wrist

P Value

Page 9: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

For about 20 percent (n=50) deleted

Age NeckForearm Wris

t

P Value

Page 10: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

We seem to change our conclusions somewhat

• With age and neck, it seems we fail to reject more often than not

• The other two, we still reject most of the time• This is assuming the missing subjects do mot differ from

the non-missing. This would cause bias .

Page 11: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Types Of Missingness•Missing Completely at Random (MCAR)•Missing at Random (MAR)•Missing Not at Random (MNAR) or Not Missing at Random (NMAR)

Page 12: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

What Distinguishes Each Type?

• Suppose you’re loitering outside an elementary school one day…

• You then find out that students just received their report cards for the first quarter• For some reason, you start asking passing students their English grades. Of course, you

don’t force them to tell you or anything. You also write down their gender and hair color

Page 13: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

A data set from this activity might look like this…

Hair Color Gender Grade

Red M ABrown F ABlack F BBlack M ABrown M  Brown M  Brown F  Black M BBlack M BBrown F ABlack F  Brown F C

Red M  Red F A

Brown M ABlack M A

• 7 students received As, 3 received Bs, and 1 a C

•No failing!!• But 5 students did not reveal their

grade

Page 14: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

To determine the type of missingness, look at what influences the

probability of a missing pointHair Color Gender Grade

0 0 00 0 00 0 00 0 00 0 10 0 10 0 10 0 00 0 00 0 00 0 10 0 00 0 10 0 00 0 00 0 0

• Here is the same data set, but the values are replaced with a “0” if the data point is observed and “1” if it is not

• We’ll call this the “Missing Matrix.” Obviously there are many more possible missing matricies

• The relevant question is, for any one of these data points, what is the probability that the point is equal to “1” ?

Page 15: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Upcoming Quiz!•What type of missingness do the grades exhibit?

Page 16: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Missing Completely at Random (MCAR)

• If this probability is not dependent on any of the data, observed or unobserved, then the data is Missing Completely at Random (MCAR)

• To be more precise, suppose that X is the observed data and Y is the unobserved data. Suppose we label our “Missing Matrix” as R.

• Then, if the data are MCAR, P(R|X,Y)=P(R)

Page 17: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Example…

• Suppose you are running an experiment on plants grown in pots, when suddenly you have a nervous breakdown and smash some of the pots

• In your insanity, you will probably not likely choose the plants to smash in a well-defined pattern, such as height age, etc.

• Hence, the missing values generated from your act of madness will likely fall into the MCAR category

Page 18: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Another way to think of MCAR

• Supposed we had to quickly go to the bathroom and do number 2

• In our desperation, we use the data as our toilet paper

• Presumably, some of our data would be smeared with…you know what

• The data smeared can be said to be a random subset of our data

Page 19: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

In practice, MCAR is usually not realistic

• A completely random mechanism for generating missingness in your data set just isn’t very realistic

• Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc.

Page 20: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Missing at Random (MAR)

• If the probability of your missing data is dependent on the observed data but not the unobserved data, your missing observations are said to be Missing at Random (MAR)

• Symbolically, P(R|X,Y)=P(R|X), so that the unobserved data does not contribute to the probability of observing our “Missing Matrix.”

• Random is somewhat of a misnomer. MAR means that there is a mechanism that is associated with whether the data is missing, and it has to do with our observed data

Page 21: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Example…

• Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc.

Page 22: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

The key point to MAR is…• We can still model the missing mechanism and compensate for it• The multiple imputation methods we will be talking about today

assume MAR• For example, if age is known, you can model missingness as a

function of age• Whether or not missing data is MAR or the next type, Missing Not at

Random (MNAR) is not testable. Requires you to understand your data

• (after)

Page 23: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Missing Not at Random (MNAR)

• The missingness has something to do with the missing value itself

• It has been said that smokers are not as likely to answer the question, “Do you smoke?”

• Said to be nonignorable• Although there are some proposed ways to handle MNAR

data, these are more complicated and are beyond the scope of this class

Page 24: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

So, returning to our school example…

• Do you think this missing data is likely MCAR, MAR or MNAR?

Hair Color Gender Grade

Red M ABrown F ABlack F BBlack M ABrown M  Brown M  Brown F  Black M BBlack M BBrown F ABlack F  Brown F C

Red M  Red F A

Brown M ABlack M A

Page 25: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Add overall GPA

• Now the data looks like this

• Does this change anything?

Hair Color GPA Gende

r GradeRed 3.4 M A

Brown 3.6 F ABlack 3.7 F BBlack 3.9 M ABrown 2.5 M  Brown 3.2 M  Brown 3.0 F  Black 2.9 M BBlack 3.3 M BBrown 4.0 F ABlack 3.65 F  Brown 3.4 F C

Red 2.2 M  Red 3.8 F A

Brown 3.8 M ABlack 3.67 M A

Page 26: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

So what do we do about

missing data?

Page 27: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Single Imputation MethodsImpute Once

• Mean Imputation: imputing the average from observed cases for all missing values of a variable

• Hot Deck Imputation: imputing a value from another subject, or “donor,” that is most like the subject in terms of observed variables

• Some others• All fundamentally impose too much precision. We

have uncertainty in what the unobserved values actually are

Page 28: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Multiple Imputation

• Using a single imputation approach does not account for an obvious source of uncertainty

• By imputing only once, we are treating the imputed value as if we observed it when we did not

• Therefore, we have uncertainty in what the observed value would have been

• Multiple Imputation (MI) takes this into account by generating several random values for each missing data point

Page 29: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

The General Process

1. A value is randomly drawn for the unobserved data points based on a predetermined model from the observed data

2. Repeat step 1 some number of times, say N, resulting in N imputed data sets

3. Each imputed data set is analyzed separately4. The separate analyses are pooled together for a unifying

analysis that takes into account all the imputed data sets

Page 30: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

To illustrate…

32 243 ?56 625 ?84 5

Here’s some data

X Y

Page 31: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Oh no, we have two missing values!

Whatever shall we do?!

Page 32: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Let’s Impute Some Data!

32 24356 62584 5

X Y

5.58

First, we’ll use a predictive distribution of the missing values, given the observed values, to make random draws of the observed values and fill them in.

Now we have one imputed data set!

Page 33: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Let’s Set That Aside…

32 24356 62584 5

X Y

5.58

And Do it Again!!!!

32 24356 62584 5

X Y

7.2

1.1

Page 34: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Set that aside…

Page 35: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Now we have 2 imputed data sets!!!

32 24356 62584 5

X Y

7.2

1.1

32 24356 62584 5

X Y

5.58

Page 36: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

•Do this m number of times for m imputed data sets

Page 37: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Inference with Multiple Imputation

• Now that we have our imputed data sets, how do we make use of them? (suppose in this case m = 2)

32 24356 62584 5

X Y

7.2

1.1

32 24356 62584 5

X Y

5.58

Page 38: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

We analyze each separately

32 24356 62584 5

X Y

7.2

1.1

32 24356 62584 5

X Y

5.58

Slope

4.932

S.E.4.28

7Slope

-0.824

5

S.E.6.184

5

Page 39: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Finally we pool the analyses together

• The pooled slope estimate is the average of the m imputed estimates

• In our example, β1p = = (4.932-.8245)*.5 = 2.0538• The pooled slope variance is given by

)* β1p )2

Where Zi is the standard error of the imputed slopesThe pooled standard error in this case is (4.287 + 6.1845)/2 + (3/2)*(16.569) = 30.08925To find the standard error, take the square root, and we get 5.485

Page 40: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Predicting the missing data given the observed data

• Bayes’ Theorem...

Page 41: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Imagine, then, that we establish some distribution of parameters of interest before considering the data, P(θ), where θ is the set of parameters we are trying to estimate. This is called the prior distribution of θ.Then, we establish a distribution P(Xobs|θ)We can finally use Bayes Theorem to establish P(θ|Xobs), make random draws for θ, and use these draws to make predictions of Ymiss

Page 42: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

How many imputations do we need?

• Depends on the size of the data set and the amount of missingness• Some previous research indicated that about 5 is sufficient for

efficiency of the estimates, based on (1 + )-1

Where m is the number of imputations and λ is the fraction of missing information for the term being estimated (Schaffer, 1999)

• However, more recent research claims that a good imputation number is actually higher (maybe 40 or more) in order to achieve higher power (Graham et al, 2007)

Page 43: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

General Methods for Multiple Imputation

• Regression based• Chained Equations (MICE) or Fully Conditional

Specification (FCS)• Markov Chain Monte Carlo (MCMC)

Page 44: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

• We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.)

Page 45: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Regression Approach in SAS• Uses predictive mean matching, which means that the actual

imputed value is one chosen randomly from a set of observed values whose predicted value is close to the predicted value of the missing observation

• Is meant to try and keep imputed values plausible• Based on the imputation model we build, posterior random draws

are made for the regression parameters• These draws are used to construct the predicted values for the

missing observation

Page 46: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

What parameters?

• Suppose our imputation model is y=• A random draw is made from the posterior predictive

distribution of the parameters, and we get the randomly drawn parameters * = (*1,…, *k)

• The missing value yi is predicted as *1x1 + … + *kxk

• Predictive mean matching is made based on this prediction

Page 47: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

SAS example

• We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.)

• Since we plan to do regression on bonuses, and bonuses may have large variability as they get higher, we will take the log of bonuses before we do the imputation

Page 48: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Here’s the code for the entire process:

proc mi data=bob1 out=mob seed=123 nimpute=10;monotone regpmm(logb=stock sales years mba mastphd age);var stock sales years mba mastphd age logb;run;

proc reg data=mob outest=mp covout noprint;model logb=stock sales years mba mastphd age;by _imputation_;run;

Imputation Code Regression Code

proc mianalyze data=mp;modeleffects stock sales years mba mastphd age;run;

Pooled Analysis Code

Page 49: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Here is the output

Parameter

VarianceBetwee

nWithi

nTotal

stock 1.2212E-05

3.5E-05

4.9E-05

sales 1.15E-12 1.02E-11

1.15E-11

years 6.388E-06

2.5E-05

3.2E-05

mba 0.001423

0.00794

0.0095

mastphd

0.001299

0.00633

0.00776

age 1.0735E-05

2.5E-05

3.7E-05

Parameter

Estimate

Std Erro

r

95% Confidence Limit

s

Theta0

t for H0:

Pr > |t|

Parameter=Theta0

stock -0.005560.00697

-0.019

4

0.00824

0 -0.8 0.4267

sales 2.42E-05

3.4E-06

0.00002

3.1E-05

0 7.16<.0001

years 0.017694

0.00562

0.0066

0.02879

0 3.15 0.0019

mba 0.014343

0.09749

-0.177

4

0.20612

0 0.15 0.8831

mastphd

-0.001820.08809

-0.175

3

0.17162

0 -0.02 0.9835

age 0.014896

0.00608

0.00281

0.02698

0 2.45 0.0163

Page 50: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Classification Variables

• Suppose that we want to impute a variable that takes one of two values, “male” or “female”, “smoker” or “non smoker”, “dead” or “alive”

• Or what if there are even more categories, such as dislike, like, and love?

• What if they are nominal, like chocolate, vanilla, and strawberry?

• We can hardly use continuous methods in these cases

Page 51: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

We can use the “Logistic Regression Method”

• Remember that if p = probability that y=1, the logistic regression model can be expressed as

= We can make random draws for *, the estimators of , from their posterior distribution ofUse those to calculate the estimate for p = , and use this to predict y for the missing case

Page 52: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

• This method also works for ordinal data• Can be performed sequentially in SAS on multiple

variables one at a time if data is monotone missing, which means an observation missing implies observations missing in all the rest of the variables for that subject

• Discriminant Function Method can be used for nominal variables

Page 53: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

SAS Example• I took the CEO data set and removed 57 values (no

particular reason I chose 57)• The following code runs the imputation

proc mi data=bob nimpute=5 seed=231 out=lid;class mastphd;var age stock sales years mba mastphd;monotone logistic (mastphd=years sales stock age mba);run;

Page 54: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

And we get this…

WARNING: The maximum likelihood estimates for the logistic regression with observed observations may not exist for variable MastPHD. The posterior predictive distribution of the parameters used in the imputation process is based on the maximum likelihood estimates in the last maximum likelihood iteration.

Page 55: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

The answer lies in the follies of logistic regression, as well as the redundancy of our model

Table of MastPHD by MBAMastPHD(MastPHD)

MBA(MBA)

0 1 Total0 372 0 372

50.07 0 50.07100 0

67.64 01 178 193 371

23.96 25.98 49.9347.98 52.0232.36 100

Total 550 193 74374.02 25.98 100

• We have “perfect classification” in that no one without a masters/phd has an mba

• If we have perfect classification like this, then the algorithm that does logistic regression will not converge

• This is something you need to be careful about in general

Page 56: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Now that we’ve removed MBA, here’s the code

proc mi data=bob nimpute=5 seed=231 out=lid;class mastphd;var age stock sales years mastphd;monotone logistic (mastphd=years sales stock age);run;

proc logistic data=lid outest=rain covout noprint descending;class mastphd;model mastphd=age stock sales years;by _imputation_;run;

proc mianalyze data=rain;modeleffects age stock sales years;run;

Imputation Code Logistic Regression Code Pooled Analysis Code

Page 57: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

So here are the resultsVariance Information

Parameter

Variance DF Relative Fraction Relative

Increase

Missing Efficiency

Between

Within

Total in Varianc

e

Information

age 7.2861E-05

0.00015

0.00023 28.185

0.604415 0.416692 0.923073

stock 1.3164E-05

0.00042

0.00044 3084.8

0.037355 0.036634 0.992727

sales 8.39E-13 5.41E-11

5.51E-11 11990 0.018605 0.018429 0.996328

years 1.8885E-05

0.00014

0.00016 204.21

0.162733 0.148259 0.971202

Parameter EstimatesParame

terEstimat

eStd

Error95% Confiden

ce LimitsDF Minimu

mMaximu

mTheta

0t for H0:

Pr > |t|

Parameter=Theta0

age -0.03059

2

0.01524

-0.0618 0.00061

28.185-0.040456 -0.02281 0 -2.01 0.0543

stock -0.08713

6

0.02095

-0.1282 -0.046

1

3084.8-0.090364-0.082114 0 -4.16<.0001

sales 3.554E-06

7.4E-06

-1E-05 0.00002

11990 0.000002689

0.000004581

0 0.48 0.6321

years 0.005274

0.01273

-0.0198 0.03036

204.21 0.001705 0.010075 0 0.41 0.679

Page 58: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

What about when more than one variable has

missing values?

Page 59: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Multiple Imputation by Chained Equations (MICE)

1. Provides initial imputations of missing values

2. For one particular variable, removes them again

3. Builds model based on other variables, and uses posterior predictive distribution to impute random values

4. Does the same thing for another variable, only imputed values for first variable remain

5. Completes for all variables, repeats the process many times

6. This makes one imputed data set. Does so m times

Page 60: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

• Works well in simulations, handles many types of variables at once

• Can take a lot of time, and theoretical justification is not particularly strong

Page 61: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

R Example

• This data set, “Nhanes”, has age group, body mas index, hypertensive status and serum cholesterol

• Body mass index and serum cholesterol are continuous, while hypertensive status (yes or no) is binary and age group is ordinal

• We will use the package “mice” and the function “mice” to complete the imputation and analysis

Page 62: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Code

• You need to install the ‘mice’ package

nhanes$hyp<-as.factor(nhanes$hyp)bord<-mice(nhanes,m=40,seed=132, me=c("polr","pmm","logreg","norm"))complete(bord,12)bit<-with(bord,lm(chl~age+bmi+hyp))summary(pool(bit))

Page 63: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Output

est se t df Pr(>|t|)(Intercept) -39.104424 88.462185 -0.4420468 9.341691 0.66851235age 40.287101 18.378020 2.1921350 6.268912 0.06894168bmi 6.091045 2.610044 2.3336941 11.449700 0.03876241hyp 5.410891 29.405394 0.1840102 8.038752 0.85856252Body mass is a significant predictor of cholesterol, and

age nearly is, but hypertensive status is not

Page 64: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Markov Chain Monte Carlo Approach

• Here, the process gives us data estimations via Markov chains

• A Markov chain holds the property that the probability of the next link in the chain depends only on the current link

• Basically, we perform a bunch of steps, and the probability of each step depends only on the previous step

• Eventually, theory holds that under certain conditions, the steps will converge to the state that we are trying to estimate, called the stationary distribution

Page 65: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

But, there’s a catch…

• This approach assumes multivariate normality

Page 66: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Summary

• Though handling missing data is ultimately just a nuisance necessity and not the point of the analysis, it pays to give it the consideration it is due

• Whether or not you use multiple imputation, single imputation, or complete case analysis depends on how much missing data you have, and how big the sample is

• Having the actual data is still always better

Page 67: Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Thank you!