Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Missing Data and Multiple

ImputationBy Jon AtwoodCollaborator

LISA

In this course, we will…

• Examine missing data in a general sense; what it is, where it comes from, what types exist, etc.

• Explain the problems of certain common methods for dealing with missing data, such as complete case analysis and single imputation methods

• Study multiple imputation (MI), learning generally how it works

• Apply MI to real data sets using SAS and R

So what is missing data?

• Missing data is information that we want to know, but don’t

• It can come in many forms, from people not answering questions on surveys, to inaccurate recordings of the height of plants that need to be discarded, to canceled runs in a driving experiment due to rain

• We could also consider something we never even thought of to be missing data

The key question is, why is the data missing?

• What mechanism is it that contributes to, or is associated with, the probability of a data point being absent?

• Can it be explained by our observed data or not?• The answers drastically affect what we can ultimately do

to compensate for the missingness

Perhaps the most common method of handling missing data is “Complete Case Analysis”

• Simply delete all cases that have any missing values at all, so you are left only with observations with all variables observed

• Computer software often does this by default when performing analysis (regression, for example)

• This is the simplest way to handle missing data. In some cases, will work fine

• However, loss of sample will lead to variance larger than reflected by the size of your data.

• May bias your sample

And now a closer look…

• We use as an example a data set of body fat percentage in men, and the circumference of various body parts (Penrose et al., 1985)

• Does the circumference of certain body parts predict body fat percentage?

• Here are some significant figures from a regression model with body fat percentage as the response

Predictor Estimate S.E. P-ValueAge 0.0626 0.0313 0.0463

Neck -0.4728 0.2294 0.0403Forearm 0.45315 0.1979 0.0229

Wrist -1.6181 0.5323 0.0026

In this case, the data is complete, with sample size 252

• But suppose about 5 percent of the participants had missing values? 10 percent? 20 percent?

• What if we performed complete case analysis and removed those who had missing values?

• First let’s examine the effect if we do this if when the data is MCAR

• I randomly removed cases from the data set, reran the analysis and stored the p-values. I did this 1,000 times, and plotted the 1,000 p-values in boxplots

For about 5 percent (n=13) deleted

Age Neck Forearm Wrist

P Value

For about 20 percent (n=50) deleted

Age NeckForearm Wris

t

P Value

We seem to change our conclusions somewhat

• With age and neck, it seems we fail to reject more often than not

• The other two, we still reject most of the time• This is assuming the missing subjects do mot differ from

the non-missing. This would cause bias .

Types Of Missingness•Missing Completely at Random (MCAR)•Missing at Random (MAR)•Missing Not at Random (MNAR) or Not Missing at Random (NMAR)

What Distinguishes Each Type?

• Suppose you’re loitering outside an elementary school one day…

• You then find out that students just received their report cards for the first quarter• For some reason, you start asking passing students their English grades. Of course, you

don’t force them to tell you or anything. You also write down their gender and hair color

A data set from this activity might look like this…

Hair Color Gender Grade

Red M ABrown F ABlack F BBlack M ABrown M Brown M Brown F Black M BBlack M BBrown F ABlack F Brown F C

Red M Red F A

Brown M ABlack M A

• 7 students received As, 3 received Bs, and 1 a C

•No failing!!• But 5 students did not reveal their

grade

To determine the type of missingness, look at what influences the

probability of a missing pointHair Color Gender Grade

0 0 00 0 00 0 00 0 00 0 10 0 10 0 10 0 00 0 00 0 00 0 10 0 00 0 10 0 00 0 00 0 0

• Here is the same data set, but the values are replaced with a “0” if the data point is observed and “1” if it is not

• We’ll call this the “Missing Matrix.” Obviously there are many more possible missing matricies

• The relevant question is, for any one of these data points, what is the probability that the point is equal to “1” ?

Upcoming Quiz!•What type of missingness do the grades exhibit?

Missing Completely at Random (MCAR)

• If this probability is not dependent on any of the data, observed or unobserved, then the data is Missing Completely at Random (MCAR)

• To be more precise, suppose that X is the observed data and Y is the unobserved data. Suppose we label our “Missing Matrix” as R.

• Then, if the data are MCAR, P(R|X,Y)=P(R)

Example…

• Suppose you are running an experiment on plants grown in pots, when suddenly you have a nervous breakdown and smash some of the pots

• In your insanity, you will probably not likely choose the plants to smash in a well-defined pattern, such as height age, etc.

• Hence, the missing values generated from your act of madness will likely fall into the MCAR category

Another way to think of MCAR

• Supposed we had to quickly go to the bathroom and do number 2

• In our desperation, we use the data as our toilet paper

• Presumably, some of our data would be smeared with…you know what

• The data smeared can be said to be a random subset of our data

In practice, MCAR is usually not realistic

• A completely random mechanism for generating missingness in your data set just isn’t very realistic

• Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc.

Missing at Random (MAR)

• If the probability of your missing data is dependent on the observed data but not the unobserved data, your missing observations are said to be Missing at Random (MAR)

• Symbolically, P(R|X,Y)=P(R|X), so that the unobserved data does not contribute to the probability of observing our “Missing Matrix.”

• Random is somewhat of a misnomer. MAR means that there is a mechanism that is associated with whether the data is missing, and it has to do with our observed data

Example…

• Usually, missing data is missing for a reason. Maybe older people are less likely to answer web-delivered questions on surveys, or in longitudinal studies people may die before they have completed the entire study, etc., companies may be reluctant to reveal financial information, etc.

•

The key point to MAR is…• We can still model the missing mechanism and compensate for it• The multiple imputation methods we will be talking about today

assume MAR• For example, if age is known, you can model missingness as a

function of age• Whether or not missing data is MAR or the next type, Missing Not at

Random (MNAR) is not testable. Requires you to understand your data

• (after)

Missing Not at Random (MNAR)

• The missingness has something to do with the missing value itself

• It has been said that smokers are not as likely to answer the question, “Do you smoke?”

• Said to be nonignorable• Although there are some proposed ways to handle MNAR

data, these are more complicated and are beyond the scope of this class

So, returning to our school example…

• Do you think this missing data is likely MCAR, MAR or MNAR?

Hair Color Gender Grade

Red M ABrown F ABlack F BBlack M ABrown M Brown M Brown F Black M BBlack M BBrown F ABlack F Brown F C

Red M Red F A

Brown M ABlack M A

Add overall GPA

• Now the data looks like this

• Does this change anything?

Hair Color GPA Gende

r GradeRed 3.4 M A

Brown 3.6 F ABlack 3.7 F BBlack 3.9 M ABrown 2.5 M Brown 3.2 M Brown 3.0 F Black 2.9 M BBlack 3.3 M BBrown 4.0 F ABlack 3.65 F Brown 3.4 F C

Red 2.2 M Red 3.8 F A

Brown 3.8 M ABlack 3.67 M A

So what do we do about

missing data?

Single Imputation MethodsImpute Once

• Mean Imputation: imputing the average from observed cases for all missing values of a variable

• Hot Deck Imputation: imputing a value from another subject, or “donor,” that is most like the subject in terms of observed variables

• Some others• All fundamentally impose too much precision. We

have uncertainty in what the unobserved values actually are

Multiple Imputation

• Using a single imputation approach does not account for an obvious source of uncertainty

• By imputing only once, we are treating the imputed value as if we observed it when we did not

• Therefore, we have uncertainty in what the observed value would have been

• Multiple Imputation (MI) takes this into account by generating several random values for each missing data point

The General Process

1. A value is randomly drawn for the unobserved data points based on a predetermined model from the observed data

2. Repeat step 1 some number of times, say N, resulting in N imputed data sets

3. Each imputed data set is analyzed separately4. The separate analyses are pooled together for a unifying

analysis that takes into account all the imputed data sets

To illustrate…

32 243 ?56 625 ?84 5

Here’s some data

X Y

Oh no, we have two missing values!

Whatever shall we do?!

Let’s Impute Some Data!

32 24356 62584 5

X Y

5.58

First, we’ll use a predictive distribution of the missing values, given the observed values, to make random draws of the observed values and fill them in.

Now we have one imputed data set!

Let’s Set That Aside…

32 24356 62584 5

X Y

5.58

And Do it Again!!!!

32 24356 62584 5

X Y

7.2

1.1

Set that aside…

Now we have 2 imputed data sets!!!

32 24356 62584 5

X Y

7.2

1.1

32 24356 62584 5

X Y

5.58

•Do this m number of times for m imputed data sets

Inference with Multiple Imputation

• Now that we have our imputed data sets, how do we make use of them? (suppose in this case m = 2)

32 24356 62584 5

X Y

7.2

1.1

32 24356 62584 5

X Y

5.58

We analyze each separately

32 24356 62584 5

X Y

7.2

1.1

32 24356 62584 5

X Y

5.58

Slope

4.932

S.E.4.28

7Slope

-0.824

5

S.E.6.184

5

Finally we pool the analyses together

• The pooled slope estimate is the average of the m imputed estimates

• In our example, β1p = = (4.932-.8245)*.5 = 2.0538• The pooled slope variance is given by

)* β1p )2

Where Zi is the standard error of the imputed slopesThe pooled standard error in this case is (4.287 + 6.1845)/2 + (3/2)*(16.569) = 30.08925To find the standard error, take the square root, and we get 5.485

Predicting the missing data given the observed data

• Bayes’ Theorem...

Imagine, then, that we establish some distribution of parameters of interest before considering the data, P(θ), where θ is the set of parameters we are trying to estimate. This is called the prior distribution of θ.Then, we establish a distribution P(Xobs|θ)We can finally use Bayes Theorem to establish P(θ|Xobs), make random draws for θ, and use these draws to make predictions of Ymiss

How many imputations do we need?

• Depends on the size of the data set and the amount of missingness• Some previous research indicated that about 5 is sufficient for

efficiency of the estimates, based on (1 + )-1

Where m is the number of imputations and λ is the fraction of missing information for the term being estimated (Schaffer, 1999)

• However, more recent research claims that a good imputation number is actually higher (maybe 40 or more) in order to achieve higher power (Graham et al, 2007)

General Methods for Multiple Imputation

• Regression based• Chained Equations (MICE) or Fully Conditional

Specification (FCS)• Markov Chain Monte Carlo (MCMC)

• We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.)

Regression Approach in SAS• Uses predictive mean matching, which means that the actual

imputed value is one chosen randomly from a set of observed values whose predicted value is close to the predicted value of the missing observation

• Is meant to try and keep imputed values plausible• Based on the imputation model we build, posterior random draws

are made for the regression parameters• These draws are used to construct the predicted values for the

missing observation

What parameters?

• Suppose our imputation model is y=• A random draw is made from the posterior predictive

distribution of the parameters, and we get the randomly drawn parameters * = (*1,…, *k)

• The missing value yi is predicted as *1x1 + … + *kxk

• Predictive mean matching is made based on this prediction

SAS example

• We will look at part of a data set of CEO bonuses, with other predictor variables (sales, advanced degrees, age, etc.)

• Since we plan to do regression on bonuses, and bonuses may have large variability as they get higher, we will take the log of bonuses before we do the imputation

Here’s the code for the entire process:

proc mi data=bob1 out=mob seed=123 nimpute=10;monotone regpmm(logb=stock sales years mba mastphd age);var stock sales years mba mastphd age logb;run;

proc reg data=mob outest=mp covout noprint;model logb=stock sales years mba mastphd age;by _imputation_;run;

Imputation Code Regression Code

proc mianalyze data=mp;modeleffects stock sales years mba mastphd age;run;

Pooled Analysis Code

Here is the output

Parameter

VarianceBetwee

nWithi

nTotal

stock 1.2212E-05

3.5E-05

4.9E-05

sales 1.15E-12 1.02E-11

1.15E-11

years 6.388E-06

2.5E-05

3.2E-05

mba 0.001423

0.00794

0.0095

mastphd

0.001299

0.00633

0.00776

age 1.0735E-05

2.5E-05

3.7E-05

Parameter

Estimate

Std Erro

r

95% Confidence Limit

s

Theta0

t for H0:

Pr > |t|

Parameter=Theta0

stock -0.005560.00697

-0.019

4

0.00824

0 -0.8 0.4267

sales 2.42E-05

3.4E-06

0.00002

3.1E-05

0 7.16<.0001

years 0.017694

0.00562

0.0066

0.02879

0 3.15 0.0019

mba 0.014343

0.09749

-0.177

4

0.20612

0 0.15 0.8831

mastphd

-0.001820.08809

-0.175

3

0.17162

0 -0.02 0.9835

age 0.014896

0.00608

0.00281

0.02698

0 2.45 0.0163

Classification Variables

• Suppose that we want to impute a variable that takes one of two values, “male” or “female”, “smoker” or “non smoker”, “dead” or “alive”

• Or what if there are even more categories, such as dislike, like, and love?

• What if they are nominal, like chocolate, vanilla, and strawberry?

• We can hardly use continuous methods in these cases

We can use the “Logistic Regression Method”

• Remember that if p = probability that y=1, the logistic regression model can be expressed as

= We can make random draws for *, the estimators of , from their posterior distribution ofUse those to calculate the estimate for p = , and use this to predict y for the missing case

• This method also works for ordinal data• Can be performed sequentially in SAS on multiple

variables one at a time if data is monotone missing, which means an observation missing implies observations missing in all the rest of the variables for that subject

• Discriminant Function Method can be used for nominal variables

SAS Example• I took the CEO data set and removed 57 values (no

particular reason I chose 57)• The following code runs the imputation

proc mi data=bob nimpute=5 seed=231 out=lid;class mastphd;var age stock sales years mba mastphd;monotone logistic (mastphd=years sales stock age mba);run;

And we get this…

WARNING: The maximum likelihood estimates for the logistic regression with observed observations may not exist for variable MastPHD. The posterior predictive distribution of the parameters used in the imputation process is based on the maximum likelihood estimates in the last maximum likelihood iteration.

The answer lies in the follies of logistic regression, as well as the redundancy of our model

Table of MastPHD by MBAMastPHD(MastPHD)

MBA(MBA)

0 1 Total0 372 0 372

50.07 0 50.07100 0

67.64 01 178 193 371

23.96 25.98 49.9347.98 52.0232.36 100

Total 550 193 74374.02 25.98 100

• We have “perfect classification” in that no one without a masters/phd has an mba

• If we have perfect classification like this, then the algorithm that does logistic regression will not converge

• This is something you need to be careful about in general

Now that we’ve removed MBA, here’s the code

proc mi data=bob nimpute=5 seed=231 out=lid;class mastphd;var age stock sales years mastphd;monotone logistic (mastphd=years sales stock age);run;

proc logistic data=lid outest=rain covout noprint descending;class mastphd;model mastphd=age stock sales years;by _imputation_;run;

proc mianalyze data=rain;modeleffects age stock sales years;run;

Imputation Code Logistic Regression Code Pooled Analysis Code

So here are the resultsVariance Information

Parameter

Variance DF Relative Fraction Relative

Increase

Missing Efficiency

Between

Within

Total in Varianc

e

Information

age 7.2861E-05

0.00015

0.00023 28.185

0.604415 0.416692 0.923073

stock 1.3164E-05

0.00042

0.00044 3084.8

0.037355 0.036634 0.992727

sales 8.39E-13 5.41E-11

5.51E-11 11990 0.018605 0.018429 0.996328

years 1.8885E-05

0.00014

0.00016 204.21

0.162733 0.148259 0.971202

Parameter EstimatesParame

terEstimat

eStd

Error95% Confiden

ce LimitsDF Minimu

mMaximu

mTheta

0t for H0:

Pr > |t|

Parameter=Theta0

age -0.03059

2

0.01524

-0.0618 0.00061

28.185-0.040456 -0.02281 0 -2.01 0.0543

stock -0.08713

6

0.02095

-0.1282 -0.046

1

3084.8-0.090364-0.082114 0 -4.16<.0001

sales 3.554E-06

7.4E-06

-1E-05 0.00002

11990 0.000002689

0.000004581

0 0.48 0.6321

years 0.005274

0.01273

-0.0198 0.03036

204.21 0.001705 0.010075 0 0.41 0.679

What about when more than one variable has

missing values?

Multiple Imputation by Chained Equations (MICE)

1. Provides initial imputations of missing values

2. For one particular variable, removes them again

3. Builds model based on other variables, and uses posterior predictive distribution to impute random values

4. Does the same thing for another variable, only imputed values for first variable remain

5. Completes for all variables, repeats the process many times

6. This makes one imputed data set. Does so m times

• Works well in simulations, handles many types of variables at once

• Can take a lot of time, and theoretical justification is not particularly strong

R Example

• This data set, “Nhanes”, has age group, body mas index, hypertensive status and serum cholesterol

• Body mass index and serum cholesterol are continuous, while hypertensive status (yes or no) is binary and age group is ordinal

• We will use the package “mice” and the function “mice” to complete the imputation and analysis

Code

• You need to install the ‘mice’ package

nhanes$hyp<-as.factor(nhanes$hyp)bord<-mice(nhanes,m=40,seed=132, me=c("polr","pmm","logreg","norm"))complete(bord,12)bit<-with(bord,lm(chl~age+bmi+hyp))summary(pool(bit))

Output

est se t df Pr(>|t|)(Intercept) -39.104424 88.462185 -0.4420468 9.341691 0.66851235age 40.287101 18.378020 2.1921350 6.268912 0.06894168bmi 6.091045 2.610044 2.3336941 11.449700 0.03876241hyp 5.410891 29.405394 0.1840102 8.038752 0.85856252Body mass is a significant predictor of cholesterol, and

age nearly is, but hypertensive status is not

Markov Chain Monte Carlo Approach

• Here, the process gives us data estimations via Markov chains

• A Markov chain holds the property that the probability of the next link in the chain depends only on the current link

• Basically, we perform a bunch of steps, and the probability of each step depends only on the previous step

• Eventually, theory holds that under certain conditions, the steps will converge to the state that we are trying to estimate, called the stationary distribution

But, there’s a catch…

• This approach assumes multivariate normality

Summary

• Though handling missing data is ultimately just a nuisance necessity and not the point of the analysis, it pays to give it the consideration it is due

• Whether or not you use multiple imputation, single imputation, or complete case analysis depends on how much missing data you have, and how big the sample is

• Having the actual data is still always better

Thank you!

Missing Data and Multiple Imputation By Jon Atwood Collaborator LISA.

Documents