Top Banner
Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics
49

Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Jan 12, 2016

Download

Documents

Anabel Parks
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Missing Data:

Where has my data gone?

Peter T. Donnan

Professor of Epidemiology and Biostatistics

Page 2: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

The Tao of The Tao of MissingnessMissingness

““The inside and the The inside and the outside are one”outside are one”

Zen philosopherZen philosopher

““Nothing is more real than Nothing is more real than nothing” nothing” Samuel BeckettSamuel Beckett

Page 3: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

OverviewOverview

•Why missing data mattersWhy missing data matters

•Some useful definitionsSome useful definitions

•Practical issuesPractical issues

•Methods for imputationMethods for imputation

Page 4: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Missing data is Missing data is inevitable!inevitable!

• Trials or observational studies are set up Trials or observational studies are set up to obtain complete data from everyoneto obtain complete data from everyone

• Multiple reminders for questionnaire dataMultiple reminders for questionnaire data

• Important to distinguish valid unknown, Important to distinguish valid unknown, not applicable, lost to follow-up, etcnot applicable, lost to follow-up, etc

• It’s not missing, it’s unknown!It’s not missing, it’s unknown!

• Despite investigators’ best efforts missing Despite investigators’ best efforts missing data is inevitabledata is inevitable

• The key is to minimise loss of data in the The key is to minimise loss of data in the first placefirst place

• Trials or observational studies are set up Trials or observational studies are set up to obtain complete data from everyoneto obtain complete data from everyone

• Multiple reminders for questionnaire dataMultiple reminders for questionnaire data

• Important to distinguish valid unknown, Important to distinguish valid unknown, not applicable, lost to follow-up, etcnot applicable, lost to follow-up, etc

• It’s not missing, it’s unknown!It’s not missing, it’s unknown!

• Despite investigators’ best efforts missing Despite investigators’ best efforts missing data is inevitabledata is inevitable

• The key is to minimise loss of data in the The key is to minimise loss of data in the first placefirst place

Page 5: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Why does data go Why does data go missing?missing?

• Poor trial management, lack of follow-Poor trial management, lack of follow-upup

• Patients have Adverse Events (AE) Patients have Adverse Events (AE) and drop-outand drop-out

• Patients fail to attend clinic / fill in Patients fail to attend clinic / fill in questionnairequestionnaire

• Migrate with no information available Migrate with no information available (They don’t write, they don’t call!)(They don’t write, they don’t call!)

• Leave study for no apparent reasonLeave study for no apparent reason

• Poor trial management, lack of follow-Poor trial management, lack of follow-upup

• Patients have Adverse Events (AE) Patients have Adverse Events (AE) and drop-outand drop-out

• Patients fail to attend clinic / fill in Patients fail to attend clinic / fill in questionnairequestionnaire

• Migrate with no information available Migrate with no information available (They don’t write, they don’t call!)(They don’t write, they don’t call!)

• Leave study for no apparent reasonLeave study for no apparent reason

Page 6: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Some real examples of Some real examples of reasons for missing reasons for missing

datadata• ““Emergency Christmas shopping” Emergency Christmas shopping” (reason for (reason for

missed visit, early November)missed visit, early November)

• ““The drugs will interfere with my drinking” The drugs will interfere with my drinking” (reason for eligible pt saying No to trial)(reason for eligible pt saying No to trial)

• ““No you can’t come and see me: I’m better” No you can’t come and see me: I’m better” (pt (pt dropping out at V3)dropping out at V3)

• Changed address and/or phone number Changed address and/or phone number rendered pts untraceable rendered pts untraceable (more frequent in the (more frequent in the West)West)

• Two pts co-operated but refused photographs, Two pts co-operated but refused photographs, one on religious grounds one on religious grounds (despite giving (despite giving consent)consent)

• ““Emergency Christmas shopping” Emergency Christmas shopping” (reason for (reason for missed visit, early November)missed visit, early November)

• ““The drugs will interfere with my drinking” The drugs will interfere with my drinking” (reason for eligible pt saying No to trial)(reason for eligible pt saying No to trial)

• ““No you can’t come and see me: I’m better” No you can’t come and see me: I’m better” (pt (pt dropping out at V3)dropping out at V3)

• Changed address and/or phone number Changed address and/or phone number rendered pts untraceable rendered pts untraceable (more frequent in the (more frequent in the West)West)

• Two pts co-operated but refused photographs, Two pts co-operated but refused photographs, one on religious grounds one on religious grounds (despite giving (despite giving consent)consent)

Page 7: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Does it matter?Does it matter?

•Missing data can seriously Missing data can seriously damage a study’s credibilitydamage a study’s credibility

•Two main problems;Two main problems;

May introduce biasMay introduce bias

Reduces PowerReduces Power

•Missing data can seriously Missing data can seriously damage a study’s credibilitydamage a study’s credibility

•Two main problems;Two main problems;

May introduce biasMay introduce bias

Reduces PowerReduces Power

Page 8: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Note that Note that even worse in even worse in

regression:regression: IDID BMIBMI HBA1HBA1cc

LDLLDL CholChol HDLHDL

11 35.235.2 9.19.1 5.85.8 0.80.8

22 26.326.3 7.07.0 4.34.3 1.11.1

33

44 28.328.3 11.311.3 5.45.4 6.16.1 0.70.7

55 8.48.4 3.93.9

66 40.740.7 10.210.2 4.04.0

77 30.530.5 9.39.3 2.92.9 4.14.1 1.01.0

88 26.126.1 3.53.5 5.25.2

•Pairwise Pairwise comparisons leave comparisons leave out 38%+out 38%+

•So two-group So two-group comparisons not too comparisons not too badbad

•Regression or any Regression or any other other multidimensional multidimensional analysis leaves out analysis leaves out 75% of data75% of data

- COMPLETE-CASE ONLY - COMPLETE-CASE ONLY ANALYSISANALYSIS

Page 9: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Practical Tip 1 Practical Tip 1

• Complete Case Complete Case analysis is where the analysis is where the missing data problem is ignoredmissing data problem is ignored

• Patients with missing data are excludedPatients with missing data are excluded• This will be obvious from the This will be obvious from the

constructed tablesconstructed tables• The n in the tables reporting the analysis The n in the tables reporting the analysis

will be less than the N enrolledwill be less than the N enrolled• Even worse the dataset used may differ Even worse the dataset used may differ

by outcome as n may changeby outcome as n may change

• Complete Case Complete Case analysis is where the analysis is where the missing data problem is ignoredmissing data problem is ignored

• Patients with missing data are excludedPatients with missing data are excluded• This will be obvious from the This will be obvious from the

constructed tablesconstructed tables• The n in the tables reporting the analysis The n in the tables reporting the analysis

will be less than the N enrolledwill be less than the N enrolled• Even worse the dataset used may differ Even worse the dataset used may differ

by outcome as n may changeby outcome as n may change

Page 10: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Practical Tip 1 Practical Tip 1

• A useful and informative A useful and informative procedure is to create a table procedure is to create a table comparing the characteristics of comparing the characteristics of the complete case dataset and the complete case dataset and those missing e.g.those missing e.g.

• A useful and informative A useful and informative procedure is to create a table procedure is to create a table comparing the characteristics of comparing the characteristics of the complete case dataset and the complete case dataset and those missing e.g.those missing e.g.

FactorFactor Complete Complete CasesCases

Missing at 8 Missing at 8 weeksweeks

Mean Age Mean Age 3232 5050

Mean BMIMean BMI 1919 2828

% Male% Male 50%50% 65%65%

Page 11: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

One Solution? – One Solution? – Missing-indicator Missing-indicator

method method • Code all missing as unknown and Code all missing as unknown and

include unknown category in include unknown category in regression model (Mea culpa!)regression model (Mea culpa!)

• Advantage that no subject excludedAdvantage that no subject excluded• Difficult to interpretDifficult to interpret• Does not deal with main issue of Does not deal with main issue of

potential potential BIASBIAS• In fact, it will add bias…..In fact, it will add bias…..• FudgeFudge rather than solution rather than solution

• Code all missing as unknown and Code all missing as unknown and include unknown category in include unknown category in regression model (Mea culpa!)regression model (Mea culpa!)

• Advantage that no subject excludedAdvantage that no subject excluded• Difficult to interpretDifficult to interpret• Does not deal with main issue of Does not deal with main issue of

potential potential BIASBIAS• In fact, it will add bias…..In fact, it will add bias…..• FudgeFudge rather than solution rather than solution

Page 12: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Variables in the Equation

.022 .007 11.000 1 .001 1.022

.031 .122 .064 1 .800 1.031

114.441 4 .000

.183 .443 .170 1 .680 1.200

.882 .423 4.344 1 .037 2.415

1.961 .423 21.461 1 .000 7.106

1.427 .448 10.144 1 .001 4.166

.033 .020 2.815 1 .093 1.034

.369 .136 7.386 1 .007 1.446

age

sexnum

dukes

dukes(1)

dukes(2)

dukes(3)

dukes(4)

cscore

hyperco

B SE Wald df Sig. Exp(B)

HR Unknown Stage vs. HR Unknown Stage vs. Stage AStage A

Example: Unknown stage Example: Unknown stage (n=40/476) in Cox PH model for (n=40/476) in Cox PH model for

colorectal cancercolorectal cancer

N.b. Effect of known N.b. Effect of known stages are now biasedstages are now biased

Page 13: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Imputation: Imputation: Another Solution Another Solution

• Impute missing values and then carry Impute missing values and then carry out analysis with complete datasetout analysis with complete dataset

• Advantage that no subject excludedAdvantage that no subject excluded• Many methods of estimating the Many methods of estimating the

missing valuesmissing values1.1. LVCF (LOCF) Last Value Carried ForwardLVCF (LOCF) Last Value Carried Forward2.2. Mean or median value of measurementsMean or median value of measurements3.3. Expected value based on regressionExpected value based on regression4.4. Expected value based on E-M algorithmExpected value based on E-M algorithm

• Impute missing values and then carry Impute missing values and then carry out analysis with complete datasetout analysis with complete dataset

• Advantage that no subject excludedAdvantage that no subject excluded• Many methods of estimating the Many methods of estimating the

missing valuesmissing values1.1. LVCF (LOCF) Last Value Carried ForwardLVCF (LOCF) Last Value Carried Forward2.2. Mean or median value of measurementsMean or median value of measurements3.3. Expected value based on regressionExpected value based on regression4.4. Expected value based on E-M algorithmExpected value based on E-M algorithm

Page 14: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Some notation Some notation

• YYobsobs – observed data – observed data

• YYmissmiss – missing data – missing data

• R – missing data indicator: R – missing data indicator:

R = 1 indicates data observed, R = 1 indicates data observed,

R = 0 missingR = 0 missing

• Prob [R = 0 | YProb [R = 0 | Yobsobs ] prob of missing ] prob of missing data given values of observed data data given values of observed data

• YYobsobs – observed data – observed data

• YYmissmiss – missing data – missing data

• R – missing data indicator: R – missing data indicator:

R = 1 indicates data observed, R = 1 indicates data observed,

R = 0 missingR = 0 missing

• Prob [R = 0 | YProb [R = 0 | Yobsobs ] prob of missing ] prob of missing data given values of observed data data given values of observed data

Page 15: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Some very difficult, opaque, Some very difficult, opaque, but essential definitions (1)but essential definitions (1)

Missing Completely at Random (MCAR)Missing Completely at Random (MCAR)

• Prob (Missing) is independent of both: Prob (Missing) is independent of both: 1) observed data and 1) observed data and 2) unobserved data2) unobserved data

• Essentially observed data is a random Essentially observed data is a random sample of full datasample of full data

• MCAR is what everyone falsely assumes!MCAR is what everyone falsely assumes!• If MCAR is assumed, observed-case or If MCAR is assumed, observed-case or

complete-case analysis is valid.complete-case analysis is valid.• Observed-case analysis is software Observed-case analysis is software

default!default!

Missing Completely at Random (MCAR)Missing Completely at Random (MCAR)

• Prob (Missing) is independent of both: Prob (Missing) is independent of both: 1) observed data and 1) observed data and 2) unobserved data2) unobserved data

• Essentially observed data is a random Essentially observed data is a random sample of full datasample of full data

• MCAR is what everyone falsely assumes!MCAR is what everyone falsely assumes!• If MCAR is assumed, observed-case or If MCAR is assumed, observed-case or

complete-case analysis is valid.complete-case analysis is valid.• Observed-case analysis is software Observed-case analysis is software

default!default!

Page 16: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Response IndicatorsResponse Indicators

RR11 RR22 RR33 RR44

11 11 11 11

11 00 11 11

11 11 00 00

Response VectorResponse Vector

YY11 YY22 YY33 YY44

yy yy yy yy

yy ** yy yy

yy yy ** **

Representation of R as a Representation of R as a stratification factor for responsesstratification factor for responses

For MCAR: For MCAR:

Prob [R = 0 | YProb [R = 0 | Yobsobs, Y, Ymissmiss, X ] = Prob [ R = 0 | X] , X ] = Prob [ R = 0 | X]

Page 17: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Possible to test for MCARPossible to test for MCAR

Park-Lee* test for MCARPark-Lee* test for MCAR

• Within framework of GEE (Liang and Within framework of GEE (Liang and Zeger)Zeger)

• Define indicator variables for each Define indicator variables for each missing data patternmissing data pattern

• Fit model with indicators as covariatesFit model with indicators as covariates• Test regression coefficients for Test regression coefficients for

indicators and if significant missing indicators and if significant missing data mechanism is data mechanism is not MCARnot MCAR

Park-Lee* test for MCARPark-Lee* test for MCAR

• Within framework of GEE (Liang and Within framework of GEE (Liang and Zeger)Zeger)

• Define indicator variables for each Define indicator variables for each missing data patternmissing data pattern

• Fit model with indicators as covariatesFit model with indicators as covariates• Test regression coefficients for Test regression coefficients for

indicators and if significant missing indicators and if significant missing data mechanism is data mechanism is not MCARnot MCAR

*Park T and Lee S-Y. A test of missing completely at random for longitudinal data with missing observations. Statist Med 1997; 16: 1859-1871

Page 18: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Example of Park-Lee* test Example of Park-Lee* test for MCARfor MCAR

Missing Missing data data

patternpattern

WaveWave

11 22 33

00 OO OO OO

11 OO MM OO

22 OO OO MM

33 OO MM MM

Fit three indicator Fit three indicator variablesvariables

IIk k = 1 if missing pattern k, = 1 if missing pattern k,

= 0 otherwise= 0 otherwise

CovariatCovariatee

Est/SEEst/SE

II11 0.650.65

II22 2.03*2.03*

II33 3.51*3.51*For overall test p = 0.0023For overall test p = 0.0023

Page 19: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Examples: MCARExamples: MCAR

• Six Cities Air Pollution Study – Six Cities Air Pollution Study – children changed schools because children changed schools because of parents so unrelated to health of parents so unrelated to health of childrenof children

• In a trial Practice changed In a trial Practice changed computer system so missing computer system so missing observations not related to observations not related to previous observed or future previous observed or future valuesvalues

• Six Cities Air Pollution Study – Six Cities Air Pollution Study – children changed schools because children changed schools because of parents so unrelated to health of parents so unrelated to health of childrenof children

• In a trial Practice changed In a trial Practice changed computer system so missing computer system so missing observations not related to observations not related to previous observed or future previous observed or future valuesvalues

Page 20: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Practical Tip 2Practical Tip 2

• Check data for MCAR Check data for MCAR (note SPSS carries (note SPSS carries out Little’s test)out Little’s test)

• If assumption seems reasonable If assumption seems reasonable analyse using complete-case only with analyse using complete-case only with impunity impunity

• If missing data constitutes < 5% If missing data constitutes < 5% probablyprobably reasonable to assume MCAR reasonable to assume MCAR

• If not, complete-case analysis is likely If not, complete-case analysis is likely to be biasedto be biased

• N.b. MCAR not that commonN.b. MCAR not that common

• Check data for MCAR Check data for MCAR (note SPSS carries (note SPSS carries out Little’s test)out Little’s test)

• If assumption seems reasonable If assumption seems reasonable analyse using complete-case only with analyse using complete-case only with impunity impunity

• If missing data constitutes < 5% If missing data constitutes < 5% probablyprobably reasonable to assume MCAR reasonable to assume MCAR

• If not, complete-case analysis is likely If not, complete-case analysis is likely to be biasedto be biased

• N.b. MCAR not that commonN.b. MCAR not that common

Page 21: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Another essential definitionAnother essential definition

Missing At Random (MAR)Missing At Random (MAR)

• Prob (Missing) is independent of: Prob (Missing) is independent of: 1) unobserved data 1) unobserved data butbut 2) 2) dependentdependent on observed data on observed data

• Essentially observed data is a random Essentially observed data is a random sample of full data sample of full data in each stratumin each stratum

• MAR is weaker version of MCAR MAR is weaker version of MCAR assumptionassumption

• If MAR is assumed, If MAR is assumed, manymany methods possible methods possible to impute data using observed data.to impute data using observed data.

Missing At Random (MAR)Missing At Random (MAR)

• Prob (Missing) is independent of: Prob (Missing) is independent of: 1) unobserved data 1) unobserved data butbut 2) 2) dependentdependent on observed data on observed data

• Essentially observed data is a random Essentially observed data is a random sample of full data sample of full data in each stratumin each stratum

• MAR is weaker version of MCAR MAR is weaker version of MCAR assumptionassumption

• If MAR is assumed, If MAR is assumed, manymany methods possible methods possible to impute data using observed data.to impute data using observed data.

Page 22: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Missing At Random (MAR)Missing At Random (MAR)

• Prob (missing) depends on YProb (missing) depends on Yobsobs but not but not on missing Yon missing Ymissmiss

• Prob [ R = 0 | YProb [ R = 0 | Yobsobs, Y, Ymissmiss, X] = , X] = Prob [ R = 0 | YProb [ R = 0 | Yobsobs, X], X]

• MCAR is a special case of MARMCAR is a special case of MAR• Use fact that missing Y for a person Use fact that missing Y for a person

with same age, gender, BP, chol, BMI, with same age, gender, BP, chol, BMI, etc. will be similar to a person with etc. will be similar to a person with same characteristics who does have same characteristics who does have outcomeoutcome

• Allows imputation methods based on Allows imputation methods based on observed data e.g. mean, regressionobserved data e.g. mean, regression

• Prob (missing) depends on YProb (missing) depends on Yobsobs but not but not on missing Yon missing Ymissmiss

• Prob [ R = 0 | YProb [ R = 0 | Yobsobs, Y, Ymissmiss, X] = , X] = Prob [ R = 0 | YProb [ R = 0 | Yobsobs, X], X]

• MCAR is a special case of MARMCAR is a special case of MAR• Use fact that missing Y for a person Use fact that missing Y for a person

with same age, gender, BP, chol, BMI, with same age, gender, BP, chol, BMI, etc. will be similar to a person with etc. will be similar to a person with same characteristics who does have same characteristics who does have outcomeoutcome

• Allows imputation methods based on Allows imputation methods based on observed data e.g. mean, regressionobserved data e.g. mean, regression

Page 23: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Examples: MARExamples: MAR

• Six Cities Air Pollution Study – children Six Cities Air Pollution Study – children moved out of area because of non-moved out of area because of non-respiratory problems (e.g. type 1 diabetes)respiratory problems (e.g. type 1 diabetes)

• Men less likely to attend for follow-up visit Men less likely to attend for follow-up visit but but notnot related to values of their likely related to values of their likely outcomes outcomes

• Repeated measures where missingness is Repeated measures where missingness is not related to values would have obtained not related to values would have obtained

• Six Cities Air Pollution Study – children Six Cities Air Pollution Study – children moved out of area because of non-moved out of area because of non-respiratory problems (e.g. type 1 diabetes)respiratory problems (e.g. type 1 diabetes)

• Men less likely to attend for follow-up visit Men less likely to attend for follow-up visit but but notnot related to values of their likely related to values of their likely outcomes outcomes

• Repeated measures where missingness is Repeated measures where missingness is not related to values would have obtained not related to values would have obtained

Page 24: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Single ImputationSingle Imputation

• Most common approach Most common approach is to add mean of is to add mean of values observed to values observed to impute missingimpute missing

• Takes no account of Takes no account of differences related to differences related to other factors eg. HbA1cother factors eg. HbA1c

• Takes no account of Takes no account of uncertainty in uncertainty in estimating missing estimating missing valuevalue

• Makes clinicians Makes clinicians uneasy!uneasy!

• Most common approach Most common approach is to add mean of is to add mean of values observed to values observed to impute missingimpute missing

• Takes no account of Takes no account of differences related to differences related to other factors eg. HbA1cother factors eg. HbA1c

• Takes no account of Takes no account of uncertainty in uncertainty in estimating missing estimating missing valuevalue

• Makes clinicians Makes clinicians uneasy!uneasy!

IDID BMIBMI HBAHBA1c1c

LDLLDL CholChol HDLHDL

11 35.35.22

9.19.1 5.85.8 0.80.8

22 26.26.33

7.07.0 4.34.3 1.11.1

33

44 28.28.33

11.11.33

5.45.4 6.16.1 0.70.7

55 8.48.4 3.93.9

66 40.40.77

10.10.22

4.04.0

77 30.30.55

9.39.3 2.92.9 4.14.1 1.01.0

88 26.26.11

3.53.5 5.25.2

31.31.22

31.31.22

Page 25: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Single ImputationSingle Imputation

• Common method in Common method in longitudinal datalongitudinal data

• Last Value Carried Last Value Carried Forward (LVCF or Forward (LVCF or LOCF)LOCF)

• Common in RCTsCommon in RCTs• Some journals and Some journals and

even FDA endorseeven FDA endorse• But statistically But statistically

unsoundunsound unless strong unless strong and unrealistic and unrealistic assumptions met (see assumptions met (see LSHTM website)LSHTM website)

• Common method in Common method in longitudinal datalongitudinal data

• Last Value Carried Last Value Carried Forward (LVCF or Forward (LVCF or LOCF)LOCF)

• Common in RCTsCommon in RCTs• Some journals and Some journals and

even FDA endorseeven FDA endorse• But statistically But statistically

unsoundunsound unless strong unless strong and unrealistic and unrealistic assumptions met (see assumptions met (see LSHTM website)LSHTM website)

IDID BaselinBaselinee

4 4 weeksweeks

8 8 weeksweeks

11 1515 1313 1313

22 2929 3232 3232

33 4343 4343

44 3232 2929 2525

55 1919 3636 2626

66 1010 1010 1313

77 3131 2525 2020

88 1919 1818

4343

1919

Page 26: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Examples: Single Examples: Single ImputationImputation

• Last Value Carried Forward (LVCF or Last Value Carried Forward (LVCF or LOCF) very common in RCTsLOCF) very common in RCTs

• Adalimumab in severe Crohn’s Adalimumab in severe Crohn’s disease, nearly 50% of patients were disease, nearly 50% of patients were lost-to-follow-up at 52 weeks in one lost-to-follow-up at 52 weeks in one trial and LVCF used (but relapsing-trial and LVCF used (but relapsing-remitting condition!)remitting condition!)

• But legitimate use in Bell’s Palsy Trial!But legitimate use in Bell’s Palsy Trial!• No disagreement among statisticians No disagreement among statisticians

that method is unsoundthat method is unsound

• Last Value Carried Forward (LVCF or Last Value Carried Forward (LVCF or LOCF) very common in RCTsLOCF) very common in RCTs

• Adalimumab in severe Crohn’s Adalimumab in severe Crohn’s disease, nearly 50% of patients were disease, nearly 50% of patients were lost-to-follow-up at 52 weeks in one lost-to-follow-up at 52 weeks in one trial and LVCF used (but relapsing-trial and LVCF used (but relapsing-remitting condition!)remitting condition!)

• But legitimate use in Bell’s Palsy Trial!But legitimate use in Bell’s Palsy Trial!• No disagreement among statisticians No disagreement among statisticians

that method is unsoundthat method is unsound

Page 27: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Solution is Solution is Multiple Multiple

Imputation!Imputation!1.1. Assumes data MARAssumes data MAR2.2. Missing data filled in Missing data filled in

m timesm times3.3. The m complete The m complete

datasets are datasets are eacheach analysed by using analysed by using standard proceduresstandard procedures

4.4. The results for the m The results for the m complete datasets complete datasets are combined for are combined for inferenceinference

1.1. Assumes data MARAssumes data MAR2.2. Missing data filled in Missing data filled in

m timesm times3.3. The m complete The m complete

datasets are datasets are eacheach analysed by using analysed by using standard proceduresstandard procedures

4.4. The results for the m The results for the m complete datasets complete datasets are combined for are combined for inferenceinference

IDID BaseliBaselinene

4 4 weekweekss

8 8 weekweekss

11 1515 1313 1212

22 2929 3232 3030

33 3535 4444

44 3232 2929 2525

55 1919 3636 2626

3636

1919

IDID BaselBaselineine

4 4 weekweekss

8 8 weekweekss

11 1515 1313 1313

22 2929 3232 3232

33 4343   4343

44 3232 2929 2525

55 1919 3636 2626

IDID BaseBaselineline

4 4 weeweeksks

8 8 weeweeksks

11 1515 1313 1616

22 2929 3232 25

33 3939 28  4040

44 3232 2929 2525

55 1919 3636 2626

Page 28: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Multiple Multiple Imputation (MI)Imputation (MI)

• Process derived by Donald Rubin (1987)Process derived by Donald Rubin (1987)

• Replace missing values with set of Replace missing values with set of plausible values that also…plausible values that also…

• Represents the uncertainty about the Represents the uncertainty about the correct value correct value

• Requires MAR assumption but Requires MAR assumption but NOTNOT MCARMCAR

• Many methods of estimating imputed Many methods of estimating imputed values 1) regression, 2) propensity values 1) regression, 2) propensity score, 3) MCMCscore, 3) MCMC

• Process derived by Donald Rubin (1987)Process derived by Donald Rubin (1987)

• Replace missing values with set of Replace missing values with set of plausible values that also…plausible values that also…

• Represents the uncertainty about the Represents the uncertainty about the correct value correct value

• Requires MAR assumption but Requires MAR assumption but NOTNOT MCARMCAR

• Many methods of estimating imputed Many methods of estimating imputed values 1) regression, 2) propensity values 1) regression, 2) propensity score, 3) MCMCscore, 3) MCMC

Page 29: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 1: Multiple Step 1: Multiple Imputation Imputation Methods Methods

1) Regression – Missing values 1) Regression – Missing values predicted by regression model of predicted by regression model of previous values and covariatesprevious values and covariates

• Fit model XFit model Xββ using any variables available using any variables available (previous values and covariates)(previous values and covariates)

• Repeat if further follow-up results Repeat if further follow-up results missingmissing

• Extract predicted value and save new Extract predicted value and save new dataset with predicted value inserteddataset with predicted value inserted

1) Regression – Missing values 1) Regression – Missing values predicted by regression model of predicted by regression model of previous values and covariatesprevious values and covariates

• Fit model XFit model Xββ using any variables available using any variables available (previous values and covariates)(previous values and covariates)

• Repeat if further follow-up results Repeat if further follow-up results missingmissing

• Extract predicted value and save new Extract predicted value and save new dataset with predicted value inserteddataset with predicted value inserted

Page 30: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Missingness Missingness ModelModel

• How do I choose what factors to use How do I choose what factors to use in predicting imputed values?in predicting imputed values?

• All factors related to outcome (i.e. all All factors related to outcome (i.e. all Xs)Xs)

• Plus importantly the Plus importantly the outcome outcome

• Any other factors possibly related to Any other factors possibly related to the reason for being missingthe reason for being missing

• Better to be overly inclusive and Better to be overly inclusive and statistical significance not important statistical significance not important

• How do I choose what factors to use How do I choose what factors to use in predicting imputed values?in predicting imputed values?

• All factors related to outcome (i.e. all All factors related to outcome (i.e. all Xs)Xs)

• Plus importantly the Plus importantly the outcome outcome

• Any other factors possibly related to Any other factors possibly related to the reason for being missingthe reason for being missing

• Better to be overly inclusive and Better to be overly inclusive and statistical significance not important statistical significance not important

Page 31: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Multiple Multiple Imputation:Imputation:

A Cautionary tale A Cautionary tale • Hippisley-Cox et al, BMJ 2007 developed Hippisley-Cox et al, BMJ 2007 developed

a risk algorithm for CVD called QRISKa risk algorithm for CVD called QRISK• 70% of Cholesterol values were missing 70% of Cholesterol values were missing

and imputed using MI assuming data and imputed using MI assuming data MARMAR

• Found NO association between CVD and Found NO association between CVD and cholesterolcholesterol

• Investigation showed they had not used Investigation showed they had not used CVD outcome in the imputation modelCVD outcome in the imputation model

• When rectified ‘true’ association found!When rectified ‘true’ association found!

• Hippisley-Cox et al, BMJ 2007 developed Hippisley-Cox et al, BMJ 2007 developed a risk algorithm for CVD called QRISKa risk algorithm for CVD called QRISK

• 70% of Cholesterol values were missing 70% of Cholesterol values were missing and imputed using MI assuming data and imputed using MI assuming data MARMAR

• Found NO association between CVD and Found NO association between CVD and cholesterolcholesterol

• Investigation showed they had not used Investigation showed they had not used CVD outcome in the imputation modelCVD outcome in the imputation model

• When rectified ‘true’ association found!When rectified ‘true’ association found!

Page 32: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 1: Multiple Step 1: Multiple Imputation Imputation Methods Methods

2) Propensity score – 2) Propensity score – • create indicator variable R=0 for missingcreate indicator variable R=0 for missing

• Fit logistic model XFit logistic model Xββ of propensity to be of propensity to be missing (R=0).missing (R=0).

• Divide observations by quintiles of Divide observations by quintiles of propensity scorepropensity score

• Allow random draws (~Bayesian Allow random draws (~Bayesian bootstrap) of values from observed data bootstrap) of values from observed data in matching quintile to fill in missing in matching quintile to fill in missing data data

2) Propensity score – 2) Propensity score – • create indicator variable R=0 for missingcreate indicator variable R=0 for missing

• Fit logistic model XFit logistic model Xββ of propensity to be of propensity to be missing (R=0).missing (R=0).

• Divide observations by quintiles of Divide observations by quintiles of propensity scorepropensity score

• Allow random draws (~Bayesian Allow random draws (~Bayesian bootstrap) of values from observed data bootstrap) of values from observed data in matching quintile to fill in missing in matching quintile to fill in missing data data

Page 33: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 1: Multiple Step 1: Multiple Imputation Imputation Methods Methods

3) Monte Carlo Markov Chain 3) Monte Carlo Markov Chain (MCMC) –(MCMC) –

• Imputation draws from conditional Imputation draws from conditional distribution of Ydistribution of Ymissmiss | Y | Yobsobs

• Posterior step simulates posterior Posterior step simulates posterior mean and covariance matrixmean and covariance matrix

• New estimates used iteratively in New estimates used iteratively in imputation stepimputation step

• Process converges (hopefully)Process converges (hopefully)• Incorporates EM algorithmIncorporates EM algorithm

3) Monte Carlo Markov Chain 3) Monte Carlo Markov Chain (MCMC) –(MCMC) –

• Imputation draws from conditional Imputation draws from conditional distribution of Ydistribution of Ymissmiss | Y | Yobsobs

• Posterior step simulates posterior Posterior step simulates posterior mean and covariance matrixmean and covariance matrix

• New estimates used iteratively in New estimates used iteratively in imputation stepimputation step

• Process converges (hopefully)Process converges (hopefully)• Incorporates EM algorithmIncorporates EM algorithm

Page 34: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 1: Multiple Step 1: Multiple Imputation Imputation

• All available in PROC MI in SAS All available in PROC MI in SAS software and creates software and creates mm number of number of datasetsdatasets

• Now available in SPSS v. 17Now available in SPSS v. 17

• Note SPSS carries out Little’s test Note SPSS carries out Little’s test for MCARfor MCAR

• S-plus – some functions S-plus – some functions

• Stata has full set of programs for Stata has full set of programs for MIMI

• All available in PROC MI in SAS All available in PROC MI in SAS software and creates software and creates mm number of number of datasetsdatasets

• Now available in SPSS v. 17Now available in SPSS v. 17

• Note SPSS carries out Little’s test Note SPSS carries out Little’s test for MCARfor MCAR

• S-plus – some functions S-plus – some functions

• Stata has full set of programs for Stata has full set of programs for MIMI

Page 35: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

How many (How many (mm) ) datasets do I need? datasets do I need?

Too many leads toToo many leads todata managementdata managementproblemproblemRelative efficiency ofRelative efficiency ofusing finite musing finite mimputations is given imputations is given

bybyRE = ( 1 + RE = ( 1 + λλ / m) / m) -1-1

where where λλ is fraction of is fraction ofmissing informationmissing information

Too many leads toToo many leads todata managementdata managementproblemproblemRelative efficiency ofRelative efficiency ofusing finite musing finite mimputations is given imputations is given

bybyRE = ( 1 + RE = ( 1 + λλ / m) / m) -1-1

where where λλ is fraction of is fraction ofmissing informationmissing information

λ

m 10% 20% 30% 50% 70%

33 0.970.97 0.940.94 0.910.91 0.860.86 0.810.81

55 0.980.98 0.960.96 0.940.94 0.910.91 0.880.88

1010 0.990.99 0.980.98 0.970.97 0.950.95 0.930.93

2020 0.990.99 0.990.99 0.980.98 0.980.98 0.970.97

RERE

Page 36: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 2: Multiple Step 2: Multiple Imputation Imputation

• Analyse the now complete Analyse the now complete datasets in standard waydatasets in standard way

• T-test, Regression, Survival, T-test, Regression, Survival, Logistic, GLM, Mixed model, etc…Logistic, GLM, Mixed model, etc…

• Creates a set of parameter Creates a set of parameter estimates for each of m datasetsestimates for each of m datasets

• Analyse the now complete Analyse the now complete datasets in standard waydatasets in standard way

• T-test, Regression, Survival, T-test, Regression, Survival, Logistic, GLM, Mixed model, etc…Logistic, GLM, Mixed model, etc…

• Creates a set of parameter Creates a set of parameter estimates for each of m datasetsestimates for each of m datasets

Page 37: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 3: Multiple Step 3: Multiple Imputation Imputation

• Combine results from m datasetsCombine results from m datasets

• Standard way is calculate mean and Standard way is calculate mean and variance of parameter estimatevariance of parameter estimate

• Combine results from m datasetsCombine results from m datasets

• Standard way is calculate mean and Standard way is calculate mean and variance of parameter estimatevariance of parameter estimate

m

i iQmQ

1

1

Let Let ÜÜ be within-imputation variance and B the be within-imputation variance and B the between-imputation variance then the total between-imputation variance then the total variance T is -variance T is -

Bm

UT )(1

1

Page 38: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Step 3: Multiple Step 3: Multiple Imputation Imputation

• Relatively easy, but fortunately SAS has Relatively easy, but fortunately SAS has a procedure to implement this called a procedure to implement this called PROC MIANALYZEPROC MIANALYZE

• Good documentationGood documentation• SPSS now does this step in v.17!SPSS now does this step in v.17!• MI now considered gold standard MI now considered gold standard

methodology for drawing valid methodology for drawing valid inferences in the face of missing data inferences in the face of missing data (with MAR) (with MAR)

• Still many people waryStill many people wary

• Relatively easy, but fortunately SAS has Relatively easy, but fortunately SAS has a procedure to implement this called a procedure to implement this called PROC MIANALYZEPROC MIANALYZE

• Good documentationGood documentation• SPSS now does this step in v.17!SPSS now does this step in v.17!• MI now considered gold standard MI now considered gold standard

methodology for drawing valid methodology for drawing valid inferences in the face of missing data inferences in the face of missing data (with MAR) (with MAR)

• Still many people waryStill many people wary

Page 39: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Alternative Alternative Solution: Solution:

Weighting Weighting • Weight observed data to take account Weight observed data to take account

of under-representation of certain of under-representation of certain response profilesresponse profiles

• Does not involve imputation but Does not involve imputation but assumes MARassumes MAR

• First proposed in sample survey First proposed in sample survey literatureliterature

• Relatively easy as most standard Relatively easy as most standard programs allow addition of weighting programs allow addition of weighting factorfactor

• Requires weight wRequires weight wii and then complete and then complete case analysis weighted by 1/wcase analysis weighted by 1/wii

• Weight observed data to take account Weight observed data to take account of under-representation of certain of under-representation of certain response profilesresponse profiles

• Does not involve imputation but Does not involve imputation but assumes MARassumes MAR

• First proposed in sample survey First proposed in sample survey literatureliterature

• Relatively easy as most standard Relatively easy as most standard programs allow addition of weighting programs allow addition of weighting factorfactor

• Requires weight wRequires weight wii and then complete and then complete case analysis weighted by 1/wcase analysis weighted by 1/wii

Page 40: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Alternative Alternative Solution: Solution:

Weighting Weighting • Estimate wEstimate wii = Pr [R = 0 | Y = Pr [R = 0 | Yobsobs, X], X]• Repeat for multiple time pointsRepeat for multiple time points• Analyse complete cases Analyse complete cases

weighted by wweighted by wii • Example GEE with MAR Example GEE with MAR • Intuitively good as weight Intuitively good as weight

people with missing data as people with missing data as similar similar

to those with observed data to those with observed data

• Estimate wEstimate wii = Pr [R = 0 | Y = Pr [R = 0 | Yobsobs, X], X]• Repeat for multiple time pointsRepeat for multiple time points• Analyse complete cases Analyse complete cases

weighted by wweighted by wii • Example GEE with MAR Example GEE with MAR • Intuitively good as weight Intuitively good as weight

people with missing data as people with missing data as similar similar

to those with observed data to those with observed data

Page 41: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Practical Tip 3Practical Tip 3

• If we assume MAR, method of If we assume MAR, method of MI provides means of valid MI provides means of valid inferenceinference

• Comprehensive software in SAS Comprehensive software in SAS and now SPSSand now SPSS

• Other software incorporate as Other software incorporate as standard (Stata)standard (Stata)

• Consider weighting method as Consider weighting method as intuitively appealingintuitively appealing

• If we assume MAR, method of If we assume MAR, method of MI provides means of valid MI provides means of valid inferenceinference

• Comprehensive software in SAS Comprehensive software in SAS and now SPSSand now SPSS

• Other software incorporate as Other software incorporate as standard (Stata)standard (Stata)

• Consider weighting method as Consider weighting method as intuitively appealingintuitively appealing

Page 42: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Another essential definitionAnother essential definition

Missing Not At Random (MNAR)Missing Not At Random (MNAR)

• Prob (Missing) is dependent on both: Prob (Missing) is dependent on both: 1) unobserved data and 1) unobserved data and 2) observed data2) observed data

• Often referred to as Often referred to as nonignorable missing nonignorable missing mechanism or informative missingnessmechanism or informative missingness

• MNAR is completely unverifiable from the dataMNAR is completely unverifiable from the data• Need to assess the sensitivity of results to Need to assess the sensitivity of results to

different plausible explanationsdifferent plausible explanations• All standard methods are All standard methods are NOTNOT valid valid• Ongoing area of research in statistical methodsOngoing area of research in statistical methods

Missing Not At Random (MNAR)Missing Not At Random (MNAR)

• Prob (Missing) is dependent on both: Prob (Missing) is dependent on both: 1) unobserved data and 1) unobserved data and 2) observed data2) observed data

• Often referred to as Often referred to as nonignorable missing nonignorable missing mechanism or informative missingnessmechanism or informative missingness

• MNAR is completely unverifiable from the dataMNAR is completely unverifiable from the data• Need to assess the sensitivity of results to Need to assess the sensitivity of results to

different plausible explanationsdifferent plausible explanations• All standard methods are All standard methods are NOTNOT valid valid• Ongoing area of research in statistical methodsOngoing area of research in statistical methods

Page 43: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Examples: NMARExamples: NMAR

• QOL missing in those with low QOL missing in those with low quality of life and so quality of life and so missingness related to what missingness related to what might have been QOLmight have been QOL

• Measurement of weight loss Measurement of weight loss more likely to be missing if more likely to be missing if weight loss likely to be low weight loss likely to be low

• QOL missing in those with low QOL missing in those with low quality of life and so quality of life and so missingness related to what missingness related to what might have been QOLmight have been QOL

• Measurement of weight loss Measurement of weight loss more likely to be missing if more likely to be missing if weight loss likely to be low weight loss likely to be low

Page 44: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Missing Not At Random Missing Not At Random (MNAR)(MNAR)

One method uses Structural Equation One method uses Structural Equation Modelling (SEM)Modelling (SEM)

• Requires specialist softwareRequires specialist software• Often referred to as Often referred to as nonignorable missing nonignorable missing

mechanism or informative missingnessmechanism or informative missingness• MNAR is completely unverifiable from the MNAR is completely unverifiable from the

datadata• Need to assess the sensitivity of results to Need to assess the sensitivity of results to

different plausible explanationsdifferent plausible explanations• All standard methods are All standard methods are NOTNOT valid valid• Ongoing area of research in statistical Ongoing area of research in statistical

methodsmethods

One method uses Structural Equation One method uses Structural Equation Modelling (SEM)Modelling (SEM)

• Requires specialist softwareRequires specialist software• Often referred to as Often referred to as nonignorable missing nonignorable missing

mechanism or informative missingnessmechanism or informative missingness• MNAR is completely unverifiable from the MNAR is completely unverifiable from the

datadata• Need to assess the sensitivity of results to Need to assess the sensitivity of results to

different plausible explanationsdifferent plausible explanations• All standard methods are All standard methods are NOTNOT valid valid• Ongoing area of research in statistical Ongoing area of research in statistical

methodsmethods

Page 45: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Summary Summary

• Consider hierarchy of missing dataConsider hierarchy of missing data• MCAR, MAR, MNARMCAR, MAR, MNAR• Ideal is to use MI if MARIdeal is to use MI if MAR• or Weighting methods if MARor Weighting methods if MAR• Tools now in SPSSTools now in SPSS• Need to model missingness Need to model missingness

mechanism jointly with analysis of mechanism jointly with analysis of outcome if MNARoutcome if MNAR

• Complete case analysis needs to Complete case analysis needs to be justified!be justified!

• LVCF needs to be justified!LVCF needs to be justified!

• Consider hierarchy of missing dataConsider hierarchy of missing data• MCAR, MAR, MNARMCAR, MAR, MNAR• Ideal is to use MI if MARIdeal is to use MI if MAR• or Weighting methods if MARor Weighting methods if MAR• Tools now in SPSSTools now in SPSS• Need to model missingness Need to model missingness

mechanism jointly with analysis of mechanism jointly with analysis of outcome if MNARoutcome if MNAR

• Complete case analysis needs to Complete case analysis needs to be justified!be justified!

• LVCF needs to be justified!LVCF needs to be justified!

Page 46: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

Summary Summary

“…“…it is time to place CC it is time to place CC analysis and simple analysis and simple imputation methods, in imputation methods, in particular LOCF, in the particular LOCF, in the Museum of Statistical Museum of Statistical Science..”Science..”

Geert MolenberghsGeert MolenberghsEditorial JRSS A, 2007:861-863Editorial JRSS A, 2007:861-863

“…“…it is time to place CC it is time to place CC analysis and simple analysis and simple imputation methods, in imputation methods, in particular LOCF, in the particular LOCF, in the Museum of Statistical Museum of Statistical Science..”Science..”

Geert MolenberghsGeert MolenberghsEditorial JRSS A, 2007:861-863Editorial JRSS A, 2007:861-863

Page 47: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

ReferencesReferences

• LSHTM website on missing data, sponsored by ESRC (LSHTM website on missing data, sponsored by ESRC (www.lshtm.ac.uk/missingdata/start.html))

• Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. gentle introduction to imputation of missing values. J Clin J Clin epidemiol epidemiol 2006; 59: 1087-912006; 59: 1087-91

• Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR. Multiple imputation for missing MG, Wood AM, Carpenter JR. Multiple imputation for missing data in epidemiological and clinical research:potential and data in epidemiological and clinical research:potential and pitfalls. pitfalls. BMJBMJ 2009; 338: b2393. 2009; 338: b2393.

• Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P. Derivation and validation of QRISK, a new Brindle P. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: cardiovascular disease risk score for the United Kingdom: prospective open cohort study. prospective open cohort study. BMJBMJ 2007; 335: 136. 2007; 335: 136.

• Little, Roderick JA and Rubin, Donald B. (1987). Little, Roderick JA and Rubin, Donald B. (1987). Statistical Statistical Analysis with Missing DataAnalysis with Missing Data John Wiley and Sons, New York. John Wiley and Sons, New York.

Page 48: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

ReferencesReferences

• Dempster AP, Laird NM and Rubin DB. Maximum Likelihood Dempster AP, Laird NM and Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm, from Incomplete Data via the EM Algorithm, Journal of the Journal of the Royal Statistical SocietyRoyal Statistical Society 1977; Ser. B., 39: 1 - 38. 1977; Ser. B., 39: 1 - 38.

• Rubin DB. (1987). Rubin DB. (1987). Multiple imputation for nonresponse in Multiple imputation for nonresponse in surveyssurveys. John Wiley & Sons, New York. . John Wiley & Sons, New York.

• Yuan YC. Multiple imputation for missing data: concepts and Yuan YC. Multiple imputation for missing data: concepts and new development. SAS Institute Inc (P267-25) new development. SAS Institute Inc (P267-25)

• Software Documentation for SAS®, S-PLUS® and SPSS®. Software Documentation for SAS®, S-PLUS® and SPSS®. • R Development Core Team(2005). R Development Core Team(2005). R: A language and R: A language and

environment for statistical computing. environment for statistical computing. R Foundation for R Foundation for Statistical Computing, Vienna, Austria.Statistical Computing, Vienna, Austria.

Page 49: Missing Data: Where has my data gone? Peter T. Donnan Professor of Epidemiology and Biostatistics.

The Tao of The Tao of MissingnessMissingness

““There are There are known knownsknown knowns. These . These are things we know that we are things we know that we know. There are know. There are known known unknownsunknowns. That is . That is to say, there are things that we to say, there are things that we know we don't know. know we don't know. But there are also But there are also unknown unknown unknownsunknowns. There are things we . There are things we don't know we don't know.” don't know we don't know.” Donald RumsfeldDonald Rumsfeld