Top Banner
Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School
42

Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Dec 27, 2015

Download

Documents

Marjorie Haynes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Bayesian modeling of nonsampling error

Alan M. Zaslavsky

Harvard Medical School

Page 2: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

General setup for nonsampling error• Focus on measurement error problem

– Item responses with error– Item or unit nonresponse as a special error

response– …or nonresponse as part of error for aggregates

• Y = data measured with error• Y* = latent “true” values (object of inference)

– Might be observed for part of data (calibration)

• X = covariates– Assumed (for presentation) correct and complete– Include design information

Page 3: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Objective of inference• Estimate statistics of “true” values f(Y*)

• Estimate parameters of models– From likelihood standpoint:

inference from L( | Y*,X)– (Specifically) from Bayesian standpoint, draw

from P( | Y*,X)

• Both possible if we have draws of Y*– Multiple imputation for valid inferences

Page 4: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Two ways to factorize distribution

• Predictive factorization:P(Y,Y* | X,,*) = P(Y* | Y,X,*) P(Y | X, )– Direct prediction of Y* for imputation

• “Scientific” factorization:

P(Y,Y* | X,,*) = P(Y | Y*,X,) P(Y* | X, *)– First factor is observation (measurement error)

model– Second factor is model for true relationships

Page 5: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

More on “scientific” factorization• Separates two distinct processes

– Information might be from different sources– Possibility of more (or different) generalizability

• Models are more interpretable– Incorporate prior information for specification

and parameters– Easier to assess “congeniality” of models?

• Compare model for P(Y* | X, *) with model involving

– Simplifications? e.g. P(Y | Y*,X,) = P(Y | Y*,)

Page 6: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Inference with “scientific” factorization

• Computations via Gibbs sampler– Imputation of Y* by Bayes’s theorem– Complete-data inferences for *

• Inferences of scientific interest ()– Multiple imputation inference using Y*– Direct from model if =(*)

Page 7: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Possible sources for measurement error model parameters ()

• Calibration study– Sample of (Y,Y*) pairs to identify the two

parameters– For robustness, important to build in adequate

flexibility to avoid identifying off unverified model assumptions about P(Y | X,,*)

• Prior studies (also used Bayesianly as prior)– Previous calibration model estimates, if

measurement process is consistent– Synthesis of accumulated survey methodology

Page 8: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Example 1: Correction for underreporting in study of

chemotherapy for colorectal cancer

• Provision of guideline-recommended adjuvant chemotherapy a critical issue in quality of care for cancer

• Cancer registries as a source of chemo data– Excellent population coverage– Underreporting of treatment

Page 9: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

California study

• Cancer registry data– Statewide coverage– About 70,000 cases over 5 years in relevant

stages (appropriate for chemotherapy)

• Calibration survey– Request medical record data from physicians– Limited in time (1+ year) and space (3 of 10

regions)– 1956 cases in sample, 1449 (74%) respond

Page 10: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Reporting of adjuvant therapy

• Folllowup survey response rate higher …– at HMO-affiliated and high-volume hospitals– when chemo reported in original record

• 82% of adjuvant therapy was reported to Registry (among “respondents”)– Substantial underestimation if Registry alone used– More complete in teaching hospitals, HMO

affiliates, high volume hospitals, younger and rectal cancer patients

Cress et al., Medical Care 2003

Page 11: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Naïve estimation of administration of adjuvant chemotherapy

• Analysis based only on “gold standard” survey + Registry data in sample

• Strong variation by patient characteristics– Age (less if older), marital status– Race (less if Black, more if Hispanic, Asian)– Income (upward gradient with higher income)

• Substantial unexplained hospital-level variation

Ayanian et al., J Clinical Oncology 2003

Page 12: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Limitations of standard analytic approaches

• Survey respondents alone:– Small portion of available California data

(1449/70,000)– Single area of state– Unrepresentative due to survey nonresponse– Confounding of survey response, reporting,

treatment variation (e.g. volume effects)

• Registry data alone:– Underreporting of chemotherapy– Reporting is nonuniform

Page 13: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Combining Registry and survey data

• Combine – power of large Registry data – correction for underreporting based on survey

• Simple correction based on:P(reported chemo) =

P(chemo) P(report | chemo)

Therefore: P(chemo) = P(reported chemo) / P(report | chemo)

Page 14: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Registry plus simple correction• In survey:

P(reported chemo) = 59%

P(report | chemo) = 82%

P(chemo) = 59%/82% ≈ 71%

• Outside survey (mostly rest of state): P(reported chemo) = 49%

P(report | chemo) = 82%

P(chemo) = 49%/82%≈ 60%

Depends on assumption that reporting is similar in the two areas

Page 15: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Model-based methodology(Yucel and Zaslavsky)

• Disaggregated model– Take into account individual effects on both

chemotherapy and reporting– Take into account hospital variation in both

chemotherapy and reporting

• Imputation of chemo for individual cases– Allow fitting of any desired models– Multiple imputation to obtain proper measures

of uncertainty with imputed data

Page 16: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Models for reporting and therapy• Logit or Probit regression for therapy (outcome)

– Patient p has characteristics xhp: age, sex, race/ethnicity,

comorbidity score (Charlson), tumor stage/site, income category

– Hospital h has characteristics zh: volume, ACOS-certified

registry, teaching

– Random effect h for hospital h

logit P(chemohp) = xhp + zh + h

• Similar model (with or without random effect) for reporting given therapy– Random effects for reporting & therapy could be

correlated

Page 17: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Two versions of hierarchical model(a) single random effect (b) bivariate RE

←Parameters→

Latent “true” status

Observed status

ReportingOutcomeReporting Outcome

Page 18: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Fitting the model

• Full Bayesian specification– Diffuse priors for coefficients, (co)variances

• Fit via Gibbs sampling: alternately– Impute true chemo status for non-survey

cases– Draw random hospital effects – Draw “fixed” coefficients and variance

components

Page 19: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Imputing chemo status (Bayes thrm)• Example: consider individual (not in survey)

for whom models give– Prior P(chemo)=70%– Prior P(reporting | chemo) = 80%

• If chemo reported, then true chemo = 1

• If chemo not reported:– P(no chemo, no report) = 30%– P(chemo, no report) = 70% 20% = 14%

– P(chemo | no report) = 14%/(14% + 30%) ≈ 32%

– Impute chemo=1 with probability 32%

Page 20: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Computing: probit via latent variables• Probit model: (P(Yhp=1))= xhp + zh + h

– Equivalently: Yhp=1 ↔ ehp < xhp + zh + h,

where ehp ~N(0,1) is a normal latent variable

(Albert & Chib 1993)

– Equivalently, Yhp=1 ↔ uhp= xhp + zh + h−ehp >0

– Observing Yhp implies truncated normal posterior for uhp

given higher-level parameters h

• Given a draw of uhp, higher levels reduce to normal

multilevel model with observation uhp

and fixed variance=1 at bottom level (well-known problem)• independent of the discrete data or imputed values• direct generalization to correlated bivariate response

Page 21: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

“Restricted” inference for robustness• Two kinds of information involved in inference for

“reporting” model– “Direct” in survey sample (1449 cases):

Y | Y*, parameters, X– “Indirect” in remaining area (~74,000+ cases):

Y | parameters, X (combines outcome & reporting models)

– Possibly sensitive to model misspecification?

• Ad hoc solution: Restrict likelihood for reporting model to direct data from reporting survey cases– Throw away some information from others– Greater robustness to slight misspecification?

– Reparametrize as regression (R)| (O) & marginal (O)

Page 22: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Direct interpretation of fitted model

• Effects broadly similar to those in naïve (sample only) analyses. – Volume effect on reporting but not on chemo– Lower chemo rate outside survey region

• Substantial hospital random effects in both reporting and therapy rates– Indication of substantial unexplained variation

– a problem (from health services standpoint)!– Reporting completeness and therapy rates

not (residually) correlated

Page 23: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Using imputations to estimate effect of chemotherapy on survival

• Re-fit model including 2-year survival as predictor of chemotherapy

• Using imputed corrected chemotherapy, fit model with chemotherapy (and other variables) as predictor of survival– Correct variances with multiple imputation– Missing info ≈70% for chemo, 1-4% for other variables

• Finds significant positive effect (OR=1.26) of chemo on survival– [Are the severity controls good enough?]

Page 24: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Modeling critical with missing data• Several kinds of missing data:

– Unreported chemotherapy– Nonresponse to followback (validation) survey– Areas excluded from followback survey

• Potential for confounding if unjustifiable MCAR (or insufficiently conditional MAR) assumptions are made– MCAR = Missing Completely at Random:

missingness independent of everything– MAR = Missing at Random:

missingness independent of unobserved, conditional on observed

Page 25: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Some countinterintuitive results!

Low Med High AllSurvey response rate 63 73 78 75Reporting completeness in survey 81 81 92 87Chemotherapy rates by registry Survey respondents 60 54 66 62 Survey nonrespondents 40 44 53 48 All 52 51 63 58Chemotherapy rates by survey 77 68 72 71Chemotherapy rates by hybrid method Survey respondents 80 70 74 73 Survey nonrespondents 40 44 53 47 All 65 63 69 67Chemotherapy rates under model 67 71 69 69

Hospital Volume

Page 26: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Limitations and potential design improvements

• Major limitation: calibration survey is unrepresentative (in known ways)– Only covers some areas (trial implementation)– Differences by region in reporting are plausible– Can evaluate sensitivity to alternative

assumptions

• Could improve design for ongoing studies– Sample across entire area– Quality improvement for both therapy and

reporting

Page 27: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Example 2: Adjustment for measurement bias of 1990 Post

Enumeration Survey

• Post-Enumeration Survey provides estimates of proportional error in Decennial Census estimates– Includes whole-household and within-

household under- and overenumerations– Tabulated for poststrata of individuals defined

by household-level (region, urbanicity) and individual-level (age, sex, race/ethnicity) variables

Page 28: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

• k = domain index

• ck = population share of domain k

• y*k = true census underenumeration rate

• = (biased) estimate of y*k from survey

• yk = E = expectation, bk = yk −y*k= bias

• = unbiased estimate of bk, E = bk

• Constraints: ck y*k = ck yk = ck = ck bk = 0

(sum of errors in shares is 0).

• Sampling variance of = Var | y = Vy

Notation for undercount estimation(Zaslavsky 1993, JASA)

ky

ky

kykb kb

yy

Page 29: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Components and variance of kb

• Sources of bias estimates (total error model)– Small calibration studies to estimate process

errors (matching, geocoding, fabrications)– Model-based estimates of correlation bias– Uncertainty about imputation model

• Var ( − b) = Vb includes

– Sampling variances from calibration studies,– Uncertainty across correlation bias models,– (Multiple) imputation variance and model

uncertainty

b

Page 30: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

A naïve approach and its problems• Simple bias corrected estimate is

– Unbiased estimator of y*

– Variance is Vy + Vb and Vb is likely to be large

– Problem for non-Bayesian approaches: if we have very little data to estimate something, must we assume that it could be “anything”?

• Alternative (Bayesian) approach: introduce reasonable prior beliefs– Bias terms bk are a collection centered around 0

– Characterize variability by variance component

– Similar argument for undercount terms yk

by ˆˆ

Page 31: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Hierarchical model for estimation and bias correction

• “Sampling” model:

– Not exactly “sampling” since some model uncertainty is included in Vb

• “Structural” (Level 2) model:

b

y

V

V

b

y

b

y

0

0,~ˆ

ˆN

UU

UU

b

y2

2

,0

0~

bbyyb

byybyN

Page 32: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Hierarchical model for estimation and bias correction

• “Structural” (Level 2) model:

– Undercount and bias terms each drawn from common distribution

– Proportional covariance structures for each and for correlation of the two

– Matrix U based on a prior “similarity” of domains (number of common characteristics)

UU

UU

b

y2

2

,0

0~

bbyyb

byybyN

Page 33: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Priors and inference• Fairly vague priors for variance components,

correlation– These represent assessments of degree of variation

in bias, undercount and how they relate across domains

– Key to this inference is existence of collection of domains

• Inference via Gibbs sampler• Extensive simulations

– Compare to uniform shrinkage, hypothesis testing approaches, etc.

– Suggested that full hierarchical Bayes model would outperform competitors

Page 34: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Analyses with 1992 data

• Data combined 3 sources– 1990 census– Post-Enumeration Survey– Various sources of bias component estimates

• Estimates:– Substantial differential undercount, – Substantial differential bias,

%2.1~ y

%2.3~ b

Page 35: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Refinement: misaligned domains(Zaslavsky 1992, Proc. SRMS)

• Domains for bias estimates might differ from those for y– e.g. if they combine the main domains– Observation is

• Modifies the sampling model:

• Applied to 1992 data:– 357 poststrata, 51 poststratum groups, but only

10 evaluation poststrata

bXb ˆˆ0

'0

0,~ˆ

ˆ

0 XXV

V

Xb

y

b

y

b

yN

Page 36: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Other potential applications

• Domain-level estimates– No gold standard data for individuals– No individual-level corrections

• Many applications where there are small evaluation samples for a measure– Welfare or food stamp payment error– Quality evaluations in medical care

Page 37: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Example 3: Imputation of households to correct for enumeration error

• Setting: Census (or survey) of households with errors of enumeration– Whole-household errors– Within-household errors– [Assumption (here) that all errors are omissions]

• Objective: To (multiply) impute corrected rosters.– Add person to households – Impute additional households

Page 38: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Bayesian imputation strategy(Zaslavsky 2004; Zaslavsky & Rubin 1989 Proc. ARC)

• Based on “scientific” factorization– Prevalence model: distribution of households

by compositional type (roster of members by poststratum), P(Y*bk=t | k)

k= (latent) parameter of block b

– Observational model: probability of observed types (with error), P(Ybk=u | Y*bk=t,)

Page 39: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Model specifics

• Prevalence models– x(t) summarizes characteristics of type t

– Prevalence proportional to exp(x(t) · k) · h(t)

• h(t) is (nonparametric) general prevalence of type t

• Observational model• Loglinear model based on probabilities of omission of

individuals• Terms for dependence of omissions within household• Could be based on (hypothetical) dataset …• … and/or calibrated to match aggregate omission

rate estimates by poststratum

Page 40: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Imputations

• Draw Y*bk by Bayes’s theorem – Possible values are those types that could

“lose” one or more members yielding observed Y*bk

– Draw from all possible values of t

• Special type for unobserved households– Count imputed using SOUP (unbiased) prior– True types imputed similar to others

• Gibbs sampler to estimate all parameters

Page 41: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

General summary of examples• All are “Bayesian” in drawing corrected

values from posterior distributions– “Scientific” factorization for interpretability

(Examples 1 and 3)– “Observations” might have simple (Ex. 1,2) or

complex (Ex. 3) structure

• Bayesian also in– Incorporating prior information– Pooling across collections of units (“shrinkage”)– Hierarchical specification of complex models– Probability representation of model uncertainty

(Ex. 2)

Page 42: Bayesian modeling of nonsampling error Alan M. Zaslavsky Harvard Medical School.

Program to move forward

• Systematic quantitative meta-analysis of information on nonresponse errors

• Models for various types of nonresponse error

• Think more about how to combine information from data and model uncertainty

• Standard algorithms and software

• Integrate with analyses of nonresponse, item missing data, etc.