Top Banner
I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University We joined the project at the beginning of 2013 We noted interesting challenges: 6 non-homogeneous cohorts How best to identify suitable control group (10% sample or everyone)? If 10% sample, should we use Prentice weights? should we use stratified analyses or random effects models? How to incorporate birth weight in models? dichotomous, polychotomous, or continuous How to deal with large amount of missing data? statistical packages drop entire case when any of the variables are missing How to deal with confounding and effect modification? How to overcome, for statistical analyses, small numbers of cancers in some of the birth weight groups? How to assess and incorporate appropriate scale of continuous covariates? 1
16

I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

Feb 09, 2016

Download

Documents

wyanet

I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University. We joined the project at the beginning of 2013 We noted interesting challenges: 6 non-homogeneous cohorts How best to identify suitable control group (10% sample or everyone) ? - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

1I4C Pooled Dataset:Problems and Solutions

Stan Lemeshow & Gary PhillipsThe Ohio State University

We joined the project at the beginning of 2013 We noted interesting challenges:

6 non-homogeneous cohorts How best to identify suitable control group (10% sample or everyone)?

If 10% sample, should we use Prentice weights? should we use stratified analyses or random effects models?

How to incorporate birth weight in models? dichotomous, polychotomous, or continuous

How to deal with large amount of missing data? statistical packages drop entire case when any of the variables are missing

How to deal with confounding and effect modification? How to overcome, for statistical analyses, small numbers of cancers in some of the

birth weight groups? How to assess and incorporate appropriate scale of continuous covariates?

Page 2: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

2• 6 non-homogeneous cohorts

ALSPAC CPP DNBC JPS MoBa TIHSRecruitment years 1991-1992 1959-1966 1996-2002 1964-1976 1999-2007 1987-1995

Total number of live births in cohort 14,062 58,000 96,860 92,408 108,487 10,628Singleton, live births with no Downs Syndrom in pooled dataset 13,664 50, 342 8,603a 20, 313b 10, 497a 9,362

ALSPAC: Avon Longitudinal Study of Parents and Children (UK)CPP: The Collaborative Perinatal Project (USA)DNBC: Danish National Birth Cohort (Denmark)JPS: Jerusalem Perinatal Study (Israel)MoBa: Norwegian Mother and Child Cohort Study (Norway)THIS: Tasmanian Infant Health Survey (Australia)

Issue: Is it possible to create a set of weights to reflect some overarching population? • We believe that there really isn’t an overarching population so use of specially chosen statistical weights is unnecessary.

a: based on a random, representative, 10% sample of the entire cohort b: based on neonates with data on gestational age

Page 3: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

3 OversamplingOriginal plan was to take all cases and 10% of controls

Prentice weights have been proposed for handling the oversampling If the plan is to use 10% of controls in all cohorts, should we use

Prentice weights or not?

No need to use Prentice weights

Cox stratified regression using Prentice Weights using 10% of the controlsStratified Cox regr. -- Breslow method for ties No. of subjects = 1998 Number of obs = 1998No. of failures = 376Time at risk = 18458.09313 LR chi2(3) = 5.44Log likelihood = -39727.245 Prob > chi2 = 0.1421 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- birthwt_cat | 0 | 0.79 0.21 -0.89 0.375 0.46 1.34 2 | 1.22 0.16 1.47 0.142 0.94 1.59 | gestage | 0.95 0.02 -2.13 0.033 0.90 1.00 weight | 1.00 (offset)------------------------------------------------------------------------------ Stratified by study Cox stratified regression without using Prentice Weights using 10% of the controlsStratified Cox regr. -- Breslow method for ties No. of subjects = 1998 Number of obs = 1998No. of failures = 376Time at risk = 18458.09313 LR chi2(3) = 3.79Log likelihood = -2175.7188 Prob > chi2 = 0.2851 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- birthwt_cat | 0 | 0.81 0.22 -0.79 0.429 0.47 1.37 2 | 1.17 0.16 1.16 0.244 0.90 1.53 | gestage | 0.96 0.02 -1.80 0.071 0.91 1.00------------------------------------------------------------------------------ Stratified by study

Prentice weights are used to adjust the analysis when the cases are oversampled in the analysis.

We found that if we used Prentice weights in the Cox regression the estimated hazard ratios were essentially the same as the hazard ratios when Prentice weights were not used.

we decided not to use them.

Page 4: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

4

Use all observations or only10% of Controls?

No need to use only a 10% sample

Cox stratified regression without using Prentice Weights using 10% of the controlsStratified Cox regr. -- Breslow method for ties No. of subjects = 1998 Number of obs = 1998No. of failures = 376Time at risk = 18458.09313 LR chi2(3) = 3.79Log likelihood = -2175.7188 Prob > chi2 = 0.2851 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- birthwt_cat | 0 | 0.81 0.22 -0.79 0.429 0.47 1.37 2 | 1.17 0.16 1.16 0.244 0.90 1.53 | gestage | 0.96 0.02 -1.80 0.071 0.91 1.00------------------------------------------------------------------------------ Stratified by study

Cox stratified regression without using Prentice Weights and using all observationsStratified Cox regr. -- Breslow method for ties No. of subjects = 111430 Number of obs = 111430No. of failures = 376Time at risk = 1101297.732 LR chi2(3) = 4.41Log likelihood = -3511.1489 Prob > chi2 = 0.2203 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- birthwt_cat | 0 | 0.86 0.23 -0.58 0.565 0.50 1.46 2 | 1.20 0.16 1.37 0.170 0.92 1.57 | gestage | 0.95 0.03 -1.85 0.064 0.90 1.00------------------------------------------------------------------------------ Stratified by study

Initial analyses used 10% of non-cases vs. all cases fromeach of the 6 cohorts. We thought it made more statistical sense use all of theobservations if the goal of this study was to find the true relationship between birth weight and childhood cancer. Presumably this relationship would be independent of the overarching population. Since we are not trying to model a specific populationusing the 6 cohorts samplingweights would not be needed. Sampling weights are applied in analyses so that the results (prevalence of childhood cancer) mirror anoverarching population the study sample does not reflect.

Page 5: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

5Stratified analyses vs random effects modelsCox stratified regression without using Prentice Weights and using all observationsStratified Cox regr. -- Breslow method for ties No. of subjects = 111430 Number of obs = 111430No. of failures = 376Time at risk = 1101297.732 LR chi2(3) = 4.41Log likelihood = -3511.1489 Prob > chi2 = 0.2203 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- birthwt_cat | 0 | 0.86 0.23 -0.58 0.565 0.50 1.46 2 | 1.20 0.16 1.37 0.170 0.92 1.57 | gestage | 0.95 0.03 -1.85 0.064 0.90 1.00------------------------------------------------------------------------------ Stratified by studyRandom-effects (shared frailty) Cox regression without using Prentice Weights and using all observationsCox regression -- Breslow method for ties Number of obs = 111430 Gamma shared frailty Number of groups = 6Group variable: study No. of subjects = 111430 Obs per group: min = 8571No. of failures = 376 avg = 18571.67Time at risk = 1101297.732 max = 49334  Wald chi2(3) = 4.59Log likelihood = -4047.5665 Prob > chi2 = 0.2042 ------------------------------------------------------------------------------ _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- birthwt_cat | 0 | 0.85 0.23 -0.61 0.545 0.50 1.45 2 | 1.21 0.16 1.42 0.156 0.93 1.58 | gestage | 0.95 0.03 -1.83 0.067 0.90 1.00-------------+---------------------------------------------------------------- theta | 1.15993 .5899787------------------------------------------------------------------------------Likelihood-ratio test of theta=0: chibar2(01) = 439.68 Prob>=chibar2 = 0.000 Note: standard errors of hazard ratios are conditional on theta.

Sensitivity analysis: • compare results of stratified Cox regression to a random-effects (shared frailty) Cox regression. • The hazard ratios very similar in both regressions. • The cohort identifier was the stratification variable and was also the random-effects term. use stratified analysis as it runs much faster on the computer compared to the random-effects analysis.

Page 6: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

6Switching from a 3-level birth weight to a 2-level birth weight variable

When we analyzed birth weight using a 3-level variable <2,500 g 2,500 to < 4,000 g ≥ 4,000 gthe regression was unstable since the Leukemia and ALL outcomes only had only 2 babies born weighing < 2,500 grams.

There were only 4 observations in the leukemia and ALL groups when we looked at the lowest 10 percentile of birth weight.

Thus we dichotomized birth weight at: 4,000 g 3,500 g 3,000 g top 10 percentile within each of the 6 cohorts

We also analyzed birth weight using a continuous variable

Childhood | Birth weight cancer | < 2500 2500 to < >= 4000 | Total-----------+---------------------------------+---------- 0. No | 8,509 92,397 11,030 | 111,936 1. Yes | 20 282 75 | 377 -----------+---------------------------------+---------- Total | 8,529 92,679 11,105 | 112,313    Childhood | Birth weight leukemia | < 2500 2500 to < >= 4000 | Total-----------+---------------------------------+---------- 0. No | 8,527 92,592 11,079 | 112,198 1. Yes | 2 87 26 | 115 -----------+---------------------------------+---------- Total | 8,529 92,679 11,105 | 112,313    Childhood | Birth weight ALL | < 2500 2500 to < >= 4000 | Total-----------+---------------------------------+---------- 0. No | 8,527 92,605 11,083 | 112,215 1. Yes | 2 74 22 | 98 -----------+---------------------------------+---------- Total | 8,529 92,679 11,105 | 112,313

Page 7: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

7Dealing with a large amount of missing data

Statistical packages drop the entire case when any of that case’s variables are

missing Solution: Multiple imputation (MI)

We used chained MI to impute 20 complete datasets. Note that the choice to use 20 imputed complete datasets was based on our collective

experience running MI. After generating the 20 datasets proportional hazard Cox regression was performed

separately on each imputation m = 1, … , 20 and the results are pooled into a single multiple-imputation result.

The “chained” method fills in missing values in multiple variables iteratively by using chained equations, a sequence of univariate imputation methods with fully conditional specification (FCS) of prediction equations. It accommodates arbitrary missing-value patterns. Specifically we used:

truncated linear regression for continuous variables (paternal age, maternal height, pregnancy weight change, and pre-pregnancy BMI) where the imputations are limited to lower and upper boundaries set at the minimum and maximum values of the non-missing observations of a particular continuous variable.

logistic regression for dichotomous variables (first born and maternal smoking). predictor variables used to impute the missing data were maternal age, gestational

age, birth weight, sex of child, and cohort.  

Page 8: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

8

Table of missing observations by cohort

Page 9: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

9Multiple ImputationSummary of continuous variables needed to set the boundaries on the truncated regression used in the MI variable | N mean sd min max-----------------+-------------------------------------------------- pat_age | 96678 29.8 6.6 14.0 71.9 mat_height | 99173 162.9 7.2 101.6 203.2 tot_pregwtchge | 95512 11.4 5.6 -76.0 68.0 mat_prepreg_BMI | 95562 22.9 4.1 11.0 61.0--------------------------------------------------------------------

Page 10: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

10

Confounding and Effect Modification We checked all variables in the dataset to determine if any of them

confounded the relationship between cancer and dichotomous birth weight or continuous birth weight. A variable is potentially a confounder when its addition to the model with only birth

weight changed the birth weight coefficient by more than 15% in either direction. This has nothing to do with statistical significance. Assessment of confounding was made prior to employing multiple imputation.

we used a set of observations that was common to both regressions (with and without the confounder).

This produced the following set of possible univariable confounders: gestational age maternal age paternal age maternal height pregnancy weight change pre-pregnancy BMI first born any maternal smoking

This same set of confounders was produced for all 5 definitions of birth weight.

Page 11: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

11

Effect ModificationA covariate that had a statistically significant interaction (p≤ 0.05)

with birth weight was considered to be an effect modifier. When we checked none of the covariates were considered effect-

modifiers (all p-values were quite large). This was true for all 5 definitions of birth weight.

Note: Had a variable been an effect modifier, it could not also have been a confounder.

Note: numbers of observationsin crude and adjusted modelsare identical

Table: Outcome is cancer where β1 is the coefficient for ≥ 4,000 and β2 is the coefficient for a 1 kg increase in continuous birth weight

Page 12: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

12

Example of checking for effect modification with categorical birth weightStratified Cox regr. -- Breslow method for ties No. of subjects = 105494 Number of obs = 105494No. of failures = 342Time at risk = 1062773.443 LR chi2(3) = 1.80Log likelihood = -3182.6531 Prob > chi2 = 0.6157--------------------------------------------------------------------------------------------- _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]----------------------------+---------------------------------------------------------------- 1.birthwt_cat4000 | 0.21 0.15 1.35 0.178 -0.09 0.51 1.mat_smk_any | 0.07 0.14 0.49 0.625 -0.20 0.34 |birthwt_cat4000#mat_smk_any | 1 1 | -0.23 0.36 -0.65 0.515 -0.93 0.47---------------------------------------------------------------------------------------------

Example of checking for effect modification with continuous birth weight for a 1 kg increase

Stratified Cox regr. -- Breslow method for ties No. of subjects = 105494 Number of obs = 105494No. of failures = 342Time at risk = 1062773.443 LR chi2(3) = 2.09Log likelihood = -3182.5069 Prob > chi2 = 0.5541------------------------------------------------------------------------------------------ _t | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------------------+---------------------------------------------------------------- birthwt_kg | 0.16 0.12 1.36 0.175 -0.07 0.39 1.mat_smk_any | 0.28 0.72 0.39 0.696 -1.14 1.70 |mat_smk_any#c.birthwt_kg | 1 | -0.07 0.21 -0.34 0.736 -0.48 0.34------------------------------------------------------------------------------------------

Page 13: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

13

Assessing the Scale of the Continuous Covariates

We checked the scale of the continuous covariates and birth weight to ensure that they were linear in the log-hazard The method of fractional polynomials was used for this purpose.

All continuous variables were determined to be linear and we did not need to transform any of them.

From the multivariable fractional polynomials we can see that paternal age is not linear in the logit Deviance for model at the end of cycle 1=5344.807 , 74641 observations Variable Model (vs.) Deviance Dev diff. P Powers (vs.)---------------------------------------------------------------------- mat_height lin. FP2 5344.807 0.703 0.872 1 3 3 Final 5344.807 1 tot_pregw... lin. FP2 5344.807 6.510 0.089 1 1 1 Final 5344.807 1 mat_prepr... lin. FP2 5344.807 4.215 0.239 1 -2 -2 Final 5344.807 1 mat_age lin. FP2 5344.807 0.864 0.834 1 0 0 Final 5344.807 1 pat_age lin. FP2 5357.821 13.014 0.005+ 1 3 3 FP1 5353.602 8.794 0.012+ 3 Final 5344.807 3 3 [birthwt_cat4000 included with 1 df in model] gestage lin. FP2 5344.807 0.191 0.979 1 -2 -2 Final 5344.807 1  Fractional polynomial fitting algorithm converged after 2 cycles.

Note: This is an iterative process that,after 5 or 6 cycles, produces the final results on the next slide.

Page 14: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

14

Final multivariable fractional polynomial model for _t-------------------------------------------------------------------- Variable | -----Initial----- -----Final----- | df Select Alpha Status df Powers-------------+------------------------------------------------------birthwt_c... | 1 1.0000 0.0500 in 1 1 mat_age | 4 1.0000 0.0500 in 1 1 pat_age | 4 1.0000 0.0500 in 4 3 3 mat_height | 4 1.0000 0.0500 in 1 1tot_pregw... | 4 1.0000 0.0500 in 1 1mat_prepr... | 4 1.0000 0.0500 in 1 1 gestage | 4 1.0000 0.0500 in 1 1--------------------------------------------------------------------

Next we run fractional polynomials where we log all 44 transformations and notice that a quadratic transformation is very close to the complicated (3,3)

Model # Deviance Power 1 Power 2  1 5359.042 -2.0 . 2 5359.114 -1.0 . : : : : 37 5345.886 2.0 0.5 38 5345.164 3.0 0.5 39 5346.321 1.0 1.0 40 5345.497 2.0 1.0 41 5344.866 3.0 1.0 42 5344.859 2.0 2.0 43 5344.404 3.0 2.0 44 5344.111 3.0 3.0

Page 15: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

15

Cox regression using gestational age and linear paternal age in order to compare to the model with gestational age and quadratic paternal age.

Stratified Cox regr. -- Breslow method for ties No. of subjects = 94661 Number of obs = 94661No. of failures = 348Time at risk = 975210.3499 LR chi2(3) = 6.00Log likelihood = -3215.7098 Prob > chi2 = 0.1116 --------------------------------------------------------------------------------- _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]----------------+----------------------------------------------------------------birthwt_cat4000 | 1.14 0.16 0.91 0.362 0.86 1.50 gestage | 0.98 0.03 -0.86 0.391 0.93 1.03 pat_age | 0.98 0.01 -2.18 0.029 0.96 1.00--------------------------------------------------------------------------------- Stratified by study  Stratified Cox regr. -- Breslow method for ties No. of subjects = 94661 Number of obs = 94661No. of failures = 348Time at risk = 975210.3499 LR chi2(4) = 7.32Log likelihood = -3215.0512 Prob > chi2 = 0.1201 --------------------------------------------------------------------------------- _t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]----------------+----------------------------------------------------------------birthwt_cat4000 | 1.13 0.16 0.89 0.375 0.86 1.49 gestage | 0.98 0.03 -0.86 0.392 0.93 1.03 pat_age | 1.07 0.08 0.83 0.409 0.91 1.25 pat_age2 | 1.00 0.00 -1.11 0.269 1.00 1.00--------------------------------------------------------------------------------- Stratified by study

Note: hazard ratios for dichotomous birth weight at 4,000 grams are verysimilar between the 2 regression Models.

In the final models we used x and x2

since that is the correct scale

Page 16: I4C Pooled Dataset: Problems and Solutions Stan Lemeshow & Gary Phillips The Ohio State University

16

Multivariable ConfoundingStarting with all confounders in the model, we remove

covariates one at a time if they are no longer confounding the relationship birth weight or the other covariates in the model. This procedure is repeated until a confounder can no longer be dropped

as it changes the other coefficients in the model by more than 15% in either direction.

This method leaves a parsimonious multivariable model. We used the MI dataset to check for the multivariable confounding.

Model coefficients and percent change (in red) from model with all 7 variables (blue) for cancer and birth weight dichotomized at 4,000 kg. Already removed from the model are maternal smoking and total pregnancy weight change.