Modeling Modeling Correlated/Clustered Correlated/Clustered Multinomial Data Multinomial Data Justin Newcomer Justin Newcomer Department of Mathematics and Statistics Department of Mathematics and Statistics University of Maryland, Baltimore County University of Maryland, Baltimore County Probability and Statistics Day, April 28, 2007 Probability and Statistics Day, April 28, 2007 Joint Research with Professor Nagaraj K. Joint Research with Professor Nagaraj K. Neerchal, UMBC and Jorge G. Morel, PhD, Neerchal, UMBC and Jorge G. Morel, PhD, P&G Pharmaceuticals, Inc. P&G Pharmaceuticals, Inc.
21
Embed
Modeling Correlated/Clustered Multinomial Data Justin Newcomer Department of Mathematics and Statistics University of Maryland, Baltimore County Probability.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Modeling Correlated/Clustered Modeling Correlated/Clustered Multinomial Data Multinomial Data
Justin NewcomerJustin Newcomer
Department of Mathematics and StatisticsDepartment of Mathematics and Statistics
University of Maryland, Baltimore CountyUniversity of Maryland, Baltimore County
Probability and Statistics Day, April 28, 2007Probability and Statistics Day, April 28, 2007
Joint Research with Professor Nagaraj K. Neerchal, UMBC Joint Research with Professor Nagaraj K. Neerchal, UMBC and Jorge G. Morel, PhD, P&G Pharmaceuticals, Inc.and Jorge G. Morel, PhD, P&G Pharmaceuticals, Inc.
2
Motivation
In the analysis of forest pollen, counts of the frequency of occurrence of different kinds of pollen grains are made at various levels of a sediment core
An attempt is then made to reconstruct the past vegetation changes in the area from which the core was taken
Key assumptions: Each observation can be classified by exactly one of k
possible outcomes, with probabilities 1,..., k
All observations are independent of each other
In our example, since each pollen count comes from a cluster of 100 pollen grains, the individual observations within a cluster can be expected to be correlated The possible correlations are a violation of the multinomial
model assumptions!
The Multinomial ModelThe Multinomial Model
ktk
tt
ktt
m
2121
1 !!
!Pr tT
5
Motivation
How can we properly model these data and estimate the proportions of pollen grains?
What are the effects of using the wrong model?
Problem StatementProblem Statement
6
Overdispersion (Extra Variation)
Data exhibit variances larger than that permitted by the multinomial model
Usually caused by a lack of independence or clustering of experimental units
“Overdispersion is not uncommon in practice. In fact, some would maintain that over-dispersion is the norm in practice and nominal dispersion the exception.” McCullagh and Nelder (1989)
OverviewOverview
7
Overdispersion (Extra Variation)
Usually characterized by the first two moments
The quantity {1+ 2(m – 1)} is known as the design effect (Kish, 1965).
The parameter is known as the “intra class” or “intra cluster” correlation We use to denote a positive intra cluster correlation which
After each simulation, we calculate the average of the determinants from each model
A comparison of these averages gives us insight as to which model may be more efficient
Finite Mixture Dirichle Multinomial
Finite Mixture
Calculate an estimate of and its SE under the FM model. Calculate the determinant of the estimated inverse FIM
Calculate an estimate of and its SE under the DM model. Calculate the determinant of the estimated inverse FIM
Likelihood ModelSimulate 5,000 Datasets From
18
Maximum Likelihood Estimation
Simulation StudySimulation Study
The Joint Asymptotic Relative Efficiency (JARE) can be used to summarize the simulation results as it indicates which estimate would have a smaller asymptotic variance
For a vector parameter, JARE is the ratio of the determinants of the asymptotic variance-covariance matrices
0.7 FM 2.20322 2.28815 2.60584 2.67401DM 2.13496 2.19185 3.52726 3.48980
Value of Simulated Data From
19
Conclusions
If we observe correlated/clustered multinomial data, use of the naïve multinomial model causes the standard errors to be underestimated which leads to erroneous inferences and inflated Type-I error rates
If the data truly comes from a Finite Mixture distribution, then estimation using this model clearly outperforms the Dirichlet Multinomial in terms of efficiency
If we are unsure of the distribution, the FM model may underestimate the standard errors and the Dirichlet Multinomial model provides a safe alternative
20
Future Work
Covariates can be included and linked to the model parameters through “link” functions as in the Generalized Linear Model (GLM) frameworkObtain the expressions for the efficiency of likelihood models relative to GEE
Use simulations to see if gains in efficiency of the likelihood models can be achieved over GEEDoes the inclusion of covariates change our conclusions? Does the choice of link function have an influence?
Extension to Include CovariatesExtension to Include Covariates
Simulation StudySimulation Study
21
References
Cox, D.R. and Snell, E.J. (1989) Analysis of Binary Data. 2nd Ed. New York: Chapman and Hall.
Kish, L. (1965) Survey Sampling. New York: John Wiley & Sons.
Liang, K.Y. and Zeger, S.L. (1986) “Longitudinal data analysis using generalized linear models.” Biometrika 73: 13-22.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. 2nd Ed. London: Chapman and Hall.
Morel, J.G. and Nagaraj, N.K. (1993) “A finite mixture distribution for modelling multinomial extra variation.” Biometrika 80: 363-371.
Mosimann, J. E. (1962) “On the Compound Multinomial Distribution, the Multivariate -distribution, and Correlation among Proportions,” Biometrika, 49: 65-82.
Neerchal, N.K. and Morel, J.G. (1998) “Large cluster results for two parametric multinomial extra variation models.” Journal of the American Statistical Association 93: 1078-1087.
Wedderburn, R.W.M. (1974) “Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method.” Biometrika 61: 439-447.
Zeger, S.L. and Liang, K.Y. (1986) “Longitudinal data analysis for discrete and continuous outcomes.” Biometrics 42: 121-130.