Likelihood-Based Finite Mixture Models for Ordinal Data Daniel Fern´ andez Fundaci´ o Sant Joan de D´ eu ⇒ Universitat Polit` ecnica de Catalunya [email protected]Seminari del Servei d’Estad´ ıstica Aplicada & Grup de Recerca Advanced Stochastic Modelling Universitat Aut` onoma de Barcelona Feb 27th, 2020
74
Embed
Likelihood-Based Finite Mixture Models for Ordinal Data · 2020. 2. 27. · 1. Model-based clustering I Model-based clustering: process of clustering via statistical models, typically
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Likelihood-Based Finite Mixture Models forOrdinal Data
Daniel FernandezFundacio Sant Joan de Deu ⇒ Universitat Politecnica
I Note: no covariates available.I For example, questionnaires to assess levels of depression:
I n = 13 questions (rows).I m = 151 individuals (columns).I q = 4 categories: 1 to 4, with higher scores indicating higher levels
of depression.
1. Ordinal Data and Goal
I For example, questionnaires to assess the level of depressionI n = 13 questions (rows).I m = 151 individuals (columns).I q = 4 categories: 1 to 4, with higher scores indicating higher levels
of depression.
I Goals:I Can we group patients/questions together?I Which questions or patients tend to be linked with higher
values of the ordinal response?
1. Motivation
I Minimal research on methods of clustering focusing onordinal data.
I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.
I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.
1. Motivation
I Minimal research on methods of clustering focusing onordinal data.
I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.
I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.
1. Motivation
I Minimal research on methods of clustering focusing onordinal data.
I Most of the current methods based on mathematicaltechniques (e.g. distance-based algorithms) ⇒ Neitherstatistical inference nor model selection.
I Recent work (Fernandez et al, 2016): fuzzy biclustering viafinite mixtures model for ordinal data ⇒ Statistical inferenceand model selection.
1. Motivation
Source: David Sontag, NYU
I Clusters may overlapI Some clusters may be
”wider” than othersI Distances can be
deceiving!I Try a probabilistic model
I allows overlapsI allows clusters of
different sizeI allows a soft/fuzzy
clustering
1. Motivation
Hard clustering Fuzzy clustering
1. Model-based clustering
I Model-based clustering: process of clustering via statisticalmodels, typically Finite Mixture Models (FMM).
I Finite mixture models: a way of clustering in order to reducedimensionality and identifying patterns related to theheterogeneity of the data (e.g. rows/columns with similareffect on the response)
1. Model-based clustering
red line - what we see
1. Model-based clustering
I Model-based clustering: process of clustering via statisticalmodels, typically Finite Mixture Models (FMM).
I Finite mixture models: a way of clustering in order to reducedimensionality and identifying patterns related to theheterogeneity of the data (e.g. rows/columns with similareffect on the response)
I Our research: model-based clustering for ordinal data, withcomponents within the FMM ⇒ Stereotype model.
1. Stereotype Model. Formulation
I Stereotype model (Anderson, J. A., 1984):
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + (φkβ′)x k = 2, . . . , q
q − 1 log odds for categories k and 1. First category as abaseline.
I β: Assumes the parameter of the predictor regarding thecovariates is the same for all categories.
I φk : “score” for the response category k.
1. Stereotype Model. Formulation
I Stereotype model:
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + (φkβ′)x k = 2, . . . , q
I Nothing in the stereotype model treats the response asordinal.
I Including an increasing order constraint (Anderson, J.A.,1984):
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1 ,
captures the ordinal nature of the outcomes.I The model has received more attention, after Agresti (2010,
Ch.4) discussed the model in his book.
1. Stereotype Model. Formulation
I Stereotype model:
log(
P [yij = k | x]P [yij = 1 | x]
)= µk + (φkβ′)x k = 2, . . . , q
I Nothing in the stereotype model treats the response asordinal.
I Including an increasing order constraint (Anderson, J.A.,1984):
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1 ,
captures the ordinal nature of the outcomes.I The model has received more attention, after Agresti (2010,
Ch.4) discussed the model in his book.
1. Stereotype Model. Scores φk Interpretation
Use the fitted score parameters φk: determining the spacingamong categories.
I If φa = φb ⇒ the logit is the constant µa − µb⇒ The covariates x do not distinguish between a and b⇒ We could collapse the categories a and b in our data.
1. Stereotype Model. Software
I Stereotype model
I STATA module called SOREG (Lunt, 2001)I R package called ordinalgmifs (Archer et al. 2014)I R package VGAM (Yee, 2008) – it is not able to add the
monotonic constraint in the scoreI R package called clustord (Fernandez and Ryan, soon in
I Build up β′x considering row and column effect of the yij(Fernandez et al. 2016).
1. Stereotype Model. Main effects
I Build up β′x considering row and column effect of the yij(Fernandez et al. 2016).
I Main effects model:
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj)
k = 2, . . . , q i = 1, . . . , n j = 1, . . . ,m
I αi : interpreted as the effect of the rows.I βj : interpreted as the effect of the columns.I Identifiability constraints:
∑i αi =
∑j βj = 0, µ1 = 0, and
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.
1. Model-based clustering
I Main effect model 2q + n + m− 5 independent parameters.
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj) k=2,. . . ,q
i=1,. . . ,n j=1,. . . ,m
I Avoid αi + βj that overspecifies the data structure ⇒Clustering via finite mixtures models in order to reducedimensionality (McLachlan, G. and Peel, D., 2000).
1. Model-based clustering - Column clustering
For example column clustering:
We change from main effects model
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj) j = 1, . . . ,m
to
log(
P [yij = k | j ∈ c]P [yij = 1 | j ∈ c]
)= µk + φk(αi + βc) c = 1, . . . ,C < m
where βc is interpreted as the effect of the column cluster c.
1. Model-based clustering - Column clustering
For example column clustering:
We change from main effects model
log(
P [yij = k]P [yij = 1]
)= µk + φk(αi + βj) j = 1, . . . ,m
to
log(
P [yij = k | j ∈ c]P [yij = 1 | j ∈ c]
)= µk + φk(αi + βc) c = 1, . . . ,C < m
where βc is interpreted as the effect of the column cluster c.
1. Model-based clustering. Biclustering
I General formulation of model-based clustering(biclustering):
log(
P [yij = k | i ∈ r , j ∈ c]P [yij = 1 | i ∈ r , j ∈ c]
)= µk+φk(αr + βc) k = 2, . . . , q
I αr : interpreted as the effect of the row cluster r .I βc : interpreted as the effect of the column cluster c.I Constraints: α1 = β1 = 0 (or
∑αr =
∑βc = 0) and
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.
I The formulation is similar to a latent class model.I Further, αr + βc can be extended to αr + βc + γrc .I The model provides a simultaneous fuzzy clustering of the
rows and columns.
1. Model-based clustering. Biclustering
I General formulation of model-based clustering(biclustering):
log(
P [yij = k | i ∈ r , j ∈ c]P [yij = 1 | i ∈ r , j ∈ c]
)= µk+φk(αr + βc) k = 2, . . . , q
I αr : interpreted as the effect of the row cluster r .I βc : interpreted as the effect of the column cluster c.I Constraints: α1 = β1 = 0 (or
∑αr =
∑βc = 0) and
0 = φ1 ≤ φ2 ≤ · · · ≤ φq = 1.
I The formulation is similar to a latent class model.I Further, αr + βc can be extended to αr + βc + γrc .I The model provides a simultaneous fuzzy clustering of the
rows and columns.
1. Model-based clustering - Column clustering
Main effects model for stereotype model had likelihood:
L(Ω | yij) =n∏
i=1
m∏j=1
q∏k=1
(P[yij = k])I[yij =k]
and with Column clustering model turns into:
L(Ω | yij) =m∏
j=1
[ C∑c=1
κc
n∏i=1
q∏k=1
(P[yic = k])I[yij =k]]
where κc is the proportion of columns in column group c.
1. Model-based clustering
I Problem: Missing information.I We do not know the actual membership in columns (rows) nor
the number of columns (rows).
κc is the proportion of columns in column group c.
2. Model fitting
Model fitting
2. Model fitting
I EM algorithm for finding the ML solution for the parametersof models with missing information (the actual unknowncluster membership of each row and column).
I Information criteria (AIC, BIC,...)
I Comprehensive simulation study (4500 scenarios) testing 12information criteria (Fernandez and Arnold, 2016)
2. Model fitting
Table: Information criteria summary table
Criteria Definition Proposed for Depending on
AIC −2` + 2K
Regression models
Number of parameters
AICc AIC + 2K(K+1)n−K−1
AICu AICc + n log( nn−K−1 ) Number of parameters
and sample sizeCAIC −2` + K(1 + log(n))
BIC −2` + K log(n)
AIC3 −2` + 3K
Clustering
Number of parameters
CLC −2` + 2EN(R)Entropy
NEC(R) EN(R)`(R)−`(1)
ICL-BIC BIC + 2EN(R) Number of parameters, sample sizeand entropy
AWE −2`c + 2K(3/2 + log(n))
L −`− K2
∑log( nπR
12 )− Number of parameters, sample sizeR
2 log( n12 ) −
R(K+1)2 and mixing proportions
Notes: n represents the sample size, K the number of parameters, R the number of clusters, πR the mixing clusterproportion, ` the log-likelihood and EN(·) the entropy function.
2. Model fitting. Simulation study
I Simulated data with true number of row clusters.
I General results: Percentage of cases for each criteriondetermines the true number of row clusters (Fit).
2. Model fitting. One-dimensional Clustering
Table: Top 5. Overall results. One-dimensional clustering
I EM algorithm for finding the ML solution for the parametersof models with missing information (the actual unknowncluster membership of each row and column).
I Information criteria (AIC, BIC,...)
I Comprehensive simulation study (4500 scenarios) testing 12information criteria (Fernandez and Arnold, 2016) ⇒ AIC isthe best criterion
2. Model fitting
I Two possible Bayesian approaches:I “Fixed” dimension: Metropolis-Hastings and Gibbs sampler.I Variable dimension: Reversible Jump MCMC (RJMCMC,
Green, P. J., 1995)
I RJMCMC ⇒ Num. components (dimension) is a parameter.
I Convergence diagnostic: Castelloe and Zimmerman method.
2. Model fitting packages
I Model-based clustering for ordinal dataI R package called clustord (Fernandez and Ryan, soon in
CRAN) https://github.com/vuw-clustering/clustord
I Model-based clustering for mixed-type dataI R package called clustMD (McParland and Gormley,2017)
I Patients admitted for deliberated self-harm at the medicaldepartments of 3 major hospitals in Eastern Norway.
I Questionnaire designed to assess the level of depression.I 13 questions(rows), 151 patients (columns).I Ordinal data: 4 categories. From 1 (lower level) to 4 (higher level)I For instance, ”Sadness”
yij =
1 I do not feel sad2 I feel sad most of the time3 I am sad all the time4 I am so sad or unhappy that I can’t stand it
I Possible research questions:I Can we group patients/questions together?I Which questions or patients are similar?I Which questions or patients tend to be linked with higher
values of the ordinal response?
3. Results Model Fitting - EM algorithm
Table: Level of Depression. Model Fitting (1/3)
Model R C npar AIC AICc BIC ICL.BICNull effects µk + φk 1 1 5 441.63 441.81 460.71 460.71Row effects µk + φkαi n 1 16 428.81 430.52 489.89 489.89
Column effects µk + φkβj 1 m 32 463.85 470.82 586.00 586.00Main effects µk + φk (αi + βj ) n m 43 422.54 421.50 547.67 547.67
I Best AIC model: Column clustering model with C = 3groups of patients
3. Results. Common Visualisation Tools
Figure: Level of Depression: Column Clustering with C=3 patient groups
3. Results. Common Visualisation Tools
Figure: Level of Depression C=3: Distribution in each group
The proportion of individuals in clusters that had at least oneepisode of DSH (deliberated self-harm, i.e. predictor of suicide(Hawron et al. 2013)) within 3 months is: 3.4%, 16%, and 28%.
3. Results. More Visualisation Tools
Use the fitted score parameters φk: determining the spacingamong categories.
I Level 3 and 4 are very similar:φ4 − φ3 = 1− 0.852 = 0.148
3. Results. More Visualisation Tools. Fuzziness
Figure: Contour plot depicting the fuzzy clustering structure with C = 3patient clusters. The left figure is without any sorting and both axes are sortedby patient cluster on the right figure.
Probability two patients are classified in the same cluster.
4.Bayesian Inference
Bayesian Inference Approach
4. Developing RJMCMC. DAG
Figure: Directed acyclic graph: Hierarchical Stereotype Mixture Model.One–dimensional Clustering. ”TrGeometric” refers to a truncated Geometricdistribution.
I Split and Merge steps involve αR and πRI Steps have to be reversible and keep the constraints
(∑R
r=1 αr = 0,∑R
r=1 πr = 1)
I Split move:1 Draw u1, u2 ∼ U(0, 1) and one r ∈ 1, . . . ,R.2 New parameters:
α(t)r = u1α
(t−1)r α
(t)r+1 = (1− u1)α(t−1)
r
π(t)r = u2π
(t−1)r π
(t)r+1 = (1− u2)π(t−1)
r
3 Increase R by 1.4 Relabel r + 1, . . .R as r + 2, . . .R + 1
4. Developing RJMCMC. Split Step
I Split and Merge steps involve αR and πRI Steps have to be reversible and keep the constraints
(∑R
r=1 αr = 0,∑R
r=1 πr = 1)
I Split move:1 Draw u1, u2 ∼ U(0, 1) and one r ∈ 1, . . . ,R.2 New parameters:
α(t)r = u1α
(t−1)r α
(t)r+1 = (1− u1)α(t−1)
r
π(t)r = u2π
(t−1)r π
(t)r+1 = (1− u2)π(t−1)
r
3 Increase R by 1.4 Relabel r + 1, . . .R as r + 2, . . .R + 1
4. Developing RJMCMC. Merge Step
I Split and Merge move involve αR and πRI Moves have to be reversible and keep the constraints
(∑R
r=1 αr = 0,∑R
r=1 πr = 1)
I Merge move:1 Draw one random component r ∈ 1, . . . ,R − 1.2 Selecting the adjacent component r + 1.3 New parameters:
α(t)r = α(t−1)
r + α(t−1)r+1
π(t)r = π(t−1)
r + π(t−1)r+1
4 Reduce R by 1.5 Relabel r + 2, . . .R as r + 1, . . .R − 1
4. Example. Level of depression data set. RJMCMC
Figure: Level of Depression: Dimension (Column) visits
4. Example. Level of depression data set. RJMCMC
Figure: Level of Depression C=3: Distribution in each group
5. Summary. Conclusions
I Clustering rows(columns) for ordinal data allows us to:I Describe data with fewer parameters than current methods.I Identify similar rows (i.e. questions.) and/or similar columns
(i.e. subjects).I Find an a posteriori classification.
I Likelihood-based stereotype models ⇒ Inferences andmodel comparison.
I Using the fitted score parameters φk among ordinalcategories, dictated by data
I Data visualisation tools for ordinal clustering data: spacedmosaic plots, fuzziness.
I Model fitting ⇒ EM algorithm (AIC), RJMCMC (Clustercomponent as a parameter).
References
I Anderson, J. A. (1984). Regression and ordered categorical variables. JRSS Series B,46(1):1-30.
I Castelloe, J. and Zimmerman, D. (2002) Convergence assessment for RJMCMC samplers.Technical Report 313, SAS Institute, Cary, North Carolina.
I Fernandez, D., Pledger, S. and Arnold, R. (2014). Introducing spaced mosaic plots.Research Report Series. ISSN: 1174-2011. 14-3, MSOR, VUW, 2014.
I Fernandez, D., Arnold, R. and Pledger, S. (2016). Mixture-based clustering for the orderedstereotype model. CSDA. 93. 46-75.
I Green, P. J. (1995). Reversible jump MCMC computation and Bayesian modeldetermination. Biometrika, (82):711-732, 1995.
I McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability andStatistics.
I Pledger, S. and Arnold, R (2014). Multivariate methods using mixtures: Correspondenceanalysis, scaling and pattern-detection. CSDA
I Stephens, M. (2000). Dealing with label switching in mixture models. JRSS, Series B, 62,795-809.
Thank you
Thank you for listening!
References
I Anderson, J. A. (1984). Regression and ordered categorical variables. JRSS Series B,46(1):1-30.
I Castelloe, J. and Zimmerman, D. (2002) Convergence assessment for RJMCMC samplers.Technical Report 313, SAS Institute, Cary, North Carolina.
I Fernandez, D., Pledger, S. and Arnold, R. (2014). Introducing spaced mosaic plots.Research Report Series. ISSN: 1174-2011. 14-3, MSOR, VUW, 2014.
I Fernandez, D., Arnold, R. and Pledger, S. (2016). Mixture-based clustering for the orderedstereotype model. CSDA. 93. 46-75.
I Green, P. J. (1995). Reversible jump MCMC computation and Bayesian modeldetermination. Biometrika, (82):711-732, 1995.
I McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley Series in Probability andStatistics.
I Pledger, S. and Arnold, R (2014). Multivariate methods using mixtures: Correspondenceanalysis, scaling and pattern-detection. CSDA
I Stephens, M. (2000). Dealing with label switching in mixture models. JRSS, Series B, 62,795-809.
Extra Slides
1. Stereotype Model. Response Probabilities
The stereotype model is also described in terms of the responseprobabilities
P [yij = k | x] = exp(µk + φk(β′x))∑q`=1 exp(µ` + φ`(β′x)) k = 1, . . . , q ,
where the probability for the baseline category is defined as,
Stereotype reformulated as adjacent categories logit
log(
P [yij = k | x]P [yij = k + 1 | x]
)= (µk − µk+1) + (φk − φk+1)δ′x = ηk + ϑk δ
′x k = 2, . . . , q
whereηk = µk − µk+1 k = 1, . . . , q − 1
and the relation between φk and ϑk is defined by
ϑk = φk − φk+1 k = 1, . . . , q − 1
and
φk =q−1∑t=1
ϑt k = 1, . . . , q − 1 .
Adjacent-categories logit model is a particular case of the ordered stereotypemodel when ϑk is a constant such that ϑk < 1(i.e.,φk are fixed and equally spaced)
Weighted average of the fitted scoresI Fitted response probabilities with the estimated parameters over the R
groups and the q groups
P[yij = k | i ∈ r ] =exp(µk + φk (αr + βj ))∑q`=1
exp(µ` + φ`(αr + βj ))
i = 1, . . . , n j = 1, . . . ,m k = 1, . . . , q r = 1, . . . ,R .
I Weighted average over the q categories for each row cluster
y rij =
q∑k=1
k × P[yij = k | i ∈ r ]
i = 1, . . . , n j = 1, . . . ,m r = 1, . . . ,R .
I Weighted average by using the fitted conditional probabilities zir
y ij =R∑
r=1
zir × y rij i = 1, . . . , n j = 1, . . . ,m .
I Mean y ij over the m columns
y i. =1m
m∑j=1
y ij i = 1, . . . , n .
1. Finite Mixtures with Stereotype ModelExample: EM - Row clusteringDefine the unknown group membership as latent variables:
Zir = I[i ∈ r ] (i = 1, . . . , n, r = 1, . . . ,R), that follows:∑Rr=1 Zir = 1, and (Zi1, . . . ,Zir ) ∼ Mult(1;π1, . . . , πR )
E-Step:The indicator latent variables fulfill the following convenient identity:∏R
r=1 aZiri =
∑Rr=1 ai Zir for any ai 6= 0.
`c (Ω | yij, Zir) =n∑
i=1
R∑r=1
Zir log(πr ) +n∑
i=1
m∑j=1
q∑k=1
R∑r=1
Zir I(yij = k) log(θrjk),
where Ω are parameters, θrjk = P [yij = k | i ∈ r ], and Zir = E[Zir |yij].