Outline Multivariate Parametric Analysis of Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto ECI 2015 - Buenos Aires T3: Symbolic Data Analysis: Taking Variability in Data into Account Joint work with A. Pedro Duarte Silva and Jos´ e G. Dias P. Brito ECI - Buenos Aires - July 2015
62
Embed
Multivariate Parametric Analysis of Interval Data - dc.uba.ar · Multivariate Parametric Analysis of Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Outline
Multivariate Parametric Analysis of Interval Data
Paula Brito
Fac. Economia & LIAAD-INESC TEC, Universidade do Porto
ECI 2015 - Buenos AiresT3: Symbolic Data Analysis:
Taking Variability in Data into Account
Joint work with A. Pedro Duarte Silva and Jose G. Dias
Most existing methods: non-parametric descriptive approachesOur goal: parametric inference methodologies−→ probabilistic models for interval variables
For each si , Yj(si ) = Iij = [lij , uij ] is naturaly definedby the lower and upper bounds lij and uij
For modeling purposes → preferable equivalent parametrization:
Represent Yj(si ) by
the midpoint cij =lij + uij
2the range rij = uij − lij
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
Gaussian model:
Assume that the joint distribution of the midpoints C and the logsof the ranges R is multivariate Normal:
R∗ = ln(R), (C ,R∗) ∼ N2p(µ,Σ)
µ = [µtC , µtR∗ ]t ; Σ =
(ΣCC ΣCR∗
ΣR∗C ΣR∗R∗
)µC and µR∗ - p-dimensional column vectors of the mean values
ΣCC ,ΣCR∗ ,ΣR∗C and ΣR∗R∗ - p × p matrices
Model advantage :Straightforward application of classical inference methods
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
Intervals’ midpoints : location indicators → assuming a jointNormal distribution corresponds to the usual Gaussianassumption
Log transformation of the ranges → to cope with their limiteddomain
This model implies :
marginal distributions of the midpoints are Normals
marginal distributions of the ranges are Log-Normals
specific relation between mean, variance and skewness for theranges
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
More general models that try to alleviate limitations of themultivariate Normal distribution
Skew-Normal model:
Assume that the joint distribution of the midpoints C and the logsof the ranges R is multivariate Skew-Normal :
(C ,R∗) ∼ SN2p(ξ,Ω, α)
Skew-Normal distribution (Azzalini, 1985):
Generalizes the Gaussian distributionIntroducing an additional shape parameterPreserves some of its mathematical propertiesAlternative parametrization (traditional moments):SN2p(µ,Σ, γ1) (Arellano-Valle & Azzalini, 2008)
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Density of a p-dimensional Skew-Normal distribution :f (y ;α, ξ,Ω) = 2φp(x − ξ; Ω)Φp(αtω−1(x − ξ)), x ∈ IRp
ξ and α are p-dimensional vectors, Ω is a symmetric p × ppositive-definite matrix,ω is a diagonal matrix formed by the square-roots of the diagonalelements of Ωφp,Φp are, respectively, the density and the distribution function ofa p-dimensional standard Gaussian vector
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Parametric Models for interval data
However, for interval data:
Midpoint cij and Range rij of the value of an interval-valuedvariable are two quantities related to one only variable→ should not be considered separately
So : parameterizations of the global covariance matrix →take into account the link that may exist between midpoints andlog-ranges of the same or different variables
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Models for interval data
Most general formulation : allow for non-zero correlations amongall midpoints and log-ranges; other cases of interest:
The interval variables Yj are non-correlated, but for eachvariable, the midpoint may be correlated with its log-range;
Midpoints (respectively, log-ranges) of different variables maybe correlated, but no correlation between midpoints andlog-ranges is allowed;
Midpoints (respectively, log-ranges) of different variables maybe correlated, the midpoint of each variable may be correlatedwith its range, but no correlation between midpoints andlog-ranges of different variables is allowed.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Models for interval data
Config. Characterization Σ
1 Non-restricted Non-restricted2 Cj not-correlated with R∗
` , ` 6= j ΣCR∗ = ΣR∗C diagonal3 Yj ’s non correlated ΣCC ,ΣCR∗ = ΣR∗C ,ΣR∗R∗ all diagonal4 C ’s non-correlated with R∗’s ΣCR∗ = ΣR∗C = 05 All C ’s and R∗’s are non-correlated Σ diagonal
Comparison of means of one or more numerical variablesin two or more populations,from which random samples were drawn
Example : compare the mean value of the sales of a given productin different shops.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
ANOVA :Compares the variance within samples (groups) − residual variancewith the variance between samples (groups) variance explained bythe factor.
If the residual variance is very low as compares to the explainedvariance − due to the factorwe may conclude that the mean values of the variable under studyare different between groups
MANOVA :Compares the determinant of the within-groups matrix Wwith the determinant of the global matrix T
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
→ Likelihood ratio approach
Each interval-valued variable Yj is modelled by the pair (Cj ,R∗j ) ⇒
analysis of variance of Yj : two-dimensional MANOVA of (Cj ,R∗j )
Gaussian and Skew-Normal model:
Maximize the log-likelihood for the null (mean/location vectorsequal across groups) and the alternative hypothesis
In all cases, under the null hypothesis, 2lnλ follows asymptoticallya chi-square distribution with n − k degrees of freedom
Simultaneous analysis of all the Y ’s may be accomplished by a 2pdimensional MANOVA, following the same procedure
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
For one interval-variable :
Likelihood ratio statistic
λ =
(|Ealt ||Enull |
) n2
Enull and Ealt are 2× 2 matricescorresponding to the null (means equal accross groups)and alternative (means different) hypothesis
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA
Simulation study:
When sample sizes are not too small:
Tests have good power
True significance level approaches nominal levels when theconstraints assumed for the model are respected
Method assuming data is Normal with configuration 1(non-restricted) never performs worse than any other methodwhen data is indeed Normal
Skew Normal model requires large samples
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA : Example
Temperatures measured in meteorological stations in northernChina.Data : intervals of observed temperatures (Celsius scale) in each ofthe four quarters, Q1 to Q4, of the years 1974 to 1988 in 22stations.
Station Region Q1 Q2 Q3 Q4
Beijing-1974 North [−9.5, 10.6] [6.5, 29.8] [12.6, 29.6] [−10.44, 9.06]Beijing-1975 North [−8.6, 12.9] [7.9, 30.2] [15.0, 31.6] [−7.0, 19.2]
The full table comprises n = 22× 15 = 330 rows and 4 columns.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
(M)ANOVA : Example
The 22 meteorological stations belong to 3 different regions inChina (North, Northwest, Northeast):MANOVA performed to assess whether the regions are differentMODEL 2 lnλ DF P-VALUE
For each configuration, an estimate of the optimum classificationrule can be obtained with the corresponding Σ
Direct generalisation of the classical linear and quadraticdiscriminant classification rules
Linear :
Y = argmaxg (µgtΣ−1X − 1
2 µgtΣ−1µg + log πg )
Quadratic :
Y = argmaxg (−12X
tΣg−1
X + µgtΣg
−1X + log πg −
12 (log detΣg + µg
tΣg−1µg ))
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Skew-Normal model:
Three different alternatives may be considered:
1 the groups differ only in terms of µ;
2 the groups differ in terms of both µ and Σ;
3 the groups differ in terms of µ, Σ and γ1.
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Skew-Normal model:Considering cases 1) and 3) :
Location :
Y = argmaxg (ξgtΩ−1X− 1
2 ξgtΩ−1ξg +log πg +ζ0(αt ω−1(X−ξg )))
General :
Y = argmaxg (−12X
tΩg−1
X + ξgtΩg−1
X + log πg −12 (log det Ωg + ξg
tΩg−1ξg ) + ζ0(αg
t ωg−1(X − ξg )))
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Discriminant Analysis
Experimental results
Parametric rules generally outperform distance-based onesHomocedastic problems: linear discriminant rules perform bestLarge training samples and heterocedastic conditionsquadratic methods are usually superiorSmall training samples in heterocedastic problems: restrictedquadratic rules are preferable
even in some cases where the model assumed is not true
Restricted configurations 2 - 5:
provide a natural way of imposing constraintsare effective in reducing expected error ratesfor heterocedastic problems with small or moderate trainingsamples
Although the number of observations is relatively low→ a restricted though heterocedastic model has beenidentified as best fit
The method chose the best parameters for clustering,preferring a heterocedastic model to a “lighter” homocedasticone → picking up a restricted configuration for thevariance-covariance matrix
Choosing Case 2 as opposed to Case 3 → correlation betweenthe two parts of the interval-variables is considered moreimportant than correlation between different variables
Components can only be separated by consideringsimultaneously Midpoints and Log-Ranges
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
This dataset records temperatures and pluviosity measured in 282meteorological stations in the USA.
Three interval-valued variables, measured in each station:
Temperature ranges in January
Temperature ranges in July
Annual pluviosity ranges
The lowest BIC value is observed for the unrestricted (Case 1)heterocedastic solution with 6 natural clusters:
1: Arid Inland West, 2: Alaska,3: Southeast, 4: Northeast and Midwest,5: Pacific Islands and Puerto Rico, 6: Pacific Coast
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
Clusters are differentiated not only by the MidPoint but alsoby the Log-Range variables
Moreover, clusters display highly different variances
Alaska cluster presents very high variance for the JanuaryMidPoint variable, while Arid Inland West and Pacific Coastclusters present high variances for the July MidPoint variablethe Alaska cluster has a high variance for the Log-Range of thepluviosity and the Pacific Coast cluster for the Log-Range ofthe temperature in July
This stark difference illustrates well the need of aheterocedastic setup for these data
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
USA meteorological data application
Ward hierarchical clustering, partition in 6 clusters:
P. Brito ECI - Buenos Aires - July 2015
Interval DataModels for interval data
(M)ANOVADiscriminant Analysis
Model-based ClusteringR Package
Conclusions
Model-Based Clustering Applications
Model-Based Clustering : Conclusions
Proposed modelling successfully applied to real data sets ofdifferent nature and size
Adopting configurations adapted to interval data proved to bethe adequate approach
Important to consider both the information about
position - conveyed by the MidPointsintrinsic variability - conveyed by the LogRanges
when analysing interval data
Flexibility of the model in identifing heterocedastic models