Multivariate Parametric Analysis of Interval Data - dc.uba.ar · Multivariate Parametric Analysis of Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto

Outline

Multivariate Parametric Analysis of Interval Data

Paula Brito

Fac. Economia & LIAAD-INESC TEC, Universidade do Porto

ECI 2015 - Buenos AiresT3: Symbolic Data Analysis:

Taking Variability in Data into Account

Joint work with A. Pedro Duarte Silva and Jose G. Dias

P. Brito ECI - Buenos Aires - July 2015

Outline

Outline

1 Interval Data

2 Models for interval data

3 (M)ANOVA

4 Discriminant Analysis

5 Model-based ClusteringModel-Based Clustering Applications

6 R Package

7 Conclusions


Interval DataModels for interval data

(M)ANOVADiscriminant Analysis

Model-based ClusteringR Package

Conclusions

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions

Interval data

Y1 . . . Yj . . . Yp

s1 [l11, u11] . . . [l1j , u1j ] . . . [l1p, u1p]. . . . . . . . . . . .si [li1, ui1] . . . [lij , uij ] . . . [lip, uip]. . . . . . . . . . . .sn [ln1, un1] . . . [lnj , unj ] . . . [lnp, unp]





Conclusions

Native Interval Data

Temperatures and pluviosity measured in 283 meteorologicalstations in the USA:temperature ranges in January and July, annual pluviosity range

Station State January July AnnualTemperature Temperature Pluviosity

HUNTSVILLE AL [32.3, 52.8] [69.7, 90.6] [3.23, 6.10]ANCHORAGE AK [9.3, 22.2] [51.5, 65.3] [0.52, 2.93]

NEW YORK (JFK) NY [24.7, 38.8] [66.7, 82.9] [2.70, 4.13]. . . . . . . . . . . . . . .

SAN JUAN PR [70.8, 82.4] [76.9, 87.4] [2.14, 6.17]





Conclusions

Example: Price of different car models

0 100 200 300 400

Price [in 1000 Euro]

PassatSkodaOctaviaSkodaFabiaRover75Rover25PorscheVectraCorsaNissanMicraMercedesSMercedesEMercedesCMercedesSLLanciaKHondaNSKFocusPuntoFerrariBmw7Bmw5Bmw3AudiA8AudiA6AudiA3Alfa166Alfa156Alfa145





Conclusions


0 100 200 300 400







Conclusions


0 100 200 300 400


[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ][ ]

[ ]






Conclusions

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions

Parametric models for interval data

Most existing methods: non-parametric descriptive approachesOur goal: parametric inference methodologies−→ probabilistic models for interval variables

For each si , Yj(si ) = Iij = [lij , uij ] is naturaly definedby the lower and upper bounds lij and uij

For modeling purposes → preferable equivalent parametrization:

Represent Yj(si ) by

the midpoint cij =lij + uij

2the range rij = uij − lij





Conclusions

Parametric Models for interval data

Gaussian model:

Assume that the joint distribution of the midpoints C and the logsof the ranges R is multivariate Normal:

R∗ = ln(R), (C ,R∗) ∼ N2p(µ,Σ)

µ = [µtC , µtR∗ ]t ; Σ =

(ΣCC ΣCR∗

ΣR∗C ΣR∗R∗

)µC and µR∗ - p-dimensional column vectors of the mean values

ΣCC ,ΣCR∗ ,ΣR∗C and ΣR∗R∗ - p × p matrices

Model advantage :Straightforward application of classical inference methods





Conclusions


Intervals’ midpoints : location indicators → assuming a jointNormal distribution corresponds to the usual Gaussianassumption

Log transformation of the ranges → to cope with their limiteddomain

This model implies :

marginal distributions of the midpoints are Normals

marginal distributions of the ranges are Log-Normals

specific relation between mean, variance and skewness for theranges





Conclusions


More general models that try to alleviate limitations of themultivariate Normal distribution

Skew-Normal model:

Assume that the joint distribution of the midpoints C and the logsof the ranges R is multivariate Skew-Normal :

(C ,R∗) ∼ SN2p(ξ,Ω, α)

Skew-Normal distribution (Azzalini, 1985):

Generalizes the Gaussian distributionIntroducing an additional shape parameterPreserves some of its mathematical propertiesAlternative parametrization (traditional moments):SN2p(µ,Σ, γ1) (Arellano-Valle & Azzalini, 2008)





Conclusions

Density of a p-dimensional Skew-Normal distribution :f (y ;α, ξ,Ω) = 2φp(x − ξ; Ω)Φp(αtω−1(x − ξ)), x ∈ IRp

ξ and α are p-dimensional vectors, Ω is a symmetric p × ppositive-definite matrix,ω is a diagonal matrix formed by the square-roots of the diagonalelements of Ωφp,Φp are, respectively, the density and the distribution function ofa p-dimensional standard Gaussian vector





Conclusions


However, for interval data:

Midpoint cij and Range rij of the value of an interval-valuedvariable are two quantities related to one only variable→ should not be considered separately

So : parameterizations of the global covariance matrix →take into account the link that may exist between midpoints andlog-ranges of the same or different variables





Conclusions

Models for interval data

Most general formulation : allow for non-zero correlations amongall midpoints and log-ranges; other cases of interest:

The interval variables Yj are non-correlated, but for eachvariable, the midpoint may be correlated with its log-range;

Midpoints (respectively, log-ranges) of different variables maybe correlated, but no correlation between midpoints andlog-ranges is allowed;

Midpoints (respectively, log-ranges) of different variables maybe correlated, the midpoint of each variable may be correlatedwith its range, but no correlation between midpoints andlog-ranges of different variables is allowed.





Conclusions


Config. Characterization Σ

1 Non-restricted Non-restricted2 Cj not-correlated with R∗

` , ` 6= j ΣCR∗ = ΣR∗C diagonal3 Yj ’s non correlated ΣCC ,ΣCR∗ = ΣR∗C ,ΣR∗R∗ all diagonal4 C ’s non-correlated with R∗’s ΣCR∗ = ΣR∗C = 05 All C ’s and R∗’s are non-correlated Σ diagonal





Conclusions


Σ =ΣCC ΣCR*

ΣR*C ΣR*R*

Configuration 1

0

0

0

00

0

0

00

0

0

00

00

0

0

00

0

Configuration 2 Configuration 3 Configuration 4 Configuration 5





Conclusions


Configuration 2 is a particular case of 1

Both 3 and 4 are particular cases of 2

Configuration 5 is a particular case of all the others

In cases 3, 4 and 5, Σ can be written as a diagonal by blocksmatrix

Configuration 4 : the matrix Σ is formed by two p × p blocks,

Configuration 3 : there are p 2× 2 blocks

Configuration 5 : the 2p blocks are single real elements





Conclusions

Parametric analysis of interval data: ML estimation

Gaussian model:

For all configurations,

ln L(µ,Σ) =

−np ln(2π)− n

2ln |Σ| − 1

2trEΣ−1 − n

2

(X − µ

)tΣ−1

(X − µ

)Σ−1 is symmetric positive definite ⇒maximum-likelihood estimate of the mean vector is always X

Maximization of the likelihood function with respect to Σ reducesto maximizing

ln L(µ,Σ) = constant − n

2ln |Σ| − 1

2trEΣ−1





Conclusions


Configurations 3, 4 and 5, Σ is subject to the constraints

In these cases Σ can be written as a diagonal by blocks matrix,after a possible rearrangement of rows and columns

The maximum can be obtained by separately maximizing withrespect to each block of Σ

This result is not valid for configuration 2:standard numerical procedures





Conclusions


Skew-Normal model:

Log-likelihood of a p-dimensional Skew-Normal distribution:

l = constant − 12nln |Ω| −

n2 tr(Ω−1V ) +

∑i ζ0(αtω−1(xi − ξ))

where V = n−1∑

i (xi − ξ)(xi − ξ)t and ζ0(x) = ln(2Φ(x))

Configuration 1 (Azzalini and Capitanio, 1999):

the log-likelihood can be re-parametrized asl = constant − 1

2nln |Ω| −n2 tr(Ω−1V ) +

∑i ζ0(η(xi − ξ))

Then, for each ξ and η the log-likelihood is maximized on Ω byΩ = V





Conclusions


Other configurations:

Let θ = (ξ,Ω, η) = θ(ψ) with ψ = (µ,Σ, γ1)

We maximize numerically the log-likelihood of θ(ψ) using asarguments the free elements of µ,Σ, γ1, subject to admissibilityrestrictions





Conclusions

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions

(M)ANOVA : Objectives

Comparison of means of one or more numerical variablesin two or more populations,from which random samples were drawn

Example : compare the mean value of the sales of a given productin different shops.





Conclusions

(M)ANOVA

ANOVA :Compares the variance within samples (groups) − residual variancewith the variance between samples (groups) variance explained bythe factor.

If the residual variance is very low as compares to the explainedvariance − due to the factorwe may conclude that the mean values of the variable under studyare different between groups

MANOVA :Compares the determinant of the within-groups matrix Wwith the determinant of the global matrix T





Conclusions

(M)ANOVA

→ Likelihood ratio approach

Each interval-valued variable Yj is modelled by the pair (Cj ,R∗j ) ⇒

analysis of variance of Yj : two-dimensional MANOVA of (Cj ,R∗j )

Gaussian and Skew-Normal model:

Maximize the log-likelihood for the null (mean/location vectorsequal across groups) and the alternative hypothesis

In all cases, under the null hypothesis, 2lnλ follows asymptoticallya chi-square distribution with n − k degrees of freedom

Simultaneous analysis of all the Y ’s may be accomplished by a 2pdimensional MANOVA, following the same procedure





Conclusions

(M)ANOVA

For one interval-variable :

Likelihood ratio statistic

λ =

(|Ealt ||Enull |

) n2

Enull and Ealt are 2× 2 matricescorresponding to the null (means equal accross groups)and alternative (means different) hypothesis





Conclusions

(M)ANOVA

Simulation study:

When sample sizes are not too small:

Tests have good power

True significance level approaches nominal levels when theconstraints assumed for the model are respected

Method assuming data is Normal with configuration 1(non-restricted) never performs worse than any other methodwhen data is indeed Normal

Skew Normal model requires large samples





Conclusions

(M)ANOVA : Example

Temperatures measured in meteorological stations in northernChina.Data : intervals of observed temperatures (Celsius scale) in each ofthe four quarters, Q1 to Q4, of the years 1974 to 1988 in 22stations.

Station Region Q1 Q2 Q3 Q4

Beijing-1974 North [−9.5, 10.6] [6.5, 29.8] [12.6, 29.6] [−10.44, 9.06]Beijing-1975 North [−8.6, 12.9] [7.9, 30.2] [15.0, 31.6] [−7.0, 19.2]

. . . . . . . . . . . . . . . . . .ZhangYe-1988 Northwest [−15.4, 7.2] [2.3, 26.4] [8.6, 30.2] [−12.0, 15.1]

The full table comprises n = 22× 15 = 330 rows and 4 columns.





Conclusions

(M)ANOVA : Example

The 22 meteorological stations belong to 3 different regions inChina (North, Northwest, Northeast):MANOVA performed to assess whether the regions are differentMODEL 2 lnλ DF P-VALUE

NORM 1 480.2475 16 < 1E-10

NORM 2 527.4521 16 < 1E-10

NORM 3 989.9340 16 < 1E-10

NORM 4 529.2541 16 < 1E-10

NORM 5 1057.9210 16 < 1E-10

SkN 1 447.4244 16 < 1E-10SkN 2 518.1720 16 < 1E-10

SkN 3 974.0240 16 < 1E-10

SkN 4 530.3980 16 < 1E-10

SkN 5 1110.3840 16 < 1E-10





Conclusions

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions

Discriminant Analysis

Gaussian model:

For each configuration, an estimate of the optimum classificationrule can be obtained with the corresponding Σ

Direct generalisation of the classical linear and quadraticdiscriminant classification rules

Linear :

Y = argmaxg (µgtΣ−1X − 1

2 µgtΣ−1µg + log πg )

Quadratic :

Y = argmaxg (−12X

tΣg−1

X + µgtΣg

−1X + log πg −

12 (log detΣg + µg

tΣg−1µg ))





Conclusions


Skew-Normal model:

Three different alternatives may be considered:

1 the groups differ only in terms of µ;

2 the groups differ in terms of both µ and Σ;

3 the groups differ in terms of µ, Σ and γ1.





Conclusions


Skew-Normal model:Considering cases 1) and 3) :

Location :

Y = argmaxg (ξgtΩ−1X− 1

2 ξgtΩ−1ξg +log πg +ζ0(αt ω−1(X−ξg )))

General :

Y = argmaxg (−12X

tΩg−1

X + ξgtΩg−1

X + log πg −12 (log det Ωg + ξg

tΩg−1ξg ) + ζ0(αg

t ωg−1(X − ξg )))





Conclusions


Experimental results

Parametric rules generally outperform distance-based onesHomocedastic problems: linear discriminant rules perform bestLarge training samples and heterocedastic conditionsquadratic methods are usually superiorSmall training samples in heterocedastic problems: restrictedquadratic rules are preferable

even in some cases where the model assumed is not true

Restricted configurations 2 - 5:

provide a natural way of imposing constraintsare effective in reducing expected error ratesfor heterocedastic problems with small or moderate trainingsamples





Conclusions

Model-Based Clustering Applications

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions


Model-Based Clustering

f (xi ;ϕ) =k∑

`=1

π`f`(xi ; Θ`),

Maximum likelihood (ML) parameter estimation →maximization of the log-likelihood function:

`(ϕ; x) =n∑

i=1

ln f (xi ;ϕ)

Expectation-Maximization (EM) algorithm

Trying to avoid local optima → each search of the EM algorithm isreplicated from different starting points

Selection of the model and number of components (K ) →Bayesian Information Criterion : BIC= −2`(ϕ; x) + dϕln(n)





Conclusions


Income-debt application

Four interval-valued variables:Household Income (HI), Debt to Income Ratio (X 100) (DIR),Credit Card Debt (CCD) and Other Debts (OD)

5000 individual observations aggregated on the basis of:Gender, Age Category, Education Level and Job Category

→ 297 groups described by the intervals of observed values:

Group HI DIR CCD ODMale, 8-24 [15, 61] [0.1, 23.4] [0.0, 6.57] [0.02, 7.71]

High school degree, ServiceMale, 35-49, College degree, [19, 190] [1.4, 20.4] [0.04, 16.6] [0.12, 15.39]

Sales and OfficeFemale, 25-34, Some college [17, 100] [0.8, 31.7] [0.05, 6.57] [0.09, 7.65]Managerial and Professional

......

......

...P. Brito ECI - Buenos Aires - July 2015




Conclusions



Income-debt data - BIC values :

Nb. Homocedastic Heterocedasticcomp.

K Case 1 Case 2 Case 3 Case 4 Case 1 Case 2 Case 3 Case 42 9336.91 10312.07 10304.42 11557.28 8533.95 9222.99 9770.38 10641.343 9188.08 9985.10 10117.64 10733.92 7836.8 8057.74 9504.75 9890.294 9088.40 9546.50 9959.48 10273.03 7699.80 7772.59 9318.83 9534.245 9061.25 9406.52 9885.04 10124.77 7522.50 7281.54 9148.90 9341.716 8956.95 9366.33 9829.76 10000.48 7564.48 7055.61 9138.57 9174.657 8939.25 9171.05 9713.20 9897.91 — 7011.45 9160.22 9051.778 8902.65 9050.61 9654.98 9861.26 — 6859.76 9128.54 8977.989 8838.82 8992.64 9590.98 9780.71 — 6831.01 9278.78 8925.87

10 8852.52 9054.11 9595.48 9711.42 — 6870.38 9249.67 8924.33





Conclusions



Income-debt data - Component Proportions and Mean-Vectors

C1 C2 C3 C4 C5 C6 C7 C8 C9Proportions 0.112 0.193 0.149 0.152 0.083 0.028 0.218 0.053 0.013Hinc-MidP 35.72 34.25 124.31 78.86 138.88 221.78 86.88 141.69 495.38Hinc-LogR 3.08 3.50 5.33 4.75 5.40 5.95 4.74 4.90 6.87

DIncR-MidP 7.91 12.51 15.10 13.43 13.06 17.64 11.62 11.39 16.16DIncR-LogR 2.10 2.99 3.28 3.20 3.04 3.41 2.84 2.01 3.30CCDbt-MidP 0.75 1.480 6.87 3.34 9.68 16.92 3.26 3.617 40.23CCDbt-LogR -0.06 0.93 2.57 1.87 2.90 3.45 1.71 1.46 4.29ODbt-MidP 1.73 2.65 13.06 6.44 13.09 14.72 5.97 8.73 56.73ODbt-LogR 0.53 1.49 3.22 2.48 3.08 3.34 2.29 2.23 4.68





Conclusions



Component mean-values: to distinguish clusters bothMidpoints and Log-Ranges need to be considered

Estimated variance-covariance matrices clearly different acrossgroups

Noteworthy though different correlations between MidPointsand Log-Ranges of the same interval-valued variables

All correlations are positive: the higher the Midpoint thehigher the corresponding intrinsic variability





Conclusions


Labour force survey

Data from Portuguese Labour Force Survey, 1st semester of 2008

1540 cases : people who were unemployed at the time of the survey

Two variables:Activity Time, in years (AT)Unemployment Time, in months (UT)

Micro data were gathered on the basis ofGender, Region, Age-Group and Education →58 sociological groups

Lowest BIC value:solution in 5 components, heterocedastic setup,Case 2 - independent interval-valued variables





Conclusions


Labour force survey

Unemployement dataComponent Proportions, Mean-Vectors and Variances

C1 C2 C3 C4 C5Proportions 0.271 0.206 0.227 0.103 0.191

AT MidP 8.662 3.785 23.237 33.750 26.627Mean AT LogR 2.553 1.600 3.197 3.578 2.748Values UT MidP 31.985 7.495 66.990 150.500 18.869

UT LogR 4.042 2.110 4.849 5.690 3.060AT MidP 17.065 4.963 73.153 57.479 78.060

Variances AT LogR 0.340 0.544 0.119 0.040 0.182UT MidP 101.545 7.841 230.649 468.583 54.774UT LogR 0.113 0.685 0.057 0.021 0.225





Conclusions


Labour force survey

Although the number of observations is relatively low→ a restricted though heterocedastic model has beenidentified as best fit

The method chose the best parameters for clustering,preferring a heterocedastic model to a “lighter” homocedasticone → picking up a restricted configuration for thevariance-covariance matrix

Choosing Case 2 as opposed to Case 3 → correlation betweenthe two parts of the interval-variables is considered moreimportant than correlation between different variables

Components can only be separated by consideringsimultaneously Midpoints and Log-Ranges





Conclusions


USA meteorological data application

This dataset records temperatures and pluviosity measured in 282meteorological stations in the USA.

Three interval-valued variables, measured in each station:

Temperature ranges in January

Temperature ranges in July

Annual pluviosity ranges

The lowest BIC value is observed for the unrestricted (Case 1)heterocedastic solution with 6 natural clusters:

1: Arid Inland West, 2: Alaska,3: Southeast, 4: Northeast and Midwest,5: Pacific Islands and Puerto Rico, 6: Pacific Coast





Conclusions







Conclusions



Clusters are differentiated not only by the MidPoint but alsoby the Log-Range variables

Moreover, clusters display highly different variances

Alaska cluster presents very high variance for the JanuaryMidPoint variable, while Arid Inland West and Pacific Coastclusters present high variances for the July MidPoint variablethe Alaska cluster has a high variance for the Log-Range of thepluviosity and the Pacific Coast cluster for the Log-Range ofthe temperature in July

This stark difference illustrates well the need of aheterocedastic setup for these data





Conclusions



Ward hierarchical clustering, partition in 6 clusters:





Conclusions


Model-Based Clustering : Conclusions

Proposed modelling successfully applied to real data sets ofdifferent nature and size

Adopting configurations adapted to interval data proved to bethe adequate approach

Important to consider both the information about

position - conveyed by the MidPointsintrinsic variability - conveyed by the LogRanges

when analysing interval data

Flexibility of the model in identifing heterocedastic models





Conclusions

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions

R Package

MAINT-Data - available at CRAN → Gaussian model

Specialized data classes for interval-data

Methods for Maximum Likelihood Estimation

ANOVA and MANOVA

Linear and Quadratic Discriminant Analysis

. . .

Extensions:

Multivariate Skew-Normal distributions

Robust estimation

Linear Regression





Conclusions

Overview





Conclusions

The IData class





Conclusions

The IdtE classes: Single Distributions





Conclusions

The IdtE classes: Homocedastic Mixtures





Conclusions

The IdtE classes: Heterocedastic Mixtures





Conclusions

Example: Creating Idata Objects

library(MAINT.Data)

ChinaT < − IData(ChinaTemp[1:8])VarNames < − c(”Q1”,”Q2”, ”Q3”,”Q4”)

#Display the first three observations

head(ChinaT,n=3)

Q1 Q2 Q3 Q4AnQing 1974 [ 0.673, 14.827] [13.435, 28.465] [19.821, 31.179] [2.216, 9.984]AnQing 1975 [ 2.319, 14.381] [12.829, 28.471] [23.192, 32.308] [1.013, 10.987]AnQing 1976 [ 0.906, 12.494] [11.795, 28.405] [19.680, 34.120] [2.992, 10.308]





Conclusions

Example: MANOVA tests

ManvChina <- MANOVA(ChinaT,ChinaTemp$GeoReg)print(ManvChina)

Null Model Log likelihoods:NC1 NC2 NC3 NC4 NC5

-7336.254 -8331.416 -11564.904 -8390.351 -12648.760Full Model Log likelihoods:

NC1 NC2 NC3 NC4 NC5-6209.280 -6820.555 -9049.276 -6857.536 -9450.228Full Model Akaike Information Criteria:

NC1 NC2 NC3 NC4 NC512586.56 13793.11 18234.55 13851.07 19012.46Selected Model:[1]”NC1”

Null Model log-likelihood: -7336.254Full Model log-likelihood: -6209.28Qui-squared statistic: 2253.949degrees of freedom: 40p-value ≈ 0





Conclusions

Example: Linear Discriminant Analysis

Chinalda < − lda(ManvChina)

PredRes < − predict(Chinalda,ChinaT)

#Estimate error rates by ten-fold cross-validation

CVlda < − DACrossVal(ChinaT,ChinaTemp$GeoReg,TrainAlg=lda,Config=BestModel(ManvChina@H1res),CVrep=1)





Conclusions

Outline

1 Interval Data


3 (M)ANOVA



6 R Package

7 Conclusions





Conclusions

Conclusions

Parametric models specific for interval-valued variables

Model-based multivariate analysis of interval data

(M)ANOVADiscriminant analysisClustering - finite mixture modelling

Models and methods partially implemented in the R packageMAINT.Data, available on CRAN

Experimental results show the pertinence and usefulness ofthe proposed approach


Multivariate Parametric Analysis of Interval Data - dc.uba.ar · Multivariate Parametric Analysis of Interval Data Paula Brito Fac. Economia & LIAAD-INESC TEC, Universidade do Porto

Documents