Classical and Robust Symbolic Principal Component Analysis for Interval Data Margarida Azeitona Sequeira Vilela Thesis to obtain the Master of Science Degree in Mathematics and Applications Supervisor: Doctor Maria do Ros´ario de Oliveira Silva Examination Committee Chairperson: Doctor Ant´onio Manuel Pacheco Pires Supervisor: Doctor Maria do Ros´ario de Oliveira Silva Members of the Committee: Doctor Maria Paula de Pinho de Brito Duarte Silva December 2015
86
Embed
Classical and Robust Symbolic Principal Component Analysis ... · Classical and Robust Symbolic Principal Component Analysis for Interval Data ... 4 Robust Symbolic Principal Component
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Classical and Robust Symbolic Principal
Component Analysis for Interval Data
Margarida Azeitona Sequeira Vilela
Thesis to obtain the Master of Science Degree in
Mathematics and Applications
Supervisor: Doctor Maria do Rosario de Oliveira Silva
Examination Committee
Chairperson: Doctor Antonio Manuel Pacheco PiresSupervisor: Doctor Maria do Rosario de Oliveira Silva
Members of the Committee: Doctor Maria Paula de Pinho de Brito Duarte Silva
December 2015
Resumo
A analise em componentes principais e um dos metodos estatısticos mais populares para analisar
dados reais. Por este motivo, tem havido varias propostas para estender esta metodologia para o
enquadramento da analise de dados simbolicos, nomeadamente para dados intervalares.
Nesta tese, deduzimos as formulacoes populacionais para quatro destes algoritmos: Metodo dos
Centros, Metodo dos Vertices, Complete Information Principal Component Analysis and Symbolic
Covariance Principal Component Analysis. Com base nessas formulacoes teoricas, propomos uma
metodologia geral que fornece simplificacoes, conhecimento adicional e unificacao dos metodos discu-
tidos. Adicionalmente, e derivada uma formula explıcita e simples para a definicao dos scores das
componentes principais simbolicas, equivalente a representacao por Maximum Covering Area Rectan-
gles.
Alem disso, a existencia de observacoes atıpicas poderia distorcer as componentes principais
simbolicas amostrais e os respetivos scores. Para ultrapassar este problema, propomos duas famılias
de metodos robustos para analise em componentes principais simbolicas: um baseado em matrizes
de covariancia robustas e outro baseado em Projection Pursuit. E efetuado um estudo de simulacao
para avaliar o desempenho desses procedimentos, que nos permite concluir que estes podem acomodar
pequenos desvios do modelo central especificado.
Finalmente, para que todas estas novas metodologias propostas sejam facilmente utilizadas na
analise de dados reais, desenvolvemos uma aplicacao web, utilizando a plataforma Shiny do R. Na
nossa aplicacao, de forma interativa, e possıvel analisar, visualizar e comparar os resultados das com-
ponentes principais classicos e robustas, para dados convencionais e dados intervalares. Ilustramos
algumas das suas potencialidades com um conjunto de dados das telecomunicacoes.
Palavras-chave: Analise de dados simbolicos, variaveis intervalares, analise em componentes
principais, estatıstica robusta.
i
Abstract
Principal component analysis is one of the most popular statistical methods to analyse real data.
Therefore, there have been several proposals to extend this methodology to the symbolic data analysis
framework, in particular to interval-valued data.
In this thesis, we deduce the population formulations of four of these algorithms: Centers Method,
Vertices Method, Complete Information Principal Component Analysis, and Symbolic Covariance
Principal Component Analysis. Based on these theoretical formulations, we propose a general method-
ology that provides simplifications, additional insight and unification of the discussed methods. Addi-
tionally, we derive an explicit and straightforward formula to define the symbolic principal component
scores, equivalent to the representation by Maximum Covering Area Rectangle.
Furthermore, the existence of atypical observations could distort the sample symbolic principal
components and correspondent scores. To overcome this problem, we propose two families of robust
methods for symbolic Principal Component Analysis: one based on robust covariance matrices and
another based on Projection Pursuit. A simulation study is conducted to access the performance
of these procedures, allowing us to conclude that they can accommodate small deviances from the
specified central model.
Finally, to make all the new proposed methodologies easily used in the analysis of real data, we
also developed a web application, using the Shiny web application framework forR. In our application
it is possible to interactively analyse, visualize, and compare results of classical and robust principal
components, in the conventional and interval-valued frameworks. We illustrate some of its potential-
ities with a Telecommunications dataset.
Keywords: Symbolic data analysis, interval-valued variables, principal component analysis,
robust statistics.
iii
Acknowledgments
First and foremost, I would like to thank my supervisor, Professor Rosario Oliveira for her support,
time, guidance and constructive criticism during all this work. It was a pleasure working with her and
a very enriching experience.
I would also like to thank Professor Antonio Pacheco for some interesting input into this work and
for the financial support provided by CEMAT, namely for attending some conferences.
Finally, special thanks to my close family and friends for all their support and care.
[45] Midpoints and radii PCA Using a reconstruction formula b
[27] IPCA Using the inner product interval operator[59] CIPCA MCAR[38] Symbolic Covariance PCA Polytopes
a Using the max vertices coordinates.b Based on midpoint and radius rotation operators.
The MCAR representation was introduced by Chouakria [11] to obtain the SPC scores of the
CPCA and VPCA methods. According to this proposal, the kth SPC (centered) score obtained by
the method CPCA is given by
SPCCk (ξi) =SPCC
ik, (3.48)
=[PCC
k (min i),PCCk (max i)
], (3.49)
where k = 1, . . . , p, i = 1, . . . , n and γk is the kth eigenvector of ΣCC . Moreover, the lower bound is
PCCk (min i) =
p∑j=1
minaij≤xij≤bij
(xij − xj)γkj . (3.50)
The author suggested that the score is an interval, where the lower bound is formed by the linear
combinations of the lower bounds of the original interval in case of positive weights, γkj > 0, plus the
combination of the upper bounds if the weights are negative, γkj < 0 , leading to
PCCk (min i) =
p∑j:γkj>0
(aij − xj)γkj +
p∑j:γkj<0
(bij − xj)γkj , (3.51)
where xj =1
n
n∑i=1
aij + bij2
.
The upper bound is
PCCk (max i) =
p∑j=1
maxaij≤xij≤bij
(xij − xj)γkj , (3.52)
=
p∑j:γkj>0
(bij − xj)γkj +
p∑j:γkj<0
(aij − xj)γkj . (3.53)
30
Moreover, for k = 1, . . . p the hyper-rectangle formed by the first k SPCs, (SPCCi1, . . . ,SPCC
ik)t is
the MCAR k-dimensional representation of the ith object.
Taking into account that aij = cij −rij2, bij = cij +
rij2, and xj = cj , we can rewrite the limits as
PCCk (min i) =
p∑j:γkj>0
(cij −
rij2− cj
)|γkj | −
p∑j:γkj<0
(cij +
rij2− cj
)|γkj |, (3.54)
PCCk (max i) =
p∑j:γkj>0
(cij +
rij2− cj
)|γkj | −
p∑j:γkj<0
(cij −
rij2− cj
)|γkj |. (3.55)
Then, after some calculations we can conclude that
SPCCik =
[γtk(ci − c)−
1
2|γk|tri, γt
k(ci − c) +1
2|γk|tri
], (3.56)
where γk is the kth eigenvector of ΣCC , k = 1, . . . , p, |γk| = (|γk1|, . . . , |γkp|)t, ci = (ci1, . . . , cip)t,
ri = (ri1, . . . , rip)t, and c = (c1, . . . , cp)t is the sample mean vector of the centers.
Thus, the representation of the ith object in the kth SPC is an interval whose center is the linear
combination of the (centered) centers γtk(ci− c). The range of the new interval is defined by the linear
combination of the original ranges, having only positive weights, i.e. the weights on absolute value,
are equal to the weights defining the center of the interval, |γtk|.
If we use standardized data, i.e. c∗ij =cij − cj√
scjj, then the result can be formulated as follows
SPCCik =
[etk(SC)−1(ci − µC)− 1
2|ek|t(SC)−1ri, e
tk(SC)−1(ci − µC) +
1
2|ek|t(SC)−1ri
], (3.57)
where ek is the kth eigenvector of the correlation matrix of the centers, k = 1, . . . , p,
|ek| = (|ek1|, . . . , |ekp|)t, ci = (ci1, . . . , cip)t, ri = (ri1, . . . , rip)t, c = (c1, . . . , cp)t is the sample
mean vector of the centers and (SC)−1 = Diag
(1√sc11
, . . . ,1√scpp
)is the inverse of the matrix with
the standard deviations of the centers in the main diagonal.
Next, we consider the construction of SPC scores for the method VPCA, also based on the MCARs
representation. In this case, for a given observation, ξi, the kth SPC score can be obtained as
SPCVk (ξi) =
[minj
PCik.j ,maxj
PCik.j
], (3.58)
where PCik.j is the kth conventional score of the jth vertex for the ith observation.
According to this formulation, to obtain the desired score we need to calculate the conventional PC
scores for all the vertices associated with a given observation.
31
For example, in a dataset with p = 2 interval-valued variables, if all the variables are non-
degenerate, for a given observation we need to compute the following four scores:
PCik.1 = γ1k (ci1 − ri1/2) + γ2k (ci2 − ri2/2) ,
PCik.2 = γ1k (ci1 − ri1/2) + γ2k (ci2 + ri2/2) ,
PCik.3 = γ1k (ci1 + ri1/2) + γ2k (ci2 − ri2/2) ,
PCik.4 = γ1k (ci1 + ri1/2) + γ2k (ci2 + ri2/2) .
Then, use the minimum and the maximum value as the limits to define SPCVik (see (3.58)). For a
generic number p of interval-valued variables we have to compute the 2p scores, which for large values
of p becomes a really demanding task.
Nevertheless, Douzal-Chouakria et al. in [22] showed that the limits of the SPC scores in (3.58)
can be obtained like the limits of the SPC scores for CPCA (see (3.51) and (3.53)).
This is a particularly useful result since, along with the population formulation we deduced for
the VPCA method, it allows applying the method and representing the SPC scores without having
to compute de vertices matrix, thus reducing the complexity of the algorithm.
As we saw in Table 3.3, almost all the methods based on the strategy symbolic–conventional-
symbolic use the MCARs to construct the SPC scores from the conventional PC scores. So, we state
here a more general result we have deduced, using the same procedure as exemplified for the centers.
Theorem 3.2. Let C = (C1, . . . , Cp)t and R = (R1, . . . , Rp)t be the random vectors of the centers
and the ranges describing an interval-valued population of size p, such that: µC = E(C), µR = E(R),
Var(C) = ΣCC , and Var(R) = ΣRR exist.
Let ΣM be the matrix associated with a given symbolic-conventional-method M,
(M ∈ {CPCA, VPCA, CIPCA, SymCovPCA}), and (λ1,γ1), . . . , (λp,γp) be the eigenvalue-eigenvector
pairs of ΣM , such that λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0. Then the kth symbolic principal component, according
to the MCAR method, is given by:
SPCMik =
[γtk(Ci − µC)− 1
2|γk|tRi, γ
tk(Ci − µC) +
1
2|γk|tRi
], (3.59)
where γk is the kth eigenvector of ΣM , the covariance matrix associated with method M (vide Ta-
ble 3.2).
The sample SPC and respective scores are obtained by considering the sample counterparts of
(3.59), leading to:
ˆSPCM
ik =
[γtk(ci − µC)− 1
2|γk|tri, γ
tk(ci − µC) +
1
2|γk|tri
], (3.60)
where γk is the kth eigenvector of ΣM , the sample covariance matrix associated with method M (vide
Table 3.2).
32
From Theorem 3.2, we can deduce some properties of the SPCs. Let Γ = [γ1, . . . ,γp] be the
(p× p) orthogonal matrix of the eigenvectors of ΣM . Then the (p× 1) vector of the centers (ranges)
of the SPCs is Γt(C − µC)(|Γ|tR, where |Γ| = [|γ1|, . . . , |γp|]
). Being so, we can calculate the
conventional mean vectors and covariance matrices of the new centers and ranges and deduce the
following properties.
Properties:
1. E(Γt(C − µC)
)= 0.
2. E(|Γ|tR
)= |Γ|tµR.
3. Var(Γt(C − µC)
)= Λ− δMΓtDMΓ,
where Λ = Diag{λ1, . . . , λp} and
δM =
0, M = CPCA
1/4, M = VPCA
1/12, M = CIPCA, SymCovPCA
(3.61)
DM =
{Diag
{E(R2
1), . . . ,E(R2p)}, M = VPCA, CIPCA
ΣRR + µRµtR, M = SymCovPCA
(3.62)
4. Var(|Γ|tR
)= |Γ|tΣRR|Γ|.
5. Cov(Γt(C − µC), |Γ|tR
)= ΓtΣCR|Γ|.
Proof. All the demonstrations are trivial applications of the properties of the mean vector and covari-
ance, available in all Multivariate Analysis books (e.g. [35]), except property 3., that we prove next.
Let
Var(Γt(C − µC)
)= ΓtΣCCΓ.
Given that ΣM = ΣCC + δMDM , by (3.47), then ΣCC = ΣM − δMDM and
Var(Γt(C − µC)
)=ΓtΣMΓ − δMΓtDMΓ
=Λ− δMΓtDMΓ,
since Γ is the matrix of eigenvectors of ΣM and Λ the diagonal matrix of the associated eigenvalues
of ΣM .
If we consider that ΣCR = 0, i.e., C and R are uncorrelated, then Cov(Γt(C − µC), |Γ|tR
)= 0.
In what follows, we will discuss the disadvantages associated with MCARs. The main one is the
fact that the resulting hyper-rectangles include scores of data points that did not belong to the original
data and as consequence the hyper-rectangles frequently overlap which complicates the interpretation
of the results. In Figure 3.4 (from [38]) we have an example of the possible representations of the SPC
scores by the MCAR and by two new approaches suggested in an attempt to overcome its drawbacks.
33
First, in [34], Irpino and co-autors, introduced the construction of convex hulls that subsequently
are used to define Parallel Edges Connected Shape (PECS). This new closed shape (doted blue line
in Figure 3.4) is contained by the MCAR (dashed black line), so they present lower over-fitting than
the MCARs but still include points that are not part of the data. Moreover, the convex hull and the
PECS are limited to two-dimensional planes and rely on computationally demanding optimization
procedures. Additionally, to obtain the PECS it is necessary to define a stopping criteria because the
algorithm does not converge in a finite number of iterations, thus, this choice will influence the PECS
obtained.
The other approach was proposed by Le-Rademacher and Billard [38] and is based on polytope
theory. These geometric objects are constructed to represent the true structure of interval-valued
observations in the SPC space. In particular, the polytopes are obtained by connecting the represen-
tation of the vertices in the SPC space. In Figure 3.4, presented in [38], we have an example of a
6-dimensional polytope projected into the plane of the first two PCs (object in red). We can observe
that the polytope is contained in the PECS. Moreover, the new representation does not include points
that do not belong to the data. Authors in [38] argue that using projections of polytopes facilitates
the visualization of the scores and consequently the interpretation of the results.
Figure 3.4: Different representations of SPC scores (Source: [38]).
34
Chapter 4
Robust Symbolic PrincipalComponent Analysis
Statistical analysis involves a dataset, a model, and several estimation methods. Most of the time,
these procedures only return the desired outcome if the underlying assumptions (for instance, data
follow a multivariate normal distribution, independence and distributional identity, homogeneity of
variances, linearity, etc.) are satisfied. We must be aware that we can get completely absurd results
even if only one of the assumptions fail. Estimation methods severely affected by small deviations
from the assumptions are often referred to as non-robust.
Besides the effect of violations of the model assumptions, it also became necessary to consider
the influence of atypical observations, commonly known as outliers. In most areas of research it
is necessary to analyse huge amounts of data which in itself difficults the finding of outliers and,
at the same time, makes their detection process even more complicated. For this reason, several
procedures emerged in an attempt to deal with this problem, however they are all based on the same
principle: employ a diagnostic method to detect outliers, delete them from the dataset and recompute
the statistical procedure using only the remaining observations. Clearly, this approach has several
drawbacks, namely: deciding which observations to delete is a subjective decision and there is the risk
of disposing regular and thus necessary observations. Moreover, the elimination of an outlier is also
polemic since it may contain information of extreme importance, e.g. initial measures of the ozone
hole were outliers, but they lead to the discovery of these holes in the atmosphere.
In practice, the model assumptions are barely or never verified and almost all the datasets contain
some type of deviance from the assumptions, thus it was necessary necessary to come up with a better
solution. The idea was to develop new approaches that could deal with these deviations ensuring
reasonably efficiency and reliable results. Hence, the concept of “robustness” started to be used in
this sense, however the term robust was only introduced by Box [6] in 1953 and it took a few years to
be recognized as a separate field of statistics (see Huber [29]).
In general, Robust statistics covers a family of theories and techniques that yield approximately
the same results as classical methods under the ideal conditions and are only slightly affected if a
small proportion of atypical observations is present. Moreover, robust statistical procedures aim at
35
finding the estimates that best fit the majority of the data, accommodating atypical observations and
eventually allowing for the identification of deviating observations.
Multivariate analysis is one of the areas where there is a stronger need for robust approaches to
the well know classical procedures. Moreover, there has been a huge research effort in this direction.
Nevertheless, the exponential growth of databases has posed new challenges to Robust Statistics, since
many of the procedures currently available are still not sufficiently optimized. Thus, it is crucial to
continue to invest in the development of more efficient algorithms, that can handle high-dimensional
datasets.
As previously discussed, due to its advantages when dealing with large databases, SDA may give
a great contribution in this direction. However, outliers are still a largely unexplored topic in the
context of SDA. The few studies about this thematic are relatively recent and do not address outliers
and robust estimation methods in the scope of SPCA, as we intended to do in this thesis. In fact,
most of the robust methods proposed are related with linear regression (vide Domingues et al. [20]).
As for outlier detection techniques, Domingues [19] proposed some methods based on clustering and
residuals analysis, and more recently methods based on the Mahalanobis distance were introduced by
Filzmoser et al. [25].
Despite all the advantages and potentialities of PCA, we also verify (vide Section 4.1) that the
results may be extremely sensitive to the presence of outlying observations, that may increase the
variance measure, thus conducting to misleading directions unable to capture the data structure.
In the conventional framework, several approaches of robust PCA were proposed to overcome this
drawback. In this chapter we revisit some of those estimation methods and propose robust SPCA
methods based on similar concepts.
4.1 Sensitivity of SPC classical methods to atypical observa-tions
We started our study by analysing the sensitivity to outliers of the classical SPCA estimators discussed
in the previous chapter. A deeper analysis was performed by conducting a simulation study, which is
discussed in more detail in Section 4.3. Here our intention is to present a simple example to motivate
and justify the need to develop robust approaches for SPCA estimators.
In Figure 4.1 we have represented density plots for the values of the first eigenvalue obtained for
some classical SPC methods. It is important to mention that the datasets were generated from the
same central model, the first has no contaminations, in the second we contaminated the centers mean
in about 5% of the observations and in the last one the contamination level is approximately 20%. We
only present here the results for CPCA, VPCA, CIPCA, and SymCovPCA, because those were the
methods we studied in more detail in the previous chapter. Moreover, the theoretical formulations
deduced allow obtaining theoretical values for the eigenvalues of each one of these methods, which
are represented by the vertical lines in the plots. These theoretical values can be seen as target
values and this sequence of plots allow us to see how each method performs when we incorporate
36
an increasing percentage of outliers in the dataset. As expected, when we have no contamination
the kernel densities are located around its respective target value. When we introduce about 5 %
of anomalous observations in the data, all the kernel densities deviate to the right (the mean of the
contaminated observations is higher than the mean under the central model) and if we increase the
contamination the deviation from the target will be even higher. Moreover, these 4 methods present
a similar behaviour in the presence of contaminations in the centers.
(a) Data without contamination.
(b) Data with 5% of contamination.
(c) Data with 20% of contamination.
Figure 4.1: Density plots of the first eigenvalue for data with different levels of contamination. Vertical lines representthe theoretical value of the first eigenvalue, for each SPC method.
37
Thus, just like in the conventional case, we could conclude that these classical SPC methods are
also severely affected by the presence of atypical observations. This was expected since these methods
were not designed with this concern in mind, and the classical PC estimation method (used as in the
conventional step) is not robust.
4.2 Robust estimation methods
In this section, we present our proposal of robust SPC methods based on two different approaches:
robustification of the covariance matrix and Projection Pursuit (PP).
As in the classical cases, we use the strategy symbolic-conventional-symbolic, but in the conven-
tional step we estimate the principal components robustly.
We decided to follow the same strategy defined in most of the methods discussed in Chapter
3: symbolic-conventional-symbolic. So, the symbolic data provided in the input will be analysed
according to conventional robust multivariate techniques and the results will also be symbolic data.
4.2.1 Robust covariance matrix
Since the establishment of robust statistics, robust estimates have been firstly developed in order to
estimate location or scale parameters or even regression coefficients.
Several approaches for robust estimation were proposed, among which are the following main
classes:
• M-estimators - Maximum likelihood type estimators;
• R-estimators - Estimators derived from Rank Tests;
• L-estimators - Linear combinations of order statistics.
In conventional analysis, the simplest and more intuitive technique to obtain robust PCA is to
calculate principal components using robust estimates of location and covariance instead of its classical
versions. In theory any good robust estimator of the covariance matrix can be used as input of the
method. With this in mind, over the years, several estimators of the covariance matrix have been
used to suit this purpose.
The first attempts were not successful since they were based on estimators with low breakdown
point in high dimensions. The concept of breakdown point was introduced by Donoho and Huber [21]
and is related with the maximum amount of perturbation that an estimator can resist, i.e., this point
indicates the percentage of data that can be contaminated before the estimator yields arbitrarily large
aberrant values. So, this indicator should be the highest possible without significantly compromising
its efficiency.
To overcome this drawback the Minimum Volume Ellipsoid (MVE) estimator and the Minimum
Covariance Determinant (MCD) estimator, both proposed by Rousseeuw [51], were applied instead.
The MVE became the first popular high breakdown point estimator of location and scatter to be reg-
ularly used in practice. However, this estimator turned out to be replaced by the fast MCD estimator
(vide Rousseeuw and Van Driessen [52]) because the latter could be computed more efficiently.
38
The basic idea behind the MCD estimator is to consider all possible subsets of size h and find
the subset {xi1 , . . . ,xih} whose covariance matrix has the smallest determinant. Then, we obtain the
estimator of location TMCD, as the mean of these h observations:
TMCD =1
n
h∑j=1
xij (4.1)
And the estimator of scatter, CMCD corresponds to the covariance matrix with the smallest
determinant and is obtained, as follows
CMCD = cccfcsscf1
n− 1
h∑j=1
(xij −TMCD)(xij −TMCD)t (4.2)
where the factors cccf (consistency correction factor) and csscf (small sample correction factor) are
chosen to ensure consistency for the Normal distribution.
The robustness and efficiency of the estimators is determined by h. The highest possible breakdown
point can be achieved if h ≈ n/2 but in this case the estimator has a low efficiency. Moreover, for
higher values of h the estimators will have higher efficiency and lower breakdown point. So the most
appropriate choice for h is b(n+ p+ 1)/2c, although any value in [b(n+ p+ 1)/2c, n] is acceptable.
The major disadvantage of this estimator is the fact that it can only be applied to datasets where
the number of observations is larger than the number of variables. Indeed, if p > n this implies that
p > h, and the covariance matrix of any h data points will always be singular, so the solution is not
unique.
It is important to note that the computation of the MCD estimator is not an easy task. In fact,
for large n or in higher dimension it is not possible to consider all subsets of h data points in the
search of the subset with smallest determinant of its covariance matrix. To cope with such situations
Rousseeuw and Van Driessen [52] implemented a fast algorithm which finds an approximation of the
solution.
In our approaches we used this fast version which is implemented in the function covMcd of the
package robustbase [53]. In total, we propose four approaches to perform SPCA based on a robust
covariance matrix.
The first approach we present is based on the straightforward computation of the Fast MCD
estimator and can be defined as follows:
Procedure A:
1. Build the matrix of the centers (CPCA) or the matrix of the vertices (VPCA);
2. Compute its robust location and scale estimates using the Fast MCD estimator;
3. Obtain the SPCs based on ΣCPCA or ΣVPCA.
The main drawback of this approach is the fact that it can only be applied to obtain robust
version of CPCA and VPCA, because the original formulation of these methods is based on obtaining
a conventional data matrix, to which we can apply the robust covariance estimator. For the CIPCA
39
and SymCovPCA methods we can extend these procedures due to the theoretical results developed
in Chapter 3.
In the next approaches we take advantage of the parametric formulation for centers and log-
ranges, and the unified formulation of the SPC methods presented in Chapter 3, where eigenvalues
and eigenvectors have to be computed from ΣM = ΣCC + δMgM(E(RRt)
). Thus, we assume as a
central model that (C, ln(R)) ∼ N2p(µ,Σ), where
µ =
[µC
µR∗
]Σ =
[ΣCC ΣCR∗
ΣR∗C ΣR∗R∗
]Two different versions of this procedure are considered. And their difference comes from the fact
that covariances between centers and log-ranges do not play a role in the estimation methods. Thus,
in version B, Σ is estimated robustly from the data and in version C, ΣCC and ΣR∗R∗ (and µR∗)
are estimated separately. Version B guarantees that the matrix Σ is semi-positive define, but version
C has the merit of avoiding the useless estimation of ΣCR∗ .
The main steps of these procedures are presented next.
1. Build the centers and the log-ranges (if your input data are not already in this format);
2. Procedure B: Compute its robust location (µ) and scale (Σ) estimates using the
Fast MCD estimator;
Procedure C: Compute separately (µC and ΣCC) and (µR∗ and ΣR∗R∗) estimates
using the Fast MCD estimator;
3. Obtain E(RRt), where for each i, j = 1, . . . , p :
E(RiRjt) = exp
(µR∗
i+ µR∗
j+ [ΣR∗R∗ ]i,j +
1
2
([ΣR∗R∗ ]i,i [ΣR∗R∗ ]j,j
)), (4.3)
4. Compute ΣM , given that:
ΣM =
ΣCC , M = CPCA
ΣCC +1
4Diag
(E(RRt)
), M = VPCA
ΣCC +1
12Diag
(E(RRt)
), M = CIPCA
ΣCC +1
12E(RRt), M = SymCovPCA.
5. Obtain the SPCs based on ΣM .
For the methods VPCA and CIPCA we do not need all the values of the covariance matrix of the
log-ranges, in fact, we only need its diagonal (the variances). Thus, we also made some experiments
considering an additional approach (C2), which cannot be applied to SymCovPCA. The main goal
of this attempt was to verify if it is more efficient to obtain univariate estimates for the log-ranges
location and scale than estimating the joint covariance matrix. In this approach we start by building
40
the centers and the log-ranges, but in the second step we compute the robust location and scale
estimates of the centers using the Fast MCD estimator and then, the univariate robust location (µR∗j)
and scale(
[ΣR∗R∗ ]j,j
)estimates of the log-ranges using the median and MAD, respectively.
In this case, we can obtain ΣM = ΣCC + δMDiag{
E(R21), . . . , E(R2
p)}
, where:
δM =
0, CPCA
1/4, VPCA
1/12, CIPCA
and
E(R2j ) = exp
(2µR∗
j+ 2[ΣR∗R∗ ]j,j
). (4.4)
And finally, we construct the SPCs.
It should be noted that for CPCA, all approaches lead to the same result and that before the
deduction of the population formulation it was not possible to obtain a robust version of CIPCA and
SymCovPCA based on these ideas.
However, it is important to consider certain disadvantages associated with the proposed ap-
proaches. The MCD, like all high breakdown point estimators, is computationally demanding when
we need to handle large amounts of data. Moreover, in the estimation of the covariance matrices
of the centers and log-ranges we have ignored the configuration format presented in Table 2.6 (vide
Filzmoser et al. [25]). A possible solution for this problem would be to replace the MCD estimator
by the Trimmed Likelihood Estimator (TLE), introduced in Hadi and Luceno [28] and Neykov et al.
[44]. The idea behind this estimator is to use a trimmed version of the complete-data log-likelihood
function. This estimator can be applied to each configuration format, leading to robust estimates of
µ and Σ that preserve the configuration (vide [25]).
4.2.2 Projection pursuit
In the conventional framework, the disadvantages of robustifying the covariance matrix and applying
the classical estimation method motivated the need to develop other robust strategies. Therefore,
several approaches have emerged based on the application of PP principles. This kind of procedures
was initially proposed by Friedman and Tukey [26] with the aim of projecting multivariate data onto
a lower-dimensional subspace. The choice of the new subspace is done by maximizing a Projection
Index. In [30], Huber proved that PCA is a particular case of PP, where the variance of the projected
data is used as the PP index and the maximization procedure is subject to orthogonality constraints.
For a dataset with n observations and p variables, the first principal component is computed by
finding the unit vector u which maximizes the variance (S2) of the projected data:
u1 = argmax‖u‖=1
S2(utx1, . . . ,utxn). (4.5)
Since this method allows sequential estimation of the principal components, the kth component,
with 1 < k ≤ p, can be defined similarly as the first, including the condition of being orthogonal
41
(represented by ⊥) to the previous (k − 1) components:
uk = argmax‖u‖=1,u⊥u1,...,u⊥uk−1
S2(utx1, . . . ,utxn). (4.6)
Thus, the robust PCA based on PP can be obtained by replacing the variance by a robust estimator,
in (4.6). However, solving this maximization problem is not an easy task and it may be necessary to
rely on approximations.
The first method to compute this type of robust PCA was introduced by Li and Chen [39]. Never-
theless, it was difficult to apply and time-consuming, so Croux and Ruiz-Gazen [12] proposed a more
tractable algorithm (the CR algorithm). Later, Hubert et al. [31] and Croux and Ruiz-Gazen [13]
implemented more stable and faster versions of the CR algorithm.
More recently, it was proved that this algorithm is not very precise if the number of variables is
much larger than the number of observations (p >> n) and the estimated eigenvalues corresponding
to the kth PC with k > n/2 are zero for any robust scale measure used, whether there are outliers in
the dataset or not.
To overcome these drawbacks, Croux et al. [14] developed the GRID algorithm. This new approach
uses a search algorithm in the plane on a regular grid to compute an approximation of the PP
estimators for PCA. Furthermore, this algorithm does not suffer from the same problems as the
previous one, being computationally efficient and much more precise according to a simulation study
performed in [14]. In a more recent simulation study, developed by Pascoal et al. [47] PCA GRID
and ROBPCA were considered the best options among the five robust PCA methods discussed, in the
context of outlier detection based on robust PCA.
Unlike the previous method (see Subsection 4.2.1), the PP approach can be applied to datasets
where the number of variables is larger than the number of observations. Moreover, this method
is computationally easier and faster than PCA based on robust covariance matrix since the robust
estimation is performed in a lower dimension. Another property which contributes to make this method
faster than other approaches is the fact that the search for directions can be done sequentially, so the
user is not obliged to compute all the PCs.
This method is specially interesting for areas where it is common to consider datasets with p much
larger than n, such as in applications in chemometrics, marketing and biostatistics.
Due to all the advantages listed above, we also decided to propose robust SPC methods based
on PP using the Grid search algorithm. Since this approach requires to consider a conventional data
matrix as input, as for the approach A (see Subsection 4.2.1), it can only be applied to the methods
CPCA and VPCA. The procedure we propose can be defined as:
1. Build the matrix of the centers (CPCA) or the matrix of the vertices (VPCA);
2. Apply the Grid search algorithm using the MAD or the Qn estimator to detect the direc-
tions with the largest variance;
3. Obtain the SPCs.
42
There are other robust PCA methods proposed in the context of conventional data analysis. These
methods also have recognized benefits, however, in this thesis, it was not possible to extend them to
the scope of SDA. Nevertheless, we raise the readers attention to Hubert et al. [32], who proposed
a method, named ROBPCA, to combine the advantages of both PCA based on a robust covariance
matrix (see Subsection 4.2.1) and PCA based on projection pursuit (see Subsection 4.2.2).
Another popular method to robust PCA was proposed by Locantore et al. [40]. This method,
referred to as Spherical principal components.
Simulations of Maronna [41] showed that this approach has a very good performance but until
this moment not much is known about its efficiency. Moreover, this is a deterministic and very fast
method which can be computed with collinear data without any additional adaptations.
4.3 Comparative study
We have conducted a simulation study to evaluate the performance of the SPC estimators discussed
in this work, in order to study the impact of outliers in the performance of the classical and robust
methods.
The set up considered in this simulation study is presented below.
∗ based on PP: CPCAgridMAD, CPCAgridQn, VPCAgridMAD, VPCAgridQn;
∗ based on the robust covariance: CPCAcovMCD A, VPCAcovMCD (A, B,C, and C2),
CIPCAcovMCD (B, C, and C2) and SymbCovPCA (B and C).
In this simulation we generated interval-valued data by simulating centers and log-ranges following
multivariate Normal models (vide [9]).
Let (C, ln(R)) ∼ N2p(µMk,ΣMk
), where R∗ = ln (R) and
µMk=
[µMk,C
µMk,R∗
](4.7)
ΣMk=
[ΣMk,C 0
0 ΣMk,R∗
](4.8)
(4.9)
43
Then, we considered the following values for the parameters of the central model (M0):
µM0,C = (0, 0)t, µM0,R∗ = (0, 0)t,
ΣM0,C =
[2 1.2
1.2 1.5
], ΣM0,R∗ =
[0.4 0.140.14 0.07
].
and generate observations from it with probability 1− ε, where ε is the level of contamination.
Furthermore, with probability ε we generated observations from three types of contaminated mod-
els (M1), namely:
(MmCi). Models with contamination in µC (i = 1, 2, 3, 5) :
µmCi,C = (2i, 0)t and µmCi,R∗ = µM0,R∗ .
(MmR∗j). Models with contamination in µR∗ (j = 1, 2, 3) :
µmR∗j,C = µM0,C and µmR∗j,R∗ = (0, 0.5j)t.
(MmCiR∗j). Models with contamination in both µC and µR∗ (i = 3; j = 1, 2, 3) :
µmCiR∗j,C = (2i, 0)t and µmCiR∗j,R∗ = (0, 0.5j)t.
After having generated a sample of size n = 100, the log-ranges, r∗, are transformed in ranges, r
and the classical and robust estimation methods, described in Section 3.3 and Section 4.2 were applied.
Eventhough, the contamination model only deals with contaminations on the mean of the center, log-
ranges or both, the contamination on log-ranges mean will also affect the covariance matrix, as can
be seen in (4.3) and (4.4). This fact, makes the interpretation of the simulation study harder. But,
due to the lack of time, this problem is left for future work.
For each method, we computed the following measures of performance in a similar manner to what
is done in [59].
Let λj(uj) be the theoretical eigenvalue (eigenvector) associated with the jth SPC and λj(k)(uj(k))
the kth estimation of λj(uj) based on the kth simulated sample, k = 1, . . . ,m. Then:
• Absolute Cosine Value (ACV) of the eigenvectors
ACV (uj) =1
m
m∑k=1
∣∣∣∣∣ utj(k)uj
‖uj(k)‖ ‖uj‖
∣∣∣∣∣, where j = 1, . . . , p
• Relative Error (RE) of the eigenvalues
Re(λj) =1
m
m∑k=1
∣∣∣∣∣ λj(k) − λjλj
∣∣∣∣∣, where j = 1, . . . , p
• Mean Squared Error (MSE) of the eigenvalues
MSE(λj) =1
m
m∑k=1
(λj(k) − λj
)2
, where j = 1, . . . , p
44
The theoretical eigenvalues and eigenvectors were obtained based on the population formulation
defined for each method in Chapter 3. These theoretical values of the eigenvalues allowed obtaining
the MSE of λj besides Re(λj). Moreover, we obtained kernel density plots of the eigenvalues (vide
Figure 4.2), likewise to the ones presented at the beginning of this chapter, but in this case we also
include some of our proposals of robust SPC methods. It is important to refer that here we represent
(a) Model: M0, ε = 0.
(b) Model: MmC5, ε = 0.05.
(c) Model: MmC5, ε = 0.2.
Figure 4.2: Density plots of the first eigenvalue obtained for different contamination models.
the four classical SPC methods and for each of these methods we just include one robust method
from each type of proposal. In particular, we choose the procedure B to represent the methods based
on the estimation of robust covariance matrices and the PP method using the MAD estimator, since
45
MAD and Qn results are quite similar.
In Figure 4.2 we have also marked the different theoretical values for the first eigenvalue, which
enables an adequate comparison of the different estimation methods. In fact, it is only legitimate
to compare the estimates associated with each estimation method with the corresponding theoretical
reference and this is only possible to obtain given the population formulations we introduced in the
previous chapter.
In Figure 4.2a the results are based on data generated from the central model, M0, and, as expected,
the kernel densities for all the methods are centered around the corresponding theoretical eigenvalue.
This also validates, by simulation, the theoretical values obtained.
When we submit the data to an aggressive contamination in the centers (model MmC5) it appears
immediately, even for just 5% of contamination, a sharp gap between the classical and the robust
approaches. The last ones remain relatively close to the theoretical value, in opposition to the classical
ones. So, we can conclude that for this level of contamination, our robust proposals are performing
as desired, properly accommodating the outliers.
For 20% contamination this situation becomes even more obvious and the robust methods tend
to form two groups, the first includes the approaches based on robust covariances and the other the
methods based on PP. However, it was not expected that the first group of methods could perform
better than the second one, since in the conventional case, in general, the PP methods are preferable.
We suspect this is due to the fact that robust covariance matrices take directly into consideration the
structure of the covariance symbolic matrices. Further investigation on this topic is left for future
work.
To complete the discussion about the estimation of the first eigenvalue, in Figure 4.3 we represent
the MSE of the first eigenvalue obtained for a model with a severe contamination in the mean of the
centers (contamination model MmC3). For this contamination model, the classical SPC estimation
methods present higher values of the MSE, as expected. Once again it was possible to verify that
the approaches based on robust covariances lead to better results than the methods based on PP,
specially when the contamination level is 0.15 or 0.20.
Figure 4.3: MSE of the first eigenvalue obtained for the contamination model MmC3 and different levels of contamina-tion.
46
Next, in Figure 4.4 we represent the ACV of the first eigenvector obtained for the same contamina-
tion model. Let us start be noting that ACV is based on the cosine, thus good estimates lead to values
close to 1. The conclusions for these plots are similar to the ones regarding the MSE. Nevertheless, for
the ACV it is possible to verify that the PP methods perform worse than the other robust approaches
but much better than the classical counterparts.
Figure 4.4: ACV of the first eigenvector obtained for the contamination model MmC3 and different levels of contami-nation.
47
Chapter 5
Implementation
In the early days of SDA, two research European projects were developed leading to the creation
of the software SODAS [18]. This free software includes only the basic symbolic procedures and it
stopped being updated with new symbolic methodologies that are being proposed. In an attempt to
overcome this problem, in the last years several packages become available in the Comprehensive
R Archive Network (CRAN) (see Table 5.1). [49] is an open-source software project specially
designed for statistical computing and graphics. Nowadays, much of the research in statistics is done
using , so it is normal that many recent methods of different areas readily become available in
this software and SDA is not an exception. However, just two of these packages (symbolicDA and
Table 5.1: Available packages for SDA.
Package Title
GPCSIV [7] Generalized Principal Component of Symbolic Interval variablesGraphPCA [8] GraphPCA, Graphical tools of histogram PCAHistDAWass [33] Histogram-Valued Data AnalysisintReg [57] Interval RegressioniRegression [43] Regression methods for interval-valued variablesISDA.R [24] Interval symbolic data analysis forMAINT.Data [54] Model and Analyse Interval DataRSDA [50] RSDA - to Symbolic Data Analysissmds [55] Symbolic Multidimensional ScalingsymbolicDA [23] Analysis of symbolic data
RSDA) include some SPCA methods for interval-valued data. Since not all the proposed methods were
implemented and the available methods only allowed obtaining principal components based on the
correlation matrix, we decided to implement the functions by ourselves, adapted from these packages
and from the supplementary material of [38].
The code for the implemented routines is not included in this thesis because it is quite extensive,
but can be made available, on request. In the future, we expect that some of the most useful functions
presented here will be included in the RSDA package. Instead of contributing to the increase of the
packages for SDA, we believe it is easier for an user to have a more complete package for SDA than
several others whose functions may overlap.
One of today’s challenges is to visualize complex symbolic data. In order to give a response to this
challenge on the context of SPCA we have developed a Shiny application, whose main goal was to
visualize and compare results for descriptive statistics and principal components in the conventional
and symbolic framework. This tool gives the opportunity to easily access the statistical results in
48
several perspectives, providing an easier way to analyse data.
In the first sections of this chapter, we present in more detail the functions implemented, grouped
by type of task and in the last section we show an application to real data illustrating the potentialities
of the implemented functions.
5.1 Conversion
A good way to turn a statistical methodology popular and familiar to practitioners is to make its
software implementation available. And , in general, also serves this purpose. Also the symbolic
data community is aware of this fact and have made available several packages on this topic
(see Table 5.1). Nevertheless, these packages were developed independently and it is difficult to use
functions of two packages consecutively, since each one requires reading and handling data in a specific
format. To overcome this difficulty, that is, to be able to take advantage of several packages in the
same analysis, we designed functions to make conversions between the different representations of
interval-valued data used in these packages.
Different packages adopt different formats for representing symbolic data. Usually micro-data
(see Table 2.3) are not available and the user only have to give information on the interval-valued
format, like it is represented in Table 2.4. For example iRegression [43] and ISDA.R [24] require data
as a (n × 2p) data frame where in rows we have the objects and in columns the interval limits: a
minimum (lower limit) and a maximum (upper limit) as represented in Table 5.2.
Table 5.2: Symbolic Min-Max Data Frame.
Var 1 Min Var 1 Max Var 2 Min Var 2 Max Var p Min Var p Max
1 mink1
(x1k1,1) max
k1
(x1k1,1) min
k1
(x1k1,2) max
k1
(x1k1,2) · · · min
k1
(x1k1,p) max
k1
(x1k1,p)
2 mink2
(x2k2,1) max
k2
(x2k2,1) min
k2
(x2k2,2) max
k2
(x2k2,2) · · · min
k2
(x2k2,p) max
k2
(x2k2,p)
......
......
......
...n min
kn
(xnkn,1) maxkn
(xnkn,1) minkn
(xnkn,2) maxkn
(xnkn,2) · · · minkn
(xnkn,p) maxkn
(xnkn,p)
For RSDA the format required is a Symbolic Data Table (Table 5.3), which is quite similar to the
previous one. Additionally, in this table it is necessary to indicate the type for each symbolic variable.
For example, interval variables must be preceded by a column of “$I” and histogram variables by
“$H”.
Table 5.3: Symbolic Data Table.
$I Var 1 Var 1 $I Var 2 Var 2 $I Var p Var p
1 $I mink1
(x1k1,1) max
k1
(x1k1,1) $I min
k1
(x1k1,2) max
k1
(x1k1,2) · · · $I min
k1
(x1k1,p) max
k1
(x1k1,p)
2 $I mink2
(x2k2,1) max
k2
(x2k2,1) $I min
k2
(x2k2,2) max
k2
(x2k2,2) · · · $I min
k2
(x2k2,p) max
k2
(x2k2,p)
......
......
......
...n $I min
kn
(xnkn,1) maxkn
(xnkn,1) $I minkn
(xnkn,2) maxkn
(xnkn,2) · · · $I minkn
(xnkn,p) maxkn
(xnkn,p)
Besides the Symbolic Min-Max Data Frame, the package iRegression also allows for data arranged
49
in a Symbolic Center-Range Data Frame (Table 5.4). Once again, this is a data frame with n rows and
2p columns, where the first p columns are the interval centers and the last ones the interval ranges.
Additionally, we considered a Symbolic Center-Log(Range) Data Frame (Table 5.5), which is pretty
much the same as the previous table, but in this case the columns of the interval ranges are replaced