Principal Component Analysis With Missing Data and Outliers Haifeng Chen Electrical and Computer Engineering Department Rutgers University, Piscataway, NJ, 08854 [email protected]1 Introduction Principal component analysis (PCA) [10] is a well established technique for dimensionality reduction, and a chapter on the subject may be found in numerous texts on multivariate analysis. Examples of its many applications include data compression, image processing, visualisation, exploratory data analysis, pattern recognition and time series prediction. The popularity of PCA comes from three important prop- erties. First, it is the optimal (in terms of mean squared error) linear scheme for compressing a set of high dimensional vectors into a set of lower dimensional vectors and then reconstructing. Second, the model parameters can be computed directly from the data – for example by diagonalizing the sample covari- ance. Third, compression and decompression are easy operations to perform given the model parameters – they require only matrix multiplications. Despite these attractive features, however, PCA models have several shortcomings. One is that naive methods for finding the principal component directions have trouble with high dimensional data or large numbers of data points. Consider attempting to diagonalize the sample covariance matrix of vectors in a space of dimensions when and are several hundred or several thousand. Difficulties can arise both in the form of computational complexity and also data scarcity. Computing the sample covariance itself is very costly, requiring operations. In general it is best to avoid computing the sample covariance explicitly. Another shortcoming of standard approaches to PCA is that it is not obvious how to deal properly with incomplete data set, in which some of the points are missing. Currently the incomplete points are either discarded or completed using a variety of interpolation methods. However, such approaches are no longer valid when a significant portion of the measurement matrix is unknown. Typically, the training data for PCA is pre-processed in some way. But in some realistic problems where the amount of training data is huge, it becomes impractical to manually verify that all the data is 1
24
Embed
Principal Component Analysis With Missing Data and Outliers 1
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Principal Component Analysis With Missing Data and Outliers
Haifeng Chen
Electrical and Computer Engineering DepartmentRutgers University, Piscataway, NJ, 08854
Principal component analysis (PCA) [10] is a well established technique for dimensionality reduction,
and a chapter on the subject may be found in numerous texts on multivariate analysis. Examples of its
many applications include data compression, image processing, visualisation, exploratory data analysis,
pattern recognition and time series prediction. The popularity of PCA comes from three important prop-
erties. First, it is the optimal (in terms of mean squared error) linear scheme for compressing a set of high
dimensional vectors into a set of lower dimensional vectors and then reconstructing. Second, the model
parameters can be computed directly from the data – for example by diagonalizing the sample covari-
ance. Third, compression and decompression are easy operations to perform given the model parameters
– they require only matrix multiplications.
Despite these attractive features, however, PCA models have several shortcomings. One is that naive
methods for finding the principal component directions have trouble with high dimensional data or large
numbers of data points. Consider attempting to diagonalize the sample covariance matrix of � vectors
in a space of�
dimensions when � and�
are several hundred or several thousand. Difficulties can arise
both in the form of computational complexity and also data scarcity. Computing the sample covariance
itself is very costly, requiring ��� � ����� operations. In general it is best to avoid computing the sample
covariance explicitly.
Another shortcoming of standard approaches to PCA is that it is not obvious how to deal properly
with incomplete data set, in which some of the points are missing. Currently the incomplete points are
either discarded or completed using a variety of interpolation methods. However, such approaches are
no longer valid when a significant portion of the measurement matrix is unknown.
Typically, the training data for PCA is pre-processed in some way. But in some realistic problems
where the amount of training data is huge, it becomes impractical to manually verify that all the data is
1
’good’. In general, training data may contain some errors from the underlying data generation method.
We view these error points as “outliers”. However, the standard PCA algorithm is based on the assump-
tion that data have not been spoiled by outliers. In case of outliers, robust version of PCA has to be
developed.
To solve the these drawbacks of standard PCA, a lot of methods were proposed in the field of statistics,
computer engineering, neural networks etc. The purpose of this project is to give a overview of those
methods and perform some experiments to show the how the improved PCAs can deal with the missing
data and outliers in high dimensional data set. In Section 1, a brief introduction of standard PCA is
presented. To deal with the high dimensional data, we describe an EM algorithm to calculate principal
components in Section 3.2. Section 4 presents PCA for the data set containing missing points. In Section
5, we give a detailed description of current robust PCA algorithms. Some experimental results are
provided in Section 6.
2 Principal component analysis (PCA)
The most common derivation of PCA is in terms of a standardized linear projection which maximizes
the variance in the projected space [10]. For a set of observed ddimensional data vectors ���� ���������������������� � , the � principal axes ��! " ���#��$������������������% are those orthonormal axes onto which the
retained variance under projection is maximal. It can be shown that the vectors � are given by the �dominant eigenvectors (i.e. those with the largest associated eigenvalues & ) of the sample covariance
matrix ' ( )* �,+.- �/0�2143 � �/0�2153 �76� (1)
where 3 is the data sample mean, such that ' � ( & � 98 (2)
The � principal components of the observed vector �� are given by the vector: � (<; 6 �/ � 153 � � (3)
where ; ( �/� -�= � � = ����� = �?> � . The variables : � are then uncorrellated such that the covariance matrix@ )�,+.- : � : 6�BA � is diagonal with elements & � .A complementary property of PCA, and that most closely related to the original discussions of [17] is
that, of all orthogonal linear projections (3), the principal component projection minimizes the squared
2
reconstruction error@ )�DC � 1FE � C � , where the optimal linear reconstruction of E � is given byEG� (<; : �IH 3 8 (4)
3 EM Algorithm for PCA
In this section, a version of the expectation maximization (EM) algorithm [18] for learning the principal
components of a data set. The algorithm does not require computing the sample covariance. It can deal
with high dimensional data more efficiently than traditional PCA. In Section 3.1 a probabilistic model
for PCA is given. Based on that model, the EM algorithm is presented in Section 3.2. The advantage of
the EM algorithm is also provided.
3.1 Probabilistic Model of PCA
Principal component analysis can be viewed as a limiting case of a particular class of linear Gaussian
models. The goal of such models is to capture the covariance structure of an observed�
dimensional
variable using fewer than the� � � H � � A � free parameters required in a full covariance matrix. Linear
Gaussian models do this by assuming that was produced as a linear transformation of some � dimen-
sional latent variable : plus additive Gaussian noise. Denoting the transformation by the�KJ � matrix; , and the
�dimensional noise vector by L (with covariance matrix M ), the generative model can be
written as (<; : H L (5)
Conventionally, :5N<O �/PQ�SR � , and the latent variables are defined to be independent and Gaussian with
unit variance. By additionally specifying the error, or noise, model to be likewise Gaussian L N$O �/PQ�ST � ,equation (5) induces a corresponding Gaussian distribution for the observations NUO �/PQ� ;V; 6 H T � (6)
In order to save parameters over the direct covariance representation in�
space, it is necessary to choose�KW � and also to restrict the covariance structure of the Gaussian noise L by constraining the matrix T .
For example, if the shape of the noise distribution is restricted to be axis aligned (its covariance matrix
is diagonal) the model is known as factor analysis.
For the case of isotropic noise T (YX �[Z , equation (5) implies a probability distribution over space
for a given : of the form \ �/^] : � ( �_��` X � ��acb�d �fefg \ �1 �� X � C h1 ; : C � (7)
3
Using Bayes’ rule, the posterior distribution of the latent variables : given the observed may be
calculated\ � : ] � ( �_��` � aci�d � ] X a �fj ] a - d � e�g \lk 1 �� : 1 j a - ; 6 m X a �fj : 1 j a - ; 6 m �n (8)
where the posterior covariance matrix is given byX ��j a - (oX � � X � Z H ; 6 ; � a -(9)
wherej
is a � J � matrix.
3.2 EM Algorithm for PCA
Principal component analysis is a limiting case of the linear Gaussian model as the covariance of the
noise L becomes infinitely small and equal in all directions. Mathematically, PCA is obtained by taking
the limit T (Vp,q,rts_uwvyx Z . This has the effect of making the likelihood of a point dominated solely
by the squared distance between it and its reconstruction ; : . The directions of the columns of ;which minimize this error are known as the principal axes. Inference now reduces to simple least squares
¡£¢¢¢¤where every * indicates an unobservable face since there are only six visible faces from each nonsingular
view. For such kind of data, the principal component analysis with missing data (PCAMD) has to be
used. Instead of estimating only : as the value which minimizes the squared distance between the point
and its reconstruction, PCAMD generalizes the estep to:�generalized e-step : For each (possibly incomplete) point find the unique pair of points :9¥and ¥ (such that : ¥ lies in the current principal subspace and ¥ lies in the subspace defined by
the known information about ) which minimize the norm C ; : ¥ 1¦ ¥ C . Set the corresponding
column of � to :B¥ and the corresponding column of�
to ¥ .
If is complete, then ¥ ( andg ¥ is found exactly as before. If not, then :^¥ and ¥ are the solution
to a least squares problem and can be found by, for example, QR factorization of a particular constraint
matrix.
In the above generalized EM algorithm, we still assume the measurements has already been cen-
tered. But in the case of missing data, especially when a significant portion of the measurement matrix
is unknown, the average of the data may not be a very reliable estimate of the mean. Instead of using
the centered data, some methods used the mean as extra parameters for the optimization, such as the
Wiberg’s method [22] in the next section.
4.1 Wiberg’s Method
Suppose the�§J � measurement matrix
�has rank ¨ . If the data is complete and the measurement matrix
is a ¨ J ¨ diagonal matrix, ¬ is the maximum likelihood approximation of the mean vector, and 6 (�7�����������f� � is an � -tuple with all ones. The solution of this problem is essentially the SVD of the centered
(or registered) data matrix ²³1l¬´ 6 .
6
If data is incomplete, we have the following minimization problem:r�q,µ5¶ ( �� *�· � � ��¸ 1�¹ � 1lº 6� L � � (13)R ( ����S��# �¼»½� �¾¸ �À¿¼Á�Â[¿ e ¨�à � �Ä�wÅU�^Å � ���DÅl#ÆÅ � where º � and L% are column vector notations defined by
All the PCA algorithms mentioned before are based on the assumptions that data have not been spoiled
by outliers. In practice, real data often contain some outliers and usually they are not easy to be sepa-
rated from the data set. In Section 1, we showed that the traditional PCA constructs the rank � subspace
approximation to zero-mean training data that is optimal in a least-squares sense. It is commonly known
that least squares techniques are not robust in the sense that outlying measurements can arbitrarily skew
the solution from the desired solution [11]. Currently it is still a research direction to solve this draw-
back of the original PCA. Several methods were proposed in the field of statistics, neural networks, and
computer engineering etc. But they all have certain limitations.
5.1 Robust PCA by Robustifying the Covariance Matrix
To cope with outliers, the most commonly used approaches in statistics [4][11][19] replace the standard
estimation of the covariance matrix,
', with a robust estimator of the covariance matrix,
' ¥ . This for-
mulation weights the mean and the outer products which form the covariance matrix. Calculating the
eigenvalues and eigenvectors of this robust covariance matrix gives eigenvalues that are robust to sample
outliers. The mean and the robust covariance matrix can be calculated as:¬ ( @ )�,+.-IÙ - �_Ú �� � �@ )�,+.- Ù - �_Ú �� � (22)' ¥ ( @ )�,+.- Ù � �_Ú �� � �/.�21l¬ � �/0�21l¬ �76@ )�,+.- Ù � �_Ú �� � 1Û� (23)
where Ù - �_Ú �� � and Ù � �_Ú �� � are scalar weights, which are a function of the Mahalanobis distanceÚ �� ( �/G�21l¬ � 6 ' ¥ �/G�21l¬ � (24)
8
and
' ¥ is iteratively estimated. Numerous possible weight functions have been proposed (e.g. Huber’s
weighting coefficients [11] or Ù � �_Ú �� � ( � Ù - �_Ú �� ���Ô� [4]. These approaches, however, weight entire
data samples and are not appropriate for the cases when only a few individual elements are corrupted by
outliers. Another related approach would be to robustly estimate each element of the covariance matrix.
This is not guaranteed to result in a positive definite matrix [4].
These methods, based on robust estimation of the full covariance matrix, are computationally imprac-
tical for high dimensional data such as images. Note that just computing the covariance matrix requires��� � � � � operations. Also in some practical applications it is difficult to gather sufficient training data to
guarantee that the covariance matrix is full rank.
5.2 Robust PCA by Projection Pursuit
Li and Chen [12] proposed a solution based on projection pursuit (PP). Dealing with high dimensional
data, PP searches for low dimensional projections that maximize (minimize) an objective function called
projection index. By working in the low dimensional projections, it manages to avoid the difficulty
caused by sparseness of the high-dimensional data.
Principal component analysis is actually a special PP procedure. Let be the�-dimensional random
vector with covariance Ü , and let ÝßÞ be the distribution function of à 6 , where à is a�
vector. Denote
the eigenvalues of Ü by ¨ - �������.��¨ b . Recall that the first principal component is the projection of onto a
certain direction; that is X �/Ý Þ Ç � (Yr�á"âã à ã +.- X �/Ý Þ � (är�á"âã à ã +.- �/à 6 Ü0à � - d � (25)
It is well known that X � �/ÝåÞ Ç � ( à 6 - Ü.à - is the largest eigenvalue ¨ - and that à - is the associated eigen-
vector. In the subsequent steps, each new direction is then constrained to be orthogonal to all previous
directions. For example, the second principal component à 6� is determined byX �/Ý Þ È � ( r�á"âã à ã +.-�¸ ÞçæyÞ Ç X �/Ý Þ � (26)
When the measurement matrix contains outliers, [12] used the robust scale estimator instead of standard
scale estimator to deal with the outliers.
Ammann proposed a similar idea for robust PCA by using projection pursuit [1]. In his approach,
the projection pursuit of estimating the eigenvectors of covariance matrix can be expressed as follows.
Determine the last principal axis à b by minimizing)* +.-�è �/ 6 à b � (27)
9
subject to the constraint C à b C ( � , where denotes the # th measurement vector. Then for � ( � 1���������0��� , determine àc> to minimize )* +.-�è �/ 6 àI> � (28)
subject to the constraint C à%> C ( � and à 6> à ( P , � H �KÅÍ#lÅ �. è �Ô� � is the robust loss function to
bound the influence of outliers. Ordinary eigenvectors are obtained by setting è �Ô� � ( C � C �� .5.3 Robust PCA by Self-Organizing Neural Networks
The solution of the standard PCA is made after all the data have been collected and the sample covariance
matrix
'has been calculated, i.e., the approach works in the batch way. When a new sample ^é is added,
we have to recalculate the corresponding new covariance matrix' é ( � ' H GéêGé 6� H � (29)
then all the computations for solving (2) is repeated by solve' é � ( & � 98 (30)
Such approach is not suitable for some real applications where data come incrementally or in the online
way.
The problem can be solved by a number of existing self-organizing rules for PCA [15][16][23]. The
commonly used rules are listed as follows:ë ��ì H � � ( ë ��ì � HUí ��ì � �/ g 1 ë ��ì � g � � (31)ë ��ì H � � ( ë ��ì � HÛí ��ì � �/ g 1 ë ��ì �ë ��ì � 6 ë ��ì � g � � (32)ë ��ì H � � ( ë ��ì � HUí ��ì � Ò g �/h1³îº � H � g 1 g é � � (33)
whereg ( ë ��ì � 6 , îº ( g ë ��ì � , g é ( ë ��ì � 6 îº and í ��ì � is the learning rate which decreases to zero asì�ïñð while satisfying certain condition, e.g.,*Iò í ��ì � ( ðó� *�ò í ��ì �Ôi W4ð �2Áô¨w¿ôÁôÊ eÄõßö � 8 (34)
Each of the three rules will converge to the principal component vector ë almost surely under some mild
conditions which are studied in detail [15][16][23]. By regarding ë ��ì � as the weight vector (i.e., the
10
vector consisting of synapses) of a linear neuron with outputg ( ë ��ì � 6 , all of the three rules can be
considered as modifications of the well-known Hebbian ruleë ��ì H � � ( ë ��ì � HUí ��ì � g (35)
for self-organizing the synapses of a neuron.
From the view of statistical physics, all these rules (31)(32)(33) are connected to certain energy func-
tions. For example, the rule (33) is an adaptive rule for minimizing the following energy function in the
where : � (|; 6 � are the linear coefficients obtained by projecting the training data onto the principal
subspace, andg � ( @obò +.- Ù ò �ú ò � . ù � ( 0��1 ;}; 6 G� is the reconstruction error vector, and
e �fø Þ �/ù � � (ù 6� ù � is the reconstruction error of �� .In case of outliers, Xu and Yuille [13] have proposed an algorithm that generalizes the energy function
(36) by introducing additional binary variables that are zero when a data sample is considered an outlier.
where each ÿ � in« ( Ò ÿ - � ÿ � �������0� ÿ ) Ó is a binary random variable. If ÿ � ( � the sample .� is taken
into consideration, otherwise it is equivalent to discarding � as an outlier. The second term in (37)
is a penalty term, or prior, that discourages the trivial solution where all ÿ � are zero. Given ; , if the
energy, ù � ( G�B1 ;}; 6 G� is smaller than a threshold � , then the algorithm prefers to set ÿ � ( �considering the sample �� as an inlier and 0 if it is greater than or equal to � . Minimization of (37)
involves a combination of discrete and continuous optimization problems and Xu and Yuille [70] derive
11
a mean field approximation to the problem which, after marginalizing the binary variables, can be solved
by minimizing:÷Äû�ü � ; � ( 1 )* �,+.- �� �
û�ü �/ù � � � � � � (38)
where � û�ü �/ù � � � � � � (�� Á�±2�7� H e a� �� �����"� ù�� � a�� � � is a function that is related to robust statistical estimators
[2]. The�
can be varied as an annealing parameter in an attempt to avoid local minima.
Based on such reformulation of the energy function, we can get the corresponding robust version of
the adaptive self-organizing rules (31)(32)(33). For example, the rule (32) changes intoë ��ì H � � ( ë ��ì � HUí ��ì � �� H e a� �� ����� � ù�� � a�� � �/ g 1 ë ��ì �ë ��ì � 6 ë ��ì � g � � (39)
and the rule (33) changes intoë ��ì H � � ( ë ��ì � HUí ��ì � �� H e a� �� ����� � ù�� � a�� � Ò g �/h1³îº � H � g 1 g é � � (40)
Finally the converged vector ë ø�� )�� ���� is taken as the resulted principal components vector which has the
avoided the effects of outliers. In addition, a byproduct can be easily obtained by
ÿ � ( ��� �À� e �fø Þ �/ù � � W"! �( PQ� Áôì$# e ¨ Ù �À¿ e (41)
which indicates whether � is an outlier ( ÿ � ( � ) or not ( ÿ � ( P ).5.4 Robust PCA by Weighted SVD
The approach of robust PCA by neural networks is of limited application in some practical problems
as they reject entire data measurement as outliers. In some applications, outliers typically correspond
to small groups of points in the measurement vector and we seek a method that is robust to this type
of outlier yet does not reject the good points in the data samples. Gabriel and Zamir [8] give a partial
solution. They propose a weighted Singular Value Decomposition (SVD) technique that can be used to
construct the principal subspace. In their approach, they minimize÷� % � ; �S� � ( )* �ê+.- b*� +.- # � � ��ú � � 15�/� � � 6 : � � � (42)
where, � � is a column vector containing the elements of the \ -th row of ; . This effectively puts a
weight, # � � on every point in the training data. In related work, Greenacre [9] gives a partial solution
to the problem of factorizing matrices with known weighting data by introducing Generalized Singular
12
Value Decomposition (GSVD). This approach applies when the known weights in (42) are separable;
that is, one weight for each row and one for each column: # � � ( # � # � . The basic idea is to first whiten
the data using the weights, perform SVD, and then un-whiten the bases. The benefit of this approach is
that it takes advantage of efficient implementations of the SVD algorithm. The disadvantages are that the
weights must somehow already be known and that individual point outliers are not allowed.
In the general robust case, where the weights are unknown and there may be a different weight at every
point in every training data, there is no such solution that leverages SVD, [8][9] and one must solve the
minimization problem with “criss-cross regressions” which involve iteratively computing dyadic (rank 1)
fits using weighted least squares. The approach alternates between solving for � � or : � while the other
is fixed; this is similar to the EM approach we discussed before but without a probabilistic interpretation.
In this spirit, Gabriel and Odorof [8] note how the quadratic formulation in (36) is not robust to outliers
and propose making the rank 1 fitting process in (42) robust. They propose a number of methods to make
the criss-cross regressions robust but they apply the approach to very low dimensional data and their
optimization methods do not scale well to very high dimensional data such as images. In related work,
Croux and Filzmoser [5] use a similar idea to construct a robust matrix factorization based on a weighted& - norm.
5.5 Torre and Black’s Algorithm
In the computer vision field, PCA is a popular technique for parameterizing shape, appearance, and
motion [3][20][14]. Learned PCA representations have proven useful for solving problems such as face
and object recognition, tracking, detection, and background modeling [20][14]. Typically, the training
data for PCA is pre-processed in some way (e.g. faces are aligned [14]) or is generated by some other
vision algorithm (e.g. optical flow is computed from training data [3]). As automated learning methods
are applied to more realistic problems, and the amount of training data increases, it becomes impractical
to manually verify that all the data is good. In general, training data may contain undesirable artifacts
due to occlusion (e.g. a hand in front of a face), illumination (e.g. specular reflections), image noise
(e.g. from scanning archival data), or errors from the underlying data generation method (e.g. incorrect
optical flow vectors). We view these artifacts as statistical outliers .
Due to the high dimensionality of the image data, we can’t rely on the calculation of robust covariance
matrix to get the principal components. The projection based approach also suffers from the high com-
putational cost. The approach of Xu and Yuille described in the previous section suffers from three main
problems: First, a single bad pixel value can make an image lie far enough from the subspace that the
13
entire sample is treated as an outlier (i.e. ÿ � ( P ) and has no influence on the estimate of ; . Second,
Xu and Yuille use a least squares projection of the data � for computing the distance to the subspace;
that is, the coefficients that reconstruct the data m� are : � ( ; 6 G� . These reconstruction coefficients
can be arbitrarily biased by an outlier. Finally, a binary outlier process is used which either completely
rejects or includes a sample.
To make the robust PCA work efficiently for the image data, Torre and Black [7] proposed a more
general analogue outlier process that has computational advantages and provides a connection to robust
matrix 0 � � � , then proceed in the same manner in steps 2 until convergence of the algorithmn.
It is worth noting that there are several possible ways to update the parameters more efficiently, rather
than a closed form solution.
6 Experiment Results
Experiments were performed to test some of the algorithms discussed in the previous sections. In Section
6.1, we use 2 dimensional and 40 dimensional data separately to show the efficiency of the EM algorithm.
In Section 6.2, we use 40 dimensional data in which 20% are missing and show how Wiberg’s method
works in such incomplete data. The same kind of 40 dimensional data are used in the experiment of
Section 6.3, but some of them are corrupted by outliers. For such data, we compare the results of robust
PCA with those of standard PCA. Another experiment with real images is also provided in that section.
6.1 Test the EM algorithm
First we use 2D synthetic data with Gaussian distribution to test the EM algorithm introduced in Section
3.2 (Figure 1). The data and initial principal axis is shown in Figure 1(a). The first iteration and second
iteration of the principal axis are shown in Figure 1(b)(c). Comparing with the results of standard PCA
16
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
(a) (b)
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
−30 −20 −10 0 10 20 30
−20
−15
−10
−5
0
5
10
15
20
y1
y 2
(c) (d)
Figure 1: The EM based PCA for 2D data. (a) The data and initial value of the principal axis; (b) Thefirst iteration; (c) The second iteration; (d) The data and the principal axis by standard PCA.
(Figure 1(d)), we find that EM algorithm converges to the correct solution in only two steps, which is
very efficient.
In the second example, 10 data vectors were used for the PCA algorithm. Each vector contains 40
dimensional data which were sampled from one shifted sinusoid curve. The whole data set is plotted
in Figure 2(a), in which each sinusoid curve is related to one data vector. Figure 2(b)(c) show the
results of standard PCA. The two principal axes found by standard PCA are shown in Figure 2(b). The
reconstructed signals by those two principal components are shown in Figure 2(c). Figure 2(d)(e) give the
principal axes and reconstructed signals found in the first iteration of the EM algorithm. Figure 2(f)(g)
show the principal axes and reconstructed signals found in the fourth iteration of the EM algorithm.
6.2 PCA with missing data
Here we also use a set of vectors which is formed by 10 shifted harmonic sinusoid functions. By ran-
domly removing 20% of the data points, the data set is shown in Figure 3(a). Obviously, standard PCA
can’t deal with such kind of data because some of the pixels are unknown. We use Wiberg’s algorithm to
extract the two principal axes and reconstruct the data by those two principal axes. Figure 3(b)(c) show
17
the results in the third iteration of the algorithm. Figure 3(d)(e) show the results in the fifth iteration. Note
that the functions representing the estimated principal axes are getting smoother after every iteration. Fi-
nally in the seventh iteration, we obtain the very smooth principal axes and a perfect reconstruction of
the input vectors, which are shown in Figure 3(f)(g).
6.3 PCA with outliers
Although several robust PCA methods were described in Section 5. We use Torre and Black’s algorithm
introduced in Section 5.5 to the show that the robust PCA performs better than traditional PCA in case
of outliers.
In the first experiment, we still use the data sampled from sinusoid functions. But 10% of the el-
ements are contaminated with outliers (Figure 4(a)). Figure 4(b)(c) depict the two principal axes and
the reconstructed signals by standard PCA. Figure 4(d)(e) depict the two principal axes and the recon-
structed signals by robust PCA after 30 iterations. Obviously the robust PCA gives much more reliable
reconstruction than standard PCA.
In the second experiment, we use a collection of �CB½� images with size �f��P J �f��P as the training set
of PCA (from ’http://web.salleurl.edu/f̃torre/’). Those images were gathered from a static camera over
the day. There are changes in the illumination of the static background. Also 45% of the images contain
people in different location. Our purpose is to build the model of the background by using PCA. We treat
the people in the image as outliers and use PCA to extract the background model. The left column of
Figure 5 show the D examples of the training images. The middle column shows the reconstructing each
of the illustrated training image using standard PCA with ��P basis vectors. The right column shows the
reconstruction obtained with ��P robust PCA basis vectors. We find that the robust PCA is able to capture
the illumination changes while ignoring the people. Once we get the desired background model which
accounts for illumination variation, we can use it in the application of person detection and tracking.
References
[1] L. P. Ammann. Robust singular value decompositions: A new approach to projection pursuit. J. of
Amer. Stat. Assoc., 88(422):505–514, 1993.
[2] M. J. Black and A. Rangarajan. On the unification of line process, outlier detection, and robust
statistics with applications in early vision. International J. of Computer Vision, 25(19):57–92,
1996.
18
[3] M. J. Black, Y. Yaccob, A. Jepson, and D. J. Fleet. Learning parameterized models of image motion.
In Proc. IEEE Conf. on Computer Vision and Pattern Recognition, San Juan, Puerto Rico, volume I,
pages 561– 567, 1997.
[4] N. A. Campbell. Robust procedures in multivariate analysis I : Robust covariance estimation.
Applied Statistics, 29(3):231–237, 1980.
[5] C. Croux and P. Filzmoser. Robust factorization of data matrix. Proc. in Computational Statistics,
pages 245–249, 1981.
[6] Y. Dodge. Analysis of Experiments with Missing Data. Wiley, 1985.
[7] F.Torre and M. J. Black. Robust principal component analysis for computer vision. In 8th In-
ternational Conference on Computer Vision, volume I, pages 362–349, Vancouver, Canada, July
2001.
[8] K.R. Gabriel and S. Zamir. Lower rank approximation of matrices by least squares with any choice
of weights. technometrics, 21:489–498, 1979.
[9] M. J. Greenacre. Theory and Applications of Correspondence Analysis. Academic Press : London,
1984.
[10] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of
Educational Psychology, 24:417–441, 1933.
[11] P. J. Huber. Robust Statistics. New York:Wiley, first edition, 1981.
[12] G. Li and Z. Chen. Projection-pursuit approach to robust dispersion matrices and principal compo-
nents: Primary theory and monte carlo. J. of Amer. Stat. Assoc., 80(391):759–766, 1985.
[13] L.Xu and A. L. Yuille. Robust principal component analysis by self-organizing rules based on
[15] E. Oja. A simplified neuron model as a principal component analyzer. J. Math. Biol., 16:267–273,
1982.
19
[16] E. Oja and J. Karhunen. On stochastic approximation of eigenvectors and eigenvalues of the ex-
pectation of a random matrix. J. Math. Anal. Appl., 106:69–84, 1985.
[17] K. Pearson. On lines and planes of closestfit to systems of points inspace. The London, Edinburgh
and Dublin Philosophical Magazine and Journal of Sciences, 6:559–572, 1901.
[18] S. Roweis. Em algorithm for PCA and SPCA. Neural Information Processing Systems, pages
626–632, 1997.
[19] F. H. Ruymagaart. A robust principal component analysis. Journal of Multivariate Analysis,
11:485–497, 1981.
[20] G. J. Edwards T. F. Cootes and C. J. Taylor. Active appearance models. In Proc. European Conf.
on Computer Vision, volume I, pages 484– 498, 1998.
[21] M. E. Tipping and C. M. Bishop. Probabilistic principal component analysis. Technical Report
NCRG/97/010, Microsoft Research, September 1999.
[22] T. Wiberg. Computation of principal components when data is missing. In Proc. Second Symp.
Computational Statistics, pages 229–236, 1976.
[23] L. Xu. Least mean square error reconstruction for self-organizing neural nets. Neural Networks,
6:627–648, 1993.
20
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
pixel number
pixe
l val
ue
(a)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
(d) (e)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
(f) (g)
Figure 2: The EM based PCA for 40 dimensional data. (a) Input data; (b) Two principal axes found bythe standard PCA; (c) The reconstructed signals by the standard PCA; (d) Two principal axes found inthe first iteration by EM based PCA; (e) The reconstructed signals in the first iteration ; (f) Two principalaxes found in the fourth iteration ; (g) The reconstructed signals in the fourth iteration.
21
0 5 10 15 20 25 30 35 40−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
pixel number
pixe
l val
ue
(a)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
Figure 3: PCA for the imcomplete data set. (a) Input data, some pixels are missing; (b) Two principalaxes found in the third iteration; (c) The reconstructed signals in the third iteration ; (d) Two principalaxes found in the fifth iteration; (e) The reconstructed signals in the fifth iteration ; (f) Two principal axesfound in the seventh iteration ; (g) The reconstructed signals in the seventh iteration.
22
0 5 10 15 20 25 30 35 40−4
−3
−2
−1
0
1
2
3
4
pixel number
pixe
l val
ue
(a)
0 5 10 15 20 25 30 35 40−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
pixel number
pixe
l val
ue
(b) (c)
0 5 10 15 20 25 30 35 40−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
0.15
0.2
0.25
pixel number
pixe
l val
ue
0 5 10 15 20 25 30 35 40−1.5
−1
−0.5
0
0.5
1
1.5
pixel number
pixe
l val
ue
(b) (c)
Figure 4: Robust PCA. (a) Input data; (b) Two principal axes found by standard PCA; (c) The recon-structed signals by standard PCA; (d) Two principal axes found by robust PCA; (e) The reconstructedsignals by robust PCA.
23
(a) (b) (c)
Figure 5: Robust PCA for the image data. (a) Some of the original data; (b) PCA reconstruction; (c)Robust PCA reconstruction.