1 An optimization criterion for generalized discriminant analysis on undersampled problems Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park Abstract An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria of the classical Linear Discriminant Analysis (LDA) through the use of the pseudo-inverse when the scatter matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size, overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the Generalized Singular Value Decomposition (GSVD) technique. The pseudo-inverse has been suggested and used for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion proposed in this paper provides a theoretical justification for this procedure. An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational complexity by finding sub-clusters of each cluster, and uses their centroids to capture the structure of each cluster. This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments on text data, with up to 7000 dimensions, show that the approximation algorithm produces results that are close to those produced by the exact algorithm. Index Terms Classification, clustering, dimension reduction, generalized singular value decomposition, linear discriminant analysis, text mining. J. Ye, R. Janardan, C. Park, and H. Park are with the Department of Computer Science and Engineering, University of Minnesota–Twin Cities, Minneapolis, MN 55455, U.S.A. Email: jieping,janardan,chpark,hpark @cs.umn.edu. Corresponding author: Jieping Ye; Tel.: 612-626-7504; Fax: 612-625-0572.
32
Embed
1 An optimization criterion for generalized discriminant ...cse.cnu.ac.kr/~cheonghee/papers/pami_gsvd.pdf · Generalized Singular Value Decomposition (GSVD) technique. The pseudo-inverse
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
An optimization criterion for generalized
discriminant analysis on undersampled problems
Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park
Abstract
An optimization criterion is presented for discriminant analysis. The criterion extends the optimization criteria
of the classical Linear Discriminant Analysis (LDA) through the use of the pseudo-inverse when the scatter
matrices are singular. It is applicable regardless of the relative sizes of the data dimension and sample size,
overcoming a limitation of classical LDA. The optimization problem can be solved analytically by applying the
Generalized Singular Value Decomposition (GSVD) technique. The pseudo-inverse has been suggested and used
for undersampled problems in the past, where the data dimension exceeds the number of data points. The criterion
proposed in this paper provides a theoretical justification for this procedure.
An approximation algorithm for the GSVD-based approach is also presented. It reduces the computational
complexity by finding sub-clusters of each cluster, and uses their centroids to capture the structure of each cluster.
This reduced problem yields much smaller matrices to which the GSVD can be applied efficiently. Experiments
on text data, with up to 7000 dimensions, show that the approximation algorithm produces results that are close
to those produced by the exact algorithm.
Index Terms
Classification, clustering, dimension reduction, generalized singular value decomposition, linear discriminant
analysis, text mining.
J. Ye, R. Janardan, C. Park, and H. Park are with the Department of Computer Science and Engineering, University of Minnesota–Twin
� � � ����� . The first inequality follows, since & & ' is positive semi-definite, and the
equality holds if &� � �. The second inequality follows from Eq. (14) and Lemma 3.4, and the equality
holds if and only if
& � � ��� ���� � �
� �
� ���
IR ) # � � � � (16)
for some orthogonal matrix�
IR ) # ) and some matrix� � �
� �� � IR � # � , where �
� � � � IR � # � are
diagonal matrices with positive diagonal entries and�
IR � # � is orthogonal. In particular, it holds if
both�
and�
are identity matrices.
17
By Eq. (14) and Eq. (15), we have
� � �trace
� � � � � � � � �� %&� � � �,�trace
� � � � � � � � �� ��
����� * % ����� � � �
*� �� � �
� ����� � � �
� �� � � 7
Theorem 3.1 gives a lower bound on � � , when�
is fixed. From the arguments above, all the inequalities
become equalities if & � � has the form in Eq. (16), and & ���. A simple choice is to set
�and
�in
Eq. (16) to be identity matrices, which is used in our implementation later. The result is summarized in
the following corollary.
Corollary 3.1: Let & ' ,�& be defined as in Theorem 3.1 and
� � � % * , then the equality in Eq. 13
holds, if the partition of�& � �
& ��� & � � & � &�� satisfies & � � ����� � � �
� �
����� and &� � �
, where
& � � ��� & � � & � .
Theorem 3.1 does not mention the connection between . , the row dimension of the transformation
matrix & ' , and�. We can choose the transformation matrix & ' that has large row dimension . and still
satisfies the condition stated in Corollary 3.1. However we are more interested in a lower-dimensional
representation of the original data, while keeping the same information from the original data.
The following corollary says that the smallest possible value for . is�, and, more importantly, we can
find a transformation & ' , which has row dimension�, and satisfies the condition stated in Corollary 3.1.
Corollary 3.2: For every� + � % � , there exists a transformation & '
IR � # " , such that the equality
in Eq. (13) holds, i.e. the minimum value for � � is obtained. Furthermore for any transformation� &�� � '
IR ) # " , such that the assumption in Theorem 3.1 holds, we have . � �.
Proof: Construct & ' IR � # " , such that
�& � � � � & � ' � �& � � � &� � &��
18
satisfies & � � � � � � � , & � �, and &� � �
, where� � IR � # � is an identity matrix. Hence &,' � � '� ,
where � � IR � # " contains the first�
columns of the matrix � . By Corollary 3.1, the equality in Eq. (13)
holds under the above transformation & ' . This completes the proof for the first part.
For any transformation� & � � '
IR ) # " , such that rank� � & � � ' * � � �
, it is clear� + rank
� � & � � ' � + . .
Hence . � �.
Theorem 3.1 shows that the minimum value of the objective function, � � , is dependent on the rank
of the matrix & ' * . As shown in Lemma 3.1, the rank of the matrix*
, which is � % � by GSVD,
denotes the degree of linear independence of the�
centroids in the original data, while the rank,�, of
the matrix & ' * implies the degree of linear independence of the�
centroids in the reduced space by
Lemma 3.1. By Corollary 3.2, if we fix the degree of linear independence of the centroids in the reduced
space, i.e. if rank� & ' * � ���
is fixed, then we can always find a transformation matrix & ' IR � # " with
row dimension�
such that the minimum value for � � is obtained.
Next we consider the case when�
varies. From the result in Theorem 3.1, the minimum value of the
objective function � � is smaller for smaller�. However, as the value of
�decreases (note
�is always no
larger than � % � ), the degree of linear independence of the�
centroids in the reduced space is a lot less
than the one in the original high-dimensional space, which may lead to information loss of the original
data. Hence, we choose� �
rank� * �
, its maximum possible value, in our implementation. One nice
property of this choice is stated in the following proposition:
Proposition 3.1: If� � rank
� & ' * � equals rank� * �
, then8 � 8 � * + 8 � � 8 + 8 � 8
.
Proof: The proof follows directly from Lemma 3.1.
Proposition 3.1 implies that choosing� �
rank� * �
keeps the degree of linear independence of the
centroids in the reduced space the same as or one less than the one in the original space. With the
above choice, the reduced dimension under the transformation & ' is rank� * �
. We use��� �
rank� * �
to denote the optimal reduced dimension for our generalized discriminant analysis (also called exact
algorithm) throughout the rest of the paper. In this case, we can choose the transformation matrix & ' as
19
Algorithm 1: Exact algorithm
1. Form the matrices���
and���
as in Eq. (2).
2. Compute SVD on ���� ��������
���� , as �����
��� �� �
������ � , where � and � are orthogonal and � is diagonal.
3. ��� rank ����� , �! "� rank � �#� � .4. Compute SVD on �#�%$'&)(+*,$'&-�.� , the sub-matrix consisting of the first ( rows and the first � columns of matrix �
from step 2, as �#�%$/&0(+*,$'&-�.�1��2436517 � , where 2 and 7 are orthogonal and 385 is diagonal.
5. 9:� ����<;>= 7 �� ?
���� .
6. @ � �A9 �BDC , where 9 B C contains the first �E columns of the matrix 9 .
in Corollary 3.2 as follows: & ' � � '� � , where � � � contains the first� �
columns of the matrix � . The
pseudo-code for our main algorithm is shown in Algorithm 1, where Lines 2–5 compute the GSVD on
the matrix pair� * ' � * '� � to obtain the matrix � as in Eq. (6).
In principle, it is not necessary to choose the reduced dimension to be exactly� �
. Experiments show
good performance for values of reduced dimension close to� �
. However the results also show that the
performance can be very poor for small values of reduced dimension, probably due to information loss
during dimension reduction.
D. Relation between the exact algorithm and the pseudo-inverse based LDA
As discussed in the Introduction, the Pseudo Fisher Linear Discriminant (PFLDA) method [8], [10],
[23], [26], [27] applies the pseudo-inverse on the scatter matrices. The solution can be obtained by solving
the eigenvalue problem on matrix� �$ �!
. The exact algorithm proposed in this paper is based on the
criterion � � , as defined in Eq. (5), which is an extension of the criterion in classical LDA by replacing the
inverse with the pseudo-inverse. Both PFLDA and the exact algorithm apply pseudo-inverse to deal with
the singularity problem on undersampled problems. A natural question is what is the relation between
the two methods? We show in this section that the solution from the exact algorithm also solves the
20
eigenvalue problem on matrix� �$ �!
, that is, the two methods are essentially equivalent. The significance
of this equivalence result is that the criterion proposed in this paper may provide theoretical justification
of the pseudo-inverse based LDA, which was used in an ad-hoc manner in the past.
The following lemma is essential to show the relation between the exact algorithm and the PFLDA.
Lemma 3.5: Let � be the matrix obtained by computing GSVD on the matrix pair� * ' � * '� � (see
Line 5 of the Algorithm 1). Denote� �
���� � $ �
� �
� ��� . Then
� �$ � � � � ' , where��$
is the total scatter
matrix.
Proof: By Eq. (10), we have � ' ��$ � � �, i.e.
��$ � � ' � � � . From Line 5 of the Algorithm 1,
� has the form of � � � ���� � ��� �
� ������ , where
�and
�are orthogonal and
�is diagonal. Then
��$ � � ' � � �� � ���
� � � �
� �� ��� �
����
�' � �
� �� ��� �
'
� � ���� � � �
� �
����� �
' 7
It follows from the definition of pseudo-inverse that� �$ � � ���
� � � �
� �
����� �
' . It is easy to check that
� � � ' � � ���� � � �
� �
����� �
' 7
Hence� �$ � � � � ' .
Theorem 3.2: Let � be the matrix as in Lemma 3.5. Then the first� �
columns of � solve the following
eigenvalue problem:� �$ �! � � � �
, for� �� �
.
Proof: From Eq. (7), we have � ' �! � � � �, where
Hence the total complexity for our approximation algorithm is � � � � �� � % � � � � �
. Since � � is usually
chosen to be a small number and�
is much smaller than � , the complexity is simplified to � � � �
.
Table II lists the complexity of different methods discussed above. The K-NN algorithm is used for
querying the data.
24
TABLE III
SUMMARY OF DATASETS USED FOR EVALUATION
Data Set 1 2 3
Source TREC Reuters-21578
number of documents � 210 320 490
number of terms � 7454 2887 3759
number of clusters�
7 4 5
V. EXPERIMENTAL RESULTS
A. Datasets
In the following experiments, we use three different datasets1, summarized in Table III. For all datasets,
we use a stop-list to remove common words, and the words are stemmed using Porter’s suffix-stripping
algorithm [22]. Moreover, any term that occurs in fewer than two documents was eliminated as in [32].
Dataset 1 is derived from the TREC-5, TREC-6, and TREC-7 collections [30]. It consist of 210 documents
in a space of dimension of 7454, with 7 clusters. Each cluster has 30 documents. Datasets 2 and 3 are
from Reuters-21578 text categorization test collection Distribution 1.0 [19]. Dataset 2 contains 4 clusters,
with each containing 80 elements; the dimension of the document set is 2887. Dataset 3 has 5 clusters,
each with 98 elements; the dimension of the document set is 3759.
For all the examples, we use the tf-idf weighting scheme [24], [32] for encoding the document collection
with a term-document matrix.
B. Experimental methodology
To evaluate the proposed methods in this paper, we compared them with the other three dimension
reduction methods: LSI, RLDA, and two stage LSI+LDA on the three datasets in Table III. The K-Nearest
Neighbor algorithm [7] (for �� * ����� *�� ) was applied to evaluate the quality of different dimension
reduction algorithms as in [13]. For each method, we applied 10-fold cross-validation to compute the
1These datasets are available at www.cs.umn.edu/ � jieping/Research.html .
25
misclassification rate and the standard deviation. For LSI and two-stage LSI+LDA, the results depend on
the choice of reduced dimension � by LSI. Our experiments showed that �� * � � for LSI and �
� � �for LSI+LDA produced good overall results. Similarly, RLDA depends on the choice of the parameter�
. We ran RLDA with different choices of�
and found that� ��� 7
� produced good overall results on
the three datasets. We also applied K-NN on the original high dimensional space without the dimension
reduction step, which we named “FULL” in the following experiments.
The clustering using the K-Means in the approximation algorithm is sensitive to the choices of the
initial centroids. To mitigate this, we ran the algorithm ten times, and the initial centroids for each run
were generated randomly. The final result is the average over the ten different runs.
C. Results
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Mis
clas
sific
atio
n ra
te
Dataset 1Dataset 2Dataset 3
Fig. 1. Effect of the value of � � on the approximation algorithm using Datasets 1–3. The � -axis denotes the value of � � .
1) Effect of the value of � � on the approximation algorithm: In Section IV, we use � � to denote the
number of sub-clusters for the approximation algorithm. As mentioned in Section IV, our approximation
algorithm worked well for small values of � � . We tested our approximation algorithm on Datasets 1–3
for different values of � � , ranging from 2 to 30, and computed the misclassification rates. �� �
nearest
neighbors have been used for classification. As seen from Figure 1, the misclassification rates did not
fluctuate very much within the range. For efficiency, we would like � � to be small. Therefore, in our
experiments, we chose � � ��� , as the approximation algorithm performed very well on all three datasets
26
in the vicinity of this value of � � . However, as shown in Figure 1, the performance of the approximation
algorithm is generally insensitive to the choice of � � .2) Comparison of misclassification rates: In the following experiment, we evaluated our exact and
approximation algorithms and compared them with competing algorithms (LSI, RLDA, LSI+LDA) based
on the misclassification rates, using the three datasets in Table III.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 7 15
Mis
clas
sific
atio
n ra
te
Value of K in K-Nearest Neighbors
FULLLSI
RLDALSI+LDA
EXACTAPPR
Fig. 2. Performance of different dimension reduction methods on Dataset 1
0
0.05
0.1
0.15
0.2
0.25
1 7 15
Mis
clas
sific
atio
n ra
te
Value of K in K-Nearest Neighbors
FULLLSI
RLDALSI+LDA
EXACTAPPR
Fig. 3. Performance of different dimension reduction methods on Dataset 2
The results for Datasets 1–3 are summarized in Figures 2–4, respectively, where the�
-axis shows
the three different choices of K ( �� * � � � * � ) used in K-NN for classification, and the - -axis shows the
misclassification rate. ”EXACT” and “APPR” refer to the exact and approximation algorithms respectively.
We also report the results on the original space without dimension reduction, denoted as ”FULL” in
Figures 2–4. The corresponding standard deviations of different methods for Datasets 1–3 are summarized
in Tables IV. For each dataset, the results on different numbers, �� * ����� *�� of neighbors used in K-NN
27
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 7 15M
iscl
assi
ficat
ion
rate
Value of K in K-Nearest Neighbors
FULLLSI
RLDALSI+LDA
EXACTAPPR
Fig. 4. Performance of different dimension reduction methods on Dataset 3
are reported.
TABLE IV
STANDARD DEVIATIONS OF DIFFERENT METHODS ON DATASETS 1–3, EXPRESSED AS A PERCENTAGE