Nonlinear Feature Extraction based on Centroids and Kernel Functions Cheong Hee Park and Haesun Park Dept. of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455 December 19, 2002 June 10, 2003, revised Abstract A nonlinear feature extraction method is presented which can reduce the data dimension down to the number of clusters, providing dramatic savings in computational costs. The dimension reducing nonlin- ear transformation is obtained by implicitly mapping the input data into a feature space using a kernel function, and then finding a linear mapping based on an orthonormal basis of centroids in the feature space that maximally separates the between-cluster relationship. The experimental results demonstrate that our method is capable of extracting nonlinear features effectively so that competitive performance of classification can be obtained with linear classifiers in the dimension reduced space. Keywords. cluster structure, dimension reduction, kernel functions, Kernel Orthogonal Centroid (KOC) method, linear discriminant analysis, nonlinear feature extraction, pattern classification, support vector machines 1 Introduction Dimension reduction in data analysis is an important preprocessing step for speeding up the main tasks and reducing the effect of noise. Nowadays, as the amount of data grows larger, extracting the right features is not only a useful preprocess step but becomes necessary for efficient and effective processing, especially for high dimensional data. The Principal Component Analysis (PCA) and the Linear Discriminant Analysis (LDA) are two of the most commonly used dimension reduction methods. These methods search optimal directions for the projection of input data onto a lower dimensional space [1, 2, 3]. While the PCA finds the direction along which the data scatterness is greatest, the LDA searches the direction which maximizes the This work was supported in part by the National Science Foundation grants CCR-9901992 and CCR-0204109. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF). 1
18
Embed
Nonlinear Feature Extraction based on Centroids and Kernel ...cse.cnu.ac.kr/~cheonghee/papers/koc.pdf · One way for a nonlinear extension is to lift the input space to a higher dimensional
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
A nonlinearfeatureextractionmethodis presentedwhichcanreducethedatadimensiondown to thenumberof clusters,providing dramaticsavingsin computationalcosts.Thedimensionreducingnonlin-eartransformationis obtainedby implicitly mappingthe input datainto a featurespaceusinga kernelfunction, andthenfinding a linear mappingbasedon an orthonormalbasisof centroidsin the featurespacethatmaximallyseparatesthebetween-clusterrelationship.Theexperimentalresultsdemonstratethatour methodis capableof extractingnonlinearfeatureseffectively so thatcompetitive performanceof classificationcanbeobtainedwith linearclassifiersin thedimensionreducedspace.
Dimensionreductionin dataanalysisis animportantpreprocessingstepfor speedingup themaintasksandreducingtheeffect of noise.Nowadays,astheamountof datagrows larger, extractingtheright featuresisnot only a usefulpreprocessstepbut becomesnecessaryfor efficient andeffective processing,especiallyfor highdimensionaldata.ThePrincipalComponentAnalysis(PCA)andtheLinearDiscriminantAnalysis(LDA) aretwo of themostcommonlyuseddimensionreductionmethods.Thesemethodssearchoptimaldirectionsfor theprojectionof inputdataontoa lowerdimensionalspace[1, 2, 3]. While thePCAfindsthedirectionalongwhich thedatascatternessis greatest,theLDA searchesthedirectionwhich maximizesthe�
This work wassupportedin partby theNationalScienceFoundationgrantsCCR-9901992andCCR-0204109.Any opinions,findingsandconclusionsor recommendationsexpressedin this materialarethoseof theauthorsanddo not necessarilyreflecttheviewsof theNationalScienceFoundation(NSF).
1
between-clusterscatterandminimizesthewithin-clusterscatter. However, thesemethodshave a limitationfor the datawhich arenot linearly separablesinceit is difficult to capturea nonlinearrelationshipwith alinear mapping. In orderto overcomesucha limitation, nonlinearextensionsof thesemethodshave beenproposed[4, 5, 6].
Oneway for a nonlinearextensionis to lift the input spaceto a higherdimensionalfeaturespaceby anonlinearfeaturemappingandthento find alineardimensionreductionin thefeaturespace.It is well knownthatkernelfunctionsallow suchnonlinearextensionswithout explicitly forming a nonlinearmappingor afeaturespace,aslong astheproblemformulationinvolvesonly the inner productsbetweenthedatapointsandneverthedatapointsthemselves[7, 8,9]. Theremarkablesuccessof thesupportvectormachinelearningis anexampleof theeffective useof thekernelfunctions[10, 11,12,13]. ThekernelPrincipalComponentAnalysis(kernelPCA) [14] andthegeneralizedDiscriminantAnalysis[15, 16, 17, 18] have recentlybeenintroducedas nonlineargeneralizationsof the PCA and the LDA by kernel functions,respectively, andsomeinterestingexperimentalresultsare presented.However, the PCA and the LDA requiresolutionsfrom thesingularvaluedecompositionandgeneralizedeigenvalueproblem,respectively. In general,thesedecompositionsare expensive to computewhen the datadimensionis high. In addition, the dimensionreductionfrom thePCAdoesnot reflecttheclusteredstructurein thedatawell.
in the reducedspaceobtainedwith this transformation.However, whenthe datais not linearly separable,structurepreservingdimensionreductionmaynot optimalfor a classificationproblem.By consideringthatnonlinearmappingsby kernel functionscantransformoriginal datato a linearly separablestructurein ahigherdimensionalspace,we aim to obtainbothcomputationalefficiency andclassificationcapability.
In this paper, we apply the centroid-basedorthogonaltransformation,the OrthogonalCentroidalgo-rithm, to the datatransformedby kernel-basednonlinearmappingandshow that it canextract nonlinearfeatureseffectively, thusreducingthedatadimensiondown to thenumberof clustersandsaving therelativecomputationalcost. In Section2, we briefly review theOrthogonalCentroidmethodwhich is a dimensionreductionmethodbasedon anorthonormalbasisfor thecentroids.In Section3, we derive thenew KernelOrthogonalCentroidmethodextendingthe OrthogonalCentroidmethodusingkernelfunctionsto handlenonlinearfeatureextractionandanalyzethecomputationalcomplexity of ournew method.Ourexperimen-tal resultspresentedin Section4 demonstratethatthenew nonlinearOrthogonalCentroidmethodis capableof extractingnonlinearfeatureseffectively so that competitive classificationperformancecanbe obtainedwith linear classifiersafter nonlineardimensionreduction. In addition, it is shown that oncewe obtainalower dimensionalrepresentation,a linear soft margin SupportVectorMachine(SVM) is ableto achievehigh classificationaccuracy with much lessnumberof supportvectors,thusreducingpredictioncostsaswell.
2
2 Orthogonal Centroid Method
Givenavectorspacerepresentation, ���� ��� ������� ������������� �of a datasetof � vectorsin a � -dimensionalspace,dimensionreductionby lineartransformationis to find "! ���$# ���
In particular, weseekfor adimensionreducingtransformation !
with which theclusterstructureexistingin thegivendata
�is preserved in the reduceddimensionalspace.Eqn.(1) canbe rephrasedasfinding a
rankreducingapproximationof�
suchthat687:9;=< >@? �BA DC ?FE where ��� ��� #
andC ��� # � � 2
(2)
For simplicity, weassumethatthedatamatrix�
is clusteredinto G clustersas�H�� � � �� � ������� ��JIK�where
�JLM���-��� �ON and
IP LRQ � � L=� � 2 (3)
Let S L denotethesetof columnindicesthatbelongto thecluster1. ThecentroidT L of eachcluster
�JLis the
averageof thecolumnsin�UL
, i.e.,T L � V� L � L 4 L where4 L �W� V 2�2�2 V � ! �X� �ON � � (4)
andtheglobalcentroidT is definedas T �YV� �PZ Q � Z 2 (5)
Thecentroidof eachclusteris thevectorthatminimizesthesumof squareddistancesto vectorswithin thecluster. That is thecentroidvector T L givesthesmallestdistancein Frobeniusnormbetweenthematrix
�ULandtherankoneapproximation% 4 !L where? �JL�A T L 4 !L ? �E � PZ�[]\ N ? Z A T L ? �� � 6^7:9_ [a`�b�c]d PZ�[]\ N ? Z A % ? �� � 687:9_ [a`�b-ced ? �ULfA % 4 !L ? �E 2 (6)
Takingthecentroidsasrepresentativesof theclusters,wefind anorthonormalbasisof thespacespannedby thecentroidsby computinganorthogonaldecompositiong �ihkjml nio8�W�ph I Kh �/q I ��jml nro^�ih I l (7)
3
Algorithm 1 OrthogonalCentroidmethod
Givenadatamatrix�r��� ��� �
with G clustersandadatapoint % �X� ��� � , it computesamatrixhDI"�0� ��� I
andgivesa G -dimensionalrepresentation&% �sh"I ! % ��� I � � .1. ComputethecentroidT L of the
1th clusterfor V"t 1 t G .
2. Setthecentroidmatrixg �W� T � T � ������� T IF� .
3. Computeanorthogonaldecompositionofg
, which isg �shDI l .
4. &% �sh !I % givesa G -dimensionalrepresentationof % .
of thecentroidmatrix g �W� T � ������� T IF�=�X����� I where
h�u� �����is anorthogonalmatrix with orthonormalcolumnsand l �u� I � I
Thelineartransformationobtainedin theOrthogonalCentroidmethodsolvesatraceoptimizationprob-lem,providing alink betweenthemethodsof lineardiscriminantanalysisandthosebasedoncentroids[20].Dimensionreductionby theLinearDiscriminantAnalysissearchesfor a lineartransformationwhich max-imizes the between-clusterscatterand minimizesthe within-clusterscatter. The between-clusterscattermatrix is definedas vfw � IP LRQ � PZ�[e\ Nyx T LfA T�z x T LfA T�z ! � IP LRQ � � L x T L{A T�z x T L{A T�z ! (8)
and |�}�~O��� x vfw z � IP L�Q � PZ�[]\ N x T L A T�z ! x T L A T�z � IP LRQ � PZ�[]\ N ? T L A T ? �� 2 (9)
Let’s considera criterionthat involvesonly thebetween-clusterscattermatrix, i.e., to find a dimensionreducingtransformation
!W�H� # ���suchthat the columnsof
areorthonormaland
|�}F~e��� x ! vfw z ismaximized.Notethat
}F~ 9�� x vfw z cannotexceedG A V . Accordingly,|�}�~O��� x vfw z �s� ��� ����� � ��I q � (10)
where��L
’s, V"t 1 t G A V , arethe G A V eigenvaluesof
vfw. Denotingthecorrespondingeigenvectorsas � L ’s,
for any '=�@G A V and � # �W� � � ������� � # � , we have|3}F~e��� x � !# vfw � # z �s� ��� ����� � � # �i� ��� ����� � ��I q � 2 (11)
4
In addition,for any ��� ��� #
whichhasorthonormalcolumns,|3}F~O��� x ! vfw z t |�}�~O��� x vfw z 2 (12)
Hence
|3}F~O��� x "! vfw z is maximizedwhen
is chosenas � # for any ')�*G A V and|�}�~O��� x � !# v�w � # z � |�}F~e��� x v�w z (13)
accordingto Eqns.(10)and(11).By theorthogonaldecomposition(7), for any scalar� L���� , VDt 1 t G , wehave����� IP LRQ � � L T L ����� � � ����� h !�� IP LRQ � � L T L�� ����� � � ����� j h !IhD!�/q I o � IP LRQ � � L T L�� ����� � (14)� ���� j h"!I�� ILRQ � � L T Ln o ���� � � ����� h !I � IP LRQ � � L T L � ����� � 2
Eqn.(14) impliesthatthetransformationh !I
preservesthelengthof any vectorin thesubspacespannedbythecentroidvectorsin
is preservedin thetransformedspace,andall theotheritemsareprojectedto thesubspacespannedby thecentroids,maintainingrelativedistancerelationshipwith thecentroids.Hence|�}F~e��� x vfw z � IP LRQ � PZ�[]\ N ? T L�A T ? �� � IP L�Q � PZ�[]\ N ? h !I T LfA�h !I T ? �� � |�}�~O��� x h !I vfw h"I z 2 (15)
By computingan orthogonaldecompositionof the centroidmatrix we obtain a solution that maximizes
|3}F~O��� x ! vfw z over all matrices ��� ��� I
with orthonormalcolumns.Therefore,insteadof computingthe G A V leadingeigenvectorsof
vfwwhichgivethesolutionfor 6 ~�� ;{��; Qf�f� G T 4 x "! vfw z , wesimplyneed
to computeh I
, which is muchlesscostly.
3 Kernel Orthogonal Centroid Method
Although a linear hyperplaneis a naturalchoiceasa boundaryto separateclustersit haslimitations fornonlinearlystructureddata.To overcomethis limitation we mapinput datato a featurespace(possiblyaninfinite dimensionalspace)throughanonlinearfeaturemapping� +e� � �-��� ��.¢¡ � � \ � �
(16)
which transformsinput datainto linearly separablestructure.Without knowing the featuremapping�
orthe featurespace
¡explicitly, we canwork on the featurespace
¡throughkernel functions,as long as
theproblemformulationdependsonly on the innerproductsbetweendatapointsin¡
andnot on thedatapointsthemselves.For any kernelfunction £ satisfyingMercer’scondition,thereexistsareproducingkernelHilbert space¤ anda featuremap
is a nonsingularuppertriangularma-trix [23]. We apply the OrthogonalCentroidalgorithm to
� x � z to reducethe datadimensionto G , thenumberof clustersin theinput data.Thenfor any datapoint % �0� ��� � , thedimensionreducedrepresenta-tion of % in a G -dimensionalspacewill begivenby
² !I � x %�z .We now show how we cancalculate
²³!I � x %fz without knowing�
explicitly, i.e., without knowing ±explicitly. Thecentroidmatrix ± in thefeaturespaceis± �¸·¹ V� � PL []\ d � x ºL z ������� V� I PL []\�» � x ºL zy¼½ �0� \ � I 2 (21)
Hence ± ! ± ��¾ !)¿ ¾i (22)
where¿ �X� � � �
is thekernelmatrixwith¿ x 1 «À z � £ x ºLÁ 3 Z z � ( � x �L z � x Z z$¦ for VDt 1 «À t � (23)
and ¾ ! � ·Â¹�� d ����� �� d n ����� nn ����� n ��Oà ����� ��Oà n ����� ����� n
. . .n ����� n �� » ����� �� » ¼ÅÄÄĽ �X� I � �=2 (24)
Sincethekernelmatrix¿
is symmetricpositivedefiniteandthematrix¾
haslinearlyindependentcolumns,± ! ± is alsosymmetricpositive definite. The Cholesky decompositionof ± ! ± givesa nonsingularuppertriangularmatrix
´suchthat ± ! ± � ´ ! ´ 2 (25)
6
Algorithm 2 KernelOrthogonalCentroidMethod
Given a datamatrix��B� ��� �
with G clustersandindex sets S L , 1 � V ������� G which denotethe setofthecolumnindicesof thedatain thecluster
1, anda kernelfunction £ , this algorithmcomputesnonlinear
dimensionreducedrepresentation&% �²J!I � x %�z �0� I � � for any input vector % �X� ��� � .1. Formulatethekernelmatrix
¿basedon thekernelfunction £ as¿ x 1 yÀ z � £ x ºLÇ 3 Z z V"t 1 «À t � 2
2. Compute± ! ± ��¾s! ¿ ¾ where¾ x 1 yÀ z �kÈ V ¯ � Z if1 � S Zn
otherwise VDt 1 t � VDt À t G 2
3. ComputetheCholesky factor´
of ± ! ± : ± ! ± �B´ ! ´ .
4. Thesolution &% for thelinearsystem´ ! &% � ·Â¹ �� d � L [e\ d £ x ºLÁ %�z...�� » � L [e\�» £ x L %�z ¼ÅĽ
gives G -dimensionalrepresentationof % .
Since ²³IU� ± ´µq � (26)
from (20),wehave ² !I � x %�z � x ´uq � z ! ± ! � x %fz � x ´uq � z ! ·Â¹ �� d � L [e\ d £ x ºLÇ %fz...�� » � L [e\�» £ x ºLÇ %fz ¼ÅĽ 2 (27)
Dueto theassumptionthatthekernelfunction £ is symmetricpositivedefinite,thematrix ± ! ± is symmetricpositive definiteandaccordinglythe centroidmatrix ± haslinearly independentcolumns. We summarizeouralgorithmin Algorithm 2 theKernelOrthogonalCentroid(KOC)method.
We now briefly discussthe computationalcomplexity of the KOC algorithmwhereoneflop (floatingpoint operation)representsroughlywhatis requiredto do oneaddition(or subtraction)andonemultiplica-tion (or division) [23]. Wedid not includethecostfor evaluatingthekernelfunctions £ x ºLÇ � Z z and £ x �LÁ %�zsincethis is requiredin any kernel-basedmethods,andthecostdependson thespecifickernelfunction. InAlgorithm 2, thecomputationof ± ! ± �s¾ !=¿ ¾
. Cholesky decompositionof ± ! ± for obtainingthe uppertriangularmatrix
´in (25) takes É x I3ÊË z flops since ± ! ± is GuÌ®G whereG is the numberof clusters. Oncewe obtain the uppertriangularmatrix
´, then the lower dimensional
representation&% �Ͳ !I � x %fz of a specific input % can be computedwithout computing´ q �
, but fromsolvinga linearsystem ´ ! &% � ·Â¹ �� d � L []\ d £ x �LÁ %�z
...�� » � L []\ » £ x ºLÇ %�z ¼ÅĽ (28)
which requiresÉ x I Ã� � �=z flops. Typically thenumberof clustersis muchsmallerthanthetotal numberoftrainingsamples� . Therefore,thecomplexity in nonlineardimensionalreductionby theKernelOrthogo-nal Centroidmethodpresentedin Algorithm 2 is É x � � z . However, thekernel-basedLDA or PCA needstohandleaneigenvalueproblemof size ��Ì/� where� is thenumberof trainingsamples,whichis moreexpen-sive to compute[14, 15, 16]. Therefore,theKernelOrthogonalCentroidmethodis anefficient dimensionreductionmethodthatcanreflectthenonlinearclusterrelationin thereduceddimensionalrepresentation.
Alternatively, thedimensionreducedrepresentation²J!I � x %fz givenby theKOC methodcanbederived
asfollows. Representthecentroidmatrix ± in thefeaturespace,givenby Eqn.(21),as± ��:ÎT �� ������� aÎT I �{�ÐÏ �P LRQ �fÑ �L � x L z ������� �P LRQ �{Ñ IL � x L zyÒ (29)
whereÑ ZL is���Ó if
�Lbelongsto thecluster
À, otherwiseÑ ZL is
n. Now, considertheorthogonaldecomposition± �²³IK´
(30)
of ± . Sincethecolumnsof²JI
canberepresentedasa linearcombinationof thecolumnsof ± , they in turncanbeexpressedasa linearcombinationof thevectors
� x L z 1 � V ������� � , as²³IU�W�ÔÎÕ � ������� �ÎÕ IK�f�ÖÏ �P L�Q � � �L � x ºL z ������� �P LRQ � � IL � x ºL z×Ò 2 (31)
In orderto compute² !I � x %fz , first we will show how we canfind thecoefficients � Z L ’s from thegiven Ñ ZL ’s
where ± � Ï �P LRQ �{Ñ �L � x �L z ������� �P LRQ �{Ñ IL � x ºL z Ò (32)� Ï �P LRQ � � �L � x ºL z ������� �P LRQ � � IL � x �L z Ò ·Â¹ ÎG �3� ����� ÎG � In. ..
with G clustersandakernelfunction £ , thismethodcomputesthenonlineardimensionreducedrepresentation&% �²J!I � x %�z �0� I � � for any input vector % �X� ��� � .
1. Define Ñ ZL �ÙØ ��]Ó ifºL
belongsto theclusterÀn
otherwisefor VDt 1 t � VDt À t G .
2. Computean orthogonaldecomposition± �Ú²³IK´of the centroidmatrix ± as in Eqn. (32) by the
modifiedGram-Schmidt.for Û � V ������� GÎG�Ü«Ü �rÝ ( ÎT�Ü �ÎT�ܳ¦ �Þ x Ñ Ü z ! ¿ Ñ Ü� Ü � Ñ Ü ¯ ÎG�Ü«Ü
for� � Û � V ������� GÎG�ÜÔß � ( ÎÕ Ü �ÎTFß-¦ � x � Ü z ! ¿ Ñ ßÑ ß � Ñ ß A � Ü ÎG ÜÔß
endend
3.² !I � x %fz �áà � �LRQ � � �L £ x ºLÇ %fz ������� � �L�Q � � IL £ x ºLÇ %�zºâ ! 2
¿as( ÎT�Ü �ÎTFß-¦ � � �P LRQ �fÑ ÜL � x �L z � ! � �P LRQ �fÑ ßL � x ºL z �ã� x Ñ Ü z !)¿ Ñ ß (33)
whereÑ Ü �� Ñ Ü� ������� Ñ Ü� � ! . In addition,thevectorsÎä Ü � ÎT ÜÝ ( ÎT�Ü �ÎT�ܳ¦ andÎä ß �WÎTKß A ( ÎT Ü �ÎT ß ¦( ÎT�Ü �ÎT�ܳ¦ ÎT�Ü V"t Û t � t G (34)
areorthogonalvectorssuchthatÛ ä ��å Îä Ü æÎä ßÁç � Û ä ��å ÎT�Ü �ÎTFߧç 2 (35)
From Eqns.(33) and (34), we can representÎä Ü and
Îä ß as linear combinationsof� x L z , 1 � V ������� � .
Basedon theseobservations,we canapplythemodifiedGram-Schmidtmethod[23] to thecentroidmatrix± to computeanorthonormalbasisof thecentroids,eventhoughweonly have animplicit representationofthecentroidsin thefeaturespace.Oncetheorthonormalbasis
²³Iis obtained,i.e., thecoefficients � ÜL ’s ofÎÕ Ü � � �LRQ � � ÜL � x �L z , VXt Û t G arefound, thenthe reduceddimensionalrepresentation
² !I � x %fz canbecomputedfrom ² !I � x %fz � ·Â¹ � �LRQ � � �L £ x �LÁ %fz...� �L�Q � � IL £ x ºLÇ %fz ¼ÅĽ 2 (36)
9
This approachis summarizedin Algorithm 3.Algorithm 3, the KernelOrthogonalCentroidmethod,requires É x I Ã� � � z flops for the orthogonalde-
compositionof thecentroidmatrix ± and É x Ge�=z flopsfor obtainingthereduceddimensionalrepresentation² !I � x %�z for any input vector % �u� ��� � . Hencethetotal complexity of Algorithm 3 is slightly higherthanAlgorithm 2. However, theapproachof finding theparameters� Ü from theparametersÑ Ü canbe appliedin othercontext of kernelbasedfeatureextractionwheredirect derivation of the kernelbasedmethodasin Algorithm 2 is not possible.In thenext section,we presentthenumericaltestresultsthat comparetheeffectivenessof our proposedmethodto otherexisting methods.We alsovisualizethe effectsof variouskernelsin ouralgorithms.
4 Computational TestResults
TheKernelOrthogonalCentroidmethodhasbeenimplementedin C on IBM SPat theUniversityof Min-nesotaSupercomputingInstitutein orderto investigateits computationalperformance.Thepredictionac-curacy of classificationof the test datawhosedimensionwas reducedto the numberof clustersby ourKOCmethodwascomparedto otherexisting linearandnonlinearfeatureextractionmethods.Weuseddatasetsavailablein thepublic domainaswell assomeartificial datawe generated.In addition,the input datawith clusterstructurearevisualizedin the3-dimensionalspaceafterdimensionreductionby our proposedmethodto illustratethequality of therepresentedclusteredstructure.In theprocess,we alsoillustratetheeffect of variouskernelfunctions. We usedtwo of themostcommonlyusedkernelsin our KOC method,whicharepolynomialkernels £ x % §¥ z � x % ��¥ � V z ª 3¬ ¦ nandGaussiankernels £ x % 3¥ z � ���� x A ? % A®¥ ? ��¯e° z ° �0� 2
The purposeof our first test is to illustratehow our methodproducesa lower dimensionalrepresentationseparatingthedataitemswhich belongto differentclasses.We presenttheresultsfrom theIris plantsdataof Fisher[26], aswell asfrom anartificial datasetthatwegenerated,wherethedatapointsin threeclustersin theoriginal spacearenot separable.
In the Iris data,the given datasethas150 datapoints in a 4-dimensionalspaceandis clusteredto 3classes.Oneclassis linearly separablefrom theothertwo classes,but thelattertwo classesarenot linearlyseparable.Figure1 showsthedatapointswhicharereducedto a3-dimensionalspaceby variousdimensionreductionmethods.The leftmostfigure in Figure1 is obtainedby anoptimal rank3 approximationof the
10
dataset from its singularvaluedecomposition,which is oneof the mostcommonlyusedtechniquesfordimensionreduction[23]. Thefigureshows thatafter thedimensionreductionby a rank3 approximationfrom the SVD, two of the threeclassesarestill not quite separated.The secondandthe third figuresinFigure1 areobtainedby our KOC methodwith the Gaussiankernelwhere
° � V andn 2 n V , respectively.
They show thatour KernelOrthogonalCentroidmethodcombinedwith theGaussiankernelfunctionwith° � n 2 n V givesa 3-dimensionalrepresentationof Iris datawhereall threeclustersarewell separatedandthebetween-clusterrelationshipis remote.
The artificial datawe generatedhasthreeclasses.Eachclassconsistsof 200 datapoints uniformlydistributedin the cubic region with height1.4, width 4 andlength18.5. The threeclassesintersecteachotherasshown in thefirst figureof Figure2, for thetotalof 600givendatapoints.Differentkernelfunctionswereappliedto obtainthenonlinearrepresentationof thesegivendatapoints.In this test,thedimensionoftheoriginal datasetis in factnot reduced,sinceit wasgivenin the3-dimensionalspace,andafterapplyingtheKOCmethod,thefinal dimensionis also3 which is thenumberof theclusters.Thesecondfigureshowsthe new datarepresentationwith a polynomialkernelof degree4. The third figure is producedusingtheGaussiankernel £ x % §¥ z � ���æ x A ? % A®¥ ? � ¯e° z where
° �sè.
4.2 Performancein Classification
In our secondtest, the purposewas to comparethe effectivenessof dimensionreductionfrom our KOCmethodin classification.For this purpose,we comparedtheaccuracy of binaryclassificationresultswherethedimensionof thedataitemsarereducedby ourKOCmethodaswell asby thekernelFisherdiscriminant(KFD) methodof Mika etal. [15]. Thetestresultspresentedin thissectionarefor binaryclassificationsforcomparisonsto KFD which canhandletwo-classcasesonly. For moredetailson the testdatagenerationand results,see[15], wherethe authorspresentedthe kernel FisherDiscriminant(KFD)methodfor thebinary-classwith substantialtestresultscomparingtheir methodto otherclassifiers.
The Linear DiscriminantAnalysisoptimizesvariouscriteria functionswhich involve between-cluster,within-clusteror mixed-clusterscattermatrices[2]. Many of thecommonlyusedcriteriainvolvetheinverseof the between-clusterscattermatrix
vfwdefinedin (8) or the within-clusterscattermatrix
vêéwhich is
definedas vêé � IP LRQ � PZ�[]\ N«x Z A T L z x Z A T L z ! (37)
requiringoneof thesescattermatricesto benonsingular. However, in many applicationsthesescattermatri-cesareeithersingularor ill-conditioned.Onecommonsituationwhenbothscattermatricesbecomesingularis whenthenumberof datapointsis smallerthanthedimensionof thedataspace.Numerousmethodshavebeenproposedto overcomethis difficulty including the regularizationmethod[24]. Generalizeddiscrim-inantanalysiscalledLDA/GSVD, which is basedon thegeneralizedsingularvaluedecomposition,workswell regardlessof thesingularityof thewithin-clusterscattermatrix. (See[25].) In theKFD analysis,Mikaet al. usedregularizationparametersto make thewithin-clusterscattermatrixnonsingular.
Fisherdiscriminantcriterion requiresa solutionof an eigenvalueproblemwhich is expensive to com-pute.In orderto improve thecomputationalefficiency of KFD, severalmethodshavebeenproposed,whichincludetheKFD basedonaquadraticoptimizationproblemusingregularizationoperatorsor asparsegreedy
11
approximation[27, 28,29]. In general,quadraticoptimizationproblemsareascostlyastheeigenvalueprob-lems.A majoradvantageof ourKOC methodis thatits computationalcostis substantiallylower, requiringcomputationof aCholesky factor[23] andasolutionfor a linearsystemwheretheproblemsizeis thesameas the numberof clusters. The computationalsavings comefrom the fact that the within-clusterscattermatrix is not involvedin theoptimaldimensionreductioncriterion.
In Table1, wepresenttheimplementationresultsonsevendatasetswhichMika etal. haveusedin theirtests1 [30]. Thedatasetswhicharenotalreadyclusteredor with morethantwo clusterswerereorganizedsothattheresultshaveonly two classes.Eachdatasethas100pairsof trainingandtestdataitemswhichweregeneratedfrom onepoolof dataitems.For eachdataset,theaverageaccuracy is calculatedby runningthese100cases.Parametersfor thebestcandidatefor thekernelfunctionandSVM aredeterminedbasedon a 5fold cross-validationusingthefirst fivetrainingsets.Werepeattheir resultsin thefirst fivecolumnsof Table1 which show the predictionaccuraciesin percentage(%) from the RBF classifier(RBF),AdaBoost(AB),regularizedAdaBoost,SVM andKFD. For moredetails,see[15].
Theresultsshown in thecolumnfor KOC areobtainedfrom the linear soft margin SVM classificationusingthe software Û�ë�� # LíìKî ß [31] after dimensionreductionby KOC. The testresultswith the polynomialkernelwith degree3 andthe Gaussiankernelwith an optimal
°value for eachdatasetarepresentedin
Table1. Theresultsshow thatour methodobtainedcomparableaccuracy to othermethodsin all the testswe performed.Using our KOC algorithm,we wereableto achieve substantialcomputationalsavings notonly dueto the lower computationalcomplexity of our algorithm,but from usinga linear SVM. Sincenokernelfunction (or identity kernelfunction) is involved in the classificationprocessby a linear SVM, theparameterï in therepresentationof theoptimalseparatinghyperplaneð x %fz � ï ! % ��ñcanbecomputedexplicitly, saving substantialcomputationtime in thetestingstage.In addition,dueto thedimensionreduction,kernelfunctionvaluesarecomputedbetweenmuchshortervectors.
Anotherphenomenonwe observed in all thesetestsis thatafter thedimensionreductionby KOC, thelinear soft margin SVM requiressignificantly lessnumberof training datapointsasthe supportvectors,comparedto thesoft margin SVM with thekernelfunctionappliedto theoriginal input data.More detailscanbefoundin thenext section.
In orderto apply the SVMs for a three-classproblem,we usedthe methodwhereafter a binary clas-sificationof
g �vs. not
g � x g � ¯uò g � z is determined,dataclassifiednot to be in the classg �
is furtherclassifiedto be in
g �orgôó x g � ¯ gôó z . Therearethreedifferentwaysto organizethebinaryclassifiersfor
1Thebreastcancerdatasetwasobtainedfrom theUniversityMedicalCenter, Inst. of Oncology, Ljubljana,Yugoslavia. Thanksto M. Zwitter andM. Soklic for thedata.
12
a three-classproblemdependingon which classifierg L ¯0ò g L
,1 � V Kõæ �ö , is consideredin thefirst step.
Onemay run all threecasesto achieve betterpredictionaccuracy. We presentthe resultsobtainedfromg � ¯mò g �and
g � ¯ g ó, sinceall threewaysproducedcomparableresultsin our tests.
In Figure3, thepredictionaccuracy andthenumberof supportvectorsareshown whenthenonlinearsoftmargin SVM is appliedin theoriginal dimensionandthe linear soft margin SVM is appliedin thereduceddimensionobtainedfrom our KOC algorithm. In bothcases,Gaussiankernelswith various
valuesis similar in bothcases,it is interestingtonotethatthenumberof supportvectorsis muchlessin thecaseof thelinearsoft margin SVM with datainthereducedspace.In addition,theperformanceandthenumberof supportvectorsarelesssensitive to thevalueof
°afterdimensionreductionby theKOC algorithm.
ThetestresultsconfirmthattheKOCalgorithmis aneffectivemethodin extractingimportantnonlinearfeatures.Oncethebestfeaturesareextracted,thecomputationof finding theoptimalseparatinghyperplaneandclassificationof new databecomemuchmoreefficient. An addedbenefitwe observed in all our testsis thatafter thekernel-basednonlinearfeatureextractionby theKOC algorithm,anotheruseof thekernelfunction in the SVM is not necessary. Hencethe simple linear SVM canbe effectively used,achievingfurther efficiency in computation.Anothermerit of the KOC methodis that after its dramaticdimensionreduction,in theclassificationstagethecomparisonbetweenthevectorsby any similarity measuresuchasEuclideandistance(
� �norm)or cosinebecomesmuchmoreefficient, sincewe now comparethevectors
with G componentseach,ratherthan � componentseach.
5 Conclusion
We have presenteda new methodfor nonlinearfeatureextractioncalledthe KernelOrthogonalCentroid(KOC). The KOC methodreducesthe dimensionof the input datadown to the numberof clusters. Thedimensionreducingnonlineartransformationis a compositeof two mappings;thefirst implicitly mapsthedatainto a featurespaceby usinga kernelfunction,andthesecondmappingwith orthonormalvectorsinthe featurespaceis found so that the dataitemsbelongingto differentclustersaremaximally separated.Oneof themajoradvantagesof our KOC methodis its computationalefficiency, comparedto otherkernel-basedmethodssuchaskernelPCA [14] or KFD [15, 27, 29] andGDA [16]. The efficiency comparedtoothernonlinearfeatureextractionmethodutilizing discriminantanalysisis achieved by only consideringthebetween-clusterscatterrelationshipandby developinganalgorithmwhich achievesthis purposefromfindinganorthonormalbasisof thecentroids,which is farcheaperthancomputingtheeigenvectors.
While linearor nonlinearDiscriminantAnalysisconsidertheminimizationof thewithin-clusteraswellasthemaximizationof thebetween-clusterdistances,they have somedisadvantages.Thenonsingularityofscattermatricesshouldbeappropriatelytakencareof. Thesingularvaluedecompositionwhich is requiredin computingthesolutiondemandshigh computationalcomplexity andmemory. Making an optimal bal-ancebetweenwithin-clusterandbetween-clusterscatternesscanmake modelselectionfor kernelfunctionsdifficult, sinceit is impossibleto maximizebetween-clusterscatterandminimize within-clusterscatteratthesametime.
curacy to otherexistingclassificationmethodsin our tests.In addition,whenit is usedwith theSVM, in allour teststhelinearSVM performedaswell andwith far lessnumberof supportvectors,furtherreducingthecomputationalcostsin theteststage.
Acknowledgements
Theauthorswould like to thanktheUniversityof MinnesotaSupercomputingInstitute(MSI) for providingthecomputingfacilities.Wealsowould like to thankDr. S.Mika for valuableinformation.
References
[1] R.O.Duda,P.E. Hart,andD.G.Stork. PatternClassification. Wiley-interscience,New York, 2001.
[2] K. Fukunaga. Introductionto StatisticalPatternRecognition. AcadamicPress,secondedition,1990.
[3] I.T. Jolliffe. Principal ComponentAnalysis. Springer-Verlag,New York, 1986.
[4] M.A. Kramer. Nonlinearprincipalcomponentanalysisusingautoassociative neuralnetwork. AIChEjournal, 37(2):233–243,1991.
[5] K.I. DiamantarasandS.Y. Kung. Principal ComponentNeural Networks:TheoryandApplications.Wiley-interscience,New York, 1996.
[19] H. Park, M. Jeon,andJ.B.Rosen.Lower dimensionalrepresentationof text databasedon centroidsandleastsquares.BIT NumericalMathematics, 43(2):1–22,2003.
[20] P. HowlandandH. Park. Cluster-preservingdimensionreductionmethodsfor efficient classificationof text data.A comprehensivesurvey of text mining,Springer-Verlag,pp.3–23,2003.
[21] N. Cristianini andJ. Shawe-Taylor. An Introductionto SupportVector Machinesand other kernel-basedlearningmethods. Cambridge,2000.
[24] J.H. Friedman. Regularizeddiscriminantanalysis. Journal of the Americanstatisticalassociation,84(405):165–175,1989.
[25] P. Howland, M. Jeon,andH. Park. Structurepreservingdimensionreductionfor clusteredtext databasedon thegeneralizedsingularvaluedecomposition.SIAMJournal on Matrix AnalysisandAppli-cations, 25(1):165–179,2003.
[28] S. Mika, G. Ratsch,andK.-R. Muller. A mathematicalprogrammingapproachto the kernelfisheralgorithm.Advancesin neural informationprocessingsystems, 13,2001.
15
[29] S.Mika, A.J.Smola,andB. Scholkopf. An improvedtrainingalgorithmfor kernelfisherdiscriminants.In proceedingsAISTATS,MorganKaufmann, pages98–104,2001.
About the Author - CheongHeeParkreceivedherPh.D.in Mathematicsfrom YonseiUniversity, Seoul,Koreain 1998.ShereceivedtheM.S. degreein ComputerSciencein 2002andis currentlya Ph.D.studentat theDepartmentof ComputerScienceandEngineering,Universityof Minnesota.Her researchinterestsincludepatternrecognition,dimensionreductionandmachinelearning.
About the Author -HaesunPark receivedherB.S.degreein Mathematicsfrom SeoulNationalUniver-sity, SeoulKorea,in 1981with summacumlaudeandtheuniversitypresident’s medalasthetop graduateof theuniversity. ShereceivedherM.S. andPh.D.degreesin ComputerSciencefrom CornellUniversity,Ithaca,NY, in 1985and1987,respectively. Shehasbeenon faculty at Departmentof ComputerScienceandEngineering,Universityof Minnesota,Twin Citiessince1987,wherecurrentlysheis a professor. Hercurrentresearchinterestsincludenumericallinear algebra,patternrecognition,informationretrieval, datamining, andbioinformatics.Sheserved on the editorial boardof SIAM Journalon ScientificComputing,Societyfor IndustrialandAppliedMathematics,from 1993to 1999.Currently, sheis on theeditorialboardof Mathematicsof Computation,AmericanMathematicalSociety, BIT NumericalMathematics,andCom-putationalStatisticsandDataAnalysis,InternationalAssociationof StatisticalComputing,a specialissueon numericallinear algebraandstatistics.Shehasrecentlyserved on thecommitteesof severalmeetingsincludingtheprogramcommitteesfor text mining workshopat SIAM conferenceon DataMining for pastseveralyears.
16
Figure 1: Iris datarepresentedin a 3-dimensionalspace. The first figure is obtainedfrom a rank 3 ap-proximationby theSVD. Theothersareby theKernelOrthogonalCentroidmethodwith Gaussiankernel£ x % §¥ z � ���� x A ? % A�¥ ? � ¯e° z where
° � V x thesecondz andn 2 n V x thethirdz . UsingGaussiankernelwith° � n 2 n V , ourmethodobtaineda completeseparationof thethreeclasses.
Figure2: Thefirst figureis thetrainingdatawith 3 clustersin a 3-dimensionalspace.Thesecondfigureisgeneratedby theKernelOrthogonalCentroidmethodwith apolynomialkernelof degree4. Thethird figureis from theKOC algorithmusingtheGaussiankernelwith width 5.
Table1: The predictionaccuraciesareshown. The first part (RBF to KFD) is from [15]: classificationaccuracy from a singleRBF classifier(RBF),AdaBoost(AB),regularizedAdaBoost,SVM andKFD. Thelasttwo columnsarefrom theKernelOrthogonalCentroidmethodusingGaussiankernels(optimal
°values
shown) anda polynomialkernelof degree3. For eachtest,thebestpredictionaccuracy resultis shown inboldface.
0 1 2 3 4 5 6 784
84.5
85
85.5
86
86.5
87
σ (the width in Gaussian kernels)
perc
ent(
%)
Prediction Accuracy
0 1 2 3 4 5 6 710
20
30
40
50
60
70
80
90
100
σ (the width in Gaussian kernels)
perc
ent(
%)
Number of Support Vectors
Figure3: Classificationresultson the artificial datausinga soft margin SVM. The left graphshows thepredictionaccuracy in the full input spaceby a SVM with a Gaussiankernel(dashedline), andthat in thereduceddimensionalspaceobtainedby our KOC methodwith a Gaussiankernelanda linearSVM (solidline). Theright graphcomparesthenumberof supportvectorsgeneratedin thetrainingprocess.While thebestaccuracy is similar in bothcases,theoverall numberof supportvectorsis muchlesswhenthereduceddimensionalrepresentationis usedin a linearSVM.