Nonlinear Feature Extraction based on Centroids and Kernel Functions Cheonghee Park and Haesun Park Dept. of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455 December 19, 2002 Abstract A nonlinear feature extraction method is presented which can reduce the data dimension down to the number of clusters, providing dramatic savings in computational costs. The di- mension reducing nonlinear transformation is obtained by implicitly mapping the input data into a feature space using a kernel function, and then finding a linear mapping based on an or- thonormal basis of centroids in the feature space that maximally separates the between-cluster relationship. The experimental results demonstrate that our method is capable of extracting nonlinear features effectively so that competitive performance of classification can be obtained with linear classifiers in the dimension reduced space. Keywords. cluster structure, dimension reduction, kernel functions, Kernel Orthogonal Cen- troid (KOC) method, linear discriminant analysis, nonlinear feature extraction, pattern classi- fication, support vector machines 1 Introduction Dimension reduction in data analysis is an important preprocessing step for speeding up the main tasks and reducing the effect of noise. Nowadays, as the amount of data grows larger, extracting The correspondence should be addressed to Prof. Haesun Park ([email protected]). This work was supported in part by the National Science Foundation grants CCR-9901992 and CCR-0204109. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation (NSF). 1
21
Embed
Nonlinear Feature Extraction based on Centroids and Kernel ...hpark/papers/kcer.pdfnal Centroid method which is a dimension reduction method based on an orthonormal basis for the centroids.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
A nonlinearfeatureextractionmethodis presentedwhich canreducethedatadimensiondown to the numberof clusters,providing dramaticsavings in computationalcosts. The di-mensionreducingnonlineartransformationis obtainedby implicitly mappingthe input datainto a featurespaceusingakernelfunction,andthenfindinga linearmappingbasedon anor-thonormalbasisof centroidsin thefeaturespacethatmaximallyseparatesthebetween-clusterrelationship. The experimentalresultsdemonstratethat our methodis capableof extractingnonlinearfeatureseffectively sothatcompetitiveperformanceof classificationcanbeobtainedwith linearclassifiersin thedimensionreducedspace.
Dimensionreductionin dataanalysisis animportantpreprocessingstepfor speedingup themaintasksandreducingtheeffect of noise.Nowadays,astheamountof datagrows larger, extracting�
Thecorrespondenceshouldbeaddressedto Prof. HaesunPark ([email protected]).This work wassupportedin part by the NationalScienceFoundationgrantsCCR-9901992andCCR-0204109.Any opinions,findings andconclusionsor recommendationsexpressedin this materialarethoseof theauthorsanddo not necessarilyreflecttheviewsof theNationalScienceFoundation(NSF).
1
the right featuresis not only a useful preprocessstepbut becomesnecessaryfor efficient andeffective processing,especiallyfor high dimensionaldata. The Principal ComponentAnalysis(PCA)andtheLinearDiscriminantAnalysis(LDA) aretwo of themostcommonlyuseddimensionreductionmethods.Thesemethodssearchoptimaldirectionsfor theprojectionof input dataontoa lower dimensionalspace[1, 2, 3]. While the PCA finds the direction along which the datascatternessis greatest,theLDA searchesthedirectionwhichmaximizesthebetween-clusterscatterandminimizesthe within-clusterscatter. However, thesemethodshave a limitation for the datawhicharenot linearlyseparablesinceit is difficult to captureanonlinearrelationshipwith a linearmapping.In orderto overcomesucha limitation, nonlinearextensionsof thesemethodshavebeenproposed[4, 5, 6].
Oneway for a nonlinearextensionis to lift the input spaceto a higherdimensionalfeaturespaceby a nonlinearfeaturemappingandthento find a lineardimensionreductionin thefeaturespace.It is well known that kernelfunctionsallow suchnonlinearextensionswithout explicitlyforming a nonlinearmappingor a featurespace,aslong astheproblemformulationinvolvesonlythe inner productsbetweenthe datapointsandnever the datapoints themselves [7, 8, 9]. Theremarkablesuccessof thesupportvectormachinelearningis anexampleof theeffectiveuseof thekernelfunctions[10, 11, 12, 13]. ThekernelPrincipalComponentAnalysis(kernelPCA) [14] andthegeneralizedDiscriminantAnalysis[15, 16, 17,18] have recentlybeenintroducedasnonlineargeneralizationsof the PCA andthe LDA by kernelfunctions,respectively, andsomeinterestingexperimentalresultsarepresented.However, the PCA andthe LDA requiresolutionsfrom thesingularvaluedecompositionandgeneralizedeigenvalueproblem,respectively. In general,thesedecompositionsareexpensive to computewhenthetrainingdatasetis largeandespeciallywhenthe problemdimensionbecomeshigherdueto the mappingto a featurespace. In addition, thedimensionreductionfrom thePCAdoesnot reflecttheclusteredstructurein thedatawell [19].
Thecentroidof eachclusterminimizesthesumof thesquareddistancesto vectorswithin theclusterand it yields a rank one approximationof the cluster[19]. In the OrthogonalCentroidmethod[19] the centroidsaretaken asrepresentativesof eachclusterandthe vectorsof the in-put spacearetransformedby an orthonormalbasisof the spacespannedby the centroids. Thismethodprovidesadimensionreducinglineartransformationpreservingtheclusteringstructureinthegivendata. The relationshipbetweenany datapointsandcentroidsmeasuredby
���-normor
cosinein the full dimensionalspaceis completelypreserved in the reducedspaceobtainedwiththis transformation[19, 20]. Also it is shown that this methodmaximizesbetween-clusterscatteroverall thetransformationswith orthonormalvectors[21, 22].
In this paper, we apply the centroid-basedorthogonaltransformation,the OrthogonalCen-troid algorithm,to thedatatransformedby kernel-basednonlinearmappingandshow that it canextract nonlinearfeatureseffectively, thus reducingthe datadimensiondown to the numberofclustersandsaving the relative computationalcost. In Section2, we briefly review theOrthogo-nal Centroidmethodwhich is a dimensionreductionmethodbasedon an orthonormalbasisforthecentroids.In Section3, we derive thenew KernelOrthogonalCentroidmethodextendingtheOrthogonalCentroidmethodusingkernelfunctionsto handlenonlinearfeatureextractionandan-
2
alyze the computationalcomplexity of our new method. Our experimentalresultspresentedinSection4 demonstratethatthenew nonlinearOrthogonalCentroidmethodis capableof extractingnonlinearfeatureseffectively so thatcompetitive classificationperformancecanbeobtainedwithlinearclassifiersafternonlineardimensionreduction.In addition,it is shown thatoncewe obtaina lower dimensionalrepresentation,a linear soft margin SupportVectorMachine(SVM) is ableto achieve high classificationaccuracy with muchlessnumberof supportvectors,thusreducingpredictioncostsaswell.
2 Orthogonal Centroid Method
Givenavectorspacerepresentation,����� ������������������������ �of a datasetof vectorsin a ! -dimensionalspace,dimensionreductionby lineartransformationis to find "$# ���&% �'� thatmapsa vector ( to avector )( for some*�+,! :" #.- ( �/�0��� �21 )( �3� % � � 4658795 " # ( � )( 5 (1)
In particular, we seekfor a dimensionreducingtransformation":# with which the clusterstruc-tureexisting in the givendata
�is preserved in the reduceddimensionalspace.Eqn. (1) canbe
rephrasedasfinding a rankreducingapproximationof�
suchthat;=<?>@BA CED �EF "HG DJI � where " �3� ��� %and G ��� % � � 5
(2)
For simplicity, we assumethatthedatamatrix�
is clusteredinto K clustersas������L����� � �������M�N�PON�where
�PQR���0��� �TS �and
OU QWVX� Q�� 5 (3)
Let Y Q denotethesetof columnindicesthatbelongto thecluster4. Thecentroid Z Q of eachcluster�PQ
�a�b� ��� �with K clustersanda datapoint ( �c� ��� � , it computesa matrixd O���� ��� O
andgivesa K -dimensionalrepresentation)( � d O # ( ��� O � � .1. ComputethecentroidZ Q of the
4th clusterfor \He 4 e K .
2. Setthecentroidmatrix f �]� Z ��� Z � ��������� Z ON� .3. Computeanorthogonaldecompositionof f , which is f � d OJg
.
4. )( � d #O ( givesa K -dimensionalrepresentationof ( .
Thecentroidof eachclusteris thevectorthatminimizesthesumof squareddistancesto vectorswithin the cluster. That is the centroidvector Z Q gives the smallestdistancein Frobeniusnormbetweenthematrix
�2Qandtherankoneapproximation( 7 #Q whereD �2QBF Z Q 7 #Q D �I � U`6h^i S D ` F Z Q D �� � ;=<?>j h�kTl�m�n U`6h�i S D ` F ( D �� � ;=<?>j h�kTlMm^n D �PQBF ( 7 #Q D �I 5 (6)
Taking the centroidsas representatives of the clusters,we find an orthonormalbasisof thespacespannedby thecentroidsby computinganorthogonaldecompositionf � dpo g qsr
(7)
of thecentroidmatrix f �]� Z ����������� Z ON��������� O �where
d �t�vu'���w��������u � �H��� ���'� is an orthogonalmatrix with orthonormalcolumnsandgx�� O � O
is anuppertriangularmatrix. Takingthefirst K columnsofd
, weobtainf � d OJgwith
d Oy�]�vu'���w��������uJON�z�(8)
wherethe columnsofd O
is an orthonormalbasisforgH X{ 7}| f�~ spannedby the columnsof f
whenthecolumnsof f arelinearly independent.Thealgorithmcaneasilybemodifiedwhenthecolumnsof f arenot linearly independent.The matrix
d O # givesa dimensionreducinglineartransformationpreservingthe clusteringstructurein the sensethat the relationshipbetweenanydataitem anda centroidmeasuredusing
pletelypreservedin thereduceddimensionalspace[19, 20]. Thismethodis calledtheOrthogonalCentroidmethodand is summarizedin Algorithm 1. Moreover, as we show in Section3, thisassumptionis no longerneededin our new KernelOrthogonalCentroidmethod.
It is shown thatthelineartransformationobtainedin theOrthogonalCentroidmethodsolvesatraceoptimizationproblem,providing a link betweenthemethodsof lineardiscriminantanalysis
4
and thosebasedon centroids[22]. Dimensionreductionby the Linear DiscriminantAnalysissearchesfor a linear transformationwhich maximizesthe between-clusterscatterandminimizesthewithin-clusterscatter. Thebetween-clusterscattermatrix is definedas��� � OU Q�VX� U`6h^i S | Z QBF Z^~ | Z QBF Z^~ # � OU Q�VX� Q | Z QBF Z�~ | Z QBF Z^~ # (9)
and ���6����� | ��� ~ � OU Q�VX� U`6h^i S | Z QBF Z^~ # | Z QBF Z^~ � OU Q�VX� U`6h^i S D Z QBF Z D �� 5 (10)
Let’s considera criterion that involvesonly the between-clusterscattermatrix, i.e., to find adimensionreducingtransformation" # �b� % �'� suchthat thecolumnsof " areorthonormaland
Next we show how theOrthogonalCentroidalgorithmcanbecombinedwith thekernelfunc-tion to producea nonlineardimensionreductionmethodwhich doesnot requirethe featuremap-ping
is a nonsingularuppertriangularmatrix [25]. We apply theOrthogonalCentroidalgorithmto
¢ | � ~ to reducethedatadimensionto K , thenumberof clustersin the input data. Thenfor any datapoint ( ��� ��� � , thedimensionreducedrepresentationof ( in a K -dimensionalspacewill begivenby
¶ #O ¢ | (�~ .Wenow show how wecancalculate
¶ #O ¢ | (�~ withoutknowing¢
explicitly, i.e.,withoutknow-ing µ explicitly. Thecentroidmatrix µ in thefeaturespaceisµ � ¹ \ � UQ h�i n ¢ | �Q ~ ��������� \ O UQ h�i»º ¢ | �Q ~z¼ �3� i � O95 (24)
Due to the assumptionthat the kernel function § is symmetricpositive definite, the matrix µ # µis symmetricpositive definite and accordinglythe centroidmatrix µ has linearly independentcolumns. We summarizeour algorithmin Algorithm 2 the KernelOrthogonalCentroid(KOC)method.
We now briefly discussthe computationalcomplexity of the KOC algorithmwhereoneflop(floating point operation)representsroughly what is requiredto do oneaddition(or subtraction)andonemultiplication (or division) [26]. We did not includethe cost for evaluatingthe kernelfunctions § | �QÁ�� ` ~ and § | �QÁ� (�~ sincethis is requiredin any kernel-basedmethods,andthe cost
positionof µw#�µ for obtainingthe uppertriangularmatrix·
in (28) takes Ï | OÁÐÑ ~ flops since µ�#�µis K3ÒÓK where K is the numberof clusters.Oncewe obtainthe uppertriangularmatrix
·, then
thelower dimensionalrepresentation)( �a¶ #O ¢ | (�~ of a specificinput ( canbecomputedwithoutcomputing
· ���, but from solvinga linearsystem· # )( � ÂÃÄ �� n Í Q h^i n § | �QÁ� (�~
...�� º Í Q h^i º § | �QÁ� (�~ ÆWÈÉ � (31)
which requiresÏ | O Å� � M~ flops. Typically the numberof clustersis muchsmallerthanthe totalnumberof trainingsamples . Therefore,thecomplexity in nonlineardimensionalreductionby theKernelOrthogonalCentroidmethodpresentedin Algorithm 2 is Ï | � ~ . However, thekernel-basedLDA or PCAneedsto handleaneigenvalueproblemof size ÔÒ2 where is thenumberof trainingsamples,which is more expensive to compute[14, 15, 16]. Therefore,the Kernel OrthogonalCentroidmethodis anefficient dimensionreductionmethodthatcanreflectthenonlinearclusterrelationin thereduceddimensionalrepresentation.
Basedontheseobservations,wecanapplythemodifiedGram-Schmidtmethod[25] to thecentroidmatrix µ to computeanorthonormalbasisof thecentroids,eventhoughwe only have animplicitrepresentationof thecentroidsin thefeaturespace.Oncetheorthonormalbasis
�]�é� ��� �with K clustersanda kernelfunction § , this methodcomputesthe
nonlineardimensionreducedrepresentation)( �¶ #O ¢ | (X~ �/� O � � for any input vector ( ��� ��� � .1. Define × `Q � Ë ���Ø if
�Qbelongsto thecluster
Àqotherwise
for \Le 4 e � \Le À e K .2. Computeanorthogonaldecompositionµ �ê¶�O�·
of thecentroidmatrix µ asin Eqn. (35)by themodifiedGram-Schmidt.
for å � \ �^�^����� KÕK'ßëß � ä + ÕZ�ß ��ÕZ�ßyª �ì | × ß ~ # ¾ × ßÚ ß � × ß ² ÕK'ßëßfor æ � å � \ ��������� KÕKTßÞà � + Õu ß ��ÕZNàÛª � | Ú ß ~ # ¾ × à× à � × à F Ú ß ÕK'ßÞàend
end
3.¶ #O ¢ | (X~ �� Í �Q�VX� Ú �Q § | �QÁ� (�~ ��������� Í �QWVX� Ú OQ § | �QÁ� (�~ � # 5
from the parameters× ß canbe appliedin othercontext of kernelbasedfeatureextractionwheredirectderivationof thekernelbasedmethodasin Algorithm 2 is not possible.We have appliedasimilar approachin developingnonlineardiscriminantanalysisbasedon thegeneralizedsingularvaluedecomposition,which workssuccessfullyregardlessof nonsingularityof thewithin-clusterscattermatrix [27]. More discussionsabouttheoptimizationcriteriausedin LDA, including thewithin-clusterscattermatrix,aregivenin thenext section.
In the KernelOrthogonalCentroidmethod,the choiceof kernel function will influencetheresultsasin any otherkernel-basedmethods.However, a generalguidelinefor anoptimalchoiceof the kernel is difficult to obtain. In the next section,we presentthe numericaltestresultsthatcomparetheeffectivenessof ourproposedmethodto otherexistingmethods.Wealsovisualizetheeffectsof variouskernelsin our algorithms.
4 Computational TestResults
TheKernelOrthogonalCentroidmethodhasbeenimplementedin C on IBM SPat theUniversityof MinnesotaSupercomputingInstitutein orderto investigateits computationalperformance.Thepredictionaccuracy of classificationof thetestdatawhosedimensionwasreducedto thenumberofclustersby ourKOCmethodwascomparedto otherexistinglinearandnonlinearfeatureextraction
Our experimentalresultsillustrate that when the OrthogonalCentroidmethodis combinedwith anonlinearmapping,asin theKOCalgorithm,with anappropriatekernelfunction,thelinearseparabilityof thedatais increasedin thereduceddimensionalspace.This is dueto thenonlineardimensionreductionachievedby theorthonormalbasisof thecentroidsin thefeaturespacewhichmaximizesthebetween-clusterscatter.
In theIris data,thegivendatasethas150datapointsin a4-dimensionalspaceandis clusteredto 3 classes.Oneclassis linearly separablefrom theothertwo classes,but the latter two classesarenot linearly separable.Figure1 shows the datapointswhich arereducedto a 3-dimensionalspaceby variousdimensionreductionmethods. The leftmostfigure in Figure1 is obtainedbyan optimal rank 3 approximationof the datasetfrom its singularvaluedecomposition,which isoneof themostcommonlyusedtechniquesfor dimensionreduction[25]. Thefigureshows thatafter thedimensionreductionby a rank3 approximationfrom theSVD, two of the threeclassesare still not quite separated.The secondand the third figuresin Figure 1 are obtainedby ourKOC methodwith the Gaussiankernelwhere
³ � \ and
q 5 q \ , respectively. They show thatourKernelOrthogonalCentroidmethodcombinedwith theGaussiankernelfunctionwith
Theartificial datawe generatedhasthreeclasses.Eachclassconsistsof 200datapointsuni-formly distributedin thecubicregion with height1.4,width 4 andlength18.5. Thethreeclassesintersecteachother as shown in the top left figure of Figure 2, for the total of 600 given datapoints. Different kernel functionswere appliedto obtain the nonlinearrepresentationof thesegivendatapoints. In this test,thedimensionof theoriginal datasetis in factnot reduced,sinceit
As with the Iris data,with the properkernelfunction, the threeclustersarewell separated.It isinterestingto notethat thewithin-clusterrelationshipalsobecametighteralthoughthedimensionreductioncriterioninvolvesonly thebetween-clusterrelationship.
4.2 Performancein Classification
In our secondtest,thepurposewasto comparetheeffectivenessof dimensionreductionfrom ourKOCmethodin classification.For thispurpose,wecomparedtheaccuracy of binaryclassificationresultswherethe dimensionof the data items are reducedby our KOC methodas well as bythe kernelFisherdiscriminant(KFD) methodof Mika et al. [15]. The test resultspresentedinthis sectionare for binary classificationsfor comparisonsto KFD which can handletwo-classcasesonly. For moredetailson the testdatagenerationandresults,see[15], wherethe authorspresentedthe kernelFisherDiscriminant(KFD)methodfor the binary-classwith substantialtestresultscomparingtheirmethodto otherclassifiers.
TheLinearDiscriminantAnalysisoptimizesvariouscriteriafunctionswhich involvebetween-cluster, within-clusteror mixed-clusterscattermatrices[2]. Many of thecommonlyusedcriteriainvolve theinverseof thewithin-clusterscattermatrix
�Bð, which is definedas,�Bð � OU Q�VX� U`6h^i S | ` F Z Q ~ | ` F Z Q ~ # � (40)
requiringthis within-clusterscattermatrix�Bð
to benonsingular. However, in many applicationsthe matrix
�Bðis either singularor ill-conditioned. One commonsituationwhen
�Bðbecomes
singularis whenthenumberof datapointsis smallerthanthedimensionof thespacewhereeachdataitem resides.Numerousmethodshave beenproposedto overcomethis difficulty includingtheregularizationmethod[29]. A methodHowlandet al. recentlydevelopedcalledLDA/GSVD,which is basedon the generalizedsingularvalue decomposition,works well regardlessof thesingularityof thewithin-clusterscattermatrix. (See[21].) In theKFD analysis,Mika et al. usedregularizationparametersto make thewithin-clusterscattermatrixnonsingular.
Fisherdiscriminantcriterionrequiresa solutionof aneigenvalueproblemwhich is expensiveto compute.In orderto improve thecomputationalefficiency of KFD, severalmethodshave beenproposed,which include the KFD basedon a quadraticoptimizationproblemusing regulariza-tion operatorsor a sparsegreedyapproximation[30, 31, 32]. In general,quadraticoptimizationproblemsare ascostly as the eigenvalueproblems. A major advantageof our KOC methodisthat its computationalcostis substantiallylower, requiringcomputationof a Cholesky factoranda solutionfor a linearsystemwheretheproblemsizeis thesameasthenumberof clusters.The
13
computationalsavingscomefrom thefact that thewithin-clusterscattermatrix is not involved intheoptimaldimensionreductioncriterion[22].
In Table1, we presentthe implementationresultson seven datasetswhich Mika et al. haveusedin their tests1 [33]. The datasetswhich are not alreadyclusteredor with more than twoclusterswerereorganizedsothat theresultshave only two classes.Eachdatasethas100pairsoftrainingandtestdataitemswhich weregeneratedfrom onepool of dataitems.For eachdataset,theaverageaccuracy is calculatedby runningthese100cases.Parametersfor thebestcandidateforthekernelfunctionandSVM aredeterminedbasedon a 5 fold cross-validationusingthefirst fivetrainingsets.We repeattheir resultsin thefirst fivecolumnsof Table1 whichshow thepredictionaccuraciesin percentage(%) fromtheRBFclassifier(RBF),AdaBoost(AB),regularizedAdaBoost,SVM andKFD. For moredetails,see[15].
The resultsshown in the column for KOC are obtainedfrom the linear soft margin SVMclassificationusingthesoftware å^ñ ! % Qóò�ô à [34] afterdimensionreductionby KOC.Thetestresultswith the polynomial kernelwith degree3 and the Gaussiankernelwith an optimal
³value for
eachdatasetarepresentedin Table1. The resultsshow that our methodobtainedcomparableaccuracy to othermethodsin all thetestswe performed.Usingour KOC algorithm,we wereableto achievesubstantialcomputationalsavingsnotonly dueto thelowercomputationalcomplexity ofouralgorithm,but from usinga linear SVM. Sincenokernelfunction(or identitykernelfunction)is involvedin theclassificationprocessby a linearSVM, theparameterõ in therepresentationoftheoptimalseparatinghyperplane ö | (�~ � õ # ( �,÷canbecomputedexplicitly, saving substantialcomputationtime in the testingstage.In addition,dueto thedimensionreduction,kernelfunctionvaluesarecomputedbetweenmuchshortervectors.
Anotherphenomenonwe observed in all thesetestsis that after the dimensionreductionbyKOC, thelinearsoft margin SVM requiressignificantlylessnumberof trainingdatapointsasthesupportvectors, comparedto thesoftmargin SVM with thekernelfunctionappliedto theoriginalinputdata.Moredetailscanbefoundin thenext section.
4.3 Performanceof the Support Vector Machines
Using the sameartificial datathat we usedin Section4.1, now we comparethe performanceofclassificationon thesoft-margin SVMs usingthedatageneratedfrom our KOC, aswell asusingtheoriginaldata.This time,600moretestdatapointsaregeneratedin additionto the600trainingdatageneratedfor the earliertest in Section4.1. The testdataaregeneratedfollowing the samerulesasthetrainingdata,but independentlyfrom thetrainingdata.
In orderto applytheSVMsfor a three-classproblem,weusedthemethodwhereafterabinaryclassificationof f � vs. not f � | f � ²Óø f � ~ is determined,dataclassifiednot to be in the class
1Thebreastcancerdatasetwasobtainedfrom the UniversityMedicalCenter, Inst. of Oncology, Ljubljana,Yu-goslavia. Thanksto M. Zwitter andM. Soklic for thedata.
14
f � is furtherclassifiedto be in f � or f&ù | f � ² f&ù�~ . Therearethreedifferentwaysto organizethebinaryclassifiersfor a three-classproblemdependingon which classifierf Q ²îø f Q , 4 � \ �Nú���û , isconsideredin thefirst step.Onemayrun all threecasesto achieve betterpredictionaccuracy. Formoredetails,see[35]. We presenttheresultsobtainedfrom f � ²îø f � and f � ² f&ù , sinceall threewaysproducedcomparableresultsin our tests.
In Figure3, the predictionaccuracy andthe numberof supportvectorsareshown whenthenonlinearsoft margin SVM is appliedin theoriginal dimensionandthe linear soft margin SVMis appliedin the reduceddimensionobtainedfrom our KOC algorithm. In both cases,Gaussiankernelswith various
³valueswereused. While the bestpredictionaccuracy amongvarious
³valuesis similar in bothcases,it is interestingto notethat thenumberof supportvectorsis muchlessin the caseof the linear soft margin SVM with datain the reducedspace. In addition, theperformanceandthenumberof supportvectorsarelesssensitive to thevalueof
nonlinearfeatures.Oncethe bestfeaturesareextracted,the computationof finding the optimalseparatinghyperplaneandclassificationof new databecomemuchmoreefficient. An addedbenefitwe observedin all our testsis thatafter thekernel-basednonlinearfeatureextractionby theKOCalgorithm, anotheruseof the kernel function in the SVM is not necessary. Hencethe simplelinear SVM canbe effectively used,achieving further efficiency in computation.Anothermeritof the KOC methodis that after its dramaticdimensionreduction,in the classificationstagethecomparisonbetweenthevectorsby any similarity measuresuchasEuclideandistance(
���norm)
or cosinebecomesmuchmoreefficient, sincewe now comparethe vectorswith K componentseach,ratherthan ! componentseach.
5 Conclusion
We have presenteda new methodfor nonlinearfeatureextractioncalled the KernelOrthogonalCentroid(KOC). TheKOC methodreducesthedimensionof the input datadown to thenumberof clusters. The dimensionreducingnonlineartransformationis a compositeof two mappings;the first implicitly mapsthe datainto a featurespaceby usinga kernelfunction,andthe secondmappingwith orthonormalvectorsin the featurespaceis found so that the dataitemsbelongingto differentclustersaremaximally separated.Oneof the major advantagesof our KOC methodis its computationalefficiency, comparedto otherkernel-basedmethodssuchaskernelPCA [14]or KFD [15, 30, 32] andGDA [16]. The efficiency comparedto othernonlinearfeatureextrac-tion methodutilizing discriminantanalysisis achieved by only consideringthe between-clusterscatterrelationshipandby developinganalgorithmwhich achievesthis purposefrom finding anorthonormalbasisof thecentroids,which is far cheaperthancomputingtheeigenvectors.
Theexperimentalresultsillustratethat theKOC algorithmachievesaneffective lower dimen-sionalrepresentationof the input datawhich arenot linearly separable,whencombinedwith the
15
right kernelfunction. With theproposedfeatureextractionmethod,we wereableto achieve com-parableor betterpredictionaccuracy to otherexisting classificationmethodsin our tests.In addi-tion, whenit is usedwith theSVM, in all our teststhelinearSVM performedaswell andwith farlessnumberof supportvectors,furtherreducingthecomputationalcostsin theteststage.
Acknowledgements
Theauthorswould like to thanktheUniversityof MinnesotaSupercomputingInstitute(MSI) forproviding thecomputingfacilities.We alsowould like to thankDr. S.Mika for valuableinforma-tion.
[2] K. Fukunaga. Introductionto StatisticalPatternRecognition. AcademicPress,secondedi-tion, 1990.
[3] I.T. Jolliffe. Principal ComponentAnalysis. Springer-Verlag,New York, 1986.
[4] M.A. Kramer. Nonlinearprincipalcomponentanalysisusingautoassociativeneuralnetwork.AIChEjournal, 37(2):233–243,1991.
[5] K.I. DiamantarasandS.Y. Kung.Principal ComponentNeural Networks:TheoryandAppli-cations. Wiley-interscience,New York, 1996.
[6] T. Hastie,R. Tibshirani, and A. Buja. Flexible discriminantanalysisby optimal scoring.Journalof theAmericanstatisticalassociation, 89:1255–1270,1994.
[7] M.A. Aizerman,E.M. Braverman,andL.I. Rozonoer. Theoreticalfoundationsof thepotentialfunctionmethodin patternrecognitionlearning.Automationandremotecontrol, 25:821–837,1964.
[19] H. Park, M. Jeon,andJ.B. Rosen.Lower dimensionalrepresentationof text databasedoncentroidsandleastsquares,2001.submittedto BIT.
[20] M. Jeon,H. Park, and J. B. Rosen. Dimensionalreductionbasedon centroidsand leastsquaresfor efficient processingof text data. In Proceedingsfor thefirst SIAMinternationalworkshopon text mining, Chiago, IL, 2001.
[21] P. Howland,M. Jeon,andH. Park. Clusterstructurepreservingdimensionreductionbasedon the generalizedsingularvalue decomposition. SIAM Journalon Matrix Analysis andApplications,to appear.
[22] P. HowlandandH. Park. Cluster-preservingdimensionreductionmethodsfor efficient clas-sificationof text data,a comprehensive survey of text mining. Springer-Verlag, to appear,2002.
[23] N. Cristianini,andJ. Shawe-Taylor. An Introductionto SupportVectorMachinesandotherkernel-basedlearningmethods. Cambridge,2000.
17
[24] C.J.C.Burges. A tutorial on supportvectormachinesfor patternrecognition.Data MiningandKnowledgeDiscovery, 2(2):121–167,1998.
[29] J.H.Frieman.Regularizeddiscriminantanalysis.Journal of theAmericanstatisticalassoci-ation, 84(405):165–175,1989.
[30] S.Mika, G. Ratsch,J.Weston,B. Scholkopf, A.J.Smola,andK.-R. Muller. Invariantfeatureextraction andclassificationin kernel spaces.Advancesin neural information processingsystems, 12:526–532,2000.
[31] S.Mika, G. Ratsch,andK.-R. Muller. A mathematicalprogrammingapproachto thekernelfisheralgorithm.Advancesin neural informationprocessingsystems, 13,2001.
[32] S. Mika, A.J. Smola,andB. Scholkopf. An improved training algorithmfor kernelfisherdiscriminants.In proceedingsAISTATS,MorganKaufmann, pages98–104,2001.
[35] H. Kim andH. Park. Proteinsecondarystructurepredictionby supportvectormachinesandposition-specificscoringmatrices.Submittedfor publication,Oct.,2002.
Figure2: Thetop left figureis thetrainingdatawith 3 clustersin a 3-dimensionalspace.Thetopright figure is generatedby theKernelOrthogonalCentroidmethodwith a polynomialkernelofdegree4. The bottomleft andbottomright figuresarefrom the KOC algorithmusingGaussiankernelswith width 5 and0.05,respectively.
Twonorm 97.1 97.0 97.3 97.0 ÿ�� 5 � ÿ � 5 � 38.0 ÿ � 5 �Table1: Thepredictionaccuraciesareshown. Thefirst part (RBF to KFD) is from [15]: classifi-cationaccuracy from a singleRBF classifier(RBF),AdaBoost(AB),regularizedAdaBoost,SVMandKFD. Thelasttwo columnsarefrom theKernelOrthogonalCentroidmethodusingGaussiankernels(optimal
³valuesshown) anda polynomial kernelof degree3. For eachtest, the best
predictionaccuracy resultis shown in boldface.
0 1 2 3 4 5 6 784
84.5
85
85.5
86
86.5
87
σ (the width in Gaussian kernels)
perc
ent(
%)
Prediction Accuracy
0 1 2 3 4 5 6 710
20
30
40
50
60
70
80
90
100
σ (the width in Gaussian kernels)
perc
ent(
%)
Number of Support Vectors
Figure3: Classificationresultsontheartificial datausingasoftmargin SVM. Theleft graphshowsthepredictionaccuracy in thefull inputspaceby aSVM with aGaussiankernel(dashedline), andthat in thereduceddimensionalspaceobtainedby our KOC methodwith a GaussiankernelandalinearSVM (solid line). Theright graphcomparesthenumberof supportvectorsgeneratedin thetrainingprocess.While thebestaccuracy is similar in bothcases,theoverall numberof supportvectorsis muchlesswhenthereduceddimensionalrepresentationis usedin a linearSVM.