Tissue Classification with Gene Expression Profiles Amir Ben-Dor Agilent Laboratories 3500 Deer Creek Road, MS 25U-5 Palo Alto, CA 94304 and The University of Washington, Seattle [email protected]Laurakay Bruhn Agilent Laboratories 3500 Deer Creek Road, MS 25U-7 Palo Alto, CA 94304 laurakay [email protected]Nir Friedman School of Computer Science & Engineering Hebrew University Jerusalem, 91904, ISRAEL [email protected]Iftach Nachman Center for Computational Neuroscience and School of Computer Science & Engineering Hebrew University Jerusalem, 91904, Israel [email protected]Mich` el Schummer University of Washington Box 357730 Seattle, WA 98195 [email protected]Zohar Yakhini Agilent Laboratories MATAM Bdg 30 Haifa 31905, Israel zohar [email protected]May 3, 2000 Abstract Constantly improving gene expression profiling technologies are expected to provide un- derstanding and insight into cancer related cellular processes. Gene expression data is also A preliminary version of this work appeared in Proceedings of the Fourth Annual International Conference on Computational Molecular Biology, 2000. Part this work was done while Amir Ben-Dor was at University of Washington, Seattle, with support from the Program for Mathematics and Molecular Biology (PMMB). Nir Friedman and Iftach Nachman were supported through the generosity of the Michael Sacher Trust and Israeli Science Foundation equipment grant. Part of this work was done while Zohar Yakhini was visiting the Computer Science Department at the Technion. Contact author. 1
32
Embed
Tissue Classification with Gene Expression Profilesrobotics.stanford.edu/~nir/Papers/BBFNSYFull.pdf · Tissue Classification with Gene Expression Profiles Amir Ben-Dor Agilent
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Constantlyimproving geneexpressionprofiling technologiesareexpectedto provide un-derstandingandinsight into cancerrelatedcellular processes.Geneexpressiondatais also�
A preliminaryversionof this work appearedin Proceedingsof the Fourth AnnualInternationalConferenceonComputationalMolecularBiology, 2000.�
Part this work wasdonewhile Amir Ben-Dorwasat University of Washington,Seattle,with supportfrom theProgramfor MathematicsandMolecularBiology(PMMB). Nir FriedmanandIftachNachmanweresupportedthroughthegenerosityof theMichaelSacherTrustandIsraeliScienceFoundationequipmentgrant.Partof thiswork wasdonewhile ZoharYakhiniwasvisiting theComputerScienceDepartmentat theTechnion.�
Contactauthor.
1
expectedto significantlyaid in thedevelopmentof efficientcancerdiagnosisandclassificationplatforms.In thiswork weexaminethreesetsof geneexpressiondatameasuredacrosssetsoftumor(s)andnormalclinical samples:Thefirst setconsistsof 2,000genes,measuredin 62ep-ithelial colonsamples(Alon et al. 1999).Thesecondconsistsof � 100,000clones,measuredin 32ovariansamples(unpublishedextensionof datasetdescribedin (Schummeretal. 1999)).The third setconsistsof � 7,100genes,measuredin 72 bonemarrow andperipheralbloodsamples(Golub et al. 1999). We examinethe useof scoringmethods,measuringseparationof tissuetype(e.g.,tumorsfrom normals)usingindividual geneexpressionlevels. Thesearethencoupledwith high dimensionalclassificationmethodsto assessthe classificationpowerof completeexpressionprofiles.Wepresentresultsof performingleave-one-outcrossvalida-tion (LOOCV) experimentson thethreedatasets,employing nearestneighborclassifier, SVM(Cortes& Vapnik 1995),AdaBoost(Freund& Schapire1997)anda novel clusteringbasedclassificationtechnique.As tumorsamplescandiffer from normalsamplesin their cell-typecompositionwealsoperformLOOCV experimentsusingappropriatelymodifiedsetsof genes,attemptingto eliminatetheresultingbias.
Wedemonstratesuccessrateof at least90%in tumorvsnormalclassification,usingsetsofselectedgenes,with aswell aswithout cellularcontaminationrelatedmembers.Theseresultsareinsensitive to theexactselectionmechanism,over acertainrange.
1 Intr oduction
The processby which the approximately100,000genesencodedby the humangenomeareex-pressedasproteinsinvolvestwo steps.DNA sequencesare initially transcribedinto mRNA se-quences. ThesemRNA sequencesin turn are translatedinto the amino acid sequencesof theproteinsthat performvariouscellular functions. A crucial aspectof propercell function is theregulationof geneexpression,sothatdifferentcell typesexpressdifferentsubsetsof genes.Mea-suringmRNA levels canprovide a detailedmolecularview of the subsetof genesexpressedindifferentcell typesunderdifferentconditions.Recentlydevelopedarray-basedmethodsenablesi-multaneousmeasurementsof theexpressionlevelsof thousandsof genes.Thesemeasurementsaremadeby quantitatingthehybridization(detectedfor example,by fluorescence)of cellularmRNAto an arrayof definedcDNA or oligonucleotideprobesimmobilizedon a solid substrate.Arraymethodologieshave led to a tremendousaccelerationin therateat which geneexpressionpatterninformation is accumulated(DeRisi. et al. 1997,Khan et al. 1998,Lockhart et al. 1996,Wenet al. 1998).Measuringgeneexpressionlevelsunderdifferentconditionsis importantfor expand-ing ourunderstandingof genefunction,how variousgeneproductsinteract,andhow experimentaltreatmentscanaffect cellularfunction.
Another importantpurposeof geneexpressionstudiesis to improve understandingof cellu-lar responsesto drug treatment.Expressionprofiling assaysperformedbefore,during andafter
treatment,areaimedat identifying drugresponsivegenes,indicationsof treatmentoutcomes,andat identifying potentialdrug targets(Clarke et al. 1999). More generally, completeprofilescanbeconsideredasa potentialbasisfor classificationof treatmentprogressionor othertrendsin theevolutionof thetreatedcells.
Dataobtainedfrom cancerrelatedgeneexpressionstudiestypically consistsof expressionlevelmeasurementsof thousandsof genes.This complexity calls for dataanalysismethodologiesthatwill efficiently aid in extractingrelevantbiologicalinformation.Previousgeneexpressionanalysiswork emphasizesclusteringtechniques,which aim at partitioning the set of genesinto subsetsthatareexpressedsimilarly acrossdifferentconditions.Indeed,clusteringhasbeendemonstratedto identify functionally relatedfamiliesof genes(Ben-Doret al. 1999,DeRisi. et al. 1997,Chuet al. 1998,Eisenet al. 1998,Iyer et al. 1999,Wen et al. 1998). Similarly, clusteringmethodscanbeusedto dividea setof cell samplesinto clustersbasedon their expressionprofile. In (Alonetal.1999)thisapproachwasappliedto asetof colonsampleswhichwasdividedinto two groups,onecontainingmostlytumorsamples,andtheothercontainingmostlynormaltissuesamples.
Clusteringmethods,however, do not useany tissueannotation(e.g.,tumorvs.normal)in thepartitioningstep.This informationis only usedto assesthesuccessof themethod.Suchmethodsare often referredto as unsupervised. In contrast,supervisedmethods,attemptto predict theclassificationof new tissues,basedon their geneexpressionprofilesafter training on examplesthathavebeenclassifiedby anexternal“supervisor”.
Thepurposeof thiswork is to rigorouslyassessthepotentialof classificationapproachesbasedongeneexpressiondata.Wepresentanovelclusteringbasedclassificationmethodology, andapplyit togetherwith two otherrecentlydevelopedclassificationapproaches,Boosting(Schapire1990,Freund& Schapire1997)andSupportVectorMachines(Cortes& Vapnik1995,Vapnik1999)tothreedatasets.Thesesetsinvolvecorrespondingtissuesamplesfrom tumorandnormalbiopsies.Thefirst is a datasetof coloncancer(Alon et al. 1999),thesecondis a datasetof ovariancancer(an extensionof the dataset reportedin (Schummeret al. 1999)),andthe third is a datasetofleukemia (Golub et al. 1999). We useestablishedstatisticaltools, suchas leaveoneout crossvalidation(LOOCV), to evaluatethepredictivepowerof thesemethodsin thedatasets.
Oneof themajorchallengesof geneexpressiondatais the largenumberof genesin thedatasets. For example,oneof our datasetsincludesalmost100,000clones. Many of theseclonesarenot relevant to the distinctionbetweencancerandtumor andintroducenoisein the classifi-cationprocess.Moreover, for diagnosticpurposesit is importantto find small setsof genesthataresufficiently informative to distinguishbetweencellsof differenttypes.To this endwe suggesta simplecombinatorialerror ratescorefor eachgene,andusethis methodto selectinformativegenes.As we show, selectingrelatively smallsubsetsof genescandrasticallyimprove theperfor-mance.Moreover, this selectionprocessalsoisolatesgenesthatarepotentiallyintimatelyrelatedto thetumormakeupandthepathomechanism.
To realisticallyassesstheperformanceof suchmethodsoneneedsto addresstheissueof sam-ple contamination. Tumorandnormalsamplesmaydramaticallydiffer in termsof their cell-typecomposition.For example,in thecoloncancerdata(Alon et al. 1999),theauthorsobserved thatthe normalcolon biopsyalso includedsmoothmuscletissuefrom the colon walls. As a result,smoothmusclerelatedgenesshowed high expressionlevels in the normalsamplescomparedtothe tumor samples.This artifact, if consistent,could contribute to successin classification.Toeliminatethis effect we remove the musclespecificgenesandobserve the effect on the success
3
rateof theprocess.The restof the paperis organizedasfollows. In Section2, we describethe principal classi-
fication methodswe usein this study. Theseincludetwo stateof theart methodsfrom machinelearning,anda novel approachbasedon the clusteringalgorithm of (Ben-Dor et al. 1999). InSection3, we describethethreedatasets,theLOOCV evaluationmethod,andevaluatetheclas-sificationmethodson thethreedatasets. In Section4 we addresstheproblemof geneselection.Weproposeasimplemethodfor selectinginformativegenesandevaluatetheeffectof geneselec-tion on the classificationmethods.In Section5, we examinethe effect of samplecontaminationon possibleclassification.We concludein Section6 with a discussionof relatedwork andfuturedirections.
2 ClassificationMethods
In this section,we describethe main classificationmethodsthat we will be using in this paper.We startby formally definingtheclassificationproblem.Assumethatwe aregivena training set�
, consistingof pairs � ��������� , for ����������������� . Eachsample � is a vectorin R thatdescribesexpressionvaluesof ! genes/clones.The label ��� associatedwith � is either "#� or $%� (for sim-plicity, we will discusstwo-labelclassificationproblems).A classificationalgorithmis a function&
thatdependson two arguments,thetrainingset�
, andaquery (' R , andreturnsapredictedlabel )�*� &,+.- 0/ . We alsoallow for no classificationto occur if is eithercloseto noneof theclassesor whenit is tooborderlinefor adecisionto betaken.Formally, this is realizedby allowing)� to be "1� , $%� or 2 , thelatter representinganunclassifiedquery. Goodclassificationprocedurespredict labelsthat typically matchthe “true” label of the query. For a precisedefinition of thisnotionin theabsenceof theunclassifiedoptionassumethatthereis some(unknown) joint distribu-tion 3 - 4���5/ of expressionpatternsandlabels.Theerror of aclassificationfunction
&,+.-76 / is definedas 3 -8&,+9- 0/;:�<�5/ . Of course,sincewe do not have accessto 3 -=6 / , we cannotpreciselyevaluatethis termanduseestimatorsinstead.Whenunclassifiedis acceptedasapossibleoutputoneneedsto considerthe costs/penaltiesof the variousoutcomesin analyzingthe valueof a classificationmethod. For a comprehensive discussionof classificationproblemssee(Bishop1995,Duda&Hart1973,Ripley 1996).
Formally, let > ? - 4��@A/B�DC E - �F" C E GH/ - @I�F" C E @JG�/�GK L M�N EPO G L M�N ERQ Gbe the Pearsoncorrelationbetweentwo vectorsof expressionlevels. Given a new vector , the
4
nearestneighborclassificationproceduresearchesfor thevector � in thetrainingdatathatmaxi-mizes
> ? - 4�S ��/ , andreturns��� , thelabelof � .This simplenon-parametricclassificationmethoddoesnot take any global propertiesof the
trainingsetinto consideration.However, it is surprisinglyeffective in many typesof classificationproblems.We useit in our analysisasa strawman,to which we comparethemoresophisticatedclassificationapproaches.
2.2 UsingClustering for Classification
Recallthatclusteringalgorithms,whenappliedto expressionpatterns,attemptto partitionthesetof elementsinto clustersof patterns,so that all the patternswithin a clusteraresimilar to eachother, anddifferentfrom patternsin otherclusters.This suggeststhat if thelabelingof patternsiscorrelatedwith thepatterns,thenunsupervisedclusteringof thedata(labelsnottakeninto account)would clusterpatternswith the samelabel togetherand separatepatternswith different labels.Indeed,sucharesultis reportedby Alon etal. (1999)in theiranalysisof coloncancer. Theirstudy(which we describein more detail in Section3) involves geneexpressionpatternsfrom colonsamplesthatincludebothtumorsandnormaltissues.Applying ahierarchicalclusteringprocedureto thedata,Alon et al. observe that the topmostdivision in thedendrogramdividessamplesintotwo groups,onepredominantlytumor, andthe otherpredominantlynormal. This suggeststhatfor sometypesof classificationproblems,suchas tumor vs. normal, clusteringcandistinguishbetweenlabels. Following this intuition, we build a clusteringbasedclassifier. We first describetheunderlyingclusteringalgorithmandthenpresenttheclassifier.
2.2.1 The clustering algorithm
The CAST algorithm, implementedin the BioClust analysissoftware package(Ben-Dor et al.1999),takesasinputathresholdparameterT , whichcontrolsthegranularityof theresultingclusterstructure,anda similarity measurebetweenthetissues.U We saythata tissueV hashigh similarityto asetof tissuesW , if theaveragesimilarity betweenV andthetissuesin W is at leastT . Otherwise,wesaythat V haslow similarity to W . CASTconstructstheclustersoneatatime,andhaltswhenalltissuesareassignedto clusters.Intuitively, thealgorithmalternatesbetweenaddinghighsimilaritytissuesto W , andremoving low similarity tissuesfrom it. Eventually, all thetissuesin W havehighsimilarity to W , while all thetissuesoutsideof W have low similarity to W . At thisstagetheclusterW is closed,anda new clusteris started(See(Ben-Doret al. 1999)for completedescriptionof thealgorithm).
Clearly, thethresholdvalue T , hasgreateffecton theresultingclusterstructure.As T increases,theclustersformedaresmaller. At theextremecase,if T is high enough,eachtissuewould forma singletoncluster. Similarly, as T decreases,theclusterstendto get larger. If T is low enough,alltissuesareassignedto thesamecluster.X
In thiswork weusethePearsoncorrelationbetweengeneexpressionprofilesasthesimilarity measure.However,any similarity measurecanbeused.
5
2.2.2 Clustering basedclassification
As describedabove, the thresholdparameterT determinesthe cohesivenessand the numberofthe resultingclusters.A similar situationoccursin otherclusteringalgorithms.For example,inhierarchicalclusteringalgorithms(e.g.,(Alon et al. 1999,Eisenet al. 1998)),thecutoff “level” ofthetreecontrolsthenumberof clusters.In any clusteringalgorithm,it is clearthatattemptingtopartitionthedatainto exactly two clusterswill notbetheoptimalchoicefor predictinglabels.Forexample,if the tumorclassconsistsof several typesof tumors,thenthemostnoticeabledivisioninto two clustersmight separate“extreme” tumorsfrom the milder onesandthe normaltissues,andonly a furtherdivisionwill separatethenormalsfrom themilder tumors.
For thepurposeof determiningtheright parameterto beusedin clusteringdatathatcontainssomelabeledsampleswe proposea measureof clusterstructurecompatibilitywith a givenlabelassignment.The intuition is simple: on the onehand,we want clustersto be uniformly labeledandthereforepenalizepairsof samplesthatarewithin thesameclusterbut have differentlabels;on theotherhand,we do not wantto createunnecessarypartitionsandthereforepenalizepairsofsamplesthathave thesamelabel,but arenot within thesamecluster.
Formally, we definethe compatibilityscoreof a clusterstructurewith the training setasthesumof two terms. The first is the numberof tissuepairs
- VY��Z0/ suchthat V and Z have thesamelabel, andareassignedto the samecluster. The secondterm is the numberof
- VY��Z0/ pairs thathave differentlabels,andareassignedto differentclusters.This scoreis alsocalledthematchingcoefficient in theliterature(Everitt 1993).To handlelabelassignmentsdefinedonly onasubsetofthedatawe restrictthecomparisonto countpairsof examplesfor which labelsareassigned(thematchingcoefficient for asubmatrixis computed).
Usingthisnotion,wecanoptimize,usingabinarysearch,thechoiceof clusteringparameterstofind themostcompatibleclustering.That is: we considerdifferentthresholdvalues,T ; useCASTto clusterthe tissues;measurethe compatibility [ - T\/ of the resultingclusterstructurewith thegivenlabelassignment;andfinally, choosetheclusteringthathasmaximal [ - T\/ . Thus,althoughtheclusteringalgorithmis unsupervised, in thesensethat it doesnot take into accountthelabels,we usea supervisedprocedurefor choosingtheclusteringthreshold.We alsoemphasizethat thisgeneralideacanbeappliedto any parameterdependentclusteringmethod,andis not restrictedtoour particularchoice.
To classifyaquerysampleweclusterthetrainingdataandthequery, maximizingcompatibilityto the labelingof the trainingdata. We thenexaminethe labelsof all elementsof theclusterthequerybelongsto andusea simplemajority rule to determinetheunknown label. The intuition isthat thequery’s labelshouldagreewith theprevailing label in its cluster. Variousmajority rules,taking into accountstatisticalconfidencecanbe used. Whenconfidenceis too low the query islabeledasunclassified. The stringency of this testdeterminesthe strictnessof our classificationrule. In thecurrentexperimentweusethemostliberal rule, i.e. aqueryis unclassifiedonly if thereis anequalnumberof elementsof eachlabelin its cluster. Thechoiceof majority ruledependsonthecostof non-classificationvs. thecostof misclassification.
The literatureof supervisedlearningdiscussesa largenumberof methodsthat learndecisionsurfaces.Thesemethodscanbedescribedby two aspects.First, theclassof surfacesfrom whichoneis selected.This questionis oftencloselyrelatedto the representationof thelearnedsurface.Examplesincludelinearseparation(which we discussin moredetailbelow), decision-treerepre-sentations,andtwo-layerartificial neuralnetworks. Second,the learningrule that is beingused.For example,oneof thesimplestlearningrulesattemptsto minimize thenumberof errorson thetrainingset.
Application of direct methodsto our domaincan suffer from a seriousproblem. In gene-expressiondataweexpect ! , thenumberof measuredgenes,to besignificantlylargerthan ] , thenumberof samples.Thus,dueto thelargenumberof dimensionstherearemany simpledecisionsurfacesthatcanseparatethepositiveexamplesfrom thenegativeones.This meansthatcountingthe numberof training seterrorsis not restrictive enoughto distinguishgooddecisionsurfacesfrom badones(in termsof theirperformanceon examplesnot in thetrainingset).
In this paper, we usetwo methodsthatreceivedmuchrecentattentionin themachinelearningliterature.Bothmethodsattemptto follow theintuition thatclassificationof examplesdependsnotonly on theregion they arein, but alsoon a notionof margin: how closethey areto thedecisionsurface.Classificationof exampleswith smallmargins is not asconfidentasclassificationof ex-ampleswith largemargins. (Givenslightly differenttrainingdata,theestimateddecisionsurfacemovesa bit, thuschangingtheclassificationof pointswhich arecloseto it.) This reasoningsug-geststhatweshouldselectadecisionsurfacethatclassifiesall thetrainingexamplescorrectlywithlargemargin. Following thesameargument,giventhe learneddecisionsurfaceandanunlabeledsample , we canseta thresholdon themargin of for classification.If is closerto thesurfacethenthe allowed threshold,we mark is asunclassified.Again, the thresholdwill dependon therelativecostsof thedifferentoutcomes.
The basicintuition of large margin classifiersis developedin quite differentmannersin thefollowing two approaches.
2.3.1 Support Vector Machines
Supportvector machines(SVM) were developedin (Cortes& Vapnik 1995,Vapnik 1999). AtutorialonSVMscanbefoundin (Burges1998).Theintuition for supportvectormachinesis bestunderstoodin theexampleof lineardecisionrules.A lineardecisionrule canberepresentedby ahyperplanein ^ suchthatall examplesontheonesideof thehyperplanearelabeledpositiveandall theexamplesontheothersidearelabelednegative. Of course,in sufficiently high-dimensionaldatawe canfind many linear decisionrulesthat separatethe examples.Thus,we want to find ahyperplanethat is asfar away aspossiblefrom all theexamples.More precisely, we wantto finda hyperplanethatseparatesthepositiveexamplesfrom thenegativeones,andalsomaximizestheminimumdistanceof theclosestpointsto thehyperplane.
We now make this intuition moreconcrete. A linear decisionrule can be representedby ahyperplanein ^ suchthat all exampleson the onesideof the hyperplaneare labeledpositiveandall theexampleson theothersidearelabeledasnegative. Sucha rule canberepresentedbya vector _`'a^ anda scalarb thattogetherspecifythehyperplane_ 6 c$dbe�f2 . Classification
7
for anew example is performedby computingsign- _ 6 1$gb�/ . Recallthat hji�k lnmnopqp l pqp h is thedistance
from to theline 6 _r$sb.�t2 . Thus,if all pointsin thetrainingdatasatisfy�u� - � 6 _r$sb�/.vw� (1)
than all points are correctly classified,andall of themhave a distanceof at least �yxYhjh _zhjh fromthe hyperplane.We canfind the hyperplanethat maximizesthe margin of error by solving thefollowing quadraticprogram:
Minimize h�h _zhjh {Subject to �u� - � 6 _s$sb�/.vw� for �|�}�~����������� .
Suchquadraticprogramscanbesolvedin thedual form. This dual form is posedin termsofauxiliaryvariables�4� . Thesolutionhasthepropertythat_��t� � �4�H���j ���andthus,wecanclassifyanew example by evaluating
sign- � � �4�H���=5 �8�S0�4$�b�/ (2)
In practice,thereis arangeof optimizationmethodsthatcanbeusedfor solvingthedualoptimiza-tion problem.SeeBurges(1998)for moredetails.
TheSVM dualoptimizationproblemandits solutionhave severalattractive properties.Mostimportantly, only asubsetof thetrainingexamplesdeterminethepositionof thehyperplane.Intu-itively, theseareexactly thesesamplesthatareat a distance�yxFhjh _�h�h from thehyperplane.It turnsout that the dual problemsolutionassigns������2 to all examplesthat arenot “supporting” thehyperplane.Thus,weonly needto storethesupportvectors � for which �����d2 . (Hencethenameof thetechnique.)
It is clear that linear hyperplanesarea restrictedform of decisionsurfaces. Onemethodoflearningmore expressive separatingsurfacesis to project the training examples(and later onqueries)into a higher-dimensionalspace,and learna linear separatorin that space. For exam-ple, if our trainingexamplesarein ^ U , we canprojecteachinput value to thevector
In general,we canfix anarbitraryprojection �t� R ��� R � to higherdimensionalspace,andgetmoreexpressivedecisionsurfaces.However, it seemsthatit requiresusto solveaharderSVMoptimizationproblem,sincewenow dealwith higherdimensionalspace.
Somewhat surprisingly, it turnsout that the dual form of the quadraticoptimizationprobleminvolvesonly inner productsof vectorsin ^ (i.e., expressionsof the form �� - �5/���� - ���/\� ). Inotherwords,vectors � do not appearoutsidethescopeof an innerproductoperation.Similarly,theclassificationrule (2) only examinesvectorsin ^ insidetheinnerproductoperation.Thus,ifwe wantto considerany projection �t� R ��� R � , we canfind anoptimalseparatinghyperplanein thatprojectedspaceby solvingthequadraticproblemwith innerproducts�� - ��/���� - ���/\� . Thus,if we cancomputetheseinner products,thecostof thequadraticoptimizationproblemdoesnotincreasewith thenumberof dimensionsin theprojectedspace.
8
Moreover, for many projectionstherearekernel functionsthatcomputetheresultof theinnerproduct. A kernel function
>for a projection � satisfies
> - 4��@A/%���� - 0/���� - @A/S� . Given a legalkernelfunction,wecanuseit withoutknowing or explicitly computingtheactualmapping� . Formany projections,the kernelfunction canbe computedin time that is linear in ! , regardlessofthedimension] . For example,kernelfunctionsof theform
> - 4��@n/�� - 54�S@n�F$d�y/ ? computedot-
productsfor a projectionfrom ^ to ^ -�� � / , whereeachcoordinateis a polynomialof degree� intheoriginalcoordinatesin ^ . Notethatin thisexample,a lineardecisionsurfacein theprojectedspacecorrespondsto apolynomialmanifoldof degree� in theoriginal space.
To summarize,if we wantto learnexpressivedecisionsurfaces,we canchoosea kernelfunc-tion, anduseit insteadof the inner-productin the executionof the SVM optimization. This isequivalentto learninga linearhyperplanein theprojectedspace.
In this work weconsidertwo kernelfunctions:� Thelinearkernel
> U - 4��@A/���54��@A� .� Thequadratickernel
> { - 4��@A/�� - �4��@n��$��y/7{ .Therationalfor usingthesesimplekernels,is thatsinceour input spaceis high dimensional,wecanhopeto find a simpleseparationrule in thatspace.We thereforetestthelinearseparator, andthenext orderseparatorasacomparisonto checkif higherorderkernelscanyield betterresults.
Note that the quadratickernel is strictly more expressive than the linear one: any decisionsurfacethatcanberepresentedwith
> U -=6 � 6 / canalsoberepresentedwith
> { -76 � 6 / . Nonetheless,it isnot obviousthatthemoreexpressive representationwill alwaysperformbetter. Givena largersetof decisionsurfacesto choosefrom, this procedureis moresusceptibleto overfitting, i.e. learningadecisionsurfacethatperformswell on thetrainingdatabut performsbadlyon testdata.
2.3.2 Boosting
Boostingwasinitially developedasa methodfor constructinggoodclassifiersby repeatedcallsto “weak” learningprocedures(Freund& Schapire1997,Schapire1990).Theassumptionis thatwe have accessto a “weak learner”thatgivena trainingset
�, constructsa function
&,+.- 0/ . Thelearneris weakin thesensethat the training seterror is only slightly betterthanthatof a randomguess. Formally, we assumethat
&I+.- 0/ classifiesat least �yx���$��yx��Y ,��@ - �¡/ of the input spacecorrectly.
to geta classifier& U - 0/ . Then,wecanfind theexamplesin�
thatareclassifiedincorrectlyby& U . We wantto force
thelearningalgorithmto givetheseexamplesspecialattention.This is doneby constructinganewtrainingdatasetin which theseexamplesaregivenmoreweight. Boostingtheninvokestheweaklearneron thereweightedtrainingsetandobtainsa new classifier. Examplesarethenreweightedagain,andtheprocessis iterated.Thus,boostingadaptively reweightstrainingexamplesto focuson the“hard” ones.° In thispaper, weusetheAdaBoostalgorithmof FreundandSchapire(1997).Thisalgorithmis describedin Figure1.
In practiceboostingis anefficient learningprocedurethatusuallyhasasmallnumberof errorsontestsets.Thetheoreticalunderstandingof thisphenomenonusesanotionof margin thatis quitesimilar to theonedefinedfor SVMs. Recall,thatboostingclassificationis madeby averagingthe“votes”of many classifiers.Definethemargin of example � to be�±�²�³�u� � � _¯� & � - ��/��By definition,wehavethatif �±�|�d2 , then
® - ��/´�³��� , andthus � is classifiedcorrectly. However,if �±� is closeto 2 , thenthis classificationis “barely” made.On theotherhand,if �±� is closeto � ,thena large majority of theclassifiersmake the right predictionon � . Theanalysisof Schapireet al. (1998)andMasonet al. (1999)shows that the generalizationerror of boosting(andothervoting schemes)dependson thedistribution of marginsof trainingexamples.Schapireet al. alsoshow that repeatediterationsof AdaBoostcontinually increasethe smallestmargin of trainingexamples. This is contrastedwith other voting schemesthat are not necessarilyincreasingthemargin for thetrainingsetexamples.
3 Evaluation
In theprevioussectionwe discussedseveralapproachesfor classification.In this sectionwe ex-aminetheir performance,on experimentaldata.µ
Notethatfor eachgene,we needto consideronly ¶ rules,sincethegenetakesat most ¶ differentvaluesin thetrainingdata.Thus,wecanlimit ourattentionto mid-waypointsbetweenconsecutivevaluesattainedby the · ’ th genein thetrainingdata.¸
More precisely, boostingdistortsthe distribution of the input samples.For someweaklearners,like the stumpclassifier, this canbesimulatedby simply reweightingthesamples.
10
Input:� A datasetof � labeledexamplesª - U ��� U /���������� - ¹e����¹º/�¬� A weaklearningalgorithm » .
Initialize: thedistributionover thedataset:� U - ��/¯�¼�yx,� for �½�f�´�����\�Iterate:for T¯�¼�~���¾���u���u�S¿� Call » with distribution
�%À; Getbackahypothesis
® À.� Calculatetheerrorof
® À: Á À � ¹��R U �%À - ��/���ª,���´:� ® À - ��/�¬� Set _ À � U{ �� ¢ U�Ã�ÄHÅÄHÅ� Setthenew distribution to be:�%À m U - �5/�Æ �ÇÀ - �5/7È l Å�ÉPÊuË�Å�Ì i Ê�Í
suchthat�%À m U will sumto 1.
Output: Thefinal hypothesis ® - 0/´�tÎÏ� ¢�Ð -½Ñ�À  U _ À ® À - 0/S/Figure1: ThegenericAdaBoostalgorithm.In oursettingweusetheweaklearner» is aprocedurethatsearchesfor a decisionstumpthathasthesmallest(weighted)erroron thetrainingdata(withweightsdefinedby
�%À). The final outputof AdaBoostin this caseis a weightedcombinationof
decisionstumps.
11
3.1 Data Sets
Descriptionsof thethreedatasetsstudiedfollow. Thefirst two datasetsinvolvecomparingtumorandnormalsamplesof thesametissue,while thethird datasetinvolvessamplesfrom two variantsof thesamedisease.
Colon cancerdata set. This datasetis a collectionof expressionmeasurementsfrom colonbiopsy samplesreportedby Alon et al. (1999). The dataset consistsof 62 samplesof colonepithelialcells. Thesesampleswerecollectedfrom colon-cancerpatients.The“tumor” biopsieswerecollectedfrom tumors,andthe “normal” biopsieswerecollectedfrom healthypartsof thecolonsof thesamepatients.Thefinal assignmentsof thestatusof biopsysamplesweremadebypathologicalexamination.
Ovarian cancer data set. This datasetis a collectionof expressionmeasurementsfrom 32samples:15 ovary biopsiesof ovariancarcinomas,13 biopsiesof normalovaries,and4 samplesof othertissues.Thus,thedatasetconsistsof 28 sampleslabeledastumoror normal. Geneex-pressionlevelsin these32samplesweremeasuredusingamembrane-basedarraywith radioactiveprobes.The arrayconsistedof cDNAs representingapproximately100,000clonesfrom ovarianclonelibraries.For someof thesamples,therearetwo or threerepeatedhybridizationsfor erroras-sessments.In thesecases,wecollapsedtherepeatedexperimentsinto oneexperiment,representedby theaveragelevel of expression.
Leukemia data set. This dataset is a collection of expressionmeasurementsreportedbyGolubet al. (1999). Thedatasetcontains72 samples.Thesesamplesaredividedto two variantsof leukemia:25samplesof acutemyeloidleukemia(AML) and47samplesof acutelymphoblasticleukemia(ALL). Thesourceof thegeneexpressionmeasurementswastaken from 63 bonemar-row samplesand9 peripheralblood samples.Geneexpressionlevels in these72 samplesweremeasuredusinghigh densityoligonucleotidemicroarrays.The expressionlevels of 7129genesarereported.Thedata,72 samplesover7129genesis availableathttp://www.genome.wi.mit.edu/MPR.
3.2 Estimating Prediction Err ors
Whenevaluatingthe predictionaccuracy of the classificationmethodswe describedabove, it isimportantnot to usethetrainingerror. Mostclassificationmethodswill performwell onexamplesthey haveseenduringtraining.To getarealisticestimateof performanceof theclassifier, wemusttest it on examplesthat did not appearin the training set. Unfortunately, sincewe have a smallnumberof examples,wecannotremoveasizeableportionof thetrainingsetanduseit for testing.
A commonmethodto testaccuracy in suchsituationsis cross-validation. To applythismethod,wepartitionthedatainto
>setsof samples,[ U ����������[ (typically, thesewill beof roughlythesame
size).Then,weconstructadataset� �Ô� � "g[º� , andtesttheaccuracy of
&I+ Ê - / on thesamplesin[º� . Having donethis for all �ÖÕf�eÕ >we estimatetheaccuracy of themethodby averagingthe
> �<� . In this case,every trial removesa singlesampleandtrainson therest.This methodis known asleaveoneoutcrossvalidation (LOOCV). Othercommonchoicesare
Table1 lists theaccuracy estimatesfor thedifferentmethodsappliedto thethreedatasets.Aswe cansee,theclusteringapproachperformssignificantlybetterthantheotherapproacheson thecolon cancerdataset,but not so on the ovariandatasets. AdaBoostperformsbetterthanothermethodson the leukemiaandoveriandatasets. We canalsoseethat quadraticSVM doesnotperformaswell asthe linearSVM, probablybecauseit overfits thetrainingdata.Thesamephe-nomenonoccursin Adaboost,wheretheclassifiersareslightly moreaccurateafter100iterationsthanafter10000iterations.
3.3 ROC Curves
Estimatesof classificationaccuracy giveonly apartial insighton theperformanceof a method.Inourevaluation,wetreatedall errorsashaving equalpenalty. In many applications,however, errorshaveasymmetricweights.For ageneraldiscussionof risk andlossconsiderationsin classificationsee,e.g, (Ripley 1996). To setterminologyfor our particularcase,we distinguishfalsepositiveerrors- normaltissuesclassifiedastumor, and falsenegativeerrors- tumor tissuesclassifiedasnormal. In diagnosticapplications,falsenegative errorscanbedetrimental,while falsepositivesmaybetolerated(sinceadditionaltestswill beperformedon thepatient).
To dealwith asymmetricweightsfor errors,weintroducetheconfidenceparameter, Û . In clus-teringapproaches,themodifiedprocedurelabelsaquerysampleastumorif theclustercontainingit hasat leasta fraction Û of tumors. In asimilarmanner, wecanintroduceconfidenceparametersfor SVM andboostingapproachesby changingthe thresholdmargin neededfor positive classifi-cation.
ROCcurvesareusedtoevaluatethe“power” of aclassificationmethodfor differentasymmetricweights(see,for example,(Swets1988)). A ROC curve plotsthetradeoff betweenthetwo typesof errorsastheconfidenceparametervaries.Eachpointonthetwo dimensionalcurvecorrespondsto a particularvalue of the confidenceparameter. The
- 4��@n/ coordinatesof a point representthe fractionsof negative andpositive samplesthat areclassifiedaspositive with this particularconfidenceparameter. The extremeendsof the curvesare the moststrict andmostpermissiveconfidencevalues:with thestrictestconfidencevaluenothingis classifiedaspositive,putting
- 2A��2Ü/onthecurve;with themostpermissiveconfidencevalueeverythingis classifiedaspositive,putting- �����y/ onthecurve. Thepathbetweenthesetwo extremesshowshow flexible theprocedureis withrespectto trading-off error rates. The bestcasescenariois that the pathgoesthroughthe point- 2¾���y/ . This impliesthatfor someconfidenceparameter, all positivesareclassifiedaspositive,andall negativesareclassifiedasnegative. Thatis - theprocedurecanbemadeverystrictwith respect
13
Table1: Summaryof classificationperformanceof the differentmethodson the threedatasets.The tablesshows the precentof samplesthat werecorrectlyclassified,incorrectlyclassfied,andunclassfiedby eachmethodin theLOOCV evaluation.Unsupervisedlabelsfor margin basedclas-sifier weredecidedby a fixedthresholdon classificationmargin: in SVM, 2A�Ý��Ù , andin Adaboost,2A�Þ2ÜÙ .
Figure2: ROCcurvesfor methodsappliedto thecoloncancerdataset.The -axisshowspercent-ageof negativeexamplesclassifiedaspositives,and @ -axisshowspercentageof positiveexamplesclassifiedaspositive. Eachpoint along the curve correspondsto the percentagesachieved at aparticularconfidencethresholdvalueby thecorrespondingclassificationmethod.Error estimatesarebasedon LOOCV trials.
to falsepositiveerror, with no falsenegativepriceto pay. ROCcurveswith largeareasunderneathmeanthathigh falsepositivestringenciescanbeobtainedwithout muchof a falsenegativeprice.
In Figure2 we plot the ROC curves for clustering,SVM andboostingon the colon cancerdataset. As we cansee,thereis no cleardominationamongthe methods.(The only exceptionis SVM with quadratickernelthat is consistentlyworsethanthe othermethods.)The clusteringprocedureis dominantin the region wheremisclassificationerrorsof both typesareroughly ofthesameimportance.However, SVM with linearkernelandboostingarepreferablein regionsofhighly asymmetricerror cost (both endsof the spectrum).This may be dueto the fact that thematchingcoefficient score(seeSection2.2),which determinestheclustergranularity, treatsbothtypesof errorsashaving equalcosts.
4 GeneSelection
It is clearthat the expressionlevels of many of the genesthat aremeasuredin our datasetsareirrelevant to the distinctionbetweentumor andnormal tissues.Taking suchgenesinto accountduring classificationincreasesthe dimensionalityof the classificationproblem,presentscompu-tationaldifficulties,andintroducesunnecessarynoiseto the process.Anotherissuewith a largenumberof genesis the interpretability of the results. If the “signal” that allows our methodstodistinguishtumor from normaltissuesis encodedin the expressionlevelsof few genes,thenwemight be able to understandthe biological significanceof thesegenes.Moreover, a major goalfor diagnosticresearchis to developdiagnosticproceduresbasedon inexpensivemicroarraysthathave enoughprobesto detectdiseases.Thus,it is crucial to recognizewhethera smallnumberof
15
genescansuffice for goodclassification.Theproblemof featureselectionreceiveda thoroughtreatmentin patternrecognitionandma-
chinelearning.Thegeneexpressiondatasetsareproblematicin thatthey containa largenumberof genes(features)andthusmethodsthatsearchover subsetsof featurescanbeprohibitively ex-pensive. Moreover, thesedatasetscontainonly a small numberof samples,so the detectionofirrelevantgenescansuffer from statisticalinstabilities.
4.1 The TNoM Score
To addresstheseissues,we utilize measuresof “relevance”of eachgene.In particular, we focuson a quantitywe call the thresholdnumberof misclassificationor TNoM score of a gene. Theintuition is that an informative genehasquite different valuesin the two classes(normal andtumor),andthuswe shouldbeableto separatetheseby a thresholdvalue. Formally, we seekthebestdecisionstumpfor that gene(asdefinedin Section2.3.2),andthencountthe classificationerrorsthis decisionstumpmakeson thetrainingexamples.
Recallthatwe describea decisionstumprule by two parameters¥ , and T . Thepredictedclassis simply sign
- ¥ - Ö"ßT\/S/ . Thenumberof errorsmadeby adecisionstumpis definedasà N�N -\á ��â�hIã0��ä5/��t��åaæ4ªJä å :� sign-SáÇ6�- O å E ãJG "çâ\/S/�¬J�
where � E ¢ G is theexpressionvalueof gene¢ in the � ’ th sampleand �u� is thelabelof the � ’ th sample.TheTNoM scoreof ageneis simplydefinedas:èÔé1êÏë - ã0��ä5/´��ì�í�îï�ðqñ à N�N -\á ��â�hIã0��ä5/��
thenumberof errorsmadeby thebestdecisionstump.Theintuition is thatthisnumberreflectsthequality of decisionsmadebasedon theexpressionlevelsof this gene.
Notethattheexpressivenessof decisionstumprulesis severelylimited. Thus,moreexpressivequeriesaboutthe gene’s expressionlevel (e.g.,whetherit is within a particularrange)might bemore useful for predictingthe label of the tissue. However, if we are interestedin geneswithsignificantover-expressionor under-expressionin oneof the labels,thensucha thresholdshouldseparatethetwo groups.
4.2 Evaluating the Significanceof a Gene
An immediatequestionto askis whethergeneswith low TNoM scoresare indeedindicative oftheclassificationof expression.In otherwords,we want to testthestatisticalsignificanceof thescoresof the bestscoringgenesin our dataset. One way to evaluatethe significanceof suchresultsis to testthemagainstrandomdata.Moreexplicitly: wewantto estimatetheprobabilityofagenescoringbetterthansomefixedlevel Î in randomlylabeleddata.Thisnumberis thep-valuecorrespondingto the given level Î . Geneswith very low p-valuesarevery rare in randomdataandtheir relevanceto thestudiedphenomenonis thereforelikely to have biological,mechanisticor protocolreasons.Geneswith low p-valuesfor which thelattertwo optionscanberuledoutareinterestingsubjectsfor furtherinvestigationandareexpectedto givedeeperinsightinto thestudiedphenomena.
16
Let ªJ"#��$�¬~Ìóò ð ? Í denoteall vectorswith Ðõô " ô entriesand � ô $ ô entries(the normal/cancerse-manticis onepossibleinterpretation).Let Z bea vectorof labels. Also let ¢ bea vectorof geneexpressionvalues.TheTNoM scoreis a functionthattakes ¢ and Z andreturnsthescoreof ¢ withrespectto labeling Z .
We wantto computethep-valueof a scoreon a particulargene.We assumethatthevectorofgeneexpressionvalues¢ is fixed,andconsiderrandomlabelassignments.Let ö ò ð ? bea randomvectordrawn uniformly over ªJ"1��$�¬ ÌÞò ð ? Í . Thep-valueof a scoreÎ is then÷ L M ä - Î1� ¢ � Ð �8�F/��«ø N ê�ù - èÔé1êÏë - ã0� úÔû ð ü /ºÕ«ý,/�� (3)
Definition 4.1 Let V±'aªJ"#��$�¬ Ìóò ð ? Í . Define:Î Ã m - VA/�� ìzíjî�RÂnÿ������ ò m ? ��� # of $ s in V U ���u�u����Vy�$
# of " s in Vy� m U ���u���u��V ò m ?��� �
andsymmetrically: Î m à - V¾/�� ìzíjî� Ânÿ������ ò m ? ��� # of " s in V U ���u���u��Vy�$
# of $ s in Vy� m U ���u�u����V ò m ?��� �
Intuitively, Î Ã m - V¾/ is thescoreof thelabeling V whenwe only examinedecisionstumpsin whichvaluesabove the thresholdarelabeled$ andvaluesbelow the thresholdarelabeled " . (That is,the ¥ coefficient is positive). Similarly, Î m à - VA/ is the scoreof the labelingwhenwe imposethesymmetricconstraint.
Supposewe are now interestedin computingall TNoM scoredistributions for the spacesªJ"#��$�¬ Ìóò ð ? Í where Ð rangesfrom 2 to and � rangesfrom 2 to � . We definean array ¿ asfollows ¿ - Ð �8�Ô��Î~�ST\/�� �������
�� �� VÖ'aªJ"#��$�¬ ÌÞò ð ? Í � Î Ã m - V¾/��«ÎandÎ m à - V¾/¯��T
� �����������
Contributionsto ¿ - Ð �8����Î~�ST\/ comefrom vectorsin ªJ"#��$�¬ ÌuÌóò,à U�Í ð ? Í by concatenatinga " andfromvectorsin ªJ"#��$�¬~Ìóò ð Ì ? à U�ÍuÍ by concatenatinga $ . Proposition4.2 indicatesthe sizeof eachsuchcontribution in thevariouscasesandweobtainthefollowing recursionformula:
Initial conditionsfor this recursive calculationaretrivially set.To obtaintheexplicit distribu-tion andthusthep-valuesweusethefollowing formula:ø N ê�ù - èÔé1êÏë - úÔû ð ü /��}ý,/¯� è -�� � ÷ ��ýÜ��ý,/�$�� 6�� ñ���� è -�� � ÷ ��ý~��â\/� û m üü! �4.3 Inf ormativeGenesin Cancer Data
Considera setof actuallabeledgeneexpressiondata,suchastheoneswe describedabove. Itis beneficialto give somequantitative scoreto the abundanceof highly informative genes,withrespectto the given labeling. Figure 3 depictsa comparisonbetweenthe expectednumberofgenesscoringbetterthana given thresholdandthe actualnumberfound in the data. As we cansee,thenumberof highly informative genesis well above theexpectednumberaccordingto thenull-hypothesis.
Figure3: Comparisonof thenumberof significantgenesin actualdatasetsto theexpectednum-ber under the null-hypothesis(randomlabels). The -axis denotesp-value and the @ -axis thenumberof genes. The expectednumberof geneswith score
èÔé#êÏëbetter than a given Î isø N ê�ù - èÔé1êÏë - úÔû ð ü /×Õ ý,/ 64- # of genes/ . Graphs(a) and (b) show resultsfrom the Colon data
set. Graphs(c) and(d) show resultsfrom theLeukemiadataset. Thegraphson the left, (a) and(c), show thewholesignificancerange,andthegraphsontheright, (b) and(d), show thetail of thedistribution(p-valuesarein log-scale).
differentcancers,amongthemlung,breast,oropharyngeal,bladder, endometrial,ovarianandcol-orectalcarcinoma),ferritin H (ovariancancer),collagen1A1 (ovariancancer, osteosarcoma,cer-vical carcinoma),andGAPDH (cancersof lung, cervix andprostate).In addition,2 cloneswithnohomologyto aknown genearefoundin thisselection.Giventhehighnumberof cancerrelatedgenesin the top 137, it is likely that thesenovel genesexhibit a similar cancer-relatedbehavior.We conductedexpressionvalidationfor GAPDH,SLPI,HE4 andkeratin18 which confirmedtheelevatedexpressionin someovariancarcinomascomparedto normalovariantissues.
It thenselectsall genesthathavea smalleror equalerrorscoreon thetrainingdata.Alternatively,
19
a p-valueapproachcanbe taken: all geneswith scoreswhich arevery rare in randomdataareselected.
To evaluateperformancewith geneselection,we have to be careful to jointly evaluatebothstagesof theprocess:geneselectionandclassification.Thus,in eachcross-validationtrial, geneselectionis appliedbasedon the training examplesin that trial. Note, that since the trainingexamplesaredifferentin differentcrossvalidationtrials, we expectthenumberof selectedgenesto dependon thetrial.
Figure4 describestheperformanceof someof themethodswediscussedabovewhenwevarythestringency of theselectionprocess.
In thecolondataset,geneselectionleadsto mixedresults.Somemethods,suchasclustering,performslightly worsewith fewer genes,while others,suchasSVM, performbetterwith smallersetof genes.Ontheotherhand,in theovariandataset,geneselectionleadsto impressiveimprove-mentin all methods.All methodsperformwell in theregionbetweenthreshold3 (avg. 173clones)to 6 (avg. 4375clones).Note thatbothBoostingandSVM performwell evenwith fewer clones.In the leukemiadataset,geneselectionslightly improved the performanceof AdaBoost(whichperformedwell with all thegenes),andsignificantlyimprovedtheperformanceof othermethods(betweenthresholdof 11 to 13
èÔé#êÏëscore).
Figure 5 shows ROC curves for Clusteringapproach,Boosting, and quadraticSVM withthresholdof 3 (linear SVM hassimilar curve to quadraticSVM, andthuswasnot plotted). Aswe cansee,althoughall methodshave roughlythesameaccuracy with this subsetof genes,theirROC profile is strikingly different.Thesecurvesclearlyshow thattheClusteringapproachmakesfalsepositiveerrors,while all theotherapproachesmake falsenegativeerrors.
It is instructive to comparetheTNoM scoresof genesto othermethodsfor selectinggenes.Inparticular, we notethat the AdaBoostprocedureis effectively a geneselectionmethod. In eachiterationtheAdaBoostprocedureselectsastumpclassifierthatexaminestheexpressionvalueof asinglegene.Thus,we canevaluatetheimportanceof a genein theAdaBoostclassificationby theweightassignedto decisionstumpsthatqueriesthevalueof thatgene.Figure6 shows acompari-sonof geneAdaBoostweightsto TNoM scores.As wecansee,thehighestweightgeneshave lowTNoM scores.In theleukemiadatasetwealsoseethatthereis acorrelationbetweengeneweightandtheTNoM score.Notethatin thisdataset,AdaBoostis veryeffectivewithoutadditionalgeneselection.On theotherhand,in thecolondataset,AdaBoost’sperformanceis improvedwhenweremove geneswith high TNoM score.We notethat in generaltheAdaBoostproceduredoesnotuseall of thegeneswith low TNoM score.This suggeststhat thereis a significantoverlapin theinformationthatthesegenesconvey abouttheclassification.
5 SampleContamination
Cancerclassificationbasedon array-basedgeneexpressionprofiling maybe complicatedby thefactthatclinical samples,e.g.tumorvs. normal,will likelycontainamixtureof differentcell types.In addition,thegenomicinstability inherentin tumorsamplesmayleadto alargedegreeof randomfluctuationsin geneexpressionpatterns.Although both the biological andgeneticvariability intumorsampleshave thepotentialto leadto confusinganddifficult to interpretexpressionprofiles,geneexpressionprofiling doesallow us to efficiently distinguishtumor andnormalsamples,as
576 6 8969:<;=896 >@? A 8CB@DFE GIH9A JK>ILMNO PQR=M
O ? EFRSNUT VIL P ? EFR=MWT XKL QY? E QWT QIL R=MY? E X9T NCL RSN@? E ZWT XKL R O ? E O T O L R P ? E [\T [�L
]�>CH96 >C^`_U]�>9a B9bCcK896d\8C8C^`_ a eKBfR=MWg MCMCM:Uh�iYgIA a eK>CH96:Uhjikgml9JKHCnWT
$ %%&%'() *+, -''./ . 0123 4
576 6 8969:<;=896 >@? A 8CB@DFE GIH9A JK>ILMNO PQR=M
O ? EFRSNUT VIL P ? EFR=MWT XKL QY? E QWT QIL R=MY? E X9T NCL RSN@? E ZWT XKL R O ? E O T O L R P ? E [\T [�L
]�>CH96 >C^`_U]�>9a B9bCcK896d\8C8C^`_ a eKBfR=MWg MCMCM:Uh�iYgIA a eK>CH96:Uhjikgml9JKHCnWT
Figure4: Classificationperformance,asit dependson thethresholdusedfor selectinggenes.Theo -axisshows theTNoM scorethresholdusedandthebase10 logarithmof theassociatedp-value.TheresultsarebasedonperformingLOOCV for thewholeprocessof selectionandclassification,asexplainedin thetext. For eachmethod,thesolid barrepresentsthefractionof thetrainingdatathatwasmis-classified.Thethin line extensionsrepresentthefractionwhich wasunclassified.
21
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Po
sitiv
e R
ate
Negative R ate
ClusteringBoosting, 10,000
SVM, quad.
Figure5: ROC curvesfor threemethodsthatareappliedto theovariandatasetwith TNoM scorethresholdsetto pwe have seenin the previous sections.However, the presenceof differentcell typeswithin andbetweensamplescould lead to identificationof genesthat stronglyaffect clusterformationbutwhichmayhavelittle to dowith theprocessbeingstudied,in thiscasetumorigenesis.For example,in thecaseof thecoloncancerdatasetpresentedabove,a largenumberof muscle-specificgeneswereidentifiedasbeingcharacteristicof normalcolonsamplesboth in our clusteringresultsandin the resultsof Alon et al. (1999). This is mostlikely dueto a higherdegreeof smoothmusclecontaminationin thenormalversustumorsamples.
This raisestheconcernthatourclassificationmaybebiasedby thepresenceof musclespecificgenes.To testthis hypothesis,we attemptedto constructdatasetsthatavoid genesthat aresus-pectedin introducingbias.Welistedthetop200error-scorerankinggenesin thecoloncancerdataset,andidentifiedmuscle-specificgenes.Theseinclude(J02854)myosinregulatorylight chain2,smoothmuscleisoform (human);(T60155)actin, aortic smoothmuscle(human);and(X12369)tropomyosinalphachain,smoothmuscle(human)thataredesignatedassmoothmuscle-specificby Alon et al.’sanalysis,and(M63391)desmin(human),completecds;(D31885)muscle-specificEST (human);and (X7429) alpha7B integrin (human)which aresuspectedto be expressedinsmoothmusclebasedon literaturesearches.
An additionalform of “contamination”is dueto the high metabolicrateof the tumors. Thisresultsin high expressionvaluesfor ribosomalgenes.Althoughsuchhigh expressionlevelscanbe indicative of tumors,sucha finding doesnot necessarilyprovide novel biological insight intotheprocess,nor doesit providea diagnostictool sinceribosomalactivity is presentin virtually alltissues.Thus,wealsoidentifiedribosomalgenesin thetop200scoringgenes.
Figure7 shows the performanceof the clusteringapproachon threedatasets: the full 2000genedataset,adatasetwithoutmusclespecificgenes,andadatasetwithoutbothmusclespecific
22
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
5 10 15 20
Ada
Boo
st w
eigh
t
TNoM Score
0
0.02
0.04
0.06
0.08
0.1
0 5 10 15 20
Ada
Boo
st w
eigh
t
TNoM Score
Colon Leukemia
Figure6: Comparisonof theweightassignedto genesin theAdaBoostclassification(withoutgeneselection)andtheTNoM score.Eachpoint in thescatterplot correspondsto a gene.The o -axisdenotesthe TNoM scoreof the gene,andthe q -axis the weight associatedwith all the decisionstumpsthatquerythegene’sexpressionvaluein theclassifierlearnedby AdaBoost.
andribosomalgenes.As thelearningcurvesshow, theremoval of genesaffectstheresultsonly incasesusingthesmallestsetsof genes.Fromerrorscorethresholdof 10(avg.9.1genes)andhigher,thereis no significantchangein performancefor the procedure.Thus,althoughmusclespecificgenescanbehighly indicative, theclassificationprocedureperformswell evenwithout relyingonthesegenes.
Although themusclecontaminationdid not necessarilyalter theability of this genesetto beusedto classifytumorvs. normalsamplesin this case,it will continueto beimportantto accountfor possibleaffectsof tissuecontaminationon clusteringandclassificationresults.Experimentaldesignsthatincludegeneexpressionprofilesof tissueand/orcell culturesamplesrepresentativeoftypesof tissuecontaminantsknown to beisolatedalongwith differenttypesof tumorsamples(forexampleseePerouet al. (1999)),canbeutilized to helpdistinguishcontaminantgeneexpressionprofilesfrom thoseactuallyassociatedwith specifictypesof tumorcells.
6 Conclusions
In this paperwe examinedthequestionof tissueclassificationbasedon expressiondata.Our con-tribution is four-fold. First, we introduceda new cluster-basedapproachfor classification.Thisapproachbuilds on clusteringalgorithmsthat aresuitablefor geneexpressiondata. Second,weperformedrigorousevaluationof this method,andof known methodsfrom the machinelearn-ing literature. Theseinclude large margin classificationmethods(SVM andAdaBoost)andthenearest-neighbormethod.Third, we highlightedtheissueof samplecontaminationandestimatedthesensitivity of our approachto samplevariability. Dif ferencesin tissuebiopsiescouldtheoret-ically affect the quality of any given classificationmethod. Studyingthis issue,we observed no
23
40
50
60
70
80
90
100
8 9 10 11 12 13 14 15 16
All genesw/o Muscle
w/o Muscle & R ibosomeCo
rre
ct P
red
icti
on
s (
%)
Error Score
Figure7: Curvesshowing thepredictive performanceof clusteringmethodsin theoriginal Alonet al. dataset, and datasetswheremusclespecific,and ribosomalgeneswere removed. Allestimatesare basedon LOOCV evaluation. Theseresultsshow that even without the obviouscontaminations,ourmethodsaresuccessfulin reliablypredictingtissuetype.
significantcontaminatingtissuebiasin thecoloncancerdataset.Finally, we investigatedtheissueof geneselectionin expressiondata.As our resultsfor theovariandatasetshow, a largenumberof clonescanhave a negative impacton predictive performance.We showed that a fairly sim-ple selectionprocedurecanleadto significantimprovementsin predictionaccuracy. In addition,we derivedanefficient dynamicprogrammingmethodfor computingexact p-valuesfor a gene’sTNoM score.
The work reportedhereis closely relatedto two recentpapers. First, Golub et al. (1999)(seealso(Slonimet al. 2000))examineda scoringrule to selectinformativegenesandperformedLOOCV experimentsto testa voting basedclassificationapproach.Althoughtheir scorefor geneselectionandtheir classificationmethodaredifferentthanours,their main conclusionsarequitesimilar in that they get goodclassificationaccuracy with relatively small numberof genes.Ourresultson the samedataset (leukemia) are comparableor better. Theseresultsemphasizetheconclusionthat the two leukemiaphenotypes(ALL andAML) arewell seperatedin expressiondata.
Second,Brown et al. (1999)usesupportvectormachinesin the context of geneexpressiondata. In contrastto our approach,they attemptto classify the genesratherthensamples.Thus,they dealwith thedual classificationproblem. Thecharacteristicsof their classificationproblemarequitedifferent: many examples(i.e., thousandsof genes),andfew attributes(i.e., expressionin differentsamples). We notethatsomeof theapproacheswe usedin this work (e.g.,clusteringbasedclassification)might beapplicableto this dualclassificationproblemaswell.
24
As notedabove, the geneselectionprocesswe explored in this paperis quite simplistic. Inparticular, it was basedon scoringsingle genesfor relevance. Thus, the processmight selectseveral genesthat convey the sameinformation, and might ignore genesthat add independentinformation. We are currently studyingmore direct approachesto the selectionof informativesetsof genes.Identifying setsof genesthat give rise to efficient learnedclassifiersmight revealpreviouslyunknown diseaserelatedgenesandguidefurtherbiologicalresearch.
Ben-Dor,A., Shamir, R. & Yakhini, Z. (1999),‘Clusteringgeneexpressionpatterns’,Journal ofComputationalBiology6, 281–297.
Bishop,C. M. (1995),Neural Networksfor PatternRecognition, OxfordUniversityPress,Oxford,U.K.
Brown, M., Grundy, W., Lin, D., Cristianini,N., Sugnet,C., Furey, T., Jr., M. A. & Haussler, D.(1999),Knowledge-basedanalysisof microarraygeneexpressiondatausingsupportvectormachines,TechnicalReportUCSC-CRL-99-09,U. C. SantaCruz.
Burges,C. J. C. (1998), ‘A tutorial on SupportVectorMachinesfor patternrecognition’,DataMining andKnowledgeDiscovery2, 121–167.
Chu,S., DeRisi, J., Eisen,M., Mullholland, J., Botstein,D., Brown, P. & Herskowitz, I. (1998),‘The transcriptionalprogramof sporulationin buddingyeast’,Science282, 699–705.
Clarke, P. A., George, M., Cunningham,D., Swift, I. & Workman,P. (1999), Analysis of tu-morgeneexpressionfollowing chemotherapeutictreatmentof patientswith bowel cancer, in‘Proc.NatureGeneticsMicroarrayMeeting99’, Scottsdale,Arizona,p. 39.
Khan, J., Simon,R., Bittner, M., Chen,Y., Leighton,S. B., Pohida,T., Smith, P. D., Jiang,Y.,Gooden,G. C., Trent,J.M. & Meltzer, P. S. (1998),‘Geneexpressionprofiling of Alveolarrhabdomyosarcomawith cDNA microarrays’,CancerReasearch .
Kohavi, R. (1995),A studyof cross-validationandbootstrapfor accuracy estimationandmodelse-lection, in ‘Proc.FourteenthInternationalJointConferenceon Artificial Intelligence(IJCAI’95)’, MorganKaufmann,SanFrancisco,Calif., pp.1137–1143.
Perou,C.M., Jeffrey, S.S.,v deRijn, M., Rees,C.A., Eisen,M. B., Ross,D. T., Pergamenschikov,A., Williams, C. F., Zhu,S.X., Lee,J.C. F., Lashkari,D., Shalon,D., Brown, P. O. & D, B.(1999),‘Distinctive geneexpressionpatternsin humanmammaryepithelialcellsandbreastcancers’,Proc.Nat.Acad.Sci.USA96, 9212–9217.
Ripley, B. D. (1996),PatternRecognitionandNeural Networks, CambridgeUniversityPress.
Schapire,R. E. (1990),‘The strengthof weaklearnability’,MachineLearning5, 197–227.
Schapire,R. E., Freund,Y., Bartlett,P. & Lee,W. S. (1998),‘Boostingthemargin: A new expla-nationfor theeffectivenessof votingmethods’,Annalsof Statistics26, 1651–1686.
Schiedeck,T., Christoph,S.,Duchrow, M. & Bruch,H. (1998),‘Detectionof hl6-mRNA: new pos-sibilities in serologictumordiagnosisof colorectalcarcinomas’,Zentralbl Chir 123(2), 159–162.
Schummer, M., NG, W., Bumgarner, R., Nelson,P., Schummer, B., Hassell,L., Baldwin., L. R.,Karlan, B. & Hood, L. (1999), ‘Comperative hybridizationof an array of 21,500ovariancDNAs for the discovery of genesoverexpressedin ovariancarcinomas’,Gene238, 375–385.
26
Slonim,D. K., Tamayo,P., Mesirov, J. P., Golub,T. R. & Lander, E. S. (2000),Classpredictionanddiscovery using geneexpressiondata, in ‘Fourth Annual InternationalConferenceonComputationalMolecularBiology’, pp.263–272.
Swets,J. (1988),‘Measuringtheaccuracy of diagnosticsystems’,Science240, 1285–1293.