-
Tanagra
21/01/09 1/26
1 Introduction
ComputingaGainChart.Comparing thecomputation timeofdatamining
toolsona largedatasetunder
Linux.
Thegainchartisanalternativetoconfusionmatrixfortheevaluationofaclassifier.Itsnameissometimes
different according the tools (e.g. lift curve, lift chart,
cumulative gain chart, etc.). Themain idea is to
elaborateascatterplotwheretheXcoordinates
isthepercentofthepopulationandtheYcoordinates is
the percent of the positive value of the class attribute. The
gain chart is usedmainly in themarketing
domainwherewewanttodetectpotentialcustomers,butitcanbeusedinothersituations.
The construction of the gain chart is already outlined in a
previous tutorial (see http://datamining
tutorials.blogspot.com/2008/11/liftcurvecoilchallenge2000.html).
In this tutorial, we extend the
descriptiontootherdataminingtools(Knime,RapidMiner,WekaandOrange).Thesecondoriginalityofthis
tutorial is thatwe lead the experiment under Linux (French
version ofUbuntu 8.10 see http://data
miningtutorials.blogspot.com/2009/01/tanagraunderlinux.html for
the installation and the utilization of
TanagraunderLinux).Thethirdoriginalityisthatwehandlealargedatasetwith2.000.000examplesand
41variables.Itwillbeveryinterestingtostudythebehaviorofthesetoolsinthisconfiguration,especially
becauseourcomputerisnotreallypowerful(Celeron,2.53GHz,1MBRAM).
Weadoptthesamewayforeachtool.Inafirsttime,wedefinethesequenceofoperationsandthesettings
-
Tanagra
21/01/09 2/26
onasampleof2.000examples.Then,
inasecondtime,wemodifythedatasource,wehandlethewhole
dataset.Wemeasurethecomputationtimeandthememoryoccupation.Wenotethatsometoolsfailed
theanalysisonthecompletedataset.
About the learningmethod,we use a linear discriminant
analysiswith a variable selection process for
Tanagra.For theother tools, thisapproach
isnotavailable.So,weuseaNaiveBayesmethodwhich isa
linearclassifieralso.
2 Dataset
WeuseamodifiedversionoftheKDDCUP99dataset.Theaimoftheanalysisisthedetectionofanetwork
intrusion
(http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html). We want
to detect a normal
connection (binaryproblem "normalnetwork connectionornot").
Moreprecisely,weask the following
problem:"ifwesetascoretoexamplesandrankthemaccordingtothisscore,what
istheproportionof
positive(normal)connectiondetectedifweselectonly20percentofthewholeconnections?"
Wehavetwodatafiles.Thefirstcontains2.000.000examples
(full_gain_chart.arff). It isusedtocompare
the ability of the tools to handle a large dataset. The second
is a sample with 2.000 examples
(sample_gain_chart.arff).Itisusedforspecifyingallthesequenceofoperations.Thesefilesaresettogether
inanarchivehttp://eric.univlyon2.fr/~ricco/tanagra/fichiers/dataset_gain_chart.zip
3 Tanagra
3.1 Defining the diagram
Creatingadiagramand importingthedataset.Afterwe
launchTanagra,weclickonFILE/NEWmenu in
order to create a diagram and import the dataset. In the dialog
settings, we select the data file
SAMPLE_GAIN_CHART.ARFF(Wekafileformat).WesetthenameofthediagramasGAIN_CHART.TDM.
-
Tanagra
21/01/09 3/26
41variablesand2.000casesareloaded.
-
Tanagra
21/01/09 4/26
Partitioning the dataset. We want to use 10 percent of the
dataset for the learning phase, and the
remainderforthetestphase.10percentcorrespondto200examplesonoursample.Itseemsverylimited.
Butwemusttorememberthatthisfirstanalysis
isusedonlyasapreparationphase.Inthenextstep,we
applytheanalysisonthewholedataset.Sothetruelearningsetsizeis10percentof2millionsi.e.200.000
examples.Itseemsenoughtocreateareliableclassifier.
Inordertodefinethetrainandtestsets,weusetheSAMPLINGcomponent(INSTANCESELECTIONtab).We
clickonthePARAMETERSmenu.WesetPROPORTIONSIZEas10%.
WhenweclickonthecontextualVIEWmenu,Tanagraindicatesthat200examplesarenowselected.
Definingthevariablestypes.WeinserttheDEFINESTATUScomponent(usingtheshortcutintothetoolbar)
inordertodefinetheTARGETattribute(CLASSE)andtheINPUTattributes(theotherones).
-
Tanagra
21/01/09 5/26
Automatic feature selection. Some of the INPUT attributes are
irrelevant or redundant. We insert the
STEPDISC (FEATURE SELECTION tab) component in order to detect
the relevant subset of predictive
variables.Settingthesizeofthesubsetisveryhard.Statisticalcutvaluesarenotreallyusefulherebecause
weapplytheprocessonalargedataset,allthevariablesseemsignificant.
So, the simplestway is the trial and error approach.We observe
the decreasing of theWILKS Lambda
statisticwhenweaddanewpredictivevariable.ThegoodchoiceseemsSUBSETSIZE=6.Whenweadda
new variable after this step, the decreasing of the Lambda is
comparatively weak. This corresponds
approximately to theelbowof the curveunderlining the
relationshipbetween thenumberofpredictive
variableandtheWilks'lambda.Wesetthesettingsaccordingtothispointofview.
-
Tanagra
21/01/09 6/26
Learningphase.WeaddtheLINEARDISCRIMINANTANALYSIScomponent(SPVLEARNINGtab).Weobtain
thefollowingresults.
-
Tanagra
21/01/09 7/26
Computingthescorecolumn.Wecomputenowthescore(thescorevalueisnotreallyaprobability,butit
allowstoranktheexamplesastheconditionalprobabilityP[CLASS=positive/
INPUTattributes])ofeach
exampletobeapositiveone.Itiscomputedonthewholedataseti.e.boththetrainandthetestset.
WeaddtheSCORINGcomponent(SCORINGtab).Wesetthefollowingparametersi.e.the"positive"value
oftheclassattributeis"normal".AnewcolumnSCORE_1isaddedtothedataset.
Creatingthegainchart.InordertocreatetheGAINCHART,wemustdefinetheTARGET(CLASSE)attribute
andtheSCOREattributes(INPUT).WeusetheDEFINESTATUScomponentfromthetoolbar.Wenotethat
we can set several SCORE columns, this optionwill be
usefulwhenwewant to compare various score
columns(e.g.whenwewanttocomparetheefficiencyofvarious
learningalgorithmswhichcomputethe
score).
-
Tanagra
21/01/09 8/26
Then,weaddtheLIFTCURVEcomponent(SCORINGtab).Thesettingscorrespondtothecomputationofthe
curveonthe"normal"valueoftheclassattribute,andontheunselectedexamplesi.e.thetestset.
-
Tanagra
21/01/09 9/26
Weobtainthefollowingchart.
In the HTML tab, we have the details of the results. We see that
among the first 20 percent of the
populationwehaveatruepositiverateof97.80%i.e.97.80%ofthepositiveexamplesofthedataset.
-
Tanagra
21/01/09 10/26
3.2 Running the diagram on the complete dataset
Thisfirststepallowsustodefinethesequenceofdataanalysis.Now,wewanttoapplythesameframework
onthecompletedataset.Wecreatetheclassifieronalearningsetwith200,000instances,andevaluateits
performance,usingagainchart,onatestsetwith1,800,000examples.
WithTanagra,thesimplestwayistosavethediagram(FILE/SAVE)andclosetheapplication.Then,weopen
thefilediagramGAIN_CHART.TDMinatexteditor.Wereplacethedatasourcereference.Wesetthename
ofthecompletedataseti.e.full_gain_chart.arff.
We launchTanagraagain.Weopenthemodifieddiagramgain_chart.tdm
(FILE/OPENmenu).Thenew
datasetisautomaticallyloaded.Tanagraisreadytoexecuteeachnodeofthediagram.
-
Tanagra
21/01/09 11/26
Theprocessingtimeis180seconds(3minutes).
-
Tanagra
21/01/09 12/26
Weexecutealloperationswith theDIAGRAM /EXECUTEmenu.Weget the
tableassociated to thegain
chart.Amongthe20%firstindividuals(accordingtothescorevalue),wefind98.68%ofpositives(thetrue
positiverateis98.68%).
Hereisthegainchart.
-
Tanagra
21/01/09 13/26
Weconsiderbelowthecomputationtime.WeseemainlythatthememoryoccupationofTanagraincreased
slightlywiththecalculations.Thesearethedataloadedintomemorywhichmainlydeterminethememory
occupation.
4 Knime
4.1 Defining the workflow on the sample
The installation and the implementation of Knime
(http://www.knime.org/, version 2.0.0) are very easy
underUbuntu.Wecreateanewworkflow(FILE/NEW).WeaskforaKnimeProject.
Aswesayabove,theLinearDiscriminantmethod
isnotavailable.SoweusetheNaiveBayesClassifier. It
corresponds to a linear classifier. We expect to obtain similar
results. Unfortunately, Knime does not
proposeafeatureselectionprocessadaptedtothisapproach.Thus,weevaluatetheclassifiercomputedon
allthepredictivevariables.
Wecreatethefollowingworkflow:
-
Tanagra
21/01/09 14/26
Wedescribebelowthesettingsofsomecomponents.
ARFF READER: The dialog box appears with the CONFIGURE menu. We
set the file name
i.e.sample_gain_chart.arff.AnotherveryimportantparameterisintheGENRALNODESETTINGStab.
Ifthedatasetistoolarge(higherthan100,000examplesaccordingtothedocumentation),itcanbe
swappedonthedisk.Itistheonlysoftwareofthistutorialwhichdoesnotloadtheentiredataset
inmemoryduringtheprocessing.Inthispointofview,itissimilartothecommercialtoolsthatcan
handle very large databases. Certainly, it is an advantage. We
see the consequences of this
technicalchoicewhenwehavetohandlethecompletedatabase.
-
Tanagra
21/01/09 15/26
ForPARTITIONNINGnode,weselect10%oftheexamples.
ForNAIVEBAYESPREDICTORnode,weaskthatthecomponentproducesthescorecolumnforthe
valuesoftheclassattribute.LikeTanagra,weneedtothiscolumnforrankingtheexamplesduring
theconstructionofthegainchart.
Last, for the LIFTCHART,we specify the class attribute (CLASSE),
thepositive valueof the class
-
Tanagra
21/01/09 16/26
attribute(NORMAL).Thewidthoftheinterval(INTERVALWIDTH)fortheconstructionofthetable
is10%oftheexamples.
Whenwe click on the EXECUTE ANDOPEN VIEW contextualmenu,we
obtain the following gain chart
(CUMULATIVEGAINCHARTtab).
-
Tanagra
21/01/09 17/26
4.2 Calculations on the complete database
Wearereadytoworkonthecompletebase.WeactivateagainthedialogsettingsoftheARFFREADERnode
(CONFIGURE menu). We select the full_gain_chart.arff data file.
We can execute the workflow. We
obtainthefollowinggainchart.
Theresultsareverysimilartothoseobtainedonthesample(2,000examples).
Thecomputationtimehoweverisnotreallythesame.ItisveryhighcomparedtothatofTanagra.Westudy
indetailsbelowthecomputationtimeofeachstep.
Aboutthememoryoccupation,thesystemmonitorshowsthatKnimeuses"only"193Mo(Figure1).
It is
clearlylowerthanTanagra;itwillbelowerthanallothertoolsweanalyzeinthistutorial.
Wecanclearlyseetheadvantagesanddisadvantagesofswappingdataonthedisk:thememory
isunder
control,but the computation time increases. Ifweare in
thephaseof finalproductionof thepredictive
model, it isnotaproblem. Ifweare in
theexploratoryphase,whereweuse trialanderror inorder to
definethebeststrategy,prohibitivecomputationtimeisdiscouraging.
Inthisconfiguration,theonlysolutionistoworkonsample.Inourstudy,wenotethattheresultsobtained
-
Tanagra
21/01/09 18/26
on2,000casesareveryclose to the
resultsobtainedon2,000,000examples.Ofcourse, it isanunusual
situation.Theunderlyingconceptbetween theclassattributeand
thepredictiveattribute isverysimple.
But,itisobviousthat,ifwecandeterminethesufficientsamplesizefortheproblem,thisstrategyallowsus
toexploredeeplythevarioussolutionswithoutaprohibitivecalculationstime.
Figure1MemoryoccupationofKnimeduringtheprocessing
5 RapidMiner
The installation and the implementation of RapidMiner
(http://rapidi.com/content/blogcategory/38/69/,
CommunityEdition,version4.3)areverysimpleunderUbuntu.Weaddon
thedesktopashortcutwhich
runsautomaticallytheJavaVirtualMachinewithrapidminer.jarfile.
5.1 Working on the sample
As the other tools,we set the sequence of operations on the
sample (2,000 examples). There is little
documentationonthesoftware.However,itcomeswithmanypredefinedexamples.Weoftenfindquickly
anexampleofanalysiswhichissimilartoours.
About thegainchart,weget theexample
filesample/06_Visualization/16_LiftChar.xlm.Weadapt the
-
Tanagra
21/01/09 19/26
sequenceofoperationsonourdataset.
Idonotreallyunderstandall.TheoperatorsIORetrieverandIOStorerseemquitemysterious.Anyway,the
importantthingisthatRapidMinerprovidesthedesiredresults.Anditis.
-
Tanagra
21/01/09 20/26
Weseeintothischart:thereare1,800examplesintothetestset;thereare324positiveexamples(CLASSE
=NORMAL).Amongthefirst20%ofthedataset(360examples)rankedaccordingtothescorevalue,there
are298positiveexamplesi.e.thetruepositiverateis298/324#92%.
Weusethefollowingsettingsforthisanalysis:
SIMPLEVALIDATION:SPLIT_RATIO=0.1;
LIFTPARETOCHART:NUMBEROFBINS=10
WeusetheNaiveBayesclassifierforRapidminerbecausethelineardiscriminantanalysisisnotavailable.
-
Tanagra
21/01/09 21/26
5.2 Calculations on the complete database
To run the calculations on the complete database, we modify the
settings of ARFF EXAMPLE SOURCE
component.Then,weclickonthePLAYbuttonintothetoolbar.
Thecalculationwasfailed.Thebasewas loaded in17minutes;thememory
is increasedto700MBwhen
thecrashoccurred.Themessagesentbythesystemisthefollowing
-
Tanagra
21/01/09 22/26
6 Weka
Wekaisawellknownmachinelearningsoftware(http://www.cs.waikato.ac.nz/ml/weka/,Version360).It
runsunderaJRE(JavaRuntime).Theinstallationandtheimplementationareeasy.
There are variousways touseWeka.We choose theKNOWLEDGEFLOWmode
in this tutorial. It is very
similartoKnime.
6.1 Working on the sample
As the other tools, we work on the sample in a first time in
order to define the entire sequence of
operations.
We seebelow thediagram.We voluntarily separate the
components.Thuswe see, inblue, the typeof
information transmitted from an icon to another. It is very
important. It determines the behavior of
components.
-
Tanagra
21/01/09 23/26
Wedescribeheresomenodesandsettings:
ARFFLOADERloadsthedatafilesample_gain_chart.arff;
CLASSASSIGNERspecifiestheclassattributeclasse,alltheothervariablesarethedescriptors;
CLASSVALUEPICKERdefinesthepositivevalueoftheclassattributei.e.CLASSE=NORMAL;
TRAINTESTSPLITMAKERsplitsthedataset,weselect10%forthetrainset;
NAIVEBAYES is the learningmethod,weconnect to thiscomponentboth
theTRAININGSETand the
TESTSETconnections;
CLASSIFIERPERFORMANCEEVALUATORevaluatestheclassifierperformance;
MODELPERFORMANCECHARTcreatesvariouschartsfortheclassifierevaluationonthetestset.Tostartthecalculations,weclickontheSTARTLOADINGmenuoftheARFFLOADERcomponent.Theresults
areavailablebyclickingontheSHOWCHARTmenuoftheMODELPERFORMANCECHARTcomponent.
Wekaoffers a clever tool for visualizing charts.We can select
interactively the xcoordinates and the y
-
Tanagra
21/01/09 24/26
coordinates.But,unfortunately, ifwecanchoosetherightvalues
fortheYcoordinates,wecannotselect
thepercentofthepopulationforthexcoordinates.Weusethefalsepositiveratebelowinordertodefine
theROCcurve.
6.2 Calculations on the complete database
Toprocessthecompletedatabase,weconfiguretheARFFLOADER.Weselectthefull_gain_chart.arffdata
file.ThenweclickontheSTARTLOADINGmenu.
After 20minutes, the calculation is not completed.Weka
disappears suddenly.We tried tomodify the
settingsoftheJRE(JavaRuntime)(e.g.setXmx1024mfortheheapsize).Butthisisnotfruitful.
WekaunderWindowshasthesameprobleminthesamecircumstance.Wecannothandleadatabasewith
41variablesand2,000,000examples.
-
Tanagra
21/01/09 25/26
7 Conclusion
Ofcourse,maybeIhavenotsettheoptimalsettingsforthevarioustools.Buttheyarefree,thedatasetis
availableon
line,anyonecanmakethesameexperiments,andtryothersettings inorderto
increasethe
efficiency of the processing, both about the computation time
and the memory occupation (e.g. the
settingsoftheJREJavaRuntimeEnvironment).
Computationtimeandmemoryoccupationarereportedinthefollowingtable.
Tanagra Knime Weka RapidMinerLoad the dataset 3 mn 6 mn 40 sec
> 20 mn (erreur) 17 mnLearning phase + Creating the gain chart
25 sec 30 mn - erreurMemory occupation (MB) 342 MB 193 MB > 700
MB (erreur) 700 MB (erreur)
Tool
Thefirstresultwhichdrawsourattention
isthequicknessofTanagra.Although itusesWineunderLinux,
Tanagraisstillveryfast.Thecomparisonisespeciallyvalidforloadingdata.Forthelearningphase,because
wedonotusethesameapproach,thecomputationtimesarenotreallycomparable.
Thisresultseemscuriousbecausewemadethesamecomparison(loadingalargedataset)underWindows
(http://dataminingtutorials.blogspot.com/2008/11/decisiontreeandlargedataset.html).Thecomputation
timesweresimilaratthisoccasion.Perhaps,itistheJREunderLinuxwhichisnotefficient?
ThesecondimportantresultofthiscomparisonistheexcellentmemorymanagementofKnime.Ofcourse,
weuseaverysimplisticmethod (NaiveBayesClassifier).However,wenote
that thememoryoccupation
remainsverystableduring
thewholeprocess.Obliviously,Knimecanhandlewithoutanyproblemavery
largedatabase.
Incompensation,thecomputationtimeofKnimeisclearlyhigher.Incertaincircumstance,duringthestep
wherewetrytodefinetheoptimalsettingsusingatrialanderrorapproach,itcanbeadrawback.
Finally,tobequitecomprehensive,animportanttoolmissesinthiscomparison.It
isOrangewhichIlikea
lot because it is very powerful and user friendly
(http://www.ailab.si/orange/). Unfortunately, even if I
followedattentively thedescriptionof the installationprocess,
Icouldnot install (compile)Orangeunder
Ubuntu8.10(http://www.ailab.si/orange/downloadslinux.asp#orange1.0ubuntu).
TheWindowsversionworkswithoutanyproblems.We showbelow the
resultof theprocessunder the
Windows version.Wedonot give the computation time and
thememoryoccupationherebecause the
systemsarenotcomparable.
-
Tanagra
21/01/09 26/26