A Comparative Study of Data Cleaning Tools

DOI: 10.4018/IJDWM.2019100103

International Journal of Data Warehousing and MiningVolume 15 • Issue 4 • October-December 2019

Copyright©2019,IGIGlobal.CopyingordistributinginprintorelectronicformswithoutwrittenpermissionofIGIGlobalisprohibited.

48

A Comparative Study of Data Cleaning ToolsSamson Oni, University of Maryland Baltimore County, USA

Zhiyuan Chen, University of Maryland Baltimore County, USA

Susan Hoban, University of Maryland, Baltimore County, USA

Onimi Jademi, University of Maryland, Baltimore County, USA

ABSTRACT

Intheinformationera,dataiscrucialindecisionmaking.Mostdatasetscontainimpuritiesthatneedtobeweededoutbeforeanymeaningfuldecisioncanbemadefromthedata.Hence,datacleaningisessentialandoftentakesmorethan80percentoftimeandresourcesofthedataanalyst.Adequatetoolsandtechniquesmustbeusedfordatacleaning.Thereexistalotofdatacleaningtoolsbutitisunclearhowtochoosetheminvarioussituations.Thisresearchaimsathelpingresearchersandorganizationschoosetherighttoolsfordatacleaning.Thisarticleconductsacomparativestudyoffourcommonlyuseddatacleaningtoolsontworealdatasetsandanswerstheresearchquestionofwhichtoolwillbeusefulbasedondifferentscenario.

KeyWoRDSBig Data, Data Cleaning, Data Cleansing, Data Fusion, Data Quality, Data Wrangler, Dirty Data, Open Refine

INTRoDUCTIoN

Dataisconstantlybeingproducedineverysector.However,dataisproducedinmanyforms,withvariouslevelsofqualityandsomedatamayhavepoorquality.Datacleaning,sometimescalleddatascrubbingordatacleansing,isthedetectionandremovaloferrorsandinconsistencyfromdatawiththeaimofimprovingdataquality.InBigDataprocessing,datacleaningisacriticalandimportantsteppriortodataprocessingandmaintenance(Müller&Freytag,2005).Datacleaningisimportanttobothdatafromasinglesourceanddatafrommultiplesources.Datacleaningisanessentialstepforthedatafusionprocess,whichistheprocessofmergingdatafrommultiplesources(Haghighat,Abdel-Mottaleb,&Alhalabi,2016).Fusingpoorqualitydatafromvarioussourcestogetherwillcausemoreissuesafterwards.Therefore,adequatecleaningofdatafromvarioussourcesbeforeintegrationwillhavesignificantimpactontheoutcomeofdatafusion.

Cleaningdatarequires identifyingincorrect, invalidorduplicateentries.Thequalityofdataisdeterminedbythedegreetowhichthedatainquestionmeetsspecificneeds,whichinanycasewillbehigherasthedatabecomescleaner(Kandel,Paepcke,Hellerstein,&Heer,2011).Validity,completeness,accuracyandprecisionarethemeasuresofdataquality(Kandeletal.,2011).Theimportanceofaccurateandcorrectdataforfusion/ETLprocesscannotbeoveremphasized.


49

Dataanalystsalsospendagreatdealoftimeandresourcestryingtofixdataqualityproblems.Dasuetal.(Dasu&Johnson,2003)emphasizedtheruleofthumb,whichstatesthatmorethaneightypercent(80%)oftimeonadataanalysisprojectisspentoncleaningandpreprocessing.

Althoughtherearemanydatacleaningtools,theyoftenhavedistinctivefeatures.Theyalsorequiredistinctlevelsofskillstousethemandhavedifferentcostsandlearningcurves.Determiningthebesttoolsforanygivencleaningtaskdependsonmanyfactors.However,inpractice,usersareoftennotexpertsondatacleaningtoolsandtechnologiessothereisgreatneedtoprovidesomeguidanceonhowtochoosedatacleaningtools.

Theobjectiveofthispaperistoanalyzefourpopulardatacleaningtoolsanddeterminewhichtoolsareappropriateforvariousscenarios.Thispapercomparesthefeaturesofthesetoolsandtheirperformanceoncleaningthesamedataset.Twodatasetswereusedforthisexperiment.Theresultsmayhelpuserschooseappropriatedatacleaningtools.

Thispapermakesthefollowingcontributions:

• Compared the performance of four data cleaning tools on two real world data sets. Themetricsincludetheirfeatures,requiredplatformsandskill level,timeofcompletion,easeofimplementation/usage,etc.

• Proposesaguidelineforchoosingdatacleaningtools.

Therestofthepaperisorganizedasfollows.Abackgroundstudyispresentedfirst,followedby an overview of various aspects of data cleaning. The methodology section describes themethodology used for the comparison study. The results section describes the results of thestudy.Thediscussionandconclusionsectionpresenttheguidelinesforchoosingdatacleaningtoolsandconcludesthepaper.

LITeRATURe ReVIeW

Therehasbeenlotofworkondatacleaning.Theworkcanberoughlydividedintotwocategories:thoseonmethodstoaddressspecificdataqualityissuesandthoseonmoregeneraltoolsorframeworkthatcanaddressmultipledataqualityissues.

Workonspecificdataqualityissues:Leeetal.(Lee,Lu,Ling,&Ko,1999)presentedseveraltechniquestopreprocessrecordsbeforesortingthemsothatpotentiallymatchingrecordswillbebroughttoclosetogether.Usingthesetechniques,theyimplementedadatacleaningsystemthatcandetectandremoveduplicaterecords.

VariousmethodsofhandlingmissingdatawerediscussedbyLuján-Mora(Martinez-Mosqueraetal.,2017).Theauthorsproposedalgorithmsusedinananalysisofanincompletedataset.Theauthorsproposedmultipleimputationmethods,includingregressionimputation(fillinginmissingdatawithvaluespredictedbyaregressionmodel)andsinglehotdeckimputation(replacingthemissingvalueswiththoseobtainedfromsimilarobjectsfromthesameexperiments).

Generaltoolsorframework:Martinez-Mosqueraetal.(Martinez-Mosquera,Luján-Mora,López,&Santos,2017)lookedatmodelingdatacleaningforBigDataanalysisbasedonpreviousresearchformodelingETLprocessesusingwhat isknownasUnifiedModelingLanguage (UML).Theypresentedtwousecases,oneformodelingthedatacleaningprocessforweblogsandtheotherformodelingthecleaningprocessforsecuritylogs.

Galhardas et al. (Galhardas, Florescu, Shasha, & Simon., 2000) developed a data cleaningframeworkcalledAJAX.Theirapproachseparatesphysicalandlogicallevelsofdatacleaning.Thelogicallevelsupportsthedesignofthedatacleaningworkflowandthephysicallevelimplementsthedatacleaningworkflow.Thisframeworktransformsexistingdatafromoneormoredatacollectionstoatargetschemawhileeliminatingduplicates.


50

Luján-Moraetal.(Martinez-Mosqueraetal.,2017)proposedatechniquefordatacleaningthatcanbeusedforcheckingdataqualityissuesonsecuritylogs.Theyusedpredefinedrulestocombinelogdataandscanforissues.Detectedissuesarethencorrectedbeforethedatasetisanalyzed.However,theirworkfocusedonaspecificdatatype:securitylogs.

Kandeletal.(Kandeletal.,2011)describedDataWranglerasatoolfortheinteractivecleaningofdatausingvisualspecificationsofdatatransformationscripts.TheyexplainedthatDataWranglercombinesdirectmanipulationofvisualizeddatawithautomaticinferenceofrelevanttransformationswhichinturncleansethedata.

KarrarandAli(Karrar&Ali,2016)conductedacomparativeanalysisofSQLServerandWinpuretoolsusingacademicandweatherdatasets.Theyanalyzedtwodatacleaningtools,whileourapproachusedfourtoolsthataremorecommonlyusedintheindustryfordatacleaning.Porwal&Vora(Porwal&Vora,2013)alsocarriedoutcomparativeanalysisontwodatacleaningalgorithms:TheAlliancerulealgorithmandHadcleanalgorithm.However,theydidnotcomparemorefull-fledgeddatacleaningtools.Theyalsodidnotconsiderotherfactorssuchasusabilityofthetools.

Inanutshell,therehasbeenalotofresearchontheneedfordatacleaningaswellasdatacleaningtechniques,tools,andframeworks.However,thereisnotmuchworkoncomparingtoolsandchoosingdatacleaningtoolsandthecriteriatoconsider.Thispaperconductsacomparativestudyonfourcommonlyuseddatacleaningtools.

oVeRVIeW oF DATA CLeANING

Thissectionwilldescribetypesofdata,dataqualityissuesconsideredinthispaper,generalstepsofdatacleaningforasinglesource,andgeneralstepsofdatacleaningformultiplesources.

Types of DataWithrespecttocategorizationofdata,therearethreetypesofdata:structured,semi-structuredandunstructureddata.

1. Structured data:Thistypeofdatahasahighdegreeoforganizationandadherestoapredefineddatamodel.Oneexampleisdatainarelationaldatabase.

2. Semi-structured data:Thistypeofdatadoesnotfitintorelationaldatabasebuthavesomeformoforganizationforeasyanalysis.AnexampleisXMLdata.

3. Unstructured data: This data type is not organized nor does it have a predefined model.Unstructureddataisnotagoodfitforrelationaldatabase.Examplesaretext,pdf,images.

Irrespectiveofthetypeofdata,poordataqualitywillleadtopooranalysisanddecisions.

Data Quality IssuesDataqualityissuescancomeinvariousformsrangingfromduplicatedata,missingdata,errors(likespellingstudentasstdent),toinconsistentformatetc.Belowareseveraltypesofdataqualityissuesconsideredinthispaper.

1. Misspelled data:Forexample,acolumnhas‘student’,anotherhas‘stdent’whichismisspelled.2. Duplicate data/records (Table 1):Thisiswhenthesameinformationisenteredorduplicated

inadatasetordatabase.3. Irrelevant data:thisisthedatainthedatasetthatarenotrelevanttothework.Thiskindofdata

needstoberemoved.4. Mixed ranges: Sometimes data is measured in ranges, e.g., salary, age. Ranges need to be

representedconsistentlyandappropriatelyinthedata.


51

5. Mixed numerical scales:thistypeofdatadealswithusingdifferentnumericalscalesofdatainthedataset.Forexample,representingonemillioncanberepresentedas1mwhileonebillioncanberepresentedas1B.Butacomputermaynoteasilygetthis.

6. Multiple representation: representing the samepieceof information indifferent formsyethavingthesamemeaningcancauseproblemswithinadataset.Forexample,usingmultiplerepresentationsforthecountryUnitedStates(i.e.U.S.A,US,UnitedStates,UnitedStatesofAmerica).Alltheserepresentationsmeanthesamebutusingamixtureofseveralrepresentationsforthesameinformationwithinadatasetwillcausetroubleforanalysis.

7. Wrong date format: Different date format are used in data today, but the mixture ofseveraldataformatsinonedatasetcanbetroublesome.Exampleofdifferentformatscanbe2/12/2018,February2,2018,and2-12-2018.Thethreedatesmeanthesame,buttheirpresentation differs. Another example of date inconsistency is the American (MM/DD/YYYY)andEuropean(DD/MM/YYYY)formatsmixture.InAmericanformatthedaywillbewrittenas2/12/2018tobe12thofFebruary2018,whileEuropeanswillrepresentthesamedateas12/2/2018startingwiththeday.

General Steps of Data Cleaning for a Single SourceThereareseveralphasesinvolvedindatacleaningforasingledatasource.

• Detecterrorsandinconsistenciesindatatoremove.• Verifythattheerrorisreallyanerror,notaspecialfeatureofthedataset(Rahm&Do,2000).

Thisoftenrequireshumaninteraction.• Extracterroneousrecordstoanewtemporarytable.• Performcleaningoperationsonthedatainthattemporarytable.

Mostoftheseprocessesarealreadybuiltindifferentdatacleaningtoolsasdoingthismanuallywillcostlotoftimeandresources.

General Steps of Data Cleaning for Multiple SourcesInmultipledatasources,eachdatasourcemaycontaindirtydata.Inaddition,datafromonesourcemaycontradictoroverlapwithdatafromothersources.Theprocessofmergingthesedataisalsoknownasdatafusion.

Table2depictsasingledatasource thathasafewdataquality issues includingmisspelling(NigeriawasspelledasNigria)andduplicateddata.

Table3andTable4showtwodatasourcesthatneedtobeintegrated.Eachdatasourcemayhavesomedataqualityissues(e.g.,Nigeriaismisspelledinsource2).Someoftheseissuescanbeaddressedintheindividualsource,butotherscanonlybeaddressedduringandafterintegration.

Table5showsintegrateddatainwhichcleaningwasdoneintheindividualsourcealone.Issueslikemisspellingareaddressed,butredundancyandoverlappingarenot.Forexample,wehavecolumnsforName,firstnameandlastname,andcolumnsforsexandgender.Table6showscleanintegrateddatawheretheseissuesareaddressed.

Table 1. An example of Duplicate record

No First Name Last Name Age Sex Phone No

1 Samson James 19 M 202-298-2014

2 Samson James 19 M 202-298-2014


52

Asidefromtheissueofdataoverlappingassociatedwithmultiplesources,namingandstructuralconflictsmayoccur(Batini,Lenzerini,&Navathe,1986)(Parent&Spaccapietra,1998).Namingconflictsisanissuethatariseswhendifferentnamesareusedforsameobjectsacrosssources,orwhenthesamenameisusedfordifferentobjectsacrosssources.Meanwhile,structuralconflictsoccurwhendifferentrepresentationsofthesameobjectariseindifferentdatasources.

Table 2. Data Quality issues in a single data source.

CampusId First Name Last Name Country Sex

VH609042 Samson James Nigria M

XV503267 Jane Mark India F

XV503267 Jane Mark India F

Table 3. Data in source 1

CampusId Name Address Sex Date of Birth

VH609042 SamsonJames 100IRC21222,MD 1 12-01-1989

XV503267 JaneMark 123Oceanstreet,21223,MD 0 02-01-1988

Table 4. Data in Source 2

CampusId First Name Last Name Country Gender Course

VH609042 Samson James Nigria M 10

XV503267 Jane Mark F 10

Table 5. Integrated data with data cleaning in individual sources only

CampusId Name Address Date of Birth Sex First

Name Last Name Country Gender Courses

VH609042 SamsonJames

100IRC21222,MD

12-01-1989 1 Samson James Nigeria M 10

XV503267 JaneMark

123Oceanstreet,21223,MD

02-01-1988 0 F Mark India F 10

Table 6. Clean Integrated data

CampusId FirstName LastName Sex Address Courses Date of Birth Country

VH609042 Samson James M 100IRC21222,MD 10 12-01-1989 Nigeria

XV503267 Jane Mark F123Oceanstreet,21223,MD

10 02-01-1988 India


53

Sothegeneralstepstocleandatafrommultiplesourcesinclude1)cleandataateachsource;2)dataintegration;3)addressdataqualityissuesinintegrateddata.

MeTHoDoLoGy

Thissectiondescribesthemethodologyofthecomparativestudy,includingthedatasets,thefourdatacleaningtools,andthedatacleaningtasks.

Data SetsInthiswork,weusedadatasetonatmosphericandclimateresearchfromtheU.SDepartmentofEnergywebsite(www.arm.gov)andadatasetaboutuniversities(universityData)extractedfromWikipedia.Thedata from theU.SDepartment ofEnergywebsite is theAtmosphericRadiationMeasurement(ARM)userfacilitydatacollectedthroughscientificexperimentsandroutineoperations.Theobservationsweremadeeveryhalfanhour.TheUniversitydatasetgivesanoverviewofdifferentuniversities:whentheywereestablished,thenumberoffaculty,staffandstudentscurrentlyenrolledaswellasthetotalendowmentamounteachuniversitycurrentlypossesses.TheinformationincludedinthedatasetisexplainedinTable7.

TheUniversitydatasethas10variables(p=10),containsover75,000records(n=75043)andissavedasCSV.TheARMdatasethas15variables(p=15),containsover12,000records(n=12,762)andissavedasCSV.Table8showsthecolumnsinuniversitydata.

Data Quality Issues in Data SetsFigure1andFigure2showthescreenshotsofthesetwodatasets,respectively.Thetwodatasetsusedforexperimentsareverymessyandhaveseveraldataqualityissues:

Table 7. Properties of data sets

File Name No. of Records No. of Fields Missing Values Duplicate Record

UniversityData 75,000 10 7.89% 32.7%

ARMData 12,762 15 27.6% 0%

Table 8. Columns of the University Data

Description of Variable Variable Name in Dataset

NameofUniversity University

Themonetaryamountofendowmenttheschoolhas Endowment

Thetotalnumberoffacultyemployedbyschool NumFaculty

NumberofDoctoral NumDoctoral

Countrywheretheschoolexits Country

Thetotalnumberofstaffmembersintheschool NumStaff

Theyeartheschoolwasestablished Established

NumberofPostgraduatestudents NumPostgrad

NumberofUndergraduatestudents NumUndergrad

Totalnumberofallstudentsenrolled NumStudents


54

Figure 1. Screenshot of UniversityData.csv opened with Excel

Figure 2. Screenshot of ARM data opened with Excel


55

• Inconsistentdatevalues.TheUniversitydatacontainsdifferentdateformatswhileintheARMdata,thecolumnfordatehasbothdateandtime,whichhavetobeseparated.

• Inconsistencyinabbreviationsandtermsused.e.g.,toindicateUnitedStatesofAmerica(USA)ascountry,somerecordsusetermslikeUS,USAandUnitedStates.Thesemightbeconsideredasdifferentcountrieswithoutdatacleaning.

• Mixtureofnumericalandtextvalues.• Missinginformation:TheUniversitydatasetcontainsseveralNAvalues,whichisnotunusual

foranyformofdatasetbutmightbeproblematicwhencarryingoutdataanalysisondataset.WhiletheARMdatahasmanymissingrecords.

• ValuesintheUniversitydatasetareseparatedbyinconsistentnumberofdoublequotes.WhilethatintheARMdataareseparatedbyaspacewhichdoesn’tmeetthecommarequirementofCSV.

• Duplicaterecords:Thedatasetissupposedtohaveonlyoneentryforeachuniversityinstance.However,asseeninthedata,someuniversitieshaveseveralentrieswithalldatasometimesbeingthesame,andsometimeshavingvariationse.g.,LamarUniversityhas33entriesbuttherearevariationsinthevaluesofthelastvariablewithsomeshowing13773,14388and14522.

• Outlierrecords:TheARMdatasethassomevaluesinsomecolumnsfarbeyondthenormal.Thatcanbeproblematicwhilecarryingoutanalysis.

• Missingrecordsinasequence:TheARMdatasetwascollectedoveranintervalofhalfanhour.Therearemanymissingrecordsforcertaintimes.

Thepurposeofthisstudyistouseseveraldatacleaningtoolsonthesamedatasettocomparethesedatacleaningtools.Throughthisstudy, it isanticipatedthatwewillgainbetter insightonhowthesedifferenttoolswork,thestrengthandweaknessesofeachtoolindatacleaningtechniquesaswellascomingupwithvaluablesuggestionsanddiscussionsaboutthefutureofdatacleaningtechniquesandtools.

Tool UsedForthisstudy,weusedfourdifferentdatacleaningtoolsnamelyOpenRefine,R,PythonandDataWrangler.Thesetoolsarethemostpopulartoolsusedfordatacleaningintherealworld.OpenRefine,RandPythonareopen source,whichmakes themeasilyaccessible foruse.DataWrangler is acommercialtoolbuthasacommunityversionwhichdoesagoodjobofdatacleaning.Thesetoolsusedaredescribedbelow:

• OpenRefine:OpenRefine(Verborgh&DeWilde,2013)isaweb-based,stand-alone,opensourceapplicationfordatacleanupand transformation toother formats. Itoperatesonrowsofthedatathathavecellsundercolumns,whichisverysimilartorelationaltables.Thistoolcleans,reshapesandeditsbatch,unstructuredandmessydata.ItwasformerlyknownasGoogleRefineandwasalsocalledFreebaseGridworksbeforethat.OperationsinOpenRefineincludefaceting(allowinguserstonarrowdownresultsthroughseveraldifferent dimensions), clustering, and reconciling, which all help in the data cleaningprocess.Italsoanalyzesthedatathroughfiltering,facetingandconvertingthedataintomorestructuredform.

OpenRefineisastandaloneapplicationthathasawebinterface.Itisnothostedonthewebbutcanbedownloadedandrunsonthelocalmachine.Inotherwords,itisadesktopapplicationthatopensinabrowserasalocalwebserver.

TransformationexpressionscanbewritteninGeneralRefineExpressionLanguage(GREL),Jython(i.e.Python)andClojure.Sinceitisanopensourceproject,itscodecanbereusedinotherprojects.


56

OpenRefinecarriesourcleaningtasksthroughfilteringandfaceting,andthenconvertsthedataintoamorestructuredformat.

• Data Wrangler:DataWrangler(Kandeletal.,2011)isaStanfordUniversityprojectthathelpsanalystscleanandpreparediverse,messydataquicklyandaccurately.Itisaninteractivetoolfordatacleaning.DataWranglercanworkwithdataintwoways.Userscansimplypastethedataintoitswebinterfaceorcanusethewebinterfacetoexporttheoperationsaspythoncodeandprocessarbitraryamountsofdata.ThewebinterfaceisusingJavaScriptandthereforehassomeperformanceissuesandonlysupportsupto1000rows,butuserscanuseittoconfigureDataWrangleronasubsetofthedataandthenapplytheconfigurationonthewholedataset.ThemostrecentversionofthistooliscalledTrifactaWrangler.

For the experiment we imported our data into Data Wrangler and the application began toautomaticallyorganizeandstructureourdataset.Thistoolcontainsstrongmachinelearningalgorithmsthathelpsuggestcommoncleaningtobedoneandcommontransformationandaggregations.DataWranglerallowsamixtureofnumericalandtextvalues.

• Python:Pythonisanothertoolthatcanbeusedfordatacleaning.Ithasseveralmodulesthatcanbeusedtocarryoutcleaning.OnepowerfulmoduleinPythonthatisusedfordatacleaningisPANDAS(Pythondataanalysistoolkit).Thismoduleisbasicallyfordataanalysis,whichdatacleaningispartof.AnothermoduleinPythonthatcanbeusefulwhencarryingoutcleaningistheNumpymodule.ThismoduleisusedforscientificcomputingwithPython.IthasapowerfulN-dimensionalarrayobjectthatisusefulforlargedatasets.

• R:Risaprogramminglanguageusedforstatisticalcomputation(Johnetel.,2016).Ithasbeenwidelyusedfordataanalysis.Rhasasetoftoolsthataredesignedtocleandataeffectivelyandcomprehensively.TheRenvironmenthasthecapacitytoreaddatainseveralformatsandprocessthesefiles.

InthecleaningofdatausingR,foursimplestepscanbetakenwhichRprovidesgreatresourcefor:

1. Readingdata:Rprovidesadequatereadingresourcefrompracticallyanyformatintodataframe.2. ExploratoryAnalysis:Afterreadingthedata,usersoftenconductaninitialexplorationofthe

dataframe.3. ExploratoryAnalysisinVisualform:Duringcleaningitisusefultovisualizedataateachstage.

Rprovidesadequatevisualizationtools.Threepowerfulvisualizationthatcanbeusefulduringdatacleaningare:Boxplot,HistogramandScatterplot.

The Data Cleaning TasksIntheexperimentthefollowingdatacleaningtaskswereconducted.

• Dealingwithtypographicalerrorsormultiplerepresentations:◦ Cleaningupinconsistentspellingofterms(i.e.“USA”,“U.S.A”,“U.S.”,etc.).◦ Convertingvaluesthataretextdescriptionsofnumericvalues(i.e.$123million)toactual

numericvalues(i.e.123000000)whichareusableforanalysis.◦ Extractingandcleaningvaluesfordates.

• Identifyingwhichrowsofaspecificcolumncontainasearchterm.• Removingduplicatedata.• Separatingdateandtime.• Handlingoutliers.


57

• Handlingmissingrecordsinasequence.Hereafterusingthetooltodiscoverthemissingrecordsinasequence,theuserdecideshowtoreplacemissingvalues,eitherbyimputation,inferencefromotherrecordsorothermethoddecidedbytheuser.

• Exportingcleaneddatatoseveralformats.• Handlingmissingfields,duplicaterecords,inconsistentformats.• Batcheditingofrowsandcolumn.

Twouserswithadvancedprogrammingskillsfinishedthesedatacleaningtasksusingthefourtools.Foreachdatacleaningtask,usersappliedadatacleaningtooltofixthedataqualityissueslistedinthetask.Theymanuallycheckedourresultsandrepeatedthecleaningtaskuntilwecouldnotfindmorerelatedqualityissuesinthedata.Theorderofapplyingeachtoolforeachtaskisrandomizedtoavoidbiasintroducedbytheorder.

When compare the four tools, we focused on the following criteria: key features, platform,scalability,skilllevelneeded,timeofcompletionandeaseofimplementation.

ReSULTS

Foreachtoolwedescribeitskeyfeatures,platform,skilllevelneeded,timeofcompletion,easeofimplementation,advantagesanddisadvantages,accuracy.

Key FeaturesOpenRefine:Ithasthefollowingkeyfeatures:• Importingdatafromvariousdatasourcesandsupportthefollowingformat:CSV,TSV,.xls,.xlsx,

JSON,XML,RDFasXMLandgoogledocument.Figure9showsascreenshotofimportingtheuniversitydatausingOpenRefine.

• Facetsandfilters:OpenRefineallowuserstousefacetsandfilterstofilterdataintosubsetsforeasyusage.Thiscanbedonefornumbers,textanddatescolumns.Forexample,fortheUniversitydata,ifauserfacetsdataonthegendercolumnwewillget2infemaleand1inmale.Iftheuserselectsfemale,thenitwillshowthetworowswithfemale.

• Support forexpressions thatcanbeused tocreatenewdatafromexistingdataor transformexistingdata.

• Reconciliation:reconciliationmatchestextnamesorvalueinthecolumnstodatabaseidentifiersinvariousdatabaseIDspaces.Ithelpsresolveinconsistentspellingissues.Forexample,US,USAandUnitedStatescanbematchedtoUnitedStatesofAmerica.ReconciliationcanbedonebycallingWebServicesordatabaseAPI.

• ExportingData:datacanbeexportedintoTabseparatedvalues(TSV),Commaseparatedvalues(CSV),ExcelandHTMLTable.

• Undo/redo:Undogivesuser the flexibility to rectifymistakes.Redoenables theuser torepeatastep.

Data Wrangler:Ithasthefollowingkeyfeatures:

• DataWranglersupportsthefollowingsixuserinteractionswhileusingthetoolforcleaning.◦ Selectcolumns◦ Selectrows◦ Selecttextwithinacell◦ Editdatawithinthetable◦ Clickbarsindataqualitymeter◦ Assigndatatypes,columnnamesandsemanticroles.


58

• DataWranglerhasasuggestionenginethatsuggestsnextdatacleaningsteps.• DataWranglersupportsautomatedscriptgeneration.• DataWranglerallowsusertohavestepbystepinteractionwithdata.• DataWranglersupportsCSV,JSONANDTDEdataformats.

Python:Ithasthefollowingkeyfeatures:

• Featurestovisualizeandexploredata.• Writingcustomizablecodeforspecificdatacleaningtasks.• Easyintegrationwithothertoolsorproduct.Pythoncancallprogramswritteninotherlanguages.

Pythoncodecanbecalledinotherlanguagesaswell.

R:Rhasthefollowingkeyfeatures.:

• Rhasmanyfunctionsthatcanbeusedfordatacleaning.• Rhasgoodvisualizationlibraries.• Writingcustomizablecodeforspecificdatacleaningtasks.

Platform and Needed Skill LevelOpenRefineisaweb-basedapplicationthereforeitisplatformindependent.ItcanrunonWindows,LinuxandMac.Itrequiresbasictointermediateskilllevel.

DataWranglerrunsonWindowsandMac.Itrequiresbasicskilllevel.BothPythonandRrunonallplatforms,includingLinux,WindowsandMac.Theybothrequire

advancedskilllevel,becauseusershavetoknowhowtoprogram.

Time of CompletionFigure3depictstheaveragecompletiontimeofcleaningusingallfourtoolsonUniversitydataandARMdata.TheusershavehighskilllevelandarefamiliarwithbothRandPython.

UsingDataWranglerhasthefastestcompletiontimefollowedbyusingOpenRefine.Thisisexpected,becausebothtoolsarehighlyinteractive.DataWrangleralsocansuggestdatacleaningsteps,soitleadstoevenfastercompletiontime.UsingRandPythontookmuchlongertimebecausebothrequirecustomizedprogramming.UsingRtookshortertimethanusingPython,becauseRhasalotofdataanalysisfunctionsthataresuitablefordatacleaning.Therelativeorderofdifferenttoolsisalsothesameforbothdatasets.

ease of ImplementationThereisnostandardsequenceofstepsincleaningdata.Sometimesitdependsonthespecificissuescontainedinthedata,whileothertimesitdependsontheuser’sapproach.Duetothisfact,wewerenotabletodoaquantitativeanalysis.However,wegatheredfeedbackfromusersofthesetools.SomeusersexpressedhowtheyfeltusingOpenRefineandDataWranglerfordatacleaningbasedontheinteractiveuserinterface.OthersdiscussedhowtheycouldusePythonandRincustomizedways.Basedonthefeedbackwegotandonourusageofthesetoolsforourexperiment,weassignedscale1-3ontheeaseofimplementationofthesetools.

Weclassifiedtheeaseofimplementation/usageofthesetoolsintoascaleofthree(3),with3astheeasiesttouse.

1. Scale 1:Highhumandependence,lowinteractivity,littleautomationandrequiringadvancedtechnicalskill.Thisscalemeansthattheusermustknowwhatheorsheisdoing,andthetoolgivesnosuggestionsorhints.Theeffectivenessofthetooldependsontheuser’sknowledge


59

andskills.Inaddition,theuserneedstohaveadvancedtechnicalskillstobeabletosuccessfullycompletetaskswiththetool.

2. Scale 2:Highhumandependence,highinteractivity,someautomationandrequiringbasictointermediatetechnicalskill.Heretheuserisexpectedtoknowwhatexactlyinthedatahewantstocleanbutthetoolinteractivelyhelpstheusercarryoutthetask.Basictechnicalskillsareneededforbasiccleaning,butintermediatetechnicalskillsmaybeneededforcomplextask.

3. Scale 3:Toolsinthiscategoryarehighlyinteractive,haslittletonohumandependence,andsuggestscleaningstepstotheuserandauserwithnoexperiencenortechnicalskillscanusethistooltoachievethedatacleaningtasks.

Figure4showstheeaseofimplementationscaleforeachtool.OpenRefinehasascaleof2becausethistoolishighlyinteractiveandonlyrequiresbasictointermediateskills.However,usersstillneedtospecifyallstepsinthedatacleaningprocess,soithashighhumandependenceandsomeautomation.

DataWranglerhasascaleof3becauseitishighlyinteractiveandonlyrequiresbasicskills.In addition, it suggestsdata cleaning steps tousers, so it has lowhumandependenceand ishighlyautomated.

PythonandRbothhaveascaleof1becausetheyhavehighhumandependence,lowinteractivity,littleautomationandrequireadvancedtechnicalskill.

Otheraspects:Welookedatseveralotheraspects,includingpossibilitytobeembedinothertools/programs,userinterface,massedits(editingmultiplecellsatthesametime),approach,compatibilitywithBigData.

BothOpenRefineandDataWranglerarestand-aloneandcannotbeembedded.RandPythoncanbeembeddedinotherprograms.

BothOpenRefineandDataWranglerhavegraphicuserinterface.RandPythondonot.Allofthemsupportmassediting,butRandPythonrequiresomecodingtodothat.

Figure 3. Time of completion in minuets using four data cleaning tools for University Data and ARM Data cleaning


60

Intermsofdatacleaningapproach,OpenRefinesupportssimpletasksasasimpleclick,butformorecomplextasks,usersneedtouseexpressionlanguage.ForDataWrangler,usersonlyneedtoclick,andthesystemalsosuggestsdatacleaningsteps.ForRandPythonusershavetomanuallywritescripts.

OpenRefinecanonlysupportcleaning5000recordssoitdoesnotdirectlysupportcleaningbigdata.Theotherthreetoolscanhandlebigdata.

Advantages and DisadvantagesOpenRefinehasthefollowingadvantages:

• Sincethistoolisadesktopapplicationwithouttheneedtoconnecttointernet,thedatasetisrelativelysafeandishardertotamperwith.

• Userscanuseitsfacetfeaturetofilterthedataintosubsets.• Ithaspowerfulfeaturestotransformdata.• Itprovidessimpledatasummarizationplatform.

OpenRefinehasthefollowingdisadvantages:

• Googleremovedsupportforthistool,andsomeoftheirfeaturesareredundant.• TheUIisnotuserfriendly,severalfeaturesarenoteasytofind.• OpenRefineisnotsuitableforprocessinglargedatasetsduetothe5000-recordlimit.• Itassumesthatdataisorganizedintabularform,whichisnotalwaystrue.

DataWranglerhasthefollowingadvantages:

Figure 4. Ease of implementation of the tools


61

• DataWranglerhastwoviews.TheGridviewandtheColumnview.• Thistoolalsosupportsdatavisualizationandsupportsvisualizationateverystepofdatacleaning.• Itsupportsmassediting.• Itusesnaturallanguagedescriptionsoftransformation.• Itrecommendscleaningstepstouploadeddata.

Overall,wefoundDataWranglerthemostuserfriendlyoutofthefourtools.DataWranglerhasthefollowingdisadvantages:

• Itconsumeslotsofmemory.• Justlimitedfeaturesavailableforfreeversion.

Pythonhasthefollowingadvantages:

• Userscancustomizetheirsolutiontofittheirneeds.• Thistoolisgreatasitiseasytofuseintootherapplication.

Pythonhasthefollowingdisadvantages:

• Itrequiresadvancedprogrammingskills.• ThelearningcurveishighasitrequiresusertolearnhowtousemanymodulesinPython.• It’snottimeeffectiveduetothehighlearningcurve.• Thismethodcanbecomplexanddifficulttoimplement.• Usersmusthavepreviousknowledgeofwhatstepstotakeinthecleaningprocess.

Rhasthefollowingadvantages:

• Itissuitablewhenthedataismainlyusedforstatisticalanalysis(e.g.,salesrecord).• Itisveryeasytovisualizedataateachstageofcleaning.

Rhasthefollowingdisadvantages:

• Itisnotagoodoptionforintegratingintootherprojectsinotherdomainsdifferentfromdatasciencedomains.OtherprojectsmightmakeuseofotherprogramminglanguagesthatRdoesn’tintegratewellwith.

• It’snottimeeffectiveduetothehighlearningcurve.• Thismethodcanbecomplexanddifficulttoimplement.• Usersmusthavepreviousknowledgeofwhatstepstotakeinthecleaningprocess.

AccuracyWewerenotabletoquantifytheaccuracyofthesetools,becauseusershavetogothroughmultipleiterations for each tool andonce some issues are fixed inone iteration, the toolmay findmoreissuesthatwillbefixedinthenextiteration.Intheexperiments,weobservedOpenRefineandDataWranglertohavehighaccuracywhendetectingspecificdataqualityissues(e.g.,missingvalues).Butsomemanualworkisneededtofixthefoundissues(e.g.,youcandecidetoremovemissingvaluesorassignsomevalues).

ForRandPython,theaccuracyalldependsontheuser’sskilllevel,whethertheusercanwritegoodprogramstodetectthoseissuesandsolvethem.


62

Attheend,mostdataqualityissuesareaddressedforeachtool.

Summary of ComparisonTable9summarizethecomparisonofthesefourtools.FollowingcomparisoncriteriausedbyPorwal&Vora(Porwal&Vora,2013)andKarrarandAli(Karrar&Ali,2016),wecameupwiththefollowingmetrics:importformat,performancetime,skilllevel,platform,easeofimplementation,keyfeatures,outputformat,skilllevel,platform,accuracy,possibilitytobeembedinothertools/programs,userinterface,massedit,approach,compatiblewithbigdataandtheirdisadvantages.

Table 9. Comparison of OpenRefine, Data Wrangler, Python and R

Criteria OpenRefine Wrangler Python R

ImportformatCSV,TSV,Excel(XLS/XLSX),JSON,XML,RDF

Excel(XLS/XLSX),CSV,TEXT All All

Performancetime Dependsondatasizeandformat

Dependsonuserchoiceanddatasize

Dependsonuserprogrammingskillsandlevelofartifactindata

Dependsonuserprogrammingskillsandlevelofartifactindata.

Keyfeatures

Facetsandfilters,Supportforexpressionlanguage,Reconciliation

Userinteractions,suggestionengine,automatedscriptgeneration,

Customizablebyuser,integratewithothertools,greatvisualizationlibrary

Customizablebyuser,greatvisualizationlibrary

Skilllevel BasictoIntermediate Basic Advanced Advanced

Platform Allplatform Windows,Mac Allplatform Allplatform

AccuracyHighaccuracywhendetectingspecificdataqualityissue

Highaccuracywhendetectingspecificdataqualityissue

Dependsontheuser’sskilllevel

Dependsontheuser’sskilllevel

Platform Allplatform Windows,Mac Allplatform Allplatform

Easeofimplantation 2 3 1 1

Outputformat TSV,CSV,ExcelandHTMLTable CSV,JSON,TDE Usermaycustomize

toanyformatUsermaycustomizetoanyformat

Possibilitytoembedded

No,Standalonebutcodeisavailable No,Standalone Yes Yes

GraphicUserInterface Yes Yes No No

EditMultipleValues Supportmassedit Supportmasseditanditseasy

Supportbutrequirecomplicatedcoding

Supportbutrequirecoding

Approach

Simpletaskcarriedoutwithaclick,butcomplextaskrequiresexpressionlanguage

Simpleclickandalsosuggestcleaningfeaturesforuser

Needtowritescripts Needtowritescripts

CompatiblewithBigData

No(suitableforonly5000records) Yes Yes Yes

Drawbacks

Googlestoppedsupport,advancedfeaturesrequiretechnicalskills

Memoryconsumptionishigh,costimplication

Requiregoodknowledgeofprogramming

Requireknowledgeofprogrammingandstatistics


63

CoNCLUSIoN

Theproblemof‘dirty’datacostsinstitutionslargeamountsofmoneyeveryyear.Morethan80%oftimeandresourcesarespentpreparingandcleaningdata.Thispaperconductedacomparisonstudyoffourcommonlyusedtoolsfordatacleaning.TheresultsshowthatOpenRefine,whichisanopensourcetooldevelopedbyGoogle,isausefultoolandhasseveralmeritssuchasthefeatureofrunninglocallywhichmakesuserdatamoresecure,andthefeaturewithagraphicalinterfaceandthemasseditfeature.ButOpenRefineneedsexperienceandexpertisetobeabletouseitsadvancedfeatures.OpenRefinealsoworksbetterforsmalldatasets.

DataWranglerhastheadvantageofbeingastandalonetool.Itisveryefficientforbigdataandhasauniquevisualizationfeatureateachstepandgivestheuseranopportunitytopreviewchangesmadegraphicallybeforecommittingthechange.Itcanalsorecommenddatacleaningsteps.Overallitistheeasiesttouse.However,thefreeversionhaslimitedfunctionalities.

PythonandRhavetheadvantagefortheusertocustomizethedataanyways/hewants,andtheycanbeembeddedintoothertools.BothPythonandRhavesamefeaturesincleaning,butPythonhaslotsofmodulestosupportdifferentaspectofcleaningandtheabilitytousethisdataforotheranalysis.PythonandRhowever,requiregreatprogrammingskills,whichmaynotbepresent.PythonandRalsotakelotsoftimetocarryoutcleaningaseachstepalongthewaymustbeimplementedmanually.

Inconclusion,DataWranglerwillbeagoodstartfornoviceuser,asmanydataanalystwillprefernottospendtoomuchtimecleaningdata,astheymustworkonthefunctionalityorusageofthesedata.Itwillbegoodforusersthatdon’tmindpayingforcleaningtool.Forusersseekingopensourcetool,OpenRefineisagoodoption.Fordataengineersthathavetimeandadequateskills,PythonorRwillbeagoodoption.

Onepossiblefutureworkistotakeeachtoolandlookathowitcanhelpusintheintegrationofdatafromdiversesources.


64

ReFeReNCeS

Batini,C.,Lenzerini,M.,&Navathe,S.B.(1986).Acomparativeanalysisofmethodologiesfordatabaseschemaintegration.ACM Computing Surveys,18(4),323–364.doi:10.1145/27633.27634

Castanedo,F.(2013).Areviewofdatafusiontechniques.The Scientific World Journal.PMID:24288502

Dasu,T.,&Johnson,T.(2003).Exploratory data mining and data cleaning(Vol.479).JohnWiley&Sons.doi:10.1002/0471448354

Galhardas,H.,Florescu,D.,Shasha,D.,&Simon,E.(2000).AJAX: an extensible data cleaning tool.

Haghighat,M.,Abdel-Mottaleb,M.,&Alhalabi,W. (2016).DiscriminantCorrelationAnalysis:Real-TimeFeatureLevelFusionforMultimodalBiometricRecognition.IEEE Transactions on Information Forensics and Security,11(9),1984–1996.doi:10.1109/TIFS.2016.2569061

John,F.,&Allison,L.(2016).RandtheJournalofStatisticalSoftware.Journal of Statistical Software,73(2).

Kandel,S.,Paepcke,A.,Hellerstein,J.,&Heer,J.(2011).Wrangler:Interactivevisualspecificationofdatatransformationscripts.Paper presented at theProceedings of the SIGCHI Conference on Human Factors in Computing Systems.

Karrar,A.E.,&Ali,M.M.(2016).ComparativeAnalysisofDataCleaningToolsUsingSQLServerandWinpureTool.International Journal of Computer Applications in Technology,3(7),371–377.

Kumar,S.,&Nadeem,M.(2008).Extraction,Transformation,Loading(ETL)andDataCleaningProblems.Journal of Independent Studies and Research on Computing,6(1).

Lee,M.L.,Lu,H.,Ling,T.W.,&Ko,Y.T.(1999).Cleansingdataforminingandwarehousing.Paper presented at the10th International Conference on Database and Expert Systems Applications.

Martinez-Mosquera,D.,Luján-Mora,S.,López,G.,&Santos,L.(2017).DataCleaningTechniqueforSecurityLogsBasedonFellegi-SunterTheory.Paper presented at the SIGSAND-EuroSymposium,Gdansk,Poland.

Müller,H.,&Freytag,J.-C.(2005).Problems, methods, and challenges in comprehensive data cleansing.

Parent,C.,&Spaccapietra,S.(1998).Issuesandapproachesofdatabaseintegration.Communications of the ACM,41(5es),166–178.doi:10.1145/276404.276408

Patel,S.(2012).RequirementtocleanseDATAinETLprocessandWhyisdatacleansinginBusinessApplication?International Journal of Engineering Research and Applications,2(3).

Porwal,S.,&Vora,D.(2013).AComparativeAnalysisofDataCleaningApproachestoDirtyData.International Journal of Computers and Applications,62(17).

Rahm,E.,&Do,H.H.(2000).Datacleaning:Problemsandcurrentapproaches.IEEE Data Eng. Bull.,23(4),3–13.

Vassiliadis,P.,Simitsis,A.,&Skiadopoulos,S.(2002,November).ConceptualmodelingforETLprocesses.InProceedings of the 5th ACM international workshop on Data Warehousing and OLAP(pp.14-21).ACM.

Verborgh,R.,&DeWilde,M.(2013).Using OpenRefine.PacktPublishingLtd.

http://dx.doi.org/10.1145/27633.27634

http://www.ncbi.nlm.nih.gov/pubmed/24288502

http://dx.doi.org/10.1002/0471448354

http://dx.doi.org/10.1109/TIFS.2016.2569061

http://dx.doi.org/10.1145/276404.276408


65

Samson Oni is a PhD student of Information Systems in the University of Maryland Baltimore County (UMBC). He obtained his master’s degree in computer science University of Maryland, Baltimore County. He worked as a Research Assistant at the Imaging Research Center UMBC. His previous work includes technical intern for Joint Centre for Earth Systems (NASA-JCET) - UMBC and Full-stack developer for Department of education UMBC. His research focus is in cyber security and Data science and have carried out several projects in these domains. Currently, he is a research assistant at the Information Systems UMBC where he is working on semantic web, blockchain and cybersecurity-related projects. More information can be found at http://www.samdwise.com

Zhiyuan Chen is an Associate Professor in Department of Information Systems at University of Maryland Baltimore County. He received a PhD degree in Computer Science from Cornell University in August 2002. He has more than 10 years of extensive research experience in data privacy, privacy preserving data mining, database management, data science, and cyber security. His main research focus is in algorithms for preserving privacy of data and at the same time allows accurate analysis of the data. He has published over 40 papers in peer reviewed journals and publications and over 20 of them are in the area of privacy and security. More information can be found at https://userpages.umbc.edu/~zhchen/

Susan Hoban worked with NASA for over two decades, first as a scientist studying comets and the interstellar medium, then as a STEM Educator. Dr. Hoban develops curriculum for professional development of educators for classroom use and informal education venues. Dr. Hoban specializes in integrating hands-on activities with data collection and analysis to develop the habits-of-mind of STEM. Curriculum modules include, but are not limited to rocketry, environmental education, astronomy & astrobiology, computer modeling, STEM music, and robotics for learners of all ages. Dr. Hoban is currently also working on using analytics for cyber security.

Onimi Jademi is a PhD candidate in the Department of Information Systems at the University of Maryland, Baltimore County (UMBC). Her research focuses on natural language processing and machine learning, and its applications especially in the healthcare domain. She has experience with high quality qualitative and quantitative research methods.

A Comparative Study of Data Cleaning Tools

Documents