Page 1
DetectingDataErrors:Whereareweandwhatneedstobe
done?*PresentationBy:Sitong Che andSrikar Pyda
WrittenBy:Ziawasch Abedjan,XuChu,DongDeng,RaulCastroFernandez,Ihab F.Ilyas,Mourad Ouzzani,PaoloPapotti,MichaelStonebraker,NanTang
{abedjan,ddong,raulcf,stonebraker}@csail.mit.edu {x4chu,ilyas}@uwaterloo.ca {mouzzani,ntang}@qf.org.qa [email protected]
Page 2
Introduction
• Amultitudeofdata-cleaningtoolsexisttodetectandpotentiallyrepairerrors• It’sbettertothinkofdata-cleaningsolutionsasbeingtailoredtodetectingparticularcategoriesoferrorsratherthandetectingallpotentialerrors• Data-cleaningisimportantforenterprisebecausedata-centricapproachesarebecomingcriticalforinnovationinbusinessandscience
• Differenttypesoferrorsoftenexistonthesamedata-set• Requirescleaningfrommultipletoolsinordertodetectandrepairthevarietyofnuancesintheerrors
Page 3
Overview:PragmaticQuestion
Arethesetoolsrobustenoughtocapturemosterrorsinreal-worlddatasets?
Whatisthebeststrategytoholisticallyrunmultipletoolstooptimizethedetection
effort?
Page 4
DataCleaningSolutionCategories• Rule-baseddetectionalgorithm:Userscanspecifyacollectionofdata-cleaningrulesandthetoolwillfindanyviolationswithinthedata-set• NADEEF• “notnull”constraint• Multi-attributefunctionaldependencies(FDs)• User-definedfunctions
• Patternenforcementandtransformationtools:Patternenforcementtoolsdiscoverbothsemanticandsyntacticpatternsinthedataandusethemtodetecterrors.Transformationtoolscanbeusedtochangethedatarepresentationandexposeadditionalpatternswithinthedata-set.• Syntactic:OPENREFINEandTRIFACTA• Semantic:Katara
• Quantitativeerrordetection:Thesealgorithmsexposeoutliersandotherstatisticalglitcheswithinthedata.• Recordlinkageandde-duplicationalgorithms:De-duplicationtoolsdetectduplicatedatarecordswhichreferto thesameentity.ConflictingValuescanbefoundàfurther indicatingerror.• TAMR
• Arethesecategoriessufficient?Overlap?• Theauthorsadmitthattheircategorizationdoesnotperfectlypartitionerrors• Theauthorsareattemptingtocategorizedetectableerrorsfromreal-worlddata-sets
Page 5
Data-CleaningChallenges
• Syntheticdataanderrors:Mostcleaningalgorithmsareevaluatedondata,eithersyntheticorreal-world,withsyntheticallyinjectederror• Thereisbothalackofrealdata-setsalongwithappropriategroundtruthandalackofwidelyacceptedbenchmarkofdata-cleaningquality
• Combinationoferrortypesandtools:Real-worlddataoftenhasmultiplekindsoferrors.• Anerrorcanoftenbefoundthroughamultitudeoftools
• Conflictingduplicaterecordsandintegrityconstraint• Overlap??
• HumanInvolvement:Enterprisesrequirebudgetstofacilitatehumanpower—havinganidealorderingfortheapplicationofdata-cleaningtoolsiskeytominimizehumanintervention.• Verifydetectederrors• Specifycleaningrules• Providefeedbackformachineleaning
Page 6
Overview:Methodology
• Collectionofreal-worlddatawitheitherfullorpartialgroundtruth• Representthekindsofdirtydatafoundinpractice• Theauthorscaneasilyjudgetheperformancebecauseoftheknowledgeofgroundtruth
• Interestedinautomaticerrordiscoveryasopposedtoautomaticrepairbecauseauto-repairisnotpragmaticinpractice.• Reportresultsintermsofprecisionandrecallintermsofthegroundtruth.• Upper-boundrecall:estimateforthemaximumrecallofatoolifithasbeenconfiguredbyanoracle• ”Perfect-configuration”ofdata-cleaningrulestoenableoptimalerrordetection• Usegroundtruthtoestimateupper-boundrecall:classifyremainingerrorsthatarenotdetectedbytype
• Anyerrorwhosetypecanbecleanedbyatoolshouldbecountedtowardsitsrecall.
Page 7
EvaluationQuestionsWhatistheprecisionandrecallofeachtool?Howprevalentaretheerrorswhichthedata-cleaningtoolisdesignedtodetect?Howmanyerrorsinthedatasetsaredetectablebyapplyingalltoolscombined?Sincehuman-in-the-loopisawellacceptedparadigm,howmanyfalsepositivesarethere.Thesecauseadraininhumaneffortbudgetandcauseacleaningefforttofail.Isthereastrategytominimizehumaneffortbyleveragingtheinteractionamongtools?
Page 8
MainFindings
• Conclusion1:Thereisnosingledominanttoolbecausethedata-cleaningalgorithmsaregenerallytailoredtowardsparticulartypesoferrors• Aholistic“composite”strategymustbeusedbecauseeachdata-cleaningtoolisindividuallydesignedtodetectselectivegenresoferrors
• Conclusion2:Byassessingtheoverlapoferrorsdetectedbythemultitudeofdata-cleaningtoolsutilizedinordertominimizefalsepositive(userengagement)• Orderingstrategymustbespecifictothedata-setbecauseofvarianceinstructuralpropertiesandpatterns
• Conclusion3:Thepercentageoferrorsthatcanbefoundbythecombinedorderedapplicationofalltoolsissignificantlylessthan100%.• Willdiscussadditionalerrorslaterinexperiments—researchersneedtodevelopnewwaysoffindingthese”unknowncategories”ofdataerrors(oneswhichcanbespottedbyhumansbutnotbythecurrentcleaningtools
Page 9
DataErrorsandDataSets• DataError:Givenadataset,adataerrorisancell-valuewhich
isdifferentfromitsgivengroundtruth• Outliererrors:Cell-valueswhichdeviatefromthedistribution
ofovertherangeofvaluesinacolumnofatable.• Duplicateerrors:Distinctdatabaseentries/recordswhichrefer
tothesamereal-worldentity.• Ifthetwoentries’attributevaluesdonotmatchthat
couldindicateanerror.• Ruleviolationerrors:Cell-valuesthatviolateanykindof
integrityconstraints• NotNull&UniquenessConstraints
• Patternviolation:Valuesthatviolatesyntacticandsemanticconstraints• Alignment,formatting,misspelling,andsemanticdata-
types
Page 10
DataSetsOverview• Thefourerror-typesaregenerallyprevalentacrossalldata-sets• TheAnimaldata-setdoesnothaveoutliers• MITVPFandBLACKOAKaretheonlydata-setswithduplicates
• Theratiooferroneouscellsineachdata—setrangefrom0.1%to34%• Structuralproperties:
• Rayyan BibhasthehighestpercentageoferrorswhileAnimalhaslowest
• Merckhasthegreatestnumberofattributes• BlackOak hasthegreatestnumberofentries
Page 11
MIT-VPF• MITOfficeoftheVicePresidentforFinance’s(VPF)procurementdatabasewhichcontainsinformationaboutvendorsandindividualsthatsupplyMITwithproductsandsupplies
• StructuralDetails:• ExecutePurchaseOrder:newentryisaddedwithdetailsaboutthe
contractingpartytothevendormasterdata-setwheneverMITbuysaproduct• Identificationinformation(name,address,phonenumber,businesscodes)
• Theongoingprocessofaddingcreatesauniqueproblemofduplicatesandotherdata-errors(theory)• Inconsistentformatting(address,phonenumber,companynames)• Contactinformationmaychangeovertime
• Groundtruth:EmployeesofVPFmanuallycuratedarandomsampleof13,603records(halfofthedata-set)andmarkederroneousfields(empirics)• addressandcompanynames:missingstreetnumbers,wrong
capitalization,andattributevaluesinthewrongcolumn
Page 12
Merck• TheMerckdatasetdescribesITservicesandsoftwaresystemswithinthecompanythatarepartiallymanagedwiththirdparty—usedforoptimizationofdownsizingservice• StructuralDetails:Eachsystemischaracterizedbylocation,numberofendusers,andleveloftechnicalsupport• Greatestnumberofattributes(68)butisverysparse
• GroundTruth:Merckprovidedthecustomcleaningscripttheyusedtoproduceacleanedversionofthedata-set• Appliesvariousdatatransformationsthatnormalizecolumnsandallowforuniformvaluerepresentation
• Theauthorsutilizedthescripttoformulaterulesandtransformationsforcleaningtools• Therearemanyhiddenfunctioncallsthatareimplicitlycalled
whichchangedata-values
Page 13
Animal• Animaldata-setprovidedbyscientistsatUCBerkeleyabouttheeffectsoffirewoodcuttingonsmallterrestrialvertebrates• StructuralProperties
• Eachentrycontainsinformationaboutthetimeandlocationofcaptureofananimal,inadditiontoitsproperties:tagnumber,sex,weight,species,andagegroup
• Eachrecordwasmanuallyenteredintospreadsheetsfrombeinginitiallytranscribedonpaper(datafrom1993-2012years)
• Groundtruth:Thescientistsmanuallycleanedthedata-setandidentifiedseveralhundredsoferroneouscells• Errors:
• Shiftedfields• Wrongnumericvalues
Page 14
Rayyan-Bib• Rayyan isasystembuiltatQCRItoassistscientistsintheproductionofsystematicreviews• literaturereviewswhichidentifyandsynthesizeallresearchevidencerelatedtoanuancedresearchquestion
• StructuralProperties:• Usersconsolidatesearchresultsintolonglistsofreferencestostudieswhichtheyfeelarerelevanttoansweringthequestion• Searchingmultipledatabasesusingmultiplequeries• Userscanmanuallymanipulatecitationssodataispronetoerror
• Entrieshavealotofattributes:article_title,journal_title,journal_abbreviation etc
• Groundtruth:Theauthorsmanuallycheckedasampleof1,000referencesfromRayyan’s database• Manymissingvaluesandinconsistenciesindata
• Journal_title andjornal_abbreviation areoftenswitched• Authornamesaresometimesfoundinjournal_title
Page 15
BlackOak• BlackOak Analyticsisacompanywhichprovidesentityresolutionsolutions• StructuralProperties:Providedanonymizedaddressdatasetandadirtyversionwhichtheyuseforevaluation• Groundtruthisgivenbecauseit’sasyntheticdata-set• Errorsarerandomlydistributed
• Errors:• Spellingofvalues• Formattingofvalues• Completeness• Fieldseparation
• Theauthorsuniquelyincludedthisdata-settoanalyzethedifferenceinerrordetectionperformancebetweenrealworldandsyntheticdatasets
Page 16
DataCleaningTools• Selecteddata-cleaningtoolswhichcoveredallfourerrortypes• Multipletoolssometimesfocusondifferentsubtypesofagivenerrortype• Iterativefine-tuningprocessforeachtool• Comparedetectederrorswithgroundtruthinordertoadjustthetoolconfigurationorrulesinordertoimproveperformance• Detectableerrorsarecountedtowardstherecallupperbound
Page 17
Strategy1:OutlierDetection
• Detectdatavalueswhichdonotfollowthestatisticaldistributionoftheoveralldata• Tool1:Dboost
• Unique:Decomposesrun-ondatatypes(date)intotheirconstituentpieces(m,y,d)• Attributeswhicharewrappedinmorecomplexdatacanbeindividuallyanalyzedseparatelyforoutliers
• Histogramscreateadistributionofthedatawithoutanyapriori assumptionbycountingtheoccurrencesofuniquedata-values
• GaussianandGGMassumethateachvaluewasdrawnfromanormaldistributionwithgivenameanandstandarddeviationoramultivariateGaussiandistributionrespectively.
• OptimalParameterConfiguration:• Numberofbins&theirwidthsforhistograms• Mean&StandardDeviationforGaussianandGMM
Page 18
Strategy2:Rule-basedErrorDetection
• Relyondata-qualityrulestodetecterrors:expressedusingintegrityconstraints• FunctionalDependencies• DenialConstraints• Violation:Collectionofcellsthatdonotconformtoagivenintegrityconstraint• Atleastonecellinvolvedintheviolationmustbechangedtoresolveaviolation
• Tool2:DC-Clean• Focusesondenialconstraints• TheauthorsdesignacollectionofDCstocapturethesemanticsofthedata• “iftherearetwocapturesofthesameanimalindicatedbythesametagnumber,thenthefirstcapturemustbemarkedasoriginal”
Page 19
Strategy3:Pattern-basedDetection
• Tool3:OPENREFINE:Opensourcedatawranglingtool• Tool4:TRIFECTA:Communityversionofacommercialdatawranglingtool
• OPENRIFEANDTRIFECTAfocusonsyntacticpatterns:provideexplorationtechniquestodiscoverdata-inconsistencies
• Tool5:KATARA:Semanticpatterndiscoveryanddetectiontool• Focusesonsemanticpatternsmatchedagainstaknowledgebase
• ETL(Extract,Transform,Load)tools:pulldataoutofonedatabaseandplaceitinanother• Tool6:KNIME• Tool7:PENTAHO
Page 20
Tool3:OPENREFINE
• OPENREFINEisanopensourcewranglingtoolthatcandigestdatainmultipleformats--facilitatesdata-exploration• FacetingOperation:Letsuserslookatdifferentkindsofaggregateddata—resemblesagroupingoperation• TheuserspecifiesonecolumnsforfacetingandOPENREFINEgeneratesawidgetthatshowsalldistinctvalues&theirnumberofoccurrences
• Filteringoperation:• TheusercanspecifyanexpressiononmultiplecolumnsandOPENREFINEgeneratesthewidgetbasedonvaluesoftheexpression
• TheusercanthenselectoneormorevaluesinthewidgetandOPENREFINEfiltersrowswhichdonotcontainselectedvalues
• Datacleaningusesaneditingoperation• Editsonecellatatime• Ifyoueditatextfacet,allcellsconsistentwiththatfacetwillupdate
Page 21
Tool4:TRIFACTA
• TRIFACTAisthecommercialdescendantofDataWrangler:Predictsandappliesvarioussyntacticdata-transformationsfordatapreparationandcleaning.• Canapplybusiness&standardizationrulesthroughavailabletransformationscripts
• Appliesafrequencyanalysistoeachcolumntoidentifymostandleastfrequentvalues• Showsattributevaluesthatdeviatestronglyfromthevaluedistributioninthespecificattribute• Mapseachattributetoitsmostprominentdata-typeandidentifiesvaluesthatdonotmatch
Page 22
Tool5:KATARA
• KATARAreliesonexternalknowledgebases,suchasYago,todetect&correcterrorswhichviolateasemanticpattern• Identifiesthetypeofacolumnandtherelationshipbetweentwocolumnsinthedata-setusingaknowledgebase• ThetypecolumnAinatablemightcorrespondtoCountryinknowledgebaseYago &therelationshipbetweencolumnsAandBmightcorrespondtothepredicateHasCapital inYqgo
• Basedonthediscoveredtypesandrelationship,Katara validatesvaluesusingtheknowledgebaseandhumanexperts• Exampe:Avalueof”California”incolumnAwillbemarkedasanerrorbecauseitisnotacountryinYago
Page 23
Tool6:PENTAHO
• PENTAGOprovidesagraphicalinterfacewheredatawranglingcanbeimplementedasadirectedgraphofETL(Extract,Transform,load)operations• Anydata-manipulationorrulevalidationcanbeaddedasanodeintheETLpipelines• ExecutesmultipleETLworkflowstoclean/curatedataBUTrules/proceduresmustbespecifiedbyuser• Providesroutinesforstringtransformationandsinglecolumnconstraintvalidation
Page 24
Tool7:KNIME
• Knime focusesonworkflowauthoringandencapsulatingdataprocessingtaskssuchascurationandmachinelearningbasedfunctionalityincompassablenodes• AlthoughKNIMEexecutesmultipleETLworkflows,similartoPENTAGO,toclean/curatedata,rules/proceduresmustbespecifiedbyuser• Usersmustknowexactlywhatkindsofrulesandpatternsneedtobeverified• UnlikeOPENREFINE&TRIFACTA,PENTAHOandKNIMEdonotprovidewaystoautomaticallydisplayoutliersanddetecttypemismatches
Page 25
Strategy4:DuplicateDetection
• Iftworecordsrefertothesamereal-worldentity,buthavedifferingattributevalues,thereastrongchanceoneofthetwovaluesforeachrespectiverecordisanerror• Tool8:TAMR(commercialdescendantofDataTamersystem)• TAMRisatoolwithindustrialstrengthdataintegrationalgorithmsforrecordlinkageandschemamapping• Premisedonmachinelearningmodelsthatlearnduplicatefeatures
• Expertsourcing• SimilarityMetrics
Page 26
CombinationofMultipleTools
• Problem:Howdoesauserproperlycombinemultipleindependentdata-cleaningtools• Option1: Runalltoolsandapplyaunionormin-kstrategy• Option2:Haveusersmanuallycheckasampleofdetectederrors,whichcanbeusedtoguidetheprioritizationofdata-cleaningoperations
Page 27
Option1:UnionAllandMin-k
• Unionall• Takestheunionoftheerrorsemittedbyalltools
• Min-k• Thoseerrorsdetectedbyatleastk-toolswhileexcludingthosedetectedbylessthank-tools• Noneedtokeepcleaningthedata-setwithnewtechniquesifthemaximumperformanceforerrordetectionhasbeenreached
Page 28
OrderingBasedonPrecision
• ProblemswithOption1(exhaustiveunion)• Expensivebecauseitrequiresmassiveamountsofhumanefforttovalidatelargenumberofcandidateerrors• BlackOak data-set:Auserwouldhavetoverifythe982,000cellsidentifiedaspossiblyerroneoustodiscover382,928actualerrors.• Resultsfromtoolswithpoorperformanceinerrordetectionforthisparticulardata-setshouldnotbeevaluted
• Alternative:Sampling-basedmethodtoselecttheorderinwhichdata-cleaningstrategieswillbeimplementedonthedata-set
Page 29
OrderingBasedonPrecision• CostModel:Althoughtheperformanceofatoolcanbemeasuredbyprecisionandrecallindetectingerrors,Precisionisabetterproxyforadata-cleansingtool’serrordetectionperformance• Recallcanonlybecomputedifalloftheerrorsinthedataareknown(fullgroundtruth)—thisisnearlyimpossiblewhenweexecuteerrordetectionstrategiesonnewdata-sets
• Precisioniseasytoestimate• AssumeC isthecostofhavingahumancheckadetectederrorandthatVisthevalueofidentifyingarealerror• Valuemustbehigherthancost!• P*V>(P+N)*C,wherePisthenumberofcorrectlydetectederrorsandNisthenumberoferroneouslydetectederrors(falsepositives)
• P/(P+N)>C/V• Setthreshold:σ =C/V
Page 30
OrderingBasedonPrecision• Anytoolwithaprecisionbelowσ shouldnotrunbecausethecostofcheckingisgreaterthanthevalueofaccuratelyidentifyingadata-error• Theratioisdomaindependent(unknowninmostcases);itisnaturaltohavelargeVvaluesforhighlyvaluabledata
• IfVisverylarge,alldata-cleansingtoolswillbeconsideredwiththecorrespondinglysmallthresholdvalue,whichboostsrecall
• IfthevalueofCishighanddominatestheratio,wesavecostonlyonthevalidationoftoolsthatareveryprecise—tradeoffwithrecall
• Theauthorsestimatetheprecisionoftheirdata-cleansingtoolonagivendata-setbycheckingarandomsampleofthedetectederrors.• Whynotrunallthetoolswithaprecisionhigherthanthresholdandevaluatetheunionofalltheirdetectederrorsets??• Toolsarenotindependentandsetsofdetectederrorsmayandoftendooverlap• Sometoolsmayhaveanextremelyhighprecision,butalloftheerrorstheydetectmaybecoveredwithtoolsthathaveevenhighprecisionvalues
Page 31
OrderingBasedonPrecision
• Maximumentropy-basedorderselection:FollowingtheMaximumEntropyprinciple,theauthorsdesignanalgorithmwhichassessestheestimateprecisionforagivendata-cleansingtool• Estimatesoverlapbetweentoolresults• Pickingthetoolwithhighestprecision(percentageofpositiveswhicharetrueovertotal)reducesentropythemostbecausehighentropyreferstouncertainty
Page 32
OrderingBasedonPrecision:Algorithm
1. Runalldata-cleaningtoolsontheentiredata-setandreturndetectederrors2. Estimateprecisionforeachtoolbyverifyingarandomsampleofitsdetected
errorswithahumanexpert3. Pickthetoolwiththehighestestimatedprecisionamongalldata-cleaningtools
notyetconsideredinordertomaximizeentropy&verifiesdetectederrorsonthecompletedata-setwhichhavenotbeenverifiedbefore
4. SinceerrorsvalidatedfromStep3mayhavebeendetectedbyothertools,weupdatetheprecisionoftheothertoolsandgotoS3topickthenexttoolifadata-cleaningstrategyexistswithanestimatedprevision> σ
Empirics:Regardlessofeachtool’sindividualperformance,theproposedorderreducescostofmanualverificationwithmarginalreductionofrecall.
Page 33
EvaluationMetrics
• D:dataset• G:purelycleaneddataset• E:diff(G,D)=E• T(D):thesetofcellsmarkedaserros bytoolT• Precision:• Recall:• AggregatedF:2(R*P)/(R+P)
Page 34
UsageofTools
• DBOOST:appliedthreealgorithms:Gaussian,histogram,GMM.WithparametersmakingFhighest.• DC-Clean: existing rules + manuallyconstructedFDrulesbasedonobviousn-to-1relationships• OpenRefine: facetmechanism+ formattingandsingle-columnrules• TRIFACTA: outlierdetectionandtype-verification+ formattingandsingle-columnrules• KATARA: manuallyconstructed& existing knowledgebase• PENTAHO & KNIME: modeleachtransformationandvalidationroutineasaworkflownodeintheETLprocess• TAMR: iterate training until the precision and recall become stable
Page 36
Dataqualityrulesdefinedoneachdataset
Page 37
UserInvolvement
• Setrules• PerformdataexplorationusingOpenRefine andTRIFCTA• Validatetheresultoferrors• Gothroughtheremainingerrorsandtrytocategorizethem
Page 38
IndividualEffectiveness
Page 39
IndividualEffectiveness
• DBOOST:useless for Animal. Good for BlackOak• DC-Clean: good for Animal, Merck. Bad for MIT VPF• OpenRefine: bad for Animal, top 2 for others• TRIFACTA: bad for Animal recall, top 2 for others• KATARA: good for BlackOak, bad for MIT VPF• PENTAHO & KNIME: good on general• TAMR: found all duplicates for MIT VPF, and most of duplicates forBlackOak
Page 42
Tool Combination Effectiveness
• Union All: High recall but low precision (lots of FP)
Page 43
Min-K
• Require at least k algorithms agree on error• (K=1) == union all• As k increases, precision increases, recall decreases• Main problem: how to pick k
Page 44
OrderingbasedonBenefitandUserValidation• Randomly sample 5% of the detected errors for each tool andcompare them with ground truth for precision estimation.• Run tools in precision order (dynamically update the precisionestimation and drop tools that did not pass )• Baseline: simple union• Threshold: σ(0.1-0.5) (for precision)• As threshold increases, precision increases, FP decrease significantly,with TP decrease a little, causing recall decrease a little
Page 45
Ordering Strategy results
Page 46
Recall Upper-bound
• extra rules found by manually going through remaining errors
Page 47
Domain Specific Tools
• For MIT VPF and BlackOak: ADDRESSCLEANER• Apply on a 1000 sample• Found 2 & 13 new errors. Recall: 0.93-0.95; 0.999-0.999
Page 48
Enrichment
• Manually add more attributes to the original dataset (only those thatdid not introduce additional duplicate rows)• DC-Clean & TAMR
Page 49
Conclusion
• Thereisnosingledominanttoolforthevariousdatasetsanddiversifiedtypesoferrors.Singletoolsachievedonaverage47%precisionand36%recall, showingthatacombinationoftoolsisneededtocoveralltheerrors.• Pickingtherightorderinapplyingthetoolscanimprovetheprecisionandhelpreducethecostofvalidationbyhumans.• Domainspecifictoolscanachievehighprecisionandrecallcomparedtogeneral-purposetools,achievingonaverage71%precisionand64%recall,butarelimitedtocertaindomains• Rule-basedsystemsandduplicatedetectionbenefitedfromdataenrichment.Inourexperiments,weachievedanimprovementofupto10%moreprecisionand7%morerecall