Top Banner
Detecting Data Errors: Where are we and what needs to be done?* Presentation By: Sitong Che and Srikar Pyda Written By: Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang {abedjan, ddong, raulcf, stonebraker}@csail.mit.edu {x4chu, ilyas}@uwaterloo.ca {mouzzani, ntang}@qf.org.qa [email protected]
49

Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Jun 08, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

DetectingDataErrors:Whereareweandwhatneedstobe

done?*PresentationBy:Sitong Che andSrikar Pyda

WrittenBy:Ziawasch Abedjan,XuChu,DongDeng,RaulCastroFernandez,Ihab F.Ilyas,Mourad Ouzzani,PaoloPapotti,MichaelStonebraker,NanTang

{abedjan,ddong,raulcf,stonebraker}@csail.mit.edu {x4chu,ilyas}@uwaterloo.ca {mouzzani,ntang}@qf.org.qa [email protected]

Page 2: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Introduction

• Amultitudeofdata-cleaningtoolsexisttodetectandpotentiallyrepairerrors• It’sbettertothinkofdata-cleaningsolutionsasbeingtailoredtodetectingparticularcategoriesoferrorsratherthandetectingallpotentialerrors• Data-cleaningisimportantforenterprisebecausedata-centricapproachesarebecomingcriticalforinnovationinbusinessandscience

• Differenttypesoferrorsoftenexistonthesamedata-set• Requirescleaningfrommultipletoolsinordertodetectandrepairthevarietyofnuancesintheerrors

Page 3: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Overview:PragmaticQuestion

Arethesetoolsrobustenoughtocapturemosterrorsinreal-worlddatasets?

Whatisthebeststrategytoholisticallyrunmultipletoolstooptimizethedetection

effort?

Page 4: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

DataCleaningSolutionCategories• Rule-baseddetectionalgorithm:Userscanspecifyacollectionofdata-cleaningrulesandthetoolwillfindanyviolationswithinthedata-set• NADEEF• “notnull”constraint• Multi-attributefunctionaldependencies(FDs)• User-definedfunctions

• Patternenforcementandtransformationtools:Patternenforcementtoolsdiscoverbothsemanticandsyntacticpatternsinthedataandusethemtodetecterrors.Transformationtoolscanbeusedtochangethedatarepresentationandexposeadditionalpatternswithinthedata-set.• Syntactic:OPENREFINEandTRIFACTA• Semantic:Katara

• Quantitativeerrordetection:Thesealgorithmsexposeoutliersandotherstatisticalglitcheswithinthedata.• Recordlinkageandde-duplicationalgorithms:De-duplicationtoolsdetectduplicatedatarecordswhichreferto thesameentity.ConflictingValuescanbefoundàfurther indicatingerror.• TAMR

• Arethesecategoriessufficient?Overlap?• Theauthorsadmitthattheircategorizationdoesnotperfectlypartitionerrors• Theauthorsareattemptingtocategorizedetectableerrorsfromreal-worlddata-sets

Page 5: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Data-CleaningChallenges

• Syntheticdataanderrors:Mostcleaningalgorithmsareevaluatedondata,eithersyntheticorreal-world,withsyntheticallyinjectederror• Thereisbothalackofrealdata-setsalongwithappropriategroundtruthandalackofwidelyacceptedbenchmarkofdata-cleaningquality

• Combinationoferrortypesandtools:Real-worlddataoftenhasmultiplekindsoferrors.• Anerrorcanoftenbefoundthroughamultitudeoftools

• Conflictingduplicaterecordsandintegrityconstraint• Overlap??

• HumanInvolvement:Enterprisesrequirebudgetstofacilitatehumanpower—havinganidealorderingfortheapplicationofdata-cleaningtoolsiskeytominimizehumanintervention.• Verifydetectederrors• Specifycleaningrules• Providefeedbackformachineleaning

Page 6: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Overview:Methodology

• Collectionofreal-worlddatawitheitherfullorpartialgroundtruth• Representthekindsofdirtydatafoundinpractice• Theauthorscaneasilyjudgetheperformancebecauseoftheknowledgeofgroundtruth

• Interestedinautomaticerrordiscoveryasopposedtoautomaticrepairbecauseauto-repairisnotpragmaticinpractice.• Reportresultsintermsofprecisionandrecallintermsofthegroundtruth.• Upper-boundrecall:estimateforthemaximumrecallofatoolifithasbeenconfiguredbyanoracle• ”Perfect-configuration”ofdata-cleaningrulestoenableoptimalerrordetection• Usegroundtruthtoestimateupper-boundrecall:classifyremainingerrorsthatarenotdetectedbytype

• Anyerrorwhosetypecanbecleanedbyatoolshouldbecountedtowardsitsrecall.

Page 7: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

EvaluationQuestionsWhatistheprecisionandrecallofeachtool?Howprevalentaretheerrorswhichthedata-cleaningtoolisdesignedtodetect?Howmanyerrorsinthedatasetsaredetectablebyapplyingalltoolscombined?Sincehuman-in-the-loopisawellacceptedparadigm,howmanyfalsepositivesarethere.Thesecauseadraininhumaneffortbudgetandcauseacleaningefforttofail.Isthereastrategytominimizehumaneffortbyleveragingtheinteractionamongtools?

Page 8: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

MainFindings

• Conclusion1:Thereisnosingledominanttoolbecausethedata-cleaningalgorithmsaregenerallytailoredtowardsparticulartypesoferrors• Aholistic“composite”strategymustbeusedbecauseeachdata-cleaningtoolisindividuallydesignedtodetectselectivegenresoferrors

• Conclusion2:Byassessingtheoverlapoferrorsdetectedbythemultitudeofdata-cleaningtoolsutilizedinordertominimizefalsepositive(userengagement)• Orderingstrategymustbespecifictothedata-setbecauseofvarianceinstructuralpropertiesandpatterns

• Conclusion3:Thepercentageoferrorsthatcanbefoundbythecombinedorderedapplicationofalltoolsissignificantlylessthan100%.• Willdiscussadditionalerrorslaterinexperiments—researchersneedtodevelopnewwaysoffindingthese”unknowncategories”ofdataerrors(oneswhichcanbespottedbyhumansbutnotbythecurrentcleaningtools

Page 9: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

DataErrorsandDataSets• DataError:Givenadataset,adataerrorisancell-valuewhich

isdifferentfromitsgivengroundtruth• Outliererrors:Cell-valueswhichdeviatefromthedistribution

ofovertherangeofvaluesinacolumnofatable.• Duplicateerrors:Distinctdatabaseentries/recordswhichrefer

tothesamereal-worldentity.• Ifthetwoentries’attributevaluesdonotmatchthat

couldindicateanerror.• Ruleviolationerrors:Cell-valuesthatviolateanykindof

integrityconstraints• NotNull&UniquenessConstraints

• Patternviolation:Valuesthatviolatesyntacticandsemanticconstraints• Alignment,formatting,misspelling,andsemanticdata-

types

Page 10: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

DataSetsOverview• Thefourerror-typesaregenerallyprevalentacrossalldata-sets• TheAnimaldata-setdoesnothaveoutliers• MITVPFandBLACKOAKaretheonlydata-setswithduplicates

• Theratiooferroneouscellsineachdata—setrangefrom0.1%to34%• Structuralproperties:

• Rayyan BibhasthehighestpercentageoferrorswhileAnimalhaslowest

• Merckhasthegreatestnumberofattributes• BlackOak hasthegreatestnumberofentries

Page 11: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

MIT-VPF• MITOfficeoftheVicePresidentforFinance’s(VPF)procurementdatabasewhichcontainsinformationaboutvendorsandindividualsthatsupplyMITwithproductsandsupplies

• StructuralDetails:• ExecutePurchaseOrder:newentryisaddedwithdetailsaboutthe

contractingpartytothevendormasterdata-setwheneverMITbuysaproduct• Identificationinformation(name,address,phonenumber,businesscodes)

• Theongoingprocessofaddingcreatesauniqueproblemofduplicatesandotherdata-errors(theory)• Inconsistentformatting(address,phonenumber,companynames)• Contactinformationmaychangeovertime

• Groundtruth:EmployeesofVPFmanuallycuratedarandomsampleof13,603records(halfofthedata-set)andmarkederroneousfields(empirics)• addressandcompanynames:missingstreetnumbers,wrong

capitalization,andattributevaluesinthewrongcolumn

Page 12: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Merck• TheMerckdatasetdescribesITservicesandsoftwaresystemswithinthecompanythatarepartiallymanagedwiththirdparty—usedforoptimizationofdownsizingservice• StructuralDetails:Eachsystemischaracterizedbylocation,numberofendusers,andleveloftechnicalsupport• Greatestnumberofattributes(68)butisverysparse

• GroundTruth:Merckprovidedthecustomcleaningscripttheyusedtoproduceacleanedversionofthedata-set• Appliesvariousdatatransformationsthatnormalizecolumnsandallowforuniformvaluerepresentation

• Theauthorsutilizedthescripttoformulaterulesandtransformationsforcleaningtools• Therearemanyhiddenfunctioncallsthatareimplicitlycalled

whichchangedata-values

Page 13: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Animal• Animaldata-setprovidedbyscientistsatUCBerkeleyabouttheeffectsoffirewoodcuttingonsmallterrestrialvertebrates• StructuralProperties

• Eachentrycontainsinformationaboutthetimeandlocationofcaptureofananimal,inadditiontoitsproperties:tagnumber,sex,weight,species,andagegroup

• Eachrecordwasmanuallyenteredintospreadsheetsfrombeinginitiallytranscribedonpaper(datafrom1993-2012years)

• Groundtruth:Thescientistsmanuallycleanedthedata-setandidentifiedseveralhundredsoferroneouscells• Errors:

• Shiftedfields• Wrongnumericvalues

Page 14: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Rayyan-Bib• Rayyan isasystembuiltatQCRItoassistscientistsintheproductionofsystematicreviews• literaturereviewswhichidentifyandsynthesizeallresearchevidencerelatedtoanuancedresearchquestion

• StructuralProperties:• Usersconsolidatesearchresultsintolonglistsofreferencestostudieswhichtheyfeelarerelevanttoansweringthequestion• Searchingmultipledatabasesusingmultiplequeries• Userscanmanuallymanipulatecitationssodataispronetoerror

• Entrieshavealotofattributes:article_title,journal_title,journal_abbreviation etc

• Groundtruth:Theauthorsmanuallycheckedasampleof1,000referencesfromRayyan’s database• Manymissingvaluesandinconsistenciesindata

• Journal_title andjornal_abbreviation areoftenswitched• Authornamesaresometimesfoundinjournal_title

Page 15: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

BlackOak• BlackOak Analyticsisacompanywhichprovidesentityresolutionsolutions• StructuralProperties:Providedanonymizedaddressdatasetandadirtyversionwhichtheyuseforevaluation• Groundtruthisgivenbecauseit’sasyntheticdata-set• Errorsarerandomlydistributed

• Errors:• Spellingofvalues• Formattingofvalues• Completeness• Fieldseparation

• Theauthorsuniquelyincludedthisdata-settoanalyzethedifferenceinerrordetectionperformancebetweenrealworldandsyntheticdatasets

Page 16: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

DataCleaningTools• Selecteddata-cleaningtoolswhichcoveredallfourerrortypes• Multipletoolssometimesfocusondifferentsubtypesofagivenerrortype• Iterativefine-tuningprocessforeachtool• Comparedetectederrorswithgroundtruthinordertoadjustthetoolconfigurationorrulesinordertoimproveperformance• Detectableerrorsarecountedtowardstherecallupperbound

Page 17: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Strategy1:OutlierDetection

• Detectdatavalueswhichdonotfollowthestatisticaldistributionoftheoveralldata• Tool1:Dboost

• Unique:Decomposesrun-ondatatypes(date)intotheirconstituentpieces(m,y,d)• Attributeswhicharewrappedinmorecomplexdatacanbeindividuallyanalyzedseparatelyforoutliers

• Histogramscreateadistributionofthedatawithoutanyapriori assumptionbycountingtheoccurrencesofuniquedata-values

• GaussianandGGMassumethateachvaluewasdrawnfromanormaldistributionwithgivenameanandstandarddeviationoramultivariateGaussiandistributionrespectively.

• OptimalParameterConfiguration:• Numberofbins&theirwidthsforhistograms• Mean&StandardDeviationforGaussianandGMM

Page 18: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Strategy2:Rule-basedErrorDetection

• Relyondata-qualityrulestodetecterrors:expressedusingintegrityconstraints• FunctionalDependencies• DenialConstraints• Violation:Collectionofcellsthatdonotconformtoagivenintegrityconstraint• Atleastonecellinvolvedintheviolationmustbechangedtoresolveaviolation

• Tool2:DC-Clean• Focusesondenialconstraints• TheauthorsdesignacollectionofDCstocapturethesemanticsofthedata• “iftherearetwocapturesofthesameanimalindicatedbythesametagnumber,thenthefirstcapturemustbemarkedasoriginal”

Page 19: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Strategy3:Pattern-basedDetection

• Tool3:OPENREFINE:Opensourcedatawranglingtool• Tool4:TRIFECTA:Communityversionofacommercialdatawranglingtool

• OPENRIFEANDTRIFECTAfocusonsyntacticpatterns:provideexplorationtechniquestodiscoverdata-inconsistencies

• Tool5:KATARA:Semanticpatterndiscoveryanddetectiontool• Focusesonsemanticpatternsmatchedagainstaknowledgebase

• ETL(Extract,Transform,Load)tools:pulldataoutofonedatabaseandplaceitinanother• Tool6:KNIME• Tool7:PENTAHO

Page 20: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Tool3:OPENREFINE

• OPENREFINEisanopensourcewranglingtoolthatcandigestdatainmultipleformats--facilitatesdata-exploration• FacetingOperation:Letsuserslookatdifferentkindsofaggregateddata—resemblesagroupingoperation• TheuserspecifiesonecolumnsforfacetingandOPENREFINEgeneratesawidgetthatshowsalldistinctvalues&theirnumberofoccurrences

• Filteringoperation:• TheusercanspecifyanexpressiononmultiplecolumnsandOPENREFINEgeneratesthewidgetbasedonvaluesoftheexpression

• TheusercanthenselectoneormorevaluesinthewidgetandOPENREFINEfiltersrowswhichdonotcontainselectedvalues

• Datacleaningusesaneditingoperation• Editsonecellatatime• Ifyoueditatextfacet,allcellsconsistentwiththatfacetwillupdate

Page 21: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Tool4:TRIFACTA

• TRIFACTAisthecommercialdescendantofDataWrangler:Predictsandappliesvarioussyntacticdata-transformationsfordatapreparationandcleaning.• Canapplybusiness&standardizationrulesthroughavailabletransformationscripts

• Appliesafrequencyanalysistoeachcolumntoidentifymostandleastfrequentvalues• Showsattributevaluesthatdeviatestronglyfromthevaluedistributioninthespecificattribute• Mapseachattributetoitsmostprominentdata-typeandidentifiesvaluesthatdonotmatch

Page 22: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Tool5:KATARA

• KATARAreliesonexternalknowledgebases,suchasYago,todetect&correcterrorswhichviolateasemanticpattern• Identifiesthetypeofacolumnandtherelationshipbetweentwocolumnsinthedata-setusingaknowledgebase• ThetypecolumnAinatablemightcorrespondtoCountryinknowledgebaseYago &therelationshipbetweencolumnsAandBmightcorrespondtothepredicateHasCapital inYqgo

• Basedonthediscoveredtypesandrelationship,Katara validatesvaluesusingtheknowledgebaseandhumanexperts• Exampe:Avalueof”California”incolumnAwillbemarkedasanerrorbecauseitisnotacountryinYago

Page 23: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Tool6:PENTAHO

• PENTAGOprovidesagraphicalinterfacewheredatawranglingcanbeimplementedasadirectedgraphofETL(Extract,Transform,load)operations• Anydata-manipulationorrulevalidationcanbeaddedasanodeintheETLpipelines• ExecutesmultipleETLworkflowstoclean/curatedataBUTrules/proceduresmustbespecifiedbyuser• Providesroutinesforstringtransformationandsinglecolumnconstraintvalidation

Page 24: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Tool7:KNIME

• Knime focusesonworkflowauthoringandencapsulatingdataprocessingtaskssuchascurationandmachinelearningbasedfunctionalityincompassablenodes• AlthoughKNIMEexecutesmultipleETLworkflows,similartoPENTAGO,toclean/curatedata,rules/proceduresmustbespecifiedbyuser• Usersmustknowexactlywhatkindsofrulesandpatternsneedtobeverified• UnlikeOPENREFINE&TRIFACTA,PENTAHOandKNIMEdonotprovidewaystoautomaticallydisplayoutliersanddetecttypemismatches

Page 25: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Strategy4:DuplicateDetection

• Iftworecordsrefertothesamereal-worldentity,buthavedifferingattributevalues,thereastrongchanceoneofthetwovaluesforeachrespectiverecordisanerror• Tool8:TAMR(commercialdescendantofDataTamersystem)• TAMRisatoolwithindustrialstrengthdataintegrationalgorithmsforrecordlinkageandschemamapping• Premisedonmachinelearningmodelsthatlearnduplicatefeatures

• Expertsourcing• SimilarityMetrics

Page 26: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

CombinationofMultipleTools

• Problem:Howdoesauserproperlycombinemultipleindependentdata-cleaningtools• Option1: Runalltoolsandapplyaunionormin-kstrategy• Option2:Haveusersmanuallycheckasampleofdetectederrors,whichcanbeusedtoguidetheprioritizationofdata-cleaningoperations

Page 27: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Option1:UnionAllandMin-k

• Unionall• Takestheunionoftheerrorsemittedbyalltools

• Min-k• Thoseerrorsdetectedbyatleastk-toolswhileexcludingthosedetectedbylessthank-tools• Noneedtokeepcleaningthedata-setwithnewtechniquesifthemaximumperformanceforerrordetectionhasbeenreached

Page 28: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

OrderingBasedonPrecision

• ProblemswithOption1(exhaustiveunion)• Expensivebecauseitrequiresmassiveamountsofhumanefforttovalidatelargenumberofcandidateerrors• BlackOak data-set:Auserwouldhavetoverifythe982,000cellsidentifiedaspossiblyerroneoustodiscover382,928actualerrors.• Resultsfromtoolswithpoorperformanceinerrordetectionforthisparticulardata-setshouldnotbeevaluted

• Alternative:Sampling-basedmethodtoselecttheorderinwhichdata-cleaningstrategieswillbeimplementedonthedata-set

Page 29: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

OrderingBasedonPrecision• CostModel:Althoughtheperformanceofatoolcanbemeasuredbyprecisionandrecallindetectingerrors,Precisionisabetterproxyforadata-cleansingtool’serrordetectionperformance• Recallcanonlybecomputedifalloftheerrorsinthedataareknown(fullgroundtruth)—thisisnearlyimpossiblewhenweexecuteerrordetectionstrategiesonnewdata-sets

• Precisioniseasytoestimate• AssumeC isthecostofhavingahumancheckadetectederrorandthatVisthevalueofidentifyingarealerror• Valuemustbehigherthancost!• P*V>(P+N)*C,wherePisthenumberofcorrectlydetectederrorsandNisthenumberoferroneouslydetectederrors(falsepositives)

• P/(P+N)>C/V• Setthreshold:σ =C/V

Page 30: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

OrderingBasedonPrecision• Anytoolwithaprecisionbelowσ shouldnotrunbecausethecostofcheckingisgreaterthanthevalueofaccuratelyidentifyingadata-error• Theratioisdomaindependent(unknowninmostcases);itisnaturaltohavelargeVvaluesforhighlyvaluabledata

• IfVisverylarge,alldata-cleansingtoolswillbeconsideredwiththecorrespondinglysmallthresholdvalue,whichboostsrecall

• IfthevalueofCishighanddominatestheratio,wesavecostonlyonthevalidationoftoolsthatareveryprecise—tradeoffwithrecall

• Theauthorsestimatetheprecisionoftheirdata-cleansingtoolonagivendata-setbycheckingarandomsampleofthedetectederrors.• Whynotrunallthetoolswithaprecisionhigherthanthresholdandevaluatetheunionofalltheirdetectederrorsets??• Toolsarenotindependentandsetsofdetectederrorsmayandoftendooverlap• Sometoolsmayhaveanextremelyhighprecision,butalloftheerrorstheydetectmaybecoveredwithtoolsthathaveevenhighprecisionvalues

Page 31: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

OrderingBasedonPrecision

• Maximumentropy-basedorderselection:FollowingtheMaximumEntropyprinciple,theauthorsdesignanalgorithmwhichassessestheestimateprecisionforagivendata-cleansingtool• Estimatesoverlapbetweentoolresults• Pickingthetoolwithhighestprecision(percentageofpositiveswhicharetrueovertotal)reducesentropythemostbecausehighentropyreferstouncertainty

Page 32: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

OrderingBasedonPrecision:Algorithm

1. Runalldata-cleaningtoolsontheentiredata-setandreturndetectederrors2. Estimateprecisionforeachtoolbyverifyingarandomsampleofitsdetected

errorswithahumanexpert3. Pickthetoolwiththehighestestimatedprecisionamongalldata-cleaningtools

notyetconsideredinordertomaximizeentropy&verifiesdetectederrorsonthecompletedata-setwhichhavenotbeenverifiedbefore

4. SinceerrorsvalidatedfromStep3mayhavebeendetectedbyothertools,weupdatetheprecisionoftheothertoolsandgotoS3topickthenexttoolifadata-cleaningstrategyexistswithanestimatedprevision> σ

Empirics:Regardlessofeachtool’sindividualperformance,theproposedorderreducescostofmanualverificationwithmarginalreductionofrecall.

Page 33: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

EvaluationMetrics

• D:dataset• G:purelycleaneddataset• E:diff(G,D)=E• T(D):thesetofcellsmarkedaserros bytoolT• Precision:• Recall:• AggregatedF:2(R*P)/(R+P)

Page 34: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

UsageofTools

• DBOOST:appliedthreealgorithms:Gaussian,histogram,GMM.WithparametersmakingFhighest.• DC-Clean: existing rules + manuallyconstructedFDrulesbasedonobviousn-to-1relationships• OpenRefine: facetmechanism+ formattingandsingle-columnrules• TRIFACTA: outlierdetectionandtype-verification+ formattingandsingle-columnrules• KATARA: manuallyconstructed& existing knowledgebase• PENTAHO & KNIME: modeleachtransformationandvalidationroutineasaworkflownodeintheETLprocess• TAMR: iterate training until the precision and recall become stable

Page 35: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to
Page 36: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Dataqualityrulesdefinedoneachdataset

Page 37: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

UserInvolvement

• Setrules• PerformdataexplorationusingOpenRefine andTRIFCTA• Validatetheresultoferrors• Gothroughtheremainingerrorsandtrytocategorizethem

Page 38: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

IndividualEffectiveness

Page 39: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

IndividualEffectiveness

• DBOOST:useless for Animal. Good for BlackOak• DC-Clean: good for Animal, Merck. Bad for MIT VPF• OpenRefine: bad for Animal, top 2 for others• TRIFACTA: bad for Animal recall, top 2 for others• KATARA: good for BlackOak, bad for MIT VPF• PENTAHO & KNIME: good on general• TAMR: found all duplicates for MIT VPF, and most of duplicates forBlackOak

Page 40: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to
Page 41: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to
Page 42: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Tool Combination Effectiveness

• Union All: High recall but low precision (lots of FP)

Page 43: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Min-K

• Require at least k algorithms agree on error• (K=1) == union all• As k increases, precision increases, recall decreases• Main problem: how to pick k

Page 44: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

OrderingbasedonBenefitandUserValidation• Randomly sample 5% of the detected errors for each tool andcompare them with ground truth for precision estimation.• Run tools in precision order (dynamically update the precisionestimation and drop tools that did not pass )• Baseline: simple union• Threshold: σ(0.1-0.5) (for precision)• As threshold increases, precision increases, FP decrease significantly,with TP decrease a little, causing recall decrease a little

Page 45: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Ordering Strategy results

Page 46: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Recall Upper-bound

• extra rules found by manually going through remaining errors

Page 47: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Domain Specific Tools

• For MIT VPF and BlackOak: ADDRESSCLEANER• Apply on a 1000 sample• Found 2 & 13 new errors. Recall: 0.93-0.95; 0.999-0.999

Page 48: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Enrichment

• Manually add more attributes to the original dataset (only those thatdid not introduce additional duplicate rows)• DC-Clean & TAMR

Page 49: Detecting Data Errors: Where are we and what needs to be ... · Introduction •A multitude of data-cleaning tools exist to detect and potentially repair errors •It’s better to

Conclusion

• Thereisnosingledominanttoolforthevariousdatasetsanddiversifiedtypesoferrors.Singletoolsachievedonaverage47%precisionand36%recall, showingthatacombinationoftoolsisneededtocoveralltheerrors.• Pickingtherightorderinapplyingthetoolscanimprovetheprecisionandhelpreducethecostofvalidationbyhumans.• Domainspecifictoolscanachievehighprecisionandrecallcomparedtogeneral-purposetools,achievingonaverage71%precisionand64%recall,butarelimitedtocertaindomains• Rule-basedsystemsandduplicatedetectionbenefitedfromdataenrichment.Inourexperiments,weachievedanimprovementofupto10%moreprecisionand7%morerecall