Top Banner
FRIL Fine-grained Record Integration and Linkage Tool Tutorial V 3.2 08/07/2009 Author: Pawel Jurczyk Copyright: Emory University, Math&CS Department, 2009 Document history: Date Version Date Version 12/22/2008 1.0 08/07/2009 3.2 12/29/2008 2.0 06/29/2009 3.0 07/02/2009 3.1
43

FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

Sep 23, 2018

Download

Documents

dangmien
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

FRILFine­grainedRecordIntegrationandLinkageTool

TutorialV3.2

08/07/2009

Author:PawelJurczyk

Copyright:EmoryUniversity,Math&CSDepartment,2009

Documenthistory:

Date Version Date Version12/22/2008 1.0 08/07/2009 3.212/29/2008 2.0 06/29/2009 3.0 07/02/2009 3.1

Page 2: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

2

Tableofcontents

Tableofcontents ............................................................................................................................................................ 2

Introduction..................................................................................................................................................................... 3

Installation ....................................................................................................................................................................... 3Generalarchitecture..................................................................................................................................................... 4

Mainwindow ................................................................................................................................................................... 4Configurationofdatasources.................................................................................................................................... 7

Configurationoflinkage ............................................................................................................................................18

Resultsavers/Resultsdeduplication ....................................................................................................................24Runningthelinkage,linkageconfidence .............................................................................................................26

Howconfidenceleveliscalculated? ......................................................................................................................28Distancemetrics...........................................................................................................................................................30

Datadeduplication ......................................................................................................................................................38

References ......................................................................................................................................................................43

Page 3: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

3

IntroductionThisdocumentprovidesatutorialfortoolcalledFRIL(Fine‐grainedRecordsIntegrationandLinkageTool).FRILisaJava‐basedtoolthatincorporatesarichcollectionofrecorddistancemetrics,searchmethods,andanalysistools.Alongitsworkflow,FRILprovidesarichsetofuser‐tunableparametersaugmentedwithgraphicvisualizationtoolstoassistusersinunderstandingtheeffectsofparameterchoices.InthistutorialthereaderwillobtainknowledgeaboutallthecomponentsavailableinFRIL.

InstallationFRILcanbedownloadedfromitshomepage.Thefileavailablefordownloadisazippedarchivethatcontainsallthefilesrequiredforrunningthesoftwareandexampledatafilesusedinthistutorial.Ifyouhavenotobtainedthesoftwaresofar,doitnow:

1. Gotohttp://www.mathcs.emory.edu/Research/Area/datainfo/FRIL2. GotoDownloadsection3. Downloadthemostrecentbinarydistribution(orsourcedistributionifyouprefer)4. Unzipdownloadedarchive5. OpentheFRILfolder(itshouldbecreatedafterunzippingthefile)6. Runthesoftware.Usefril.bat(Windows)/fril.sh(Linux/Unix)/fril.sh.command(Mac)depending

onoperatingsystemyouuse.7. Ifthesoftwaredoesnotstart,makesureyouhavejava1.6installed.FRILisdevelopedinjavaand

requiresthisframeworktorun.Youcanobtainthelatestjavaframeworkfromhttp://java.com.ForfurtherinformationyoumightalsowanttorefertothefileREADME.txtthatislocatedinaFRILdirectory.

Page 4: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

4

GeneralarchitectureBeforewegetstartedwithFRIL,itisimportanttounderstandthearchitectureofthesystem.TheworkflowofFRILisshowninFigure1.Theuserspecifiestheinitialinputfiles.Eachrunrequirestheusertospecifythesearchmethod,thedistancefunctionintheattributecomparisonmoduleandthedecisionmodel.OutputconsistsofsetsM,U,Pandvarioussummarystatistics.SetsUandPmaybefedbackintoFRILwithadifferentsetofparameters.

Figure1:ArchitectureofFRIL

InthistutorialweguideuserthroughallthecomponentsofFRIL,presentinghowtoconfigure,debugandrunefficientandaccuraterecordlinkage.

MainwindowOnceyoustartthesoftware,themainwindowopens.ThescreenshotofapplicationispresentedinFigure2.Notethatthereare5majorcomponentsintheapplication:

• Configurationofthefirstdatasource• Configurationoftheseconddatasource• Configurationofthelinkageprocess• Configurationoftheresultssaver• Linkageprocesspanel

TheFRILmainwindowcontainsthefollowingspecificcomponents(thenumberscorrespondtonumbersinlabelsinFigure2):

1. Configurationofthefirstdatasourcebutton2. Viewofnotjoineddatainthefirstdatasourcebutton3. Indicatorsoffiltering/deduplicationinthefirstdatasource4. Configurationoftheseconddatasourcebutton5. Viewofnotjoineddataintheseconddatasourcebutton6. Indicatorsoffiltering/deduplicationintheseconddatasource

Page 5: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

5

7. Configurationoflinkageprocessbutton8. Linkageprocessstartbutton9. Configurationofresultsavers/resultdeduplicationbutton10. Viewoflinkedrecordsbutton11. Indicatorofdeduplicationoflinkageresults12. IndicatorofFRILmode(linkage/deduplication)

ToconfigurethelinkageusingFRILyouneedtogothroughallthemajorcomponents(configuredatasources,linkageandresultsavers).Onceyoucompletethatyoucanstartthelinkageprocessusingthestartlinkagebutton.Intheremainingpartofthistutorialwewillgoovereachofthestepsabovetolinksampledatafiles.

FRILhastwomodes:linkageanddeduplication.ThelinkagemodeisthedefaultoneandispresentedinFigure1.ThededuplicationmodecanbeenabledintheModemenu.PleaseseesectionDatadeduplicationformoredetails.

IfyouhavealreadystartedFRILbefore,youmightbeaskedifyouwanttoopenyourpreviousconfigurationwhenFRILstarts.FRILremembersitsconfigurationandattemptstorestoreituponrestart.

Ifyouarenotsurehowtoobtainorstartthesoftware,gotoInstallationsection.

WorkingwithsavedconfigurationWhenyouworkwithFRIL,saveyourconfigurationoften.Inthatcaseyouwillbesafeifanythingunexpectedhappens(e.g.,FRILhangsorcrashes–yes,itmighthappen,althoughweareworkingtomakethesoftwarestable).Tosaveyourconfiguration,clickon“File‐>Saveas”or“File‐>Save”.Later,whenyouopenFRILagain,youwillhaveanoptionofloadingthesavedconfiguration.Youcanalsoloadconfigurationatanytime.Todoso,click“File‐>Open”.

AutosaveoptionFRILwasrecentlyupdatedtoincludeanautosaveoption(thisoptionisenabledbydefault,youcanchangeitintheToolsmenuitem).Theautosaveoptionautomaticallysavestheconfigurationasyouworkwiththetool.Ifyouclosetheapplicationimproperlyorifitbreaks,youcanbeaskedwhetheryouwanttoloadanautosavedconfigurationwhenyourestartFRIL.

Page 6: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

6

Figure2:MainwindowofFRIL

Page 7: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

7

ConfigurationofdatasourcesDatasourcesprovidedatathatwillbelinkedinFRIL.Youhaveanoptionofthefollowingdatasources:

1. CSVfileComma‐separatedvaluesfile.Thisisatextfilewheredataisseparatedbycomma,semicolonorotherseparatorcharacter.TheconfigurationofCSVdatasourcerequiresspecificationofaninputfileandacolumnseparator.

2. ExcelfileFilesavedinMicrosoftExcelformat.Note:FRILworkswithExcelfilessavedinOffice1997‐2003fileformat.SupportforOffice2007isonitsway.TheconfigurationofExceldatasourcerequiresspecificationofafilenameandasheetname.Bothparametersarerequired.

3. Fixedwidthcolumnsinputfile(Textfile)Plaintextdatafile.Eachrowinthefileisaseparaterecord,andeveryattributeshouldhavefixedlength.Requiresaschemafile(oradatamodelfile)thatdescribesattributesavailableinthefileandlengthofthoseattributes.Thedatamodelfilenameisoptional,andifnotprovidedFRILwilllookforafilewithextension“.fields”.Forinstance,ifyouuseaninputfiledata.txt,andyoudonotspecifythedatamodelfile,FRILwillexpectthefiledata.fieldstobelocatedinthesamedirectoryasthefiledata.txt.The“.fields”filecouldlookasfollows:

Theexampleabovedefinesatextfilewith3attributes,name,ageandaddress.Eachlineinthetextfileisexpectedtohave83characters(20+3+60).Thefirst20charactersineachlinearedecodedasaname,next3charactersasanageandnext60charactersasanaddress.

4. DatabaseThedatawillbepulledfromdatabase.RequiresJDBClibraries.Currently,thelibrarieshavetobedownloadedmanually,butinthefuturethisprocesscanbeautomated.Note:JDBCgivesalotofoptionsfordatasources.Usingthistechnology,youcanaccessforinstanceMSAccessdatabase,MSSQL,MySQL,PostgreSQL,Oracleandmanymoredatabases.

Let’snowadddatatoFRIL.Followthestepsoutlinedbelowtofullyconfigurethefirstdatasource.Wewillworkwiththesampledatathatshouldbelocatedinthedata‐sampledirectorylocatedinyourFRILhomedirectory.First,clickontheconfigurationofupperdatasourceinmainwindow(clickonabuttonnexttothelabelwithnumber1inFigure2).Awizardfordatasourceshouldappear.InDatasourcetypechooseaCSVfileoption.Namethedatasource“FirstFile”andclickon“…”buttonontherightfromthe“Filename”textfield.FilechoosershouldopeninyourFRILdirectory.Ifnot,navigatethere.Doubleclickondata‐sampledirectoryandclickon“generated‐data‐error.csv”.Click“Open”.YoushouldbebackintheDatasourceconfigurationwizard.In“Columnseparator”optionchooseComma.YourwindowshouldlookliketheonepresentedinFigure3.NowyoushouldbereadytoclickNext.Whenyoudothat,thenextstepinconfigurationofdatasourceappears.Youshouldnowseeafewboxeswiththefollowingnamesonthem:first_name,last_name,DOB,height,weight.

IfyouchooseawrongColumnseparator,FRILwillnotbeabletoreadyourdatainaproperway.Youwilllikelyseeonlyoneboxinthesecondstepofthedatasourceconfigurationthatreadssomethinglike:“first_name,last_na…“.

name20age3address60

Page 8: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

8

Figure3:Firststepofdatasourceconfiguration

Tryclickingon“Showfieldssummary”buttonlocatedatthetopwhenyouareinthesecondstepofadatasourceconfiguration.Youwillseethedatasummarywindow.Trytomoveyourmouseoveroneofthe“See”buttons–afewexamplesofdatavaluesshouldbepresented.Thisoptioncanbeusefulwhenexploringthedata.

ThenextstepinconfigurationistochoosecolumnsfromtheinputfilethatyouwillactuallyuseinFRIL.Yourinputfilecouldpotentiallyhavehundredsofdatacolumns,andyoumightreallyneedonlyfewofthemintheapplication.Bychoosingthecolumnsyoureallyneedyoumakethetooltoworkmuchfaster.Inourcase,however,ourdatafilehasonly5columns,andallthecolumnswillbeusedinthelinkage.Let’snowaddattributesthatwillbeusedlaterbyFRIL.Followthesesteps:

1. Wedonotwanttoworkwithfirstandlastnameseparately,butwewanttotreatthosetwoattributesasone(orasname).Clickontheboxthatreadsfirst_name.ChooseUseconverter‐>Mergeconverter.Anewconfigurationwindowshouldappear.Asthe“Outattributename”type“name”.ChooseSpacefortheoption“Joinfieldsusing”.Inthe“Joinedcolumns”clicklast_name@FirstFile.YourwindowshouldlookliketheoneinscreenshotfromFigure5.Wonderingwhatyouweredoing?Clickonmagnifyingglassicon.Youshouldseeadynamicanalysiswindowthatshowsthreecolumns(Figure6).Youseethatyouhavetwoinputcolumns(first_nameandlast_name)andoneoutputcolumn(name).Nowyouseewhatmergeconverterdoes–ittakestwo(ormore)attributesfrominputfileandcreatesanewattribute.Thevalueofthenewattributeiscalculatedbymergingvaluesfromfewcolumns.

Beforeyouclosetheconverterconfiguration,trytochangetheoption“Joinfieldsusing”or“Joinedcolumns”whilekeepingthedynamicanalysiswindowopen.Theanalysiswindowisupdatedeverytimeyoumakeanychangeintheconfigurationofconverters.Makesure,however,toselectvaluesofparametersasdescribedinstep1beforeyouclickOK.

2. Nowyoushouldbereadytoclosetheconfigurationofconverter.ClickOK.3. Clickon“Showfieldssummary”buttonandmoveyourmouseover“See…”buttonfortheattribute

DOB.YouwillseethatthevaluesaredatescodedasMMDDYYYY.Wewillnowchangethe4‐digityearformatinto2‐digitformat.OurgoalistoproducevaluesthathaveformatMMDDYY.

Page 9: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

9

4. Closethesummarywindowandclickonboxthatreads“DOB”andselectUseconverter‐>Trimconverter.

5. Intheconfigurationofconverter,asthe“Outputattributename”type“Date‐MMDD”.Selectthecheckboxnextto“Modifyendofstring”.UsetheoptionforCutandsetnumberofcutcharactersto4.MakesurethatyourconfigurationisconsistentwithFigure7.

6. Clickonthemagnifyingglassbutton.Theconverteryoujustcreatedleavesfirst4charactersofeveryDOBvalue.ThosecharactersareMMDDpartofthedate.ClosedynamicanalysiswindowwhenyouaredoneandclickOKintheconverterconfigurationwindow.

7. Clickagainontheboxthatreads“DOB”,chooseUseconverter‐>Trimconverter.8. Thistimespecify“Date‐YY”asan“Outputattributename”,selectthecheckboxnextto“Modifyend

ofstring”andselecttheradiobuttonnextto“Leave”.Type2intheenabledtextfield(matchyourconfigurationwithFigure8).Clickagainonthemagnifyingglassbutton.Thisconverterleavesonlytwolastdigitsofyearfromanyinputdate.Applythechanges(closedynamicanalysisandclickOK).

9. WehaveseparateattributesforMMDDpartofdateandYYpartofdate.Thenextstepistomakeoneattributefromthosetwocomponents.Clickonboxthatreads“Date‐MMDD”andchooseUseconverter‐>Mergeconverter.MatchtheconfigurationwithFigure9.MakesuretoselectNonefortheoption“Joinfieldsusing”.Asusually,clickonmagnifyingglasstoseetheresultsandonceyouaredoneapplythechanges.

10. Clickonboxthatreads“height”.Clickonoption“Addtooutmodel”.Followthesamestepsfor“weight”.

11. YourdatasourceconfigurationshouldlooksimilarlytotheonepresentedinFigure10.Ifnot,adjustyourconfigurationaccordingly.

Tryclickingoneachoftheboxesyouseeinyourconfigurationwindow.Youwillseethateverytypeofboxhasdifferentoptions.E.g.,byclickingonanorangecolorbox(converter),youcanmodifyitssettings.Ontheotherhand,byclickingonabluebox(outputcolumn),youcanrenamethecolumnorremoveit.

Observethatyoucanbuildasequenceofconverterstoachievemorecomplextasksindatamanipulation.

12. Click“Next”tofollowtothelaststepofdatasourceconfiguration(presentedinFigure4).Inthelaststepacceptthedefaultconfigurationandclick“Finish”

Thelaststepofdatasourceconfigurationisadatapreprocessingstepandrequiresabitofexplanation.Youcanconfigureheretwomajortasksforadatasource:deduplicationandfiltering.Datadeduplicationisataskofremovingduplicaterecordsfrominputdata(pleaseseesectionDatadeduplicationformoredetails).Filteringisataskofspecifyingwhichrecordsshouldbeusedinthelinkageprocess.Notethatifyoudonotuseanyfilter,alltherecordsprovidedbygivendatasourcewillbeused.Ontheotherhand,ifyouspecifyanyfilter,onlytherecordsthatsatisfythefilterexpressionwillbeconsidered.Otherrecordswillnotbetakenintoaccount.Thefilterexpressioncanbeanybooleanexpressionthatinvolvesattributesfromadatasource.AssumingthatyourdatasourceprovidesattributesNAME,AGEandZIP_CODE,thefollowingtablepresentcorrectfilterexpressions:AGE=32 OnlyrecordsthathaveAGEequalto32willbe

considered.AGE<=32andZIP_CODE=“30030”andNAME=“SMITH”

OnlyrecordsthathaveNAMEequalto“SMITH”andZIP_CODEequalto30030andAGElessorequal32willbeconsidered.

Page 10: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

10

ZIP_CODE=“30030”orAGE>64 OnlyrecordsthathaveZIP_CODEequalto30030orAGEgreaterthan64willbeconsidered.

Notethatthefilterexpressioneditorhassometoolsthatcanhelpyouwithaproperfiltercondition.The“Addfilterexpression”buttonopensawindowthathelpstoaddanewexpressiontothefilter,the“Clearfilter”buttonclearsthefilter(emptyfilterisequivalenttonofilter),andthe“Validatefilter”buttoncheckswhethertheselectedfilterexpressioniscorrect.Inaneventoffilterexpressionbeingincorrect,theerrormessagegivesanexplanationoferror.

Filterexpressionscaninvolveanarbitrarynumberofconditions.Youcanuse“and”and“or”asbooleanoperatorsandparenthesis(“(“,“)”)todefinetheorderoftheoperations.Bothoperators(“and”and“or”)canappearinanexpressionatthesametime,e.g.,thefollowingfilterexpressioniscorrect:NAME=“JOHN”or(AGE=33andZIP_CODE=“30030”)

Notethatanypreprocessingofadatasourcehappensbeforetheactuallinkage.Itmeansthatifyoudecidetodeduplicateadatasource,thedatasourcewillbededuplicatedbeforeprovidingdataforthelinkageprocess.Thesameruleappliestofiltering(recordswillbefilteredbeforeprovidinganydataforthelinkage).Thisfactmayhaveanimpactontimeittakestocompletetheoveralllinkageprocess:deduplicationmayincreasethistimeasitaddstocomputationoverhead,whilstfilteringwillreduceitmostoftimesasitlimitsanumberofrecordsinthelinkagephase.

Figure4:Configurationofdatasourcepreprocessing

Page 11: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

11

Figure5:Converterconfiguration

Figure6:Dynamicanalysiswindow

Page 12: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

12

Figure7:TrimconverterextractingMMDDfromdate

Figure8:TrimconverterextractingYYfromdate

Page 13: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

13

Figure9:Mergeconvertermergingdate

Figure10:Completeddatasourceconfiguration

Nowyoushouldbedonewiththeconfigurationofthefirstdatasource.Savetheconfiguration.Usethesamestepstoconfiguretheseconddatasource.Clickonconfigurationoflowerdatasourceinthemainwindow(clickonbuttonnexttolabelwithnumber2inFigure2).ChooseCSVfileandnamethesource“SecondFile”choosethefile“generated‐data.csv”asaninputfile(locatedinthesamedirectoryasthefileusedbefore)andcommaasaseparator.Thisfileprovides4attributes:patient_name,DOB,height,weight.Makesuretoanalyzethedatausing“Showfieldssummary”button.Donotuseanyconverters;justaddallthecolumnstoOutmodel(tryoption“Addallcolumnstooutmodel”).Don’tforgettosaveyourconfigurationwhenyoucompletethistask.

Page 14: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

14

Thinkwhywechosetousemergeconverterforfirstandlastnameandwhywedidallthethingswiththedateswhenthefirstdatasourcewasconfigured.Thesecondfileprovidesjustnamesofpatients,withnodistinctionbetweenfirstandlastnames.Also,theformatofdateinthesecondfileisMMDDYY.Byapplyingtheconvertersasdescribedaboveweresolvedschemadiscrepancybetweenthetwoinputfilesweareusing.

Atthispointyoushouldhaveafullconfigurationofbothdatasources.Youcannowmovetoaconfigurationofthelinkageprocess.Beforewedothat,however,let’sgooveralltheconvertersthatareavailableinthesystemasyoumightfindthosequiteusefulinyourprojects.

MergeconverterSampleusage:mergefirstnameandlastnamestoredinseparateattributesintooneattribute.Thisconverterisusedtomergetwoormoreattributesintooneattribute.Wehavealreadyseenhowtouseitbefore.

SplitconverterSampleusage:splitnameintofirstnameandlastname.Thisconverterdoestheoppositetothemergeconverter.Youcansplitoneattributeintotwoormoreattributes.Totryit,opentheconfigurationoftheseconddatasource(theoneyouconfiguredbyyourself).Gotothesecondstepoftheconfiguration(clickNextinthefirststep).Clickontheboxthatreadspatient_name(eitherofthetwo).ChooseUseconverter‐>Splitconverter.Youseethatyoucanconfigurehowmanyofattributeswillbegeneratedfromtheinputattribute(defaultvalueis2).Youmightalsorenametheattributesandchoosethesplitcharacter.MaketheconfigurationofthisconverterconsistentwithFigure11andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Remember,youcanalwaysplaywithattributeswhiledynamicanalysiswindowisopentoseewhatyourconverterisactuallydoing.TryclickingOK.NowyouseehowtheSplitconverterisrepresentedintheconfigurationwindow.Click“Cancel”onthewizardtopreventsavingthechangesyoujustmade.

Page 15: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

15

Figure11:Samplesplitconverterconfiguration

TrimconverterSampleusage:cutfirsttwodigitsfromyearsothatyouhaveYYformatinsteadofYYYY.Thisconverterisusedtotrimsomecharactersfromattributevalues.Wehavealreadyseenhowtouseitbefore.

AppendconverterSampleusage:appendstring“inch”toeveryvalueofheightattribute,sothatyourattributehasaformof“XXXinch”whereXXXaresomenumbers.Thisconvertercanappendagivenstringtothefrontorendoftheattributevalue.Theinputvaluesaretreatedasstrings,andyoucanappendtofrontand/orendofthestring.Totryit,opentheconfigurationofthefirstdatasource(theoneweconfiguredtogether).Gotothesecondstepoftheconfiguration(clickNextinthefirststep).ClickontheboxthatreadsDOB.ChooseUseconverter‐>Appendconverter.MaketheconfigurationofthisconverterconsistentwithwhatyouseeinFigure12andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Closethedynamicanalysiswindowwhenyouaredoneandclick“Cancel”intheconfigurationofAppendconverter.

Page 16: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

16

Figure12:Sampleappendconverterconfiguration

ReplacestringconverterSampleusage:Replaceabbreviationsinstreetnameswithfullnames(like“Dr.”‐>“Drive”or“Ave.”‐>“Avenue”).Thisconvertercanreplacespecifiedvaluesingiveninputstringwithsomeothervalues.Totryit,opentheconfigurationofthefirstdatasource(theoneweconfiguredtogether).Gotosecondstepoftheconfiguration(clickNextinthefirststep).ClickontheboxthatreadsDOB.ChooseUseconverter‐>Replacestringconverter.MaketheconfigurationofthisconverterconsistentwithFigure13(youwillneedtouse“+”buttonthreetimestoconfigurethereplacementrules,notealsothatyoucanedittheconfigurationbypressingthebuttonbetween“+”and“‐“)andpressthemagnifyingglassbutton.Youshouldbeabletoseewhatthisconverterdoes.Closethedynamicanalysiswindowwhenyouaredoneandclick“Cancel”intheconfigurationofAppendconverter.

Replacestringconverteriscasesensitive.Thismeansthatifyourinputvalueis“10FirstST”andtheonlyreplacementruleis“St”‐>”Street”,the“ST.”intheinputvaluewillnotbereplaced.Tohaveitreplaced,youwillneedtoaddsecondreplacementrulethatreads“ST”‐>”Street”.

Replacestringconverteristricky.Itsreplacementrulesareregularexpressions.Forinstance,“.”doesnotmatchdot,butanycharacter.Areplacementrule“A.B”‐>”CD”willnotreplaceonlystrings“A.B”with“CD”butanycharacterinform“AxB”,wherexisanycharacteras“.”inregularexpressionmeansanycharacter.Tocodetheruleaboveproperlyyouwouldneedtousethefollowingsyntax:“A\.B”‐>”CD”(“\”tellstotreat“.”asadot,notasanycharacter).Formoreinformationonregularexpressionsrefertohttp://www.regular‐expressions.info/reference.html.

Page 17: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

17

Figure13:Replacestringconverter

Page 18: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

18

ConfigurationoflinkageOnceyouaredonewiththeconfigurationofthedatasources,youcanmovetoconfigurationofthelinkageprocess.Tostartthelinkageconfiguration,clickonthebuttonnexttothelabelwithnumber3inFigure2.Youshouldseethelinkagewizard.AcceptthedefaultconfigurationforDatastratificationandclick“Next”.NowyouareabouttotellFRILhowyouwanttoidentifymatchesintheinputdatafiles.

Theconfigurationoflinkagestronglydependsontheconfigurationofdatasources.Ifyouhavenotconfigureddatasourcessofar,gobacktotheprevioussection.Otherwiseyouwillnotbeabletofollowthestepsoutlinedinthissection.

ThemostimportantpartinthelinkageconfigurationisconfigurationofsocalledJoincondition.InJoinconditionyouneedtospecifywhichattributesfromthefirstdatasourcewillbecomparedwithwhichattributesfromtheseconddatasources.Basingonthesecomparisons,FRILwillcalculatematchscoreforeachofpairofrecordsthataretested,andwilldecidewhetherthepairisamatchorpotentialmatch.Toconfigureyourfirstjoincondition,followthesesteps:

1. Click“+”locatedinthepanel“Joincondition”2. Youshouldseeanewwindow.In“Leftcolumn”[email protected]“Rightcolumn”click

[email protected]. Nowyouneedtospecifyhowtheattributeschosenaboveshouldbecompared.Youknowthatthe

attributesarenames,andnamestendtohavemisspellingsorerrors.Probablysomecomparisonmethodthattakesthatintoaccountwilldothebestinthiscase.Let’sthenchoose“Editdistance”asa“Distancemetric”.

4. Nowconfigurationpanelforthedistancemetricshouldbeupdated.Detaileddescriptionofthepanelcanbefoundbelow(seesectionDistancemetrics).Atthemoment,let’sjustacceptthedefaultconfigurationoftheeditdistanceconfiguration.

5. Settheweightvalueto40.MakesureyouconfigurationmatchesFigure14.ClickOK.

Theconfigurationofdistancemetric(Figure14)containstheparametercalled“Emptyvaluescore”.ThisparameterallowssettingthebehaviorofFRILwhenatleastoneofinputvaluesisempty.Recallthateachdistancemetricreturnsvaluebetween0.0and1.0,andthevaluerepresentshowsimilararetheinputvalues.Ifanyoftheinputvalues,however,isempty,thesimilaritybetweenthemisdefinedusingthevalueofparameter“Emptyvaluescore”insteadofusingthedistancemetric.Forinstance,ifyousetthisparameterto0.6,thenthescoreformatchinganemptyvaluewithanyothervaluewillalwaysbe0.6.

6. Windowcloses,andnowyoushouldseeonerowinJoincondition.ObservealsotheCurrentsumofweights–thevalueshouldbe40anditshouldbered.ThereasonisthatallweightsinFRILcurrentlyhavetoaddupto100.

7. Click“+”againandnowchooseDate@[email protected]’suseagaintheEditdistance(despitethisisadate).Changethe“Approvelevel”ineditdistanceconfigurationto0.1and“Disapprovelevel”to0.3(seeFigure15).Setweightoftheconditionto30.ClickOK.

8. Click“+”againandnowchooseheight@[email protected]’susetheNumericdistance.

9. Inthenumericdistanceconfiguration,selectoption“Range(percentage)”andinsert10intobothactivetextfields(seeFigure16).Wearesayingthatheightswillbetreatedasnumbers,andthetolerancebetweenvaluescanbeupto10%.

10. Specify15asaweightandclickOK.

Page 19: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

19

11. Repeatsteps7‐10forattributesweight@[email protected]. Changethevalueofacceptancelevelto71.13. Click“+”inthe“Outputcolumns”panel.Selectalltheavailablecolumns(left‐clickonfirstoptionin

thelist,thenpressShiftandwiththeshiftpressedleft‐clickonlastoptioninthelist).ClickOK.

Youcanarrangecolumnsinthe“Outputcolumns”asyoulike.Clickonanyiteminthelistandusearrowsontherighttochangetheorderofattributes.Forinstance,itmightbeusefultohavecolumnsthatarebeingcomparedinjoincondition(likenameandpatirnt_name)nexttoeachother.Also,youdonotneedtoaddallthecolumnsto“Outputcolumns”.Forinstance,Date‐MMDDandDate‐YYareonlytemporaryattributesandprobablycouldberemovedfromthelist.

14. YourresultingconfigurationshouldlooksimilartooneinFigure17.

Theconfigurationoflinkage(Figure17)hasabutton“Manualreview”.Usethisoptionifyouwanttomanuallyapproveorrejectcertainlinkages.Onceyouclickonthisbutton,youcanenabletheprocessofmanualreview,andsetthelevelsformanuallyreviewedlinkages.Specifically,youwillseeawindowsimilartotheonepresentedinFigure18.Notethatyoucanslidetheredlinesothatthemanualreviewlevelwillbeadjustedtoyourneeds.

Oncethemanualreviewprocessisenabled,itwillbesignalizedbythegreen“bulb”belowthe“Manualreview”.

15. Click“Next”.16. Inthelaststepofjoinconfigurationchoosethe“Sortedneighborhoodmethod”asajoinmethod

type.17. Intheconfigurationofthisjoinmethod,setwindowsizeto8andmakesurethatthesortorderis

asfollows:name,Date,height,weight(seeFigure19).Ifnecessary,clickonentriesinsortorderandusearrowsontherighttosettheappropriateorder.Checkbothcheckboxesatthebottom(summarycreation).

18. MakesureyourconfigurationmatchesFigure19.Ifso,clickFinishinthelinkagewizard.19. Savetheconfiguration(File‐>Save).

Thearrowsontherightthatallowyoutochangethesortorderinsortedneighborhoodmethodwillbedisabledifyoudonothaveanyrowintheordertableselected.

Formoredetailsondistancemetricsseesection“Distancemetrics”.Descriptionofdifferentjoinmethodsisoutsidethescopeofthistutorial.Fordetailsonthatrefertothepublication“FRIL:AToolforComparativeRecordLinkage”.LinktothepublicationcanbefoundonFRILhomepageinthesectionDocumentation/Publications.

ThesummaryfornotjoineddatamaynotbeavailableforallthesearchmethodsthatareavailableinFRIL.Forinstance,ifyouchooseNestedloopjoinasasearchmethod,onlyonesummarybuttonisavailable.Thisiscausedbydetailsofimplementationofeachofsearchmethods.Simplyput,forsomecasesitisextremelyhardtoimplementefficientlythisfeature(e.g.,withoutextendingsignificantlyatimeofthelinkageorCPU/memoryconsumption).

Atthispointyouhavecompleteaconfigurationofdatasourcesandthelinkageprocess.Youarereadytoproceedtotheconfigurationofresultssavers.

Page 20: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

20

Figure14:Conditionusedtocomparenames

Page 21: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

21

Figure15:Editdistanceconfigurationfordates

Figure16:Numericdistanceconfiguration

Page 22: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

22

Figure17:Completedsecondstepoflinkageconfiguration

Figure18:Configurationofmanualreview

Page 23: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

23

Figure19:Laststepoflinkageconfiguration

Page 24: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

24

Resultsavers/ResultsdeduplicationConfigurationofresultssaversistheeasiestpartoftheconfiguration.Allyouneedtodoistospecifywheretosaveyourdata.Clickonthebuttonnexttothelabelwithnumber9inFigure2.Youshouldseetheresultssaverswizard(seeFigure20).Followthesteps:

1. Clickcreatenewbutton(“+”).2. Anewwindowshouldappear.AsaresultsavertypechooseCSVfiledatasaver.3. Asanoutputfileleave“results.csv”.4. ClickOK.5. Leavethesettingsofresultsdeduplicationfornow(seedetailsbelow)andpressFinish.6. Savethecurrentconfiguration.

Figure20:Resultsaver/Resultsdeduplicationconfiguration

That’sit,youaredonewiththeconfigurationoflinkage.Beforewemoveontothenextsection,however,wewilltakeacloserlookattheoptionofdeduplicationoflinkageresults.

Thefirstquestionweneedtoansweriswhydoyouneedtheresultsdeduplicationatall.It’sallbecauseofduplicatesyoucangetwhenyoulinkyourfiles.Imagineasituationwhereyouhavetworecordsinthefirstfile,XandY.Therecords(let’sassumethedatacontainsinformationaboutpatients)canhavesuchattributesasfirstandlastnames,DOB,heightandweight.Ourtworecords,XandYcouldactuallycontain

Page 25: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

25

informationabouttwins.Insuchacase,youcanassumethesamelastnames,thesameDOB,similarheightandweight.Whataboutthefirstnames?Well,theycanalsobeverysimilar,asparentssometimesliketogivetwinssimilarnames.Now,let’sassumethatyouarelinkingthisfileagainstsomeotherfile.FRILcanfindthatbothrecordsXandYmatchtothesamerecord,sayZ,fromthesecondfile.ThereasonisthattheXandYareverysimilarandthustheycanmeetthelinkageconditionandacceptancelevelyouspecified.Intheconsequence,wehavetwolinkages:X‐ZandY‐Z.Asyousee,theZrecordislinkedtwice,andthuswecallthissituationaduplicateinresults.

FRILgivesaneasy‐to‐useoptiontosolvetheproblemdescribedabove.Whatyoucandoistoenableresultsdeduplication.Ifyoudoso,FRILwilllookforsituationsasaboveandwillattempttochooseoneofthelinkages,rejectingthesecondonebeforesavingtheresults.Ifthedecisioncannotbemadeautomatically,thetoolcanaskyouwhattodo.

Toconfiguretheresultsdeduplicationyouhavetochooseoneofthethreeoptionsyouhavein“Resultsdeduplication”inFigure20.Theoptionsyouhavespecifythetypesofduplicatesyoumightwanttolookfor:

1. Youcanrequirethateveryrecordfromthefirstdatasourcewillbelinkedwithexactlyonerecordfromtheothersource.Inthiscase,however,youcanhaveonerecordfromtheotherdatasourcetobelinkedwithmultiplerecordsfromthefirstsource.

2. Youcanrequirethateveryrecordfromtheseconddatasourcewillbelinkedwithexactlyonerecordfromthefirstsource.Inthiscase,however,youcanhaveonerecordfromthefirstdatasourcetobelinkedwithmultiplerecordsfromtheothersource.

3. Youcanrequireboth,e.g.,exactlyonerecordfromthefirstsourcetobelinkedwithexactlyonerecordfromthesecondsource,andexactlyonerecordfromthesecondsourcetobelinkedwithexactlyonerecordfromthefirstsource.

Ifyouhavesomeexperiencewithdatabasesystems,youshouldseethattheoptionsinresultsdeduplicationrelatetotherelationtypesyoucanhavebetweentablesinarelationaldatabase:many‐to‐one,one‐to‐manyandone‐to‐onerelations.

Insomecasesyoumightnotwanttoremoveallduplicates,butyoumightwanttouseoneofthefirsttwooptionsyouhave.Considertwofiles,oneprovidingdataaboutchildrenandtheotherprovidingdataaboutmothers.Itisperfectlyfinetohaveonemotherlinkedwithtwoormorechildren.However,itisnotacceptabletohaveonechildtobelinkedwithtwoormoremothers.

Thesecondoptionthatyouhaveavailableintheconfigurationofresultsdeduplicationis“Deduplicationconflicts”.ThisoptionaddressestheduplicatesthatcannotbedecidedautomaticallybyFRIL.NotethatFRILwillchoosethebestlinkagefromduplicatesbasedonalinkagescore.E.g.,ifyouhavetwoduplicatesasdiscussedabove,X‐ZandY‐Z,andifyourconfigurationprohibitsthistypeofduplicates,thenthededuplicationwillchoosethelinkagewiththehighermatchscore.However,ifthetwolinkageshavethesamescore,anautomaticlegitimatedecisioncannotbemadebyFRIL.Inthiscase,youcanconfigureFRILtodooneofthefollowing:

1. Donothing–thisoptionwillleavebothduplicatesintheresults

2. Chooseonelinkagerandomly–thisoptionwillchooserandomlyoneandrejecttheotherlinkagesbeingduplicates.Onlythechosenlinkagewillbeincludedintheresults.

Page 26: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

26

3. Askyouaboutthedecision–inthiscaseFRILwillpopupawindowinwhichyouwillbeabletodecidewhichlinkagesarefalsepositivesandwhicharetruepositives.

Runningthelinkage,linkageconfidenceTorunthelinkage,clickonabuttonnexttothelabelwithnumber8inFigure2.Youshouldseealinkageprogresswindow(seeFigure21).Notethatyoucanclickonthe“arrow”or“chart”buttons.Tryit:

1. Arrowbuttonopensalinkagedebuggingwindow.Asyouopenit,youcanseewhatiscomparedwithwhatinthelinkageprocess.Youcananalyzetheprocessandseeifyoumadeanyerrorsintheconfiguration.Ifapairofrecordsisjoined,itappearsgreen.Ifitisnotjoined,itappearsred.

2. Chartbuttonopensahistogramofthelinkageresults.

Whenlinkagedebuggingisenabled,thelinkageprocessismuchslowerthanwhendebuggingisdisabled.Makesuretodisableitonceyoufeeleverythingrunswell.

Figure21:Linkageprogresswindow

Oncethelinkageprocesscompletes,youshouldhaveinformationonnumberofrecordsthathavebeenlinked(inourcase2972).Onceyouseethatnotification,clickOK.Notethatyoucouldeasilyrerunthelinkagebypressingagainthestartbutton.InyourfilebrowsernavigatetothedirectoryonyourdiskwhereyouhaveFRIL.Nowyoushouldseethreenewfilesinthatdirectory:

1. results.csvThisfilecontainsresultsofyourlinkage.

Nameofresultsfilecouldbedifferentifyouchosedifferentconfigurationofresultsaver(stepnumber3atthebeginningofthissection).

2. minus‐FirstFile.csvListofrecordsfromtheFirstFilesourcethathavenotbeenjoined.

3. minus‐SecondFile.csvListofrecordsfromtheSecondFilesourcethathavenotbeenjoined.

Page 27: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

27

Nowlet’sopentheresults.csvfile.UseExceloranyotherspreadsheetapplicationlikeCalcfromOpenOffice.Youcanseethatyourresultsfilecontainslinkages.ThelastcolumnintheresultsfileiscalledConfidence.ThisnumbertellsyouhowconfidentFRILwasaboutgivenlinkage(valuebetween0and100).Sortontheconfidencesothatrecordsareinascendingorder.Youshouldseethattherewasonelinkagewithconfidence75.Observethatthislinkagehasdifferencesindatesandheight.Inspectsomelinkageswithotherconfidencelevels.MovetothenextsectiontoseedetailsonhowFRILcalculatesthoseconfidencescores.

Alternatively,youcanchooseaFRIL‐integratedresultsviewertoseethelinkageresultsandnotjoineddatarecords.Touseit,clickononeofthebuttonswithlabels2,5or10inFigure2.Pleasenotethattheminusviewercanbedisabledifyoudidnotconfigurethelinkageprocesstocreateasummaryofnotjoineddata.TheFRIL‐integratedrecordsviewerispresentedinFigure22.

Figure22:FRIL­integratedrecordsviewer

Themaincomponentsoftherecordsviewerincludeatoolbar(topofthewindow),contentsandastatusbar(bottomofthewindow).Themajorelementsare(thenumberscorrespondtothenumbersinlabelsinFigure22):

1. SavebuttonThisbuttonopensadialogthatallowstosaverecordsvisibleintheviewertoafile.Youcaneither

Page 28: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

28

savealltherecordsreturnedbythecurrentquery,oronlyrecordsvisibleoncurrentpage.Notethattherecordsviewerdividesrecordsintopages,witheachpageshowing200recordsbydefault.

2. FilterbuttonThisbuttonenablesfilteringofthevisiblerecords.Theconfigurationoffilteringisthesameastheconfigurationoffilteringforadatasourcediscussedinprevioussections.

3. SortbuttonThisbuttonenablesrecordssorting.Youcansortrecordsbyanyattribute(s).Inthecaseoflinkages,youcanalsoenablesortingbasedonconfidenceofthelinkages.

4. DetailsbuttonThisbuttonopensadetailswindow.Thedetailswindowpresentsvaluesofalltheattributesofrecordsintheview.Toseethedetails,youhavetomoveyourmousepointeroverarecordyouwanttoseethedetailsof.

5. PreferencesbuttonThisbuttonopensapreferenceswindow.Youcanconfiguresuchpreferencesasvisibleattributesorcolors.

6. Statusoffilter/sortThiscomponentgivesquickinformationaboutsort/filterbeingenabledordisabled.Ifsortand/orfilteris(are)enabled,thecolorofthecorrespondingdotturnstogreen.

7. Forward/backwardbuttonsandcurrentpagenumberThebuttonsinthiscomponentallowtomovebetweenthepagesofrecords.

Notethatyoucanopenalinkagesviewwindowandnotjoinedrecordsviewwindow(s)atthesametime.Thismighthelpinyouranalysisofthelinkageresults.

Howconfidenceleveliscalculated?CalculationofaconfidencescoreinFRILissolelybasedonconfigurationyouprovided.RefertoFigure17.Youalreadyknowthatthesumofallweightsinjoinconditionhastobeequalto100.Observethatwehave4sub‐conditions:fornames,dates,heightandweight.Theweightsforeachofsub‐conditionsare40,30,15and15,respectively.InFRIL,eachofdistancefunctionsreturnsvaluebetween0and1.Tocomputethetotalscoreforcomparedpair,FRILmultipliestheweightofeachsub‐conditionbythevaluereturnedbychosendistancefunction.Then,thosenumbersareaddedtogethertocalculateatotalscoreforgivenpairofrecords.IfthetotalscoreisequalorgreaterthantheApprovelevel(refertoFigure17),thepairofrecordsisconsideredtobeamatchandissavedintheresultsfile.Tofindmoreinformationontheprocessofscorecalculation,pleaserefertopublications[1,2].

Togiveanexample,let’sconsiderwehavethefollowingrecordfromsourceFirstFile:

Name=JohnSmith,Date=101182,Height=80,Weight=180

Let’sconsideranotherrecordfromsourceSecondFile:

Patient_name=JohnSmith,DOB=101182,Height=85,Weight=160

Page 29: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

29

Now,let’sassumeagainthatconfigurationofFRIListhesameastheonepresentedinFigure17.Thenamesanddatesarecomparedusinganeditdistance,andasthosevaluesarethesame,theeditdistancereturns1.Valuesofheightinbothrecordsaredifferent.However,tocomparetheheightsweareusingaNumericdistancewithparameters+/‐10%(refertoFigure16).Inthatcase,theallowablerangeforheightisfrom(80–10%of80)to(80+10%of80).Thisgivesarangeof72‐88.ThevalueofheightinthecomparedrecordfromtheSecondFilefallswithintherangeabove,andthescorereturnedbythenumericdistanceforheightwillbe(88‐85)/(88‐80)=3/8=0.375.Asimilarsituationismetforweight:anumericdistancewith+/‐10%isusedforcomparison.Thistime,however,allowablerangeis162‐198(180+/‐10%),andintheconsequenceofcomparingthevalue180with160thedistancefunctionreturns0.Thetotalmatchscoreofthosetworecordswillthusbe(rememberthevaluesofourweights):1*40+1*30+0.375*15+0*15=75.6≈76

Thescoresyouaregettingcanbeinfluencedbyvalueoftheparametercalled“Emptyvaluescore”ifyourdatacontainsemptyvalues.

Thenextsectiongivesmoredetailsonalltheavailabledistancemetrics.

Page 30: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

30

DistancemetricsDistancemetricsareoneofthemostimportantcomponentsavailableinFRIL.Byspecifyingdistance

metricbetweentwoattributes,youareprovidinginformationonhowtheattributesshouldbetreated(asstrings,numbers,dates)andwhatareallowablediscrepanciesbetweenvaluesoftheattributes.Rememberthateachdistancefunctionmapssimilaritybetweenvaluesofattributesontovalueintherange[0…1],where0means“arenotsimilarintermsofthisdistancefunction”and1means“areexactlythesameintermsofthisdistancefunction”.Inthereminderofthissectionwewilldescribeinmoredetailsthe6coredistancefunctions.

Equalfieldsbooleandistance

Inputvaluesaretreatedasrawstrings.Thisdistancefunctionreturnsonlytwovalues,0or1.0isreturnedifcomparedvaluesaredifferent,1ifareexactlythesame.Noadditionalconfigurationisrequired.

Examples:

Firstvalue Secondvalue Score“ABC” “ABC” 1“ABC” “ABB” 0“1.0” “1” 0“1.0” “1.0” 1“10/12/08” “10/12/2008” 0“10/12/2008” “10/12/2008” 1

Editdistance

Inputvaluesaretreatedasrawstrings.Thedistancetestshowmanyoperationsneedtobeappliedtothefirststringsothatitisconvertedtothesecondstring.Possibleoperationsincludedeletingacharacter,insertingacharacterorswitchingtwocharactersthatarenexttoeachother.Youneedtospecifyapprovelevelanddisapprovelevelwhenthisdistancemetricisused(pleaseseeFigure23).Approveanddisapprovelevelintheuserinterfacearealsoboundtothesmallplotwithaccept/rejectvalues.Onceyouchangeeitherofthevaluesinthetextfields,theplotisupdated,andonceyoudragthelineontheplot,thevaluesintextboxesarealsoupdated.Bothparametersthatneedtobeconfiguredcanhavevaluesfromrange[0…1].Thescoreofeditdistancefunctioniscalculatedusingthefollowingformula(strAisthefirststring,strBisthesecondstring,e(strA,strB)istheeditdistancebetweentwostrings,aisanapprovelevelanddisadisapprovelevel):

score(strA,strB) =

0,if e(strA,strB) > d *max(length(strA),length(strB))1,if e(strA,strB) < a*max(length(strA),length(strB))d *max(length(strA),length(strB)) − e(strA,strB)

(d − a) *max(length(strA),length(strB)), otherwise

Page 31: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

31

Figure23:Editdistanceconfiguration

Examplesofscorevaluewhenaneditdistanceisused:

Firstvalue Secondvalue e(strA,strB) Approvelevel

Disapprovelevel

Score

“A” “A” 0 0.1 0.3 1“A” “B” 1 0.1 0.3 0“ADAM” “ADAMS” 1 0.1 0.3 0.5“JACOBDOBBS” “JAKOBHOBBS” 2 0.1 0.3 0.59“JASPERCISNEROS” “ADENCISNEROS” 4 0.1 0.3 0.17“BREANNAROBISON” “BRENNAROBINSON” 2 0.1 0.3 0.83

Don’tforgetthatyoucanusethemagnifyingglasstoseehowcurrentconfigurationofeditdistancecalculatesscores.Withtheanalysiswindowopen,youcanchangetheparametersandtheresultspresentedinthedynamicanalysiswindowwillbeupdated.

Jaro­Winklerdistance

Inputvaluesaretreatedasrawstrings.ThisdistanceusesamethodproposedbyWinkler.Itissuitableforshorterstrings,suchasnames.

Thevalueofthisdistancemetriciscalculatedasfollows:

score(s1,s2) =max(100,d j + lp(1− d j ))

wheredjisthevalueofJaro‐Winklerdistancebetweenstringss1ands2(seebelow),listhelengthofcommonprefixbetweenstringsandpisascalingfactor.ThelandpareparametersthatcanbesetinFRIL‐thesystemusesscaleptogivemorefavorableratingstostringsthatmatchfromthebeginningforasetprefixlengthl.Thedefaultvalueforpis0.1andthevalueforlshouldnotexceed4.

TheJaro‐Winklerdistanceisdefinedas:

Page 32: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

32

d j =13( m| s1 |

+m| s2 |

+m − tm

)

wheremisanumberofmatchingcharacters,andtisthenumberoftranspositions.

Twocharactersfroms1ands2arematching,iftheyarenotfartherthan

max(| s1 |,| s2 |)2

−1positions

apart.Thenumberofmatching(butdifferentsequenceorder)charactersdividedby2definesthenumberoftranspositions.

Examples(p=0.1andl=4):

s1 s2 Matchingstringfroms1

Matchingstringfroms2

m t dj score(s1,s2)

“CRATE” “TRACE” “RATE” “RACE” 3 0 0.867 0.867“DWAYNE” “DUANE” “DANE” “DANE” 4 0 0.223 0.84

Q­gramsdistance

Inputvaluesaretreatedasrawstrings.Thedistanceoperatesonso‐calledq‐grams,substringsfrominputstringsoflengthq.Bothinputstringsarefirstdividedintoq‐grams.Thisoperationproducestwosetsofso‐calledgrams.Next,acountofitemsthatdonotappearintheintersectionofthosetwosetsiscalculated.Theresultingnumberhasanimpactonthescore.Configurationofthisdistancerequiresspecificationofq,approvelevelanddisapprovelevel(seeFigure24).Approveanddisapprovelevelvalueshaveasimilarmeaningasintheeditdistance.Thefinalscoreiscalculatedasfollows(strAisthefirststring,strBisthesecondstring,gramsreturnsq‐gramsofgivenstring,diffreturnsacountofitemsthatdonotappearintheintersectionofsets,aanddareapproveanddisapprovelevels,respectively):

score(strA,strB) =

1, if diff (grams(strA),grams(strB)) < round(a* (| grams(strA) | + | grams(strB) |))0, if diff (grams(strA),grams(strB)) > round(d * (| grams(strA) | + | grams(strB) |))

1− diff (grams(strA),grams(strB))− round(a* (| grams(strA) | + | grams(strB) |))round(d * (| grams(strA) | + | grams(strB) |)) − round(a* (| grams(strA) | + | grams(strB) |))

otherwise

Page 33: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

33

Figure24:Q­gramsdistanceconfiguration

Examples(allexamplesareforq=3):

Firstvalue Secondvalue Firstvaluegrams Secondvaluegrams

Approve/Disapprovelevel

Score

“WILLIAMYOUNG”

“WILLIAMYOUNG”

WIL,ILL,LLI,LIA,IAM,AM,MY,YO,YOU,OUN,UNG,NG,G

WIL,ILL,LLI,LIA,IAM,AM,MY,YO,YOU,OUN,UNG,NG,G

0.3/0.1 1

“DONOVANBLACK”

“DONOVANPACK”

DON,ONO,NOV,OVA,VAN,AN,NB,BL,BLA,LAC,ACK,CK,K

DON,ONO,NOV,OVA,VAN,AN,NP,PA,PAC,ACK,CK,K

0.3/0.1 0.2

“AJYHUDDLESTON”

“JAYHUDDLESTON”

AJY,JY,YH,HU,HUD,UDD,DDL,DLE,LES,EST,STO,TON,ON,N

JAY,AY,YH,HU,HUD,UDD,DDL,DLE,LES,EST,STO,TON,ON,N

0.3/0.1 0.8

Notethattheq‐gramsdistanceworkswithsets,andorderofgramsdoesnotmatter.Thismeansthatthedistancewillgiveveryhighscoretothepairsofstringslike“MICHAELADAMS”and“ADAMSMICHAEL”.Therefore,thisdistanceminimizeserrorsduetoswitching,forinstance,first,secondandlastnames.

Don’tforgetthatyoucanusemagnifyingglasstoseehowcurrentconfigurationofeditdistancecalculatesscores.Withtheanalysiswindowopen,youcanchangetheparametersandtheresultspresentedindynamicanalysiswindowwillbeupdated.

Page 34: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

34

Soundex

Inputvaluesaretreatedasrawstrings.ImplementstheSoundexalgorithmtocalculateascorebetweentwostrings(usuallynames).Requiresthefollowingparameters:soundexlength,approveanddisapprovelevels(seeFigure25).Soundexlengthprovideslengthofsoundexcodethatwillbecomputed.Editdistanceisusedtocomparethesoundexcodesoftwostrings,andapprove/disapprovelevelsaretheparametersofthiseditdistance.Notethatdefaultvaluesof0forapproveanddisapprovelevelsmeanthatonlysoundexcodesthatareexactlythesamewillgetanon‐zeroscore.FormoreinformationonSoundexseehttp://en.wikipedia.org/wiki/Soundex.

Figure25:ConfigurationofSoundexdistance

Numericdistance

Inputvaluesaretreatedasnumbers(realvalues).Thisdistanceallowsuserstospecifyarangeofvaluesthatwillhaveanon‐zeromatchscore.Twooptionsforspecifyingtherangeareavailable(seeFigure26):

1. RangeasfixedvalueRangeisspecifiedusingfixedvalue.Forinstance,onecansaythatarangewillhaveradiusof2.Then,ifaninputvalueis10,therangewillbe[8…12].

2. RangeaspercentageofvalueRangeasapercentageofinputvalue.Forinstance,onecansaythattherangewillhavearadiusof30%ofinputvalue.Then,ifaninputvaluewas10,theresultingrangewouldbe[7…13].

Notethatyoucanspecifydifferentvaluesforleftandrightsideoftherange.

Besidesthespecificationofarange,usercanspecifywhetherthescoreshouldbecalculatedusinglinearapproximation.Iflinearapproximationisdisabled,thefinalscoreiscalculatedusingthefollowingformula(lowerBoundandupperBoundreturntheboundsoftherange):

score(v1,v2) =1, if v2∈ [lowerBound(v1)...upperBound(v1)]0 otherwise

Page 35: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

35

Ifthelinearapproximationisenabled,thefollowingformulaisused:

score(v1,v2) =

upperBound(v1) − v2upperBound(v1) − v1

if v2∈ [v1...upperBound(v1)]

v2 − lowerBound(v1)v1− lowerBound(v1)

if v2∈ [lowerBound(v1)...v1)

0 otherwise

Figure26:Configurationofnumericdistance

Examples:

Firstvalue Secondvalue Linearapproximation

Lowerbound Upperbound Score

10 9 On 6(value‐4) 12(value+2) 0.7510 9 Off 6(value‐4) 12(value+2) 110 11 On 6(value‐4) 12(value+2) 0.510 11 Off 6(value‐4) 12(value+2) 110 12 On 6(value‐4) 12(value+2) 050 48 On 40(value‐20%) 60(value+20%) 0.850 54 On 40(value‐20%) 60(value+20%) 0.6

Datedistance

Inputvaluesaretreatedasdates.ThisdistanceworksverysimilartotheNumericdistance.Insteadofworkingwithnumbers,however,ittreatsinputvaluesasdates.Whenrangesarecomputed,numberofyears,days,hours,minutesorsecondsbetweendatescanbespecified.Optionoflinearapproximationisavailable(thisoptionworksinthesamewayasforthenumericdistance).Inaddition,userhastospecifyaformatofinputdates(seeFigure27).Notethatthedatedistancewillsuggestanappropriateformat.Theformatcanusethefollowingcharacters:

Character Meaning Examplevalue

Page 36: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

36

y Year 1996;96

M Monthinyear July;Jul;07

w Weekinyear 27

W Weekinmonth 2

D Dayinyear 189

d Dayinmonth 10

F Dayofweekinmonth 2

E Dayinweek Tuesday;Tue

a Am/pmmarker PM

H Hourinday(0‐23) 0

k Hourinday(1‐24) 24

K Hourinam/pm(0‐11) 0

h Hourinam/pm(1‐12) 12

m Minuteinhour 30

s Secondinminute 55

S Millisecond 978

z Timezone PacificStandardTime;PST;GMT‐08:00

Z Timezone ‐0800

Sometimesyoumightnotwanttouseeditdistanceforattributesthataredates(likewedidinthistutorial).Considercaseswhenhumanscopydatesfrompaperformstoacomputer.Digits1and7lookfamiliar,sothedate10/22/2007couldbetypedinas10/22/2001.Inthosecases,datedistancewillnotperformthebest.Probablybetterchoiceisaneditdistancethatcancapturesucherrors.

Page 37: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

37

Figure27:Configurationofdatedistance

Ifformatofdateisnotspecifiedcorrectly,thedatedistancewillnotworkinpredictableway.Usedynamicanalysiswindowtomakesurethattheconfigurationisappropriate.

Page 38: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

38

DatadeduplicationDatadeduplicationallowsidentifyingduplicatesingivendatasource.Aduplicateisdefinedastwoormorerecordsthatsatisfyduplicateconditionconfiguredbyauser.TherearetwoaspectsofthedatadeduplicationinFRIL:

1. DeduplicationofdatasourcebeforelinkageDeduplicationbeforelinkageallowsonetoconfigurededuplicationrulesinthelinkageconfiguration.Inthatcase,beforetheactuallinkageisperformed,FRILwilldeduplicateselecteddatasources.Thededuplicationistransparentforuser,andisanintegralpartofthelinkageprocess.Theconfigurationofthededuplicationofdatasourcebeforelinkageisperformedduringthelaststepoftheconfigurationofadatasource.

2. DeduplicationasaseparateprocessFRILcandeduplicatedatasourcewithoutperforminganylinkage.Inthatcase,ausercanchooseadeduplicationmodewhereonlyoneinputfileisrequired(toswitchbetweenmodesuseModemenu).Oncetheconfigurationiscomplete,ausercanrundeduplication.Asaresult,FRILwillsavededuplicateddataand,ifenabledduringtheconfiguration,afilecontainingidentifiedduplicateswillbegenerated.AscreenshotofthededuplicationmodeispresentedinFigure28.Notethattheconfigurationofdeduplicationmodeisquitesimilartotheconfigurationoflinkage,anditrequiresdatasourceconfiguration,deduplicationconfigurationandresultsaverconfiguration.Thedetailsofeachstepoftheconfigurationarediscussedbelow.

Theconfigurationofdeduplicationinbothcasesdescribedaboveisidenticalandusesthesameuserinterface.Toconfigurethededuplicationinadatasourcebeforelinkage,enabledatasourcededuplicationinthelaststepofwizardandpressthePreferencesbutton.Toconfigurededuplicationinthededuplicationmode,press“Configure”buttoninthededuplicationboxinthescreenshotpresentedinFigure28.

Page 39: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

39

Figure28:DeduplicationmodeinFRIL

Theconfigurationofdeduplicationmode,asmentionedbefore,hasthreemajorsteps:aconfigurationofdatasource,aconfigurationofdeduplicationprocessandaconfigurationofresultsaver.Belowwedescribedetailsofeachofthesesteps:

1. DatasourceconfigurationDatasourceconfigurationinthedatadeduplicationmodefollowsthesamestructureasadatasourceconfigurationforlinkage.Thesameapproachanduserinterfaceareused.Inthelaststepoftheconfiguration,userscanchoosewhethertofiltertheinputdataandwhethertogenerateasummaryofaduplicatedata.Notethatyoucannotdisablethededuplicationofadatasource,asthiswouldcontradictwiththepurposeofadeduplicationmode.AscreenshotpresentingthelaststepofdatasourceconfigurationindeduplicationmodeispresentedinFigure29.

2. DeduplicationconfigurationInthisstepyouneedtoconfigurearulethatwillidentifyduplicates.Theconfigurationhastwosteps:aconfigurationofcomparedattributesandaconfigurationofblockingattributeusedfordeduplication.AscreenshotofthefirstofconfigurationstepsispresentedinFigure30.Asyoucansee,similarlyasforthelinkageconfiguration,youneedtoidentifyaconditionthatwillbeusedforduplicatesidentification.Theconditioncanuseanyattributesprovidedbydatasourceand

Page 40: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

40

selectedasavailableinFRIL.Notethat,unlikeforthelinkageconfiguration,youarechoosingonlyattributesfromonedatasourceasonlyonedatasourceisavailable.Whenaddingnewattributestothecondition,youcanchooseanyoftheavailabledistancemetrics(seethesectionDistancemetricsforadetaileddescriptionofthedistancemetricsavailableinFRIL).Onceyouaredonewiththefirststepoflinkageconfiguration,youcanproceedtothenextone(Figure31).Inthesecondstepyouneedtoconfigureablockingvariable.NotethatFRILwilllookforduplicatesonlyamongtherecordsthatendupinthesameblock.Therefore,youneedtothinkwhatisthebestwaytoblockrecordsinordertoidentifyasmanyduplicatesaspossible.Somegoodideasforblockingvariableare:soundexcodeoflastname,birthyear(orbirthdate)orZIPcode.

Theavailableoptionsforblockingmethoddependondistancemetricusedforchosenblockingattribute.IfEqualbooleandistance,DatedistanceorNumericdistancemetricisused,suchanattributewillonlybeallowedtobuildblocksthathavethesamevalueofattribute.IfEditdistance,Jaro‐Winklerdistance,Q‐gramsdistance,SoundexorStreetaddressdistanceisused,onlytheothertwooptionswillbeenabled.

3. ResultsaverThisisthelaststepofdeduplicationconfigurationinadeduplicationmode.Notethatthisstepisnotavailableindeduplicationofdatasourcebeforelinkage,asthededuplicateddataisnotsavedintoanyfile,butratherusedasaninputofthelinkageprocess.Inthisstepallyouneedtodoistochooseafilenamewherethededuplicatedresultswillbesaved.

Figure29:Laststepofdatasourceconfiguration(Deduplicationmode)

Page 41: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

41

Soundexcodeforblockingvariableisavailableonlyforattributesthatarecomparedusingastringdistancemetric(e.g.,editdistanceorsoundexdistance).

Usealwaysanappropriatedistancemetricforgivenfieldtype.Ifyouarecomparingnumericvalues(likeweightorheight),useanumericdistance.Ifyouarecomparingnames,useoneofthestringdistancemetrics.Thiswillallowyoutofindmoreduplicates,forinstancethosecontainingmisspellingsorerrorsinmeasurements.

Howtheduplicatesareidentifiedgiventhededuplicationcondition?ItisveryimportanttounderstandwhatrecordsareconsideredduplicatesbyFRIL.Onceyouprovideyourduplicatesidentificationcondition,tworecordsareconsideredduplicatesifthesumofproductsofdistancemetricvalueforcomparedattributebetweengiventworecordsmultipliedbyweightfortheattributeisgreaterthantheduplicateacceptancelevel.Notethisisverysimilartohowthelinkagesareidentified(seesectionHowconfidenceleveliscalculated?formoredetails).Inthededuplicationyouonlyhave,however,datafromonedatasource.

Emptyvaluescorehasthesameimpactforfindingduplicatesasforfindinglinkages.Refertosection“Howconfidenceleveliscalculated?”formoredetails.

Figure30:Firststepofdeduplicationconfiguration

Page 42: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

42

Figure31:Deduplicationconfiguration­step2

Figure32:Secondstepofdeduplicationconfiguration

Page 43: FRIL - SourceForgefril.sourceforge.net/FRIL-Tutorial-3.2.pdf · 4 General architecture Before we get started with FRIL, it is important to understand the architecture of the

43

References1. P.Jurczyk,J.J.Lu,L.Xiong,J.D.CraganandA.Correa,“FRIL:AToolforComparativeRecord

Linkage”,AmericanMedicalInformaticsAssociation(AMIA)2008AnnualSymposium2. P.Jurczyk,J.J.Lu,L.Xiong,J.D.CraganandA.Correa,“Fine‐grainedrecordintegrationandlinkage

tool”,Birthdefectsresearch,PartA,Clinicalandmolecularteratology,2008Nov;82(11):822‐9.