Top Banner
Linking data records using probabilistic techniques Linking data records using Linking data records using probabilistic techniques probabilistic techniques
48

Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Aug 31, 2018

Download

Documents

nguyenkhue
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Linking data records using probabilistic techniques

Linking data records using Linking data records using

probabilistic techniquesprobabilistic techniques

Page 2: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

OverviewOverviewOverview

•• The choice between deterministic The choice between deterministic and probabilistic linkage methodsand probabilistic linkage methods

•• Demonstration of probabilistic Demonstration of probabilistic linkage software linkage software ---- LinksLinks

•• Conclusions regarding probabilistic Conclusions regarding probabilistic and deterministic methodsand deterministic methods

Page 3: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Linkage BeginningsLinkage BeginningsLinkage Beginnings

The Massachusetts Maternal Mortality The Massachusetts Maternal Mortality and Morbidity Projectand Morbidity Project

Hospital

Discharge

Mothers

Birth

Certificates

Hospital

Discharge

Children

Page 4: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Deterministic Linkage ChallengesDeterministic Linkage ChallengesDeterministic Linkage Challenges

•• Which are the best variables to link Which are the best variables to link with?with?

•• What is an objective way to decide What is an objective way to decide matched v. unmatched?matched v. unmatched?

•• When do we say When do we say ““enough is enough is enoughenough””??

Page 5: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

The MA Pregnancy to Early Life Longitudinal Project (PELL)

The MA Pregnancy to Early Life The MA Pregnancy to Early Life

Longitudinal Project (PELL) Longitudinal Project (PELL)

Birth

Certificate/

Fetal Death

Hospital Discharge

Observational Stay

Emergency Department

Administrative DataAdministrative Data

Page 6: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

The MA Pregnancy to Early Life Longitudinal Project (PELL)

The MA Pregnancy to Early Life The MA Pregnancy to Early Life

Longitudinal Project (PELL) Longitudinal Project (PELL)

Birth

Certificate/

Fetal Death

MA Healthy Start

Early Intervention

WIC

Programmatic DataProgrammatic Data

Page 7: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

The MA Pregnancy to Early Life Longitudinal Project (PELL)

The MA Pregnancy to Early Life The MA Pregnancy to Early Life

Longitudinal Project (PELL) Longitudinal Project (PELL)

Birth

Certificate/

Fetal Death

Death Data

Birth Defects

Vital and Health Statistics DataVital and Health Statistics Data

Page 8: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

PELL Data SetPELL Data SetPELL Data Set

•• Used for many different analysesUsed for many different analyses

-- Program ReviewProgram Review

-- SurveillanceSurveillance

-- ResearchResearch

How do we create a linked data set How do we create a linked data set with flexibility?with flexibility?

Page 9: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

New ChallengesNew ChallengesNew Challenges

•• Reduce amount of time spent on Reduce amount of time spent on each linkage each linkage

•• Use one linkage algorithm for Use one linkage algorithm for multiple years of datamultiple years of data

•• Deal with matched and unmatched Deal with matched and unmatched in a consistent and objective wayin a consistent and objective way

•• But But howhow??

Page 10: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same
Page 11: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Probabilistic Record LinkageProbabilistic Record LinkageProbabilistic Record Linkage

•• Uses probabilities to determine Uses probabilities to determine whether a pair of records refer to whether a pair of records refer to the same individualthe same individual

•• Calculates weights to quantify the Calculates weights to quantify the likelihood that a pair of records are likelihood that a pair of records are a true matcha true match

•• Probabilistic weights may be either Probabilistic weights may be either nonnon--specific or value specificspecific or value specific

Page 12: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

General (Non-Specific) WeightsGeneral (NonGeneral (Non--Specific) WeightsSpecific) Weights

•• Agreement on a specific variableAgreement on a specific variable

•• Example:Example:

-- AgreementAgreement on date of birth receives a on date of birth receives a higher weight then match on sex higher weight then match on sex

-- Disagreement Disagreement on sex receives a on sex receives a higher penalty than disagreement on higher penalty than disagreement on date of birthdate of birth

Page 13: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Value Specific WeightsValue Specific WeightsValue Specific Weights

•• Agreement on a specific value of Agreement on a specific value of the variable being comparedthe variable being compared

•• Example: Example: Comparing initials using value Comparing initials using value specific weightsspecific weights

-- Agreement on initial Z receives higher Agreement on initial Z receives higher weight than match on initial Sweight than match on initial S

-- Disagreement on initial S receives higher Disagreement on initial S receives higher penalty than disagreement on Zpenalty than disagreement on Z

Page 14: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Benefit of WeightsBenefit of WeightsBenefit of Weights

•• Weights objectively reflect our Weights objectively reflect our confidence in a matchconfidence in a match

•• Individual choice in cutting off low Individual choice in cutting off low weightsweights

Page 15: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Probabilistic Linkage MethodsProbabilistic Linkage MethodsProbabilistic Linkage Methods

•• Some SAS programmers write their Some SAS programmers write their own probabilistic codeown probabilistic code

•• Software packagesSoftware packages

-- Very expensiveVery expensive

-- Difficult to useDifficult to use

-- Some applications are available as Some applications are available as freeware or sharewarefreeware or shareware

Page 16: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same
Page 17: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Choosing Probabilistic SoftwareChoosing Probabilistic SoftwareChoosing Probabilistic Software

OS Initial $ Yearly $ Link Type Description Audience

Automatch (Integrity)

Windows $100,000 ??? Probablistic GUI Marketing

Generalized Record Linkage System (GRLS)

UNIX $18,800 10% Probablistic ORACLE Health care

LinkPro

None SAS Health careWindows/

Mainframe

$1,455 /

$1,190

Determ &

Prob

Automatch (Integrity)

Generalized Record Linkage System (GRLS)

LinkPro

•Links: same as LinkPro but freeware

•FEBRL: also freeware, opensource

Page 18: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

LinkPro FeaturesLinkPro FeaturesLinkPro Features

•• InexpensiveInexpensive

•• Easy to useEasy to use

•• Created for health care record Created for health care record linkagelinkage

•• Both deterministic and probabilistic Both deterministic and probabilistic linkagelinkage

Page 19: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

LinkPro FeaturesLinkPro FeaturesLinkPro Features

•• Capacity to recognize and Capacity to recognize and accommodate duplicate recordsaccommodate duplicate records

•• Supports full and partial/conditional Supports full and partial/conditional comparisons comparisons

Page 20: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

LinkPro FeaturesLinkPro FeaturesLinkPro Features

•• Runs on any mainframe, mini, Runs on any mainframe, mini, workstation or PC with SAS 6.06 or workstation or PC with SAS 6.06 or higherhigher

•• Free technical supportFree technical support

Page 21: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

How LinkPro WorksHow LinkPro WorksHow LinkPro Works

•• Automatically calculates and applies Automatically calculates and applies nonnon--specificspecific probabilistic weightsprobabilistic weights

•• Weights estimate the likelihood that Weights estimate the likelihood that a pair of records from separate files a pair of records from separate files correspond to the same individualcorrespond to the same individual

Page 22: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

How Weights Are CalculatedHow Weights Are CalculatedHow Weights Are Calculated

Computed building on logComputed building on log22 of the of the odds or frequency ratio calculated odds or frequency ratio calculated for each variablefor each variable

weight = log2 OUTCOME freq in LINKED pairsOUTCOME freq in UNLINKABLE pairs

Page 23: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same
Page 24: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

LinkPro Statements and SyntaxLinkPro Statements and SyntaxLinkPro Statements and Syntax

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <<optionsoptions> ;> ;

_VAR _VAR variablevariable--list;list;

_BY _BY variablevariable--list;list;

_RUN;_RUN;

Page 25: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LINKPRO <options>_LINKPRO _LINKPRO <options><options>

DATA1= DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset ;;

_VAR _VAR variablevariable--list;list;

_BY _BY variablevariable--list;list;

_RUN;_RUN;

<options>

_LINKPRO

Page 26: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

MIN = MIN = numbernumber

•• Minimum number of variables that Minimum number of variables that must agree in _VAR statementmust agree in _VAR statement

Page 27: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

USEMEMUSEMEM

•• Stores input data in memory for Stores input data in memory for faster executionfaster execution

SIMPLESIMPLE

•• Deterministic linkage onlyDeterministic linkage only

Page 28: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

PAIRS = PAIRS = SASSAS--datadata--setset

•• Creates data set containing all Creates data set containing all ‘‘linkablelinkable’’ pairs (potential links)pairs (potential links)

RESOLVE = RESOLVE = 1xN 1xN oror Nx1Nx1

•• Allows one to many matchAllows one to many match

Page 29: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LINKPRO <options>_LINKPRO <_LINKPRO <options>options>

DEBUGDEBUG

•• Prints all SAS statements and Prints all SAS statements and messages for problem diagnosingmessages for problem diagnosing

Page 30: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same
Page 31: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_VAR statement_VAR statement_VAR statement

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;

variablevariable--list;list;

_BY _BY variablevariable--list;list;

_RUN;_RUN;

_VAR

Page 32: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_VAR statement_VAR statement_VAR statement

•• Lists all variables used in linkageLists all variables used in linkage

•• Numeric or character variablesNumeric or character variables

Page 33: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_VAR statement_VAR statement_VAR statement

Partial/Conditional ComparisonsPartial/Conditional Comparisons

•• VAR1|VAR2VAR1|VAR2

•• VAR1 compared for agreement, if VAR1 compared for agreement, if no match, VAR2 comparedno match, VAR2 compared

•• 3 possible outcomes and weights3 possible outcomes and weights

Page 34: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same
Page 35: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_BY statement_BY statement_BY statement

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;

_VAR _VAR variablevariable--list;list;

variablevariable--list;list;

_RUN;_RUN;

_BY

Page 36: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_BY statement_BY statement_BY statement

•• Optional statementOptional statement

•• Variable(s) that must match Variable(s) that must match exactlyexactly

•• Speeds up linkageSpeeds up linkage

Page 37: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LPX and _WTX statements_LPX and _WTX statements_LPX and _WTX statements

_LINKPRO DATA1= _LINKPRO DATA1= SASSAS--datadata--setset

DATA2= DATA2= SASSAS--datadata--setset <options> ;<options> ;

_VAR _VAR variablevariable--list;list;

_BY _BY variablevariable--list;list;

'SAS'SAS--statement(s)';statement(s)';

'SAS'SAS--statement(s)';statement(s)';

_RUN;_RUN;

_LPX

_WTX

Page 38: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_LPX Statement_LPX Statement_LPX Statement

•• Optional statementOptional statement

•• Inserts SAS statements into the Inserts SAS statements into the data step that generates linkable data step that generates linkable pairspairs

_LPX 'if given1^=given2 _LPX 'if given1^=given2

and given1>given2 then and given1>given2 then

_matched = _matched+1;';_matched = _matched+1;';

Page 39: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

_WTX Statement_WTX Statement_WTX Statement

•• Optional statementOptional statement

•• Inserts SAS statements into the Inserts SAS statements into the data step that calculates data step that calculates probabilistic weightsprobabilistic weights

_WTX 'if abs(birthyr1_WTX 'if abs(birthyr1--

birthyr2)<=3 then birthyr2)<=3 then

__wgtwgt=_wgt+1;'; =_wgt+1;';

Page 40: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

LinkPro Output FilesLinkPro Output FilesLinkPro Output Files

Linked recordsLinked records

Links that could not be Links that could not be resolvedresolved

Unlinked records from the Unlinked records from the first data setfirst data set

Unlinked records from the Unlinked records from the second data setsecond data set

_LKD

_TIE

1

2

_DAT1

_DAT2

3

4

Page 41: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same
Page 42: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

LinkPro Versus DeterministicLinkPro Versus DeterministicLinkPro Versus Deterministic

•• Replicated original BCReplicated original BC--HD link using HD link using LinkProLinkPro

•• Found over 99% agreement in Found over 99% agreement in resulting links from two linkage resulting links from two linkage methodsmethods

Page 43: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Benefits Of Probabilistic LinkageBenefits Of Probabilistic LinkageBenefits Of Probabilistic Linkage

•• Routinized linkage processRoutinized linkage process

•• Provided objective way to deal with Provided objective way to deal with matched and unmatched data matched and unmatched data

•• Reduced amount of code and time Reduced amount of code and time spent on linking dataspent on linking data

•• Able to inspect tied records or print Able to inspect tied records or print out those with lowest 5out those with lowest 5--10% weight10% weight

Page 44: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

Abandon Deterministic Altogether?Abandon Deterministic Altogether?Abandon Deterministic Altogether?

Definitely NOTDefinitely NOT

Choice of deterministic or probabilistic Choice of deterministic or probabilistic methods depends on:methods depends on:

-- Type of project Type of project

-- DataData

Page 45: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

ConclusionsConclusionsConclusions

•• Deterministic vs. ProbabilisticDeterministic vs. Probabilistic

-- Depends on your situation and goalsDepends on your situation and goals

•• Probabilistic linkage software can be Probabilistic linkage software can be affordable and easy to use!affordable and easy to use!

Page 46: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

More Information on LinkProMore Information on LinkProMore Information on LinkPro

http://members.shaw.ca/andre.wajda/linkpro.htmlhttp://members.shaw.ca/andre.wajda/linkpro.html

AndrAndréé Wajda Wajda

[email protected]@shaw.ca

Page 47: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same

More Information on LinksMore Information on LinksMore Information on Links

Randy Randy WalldWalld

[email protected][email protected]

Page 48: Linking data records using probabilistic techniques · Mainframe $1,455 / $1,190 Determ & Prob Automatch (Integrity) Generalized Record Linkage System (GRLS) LinkPro •Links: same