Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health [email protected][email protected]§ ChoiceMaker Technologies, Inc. [email protected]Adapted from a presentation at the 34 th National Immunization Conference Washington, DC July 7, 2000 The NY Citywide Immunization Registry’s MEDD De-Duplication Project ChoiceMaker Technologies
31
Embed
Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health [email protected][email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Andrew Borthwick, PhD§Vikki Papadouka, PhD, MPH*
Deborah Walker, PhD*
*New York City Department of [email protected]@dohlan.cn.ci.nyc.ny.us
“don’t knowdon’t know” and require human reviewThresholds dividing the merge/ merge/ don’t know/ don’t know/ don’t don’t
merge merge cases are set by the user
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Maximum Entropy ModelingMEDD uses “Maximum Entropy Modeling”
A new statistical decision-making techniqueLearn the human judgment process by training from examplesHas been used in sentence parsing, computer vision, financial modeling, and proper-name identification
Has achieved state-of-the-art results on these problems
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Maximum Entropy Modeling: Features
Maximum Entropy uses “Features”Feature = a function which looks at specific fields in the pair of records to make a “merge” or “don’t merge” decisionMEDD has many different features, each of which is assigned a “weight” during training
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Sample MEDD Features
Mother’s BirthdayMatch of Mom’s B’day predicts “Merge” Mismatch of Mom’s B’day predicts “No-Merge”Neither feature fires if Mom’s B’day wasn’t filled in on both records
We have no evidence in this caseMany other features
Child’s birthdayChild’s first and last nameMedicaid Number
ChoiceMaker Technologies
Record pairshand-marked withmerge/no-merge decisions
A weight foreach feature
A set of features
Maximum Entropy
ParameterEstimator
New York Citywide Immunization Registry:The MEDD De-duplication Project
Training the System
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Probability Computation
Merge = product of weights of all features predicting “mergemerge” for the
pairNoMerge = product of weights of all features
predicting “no mergeno merge” for the pair
For a pair of records, MEDD computes the probability that the pair should be merged as:
NoMergeMerge
Merge
ChoiceMaker Technologies
Field Name Record Feature Weight Prediction
1 2
Last name Smith Smith Match 1.153 Merge
First name Emily Emely No-matchSoundex
1.3504.708
No-mergeMerge
DOB [04/28/97] [04/28/97] Match 1.138 Merge
Multiple birth N N
Mom’s Maiden Name CRUZ
Mother’s DOB 12/04/76
Street 4528 3rd Ave 4528 3rd Ave Match 4.342 Merge
Predicts “Merge” with 53.9% confidence (Human review)Predicts “Merge” with 53.9% confidence (Human review)
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Sophisticated MEDD features:Name Frequency
Name Frequency“Rodriguez” is 9 times more common than “Walker” in
NYCLess than 3 kids per year are born with the names
“Borthwick” and “Papadouka”Hence we build features categorizing names as “very
common”, “somewhat common”, “very rare”, etc.Given that we have a name match, the fact that the names
are very common is a feature predicting “don’t merge”A match between rare names is a feature predicting “merge”
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Sophisticated MEDD features:Partial Name Match
Soundex: A phonetic representation of namesConnor = Conor = Conner = CNRWhen the Soundex representation of two
names matches, a feature fires predicting “merge”
Edit Distance: Features firing based on two names having an edit distance of 1
Borthwich Borthwick Bortwick
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Special Situation Features
Every database has its quirksHMO XYZ always sends its data to the CIR with Day of
Birth = “1”Birthday = July 1, 1998 not July 15, 1998
We have a special feature:If Provider = “HMO XYZ” AND Day of Birth = 1 AND
dates differs only on day of birth, THEN predict merge
We plan to allow users to define these types of features themselves
New York Citywide Immunization Registry:The MEDD De-duplication Project
Test Procedure
MEDD MEDD tested on c. 3,000 pairs under NYC DOH supervisionPairs were carefully hand-scored by NYC DOH as Merge/Don’t Merge
ChoiceMaker never saw the test data
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
MEDD Evaluation Results
RequestedAccuracy
% of Records Needing Human Review
1% False Positive1% False Negative
1.4%
0.5% False Positive0.5% False Negative
2.6%
0.3% False Positive0.3% False Negative 3.2%
Even with double-checking, humanerror rate is no better than 0.3%
Even with double-checking, humanerror rate is no better than 0.3%
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Summary: What MEDD Offers
Can be trained on just 3,000 record pairs Judges nearly 1,000 record-pairs per secondAchieves very high accuracy by finding the optimal
weighting of the different clues (“features”) indicating
mergemerge/don’t mergedon’t merge Says “mergemerge”, “don’t mergedon’t merge”, or “I don’t knowI don’t know”Can be rigorously testedRegistry management can make informed judgments
regarding the effort vs. accuracy trade-off
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
The 5 Stages of the De-duplication Process
1. “Blocking”: Identify list of possible duplicates (SmartSearch)
2. “Decision-Making”: Identify a definitive list of duplicate records (MEDD)
3. Human Review ofa. Records marked as “don’t know” by MEDDb. Records held by special filters (twins, scanty records, etc.)
4. Linkage: Link records that belong to the same child together (if A=B and B=C then A=C)
5. Update the CIR
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche
Project AvalancheProject Avalanche: A project by which we systematically de-duplicate the whole CIR by comparing every record to every record meeting certain criteria
New York Citywide Immunization Registry:The MEDD De-duplication Project
Project Avalanche I
Used strict blocking criteria for finding possible duplicates to be passed on to MEDD such as:
Exact match on DOB+Medical Record orExact match on Medicaid number orFirst name+gender+DOB+last name=maiden name (and vise versa) orLast name+First name+DOB
Used 98% as the cut-off for automatic mergingHand-reviewed records produced by the filters
ChoiceMaker Technologies
New York Citywide Immunization Registry:The MEDD De-duplication Project