Top Banner
Probabilistic Deduplication, Data Linkage and Geocoding Peter Christen Data Mining Group, Australian National University in collaboration with Centre for Epidemiology and Research, New South Wales Department of Health Contact: [email protected] Project web page: http://datamining.anu.edu.au/linkage.html Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for Advanced Computing (APAC) Peter Christen, June 2005 – p.1/44
44

Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Aug 10, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Probabilistic Deduplication,Data Linka ge and Geocoding

Peter Christen

Data Mining Group, Australian National University

in collaboration with

Centre for Epidemiology and Research, New South Wales Department of Health

Contact: peter .christen@an u.edu.au

Project web page: http://datamining.an u.edu.au/linka ge.html

Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC),

and the Australian Partnership for Advanced Computing (APAC)

Peter Christen, June 2005 – p.1/44

Page 2: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Outline

Data cleaning and standardisation

Data linkage and probabilistic linkage

Privacy and ethics

Febrl overview and open source tools

Probabilistic data cleaning and standardisation

Blocking and indexing

Record pair classification

Parallelisation in Febrl

Data set generation

Geocoding

Peter Christen, June 2005 – p.2/44

Page 3: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Data cleaning and standar disation (1)

Real world data is often dirtyMissing values, inconsistencies

Typographical and other errors

Different coding schemes / formats

Out-of-date data

Names and addresses are especially prone todata entry errors

Cleaned and standardised data is needed forLoading into databases and data warehouses

Data mining and other data analysis studies

Data linkage and data integration

Peter Christen, June 2005 – p.3/44

Page 4: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Data cleaning and standar disation (2)

42Main 3a 2600

26003a

App.Rd.

Miller 3a 29/4/198642MainPeter Rd.App.

198629 4

AddressName

Locality

Doc 2600A.C.T.Canberra

CanberraA.C.T.

Title Givenname Surname YearMonthDay

PostcodeTerritoryLocalitynameno.Unit

Unittype

42

typeWayfareWayfare

nameno.Wayfare

peter miller

main canberra actapartmentroad

doctor

Date of Birth

Street

Remove unwanted characters and words

Expand abbreviations and correct misspellings

Segment data into well defined output fields

Peter Christen, June 2005 – p.4/44

Page 5: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Data linka ge

Data (or record) linkage is the task of linkingtogether information from one or more datasources representing the same entity

Data linkage is also called database matching,data integration, data scrubbing, or ETL(extraction, transformation and loading)

Three records, which represent the same person?

1. Dr Smith, Peter; 42 Miller Street 2602 O’Connor

2. Pete Smith; 42 Miller St 2600 Canberra A.C.T.

3. P. Smithers, 24 Mill Street 2600 Canberra ACT

Peter Christen, June 2005 – p.5/44

Page 6: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Data linka ge techniques

Deterministic or exact linkageA unique identifier is needed, which is of high quality

(precise, robust, stable over time, highly available)

For example Medicare, ABN or Tax file number

(are they really unique, stable, trustworthy?)

Probabilistic linkage (Fellegi & Sunter, 1969)Apply linkage using available (personal) information

(for example names, addresses, dates of birth, etc)

Other techniques(rule-based, fuzzy approach, information retrieval, AI)

Peter Christen, June 2005 – p.6/44

Page 7: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Probabilistic linka ge

Computer assisted data linkage goes back as faras the 1950s (based on ad-hoc heuristic methods)

Basic ideas of probabilistic linkage wereintroduced by Newcombe & Kennedy (1962)

Theoretical foundation by Fellegi & Sunter (1969)Using matching weights based on frequency ratios

(global or value specific ratios)

Compute matching weights for all fields used in linkage

Summation of matching weights is used to designate a

pair of records as link, possible-link or non-link

Peter Christen, June 2005 – p.7/44

Page 8: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Appr oximate string comparison

Account for partial agreement between strings

Return score between 0.0 (totally different)and 1.0 (exactly the same)

Examples:

String_1 String_2 Winkler Bigram Edit-Dist

tanya tonya 0.880 0.500 0.800

dwayne duane 0.840 0.222 0.667

sean susan 0.805 0.286 0.600

jon john 0.933 0.400 0.750

itman smith 0.567 0.250 0.000

1st ist 0.778 0.500 0.667

peter ole 0.511 0.000 0.200

Peter Christen, June 2005 – p.8/44

Page 9: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Phonetic name encoding

Bringing together spellings variations of thesame name

Examples:

Name Soundex NYSIIS Double-Metaphone

stephen s315 staf stfn

steve s310 staf stf

gail g400 gal kl

gayle g400 gal kl

christine c623 chra krst

christina c623 chra krst

kristina k623 cras krst

Peter Christen, June 2005 – p.9/44

Page 10: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Linka ge example: Month of bir th

Assume two data sets with a error in field month of birth

Probability that two linked records (that represent the same

person) have the same month value is (L agreement)

Probability that two linked records do not have the same

month value is (L disagreement)

Probability that two (randomly picked) unlinked records have

the same month value is (U agreement)

Probability that two unlinked records do not have the same

month value is (U disagreement)

Agreement weight �� �� : log �

Disagreement weight �� �� : log �

Peter Christen, June 2005 – p.10/44

Page 11: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Value specific frequencies

Example: SurnamesAssume the frequency of Smith is higher than Dijkstra

(NSW Whitepages: 25,425 Smith, only 3 Dijkstra)

Two records with surname Dijkstra are more likely to

be the same person than with surname Smith

The matching weights need to be adjustedDifficulty: How to get value specific frequencies that

are characteristic for a given data set

Earlier linkages done on same or similar data

Information from external data sets (e.g. Australian

Whitepages)Peter Christen, June 2005 – p.11/44

Page 12: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Final linka ge decision

The final weight is the sum of weights of all fieldsRecord pairs with a weight above an upper threshold are

designated as a link

Record pairs with a weight below a lower threshold are

designated as a non-link

Record pairs with a weight between are possible link

−5 0 5 10 15 20

Lower threshold Upper thresholdTotal matching

weight

lower weightMany more with

Peter Christen, June 2005 – p.12/44

Page 13: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Applications and usage

Applications of data linkageRemove duplicates in a data set (internal linkage)

Merge new records into a larger master data set

Create patient or customer oriented statistics

Compile data for longitudinal (over time) studies

Clean data sets for data analysis and mining projects

Geocode data

Widespread use of data linkageCensus statistics

Business mailing lists

Health and biomedical research (epidemiology)

Fraud and crime detection Peter Christen, June 2005 – p.13/44

Page 14: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Priv acy and ethics

For some applications, personal information is notof interest and is removed from the linked data set(for example epidemiology, census statistics, data mining)

In other areas, the linked information is the aim(for example business mailing lists, crime and frauddetection, data surveillance)

Personal privacy and ethics is most importantPrivacy Act, 1988

National Statement on Ethical Conduct in

Research Involving Humans, 1999

Peter Christen, June 2005 – p.14/44

Page 15: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Febrl – Freely extensib le biomedicalrecor d linka ge

Commercial software for data linkage is oftenexpensive and cumbersome to use

Project aimsAllow linkage of larger data sets (high-performance

and parallel computing techniques)

Reduce the amount of human resources needed

(improve linkage quality by using machine learning)

Reduce costs (free open source software)

Modules for data cleaning and standardisation,data linkage, deduplication and geocoding

Free, open source https://sour ceforge.net/pr ojects/f ebrl/

Peter Christen, June 2005 – p.15/44

Page 16: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Open sour ce software tools

Scripting language Python www.python.or g

Easy and rapid prototype software development

Object-oriented and cross-platform (Unix, Win, Mac)

Can handle large data sets stable and efficiently

Many external modules, easy to extend

Large user community

Parallel libraries MPI and OpenMPWidespread use in high-performance computing

(quasi standards) Portability and availability

Parallel Python extensions: PyRO and PyPar

Peter Christen, June 2005 – p.16/44

Page 17: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Probabilistic data cleaning andstandar disation

Three step approach in Febrl1. Cleaning

– Based on look-up tables and correction lists

– Remove unwanted characters and words

– Correct various misspellings and abbreviations

2. Tagging

– Split input into a list of words, numbers and separators

– Assign one or more tags to each element of this list

(using look-up tables and some hard-coded rules)

3. Segmenting

– Use either rules or a hidden Markov model (HMM)

to assign list elements to output fields

Peter Christen, June 2005 – p.17/44

Page 18: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Step 1: Cleaning

Assume the input component is one string(either name or address – dates are processed differently)

Convert all letters into lower case

Use correction lists which contain pairs of(original,replacement) strings

An empty replacement string results in removingthe original string

Correction lists are stored in text files and can bemodified by the user

Different correction lists for names and addresses

Peter Christen, June 2005 – p.18/44

Page 19: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Step 2: Tagging

Cleaned strings are split at whitespace boundariesinto lists of words, numbers, characters, etc.

Using look-up tables and some hard-coded rules,each element is tagged with one or more tags

Example:Uncleaned input string: “Doc. peter Paul MILLER”

Cleaned string: “dr peter paul miller”

Word and tag lists:[‘dr’, ‘peter’, ‘paul’, ‘miller’]

[‘TI’, ‘GM/SN’, ‘GM’, ‘SN’ ]

Peter Christen, June 2005 – p.19/44

Page 20: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Step 3: Segmenting

Using the tag list, assign elements in the word listto the appropriate output fields

Rules based approach (e.g. AutoStan)

Example: “if an element has tag ‘TI’ then assign the

corresponding word to the ‘Title’ output field”

Hard to develop and maintain rules

Different sets of rules needed for different data sets

Hidden Markov model (HMM) approach

A machine learning technique (supervised learning)

Training data is needed to build HMMs

Peter Christen, June 2005 – p.20/44

Page 21: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Hidden Markov model (HMM)

Middlename End

Surname

Start Title

Givenname

15%

85%5%

65%

10%

5%

5%25%

100%

20%

5%

75%

30%

55%

A HMM is a probabilistic finite state machine

Made of a set of states and transition probabilities

between these states

In each state an observation symbol is emitted with a

certain probability distribution

In our approach, the observation symbols are tags and

the states correspond to the output fields

Peter Christen, June 2005 – p.21/44

Page 22: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

HMM probability matrices

Middlename End

Surname

Start Title

Givenname

15%

85%5%

65%

10%

5%

5%25%

100%

20%

5%

75%

30%

55%

State

Obser vation Start Title Givenname Middlename Surname End

TI – 96% 1% 1% 1% –

GM – 1% 35% 33% 15% –

GF – 1% 35% 27% 14% –

SN – 1% 9% 14% 45% –

UN – 1% 20% 25% 25% –

Peter Christen, June 2005 – p.22/44

Page 23: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

HMM data segmentation

Middlename End

Surname

Start Title

Givenname

15%

85%5%

65%

10%

5%

5%25%

100%

20%

5%

75%

30%

55%

For an observation sequence we are interested inthe most likely path through a given HMM(in our case an observation sequence is a tag list)

The Viterbi algorithm is used for this task(a dynamic programming approach)

Smoothing is applied to account for unseen data(assign small probabilities for unseen observation symbols)

Peter Christen, June 2005 – p.23/44

Page 24: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

HMM segmentation example

Middlename End

Surname

Start Title

Givenname

15%

85%5%

65%

10%

5%

5%25%

100%

20%

5%

75%

30%

55%

Input word and tag list[‘dr’, ‘peter’, ‘paul’, ‘miller’]

[‘TI’, ‘GM/SN’, ‘GM’, ‘SN’ ]

Two example paths through the HMM1: Start -> Title (TI) -> Givenname (GM) -> Middlename (GM) ->

Surname (SN) -> End

2: Start -> Title (TI) -> Surname (SN) -> Givenname (GM) ->Surname (SN) -> End

Peter Christen, June 2005 – p.24/44

Page 25: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Address HMM standar disationexample

WayfareNumber

WayfareType Territory

Start End

WayfareName Name

Locality

5%

90%

8%

2%

3%

95% 95%

3%

80%

18%

90%

2%

40%

2%

2%

40%10%

20%

95%

codePost−

1. Raw input: ’73 Miller St, NORTH SYDENY 2060’

Cleaned into: ’73 miller street north sydney 2060’

2. Word and tag lists:[’73’, ’miller’, ’street’, ’north_sydney’, ’2060’][’NU’, ’UN’, ’WT’, ’LN’, ’PC’ ]

3. Example path through HMMStart -> Wayfare Number (NU) -> Wayfare Name (UN) -> Wayfare

Type (WT) -> Locality Name (LN) -> Postcode (PC) -> End

Peter Christen, June 2005 – p.25/44

Page 26: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

HMM training (1)

Both transition and observation probabilities needto be trained using training data(maximum likelihood estimates (MLE) are derived byaccumulating frequency counts for transitions andobservations)

Training data consists of records, each being asequence of tag:hmm_state pairs

Example (2 training records):# ‘42 / 131 miller place manly 2095 new_south_wales’

NU:unnu,SL:sla,NU:wfnu,UN:wfna1,WT:wfty,LN:loc1,PC:pc,TR:ter1

# ‘2 richard street lewisham 2049 new_south_wales’

NU:wfnu,UN:wfna1,WT:wfty,LN:loc1,PC:pc,TR:ter1

Peter Christen, June 2005 – p.26/44

Page 27: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

HMM training (2)

A bootstrapping approach is applied for semi-automatic training1. Manually edit a small number of training records and

train a first rough HMM

2. Use this first HMM to segment and tag a larger number

of training records

3. Manually check a second set of training records, then

train an improved HMM

Only a few person days are needed to get a HMMthat results in an accurate standardisation(instead of weeks or even months to develop rules)

Peter Christen, June 2005 – p.27/44

Page 28: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Address standar disation results

Various NSW Health data setsHMM1 trained on 1,450 Death Certificate records

HMM2 contains HMM1 plus 1,000 Midwifes Data

Collection training records

HMM3 is HMM2 plus 60 unusual training records

AutoStan rules (for ISC) developed over years

HMM/Method

Test Data Set HMM HMM HMM Auto

(1,000 records each) 1 2 3 Stan

Death Certificates 95.7% 96.8% 97.6% 91.5%

Inpatient Statistics Collection 95.7% 95.9% 97.4% 95.3%

Peter Christen, June 2005 – p.28/44

Page 29: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Name standar disation results

NSW Midwifes Data Collection (1990 - 2000)(around 963,000 records, no medical information)

10-fold cross-validation study with 10,000 randomrecords (9,000 training and 1,000 test records)

Both Febrl rule based and HMM data cleaning andstandardisation

Rules were better because most names were simple

(not much structure to learn for HMM)

Min Max Average StdDev

HMM 83.1% 97.0% 92.0% 4.7%

Rules 97.1% 99.7% 98.2% 0.7%

Peter Christen, June 2005 – p.29/44

Page 30: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Bloc king and inde xing

Number of possible links equals the product of thesizes of the two data sets to be linked(two databases with 1,000,000 and 5,000,000 records willresult in 1,000,000 5,000,000 =

� �

= 5 Trillion recordpair comparisons)

Performance bottleneck is the (expensive)comparison of field values (similarity measures)between record pairs

Blocking / indexing / filtering techniques are usedto reduce the large amount of record comparisons(for example, only compare records which have the samepostcode value)

Peter Christen, June 2005 – p.30/44

Page 31: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Field comparison functions in Febrl

Exact string

Truncated string (only consider beginning of strings)

Approximate string (using Winkler, Edit dist, Bigram etc.)

Encoded string (using Soundex, NYSIIS, etc.)

Keying difference (allow a number of different characters)

Numeric percentage (allowing percentage tolerance)

Numeric absolute (allow absolute tolerance)

Date (allow day tolerance)

Age (allow percentage tolerance)

Time (allow minute tolerance)

Distance (allow kilometre tolerance)Peter Christen, June 2005 – p.31/44

Page 32: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Recor d pair classification

For each record pair compared a vector containingmatching weights is calculatedExample:Record A: [’dr’, ’peter’, ’paul’, ’miller’]

Record B: [’mr’, ’john’, ’’, ’miller’]

Matching weights: [0.2, -3.2, 0.0, 2.4 ]

Matching weights are used to classify record pairsas links, non-links, or possible links

Fellegi & Sunter classifier simply sums all theweights, then uses two thresholds to classify

Improved classifiers are possible(for example using machine learning techniques)

Peter Christen, June 2005 – p.32/44

Page 33: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Parallelisation

Implemented transparently to the user

Currently using MPI via Python module PyPar

Use of super-computing centres is problematic(privacy) Alternative: In-house office clusters

Some initial performance results (on Sun SMP)

1

1.5

2

2.5

3

1 2 3 4

Spe

edup �

Number of processors

Step 1 - Loading and indexing

20,000 records100,000 records200,000 records

1

1.5

2

2.5

3

3.5

4

1 2 3 4

Spe

edup �

Number of processors

Step 2 - Record pair comparison and classification

20,000 records100,000 records200,000 records

Peter Christen, June 2005 – p.33/44

Page 34: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Data set generation

Difficult to acquire data for testing and evaluation(as data linkage deals with names and addresses)

Also, linkage status is often not known(hard to evaluate and test new algorithms)

Febrl contains a data set generator

Uses frequency tables for given- and surnames, street

names and types, suburbs, postcodes, etc.

Duplicate records are created via random introduction of

modifications (like insert/delete/transpose characters,

swap field values, delete values, etc.)

Also uses lists of known misspellings

Peter Christen, June 2005 – p.34/44

Page 35: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Data set generation – Example

Data set with 4 original and 6 duplicate recordsREC_ID, ADDRESS1, ADDRESS2, SUBURB

rec-0-org, wylly place, pine ret vill, taree

rec-0-dup-0, wyllyplace, pine ret vill, taree

rec-0-dup-1, pine ret vill, wylly place, taree

rec-0-dup-2, wylly place, pine ret vill, tared

rec-0-dup-3, wylly parade, pine ret vill, taree

rec-1-org, stuart street, hartford, menton

rec-2-org, griffiths street, myross, kilda

rec-2-dup-0, griffith sstreet, myross, kilda

rec-2-dup-1, griffith street, mycross, kilda

rec-3-org, ellenborough place, kalkite homestead, sydney

Each record is given a unique identifier, whichallows the evaluation of accuracy and error ratesfor data linkage

Peter Christen, June 2005 – p.35/44

Page 36: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Geocoding

The process of matching addresses withgeographic locations (longitude and latitude)

It is estimated that 80% to 90% of governmentaland business data contain address information(US Federal Geographic Data Committee)

Geocoding tasksPre-process the geocoded reference data (cleaning,

standardisation and indexing)

Clean and standardise the user addresses

(Approximate) matching of user addresses with the

reference data

Peter Christen, June 2005 – p.36/44

Page 37: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Geocoding techniques

.

.3

1

6

..7

5

.2

.4. .

.

.

.

.

.

8

9

10

12

11

13

Street centreline based (many commercial systems)

Property parcel centre based (our approach)

A recent study found substantial differences(specially in rural areas)Cayo and Talbot; Int. Journal of Health Geographics, 2003

Peter Christen, June 2005 – p.37/44

Page 38: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Geocoded national address file

G-NAF: Available since early 2004(PSMA, http://www.g-naf.com.au/)

Source data from 13 organisations(around 32 million source records)

Processed into 22 normalised database tables

Adress SiteGeocode

Adress Site

Adress Detail Adress Alias

Locality Locality Alias

Street

Street LocalityGeocode

Street LocalityAlias Geocode

Locality

1

1

1 n

n

n

n

1

1 n

1

n

1 n

1

1

1n

1

n

1

n

Peter Christen, June 2005 – p.38/44

Page 39: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Febrl geocoding system

G−NAF datafiles

Febrl cleanand

standardise

Build inverteddata files

Inverted index Febrl geocodematch engine

Febrl cleanand

standardise

fileUser data

Geocodeduser data file

Geocoding module

Process−GNAF module

input dataWeb interface

GeocodedWeb data

GIS dataAustPost data

Web server module

indices

Only NSW G-NAF data available(around 4 million address, 58,000 street and 5,000 localityrecords)

Additional Australia Post and GIS data used

Peter Christen, June 2005 – p.39/44

Page 40: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Additional data files

Use external Australia Post postcode and suburblook-up tables for correcting and imputing(e.g. if a suburb has a unique postcode this value can beimputed if missing, or corrected if wrong)

Use boundary files for postcodes and suburbs tobuild neighbouring region lists

Idea: People often record neighbouring suburb or

postcode if it has a higher perceived social status

Create lists for direct and indirect neighbours

(neighbouring levels 1 and 2)

Peter Christen, June 2005 – p.40/44

Page 41: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Febrl geocoding matc h engine

Uses cleaned and standardised user address(es)and G-NAF inverted index data

Fuzzy rule based approach

1. Find street match set (street name, type and number)

2. Find postcode and locality match set (with no, then

direct, then indirect neighbour levels)

3. Intersect postcode and locality sets with street match set

(if no match increase neighbour level and go back to 2.)

4. Refine with unit, property, and building match sets

5. Retrieve corresponding location (or locations)

6. Return location and match status (address, street or

locality level match; none, one or many matches)Peter Christen, June 2005 – p.41/44

Page 42: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Geocoding examples

Red dots: Febrl geocoding (G-NAF based)

Blue dots: Street centreline based geocoding

Peter Christen, June 2005 – p.42/44

Page 43: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Outlook

Several research areasImproving probabilistic data standardisation

New and improved blocking / indexing methods

Apply machine learning techniques for record pair

classification

Improve performances (scalability and parallelism)

Project web pagehttp://datamining.an u.edu.au/linka ge.html

Febrl is an ideal experimental platform to develop,implement and evaluate new data standardisation and

data linkage algorithms and techniques

Peter Christen, June 2005 – p.43/44

Page 44: Probabilistic Deduplication, Data Linkage and Geocoding · Funded by the ANU, the NSW Department of Health, the Australian Research Council (ARC), and the Australian Partnership for

Contrib utions / Ackno wledg ements

Dr Tim Churches (New South Wales Health Department, Centre forEpidemiology and Research)

Dr Markus Hegland (ANU Mathematical Sciences Institute)

Dr Lee Taylor (New South Wales Health Department, Centre forEpidemiology and Research)

Ms Kim Lim (New South Wales Health Department, Centre forEpidemiology and Research)

Mr Alan Willmore (New South Wales Health Department, Centre forEpidemiology and Research)

Mr Karl Goiser (ANU Computer Science PhD student)

Mr Puthick Hok (ANU Computer Science honours student, 2004)

Mr Justin Zhu (ANU Computer Science honours student, 2002)

Mr David Horgan (ANU Computer Science summer student, 2003/2004)

Peter Christen, June 2005 – p.44/44