Top Banner
An Open-source Similar- name Finder Dallan Quass [email protected]
46

An Open-source Similar-name Finder

Jun 21, 2015

Download

Technology

Dallan Quass

An Open-source Similar-name Finder presented by Dallan Quass at RootsTech 2012

An improvement on Soundex
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: An Open-source Similar-name Finder

An Open-source Similar-name Finder

Dallan Quass [email protected]

Page 2: An Open-source Similar-name Finder

What's the problem?

Page 3: An Open-source Similar-name Finder

People can't spell unusual names

Maybe a piece of mail addressed to Solverg Quast?

Solverg Quast5934 Phoenix Ave.Shoreview, MN 55126

Johnston Bros.1256 Bristol St.Mapleton, MN 55126

Should be: Solveig Quass

Page 4: An Open-source Similar-name Finder

People use nicknames

John

Johnny

Jack

Page 5: An Open-source Similar-name Finder

Transcribers make typos

Jhon

Page 6: An Open-source Similar-name Finder

Most of our ancestors didn't know how to read or write

signature

Page 7: An Open-source Similar-name Finder

What does it matter?

Page 8: An Open-source Similar-name Finder

How do you find records?Johnny SnithJohn Smith

Page 9: An Open-source Similar-name Finder

How do you match people?

John Smith Johnny Smithe

Page 10: An Open-source Similar-name Finder

Not a new problem

Page 11: An Open-source Similar-name Finder

Lots of solutions

Soundex

Nysii

s

Double

Metaphone

Refined Soundex

Daitch-Mokotoff

Caverphone

LevensteinJaro Winkler

Monge Elkan

Needleman Wunch

Smith

Waterman

Page 12: An Open-source Similar-name Finder

No Bullseye

Page 13: An Open-source Similar-name Finder

Why is this so hard?

Page 14: An Open-source Similar-name Finder

How similar are two names?

We’re neighbors

JohnJonnyJoe

I don’t know those guys

Page 15: An Open-source Similar-name Finder

First approach: Coders

Soundex

Nysii

s

Double

Metaphone

Refined Soundex

Daitch-Mokotoff

Caverphone

General approach

Combine repeated letters

Remove vowels (except maybe for leading)

Unite similar-sounding letters

Page 16: An Open-source Similar-name Finder

First approach: Coders

Jim

John

Jane

Johan

Johannes

Page 17: An Open-source Similar-name Finder

Second approach: Distance functions

LevensteinJaro Winkler

Monge Elkan

Needleman Wunch

Smith

Waterman

General approach

Align sequences of letters

Score based upon the number of matches, transpositions, differences

Monge Elkan considers similar-sounding letters

Page 18: An Open-source Similar-name Finder

Second approach: Distance functions

Jim

John

Jane

Johan

Johannes

Better results,but

Doesn't scale well

Page 19: An Open-source Similar-name Finder

Can we do better?

Page 20: An Open-source Similar-name Finder

Warning: Machine learning ahead!

Page 21: An Open-source Similar-name Finder

Thank you Ancestry!

Ancestry.com paid someone to label 100,000 pairs of names

Name pairs were drawn from actual matching records at Ancestry

Labeled name pairs have been made freely available

Page 22: An Open-source Similar-name Finder

A closer look at Levenstein

Jon

John

Bohn-1

-1

Page 23: An Open-source Similar-name Finder

Maximize your expectations

Expectation Maximization Algorithm

Expectation step: calculate the expected value of a function

Maximization step: find parameters that maximize the expected value

Iterate until convergence

Jon

John

Bohn

high cost

low costWeighted Edit Distance

Page 24: An Open-source Similar-name Finder

Learn to classify

Positive and negative examples

Features

Coders

Distance functions

Weighted edit distance

Learn weights

several algorithms to choose from

Results in a vector

Threshold separates matches from non-matches

Page 25: An Open-source Similar-name Finder

Wait, isn't this just another distance function?

Distance functions don't scale, right?

Page 26: An Open-source Similar-name Finder

Right

Page 27: An Open-source Similar-name Finder

Back to the basics

x f(x)

-5 -1-3 4.5 0 7 2 3.5 4 2

Page 28: An Open-source Similar-name Finder

Long tail

Page 29: An Open-source Similar-name Finder

Long tail

200,000 Surnames 70,000 Given names

≤ 1/5,000,000 names

Page 30: An Open-source Similar-name Finder

Long tail

Use distance function with table here

Use coder here

Page 31: An Open-source Similar-name Finder

Result: Table initialized by a function

Dallan: Dalana Daleen Dalen Dalin … Talan Tallon

Ryan: Aaran Aran Arrin … Rian Riana ...

Page 32: An Open-source Similar-name Finder

A nice thing about tables...

Dallan: Dalana Daleen Dalen Dalin … Talan Tallon

Ryan: Aaran Aran Arrin … Rian Riana ...

Page 33: An Open-source Similar-name Finder

Add to the table

Nicknames

BehindTheName.com

The New American

Dictionary of Baby

Names by Leslie

Dunking and William

Gosling A Dictionary of Surnames by Patrick

Hanks and Flavia Hodges

WeRelate community

Page 34: An Open-source Similar-name Finder

Thank you BehindTheName.com!

Fascinating Family Treesfor given names

Page 35: An Open-source Similar-name Finder

Result

97 65

97 74

SoundexOur approach

Precision Recall

28% decrease in false negatives

Given names

89 68

89 77

SoundexOur approach

Precision Recall

28% decrease in false negatives

Surnames

Page 36: An Open-source Similar-name Finder

Who is using it?

Page 37: An Open-source Similar-name Finder

WeRelate.org

Page 38: An Open-source Similar-name Finder

Continuous improvement

Page 39: An Open-source Similar-name Finder

Continuous improvement

Page 40: An Open-source Similar-name Finder

Community oversight

Page 41: An Open-source Similar-name Finder

How do I use it?

Source code and table available on Github: http://github.com/DallanQ/Names

SearchNormalizeIndexSearch

ScoreEvalService

Page 42: An Open-source Similar-name Finder

Roadmap

Jan 2011 Open-source project created

Jan 2011 Implemented at WeRelate

Feb 2011 Announce at RootsTech

Continued improvements

Page 43: An Open-source Similar-name Finder

Future work

Page 44: An Open-source Similar-name Finder

Future work

Reduce the number of name variants to look up

Look up multiple codesRefined soundex?

Cluster namesMahout?

Remove “chaff” variants from common names

Page 45: An Open-source Similar-name Finder

Conclusion

Images appearing on these slides are copyrighted by the contributors to http://commons.wikimedia.org and are used under license

Thank you Ancestry.com and BehindTheName.com!!!

Identifying name variants is hard

But getting it right is pretty important

names are at the core of genealogical research

Open source algorithm is now freely available

http://github.com/DallanQ/Names

28% reduction in false negatives compared to Soundex

continuous improvement

Hopefully others will benefit from this effort

goal is to improve genealogical searches across the Web

Page 46: An Open-source Similar-name Finder