December 2007 NLP: Conflation Algorithms 1
Natural Language Processing
Conflation Algorithms
December 2007 NLP: Conflation Algorithms 2
Acknowledgements
• John Repici (2002) http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
• Porter, M.F., 1980, An algorithm for suffix stripping, reprinted in Sparck Jones, Karen, and Peter Willet, 1997, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, ISBN 1-55860-454-4. [Vince has a copy of this]
• Jurafsky & Martin appendix B pp 833-836.
December 2007 NLP: Conflation Algorithms 3
Conflation
COMPUTES
COMPUTE
COMPUTATIONCOMPUTABILITY
COMPUTING
COMPUTER
COMPUT
December 2007 NLP: Conflation Algorithms 4
Word Conflation Algorithms
• Morphological analysis versus conflation
• Notion of word class is application dependent– Genealogy: Phonetic similarity– Information Retrieval: Semantic similarity
• Soundex
• Porter
December 2007 NLP: Conflation Algorithms 5
Problems with Names
• Names can be misspelt: Rossner• Same name can be spelt in different ways
Kirkop; Chircop• Same name appears differently in different
cultures: Tchaikovsky; Chaicowski• To solve this problem, we need phonetically
oriented algorithms which can find similar sounding terms and names.
• Just such a family of algorithms exist and are called SoundExes, after the first patented version.
December 2007 NLP: Conflation Algorithms 6
The Soundex Algorithm
• A Soundex algorithm takes a word as input and produces a character string which identifies a set of words that are (roughly) phonetically alike.
• It is very handy for searching large databases• Originally developed 1918 by Margaret K. Odell
and Robert C. Russell of the US Bureau of Archives, to simplify census-taking.
December 2007 NLP: Conflation Algorithms 7
Soundex Algorithm 1
The Soundex Algorithm uses the following steps to encode a word:
1. The first character of the word is retained as the first character of the Soundex code.
2. The following letters are discarded: a,e,i,o,u,h,w, and y.
3. Remaining consonants are given a code number.4. If consonants having the same code number appear
consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")
December 2007 NLP: Conflation Algorithms 8
Code Numbers
b, p, f, and v 1
c, s, k, g, j, q, x, z 2
d, t 3
l 4
m,n 5
r 6
December 2007 NLP: Conflation Algorithms 9
Soundex Algorithm: Example
The Soundex Algorithm uses the following steps to encode a word:
[ROSNER]1. The first character of the word is retained as the first
character of the Soundex code [R]2. The following letters are discarded: a,e,i,o,u,h,w, and
y. [RSNR]3. Remaining consonants are given a code number.
[R256]4. If consonants having the same code number appear
consecutively, the number will only be coded once. (e.g. "B233" becomes "B23")[R256]
December 2007 NLP: Conflation Algorithms 10
Soundex Algorithm 2
– The resulting code is modified so that it becomes exactly four characters long: If it is less than 4 characters, zeroes are added to the end (e.g. "B2" becomes "B200")
– If it is more than 4 characters, the code is truncated (e.g. "B2435" becomes "B243")
December 2007 NLP: Conflation Algorithms 11
Uses for the Soundex Code
• Airline reservations - The soundex code for a passenger's surname is often recorded to avoid confusion when trying to pronounce it.
• U.S. Census - As is noted above, the U.S. Census Department was a frequent user of the Soundex algorithm while trying to compile a listing of families around the turn of the century.
• Genealogy - In genealogy, the Soundex code is most often used to avoid obstacles when dealing with names that might have alternate spellings.
December 2007 NLP: Conflation Algorithms 12
Improvements
• Preprocessing before applying the basic algorithm, e.g. identification of
– DG with G – GH with H – GN with N (not 'ng') – KN with N – PH with F
• Question: where to stop?
• Question: how to evaluate?
December 2007 NLP: Conflation Algorithms 13
IR Applications
• Information Retrieval:
Query → → Relevant Documents
• “Bag of Terms” document model
• What is a single term?
December 2007 NLP: Conflation Algorithms 14
Why Stemming is Necessary
• Frequently we get collections of words of the following kind in the same document
compute, computer, computing, computation, computability ….
• Performance of IR system will be improved if all of these terms are conflated.– Less terms to worry about– More accurate statistics
December 2007 NLP: Conflation Algorithms 15
Issues
• Is a dictionary available?– Stems– Affixes
• Motivation: linguistic credibility or engineering performance?
• When to remove a affix versus when to leave it alone
• Porter (1980): W1 and W2 should be conflated if there appears to be no difference between the statements "this document is about W1/W2"
relate/relativity vs. radioactive/radioactivity
December 2007 NLP: Conflation Algorithms 16
Consonants and Vowels
• A consonant is a letter other than a,e,i,o,u and other than y preceded by a consonant: sky, toy
• If a letter is not a consonant it is a vowel.• A sequence of consonants (cc..c) or vowels (vv..v) will
be represented by C or V respectively.• For example the word troubles maps to C V C V C• Any word or part of a word, therefore has one of the
following forms:
(CV)n….C(CV)n….V(VC)n….C(VC)n….V
December 2007 NLP: Conflation Algorithms 17
Measure
• All the above patterns can be replaced bythe following regular expression
(C) (VC)m (V)
• m is called the measure of any word or word part.
• m=0: tr, ee, tree, y, bym=1: trouble, oats, trees, ivym=2: troubles; private
December 2007 NLP: Conflation Algorithms 18
Rules
• Rules for removing a suffix are given in the form
(condition) S1 → S2
• i.e. if a word ends with suffix S1, and the stem before S1 satisfies the condition, then it is replaced with S2. Example
(m > 1) EMENT →
• Example: enlargement → enlarg
December 2007 NLP: Conflation Algorithms 19
Conditions
• *S - stem ends with s• *Z - stem ends with z• *T – stem ends with t• *v* - stem contains a vowel• *d - stem ends with a double consonant• *o - stem ends cvc, where second c is not w, x
or y e.g. –wil, -hop• In conditions, Boolean operators are possible
e.g. (m>1 and (*S or *T))• Sets of rules applied in 7 steps. Within each
step, rule matching longest suffix applies.
December 2007 NLP: Conflation Algorithms 20
OrganisationStep 1Plurals and Third Person Singular Verbs
Step 2Verbal Past Tense and Progressive
Step 3: Y to INoun Inflections
Steps 4 and 5Derivational MorphologyMultiple Suffixesvisualisation → visualise
Steps 6Derivational MorphologySingle Suffixes
Step 7Cleanup
-s
-ed, -ing fly/flies
December 2007 NLP: Conflation Algorithms 21
Step 1:Plural Nouns and 3rd Person Singular Verbs
condition rewrite example
SSES → SS caresses → caress
IES → I ponies → poni
SS → SS caress → caress
S → cats → cat
December 2007 NLP: Conflation Algorithms 22
Step 2a Verbal Past Tense and Progressive Forms
condition rewrite example
(m>0) EED → EE feed → feed
agreed → agree
(*v*) ED → ε plastered → plaster
bled → bled
(*v*) ING → ε killing → killsing → sing
December 2007 NLP: Conflation Algorithms 23
Step 2b: CleanupIf 2nd or 3rd of last step succeeds
condition rewrite example
AT → ATE generat → generate
BL → BLE troubl → trouble
IZ → IZE capsiz → capsize
*d and not
(*L or *S or *Z)
→
single letter
hopp → hop
hiss → hiss
December 2007 NLP: Conflation Algorithms 24
Step 3: Y to I
(*v*) Y → I happy → happi
cry → cry
December 2007 NLP: Conflation Algorithms 25
STEP 4: Derivational Morphology 1 – Multiple Suffixes (excerpt)
Condition Rewrite Example
(m > 0) ATIONAL → ATE relational → relate
(m > 0) TIONAL → TION conditional → condition
(m > 0) ENCI → ENCE valenci → valence
(m > 0) ABLI → ABLE comfortabli → comfortable
(m > 0) OUSLI → OUS analagously → analagous
(m > 0) IZATION → IZE digitizer → digitize
(m > 0) ATION → ATE generation → generate
(m > 0) ATOR → ATE operator → operate
(m > 0) ALISM → AL formalism → formal
(m > 0) IVENESS → IVE pensiveness → pensive
(m > 0) FULNESS → FUL hopefulness → hopeful
(m > 0) OUSNESS → OUS callousness → callous
(m > 0) ALITI → AL formality → formal
(m > 0) BILITI → BLE possibility → possible
December 2007 26
Step 6: Derivational Morphology III: Single Suffixes
Condition Rewrite Example
(m > 1) AL → ε revival → reviv
(m > 1) ANCE → ε allowance → allow
(m > 1) ENCE → ε inference → infer
(m > 1) ER → ε airliner → airlin
(m > 1) IC → ε Coptic → Copt
(m > 1) ABLE → ε laughable → laugh
(m > 1) ANT → ε irritant → irrit
(m > 1) EMENT → ε replacement → replac
(m > 1) MENT → ε adjustment → adjust
(m > 1) ENT → ε dependent → depend
(m > 0) (*S or *T) ION → ε adoption → adopt
(m > 1) OU → ε callousness → callous
(m > 1) ISM → ε formalism→ formal
(m > 1) ATE → ε activate → activ
ITI → ε
December 2007 NLP: Conflation Algorithms 27
Porter Example
• INPUTin the first focus area, integrated projects shall help develop, principally, common open platforms for software and services supporting a distributed information and decision systems for risk and crisis management
December 2007 NLP: Conflation Algorithms 28
Porter Output
Original Word Stemmed Word
first first
focus focu
area area
integrated integr
projects project
help help
develop develop
principally princip
common common
open open
platforms platform
Original Word Stemmed Word
platforms platform
software softwar
services servic
supporting support
distributed distribut
information inform
decision decis
systems system
risk risk
crisis crisi
management manag
December 2007 NLP: Conflation Algorithms 29
Summary
• Conflation serves different purposes
• Generally, motivation is to achieve an engineering goal rather than linguistic fidelity.
• This can cause errors in the bag of words model.
• Soundex and Porter very well established and easily available.