The CLUES database: automated search for cognate forms
Australian Linguistics Society Conference, Canberra
4 December 2011
Mark Planigale (Mark Planigale Research & Consultancy)
Tonya Stebbins (RCLT, La Trobe University)
Introduction Overview of the design of the CLUES database -
being developed as a tool to aid the search for correlates across multiple datasets
Linguistic model underlying the database Explore key issues in developing the
methodology Show examples of output from the database
Because the design of CLUES is relatively generic, it is potentially applicable to a wide range of languages, and to tasks other than correlate detection.
Context
What is CLUES? “Correlate Linking and User-defined Evaluation System”. Database designed to simultaneously handle lexical
data from multiple languages. It uses add-on modules for comparative functions.
Primary purpose: identify correlates across two or more languages. Correlate: pair of lexemes which are similar in phonetic form
and/or meaning The linguist assesses which of the identified correlates are
cognates, and which are similar due to some other reason (borrowing, universal tendencies, accidental similarity)
Allows the user to adjust the criteria used to evaluate degree of correlation between lexemes.
It can store, filter and organise results of comparisons.
Computational methods in historical linguistics Lexicostatistics Typological comparison Phylogenetics Phoneme inventory comparison Modelling effects of sound change rules
Correlate search > CLUES
A few examples Lowe & Mazaudon 1994 – ‘Reconstruction Engine’ (models operation
of proposed sound change rules as a means of checking hypotheses) Nakhleh et al 2005 – Indo-European, phylogenetic Holman et al. 2008 – Automated Similarity Judgment Program – 4350
languages; 40 lexical items (edit distance); 85 most stable grammatical (typological) features from WALS database.
Austronesian Basic Vocabulary Database: 874 mostly Austronesian languages, each language represented by around 210 words. http://language.psy.auckland.ac.nz/austronesian/ project had phylogenetic focus – did some manual comparative work in preparing the data)
Greenhill & Gray 2009 – Austronesian, phylogenetic Dunn, Burenhult et al. 2011 – Aslian Proto-Tai'o'Matic (“merges, searches, and extends several wordlists
and proposed reconstructions of proto-Tai and Southwestern Tai” http://crcl.th.net/index.html?main=http%3A//crcl.th.net/crcl/assoc.htm)
Broad vs. deep approaches to automated lexical comparisonParameter ‘Broad and shallow’ ‘Narrow and deep’
Language sample
Relatively large Relatively small
Vocabulary sample
Constrained, based on standardised wordlist (e.g. Swadesh 200, 100 or 40)
All available lexical data for selected languages
Purpose Establish (hypothesised) genetic relationships
Linguistic and/or cultural reconstruction; model language contact and semantic shift
Method LexicostatisticsPhylogenetics
Comparative method with fuzzy matching
Typical metrics Phonetic (e.g. edit distance)Typological (shared grammatical features)Maximum likelihood
Phonetic (e.g. edit distance)SemanticGrammatical
CLUES comparisons can be constrained to core vocab (using wordlist feature) however it is intended to be used within a ‘narrow and deep’ approach.
Design of CLUES
CLUES: Desiderata
•Results agree with human expert judgment
•Minimisation of false positives and negatives
Accuracy•Computed similarity level does measure degree of correlation
•Computed similarity level varies directly with cognacy
Validity
•Like results for like comparison pairs
•Like results for a single comparison pair on repetition
Reliability
•System performs accurately on new (‘unseen’) data as well as the data that the similarity metrics were ‘trained’ on
Generalisability
•Comparisons are performed fast enough to be useful
Efficiency
Lexical model (partial)
Lexemepart of speech
temporal information
Written form Sense
Gloss Semantic domain
1 1
11
∞ ∞
∞∞
Wordlist item∞ ∞
Source∞ 1Orthography
Language
Phone
1 ∞
1∞
∞
∞
...
Three dimensions of lexical similarityDimension of comparison
Data fields currently available
Phonetic / phonological(phonetic form of lexeme)
Written form (mapped to phonetic content)
Semantic(meaning of lexeme)
Semantic domainGloss
Grammatical(grammatical features of lexeme)
Word class
In the context of correlate detection, grammatical features may be of interest as a ‘dis-similarising’ feature for lexemes that are highly correlated on form and meaning.
What affects the results?
• Choice of appropriate formal (quantifiable) criteria for similarity
• Impact: Validity of results; generalisability of system
Selection and evaluation of metrics
• Systematic differences in the representations used for different data sets within the corpus
• Impact: Validity of results
Inconsistent representations
• Random fluctuations within the data that obscure the true value of individual data items, but do not change the underlying nature of the distribution
• Impact: Reliability of data, reliability of results
Noise
less controllab
le
CLUES: Managing representational issues automated generation of phonetic form(s)
from written form(s) where required, manual standardization to
common lexicographic conventions manual assignment to common ontology
semantic domain set automated mapping onto a shared common
set of grammatical features, values and terms
Calculating similarity
Similarity scores
Total Overallscore
w5 w6 w7
Subtotals Form subtotal Meaningsubtotal
Grammar subtotal
w1 w2 w3 w4
Base Written form similarity
Semantic domain
similarity
Gloss similarity
Wordclass similarity
Ura ɣunǝga vs. Mali kunēngga ‘sun’
4a. Lexeme 1 Lexeme 2Base
similarity score
Weight Subtotal Weight Overall
Written form(s) ɣunǝga [ɣunǝga]
kunēngga [ɣunǝŋga] 0.896 1.0 0.896 0.45
} 0.953Gloss(es) sun sun 1.0 0.5} 1.0 0.45
Semantic domain(s) A3 A3 1.0 0.5
Wordclass N N 1.0 1.0 1.0 0.1
Sulka kolkha ‘sun’ vs. Mali dulka ‘stone’
4b. Lexeme 1 Lexeme 2Base
similarity score
Weight Subtotal Weight Overall
Written form(s) kolkha [kolkha]
dulka[dulka] 0.828 1.0 0.828 0.45
} 0.548Gloss(es) sun stone 0.0 0.5 } 0.167 0.45Semantic domain(s) A3 A5 0.333 0.5
Wordclass N N 1.0 1.0 1.0 0.1
4c. Lexeme 1 Lexeme 2Base
similarity score
Weight Subtotal Weight Overall
Written form(s) kolkha [kolkha]
dulka[dulka] 0.828 1.0 0.828 0.7
} 0.68Gloss(es) sun stone 0.0 0.5 } 0.0 0.2Semantic domain(s) A3 A5 0.333 0.0
Wordclass N N 1.0 1.0 1.0 0.1
Sample results: across domains Small set of lexical data from 7 languages; ‘symmetrical’; overall scores
5a.(tau) N J1 kabarak 'blood'
(ura) N A3 ɣunǝga
'sun'
(sul) N A5 kre
'stone'
(sul) N J1 ka ptaik
'skin'
(mal) N A3 kunēngga
'sun'
(ura) N J1 slǝp
'bone'
(qaq) N T1 ltigi 'fire'
(mal) V T1 lēt
'light a fire'
(mal) N J1 slēpki 'bone'
(tau) N J1 kabarak 'blood' 1 0.309 0.2905 0.657 0.2995 0.5435 0.278 0.2435 0.541
(ura) N A3 ɣunǝga 'sun' 0.309 1 0.34725 0.2665 0.948 0.312 0.3515 0.2445 0.325
(sul) N A5 kre 'stone' 0.2905 0.34725 1 0.2615 0.33875 0.3395 0.2825 0.294 0.2745
(sul) N J1 ka ptaik 'skin' 0.657 0.2665 0.2615 1 0.2895 0.5275 0.2835 0.226 0.587
(mal) N A3 kunēngga 'sun' 0.2995 0.948 0.33875 0.2895 1 0.289 0.3025 0.22 0.3495
(ura) N J1 slǝp 'bone' 0.5435 0.312 0.3395 0.5275 0.289 1 0.326 0.3815 0.8905
(qaq) N T1 ltigi 'fire' 0.278 0.3515 0.2825 0.2835 0.3025 0.326 1 0.6945 0.371
(mal) V T1 lēt 'light a fire' 0.2435 0.2445 0.294 0.226 0.22 0.3815 0.6945 1 0.307
(mal) N J1 slēpki 'bone' 0.541 0.325 0.2745 0.587 0.3495 0.8905 0.371 0.307 1
Sample results: within a domain
5b. Overall similarity score
(qaq) N A5 dul
'stone'
(ura) N A5 dul
'stone'
(mal) N A5 dulka 'stone'
(tau) N A5 aaletpala
'stone'
(sul) N A5 kre
'stone'
(kua) N A5 vat
'stone'
(sia) N A5 fat
'stone'
(kua) N M1 dududul 'fighting stone'
(qaq) N A5 dul 'stone' 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425
(ura) N A5 dul 'stone' 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425
(mal) N A5 dulka 'stone' 0.875 0.875 1 0.776 0.79 0.7205 0.7355 0.426
(tau) N A5 aaletpala 'stone' 0.6945 0.6945 0.776 1 0.7375 0.727 0.73 0.3815
(sul) N A5 kre 'stone' 0.7355 0.7355 0.79 0.7375 1 0.7785 0.798 0.3075
(kua) N A5 vat 'stone' 0.759 0.759 0.7205 0.727 0.7785 1 0.9805 0.3095
(sia) N A5 fat 'stone' 0.739 0.739 0.7355 0.73 0.798 0.9805 1 0.298
(kua) N M1 dududul 'fighting stone' 0.425 0.425 0.426 0.3815 0.3075 0.3095 0.298 1
Sample results: within a domain
5c. Form similarity only
(qaq) N A5 dul
'stone'
(ura) N A5 dul
'stone'
(mal) N A5 dulka 'stone'
(tau) N A5 aaletpala
'stone'
(sul) N A5 kre
'stone'
(kua) N A5 vat
'stone'
(sia) N A5 fat
'stone'
(kua) N M1 dududul 'fighting stone'
(qaq) N A5 dul 'stone' 1 1 0.75 0.389 0.471 0.518 0.478 0.6
(ura) N A5 dul 'stone' 1 1 0.75 0.389 0.471 0.518 0.478 0.6
(mal) N A5 dulka 'stone' 0.75 0.75 1 0.552 0.58 0.441 0.471 0.602
(tau) N A5 aaletpala 'stone' 0.389 0.389 0.552 1 0.475 0.454 0.46 0.513
(sul) N A5 kre 'stone' 0.471 0.471 0.58 0.475 1 0.557 0.596 0.365
(kua) N A5 vat 'stone' 0.518 0.518 0.441 0.454 0.557 1 0.961 0.369
(sia) N A5 fat 'stone' 0.478 0.478 0.471 0.46 0.596 0.961 1 0.346
(kua) N M1 dududul 'fighting stone' 0.6 0.6 0.602 0.513 0.365 0.369 0.346 1
Metrics A wide variety of metrics can be implemented and
‘plugged into’ the comparison strategy Metrics return a real value in range [0.0, 1.0]
representing the level of similarity of the items being compared
User can control which set of metrics is used Can use multiple comparison strategies on the
same data set and store and compare results Metrics discussed here are those used to produce
the sample results
General principle: “best match” – prefer false positives to false negatives
Phonetic form similarity metric “edit distance with phone substitution probability matrix” f1, f2 := phonetic forms being compared (lists of phones – generated
automatically from written forms, or transcribed manually) Apply edit distance algorithm to f1 and f2 with following costs:
Deletion cost = 1.0 (constant) Insertion cost = 1.0 (constant) Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls in
range [0.0, 2.0] dmin := minimum edit distance for f1 and f2
dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 and f2)
Similarity = 1 – (dmin / dmax)
Finds maximal unbounded alignment of two forms. Can also be understood as detecting contribution of each form to a putative combined form
Examples:mbias vs. biaska dmin= 3 dmax= 11 Similarity = 1-(3/11) = 0.727
mbiaskavat vs. fat dmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at
Phone similarity metric Phone similarity sp for a pair of phones is a real number in
range [0, 1] drawn from a phone similarity matrix Matrix calculated automatically on the basis of weighted sum
of similarities between phonetic features of the two phones Examples of phonetic features include nasality (universal),
frontness (vowels), place of articulation (consonants) Each phonetic feature has a set of possible values and a similarity
matrix for these values. Similarity matrix is user-editable Feature similarity matrix should reflect probability of various paths
of diachronic change Possible to under-specify feature values for phones
Similarity of a phone with itself will always be 1.0 ‘Default’ similarities can be overridden for particular phones
(universal) and/or phonemes (language pair-specific)
Semantic domain similarity metric “depth of deepest subsumer as
proportion of maximum local depth of semantic domain tree”
n1, n2 := the semantic domains being compared (nodes in semantic domain tree)
S := ‘subsumer’: deepest node in semantic domain tree that subsumes both n1 and n2
ds := depth of S in tree (path length from root node to S)
dm := maximum local depth of tree (length of longest path from root node to an ancestor of n1 or n2)
Similarity = ds / dm
See also Li et al. (2003)
Examples:
F vs. F = 1.0
D vs. E = 0.333
B vs. C = 0.0
A
B C ...
D
F
E
Gloss similarity metric Crude sentence comparison
metric: “proportion of tokens in common”
g1, g2 := the glosses being compared
r1, r2 := reduced glosses (after removal of stop words, e.g. a, the, of)
len1, len2 := length of r1, r2 (number of tokens)
L := max (len1, len2) If L = 0, Similarity = 1.0, else: C := count of common tokens
(tokens that appear in both r1 and r2)
Similarity = C / L
This metric needs refinement
Examples:
‘house’ vs. ‘house’ = 1.0
‘house’ vs. ‘a house’ =
1.0
‘house’ vs. ‘raised
sleeping house’ =
0.333
‘house’ vs. ‘hut’ = 0.0
Conclusion
Possible extensions; unresolved questions Extensions: find borrowings; detect duplicate lexicographic entries;
orthographic conversion; ... Analytical questions: How to represent tone and incorporate within
phonetic comparison? Phonetic feature system – multi valued or binary? Segmentation (comparison at phone, phone sequence or phoneme level)? The Edit distance metric may be improved by privileging uninterrupted identical sequences.
Elaborate semantic matching: more sophisticated approaches using: taxonomies e.g. WordNet, with some way to map lexemes onto concepts; compositional semantics – primitives.
Performance: Since comparison is parameterised, it may be possible to use genetic algorithms to optimise performance. Need a quantitative way to evaluate performance of system.
Relation to theory: How much theory is embedded in the instrument? What effect does this have on results?
Inter-operability between databases is a key issue in the ultimate usability of the tool.
Acknowledgements Thanks to Christina Eira, Claire Bowern, Beth
Evans, Sander Adelaar, Friedel Frowein and Sheena Van Der Mark, and Nicolas Tournadre for their comments and suggestions on this project.
References Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown,
Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W. Holman. 2009. Adding typology to lexicostatistics: a combined approach to language classification. Linguistic Typology 13: 167-179.
Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354. Atkinson et al 2005.
Li, Yuhua, Bandar Z, McLean D (2003) “An approach for measuring semantic similarity using multiple information sources,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no.4, pp. 871-882.
Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382-420.(from Bakker et al. 2009)
Lowe, John Brandon and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method," Association for Computational Linguistics Special Issue in Computational Phonology 20.3:381-417.