The CLUES database: automated search for linguistic cognates

The CLUES database: automated search for cognate forms

Australian Linguistics Society Conference, Canberra

4 December 2011

Mark Planigale (Mark Planigale Research & Consultancy)

Tonya Stebbins (RCLT, La Trobe University)

Introduction Overview of the design of the CLUES database -

being developed as a tool to aid the search for correlates across multiple datasets

Linguistic model underlying the database Explore key issues in developing the

methodology Show examples of output from the database

Because the design of CLUES is relatively generic, it is potentially applicable to a wide range of languages, and to tasks other than correlate detection.

Context

What is CLUES? “Correlate Linking and User-defined Evaluation System”. Database designed to simultaneously handle lexical

data from multiple languages. It uses add-on modules for comparative functions.

Primary purpose: identify correlates across two or more languages. Correlate: pair of lexemes which are similar in phonetic form

and/or meaning The linguist assesses which of the identified correlates are

cognates, and which are similar due to some other reason (borrowing, universal tendencies, accidental similarity)

Allows the user to adjust the criteria used to evaluate degree of correlation between lexemes.

It can store, filter and organise results of comparisons.

Computational methods in historical linguistics Lexicostatistics Typological comparison Phylogenetics Phoneme inventory comparison Modelling effects of sound change rules

Correlate search > CLUES

A few examples Lowe & Mazaudon 1994 – ‘Reconstruction Engine’ (models operation

of proposed sound change rules as a means of checking hypotheses) Nakhleh et al 2005 – Indo-European, phylogenetic Holman et al. 2008 – Automated Similarity Judgment Program – 4350

languages; 40 lexical items (edit distance); 85 most stable grammatical (typological) features from WALS database.

Austronesian Basic Vocabulary Database: 874 mostly Austronesian languages, each language represented by around 210 words. http://language.psy.auckland.ac.nz/austronesian/ project had phylogenetic focus – did some manual comparative work in preparing the data)

Greenhill & Gray 2009 – Austronesian, phylogenetic Dunn, Burenhult et al. 2011 – Aslian Proto-Tai'o'Matic (“merges, searches, and extends several wordlists

and proposed reconstructions of proto-Tai and Southwestern Tai” http://crcl.th.net/index.html?main=http%3A//crcl.th.net/crcl/assoc.htm)

http://language.psy.auckland.ac.nz/austronesian/

http://crcl.th.net/index.html?main=http://crcl.th.net/crcl/assoc.htm

http://crcl.th.net/index.html?main=http://crcl.th.net/crcl/assoc.htm

Broad vs. deep approaches to automated lexical comparisonParameter ‘Broad and shallow’ ‘Narrow and deep’

Language sample

Relatively large Relatively small

Vocabulary sample

Constrained, based on standardised wordlist (e.g. Swadesh 200, 100 or 40)

All available lexical data for selected languages

Purpose Establish (hypothesised) genetic relationships

Linguistic and/or cultural reconstruction; model language contact and semantic shift

Method LexicostatisticsPhylogenetics

Comparative method with fuzzy matching

Typical metrics Phonetic (e.g. edit distance)Typological (shared grammatical features)Maximum likelihood

Phonetic (e.g. edit distance)SemanticGrammatical

CLUES comparisons can be constrained to core vocab (using wordlist feature) however it is intended to be used within a ‘narrow and deep’ approach.

Design of CLUES

CLUES: Desiderata

•Results agree with human expert judgment

•Minimisation of false positives and negatives

Accuracy•Computed similarity level does measure degree of correlation

•Computed similarity level varies directly with cognacy

Validity

•Like results for like comparison pairs

•Like results for a single comparison pair on repetition

Reliability

•System performs accurately on new (‘unseen’) data as well as the data that the similarity metrics were ‘trained’ on

Generalisability

•Comparisons are performed fast enough to be useful

Efficiency

Lexical model (partial)

Lexemepart of speech

temporal information

Written form Sense

Gloss Semantic domain

1 1

11

∞ ∞

∞∞

Wordlist item∞ ∞

Source∞ 1Orthography

Language

Phone

1 ∞

1∞

∞

∞

...

Three dimensions of lexical similarityDimension of comparison

Data fields currently available

Phonetic / phonological(phonetic form of lexeme)

Written form (mapped to phonetic content)

Semantic(meaning of lexeme)

Semantic domainGloss

Grammatical(grammatical features of lexeme)

Word class

In the context of correlate detection, grammatical features may be of interest as a ‘dis-similarising’ feature for lexemes that are highly correlated on form and meaning.

What affects the results?

• Choice of appropriate formal (quantifiable) criteria for similarity

• Impact: Validity of results; generalisability of system

Selection and evaluation of metrics

• Systematic differences in the representations used for different data sets within the corpus

• Impact: Validity of results

Inconsistent representations

• Random fluctuations within the data that obscure the true value of individual data items, but do not change the underlying nature of the distribution

• Impact: Reliability of data, reliability of results

Noise

less controllab

le

CLUES: Managing representational issues automated generation of phonetic form(s)

from written form(s) where required, manual standardization to

common lexicographic conventions manual assignment to common ontology

semantic domain set automated mapping onto a shared common

set of grammatical features, values and terms

Calculating similarity

Similarity scores

Total Overallscore

w5 w6 w7

Subtotals Form subtotal Meaningsubtotal

Grammar subtotal

w1 w2 w3 w4

Base Written form similarity

Semantic domain

similarity

Gloss similarity

Wordclass similarity

Ura ɣunǝga vs. Mali kunēngga ‘sun’

4a. Lexeme 1 Lexeme 2Base

similarity score

Weight Subtotal Weight Overall

Written form(s) ɣunǝga [ɣunǝga]

kunēngga [ɣunǝŋga] 0.896 1.0 0.896 0.45

} 0.953Gloss(es) sun sun 1.0 0.5} 1.0 0.45

Semantic domain(s) A3 A3 1.0 0.5

Wordclass N N 1.0 1.0 1.0 0.1

Sulka kolkha ‘sun’ vs. Mali dulka ‘stone’

4b. Lexeme 1 Lexeme 2Base

similarity score


Written form(s) kolkha [kolkha]

dulka[dulka] 0.828 1.0 0.828 0.45

} 0.548Gloss(es) sun stone 0.0 0.5 } 0.167 0.45Semantic domain(s) A3 A5 0.333 0.5

Wordclass N N 1.0 1.0 1.0 0.1

4c. Lexeme 1 Lexeme 2Base

similarity score


Written form(s) kolkha [kolkha]

dulka[dulka] 0.828 1.0 0.828 0.7

} 0.68Gloss(es) sun stone 0.0 0.5 } 0.0 0.2Semantic domain(s) A3 A5 0.333 0.0

Wordclass N N 1.0 1.0 1.0 0.1

Sample results: across domains Small set of lexical data from 7 languages; ‘symmetrical’; overall scores

5a.(tau) N J1 kabarak 'blood'

(ura) N A3 ɣunǝga

'sun'

(sul) N A5 kre

'stone'

(sul) N J1 ka ptaik

'skin'

(mal) N A3 kunēngga

'sun'

(ura) N J1 slǝp

'bone'

(qaq) N T1 ltigi 'fire'

(mal) V T1 lēt

'light a fire'

(mal) N J1 slēpki 'bone'

(tau) N J1 kabarak 'blood' 1 0.309 0.2905 0.657 0.2995 0.5435 0.278 0.2435 0.541

(ura) N A3 ɣunǝga 'sun' 0.309 1 0.34725 0.2665 0.948 0.312 0.3515 0.2445 0.325

(sul) N A5 kre 'stone' 0.2905 0.34725 1 0.2615 0.33875 0.3395 0.2825 0.294 0.2745

(sul) N J1 ka ptaik 'skin' 0.657 0.2665 0.2615 1 0.2895 0.5275 0.2835 0.226 0.587

(mal) N A3 kunēngga 'sun' 0.2995 0.948 0.33875 0.2895 1 0.289 0.3025 0.22 0.3495

(ura) N J1 slǝp 'bone' 0.5435 0.312 0.3395 0.5275 0.289 1 0.326 0.3815 0.8905

(qaq) N T1 ltigi 'fire' 0.278 0.3515 0.2825 0.2835 0.3025 0.326 1 0.6945 0.371

(mal) V T1 lēt 'light a fire' 0.2435 0.2445 0.294 0.226 0.22 0.3815 0.6945 1 0.307

(mal) N J1 slēpki 'bone' 0.541 0.325 0.2745 0.587 0.3495 0.8905 0.371 0.307 1

Sample results: within a domain

5b. Overall similarity score

(qaq) N A5 dul

'stone'

(ura) N A5 dul

'stone'

(mal) N A5 dulka 'stone'

(tau) N A5 aaletpala

'stone'

(sul) N A5 kre

'stone'

(kua) N A5 vat

'stone'

(sia) N A5 fat

'stone'

(kua) N M1 dududul 'fighting stone'

(qaq) N A5 dul 'stone' 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425

(ura) N A5 dul 'stone' 1 1 0.875 0.6945 0.7355 0.759 0.739 0.425

(mal) N A5 dulka 'stone' 0.875 0.875 1 0.776 0.79 0.7205 0.7355 0.426

(tau) N A5 aaletpala 'stone' 0.6945 0.6945 0.776 1 0.7375 0.727 0.73 0.3815

(sul) N A5 kre 'stone' 0.7355 0.7355 0.79 0.7375 1 0.7785 0.798 0.3075

(kua) N A5 vat 'stone' 0.759 0.759 0.7205 0.727 0.7785 1 0.9805 0.3095

(sia) N A5 fat 'stone' 0.739 0.739 0.7355 0.73 0.798 0.9805 1 0.298

(kua) N M1 dududul 'fighting stone' 0.425 0.425 0.426 0.3815 0.3075 0.3095 0.298 1

Sample results: within a domain

5c. Form similarity only

(qaq) N A5 dul

'stone'

(ura) N A5 dul

'stone'

(mal) N A5 dulka 'stone'

(tau) N A5 aaletpala

'stone'

(sul) N A5 kre

'stone'

(kua) N A5 vat

'stone'

(sia) N A5 fat

'stone'

(kua) N M1 dududul 'fighting stone'

(qaq) N A5 dul 'stone' 1 1 0.75 0.389 0.471 0.518 0.478 0.6

(ura) N A5 dul 'stone' 1 1 0.75 0.389 0.471 0.518 0.478 0.6

(mal) N A5 dulka 'stone' 0.75 0.75 1 0.552 0.58 0.441 0.471 0.602

(tau) N A5 aaletpala 'stone' 0.389 0.389 0.552 1 0.475 0.454 0.46 0.513

(sul) N A5 kre 'stone' 0.471 0.471 0.58 0.475 1 0.557 0.596 0.365

(kua) N A5 vat 'stone' 0.518 0.518 0.441 0.454 0.557 1 0.961 0.369

(sia) N A5 fat 'stone' 0.478 0.478 0.471 0.46 0.596 0.961 1 0.346

(kua) N M1 dududul 'fighting stone' 0.6 0.6 0.602 0.513 0.365 0.369 0.346 1

Metrics A wide variety of metrics can be implemented and

‘plugged into’ the comparison strategy Metrics return a real value in range [0.0, 1.0]

representing the level of similarity of the items being compared

User can control which set of metrics is used Can use multiple comparison strategies on the

same data set and store and compare results Metrics discussed here are those used to produce

the sample results

General principle: “best match” – prefer false positives to false negatives

Phonetic form similarity metric “edit distance with phone substitution probability matrix” f1, f2 := phonetic forms being compared (lists of phones – generated

automatically from written forms, or transcribed manually) Apply edit distance algorithm to f1 and f2 with following costs:

Deletion cost = 1.0 (constant) Insertion cost = 1.0 (constant) Substitution cost = 2 x (1 - sp), where sp is phone similarity. Substitution cost falls in

range [0.0, 2.0] dmin := minimum edit distance for f1 and f2

dmax := maximum possible edit distance for f1 and f2 (sum of lengths of f1 and f2)

Similarity = 1 – (dmin / dmax)

Finds maximal unbounded alignment of two forms. Can also be understood as detecting contribution of each form to a putative combined form

Examples:mbias vs. biaska dmin= 3 dmax= 11 Similarity = 1-(3/11) = 0.727

mbiaskavat vs. fat dmin= 0.236 dmax= 6 Similarity = 1-(0.236/6) = 0.96 {v,f}at

Phone similarity metric Phone similarity sp for a pair of phones is a real number in

range [0, 1] drawn from a phone similarity matrix Matrix calculated automatically on the basis of weighted sum

of similarities between phonetic features of the two phones Examples of phonetic features include nasality (universal),

frontness (vowels), place of articulation (consonants) Each phonetic feature has a set of possible values and a similarity

matrix for these values. Similarity matrix is user-editable Feature similarity matrix should reflect probability of various paths

of diachronic change Possible to under-specify feature values for phones

Similarity of a phone with itself will always be 1.0 ‘Default’ similarities can be overridden for particular phones

(universal) and/or phonemes (language pair-specific)

Semantic domain similarity metric “depth of deepest subsumer as

proportion of maximum local depth of semantic domain tree”

n1, n2 := the semantic domains being compared (nodes in semantic domain tree)

S := ‘subsumer’: deepest node in semantic domain tree that subsumes both n1 and n2

ds := depth of S in tree (path length from root node to S)

dm := maximum local depth of tree (length of longest path from root node to an ancestor of n1 or n2)

Similarity = ds / dm

See also Li et al. (2003)

Examples:

F vs. F = 1.0

D vs. E = 0.333

B vs. C = 0.0

A

B C ...

D

F

E

Gloss similarity metric Crude sentence comparison

metric: “proportion of tokens in common”

g1, g2 := the glosses being compared

r1, r2 := reduced glosses (after removal of stop words, e.g. a, the, of)

len1, len2 := length of r1, r2 (number of tokens)

L := max (len1, len2) If L = 0, Similarity = 1.0, else: C := count of common tokens

(tokens that appear in both r1 and r2)

Similarity = C / L

This metric needs refinement

Examples:

‘house’ vs. ‘house’ = 1.0

‘house’ vs. ‘a house’ =

1.0

‘house’ vs. ‘raised

sleeping house’ =

0.333

‘house’ vs. ‘hut’ = 0.0

Conclusion

Possible extensions; unresolved questions Extensions: find borrowings; detect duplicate lexicographic entries;

orthographic conversion; ... Analytical questions: How to represent tone and incorporate within

phonetic comparison? Phonetic feature system – multi valued or binary? Segmentation (comparison at phone, phone sequence or phoneme level)? The Edit distance metric may be improved by privileging uninterrupted identical sequences.

Elaborate semantic matching: more sophisticated approaches using: taxonomies e.g. WordNet, with some way to map lexemes onto concepts; compositional semantics – primitives.

Performance: Since comparison is parameterised, it may be possible to use genetic algorithms to optimise performance. Need a quantitative way to evaluate performance of system.

Relation to theory: How much theory is embedded in the instrument? What effect does this have on results?

Inter-operability between databases is a key issue in the ultimate usability of the tool.

Acknowledgements Thanks to Christina Eira, Claire Bowern, Beth

Evans, Sander Adelaar, Friedel Frowein and Sheena Van Der Mark, and Nicolas Tournadre for their comments and suggestions on this project.

References Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown,

Pamela Brown, Dmitry Egorov, Robert Mailhammer, Anthony Grant, and Eric W. Holman. 2009. Adding typology to lexicostatistics: a combined approach to language classification. Linguistic Typology 13: 167-179.

Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354. Atkinson et al 2005.

Li, Yuhua, Bandar Z, McLean D (2003) “An approach for measuring semantic similarity using multiple information sources,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no.4, pp. 871-882.

Nakhleh, Luay, Don Ringe, and Tandy Warnow (2005). Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language 81: 382-420.(from Bakker et al. 2009)

Lowe, John Brandon and Martine Mazaudon. 1994. The reconstruction engine: a computer implementation of the comparative method," Association for Computational Linguistics Special Issue in Computational Phonology 20.3:381-417.

http://email.eva.mpg.de/~wichmann/Levenshtein%20versus%20WALS%20FINAL.pdf

http://email.eva.mpg.de/~wichmann/Levenshtein%20versus%20WALS%20FINAL.pdf

http://email.eva.mpg.de/~wichmann/Explorations.pdf

The CLUES database: automated search for linguistic cognates

Technology

The CLUES database: automated search for linguistic cognates