Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Geographic Musings How Geography Influences Language. John Nerbonne j.nerbonne@rug Center for Language and Cognition, University of Groningen Linguistisches Forschungskolloquium Universit ¨ at Z ¨ urich 14 Apr. 2011 John Nerbonne j.nerbonne@rug 1/34
34
Embed
How Geography Influences Language.nerbonne/outgoing/talks/zurich-2011...How Geography Influences Language. John Nerbonne j.nerbonne@rug Center for Language and Cognition, University
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
To determine the aggregate distance between dialects:We determine the distance between each dialect pair for everysingle linguist element (in sample, e.g. dialect atlas)
Perhaps just same (0) vs. different (1)... but we’ve developed more sensitive measures (below)
We sum these distances for every element (hundreds of them)Immediate result: place × place table of dialect differences
Aside: more sensitive pronunciation distance measure
Levenshtein distance enables analysis of phonetic transcriptionswithout manual alignment
—move from categorical to numerical analysis of data.One of the most successful methods to determine sequencedistance (Levenshtein, 1964)
biological molecules, software engineering, ...
Levenshtein distance: minimum number of insertions, deletionsand substitutions to transform one string into the otherSyllabicity constraint add: vowels never substitute for consonants
Based on Dutch pronunciation data from theGoeman-Taeldeman-Van Reenen-Project data (GTRP; Goemanand Taeldeman, 1996)
We use 562 words for 424 varieties in the Netherlands
Wieling, Heeringa & Nerbonne (2007) An Aggregate Analysis ofPronunciation in the Goeman-Taeldeman-van Reenen-ProjectData. In: Taal en Tongval 59(1), 84-116
Calculating Levenshtein distances yields interesting soundcorrespondences contained in the alignments (more on that later)
Note that a 100-word comparison already yields about 500 soundcorrespondences
Obtain the distances between each of the ≈ 90, 000 pairs ofvarieties
n.b. this involves 500× 52 segment comparisons≈ 1.1× 109 segment comparisons in total
Organize these in a 400× 400 tableSeek groups (dialect areas) or continuum-like relations, e.g. byapplying clustering or multi-dimensional scaling, respectively
Note that no attention has been paid to geography thus far!
Large body of dialectometric work—positive aspects
Dutch, German, American English, Norwegian, Swedish,Afrikaans, Sardinian, Tuscan, Catalan, Sino-Tibetan, Chinese,Bulgarian, Bantu, Central Asian (Turkic & Indo-Iranian), ...Development of consistency measure (Cronbach’s α) indictingwhether data set is sufficiently largeNovel reflection, work on validation aimed at assessing degree ofdetection of SIGNALS OF PROVENANCE
Gooskens & Heeringa (2004) Perceptive Evaluation of LevenshteinDialect Distance Measurements using Norwegian Dialect Data.Language Variation and Change 16(3), 189-207.
Criticisms of dialectometry, esp. Levenshtein-basedwork
Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)
—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?
... progress on all fronts, but presentation would take too long—question and discussion period for those interested
Regression designDependent variable: varietal distance, as measured by aggregatecategorical distance or Levenshtein distanceIndependent variable: geographical distance, regarded as anoperationalization of the chance of social contactStatistical cautions:
1 correlations involving averages are inflated— but we’re interested in the entire varieties (dialects)
2 distances are not independent, so significance may be inflated— Mantel tests
Seguy (1971) La relation entre la distance spatiale et la distancelexicale. Revue de Linguistique Romane 35(138), 335-357:Aggregate variation increases sublinearly with respect togeography
Geography accounts for 22− 38% of aggregate linguistic variation.General — sublinear — characterization of relation betweengeographical distance and linguistic differencesLike population geneticists’ “isolation by distance” (Wright, 1943;Malecot, 1955)
We add to the regression design variables indicating whether twovarieties belong to the same or different dialect areas.Results: Dialect areas contribute substantially to the explanationof aggregate linguistic distance. r2 increases from about 32%(based only on geographic distance) to about 47% (based ongeographic distance and areal differencs).
John Nerbonne (submitted, 2010) How much does Geography InfluenceLanguage Variation? Auer et al. (eds.) Proc. of the Freiburg (FRIAS)language and space workshops. Mouton de Gruyter: Berlin.
Simon N. Wood (2006) Generalized Additive Models: AnIntroduction with RAllows regresion using combination of predictors, e.g. longitudeand latitudeA more sophisticated notion of geography (than simple distance)But I do not understand all the math!
Pure distance models explain 22% - 38% of aggregate linguisticvariation.Areal distinctions are somewhat collinear, but nonethless addsubstantially to simple models, perhaps as much as 50% (moving30% to 45%, for example).Naturally, there is also subdialectal variation (social, sexual,individual), but few systematic data collections.Emerging questions:
What is the linguistic structure of the dialect differences we find?Do typological constraints play a (confounding) role?Can we tease apart geographical and historical explanations, andhow?