How Geography Influences Language.nerbonne/outgoing/talks/zurich-2011...How Geography Inﬂuences Language. John Nerbonne j.nerbonne@rug Center for Language and Cognition, University

Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Geographic Musings

How Geography Influences Language.

John Nerbonnej.nerbonne@rug

Center for Language and Cognition, University of Groningen

Linguistisches ForschungskolloquiumUniversitat Zurich

14 Apr. 2011

John Nerbonne j.nerbonne@rug 1/34


Overview

Old problems in dialectologyMassive variationCounterindicating signals

Aggregating signals (dialectometry)Levenshtein distance

Examining the influence of geographyUnderstanding Seguy’s curve (distance)Incorporating areas

Proceeding more geographically (ideas)



One old problem in dialectology

Pronunciations are very variable— 87 different pronunciations of ich in the PAD

1 5Ic 5Ic˜

5¯Ic QI–k QIk @IS >@

˜Ig c EI–S

¯Ec˜k E–g E

˙Icff

E˙IS¯E

˙Ik Ek Ekh I I: IP Ic Ic

ffIc¯

IG IGff

IS ISff

IS¯I

¯c I

¯c¯

I¯G I

¯g I

¯k I

¯k. I

¯C I

¯ý I

˚k I–c I–g I–g. I–j I–k

I–C I–x I˙

I˙c¯

I˙: I

˙:c I

˙c I

˙X I

˙g I

˙g. I

˙k I

˙C I

˙ý Ig

Ij Ij˜

Ik Ikh IC Ixff Yc¯

Yý e >e¯

IG e–>Pk e– c e–g e

˙S—

e˙

>cj e

˙c e

˙G e

˙g e

˙j e

˙C eg ek e

>kx

˜i i: i:c i:c

˜ic

i– i–:>jc i–k



A second old problem in dialectology

We receive noisy signals of provenance.

front/low V in Haus [p] (dark) vs. [>pf] [t] vs. [>ts] [k] vs. [x(c)]

“non-overlapping isoglosses”



Isoglosses seldom overlap

aggregate [S] (dark) vs. s [z] (dark) vs. [s] N d/t (dark)2nd shift (non-initially) (initially) vs. deletion



Isoglosses seldom overlap, more

apical [r] (dark) final [n] drop (dark) medial [t] vs. s init. lenited /g/vs. uvular [ö] vs. retention



Why dialectometry?

Strengthen geographic signals by aggregatingSolve problems of earlier dialectology

Non-overlapping distributionsSelection of features too arbitrairy“Atomism” (Coseriu), idiosyncratic words (Bloomfield)

Introduce replicable proceduresFollowing Seguy, Goebl, Schiltz, Kretzschmar, Shackleton, ...Seeking law-like relations in linguistic variation

Sublinear distributions of linguistic variation vs. geography



Calculating dialect distances

To determine the aggregate distance between dialects:We determine the distance between each dialect pair for everysingle linguist element (in sample, e.g. dialect atlas)

Perhaps just same (0) vs. different (1)... but we’ve developed more sensitive measures (below)

We sum these distances for every element (hundreds of them)Immediate result: place × place table of dialect differences

Seguy (1971), Goebl (1980s and on), many others



Aside: more sensitive pronunciation distance measure

Levenshtein distance enables analysis of phonetic transcriptionswithout manual alignment

—move from categorical to numerical analysis of data.One of the most successful methods to determine sequencedistance (Levenshtein, 1964)

biological molecules, software engineering, ...

Levenshtein distance: minimum number of insertions, deletionsand substitutions to transform one string into the otherSyllabicity constraint add: vowels never substitute for consonants



Example of the Levenshtein distance

mO@lk@ delete @ 1mOlk@ subst. O/E 1mElk@ delete @ 1mElk insert @ 1mEl@k

4

m O @ l k @m E l @ k

1 1 1 1



Example

Based on Dutch pronunciation data from theGoeman-Taeldeman-Van Reenen-Project data (GTRP; Goemanand Taeldeman, 1996)

We use 562 words for 424 varieties in the Netherlands

Wieling, Heeringa & Nerbonne (2007) An Aggregate Analysis ofPronunciation in the Goeman-Taeldeman-van Reenen-ProjectData. In: Taal en Tongval 59(1), 84-116

Calculating Levenshtein distances yields interesting soundcorrespondences contained in the alignments (more on that later)

Note that a 100-word comparison already yields about 500 soundcorrespondences



Distribution of sites



Analytical steps

Obtain the distances between each of the ≈ 90, 000 pairs ofvarieties

n.b. this involves 500× 52 segment comparisons≈ 1.1× 109 segment comparisons in total

Organize these in a 400× 400 tableSeek groups (dialect areas) or continuum-like relations, e.g. byapplying clustering or multi-dimensional scaling, respectively

Note that no attention has been paid to geography thus far!



Multi-Dimensional Scaling

Frisian

Frisian cities, Het Bildt

Westerkwartier

Stellingwerf

Low Saxon

Central Gelderland

Dutch Low Franconian

Flemish Low Franconian



MDS dimensions → colors, projected to map



Interpolated, interpreted maps



Noisy Clustering

BonnKöln 100

Iversheim56

AachenWinterspelt

55

Odenspiel

56

LohraWittelsberg 58

Allna100

HerbornseelbachOffdilln 100

99

DexbachNiederasphe 100

Rosenthal58

Frohnhausen

100

74

AltenbergSchraden 54

BockelwitzSchmannewitz 97

Linz60

GrünlichtenbergRoßwein 100

69

Lampertswalde

72

JonsdorfRammenau 88

Gersdorf72

65

AltlandsbergLippen 100

Groß Jamno100

Pretzsch

100

Neu Schadow

93

GerbstedtLandgrafroda 100

53

BorstendorfGornsdorf 100

Theuma96

Mockern

55

CursdorfOsterfeld

Wehrsdorf

56

BillingsbachZellingen 66

Altentrüdingen97

BempflingenIggingen 80

Schömberg100

BurgriedenOberhomberg

53

BruchHermeskeil 100

KruftSiebenbach 100

Mastershausen56

57

Hartenfels

56

BüdesheimEisenbach 73

Niedernhausen61

Vielbrunn

56

Lohrhaupten

83

EschelbronnPfaffenrot 83

Niederauerbach85

56

EnsheimMaxweiler

53

EbertshausenExdorf 100

TannWeyhers 100Helmers

100100

EichenhofenHermannsreuth 100

PeterskirchenSchachach 60

Gelting92

LangenbruckOberviehbach 59

PielenhofenTreffelstein 100

Ulbering

67

Hartenstein

60

KemmernOttowind 100

Schauenstein100

Weidenbach

71

Nürnberg

65

63

Oberau

62

Klafferstraß

70

Pöttmes

7875

MaibrunnRamsau 93

79

EinöllenUngstein 59

HorheimSeelbach 62

Endenburg−Lehnacker52

EngelsbachSchellroda 100

HönebachRinggau−Röhrda 84

Unterellen

63

Mörshausen

60

GroßwechsungenWieda 99

Groß Ballhausen86

100

Orferode

99

HöchstädtIgling 70

Wildpoldsried96

SchnepfenbachVolkershausen 71

ClausthalKleinbottwar

ObermaiselsteinOberwürzbach

83

AhrbergenWasbüttel 100Brelingen

76

AlberslohHaddorf 100

Lippramsdorf61

BrockhausenEngter 100

60

HohenkörbenWüllen 63

77

AltwarpBreddin

Klein Rossau60

GrünowVietmannsdorf 94

Falkenthal79

99

MirowSchönbeck 99

98

BenninWentorf 91

Groß MohrdorfWolgast 91

Hagen64

Kirch Kogel

69

GresenhorstHerzfeld 97

Jürgenshagen

68

Verchen

68

59

AstfeldFreden 74

GottsbürenOsterhagen 96

71

AtzendorfHundisburg 100

Götz94

JacobsdorfReetz 61

62

Ruhlsdorf

81

Benzingerode

100

JeverWangerooge 57

Barßel81

BremscheidHerdecke 60

HerrentrupReelkirchen 100

HesselteichValdorf 100

9256

DreekeHerßum 66

GroßenwieheSchwabstedt 100

Holmkjer100

Wasbek

65

HammahOiste 52

JesteburgKuhstedt 94

StöckenWarpe 100Adorf

BardenflethDiekhusen

EbstorfEversen

HohwachtHuddestorf

JeetzelOhrdorf

Osterbruch

88

LeuthWemb 100

83

100

Seeks groups in data, enabling comparison to older dialectologywhich sought areasOnly bootstrap (or noisy) clustering to avoid instability



Projecting groups to geography

Den Burg

SchiermonnikoogOosterend

Leeuwarden

Grouw

Groningen

Heerhugowaard

Haarlem

Delft

StaverenSteenwijk

Urk

Hattem

Amersfoort

Assen

Emmen

Itterbeck

Lochem

Brugge

Veurne

Middelburg

Gent

Vianen

Zevenbergen

Kalmthout

Mechelen

Groesbeek

Helmond

Venlo

Overpelt

Roeselare

SteenbeekGeraardsbergen Tienen

Kerkrade

Aubel



Large body of dialectometric work—positive aspects

Dutch, German, American English, Norwegian, Swedish,Afrikaans, Sardinian, Tuscan, Catalan, Sino-Tibetan, Chinese,Bulgarian, Bantu, Central Asian (Turkic & Indo-Iranian), ...Development of consistency measure (Cronbach’s α) indictingwhether data set is sufficiently largeNovel reflection, work on validation aimed at assessing degree ofdetection of SIGNALS OF PROVENANCE

Gooskens & Heeringa (2004) Perceptive Evaluation of LevenshteinDialect Distance Measurements using Norwegian Dialect Data.Language Variation and Change 16(3), 189-207.



Criticisms of dialectometry, esp. Levenshtein-basedwork

Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)

—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?

... progress on all fronts, but presentation would take too long—question and discussion period for those interested



The Influence of Geography

Regression designDependent variable: varietal distance, as measured by aggregatecategorical distance or Levenshtein distanceIndependent variable: geographical distance, regarded as anoperationalization of the chance of social contactStatistical cautions:

1 correlations involving averages are inflated— but we’re interested in the entire varieties (dialects)

2 distances are not independent, so significance may be inflated— Mantel tests



Inspiration: Jean Seguy

Seguy (1971) La relation entre la distance spatiale et la distancelexicale. Revue de Linguistique Romane 35(138), 335-357:Aggregate variation increases sublinearly with respect togeography

COURSE MOYENNE

Y = 36Vlog(x + 11

so

.0

J

10

1

~ 1. 6 . I) IS 10 1~ 30 3~ .0 .~ 50 55 60 ~ 10 1S 10 is 90 95 100 IDS 110



Sublinear spread is general

0 100 300 500

0.00

0.10

0.20

Bantu

0 100 200 300 400 500

0.00

00.

002

0.00

4

Bulgaria

0 200 400 600 800

0.04

0.08

0.12

Germany

0 200 600 1000

0.0

0.2

0.4

LAMSAS / Lowman

0 50 100 200 300

0.01

0.03

0.05

0.07

The Netherlands

0 100 200 300 400 500

1.0

2.0

3.0

4.0

Norway



Aside: Trudgill’s “Gravity hypothesis”

Moon

DeimosPhobos

Venus

Earth

Mars

Sun

According to Trudgill (1972) diffusion follows an inverse square

law, with the consequence that linguistic distance should likewise

increase with the square of the distance. Population size plays

the role of mass.



Trudgill’s “Gravity hypothesis”

Sublinear aggregate relation incompatible with a quadraticinfluence

J.Nerbonne (to appear) Measuring the Diffusion of Linguistic Change.Phil. Transactions of the Royal Society B



How much does distance influence language?

Area Corr.(l,geo) r2

Gabon Bantu 0.47 0.22Bulgaria 0.49 0.24Germany 0.57 0.32Eastern U.S. 0.51 0.26Netherlands 0.62 0.38Norway 0.41 0.16

Norwegian ling. dist. correlates better w. travel time in 1900 (r = 0.54)Gooskens (2005) Dialectologia et Geolinguistica 13.

— very primitive geography!



Geography influence on language

Geography accounts for 22− 38% of aggregate linguistic variation.General — sublinear — characterization of relation betweengeographical distance and linguistic differencesLike population geneticists’ “isolation by distance” (Wright, 1943;Malecot, 1955)



Wrede’s (1926-56) German Dialect Areas

Aachen

Adorf

Ahrbergen

Albersloh

AllnaAltenberg

Altentrüdingen

Altlandsberg

Altwarp

Astfeld Atzendorf

BardenflethBarßel

Bempflingen

Bennin

Benzingerode

Billingsbach

Bockelwitz

BonnBorstendorf

Breddin

Brelingen

Bremscheid

Brockhausen

Bruch

Burgrieden

Büdesheim

Clausthal

Cursdorf

Dexbach

Diekhusen

Dreeke

Ebertshausen

Ebstorf

Eichenhofen

Einöllen

Eisenbach

Endenburg−Lehnacker

Engelsbach

Engter

Ensheim

Eschelbronn

Eversen

Exdorf

Falkenthal

Freden

Frohnhausen

Gelting

Gerbstedt

Gersdorf

Gornsdorf

Gottsbüren

Gresenhorst

Groß Ballhausen

Groß Jamno

Groß Mohrdorf

Großwechsungen

Großenwiehe

Grünlichtenberg

Grünow

Götz

Haddorf

Hagen

Hammah

Hartenfels

Hartenstein

HelmersHerbornseelbach

Herdecke

Hermannsreuth

Hermeskeil

Herrentrup

Herßum

Herzfeld

Hesselteich

Hohenkörben

Hohwacht

Holmkjer

Horheim

Huddestorf

Hundisburg

Höchstädt

Hönebach

Iggingen

Igling

Iversheim

Jacobsdorf

Jeetzel

Jesteburg

Jever

Jonsdorf

Jürgenshagen

Kemmern

Kirch Kogel

Klafferstraß

Klein Rossau

Kleinbottwar

Kruft

Kuhstedt

Köln

LampertswaldeLandgrafroda

Langenbruck

LeuthLinz

Lippen

Lippramsdorf

Lohra

Lohrhaupten

Maibrunn

Mastershausen

Maxweiler

Mirow

Mockern

Mörshausen

Neu Schadow

Niederasphe

Niederauerbach

Niedernhausen

Nürnberg

Oberau

Oberhomberg

Obermaiselstein

Oberviehbach

Oberwürzbach

OdenspielOffdilln

Ohrdorf

Oiste

Orferode

Osterbruch

Osterfeld

Osterhagen

Ottowind

Peterskirchen

Pfaffenrot

Pielenhofen

Pretzsch

Pöttmes

Rammenau

Ramsau

Reelkirchen

Reetz

Ringgau−Röhrda

Rosenthal

Roßwein

Ruhlsdorf

Schachach

Schauenstein

Schellroda

Schmannewitz

Schnepfenbach

Schraden

Schwabstedt

Schömberg

Schönbeck

Seelbach

Siebenbach

Stöcken

Tann

Theuma

Treffelstein

Ulbering

Ungstein

Unterellen

Valdorf

Verchen

Vielbrunn

Vietmannsdorf

Volkershausen

Wangerooge

Warpe

Wasbek

Wasbüttel

Wehrsdorf

Weidenbach

Wemb

Wentorf

Weyhers

Wieda

Wildpoldsried

Winterspelt

Wittelsberg

Wolgast

Wüllen

Zellingen



Influence of Dialect Areas?

We add to the regression design variables indicating whether twovarieties belong to the same or different dialect areas.Results: Dialect areas contribute substantially to the explanationof aggregate linguistic distance. r2 increases from about 32%(based only on geographic distance) to about 47% (based ongeographic distance and areal differencs).

John Nerbonne (submitted, 2010) How much does Geography InfluenceLanguage Variation? Auer et al. (eds.) Proc. of the Freiburg (FRIAS)language and space workshops. Mouton de Gruyter: Berlin.



Generalized Additive Models

Simon N. Wood (2006) Generalized Additive Models: AnIntroduction with RAllows regresion using combination of predictors, e.g. longitudeand latitudeA more sophisticated notion of geography (than simple distance)But I do not understand all the math!



LingDist ∼ S(long.lat)

3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

51.0

51.5

52.0

52.5

53.0

53.5

−0.2

−0.15

−0.1

−0.05

−0.05

−0.05

0

0

0

0

0

0.05

0.05

0.05

0.1

0.1

0.1

0.1

5

0.15

0.1

5 Longitude

Latit

ude



How could Geographers help?

Alternative perspectives on geographic explanation

Do you have other means of exploring geographic explanations?

ToolsIs there good, easy-to-use simulation software e.g. to examinequestions of diffusion?

Access to potentially explanatory demographic dataAgeEducational level“Locality” – percentage born and raised with k kilometersPopulation density



How does geography influence language variation?

Pure distance models explain 22% - 38% of aggregate linguisticvariation.Areal distinctions are somewhat collinear, but nonethless addsubstantially to simple models, perhaps as much as 50% (moving30% to 45%, for example).Naturally, there is also subdialectal variation (social, sexual,individual), but few systematic data collections.Emerging questions:

What is the linguistic structure of the dialect differences we find?Do typological constraints play a (confounding) role?Can we tease apart geographical and historical explanations, andhow?

Try Gabmap! www.gabmap.nl



Questions?

Thank You!


How Geography Influences Language.nerbonne/outgoing/talks/zurich-2011...How Geography Inﬂuences Language. John Nerbonne j.nerbonne@rug Center for Language and Cognition, University

Documents