Aggregate vs. Feature-Based Perspectives on Dialect Geography › ~nerbonne › outgoing › Oldenburg-2012 › ... · Bob de Jonge, Agnes de Bie, Cornelius Hasselblatt... Simonetta

Dialectology Motivation Aggregating Signals Dutch Pronunciation Geographic Projections Features?

Aggregate vs. Feature-Based Perspectives onDialect Geography

John [email protected]

Center for Language and Cognition, University of Groningen

Language in Space: Geographic Perspectives on LanguageDiversity and Diachrony

NSF Workshop, LSA Linguistics InstituteUniversity of Colorado, Boulder, 23-4 July 2011

John Nerbonne [email protected] 1/35


Groningen dialectology team!

Charlotte Gooskens, Peter Houtzagers, Hermann Niebaum, WilbertHeeringa, Jelena Prokic, Therese Leinonen, Martijn Wieling, MarcoSpruit, Peter Kleiweg, Christine Siedle, Jens Moberg, ......Sebastian Kurschner, Alexandra Lenz, Bob Shackleton, Renee vanBezooijen, ......Bob de Jonge, Agnes de Bie, Cornelius Hasselblatt...Simonetta Montemagni, Franz Manni, Petja Osenova, Esteve Valls,Lucija Simicic, Kristel Uiboaed, Boudewijn van den Berg



Overview

Old problems in dialectologyMassive variationCounterindicating signals

Aggregating signals (dialectometry)Levenshtein distance

Dialectological law enabled by aggregate viewSeguy’s curve

Features, “ranking isoglosses” (Chambers & Trudgill, p.97)



One old problem in dialectology

Pronunciations are very variable— 87 different pronunciations of ich in the PAD

1 5Ic 5Ic˜

5¯Ic QI–k QIk @IS >@

˜Ig c EI–S

¯Ec˜k E–g E

˙Icff

E˙IS¯E

˙Ik Ek Ekh I I: IP Ic Ic

ffIc¯

IG IGff

IS ISff

IS¯I

¯c I

¯c¯

I¯G I

¯g I

¯k I

¯k. I

¯C I

¯ý I

˚k I–c I–g I–g. I–j I–k

I–C I–x I˙

I˙c¯

I˙: I

˙:c I

˙c I

˙X I

˙g I

˙g. I

˙k I

˙C I

˙ý Ig

Ij Ij˜

Ik Ikh IC Ixff Yc¯

Yý e >e¯

IG e–>Pk e– c e–g e

˙S—

e˙

>cj e

˙c e

˙G e

˙g e

˙j e

˙C eg ek e

>kx

˜i i: i:c i:c

˜ic

i– i–:>jc i–k



A second old problem in dialectology

We receive noisy signals of provenance.

front/low V in Haus [p] (dark) vs. [>pf] [t] vs. [>ts] [k] vs. [x(c)]

“non-overlapping isoglosses”



Isoglosses seldom overlap

aggregate [S] (dark) vs. s [z] (dark) vs. [s] N d/t (dark)2nd shift (non-initially) (initially)

apical [r] (dark) final [n] drop (dark) medial [t] vs. s init. lenited /g/vs. uvular [ö] vs. retention



Why dialectometry?

Strengthen geographic signals by aggregatingSolve problems of earlier dialectology

Non-overlapping distributionsSelection of features too arbitrary“Atomism” (Coseriu), idiosyncratic words (Bloomfield)

Introduce replicable proceduresFollowing Seguy, Goebl, Schiltz, Kretzschmar, Shackleton, ...Seeking law-like relations in linguistic variation



Calculating dialect distances

To determine the aggregate distance between dialects:We determine the distance between each dialect pair for everysingle linguistic element (in sample, e.g. dialect atlas)

Perhaps just same (0) vs. different (1)... but we’ve developed more sensitive measures (below)

We sum these distances for every element (hundreds of them)Immediate result: place × place table of dialect differences

Seguy (1971), Goebl (1980s and on), many others



Dialectometric “feature ranking”

Chambers & Trudgill (1998) ask for a ranking of features (andisoglosses) in order to identify dialect boundaries.Implicit “feature ranking” in dialectometry: a feature that’sinstantiated n times in dialect atlas material is weighted n timesmore heavily than one that appears once.

Lexical items uniformly weightedPhonetic segment distances weighted in proportion to theirfrequency in the word list

Note that Goebl has also experimented with “inverse frequency”weighting of responses.



Aside: more sensitive pronunciation distance measure

Levenshtein distance enables analysis of phonetic transcriptionswithout manual alignment

—move from categorical to numerical analysis of data.One of the most successful methods to determine sequencedistance (Levenshtein, 1964)

biological molecules, software engineering, ...

Levenshtein distance: minimum number of insertions, deletionsand substitutions to transform one string into the otherSyllabicity constraint add: vowels never substitute for consonants



Example of the Levenshtein distance

mO@lk@ delete @ 1mOlk@ subst. O/E 1mElk@ delete @ 1mElk insert @ 1mEl@k

4

m O @ l k @m E l @ k

1 1 1 1



Example

Based on Dutch pronunciation data from theGoeman-Taeldeman-Van Reenen-Project data (GTRP; Goemanand Taeldeman, 1996)

We use 562 words for 424 varieties in the Netherlands

Wieling, Heeringa & Nerbonne (2007) An Aggregate Analysis ofPronunciation in the Goeman-Taeldeman-van Reenen-ProjectData. In: Taal en Tongval 59(1), 84-116

Calculating Levenshtein distances yields interesting soundcorrespondences contained in the alignments (more on that later)

Note that a 100-word comparison already yields about 500 soundcorrespondences



Distribution of sites



Analytical steps

Obtain the distances between each of the ≈ 90, 000 pairs ofvarieties

n.b. this involves 500× 52 segment comparisons≈ 1.1× 109 segment comparisons in total

Organize these in a 400× 400 tableSeek groups (dialect areas) or continuum-like relations, e.g. byapplying clustering or multi-dimensional scaling, respectively



Multi-Dimensional Scaling

Frisian

Frisian cities, Het Bildt

Westerkwartier

Stellingwerf

Low Saxon

Central Gelderland

Dutch Low Franconian

Flemish Low Franconian



MDS dimensions → colors, projected to map



Noisy Clustering

BonnKöln 100

Iversheim56

AachenWinterspelt

55

Odenspiel

56

LohraWittelsberg 58

Allna100

HerbornseelbachOffdilln 100

99

DexbachNiederasphe 100

Rosenthal58

Frohnhausen

100

74

AltenbergSchraden 54

BockelwitzSchmannewitz 97

Linz60

GrünlichtenbergRoßwein 100

69

Lampertswalde

72

JonsdorfRammenau 88

Gersdorf72

65

AltlandsbergLippen 100

Groß Jamno100

Pretzsch

100

Neu Schadow

93

GerbstedtLandgrafroda 100

53

BorstendorfGornsdorf 100

Theuma96

Mockern

55

CursdorfOsterfeld

Wehrsdorf

56

BillingsbachZellingen 66

Altentrüdingen97

BempflingenIggingen 80

Schömberg100

BurgriedenOberhomberg

53

BruchHermeskeil 100

KruftSiebenbach 100

Mastershausen56

57

Hartenfels

56

BüdesheimEisenbach 73

Niedernhausen61

Vielbrunn

56

Lohrhaupten

83

EschelbronnPfaffenrot 83

Niederauerbach85

56

EnsheimMaxweiler

53

EbertshausenExdorf 100

TannWeyhers 100Helmers

100100

EichenhofenHermannsreuth 100

PeterskirchenSchachach 60

Gelting92

LangenbruckOberviehbach 59

PielenhofenTreffelstein 100

Ulbering

67

Hartenstein

60

KemmernOttowind 100

Schauenstein100

Weidenbach

71

Nürnberg

65

63

Oberau

62

Klafferstraß

70

Pöttmes

7875

MaibrunnRamsau 93

79

EinöllenUngstein 59

HorheimSeelbach 62

Endenburg−Lehnacker52

EngelsbachSchellroda 100

HönebachRinggau−Röhrda 84

Unterellen

63

Mörshausen

60

GroßwechsungenWieda 99

Groß Ballhausen86

100

Orferode

99

HöchstädtIgling 70

Wildpoldsried96

SchnepfenbachVolkershausen 71

ClausthalKleinbottwar

ObermaiselsteinOberwürzbach

83

AhrbergenWasbüttel 100Brelingen

76

AlberslohHaddorf 100

Lippramsdorf61

BrockhausenEngter 100

60

HohenkörbenWüllen 63

77

AltwarpBreddin

Klein Rossau60

GrünowVietmannsdorf 94

Falkenthal79

99

MirowSchönbeck 99

98

BenninWentorf 91

Groß MohrdorfWolgast 91

Hagen64

Kirch Kogel

69

GresenhorstHerzfeld 97

Jürgenshagen

68

Verchen

68

59

AstfeldFreden 74

GottsbürenOsterhagen 96

71

AtzendorfHundisburg 100

Götz94

JacobsdorfReetz 61

62

Ruhlsdorf

81

Benzingerode

100

JeverWangerooge 57

Barßel81

BremscheidHerdecke 60

HerrentrupReelkirchen 100

HesselteichValdorf 100

9256

DreekeHerßum 66

GroßenwieheSchwabstedt 100

Holmkjer100

Wasbek

65

HammahOiste 52

JesteburgKuhstedt 94

StöckenWarpe 100Adorf

BardenflethDiekhusen

EbstorfEversen

HohwachtHuddestorf

JeetzelOhrdorf

Osterbruch

88

LeuthWemb 100

83

100

Seeks groups in data, enabling comparison to older dialectologywhich sought areasOnly bootstrap (or noisy) clustering to avoid instability



Projecting groups to geography

Den Burg

SchiermonnikoogOosterend

Leeuwarden

Grouw

Groningen

Heerhugowaard

Haarlem

Delft

StaverenSteenwijk

Urk

Hattem

Amersfoort

Assen

Emmen

Itterbeck

Lochem

Brugge

Veurne

Middelburg

Gent

Vianen

Zevenbergen

Kalmthout

Mechelen

Groesbeek

Helmond

Venlo

Overpelt

Roeselare

SteenbeekGeraardsbergen Tienen

Kerkrade

Aubel



Large body of dialectometric work—positive aspects

Dutch, German, American English, Norwegian, Swedish,Afrikaans, Sardinian, Tuscan, Catalan, Bulgarian, Croatian,Estonian, Sino-Tibetan, Chinese, Central Asian (Turkic &Indo-Iranian), ...Development of consistency measure (Cronbach’s α) indictingwhether data set is sufficiently largeNovel reflection, work on validation aimed at assessing degree ofdetection of SIGNALS OF PROVENANCE

Gooskens & Heeringa (2004) Perceptive Evaluation of LevenshteinDialect Distance Measurements using Norwegian Dialect Data.Language Variation and Change 16(3), 189-207.



Criticisms of dialectometry, esp. Levenshtein-basedwork

Measure is too insensitive, 0/1 segment differencesToo little attention to phonetic/phonological conditioningToo reliant on transcription—what about acoustics?Where is the sociolinguistics? Isn’t variationist linguistics mostlyabout sociolinguistics?“Distance-based” methods yield too little insight into the linguisticbasis of differences (concrete differences lost in the aggregatesums)

—the hint is that it may be all smoke & mirrorsSo what? Isn’t this all just confirming what we knew earlier?

... progress on all fronts, but presentation would take too long—question and discussion period for those interested



The Influence of Geography

Regression designDependent variable: varietal distance, as measured by aggregatecategorical distance or Levenshtein distanceIndependent variable: geographical distance, regarded as anoperationalization of the chance of social contactStatistical cautions:

1 correlations involving averages are inflated— but we’re interested in the entire varieties (dialects)

2 distances are not independent, so significance may be inflated— Mantel tests



Inspiration: Jean Seguy

Seguy (1971) La relation entre la distance spatiale et la distancelexicale. Revue de Linguistique Romane 35(138), 335-357:Aggregate variation increases sublinearly with respect togeography

COURSE MOYENNE

Y = 36Vlog(x + 11

so

.0

J

10

1

~ 1. 6 . I) IS 10 1~ 30 3~ .0 .~ 50 55 60 ~ 10 1S 10 is 90 95 100 IDS 110



Sublinear spread is general

0 100 300 500

0.00

0.10

0.20

Bantu

0 100 200 300 400 500

0.00

00.

002

0.00

4

Bulgaria

0 200 400 600 800

0.04

0.08

0.12

Germany

0 200 600 1000

0.0

0.2

0.4

LAMSAS / Lowman

0 50 100 200 300

0.01

0.03

0.05

0.07

The Netherlands

0 100 200 300 400 500

1.0

2.0

3.0

4.0

Norway



Aside: Trudgill’s “Gravity hypothesis”

Moon

DeimosPhobos

Venus

Earth

Mars

Sun

According to Trudgill (1972) diffusion follows an inverse square

law, with the consequence that linguistic distance should likewise

increase with the square of the distance. Population size plays

the role of mass.



Trudgill’s “Gravity hypothesis”

Sublinear aggregate relation incompatible with a quadraticinfluence (on individual features)

J.Nerbonne (2010) Measuring the Diffusion of Linguistic Change. Phil.Transactions of the Royal Society B: Biological Sciences 365.



How much does distance influence language?

Area Corr.(l,geo) r2

Gabon Bantu 0.47 0.22Bulgaria 0.49 0.24Germany 0.57 0.32Eastern U.S. 0.51 0.26Netherlands 0.62 0.38Norway 0.41 0.16

Norwegian ling. dist. correlates better w. travel time in 1900 (r = 0.54)Gooskens (2005) Dialectologia et Geolinguistica 13.

Adding areas increases explained variance 50% (forthcoming in aFreiburg volume)



Geographic influence on language

Geography accounts for 33− 57% of aggregate linguistic variation.General — sublinear — characterization of relation betweengeographical distance and linguistic differencesLike population geneticists’ “isolation by distance” (Wright, 1943;Malecot, 1955)



Features? (assuming aggregate analysis)

Argumentum ad auctoritatem Groningen software supports freesearch (with measures of “importance”)Post-hoc “feature mining”: We can look for words that correlatewith significant dimensions of MDS solutions (of aggregateanalyses).Bipartite spectral graph partitioning (like two-dimensional factoranalysis).

Begin with matrix of varieties × featuresCluster varieties and features simultaneously.

Mixed modelsInclude feature choice (words) as random-effect factor in regressionmodel.



“Importance” of feature wrt area

Representative(f,a) ≈ relative frequency of f among sitesDistinctive(f,a) ≈ proportion of occurrences of f in a as opposed to

outside aImportance(f,a) is average of representativeness and distinctiveness



MDS-based feature-mining



Co-clustering bi-partite spectral graph

-0.32

-0.34

-0.34

-0.23

0.23

0.34

0.34

-0.32

0

0.32

0.32

Details during discussion if wanted.



“Mixed models”: modeling each word

LD = 0.00 + 0.01WF − 0.005PS + 0.004PA (general model)LD = −0.01 + 0.01WF + 0.010PS + 0.004PA (word: bier )LD = 0.20 + 0.01WF − 0.008PS + 0.004PA (word: zijn)

Ongoing work by Martijn Wieling (submitted)John Nerbonne [email protected] 32/35


A caution: dialect continua

Old vs. young speakers in Sweden (SveDia, Therese Leinonen, 2010)

“Feature ranking” could create spurious dialect areas, even wherescientific consensus sees continua.



Features in aggregate analysis

Aggregate perspective enables identification & formulation ofgeneral law: distance models explain 22%− 38% of aggregatelinguistic variation.

Areal distinctions a bit collinear, but add (≈ 50%).

Features naturally ranked in dialectometric view, either as uniform,or as reflected in item sample / lexiconSeveral means of identifying and ranking featuresEmerging questions:

What is the linguistic structure of the dialect differences we find?Do typological constraints play a (confounding) role?Can we tease apart geographical and historical explanations, andhow?

Try Gabmap! www.gabmap.nl



Questions?

Thank You!


Aggregate vs. Feature-Based Perspectives on Dialect Geography › ~nerbonne › outgoing › Oldenburg-2012 › ... · Bob de Jonge, Agnes de Bie, Cornelius Hasselblatt... Simonetta

Documents