Top Banner
The idea Corpus Data Mining Experiment Modelling Rules and Exceptions in Language Dynamics: a Quantitative Investigation Martina Pugliese Sapienza Universit` a di Roma, Dipartimento di Fisica Joint work with Prof. V. Loreto , C. Cuskley, C. Castellano, F. Colaiori, F. Tria Final Seminar for the PhD program Roma, 29th October 2014 Martina Pugliese
24

The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

Aug 20, 2019

Download

Documents

phamnga
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling

Rules and Exceptions in Language Dynamics:

a Quantitative Investigation

Martina Pugliese

Sapienza Universita di Roma, Dipartimento di Fisica

Joint work with Prof. V. Loreto, C. Cuskley, C. Castellano, F. Colaiori, F. Tria

✆Final Seminar for the PhD program

Roma, 29th October 2014

Martina Pugliese

Page 2: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Introduction

Human Languages: Rules versus ExceptionsThe past tense as the object of investigation

Human Languages are structured into syntactic rules, whichpresent exceptions in the form of irregularities

l

The past tense formation is a typical example of a rule (regular,-ed form) and many irregular forms

sneak

snuck?

sneaked?

??play played

swim swam

I saw him run after a gilded

butterfly: and when he caught it,

he let it go again; and after it

again; and over and over he comes,

and again; catched it again (...)

— W. Shakespeare, Coriolanus

Martina Pugliese

Page 3: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Introduction

Human Languages: Rules versus ExceptionsThe past tense as the object of investigation

Human Languages are structured into syntactic rules, whichpresent exceptions in the form of irregularities

l

The past tense formation is a typical example of a rule (regular,-ed form) and many irregular forms

sneak

snuck?

sneaked?

??play played

swim swam

I saw him run after a gilded

butterfly: and when he caught it,

he let it go again; and after it

again; and over and over he comes,

and again; catched it again (...)

— W. Shakespeare, Coriolanus

Martina Pugliese

Page 4: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Introduction

Outline of the workTackling the problem from different points of view

Succint summary of the literature in the field

Regularization is the expected phenomenon

Irregularization is typically considered irrelevant

Frequency plays the leading role: low frequency verbs aremore prone to regularize

The sociolinguistic enviroment of speakers influencesregularity of verbs

We will explore the past tense problem with three parallelapproaches:

1 Data Mining on a Linguistic Corpus

2 Experiment on novel verbal forms

3 Agent-based modelling of competing inflections

Martina Pugliese

Page 5: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Introduction

Outline of the workTackling the problem from different points of view

Succint summary of the literature in the field

Regularization is the expected phenomenon

Irregularization is typically considered irrelevant

Frequency plays the leading role: low frequency verbs aremore prone to regularize

The sociolinguistic enviroment of speakers influencesregularity of verbs

We will explore the past tense problem with three parallelapproaches:

1 Data Mining on a Linguistic Corpus

2 Experiment on novel verbal forms

3 Agent-based modelling of competing inflections

Martina Pugliese

Page 6: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Introduction

Outline of the workTackling the problem from different points of view

Succint summary of the literature in the field

Regularization is the expected phenomenon

Irregularization is typically considered irrelevant

Frequency plays the leading role: low frequency verbs aremore prone to regularize

The sociolinguistic enviroment of speakers influencesregularity of verbs

We will explore the past tense problem with three parallelapproaches:

1 Data Mining on a Linguistic Corpus

2 Experiment on novel verbal forms

3 Agent-based modelling of competing inflections

Martina Pugliese

Page 7: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)

CoHA: 400 · 106 written words in period 1810− 2009

The data we used

verbs in 1830− 1989: 16 decades, diachronic perspective

dataset confined to size of first decade (≈ 2.1 · 106 tokens)

threshold on frequency to define regularity

Core verbs and extended vocabulary analyses

I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)

Core: existing in every decade Extended: entire vocabulary

Martina Pugliese

Page 8: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)

CoHA: 400 · 106 written words in period 1810− 2009

The data we used

verbs in 1830− 1989: 16 decades, diachronic perspective

dataset confined to size of first decade (≈ 2.1 · 106 tokens)

threshold on frequency to define regularity

Core verbs and extended vocabulary analyses

I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)

Core: existing in every decade Extended: entire vocabulary

Martina Pugliese

Page 9: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

(Ir)Regularity of verbs: a corpus perspectiveThe Corpus of Historical American English (CoHA)

CoHA: 400 · 106 written words in period 1810− 2009

The data we used

verbs in 1830− 1989: 16 decades, diachronic perspective

dataset confined to size of first decade (≈ 2.1 · 106 tokens)

threshold on frequency to define regularity

Core verbs and extended vocabulary analyses

I : irregularity proportion (irreg. past tokens/tot. past tokens)f : frequency of lemmaRoot: set of lemmas sharing a verb root (e.g. go, forego, undergo,...)

Core: existing in every decade Extended: entire vocabulary

Martina Pugliese

Page 10: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

Language as an open systemThe number of types changes in time

undefined; all; mostly I and mostly R: threshold on I at 0.5

4400

4800

5200

5600

6000

1830

1880

1930

1980

Alltypes

Decade

A

1900

2100

2300

2500

1830

1880

1930

1980

Mostly

Rtypes

Decade

B

130

155

180

1830

1880

1930

1980

Mostly

IRtypes

Decade

C

2200

2550

2900

3250

3600

1830

1880

1930

1980

Undefined

types

Decade

D

verbsroots

verbsroots

verbsroots

verbsroots

Overall increasein vocabularysize

mostly R andundefined typesincrease innumber

mostly I typesare constant innumber

Martina Pugliese

Page 11: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

The situation of the core roots in the last decadeAn evolving picture of two opposite behaviours

stable IR (IR in every decade); stable R (R in every decade); active(0 < I < 1 in at least one decade, threshold at 1%)

0

0.2

0.4

0.6

0.8

1

10-6

10-5

10-4

10-3

10-2

10-1

1

I

f

0 0.5 1

Color coding:average I acrosstime

Arrows: trajectoryfrom first to lastdefined occurrence

Purple curves:binned values

Verbs are in a dynamic state: some regularize, some irregularize,some are stable

Martina Pugliese

Page 12: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

How do core roots navigate the plane (f , I ) ?The active roots pattern in a cloud

d =√

(δf )2 + (δI )2 δf = ∆f

∆t, δI = ∆I

∆t

10-8

10-7

10-6

10-5

10-4

10-3

10-2

10-1

10-6 10-5 10-4 10-3 10-2 10-1 1

d

f

0

0.2

0.4

0.6

0.8

1

f : frequency averagedover time

Bigger points: activeroots, color-codedwith average I acrosstime

Smaller dots: thestable roots

Martina Pugliese

Page 13: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

Active roots and the variation in I

Confirmation of two opposite and balanced forces

∆I > 0: irregularization; ∆I < 0: regularization

f : frequency averagedover time

decreasing trend withincreasing f

the numbers in thetwo subplots arecomparable

10-4

10-3

10-2

10-1

1

10-6

10-5

10-4

10-3

10-2

10-1

1

∆I

f

∆I > 0

10-4

10-3

10-2

10-1

1

−∆

I

∆I < 0

Martina Pugliese

Page 14: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

Phonological classification of rootsClasses as phonological attractors

Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)

What drives irregularization?What is the source of activity in dynamic verbs?

What keeps the number of irregular types constant?

Phonological classification of roots:

Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang

Size of class proportional to number of membersFrequency of class is sum of frequencies of its members

♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese

Page 15: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

Phonological classification of rootsClasses as phonological attractors

Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)

What drives irregularization?What is the source of activity in dynamic verbs?

What keeps the number of irregular types constant?

Phonological classification of roots:

Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang

Size of class proportional to number of membersFrequency of class is sum of frequencies of its members

♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese

Page 16: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

Phonological classification of rootsClasses as phonological attractors

Regularization: broad application of dominant rule, individual failsin retrieving irregular form (morphological level)

What drives irregularization?What is the source of activity in dynamic verbs?

What keeps the number of irregular types constant?

Phonological classification of roots:

Roots with I > 0 classified according to phonological changefrom infinitive to past tense→ e.g., sing-sang and ring-rang

Size of class proportional to number of membersFrequency of class is sum of frequencies of its members

♣ Classes may work as attractors, conditioning dynamics ♣Martina Pugliese

Page 17: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Data Exogenous Endogenous Two forces Classes

Dynamics of classes: a clarified pictureThe evolution of four pivotal classes

Bubbles are classes, grey points at the bottom are regular roots

0

0.2

0.4

0.6

0.8

1

10-6 10-5 10-4 10-3 10-2 10-1 1

I

fsum

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

burn

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

dwell

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

hide

0

0.2

0.4

0.6

0.8

1

10−5 10−4 10−3 10−2 10−1

I

fsum

sing

Points are less scattered, window is narrower

Martina Pugliese

Page 18: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Motivation Task Results

Is irregularity backed by a cognitive process?How do individuals choose the past tense ending in the first place?

✆If I do not know or recall a verb, which ending do I choose?

Does this change depending on my language nativeness?

Martina Pugliese

Page 19: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Motivation Task Results

How does the experiment workProviding the past tense of non-existing verbs (non-verbs)

Non-verbs are built using the phonological distance with existingverbs and are categorized as Regular, Irregular and Duplicate

Info

rmat

ion

par

t

Is English your

first language? Do you speak other languages?

START

END

Ask user

information

about languages

Martina Pugliese

Page 20: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Motivation Task Results

The irregular rates divided by stimulus categoryNon-native and native outcomes are different

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Regular Duplicate Irregular

Irreg.rate

Non-verb category

Natives

Non-natives

Martina Pugliese

Page 21: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Motivation Task Results

How many, and which, irregular responses do we get?Different choices for natives and non-natives

The irregular responses can be divided into 7 linguistic categories

0

5

10

15

20

0 2 4 6 8 10 12 14 16

Num.users

Num. irreg. resp.

05

1015202530

0 2 4 6 8 10 12 14 16

Natives

Non-natives

All

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Reg.

VC Lev.

Other

VC+d

Fin+t

/d

VC+t

Ruck.

Frequency

Irregular category

CoHA

All users

Natives

Non-natives

Natives speakers weigh the regular category more than non-natives

Martina Pugliese

Page 22: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Rules Outcome

A three-states model of competing inflectionsThe rules of the model

Before After

s h s h

I I I IR R R RI R I MR I R MI M I IR M R RM(I) I I IM(R) I M MM(I) R M MM(R) R R RM(I) M I IM(R) M R R

s: speaker; h: hearer

Three possible inflections: R (regular), I(irregular), M (mixed) (both endingspossible)

At each time step, s and h interact over alemma whose frequency is f

When h does not have the utteredinflection, he appends it

When h has the uttered inflection, both sand h delete the other one

At each time step, a randomly selectedagent is replaced with probability r withone who only has R inflections

Martina Pugliese

Page 23: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling Rules Outcome

The model exhibits a transitionAnalytical and numerical solutions

n = r/fρX represents the fraction of individuals in the X inflection

The model is analytically solvable

0

0.2

0.4

0.6

0.8

1

0 0.02 0.04 0.06 0.08

ρI

n

0

0.2

0.4

0.6

0.8

1

0 0.02 0.04 0.06 0.08

ρR

n

ρ(3)I

ρ(2)I

ρ(1)I

ρI (0) = 0.8

ρI (0) = 0.5

ρI (0) = 0.3

High-frequency verbs tend to stay I, low-frequency verbs tend tobecome R

Martina Pugliese

Page 24: The idea Corpus Data Mining Experiment Modelling ... fileThe idea Corpus Data Mining Experiment Modelling Introduction Outline of the work Tackling the problem from different points

The idea Corpus Data Mining Experiment Modelling

Conclusions and Perspectives

What the work shows in a nutshell

Frequency alone does not necessarily predict fate of verbs

Vocabulary lemmas change in a complex way as result ofseveral factors

Activity in irregularity proportion is mostly located in anintermediate frequency window

Phonological classification clarifies the existence of clusters ofverbs behaving in similar way

Native and non-native speakers tend to have opposing viewsof regularity preference

Modelling sheds light on the stationary states of lemmas anduncovers a transition

Thanks for bearing with me!

Martina Pugliese