Top Banner
Natural Language Processing Diachronics Dan Klein – UC Berkeley Includes joint work with Alex Bouchard-Cote, Tom Griffiths, and David Hall
48

Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Mar 16, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Natural Language Processing

DiachronicsDan Klein – UC Berkeley

Includes joint work with Alex Bouchard-Cote, Tom Griffiths, and David Hall

Page 2: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

The Task

Page 3: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Latin

focus

Lexical Reconstruction

French Spanish Italian Portuguese

feu fuego fuoco fogo

Page 4: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Tree of Languages

§ We assume the phylogeny is known§ Much work in

biology, e.g. work by Warnow, Felsenstein, Steele…

§ Also in linguistics, e.g. Warnow et al., Gray and Atkinson…

http://andromeda.rutgers.edu/~jlynch/language.html

Page 5: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Evolution through Sound Changes

camera /kamera/Latin

chambre /ʃambʁ/French

Deletion: /e/, /a/

Change: /k/ .. /tʃ/ .. /ʃ/

Insertion: /b/

Eng. camera from Latin, “camera obscura”

Eng. chamber from Old Fr. before the initial /t/ dropped

Page 6: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Changes are Systematic

camra /kamra/

camera /kamera/

e ® _

numrus /numrus/

numerus /numerus/

e ® _

Page 7: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Changes are Contextual

camra /kamra/

camera /kamera/

e ® _ / after stress

e ® _

Page 8: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Changes Have Structure

cambra /kambra/

camra /kamra/

_ ® b / m_r

_ ® b

_ ® [stop x] / [nasal x]_r

Page 9: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Changes are Systematic

English Great Vowel Shift (Simplified!)

e

i

a

ai

“time” = teem “time” = taim

Page 10: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

English Great Vowel Shift

Page 11: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Diachronic Evidence

tonitru non tonotrutonight not tonite

Yahoo! Answers [ca 2000] Appendix Probi [ca 300]

Page 12: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Synchronic (Comparative) Evidence

Key idea: changes occur uniformly across the lexicon

Page 13: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

The Data

Page 14: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

The Data§ Data sets

§ Small: Romance§ French, Italian, Portuguese, Spanish§ 2344 words§ Complete cognate sets§ Target: (Vulgar) Latin

FR IT PT ES

Page 15: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

The Data§ Data sets

§ Small: Romance§ French, Italian, Portuguese, Spanish§ 2344 words§ Complete cognate sets§ Target: (Vulgar) Latin

§ Large: Austronesian§ 637 languages§ 140K words§ Incomplete cognate sets§ Target: Proto-Austronesian

FR IT PT ES

Page 16: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Austronesian

Page 17: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Austronesian Examples

From the Austronesian Basic Vocabulary Database

Page 18: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

The Model

Page 19: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Simple Model: Single Characters

CG CC CC GG

G

C GG

[cf. Felsenstein 81]

Page 20: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Changes are Systematic

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/kentrum/

/sentro/

/sentro/

/sentro//tʃɛntro/

Page 21: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Parameters are Branch-Specific

focus /fokus/

fuego /fweɣo/

/fogo/

fogo /fogo/

fuoco /fwɔko/

qIB

IT ES PT

IB

LAqES

qIT qPT

[Bouchard-Cote, Griffiths, Klein, 07]

Page 22: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Edits are Contextual, Structured

/fokus/

/fwɔko/

f# o

f# w ɔ

qIT

Page 23: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Inference

Page 24: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Learning: Objective

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

z

w

Page 25: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Learning: EM§ M-Step

§ Find parameters which fit (expected) sound change counts

§ Easy: gradient ascent on theta

§ E-Step§ Find (expected) change

counts given parameters§ Hard: variables are string-

valued

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

Page 26: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Computing Expectations

‘grass’

Standard approach, e.g. [Holmes 2001]: Gibbs sampling each sequence

[Holmes 01, Bouchard-Cote, Griffiths, Klein 07]

Page 27: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

A Gibbs Sampler

‘grass’

Page 28: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

A Gibbs Sampler

‘grass’

Page 29: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

A Gibbs Sampler

‘grass’

Page 30: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Getting Stuck

How could we jump to a state where the liquids /r/ and /l/ have a common

ancestor?

?

Page 31: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Getting Stuck

Page 32: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Efficient Sampling: Vertical Slices

Single Sequence

Resampling

Ancestry Resampling

[Bouchard-Cote, Griffiths, Klein, 08]

Page 33: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Results

Page 34: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Results: Romance

Page 35: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Learned Rules / Mutations

Page 36: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Learned Rules / Mutations

Page 37: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Results: Austronesian

Page 38: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Examples: Austronesian

[Bouchard-Cote, Hall, Griffiths, Klein, 13]

Page 39: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Result: More Languages Help

Number of modern languages used

Mea

n ed

it di

stan

ceDistance from Blust [1993] Reconstructions

Page 40: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Visualization: Learned Universals

*The model did not have features encoding natural classes

Page 41: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Regularity and Functional Load

In a language, some pairs of sounds are more contrastive than others (higher functional load)

Example: English p/d versus t/th

High Load: p/d: pot/dot, pin/dindress/press, pew/dew, ...

Low Load: th/t: thin/tin

Page 42: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Functional Load: Timeline1955: Functional Load Hypothesis (FLH): Sound changes are

less frequent when they merge phonemes with high functional load [Martinet, 55]

1967: Previous research within linguistics: “FLH does not seem to be supported by the data” [King, 67] (Based on 4 languages as noted by [Hocket, 67; Surandran et al., 06])

Our approach: we reexamined the question with two orders of magnitude more data [Bouchard-Cote, Hall, Griffiths, Klein, 13]

Page 43: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Regularity and Functional Load

Functional load as computed by [King, 67]

Data: only 4 languages from the Austronesian dataM

erge

r pos

terio

r pro

babi

lity

Each dot is a sound change identified by the system

Page 44: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Regularity and Functional Load

Data: all 637 languages from the Austronesian data

Functional load as computed by [King, 67]

Mer

ger p

oste

rior p

roba

bilit

y

Page 45: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Extensions

Page 46: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Cognate Detection

/fweɣo/

/fogo/

/fwɔko/

/berβo/

/vɛrbo/

/vɛrbo/ /tʃɛntro/

/sentro/

/sɛntro/

p‘fire’

[Hall and Klein, 11]

Page 47: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Grammar Induction

010203040506070

Dut

ch

Dan

ish

Swed

ish

Span

ish

Portu

gues

e

Slov

ene

Chi

nese

Engl

ish

WG NGRMG

IEGLAvg rel gain: 29%

[Berg-Kirkpatrick and Klein, 07]

Page 48: Natural Language Processing · 2020. 4. 28. · Answers [ca 2000] Appendix Probi[ca 300] Synchronic (Comparative) Evidence Key idea: changes occur uniformly across the lexicon. The

Language Diversity

Why are the languages of the world so similar?

Universal grammar answer: Hardware constraints

Common source answer: Not much time has passed

[Rafferty, Griffiths, and Klein, 09]