Revisiting the quantitative phylogeny of the Uralic languages · Revisiting the quantitative phylogeny of the Uralic languages Kaj Syrjänen, Terhi Honkola, Jadranka Rota, Unni-Päivä

Revisiting the quantitative phylogeny of the Uralic languages

Kaj Syrjänen,Terhi Honkola, Jadranka Rota,

Unni-Päivä Leino, Outi Vesakoski

Contextualizing historical lexicology,15.5.2017, University of Helsinki

Map: Geographical Database of the Uralic

languages by BEDLAN & J. Ylikoski

Chang et al. 2015

Grollemund et al. 2015

Background

Syrjänen et al. 2013

Bouckaert et al. 2012

Mathematical model

Background PhylogenyPARAMETERS

Mathematical model

Topology

Branch lenght

Background

Other parameters

Syrjänen et al. 2013

Basic rules of sequence evolution

PARAMETERS PhylogenyLanguage phylogeny

• Genetic data is used in biology to make phylogenies

• Not all genetic material is similar

- Heterogenous rate of change

-e.g. coding regions (genes) vs. non-coding regions

→ Should not be analysed together

Introduction

CC Attribution 4.0 License, http://cnx.org/contents/5CvTdmJL@4.4:xiQtvh_M@3/Structure-and-Function-of Cell#OSC_Microbio_10_04_noncodDNA

Parameter set 1 Parameter set 2

• Phylogenetic partitioning is the solution

- Takes into account heterogenous patterns of evolution

- Different parameters for different parts of the genome

- Common in biological phylogenetic analyses!

Introduction

CC Attribution 4.0 License, http://cnx.org/contents/5CvTdmJL@4.4:xiQtvh_M@3/Structure-and-Function-of Cell#OSC_Microbio_10_04_noncodDNA

Mathematical model

Topology

Branch lenght

Background

Basic rules of sequence evolution

PARAMETERS(Language) phylogeny

Rate heterogeneity

(set 1: coding)

Rate heterogeneity (set 2: non-coding)

• Partitioning vs. no partitioning

• May produce trees which differ in (Kainer & Lanfear

- Branch support

- Topology

- Branch length

→ In other words, almost anything

Introduction

• Language data is used to make quantitative phylogenies

• Bayesian model-based methods

• Parsimony methods

• Distance based methods

- Glottochronology

• Not all linguistic material is similar

- Heterogenous rate of change

Introduction

2) from one meaning to next

= e.g. parts of speech change at varying

1) from one language to next

= Languages change at varying rates

Pagel et al. 2007

Solved in Bayesian analyses by using

evolutionary models

• Two notable points of linguistic variation:

- Rate of lexical replacement varies

Introduction

1) from one language to next

= Languages change at varying rates

• Two notable points of linguistic variation:

- Rate of lexical replacement varies

2) from one meaning to next

= e.g. parts of speech change at varying

a) Solved by allowing rate variation

along gamma distribution

b) Another option could be data

partitioning

Introduction2) from one meaning to next

a) Rate variation along gamma distribution

Used in language phylogenies by e.g.

- Grollemund et al. 2015

- Chang et al. 2015

- Bouckaert et al. 2012

b) Data partitioning

- manual vs. algorithmic

-e.g. with TIGER algorithm

Not used earlier to make language

phylogenies

By Gamma_distribution_pdf.png: MarkSweep and Cburnettderivative work: Autopilot (talk) -Gamma_distribution_pdf.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=10734916

Test data partitioning of lexical data to see if this

can solve the rate heterogeneity problem

1. Test TIGER as an algorithmic partitioning method

2. Assess different approaches

- manual partitioning (e.g. basic vs less basic)

- algorithmic partitioning (TIGER)

- unpartitioned, with rate variation along gamma distribution

3. Compare Uralic phylogenies made with different approaches

Materials • Lexical data of 26 Uralic languages

• Three basic meaning lists

- Swadesh 100, Swadesh 200, Leipzig-Jakarta

226 meanings in total

• One ”less-basic” meaning list

- WOLD401-500

• Basic + ”less basic” = in total 313 meanings

• Algorithmic partitioning schemes using TIGER (Cummings & McInerey 2011)

• TIGER = Tree Independent Generation of Evolutionary Rates

• Calculates stability values (TIGER rates) from aligned phylogenetic data

• TIGER rates are relative measurements ranging between 0 and 1, with 1 being stable

• Can also produce partitioning schemes for phylogenetic analysis tools based on TIGER rates

Clicker training a Tiger In Odensee ZooPhoto by OV

Method: TIGER

Detailed explanation for mathematically oriented readers: Cummings & McInerey 2011

2. calculate pairwise

partition agreement (pa)

scores set partitions

between all the aligned

characters

1. identify set partitions

(=identical characters on

each column)

3. calculate TIGER rate

as the averaged partition

agreement score of a

character to all the other

characters

2. pa(A,B)1. set partitions

Method: How TIGER rates are calculated

Option 1: Binary Option 2: Multistate

• Same input as used in phylogenetic analysis

• Each column gets its own TIGER rate

meanings get broken up

• Produces meaning-specific TIGER rates

• Needs additional conversion steps

high-stabilitymid-stabilitylow-stabilityhigh-stabilitymid-stabilitylow-stability

Method: Coding language data for TIGER input

Option 1: Binary Option 2: Multistate

• Same input as used in phylogenetic analysis

• Each column gets its own TIGER rate

meanings get broken up

• Produces meaning-specific TIGER rates

• Needs additional conversion steps

• The better option

high-stabilitymid-stabilitylow-stabilityhigh-stabilitymid-stabilitylow-stability

Method: Coding language data for TIGER input

0.56 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0

TIGER rate (stability)

eye 1.0

name 1.0

two 1.0

we 1.0

fish 0.99

not 0.985

roast 0.575

dust 0.574

throw 0.573

rope 0.572

squeeze 0.571

shut 0.57

soon 0.569

Distribution of TIGER rates

Results 1. Sanity check for lexical TIGER rates

TIGER rates vs basic and less basic

vocabulary

TIGER rate vs rate of lexical replacement

(Pagel et al. 2007)

vocabulary

(Pagel et al. 2007)

Log(TIGER rate)0.45 0.50 0.55 0.60 0.65 0.70

Spearman: -0.71

Kendall: -0.52

p-value < 2.2e-16

with both

vocabulary

(Pagel et al. 2007)

TIGER rate

0.6 0.7 0.8 0.9 1.0

Less basic vocabulary

Basic vocabulary

quency

Log(TIGER rate)0.45 0.50 0.55 0.60 0.65 0.70

Spearman: -0.71

Kendall: -0.52

p-value < 2.2e-16

with both

0.56 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1.0

Tiger rates

Partitioning based on TIGER ratesN

Results 2. Which partition to use? (226 meanings)

Partitioning scheme Bayes Factor

unpartitioned -13124,21 -

semantic category (5) -13079,11 45,10

word lists (2) -13087,28 36,93

by meaning (226)

tiger-2-partitions -12933,21 191,00

gamma, 4 categories -12689,32 434,89

gamma, 10 categories

1) MrBayes analysis for different partitioning schemes

Manual partitions

Algorithmic partitions

Gamma model

2) Marginal likelihood estimatesused to calculate model support (Bayes Factors) for a given partitioning as opposed to an unpartitioned analysis

Partitioning schemeMarginal

Likelihood

unpartitioned -13124.21

semantic category (5) -13079.11

word lists (2) -13087.28

by meaning (226) -12689.32

tiger-2-partitions -12933.21

gamma, 4 categories -12920.03

gamma, 10 categories -12919.64

2) Marginal likelihood estimatesused to calculate model support (Bayes Factors) for a given partitioning as opposed to an unpartitioned analysis

Likelihood

Factor

unpartitioned -13124.21 -

semantic category (5) -13079.11 45,10

word lists (2) -13087.28 36,93

by meaning (226) -12689.32 434,89

tiger-2-partitions -12933.21 191,00

gamma, 4 categories -12920.03 204,18

Results 3. Uralic family tree with partitioned data

Multistate TIGER partitioning: 10 partitions, 226 meanings

LikelihoodBayes Factor

word lists (2) -13087.28 36,93

by meaning (226) -12689.32 434,89

Wordlist-based partitioning: LJ and non-LJ; 226 meanings

word lists (2) -13087.28 36,93

by meaning (226) -12689.32 434,89

Results 3. Uralic family tree with gamma model

10 gamma categories, 226 meanings

word lists (2) -13087.28 36,93

by meaning (226) -12689.32 434,89

Results 3. Uralic family tree without partitioning

226 meanings

word lists (2) -13087.28 36,93

by meaning (226) -12689.32 434,89

“Incorrect” TIGER partitioning: 4 partitions, 226 meanings

word lists (2) -13087.28 36,93

by meaning (226) -12689.32 434,89

*tiger-4-partitions (binary) -13033.91 90,30

Discussion 1: Sanity of TIGER rates

• Rates generally seem to make sense, and work with

partitioning

• Similarly usable metric as e.g. WOLD’s stability metrics or

rate of lexical replacement (Pagel et al. 2007)?

• Needs further validation across language families / with

different data

• Partitioning/gamma distribution always improved model support

Bayes Factors Phylogeny

Manual partitions 3. 2.

Algorithmic partitions 1. 1.

Gamma model 2. 1.

Discussion 2: Assessing the approaches

Bayes Factors Phylogeny

Manual partitions 3. / 2. (2.)

Algorithmic partitions 1. (1. /3.)

Gamma model 2. (1./2. )

Large differences in BF Small differences in phylogenies

• Partitioning/gamma distribution always improved model support

• In algorithmic partitioning higher number of partitions is

better than low

• BUT careful of:

• Overparametrization (too small partitions)

• How the data is divided (meanings intact or not)

• Gamma distribution

• Works reasonably well

• Potential problem: meanings not kept intact

• Minor differences

(with this data)

• Big picture in

earlier results

did not change

• How about other

language families?

Discussion 3: Partitioned data and Uralic family

Mathematical models

Take home message

Language phylogeny Develop continuously

Great example of how linguists and model developerscan improve the

models whenworking together

Technical developmentis gradual!

Are flexible

1896 Telephone (Sweden). (Wikipedia)

120 years…

Thank you!

Acknowledgmentskielievoluutio.uta.fi

Revisiting the quantitative phylogeny of the Uralic languages · Revisiting the quantitative phylogeny of the Uralic languages Kaj Syrjänen, Terhi Honkola, Jadranka Rota, Unni-Päivä

Documents

Harpactorini Phylogeny

STUDIES IN URALIC ETYMOLOGY II: FINNIC ETYMOLOGIES ·...

Languages in Contact - USC Upstate:...

Phylogeny 2009

16388432 Uralic Languages

THE URALIC LANGUAGE FAMILY: FACTS, MYTHS AND STATISTICS ·....

Uralic Languages - 644892358441536678.weebly.com

Uralic genes in Europe

Babboon Phylogeny

Molecular Phylogeny, Classification, Evolution, and...

Phylogeny Fungis

Studies in Uralic vocalism II: Reflexes of Proto-Uralic * in

Linguistics.ie.Uhlenbeck 1937(Caucasian Uralic)

The Finno-Ugric and Uralic languages in modern linguistics

Brain Phylogeny

Taxonomy & Phylogeny Introduction Classification Phylogeny.....