Linked Data in Language Typology Digital Humanities Research Seminar University of Helsinki February 1, 2018 Kaius Sinnemäki General Linguistics, University of Helsinki
Linked Data in Language Typology
Digital Humanities Research SeminarUniversity of HelsinkiFebruary 1, 2018
Kaius SinnemäkiGeneral Linguistics, University of Helsinki
Today’s talk• My background
• What is language typology?
• Rich data in typology
• Linked data possibilities in typology
• A case study
My background
3
MA 2004, general linguistics
• Corpus linguistics– Unix-based semi-
automatic detection of deep embedding
• Syntactic analysis of different genres– Incl. stream-of-
consciousness
PhD 2011, general linguistics
• Language complexity, a typological viewpoint.• Testing domain:
– Case marking, agreement, linear order.
• Statistics and R.
5
Case markingAgreement
Word order
PhD 2011, general linguistics
• Language complexity, a typological viewpoint.• Testing domain:
– Case marking, agreement, linear order.
• Statistics and R.
6
Case markingAgreement
Word order
Recent & current projects• Post doc, 2013- (Collegium & Academy of Finland)
– How grammatical categories of the noun (e.g. case, definiteness, gender) interact with other each.
– How linguistic structures adapt to sociolinguistic context.– Combining typological and experimental evidence.
• Digital Humanities at UH (w/ Mikko Tolonen)– Conferences in 2014-2015; Fennica-project 2015.
• Sacred in Secular Societies (w/ Janne Saarikivi) 2018– How religious concepts have been transformed into
secular context and preserved there.
7
What is language typology?
1950s: Chomsky and Greenberg• Linguistics dominated by structuralism from late 19th
century to mid-20th century. Emphasis on variation:“Languages can differ from each other without limit and in unpredictable ways.” (Martin Joos 1957: 96)
• Chomsky: language is a separate module in the brain.ÆUniversal grammar: all languages fundamentally the same.ÆData from English. (Rationalism, cognitive science).
• Greenberg: is cross-linguistic diversity constrained?ÆLanguage universals (esp. on word order).ÆData from many languages. (Empirical, anthropology).
• Language typology= It is worldwide comparison of languages to describe and
explain differences and similarities across languages.– Major research questions: 1) to what extent different
linguistic structures may interact among themselves or 2) with cognitive and cultural patterns (Bickel 2007).
• Well-known for word order correlations:– order of noun (N) and genitive (Gen) correlates
with the order of object (O) and verb (V)• NGen + VO (Book of John + eat an apple)• GenN + OV (John’s book + an apple eat)
Cross-linguistic comparison• How are linguistic structures compared typologically?
• Key question: are there universal categories?– If yes Æ universal ontological system feasible.
• Cf. General Ontology for Linguistic Description (GOLD).– If no Æ comparison has to be based on researchers’ tools
• No right/wrong, just better/worse definitions for comparison
• Pooling/appending data from different sources– Are the definitions for a particular structure comparable?– If not, the only possibility is to analyse new data.
Rich data / big data in typology
“Big data” in cross-linguistic research
• Big data in linguistics: text corpora (e.g., Language Bank).
• Language typology is about language comparison.– How much is there to compare in languages?– A finite description of the grammar of even one language is
impossible (Moscoso 2010).
• Currently about 7000 languages spoken (Hammarström et al. 2017).
ÆThere should be many reasons for “big data”-research in language typology.
13
Computer-assisted linguistic and cultural research
• Open access databases since 2005; linked data possibilities since 2008 (Word Atlas of Language Structures).
• Availability of new databases that contain linguistic and cultural data on languages and societies all over the world.
• Enable new research questions to be approached by new computer-assisted methods.
15
CLLD
• CLLD = Cross-Linguistic Linked Data (clld.org)– Hosts several large cross-linguistic datasets.– All openly available, repositories in GitHub.– Including (visit https://github.com/clld)
• The World Atlas of Language Structures.• The Atlas of Pidgin and Creole Language Structures.• The World Loanword Database.• The South American Indigenous Language Structures• Glottolog: catalogue of all languages, families and dialects
(including bibliographic information).– Data largely in database format, not many texts.
D-PLACE• Cultural, linguistic, environmental and geographic data on 1400+
societies. (https://d-place.org).– Society = “represents a group of people in a particular locality, who
often share a language and cultural identity.”
• Cultural descriptions tagged with date and ethnographic sources. Ethnographies based on largely pre-1950s work.– Ethnographic Atlas (Murdock 1962-1971). Human Relations Area Files.– Data from pre-1950s ethnographies.
• Also phylogenetic treesÆphylogenetic comparative methods applicable
• Clone from https://github.com/D-PLACE/dplace-data
• Is there any relationship:– Cultural trait ”presence of
trance”– Ecological factor ”rain
constancy”– (coursework w/Hilde
Schneemann, Andrea Bender and Mary Walworth)
• Phylogenetic computational methods:– Ancestral state reconstruction– Correlated changes.
• Result?– Negative coefficient, but non-
significant (p = .063)
A silly excerciseusing D-PLACE
Data sources• Typologists’ data sources are reference grammars.
– Analysis ”by hand”, seldom computer-assisted.– Time-consuming.– Are observations “languages” or “constructions”?
• Samples around 200-300 languages.– WALS: data on 2600+ languages, 190 structures. Gaps
(Dryer & Haspelmath 2013).– Not big from statistical perspective, but “big” in
comparison to the history of language typology.
• Compare with corpus linguistics:– Datapoints counted in tens of thousands or more.– About 125 000 hits for the verb oleskella ‘stay, dwell’ in
the Finnish korp -corpus.
19
Linked data in typology
What to link?• Usually languages. Problem: many alternative names.
– Tenharim (Tupian) has 35 names (AUTOTYP).– Different databases name languages differently.– See e.g. discussion on http://dlc.hypotheses.org/623
• Solution: standard language identifiers– Problem:
• different databases/catalogues use different identifier systems• Ethnologue (ISO-639.3), Glottolog, WALS, AUTOTYP.
– Not a real problem, but there are still many-to-many mappings.
Tenharim, alternative names• abahyba, Caripuna, Cauaiua, Cauhib, Cawahib• Diahoi, Diahói, Diahui, Diarroi, Diarrui, Djahui• Jahoi, Jahui, Jauareta-Tapiia, Jiahui, Juma, Yuma• Kagwahiv, Kagwahív, Kagwahiva• Karipuna, Karipuná, Kawahib, Kawaib• Paranawat, Parintintim, Parintintin, Parintintín• Pawaté-Wirafed• Tenharem, Tenharim, Tenharím, Tenharin• Tenharín, Tukumanfed, Uru-eu-uau-uau.
What to link?• Usually languages. Problem: many alternative names.
– Tenharim (Tupian) has 35 names (AUTOTYP).– Different databases name languages differently.
• Solution: standard language identifiers– Problem:
• different databases/catalogues use different identifier systems• Ethnologue (ISO-639.3), Glottolog, WALS, AUTOTYP.
– Not a real problem, but there are still many-to-many mappings.
• Conceptual work to be done:– What is a language in a database/catalogue?– Doculect, languoid, glossonym– Need to formalize the notion language
• See Cysouw & Good (2013).• Also discussion on Diversity Linguistics Comment
A case study
Creoles vs. regular languages
• Languages are transmitted in different social conditions. Usually faithful transmission, with some restructuring.
• Some languages under heavy language contact.– Break in normal transmission.– Restructuring, structural simplification /
complexification.ÆPidgins, jargons, creoles, mixed languages.
• Creoles share many featuresÆA creole typological profile?– Used for arguing about language evolution.
• Questions:– Do creoles differ from regular lgs? (Bakker et al. 2011)– Do the contributing languages (mostly Indo-European)
differ from other languages of the world? (Cysouw2009; Blasi et al. 2017)
28
(Cysouw 2009)
29
(Cysouw 2009)
30
(Cysouw 2009)
Some of the often cited examples of the creole• -profile deal with word order and argument marking:
Creoles have SVO and very little morphological marking (e.g. no –case).
A boi lobi a umapikin.DET boylove DET girl
S V O'The boy loves the girl.‘ (Sranan; Winford & Plag 2013)
But: this correlation between SVO and no case marking (or •morphological marking) occurs also in regular languages.
Is the correlation stronger in Creoles than in regular languages?–If yes – Æ evidence for Creole profile. If not, then not.
• Data on the case marking of 687 regular languages available in AUTOTYP (Bickel et al. 2017).
• Data on the word order of 1377 regular languages available in WALS (Dryer 2013).
• Data on the word order and case marking of 55 creoles available in APiCS (Michaelis et al. 2013).
• AUTOTYP metadata files:– AUTOTYP’s own language identifier (integer)– ISO-639.3 code for each language– Glottocode (glottolog) for each language
• WALS and APiCS metadata files:– WALS code for each language (for WALS lgs)– ISO-639.3 code for each language– Glottocode (glottolog) for each language
Æ Should be straightforward to merge or join in R.
• But no: several many-to-many mappings.– Not all language identifiers match one-to-one.
AUTOTYP WALS
• Observations (lines) in AUTOTYP are constructions in languages
• Observations (lines) in WALS are languages
ÆMany researcher-based choices and lots of cleaning necessary before merging is possible
• Observations (lines) in AUTOTYP are constructions in languages
• Observations (lines) in WALS are languages
ÆMany researcher-based choices and lots of cleaning necessary before merging is possible
• Observations (lines) in AUTOTYP are constructions in languages
• Observations (lines) in WALS are languages
ÆMany researcher-based choices and lots of cleaning necessary before merging is possible.ÆBUT: once the cleaning script is ready, linking should
work automatically. Currently work in progress but almost there.
And the preliminary result… (Sinnemäki 2017)
41
- Data: 55 creoles
- logit estimates: -4.6 ± 1.7; p < .001***
ÆCorrelation between word order and
case marking.
Generalized mixed effects modeling
42
- Data: 55 creoles
- logit estimates: -4.6 ± 1.7; p < .001***
ÆCorrelation between word order and
case marking.
- Data: 333 non-creoles
- logit estimate = -8.6 ± 3.8; p < .0001 ***
ÆCorrelation between word order and case
marking.
Generalized mixed effects modeling
43
- Data: 55 creoles
- logit estimates: -4.6 ± 1.7; p < .001***
ÆCorrelation between word order and
case marking.
- Data: 333 non-creoles
- logit estimate = -8.6 ± 3.8; p < .0001 ***
ÆCorrelation between word order and case
marking.
Generalized mixed effects modeling
case x word_order x lg_type- estimates: -2.1 ± 1.6; p = .24
Conclusions• More typological data openly accessible
• Universal ontological systems not necessarily feasible for a typologist
• Linking data between datasets possible but requires time-consuming cleaning
• The available datasets enable old questions to be answered in new ways with computational methods.
Thank you!
ReferencesBakker, P. et al. 2011. Creoles are typologically distinct from non-creoles. Journal of Pidgin and Creole Languages 26(1): 5-42.Blasi, D. et al. 2017. Grammars are robustly transmitted even during the emergence of creole languages. Nature Human Behaviour
1(10): 723-729.Bickel, B. 2007. Typology in the 21st century: Major current developments. Linguistic Typology 11(1): 239-251.Bickel, B. et al. 2017. The AUTOTYP typological databases. Version 0.1.0 https://github.com/autotyp/autotyp-data.Cysouw, M. 2009. APiCS, WALS, and the creole typological profile (if any). Presentation at the 1st APiCS Conference, 5-8
November 2009, Leipzig.Dryer, M. 2013. Order of subject, object and verb. In M. Dryer & M. Haspelmath (eds.).Dryer, M. & M. Haspelmath (eds.) 2013. The World Atlas of Language Structures Online. Leipzig: MPI for Evolutionary
Anthropology. http://wals.info.Good, J. & M. Cysouw 2013. Languoid, Doculect, and Glossonym: Formalizing the Notion 'Language’. Language Documentation
and Coservation 7: 331-359.Hammarström, H. et al. 2017. Glottolog 3.0. Jena: MPI for the Science of Human History. Joos, M. (ed.) 1957. Readings in Linguistics: The Development of Descriptive Linguistics in America Since 1925. Washington:
American Council of Learned Societies.Simons, G. F. & C. D. Fennig (eds.) 2017. Ethnologue: Languages of the World, 20th edn. Dallas, TX: SIL International.
http://www.ethnologue.com.Michaelis, S. et al. (eds.) 2013. Atlas of Pidgin and Creole Language Structures Online. Leipzig: MPI for Evolutionary Anthropology.Moscoso del Prado Martín, F. 2010. The effective complexity of language: English requires at least an infinite grammar. Ms.
http://www.moscosodelprado.net.Sinnemäki, K. 2017. How useful are creoles in language evolution research? Evaluating cross-linguistic universals of word order
and argument marking. Invited talk at the 30th Annual CUNY Conference on Human Sentence Processing, April 1, 2017, Massachussetts Institute of Technology.
Winford, D & I. Plag 2013. Sranan structure dataset. In S. Michaelis et al. (eds.).