Russian National Corpus today: overview and perspectives http://ruscorpora.ru Vladimir A. Plungian (Moscow)
Dec 21, 2015
Russian National Corpus today:
overview and perspectives
http://ruscorpora.ru
Vladimir A. Plungian
(Moscow)
Outline
• RNC: current state of the art
• Search possibilities and special properties
• Applications• Further development
The goals of RNC
• Supporting all kinds of linguistic research, both descriptive and theoretical, synchronic and diachronic– Lexicographic and morphosyntactic
studies
• Observing language change, especially small and gradual– Discourse and sociolinguistic studies
• Assisting teaching and learning Russian
Corpus as ideology
• One of the first corpora designed by linguists and for linguists
• Towards a usage-based linguistic model, inter alia:– From prescriptive to descriptive attitudes– From binary to gradual grammaticality
judgments – From single-system view to synchronic and
diachronic variation– From egalitarian to quantitative approach to
linguistic form
The RNC project (Russian Academy of Sciences)
• Started in 2003 (preparatory studies since 2001)• Available on the internet since April 2004• Main participants: Vinogradov Institute for
Russian Language (Moscow, RAS); Institute for Linguistic Research (St.-Petersburg, RAS); Moscow State Lomonosov U.; Voronezh State U.
• Technically supported by YandexYandex® – the biggest Russian internet resource with one of the most powerful and innovative search engines
Composition and structure:the main corpus
1. Morphologically annotated texts in written and spoken Standard Russian of XVIII-XXI cent.– Late Modern Russian (written texts from the 2nd half
of XX century up to the present day): 100 mln– The modern newspapers corpus: 100 mln– The corpus of oral texts (the same period): 6 mln– Early Modern Russian (written XVIII, XIX and early
XX-century texts): 80 mln– The corpus of Russian poetry: 4 mln– The corpus for accentological studies [oral + poetry]
Composition and structure:minor sub-corpora
2. The parallel corpora1. English-Russian [10 mln]
2. German-Russian [2 mln]
3. Ukrainian-Russian [1 mln]
4. Polish-Russian [1 mln, in preparation]
3. The small corpus of dialect texts: 0,2 mln
4. The small syntactically annotated corpus: 0,5 mln
5. The small learner’s corpus: 7 mln
The main corpus
• Circa 300 mln tokens • All types of written texts
– fiction (both prose and drama), poetry, memoirs, newspaper accounts and reviews, advertisements, texts on education, engineering, science, philosophy, religion, business, law, as well as texts of private use non intended for publication (diaries, private correspondence, etc.)
• Spontaneous oral texts, public performances, movie transcripts
Annotation in RNCMajor types:
• meta-textual annotation• morphological annotation• accentual annotation• semantic annotation
+• poetic annotation (metrics, strophics,
rhyme types, etc.)
Meta-textual annotation
• Primary text descriptors: author (name, sex, age), title, creation date, size (number of words)
• For fiction: genre (e.g. humour, fantasy), text type (e.g. novel, essay), time and place described (e.g. Soviet Union, 1930es)
• For non-fiction: functional sphere (e.g. religion, law), text type (e.g. report, advertisement), subject (e.g. sports, science)
• All meta-textual parameters are searchable
Morphological annotation
• Automatic parsing (without disambiguation)
• Manual disambiguation and accentuation in a relatively small sub-section (ca 7 mln tokens)
• Morphological information: part of speech, inflectional categories, non-standard forms (distorted or anomalous)
Semantic annotation
• Lexicon-based annotation • Specific sets of values for different
lexical classes: – verbs, adjectives, adverbs,
numerals, pronouns, predicate nouns, non-predicate nouns, proper nouns (names, surnames and patronymics)
Semantic annotation: values
• Include primarily taxonomic parameters (e.g. ‘motion’, ‘speech’, ‘colour’, ‘instrument’, ‘person’, etc.), as well as:
– Mereology (sets & parts ~ wholes) – Some derivational features (diminutives,
augmentatives, attenuatives, semelfactives, etc.)
Searching on semantic base,an example
Construction of the type <в ночь> с четверга на пятницу ≈ ‘Thursday
night’Query:
preposition С + noun, GRAMM: ‘genitive’, SEM: ‘span of time’ + preposition НА + noun, GRAMM: ‘accusative’, SEM: ‘span of time’
Applications
• Linguistic research
• Including non-linguist students’ research activities
• Education materials
• Reference tool for non-experts
Applications:research
Actual language usage (as opposed to grammars)
Short-term grammatical changes
Including evolution of word meanings and usage
Applications:students’ activities
Getting young people interested in language as a phenomenon: from small toy-researches to full-fledged investigations.
Not necessarily linguistic students!
Applications:educational materials
Russian linguistic education is traditionally oriented towards classical literature and based on a fixed set of examples wandering from a manual to another.
depressive attitudes towards courses of Russian among younger people
Applications:educational materials
The Corpus provides instruments and resources to switch to (a) usage-based and (b) domain-specific linguistic training.
Applications:reference tool
The Corpus provides quick answers to many expert and non-expert questions. Especially convenient for simple lexical queries: word history.
When (first) and in what sense was the word used?