Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)

Russian National Corpus today:

overview and perspectives

http://ruscorpora.ru

Vladimir A. Plungian

(Moscow)

Outline

• RNC: current state of the art

• Search possibilities and special properties

• Applications• Further development

The goals of RNC

• Supporting all kinds of linguistic research, both descriptive and theoretical, synchronic and diachronic– Lexicographic and morphosyntactic

studies

• Observing language change, especially small and gradual– Discourse and sociolinguistic studies

• Assisting teaching and learning Russian

Corpus as ideology

• One of the first corpora designed by linguists and for linguists

• Towards a usage-based linguistic model, inter alia:– From prescriptive to descriptive attitudes– From binary to gradual grammaticality

judgments – From single-system view to synchronic and

diachronic variation– From egalitarian to quantitative approach to

linguistic form

The RNC project (Russian Academy of Sciences)

• Started in 2003 (preparatory studies since 2001)• Available on the internet since April 2004• Main participants: Vinogradov Institute for

Russian Language (Moscow, RAS); Institute for Linguistic Research (St.-Petersburg, RAS); Moscow State Lomonosov U.; Voronezh State U.

• Technically supported by YandexYandex® – the biggest Russian internet resource with one of the most powerful and innovative search engines

Composition and structure:the main corpus

1. Morphologically annotated texts in written and spoken Standard Russian of XVIII-XXI cent.– Late Modern Russian (written texts from the 2nd half

of XX century up to the present day): 100 mln– The modern newspapers corpus: 100 mln– The corpus of oral texts (the same period): 6 mln– Early Modern Russian (written XVIII, XIX and early

XX-century texts): 80 mln– The corpus of Russian poetry: 4 mln– The corpus for accentological studies [oral + poetry]

Composition and structure:minor sub-corpora

2. The parallel corpora1. English-Russian [10 mln]

2. German-Russian [2 mln]

3. Ukrainian-Russian [1 mln]

4. Polish-Russian [1 mln, in preparation]

3. The small corpus of dialect texts: 0,2 mln

4. The small syntactically annotated corpus: 0,5 mln

5. The small learner’s corpus: 7 mln

The main corpus

• Circa 300 mln tokens • All types of written texts

– fiction (both prose and drama), poetry, memoirs, newspaper accounts and reviews, advertisements, texts on education, engineering, science, philosophy, religion, business, law, as well as texts of private use non intended for publication (diaries, private correspondence, etc.)

• Spontaneous oral texts, public performances, movie transcripts

Annotation in RNCMajor types:

• meta-textual annotation• morphological annotation• accentual annotation• semantic annotation

+• poetic annotation (metrics, strophics,

rhyme types, etc.)

Meta-textual annotation

• Primary text descriptors: author (name, sex, age), title, creation date, size (number of words)

• For fiction: genre (e.g. humour, fantasy), text type (e.g. novel, essay), time and place described (e.g. Soviet Union, 1930es)

• For non-fiction: functional sphere (e.g. religion, law), text type (e.g. report, advertisement), subject (e.g. sports, science)

• All meta-textual parameters are searchable

Morphological annotation

• Automatic parsing (without disambiguation)

• Manual disambiguation and accentuation in a relatively small sub-section (ca 7 mln tokens)

• Morphological information: part of speech, inflectional categories, non-standard forms (distorted or anomalous)

Semantic annotation

• Lexicon-based annotation • Specific sets of values for different

lexical classes: – verbs, adjectives, adverbs,

numerals, pronouns, predicate nouns, non-predicate nouns, proper nouns (names, surnames and patronymics)

Semantic annotation: values

• Include primarily taxonomic parameters (e.g. ‘motion’, ‘speech’, ‘colour’, ‘instrument’, ‘person’, etc.), as well as:

– Mereology (sets & parts ~ wholes) – Some derivational features (diminutives,

augmentatives, attenuatives, semelfactives, etc.)

Searching on semantic base,an example

Construction of the type <в ночь> с четверга на пятницу ≈ ‘Thursday

night’Query:

preposition С + noun, GRAMM: ‘genitive’, SEM: ‘span of time’ + preposition НА + noun, GRAMM: ‘accusative’, SEM: ‘span of time’

Syntactic corpus:sample search

Syntactic corpus: sample search

Applications

• Linguistic research

• Including non-linguist students’ research activities

• Education materials

• Reference tool for non-experts

Applications:research

Actual language usage (as opposed to grammars)

Short-term grammatical changes

Including evolution of word meanings and usage

Applications:students’ activities

Getting young people interested in language as a phenomenon: from small toy-researches to full-fledged investigations.

Not necessarily linguistic students!

Applications:educational materials

Russian linguistic education is traditionally oriented towards classical literature and based on a fixed set of examples wandering from a manual to another.

depressive attitudes towards courses of Russian among younger people

Applications:educational materials

The Corpus provides instruments and resources to switch to (a) usage-based and (b) domain-specific linguistic training.

Applications:reference tool

The Corpus provides quick answers to many expert and non-expert questions. Especially convenient for simple lexical queries: word history.

When (first) and in what sense was the word used?

Further development

• Oral and poetic texts

• Multi-media corpus (annotated movies)

• Full derivational annotation (searching for derivational parameters)

• Improving statistics and frequency modules

• Emphasis on parallel corpora

• Slavic parallel corpora?

Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)

Documents

russian slide

mln slide

corpus of russian poetry

russian national corpus

corpus of oral texts

main corpus

annotated corpus

small corpus of dialect