Top Banner
Russian National Corpus today: overview and perspectives http://ruscorpora.ru Vladimir A. Plungian (Moscow)
23

Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)

Dec 21, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Russian National Corpus today:

overview and perspectives

http://ruscorpora.ru

Vladimir A. Plungian

(Moscow)

Page 2: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Outline

• RNC: current state of the art

• Search possibilities and special properties

• Applications• Further development

Page 3: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

The goals of RNC

• Supporting all kinds of linguistic research, both descriptive and theoretical, synchronic and diachronic– Lexicographic and morphosyntactic

studies

• Observing language change, especially small and gradual– Discourse and sociolinguistic studies

• Assisting teaching and learning Russian

Page 4: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Corpus as ideology

• One of the first corpora designed by linguists and for linguists

• Towards a usage-based linguistic model, inter alia:– From prescriptive to descriptive attitudes– From binary to gradual grammaticality

judgments – From single-system view to synchronic and

diachronic variation– From egalitarian to quantitative approach to

linguistic form

Page 5: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

The RNC project (Russian Academy of Sciences)

• Started in 2003 (preparatory studies since 2001)• Available on the internet since April 2004• Main participants: Vinogradov Institute for

Russian Language (Moscow, RAS); Institute for Linguistic Research (St.-Petersburg, RAS); Moscow State Lomonosov U.; Voronezh State U.

• Technically supported by YandexYandex® – the biggest Russian internet resource with one of the most powerful and innovative search engines

Page 6: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Composition and structure:the main corpus

1. Morphologically annotated texts in written and spoken Standard Russian of XVIII-XXI cent.– Late Modern Russian (written texts from the 2nd half

of XX century up to the present day): 100 mln– The modern newspapers corpus: 100 mln– The corpus of oral texts (the same period): 6 mln– Early Modern Russian (written XVIII, XIX and early

XX-century texts): 80 mln– The corpus of Russian poetry: 4 mln– The corpus for accentological studies [oral + poetry]

Page 7: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Composition and structure:minor sub-corpora

2. The parallel corpora1. English-Russian [10 mln]

2. German-Russian [2 mln]

3. Ukrainian-Russian [1 mln]

4. Polish-Russian [1 mln, in preparation]

3. The small corpus of dialect texts: 0,2 mln

4. The small syntactically annotated corpus: 0,5 mln

5. The small learner’s corpus: 7 mln

Page 8: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

The main corpus

• Circa 300 mln tokens • All types of written texts

– fiction (both prose and drama), poetry, memoirs, newspaper accounts and reviews, advertisements, texts on education, engineering, science, philosophy, religion, business, law, as well as texts of private use non intended for publication (diaries, private correspondence, etc.)

• Spontaneous oral texts, public performances, movie transcripts

Page 9: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Annotation in RNCMajor types:

• meta-textual annotation• morphological annotation• accentual annotation• semantic annotation

+• poetic annotation (metrics, strophics,

rhyme types, etc.)

Page 10: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Meta-textual annotation

• Primary text descriptors: author (name, sex, age), title, creation date, size (number of words)

• For fiction: genre (e.g. humour, fantasy), text type (e.g. novel, essay), time and place described (e.g. Soviet Union, 1930es)

• For non-fiction: functional sphere (e.g. religion, law), text type (e.g. report, advertisement), subject (e.g. sports, science)

• All meta-textual parameters are searchable

Page 11: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Morphological annotation

• Automatic parsing (without disambiguation)

• Manual disambiguation and accentuation in a relatively small sub-section (ca 7 mln tokens)

• Morphological information: part of speech, inflectional categories, non-standard forms (distorted or anomalous)

Page 12: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Semantic annotation

• Lexicon-based annotation • Specific sets of values for different

lexical classes: – verbs, adjectives, adverbs,

numerals, pronouns, predicate nouns, non-predicate nouns, proper nouns (names, surnames and patronymics)

Page 13: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Semantic annotation: values

• Include primarily taxonomic parameters (e.g. ‘motion’, ‘speech’, ‘colour’, ‘instrument’, ‘person’, etc.), as well as:

– Mereology (sets & parts ~ wholes) – Some derivational features (diminutives,

augmentatives, attenuatives, semelfactives, etc.)

Page 14: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Searching on semantic base,an example

Construction of the type <в ночь> с четверга на пятницу ≈ ‘Thursday

night’Query:

preposition С + noun, GRAMM: ‘genitive’, SEM: ‘span of time’ + preposition НА + noun, GRAMM: ‘accusative’, SEM: ‘span of time’

Page 15: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Syntactic corpus:sample search

Page 16: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Syntactic corpus: sample search

Page 17: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Applications

• Linguistic research

• Including non-linguist students’ research activities

• Education materials

• Reference tool for non-experts

Page 18: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Applications:research

Actual language usage (as opposed to grammars)

Short-term grammatical changes

Including evolution of word meanings and usage

Page 19: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Applications:students’ activities

Getting young people interested in language as a phenomenon: from small toy-researches to full-fledged investigations.

Not necessarily linguistic students!

Page 20: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Applications:educational materials

Russian linguistic education is traditionally oriented towards classical literature and based on a fixed set of examples wandering from a manual to another.

depressive attitudes towards courses of Russian among younger people

Page 21: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Applications:educational materials

The Corpus provides instruments and resources to switch to (a) usage-based and (b) domain-specific linguistic training.

Page 22: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Applications:reference tool

The Corpus provides quick answers to many expert and non-expert questions. Especially convenient for simple lexical queries: word history.

When (first) and in what sense was the word used?

Page 23: Russian National Corpus today: overview and perspectives  Vladimir A. Plungian (Moscow)

Further development

• Oral and poetic texts

• Multi-media corpus (annotated movies)

• Full derivational annotation (searching for derivational parameters)

• Improving statistics and frequency modules

• Emphasis on parallel corpora

• Slavic parallel corpora?