Corpus lexicography Corpus lexicography in Russia: recent in Russia: recent trends and trends and perspectives perspectives Maria Khokhlova Maria Khokhlova St.Petersburg State St.Petersburg State University University Philological Faculty Philological Faculty [email protected][email protected]
20
Embed
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty [email protected].
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Corpus lexicography in Corpus lexicography in Russia: recent trends Russia: recent trends
and perspectivesand perspectivesMaria KhokhlovaMaria Khokhlova
St.Petersburg State UniversitySt.Petersburg State University
Frequency Dictionary of Russian: (L.N.Zasorina, 1977) Text database contained about 1 mln units.During its compilation a huge number of notorious issues were discussed:representiveness;tokenization;lemmatization...So it was the earliest computer corpus of Russian.
3
Prehistory of Russian Corpus Linguistics «Computer Fund of the Russian
Language»Idea: Acad. Andrey Yershov
Andrey Petrovich Yershov (1931-1988)
Jeršov A.P. "On methodology of constructing dialogue systems: the
phenomenon of business prosa" (1978)
The idea was formulated as follows: "Any progress in the field of constructing models and algorithms will remain a purely academic exercise, unless a most important problem of creating a Computer fund of the Russian language is solved. We hope that creation of such a Computer fund by linguists, qualified for the task, will precede construction of large systems for application purposes. This would minimize labour costs and simultaneously would protect the Russian language from arbitrary and incompetent intervention“.
5
Russian Corpora (1)
The Uppsala Russian Corpus (1960s), the earliest corpus
The Tübingen Russian Corpus (Tübingen Universität, in 1999 -2004 under the guidance of T.Berger)
The HANCO corpus (Helsinki Annotated Corpus), Helsinki University, Slavic and Baltic Languages Department (2001-2004, A. Mustajoki, M. Kopotev). It is a small teaching corpus with morphological and syntactic annotation.
6
Russian Corpora (2)
Three big corpora of Russian: The National Corpus of Russian Language
(NCRL, about 364 million words) (http://ruscorpora.ru
Corpora at the Leeds University created by S.Sharoff (about 2000 million words) (http://corpus.leeds.ac.uk/ruscorpora.html)
A corpus of Russian Fiction at the Automatic Text Processing initiative team (AOT), 680 million words (http://aot.ru).
7
Russian National Corpus (1)Over 364 million wordsBased on Yandex Search:
Search by exact form(s); Lexico-grammatical search. see www.yandex.ru – Advanced Search and www.ruscorpora.ru – Search in the Corpus
The Fundamental Digital Library of Russian Literature
and FolkloreFEB-web accumulates information in text,
audio, visual, and other forms on 11th-20th-century Russian literature, Russian folklore, and the history of Russian literary scholarship and folklore studies.
19
Conference “Corpus Linguistics”
2002 2004 2006 2008 2011 2013 (late June)Saint-PetersburgSt.Petersburg State University,