NooJ international Conference, Komotini, May 2010 Portability of Armenian Corpus by Nooj Anaid Donabedian & Victoria Khurshudian Institut National des Langues et Civilisations Orientales (INALCO), Paris
NooJ international Conference, Komotini, May 2010
Portability of Armenian Corpus
by NoojAnaid Donabedian & Victoria Khurshudian
Institut National des Langues et Civilisations Orientales (INALCO), Paris
Armenian: preliminaries
an Indo-European language
right-branching
of an accusative type
typically with an SOV structure and
dominantly with an agglutinative morphology
Historical Armenia
Republic of Armenia
Periodization prealphabetical
alphabetical (405 A.D. – up to present).
1. Old Armenian or Grabar (V-XI);
2. Middle Armenian (XII-XVI);
3. Modern Armenian (XVII – up to present)
Western Eastern (based on Constantinople dialect) (based on Ararat dialect)
dialects… dialects….
Objective
Provide data compatibility and portability between Nooj and
Eastern Armenian National Corpus (EANC) platform
What is Eastern Armenian National Corpus
www.eanc.netCorpus Technologies
Michael Daniel, Victoria Khurshudian, Dmitri Levonian,
Vladimir Plungian, Alexey Polyakov,Sergey Rubakov
8
Source texts
PARSER
Annotated texts
Annotation algorithm
Grammatical dictionary
EANC History
Moscow, Russia
March 2006: Project Launch
July 2007: 1st Release
May 2008: 2nd Release
March 2009: 3rd release
Eastern Armenian National Corpus (EANC) is:
about 110 million tokens
morphological and other markup
English translations for frequent tokens
covers SEA from the mid-19th century to the present
both written and oral discourse
full-text view for over 100 Armenian classic titles
open internet access
Written Discourse
over 106 mln. tokens
510 authors (1841-2009)
1039 fiction texts (including 206 translated texts)
7858 press issues
non-fiction (scientific and other) texts
Spontaneous discourse
Polylogues
Task-oriented discourse
TV-shows transcripts
Movies …
☼ EANC oral corpus has all been recorded and transcribed
by the project.
Oral Discourse (3.5 mln. tokens)
13
EANC Functionality
14
Search Functionality
Token queries
Context queries
Subcorpus selection
15
Simple token queries:
• lexeme search
• wordform search
• gram search
• translation search
• lexeme + gram search
Search Functionality
16
Advanced options for token queries:
case-sensitivity
punctuation marks
position in the sentence
wildcard (*)
logical functions (e.g. ‘or' |)
negated features
grammatical/lexical homonymy inclusion/exclusion
Search Functionality
17
Subcorpus selection by:
time
author(s) / title(s)
genres
types of texts (translated vs. original)
superposition of any of the above
Search Functionality
18
Display options
context expanding
‘sort by’ (time, lexeme, wordform etc.)
Latin transliteration
glossed display
KWIC (key word in the context)
Search Functionality
19
Transliterated samples:
20
Glossed samples:
21
KWIC samples:
Main Current Tasks:
Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure
Make EANC and Nooj Western Armenian platforms interportable
Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)