Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff BulTreeBank Project LML, Bulgarian Academy of Sciences (www. bultreebank.org) Workshop on Balkan Language Resources and Tools 2003 21 November 2003 Thessaloniki, Greece
28
Embed
Language Resources and Tools for the Creation of a Bulgarian Treebank Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Language Resources and Tools for the Creation of a Bulgarian Treebank
Kiril Simov, Petya Osenova,
Sia Kolkovska, Elisaveta Balabanova, Dimitar Doikoff
BulTreeBank Project
LML, Bulgarian Academy of Sciences
(www. bultreebank.org)
Workshop on Balkan Language Resources and Tools 2003
21 November 2003 Thessaloniki, Greece
Plan of the talk
• Preliminary Notes
• BulTreeBank Language Resources and Tools
• The integration architecture of the resources and tools
• Conclusion and Future work
Financial Support
BulTreeBank is a joint project betweenSeminar für Sprachwissenschaft,
Eberhard-Karls-Universität, Tübingen, Germanyand
Linguistic Modelling Laboratory,Bulgarian Academy of Sciences, Sofia, Bulgaria
The project is funded by the Volkswagen-Stiftung, Germany
Expected Results• A set of Bulgarian sentences marked-up with detailed
syntactic information
• A core set of sentences designated inside the treebank
• A linguistically interpreted text archive for Bulgarian
• A reliable partial grammar for automatic parsing of phrases in Bulgarian
• Software modules for compiling, manipulating and exploring the language resources
Preliminary notes (1)
We rely on two prerequisites during the process of our treebank creation:– integration of the pre-processing
components
– an adequate annotation scheme
Preliminary notes (2)
Integration is performed with the help of the following techniques:– Looking-forward strategy
• Adaptive mechanism• Additive mechanism
– Looking-backward strategy– Creation of a gold standard
Language Resources
• Text archive
• Morphological dictionary
• Gazetteers
• Valence dictionary
• Semantic dictionary
• Treebank
The BulTreeBank Text Archive
• A collection of linguistically interpreted texts from different genres (target size: 100 million words)
• About 72 million running words are converted into XML documents, marked up in conformance with the TEI guidelines
• 10 million running words are morphologically analyzed
• Over 1 000 000 words are morphosyntactically disambiguated by hand
The morphological dictionary
• Published as a book – Popov, Simov and Vidinska, 1998
• It covers the grammatical information of about 100 000 lexemes (1 600 000 word forms) and serves as a basis for the morphological analyzer
• The problem of the unknown words: open classes (names, abbreviations) and derivational models (diminutives etc)
The Gazetteers
• Gazetteers of namesconsisting of 15 000 words – Bulgarian and foreign person names, locations from the whole world, organizations, and others
• Gazetteers of the most frequent abbreviations
consisting of 1500 acronyms and graphical abbreviations
• Gazetteers of 300 most frequent introductory expressions and parentheticals. This is considered to be a step towards a basic list of collocations
The Valence Dictionary
• It consists of 1000 verbs and their valence frames• The frames of the most frequent verbs are
compared to the corpus data and repaired if necessary (new frames added, old ones deleted or more fine-grained)
• The semantic restrictions over the arguments are extracted and matched against the SIMPLE ontology (recall the Semantic Dictionary)
Lexical Entry of the Valence Dictionary
Verb, its transitivity and aspectMeaningI. Frame (the arguments that the verb requires)
II. Morphology of the verb's argumentsS(ubject)=N,PerPron
III. Semantics of the argumentsS(ubject) is a person
IV. Examples of the verb's usage
The Semantic Dictionary
• Classification of the most frequent nouns with respect to the ontological hierarchy of SIMPLE without specifying the synonymic relations between them (3 000 nouns)
• The proper names from the gazetteers are also mapped to the ontological hierarchy of SIMPLE
The Treebank
• Core set of sentences (1 500 sentences) - extracted mainly from Bulgarian grammars and processed manually --> highest quality
• Treebank (6 000 sentences) - extracted mainly from the corpus and pre-processed automatically before treated manually
Core set of sentences: Example of a Pragmatic Adjunct
A Corpus Sentence: an example of dependents realisation
The Tools
• Morphological analyzer
• Disambiguator(s)
• Partial grammars
– sentence splitter
– named-entity recognition module
– chunkers
Morphological Analyzer
• Assigns all possible analyses to the tokens
• Implemented in CLaRK System as a regular grammar
• Works together with the ‘token classification’ strategy and with the gazetteers
Disambiguator(s)
• Rule-based disambiguator - a preliminary version of a rule-based morpho-syntactic disambiguator, encoded as a set of constraints within the CLaRK system --> 80 % coverage
• Neural-network-based disambiguator (Simov and Osenova 2001). Its accuracy is of 95.25 % for part-of-speech and 93.17 % for complete morpho-syntactic disambiguation
After the MorphoSyntactic Analysis and Disambiguation