Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian Krešimir Šojat, Željko Agić, Marko Tadić Department of Linguistics, Department of Information Sciences Faculty of Humanities and Social Sceinces, University of Zagreb {ksojat, zagic, marko.tadic}@ffzg.hr FASSBL 7 Conference Dubrovnik, Croatia 2010-10-05
18
Embed
Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian Krešimir Šojat, Željko Agić, Marko Tadić Department of Linguistics,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Verb Valency Frame Extraction Using Morphological and Syntactic Features of
Croatian
Krešimir Šojat, Željko Agić, Marko Tadić
Department of Linguistics, Department of Information SciencesFaculty of Humanities and Social Sceinces, University of Zagreb
{ksojat, zagic, marko.tadic}@ffzg.hr
FASSBL 7 ConferenceDubrovnik, Croatia
2010-10-05
Overview What?
extraction and semi-automatic construction of verb valency frames
How? rule-based extraction procedure run on the
Croatian dependency treebank manual assignment of tectogrammatical
functors inference of rules for assigning functors to
unseen text Why?
creation of treebank-based verb valency lexicon
enhancement and enrichment of existing resources
Valency frames valency frame extraction means to detect
all possible environments of particular verb as found in the treebank
such an approach aims at fast construction of valency frames
extraction is automatic, no elements of frames added manually by human annotators
such automatically acquired verb valency lexicon can serve as a basis for further enrichment and enhancement of manually constructed resources, either existing or constructed from scratch
The treebank Croatian Dependency Treebank (HOBS)
follows the guidelines of the Prague DT taken from the Croatia Weekly 100 kw sub-
corpus of the Croatian National Corpus (HNK) XCES-encoded up to the word level sentence-delimited, tokenized, manually
lemmatized and MSD-tagged serves as the morphological layer of the treebank
annotated on the syntactic layer approximately 2.700 sentences, 67.000 tokens manually assigned syntactic functions ca 1.300 sentences double-checked and used in
this experiment
The treebank
HR Unija je već dogovorila neke mjere kako bi pomogla Hrvatskoj.
ENThe Union has already arranged some measures in order to help Croatia.
Extraction algorithm the algorithm aims at extraction of verb
valency frame instances for each verb in the treebank sample, it
Extraction algorithm the first version retrieved predicates only and
was expanded to retrieve all the verbs from the treebank sample
algorithm adapted to retrieve any verbs found in the dependency structure, regardless of their respective analytical functions and position within the dependency trees
the adaptation itself is implemented in order to raise the recall of the algorithm, while still maintaining its precision by not changing the simple set of descending rules
i.e. to retrieve as much verbs as possible given the limited size of the treebank sample used in the experiment
CCCCyyyyLocationyyyy-mm-dd
Extraction algorithm the verb “imati” (Vmn) is annotated as
object (Obj)
Extraction algorithm Thus, from each sentence the number of
extracted frames correspondes to the number of verbs: one frame for the main clause that captures
the whole syntactic structure of the sentence frames extracted from dependent clauses
Conclusions in this experiment we have designed and
implemented one possible approach: to semi-automatic extraction of a valency
frame lexicon for Croatian verbs to the refinement of existing lexicons by using
the Croatian Dependency Treebank as an underlying resource
we have automatically extracted 2930 verb valency frame instances and annotated 936 frames: the distribution of valency frames for each of
the encountered verbs the distribution of analytical functions and
morphosyntactic tags for each of the tectogrammatical functors
Future work the first result enables the enrichment of
existing valency lexicons, such as CROVALLEX the second result enables the implementation
of a rule-based system for automatic assignment of tectogrammatical functors to morphosyntactically tagged and dependency-parsed unseen text
this procedure of automatic detection of valency frames will be used also in several other projects dealing with factored SMT (e.g. ACCURAT)
regarding dependency parsing of Croatian by using the Croatian Dependency Treebank, we shall undergo various research directions in order to increase overall parsing accuracy
Thank you for your attention.
The research within the project ACCURAT leading to these results has received funding from the
European Union Seventh Framework Programme (FP7/2007-2013), grant agreement