Variance in Language and Genome Annotating Digitized Print Dictionaries Annotating Morpheme Decompositions metaDictionary – Towards a Generic e–Infrastructure for Detecting Variance in Language by Exploiting Dictionary Information Dietmar Seipel and Werner Wegstein University W ¨ urzburg Computer Science / Digital Humanities ISGC 2011 – Taipei, 23.03.2011 Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
20
Embed
metaDictionary -- Towards a Generic e--Infrastructure for ...event.twgrid.org/isgc2011/slides/HumanitiesandSocial...Dietmar Seipel and Werner Wegstein metaDictionary – Variance in
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
metaDictionary – Towards a Generice–Infrastructure for Detecting Variance in
Language by Exploiting Dictionary Information
Dietmar Seipel and Werner Wegstein
University WurzburgComputer Science / Digital Humanities
ISGC 2011 – Taipei, 23.03.2011
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
1 Variance in Language and GenomeThe metaDictionaryNetwork Analysis of Morpheme Decompositions
2 Annotating Digitized Print DictionariesAnnotation in TEI
The call sequence(*, form:[type:determiner])generates a sequence of zero or more form elements.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Techniques from Computer Science
Grammarshigher precision compared to regular expressions andstatistical parserswe use a DCG (definite clause grammar) extension,which is even more compact and directly generates XML
XML is a common data format for modelling, managing, andexchanging semi–structured data.
There exist powerful query, transformation and updatelanguages for XML.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation in TEI
Grammar–Based ParsingTechniques from Computer Science
Advantagescompakt, rapidly programmableclear, less error–proneflexibly extensible
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
Annotating Morpheme Decompositions
. . . based on the Whole Word Morphologyextension by alignment methodsmorpheme decomposition:
morpheme term: ((craft + s) + man) + ship
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
System Architecture
For decomposing and annotating the large number of entries ofa dictionary (which can exceed 100.000), one needs
linguistic knowledge and
suitable tools from computer science:
morpheme decomposer,suitable, compact knowledge representation,inference methods,graphical user interface.
Fine grain annotated dictionaries are the basis for thedecomposition.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
System Architecture
OWL Term Notation
Annotation Rules
Morphem Analyses VisualisationProtege
Morfessor
6
�
6
6
�
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
Annotation Rules
With the annotation rule (in logic)has_word_class(X, noun) :-
mc(X, A, B),has_word_class(A, noun),has_text_form(B, [ship, ...]).
the partially annotated term((craft*bm + s*ge) + man)*noun + ship
can be further annotated to(((craft*bm + s*ge) + man)*noun + ship)*noun
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
The Morpheme Annotation Tool
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language
Variance in Language and GenomeAnnotating Digitized Print Dictionaries
Annotating Morpheme Decompositions
Annotation RulesThe Morpheme Annotation Tool
Conclusions
The metaDictionary forms the core part of a generice–infrastructure:
derived from analysis of a network of dictionariesannotated morpheme decompositionsyield a more precise alignment for the metaDictionary
The next step will be to test the data using text corpora:basic morphemescombinations of basic morphemes
Culturomics (Michel et al., Science 2011): 52% of the Englishlexicon – the majority of the words used in English books – consistsof lexical dark matter undocumented in standard references.
Dietmar Seipel and Werner Wegstein metaDictionary – Variance in Language