Prim(j)ena M Prim(j)ena M ULTEXT-East standarda i ULTEXT-East standarda i normi TEI u izradi paralelnih normi TEI u izradi paralelnih korpusa korpusa Applikation des Applikation des M M ULTEXT-East und der ULTEXT-East und der TEI-Normen bei der Erstellung von TEI-Normen bei der Erstellung von Parallelkorpora Parallelkorpora Application of Application of M M ULTEXT-East and TEI ULTEXT-East and TEI in the compilation of parallel in the compilation of parallel corpora corpora Tomaž Erjavec Tomaž Erjavec Department of Knowledge Technologies Department of Knowledge Technologies Jožef Stefan Institute, Ljubljana Jožef Stefan Institute, Ljubljana [email protected], http://nl.ijs.si/et/ [email protected], http://nl.ijs.si/et/
21
Embed
Prim(j)ena MULTEXT-East standarda i normi TEI u izradi paralelnih korpusa Applikation des MULTEXT-East und der TEI-Normen bei der Erstellung von Parallelkorpora.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Prim(j)ena MPrim(j)ena MULTEXT-East standarda i ULTEXT-East standarda i normi TEI u izradi paralelnih korpusanormi TEI u izradi paralelnih korpusaApplikation des Applikation des MMULTEXT-East und der ULTEXT-East und der TEI-Normen bei der Erstellung vonTEI-Normen bei der Erstellung von ParallelkorporaParallelkorporaApplication of Application of MMULTEXT-East and TEI ULTEXT-East and TEI in the compilation of parallel corporain the compilation of parallel corpora
Tomaž ErjavecTomaž Erjavec
Department of Knowledge TechnologiesDepartment of Knowledge Technologies
Jožef Stefan Institute, LjubljanaJožef Stefan Institute, Ljubljana
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Why standards (for Why standards (for digital language digital language resources)?resources)? public documentation (+ software)public documentation (+ software) (semi)automated validation(semi)automated validation application independentapplication independent platform independentplatform independent do not become obdo not become obssolescent (as fast)olescent (as fast) However:However:
– demand time to understand and use themdemand time to understand and use them– there are (too) many and not all are there are (too) many and not all are
acceptedaccepted– they are not perfectly tuned to application they are not perfectly tuned to application
(overhead)(overhead)
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
TEI: the Text Encoding TEI: the Text Encoding InitiativeInitiative TEI Guidelines TEI Guidelines are a vocabulary to describe are a vocabulary to describe
text for scholarly purposestext for scholarly purposes They consist of:They consist of:
– XML schemasXML schemas– documentationdocumentation
P3 (1994), P4 (2002), P5 (0.9, 2007) P3 (1994), P4 (2002), P5 (0.9, 2007) being developed by the TEI Consortiumbeing developed by the TEI Consortium large user base, web site, mailing list, tutorials, large user base, web site, mailing list, tutorials,
yearly meetingsyearly meetings increasingly popular for digital libraries, text-increasingly popular for digital libraries, text-
critical editions,…, to a certain extent for critical editions,…, to a certain extent for corporacorpora
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
zabijene u zabijene u nedra da izbegne ljuti vetar, hitro zamače u staklenu kapijunedra da izbegne ljuti vetar, hitro zamače u staklenu kapiju stambene zgrade <hi rend="it">Pobeda</hi>, no stambene zgrade <hi rend="it">Pobeda</hi>, no
nedovoljno hitronedovoljno hitro da bi sprećio jednu spiralu oštre prašine da bi sprećio jednu spiralu oštre prašine da uđe zajedno s da uđe zajedno s
njim.</s>njim.</s> </p></p> … …
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
MULTEXT-EastMULTEXT-East
MULTEXT-EastMULTEXT-East: EU Project (1995-1997) : EU Project (1995-1997) Multilingual Texts and Corpora for Eastern and Multilingual Texts and Corpora for Eastern and Central European LanguagesCentral European Languages
Based on the results of EU MULTEXT (~West)Based on the results of EU MULTEXT (~West) To produce a harmonised BLARK for six To produce a harmonised BLARK for six
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
History of MULTEXT-History of MULTEXT-East resourcesEast resources First release 1998 on CD-ROM:First release 1998 on CD-ROM:
already extended with new languagesalready extended with new languages Resources since 1998 available on the Web:Resources since 1998 available on the Web:
Slavic: Slavic: – Russian (East Slavic)Russian (East Slavic)– Czech (West Slavic) Czech (West Slavic) – Slovene (South West Slavic) Slovene (South West Slavic) – Resian (Slovene dialect) Resian (Slovene dialect) – CroatianCroatian (South West (South West
Slavic)Slavic)-- Marko Tadi-- Marko Tadičč
– Serbian Serbian (South West Slavic)(South West Slavic)-- C. Krstev, D. Vitas-- C. Krstev, D. Vitas
– Bulgarian (South East Slavic)Bulgarian (South East Slavic) In progress:In progress:
– MacedonianMacedonian– Persian Persian
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
The MULTEXT The MULTEXT morphosyntactic morphosyntactic trinitytrinity1.1. MULTEXT-East morphosyntactic MULTEXT-East morphosyntactic
Based on EAGLES / MULTEXTBased on EAGLES / MULTEXT Define PoS, their attributes and valuesDefine PoS, their attributes and values The specs are a document containing: The specs are a document containing:
– introductionintroduction– common tablescommon tables– language particular sectionslanguage particular sections
Written in LaTeX Written in LaTeX PDF & HTML PDF & HTML Derived XML/TEI encoding as feature Derived XML/TEI encoding as feature
structuresstructures In Version 4 specifications to be fully in In Version 4 specifications to be fully in
TEI/TEI/XMLXML
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Example common tableExample common table
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Example Example languaglanguage e specific specific tabletable
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
~ all word-forms of cca 15.000 lemmas~ all word-forms of cca 15.000 lemmas Lexical entry is composed of three fields: Lexical entry is composed of three fields:
– the word-form: the inflected form of the wordthe word-form: the inflected form of the word– the lemma: the base-form of the wordthe lemma: the base-form of the word– the morphosyntactic description (MSD)the morphosyntactic description (MSD)
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
3. The “1984” corpus3. The “1984” corpus
Languages: En, Ro, Sl, Cs, Et, Hu, Languages: En, Ro, Sl, Cs, Et, Hu, SrSr, (Bg, Ru, (Mk, , (Bg, Ru, (Mk, HrHr, Tr,…)), Tr,…)) Structurally annotated Structurally annotated Sentence aligned with EnglishSentence aligned with English Words annotated with lemma and MSDWords annotated with lemma and MSD Encoded in TEI P4 (XML)Encoded in TEI P4 (XML)
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Utility of MULTEXT-Utility of MULTEXT-East LRsEast LRs
Specifications became, for some, the “national” Specifications became, for some, the “national” standardstandard
Training/testing dataset for HLT development:Training/testing dataset for HLT development:PoS taggers, lemmatizers, lexicon extractors, ILPPoS taggers, lemmatizers, lexicon extractors, ILP
A base dataset for further annotation and A base dataset for further annotation and experiments:experiments:– Word-sense disambiguationWord-sense disambiguation– WordNet development and evaluationWordNet development and evaluation– Syntactic parser inductionSyntactic parser induction
Teaching aid in HLT coursesTeaching aid in HLT courses ~ 100 registered users~ 100 registered users As a BLARK “best practice” for new languages: As a BLARK “best practice” for new languages:
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
Corpora using Corpora using TEI+MULTEXT-EastTEI+MULTEXT-East Reference corpus of Slovene:Reference corpus of Slovene:
FIDA (100MFIDA (100Mww), FIDA+ (600M), FIDA+ (600Mww))(+ other Sl. corpora)(+ other Sl. corpora)
Croatian National Corpus:Croatian National Corpus:HNK (HNK (1100M00Mww))
Various Various Romanian corpora, …Romanian corpora, … En-Sl parallel annotated corpus:En-Sl parallel annotated corpus:
SVEZ-IJS (10MSVEZ-IJS (10Mww))
BKS symposiumBKS symposiumApril April 20020077
Tomaž ErjavecTomaž ErjavecDept. of Knowledge Technologies, Jozef Stefan InstituteDept. of Knowledge Technologies, Jozef Stefan Institute
ConclusionsConclusions
TEI provides a rich and flTEI provides a rich and fleexible xible infrastructure to encode parallel infrastructure to encode parallel corpora: corpora: meta-data, corpus and document meta-data, corpus and document structure, alignment, linguistic analysisstructure, alignment, linguistic analysis
MULTEXT-East provides a harmonised MULTEXT-East provides a harmonised and common infrastructure for word-and common infrastructure for word-level morphosyntactic descriptionslevel morphosyntactic descriptions
Both have already been used for a Both have already been used for a number of corporanumber of corpora