Corpus-building projects in Corpus-building projects in Institute of linguistics, Institute of linguistics, Philosophical Faculty, Philosophical Faculty, University of Zagreb University of Zagreb Marko Tadić ([email protected], www.hnk.ffzg.hr/mt) Philosophical Faculty, University of Zagreb, Croatia (www.ffzg.hr/zzl/zzl-home.htm) Tübingen, 1999-12-01
39
Embed
Corpus-building projects in Institute of linguistics, Philosophical Faculty, University of Zagreb Marko Tadić ([email protected],
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Corpus-building projects inCorpus-building projects inInstitute of linguistics,Institute of linguistics,Philosophical Faculty,Philosophical Faculty,University of ZagrebUniversity of Zagreb
Philosophical Faculty, University of Zagreb, Croatia (www.ffzg.hr/zzl/zzl-home.htm)
Tübingen, 1999-12-01
Lecture planLecture plan
Institute of linguistics
Croatian frequency dictionary — Corpus of Croatian literary language (Moguš corpus)
Croatian national corpus (HNK)
Croatian participation in ELAN
Croatian-English parallel corpus
Croatian-Slovenian parallel corpus
Institute of linguistics,Institute of linguistics,Philosophical Faculty, Univ. of ZagrebPhilosophical Faculty, Univ. of Zagreb
founded 1960.
concentrating point for linguistic projects at the Philosophical Faculty
first usage of computers in Croatian language research:– Željko Bujas (1967) Concordance of Gundulić’s Osman, Austin, TX
– Rudolf Filipović (1968): contrastive projects
– Milan Moguš (1970) Computational analysis of older Croatian literature texts
– Željko Bujas (1972-1975) English-Croatian lexicographical corpus
– Milan Moguš (1976-1996) Corpus of the contemporary Croatian literary language — one-million corpus (1M)
Croatian Frequency Dictionary 1Croatian Frequency Dictionary 1
Milan Moguš, Maja Bratanić, Marko Tadić: Hrvatski čestotni rječnik, Školska knjiga – Zavod za lingvistiku Filozofskoga fakulteta Sveučilišta u Zagrebu, Zagreb 1999.
compiled on the basis of the one-million corpus of the Croatian literary language (1M)
1M corpus collecting started 1976– appropriate size for that time (Brown corpus 1967)
– in 1976 first one-million corpus of all Slavic languages?
1M corpus today– inadequate for serious lexicographic work
– useful for studies contrasting pre-1990 and post-1990 Croatian lexica
Croatian Frequency Dictionary 2Croatian Frequency Dictionary 2
1M corpus structure:– 5 subcorpora, roughly 200.000 tokens each
Croatian Frequency Dictionary 3Croatian Frequency Dictionary 3
1M corpus facts:– time-span: 1935-1978– size:
• started with 1.007.748 tokens• after removal of foreign-L elements: 994.049 tokens• after removal of proper names: 952.327 tokens
– sample size for drama, prose, poetry• 10 kW selected between 5 kW, 10 kW and 20 kW by testing the type-gain• resulted in better dispersion among 20 different authors instead of 10
– sample size for newspaper• whole daily issue• 1 national and 4 regional daily newspaper• 4 different months in 1975
– sample size for schoolbooks• all subjects from the graduate class of secondary schools in Croatia in
academic year 1977/78
Croatian Frequency Dictionary 4Croatian Frequency Dictionary 4
1M corpus processing– primary processing
• corpus marking, pre-SGML time, proprietary• word-lists (alphabetical and frequency)• concordances
– lemmatization• not automatic but• semi-manual
– 1M corpus as a database• retrievable by lemma• gives a short concordance co-text• not yet accessible via WWW
Croatian Frequency Dictionary 5Croatian Frequency Dictionary 5
working screen of the lemmatization program
Croatian Frequency Dictionary 6Croatian Frequency Dictionary 6
1M corpus after lemmatization
Croatian Frequency Dictionary 7Croatian Frequency Dictionary 7
during 20 years, “dinosaur” project– survived several generations of hardware
– multiple conversions from different data-formats
– war conditions
– etc.
dictionary size: 38.573 headwords
headwords = lemmas
3 parts– frequency dictionary of lemmas
– alphabetical dictionary of lemmas
– alphabetical dictionary of lemmas with tokens and their frequencies
Croatian Frequency Dictionary 8Croatian Frequency Dictionary 8
frequency dictionary: structure of lexicographic article frequency
lemma POS meaning standard rang absolute relative subcorpora
ruka f 55 1599 0.1597 DNPSUveć adv 56 1559 0.1557 DNPSUzemlja f 57 1512 0.1510 DNPSUnjegov pro 58 1477 0.1475 DNPSUhiža f * 540 30 0.0030 D-P--kp abb kilopond 541 29 0.0029 ----Ulukav adj 541 29 0.0029 D-PS-
alphabetic dictionary with tokens: structure of lexicographic article
lemma POS meaning standard rang absolute frq subcorpora
HNK 5: corpus on wwwHNK 5: corpus on www http://www.hnk.ffzg.hr
Testing V 1.0: 1998-12-05– 30m: 3 mW– HETA:
• Collected works by Ivan Gundulić• all works written in Croatian by Marko Marulić
Testing V 1.1: 1999-02-14 & 1999-07-20– 30m: 7,67 mW– HETA: 2,9 mW from CD-ROM: Classics of Croatian literature, Naklada
Bulaja, Zagreb, 1999
now only the testing V 1.1 (approx. 10 mW) of corpus is www accessible– text format: quasi HTML, no XML– no POS marking and retrieval
text collecting and marking going on, as well as software testing
HNK 6: hardware/softwareHNK 6: hardware/software
platform– NT instead of UNIX
– all software (commercial, shareware, custom-made) runs on Windows
input text formats– WWW: HTML, XML
– DTP: RTF, DOC, QXD, WP, TXT etc.
conversion– 2XML: custom made software
• input: HTML, RTF• output: XML, no header• two-step conversion by user-defined scripts• enables high level of automation
HNK 7: corpus formatHNK 7: corpus format
corpus format– XML, we are testing TEIXML dtd, Nancy (1999-07)
<BODY>
<DIV0 type="article">
<HEAD type="nn">U GORICI SVETOJANSKOJ ODRŽAN 12. FESTIVAL PJEVAČA AMATERA</HEAD>
<HEAD type="na">Ivana osvojila županijski Sanremo</HEAD>
<HEAD type="pn">* Od 20 natjecatelja žiri je najboljom proglasio Ivanu Erdeljac s pjesmom "Crazy", druga je Antonija Mikita s pjesmom "To", a treće je mjesto osvojila Ksenija Cvetetić</HEAD>
<FIGURE>Publici su se najviše svidjeli Marija Šalić i Petar Puhijera</FIGURE>
<P>Pod medijskim pokroviteljstvom "Večernjeg lista" i Radio Jaske, a uz pomoć DIR "Rubinić" kao generalnog te još sedamdesetak drugih sponzora, u petak i u subotu u Gorici Svetojanskoj pokraj Jastrebarskog održan je 12. festival pjevača amatera.</P>
<P>Prve festivalske večeri, na kojoj su nastupila 22 izvođača do 15 godina, prvu nagradu stručnog žirija odnijela je Petra Batelja iz Rastoka pokraj Jaske za pjesmu "To malo ljubavi". Druga nagrada pripala je Nikolini Oslaković iz Gornje Reke za pjesmu "Neka mi ne svane", a treća Mariji Jurini iz Desinca za pjesmu "Ginem". Publika je najboljom ocijenila svetojansku grupu "Mrvice" s pjesmom "Mrvica", dok je drugu nagradu dodijelila Natali Rajnović iz Jaske za pjesmu "Don"t ever cry", a treću Aniti Oslaković iz Desinca za pjesmu "Malo fali". Za najboljeg debitanta prve večeri proglašena je Irena Kišan iz Zdenčine s pjesmom "Izdali me".</P>
<P>Druga večer - s dvadeset starijih izvođača iz Jaske, Karlovca, Bjelovara, Zagreba i Velike Gorice - bila je osobito napeta, jer je za razliku od lani ponudila vrlo kvalitetne izvođače i interpretacije pa nije bilo lako odabrati najbolje.</P>
<P>Nakon poduže stanke tijekom koje su izbrojani glasovi - a koju su publici kratili gost večeri Ivo Pattiera te sastav "Santa Anna" i solistica Goga Čopić - proglašeni su ovogodišnji pobjednici. Prema ocjeni stručnog žirija, prvu nagradu i zlatnu plaketu "Večernjaka" dobila je Karlovčanka Ivana Erdeljac za vrlo dobro otpjevanu pjesmu "Crazy". Druga nagrada pripala je Antoniji Mikiti iz Velike Gorice za pjesmu "To", a treća Kseniji Cvetetić iz Petrovine za pjesmu "Neka mi ne svane".</P>
<P>Publika je najviše glasova dodijelila svetojansko-zagrebačkom duetu Mariji Šalić i Petru Puhijeri za interpretaciju pjesme "Ima li nade za nas", pa je i njima pripala "Večernjakova" zlatna plaketa. Na drugo mjesto publika je svrstala "Svetojanske tamburaše" koji su nastupili s pjesmom "Dobro jutro", a na treće Zagrepčanku Marijanu Parilac i pjesmu "Idi i ne budi ljude".</P>
<P>Najboljom debitanticom završne večeri proglašena je Zagrepčanka Marina Posilović s pjesmom "Piši, piši mi", a nagradu za najbolji scenski nastup dobio je sastav iz Petrovine "Prigorje de lajt" s pjesmom "Oj suseda, suseda". Čini se da su ovogodišnje nagrade - a bilo ih je doista mnogo, od sedmodnevnog boravka u Opatiji, umjetničke slike, bicikla i kazetofona do satova i poklon-bonova - završile u pravim rukama. Oni koji ih nisu dobili, a možda su ih također zaslužili, neka se ovaj put utješe pljeskom publike, a dogodine će imati novu priliku. Jer, tradicija Svetojanskog festivala - svojevrsnog Sanrema zagrebačke županije - nastavlja se.</P>
30m: 2000– finalization of 30m corpus text-collecting
– XML tagging completed
– WWW retrieval front-end finished
– copyright issues cleared
HETA: 2000– continuous filling with whole series of periodicals
• Narodne novine (approx. 10mW)• Večernji list (approx. 9mW)• Vjesnik (approx. 6 mW)• Vijenac (approx. 10 mW)• etc.
HNK 11: future stepsHNK 11: future steps
size of reference corpora today: 100 mW or planned for 100 mW– BNC (British National Corpus)– CNC (Czech National Corpus)– PNC (Polish National Corpus, Łódz; text-typological copy of BNC)– FIDA (Slovene corpus)– HNC (Hungarian National Corpus)
=> after or in 2000: upsizing of 30m to 100m
after or in 2000: morphosyntactic descriptions (MSD)– POS tagging– lemmatization (MULTEXT-East tagset)
after 2001: syntactic description (parsing)– tree-bank
after 2001: spoken language subcorpus (if possible)
HNK 12: www demoHNK 12: www demo
home-page
aims page
30m & HETA corpus pages
30m corpus retrieval page
30m sources list
30m concordance (tajk%)
30m concordance wider co-text
HETA, Classics retrieval page
HETA, Classics source list
Croatian participation in ELAN projectCroatian participation in ELAN project
ELAN (European Language Activity Network)
project jointly coordinated by PAROLE and TELRI associations
duration: 1998-01-01 until 1999-12-31
Institute of linguistics as an institutional member of TELRI association invited to participate in language resources collecting
project also got support from Croatian Ministry of Science and Technology (MZT 130729)
HR-ELAN 1HR-ELAN 1
Croatian participation consisted of 2 mW Croatian monolingual corpus
part of HNK: Večernji list, march-april 1999
test-bed for text-conversion and corpus encoding processes– step 1. HTML to XML
– step 2. XML to SGML (!?)
final format: SGML, CES compatible
Croatian-English parallel corpusCroatian-English parallel corpus
multilingual language research– lexicography
– contrastive linguistics
– MT
– ...
parallel corpora = essential importance
today role of English as lingua communis– common language pairs:
en : Lx
Lx : en
in October we started with not-(yet)-financed Croatian-English parallel corpus
HR-EN parallel corpora 1HR-EN parallel corpora 1
1st English-Croatian pairing– Rudolf Filipović, 1968-1971– Yugoslav Serbo-Croatian–English Contrastive Project (the only possible
name under communist authorities)– Brown corpus cut in half (505.822 tokens)– preserving original genre balance– morphosyntactically marked– translated– concordance with morphosyntactic categories as keywords– bilingual sentence database
1st usage of computers in contrastive linguistics ever?
tapes with data still archived in Institute of linguistics in Zagreb but no computer system which could read them
project publications: Contrastive Studies, New Contrastive Studies, Chapters in Contrastive Linguistics
HR-EN parallel corpora 2HR-EN parallel corpora 2
2nd Croatian-English pair– translation of Plato’s Republic, TELRI CD-ROM, 1998,
ISBN 3-922641-46-6
– hr-en not the only pair: collection of 22 languages
– rather small: 130 kW
– properly aligned
HR-EN parallel corpora 3HR-EN parallel corpora 3
3rd hr-en parallel corpus– aims to test:
• text conversion procedures• corpus organization• alignment and encoding
that will be used later in parallel corpora projects such asCroatian-Slovene parallel corpus
HR-EN parallel corpus 5: corpus collecting 2HR-EN parallel corpus 5: corpus collecting 2
source: Croatia Weekly– publisher: Croatian Institute for Culture and Information (HIKZ)
– started January 1998
– like USA today: different domains• politics, economy and finance, tourism, ecology, culture, art, events,
sports
– 12 pages, A3
– prepared in Croatian then translated by professional translating office
availability– No. 94 is being prepared now
– access to all texts in electronic form in both languages except for first 5 issues
HR-EN parallel corpus 6: corpus collecting 3HR-EN parallel corpus 6: corpus collecting 3
size– average issue: 15.000 tokens hr
18.000 tokens en
– approx.: 1.300.000 tokens hr1.600.000 tokens en
“methodological disturbance”: should we use it?– the biggest weekly newspaper Nacional
• important source of hr-texts for Croatian National Corpus• started with English translations of approx. 10% of Croatian issue on
their Web-page
– Ministry of science and technology of the Republic of Croatia• description of all finished scientific projects• on WWW• Croatian and English
HR-EN parallel corpus 7: corpus collecting 4HR-EN parallel corpus 7: corpus collecting 4
Sentence marking– script in Search&Replace shareware by Funduc SW– </S><S> after punctuation followed by capital letter– filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc.
HR-EN parallel corpus 8: alignmentHR-EN parallel corpus 8: alignment
testing stage
demo of Atril’s DéjàVu translations memory database V2.3.82
aligning module– works fine for 1:1 alignments
– handwork for 2:1, 3:1, 1:2, 1:3
export to TMX format
HR-EN parallel corpus 9: alignment 2HR-EN parallel corpus 9: alignment 2
HR-EN parallel corpus 10: alignment 3HR-EN parallel corpus 10: alignment 3
HR-EN parallel corpus 11: alignment 4HR-EN parallel corpus 11: alignment 4
encoding problem:How to store alignments?
several ways to do it now:– CES with pointers to IDs in 3rd file– translations memory (Translation Units as aligned pairs)
• since we are in XML => PLUG project dtd (Tiedemann 1998)
• si-en parallel corpus (Erjavec 1999): SGML, modified TEI <BODY> to have TU. But all upper and lower level encoding (<DIVs>, <HEADs>, <HI rend=“”>) are lost. Is there a way to retain it?
– TEIXML dtd, Nancy, July 1999. Interpretation of TEI dtd? Preferes the TEI and CES standard way of storing links in 3rd file
Is the SGML/XML decision really a problem to us? To the same <BODY> element we can attach different headers, convert character entities and have SGML instead of XML
HR-EN parallel corpus 12: preliminary statisticsHR-EN parallel corpus 12: preliminary statistics
for <S> aligning already it seems that we would have a lot of handwork
discrepancy between number of <S> and <W> in hr and en
hr en % increase
CW010 <P> 195 195
<S> 729 796 9.2
<W> 15483 18176 17.4
CW011 <P> 178 178
<S> 675 754 11.7
<W> 14853 17602 18.5
are there any data for other Slavic languages?
<W> alignment is not on the schedule yet
Croatian-Slovene parallel corpusCroatian-Slovene parallel corpus
approved in July by both Ministries of science as one of 17 bilateral scientific projects in humanities
launched effectively last week in October
partners: Philosophical Faculties in Zagreb and Ljubljana
duration: 2 years
2 meetings, one in Ljubljana and other in Zagreb were held
size and structure of corpus was defined
finding electronicaly available translations– the primary task for both project partners at the moment
Corpus-building projects inCorpus-building projects inInstitute of linguistics,Institute of linguistics,Philosophical Faculty,Philosophical Faculty,University of ZagrebUniversity of Zagreb