Corpus-building projects in Institute of linguistics, Philosophical Faculty, University of Zagreb Marko Tadić ([email protected],

Corpus-building projects inCorpus-building projects inInstitute of linguistics,Institute of linguistics,Philosophical Faculty,Philosophical Faculty,University of ZagrebUniversity of Zagreb

Marko Tadić([email protected], www.hnk.ffzg.hr/mt)

Philosophical Faculty, University of Zagreb, Croatia (www.ffzg.hr/zzl/zzl-home.htm)

Tübingen, 1999-12-01

Lecture planLecture plan

Institute of linguistics

Croatian frequency dictionary — Corpus of Croatian literary language (Moguš corpus)

Croatian national corpus (HNK)

Croatian participation in ELAN

Croatian-English parallel corpus

Croatian-Slovenian parallel corpus

Institute of linguistics,Institute of linguistics,Philosophical Faculty, Univ. of ZagrebPhilosophical Faculty, Univ. of Zagreb

founded 1960.

concentrating point for linguistic projects at the Philosophical Faculty

first usage of computers in Croatian language research:– Željko Bujas (1967) Concordance of Gundulić’s Osman, Austin, TX

– Rudolf Filipović (1968): contrastive projects

– Milan Moguš (1970) Computational analysis of older Croatian literature texts

– Željko Bujas (1972-1975) English-Croatian lexicographical corpus

– Milan Moguš (1976-1996) Corpus of the contemporary Croatian literary language — one-million corpus (1M)

Croatian Frequency Dictionary 1Croatian Frequency Dictionary 1

Milan Moguš, Maja Bratanić, Marko Tadić: Hrvatski čestotni rječnik, Školska knjiga – Zavod za lingvistiku Filozofskoga fakulteta Sveučilišta u Zagrebu, Zagreb 1999.

compiled on the basis of the one-million corpus of the Croatian literary language (1M)

1M corpus collecting started 1976– appropriate size for that time (Brown corpus 1967)

– in 1976 first one-million corpus of all Slavic languages?

1M corpus today– inadequate for serious lexicographic work

– useful for studies contrasting pre-1990 and post-1990 Croatian lexica


1M corpus structure:– 5 subcorpora, roughly 200.000 tokens each

• Drama 20 samples, 10.000 tokens each• Newspaper 8 samples, 25.000 tokens each• Prose 20 samples, 10.000 tokens each• Poetry 20 samples, 10.000 tokens each• Schoolbooks 58 samples, 3450 tokens each

195052

203208

202005

201667

205816

Novine

Drama

Udzbenici

Stihovi

Proza


1M corpus facts:– time-span: 1935-1978– size:

• started with 1.007.748 tokens• after removal of foreign-L elements: 994.049 tokens• after removal of proper names: 952.327 tokens

– sample size for drama, prose, poetry• 10 kW selected between 5 kW, 10 kW and 20 kW by testing the type-gain• resulted in better dispersion among 20 different authors instead of 10

– sample size for newspaper• whole daily issue• 1 national and 4 regional daily newspaper• 4 different months in 1975

– sample size for schoolbooks• all subjects from the graduate class of secondary schools in Croatia in

academic year 1977/78


1M corpus processing– primary processing

• corpus marking, pre-SGML time, proprietary• word-lists (alphabetical and frequency)• concordances

– lemmatization• not automatic but• semi-manual

– 1M corpus as a database• retrievable by lemma• gives a short concordance co-text• not yet accessible via WWW


working screen of the lemmatization program


1M corpus after lemmatization


during 20 years, “dinosaur” project– survived several generations of hardware

– multiple conversions from different data-formats

– war conditions

– etc.

dictionary size: 38.573 headwords

headwords = lemmas

3 parts– frequency dictionary of lemmas

– alphabetical dictionary of lemmas

– alphabetical dictionary of lemmas with tokens and their frequencies


frequency dictionary: structure of lexicographic article frequency

lemma POS meaning standard rang absolute relative subcorpora

ruka f 55 1599 0.1597 DNPSUveć adv 56 1559 0.1557 DNPSUzemlja f 57 1512 0.1510 DNPSUnjegov pro 58 1477 0.1475 DNPSUhiža f * 540 30 0.0030 D-P--kp abb kilopond 541 29 0.0029 ----Ulukav adj 541 29 0.0029 D-PS-

alphabetic dictionary with tokens: structure of lexicographic article

lemma POS meaning standard rang absolute frq subcorpora

oteti v 34 539 DNPS-(ote 2, oteje* 2, otela 2, otele 1, oteli 1, otelo 1, oteo 8, oteti 10, oteto 1, otme 3, otmem 1, otmi 1, otmu 1)

oteti se vr 14 559 DNPS-(ote 2, otelo 1, oteo 2, oteti 2, otme 6, otmem 1, otmu 1)

preko pre 523 174 DNPSU(preko 509, prek* 8, priko* 6)

Croatian National Corpus (HNK) 1Croatian National Corpus (HNK) 1

project of the Ministry of Science and Technology of the Republic of Croatia 130718, Computational processing of Croatian language

theoretical foundations (www.hnk.ffzg.hr/cilj)– Tadić (1996) Računalna obradba hrvatskoga i nacionalni korpus,

Suvremena lingvistika 41-42, 603-612

– Tadić (1998) Raspon, opseg i sastav korpusa suvremenoga hrvatskoga jezika, Filologija 30-31, 337-347

need for the corpus of national importance

a tentative solution for its composition

the size, time-span and structure was elaborated

accessibility via WWW service was suggested

HNK 2: structureHNK 2: structure

30m30m30-million Corpus of Contemporary Croatian

– texts from 1990 until today– different domains and genres– representativeness for

contemporary Croatian standard

HETAHETACroatian Electronic Text Archive (Hrvatski Elektronski Tekstovni Arhiv)– whole texts older than 1990– whole texts of complete

periodicals after 1990 which would disbalance the representativeness of 30m

HNK 3: text collectingHNK 3: text collecting

HNK text collecting started: November 1998

the cheapest and quickest way: WWW– until today: more than 80 mW from .hr domain

– daily WWW token-gain: approx. 180.000 tokens

DTP sources– more than 100 publications (now approx. 9 mW)

– domains:• fiction, medicine, agronomy, law, literature theory and criticism,

economy, philosophy, philology…• lack of texts from natural sciences• highly disbalance on humanities side

typing/scanning text is not expected

HNK 4: www text availabilityHNK 4: www text availability

daily newspaper– Vjesnik, Večernji list, Slobodna Dalmacija, Jutarnji list (in

preparation for 1999-12)• WWW version covers about 60% of paper edition

weekly/biweekly newspaper– Hrvatsko slovo, Međimurje, Varaždinske vijesti

• 60% coverage of paper edition

– Nacional, Vijenac• 100% coverage of paper edition

monthly/bimonthly magazines– Bug, PCChip, Vidi, Kontura, AutoBlic, Teen…

• small coverage of paper edition

HNK 5: corpus on wwwHNK 5: corpus on www http://www.hnk.ffzg.hr

Testing V 1.0: 1998-12-05– 30m: 3 mW– HETA:

• Collected works by Ivan Gundulić• all works written in Croatian by Marko Marulić

Testing V 1.1: 1999-02-14 & 1999-07-20– 30m: 7,67 mW– HETA: 2,9 mW from CD-ROM: Classics of Croatian literature, Naklada

Bulaja, Zagreb, 1999

now only the testing V 1.1 (approx. 10 mW) of corpus is www accessible– text format: quasi HTML, no XML– no POS marking and retrieval

text collecting and marking going on, as well as software testing

HNK 6: hardware/softwareHNK 6: hardware/software

platform– NT instead of UNIX

– all software (commercial, shareware, custom-made) runs on Windows

input text formats– WWW: HTML, XML

– DTP: RTF, DOC, QXD, WP, TXT etc.

conversion– 2XML: custom made software

• input: HTML, RTF• output: XML, no header• two-step conversion by user-defined scripts• enables high level of automation

HNK 7: corpus formatHNK 7: corpus format

corpus format– XML, we are testing TEIXML dtd, Nancy (1999-07)

<BODY>

<DIV0 type="article">

<HEAD type="nn">U GORICI SVETOJANSKOJ ODRŽAN 12. FESTIVAL PJEVAČA AMATERA</HEAD>

<HEAD type="na">Ivana osvojila županijski Sanremo</HEAD>

<HEAD type="pn">* Od 20 natjecatelja žiri je najboljom proglasio Ivanu Erdeljac s pjesmom "Crazy", druga je Antonija Mikita s pjesmom "To", a treće je mjesto osvojila Ksenija Cvetetić</HEAD>

<FIGURE>Publici su se najviše svidjeli Marija Šalić i Petar Puhijera</FIGURE>

Pod medijskim pokroviteljstvom "Večernjeg lista" i Radio Jaske, a uz pomoć DIR "Rubinić" kao generalnog te još sedamdesetak drugih sponzora, u petak i u subotu u Gorici Svetojanskoj pokraj Jastrebarskog održan je 12. festival pjevača amatera.

Prve festivalske večeri, na kojoj su nastupila 22 izvođača do 15 godina, prvu nagradu stručnog žirija odnijela je Petra Batelja iz Rastoka pokraj Jaske za pjesmu "To malo ljubavi". Druga nagrada pripala je Nikolini Oslaković iz Gornje Reke za pjesmu "Neka mi ne svane", a treća Mariji Jurini iz Desinca za pjesmu "Ginem". Publika je najboljom ocijenila svetojansku grupu "Mrvice" s pjesmom "Mrvica", dok je drugu nagradu dodijelila Natali Rajnović iz Jaske za pjesmu "Don"t ever cry", a treću Aniti Oslaković iz Desinca za pjesmu "Malo fali". Za najboljeg debitanta prve večeri proglašena je Irena Kišan iz Zdenčine s pjesmom "Izdali me".

Druga večer - s dvadeset starijih izvođača iz Jaske, Karlovca, Bjelovara, Zagreba i Velike Gorice - bila je osobito napeta, jer je za razliku od lani ponudila vrlo kvalitetne izvođače i interpretacije pa nije bilo lako odabrati najbolje.

Nakon poduže stanke tijekom koje su izbrojani glasovi - a koju su publici kratili gost večeri Ivo Pattiera te sastav "Santa Anna" i solistica Goga Čopić - proglašeni su ovogodišnji pobjednici. Prema ocjeni stručnog žirija, prvu nagradu i zlatnu plaketu "Večernjaka" dobila je Karlovčanka Ivana Erdeljac za vrlo dobro otpjevanu pjesmu "Crazy". Druga nagrada pripala je Antoniji Mikiti iz Velike Gorice za pjesmu "To", a treća Kseniji Cvetetić iz Petrovine za pjesmu "Neka mi ne svane".

Publika je najviše glasova dodijelila svetojansko-zagrebačkom duetu Mariji Šalić i Petru Puhijeri za interpretaciju pjesme "Ima li nade za nas", pa je i njima pripala "Večernjakova" zlatna plaketa. Na drugo mjesto publika je svrstala "Svetojanske tamburaše" koji su nastupili s pjesmom "Dobro jutro", a na treće Zagrepčanku Marijanu Parilac i pjesmu "Idi i ne budi ljude".

Najboljom debitanticom završne večeri proglašena je Zagrepčanka Marina Posilović s pjesmom "Piši, piši mi", a nagradu za najbolji scenski nastup dobio je sastav iz Petrovine "Prigorje de lajt" s pjesmom "Oj suseda, suseda". Čini se da su ovogodišnje nagrade - a bilo ih je doista mnogo, od sedmodnevnog boravka u Opatiji, umjetničke slike, bicikla i kazetofona do satova i poklon-bonova - završile u pravim rukama. Oni koji ih nisu dobili, a možda su ih također zaslužili, neka se ovaj put utješe pljeskom publike, a dogodine će imati novu priliku. Jer, tradicija Svetojanskog festivala - svojevrsnog Sanrema zagrebačke županije - nastavlja se.

<BYLINE>N. Godrijan-Videc</BYLINE>

</DIV0>

</BODY>

HNK 8: corpus format 2HNK 8: corpus format 2

tokenization– TOKENIZER: custom made

software• input: XML• output 1: tabbed file for

data-base input• output 2: tokenized XML

<BODY> vl990301gr01 1 X<DIV0 type="article"> vl990301gr01 7 X<HEAD type="nn"> vl990301gr01 28 XU vl990301gr01 44 RGORICI vl990301gr01 46 RSVETOJANSKOJ vl990301gr01 53 RODRŽAN vl990301gr01 66 R12 vl990301gr01 78 B. vl990301gr01 80 IFESTIVAL vl990301gr01 82 RPJEVAČA vl990301gr01 91 RAMATERA vl990301gr01 104 R</HEAD> vl990301gr01 111 X<HEAD type="na"> vl990301gr01 118 XIvana vl990301gr01 134 Rosvojila vl990301gr01 140 Ržupanijski vl990301gr01 149 RSanremo vl990301gr01 165 R</HEAD> vl990301gr01 172 X<HEAD type="pn"> vl990301gr01 179 X* vl990301gr01 195 IOd vl990301gr01 197 R20 vl990301gr01 200 Bnatjecatelja vl990301gr01 203 Ržiri vl990301gr01 216 Rje vl990301gr01 226 Rnajboljom vl990301gr01 229 Rproglasio vl990301gr01 239 RIvanu vl990301gr01 249 RErdeljac vl990301gr01 255 Rs vl990301gr01 264 Rpjesmom vl990301gr01 266 R" vl990301gr01 275 ICrazy vl990301gr01 276 R" vl990301gr01 281 I, vl990301gr01 282 Idruga vl990301gr01 284 Rje vl990301gr01 290 RAntonija vl990301gr01 293 RMikita vl990301gr01 302 Rs vl990301gr01 309 Rpjesmom vl990301gr01 311 R

HNK 9: corpus format 3HNK 9: corpus format 3

output 2: tokenized XML<BODY> <DIV0 type="article"> <HEAD type="nn"> <W type="R">U</W> <W type="R">GORICI</W> <W type="R">SVETOJANSKOJ</W> <W type="R">ODRŽAN</W> <W type="B">12</W> <W type="I">.</W> <W type="R">FESTIVAL</W> <W type="R">PJEVAČA</W> <W type="R">AMATERA</W> </HEAD> <HEAD type="na"> <W type="R">Ivana</W> <W type="R">osvojila</W> <W type="R">županijski</W> <W type="R">Sanremo</W> </HEAD> <HEAD type="pn"> <W type="I">*</W> <W type="R">Od</W> <W type="B">20</W> <W type="R">natjecatelja</W> <W type="R">žiri</W> <W type="R">je</W> <W type="R">najboljom</W> <W type="R">proglasio</W> <W type="R">Ivanu</W> <W type="R">Erdeljac</W> <W type="R">s</W> <W type="R">pjesmom</W> <W type="I">"</W> <W type="R">Crazy</W>

<W type="I">"</W> <W type="I">,</W> <W type="R">druga</W> <W type="R">je</W> <W type="R">Antonija</W> <W type="R">Mikita</W> <W type="R">s</W> <W type="R">pjesmom</W> <W type="I">"</W> <W type="R">To</W> <W type="I">"</W> <W type="I">,</W> <W type="R">a</W> <W type="R">treće</W> <W type="R">je</W> <W type="R">mjesto</W> <W type="R">osvojila</W> <W type="R">Ksenija</W> <W type="R">Cvetetić</W> </HEAD> <FIGURE> <W type="R">Publici</W> <W type="R">su</W> <W type="R">se</W> <W type="R">najviše</W> <W type="R">svidjeli</W> <W type="R">Marija</W> <W type="R">Šalić</W> <W type="R">i</W> <W type="R">Petar</W> <W type="R">Puhijera</W> </FIGURE> <W type="R">Pod</W>

<W type="R">medijskim</W> <W type="R">pokroviteljstvom</W> <W type="I">"</W> <W type="R">Večernjeg</W> <W type="R">lista</W> <W type="I">"</W> <W type="R">i</W> <W type="R">Radio</W> <W type="R">Jaske</W> <W type="I">,</W> <W type="R">a</W> <W type="R">uz</W> <W type="R">pomoć</W> <W type="R">DIR</W> <W type="I">"</W> <W type="R">Rubinić</W> <W type="I">"</W> <W type="R">kao</W> <W type="R">generalnog</W> <W type="R">te</W> <W type="R">još</W> <W type="R">sedamdesetak</W> <W type="R">drugih</W> <W type="R">sponzora</W> <W type="I">,</W> <W type="R">u</W> <W type="R">petak</W> <W type="R">i</W> <W type="R">u</W> <W type="R">subotu</W> <W type="R">u</W> <W type="R">Gorici</W> <W type="R">Svetojanskoj</W> <W type="R">pokraj</W>

HNK 10: deadlinesHNK 10: deadlines

30m: 2000– finalization of 30m corpus text-collecting

– XML tagging completed

– WWW retrieval front-end finished

– copyright issues cleared

HETA: 2000– continuous filling with whole series of periodicals

• Narodne novine (approx. 10mW)• Večernji list (approx. 9mW)• Vjesnik (approx. 6 mW)• Vijenac (approx. 10 mW)• etc.

HNK 11: future stepsHNK 11: future steps

size of reference corpora today: 100 mW or planned for 100 mW– BNC (British National Corpus)– CNC (Czech National Corpus)– PNC (Polish National Corpus, Łódz; text-typological copy of BNC)– FIDA (Slovene corpus)– HNC (Hungarian National Corpus)

=> after or in 2000: upsizing of 30m to 100m

after or in 2000: morphosyntactic descriptions (MSD)– POS tagging– lemmatization (MULTEXT-East tagset)

after 2001: syntactic description (parsing)– tree-bank

after 2001: spoken language subcorpus (if possible)

HNK 12: www demoHNK 12: www demo

home-page

aims page

30m & HETA corpus pages

30m corpus retrieval page

30m sources list

30m concordance (tajk%)

30m concordance wider co-text

HETA, Classics retrieval page

HETA, Classics source list

Croatian participation in ELAN projectCroatian participation in ELAN project

ELAN (European Language Activity Network)

project jointly coordinated by PAROLE and TELRI associations

duration: 1998-01-01 until 1999-12-31

Institute of linguistics as an institutional member of TELRI association invited to participate in language resources collecting

project also got support from Croatian Ministry of Science and Technology (MZT 130729)

HR-ELAN 1HR-ELAN 1

Croatian participation consisted of 2 mW Croatian monolingual corpus

part of HNK: Večernji list, march-april 1999

test-bed for text-conversion and corpus encoding processes– step 1. HTML to XML

– step 2. XML to SGML (!?)

final format: SGML, CES compatible

Croatian-English parallel corpusCroatian-English parallel corpus

multilingual language research– lexicography

– contrastive linguistics

– MT

– ...

parallel corpora = essential importance

today role of English as lingua communis– common language pairs:

en : Lx

Lx : en

in October we started with not-(yet)-financed Croatian-English parallel corpus

HR-EN parallel corpora 1HR-EN parallel corpora 1

1st English-Croatian pairing– Rudolf Filipović, 1968-1971– Yugoslav Serbo-Croatian–English Contrastive Project (the only possible

name under communist authorities)– Brown corpus cut in half (505.822 tokens)– preserving original genre balance– morphosyntactically marked– translated– concordance with morphosyntactic categories as keywords– bilingual sentence database

1st usage of computers in contrastive linguistics ever?

tapes with data still archived in Institute of linguistics in Zagreb but no computer system which could read them

project publications: Contrastive Studies, New Contrastive Studies, Chapters in Contrastive Linguistics


2nd Croatian-English pair– translation of Plato’s Republic, TELRI CD-ROM, 1998,

ISBN 3-922641-46-6

– hr-en not the only pair: collection of 22 languages

– rather small: 130 kW

– properly aligned


3rd hr-en parallel corpus– aims to test:

• text conversion procedures• corpus organization• alignment and encoding

that will be used later in parallel corpora projects such asCroatian-Slovene parallel corpus

HR-EN parallel corpus 5: corpus collecting 2HR-EN parallel corpus 5: corpus collecting 2

source: Croatia Weekly– publisher: Croatian Institute for Culture and Information (HIKZ)

– started January 1998

– like USA today: different domains• politics, economy and finance, tourism, ecology, culture, art, events,

sports

– 12 pages, A3

– prepared in Croatian then translated by professional translating office

availability– No. 94 is being prepared now

– access to all texts in electronic form in both languages except for first 5 issues


size– average issue: 15.000 tokens hr

18.000 tokens en

– approx.: 1.300.000 tokens hr1.600.000 tokens en

“methodological disturbance”: should we use it?– the biggest weekly newspaper Nacional

• important source of hr-texts for Croatian National Corpus• started with English translations of approx. 10% of Croatian issue on

their Web-page

– Ministry of science and technology of the Republic of Croatia• description of all finished scientific projects• on WWW• Croatian and English


Sentence marking– script in Search&Replace shareware by Funduc SW– </S><S> after punctuation followed by capital letter– filtered for known exceptions: Mr., Mrs., Miss., dr., St. etc.

HR-EN parallel corpus 8: alignmentHR-EN parallel corpus 8: alignment

testing stage

demo of Atril’s DéjàVu translations memory database V2.3.82

aligning module– works fine for 1:1 alignments

– handwork for 2:1, 3:1, 1:2, 1:3

export to TMX format

HR-EN parallel corpus 9: alignment 2HR-EN parallel corpus 9: alignment 2



encoding problem:How to store alignments?

several ways to do it now:– CES with pointers to IDs in 3rd file– translations memory (Translation Units as aligned pairs)

• since we are in XML => PLUG project dtd (Tiedemann 1998)

• si-en parallel corpus (Erjavec 1999): SGML, modified TEI <BODY> to have TU. But all upper and lower level encoding (<DIVs>, <HEADs>, <HI rend=“”>) are lost. Is there a way to retain it?

– TEIXML dtd, Nancy, July 1999. Interpretation of TEI dtd? Preferes the TEI and CES standard way of storing links in 3rd file

Is the SGML/XML decision really a problem to us? To the same <BODY> element we can attach different headers, convert character entities and have SGML instead of XML

HR-EN parallel corpus 12: preliminary statisticsHR-EN parallel corpus 12: preliminary statistics

for <S> aligning already it seems that we would have a lot of handwork

discrepancy between number of <S> and <W> in hr and en

hr en % increase

CW010 195 195

<S> 729 796 9.2

<W> 15483 18176 17.4

CW011 178 178

<S> 675 754 11.7

<W> 14853 17602 18.5

are there any data for other Slavic languages?

<W> alignment is not on the schedule yet

Croatian-Slovene parallel corpusCroatian-Slovene parallel corpus

approved in July by both Ministries of science as one of 17 bilateral scientific projects in humanities

launched effectively last week in October

partners: Philosophical Faculties in Zagreb and Ljubljana

duration: 2 years

2 meetings, one in Ljubljana and other in Zagreb were held

size and structure of corpus was defined

finding electronicaly available translations– the primary task for both project partners at the moment

Corpus-building projects inCorpus-building projects inInstitute of linguistics,Institute of linguistics,Philosophical Faculty,Philosophical Faculty,University of ZagrebUniversity of Zagreb

Marko Tadić([email protected], www.hnk.ffzg.hr/mt)

Philosophical Faculty, University of Zagreb, Croatia (www.ffzg.hr/zzl/zzl-home.htm)

Tübingen, 1999-12-01

Corpus-building projects in Institute of linguistics, Philosophical Faculty, University of Zagreb Marko Tadić ([email protected],

Documents

dictionary size

corpus todayinadequate

corpus facts

corpus structure

time brown corpus

croatian language research

tokens eachnewspaper8

tokens eachprose20 samples