Top Banner
Workshop on Computational Linguistics providing Paths to Transdisciplinary Research Book of Abstracts October 17, 2019
36

Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

Oct 22, 2019

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

Workshop onComputational Linguistics

providing Paths toTransdisciplinary Research

Book of Abstracts

October 17, 2019

Page 2: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

Dedicated to Ulrich Heid.

Page 3: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

Contents

1 Preface 1

2 Program 6

3 Reusability of Lexical Resources 8

4 Corpora for Machine Translation 10

5 DIN and a database 12

6 Collocations and dictionaries 14

7 Lexicographic Projects 17

8 NLP for Southern African Languages 20

ii

Page 4: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

9 The T&O project and its impacts 23

10 Variation in LSP 25

11 Social reading and writing media 27

12 Digital Humanities 29

Page 5: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

1

Preface

Wenn man bedenkt, dass die ersten Versuche, computa-tionell Sprachen zu verarbeiten (im Kontext der maschinel-len Übersetzung), nun bereits über siebzig Jahre zurück-liegen, ist es unserer Ansicht nach an der Zeit, zumin-dest Teilaspekte der Sprachtechnologie und ihren trans-disziplinären Anspruch in einem Workshop zu beleuchten.Zwischen Linguistik und Informatik liegend, hat die Sprach-technologie beide Disziplinen in den vergangenen Jahrzehn-ten signifikant beeinflusst und man kann behaupten, dasssich Linguistik und Informatik über ihre transdisziplinäreBrücke gegenseitig bereichert haben.

1

Page 6: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

1. PREFACE 2

Die Sprachtechnologie beschränkt sich schon lange nichtmehr auf ihre traditionellen vier ,Räder’ Morphologie, Syn-tax, Semantik und Pragmatik, mit denen sie Texte für lin-guistische Zwecke analysiert und generiert. Durch Einflussder Informatik sind statistische Werkzeuge und – heutzu-tage – neuronale Netzwerke in den selbstverständlichenGebrauch und in die Curricula der dazu angebotenen Stu-diengänge eingegangen.

In den letzten Dekaden hat sich darüber hinaus dastransdisziplinäre Angebot der Sprachtechnologie auf vieleweitere Fächer ausgedehnt, in denen ebenfalls mit Textengearbeitet wird. Nicht nur spielt die Sprachtechnologieschon länger eine wichtige Rolle für die (Fach-)Lexikogra-phie, zum Beispiel bei der Identifizierung lexikalischer Be-deutungen in Textsammlungen. Mittlerweile erschließt sieim Bereich Digital Humanities traditionelle Fächer wie dieLiteratur-, Sozial-, Geschichts- und Politikwissenschaft.

Ohne das Vorwärtsstreben universitärer Forschung wä-re die Sprachtechnologie heute sicher eine nicht so breitgefächerte Disziplin. Gerade hier hat Ulrich Heid einigesbewegt und erreicht. Bei dem am 17. Oktober 2019 statt-findenden Workshop Computational Linguistics providingPaths to Transdisciplinary Research zum Anlass seines sech-zigsten Geburtstags betrachten wir – bezogen auf seineForschung – die Sprachtechnologie aus verschiedenen Per-spektiven.

Page 7: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

1. PREFACE 3

Wir beginnen mit dem Beitrag Reusability of LexicalResources in Eurotra. Hier beschreiben Pius ten Hackenund Folker Caroli die Wichtigkeit lexikalischer Ressourcenfür die maschinelle Übersetzung in den achtziger Jahren.Daran schließt Kurt Eberle an mit seinem Beitrag Bal-ancing Corpora for Machine Translation, bezogen auf dasSFB-732-Projekt Incremental Specification in Context ausden 2000er Jahren. Kerstin Jung (geb. Eckart) berichtetim Anschluss ebenfalls über dieses Projekt, nun aus dererst seit wenigen Jahren berücksichtigten, und doch hoch-relevanten Perspektive der Prozessdokumentation, die siein ihrer Dissertation, betreut von Herrn Heid, untersuchthat. Jung beleuchtet die Arbeit an den entsprechendenISO-Normen, ohne deren Daten- und Schnittstellenbeschrei-bungen eine inter- bzw. transdisziplinäre Prozessierungvon Daten kaum möglich wäre.

Mit dem Vortrag von Rufus Gouws wechseln wir erneutdie Perspektive und betrachten nun den Bereich der Kol-lokationen, zu denen Herr Heid seit vielen Jahren arbeitet.In der Lexikographie ist das Wissen um Kollokationenessentiell und ohne Sprachtechnologie (bzw. Korpuslin-guistik) nur begrenzt untersuchbar. Für die Verarbeitungder afrikanischen Sprachen Südafrikas ist der Name UlrichHeid ebenfalls nicht wegzudenken: Theo Bothma berichtetgemeinsam mit Danie J. Prinsloo hierzu über verschiedene

Page 8: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

1. PREFACE 4

Projekte und sprachtechnologische Software, die in Zusam-menarbeit mit ihm entstanden sind.

Während der Vortrag von Bothma und Prinsloo sichauf lexikographische Projekte zu südafrikanischen Sprachenkonzentriert, werden Sonja Bosch und Gertrud Faaß überweitere sprachtechnologische Ressourcen und Werkzeugesprechen, von denen viele in der letzten Dekade nicht ohneHerrn Heids Beteiligung und seinen Rat entstanden wären.Auch bezüglich Fachsprachen war und ist Herr Heid sehraktiv: Michael Dorna von der Robert Bosch GmbH undAnna Hätty von der Universität Stuttgart berichten inihrem Vortrag T&O - Terminology Extraction and On-tology Development über ein terminologisches Koopera-tionsprojekt unter der Ägide von Herrn Heid. Im An-schluss daran beleuchtet der Vortrag Specialised Lexicog-raphy and Term Variation von Theresa Kruse und LauraGiacomini die bisherige Hildesheimer Zeit mit der Beschrei-bung von Qualifikationsarbeiten zu Fachlexikographie undTermvariation in den Gebieten Mathematik und Technik.

Auf den transdisziplinären Anteil der Sprachtechnolo-gie an den Digital Humanities geht Gerhard Lauer ein:In Social Reading and Writing Media fasst er zusammen,wie Beiträge aus sozialen Medien in Echtzeit-Analysen un-tersucht werden können. Johannes Schäfer, Anna Mos-kvina und Fritz Kliche bleiben im Bereich Digital Human-ities. Sie stellen vor, wie für bibliographische Zwecke mit

Page 9: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

1. PREFACE 5

sprachtechnologischen Methoden die Inhalte wissenschaft-licher Publikationen erfasst und verschlagwortet werdensollen; eine andere Arbeit befasst sich mit der Erkennungvon unangemessenen Inhalten wie Hate Speech in sozialenMedien.

Die Mitarbeiterinnen und Mitarbeiter der Computerlinguistik-Gruppe des Instituts für Informationswissenschaft und Sprach-technologie an der Universität Hildesheim: Gertrud Faaß, LauraGiacomini, Max Kisselew, Fritz Kliche, Theresa Kruse, AnnaMoskvina und Johannes Schäfer

Juni 2019

Page 10: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

2

Program

6

Page 11: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

2. PROGRAM 7

10:00 Arrival, Registration with coffee on offer11:00 Welcome: Wolfgang-Uwe Friedrich

President University of Hildesheim11:20 Session 1: European Machine Translation Project(s)

Welcome: Folker Caroli (Hildesheim)11:25-11:45 Pius ten Hacken (Innsbruck) and

Folker Caroli (Hildesheim)11:45-11:50 Session 1 round-up (Pius ten Hacken)11:50-11:55 Session 2: Working on SFB 732 “Incremental

Specification in Context” at University of StuttgartWelcome: Kurt Eberle (Heidelberg)

11:55-12:15 Kurt Eberle (Heidelberg)12:15-12:35 Kerstin Jung (Stuttgart)12:35-12:40 Session 2 round-up (Kerstin Jung)12:40-12:45 Session 3a: Advances in Theoretical Lexicography

Welcome: Theo Bothma (Pretoria, South Africa)12:45-13:05 Rufus Gouws (Stellenbosch University, South Africa)13:05 Lunch Break (buffet)14:15 Session 3b: Lexicographic Projects

& African Languages Issues14:15-14:35 Theo Bothma (with support of Danie J. Prinsloo, Pretoria,

South Africa)14:35-14:55 Sonja Bosch (UNISA, Pretoria, South Africa) and

Gertrud Faaß (Hildesheim)14:55-15:00 Session 3 round-up (Rufus Gouws)15:00-15:05 Session 4: Terminology and Terminological Projects

Welcome: Laura Giacomini (Hildesheim)15:05-15:25 Michael Dorna (Bosch CR, Renningen) and

Anna Hätty (Stuttgart)15:25-15:45 Laura Giacomini and Theresa Kruse (Hildesheim)15:45-15:50 Session 4 round-up (Laura Giacomini)15:50-16:05 Coffee Break16:05-16:10 Session 5: Digital Humanities

Welcome: Caroline Sporleder (Göttingen)16:10-16:30 Gerhard Lauer (Basel)16:30-16:50 Anna Moskvina, Johannes Schäfer, Fritz Kliche (Hildesheim)16:50-16:55 Session 5 round-up (Caroline Sporleder)16:55-17:15 Round up: Final Discussion and Summary (Ulrich Heid)17:15 End of the workshop17:15-18:30 University Reception: Faculty 3 and guests18:45 Dinner in local restaurant:

Os - Das Marktrestaurant (Markt 7, 31134 Hildesheim)

Page 12: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

3

Reusability of LexicalResources in Eurotra

– Pius ten Hacken, Universität Innsbruck,Folker Caroli, Universität Hildesheim

This talk focuses on the early period of research activitiesby Uli Heid. It describes the main context of natural languageprocessing research in Europe at this time, the Eurotra Project.Eurotra was the machine translation (MT) project of the Eu-ropean Community in the 1980s. It was a project that wasunique in its design and size, aiming to develop a truly multilin-gual transfer system for the nine EC languages at the time. Itsimplementation and organization reflected an approach to MTthat was mainly interested in theoretical questions of linguistic

8

Page 13: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

3. REUSABILITY OF LEXICAL RESOURCES 9

modelling. Most research was devoted to syntactic and semanticissues. Only towards the end of the project, the so-called lexicalbottleneck became a prominent issue. It was discovered that fora proper operation, the system would need large monolingualand bilingual dictionaries, whose development would require asignificant resource. The Dictionary Task Force (DTF) was setup to plan and monitor this work.

Thus, at the end of the main Eurotra project in 1990, one ofthe follow-up projects was a tender named “Reusability of Lex-ical Resources”. The main contractor for this tender was UlrichHeid. The objective of this project was to produce an overviewof the state of the art in the reusability of lexical resources. Itconsisted of two components, one producing an inventory of ex-isting digital lexicographic resources and one aiming to set up astandard for the encoding of lexical information. The outcomesof the project lead to two conceptions to building up dictio-naries for natural language processing systems, on one hand theextraction of lexical information from existing machine-readableresources for human use, on the other the production of lexicalresources by computer-based exploitation of large text corpora.This second orientation has become one of the main researchfields of Uli Heid.

Page 14: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

4

Balancing Corpora forMachine Translation

– Kurt Eberle, University of Heidelberg

Text analysis and translation have made big progress overthe last couple of years by adding the conception of neural net-works to the paradigms of Machine Learning. Text classificationcould be significantly improved as well as Machine Translation.However, a prerequisite is that very large corpora are availablefor training which is not the case for any language or languagepair or domain. What is even worse is the fact that large datanot necessarily prevent learning from obtaining biased results.In Machine Translation, for instance, it could be shown thatgood neural systems that have been created from large data,

10

Page 15: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

4. CORPORA FOR MACHINE TRANSLATION 11

though producing remarkably fluent and linguistically rich re-sults, nevertheless show many errors in adequacy for texts withhigh percentage of technical terms and unusual selection fromthe range of readings of words, phrases and sentences. To rem-edy on this it is helpful to train systems on corpora that are notonly large but balanced enough to represent an optimum of theuse cases one is interested in and that avoid unwanted biasingtowards specific meanings and translations.

In the special research area 732 on Incremental Specificationin Context and in particular in its project on linguistic tools,guided by Uli Heid, we had developed devices for extracting andlabeling references for various phenomena from text. Optionally,the tools allowed to integrate syntactic and semantic analysesof the texts and the use of semantic and pragmatic predictionsas deducible from discourse representation theory (DRT) andits model theory in order to obtain highly filtered and reliablylabeled data.

These devices have been continuously developed further.The talk reports about the current integration of knowledgefrom distributional semantics into the extraction and labelingtools for the purpose of obtaining better information about thedistribution of readings and for compiling more balanced cor-pora as a basis for classifiers. In the approach we report on, theresults of the classifiers are evaluated by the labeling tools inturn, such that the quality of both, extraction and labeling tooland classifier are optimized in a bootstrapping approach.

Page 16: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

5

Representingunderspecification –DIN and a database

– Kerstin Jung, University of Stuttgart

Inspecting lexical ambiguities, such as the ambiguity be-tween object, event and state readings of German -ung-nomina-lizations, includes the inspection of many examples. In projectB3 of SFB 732, next to corpus-based examples, also a set ofanalyses from several tools had to be explored and stored.

Since most linguistic representation tools are based on com-pletely specified analyses and do not provide means to natively

12

Page 17: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

5. DIN AND A DATABASE 13

store and compare several analyses from the one layer (e.g.syntax) for the same example, the relational database B3DBemerged from the project. Starting from a detailed set of ta-bles for texts, paragraphs and sentences, the database developedinto a generic structure with two layers: a macro layer for pro-cess description and a micro layer for a graph representationof analyses, including their degree of (under)specification. Thedesign of the micro layer in turn is closely connected to the de-velopment of the annotation representation standards by ISO’sTC 37/SC 4, the sub committee Language resource managementof the technical committee Language and terminology. This subcommittee is mirrored by the Arbeitsausschuss Sprachressourcenfrom DIN’s Normenausschuss Terminologie. With its roots bothin application and standardization, the B3DB is one of the out-comes of project SFB 732 B3 and with its focus on joint repre-sentation, process metadata and comparison exploration it hasbeen applied for further projects such as the combination ofinformation status and prosody.

Page 18: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

6

Collocations anddictionaries – then,now and in future

– Rufus Gouws, Stellenbosch University

The past four decades have witnessed an active discussionof the presentation and treatment of collocations in dictionar-ies. These discussions have been characterized by significantdifferences, a number of suggested changes and improvements,a focus on dictionaries in the transition from the printed to theonline medium and a variety of approaches regarding the statusof collocations as items in dictionary articles. A constant factor

14

Page 19: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

6. COLLOCATIONS AND DICTIONARIES 15

in these deliberations has been the name of one of the majordiscussion partners, namely Ulrich Heid.

The appropriate choice of collocations often distinguishesmother-tongue speakers from those speakers using a second orthird language. As information tools directed at the real needsof real users dictionaries need to take cognizance of problems re-garding the choice and proper use of collocations. This appliesto both monolingual and bilingual dictionaries and to dictionar-ies dealing with language for general purposes but also thosedirected at languages for special purposes. Attempts to en-sure an optimal treatment of collocations in dictionaries remaina real challenge that demands innovative commitment on thelevel of both metalexicography and the lexicographic practice.The planning of the treatment, however, should be precededby a planning of the selection and presentation of collocations.In his in depth research over many years Ulrich Heid has com-mitted himself to finding solutions that can meet this challenge.His work with regard to the linguistic nature of collocations andtheir treatment in printed dictionaries was complemented by re-search regarding the possibilities that the transition to languagetechnology within the framework of digital humanities could of-fer the (meta)lexicographer.

Collocations are at the center of this talk. The talk takes acontemplative look at the development of (meta)lexicographicapproaches to collocations and the role Ulrich Heid has playedin this regard. The contemplative look is supplemented by atransformative look that focuses on future possibilities and givesa re-appraisal of the status of collocations as items in dictionaryarticles. The need to elevate collocations from mere microstruc-

Page 20: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

6. COLLOCATIONS AND DICTIONARIES 16

tural items addressed at the lemma sign to treatment units thatfunction as address of a variety of item types is discussed. Withreference to the online environment attention is drawn to theaccommodation of collocations in specific search zones and theintroduction of a rapid access structure that can guide a userdirectly to collocations without using the lemma sign of the rel-evant article as guiding element.

Page 21: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

7

LexicographicProjects

– Theo Bothma, University of Pretoria

The paper will give a brief overview of various end-user guid-ance devices within the framework of the Function Theory ofLexicography for text production, text reception and the cogni-tive function, at the hand of the taxonomy the authors devel-oped in conjunction with Ulrich Heid, based primarily on theirearlier collaboration. Two devices for text production and textreception will be discussed in more detail, viz.:

1. For text production

(a) The Sepedi copulative decision tree

17

Page 22: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

7. LEXICOGRAPHIC PROJECTS 18

(b) The Sepedi writing assistant

The text production tools can be made accessible to end-users in different ways:

(a) Incorporated into an e-dictionary(b) Incorporated into a word processor(c) As a stand-alone tool

The Sepedi writing assistant makes use of data from thedictionary database, and the dictionary as such is notconsulted by the end-user; the dictionary is therefore‘hidden’ from the end-user. Such a database requiresspecific data and a specific structure; some issues aroundthis will be discussed. End-users that would like moredetailed information about any step in the above exam-ples, have, at each node, the option to drill down to moredetailed explanations.

2. For text reception

(a) Amazon Kindle – linking texts to e-dictionariesand other information sources

(b) Texts linked to e-dictionaries and/or Google viathe operating systems or the browser, e.g. iOS,Android, and Chrome/Firefox on Windows

For text reception, the e-dictionary does not necessar-ily contain all the information the end-user may requireto satisfy their information need (or an expanded – new

Page 23: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

7. LEXICOGRAPHIC PROJECTS 19

– information need), and the end-user may require ac-cess to additional information sources, e.g., Wikipedia,Google, and other Google tools.

To access more information (for the specific or expanded in-formation need, to learn more about the specific aspect, or evensimply because of inquisitiveness), end-users have the option todrill-down, to filter information, to link to additional sourcesetc. This expanded access fulfils the role of the cognitive func-tion, i.e., the end-user can learn more about the topic at hand.All the additional information is available on demand and theend-user is never exposed to an overload of information. Thee-dictionary then becomes an information tool on a continuumof information tools. Examples will be provided of how suchtools can be integrated in the information seeking process.

We will then argue that these developments may have an ef-fect on dictionary usage, as well as on e-lexicography in general,specifically from an end-user perspective, and may also have aneffect on dictionary construction.

Page 24: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

8

NLP Resources andApplications for theSouthern AfricanLanguages: Zulu andNorthern Sotho

– Sonja E. Bosch, University of South Africa,Gertrud Faaß, University of Hildesheim

20

Page 25: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

8. NLP FOR SOUTHERN AFRICAN LANGUAGES21

While Natural Language Processing (NLP) in the so-calleddeveloped world moves towards artificial intelligence solutionslike, e.g. neural networks utilizing billions of tokens of lan-guage samples implementing chat bots or other systems thatare able to communicate with humans, NLP for the SouthernAfrican languages is significantly lagging behind. One of thereasons for this is that in Southern African countries’ universi-ties do not normally offer curricula for computational linguis-tics studies, hence there are relatively few researchers involvedin the development of NLP applications. Looking at Europe,the NLP-research community does not seem very interested inrelatively small African languages, while the European linguistsand African studies’ community seems to have missed the Dig-ital Humanities boat completely. Progress was made when anHLT Unit responsible for driving a new HLT strategy whichincluded support of the research community with funds for col-lecting reusable resources and for developing appropriate NLPapplications was established by the South African Departmentof Arts and Culture in 2006.

Some researchers in South Africa, however, had been col-lecting language data over a period of time. For instance, DanieJ. Prinsloo (University of Pretoria, UP) began compiling interalia Northern Sotho and Zulu corpora of the Southern Africanlanguages in the 1990s. One of the first NLP applications wasdeveloped by Sonja Bosch and Laurette Pretorius (both Univer-sity of South Africa, UNISA) from the year 2000 on resulting inan FST-morphological analyser for Zulu.

In 2005 an opportunity arose to combine the expertise ofEuropean computational linguists and South African Bantu lin-

Page 26: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

8. NLP FOR SOUTHERN AFRICAN LANGUAGES22

guists when Danie J. Prinsloo started an initiative to estab-lish a research group together with Ulrich Heid in order to de-velop a Northern Sotho tagger. Subsequently Ulrich Heid addedGertrud Faaß to the team. From 2006 on, Sonja Bosch (UNISA)and Elsabé Taljard (UP) joined in and since then, not only atagger, but a number of applications have been collaborativelydeveloped and relatively big corpora have been collected, thuslaying a solid foundation for further NLP.

This talk summarizes corpora and NLP applications forthe two Bantu languages Zulu and Northern Sotho that arefreely available for the research community today. For a betteroverview, we will include resources produced by others, such asUwe Quasthoff and his team at the University of Leipzig andthe SADiLAR (South African Centre for Digital Language Re-sources) community in South Africa. We will also introduceour current work. The talk serves as an invitation to the NLPcommunity to join in working towards NLP implementations forthese Bantu languages.

Page 27: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

9

DIY beyond hammerand nails: The T&Oproject and itsimpacts

– Michael Dorna, Bosch,Anna Hätty, University of Stuttgart

We report on the achievements and follow-up activities ofthe cooperation project "T&O - Terminology Extraction andOntology Development". The project between the Institute ofNatural Language Processing (IMS), University of Stuttgart,

23

Page 28: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

9. THE T&O PROJECT AND ITS IMPACTS 24

and Robert Bosch GmbH, Corporate Research, was led by Ul-rich Heid during the years 2015 to 2017. The objective of theproject was to automatically extract and structure terms in theDIY (‘do-it-yourself’) domain. The challenges started with aheterogeneous text basis with diverse registers and text sources(e.g. encyclopedias, handbooks, marketing texts, manuals anduser-written instructions). The goal was to support differentpeople’s information needs and we were faced with a differentunderstanding and knowledge, e.g. of a DIY expert or a layperson. As a consequence, we aimed for a broad characteriza-tion of terminology, and the creation of an ontology as meansto structure the non-standardized terminology. We developed aterm extraction pipeline, starting from data preprocessing steps(normalization, automated linguistic annotation, lemma correc-tion), leading to term candidate pattern search and term rankingwith termhood measures. Compound splitting, term variant de-tection, co-reference resolution and relation extraction were fur-ther applied as basis for the ontology construction. We finallyevaluated the term extractor’s performance on annotated goldstandard data. The outcome of the project were a range of pub-lications and lead to a dissertation in the field. The follow-upactivities explored related research questions: Amongst others,an intuitive lay understanding of terminology was empiricallyinquired and models for the automatic detection of terminolog-ical meaning shifts were proposed. We extended the problemto several other domains, such as cooking, hunting, chess orautomotive.

Page 29: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

10

SpecialisedLexicography andTerm Variation

– Laura Giacomini, University of Hildesheim and University ofHeidelberg,Theresa Kruse, University of Hildesheim

In our talk we introduce two lexicographic projects dealingwith terminology variation in specialised language. The twoprojects explore the way in which synonymous variants belong-ing to the languages of technology and of mathematics can be

25

Page 30: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

10. VARIATION IN LSP 26

formally classified, annotated, extracted from corpora, and pre-sented in LSP e-dictionaries.

The first project, carried out by Laura Giacomini, focuses onan ontology- and frame-based approach to terminology, whichprovides deep insights into salient semantic features of technicallanguage, enables detailed corpus analysis and processing, andsupports term variant extraction. At the core of the project isalso a novel variation typology, accounting for orthographical,morphological, and syntactic alternations. A range of methodsfor automatically extracting different variant types from spe-cialised corpora has been developed over the last years in closecooperation with the computational linguistics team led by Ul-rich Heid at Hildesheim University. Term variation is, of course,one of the key microstructural items of the technical e-dictionarymodelled in the framework of this project and intended for semi-expert users.

The aim of the second project, by Theresa Kruse, is to createan electronic dictionary for the mathematical domain of graphtheory in cooperation with the Institute for Mathematics andApplied Informatics at Hildesheim University. The idea is toapply a pattern-based approach in which the structure of math-ematical definitions is used to extract the terminology and todirectly build an electronic dictionary with a related ontology.This dictionary shall be used by students to improve their usageof terminology.

Page 31: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

11

Goodread Goethe.Exploring socialreading and writingmedia

– Gerhard Lauer, Universität Basel

On social platforms like „Wattpad“ or „Goodreads“ millionsof mostly young people share their stories, stories which arefan fiction, teen fiction as well as classics, surrounded by manycommentaries. While in the public a lot of people fear the younggeneration read less and lost the ability for deep reading, a closer

27

Page 32: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

11. SOCIAL READING AND WRITING MEDIA 28

look into the reading world on social platforms draws a differentpicture. My talk introduces into the possibilities of real timedata research in reading science. New kind of big data are tohandle, computer-based methods had to scale up and adaptedto the new research area, and theories must be developed tounderstand the many data the new research encounters.

Page 33: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

12

Digital Humanities

– Anna Moskvina, Johannes Schäfer, Fritz Kliche University ofHildesheim

This talk presents current research projects at the Universityof Hildesheim where various methods for inspecting the contentof text data are developed. We apply our systems for termextraction, topic modelling and offensive language detection toa specialized corpus.

We gather both the text content and bibliographic data fromdigital scientific publications. Therefrom, we select a subcor-pus consisting of the publications with Ulrich Heid as authoror co-author. We applied a computational linguistic toolchainto analyze this data set by means of (a) term extraction bycomparing the author-specific corpus to a general language cor-

29

Page 34: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

12. DIGITAL HUMANITIES 30

pus and applying a ranking by statistical methods. In a secondpart we discuss our approach of (b) topic modeling to analyzethe fields of research in Heid’s publications and attempt to gaininsight in their status of transdisciplinarity in comparison torather isolated areas of research. Furthermore, we apply a thirdsystem to this corpus which is also currently being developed atthe University of Hildesheim and meant to (c) detect offensivelanguage in social media posts. We discuss process metadata re-quired for the creation of this corpus (and analysis) and presentan overview of results of our three systems.

Page 35: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

Danksagung

Wir danken der Hochschulleitung der Universität Hildesheimherzlich für die großzügige finanzielle Unterstützung dieses Work-shops.

31

Page 36: Workshop on ComputationalLinguistics providingPathsto ... · Contents 1 Preface 1 2 Program 6 3 Reusability of Lexical Resources 8 4 Corpora for Machine Translation 10 5 DIN and a

Impressum

Universität HildesheimInstitut für Informationswissenschaft und Sprachtechnologie

Arbeitsgruppe Computerlinguistikedited by Theresa Kruse and Fritz Kliche

Hildesheim, June 2019

32