Top Banner
Dealing with Lexicon Acquired from Comparable Corpora Post-edition and Exchange Estelle Delpech, Lingua et Machina Béatrice Daille, U. de Nantes - LINA 1/23
23

Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Jun 11, 2015

Download

Technology

Estelle Delpech

Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland.
Download paper at http://hal.archives-ouvertes.fr/hal-00544403
Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Dealing with Lexicon Acquired from Comparable Corpora

Post-edition and Exchange

Estelle Delpech, Lingua et MachinaBéatrice Daille, U. de Nantes - LINA

1/23

Page 2: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Working w/ lexicon acquired from comparable corpora

I. Terminology acquisition from comparable corpora : quick overview

II. A tool for terminology post-edition

III. Data exchange : a TBX variant for automatically acquired lexicons

IV. Future work

2/23

Page 3: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Part I

Terminology Acquisition from Comparable Corpora

3/23

Page 4: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Terminology acquisition from comparable corpora Comparable corpora:

“Two corpora, respectively in two languages l1 and l2 are said ”comparable” if there exists a substantial part of the vocabulary of the corpus in language l1 whose translation can be found in the corpus in language l2.”

(my translation of [Déjan and Gaussier, 2002] )

Advantages : Availabily Real usages

4/23

Page 5: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Terminology acquisition from comparable corpora

Terminology extraction : a contextual analysis Compare contexts of source and target terms If contexts are similar, there's a good chance

source and target terms are translations of each other, ex :

mastectomy : reconstruction, prophylactic, treat, undergo, removal

mastectomie : reconstruction, prophylactique, traiter, subir, ablation

5/23

Page 6: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Terminology acquisition from comparable corpora

Results Not as good as acquisition from parallel corpora ! Fung (1997) : 30 % accuracy on the Top20

candidates Morin et al. (2004) : translation is usually the 34th for

complex terms

0,92 ablation

0,48 opération

mastectomy 0,89 mastectomie

6/23

Outputs one-to-many alignments– Evaluation : precision on the TopNBest alignments

Page 7: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Part II

A Tool for Post-edition

7/23

Page 8: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

A tool for post-edition

Existing Tools : iView (Merkel and Foo, 2007) ArayaTermExtractor (Waldhör 2006) Xerox Terminology Suite ®

Our needs : Deal with one-to-many alignments Non-aligned contexts Allow non binary annotation Display useful information to help finding the right

candidate in the corpus8/23

Page 9: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

“Useful” information

→ Knownledge that helps catching the in vivo behavior terms

→Text-driven, term-oriented approach Useful information :

Variants Collocations Distributional neighbors Contexts

→ To be harvested during the term extraction / alignment process

9/23

Page 10: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Useful information : example

Mastectomy Mastectomie

risk reducting ~simple ~

~ préventive~ simple

TumorectomyLumpectomyOophorectomy

TumorectomieAblationOpération

...patient may choose to have risk-reducing bilateral mastectomy if they have a strong family history of breast cancer...

...la mastectomie préventive pourrait supprimer la grande majorité du risque de développer un cancer...

10/23

Page 11: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Post-edition interface http://80.82.238.151/Metricc/InterfaceValidation, user “test”, no password

11/23

Page 12: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Part III

Data Exchange : a TBX variant for

automatically acquired lexicon

12/23

Page 13: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Quick introduction to TBX (1)

TBX : Term Base eXchange Open, XML-based standard for exchanging

structured terminological data approved as an international standard by LISA

and ISO (norm 30042) Maps to TMF data model Subset of MARTIF Designed for various use cases Customizable

13/23

Page 14: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Quick introduction to TBX (2)

2 components : Structure : core structure based on TMF

metamodel Content : formalism to express data-categories

and their constraints

Adapted from ISO norm 30042:2008, Fig. 4, p.30

Default XCS XCS1 XCSn

Default TBX TBX variant 1

Core DTD/Schema

Form Content

TBX variant n 14/23

Page 15: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Quick introduction to TBX (3)

Taken from ISO norm 30042:2008, Fig. 1, p.9

responsability

respPerson

termType

usageNote

corpusTrace

reliabilityCode

partOfSpeech

Form defined in DTD Content defined in XCS

15/23

Page 16: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

TBX variant for lexicon acquired from comparable corpora

Default TBX data-categories termType : entryTerm, variant externalCrossReference, usageNote partOfSpeech, frequency, reliabilityCode... transactionType, responsability

+ Customized data-categories : occurrences, occurrenceCount relatedTerm termDefinition, definitionRelevance ntigReference 16/23

Page 17: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

TBX variant : A term entry

17/23

Page 18: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

TBX variant : 1-to-n alignments

18/23

Page 19: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

TBX variant : approved alignment

19/23

Page 20: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Feed-back on TBX TBX is made for stable terminologies with little

uncertainy on the status of translations not machine-generated lexicons of “candidate translations” : difficult to separate of term + properties from its

alignments

no data category specific to automatically estimated reliability

Difficult to make text-driven, term-oriented knowledge fit in a concept oriented format no definition category that would apply to a single term

and not the whole concept

Page 21: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Conclusion

Future work

21/23

Page 22: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

Future work

Integration of prototype in Libellex TBX import / export edition of linguistic properties

User testing (ergonomics) Evaluation of added-value for translation Explore new ways of :

aligning terms selecting contexts

22/23

Page 23: Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange

References Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,

no password

Metricc project : http://www.metricc.com/

Lingua et Machina : http://www.lingua-et-machina.com/

Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica, Alignement Lexical dans les corpus multilingues, pp.1-22.

ArayaTermExtractor : http://www.heartsome.de

Xerox Terminology Suite : http://www.temis.com/

Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35

TMF : ISO 16642 - Terminological markup framework

TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase eXchange (TBX)

Data categories : ISO 12620 - Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources