Top Banner
August 20, 2011 Stéphane HUET and Philippe LANGLAIS NLPCS 2011 - Copenhagen Identifying the Translations of Idiomatic Expressions using TransSearch
31

Identifying the Translations of Idiomatic Expressions using

Feb 09, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Identifying the Translations of Idiomatic Expressions using

August 20, 2011

Stéphane HUET and Philippe LANGLAIS

NLPCS 2011 - Copenhagen

Identifying the Translations of Idiomatic Expressions

using TransSearch

Page 2: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 2 S. HUET

Idiomatic Expressions

• Oxford Companion to the English Language– Idioms are expressions of a given language,

whose sense is not predictable from the meanings and arrangement of their elements

– To fight like cat and dog– It rains cats and dogs

Page 3: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 3 S. HUET

The problem of idiomatic expressions

• Numerous in most languages• Have idiosyncratic meanings that disturb

– Non-native persons– NLP

• In Machine Translation (MT)– Group multi-word expressions before the

alignment process [Lambert and Bancs 05]– Add a new feature encoding the fact that a

phrase is a multi-word expression [Carpuat and Diab 10]

Page 4: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 4 S. HUET

Idiomatic expressions and MT

Page 5: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 5 S. HUET

Idiomatic expressions and MT

Page 6: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 6 S. HUET

Objectives of the study

• The ability of the bilingual concordancer TSRali to retrieve the translations of idiomatic expressions

• Practical issues in querying such a system

Page 7: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 7 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Page 8: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 8 S. HUET

• Available on the Web since 1996• Developed by the Université de Montréal• Subscribed by many professional translators in

Canada• 7.2 M queries over 6 years• Exploits an English-French translation memory• Incorporates word alignment technology

Page 9: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 9 S. HUET

User interface

Page 10: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 10 S. HUET

User interface

1. Retrieve sentence pairs

Page 11: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 11 S. HUET

User interface

2. Spot translations

Page 12: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 12 S. HUET

User interface

3. Identify the list of translations

Page 13: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 13 S. HUET

Alignment and translation

• Word-based alignment (IBM)

• Translation spotting

This is in keeping with that strategy .

La présente mesure est conforme à cette stratégie .

La présente mesure est conforme à cette stratégie .

This is in keeping with that strategy .

– Constrained to contiguous word alignment

Page 14: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 14 S. HUET

Post-processing steps

• Objective: to have relevant and informative translations in the top list

• Bad translations filtering– Supervised classifier– Features: alignment probabilities, POS tags

• Similar translations merging– Inflectional forms of the same canonical words

conforme à / conforme aux

– Difference by grammatical words or punctuations

à l'encontre de / à l'encontre

Page 15: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 15 S. HUET

Type of queries

• Verbatim queries: “normal” queries– is still in its infancy

• Ellipses: for discontinuous expressions– is .. in its infancy

• Dictionary queries: for morphological expansions– be+ still in its+ infancy

• Bilingual queries: to check translations– En: is still in its infancy

Fr: en est encore à ses premiers balbutiements

Page 16: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 16 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Page 17: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 17 S. HUET

Resources

• Translation memory– Canadian Hansards (1986-2007)– 8.3 M sentence pairs

• Idiom lexicon– French-English phrase book– 1,467 expressions– Some entries with 2 or 3

translations

Page 18: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 18 S. HUET

Type of idiomatic expressions

• 2% are expressed in an informal language– She's well-upholstered.

– Il roule des mécaniques.

• 99% are used in the context of a sentence– It's fantastic to bop till you drop.

• 80% are verbal phrases used in their inflected forms– I slept like a log.

• 20% are fixed expressions– When there's a will, there's a way

Page 19: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 19 S. HUET

Manual preprocessing

• Annotation of words judged as extra information– They put the new salesman through his paces.

• Type of extra information words– Modal verbs: can, must– Semi-modal verbs: am going to, are likely

to– Catenative verbs: want to, keep– Adverbial phrases: in Italy, when he heard

the news– Noun phrases: this poet, his latest book

Page 20: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 20 S. HUET

Number of queries found in the TM

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

• EN: I have no axe to grind• FR: Je ne prêche pas pour ma paroisse

Page 21: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 21 S. HUET

Number of queries found in the TM

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

• EN: I have .. axe to grind• FR: Je .. prêche .. pour ma paroisse

Page 22: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 22 S. HUET

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

Number of queries found in the TM

• EN: have .. axe to grind• FR: prêche .. pour ma paroisse

Page 23: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 23 S. HUET

Number of queries found in the TM

• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa paroisse

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

+ verb lemmatization 210 624 650

Page 24: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 24 S. HUET

Number of queries found in the TM

• EN: have+ .. axe to grind• FR: prêcher+ .. pour sa+ paroisse

BilingualBilingual ENEN FRFR

Verbatim queries 36 136 248

+ manual removal of extra words 91 302 410

+ removal of extra pronoun 106 381 509

+ verb lemmatization 210 624 650

+ pronoun and determiner lemmatization 238 700 705

Page 25: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 25 S. HUET

Outline

• Introduction• TransSearch• Experimental setup• Evaluations• Conclusion

Page 26: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 26 S. HUET

Evaluation using the phrase book

• 700 English queries found in the TM– 36 sentence pairs per query– 13 suggested translations

• 705 French queries found in the TM– 32 sentence pairs per query– 15 suggested translations

• Evaluation restrained to 238 entries with English and French sides in a same sentence pair

Page 27: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 27 S. HUET

Recall measured using the phrase book

• For many queries, TransSearch displays relevant translations absent from the reference– est nébuleux displayed after the reference être

dans un état second for to be in a daze– 34 correct translations displayed for to be around

the corner

Rank 1 3 5 all

English queries 41.6 59.2 65.1 74.8

French queries 41.6 54.6 62.6 76.5

Page 28: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 28 S. HUET

Manual evaluation

• 100 French queries• 5 annotators that judged 50 queries each• 3 labels: “correct”, “wrong”, “partial”• Low Fleiss inter-annotator agreement (0.25)

Q: manger à tous les rateliers J1 J2 J3

slurps at everyone's trough correct correct correct

double-dipper partial correct partial

them pot lickers and accusing them of being at the trough and pork barelling

wrong partial wrong

Page 29: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 29 S. HUET

Manual evaluation

• Average rank of the 1st translation labeled as correct by 1 annotator: 1.4

• For 97/100 queries, a correct translation is displayed

correctpartialwrong

42%

22%

36%

Page 30: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 30 S. HUET

Conclusion

• 50% of the idioms of a phrase book found in the TM of TransSearch

• Users should use morphological (+) and proximity (..) operators for idioms

• Only 36% of the displayed translations were clearly wrong

Page 31: Identifying the Translations of Idiomatic Expressions using

Identifying Translations of Idiomatic Expressions 31 S. HUET

Thank you for your attention