technology from seed CLUE-Aligner An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units LREC - Portorož, May2 th 2016 ANABELA BARREIRO INESC-ID FRANCISCO RAPOSO INESC-ID / UTL TIAGO LUÍS VOICEINTERACTION
technologyfrom seed
CLUE-Aligner
An Alignment Tool to Annotate Pairs of Paraphrastic and Translation Units
LREC - Portorož, May2th 2016
ANABELA BARREIROINESC-ID
FRANCISCO RAPOSOINESC-ID / UTL TIAGO
LUÍSVOICEINTERACTION
2
Alignment• Set of correspondences or relationships between linguistic
units which are semantico-syntactically related– Paraphrases (found within the same language = monolingual)
• EN: to make a distinction between | EN: to distinguish between– Translations (found in different languages = bilingual)
• EN: to keep it simple | PT: simplificar
Alignment task• NLP task that consists of the identification of translation or
paraphrastic relationships among those linguistic units (words, MWU or expressions) in sentence pairs that have been identified as paraphrases or translations of each other
Introduction
3
• Sure alignments correspond to expressions/translations that satisfy the criteria for optimum/full equivalence
• They are reciprocal – it is possible to translate the expression from the source to the target language and vice-versa• Optimum equivalence refers to the highest level of translation equivalence on
both linguistic and extra-linguistic levels (Bayar,2007)
• venture capital markets | mercados de capital de risco (S)• Possible alignments correspond to expressions/translations
that satisfy the criteria for approximate equivalence• They do not meet all of the requirements for absolute
equivalence. They are not reciprocal wrt source/target language• began | a vu le jour (P)
has seen the day
Sure and Possible Alignments
4
• Supervised learning uses high quality alignments, hand-made by linguists (Blunsom & Cohn, 2006; Ambati et al., 2010)– supervised methods take into consideration context, syntax
and other grammatical and sematic information• Guidelines for manual alignment:
– English–French - Blinker project (Melamed, 1998)– Czech–English (Kruijff-Korbayová et al., 2006; Bojar &
Prokopová, 2006)– Spanish–English (Lambert et al., 2005)– Paraphrase alignment guidelines (Callison-Burch et al. 2008)
Background
5
1. Lack of multilingual datasets– Publicly available alignments are mostly bilingual, with the
exception of 6 multilingual sets (Graça et al., 2008)
2. Lack of linguistically-motivated alignment guidelines – Previously proposed guidelines cover cross-linguistic
phenomena superficially, excluding important alignment challenges presented by discontiguous MWU (DMWU) and other non-adjacent linguistic phenomena or syntactic discontinuity (e.g., extraposition, topicalization, etc.)
3. Lack of tools – Tools are inefficient with DMWU and phrasal expressions
that are complex to align and require representation as non-contiguous block alignments
Current Shortcomings
6
– Alpaco - Blinker project (Rassier & Pedersen, 2003)
– ICA - Interactive Clue Aligner (Tiedemann, 2003; 2004; 2011)
*The "clue alignment approach” is based on mainly word-level alignment clues. Our approach is based on manual alignments of cross-language MWU and phrasal expressions -- that allows representing semantically equivalent non-adjacent structures, such as DMWU in translation and paraphrasing
– Yawat (Germann, 2008)
– SWIFT (Gilmanov et al., 2014)
– among others
Related Alignment Tools
7
• Web alignment interactive tool inspired in Linear-B (Callison-Burch & Bannard, 2004), (Callison-Burch, 2007)
• Allows the block-alignment of contiguous and DMWU• Uses a matrix visualization and a coloring schemes that help
distinguish between sure and possible alignments• Allows storage of pairs of paraphrastic units, with indication
of the place of insertions, represented by "[ ]" – I urge [ ] to | Exorto [ ] a– This feature is valuable in the construction of translation
rules or grammars and syntactic parsers that use those paraphrastic pairs, for which precision is important
– It is also important in ML to help learning constituents
CLUE* = Cross-Language Unit Elicitation
CLUE-Aligner
insertion
insertion
Black cells represent full/optimal semantic correspondenceGrey cells represent approximate semantic correspondence
Light orange cell groups represent unaligned P-insertionsDark orange cell groups represent unaligned S-insertions
pre-processing of contracted forms
still ainda
CLUE-Aligner Interface
Single Word Alignments and Block Alignments
Discontiguous Multiwordsand InsertionsLight green cell / cell groups represent aligned P-insertions
Dark green cell / cell groups represent aligned S-insertions
10
• Inspired by the Logos Model (Scott, 2003; Barreiro et al., 2011), which relies on deep semantico-syntactic analysis to translate contiguous and DMWU, often mistranslated by MT systems – have proven successful in commercial MT systems• to draw a distinction between• to bring [INSERTION] to a conclusion
• I would urge the European Commission to bring the process of adopting the directive on additional pensions to a conclusion
• Supported by the Lexicon-Grammar theoretical framework and transformational grammar (Gross, 1968; 1975)
• The alignment task of the translation pairs of units resulted in a gold collection, achievable due to the CLUE-Aligner
Alignment Guidelines
11
• Allows visualization of automatic phrase alignments and can be used for correcting inaccurate alignments– can load previously (and, possibly, automatically) generated
alignments (segments) for the parallel sentences• Allows alignment of smaller individual or MWU inside DMWU• Useful in human and machine translation evaluation• Future development plans include automatic alignment
– alignments containing pairs of paraphrastic or translation units can be used to train ML systems
• Developed under the scope of the eSPERTo project https://esperto.l2f.inesc-id.pt/esperto/aligner/index.pl?
CLUE-Aligner
12
Use of Paraphrastic Units in eSPERTo
the man who is Americanthe man from Americathe man with American nationality…
The American man
https://esperto.l2f.inesc-id.pt/esperto/esperto/demo.pl
13
• Linguistic-based alignments extracted from quality corpora:– Contribute to increased precision and recall in SMT systems, with
subsequent improvement of translation quality– Are a valuable asset for applications that require monolingual
paraphrases
• We moved forward by creating a tool that handles non-adjacent structures, allowing the alignment of DMWU and phrasal expressions to improve translation applications
• Improvements to CLUE-Aligner include:– to feed it with existing translation or paraphrastic knowledge
previously aligned or generated with a linguistic processing tool– To enhance it in order to align and extract automatically large
amounts of alignment pairs to be applied to paraphrasing and MT case studies
Conclusions and Future Work
14
Thank you!
AcknowledgementsThis research work was supported by Fundação para a Ciência e Tecnologia (FCT), under project eSPERTo EXPL/MHC-LIN/2260/2013, UID/CEC/50021/2013, and post-doctoral grant SFRH/BPD/91446/2012