Top Banner
Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1
36

Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Mar 31, 2015

Download

Documents

Helen Buzzard
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Semantics in StatisticalMachine Translation

Jan Odijk

MA-Rotation Lecture

Utrecht March10, 2011

1

Page 2: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Overview

• Machine Translation (MT)• Rule-based MT• Statistical MT• Hybrid MT

2

Page 3: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: What is it?

• Input: text in source language• Output text in target language that is a

translation of the input text

3

Page 4: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: What is it?

Interlingua

Analyzed input transfer Analyzed output

Input direct translation Output 4

Page 5: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: System Types

• Direct:– Earliest systems (1950s)

• Direct word-to-word translation

– Recent statistical MT systems

• Transfer– Almost all research and commercial systems <=

1990

• Interlingual5

Page 6: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: System Types

• Interlingual– A few research systems in the 1980s

• Rosetta (Philips), based on Montague Grammar– Semantic derivation trees of attuned grammars

• Distributed Translation (BSO)– (enriched) Esperanto

• Sometimes logical representations

• Hybrid Interlingual/Transfer– Transfer for lexicons; IL for rules

6

Page 7: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Rule-Based Systems

• Most systems– explicit source language grammar– parser yields analysis of source language input– transfer component turns it into target language

structure– no explicit grammar of target language (except

morphology)

7

Page 8: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Rule-Based Systems

• Some systems (Eurotra)– explicit source and target language grammar

• sometimes reversible

– parser yields analysis of source language input– transfer component turns it into target language

structure– generation of translation by target language

grammar

8

Page 9: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Rule-Based Systems

• Some systems (Rosetta, DLT)– explicit source and target language grammar

• in some cases reversible

– parser yields interlingual representation– generation of translation by target language

grammar from interlingual representation

9

Page 10: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Is it difficult?

• FAHQT: Fully Automatic High Quality Translation– Fully Automatic: no human intervention– High Quality: close or equal to human

translation

• Even acceptable quality is difficult to achieve

10

Page 11: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Problems

• Ambiguity– Real

• Cannot be resolved by grammar• Is much higher than a human can imagine!• Require world knowledge modeling or statistics

– Temporary• Are resolved by the grammar but require large

computational resources

11

Page 12: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Problems

• Computational Complexity– Most rule based systems with a context-free

base (O(n3)) plus extensions (O(?))– Require large computational resources– Require large memory resources– Sentences with length > 20 hardly processable

12

Page 13: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Problems

• Complexity of language– Many different construction types– All interacting with each other– Full coverage is hard to achieve often fall

back on robustness measures– For many constructions proper analysis is not

known– Theoretical linguistics is not going to help

because of focus on explanatory adequacy13

Page 14: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Problems

• Divergences between languages– Lexical categorial:

• zich ergeren v. (be) annoyed (Verb-Adj)• hij zwemt graag vs. he likes to swim

– Phrasal categorial• I expect her to leave

– ik verwacht dat zij vertrekt

• She is likely to come– het is waarschijnlijk dat zij komt

14

Page 15: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Conflational Divergences:

• prepositional complements– houden van vs. love

• existential er vs. Ø– er passeerde een auto vs.– a car passed

• verbal particles– blow (something) up vs. volar

15

Page 16: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Conflational Divergences:

• reflexive verbs– zich scheren vs. shave

• composed vs. simple tense forms – he will do it vs. lo hará

• split negatives vs. composed negatives– he does not see anyone vs.– hij ziet niemand

16

Page 17: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Functional Divergences:

• I like these apples– me gustan estas manzanas

• se venden manzanas aqui– hier verkoopt men appels

• er werd door de toeschouwers gejuicht– the spectators were cheering

17

Page 18: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Divergences: MWEs

• semi-fixed MWEs– nuclear power plant vs. kerncentrale

• flexible idioms– de plaat poetsen vs. bolt– de pijp uit gaan v. to kick the bucket

18

Page 19: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Divergences: MWEs

• semi-idioms (collocations)– zware shag vs. strong tobacco

• semi-idioms (support verbs)– aandacht besteden aan– pay attention to

19

Page 20: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Why is it so difficult?

• Language Competence v. Language Use– Earlier research systems implemented idealized

reality– But not the really occurring language use– In some cases

• focus on theoretically interesting difficult constructions (that do occur in reality)

• But other constructions are more important to deal with in practical systems

20

Page 21: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Why is it so difficult?

• Large and rich lexicons– Existing human-oriented dictionaries are not

suited as such– All information must be available in a

formalized way– Much more information is needed than in a

traditional dictionary

21

Page 22: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Why is it so difficult?

• Multi-word Expressions (MWEs)– Are in current dictionaries only in a very

informal way– No standards on how to represent them

lexically– Many different types requiring different

treatment in the grammar– Huge numbers!!– Domain and company-specific terminology are

often MWEs

22

Page 23: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Why is it so difficult?

• All systems must make approximations: – Ignore certain ambiguities to begin with– Use only limited amount of relevant

information – Cut off analysis when there are too many

alternatives

23

Page 24: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Statistical MT

• Statistical MT • Derives MT-system automatically

– From statistics taken from• Aligned parallel corpora ( translation model)• Monolingual target language corpora ( language

model)• Being worked since early 90’s• Paradigm originates in speech recognition

(and these in noisy channel models)24

Page 25: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

MT: Can we make it possible?

• Plus:– No or very limited grammar development– Includes language and world knowledge automatically

(but implicitly)– Based on actually occurring data– Currently many experimental and commercial systems

• Minus:– Requires large aligned parallel corpora– Clearly has problems with longer span dependencies

25

Page 26: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Statistical MT

• Google Translate (statistical MT)• Hij draagt een pak. √He wears a suit.• Hij draagt schoenen. √ He wears shoes.• Hij draagt bruine schoenen en een pak.

• √ He wears a suit and brown shoes. (!!)• Hij draagt het pakket √ He carries the package• Hij heeft een pak aan. *He has a suit.• Voert uw bedrijf sloten uit?

– *Does your company locks out?

26

Page 27: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT:

• Can we somehow combine the strengths of rule-based approaches and the statistical approaches– And avoid their disadvantages?

• Active Research area– Several projects

27

Page 28: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT

• Euromatrix esp. “the Euromatrix”– Lists data and tools for European language pairs– Goals

• Translation systems for all pairs of EU languages• Organization, analysis and interpretation of a competitive annual international

evaluation of machine translation • The provision of open source machine translation technology including

research tools, software and data• A systematically compiled and constantly updated detailed survey of the state

of MT technology for all EU language pairs • Efficient inclusion of linguistic knowledge into statistical machine translation• The development and testing of hybrid architectures for the integration of

rule-based and statistical approaches• Successor project EuromatrixPlus

28

Page 29: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT

• PACO-MT 2008-2011• Investigates hybrid approach to MT

– Rule-based and statistical– Uses existing parser for source language

analysis– Uses statistical n-gram language models for

generation– Uses statistical approach to transfer

29

Page 30: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT

• META-NET 2010-2013 (EU-funding)– Building a community with shared vision and strategic

research agenda– Building META-SHARE, an open resource exchange

facility– Building bridges to neighbouring technology fields

• Bringing more Semantics into Translation• Optimising the Division of Labour in Hybrid MT• Exploiting the Context for Translation• Empirical Base for Machine Translation

30

Page 31: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT

• Bringing more Semantics into Translation– Charles University Prague (Jan Hajic)– FBK-Irst, Trento (Marcello Federico)– UiL-OTS, Utrecht (Christer Samuelsson)

• currently orienting ourselves and trying to determine a concrete topic for investigation

31

Page 32: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT: Semantics

• Possible Topics:– lexical semantics and their resources / Word

Sense Disambiguation– knowledge representations– multiword expressions– Syntactic and semantic dependencies /

Semantic Role Labeling– Discourse structure– Co-reference resolution– Recognizing Textual Entailment and MT

Evaluation32

Page 34: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Semantics Resources

• CoNLL 2009 Shared Task on syntactic and semantic dependencies–  training and development data –  evaluation data

• PennDiscource TreeBank

34

Page 35: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Hybrid MT

• Tools:• SRL and Semantic Parsing:  SWIRL , 

ASSERT , SENNA,  C&C (all for Eng), tools developed at LUND University (for Eng and Chn)

35

Page 36: Semantics in Statistical Machine Translation Jan Odijk MA-Rotation Lecture Utrecht March10, 2011 1.

Semantics Resources

• Tools:• Co-Reference and Anaphora Resolution: 

– BART (Eng), – COREA (Dut)

• NER: – BIOS (Eng)

36