Top Banner
63

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Apr 02, 2015

Download

Documents

Tobias Sailer
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,
Page 2: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation

Ferhan TureDissertation defense May 24th, 2013

Department of Computer ScienceUniversity of Maryland at College Park

Page 3: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Motivation

• Fact 1: People want to access informatione.g., web pages, videos, restaurants, products, …

• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages

• Goal: Find ways to efficiently and effectively- Search complex, noisy data

- Deliver content in appropriate form

3

multi-lingual text

user’s native language

forum posts

clustered summaries

Page 4: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

retriev ir find materi (usual document unstructur natur (usual text satisfi need larg collect (usual store comput work assum materi collect document written natur languag need form queri rang word entir document typic approach ir repres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creat nois search process document score relat queri score queri term independ aggreg term-docu score

Information Retrieval

4

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores.

queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58

Page 5: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Cross-Language Information Retrieval

5

Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören).

89,9332,345221,932106,13492,5414,073--162,67178,346241,58019,3185,802327,094104,82223,89095,936187,3499,394

3.42.92.72.52.42.121.81.81.71.71.51.51.51.41.11.00.90.8

Page 6: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Machine Translation

6

Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in

entsprechenden Text in der Zielsprache geschrieben übersetzen.

Machine translation (MT) is to translate text written in a source language into corresponding text in a target language.

Page 7: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Motivation

• Fact 1: People want to access informatione.g., web pages, videos, restaurants, products, …

• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages

• Goal: Find ways to efficiently and effectively- Search complex, noisy data

- Deliver content in appropriate form

7

multi-lingual text

user’s native language

MT

Cross-language IR

Page 8: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity

- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

8

(Ture et al., SIGIR’11)

(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Page 9: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Extracting Parallel Text from the Web

9

Preprocess

Signature Generatio

n Sliding WindowAlgorith

m

Candidate Generation

2-step Parallel Text Classifier

doc vectorsF signaturesF

doc vectorsE

signaturesE

sourcecollection F

targetcollection E

Phase 1

Phase 2candidate

sentence pairsaligned bilingual sentence pairs(F-E parallel text)

cross-lingualdocument pairs

Preprocess

Signature Generatio

n

Page 10: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Pairwise Similarity

• Pairwise similarity: • finding similar pairs of documents in a large collection

• Challenges• quadratic search space

• measuring similarity effectively and efficiently

• Focus on recall and scalability

10

Page 11: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Ne

Englisharticles

Ne

English document vectors

Ne

Signatures

Signature

generation

Sliding windowalgorith

m

<nobel=0.324, prize=0.227, book=0.01, …>

[0111000010...]

Preprocess

Locality-Sensitive Hashing

Similar article pairs

Page 12: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

• LSH(vector) = signature- faster similarity computation

s.t. similarity(vector pair) ≈ similarity(signature pair)

e.g., ~20 times faster than computing (cosine) similarity from

vectors similarity error ≈ 0.03

• Sliding window algorithm - approximate similarity search based on LSH

- linear run-time

12

Locality-Sensitive Hashing

(Ravichandran et al., 2005)

Page 13: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Sliding window algorithm

sort

.

.

.

sort

permute

Generating tables

Signatures

….1,110110111012,011100001013,10101010000…

p1

pQ

list1

….11111101010,110011000110,201100100100,3…

listQ

….11111001011,100101001110,210010000101,3…

table1

tableQ

….01100100100,110011000110,211111101010,3…

….00101001110,110010000101,211111001011,3…

Map Reduce

.

.

.

Page 14: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

tableQ

.

.

.

Map

Sliding window algorithm

14

Detecting similar pairs

00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011

table1

….01100100100,110011000110,211111101010,3…

Page 15: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Sliding window algorithmExample

Signatures

(1,11011011101)(2,01110000101)(3,10101010000)

table1p1

p2

list1

list2 table2

(<2,11111001011>,1)(<2,00101001110>,2)(<2,10010000101>,3)

(<1,11111101010>,1)(<1,10011000110>,2)(<1,01100100100>,3)

(<1,01100100100>,3)(<1,10011000110>,2)(<1,11111101010>,1)

(<2,00101001110>,2)(<2,10010000101>,3)(<2,11111001011>,1)

Map Reduce

Distance(3,2) = 7Distance(2,1) = 5

Distance(2,3) = 7Distance(3,1) = 6

✗✓

✗✓

# tables = 2window size = 2

# bits = 11

Page 16: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

16

MT

Doc A

MT translate doc vector

vA

German English

DocB

English

doc vector vB

Doc A

CLIR translate

doc vector vA

German

DocB

English

doc vector vB

doc vector vA

CLIR

Cross-lingual Pairwise Similarity

Page 17: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

17

MT vs. CLIR for Pairwise Similarity

low similarity values

positive-negativeclearly separated

MT slightly better than CLIR, but 600 times slower!

clir-negclir-posmt-negmt-pos

Page 18: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Ne

Englisharticles

Ne

English document vectors

Ne

Signatures

Signature

generation

Sliding windowalgorith

m

<nobel=0.324, prize=0.227, book=0.01, …>

[0111000010...]

Preprocess

Similar article pairs

Locality-Sensitive Hashing for Pairwise Similarity

Page 19: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity

CLIRTranslate

Nf Germa

n articles

Ne

Englisharticles

Ne+Nf

English document vectors

<nobel=0.324, prize=0.227, book=0.01, …>

Ne

English document vectors

Similar article pairs

Ne

Signatures

Signature

generation

Sliding windowalgorith

m

[0111000010...]

Preprocess

Page 20: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Evaluation

• Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia

• Collection: 3.44m En + 1.47m De Wikipedia articles

• Task: For each German Wikipedia article, find:

{all English articles s.t. cosine similarity > 0.30}

20

# bits (D) = 1000# tables (Q) = 100-1500window size (B) = 100-2000

Page 21: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Scalability

21

Page 22: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

two sources of error

Signatures

Brute-force approach

Similar

article

pairs

upperbound

document

vectors

Brute-force approach

Similar

article

pairs

ground truth

Signatures

Signature generatio

n

Sliding windowalgorith

m

document

vectors

Similar

article

pairsalgorit

hm output

Evaluation

22

Page 23: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Evaluation

23

95% recall39% cost

99% recall70% cost

95% recall40% cost

99% recall62% cost

100% recallno savings = no free lunch!

Page 24: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity

- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

24

(Ture et al., SIGIR’11)

(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Page 25: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Approach 1. Generate candidate sentence pairs from each document

pair2. Classify each candidate as ‘parallel’ or ‘not parallel’

Challenge: 10s millions doc pairs ≈ 100s billions sentence pairs

Solution: 2-step classification approach3. a simple classifier efficiently filters out irrelevant pairs 4. a complex classifier effectively classifies remaining pairs

Phase 2: Extracting Parallel Text

25

Page 26: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

• cosine similarity of the two sentences• sentence length ratio: the ratio of lengths of the two

sentences• word translation ratio: ratio of words in source (target)

sentence with a translation in target (source) sentence

Parallel Text (Bitext) Classifier

26

Page 27: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

sentence

detection+tf-

idf

cross-lingualdocument pairs

sentence pairs

simple classificat

ion

complexclassificat

ion

bitext S1

bitext S2

source document

target document

sentences andsent. vectors

cartesian product

X

MAP

REDUCE

candidategeneration2.4 hours

shuffle&sort1.3 hours

simple classification

4.1 hours

Bitext Extraction Algorithm

27

complex classification

0.5 hours

400 billio

n 214billio

n

132billio

n

Page 28: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Extracting Bitext from WikipediaSize Language

English

German Spanish Chinese Arabic Czech Turkish

Documents

4.0m 1.42m 0.99m 0.59m 0.25m 0.26m 0.23m

Similar doc pairs

- 35.9m 51.5m 14.8m 5.4m 9.1m 17.1m

Sentences ~90m 42.3m 19.9m 5.5m 2.6m 5.1m 3.5m

Candidate sentence pairs

- 530b 356b 62b 48b 101b 142b

S1 - 292m 178m 63m 7m 203m 69m

S2 - 0.2-3.3m 0.9-3.3m 50k-290k 130-320k 0.5-1.6m 8-250k

Baseline training data

- 2.1m 2.1m 303k 3.4m 0.78m 53k

Dev/Test set

- WMT-11/12

WMT-11/12

NIST-06/08

NIST-06/08

WMT-11/12

held-out

Baseline BLEU

- 24.50 33.44 25.38 63.15 23.11 27.22

Page 29: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Evaluation on MT

Page 30: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Evaluation on MT

Page 31: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Conclusions (Part I)

31

•Summary

- Scalable approach to extract parallel text from a

comparable corpus

- Improvements over state-of-the-art MT baseline

- General algorithm applicable to any data format

•Future work

- Domain adaptation

- Experimenting with larger web collections

Page 32: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity

- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

32

(Ture et al., SIGIR’11)

(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Page 33: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Cross-Language Information Retrieval

• Information Retrieval (IR): Given information need, find relevant material.

• Cross-language IR (CLIR): query and documents in different languages

•“Why does China want to import technology to build Maglev Railway?”➡ relevant information in Chinese documents

•“Maternal Leave in Europe”➡ relevant information in French, Spanish, German, etc.

33

query (ranked) documents

Page 34: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

grammar

extractor

decoder

language model

tokenaligne

rtokenalignments

query“maternal leave in Europe”

sentence-aligned parallel corpus

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Machine Translation for CLIR

34

language model

translationgrammar

STATISTICALMT

SYSTEM

Page 35: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Token-based CLIR

•Token translation formula

35

… most leave their children in …... aim of extending maternity leave to … ...

… la plupart laisse leurs enfants…… l’objectif de l’extension des congé de maternité à …...

Token-based probabilities

Page 36: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Token-based CLIR

36

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Page 37: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Document Retrieval

•How to score a document, given a query?

37

[maternité : 0.74, maternel : 0.26]

“maternal leave in Europe”

Query q1

Document

DocumentDocument

Documentd1

tf(maternité)

tf(maternel)

df(maternité)df(maternel)…

Page 38: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Token-based CLIR

38

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Page 39: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

39

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Token-based CLIR

Page 40: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Context-Sensitive CLIR

40

This talk: MT for context-sensitive CLIR

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

12%70%6%5%

Page 41: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Previous approach: Token-based CLIR

41

Previous approach: MT as black boxOur approach: Looking inside the box

grammar

extractor

decoder

language model

MTtokenaligne

rtoken alignments

query“maternal leave in Europe”

sentence-aligned parallel corpus

token translation probabilities

n best derivations

1-best translation“congé

de maternité en Europe”

language model

translationgrammar

n best derivations

STATISTICALMT

SYSTEM

Page 42: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

MT for Context-Sensitive CLIR

42

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query

sentence-aligned parallel corpus

“maternal leave in Europe”

decoder

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Page 43: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

CLIR from translation grammar

•Token translation formula

43

S [X : X] , 1.0X [X1 leave in europe : congé de X1 en europe] , 0.9X [maternal : maternité] , 0.9X [X1 leave : congé de X1] , 0.74X [leave : congé ] , 0.17X [leave : laisser] , 0.49...

Grammar-based probabilities

S1

X1

X2 leave in Europe

maternal

S1

X1

X2 en Europe

maternité

congé de

Synchronoushierarchical derivation

SynchronousContext-FreeGrammar (SCFG)[Chiang, 2007]

Page 44: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

MT for Context-Sensitive CLIR

44

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query

sentence-aligned parallel corpus

“maternal leave in Europe”

decoder

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Page 45: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

MT for Context-Sensitive CLIR

45

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query

sentence-aligned parallel corpus

“maternal leave in Europe”

decoder

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Page 46: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

CLIR from n-best derivations

46

t(1): { , 0.8 }

t(k): { kth best derivation , score(t(k)|s) }

t(2): { , 0.11 }

• Token translation formula

.

.

.

Translation-based probabilities

S1

X1

X2 leave in Europe

maternal

S1

X1

X2 en Europe

maternité

congé de

S1

X1 in Europe

maternal leave

S1

X1

maternité

en Europe

congé de

Page 47: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

MT for Context-Sensitive CLIR

47Ambiguity preserved

Conte

xt

sensi

tivit

y

1-best MT

token

based

n best derivations

tokenalignments

translationgrammar

sentence-alignedbitext

1-best translation

MT pipeline

grammar

based

translation

based

Prnbest

PrSCFG

Prtoken

Page 48: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Combining Evidence

•For best results, we compute an interpolated probability distribution:

48

leave laisser 0.14 congé 0.70quitter 0.06…

leave laisser 0.72 congé 0.10quitter 0.09…

leave laisser 0.09 congé 0.90quitter 0.11…

Prtoken PrSCFG Prnbest

35%40%

25%leave laisser 0.33 congé 0.54quitter 0.8…

Printerp

Page 49: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Combining Evidence

•For best results, we compute an interpolated probability distribution:

49

leave laisser 0.14 congé 0.70quitter 0.06…

leave laisser 0.72 congé 0.10quitter 0.09…

leave laisser 0.09 congé 0.90quitter 0.11…

Prtoken PrSCFG Prnbest

100%0%

0%leave laisser 0.72 congé 0.10quitter 0.09…

Printerp

Page 50: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Combining Evidence

50

•For best results, we compute an interpolated probability distribution:

Page 51: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Experiments

•Three tasks:

1. TREC 2002 English-Arabic CLIR task

50 English queries and 383,872 Arabic documents

2. NTCIR-8 English-Chinese ACLIA task

73 English queries and 388,859 Chinese documents

3. CLEF 2006 English-French CLIR task

50 English queries and 177,452 French documents

• Implementation

- cdec MT system [Dyer et al, 2010]

- using Hiero-style grammars, GIZA++ for token alignments

51

Page 52: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Comparison of ModelsEnglish-French CLEF 2006Comparison of ModelsEnglish-Arabic TREC 2002

52

Grammar-based Translation-based (10-best)

Token-based

Best interpolation

1-best MT

Comparison of ModelsEnglish-Chinese NTCIR-8

Page 53: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

53

Comparison of ModelsOverview

Page 54: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Comparison of Models

English-Chinese English-Arabic English-French0.00

0.05

0.10

0.15

0.20

0.25

0.30

Token-based

Grammar-based

Translation-based

1-best MT

Interpolated

Me

an

Ave

rag

e P

recis

ion

(M

AP

)

54

Interpolated significantly better than

token-based and 1-best in all three cases.

Page 55: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Conclusions (Part II)•Summary

- A novel framework for context-sensitive and ambiguity-

preserving CLIR

- Interpolation of proposed models works best

- Significant improvements in MAP for three tasks

•Future work

- Robust parameter optimization

- Document vs. query translation with MT

55

Page 56: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Contributions

CLIRTranslation

Model

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

Page 57: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Contributions

CLIRTranslation

Model

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

Higher BLEU for

5 lang pairs

Page 58: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Token-basedCLIR

Contributions

MTpipeline

baselinebitext

CLIRTranslation

Model

MT Translation

Model

Context-sensitive CLIR

Page 59: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Contributions

CLIRTranslation

Model

MTpipeline

baselinebitext

MT Translation

Model

Context-sensitive CLIR

Higher MAPfor

3 lang pairs

Page 60: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Contributions

MT Translation

Model

MTpipeline

baselinebitext

extracted

bitext

BitextExtraction

comparable corpora

+

Context-sensitive CLIR

CLIRTranslation

Model

Higher MAPfor

3 lang pairs

Higher BLEU for

5 lang pairs

Page 61: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Contributions

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

CLIRTranslation

Model

CLIRTranslation

Model

morebitext

Higher BLEUafter additional

iteration

Page 62: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

•LSH-based MapReduce approach to pairwise similarity

•Exploration of parameter space for sliding window algorithm

•MapReduce algorithm to generate candidate sentence pairs

•2-step classification approach to bitext extraction

Bitext from Wikipedia: improvement over state-of-the-art MT

•Set of techniques for context-sensitive CLIR using MT Combination-of-evidence works best

•Framework for better integration of MT and IR•Bootstrapping approach to show feasibility

•All code and data as part of Ivory project (www.ivory.cc)62

Contributions

Page 63: “Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation Ferhan Ture Dissertation defense May 24 th,

Thank you!