1 Problem of CLIR CROSS-LANGUAGE INFORMATION RETRIEVAL AND ...romip.ru/russiras/doc/russir2012/clir.pdf · 2 History • NTCIR (Japan, NII) (1999 -) • Asian languages (CJK) + English

1

CROSS-LANGUAGE INFORMATION RETRIEVAL AND BEYOND

Jian-Yun Nie

University of Montreal

http://www.iro.umontreal.ca/~nie

1

Outline

• What are the problems in CLIR?

• Recall: General approaches to IR

• The approaches to CLIR proposed in the literature

• Their effectiveness

• Remaining problems

• Applications

2

Problem of CLIR

• Cross-language IR (CLIR)• Use a query in a language (e.g. English) to retrieve documents in another language (Chinese)

• Multilingual IR (MLIR)• Use a query in one language to retrieve documents in several languages

3

History• In 1970s, first papers on CLIR

• TREC-3 (1994) Spanish (monolingual): El Norte Newspaper SP 1-25

• TREC-4 (1995) Spanish (monolingual): El Norte Newspaper SP 26-50

• TREC-5 (1996) Spanish (monolingual): El Norte newspaper and Agence France Presse SP 51-75

Chinese (monolingual): Xinhua News agency, People’s Daily CH 1-28

• TREC-6 (1997) Chinese (monolingual), The same documents as TREC-6 CH 29-54

CLIR:

English: Associated Press CL 1-25

French, German: Schweìzerìsche Depeschenagentur (SDA)

• TREC-7 (1998) CLIR:

English, French, German, Italian (SDA) CL 26-53

+ German: New Zurich Newspaper (NZZ)

• TREC-8 (1999) CLIR (English, French, German, Italian): as inTREC-7 CL 54-81

• TREC-9 (2000) English-Chinese:Chinese newswire articles from Hong Kong CH 55-79

• TREC 2001 English-Arabic:Arabic newswire from Agence France Presse 1-25

• TREC2002 English-Arabic:Arabic newswire from Agence France Presse 26-75

4

2

History• NTCIR (Japan, NII) (1999 -)

• Asian languages (CJK) + English• Patent retrieval, blogs, Evaluation methodology, ..

• CLEF (Europe) (2000 -)• European languages• Image retrieval, Wikipedia, …

• SEWM: Chinese IR (2004 -)• FIRE: IR in Indian languages (2008 -)• Russian, …

• Search engines• Yahoo!: 2006, French/German->German/French, English, Spanish, Italian• Google: 2007, Query translation, Translation of retrieved documents

2012 – fully integrated

5

CLIR problem

• English query � Chinese document

• Does the Chinese document contain relevant information?

• Readability: In many cases, the translation of retrieved documents into the language of the query is still necessary (goal of machine translation)

6

Why is it necessary to do CLIR?

• When the relevant information only exists in another language• Local information

• Information about local companies, events, …

• Patents

• When there is not enough relevant information in the given language• Recall-oriented search

• When one can read several languages• Avoid submitting multiple queries in different languages

• When the information is language-independent• Image

• Programs, formal specification, …

7

Strategies for CLIR• Translate the query

• Translate the documents

• Translate both query and documents into a third language (pivot language)• (Related) Transitive translation English-> Chinese

• Less effective than direct translation

• Used only when direct translation is impossible

8

3

Query translation vs. document translation

• Query is short: less contextual information available for translation

• Query translation is more flexible: the user indicates the language(s) of interest, and there is less to be translated

• Document translation: richer context

• More to translate

• Impossible to predetermine in which languages the document should be translated

• Generally, comparable effectiveness

9

How to translate

1. Machine Translation (MT)

2. Bilingual dictionaries, thesauri, lexical resources, …

3. Parallel or comparable corpora

10

Recall: Basics on IR

• What is IR problem

• Basic operations in IR systems

• Models

11

What is IR?

• IR aims at retrieving relevant documents from a large document collection for a user’s information need

• Information need � query

• Document collection: a static set of documents

• Relevance: the document contains desired information of the user

12

4

System overview

IRSystem

Multimedia

Web/Textdocs

RankedDocuments

1. Multimedia processing

2. Multimedia presentation

3. MEDIA

.

.

13

14

Simplified IR architecture

Query Document

indexing indexing(Query analysis) (Document analysis)

Representation Representation(keywords) Query (keywords)

evaluation(Document ranking)

Traditional document and query representation• Using keywords (terms)

• A term is independent from any other term

• Term � Meaning

• In NL, a term (word) may mean the same thing as another term (synonymy)

• It may mean different things in different contexts (polysemy)

15

Problems with Keywords

• Relevant documents may use similar terms

• “car” vs. “auto”

• “UK” vs. “England”

• “computer” vs. ”computing”

• Can find irrelevant documents because of the ambiguity of terms.

• “ Mobile ” (phone vs. computer vs. qualification)

• “Apple” (generic vs. the company)

• Can be annoyed by meaningless words

• “of”, “the”, “的”, …

16

5

17

• function words do not bear useful information for IRof, in, about, with, I, although, …

• Stoplist: contain stopwords, not to be used as index• Prepositions: of, in, …

• Articles: the, a, …• Pronouns: it, me, …

• Some adverbs and adjectives: great,

• Some frequent words: document, …

• The removal of stopwords usually improves slightly IR effectiveness

• A few “standard” stoplists are commonly used.

Stopwords / Stoplist

18

Stemming

• Different word forms may bear similar meaning (e.g. compute vs. computing)

• Stemming: • Removing some endings of word

computer

compute

computes

computing

computed

computation

computInternal

representation

19

Porter algorithm(Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137)

• Step 1: plurals and past participles

• SSES -> SS caresses -> caress

• (*v*) ING -> motoring -> motor

• Step 2: adj->n, n->v, n->adj, …

• (m>0) OUSNESS -> OUS callousness -> callous

• (m>0) ATIONAL -> ATE relational -> relate

• Step 3:

• (m>0) ICATE -> IC triplicate -> triplic

• Step 4:

• (m>1) AL -> revival -> reviv

• (m>1) ANCE -> allowance -> allow

• Step 5:

• (m>1) E -> probate -> probat

• (m > 1 and *d and *L) -> single letter controll -> control

Common Preprocessing Steps

• Strip unwanted characters (e.g. punctuation, numbers, etc.).

• Break into tokens (keywords) on whitespace.

• Analyse structure

• Stem tokens into “root” words • computational � comput

• Remove common stopwords (e.g. a, the, it, etc.).

• Detect common phrases (possibly using a domain specific dictionary).

• Build inverted index (keyword � list of docs containing it + positions).

20

6

Retrieval Models

• A retrieval model specifies the details of:• Document representation

• Query representation

• Retrieval function

• Determines a notion of relevance.

21

Classes of Retrieval Models

• Boolean models (set theoretic)

• Extended Boolean

• Vector space models (statistical/algebraic)

• Generalized VSM

• Latent Semantic Indexing

• Probabilistic models

• Language models

22

Boolean Model

• A document is represented as a set of keywords.

• Queries are Boolean expressions of keywords, connected by AND, OR, and NOT, including the use of brackets to indicate scope.

• [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]

• Output: Document is relevant or not. No partial matches or ranking.

23

Vector space model

• Assumption: Each term corresponds to a dimension in a vector space

• Vector space = all the keywords encountered in the collection

<t1, t2, t3, …, tn>

• Document

D = < a1, a2, a3, …, an>

ai = weight of ti in D

• Query

Q = < b1, b2, b3, …, bn>

bi = weight of ti in Q

• R(D,Q) = Sim(D,Q)

24

7

Graphic Representation

Example:

D1 = .2T1 + .3T2 + .5T3

D2 = .3T1 + .7T2 + T3

Q = 0T1 + 0T2 + .2T3

T3

T1

T2

D1 = .2T1+ .3T2 + .5T3

D2 = .3T1 + .7T2 + T3

Q = 0T1 + 0T2 + .2T3

.7

.3.2

.5

25

Term Weighting• Term frequency (TF): the more a term appears in a document,

the more it is important

• Terms that appear in many different documents are lessindicative of overall topic (less discriminant)

df i = document frequency of term i

= number of documents containing term i

idfi = inverse document frequency of term i,

= log2 (N/ df i)

(N: total number of documents)

• A typical combined term importance indicator is tf-idf weighting

wij = tfij idfi = tfij log2 (N/ dfi)

26

Similarity Measures

• Inner product

• Cosine similarity measures the cosine of the angle between two vectors.

θ2

t3

t1

t2

D1

D2

Q

θ1

cos(d j ,q) =

rd j •

rq

rd j �

rq

=

(ij

w �iq

w )i=1

t

�

ij2w

i=1

t

�iq

2wi=1

t

�

27

28

Probabilistic model

• Given D, estimate P(R|D,Q) and P(NR|D,Q)

• P(D|Q), P(R|Q) constant

� P(R|Q,D)∝ P(D|R)

• D = {t1=x1, t2=x2, …}

P(D | R,Q) = P(ti = xi | RQ)(ti =xi )�D

�

P(D | NR,Q) = P(ti = xi | NRQ)(ti =xi )�D

�

P(R |Q, D) =P(D | R,Q)* P(R |Q)

P(D | Q)

xi =1 if present

0 if absent

��

��

Odd(D) = logP(D | R,Q)

P(D | NR,Q)

8

29

How to estimate

• A set of N relevant and irrelevant samples:

ri ni-riniDoc.

with ti

Ri-riN-Ri–n+ri

N-niDoc. without ti

RiRel. doc

N-RiIrrel.doc.

NSamples

P(ti =1 | RQ) =riRi

P(ti =1 | NRQ) =ni − ri

N − Ri

P(ti = xi | RQ)

30

Some additional factors

• Smoothing (Robertson-Sparck-Jones formula)

• When no sample is available:P(t=1|R)=0.5, P(t=1|NR)=(ni+0.5)/(N+0.5)≈ni/N

• Use relevance feedback to get more samples for more precise estimation (usually unavailable)

∑∑∈

=+−+−

++−−+=

Dt

ii

iiii

iiii

t

i

ii

wxrnrR

rnRNrxDOdd

)5.0)(5.0(

)5.0)(5.0(log)(

31

BM25

• k1, k2, k3, b: parameters determined empirically• tf: term frequency• qtf: query term frequency• dl: document length• avdl: average document length

))1((

||)1()1(

),(

1

2

3

31

dlavdl

dlbbkK

dlavdl

dlavdlQk

qtfk

qtfk

tfK

tfkwQDScore

Qt

−+−=

+

−+

+

+

+

+=∑

∈

Doc. length

normalizationTF factors

32

Language models

Question: Is the document likelihood increased when a query is submitted?

(Is the query likelihood increased when D is retrieved?)- P(Q|D) calculated with P(Q|MD)

- P(Q) estimated as P(Q|MC)

)(

)|(

)(

)|(),(

QP

DQP

DP

QDPQDLR ==

Score(Q, D) = logP(Q | MD )

P(Q | MC )

9

33

Divergence between MD and MQ

∑=

=n

i Ci

Dii

MqP

MqPQqtfDQScore

1 )|(

)|(log*),(),(

∝ P(qi | MQ)* logP(qi | MD )

P(qi | MC)i=1

n

�

∝ P(qi | MQ)* logP(qi | MD )i=1

n

�

Query model:

Max. likelihoodDocument model:

Needs smoothing

P(qi | MQ) =tf (qi,Q)

| Q |

34

Document model Smoothing

• Zero probability (log 0)

• Goal: assign a low probability to words or n-grams not observed in the training corpus

word

P

MLE

smoothed

35

Smoothing

• In IR, combine doc. with corpus• Interpolation

• Dirichlet

P(wi | MD ) = λPML (wi | MD )+ (1− λ)PML (wi | MC)

PDir (wi | MD ) =tf (wi, D)+ µPML (wi | MC)

| D | +µ

Query expansion

• Original query Q

• Determine some expansion terms (e.g. synonyms, related terms)

• Add the expansion terms into Q � Q’

• Use Q’ to retrieve documents

• Q’ is usually better than Q

• In language model:

36

P(wi | MQ) = λPML (wi | MQ)+ (1− λ) P(wi | wj )PML (wj | MQ)wj

�

10

Back to CLIR: How to translate

1. Machine Translation (MT)

2. Bilingual dictionaries, thesauri, lexical resources, …

3. Parallel texts: translated texts

37

Approach 1: Using MT

• Seems to be the ideal tool for CLIR and MLIR (if the translation quality is high)

Query in F Translation in EMT

Documents in E

• Typical effectiveness: 80-100% of the monolingual effectiveness

• Problems:• Quality• Availability

38

39

40

11

41

42

Query translation using MT• Query translation: to make query comparable to documents

• Similarities with MT• Similar translation problems• Similar methods can be used

• A good MT � good CLIR effectiveness

43

Differences between query translation and MT

• Short queries (2-3 words): HD video recording• Flexible syntax: video HD recording, recording HD…

• Goal: help find relevant documents, not to make the translated query readable by users• Less strict translation: The "translation" can be by related words (same/related topics)

� Query expansion effect• Not limited to one translation (e.g. potato � pomme de terre /patate, 土豆 / 马铃薯)

• Not only translation• weight = correctness of translation + Utility for IR• E.g. Organic foodTranslation � 有机食品 (organic food)Utility for RI � 绿色食品 (green food)

44

12

Translation problems

45

西点： western dessert / West Point


46

Caution: slippery steps

Problems of MT for IR47

Exit


48

13

49

Translated as “art of the state”


ExemplesSystran Google

• 1. drug traffic

trafic de stupéfiants trafic de stupéfiants毒品交易毒品贩运• 2. drug insurance:

assurance de drogue d'assurance médicaments药物保险药物保险• 3. drug research:

recehrche de drogue la recherche sur les drogues药物研究药物研究50

Examples

Systran Google

4. drug for treatment of Friedreich’s ataxia:

drogue pour le traitement de médicament pour le traitement

l'ataxie de Friedreich de l'Ataxie de Friedreich

Friedreich的不整齐的药物治疗弗里德的共济失调治疗的药物5. drug control:

commande de drogue contrôle des drogues药物管制药物管制• 6. drug production:

production de drogue la production de drogues药物生产药物生产

51

Approach 2: Using bilingual dictionaries

• Unavailability of high-quality MT systems for many language pairs

• MT systems can often be used as a black box that is difficult to adapt to IR task

• MT produces one best translation

• IR needs translation alternatives

• Bilingual dictionary:

• A non-expensive alternative

• Usually available

52

14

Approach 2: Using bilingual dictionaries

• General form of dict. (e.g. Freedict)access: attaque, accéder, intelligence, entrée, accès

academic: étudiant, académique

branch: filiale, succursale, spécialité, branche

data: données, matériau, data

• LDC English-Chinese• AIDS /艾滋病/爱滋病/

• data /材料/资料/事实/数据/基准/

• prevention /阻碍/防止/妨碍/预防/预防法/

• problem /问题/难题/疑问/习题/作图题/将军/课题/困难/难/题是/

• structure /构造/构成/结构/组织/化学构造/石理/纹路/构造物/建筑物/建造/物/

53

Basic methods

• Use all the translation terms• data /材料/资料/事实/数据/基准/• structure /构造/构成/结构/组织/化学构造/石理/纹路/构造物/建筑物/建造/物/• Introduce noise• Implicitly, the term with more translations is assigned higher importance

• Use the first (or most frequent) translation• Does not cover other translation alternatives• Not always an appropriate choice

• Generally: 50-60% of monolingual IR

• Problems of dictionary• Coverage (unknown words, unknown translations)

• [Xu and Weischedel 2005] tested the impact of dictionary coverage on CLIR (En-Ch)

• The effectiveness increases till 10 000 entries

54

Translate the query as a whole

• Phrase translation [Ballesteros and Croft, 1996, 1997] base de données: databasepomme de terre: potato

• First translate phrases• Then the remaining words

• Best global translation for the whole query

1. Candidates:

For each query word• Determine all the possible translations (through a dictionary)

2. Selection

select the set of translation words that produce the highest cohesion

55

Cohesion• Cohesion ~ frequency of two translation words together

E.g.• data: données, matériau, data• access: attaque, accéder, intelligence, entrée, accès

(accès, données) 152 *(accéder, données) 31(données, entrée) 21(entrée, matériau) 3…

• Freq. from a document collection or from the Web (Grefenstette 99)• (Gao, Nie et al. 2001) (Liu and Jin 2005)(Seo et al. 2005)…

• sim: co-occurrence, mutual information, statistical dependence

• Improved effectiveness (80-100% of monolingual IR)

56

∑ ∑∈ ≠∧∈

=Qt ttQt

ttT

QT

i ijj

ji

QQ

TTsimTCohesion ),(maxarg)(maxarg

15

Approach 3: using parallel texts

• Training a translation model (IBM 1) t(tj|si)

• Principle:

• Principle: The more sj appears in parallel texts of ti, the higher t(tj|si).

… si … ... tj …

… si … ... tj …

… si … ... tk …

57

Utilization of TM in CLIR

• Query: a set of source words

• Each source word: a set of weighted target words

• some filtering on translations: stopwords, prob. threshold, number of translations, …

• All the target words ≈ query “translation”

• Common approach: a set of weighted translation terms in a bag of

words

• Using “translated” bag of words in monolingual IR

58

Simple utilization

• Directly use the probability of a word translation

• One should also take into account the discriminant power of the translation (IDF) (better in VSM)

59

P( f |QE ) = t( f | e)�P(e| QE )e�QE

�

w( f ,QE ) = t( f | e)e�QE

� �P(e| QE )� log| CF |

nf

exampleQuery #3 What measures are being taken to stem international drug traffic?

médicament=0.110892mesure=0.091091international=0.086505trafic=0.052353drogue=0.041383découler=0.024199circulation=0.019576pharmaceutique=0.018728pouvoir=0.013451prendre=0.012588extérieur=0.011669passer=0.007799demander=0.007422endiguer=0.006685nouveau=0.006016stupéfiant=0.005265produit=0.004789

• Multiple translations,

• but ambiguity is kept

• Difficult to distinguish usual translation words from coincidental translations (e.g. pouvoir)

• Unknown word in target language

60

16

IBM1 + dictionary: a simple combination in VSM

• Increase the weight of usual translation words in the dictionary (TREC-6)

• MAP-mono = 0.3731

• MAP-LOGOS = 0.2866 (76.8%), MAP-Systran = 0.2763 (74.1%)

Number of translation words

Default probability

10 20 30 40 50 100

0.005 0.2671 0.2787 0.2812 0.2813 0.2829 0.2671

0.01 0.2755 0.2873 0.2891 0.2896 0.2906 0.2742

0.02 0.2873 0.2959 0.2962 0.2967 0.2985 0.2825

0.03 0.2811 0.2906 0.2898 0.2897 0.2904 0.2744

0.04 0.2751 0.2842 0.2827 0.2826 0.2831 0.2683

0.05 0.2687 0.2761 0.2729 0.2729 0.2730 0.2578

61

Number N of translation words

MAP (%monolingual IR)

10 0.2546 (68.24%)

20 0.2635 (70.62%)

30 0.2660 (71.30%)

40 0.2664 (71.40%)

50 0.2671 (71.59%)

100 0.2506 (67.14%)

Principle of model training

• t(tj|si) is estimated from a parallel training corpus, aligned into parallel sentences

• IBM models 1, 2, 3, …

• Process:• Input = two sets of parallel texts

• Sentence alignment A: Sk ⇔ Tl (bitext)

• Initial probability assignment: t(tj|si,A)

• Expectation Maximization (EM): t(tj|si ,A)

• Final result: t(tj|si) = t(tj|si ,A)

62

Integrating translation in language model (Kraaij et al. 2003)

• The problem of CLIR:

• Query translation (QT)

• Document translation (DT)

Score(D,Q) = P(ti | MQ)logP(ti | MD )ti�V

�

P(ti | MQs) = P(ti | sj , MQs

ML )P(sj | MQs

ML )sj�Vs

�

≈ t(ti | sj )P(sj | MQs

ML )sj�Vs

�

P(si | MDt) = P(si | t j , MDt

)P(t j | MDt)

t j�Vt

�

≈ t(si | t j )P(t j | MDt)

t j�Vt

�

63

Results (CLEF 2000-2002)

Run EN-FR FR-EN EN-IT IT-EN

Mono 0.4233 0.4705 0.4542 0.4705

MT 0.3478 0.4043 0.3060 0.3249

QT 0.3878 0.4194 0.3519 0.3678

DT 0.3909 0.4073 0.3728 0.3547

- Translation model (IBM 1) trained on a web collection

- TM can outperform MT (Systran)

64

17

Details on translation model training on a parallel corpus

• Sentence alignment

• Align a sentence in the source language to its translation(s) in the target language

• Translation model

• Extract translation relationships

• Various models (assumptions)

65

Sentence alignment

• Assumption:• The order of sentences in two parallel texts is similar

• A sentence and its translation have similar length (length-based alignment, e.g. Gale & Church)

• A translation contains some “known” translation words, or cognates (e.g. Simard et al. 93)

66

+−−

+−−

+−−

+−−

+−

+−

=

6

5

4

3

2

1

)2,2(

)1,2(

)2,1(

)1,1(

),1(

)1,(

min),(

djiD

djiD

djiD

djiD

djiD

djiD

jiD

di: distance for

different patterns

(0-1, 1-1, …)

Example of aligned sentences (Canadian Hansards)

Débat Artificial intelligence L'intelligence artificielle A ebate

Depuis 35 ans, les spécialistes Attempts to produce thinking d'intelligence artificielle cherchent machines have met during the à construire des machines past 35 years with a curious mix pensantes. of progress and failure.

Leurs avancées et leurs insuccèsalternent curieusement.

Two further points are important.

Les symboles et les programmes First, symbols and programs are sont des notions purement purely abstract notions. abstraites.

67

2-2

2-1

0-1

1-1

TM training: Initial probability assignment t(tj|si, A)

même even

un a

cardinal cardinal

n’ is

est not

pas safe

à from

l’ drug

abri cartels

des .

cartels

de

la

drogue

.

68

18

TM training:Application of EM: t(tj|si, A)

même even

un a

cardinal cardinal

n’ is

est not

pas safe

à from

l’ drug

abri cartels

des .

cartels

de

la

drogue

.

69

IBM models (Brown et al.)

• IBM 1: does not consider positional information and sentence length

• IBM 2: considers sentence length and word position

• IBM 3, 4, 5: fertility in translation

• For CLIR, IBM 1 seems to correspond to the current (bag-of-words) approaches to IR.

70

Word alignment for one sentence pair

Source sentence in training: e = e1, …el (+NULL)

Target sentence in training: f = f1, …fm

Only consider alignments in which each target word (or position j) is aligned to a source word (of position aj)

The set of all the possible word alignments: A(e,f)71

General formula

),,,|(),,,|()|()|,(

|||,| ,],1[ ],0[ with ),...,(

)|,(

)|()|(

1

11

1

1

1

1

1

1

)(

eeeeaf

fea

eaf

efef

fe,a

mfafPmfaaPmPP

mlmilaaa

P

PEFP

jj

j

jm

j

j

j

im

A

−−

=

−

∈

∏

∑

=

==∈∀∈=

=

===

72

Prob. that e is

translated into a

sentence of length m

Prob. that j-th

target is

aligned with

aj-th source

word

Prob. to produce the

word fj at position j

19

Example

a = (1,2,4,3)

p(c'est traduit automatiquement, a|it is automatically translated) =

P(m= 4 | e= it is automatically translated)�

Pa(a1 =1 | m= 4,e)�Pt ( f1 = c' | a1 =1, m= 4,e)�

Pa(a2 = 2 | a1 =1, f1 = c',m= 4,e)�

Pt ( f2 = est | a1

2 = (1,2), f1 = c', m= 4,e)�

Pa(a3 = 4 | a1

2 = (1, 2), f12 = c' est,m= 4,e)�

Pt ( f3 = traduit | a1

3 = (1,2, 4), f1

2 = c'est, m= 4,e)�

Pa(a4 = 3 | a1

3 = (1,2, 4), f13 = c' est traduit,m= 4,e)�

Pt ( f4 = automatiquement | a1

4 = (1, 2, 4,3), f1

3 = c'est traduit, m= 4,e)

),,,|(),,,|()|( 1

11

1

1

1

1

1 eee mfafPmfaaPmP jj

j

jm

j

j

j

−−

=

−∏

73

NULL

it

is

automatically

translated

c’

est

traduit

automatiquement

IBM model 1

• Simplifications

• the model becomes (for one sentence alignment a)

)|()|(),,,|(

)1/(1)|(),,|(

)|(

1

11

1

1

1

1

jj ajaj

jj

j

j

jj

j

eftefpmfafP

llapmfaaP

mP

==

+==

=

−

−−

e

e

e ε

∏

∏

=

=

×+

=

+×=

m

j

ajm

m

j

aj

j

j

eftl

l

eftp

1

1

)|()1(

1

)|()|,(

ε

εeaf

74

Position alignment is

uniformly distributed

Context-independent

word translation

Any length generation is

equally probable – a constant

Example Model 1

a = (1,2,4,3)

5

410064.80.80.450.70.2

5

ated)lly translautomatica isit |ement,automatiqut est traduic'(

8.0lly)automatica|ementautomatiqu(

45.0)translated|traduit(

,7.0is)|est(

,2.0it)|c'(

−××=××××=

=

=

=

=

εε

ap

t

t

t

t

:Assume

75

NULL

it

is

automatically

translated

c’

est

traduit

automatiquement

Sum up all the alignments

∏∑

∑ ∑∏

= =

= = =

+=

+=

m

j

l

i

ijm

l

a

l

a

m

j

ajm

eftl

eftl

pm

j

1 0

0 0 1

)|()1(

)|(...)1(

)|(1

ε

εef

)|( ij eft

76

•Problem:

We want to optimize so as to maximize

the likelihood of the given sentence alignments

•Solution: Using EM

20

Parameter estimation

1. An initial value for t(f|e) (f, e are words)

2. Compute the count of word alignment e-f in thepair of sentences ( ) (E-step)

3. Maximization (M-step)

4. Loop on 2-3

),,()()( ss

f|ec fe)()( , ss fe

77

∑∑∑

∑∑

==

=

=

=

=

l

i

i

m

j

jl

i

i

a

m

j

j

sss(s)

eeδffδ

eft

eft

eeδffδpf|e;cj

01

0

1

)()()(

),(),(

)|(

)|(

),(),()|(),(a

f,eafe

∑

∑∑

=

=

=

=

S

s

ss

e

f

S

s

ss

e

efceft

f|ec

1

)()(1-

1

)()(

),;|()|(

factor)tion (normaliza ),;(

fe

fe

λ

λ

Count of f in f Count of e in e

Problem of parallel texts

• Only a few large parallel corpora

• e.g. Canadian Hansards, EU parliament, Hong Kong Hansards, UN

documents, …

• Many languages are not covered

• Is it possible to extract parallel texts from the Web?

• STRANDS

• PTMiner

78

An example of “parallel” pages

79

http://www.iro.umontreal.ca/index.html http://www.iro.umontreal.ca/index-english.html

STRANDS [Resnik 98]• Assumption:

If - A Web page contains 2 pointers

- The anchor text of each pointer identifies a language

Then The two pages referenced are “parallel”

80

French English

French

text

English

text

21

PTMiner (Nie & Chen 99)

• Candidate Site Selection

By sending queries to AltaVista, find the Web sites that may contain parallel text.

• File Name Fetching

For each site, fetching all the file names that are indexed by search engines. Use host crawler to thoroughly retrieve file names from each site.

• Pair Scanning

From the file names fetched, scan for pairs that satisfy the common naming rules.

81

Candidate Sites Searching

82

• Assumption: A candidate site contains at least one such

Web page referencing another language.

• Take advantage of existing search engines (AltaVista)

File Name Fetching

• Initial set of files (seeds) from a candidate site:

host:www.info.gov.hk

• Breadth-first exploration from the seeds to discover other documents from the sites

83

Pair Scanning

84

• Naming examples:

index.html v.s. index_f.html

/english/index.html v.s. /french/index.html

• General idea:

parallel Web pages = Similar URLs at the difference of a

tag identifying a language

22

Mining Results (several years ago)85

• French-English

• Exploration of 30% of 5,474 candidate sites

• 14,198 pairs of parallel pages

• 135 MB French texts and 118 MB English texts

• Chinese-English

• 196 candidate sites

• 14,820 pairs of parallel pages

• 117.2M Chinese texts and 136.5M English texts

• Several other languages I-E, G-E, D-E, …

CLIR results: F-E

F-E

(Trec6)

F-E

(Trec7)

E-F

(Trec6)

E-F

(Trec7)

Monolingual 0.2865 0.3202 0.3686 0.2764

Hansard TM 0.2166

(74.8%)

0.3124

(97.6%)

0.2501

(67.9%)

0.2587

(93.6%)

Web TM 0.2389

(82.5%)

0.3146

(98.3%)

0.2504

(67.9%)

0.2289

(82.8%)

86

• Web TM comparable to Hansard TM

• Parallel texts for CLIR can tolerate more noise

Alternative methods

• using parallel texts for pseudo-relevance feedback [Yang et al. 99]

• Given a query in F

• Find relevant documents in the parallel corpus

• Extract keywords from their parallel documents, and consider them as a query translation

F EQuery F

Rel.

doc. F

Corresponding

doc. in E

Words

in E

87

Alternative methods – LSI [Dumais et al. 97]

• Monolingual LSI : • Create a latent semantic space (using SVD)

• Each dimension represents a combination of initial dimensions (terms, documents)

• Comparison of document-query in the new space

• Bilingual LSI :• Create a latent semantic space for both languages on a parallel corpus

• Concatenate two parallel texts together• Convert terms in both languages into the semantic space

• Problems: • The dimensions in the latent space are determined to minimize some representational error – may be different from translational error

• Coverage of terms by the parallel corpus• Complexity in creating the semantic space

• Effectiveness – usually lower than using a translation model

88

23

Explicit Semantic Analysis (ESA) [Gabrilovich et al. 07, Spitkovsky et al. 12]

• Assume that each Wikipedia article corresponds to one explicit representation dimension

• A term � association with an article (tf*idf or conditional probability) based on text body or anchor text

• Document ranking: compare the document and the query representations in the ESA space

• CLIR:• Using cross-lingual links between Wikipedia articles

• Possible problems:• Completeness of ESA space

• Representation granularity

89

Summary on using document-level term relations• Pseudo-RF, LSI, ESA

• Assumption: Terms occurring in the same text (or parallel text) are related

• (Translation) relations between terms are coarse-grained

• Related to the same topics

• Query “translation” by topic-related terms

• Usually lower retrieval effectiveness than explicit translation relations

• May be suitable for comparable texts

• Do not exploit fully parallel texts

90

Using a comparable corpus

• Comparable: articles about the same topic• E.g. News articles on the same day about an event

• Impossible to train a translation model

• Estimate cross-lingual similarity (less precise than translation)• Similar methods to co-occurrence analysis

• Conditional probability, mutual information, …

• Less effective than using a parallel corpus

• To be used only when there is no parallel corpus, or to complement a dictionary or parallel corpus• Helpful to further expand the translation by a dictionary, MT or TM

91

Other problems – unknown words

• Proper names (‘Vladmir Ivanov’ in Chinese?)

• New technical terms (‘Latent Semantic Analysis’ in Chinese?)

• Possible solutions• Transliteration

• Mining the web

92

24

Transliteration (between languages of different alphabets)

• Translate a name phonetically• Generate the pronunciation of the name

• Transform the sounds into the target language sounds

• Generate the characters to represent the sounds

Engli sh nam e Frances Taylor

Engli sh phonemes F R AE N S IH S T EY L ER

ε ε

Chinese phonemes f u l ang x i s i t ai l e

Chinese Pinyin fu lang xi s i tai le

Chinese transliteration 弗朗西丝泰勒

93

Mining the web (1)

• A site is referred to by several pages with different anchor texts in different languages

• Anchor texts as parallel texts

• Useful for the translation of organizations (故宫博物馆－National Museum)

http://www.yahoo.com 雅虎搜索引擎

Yahoo!搜索引擎雅虎

美国雅虎

Yahoo!モバイル

Yahooの検索エンジン美國雅虎

Yahoo search engine 雅虎WWW站

Yahoo!

94

Mining the web (2)

• Some "monolingual” texts may contain translations• 现在网上最热的词条，就是这个“Barack Obama”，巴拉克·欧巴马（巴拉克·奥巴马）。• 这就诞生了潜语义索引(Latent Semantic Indexing) …

• Les machines à vecteurs de support ou séparateurs à vaste marge (en anglais Support Vector Machine, SVM) sont …

• Procedure• Using templates to identify potential candidates:

Source-name (Target-name)

Source-name, Target-name

…

• Segmentation (what segment may be appropriate?)

• Statistical analysis

• May be used to complete an existing dictionary

95

Mining the Web (3)

• Wikipedia – a rich source for translations

• Links between articles in different languages

• Translation of concepts (Wiki titles) and named entities (e.g. proper names) through cross-language links

• Linked articles as comparable texts � term similarity

96

25

Some other improvement means• Pre- and post-translation expansion

• Query expansion before the translation using a source-language collection

• Query expansion after the translation using a target-language collection (traditional PRF)

• Fuzzy matching between similar languages• information - información – informazione• ~cognate• Matching n-grams (e.g. 4-grams)• Transformation using rules (konvektio -> convection)

• Combining translations using different tools• Several MT systems, several dictionaries• MT + TM + dictionary• Parallel texts + comparable texts• …

97

Structured query

• Traditional method: all the translation terms in a bag of words

• data /材料/资料/事实/数据/基准/• structure /构造/构成/结构/组织/化学构造/石理/纹路/构造物/建筑物/建造/物/

• Consider all translations for the same source word as synonyms• #syn(材料资料事实数据基准) #syn(构造构成 …)

• sum up the occurrences of all the synonyms in a document (v.s. sum up the log probabilities without #syn)

• Pirkola: structured query > bag-of-words query

• Probabilistic structured query (#wsyn): add weights to synonyms

98

Current state

• Effectiveness of CLIR• Between European languages ~90-100% monolingual

• Between European and Asian languages ~ 80-100%

• A usable quality

• One usually needs translation of the retrieved documents

• The use of CLIR by Web users is still limited / Tools for CLIR are limited• To be increased (integration in the main search engines)

99

Remaining problems• Current approaches :

• CLIR= translation + monolingual IR• E.g. Using MT as a black box + monolingual IR

• The resources and tools are usually developed for MT, not for CLIR

• MT: create a readable sentence

• CLIR: retrieving relevant documents

• Problems of translation selection• MT: Select one best translation � multiple translations

• Phrases in MT: consecutive words

• But dependent words do not always form a phrase

• “Mixing drug cocktails for mental illness is still more art than science”

• �Take into account more flexible dependencies (even proximity)

• How to train a translation model in such a context?

• These are not only a problem in CLIR but also in general IR.

100

26

The future?

• CLIR≠ translation+ monolingual IR

• Translation as a step in CLIR• Translation for IR (not for human readers)

• Select effective search terms (not only good translation terms)

• Some similarities with query expansion

• Can use similar approaches to query expansion

• Combine multiple translation possibilities

• Result diversification

• Document translation: Query-dependent?

• (More in the talk on SMT)

101

Beyond

• Current CLIR approaches can succeed in crossing the language barrier• MT, Dictionary, STM

• Can one use CLIR methods for monolingual (general) IR?• Basic idea

• IBM1 model

• Phrase translation model and some adaptations

102

What have we learnt from CLIR?

• The original query is not always the best expression of information need.

• We can generate better or alternative query expressions

• Query expansion, reformulation, rewriting

• General IR

Information need

query

document

103

Other query expressions?

Basic idea for general IR

• Assumption: Lexical gap between queries and

documents

• Written by searchers and by authors, using different vocabularies

• Translate between query language and document

language

• How?• Dictionary, thesaurus (~dictionary-based translation)

• Statistical “translation” models (*)

104

27

Translation in monolingual IR

• Translation model (Berger&Lafferty, 99)

• Used for document/query expansion• How to obtain word translations?

• Pseudo-parallel texts: A sentence is “parallel” to itself, to the paragraph, …

• co-occurrence analysis: relate a term and its context terms• A document title is “parallel” to its contents (title language model – Jin

2002)• Documents on the same topic are comparable

• A query is “parallel” to the clicked document (query logs)• …

105

P(qi | D) = t(qi | dj )d j

� P(dj | D)

P(di |Q) = t(di | qj )qj

� P(qj | Q)

Using query logs

• Query language ≠ document language

Queries Documents

IT computer science

Word Microsoft office

• Wen et al, 2001, Cui et al. 2002• Term co-occurrences across queries and clicked documents

• Term similarity

• Query clustering and query expansion

• Gao et al. 2010• Using translation model on query-document title pairs

106

Query logs

107

Title: msn web messenger

Perplexity results

• Test set• 733,147 queries from the May 2009 query log

• Summary• Query LM is most predictive of test queries

• Title is better than Anchor in lower order but is worse in higher order

• Body is in a different league

108

28

Word-based translation model

• Basic model:

• Mixture model:

• Learning translation probabilities from

clickthrough data

• IBM Model 1 with EM

109

Results

110

No TM

Only TM

Using phrase translations

• Phrase = sequence of adjacent words

• Words are often ambiguous

• Phrase translations are more precise: natural contextual constraint between words in a phrase

• State of the art in MT: Using phrase translation

• Question:

Is phrase-based query translation effective in IR?

111

Phrase translation in IR (Riezler et al. 2008)Phrase = consecutive words

Determining phrases:• Word translation model

• Word alignments between parallel sentences

• Phrase = intersection of alignments in both directions

• Note: on a different collection

112

QE Models NDCG@1 NDCG@3 NDCG@10

1 NoQE 0.2803 0.3430 0.4199

2 PhraseMT 0.3002 0.3617 0.4363

3 WordModel 0.3117 0.3717 0.4434

29

Why isn’t phrase-model better than word-model?• Sentence (Riezler et al 2008)

herbs for chronic constipation

• Words and phrases

• herbs

• herbs for

• herbs for chronic

• for

• for chronic

• for chronic constipation

• chronic constipation

• constipation

113

More flexible “phrases” – n-grams for queryWord generation process:

• Select a segmentation S of Q according to P(S|Q),

• Select a query phrase q according to P(q|S), and

• Select a translation (i.e., expansion term) w according to P(w|q).

Query Title

instant messenger msn msn web messenger


instant-messenger messenger-msn

instant-messenger-msn

114

P(w |Q) = P(S|Q)P(q | S)P(w | q)q�S

�S

�

Comparison

115


1 NoQE 0.2803 0.3430 0.4199

2 WM 0.3117 0.3717 0.4434

3 PM2 0.3261 0.3832 0.4522

4 PM8 0.3263 0.3836 0.4523

WM: Word translation modelPMn: n-gram Phrase translation model

“Phrase” in IR

• Not necessarily consecutive words (“information

…retrieval”)

• Words at distance (context words) may be useful to determine the correct translation of a word

- the new drug developed by X was approved by Y

- drug … smuggle …

• Question

• How to use context words (within a text window) to help

translation?

� Some approaches in SMT but will need to be further extended

116

30

Adding co-occurrences in query – one attempt

: term, bigram, proximity

: term

Query Title





instant-msn

117

P(eD |Q) = P(eD | eQ )P(eQ |Q)eQ�Q

�

eQ

eD

Comparison of different translation models

118

QE Models NDCG@1 NDCG@3NDCG@1

01 NoQE 0.2803 0.3430 0.41992 WM 0.3117 0.3717 0.44343 PM2 0.3261 0.3832 0.45224 PM8 0.3263 0.3836 0.45235 CMT-B 0.3208 0.3786 0.44726 CMT-P2 0.3204 0.3771 0.44697 CMT-B-P3 0.3219 0.3790 0.44808 CMT-B-P5 0.3271 0.3842 0.45349 CMT-B-P8 0.3270 0.3844 0.4534

CM: Concept translation model (T-term, B-bigram, Pn-proximity within n)

Keeping phrases in translations

Query Title





instant-msn

119

msn-web web-messenger

msn-web-messengermsn-messenger

Considering “phrases” in retrieval?- Markov Random Field

• Two graphs for IR

Sequential dependence Full dependence

120

D

q1 q2

D

q1 q2 q3q3

Score(D,Q) = PΛ (Q, D) =1

Zψ(c;Λ)

c�C(G)

�

31

Markov Random Field

• Three components:

―Term – T

―Ordered term clique – O

―Unordered term clique – U

• λT, λO, λU : fixed parameters

―Typical good values (0.8, 0.1, 0.1)

121

Score(D,Q) = PΛ (Q, D) =1

Zψ(c;Λ)

c�C(G)

�

=rank

λc f (c)c�C(G)

� = λT fT (c)c�T

� + λO fO(c)c�O

� + λU fU (c)c�U

�

fT (c) = logP(qi | D)

fO(c) = log P(#1(qi,...,qi+k) | D)

fU (c) = logP(#uw(qi,...,qi+k ) | D)

Results

• Using MRF to evaluate “phrase” queries helps slightly.

• 1 + M-CMT-B-P8: Generate and use phrase translations

• M-CMT-B-P8: Generate phrase translations but break them into bag of words

122


1 MRF (NoQE) 0.2802 0.3434 0.4201

2 1 + M-CMT-B-P8 0.3293 0.3869 0.4548

3 M-CMT-B-P8 0.3271 0.3843 0.4533

Discussions

• Monolingual IR can inspire from approaches to CLIR

• Collecting “parallel” texts and train a translation model

• It is helpful to consider query and document being in 2 different languages

• Useful resources

• Query logs (click-through)

• If I don’t have query logs?

• Simulate queries by anchor texts (Croft et al.): anchor-page

123

General discussions on Web IR

• Multiple sources of information / knowledge

• Documents

• Hyperlinks / document structure

• Query logs

• Queries, query sessions, clickthrough

• Users (communities)

• Dynamic nature of the Web

• Up-to-date knowledge

� See Talks on Web science and Web dynamics

124

32

Some applications of CLIR (1)

• CLIR in specialized areas • Patent retrieval

• Patents are written in local languages (Chinese, Korean, …)• Companies doing international business are supposed to be aware of

the existing patents• CL patent retrieval in practical uses: increasing (patent searhc on

Google)

• Medical IR• Relevant information in another language may be useful (Chinese

medicine)

• Specific problems?• Patent/medical language?

• Patent/medical document structure?• Term mismatch (standardization/expansion is important)

�Talk on domain-specific IR

125


• CL Question-Answering

• Yahoo! QA, Baidu knows, …

• Some answers may only be available in a foreign language

• Translate the question

• Several possible formulations to select/combine

� Talk on community QA

126


• Image retrieval

• Mostly language-independent

• but the annotations and surrounding texts are language-dependent

• Query-Annotation search ~ CLIR problem

• Specific problem: annotations are limited

� Expansion is required

127

Summary• High-quality MT usually offers a good solution• Well-trained TM based on parallel texts can match or

outperform MT (Kraaij et al. 03)• Dictionary

• Simple utilization is not good• Translation selection is important � can match monolingual IR

• The performance of CLIR usually lower than monolingual IR (between 80% and 100% for advanced approaches)

• Better translation means � better CLIR effectiveness• Better to combine translation means

• CLIR is now integrated in some search engines (wide use to be expected in the future)

• Remaining problems: • Improve translation quality (select good translation terms)• IR-oriented translation

• Beyond: CLIR is not separated from general IR

128

33

Some references• Jian-Yun Nie, Cross-language information retrieval, Morgan-Claypool, 2010. (a

survey book on CLIR)

• Adriani, M. van Rijsbergen, C.J., (2000). Phrase identification in cross-language information retrieval. In Proceedings of RIAO, pp. 520-528. (Using phrase in CLIR)

• Ballesteros, L. and Croft, W. (1997). Phrasal translation and query expansion

techniques for cross-language information retrieval. In Proceedings of SIGIR Conf. pp. 84-91. (Using phrases in CLIR)

• Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR Conf., pp. 222-229. (Translation model for monolingual IR)

• Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), pp. 263-311. (IBM translation models)

• Chen, A., Jiang, H., and Gey, F. (2000). Combining multiple sources for short

query translation in Chinese-English cross-language information retrieval. In Proceedings of the Fifth international Workshop on on information Retrieval with Asian Languages (IRAL). pp. 17-23 (Combining translation resources)

• Cui, H., Wen, J-R., Nie, J-Y. and Ma, W-Y. 2003. Query expansion by mining user

log. IEEE Trans on Knowledge and Data Engineering. Vol. 15, No. 4. pp. 1-11. (Query expansion based on query logs)

• V. Dang and W.B. Croft. Query Reformulation Using Anchor Text. In Proc. of WSDM, pages 41-50, 2010. (simulating query logs by anchor texts)

• S. Dumais, T. Letsche, M. Littman, and T. Landauer, “Automatic cross-language retrieval using latent semantic indexing”, in AAAI Symposium on Cross Language Text and Speech Retrieval. 1997, American Association for Artificial Intelligence. (Using LSI for CLIR)

• J. Gao, J.Y. Nie, E. Xun, J. Zhang, M. Zhou, C. Huang, Improving Query Translation for CLIR using Statistical Models, 24thACM-SIGIR, New Orleans, Sept. 2001, pp. 96-104. (selection of translation terms)

• Gao, J., He, X., and Nie, J-Y. 2010. Clickthrough-based translation models for web search: from word models to phrase models. In CIKM, pp. 1139-1148. (translation

models for monolingual IR)

• E. Gabrilovich and S. Markovitch. 2007. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In IJCAI (ESA)

• Gale, W. A., Church K. W. 1993. A Program for Aligning Sentences in Bilingual

Corpora. Computational Linguistics, 19(3): 75-102. (sentence alignment)

• Grefenstette, G. (1999). The World Wide Web as a resource for example-based machine translation tasks, In Proc. ASLIB translating and the computer 21 conference. (selection of translation terms)

• Jin, Rong, Hauptmann, A.G., and Zhai, CX. (2002). Title Language Model for Information Retrieval. In Proceedings of SIGIR Conf., pp. 42-48 (title language model)

• Kraaij, W., Nie, J.Y., and Simard, M. (2003). Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval.

Computational Linguistics, 29(3): 381-420. (CLIR using language modeling and Web-mined parallel corpora)

• Liu, Y., Jin R. and Chai, Joyce Y. (2005). A maximum coference model for dictionary-based cross-language information retrieval, In Proceedings of

SIGIR conf., pp. 536-543 (translation selection)

• J. Scott McCarley, Should we Translate the Documents or the Queries in Cross-language Information Retrieval? 1999, ACL, pp. 208-214. (Document translation vs. query translation)

• Nie, J.Y., Simard, M., Isabelle, P., Durand, R. (1999) Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web, In Proceedings of SIGIR Conf., pp. 74-81 (CLIR using Web-mined parallel pages)

• Pirkola, A. (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval.” In Proceedings of SIGIR Conf., pp. 55-63. (Structured query translation)

• Pirkola, A., Toivonen, J., Keskustalo, H., Visala K., Järvelin K., (2003) Fuzzy

translation of cross-lingual spelling variants, In Proceedings of SIGIR Conf., pp. 45-352. (Fuzzy match)

• Douglas W. Oard and Paul Hackett, “Document translation for the cross-language text retrieval at the university of Maryland”, in the Sixth Text

REtrieval Conference (TREC-6). (Document translation vs. Query translation)

• Riezler, S., and Liu, Y. 2010. Query rewriting using monolingual statistical machine translation. Computational Linguistics, 36(3): 569-582 (phrase translation for CLIR)

• Seo, H.-C., Kim, S.-B., Rim, H.-C. and Myaeng S.-H., (2005) Improving query translation in English-Korean cross-language information retrieval, Information Processing and Management, 41: 507-522. (translation selection)

• Sheridan, P. and Ballerini, J. P. (1996). Experiments in multilingual information

retrieval using the SPIDER system. In Proceedings of SIGIR Conf., pp. 58-65. (CLIR using comparable corpora)

• Valentin I. Spitkovsky and Angel X. Chang. 2012. A Cross-Lingual Dictionary for English Wikipedia Concepts. In Proceedings of the Eighth International

Conference on Language Resources and Evaluation (LREC 2012) (ESA for CLIR)

• Wen, J., Nie, J-Y., and Zhang, H. 2002. Query clustering using user logs. ACM TOIS, 20(1): 59-81. (Query clustering based on query logs)

• Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E. Frederking,

“Translingual information retrieval: Learning from bilingual corpora”, Artificial Intelligence, vol.103, pp. 323–345, 1998. (Pseudo-RF for CLIR)

129