1 CROSS-LANGUAGE INFORMATION RETRIEVAL AND BEYOND Jian-Yun Nie University of Montreal http://www.iro.umontreal.ca/~nie 1 Outline • What are the problems in CLIR? • Recall: General approaches to IR • The approaches to CLIR proposed in the literature • Their effectiveness • Remaining problems • Applications 2 Problem of CLIR • Cross-language IR (CLIR) • Use a query in a language (e.g. English) to retrieve documents in another language (Chinese) • Multilingual IR (MLIR) • Use a query in one language to retrieve documents in several languages 3 History • In 1970s, first papers on CLIR • TREC-3 (1994) Spanish (monolingual): El Norte Newspaper SP 1-25 • TREC-4 (1995) Spanish (monolingual): El Norte Newspaper SP 26-50 • TREC-5 (1996) Spanish (monolingual): El Norte newspaper and Agence France Presse SP 51-75 Chinese (monolingual): Xinhua News agency, People’s Daily CH 1-28 • TREC-6 (1997) Chinese (monolingual), The same documents as TREC-6 CH 29-54 CLIR: English: Associated Press CL 1-25 French, German: Schweìzerìsche Depeschenagentur (SDA) • TREC-7 (1998) CLIR: English, French, German, Italian (SDA) CL 26-53 + German: New Zurich Newspaper (NZZ) • TREC-8 (1999) CLIR (English, French, German, Italian): as inTREC-7 CL 54-81 • TREC-9 (2000) English-Chinese: Chinese newswire articles from Hong Kong CH 55-79 • TREC 2001 English-Arabic: Arabic newswire from Agence France Presse 1-25 • TREC2002 English-Arabic: Arabic newswire from Agence France Presse 26-75 4
33
Embed
1 Problem of CLIR CROSS-LANGUAGE INFORMATION RETRIEVAL AND ...romip.ru/russiras/doc/russir2012/clir.pdf · 2 History • NTCIR (Japan, NII) (1999 -) • Asian languages (CJK) + English
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
CROSS-LANGUAGE INFORMATION RETRIEVAL AND BEYOND
Jian-Yun Nie
University of Montreal
http://www.iro.umontreal.ca/~nie
1
Outline
• What are the problems in CLIR?
• Recall: General approaches to IR
• The approaches to CLIR proposed in the literature
• Their effectiveness
• Remaining problems
• Applications
2
Problem of CLIR
• Cross-language IR (CLIR)• Use a query in a language (e.g. English) to retrieve documents in another language (Chinese)
• Multilingual IR (MLIR)• Use a query in one language to retrieve documents in several languages
3
History• In 1970s, first papers on CLIR
• TREC-3 (1994) Spanish (monolingual): El Norte Newspaper SP 1-25
• TREC-4 (1995) Spanish (monolingual): El Norte Newspaper SP 26-50
• TREC-5 (1996) Spanish (monolingual): El Norte newspaper and Agence France Presse SP 51-75
Chinese (monolingual): Xinhua News agency, People’s Daily CH 1-28
• TREC-6 (1997) Chinese (monolingual), The same documents as TREC-6 CH 29-54
• Seems to be the ideal tool for CLIR and MLIR (if the translation quality is high)
Query in F Translation in EMT
Documents in E
• Typical effectiveness: 80-100% of the monolingual effectiveness
• Problems:• Quality• Availability
38
39
40
11
41
42
Query translation using MT• Query translation: to make query comparable to documents
• Similarities with MT• Similar translation problems• Similar methods can be used
• A good MT � good CLIR effectiveness
43
Differences between query translation and MT
• Short queries (2-3 words): HD video recording• Flexible syntax: video HD recording, recording HD…
• Goal: help find relevant documents, not to make the translated query readable by users• Less strict translation: The "translation" can be by related words (same/related topics)
� Query expansion effect• Not limited to one translation (e.g. potato � pomme de terre /patate, 土豆 / 马铃薯)
• Not only translation• weight = correctness of translation + Utility for IR• E.g. Organic foodTranslation � 有机食品 (organic food)Utility for RI � 绿色食品 (green food)
44
12
Translation problems
45
西点: western dessert / West Point
Translation problems
46
Caution: slippery steps
Problems of MT for IR47
Exit
Translation problems
48
13
49
Translated as “art of the state”
Translation problems
ExemplesSystran Google
• 1. drug traffic
trafic de stupéfiants trafic de stupéfiants毒品交易 毒品贩运• 2. drug insurance:
assurance de drogue d'assurance médicaments药物保险 药物保险• 3. drug research:
recehrche de drogue la recherche sur les drogues药物研究 药物研究50
Examples
Systran Google
4. drug for treatment of Friedreich’s ataxia:
drogue pour le traitement de médicament pour le traitement
l'ataxie de Friedreich de l'Ataxie de Friedreich
Friedreich的不整齐的 药物治疗弗里德的共济失调治疗的药物5. drug control:
commande de drogue contrôle des drogues药物管制 药物管制• 6. drug production:
production de drogue la production de drogues药物生产 药物生产
51
Approach 2: Using bilingual dictionaries
• Unavailability of high-quality MT systems for many language pairs
• MT systems can often be used as a black box that is difficult to adapt to IR task
• MT produces one best translation
• IR needs translation alternatives
• Bilingual dictionary:
• A non-expensive alternative
• Usually available
52
14
Approach 2: Using bilingual dictionaries
• General form of dict. (e.g. Freedict)access: attaque, accéder, intelligence, entrée, accès
academic: étudiant, académique
branch: filiale, succursale, spécialité, branche
data: données, matériau, data
• LDC English-Chinese• AIDS /艾滋病/爱滋病/
• data /材料/资料/事实/数据/基准/
• prevention /阻碍/防止/妨碍/预防/预防法/
• problem /问题/难题/疑问/习题/作图题/将军/课题/困难/难/题是/
• structure /构造/构成/结构/组织/化学构造/石理/纹路/构造物/建筑物/建造/物/
53
Basic methods
• Use all the translation terms• data /材料/资料/事实/数据/基准/• structure /构造/构成/结构/组织/化学构造/石理/纹路/构造物/建筑物/建造/物/• Introduce noise• Implicitly, the term with more translations is assigned higher importance
• Use the first (or most frequent) translation• Does not cover other translation alternatives• Not always an appropriate choice
• Generally: 50-60% of monolingual IR
• Problems of dictionary• Coverage (unknown words, unknown translations)
• [Xu and Weischedel 2005] tested the impact of dictionary coverage on CLIR (En-Ch)
• The effectiveness increases till 10 000 entries
54
Translate the query as a whole
• Phrase translation [Ballesteros and Croft, 1996, 1997] base de données: databasepomme de terre: potato
• First translate phrases• Then the remaining words
• Best global translation for the whole query
1. Candidates:
For each query word• Determine all the possible translations (through a dictionary)
2. Selection
select the set of translation words that produce the highest cohesion
55
Cohesion• Cohesion ~ frequency of two translation words together
• t(tj|si) is estimated from a parallel training corpus, aligned into parallel sentences
• IBM models 1, 2, 3, …
• Process:• Input = two sets of parallel texts
• Sentence alignment A: Sk ⇔ Tl (bitext)
• Initial probability assignment: t(tj|si,A)
• Expectation Maximization (EM): t(tj|si ,A)
• Final result: t(tj|si) = t(tj|si ,A)
62
Integrating translation in language model (Kraaij et al. 2003)
• The problem of CLIR:
• Query translation (QT)
• Document translation (DT)
Score(D,Q) = P(ti | MQ)logP(ti | MD )ti�V
�
P(ti | MQs) = P(ti | sj , MQs
ML )P(sj | MQs
ML )sj�Vs
�
≈ t(ti | sj )P(sj | MQs
ML )sj�Vs
�
P(si | MDt) = P(si | t j , MDt
)P(t j | MDt)
t j�Vt
�
≈ t(si | t j )P(t j | MDt)
t j�Vt
�
63
Results (CLEF 2000-2002)
Run EN-FR FR-EN EN-IT IT-EN
Mono 0.4233 0.4705 0.4542 0.4705
MT 0.3478 0.4043 0.3060 0.3249
QT 0.3878 0.4194 0.3519 0.3678
DT 0.3909 0.4073 0.3728 0.3547
- Translation model (IBM 1) trained on a web collection
- TM can outperform MT (Systran)
64
17
Details on translation model training on a parallel corpus
• Sentence alignment
• Align a sentence in the source language to its translation(s) in the target language
• Translation model
• Extract translation relationships
• Various models (assumptions)
65
Sentence alignment
• Assumption:• The order of sentences in two parallel texts is similar
• A sentence and its translation have similar length (length-based alignment, e.g. Gale & Church)
• A translation contains some “known” translation words, or cognates (e.g. Simard et al. 93)
66
+−−
+−−
+−−
+−−
+−
+−
=
6
5
4
3
2
1
)2,2(
)1,2(
)2,1(
)1,1(
),1(
)1,(
min),(
djiD
djiD
djiD
djiD
djiD
djiD
jiD
di: distance for
different patterns
(0-1, 1-1, …)
Example of aligned sentences (Canadian Hansards)
Débat Artificial intelligence L'intelligence artificielle A ebate
Depuis 35 ans, les spécialistes Attempts to produce thinking d'intelligence artificielle cherchent machines have met during the à construire des machines past 35 years with a curious mix pensantes. of progress and failure.
Leurs avancées et leurs insuccèsalternent curieusement.
Two further points are important.
Les symboles et les programmes First, symbols and programs are sont des notions purement purely abstract notions. abstraites.
67
2-2
2-1
0-1
1-1
TM training: Initial probability assignment t(tj|si, A)
même even
un a
cardinal cardinal
n’ is
est not
pas safe
à from
l’ drug
abri cartels
des .
cartels
de
la
drogue
.
68
18
TM training:Application of EM: t(tj|si, A)
même even
un a
cardinal cardinal
n’ is
est not
pas safe
à from
l’ drug
abri cartels
des .
cartels
de
la
drogue
.
69
IBM models (Brown et al.)
• IBM 1: does not consider positional information and sentence length
• IBM 2: considers sentence length and word position
• IBM 3, 4, 5: fertility in translation
• For CLIR, IBM 1 seems to correspond to the current (bag-of-words) approaches to IR.
70
Word alignment for one sentence pair
Source sentence in training: e = e1, …el (+NULL)
Target sentence in training: f = f1, …fm
Only consider alignments in which each target word (or position j) is aligned to a source word (of position aj)
The set of all the possible word alignments: A(e,f)71
General formula
),,,|(),,,|()|()|,(
|||,| ,],1[ ],0[ with ),...,(
)|,(
)|()|(
1
11
1
1
1
1
1
1
)(
eeeeaf
fea
eaf
efef
fe,a
mfafPmfaaPmPP
mlmilaaa
P
PEFP
jj
j
jm
j
j
j
im
A
−−
=
−
∈
∏
∑
=
==∈∀∈=
=
===
72
Prob. that e is
translated into a
sentence of length m
Prob. that j-th
target is
aligned with
aj-th source
word
Prob. to produce the
word fj at position j
19
Example
a = (1,2,4,3)
p(c'est traduit automatiquement, a|it is automatically translated) =
P(m= 4 | e= it is automatically translated)�
Pa(a1 =1 | m= 4,e)�Pt ( f1 = c' | a1 =1, m= 4,e)�
Pa(a2 = 2 | a1 =1, f1 = c',m= 4,e)�
Pt ( f2 = est | a1
2 = (1,2), f1 = c', m= 4,e)�
Pa(a3 = 4 | a1
2 = (1, 2), f12 = c' est,m= 4,e)�
Pt ( f3 = traduit | a1
3 = (1,2, 4), f1
2 = c'est, m= 4,e)�
Pa(a4 = 3 | a1
3 = (1,2, 4), f13 = c' est traduit,m= 4,e)�
Pt ( f4 = automatiquement | a1
4 = (1, 2, 4,3), f1
3 = c'est traduit, m= 4,e)
),,,|(),,,|()|( 1
11
1
1
1
1
1 eee mfafPmfaaPmP jj
j
jm
j
j
j
−−
=
−∏
73
NULL
it
is
automatically
translated
c’
est
traduit
automatiquement
IBM model 1
• Simplifications
• the model becomes (for one sentence alignment a)
)|()|(),,,|(
)1/(1)|(),,|(
)|(
1
11
1
1
1
1
jj ajaj
jj
j
j
jj
j
eftefpmfafP
llapmfaaP
mP
==
+==
=
−
−−
e
e
e ε
∏
∏
=
=
×+
=
+×=
m
j
ajm
m
j
aj
j
j
eftl
l
eftp
1
1
)|()1(
1
)|()|,(
ε
εeaf
74
Position alignment is
uniformly distributed
Context-independent
word translation
Any length generation is
equally probable – a constant
Example Model 1
a = (1,2,4,3)
5
410064.80.80.450.70.2
5
ated)lly translautomatica isit |ement,automatiqut est traduic'(
8.0lly)automatica|ementautomatiqu(
45.0)translated|traduit(
,7.0is)|est(
,2.0it)|c'(
−××=××××=
=
=
=
=
εε
ap
t
t
t
t
:Assume
75
NULL
it
is
automatically
translated
c’
est
traduit
automatiquement
Sum up all the alignments
∏∑
∑ ∑∏
= =
= = =
+=
+=
m
j
l
i
ijm
l
a
l
a
m
j
ajm
eftl
eftl
pm
j
1 0
0 0 1
)|()1(
)|(...)1(
)|(1
ε
εef
)|( ij eft
76
•Problem:
We want to optimize so as to maximize
the likelihood of the given sentence alignments
•Solution: Using EM
20
Parameter estimation
1. An initial value for t(f|e) (f, e are words)
2. Compute the count of word alignment e-f in thepair of sentences ( ) (E-step)
3. Maximization (M-step)
4. Loop on 2-3
),,()()( ss
f|ec fe)()( , ss fe
77
∑∑∑
∑∑
==
=
=
=
=
l
i
i
m
j
jl
i
i
a
m
j
j
sss(s)
eeδffδ
eft
eft
eeδffδpf|e;cj
01
0
1
)()()(
),(),(
)|(
)|(
),(),()|(),(a
f,eafe
∑
∑∑
=
=
=
=
S
s
ss
e
f
S
s
ss
e
efceft
f|ec
1
)()(1-
1
)()(
),;|()|(
factor)tion (normaliza ),;(
fe
fe
λ
λ
Count of f in f Count of e in e
Problem of parallel texts
• Only a few large parallel corpora
• e.g. Canadian Hansards, EU parliament, Hong Kong Hansards, UN
documents, …
• Many languages are not covered
• Is it possible to extract parallel texts from the Web?
- The anchor text of each pointer identifies a language
Then The two pages referenced are “parallel”
80
French English
French
text
English
text
21
PTMiner (Nie & Chen 99)
• Candidate Site Selection
By sending queries to AltaVista, find the Web sites that may contain parallel text.
• File Name Fetching
For each site, fetching all the file names that are indexed by search engines. Use host crawler to thoroughly retrieve file names from each site.
• Pair Scanning
From the file names fetched, scan for pairs that satisfy the common naming rules.
81
Candidate Sites Searching
82
• Assumption: A candidate site contains at least one such
Web page referencing another language.
• Take advantage of existing search engines (AltaVista)
File Name Fetching
• Initial set of files (seeds) from a candidate site:
host:www.info.gov.hk
• Breadth-first exploration from the seeds to discover other documents from the sites
83
Pair Scanning
84
• Naming examples:
index.html v.s. index_f.html
/english/index.html v.s. /french/index.html
• General idea:
parallel Web pages = Similar URLs at the difference of a
tag identifying a language
22
Mining Results (several years ago)85
• French-English
• Exploration of 30% of 5,474 candidate sites
• 14,198 pairs of parallel pages
• 135 MB French texts and 118 MB English texts
• Chinese-English
• 196 candidate sites
• 14,820 pairs of parallel pages
• 117.2M Chinese texts and 136.5M English texts
• Several other languages I-E, G-E, D-E, …
CLIR results: F-E
F-E
(Trec6)
F-E
(Trec7)
E-F
(Trec6)
E-F
(Trec7)
Monolingual 0.2865 0.3202 0.3686 0.2764
Hansard TM 0.2166
(74.8%)
0.3124
(97.6%)
0.2501
(67.9%)
0.2587
(93.6%)
Web TM 0.2389
(82.5%)
0.3146
(98.3%)
0.2504
(67.9%)
0.2289
(82.8%)
86
• Web TM comparable to Hansard TM
• Parallel texts for CLIR can tolerate more noise
Alternative methods
• using parallel texts for pseudo-relevance feedback [Yang et al. 99]
• Given a query in F
• Find relevant documents in the parallel corpus
• Extract keywords from their parallel documents, and consider them as a query translation
F EQuery F
Rel.
doc. F
Corresponding
doc. in E
Words
in E
87
Alternative methods – LSI [Dumais et al. 97]
• Monolingual LSI : • Create a latent semantic space (using SVD)
• Each dimension represents a combination of initial dimensions (terms, documents)
• Comparison of document-query in the new space
• Bilingual LSI :• Create a latent semantic space for both languages on a parallel corpus
• Concatenate two parallel texts together• Convert terms in both languages into the semantic space
• Problems: • The dimensions in the latent space are determined to minimize some representational error – may be different from translational error
• Coverage of terms by the parallel corpus• Complexity in creating the semantic space
• Effectiveness – usually lower than using a translation model
88
23
Explicit Semantic Analysis (ESA) [Gabrilovich et al. 07, Spitkovsky et al. 12]
• Assume that each Wikipedia article corresponds to one explicit representation dimension
• A term � association with an article (tf*idf or conditional probability) based on text body or anchor text
• Document ranking: compare the document and the query representations in the ESA space
• CLIR:• Using cross-lingual links between Wikipedia articles
• Possible problems:• Completeness of ESA space
• Representation granularity
89
Summary on using document-level term relations• Pseudo-RF, LSI, ESA
• Assumption: Terms occurring in the same text (or parallel text) are related
• (Translation) relations between terms are coarse-grained
• Related to the same topics
• Query “translation” by topic-related terms
• Usually lower retrieval effectiveness than explicit translation relations
• May be suitable for comparable texts
• Do not exploit fully parallel texts
90
Using a comparable corpus
• Comparable: articles about the same topic• E.g. News articles on the same day about an event
• Impossible to train a translation model
• Estimate cross-lingual similarity (less precise than translation)• Similar methods to co-occurrence analysis
• Conditional probability, mutual information, …
• Less effective than using a parallel corpus
• To be used only when there is no parallel corpus, or to complement a dictionary or parallel corpus• Helpful to further expand the translation by a dictionary, MT or TM
91
Other problems – unknown words
• Proper names (‘Vladmir Ivanov’ in Chinese?)
• New technical terms (‘Latent Semantic Analysis’ in Chinese?)
• Possible solutions• Transliteration
• Mining the web
92
24
Transliteration (between languages of different alphabets)
• Translate a name phonetically• Generate the pronunciation of the name
• Transform the sounds into the target language sounds
• Generate the characters to represent the sounds
Engli sh nam e Frances Taylor
Engli sh phonemes F R AE N S IH S T EY L ER
ε ε
Chinese phonemes f u l ang x i s i t ai l e
Chinese Pinyin fu lang xi s i tai le
Chinese transliteration 弗 朗 西 丝 泰 勒
93
Mining the web (1)
• A site is referred to by several pages with different anchor texts in different languages
• Anchor texts as parallel texts
• Useful for the translation of organizations (故宫博物馆-National Museum)
http://www.yahoo.com 雅虎搜索引擎
Yahoo!搜索引擎 雅虎
美国雅虎
Yahoo!モバイル
Yahooの検索エンジン 美國雅虎
Yahoo search engine 雅虎WWW站
Yahoo!
94
Mining the web (2)
• Some "monolingual” texts may contain translations• 现在网上最热的词条,就是这个“Barack Obama”,巴拉克·欧巴马(巴拉克·奥巴马)。• 这就诞生了潜语义索引(Latent Semantic Indexing) …
• Les machines à vecteurs de support ou séparateurs à vaste marge (en anglais Support Vector Machine, SVM) sont …
• Procedure• Using templates to identify potential candidates:
Source-name (Target-name)
Source-name, Target-name
…
• Segmentation (what segment may be appropriate?)
• Statistical analysis
• May be used to complete an existing dictionary
95
Mining the Web (3)
• Wikipedia – a rich source for translations
• Links between articles in different languages
• Translation of concepts (Wiki titles) and named entities (e.g. proper names) through cross-language links
• Linked articles as comparable texts � term similarity
96
25
Some other improvement means• Pre- and post-translation expansion
• Query expansion before the translation using a source-language collection
• Query expansion after the translation using a target-language collection (traditional PRF)
• Fuzzy matching between similar languages• information - información – informazione• ~cognate• Matching n-grams (e.g. 4-grams)• Transformation using rules (konvektio -> convection)
• Combining translations using different tools• Several MT systems, several dictionaries• MT + TM + dictionary• Parallel texts + comparable texts• …
97
Structured query
• Traditional method: all the translation terms in a bag of words
• data /材料/资料/事实/数据/基准/• structure /构造/构成/结构/组织/化学构造/石理/纹路/构造物/建筑物/建造/物/
• Consider all translations for the same source word as synonyms• #syn(材料 资料 事实 数据 基准) #syn(构造构成 …)
• sum up the occurrences of all the synonyms in a document (v.s. sum up the log probabilities without #syn)
• Pirkola: structured query > bag-of-words query
• Probabilistic structured query (#wsyn): add weights to synonyms
98
Current state
• Effectiveness of CLIR• Between European languages ~90-100% monolingual
• Between European and Asian languages ~ 80-100%
• A usable quality
• One usually needs translation of the retrieved documents
• The use of CLIR by Web users is still limited / Tools for CLIR are limited• To be increased (integration in the main search engines)
99
Remaining problems• Current approaches :
• CLIR= translation + monolingual IR• E.g. Using MT as a black box + monolingual IR
• The resources and tools are usually developed for MT, not for CLIR
• MT: create a readable sentence
• CLIR: retrieving relevant documents
• Problems of translation selection• MT: Select one best translation � multiple translations
• Phrases in MT: consecutive words
• But dependent words do not always form a phrase
• “Mixing drug cocktails for mental illness is still more art than science”
• �Take into account more flexible dependencies (even proximity)
• How to train a translation model in such a context?
• These are not only a problem in CLIR but also in general IR.
100
26
The future?
• CLIR≠ translation+ monolingual IR
• Translation as a step in CLIR• Translation for IR (not for human readers)
• Select effective search terms (not only good translation terms)
• Some similarities with query expansion
• Can use similar approaches to query expansion
• Combine multiple translation possibilities
• Result diversification
• Document translation: Query-dependent?
• (More in the talk on SMT)
101
Beyond
• Current CLIR approaches can succeed in crossing the language barrier• MT, Dictionary, STM
• Can one use CLIR methods for monolingual (general) IR?• Basic idea
• IBM1 model
• Phrase translation model and some adaptations
102
What have we learnt from CLIR?
• The original query is not always the best expression of information need.
• We can generate better or alternative query expressions
• Query expansion, reformulation, rewriting
• General IR
Information need
query
document
103
Other query expressions?
Basic idea for general IR
• Assumption: Lexical gap between queries and
documents
• Written by searchers and by authors, using different vocabularies
Some references• Jian-Yun Nie, Cross-language information retrieval, Morgan-Claypool, 2010. (a
survey book on CLIR)
• Adriani, M. van Rijsbergen, C.J., (2000). Phrase identification in cross-language information retrieval. In Proceedings of RIAO, pp. 520-528. (Using phrase in CLIR)
• Ballesteros, L. and Croft, W. (1997). Phrasal translation and query expansion
techniques for cross-language information retrieval. In Proceedings of SIGIR Conf. pp. 84-91. (Using phrases in CLIR)
• Berger, A. and Lafferty, J. (1999). Information retrieval as statistical translation. In Proceedings of SIGIR Conf., pp. 222-229. (Translation model for monolingual IR)
• Brown, P., Della Pietra, S., Della Pietra, V., and Mercer, R. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), pp. 263-311. (IBM translation models)
• Chen, A., Jiang, H., and Gey, F. (2000). Combining multiple sources for short
query translation in Chinese-English cross-language information retrieval. In Proceedings of the Fifth international Workshop on on information Retrieval with Asian Languages (IRAL). pp. 17-23 (Combining translation resources)
• Cui, H., Wen, J-R., Nie, J-Y. and Ma, W-Y. 2003. Query expansion by mining user
log. IEEE Trans on Knowledge and Data Engineering. Vol. 15, No. 4. pp. 1-11. (Query expansion based on query logs)
• V. Dang and W.B. Croft. Query Reformulation Using Anchor Text. In Proc. of WSDM, pages 41-50, 2010. (simulating query logs by anchor texts)
• S. Dumais, T. Letsche, M. Littman, and T. Landauer, “Automatic cross-language retrieval using latent semantic indexing”, in AAAI Symposium on Cross Language Text and Speech Retrieval. 1997, American Association for Artificial Intelligence. (Using LSI for CLIR)
• J. Gao, J.Y. Nie, E. Xun, J. Zhang, M. Zhou, C. Huang, Improving Query Translation for CLIR using Statistical Models, 24thACM-SIGIR, New Orleans, Sept. 2001, pp. 96-104. (selection of translation terms)
• Gao, J., He, X., and Nie, J-Y. 2010. Clickthrough-based translation models for web search: from word models to phrase models. In CIKM, pp. 1139-1148. (translation
models for monolingual IR)
• E. Gabrilovich and S. Markovitch. 2007. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In IJCAI (ESA)
• Gale, W. A., Church K. W. 1993. A Program for Aligning Sentences in Bilingual
• Grefenstette, G. (1999). The World Wide Web as a resource for example-based machine translation tasks, In Proc. ASLIB translating and the computer 21 conference. (selection of translation terms)
• Jin, Rong, Hauptmann, A.G., and Zhai, CX. (2002). Title Language Model for Information Retrieval. In Proceedings of SIGIR Conf., pp. 42-48 (title language model)
• Kraaij, W., Nie, J.Y., and Simard, M. (2003). Embedding Web-Based Statistical Translation Models in Cross-Language Information Retrieval.
Computational Linguistics, 29(3): 381-420. (CLIR using language modeling and Web-mined parallel corpora)
• Liu, Y., Jin R. and Chai, Joyce Y. (2005). A maximum coference model for dictionary-based cross-language information retrieval, In Proceedings of
SIGIR conf., pp. 536-543 (translation selection)
• J. Scott McCarley, Should we Translate the Documents or the Queries in Cross-language Information Retrieval? 1999, ACL, pp. 208-214. (Document translation vs. query translation)
• Nie, J.Y., Simard, M., Isabelle, P., Durand, R. (1999) Cross-Language Information Retrieval based on Parallel Texts and Automatic Mining of Parallel Texts in the Web, In Proceedings of SIGIR Conf., pp. 74-81 (CLIR using Web-mined parallel pages)
• Pirkola, A. (1998) The effects of query structure and dictionary setups in dictionary-based cross-language information retrieval.” In Proceedings of SIGIR Conf., pp. 55-63. (Structured query translation)
• Pirkola, A., Toivonen, J., Keskustalo, H., Visala K., Järvelin K., (2003) Fuzzy
translation of cross-lingual spelling variants, In Proceedings of SIGIR Conf., pp. 45-352. (Fuzzy match)
• Douglas W. Oard and Paul Hackett, “Document translation for the cross-language text retrieval at the university of Maryland”, in the Sixth Text
REtrieval Conference (TREC-6). (Document translation vs. Query translation)
• Riezler, S., and Liu, Y. 2010. Query rewriting using monolingual statistical machine translation. Computational Linguistics, 36(3): 569-582 (phrase translation for CLIR)
• Seo, H.-C., Kim, S.-B., Rim, H.-C. and Myaeng S.-H., (2005) Improving query translation in English-Korean cross-language information retrieval, Information Processing and Management, 41: 507-522. (translation selection)
• Sheridan, P. and Ballerini, J. P. (1996). Experiments in multilingual information
retrieval using the SPIDER system. In Proceedings of SIGIR Conf., pp. 58-65. (CLIR using comparable corpora)
• Valentin I. Spitkovsky and Angel X. Chang. 2012. A Cross-Lingual Dictionary for English Wikipedia Concepts. In Proceedings of the Eighth International
Conference on Language Resources and Evaluation (LREC 2012) (ESA for CLIR)
• Wen, J., Nie, J-Y., and Zhang, H. 2002. Query clustering using user logs. ACM TOIS, 20(1): 59-81. (Query clustering based on query logs)
• Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, and Robert E. Frederking,
“Translingual information retrieval: Learning from bilingual corpora”, Artificial Intelligence, vol.103, pp. 323–345, 1998. (Pseudo-RF for CLIR)