Top Banner
April 7, 2006 Natural Language Processing/L anguage Technology for the We b Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science & Engg., IIT Bombay (anand@cse)
32

April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Dec 23, 2015

Download

Documents

Alison Floyd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

April 7, 2006 Natural Language Processing/Language Technology for the Web

Cross-Language Information Retrieval (CLIR)

Ananthakrishnan R Computer Science & Engg., IIT Bombay

(anand@cse)

Page 2: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Cross Language Information Retrieval(CLIR) “A subfield of information retrieval dealing with retrieving

information written in a language different from the language of the user's query.”

E.g., Using Hindi queries to retrieve English documents

Also called multi-lingual, cross-lingual, or trans-lingual IR.

Page 3: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Why CLIR?

E.g., On the web, we have:

Documents in different languages Multilingual documents Images with captions in different languages

A single query should retrieve all such resources.

Page 4: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Approaches to CLIR

Knowledge-based

Corpus-based

Query Translation Dictionary/Thesaurus-based

Pseudo-Relevance Feedback (PRF)

Document

Translation

MT

(rule-based)

MT

(EBMT/StatMT)

Intermediate Representation

UNL

(AgroExplorer)

Latent Semantic Indexing

Most effective approaches are hybrid – a combination of knowledge and corpus-based methods.

mostefficient;commonlyused

infeasibleforlargecollections

Page 5: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Dictionary-based Query Translation

आयरलैं�ड शांति� वा��

Irelandpeace talks

Hindi-Englishdictionaries

Collection

search

• phrase identification• words to be transliterated

Page 6: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

The problem with dictionary-based CLIR -- ambiguity

अं�रिरक्षी�य घटना cosmic outer-space

incident event occurrence lessen subside decrease lower diminish ebb decline reduce

जालैं� धना lattice mesh net wire_netting meshed_fabric counterfeit forged false fabricated small_net network gauze grating sieve

money riches wealth appositive property

आयरलैं�ड शांति� वा�� Ireland

peace calm tranquility silence quietude

conversation talk negotiation tale

Page 7: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

… filtering/disambiguation is required after query translation.

Page 8: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Disambiguation using co-occurrence statistics

Hypothesis: correct translations of query terms will co-occur and incorrect translations will tend not to co-occur

Page 9: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Problem with counting co-occurrences: data sparsity

freq(Marathi Shallow Parsing CRFs)freq(Marathi Shallow Structuring CRFs)freq(Marathi Shallow Analyzing CRFs)

… are all zero.

How do we choose between parsing, structuring, and analyzing?

Page 10: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Pair-wise co-occurrence

अं�रिरक्षी�य घटनाcosmic outer-spaceincident event occurrence lessen subside decrease lower diminish ebb

decline reduce

freq(cosmic incident) 70800freq(cosmic event 269000freq(cosmic lessen) 7130freq(cosmic subside) 3120freq(outer-space incident) 26100freq(outer-space event) 104000freq(outer-space lessen) 2600freq(outer-space subside) 980

Page 11: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Shallow Parsing, Structuring or Analyzing? shallow parsing 166000shallow structuring 180000shallow analyzing 1230000

CRFs parsing 540CRFs structuring 125CRFs analyzing 765

Marathi parsing 17100Marathi structuring 511Marathi analyzing 12200

“shallow parsing” 40700“shallow structuring” 11“shallow analyzing” 2

collocation?

But,

analyzing 74100000parsing 40400000structuring 17400000shallow 33300000

Page 12: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Ranking senses using co-occurrence statistics Use co-occurrence scores to calculate

similarity between two words: sim(x, y) Point-wise mutual information (PMI) Dice coefficient PMI-IR

)()(

) (log),(-

yhitsxhits

yxhitsyxIRPMI

AND

Page 13: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Disambiguation algorithm

},... ,{

:query suser'

21sm

ss qqqq

}{

ons, translatiofset the,each For

,tjii

si

wS

q

Page 14: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

','

'' ),(),( .1,,,

i

t

liSw

t

li

tjii

tji wwsimSwsim

ii

itji

tji Swsimwscore

'

),()( .2 ',,

},... ,,{

query translated

21tm

ttt qqqq

)( maxarg .3 ,,

tji

w

ti wscoreq

tji

Page 15: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Example

अं�रिरक्षी�य घटनाcosmic outer-spaceincident event lessen subside decrease lower

diminish ebb decline reduce

score(cosmic)= PMI-IR(cosmic, incident) + PMI-IR(cosmic, event) + PMI-IR(cosmic, lessen) + PMI-IR(cosmic, subside) …

Page 16: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Disambiguation algorithm: sample outputs

आयरलैं�ड शांति� वा��Ireland peace talks

अं�रिरक्षी�य घटना cosmic events

जालैं� धना net money (?)

Page 17: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Results on TREC8 (disks 4 and 5) English topics (401-450) manually translated to Hindi Assumption: relevance judgments for English topics

hold for the translated queries Results (all TF-IDF):

Technique MAP

Monolingual 23

All-translations 16

PMI based disambiguation 20.5

Manual filtering 21.5

Page 18: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Pseudo-Relevance Feedback for CLIR

Page 19: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

(User) Relevance Feedback (mono-lingual)1. Retrieve documents using the user’s query

2. The user marks relevant documents

3. Choose the top N terms from these documents

Top terms IDF is one option for scoring

4. Add these N terms to the user’s query to form a new query

5. Use this new query to retrieve a new set of documents

Page 20: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Pseudo-Relevance Feedback (PRF) (mono-lingual)1. Retrieve documents using the user’s query2. Assume that the top M documents retrieved

are relevant3. Choose the top N terms from these M

documents4. Add these N terms to the user’s query to

form a new query5. Use this new query to retrieve a new set of

documents

Page 21: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

PRF for CLIRCorpus-based Query Translation Uses a parallel corpus of documents:

H1 E1

H2 E2

. .

. .

. .Hm Em

Hindi collection H English collection E

Page 22: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

PRF for CLIR

1. Retrieve documents in H using the user’s query

2. Assume that the top M documents retrieved are relevant

3. Select the M documents in E that are aligned to the top M retrieved documents

4. Choose the top N terms from these documents

5. These N terms are the translated query

6. Use this query to retrieve from the target collection(which is in the same language as E)

Page 23: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Cross-Lingual Relevance Models - Estimate relevance models using a parallel corpus

Page 24: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Ranking with Relevance Models

Relevance model or Query model (distribution encodes the information need):

Probability of word occurrence in a relevant document

Probability of word occurrence in the candidate document

Ranking function (relative entropy or KL divergence)

R

)|( RwP

)|( DwP

w RwP

DwPDwP

RDKL

)|(

)|(log).|(

)||(

Page 25: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Estimating Mono-Lingual Relevance Models

)...(

)...,(

)...|()|()|(

21

21

21

m

m

mR

hhhP

hhhwP

hhhwPQwPwP

M

m

iim MhPMwPMPhhhwP

121 )|()|()()...,(

Page 26: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Estimating Cross-Lingual Relevance Models

},{ 121 )|()|(}),({)...,(

EH MM

m

iHiEEHm MhPMwPMMPhhhwP

)()1()|(,

, wPfreq

freqMwP

v Xv

XwX

Page 27: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

CLIR Evaluation – TREC(Text REtrieval Conference) TREC CLIR track (2001 and 2002)

Retrieval of Arabic language newswire documents from topics in English

383,872 Arabic documents (896 MB) with SGML markup 50 topics Use of provided resources (stemmers, bilingual

dictionaries, MT systems, parallel corpora) is encouraged to minimize variability

http://trec.nist.gov/

Page 28: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

CLIR Evaluation – CLEF(Cross Language Evaluation Forum) Major CLIR evaluation forum Tracks include

Multilingual retrieval on news collections topics will be provided in many languages including Hindi

Multiple language Question Answering ImageCLEF Cross Language Speech Retrieval WebCLEF

http://www.clef-campaign.org/

Page 29: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Summary

CLIR techniques Query Translation-based Document Translation-based Intermediate Representation-based

Query translation using dictionaries, followed by disambiguation, is a simple and effective technique for CLIR

PRF uses a parallel corpus for query translation Parallel corpora can also be used to estimate cross-

lingual relevance models CLEF and TREC: important CLIR evaluation

conferences

Page 30: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

References (1)

1. Phrasal Translation and Query Expansion Techniques for Cross-language Information Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1995.

2. Resolving Ambiguity for Cross-Language Retrieval, Lisa Ballesteros and W. Bruce Croft, Research and Development in Information Retrieval, 1998.

3. A Maximum Coherence Model for Dictionary-Based Cross-Language Information Retrieval, Yi Liu, Rong Jin, and Joyce Y. Chai, ACM SIGIR, 2005.

4. A Comparative Study of Knowledge-Based Approaches for Cross-Language Information Retrieval, Douglas W. Oard, Bonnie J. Dorr, Paul G. Hackett, and Maria Katsova, Technical Report CS-TR-3897, University of Maryland, 1998.

Page 31: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

References (2)

5. Translingual Information Retrieval: A Comparative Evaluation, Jaime G. Carbonell, Yiming Yang, Robert E. Frederking, Ralf D. Brown, Yibing Geng, and Danny Lee, International Joint Conference on Artificial Intelligence, 1997.

6. A Multistage Search Strategy for Cross Lingual Information Retrieval, Satish Kagathara, Manish Deodalkar, and Pushpak Bhattacharyya, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005.

7. Relevance-Based Language Models, Victor Lavrenko, and W. Bruce Croft, Research and Development in Information Retrieval, 2001.

8. Cross- Lingual Relevance Models, V. Lavrenko, M. Choquette, and W. Croft, ACM-SIGIR, 2002.

Page 32: April 7, 2006 Natural Language Processing/Language Technology for the Web Cross-Language Information Retrieval (CLIR) Ananthakrishnan R Computer Science.

Thank You