Top Banner
SINAI-GIR SINAI-GIR A Multilingual Geographical IR A Multilingual Geographical IR System System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer Science Department
15

SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Dec 14, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

SINAI-GIRSINAI-GIR

A Multilingual Geographical IR SystemA Multilingual Geographical IR System

University of Jaén (Spain)

José Manuel Perea Ortega

CLEF 2008, 18 September, Aarhus (Denmark)

Computer Science Department

Page 2: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Introduction

• Preliminary work of SINAI in GeoCLEF: – 2006: query expansion using gazetteers and

thesaurus [García-Vega et al., 2007]– 2007: filtering documents based on manual rules

[Perea-Ortega et al., 2007]

• GeoCLEF 2008:– Filtering documents using new manual rules and

new approachs (query reformulation, keywords and hyponyms extraction, query geo-expansion)

GeoCLEF 2008, Aarhus

Page 3: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Page 4: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Translates the queries from other languages into English

We have used SINTRAM (SINai TRAnslation Module) [García-Cumbreras et al., 2007]

It works with different online machine translators

Page 5: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Preprocessing: stemming, stopwords, POS The toponyms are extracted (NER) Two indexes are generated:

• Locations• Keywords

Page 6: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Query Preprocessing: stemming, stopwords, removes irrelevant information

The toponyms are extracted (NER) Spatial relations finder based on manual rules Query reformulation based on POS tagging and

query parsing subtask Geo-expansion using a gazetteer Keywords/Hyponyms detection

Page 7: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Lemur as index-search engine

Okapi with PRF as weighting function

Page 8: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Multilingual Query

English collection

IR SubsystemIR Subsystem

GeoNames

Final Re-Ranked Documents retrieved

TRANSLATORTRANSLATOR QUERY ANALYZERQUERY ANALYZER

English Query (Q)

Q1

Q2Q3

Collection Collection PreprocessingPreprocessing

subsystemsubsystem

GeoNames

VALIDATORVALIDATOR

Documents retrieved

Keywords and geo-information

extracted

Keywords and geo-information

extracted

SINAI-GIR System overview

Filter the list of documents recovered by the IR subsystem, applying different manual rules and using the geographical data detected in the query

Re-rank the documents using predefined weights for each rule and the keywords/hyponyms detected in the query

Page 9: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Experiments description

• SINAI has participated in mono and bilingual tasks with a total of 15 experiments15 experiments:– MONO-EN: 9 experiments– BILI-X2EN: 6 experiments

• Combining the content of topic labels: TD or TDN• BaselineBaseline: Q1 without applying any filtering or re-

ranking process• Other experimentsOther experiments:

– Filtering and re-ranking of the fusion list of the documents recovered by the Q1, Q2 and Q3

– Using keywords and/or hyponyms in the re-ranking process

GeoCLEF 2008, Aarhus

Page 10: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

MONO-EN results

GeoCLEF 2008, Aarhus

Best result: baselinebaseline (no filtering and no re-ranking)

In some filtering experiments the use of keywords improves the results

Best results using only the TD topic labels

Page 11: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

BILI-X2EN results

GeoCLEF 2008, Aarhus

Best result: baselinebaseline (no filtering and no re-ranking) with Portuguese topics

Best results using only the TD topic labels

Page 12: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Conclusions

• The baseline experiment seems to work well because we include the geo-information in the retrieval process

• The filtering of documents does not seem to work well because we include the geo-information in the query and we are re-ranking documents which maybe are not relevant with respect to their content

• The use of keywords for re-ranking the documents retrieved could be interesting because in some experiments it improves the results obtained without using them

• Query reformulation could be also interesting because for some topics it retrieves valid documents which are not retrieved with the default query

GeoCLEF 2008, Aarhus

Page 13: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

TextMESS at GeoCLEF 2008

• Spanish TextMESS projectTextMESS project (Intelligent, Interactive and Multilingual Text Mining based on Human Language Technologies): joint participation by the Polytechnic University of Valencia and University of Jaén (SINAI)

• Method employed: merging algorithm based on merging algorithm based on fuzzy Borda voting schemefuzzy Borda voting scheme, taking as input the , taking as input the two document lists returned by both systemstwo document lists returned by both systems

• Second best result in the monolingual English task

GeoCLEF 2008, Aarhus

Page 14: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

Thank you

GeoCLEF 2008, Aarhus

sinai.ujaen.es

Page 15: SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.

• References

– García-Vega, Manuel and García-Cumbreras, Miguel A. and Ureña-López, L.A. and Perea-Ortega, José M. GEOUJA System. The first participation of the University of Jaén at GEOCLEF 2006. In LNCS, volume 4730, pages 913-917. Springer-Verlag, 2007.

– Perea-Ortega, Jose M. and García-Cumbreras, Miguel A. and García-Vega, Manuel and Montejo-Ráez, Arturo. GEOUJA System. University of Jaén at GEOCLEF 2007. In Proceedings of the Cross Language Evaluation Forum (CLEF 2007), page 52, 2007.

– García-Cumbreras, Miguel A. and Ureña-López, L. Alfonso and Martínez-Santiago, Fernando and Perea-Ortega, José M. BRUJA System. The University of Jaén at the Spanish task of QA@CLEF 2006. In LNCS, volume 4730, pages 328-338. Springer-Verlag, 2007.

GeoCLEF 2008, Aarhus

http://sinai.ujaen.es