Top Banner
Search in Transliterated Space Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India
17

Search in Transliterated Space

Feb 24, 2016

Download

Documents

yovela

Shared Task Proposal, FIRE 2012. Search in Transliterated Space. Monojit Choudhury Microsoft Research Lab India. A Transliterated World Wide Web. Song Lyrics. A Transliterated World Wide Web. Reviews and Forums. A Transliterated World Wide Web. Facebook and Twitter. - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Search in Transliterated Space

Search in Transliterated Space

Shared Task Proposal, FIRE 2012

Monojit ChoudhuryMicrosoft Research Lab India

Page 2: Search in Transliterated Space

A Transliterated World Wide Web

Song Lyrics

Page 3: Search in Transliterated Space

A Transliterated World Wide Web

Reviews and Forums

Page 4: Search in Transliterated Space

A Transliterated World Wide Web

Facebook and Twitter

Page 5: Search in Transliterated Space

A Transliterated World Wide Web

And lot more

Page 6: Search in Transliterated Space

Beyond Indic languages

Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,

Morocco,…) Persian Indian sub-continental languages (IL &

Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)

Page 7: Search in Transliterated Space

Aspects of Transliterated Text

Code Mixing

Transliteration

Errors, Contracti

on

Page 8: Search in Transliterated Space

IR Scenario - I

Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni

suhanee Results: Only Roman transliterated

documents

Challenge: Spelling variations tandee hawa ye chandny soohaany

Page 9: Search in Transliterated Space

IR Scenario - II

Cross-script and Multi-script Monolingual IR in transliterated space

Query: thandee hava yeh chandni OR ठंडी हवा ये चाँदनी Results: Both Roman transliterated

or in native script Challenge: Transliteration

Page 10: Search in Transliterated Space

Scenario - III

Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and

Devanagari) and English documents

Page 11: Search in Transliterated Space

Shared Task on Retrieval

Mono-scriptMonolingual

IRTransliterate

d query in Roman

Transliterated documents in Roman

Cross-scriptMonolingual

IRTransliterate

d query in Roman

Transliterated documents in native scriptMulti-script

Monolingual IR

Query in Roman or

native script

Documents in Roman and native scripts

Page 12: Search in Transliterated Space

Shared Sub-Tasks

Language identification of transliterated queries, documents, code-mixed text

kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML

Transliteration Forward: കഴിക്കാന്‍ kazhikkan Backward: kazhikkan കഴിക്കാന്‍

Page 13: Search in Transliterated Space

Available Data

20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)

35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics

More data under preparation from FaceBook on mixture of various languages.

Looking for partners to extend!

Page 14: Search in Transliterated Space

Available Data

Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics

Looking for partners to extend it to other (Indian) Languages

Other domains?

Page 15: Search in Transliterated Space

Thank you! [email protected]

Page 16: Search in Transliterated Space

Other resources

Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological

analyzers

Anything else?

Page 17: Search in Transliterated Space

Concluding Remarks

We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing

These are just some initial ideas that came up from our experiences

If you are interested please let me know