Search in Transliterated Space Shared Task Proposal, FIRE 2012 Monojit Choudhury Microsoft Research Lab India
Dec 24, 2015
Search in Transliterated Space
Shared Task Proposal, FIRE 2012
Monojit ChoudhuryMicrosoft Research Lab India
A Transliterated World Wide Web
Song Lyrics
A Transliterated World Wide Web
Reviews and Forums
A Transliterated World Wide Web
Facebook and Twitter
A Transliterated World Wide Web
And lot more
Beyond Indic languages
Many languages that use non-Roman script Arabic (Saudi Arabia, UAE, Egypt,
Morocco,…) Persian Indian sub-continental languages (IL &
Dzongkha, Nepalese, Sinhala) Thai, Vietnamese Cyrillic (Russian, Ukrainian) Chinese, Japanese, Korean (rare)
Aspects of Transliterated Text
Code Mixing
Transliteration
Errors, Contracti
on
IR Scenario - I
Mono-script Monolingual IR in transliterated space Query: thandee hava yeh chandni
suhanee Results: Only Roman transliterated
documents
Challenge: Spelling variations tandee hawa ye chandny soohaany
IR Scenario - II
Cross-script and Multi-script Monolingual IR in transliterated space
Query: thandee hava yeh chandni OR ठं� डी� हवा� ये चाँ��दनी� Results: Both Roman transliterated
or in native script
Challenge: Transliteration
Scenario - III
Cross-script and Cross-lingual IR Query: death of mareech and subahoo Document: Hindi (Transliterated and
Devanagari) and English documents
Shared Task on Retrieval
Mono-scriptMonolingual
IR
Transliterated query in
Roman
Transliterated documents in Roman
Cross-scriptMonolingual
IR
Transliterated query in
Roman
Transliterated documents in native script
Multi-scriptMonolingual
IR
Query in Roman or
native script
Documents in Roman and native scripts
Shared Sub-Tasks
Language identification of transliterated queries, documents, code-mixed text
kooda kazhikkan oru urgan split pea soup undaki ML ML ML ML EN EN EN ML
Transliteration Forward: കഴി�ക്കാ�ന് kazhikkan Backward: kazhikkan കഴി�ക്കാ�ന്
Available Data
20000 word pairs each in Bengali, Telugu, and Hindi (labeled with language tags)
35000 unique Hindi-Roman word pairs obtained from aligning Bollywood song lyrics
More data under preparation from FaceBook on mixture of various languages.
Looking for partners to extend!
Available Data
Currently we have 500 query and url-rel judged pairs for Bollywood song lyrics
Looking for partners to extend it to other (Indian) Languages
Other domains?
Thank you! [email protected]
Other resources
Lexicons Pronunciation lexicons G2P for some languages Stemmers and morphological
analyzers
Anything else?
Concluding Remarks
We have built Multi-script Bollywood Song Search and working on transliteration and code-mixing
These are just some initial ideas that came up from our experiences
If you are interested please let me know