Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal Garain Indian Statistical Institute, Kolkata Tamaltaru Pal Indian Statistical Institute, Kolkata Jiaul Paik Indian Statistical Institute, Kolkata Kripa Ghosh Indian Statistical Institute, Kolkata David Doermann University of Maryland, College
12
Embed
Overview of RISOT: Retrieval of Indic Script OCR’d Text Utpal GarainIndian Statistical Institute, Kolkata Tamaltaru PalIndian Statistical Institute, Kolkata.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Overview of RISOT:Retrieval of Indic Script OCR’d Text
Utpal Garain Indian Statistical Institute, KolkataTamaltaru Pal Indian Statistical Institute, KolkataJiaul Paik Indian Statistical Institute, KolkataKripa Ghosh Indian Statistical Institute, KolkataDavid Doermann University of Maryland, College Park, USADouglas W. Oard University of Maryland, College Park, USA
o Evaluate retrieval of automatically recognized text from machine printed text
o Goals Support experimentation of retrieval from printed
documents Evaluate IR effectiveness for retrieval based on Indic
script OCR Provide venue where IR and OCR researchers can work
together
Task
oBengali newspaper articlesAbout half the FIRE 2008/2010
collection62,875 documentsoTextoRendered imageoOCR’d text
66 topics
RISOT 2011
o Two teams participatedo Techniques
OCR error modeling Query time stemming
oBest absolute OCR results resulted from stemming + error modeling 83% the TEXT MAP for TD queries
oBest same-team relative MAP 90% of TEXT 88% for P@10
RISOT 2011
oN-gram statistics were usedo Stemming beats words or n-grams
o Statistically significant improvement over words for T and TD; Clean and OCR; w/ and w/o error model
Further experiments on RISOT 2011 Data
Run Q Doc Term Model MAP MAP% P@5 P@10 RprecTD-C-S TD Clean Stem 0.4229 0.4413 0.3554 0.3940TD-O-S-M TD OCR Stem Multi 0.3619 86% 0.3973 0.3207 0.3379TD-O-S-E TD OCR Stem One 0.3521 83% 0.3858 0.3008 0.3294TD-O-S TD OCR Stem 0.2915 69% 0.3109 0.2489 0.2832
Run Q Doc Term Model MAP MAP% P@5 P@10 Rprec
TD-C-W TD Clean Word 0.3449 82% 0.3826 0.3152 0.3250TD-O-W-M TD OCR Word Multi 0.3434 81% 0.3577 0.2962 0.3131TD-O-W-E TD OCR Word One 0.3251 77% 0.3388 0.2694 0.3068TD-O-W TD OCR Word 0.2293 54% 0.2717 0.2217 0.2336