Top Banner
Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Sawood Alam National University of Sciences and Technology Islamabad, Pakistan Fateh ud din B Mehmood Computer Science Department, Old Dominion University Norfolk, Virginia - 23529 Michael L. Nelson
30

Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Aug 06, 2015

Download

Internet

Sawood Alam
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Improving Accessibility ofArchived Raster Dictionaries of

Complex Script Languages

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Sawood Alam

National University of Sciences and TechnologyIslamabad, Pakistan

Fateh ud din B Mehmood

Computer Science Department, Old Dominion UniversityNorfolk, Virginia - 23529

Michael L. Nelson

Page 2: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

The Time Travel

Page 3: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

OK Google, Define Dictionarya book or electronic resource that liststhe words of a language (typically inalphabetical order) and gives theirmeaning, or gives the equivalent wordsin a different language, often alsoproviding information aboutpronunciation, origin, and usage.

Page 4: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Dictionaries Are DifferentRead: random accessWrite: maintain sort orderThe most compact mode topreserve a language

Page 5: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Problem: English Dictionary

Johnson's English dictionary

Page 6: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Problem: Urdu Dictionary

Farhang-e-Asifiyah

Page 7: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Related Work

Page 8: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Unicode CollationOrdered assembly of written informationUnicode values != natural collationArabic script: U+0600 to U+06FFOut of order alphabets in derived languagesCommon Locale Data Repository (CLDR)

Page 9: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Collation DiscrepanciesCompound lettersDiacritical marksHalf lettersPrefixes

Page 10: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Nested OrderingRoot word sorting (Arabic)

Morphological derivationDerived word simplification

Radicals and strokes (Chinese)

Page 11: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Indexing: Ordered Pages

Page 12: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Indexing: Sparse Index

Page 13: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Indexing: Full Index

Page 14: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Indexing: Location Index

Page 15: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Indexing State Transition

Page 16: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Annotation

Page 17: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Digitization

Page 18: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Dictionary ExplorerMultilingual Multi-dictionary LookupSearching and ExploringAnnotation and digitizationUser Contribution and FeedbackOpen Source => GitHub:/urduweb/DictionaryExplorer

Page 19: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Dictionary Explorer: English

Dictionary Explorer: English

Page 20: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Dictionary Explorer: Urdu

Dictionary Explorer: Urdu

Page 21: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Indexing TimeDictionary Pages Index Mode Time

English toUrdu

180 Sparse Manual andScript

10minutes

MonolingualUrdu

2,500 Sparse Manual 2 hours

MonolingualClassic Urdu

3,200 Full* Crowdsource** 60 days

* 75,000 words, phrases, proverbs, and idioms** 13 contributors

Page 22: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix Permutations

Page 23: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix: One

Page 24: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix: Two

Page 25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix: Three

Page 26: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix: Four

Page 27: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix: Five

Page 28: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Prefix: Six

Page 29: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

Conclusions and Future WorkIdentified issues

Too many matchesLack of fielded searchingLack of OCR supportNo input method assistance

Collation chalangesAccessibility levels: Ordered Pages, Sparse, Full, andLocation indexes, annotation, and digitizationImplemented a multi-lingual multi-dictionary explorerEffort and prefix evaluationIn future: elastic index and automatic region estimsteGitHub:/urduweb/DictionaryExplorer