Urdu Character Set Urdu Character Set and Collating and Collating Sequence Sequence Sarmad Hussain Sarmad Hussain ات ق ی ق ح ت ز ک ر م ات ق ی ق ح ت ز ک ر مِ ِ اردو اردوCenter for Research in Urdu Language Processing Center for Research in Urdu Language Processing FAST National University of Computer and Emerging FAST National University of Computer and Emerging Sciences Sciences
28
Embed
Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Urdu Character Set and Urdu Character Set and Collating SequenceCollating Sequence
Sarmad HussainSarmad Hussain
اردو اردوِ ِمرکزتحقیقاتمرکزتحقیقاتCenter for Research in Urdu Language ProcessingCenter for Research in Urdu Language Processing
FAST National University of Computer and Emerging SciencesFAST National University of Computer and Emerging Sciences
2 مرکزتحقیقات اردو
Purpose of PresentationPurpose of Presentation
► Indicate the “state of affairs”Indicate the “state of affairs” Character setCharacter set Collating sequenceCollating sequence
►Show what has been done regarding Show what has been done regarding the standardizationthe standardization
► Identify what needs to be doneIdentify what needs to be done
3 مرکزتحقیقات اردو
SourcesSources
► Data from four dictionaries of UrduData from four dictionaries of Urdu
سنز 1.1. فیروز ، جامع سنز فیروزاللغات فیروز ، جامع لاہور فیروزاللغات لاہور ، ،((FLJFLJ ) )
.2.2Standard Twentieth Century Dictionary: Standard Twentieth Century Dictionary:
Urdu to English, Educational Publishing Urdu to English, Educational Publishing
House, New Dehli, India (STCD)House, New Dehli, India (STCD)
زبان ????????فرہنگفرہنگ3.3. قومی مقتدرہ ، زبان تلفظ قومی مقتدرہ ، اسلام تلفظ اسلام ، ( ( FTFT))آابادآاباد ،
زبان 4.4. قومی مقتدرہ ، لغت اردو زبان جدید قومی مقتدرہ ، لغت اردو اسلام جدید اسلام ، ((JULJUL ) )آابادآاباد ،
► Control characters (0 – 31, 127) Control characters (0 – 31, 127) ► Reserved control space (128 – 159, 255)Reserved control space (128 – 159, 255)► Reserved expansion space (177 – 191, 200 – 207, 240 – Reserved expansion space (177 – 191, 200 – 207, 240 –
253)253)► Vendor area (208 – 239)Vendor area (208 – 239)► Toggle character (254)Toggle character (254)
11 مرکزتحقیقات اردو
Conclusions: Standard Urdu Conclusions: Standard Urdu Character SetCharacter Set
► No general agreement on Urdu Character No general agreement on Urdu Character Set by dictionary publishersSet by dictionary publishers
► Standard Character Set defined by National Standard Character Set defined by National Language Authority Language Authority not well-publicized not well-publicized not widely adoptednot widely adopted
► GoP Computing Standard for Computing, GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined UZT 1.01 implements the NLA-defined character and symbol set character and symbol set
► Will soon be fully represented in Will soon be fully represented in Unicode/ISO 10646Unicode/ISO 10646
12 مرکزتحقیقات اردو
Urdu Collating Sequence: Urdu Collating Sequence: State of AffairsState of Affairs
FT, JULFT, JULج ٹھٹھٹ ٹ تھتھت ت پھپھپ پ بھبھب ب آآ اا ج ث خ چھچھچ چ جھجھث خ ح ڈ ڈ دھدھد د ح
►FJL,FJL, FT & JULFT & JULبیبی بی بی بی بیبےبےبیابانبیابان
►STCDSTCDبیبیبےبےبیابانبیابان بی بی بی بی
► Middle “yay” predicament: Middle “yay” predicament: ےے or or ییب = ییبب ب = کار ر ےےکار ا ر ک ا کل = = وژن وژن ییٹیلٹیل ی ل ٹ ی ن ییٹ ژ ن و ژ و
17 مرکزتحقیقات اردو
Role of Aerab in SortingRole of Aerab in Sorting
► Aerab ignored in the first (primary) pass of Aerab ignored in the first (primary) pass of sorting an Urdu stringsorting an Urdu string
ب )= ِِبب ب )= ہار ( ِِہار ( ہار ہار ہانہہانہََببب )= ِِبب ب )= ہاءی ( ِِہاءی ( ہاءی ہاءی
► However, aerab are relevant in second pass, However, aerab are relevant in second pass, when first pass gives an exact matchwhen first pass gives an exact match
ب ََبب ب ن ب ِِن ب ن نُنُُُنس ََسس س ن س ِِن س ن نُنُُُن
► No effect at primary level sortingNo effect at primary level sorting وسیوسیََمماعلااعلا وسیوسیُُمماعلان اعلاناعلماعلماعلیاعلی
► No minimal pairs found so secondary No minimal pairs found so secondary level so involvement could not be level so involvement could not be determineddetermined
20 مرکزتحقیقات اردو
Consonantal Aerab - HamzaConsonantal Aerab - Hamza
► Ignored at primary levelIgnored at primary level►Minimal pairs not found to determine Minimal pairs not found to determine
secondary level effectsecondary level effect مرامراتتٲٲمرمرمراتبمراتبمراممرامآات آاتمر مر
What Needs to be Done: What Needs to be Done: Urdu Urdu
► If required revisit and revise the Urdu If required revisit and revise the Urdu character setcharacter set
► Extensive work on sorting done at linguistic Extensive work on sorting done at linguistic level by NLA and UDB. Need to level by NLA and UDB. Need to Standardize itStandardize it Publicize itPublicize it
► Need to develop at computational level to build Need to develop at computational level to build Collation Element Table to generate sort keysCollation Element Table to generate sort keys Standardize itStandardize it Publicize itPublicize it
27 مرکزتحقیقات اردو
What Needs to be Done: What Needs to be Done: Other Languages of PakistanOther Languages of Pakistan
►Need to work towards standardization Need to work towards standardization of of Character setCharacter set Collating Sequence Collating Sequence
►Need to do gap analysis of character Need to do gap analysis of character sets with Unicode/ISO 10646 for sets with Unicode/ISO 10646 for international standardizationinternational standardization
►Need to develop Collation Element Need to develop Collation Element Tables for these Languages for sortingTables for these Languages for sorting