ل ئ سا م ہ ق ل ع ت م ی اور ج ہ ت ب ی ت ر ت ی ک وں# ن ا ب# ی ر# ن ا ت س ک ا+ ب ل ئ سا م ہ ق ل ع ت م ی اور ج ہ ت ب ی ت ر ت ی ک وں# ن ا ب# ی ر# ن ا ت س ک ا+ ب, ن سی ح رمد س, ن ی س ح رمد سF Collation Sequences and Related Issues for Pakistani Languages Center For Research in Urdu Language Processing National University of Computer and Emerging
پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل. Collation Sequences and Related Issues for Pakistani Languages. سرمد حسین. F. Center For Research in Urdu Language Processing National University of Computer and Emerging Sciences. Purpose of Presentation. Briefly discuss character sets - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
پاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائلپاکستانی زبانوں کی ترتیب تہجی اور متعلقہ مسائل
سرمد حسینسرمد حسین
F
Collation Sequences and Related Issues for Pakistani Languages
Center For Research in Urdu Language Processing
National University of Computer and Emerging Sciences
Purpose of PresentationPurpose of Presentation
► Briefly discuss character setsBriefly discuss character sets
ے ل لھ م مھ ں ںھ ن نھ و ء ی ق ک کھ گ گھ ق ک کھ گ گھ ے ل لھ م مھ ں ںھ ن نھ و ء ی ہ ہ FLJ, NLFLJ, NL
سس د ڈ ذ ر ڑ ز ژد ڈ ذ ر ڑ ز ژ آ ا ب پ ت ٹ ث ج چ ح خآ ا ب پ ت ٹ ث ج چ ح خ
ہہ ف ق ک گ ل م ں ن وف ق ک گ ل م ں ن و ش ص ض ط ظ ع غ ش ص ض ط ظ ع غے ھ ء ی ے ھ ء ی
UHE, FA , STCDUHE, FA , STCDسس د ڈ ذ ر ڑ ز ژد ڈ ذ ر ڑ ز ژ ا ب پ ت ٹ ث ج چ ح خا ب پ ت ٹ ث ج چ ح خ
ہہ ف ق ک گ ل م ن وف ق ک گ ل م ن و ش ص ض ط ظ ع غ ش ص ض ط ظ ع غ
ےی ےی ھ ء ھ ء
Conclusions: Urdu Character SetConclusions: Urdu Character Set
► No general agreement on Urdu Character Set by No general agreement on Urdu Character Set by dictionary publishersdictionary publishers
► Standard Character Set defined by National Standard Character Set defined by National Language Authority and Urdu Dictionary BoardLanguage Authority and Urdu Dictionary Board not traditionalnot traditional not well-publicized not well-publicized not completely adoptednot completely adopted
► GoP Computing Standard for Computing, UZT 1.01 GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined character and symbol implements the NLA-defined character and symbol set set
► UZT 1.01 will soon be fully represented in UZT 1.01 will soon be fully represented in Unicode/ISO IEC 10646Unicode/ISO IEC 10646
► Control characters (0 – 31, 127) Control characters (0 – 31, 127) ► Reserved control space (128 – 159, 255)Reserved control space (128 – 159, 255)► Reserved expansion space (177 – 191, 200 – 207, 240 – Reserved expansion space (177 – 191, 200 – 207, 240 –
253)253)► Vendor area (208 – 239)Vendor area (208 – 239)► Toggle character (254)Toggle character (254)
Urdu Collation SequenceUrdu Collation Sequence
►How do the following figure in?How do the following figure in? Basic LettersBasic Letters Other LettersOther Letters Basic AerabBasic Aerab Other AerabOther Aerab OthersOthers
►Arguments should be Arguments should be consistentconsistent and and simplesimple
Character vs. PhonemeCharacter vs. Phoneme
► Character = written content = lettersCharacter = written content = letters► Phoneme = linguistic contentPhoneme = linguistic content
► in word “phone” in word “phone” 5 Characters 5 Characters = = p h o n ep h o n e 3 Phonemes 3 Phonemes == f o nf o n
• stylistic variation of ا ا • adds a character to single alif• not a character in the pure sense
► STCD, UHE, FA, NLSTCD, UHE, FA, NL
ااآابآابآاپآاپ ابابایوانایوان
ٶٶ أاأا StatusStatus
► Not a character in ANY dictionary including Not a character in ANY dictionary including dictionaries bydictionaries by National Language AuthorityNational Language Authority Urdu Dictionary BoardUrdu Dictionary Board
► Has same bearing on collation sequences as Has same bearing on collation sequences as ا ا ء ء و و ء ء
► Included in UZT 1.01 as per terms of reference Included in UZT 1.01 as per terms of reference given by NLAgiven by NLA
► May be made by combination of May be made by combination of ءء followed by followed by و ، و ا ، ا► Should be taken out of UZT1.01 in its next versionShould be taken out of UZT1.01 in its next version
► Like ں is a vowel modifier ھ is a consonant modifier and DOES NOT add any “phonemic content”
as with as with ھ , , ں not a phonemenot a phoneme
written adjacent to written adjacent to ہ lighter goes up!lighter goes up!
would come before ہ
بب CC = = بھبھ CC = = بہبہ C V CC V C = =
، پھ،۔۔۔، پھ،۔۔۔ھھبب Status as “Character”Status as “Character”
► Urdu Dictionary Board and National Language Authority assert that these are phonemes therefore the character combination should be made a character
► If character combinations which are phonemes are to be promoted as characters then the following combinations should also be made characters to be consistent یں، وں ، اں
► However, it is common in languages that character combinations represent phonemes p h f (in English), so (in Urdu) پھ پ ھ
► even if it is not a phoneme ,ں may remain a character like ھ► not characters but character combinations بھ ، پھ، ۔۔۔
”Status as “Character”Status as “Character ةة
► Not a character in ANY dictionary including Not a character in ANY dictionary including
dictionaries bydictionaries by
National Language AuthorityNational Language Authority
► Middle Middle ےے or or یی predicamentpredicament کار کارےےکار = بکار = بییبب وژن وژنییوژن = ٹیلوژن = ٹیلییٹیلٹیل
ےے یی VariationVariation
► Like ا،و،یthe character ے is a vowel (phoneme)
► unlike ے ,ں is not a vowel modifier
because ں different from ے
ی : replaces ے► بی بےا adds onto ں► : ماں ما
► placed at the end of the alphabet (based on traditional
collation)
► Collated as “heavier” than ی at ligature endings but “equal
to” ی ligature medially
Role of Aerab in SortingRole of Aerab in Sorting
► Aerab ignored in the first (primary) pass of Aerab ignored in the first (primary) pass of sorting an Urdu stringsorting an Urdu string only characters are consideredonly characters are considered
► However, aerab are relevant in second pass, However, aerab are relevant in second pass, when first pass gives an exact matchwhen first pass gives an exact match
► No effect at primary level sortingNo effect at primary level sorting وسیوسیمماعلااعلا وسیوسیمماعلان اعلاناعلماعلماعلیاعلی
► No minimal pairs found on secondary No minimal pairs found on secondary level so involvement could not be level so involvement could not be determineddetermined
► Hex 41 (UZT) and Hex 200B (Unicode)Hex 41 (UZT) and Hex 200B (Unicode)► Ignored at primary level and secondary levelIgnored at primary level and secondary level
ٹیلیوژن ، ٹیلی وژنٹیلیوژن ، ٹیلی وژنٹیلیفون ، ٹیلی فونٹیلیفون ، ٹیلی فونبے کار ، بیکاربے کار ، بیکار
► But given each pair, which word first?But given each pair, which word first? Tertiary level decisionTertiary level decision
► lighter goes up!lighter goes up!► single word without break comes first?single word without break comes first?
What Needs to be Done for What Needs to be Done for UrduUrdu
►Debate and standardizeDebate and standardize Character Set Character Set
►Develop computational model to Develop computational model to implement sorting implement sorting Culturally acceptableCulturally acceptable Collation Element Collation Element
Table to generate sort keysTable to generate sort keys
►Standardize and publicize this Standardize and publicize this computational model for Urdu sortingcomputational model for Urdu sorting
What Needs to be DoneWhat Needs to be Done
►Take national standards to Take national standards to International forums: Unicode/ISOInternational forums: Unicode/ISO
►Complete similar work for all other Complete similar work for all other local languages of Pakistanlocal languages of Pakistan Character setCharacter set ScriptScript Collating SequenceCollating Sequence
Relevant National and Provincial Relevant National and Provincial Government OrganizationsGovernment Organizations
► NationalNational Urdu and Regional Languages’ Software Development Urdu and Regional Languages’ Software Development
Forum (URLSDF), Ministry of Science and Technology Forum (URLSDF), Ministry of Science and Technology (MoST), Islamabad(MoST), Islamabad
National Language Authority (NLA), Islamabad (Urdu)National Language Authority (NLA), Islamabad (Urdu) Pakistan Standards and Quality Control Authority (PSQCA), Pakistan Standards and Quality Control Authority (PSQCA),