Top Banner
Urdu Character Set Urdu Character Set and Collating and Collating Sequence Sequence Sarmad Hussain Sarmad Hussain ات ق ی ق ح ت ز ک ر م ات ق ی ق ح ت ز ک ر مِ ِ اردو اردوCenter for Research in Urdu Language Processing Center for Research in Urdu Language Processing FAST National University of Computer and Emerging FAST National University of Computer and Emerging Sciences Sciences
28

Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

Dec 13, 2015

Download

Documents

Edmund Floyd
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

Urdu Character Set and Urdu Character Set and Collating SequenceCollating Sequence

Sarmad HussainSarmad Hussain

اردو اردوِ ِمرکزتحقیقاتمرکزتحقیقاتCenter for Research in Urdu Language ProcessingCenter for Research in Urdu Language Processing

FAST National University of Computer and Emerging SciencesFAST National University of Computer and Emerging Sciences

Page 2: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

2 مرکزتحقیقات اردو

Purpose of PresentationPurpose of Presentation

► Indicate the “state of affairs”Indicate the “state of affairs” Character setCharacter set Collating sequenceCollating sequence

►Show what has been done regarding Show what has been done regarding the standardizationthe standardization

► Identify what needs to be doneIdentify what needs to be done

Page 3: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

3 مرکزتحقیقات اردو

SourcesSources

► Data from four dictionaries of UrduData from four dictionaries of Urdu

سنز 1.1. فیروز ، جامع سنز فیروزاللغات فیروز ، جامع لاہور فیروزاللغات لاہور ، ،((FLJFLJ ) )

.2.2Standard Twentieth Century Dictionary: Standard Twentieth Century Dictionary:

Urdu to English, Educational Publishing Urdu to English, Educational Publishing

House, New Dehli, India (STCD)House, New Dehli, India (STCD)

زبان ????????فرہنگفرہنگ3.3. قومی مقتدرہ ، زبان تلفظ قومی مقتدرہ ، اسلام تلفظ اسلام ، ( ( FTFT))آابادآاباد ،

زبان 4.4. قومی مقتدرہ ، لغت اردو زبان جدید قومی مقتدرہ ، لغت اردو اسلام جدید اسلام ، ((JULJUL ) )آابادآاباد ،

Page 4: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

4 مرکزتحقیقات اردو

Character SetCharacter Set

►AlphabetAlphabet

►Harakat (Aerab)Harakat (Aerab)

►Other SymbolsOther Symbols

Page 5: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

5 مرکزتحقیقات اردو

““Typical” AlphabetTypical” Alphabet

خ ح چ ج ث ٹ ت پ ب ا خ آ ح چ ج ث ٹ ت پ ب ا آ

ژ ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش سغ ع ظ غ ط ع ظ گ ط ک ق گ ف ک ق فم م ل ے ل ی ء ہ و ے ن ی ء ہ و ن

لاہور- ، سنز فیروز ، قاءدہ لاہور- اردو ، سنز فیروز ، قاءدہ اردو

Page 6: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

6 مرکزتحقیقات اردو

Do zabar Do zabar ًًدد Do zerDo zer ٍٍدد

Do peshDo pesh ُُدد Tashdeed Tashdeed ّّدد Noon ghunnaNoon ghunna نن

““Familiar” Harakaat (Aerab)Familiar” Harakaat (Aerab)

JazmJazm ددْْZabarZabar ََدد ZerZer دد?? PeshPesh ُُدد Khari zabarKhari zabar دد Khari zerKhari zer ددUlta peshUlta pesh دد

Page 7: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

7 مرکزتحقیقات اردو

““Common” Other SymbolsCommon” Other SymbolsNumbersNumbers

00 ۰۰11 ١١22 ٢٢33 ٣٣44

55 ۵۵66 ٦٦77

88 ٨٨9 9 ٩٩

Punctuation Punctuation

؟؟؛؛٬٬--

HonorificsHonorifics

Other SymbolsOther Symbols

ס

Page 8: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

8 مرکزتحقیقات اردو

Urdu Alphabet: State of Urdu Alphabet: State of AffairsAffairs

FT, JULFT, JUL خ ح چھ چ جھ ج ث ٹھ ٹ تھ ت پھ پ بھ ب آ خ ا ح چھ چ جھ ج ث ٹھ ٹ تھ ت پھ پ بھ ب آ د د ا

ژ ز ڑھ ڑ رھ ر ذ ڈھ ڈ ژ دھ ز ڑھ ڑ رھ ر ذ ڈھ ڈ غ دھ ع ظ ط ض ص ش غ س ع ظ ط ض ص ش سگھ گ کھ ک ق گھ ف گ کھ ک ق ء ف وھ و نھ ن ںھ ں مھ م لھ ء ل وھ و نھ ن ںھ ں مھ م لھ ل

ے ے ی ی

FLJ, STCDFLJ, STCD خ ح چ ج ث ٹ ت پ ب ا خ آ ح چ ج ث ٹ ت پ ب ا ژ آ ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ص د ش ص س ش س

غ ع ظ ط غ ض ع ظ ط و ض ن ں م ل گ ک ق و ف ن ں م ل گ ک ق ے ف ی ء ھ ے ہ ی ء ھ ہ

Page 9: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

9 مرکزتحقیقات اردو

Cu

rrent G

oP S

tan

dard

: UZ

T 1

.01

Cu

rrent G

oP S

tan

dard

: UZ

T 1

.01

Page 10: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

10 مرکزتحقیقات اردو

Logical Sections of UZT 1.01Logical Sections of UZT 1.01► Alphabet (80 – 122)Alphabet (80 – 122)► Aerab/diacritics/harakat (66 – 79, 123 – 126)Aerab/diacritics/harakat (66 – 79, 123 – 126)► Other charactersOther characters

Punctuation and arithmetic symbols (32 – 47, 58 – Punctuation and arithmetic symbols (32 – 47, 58 – 65)65)

Digits (48 – 57)Digits (48 – 57) Special symbols (160 – 176, 192 – 199)Special symbols (160 – 176, 192 – 199) MiscellaneousMiscellaneous

► Control characters (0 – 31, 127) Control characters (0 – 31, 127) ► Reserved control space (128 – 159, 255)Reserved control space (128 – 159, 255)► Reserved expansion space (177 – 191, 200 – 207, 240 – Reserved expansion space (177 – 191, 200 – 207, 240 –

253)253)► Vendor area (208 – 239)Vendor area (208 – 239)► Toggle character (254)Toggle character (254)

Page 11: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

11 مرکزتحقیقات اردو

Conclusions: Standard Urdu Conclusions: Standard Urdu Character SetCharacter Set

► No general agreement on Urdu Character No general agreement on Urdu Character Set by dictionary publishersSet by dictionary publishers

► Standard Character Set defined by National Standard Character Set defined by National Language Authority Language Authority not well-publicized not well-publicized not widely adoptednot widely adopted

► GoP Computing Standard for Computing, GoP Computing Standard for Computing, UZT 1.01 implements the NLA-defined UZT 1.01 implements the NLA-defined character and symbol set character and symbol set

► Will soon be fully represented in Will soon be fully represented in Unicode/ISO 10646Unicode/ISO 10646

Page 12: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

12 مرکزتحقیقات اردو

Urdu Collating Sequence: Urdu Collating Sequence: State of AffairsState of Affairs

FT, JULFT, JULج ٹھٹھٹ ٹ تھتھت ت پھپھپ پ بھبھب ب آآ اا ج ث خ چھچھچ چ جھجھث خ ح ڈ ڈ دھدھد د ح

ر ڈھڈھ ر ذ ژ ڑھ ڑھ ڑ ڑ رھرھذ ژ ز غ ز ع ظ ط ض ص ش غ س ع ظ ط ض ص ش ک س ق ک ف ق فےے ییء ء ہہ وھوھو و نھنھ نن ںھںھ ںں مھمھم م لھلھل ل گھگھگ گ کھ کھ

FLJFLJ ا ا آ خ آ ح چ ج ث ٹ ت پ خ ب ح چ ج ث ٹ ت پ ژ ب ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش س

غ ع ظ غ ط ع ظ م ط ل گ ک ق م ف ل گ ک ق ن ف ن ں ھ و و ں ھ ہ ے ء ء ہ ے ی ی

STCDSTCD ا ا آ خ آ ح چ ج ث ٹ ت پ خ ب ح چ ج ث ٹ ت پ ژ ب ز ڑ ر ذ ڈ ژ د ز ڑ ر ذ ڈ ض د ص ش ض س ص ش س

غ ع ظ غ ط ع ظ م ط ل گ ک ق م ف ل گ ک ق ں ف ں ن ے ء ء ہہ ھھ و و ن ے ی ی

Page 13: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

13 مرکزتحقیقات اردو

آا آا ا VariationVariation ا

► STCD and FLJSTCD and FLJ

آابآابآاپآاپابابایوانایوان

► FT and JULFT and JUL

ابابایوانایوانآابآابآاپآاپ

Page 14: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

14 مرکزتحقیقات اردو

ں ں ن VariationVariation ن

► FLJ, FT & STCDFLJ, FT & STCDماںماںمانمان

► JULJULمانمانماںماں

Page 15: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

15 مرکزتحقیقات اردو

ھ ھ ہ VariationVariation ہ

►FLJFLJباپباپبہنبہنبہنگیبہنگیبھابیبھابیبھنگیبھنگیبیٹابیٹا

►STCDSTCDباپباپبھابیبھابیبہنبہنبھنگیبھنگیبہنگیبہنگیبیٹابیٹا

►FT & JULFT & JULباپباپبہنبہنبہنگیبہنگیبیٹابیٹابھابیبھابیبھنگیبھنگی

بانوبانوبانھبانھبانیبانی

Page 16: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

16 مرکزتحقیقات اردو

ےے یی VariationVariation

►FJL,FJL, FT & JULFT & JULبیبی بی بی بی بیبےبےبیابانبیابان

►STCDSTCDبیبیبےبےبیابانبیابان بی بی بی بی

► Middle “yay” predicament: Middle “yay” predicament: ےے or or ییب = ییبب ب = کار ر ےےکار ا ر ک ا کل = = وژن وژن ییٹیلٹیل ی ل ٹ ی ن ییٹ ژ ن و ژ و

Page 17: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

17 مرکزتحقیقات اردو

Role of Aerab in SortingRole of Aerab in Sorting

► Aerab ignored in the first (primary) pass of Aerab ignored in the first (primary) pass of sorting an Urdu stringsorting an Urdu string

ب )= ِِبب ب )= ہار ( ِِہار ( ہار ہار ہانہہانہََببب )= ِِبب ب )= ہاءی ( ِِہاءی ( ہاءی ہاءی

► However, aerab are relevant in second pass, However, aerab are relevant in second pass, when first pass gives an exact matchwhen first pass gives an exact match

ب ََبب ب ن ب ِِن ب ن نُنُُُنس ََسس س ن س ِِن س ن نُنُُُن

Page 18: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

18 مرکزتحقیقات اردو

Vocalic Aerab - Zabar, Zer, Vocalic Aerab - Zabar, Zer, PeshPesh

►FT, FLJ, JULFT, FLJ, JULنَنَببنِنِببنُنُُُبب

یریرََبب یریرِِب ب بیر بیر

►STCDSTCDنَنَببنُنُُُببنِنِبب

ننََسسننِِسسننُُُُسس

یریرِِب ب بیر بیر

Page 19: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

19 مرکزتحقیقات اردو

Vocalic Aerab – Khari ZabarVocalic Aerab – Khari Zabar

► No effect at primary level sortingNo effect at primary level sorting وسیوسیََمماعلااعلا وسیوسیُُمماعلان اعلاناعلماعلماعلیاعلی

► No minimal pairs found so secondary No minimal pairs found so secondary level so involvement could not be level so involvement could not be determineddetermined

Page 20: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

20 مرکزتحقیقات اردو

Consonantal Aerab - HamzaConsonantal Aerab - Hamza

► Ignored at primary levelIgnored at primary level►Minimal pairs not found to determine Minimal pairs not found to determine

secondary level effectsecondary level effect مرامراتتٲٲمرمرمراتبمراتبمراممرامآات آاتمر مر

باواباواٹاٹاٶٶباباباونباون

Page 21: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

21 مرکزتحقیقات اردو

Consonantal Aerab - Consonantal Aerab - TashdeedTashdeed

► Ignored are primary level Ignored are primary level ►Effects secondary level sorting Effects secondary level sorting

““heavier than null” heavier than null”

► Interacts with vocalic aerabInteracts with vocalic aerab

راناراناََبب انااناّّبر بر رایارایاََب ب

بدیبدی بّدی بّدی بّدیا بّدیا

بدوبدو وُوُبّد بّد بّدیا بّدیاallall examples from examples from

FTFT

Page 22: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

22 مرکزتحقیقات اردو

Ligature-Break (Half Space) Ligature-Break (Half Space)

► Ignored at primary level and Ignored at primary level and secondary levelsecondary level

وژن ٹیلی ، وژن ٹیلیوژن ٹیلی ، ٹیلیوژن فون ٹیلی ، فون ٹیلیفون ٹیلی ، ٹیلیفون بیکار ، کار بیکار بے ، کار بے

►But given each pair, which word first?But given each pair, which word first? Tertiary level decisionTertiary level decision

Page 23: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

23 مرکزتحقیقات اردو

Word-Break (Normal Space)Word-Break (Normal Space)

► Ignored at primary level ? Ignored at primary level ? ►American Heritage Dictionary (2American Heritage Dictionary (2ndnd Collegiate Collegiate

ed.)ed.) black artblack art black bearblack bear blackberryblackberry black boxblack box blackenblacken Black DeathBlack Death black goldblack gold

►Space ignored at primary levelSpace ignored at primary level

Page 24: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

24 مرکزتحقیقات اردو

Word-Break (Normal Space) - Word-Break (Normal Space) - IIII

► FLJFLJ

بانگبانگ1.1.

درا دراِ ِبانگبانگ2.2.

دینا 3.3. دینا بانگ بانگ If sorting is done at word break then If sorting is done at word break then

1,3,2 1,3,2 So sorting ignores word break So sorting ignores word break

Page 25: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

25 مرکزتحقیقات اردو

Conclusions: Urdu Collating Conclusions: Urdu Collating SequenceSequence

► Multi-level Complex Multi-level Complex ProblemProblem

► Pre-processingPre-processing Contractions (Contractions ( ھ ھ ب ب

((بھبھ► Primary LevelPrimary Level

characterscharacters

► Secondary LevelSecondary Level Vocalic aerabVocalic aerab Consonantal aerabConsonantal aerab Interaction of Vocalic Interaction of Vocalic

and Consonantal and Consonantal aerabaerab

Others (?)Others (?)

► Tertiary LevelTertiary Level Ligature BreakLigature Break Others (?)Others (?)

Page 26: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

26 مرکزتحقیقات اردو

What Needs to be Done: What Needs to be Done: Urdu Urdu

► If required revisit and revise the Urdu If required revisit and revise the Urdu character setcharacter set

► Extensive work on sorting done at linguistic Extensive work on sorting done at linguistic level by NLA and UDB. Need to level by NLA and UDB. Need to Standardize itStandardize it Publicize itPublicize it

► Need to develop at computational level to build Need to develop at computational level to build Collation Element Table to generate sort keysCollation Element Table to generate sort keys Standardize itStandardize it Publicize itPublicize it

Page 27: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

27 مرکزتحقیقات اردو

What Needs to be Done: What Needs to be Done: Other Languages of PakistanOther Languages of Pakistan

►Need to work towards standardization Need to work towards standardization of of Character setCharacter set Collating Sequence Collating Sequence

►Need to do gap analysis of character Need to do gap analysis of character sets with Unicode/ISO 10646 for sets with Unicode/ISO 10646 for international standardizationinternational standardization

►Need to develop Collation Element Need to develop Collation Element Tables for these Languages for sortingTables for these Languages for sorting

Page 28: Urdu Character Set and Collating Sequence Sarmad Hussain مرکزتحقیقاتِ اردو Center for Research in Urdu Language Processing FAST National University of.

28 مرکزتحقیقات اردو

Thank youThank you

Questions?Questions?