Top Banner
Nurturing Living Languages © C-DAC Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida WELCOME
50

Nurturing Living Languages © C-DAC Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida WELCOME.

Mar 26, 2015

Download

Documents

Hayden Pearson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Slide 1

Nurturing Living Languages C-DAC Mahesh D. Kulkarni Group Coordinator C-DAC GIST 4 th August 2006 Venue : Hotel Raddison, Noida WELCOME Slide 2 Nurturing Living Languages C-DAC Indian Language Domain Name Registration Issues and Solutions Slide 3 Nurturing Living Languages C-DAC Social and economic growth is catalyzed by the presence of Internet Development of internet is mainly in English Uses only 26 alphabet (unaccented Latin letters), the 10 digits (0-9), hyphen and the dot. For proliferation and preservation of heritage, culture and content creation in multiple languages it is essential to have the domain names in multilingual scripts. Background Slide 4 Nurturing Living Languages C-DAC User enters IDN : www. (non-ASCII characters) Application (such as browser) converts to ASCII Compatible encoding (ACE) : www.xn--3b7vcv67.com Registry entry : xn3b7vcv67.com (ASCII characters) Background xn--e2br9czb xn--m1be Slide 5 Nurturing Living Languages C-DAC Overview : India has largest linguistic diversities in the world 4 major language families and at least 35 different languages and around 2000 dialects. Languages belong to either Indo-Aryan (ca.74%), the Dravidian (ca 24%), the Austro-Asiatic (Munda) (ca 1.2%) or the Tibeto-Burman (ca 0.6%) families. Some of the languages of Himalayas still unclassified. India has 22 scheduled languages and English continue to be associate additional official language Following scripts will be most needed : Assamese, Bangla, Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu. Slide 6 Nurturing Living Languages C-DAC One script :: many languages Devanagari Hindi, Marathi, Konkani, Rajasthani, Sindhi, Nepali, Dogri, Santhali, etc. Thus the code page Devanagari can support all languages using that particular script. Solution : Though the contents would reveal the language used, it would be ideal if a special attribute code to indicate language is inserted. Slide 7 Nurturing Living Languages C-DAC Konkani is written in Roman, Devanagari, Malayalam and Kannada. Sindhi is written in Gurmukhi (Punjabi), Arabi (Perso-Arabic), Devanagari, Gujarati and also Roman. Sindhi has adopted the Perso-Arabic script for representing their language. In case of Konkani, Devanagari is used as official script. Hence it is proposed that the same formula be used in attributing in IDN. However nothing stops a client from desiring to have his IDN in all the scripts and this can be efficiently catered by providing broad based transliteration facility which would transliterate a name from one Indian script to another. Thus a Konkani domain name in Devanagari could be transliterated into Kannada, Malayalam and Roman. Solution: The best solution to this is by way of linguistic or political consensus One language :: many scripts Slide 8 Nurturing Living Languages C-DAC The solution : A tool for transliteration from one Indian script to another can be easily deployed. The transliterated data could be presented to the client who could verify the transliteration and see if it meets his approval and if so, the IDN could be registered in all possible scripts Slide 9 Nurturing Living Languages C-DAC ACE i.e. ASCII compatible encoding. This is intimately tied to NamePrep (3491)and PunyCode (RFC-3492) as well as to RFC 3454 StringPrep. ACE prepares a IDN string to be sent down to PunyCode for storage where it is stored as a 7 bit numeric data We would like to make a case for the use of ISCII 91 as a parallel code for Brahmi based scripts. ISCII deploys the same encoding for all Brahmi based scripts. The advantage for this obvious as storage in ISCII will allow IDN to transliterate on the fly a name into any Indic script and thereby ensure at the PunyCode level itself that a name allotted in one script is also automatically allotted in another script to the same owner, thereby doing away with name squatting in Indic scripts, which will be a regular feature for IDN allocation in Indic scripts. Alternate mechanism Slide 10 Nurturing Living Languages C-DAC 1.IDN & THE PROBLEM OF ALLOTTING NAMES The IDN server which will attribute the domain names is to be automated and hence it is of vital interest that a mechanism of checks and counter-checks be set up to ensure the highest level of security. Two major issues are at stake. These issues are mainly specific to Indian scripts and the complex nature of their visual rendering. Slide 11 Nurturing Living Languages C-DAC PROBLEM 1: DOUBLETS The first is the need to ensure that doublets are avoided. Doublets are IDNs which are nearly alike either as homophones or close homographs. Thus spelling: Mahararashtra as: can lead to identity confusion and since all the three spellings are different, the server would attribute all the name as valid IDNs whereas in fact the original client would not like that his IDN be misused. Slide 12 Nurturing Living Languages C-DAC Problem 2: SECURITY ISSUES More serious is the willful use of such tactics to perpetrate fraud by misleading a user into believing that he has logged on to a bonafide site and thus persuade the user to divulge information such as the number of his credit card etc. Slide 13 Nurturing Living Languages C-DAC UNDERLYING THESE PROBLEMS AND ISSUES ARE THREE MAJOR POTENTIAL SECURITY HOLES HOMOPHONES AND HOMOGRAPHS SPELLING VARIANTS SPELLING ERRORS Each of these will be studied in relation to their pertinence to ensuring maximal security Slide 14 Nurturing Living Languages C-DAC These are aural and visual look-alikes and given the phonetic nature of Indian scripts are a potential source of confusion. A typology of these has been established: VISUAL LOOK ALIKES AURAL LOOK ALIKES Homophones and Homographs Slide 15 Nurturing Living Languages C-DAC Visual Look-Alikes-1 TWO LIGATURES HAVING PRACTICALLY THE SAME FORM Devanagari The first ligature is a Half da+ Full dha, the second is a half dha followed by a full da. To an average reader of Hindi, the two forms look practically alike and lead to confusion. A similar situation arises in the case of Gujarati The first is ka+la The second is ka+halanta+la Homophones and Homographs Slide 16 Nurturing Living Languages C-DAC Visual Look-Alikes-2 AMBIGUITIES ARISING OUT OF POSSIBLE UNICODE VARIANTS. This can be best seen in the case of Nukta characters. These can be generated out in two different manners: In each pair, the first character is a single character whereas the second character is made up of two characters: the consonant followed by the dot or nukta character. To the naked eye the two look alike, whereas for the machine, these would be two different IDNs. Homophones and Homographs Slide 17 Nurturing Living Languages C-DAC Visual look-alikes-3 SIMILAR LOOKING CHARACTERS WITHIN THE SAME CODE-PAGE: Within a code-page two characters can look practically alike and create ambiguity. This is especially the case when on the client machine the font enabled is not of high quality and given the size of the characters (normally 10 point), can lead to confusion. Some examples are given below: Devanagari Homophones and Homographs Slide 18 Nurturing Living Languages C-DAC Visual Look-Alikes -4 IDENTICAL CHARACTERS IN UNICODE As is the case of the Urdu and Sindhi glyph. Character 06a9 is the letter /keheh/ in Urdu whereas the same symbol in Sindhi has the representation /kheheh/. Since both fall within the same codepage aural disambiguation apart from recourse to the language used is impossible. Homophones and Homographs Slide 19 Nurturing Living Languages C-DAC Aural Look-Alikes: Homophones Indian Languages being phonetic in nature, aural representation is a major issue. These mainly arrive out of the fact that Indian languages are generally typed as they are spoken. Very often these arrive out of spelling variants and/or The ignorance of the user as to the correct spelling of the word. A large number of sub-types of problems can emerge from such Homophonic representations Homophones and Homographs Slide 20 Nurturing Living Languages C-DAC Aural Look-Alikes: Homophones-1 Confusion between the two nasal modifiers (wherever such nasal modifiers) exist. Hindi Gujarati Confusion between two or more similar sounding consonants (normally dental vs. retroflex sibilants and laterals): Marathi Gujarati Confusion arising out of short and long vowels: Tamil: Gujarati Hindi Homophones and Homographs Slide 21 Nurturing Living Languages C-DAC Aural Look-Alikes: Homophones-2 Absence or presence of a halanta. This is a source of errors even among educated speakers of the language. Proper names tend to be written at times with or without the halanta. Thus the name Shirke in Marathi can be written in the following two ways of which the first is correct, the second not normatively valid but could be accepted: Confusion arising out of the use of the rakar+ u matra instead of the vowel form: vs. Homophones and Homographs Slide 22 Nurturing Living Languages C-DAC Aural Look-Alikes: Homophones-3 A remote source of error would be the use of the Visarga or Vowel lengthener to modify an IDN. The Visarga is mainly used in Sanskrit and very rarely in neo Indian Aryan languages. However an IDN with or without the Visarga could create ambiguity. Homophones and Homographs Slide 23 Nurturing Living Languages C-DAC Aural Look-Alikes: Homophones-4 Insertion of a zero width character (ZWJ/ZWNJ) within the name string: The first has no non-joiner, the second has a non-joiner. Visually both look alike and can lead to confusion. Homophones and Homographs Slide 24 Nurturing Living Languages C-DAC Sub-Type 2: SPELLING ERRORS SUB-TYPE II Spelling Variants This is best seen in the case of Hindi where a nasal modifier can substitute for a corresponding half nasal consonant. The word Hindi itself allows to be written either as: Obviously two IDNs based on these spelling variants should not be allowed but must be resolved to the same norm. A similar situation exists in Marathi in the use of (timba) vs. /e/ vowel modifier. The first is used in colloquial Marathi under special environments whereas the second is the literary form. A filter which would normalize the two would have to be written. Other languages and scripts display similar patterns Slide 25 Nurturing Living Languages C-DAC More examples Slide 26 Nurturing Living Languages C-DAC SUB-TYPE III SPELLING ERRORS These whether conscious or unconscious could create homographic doublets and need to be detected in order to ensure that the client does not have a spurious IDN competing with his real IDN. Misspellings of words, introversions can all lead to IDN doublets. A good example is words in Hindi which have Urdu roots and which can admit spellings without Halanta (Urdu norm) and with halanta (Hindi aural norm) Slide 27 Nurturing Living Languages C-DAC 2. PROPOSED RECOMMENDATIONS Slide 28 Nurturing Living Languages C-DAC Proposed Recommendations An action plan has been proposed for ensuring maximum security in allotment of IDNs in Indian scripts. This is in shape of recommendations arising out of discussions. The recommendations are both specific and generic in nature. Slide 29 Nurturing Living Languages C-DAC Proposed Recommendations: GENERIC STRATEGIES-1 Creation of Levels: Four Levels are provided: Level 1 Highest security Level 2 Government bodies and Institutions (Bank, insurance, healthcare, etc) Level 3 Corporate and NGOs Level 4 All other users. Slide 30 Nurturing Living Languages C-DAC Proposed Recommendations: GENERIC STRATEGIES-2 The implementation should be tested in TESTBED mode and IDNs should be allotted in a phased manner: Level 1 (Highest security) and Level2 (Government bodies and Institutions) should be permitted to register in the test bed mode. This will also have the advantage of blocking out automatically all demands by spoofers and hackers to squat on such names. Levels 1 and 2 should be automatically denied to users. At this stage the automated software for providing variants based on visual and homophonic identities should be set in place. Slide 31 Nurturing Living Languages C-DAC Proposed Recommendations: GENERIC STRATEGIES-2 Subsequently Level 3 i.e. corporate, NGOs should be allowed to register. The software which will generate out all possible variants for their names, as per the rules of the language can be proposed to them. If they so desire they can register all these variants or keep them open, after being overtly warned that such a step could lead to spoofing. Level 4 can be integrated at the end Phased allotment of IDNs will eradicate to a large extent spoofing and phishing and ensure maximal security. Slide 32 Nurturing Living Languages C-DAC Proposed Recommendations: SPECIFIC ISSUES 1.Two scripts page should not be mixed. 2.As far as possible, numbers (digits) should not be used, unless they acquire a linguistic value such as 365, 24/7 etc. Domain names are not like mail applications where you can have the name followed by a digit. 3.Punctuation marks should be avoided as far as possible. These can also result in confusion as is the case of eyelash repha in Marathi: - 4. Although under ideal circumstances, correct spelling would be the norm, the first instance of a name registered even if it is incorrect would be deemed as registered and all further variants including the correct one, generated out by the software would be reserved or permitted as per the wish of the sanctioning authority. Slide 33 Nurturing Living Languages C-DAC Proposed Recommendations: SPECIFIC ISSUES-2 5. The whole process to be automated by means of a software which will ensure to the highest degree that the security holes are not breached. Given that there would be a large number of applications and that manual processing would not be possible and if possible would result in inordinate delays, automation is a pre-requisite. Slide 34 Nurturing Living Languages C-DAC Action Plan -1 Identification of Potential zones : Potential zones for ensuring were identified. These are: Creation of Variant Lists List of potential spelling variants List of potential zones of error in terms of misspellings and which are not trapped by the variants list. Slide 35 Nurturing Living Languages C-DAC Explanatory documents and Templates for each of the desired data were provided by CDAC GIST to the concerned The templates gave examples for each type of requirements in the sample template below: Slide 36 Nurturing Living Languages C-DAC CDAC. Pune has been entrusted with the creation of data for three languages: Hindi, Marathi and Urdu As per agreement Expert committees for all these three languages have been appointed, the experts being professors and experts working in the publishing industry; since these have the linguistic skills and know-how to investigate and create the required data A translation of the three letter extension of the names has also been provided. To ensure across the board intelligibility, this is in Sanskrit In the slides that follow, samples of the quantum of work accomplished in each of the languages will be detailed out. Report-1 Slide 37 Nurturing Living Languages C-DAC Translation of IDN extensions: a sample: 1)EDU 2)GOV 3)IN 4)COM 5)ORG, 6)MIL - 7)RES 8)AC 9)TRAVEL 10)MOBI 11)NET 12)INT 13)MED 14)AGRI Report-2 Slide 38 Nurturing Living Languages C-DAC Report-1: Marathi In the case of Marathi, a committee headed by Shri Phadake who has books on shuddha-lekhan to his credit has been appointed. Work has commenced on all the three areas: Variants list Spelling Variants Erroneous Spellings A large number of rules have been generated and so is the data on spelling variants and misspellings Slide 39 Nurturing Living Languages C-DAC Report-1: Marathi : Sample image of Variants list Slide 40 Nurturing Living Languages C-DAC Report-1: Marathi : Sample image of Variants list Slide 41 Nurturing Living Languages C-DAC Report-1: Marathi Sample image of Multiple spellings And misspellings Slide 42 Nurturing Living Languages C-DAC Report -2 Hindi A similar exercise has been carried out for Hindi. Sample files are provided below. Over 100 different rule variants have been identified. Slide 43 Nurturing Living Languages C-DAC Report -2 Hindi Spelling variants and misspellings for Hindi Over 300+ collected at present Slide 44 Nurturing Living Languages C-DAC Report -3 Urdu Under the able guidance of Prof Yunus Fahmi, spelling variants, misspellings and variant lists are being created. Some sample files for variant list and spellings variants are appended Slide 45 Nurturing Living Languages C-DAC Report -3 Urdu Urdu spelling Variants (over 280 in number) Slide 46 Nurturing Living Languages C-DAC Report -3 Urdu Urdu spelling Variants in PASCII (over 280 in number) Slide 47 Nurturing Living Languages C-DAC LanguageOfficial languageFamilyScript AssameseAssamIndo-AryanBangla (Modified) BengaliTripura and West BengalIndo-AryanBangla BodoAssamTibeto-BurmanDevanagari Bangla (modified) DogriJammu and KashmirIndo-AryanDevanagari, Perso- Arabic GujaratiDadra and Nagar Haveli, Daman and Diu, and Gujarat Indo-AryanGujarati HindiAndaman and Nicobar Islands, Bihar, Chandigarh, Chhattisgarh, Delhi, Harayana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh and Uttaranchal Indo-AryanDevanagari List of Official languages of India Slide 48 Nurturing Living Languages C-DAC LanguageOfficial languageFamilyScript KannadaKarnatakaDravidianKannada KashmiriKashmirIndo-AryanPerso-Arabic, Devanagari KonkaniGoaIndo-AryanDevanagari, Roman, Malayalam, Kannada MaithiliBiharIndo-AryanDevanagari MalayalamKerala and LakshadweepDravidianMalayalam ManipuriMaithiliTibeto-BurmanBangla, Meetei-Mayek MarathiMaharashtraIndo-AryanDevanagari NepaliSikkimIndo-AryanDevanagari List of Official languages of India Slide 49 Nurturing Living Languages C-DAC LanguageOfficial languageFamilyScript OriyaOrissaIndo-AryanOriya PunjabiPunjabIndo-AryanGurumukhi, Shahmukhi Sanskrit Indo-AryanDevanagari SantaliMundaDevanagari OI (ciki) Sihdhi Indo-AryanPerso-Arabic, Devanagari, Gujarati, Roman TamilTamil Nadu and PondicherryDravidianTamil TeluguAndhra PradeshDravidianTelugu UrduJannu and KashmirIndo-AryanPerso-Arabic List of Official languages of India Slide 50 Nurturing Living Languages C-DAC T H A N K Y O U Nurturing living languages