Top Banner
Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information
28

Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Dec 30, 2015

Download

Documents

Fay Reeves
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Building digital libraries in Indian languages: case studies with

Hindi and Kannada

B.S. ShivaramTrainee (2001-2002)

National Center for Science Information

Page 2: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Table of Contents• Introduction to Multilingual Digital Libraries• Different Character Sets and Encodings• Statement of the problem

• Objectives• Need for the project

• Methodology • Implementation • System description• Observations • Limitations • Conclusion • Future developments

Page 3: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Multilingual Digital Library• Library

• Digital library

• Monolingual digital library

• Multilingual digital library

Page 4: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Definition of MDL

According to Ana M. B. Pavani

“A multilingual digital library is a digital library that has all functions implemented simultaneously in as many languages as desired and whose search & retrieve functions are language independent”.

Page 5: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Terms related to multilingualism

• i18n (internationalization)

• Localization

• Multilingual digital library

• Multilingual documents (ಕನ್ನ�ಡ, हि�न्दी�, મં� ।ગે�લ)

• Cross-language Retrieval

Page 6: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Issues of MDL

• Multiple language recognition, manipulation and display.

• Multilingual or cross-language search and retrieval

Page 7: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Character set and Encodings

• Charset:- is a bunch of characters, in the way a human would understand them.

Ex: ಅ, ಆ,ಇ,ಈ, so on are charset of Kannada

अ,आ,इ,ई, so on are charset of Hindi

A,B,C,D, so on are charset of Latin English

• Character Encoding:- is a way of storing characters on a computer as bits.

Page 8: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Different character sets

• ASCII• ISO-8859 series • Windows series • User defined • ISO 10646

• Utf-8• Utf-16• Utf-32

Page 9: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Unicode • Unicode provides a unique number for every

character,no matter what the platform,no matter what the program,no matter what the language.

• Developed by Unicode Consortium • There are many versions, 3.2.0 current one • Accommodates more than 65,000.• Synchronized with the corresponding versions

of ISO-10646.

Page 10: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Unicode • Standards incorporated under Unicode

• ISO 6937, ISO 8859 series• ISCII, KS C 5601, JIS X 0209, JIS X 0212, GB 2312,

and CNS 11643 etc.

• Scripts and Characters• European alphabetic scripts• Middle Eastern right-to-left scripts• Scripts of Asia • Indian languages Devanagari, Bengali, Gurmukhi,

Oriya, Tamil, Telugu, Kannada, Malayalam.• Punctuation marks, diacritics, mathematical symbols,

technical symbols, arrows, dingbats, etc.

Page 11: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Assigning Character Codes • Unique number is assigned to each code element  and

is called a code point. • These are the hexadecimal numbers with the prefix

“U“ Ex,. , U+0041 is the hexadecimal number "A" . • It groups the characters together by scripts in code

blocks. • Code blocks vary in size, depending on the size of the

script. • Code elements are grouped logically throughout the

range of code points, called the codespace.

Page 12: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Text handling • Computer text handling involves processing and

encoding.  • The Unicode Standard directly addresses only the

encoding action, processing will be carried out by software.

• It does not defines glyph images (character set images), display software retrieve the glyphs.

• The Unicode Standard does not specify the size, shape, or orientation of on-screen characters.

Page 13: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Objectives

• To assess the suitability of GSDL for developing digital library collection in Indian languages (Hindi and Kannada)

• To create search and browse interface for GSDL Software in Hindi and Kannada

Page 14: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Need

• Immeasurable amount of literature in many languages

• E-publishing in Indian languages• E-governance in India• E-learning • Digital libraries for Rural population

Page 15: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Greenstone Digital Library Software

• Open source• Developed by CS Department, University of

Waikato, Newzealand • http://greenstone.org • Can handle different file formats• Works on different platforms• Supports for many languages through unicode

Page 16: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Multilingual support

• Interface part

• Content part

Page 17: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Methodology Software• Windows XP operating system• GSDL• Macromedia Fireworks• Nudi• Baraha• Internet Explorer 6.0

Hardware• 128 RAM with Pentium III

Page 18: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Hindi and Kannada Interface• Separate .dm files were created for both language

_textimagehome_ {Home Page}

_textimagehome_ [l=kn]{कि सु&#2330 }

Creating tabs for Hindi & Kannada

Hindi Tabs• Macromedia Fireworks• Baraha transliteration software

Kannada Tabs• Macromedia Fireworks• Nudi transliteration software

Page 19: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Collection building

हि�न्दी� का�व्या�लय: is downloaded from http://manaskriti.com/kaavyaalaya/

ಉದಯವಾ�ಣಿ ಸಂಗ್ರ�ಹಣೆ: is downloaded from http://udayavani.com

हि�न्दी� Unicode collection

ಕನ್ನ�ಡ ಯ�ನಿಕೋ��ಡ್ � ಸಂಗ್ರ�ಹಣೆ

Page 20: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

System description हि�न्दी� का�व्या�लय/ಉದಯವಾ�ಣಿ ಸಂಗ್ರ�ಹಣೆ:

• Susha/Shree-Kan-0850 Font folder• Lang interface Hindi/Kannada• Preference encoding Latin Based • Browser encoding Latin Based or User defined

Hindi/Kannada Unicode collection: • Mangal/Tunga for Hindi/Kannada Font folder• Lang interface Hindi/Kannada• Preference encoding utf-8• Browser encoding utf-8

Page 21: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Observations

• Can have interfaces in many languages .• Can build collection in many languages with

different encodings other than Unicode. • Non-Unicode collection has only browse feature.• Titles of the Non-Unicode collection were in

English language .• Unicode collections has both search and browse

features.• All collections can be accessed over network.

cont…

Page 22: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Observations • Uses MG compression technique.• Can browse lists of authors, lists of titles, lists

of dates, so on.• Can handle very large collections.• New data can be added to existing collection at

any point of time.• Open-source software; anybody can develop

and it is amendable for local requirements.

Page 23: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Limitations • Fails to display Unicode html files of Hindi/

Kannada• It doesn’t support truncated searching for

Indian scripts.• Case differences option cannot be disabled in

the preferences page.• Presently search feature works only on

Windows XP.

Page 24: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Conclusion Multilingual Digital libraries will be

ubiquitous in the future and will provide the basis for a very broad set of distributed living activities including computer-supported co-operative work, distance learning etc. Developing countries like India, where many languages are in practice could utilize comprehensive software such as Greenstone. Since Greenstone, being open-source software is readily extensible to meet the needs of multilingualism.

Page 25: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Future developments• It can be extended to other Indian languages for which

Unicode supports.• Display problem with html files can be solved for

Indian languages by creating model mappings in utf-8 charset.

• Collection can be tested for different file formats like PDF, RTF, E-mail, etc. for other Indian languages.

• It can be tested with other operating systems like UNIX, Linux and browsers like Netscape, Opera to assess their compatibility.

• Can develop stemming algorithms for Indian languages, that can be incorporated to GSDL

Page 26: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.
Page 27: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Any Q’s

ಪ್ರ�ಶ್ನೆ�ಗಳಿವೆಯೆ ?

को�ई प्रश्न ?

Page 28: Building digital libraries in Indian languages: case studies with Hindi and Kannada B.S. Shivaram Trainee (2001-2002) National Center for Science Information.

Thank you

ವಂ�ದನೆಗಳು

धन्यवा�दी