Top Banner
Status and Challenges of Status and Challenges of Local Language Computing and Local Language Computing and BRAC University’s Initiative BRAC University’s Initiative Naushad UzZaman Naushad UzZaman Research Programmer Research Programmer Center for Research on Bangla Center for Research on Bangla Language Processing Language Processing BRAC University BRAC University D. Net’s 5 th Anniversary Seminar Series: Youth and ICTs: ICT and Localization 29 th January, 2006
25

Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

Dec 18, 2015

Download

Documents

Allison Carson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

Status and Challenges of Local Status and Challenges of Local Language Computing and BRAC Language Computing and BRAC University’s InitiativeUniversity’s Initiative

Naushad UzZamanNaushad UzZaman

Research ProgrammerResearch Programmer

Center for Research on Bangla Language Center for Research on Bangla Language ProcessingProcessing

BRAC UniversityBRAC University

D. Net’s 5th Anniversary Seminar Series: Youth and ICTs: ICT and Localization

29th January, 2006

Page 2: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 2

OutlineOutline

• Statistics of Bangla language speaker

• Localization and local language computing

• BRAC University’s Initiative

• Local and Regional Initiatives

Page 3: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 3

Statistics of Bangla language Statistics of Bangla language speakersspeakers

• Spoken by 245 million people

• 7th most widely spoken language

• Spoken mainly in Bangladesh and Indian state of West Bengal

• More than 144 million people from Bangladesh

Page 4: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 4

Why localization?Why localization?

• The masses can harness the power of information

• National Interest: digital divide, governance, language preservation, …

Page 5: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 5

LocalizationLocalization

• Internationalized software in local languages

• Few groups are working actively– Ankur, Ekushey, D.Net (content development)

• Active projects– Linux, Mozilla, Open Office

Page 6: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 6

Page 7: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 7

Page 8: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 8

Larger pictureLarger picture• Good start, but a long way to!

• Local language computing: advanced applications– Optical character recognition– Machine translation– Speech synthesis– Speech recognition– Dialog systems

Page 9: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 9

ChallengesChallenges

• Language Resources– Fonts– Lexicon (word list)– Corpus (collection of texts)– Tag the lexicon and corpus

Page 10: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 10

Challenges for next few years!Challenges for next few years!• Language processing research

– Document authoring (desktop, web (blog, forums, emails), etc)

– Morphological analyzer– Speech processing– Information Retrieval (web searching, name

searching, spelling checker)– OCR (Optical Character Recognition) – Syntactic analysis (can be used in MT)– Machine Translation– And many more…

Page 11: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 11

Status of Bangla ComputingStatus of Bangla Computing

• Scattered work done, very little unification

• Scarcity of free and open-source software

• Little or no attention paid to computational linguistics - the backbone

• Many individuals are working, results few good publications in ICCIT, IUB’s ICCPB and other conferences

Page 12: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 12

BRAC University’s InitiativeBRAC University’s Initiative• Research Lab (Center for Research on Bangla

Language Processing)– 9 full-time Research staff (6 CS background, 3

linguistics background)– Seed funding from PAN Localization project of IDRC– Students working part-time, doing internship– Software/documents all OPEN SOURCE

• Academics– Course on Natural Language Processing– Student projects and theses on NLP

Page 13: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 13

Status of BU Research lab’s workStatus of BU Research lab’s work• Publications

– ICCIT 2004: 3 (Morphology 2, spelling checker)– BU Journal: 1 (Morphological parsing)– IASTED CI: 1 (Name searching)– IEEE NLP KE 05: 1 (Spelling checker)– ICCIT 2005: 1 (Morphology)– Undergraduate Thesis: 3 (Phonetic encoding, OCR,

Bangla text input in mobile)– Total: 10

• 4 more research paper submitted • Ongoing thesis: 4

Page 14: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 14

Status of BU Research lab’s workStatus of BU Research lab’s work

• Invited talks:– University of Toronto CS Seminar– Stanford University NLP group (May 2005)– IDRC Partners Conference in Cambodia

(June 2005)– IJCNLP 2005, Jeju Island, Korea (October

2005)

Page 15: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 15

Language ResourcesLanguage Resources

• Fonts: Good open-source fonts available

• Lexicon:– 80+ thousand list of words; expected to be

110 thousand in the next release– Tagging and annotation is underway.

Significant and large project

• Corpus:– Yet to begin

Page 16: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 16

Language processing researchLanguage processing research• Document authoring

– Editor, Banglapad: • open source, platform independent, rich text editor (supports Bangla spell

checking, export to html)• Status: Version 1, Release candidate 1• http://sourceforge.net/projects/banglapad

– Transliteration, pata: • Type phonetically in English, you will get similar sounding dictionary word• Desktop application: http://sourceforge.net/projects/pata; Status: Complete • Web based transliteration: Status: Expected by June 2006

– Community network tools:• Set of tools to community networking (blogs, forums, etc) in Bangla.• Not only content authoring but also web services such as spelling checker.• Status: Expected by early 2007

Page 17: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 17

Language processing researchLanguage processing research• Morphology:

– verb morphology is reasonably complete– noun morphology is somewhat usable, but much

more needs to be done– statistical methods for dealing with Bangla compound

words and blends are being worked on

• Grapheme To Phoneme (G2P):– Digital pronunciation dictionary – Useful step for speech processing– Status: Expected by June 2006

Page 18: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 18

Language processing researchLanguage processing research• Speech Processing

– Text-to-speech: • Voice for Festival. • Status: First demo expected by May 2006.

– Automatic Speech Recognition: • Limited vocabulary segmented speech recognition.• Status: First demo expected by August 2006.

Page 19: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 19

Language processing researchLanguage processing research• Information Retrieval:

– Spelling checker:• Gives phonetic suggestion and ranks phonetically• http://sourceforge.net/projects/puspaspeller/ • Integrated with other text editors, Banglapad• Status: Complete

– Searching • Phonetic web searching for Bangla• Input can be English or Bangla• Status: Expected by June 2006

– Name searching• Can be used in hospital, institutes, census, etc• Status: Expected by October 2006

Page 20: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 20

Language processing researchLanguage processing research• Pattern recognition/image

processing/document processing:– Document skew correction: Bangla document skew

corrector based on Radon transform. Complete.– Segmentation:

• Bangla line segmentation: Complete• Bangla word segmentation: Complete• Bangla character segmentation: Work in progress. The large

number of combinations (consonant clusters and the non-spacing marks) complicates this task. This is omnifont, so must work with any typeface.

Page 21: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 21

Language processing researchLanguage processing research– Pattern recognition:

• Neural net based recognizer: Fairly complete for the basic alphabet and a subset of the consonant clusters. The non-spacing marks pose a significant challenge.

• Hidden Markov Model (HMM) based recognizer: Just started, first implementation expected in May, 2006.

• Syntax:– Very preliminary work on Bangla syntax using the

Lexical Functional Grammar (LFG) formalism– Also a parallel effort using the Head-driven Phrase

Structure Grammar (HPSG) formalism

Page 22: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 22

Local and Regional InitiativesLocal and Regional InitiativesIDRC Pan Localization Network (PanL10n)• Phase I 2004-2006: 7 country collaboration

1. BRAC University, Bangladesh2. Department of IT, Bhutan3. National ICT Development Agency, Cambodia4. Science Tech and Environment Agency, Laos5. Madan Puraskar Pustakalaya, Nepal6. University of Colombo School of Computing, Sri

Lanka7. Afghanistan

• Phase II proposed for 2007-2010

Page 23: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 23

Local and Regional InitiativesLocal and Regional InitiativesIDRC Pan Localization Network Phase II (2007-

2010):• Further development of user-end local

language technology• Development of user end training for using the

local language technology• Conduction of this training• Local language content development• Measuring effects of using local language

technology

Page 24: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 24

D.Net’s InitiativeD.Net’s Initiative

Page 25: Status and Challenges of Local Language Computing and BRAC University’s Initiative Naushad UzZaman Research Programmer Center for Research on Bangla Language.

N. UzZaman, BRAC University

ICT and Localization, 29/1/06 25

SummarySummary• Local language computing • Significant challenges, from language resources

to human resources• 30+ years work for English and Western

languages; just beginning for Bangla• Include students from CS, linguistics• OPEN SOURCE a must for knowledge sharing!• Other universities should also come forward