Top Banner
DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015
26

DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

Dec 23, 2015

Download

Documents

Belinda Daniel
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE

JUSTUS C ROUXIMS STUTTGART

13.07.2015

Page 2: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

2

OUTLINE

• Concept Resource scarce languages

• Overview of the language situation in South Africa

• Lack of language resources and high level support for development of

resources

• Co-ordination of activities in resource development and management

• The demand for localised language services over digital devices and

related opportunities

Page 3: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

3

Resource scarce languages

“Under-resourced languages are generally described as languages that

suffer from a chronic lack of available resources, from human, financial,

and time resources to linguistic ones (language data and language

technology), and often also experience the fragmentation of efforts in

resource development.”

(Language Resources and Evaluation (LRE) Journal Special Issue Call, August

2014).

Page 4: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

4

Resource scarce languages (2)

"This situation is exacerbated by the realization that as technology

progresses and the demand for localised languages services over digital

devices increases, the divide between adequately- and under-resourced

languages keeps widening."

(Language Resources and Evaluation (LRE) Journal Special Issue Call, August 2014).

Page 5: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

5

Issues are

• A chronic lack of available resources, from human, financial, and time resources to linguistic ones

• Fragmentation of efforts in resource development

• As technology progresses the demand for localised languages services over digital devices increases

• But first, consider the language situation in South Africa

Page 6: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

6

Language Situation in South AfricaHome language (n = 52 mil speakers)

11 Official languages

22%

18%

16%10%

9%

7%

3%4%

2%7% 2%

Zulu 22%

Xhosa 18%

Afrikaans 16%

N Sotho 10%

English 9%

Tswana 7%

Swati 3%

Tsonga 4%

Venda 2%

S Sotho 7%

Ndebele 2%

Page 7: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

7

Nguni group

Sotho group

Tshivenda / Xitsonga group

• isiZulu• isiXhosa• Siswati• isiNdebele

• Northern Sotho / Sepedi• Southern Sotho / Sesotho• Western Sotho / Setswana

• Tshivenda• Xitsonga

Cross border languages: Mozambique, Zimbabwe, Swaziland, Lesotho, Botswana

The official African languages grouped

45%

24%

4%

Page 8: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

8

Similarities at different levels within groupsSotho group - disjunctive spelling – lexical items• Ke tla bolela Sepedi. I will speak Sepedi.• Ke tla bua Setswana. I will speak Setswana.• Ke tla bua Sesotho. I will speak Sesotho.Nguni group - conjunctive spelling – lexical items• Ngizokhuluma isiZulu. I will speak isiZulu.• Ndizothetha isiXhosa. I will speak isiXhosa.

Implications for NLP• Grammatical structures across language groups the same• Regular spelling: Grapheme to phoneme conversion – direct• Tone languages – specific implications and challenges for TTS systems

Page 9: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

9

Afrikaans and its Germanic roots

• English: My hand is in warm water.• Afrikaans: My hand is in warm water.• Dutch: Mijn hand is in warm water.• German:Meine Hand ist in warmen Wasser.• Danish: Min hånd er i varmt vand.• Norwegian: Min hånd er i varmt vann.• Swedish:Min hand är i varmt vatten.

Implications• Bootstrapping Afrikaans systems from e.g. Dutch.

Page 10: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

10

ISSUE #1Chronic lack of available (digital) resources, from human, financial, and time resources to linguistic ones

• Digital resources for previously marginalised languages extremely limited: newspapers, periodicals, relatively low presence on the Web

• Lack of language expertise – no tradition of Computational Linguistics - limited number of students in local languages – only North-West University with degree courses in Language technologies ("Linguists are still needed" – Ed Greffenstatte)

• Growing expertise in Computer Science and Signal processing with focus on natural languages in most of the larger universities.

• Financial support mainly ad-hoc from private sources

Page 11: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

11

• Various initiatives for text and speech data collections over a number of decades – mainly for linguistic / phonetic research at academic institutions – difficult to share resources

• Continued academic pressure (on grounds of the constitution) on government for support of research and development of Language Technologies - not to marginalise the indigenous languages again

• Large data acquisition projects sponsored by national government since 1999 – Part of National Language Plan (RSA and India are only countries with official policy regarding LT development).

Page 12: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

12

• Ministerial Panel: HLT Strategy for South Africa (2002)• Focus on digital resources: text & speech (SA official languages)

• 2008: Human Language Technology Expert Panel (HLTEP) established • commissions HLT application projects annually with governmental funds• these projects invariably create digital resources• obvious that it was necessary to create a central depository for all newly

created language resources

• Ongoing major projects since 2000 in text and speech domains• Refer to RMA resources to be discussed

Page 13: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

13

ISSUE #2 Fragmentation of efforts in resource development

• Various language projects across the country generating text and speech resources for different purposes – availability of the data (?)

• Resources from projects commissioned by the HLTEP (i.e. funded by tax payers money) needed to be deposited in a central place

• 2012: The National Department of Arts and Culture (DAC) established Resource Management Agency (RMA) at the North-West University (Potchefstroom) under the auspices of the Centre for Text Technology (CTexT) as a 3 year project. (www.rma.nwu.ac.za)

Page 14: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

14http://www.rma.nwu.ac.za

Page 15: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

15NEWSLETTER

Page 16: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

16

Contents of the RMA

)

LANGUAGEAfrikaans (31)English (30)isiNdebele (20)isiXhosa (23)isiZulu (27)Sesotho sa Leboa (Sepedi)(22)Setswana (20)Sesotho (Southern Sotho) (22)Siswati (20)Tshivenda (20)Xitsonga (24)Dutch (4)Yoruba (3)

PROJECTAutshumato (18)Lwazi (36)NCHLT Text (43)NCHLT Speech (13)African Speech Technology (15)

DATABASE TYPEMonolingual Speech Corpora: Annotated (22)Multilingual Text Corpora: Aligned (3)Monolingual Text Corpora: Annotated (1)RESOURCE TYPESData Modules ApplicationsTools/ Platforms

Page 17: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

17

FROM RMA TO NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES (NCDLR)• RMA: status 3-4 year project (2012 – 2015) (Dept of Arts & Culture)• Untenable as development of resources is ongoing (living archive)

• National Department of Science and Technology (DST) (2014):• International panel to determine a new South African Research Infrastructure

Roadmap (SARIR)• Presentations made to include language (Humanities) and technology in a

Roadmap dominated by natural science, medicine, engineering, earth sciences etc.

• June 2015: The National Centre for Digital Language Resources approved – long term funding (Press statement of DST to follow soon)

Page 18: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

18

National Centre for Digital Language Resources

University of PretoriaDepartment of African

Languages

CSIR MERAKA INSTITUTE

(Human Language Technologies Research

Group)

North-West University Centre for Text

Technology(CTexT)

University of South Africa

Department of African Languages

ICELDA PARTNERSHIP

Page 19: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

19

NATIONAL CENTRE FOR DIGITAL LANGUAGE RESOURCES

Functions

• Single point of entry for information on SA language resources (portal)• Free open access for academic research• Licensed access for commercial applications• Includes RMA resources

• Systematic digitisation of scientifically valuable language resources – historical nature (Scientific committee)

Page 20: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

20

isiZulu isiXhosa Setswana Sesotho Afrikaans Lang n

Dialects X (X) X X

Child language

? ? ? (X) (X) ?

Urban slang

Natural discourse

X = available (X) = limited data ? = uncertain, should be acquired.

• Systematic digitisation of different registers/modes of language resources by the Centre, as well as by academics/public as open call funded projects

• Combine these projects with MA / PhD studies with data to be deposited at Centre

• Resource centre for studies in the domain of Digital Humanities

Page 21: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

21

ISSUE #3 Demand for localised language services over digital devices increases

Available

• At text level

• Spelling checkers for all SA languages – CTexT (Microsoft) http://www.nwu.CTexT.ac.za

• Machine translation – government documents – CTexT (Autshumato IMT) http://www.autshumato.sourceforge.net

• On-line translations: e.g. www.Translate.org, www.Freelang.net and various others software programs ranging from word lists to communication phrases

• At speech/text level (interactive telephone based systems) (Major projects)

• African Speech Technology: Hotel reservation system in 5 languages (prototype) www.lrec-conf.org/proceedings/lrec2004/summaries/445.htm

• LWAZI I and II: Various community based applications www.meraka.org.za/lwazi/

Page 22: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

22

Why do we need to speed up localised language services? There is a demand for a wide array of language based communication systems: • Interactive multilingual voice systems as information systems• Interactive text-to-speech systems

• Literacy training in different languages• Language specific reading support for the blind

• Machine translation systems for public use• Speech-to speech communication systems with various language pairs• Etc……

• There are specific research and business opportunities – consider the following

Page 23: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

23

Mobile telephone penetration selected countrieshttp://www.itu.int/ITU-D/ict/statistics/explorer/index.html

Mobile cellular subscriptions MillionJapan 149

Nigeria 127

Germany 100

South Africa 76

Korea (Rep) 55

France 36

Mobile cellular subscriptions per 100 inhabitantsSouth Africa 146

Germany 121

Japan 117

Korea (Rep) 111

France 98

Nigeria 73

Page 24: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

24

Page 25: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

25

Conclusion• Challenges for the development and management of different types

of language resources and applicable tools, • Academic considerations: insights into language structures and use • Commercial considerations: providing multilingual applications for a growing

market, specifically in the African context

• In order to meet these challenges it is necessary to develop and update language resources not only on a case to case basis, but also systematically in a coordinated manner over as long a period as possible.

• This is what we are attempting to do in the South African context.

Page 26: DEVELOPING AND MANAGING RESOURCE SCARCE LANGUAGES: THE SOUTH AFRICAN CASE JUSTUS C ROUX IMS STUTTGART 13.07.2015.

26

Thank you for listening.