Top Banner
Encoding Diversity for All the World’s Languages The Script Encoding Initiative (Universal Scripts Project) Michael Everson, Evertype Westport, Co. Mayo, Ireland Bamako, Mali • 6 May 2005
48

Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Jun 15, 2018

Download

Documents

phamnhu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Encoding Diversity for All the World’s Languages

The Script Encoding Initiative (Universal Scripts Project)

Michael Everson, EvertypeWestport, Co. Mayo, Ireland

Bamako, Mali • 6 May 2005

Page 2: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

1. Current State of the Unicode Standard

• Unicode 4.1 defines over 97,000 characters

Page 3: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

1. Current State of the Unicode Standard: New Script Additions

Unicode 4.1 (31 March 2005):Buginese

CopticGlagolitic

New Tai LueNuskhuri (extends Georgian)

Syloti NagriTifinagh

Kharoshthi Old Persian Cuneiform

For Unicode 5.0 (2006):N’Ko

BalinesePhags-pa

PhoenicianCuneiform

Page 4: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

1. Current State of the Unicode Standard

• Unicode 4.1 defines over 97,000 characters

• Unicode covers over 50 scripts (many of which are used for languages with over

5 million speakers)

Page 5: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

1. Current State of the Unicode Standard

• Unicode 4.1 defines over 97,000 characters

• Unicode covers over 50 scripts (often used for languages with over 5 million speakers)

• Unicode enables millions of users worldwide to view web pages, send e-mails,

converse in chat-rooms, and share text documents in their native script

Page 6: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Unicode 4.1 defines over 97,000 characters

• Unicode covers over 50 scripts (often used for languages with over 5 million speakers)

• Unicode enables millions of users worldwide to view web pages, send e-mails, converse in chat-rooms, and share text documents in their native

script• Unicode is widely supported by current fonts

and operating systems, but…

1. Current State of the Unicode Standard

Page 7: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Over 80 scripts are missing!Missing Modern Minority Scripts

India, Nepal, Bangladesh:

• Chakma• Lepcha• Methei/

Manipuri• Newari• Ol Chiki• Saurashtra• Sorang

Sompeng• Varang Kshiti

China:

• Lanna

• Naxi Geba

• Naxi Tomba

• PollardAfrica:

• Bamum

• Bassa

• Mende

• Vai

Southeast Asia (excluding China):

• Batak• Cham• Javanese• Kayah Li• Pahawh

Hmong• Rejang• Sundanese• Viet Thai

Page 8: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Over 80 scripts are missing!Missing Historic Scripts

• Lycian• Lydian• Mandaic• Manichaean• Mayan

Hieroglyphs• Meroitic• Modi • Nabataean• North Arabic• Numidian• Old Hungarian• Old Permic• Orkhon• Pahlavi

• Grantha • Hatran• Iberian• Indus Valley • Jurchin• Kaithi• Kawi • Khotanese• Kitan Large

Script• Kitan Small

Script• Landa• Linear A• Luwian

• Palmyrene• Proto-Elamite• Pyu • Rongorongo• Samaritan• Satavahana • Sharada• Siddham• South Arabian• Soyombo• Takri• Tangut

Ideograms• Uighur• Vedic accents

• Ahom• Alpine• Aramaic• Avestan• Aztec

Pictograms• Balti• Brahmi• Büthakukye• Byblos• Carian• Chalukya• Chola• Cypro-Minoan• Egyptian

Hieroglyphs• Elbasan• Elymaic

Page 9: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• The approval process takes 2 to 5 years

2. The Unicode approval process

Page 10: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• The approval process takes 2 to 5 years

• Only after a script has been formally approved, can fonts be created and

software support added

2. The Unicode approval process

Page 11: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

New characters must be approved by two standardization bodies:

1. Unicode Technical Committee

2. The Unicode approval process

Page 12: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥
Page 13: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

2. The Unicode approval process

New characters must be approved by two standards bodies:

1. Unicode Technical Committee

2. ISO/IEC JTC1/SC2 and its Working Group 2 (composed of 32 voting national

body representatives)

Page 14: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

1.

2.

2. The Unicode approval process

So… who represents scholars, educators, and minority communities?

Page 15: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

http://linguistics.berkeley.edu/sei

Page 16: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

3. Three Case Studies(Modern Scripts)

æß槵´¶ƒ ß≤Ç Öì±≠ ©Æ∂

`óÄèÑı fï≥ —`ì¥ qî®

◊£ ˆˇ ‡ £ ¡ ‹ Î « É

Page 17: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Case 1: Balinese

Page 18: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Used for the Balinese language, an Austronesian language with 3.8 million speakers

• Used in many traditional literary and cultural works (ritual choruses, dramatic recitations)

• Considered by some Balinese to be “endangered”

• Taught in primary and secondary schools as a mandatory subject, about 2 hours a week

Case 1: Balinese

Page 19: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Building signs in Bali

Street signs in Bali

Page 20: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Balinese School Texts

Page 21: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• UNESCO’s Initiative B@bel approved funding in January 2005 for work on Balinese

• Balinese was approved at the ISO meeting in China in late January 2005

• Likely to be approved (next week) at the Unicode Technical Committee, and will appear

in Unicode 5.0

Case 1: Balinese

Page 22: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Used to write a number of Manden languages, comprising 18 to 20 million speakers

Case 2: N’Ko

`óÄèÑı fï≥ —`ì¥ qî®

Page 23: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Devised in the late 1940s by Solomana Kante of Guinea to be used for the Manden languages of

West Africa

• Has a vigorous and active user community

Case 2: N’Ko

Page 24: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Bookstore in Guinea

N’Ko school in Kankan, Guinea

Page 25: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• UNESCO’s Initiative B@bel approved funding in 2004 for work on N’Ko

• N’Ko was approved at the WG2 meeting in Toronto in June 2004

• N’Ko was also approved by the UTC in June 2004

• N’Ko is a complex script. The user community still needs help to develop fonts and provide necessary

locale information so that N’Ko can be implemented on computers. Encoding is just the first step.

Case 2: N’Ko

Page 26: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Case 3: Vai

• Some 105,000 Vai people live mainly in Liberia

• The Vai script was invented around 1833, and in 1962 authorities in Monrovia published a “Standard

Vai Syllabary”

• Schoolbooks are available in Vai script

• Work to encode Vai began in April 2005

◊£ ˆˇ ‡ £ ¡ ‹ Î « É

Page 27: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

3. Features of Modern Minority Scripts

• The official government language uses a different script than that of the minority

In Indonesia, Latin script is used for Bahasa IndonesiaIn Guinea & Mali, Latin script is used for French

In India, Devanagari and other scripts are used for Hindiand other official languages

• User groups tend to be active in promoting their own scripts

• Primary goals are to improve literacy and to further pride in their own culture

Page 28: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

4. Examples of Historic Scripts

Page 29: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Case 1: Egyptian Hieroglyphs

• Egyptian is an important script that remains unencoded

• Work on encoding this script will be broken into two phases, the first part being “the Gardiner set” of about 900 characters.

There are thousands more.

Page 30: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Dates to approximately 600 BCE.

• Used as a liturgical language by more than 100,000 Zoroastrians worldwide

• Unicode proposal draft has received input from the Iranian Academy of Sciences

Case 2: Avestan

Page 31: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Case 3: Hieroglyphic Luwian

• Used in the second millennium to the first

millennium BCE.

• Found on stone monuments in Anatolia

(Turkey) and northern Syria

• Indo-European language

Page 32: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

4. Features of Historic Scripts

• Studied by scholars and students, as well as others

• Documents in these scripts form the basis of our cultural and literary heritage

• Many courses have been cut in university budgets; development of online course

material could help

Page 33: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Modern script users are minorities and tend to be less affluent than users of majority scripts

• It is difficult for users to attend international standardization meetings

• There is no large consumer base and so minority scripts are not of much interest to corporations

• Governments may be reluctant to help such groups

5. Why are scripts missing?

Page 34: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Historic scripts have no “large consumer base”

• It is difficult to get universities to support because they do not understand

the problem fully

• Scripts (both historic and modern) are often less well-known

• Additional research is needed to encode such scripts properly

5. Why are scripts missing?

Page 35: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• In the past, most work has been done by volunteers; proposals have appeared sporadically

• As more scripts are encoded, the focus is increasingly on implementation, maintenance of the standard, and locale

data collection, with less of a focus on encoding. This leaves behind those groups whose script is not yet in Unicode,

because they don’t represent an economically viable market for computer companies.

• Unicode already covers all of the “economically interesting” scripts for computer companies. Some people have estimated that within 5 years all of the implementation for the major

scripts and locale data will have been collected.

5. Why are scripts missing?

Page 36: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• While many of the computer companies will continue to be involved in maintaining the standard, they will not be active in encoding new scripts — and it will become more difficult

to pass new encodings through committees.

• The Unicode Consortium will still be active: it enjoys tremendous support in the industry and national bodies, and is expanding. Maintenance of the standard represents a long-

term commitment by all of those involved.

• But in order to get the remaining scripts encoded, users of those scripts need to participate and help fund the project.

• Computer companies are not going to do that.

5. Why are scripts missing?

Page 37: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Ethnic pride and identity is promoted

• Literacy efforts can be encouraged

• The study of historic scripts is kept alive

• Communication between and amongst members of the community is promoted

6. What it means when a script is encoded

Page 38: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Encoding allows communication in times of emergency (disease, war, natural disaster)

with people throughout the world

6. What it means when a script is encoded

Page 39: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

www.ethnomed.org

• Encoding a script can permit the creation of health materials in local languages

Page 40: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• Communication for those whose script is outside Unicode will be difficult

• Implementations for unencoded scripts will be more costly to make interoperable

with major platforms and software

• Knowledge of the various scripts of the world will be incomplete

6. What it means when a script is NOT encoded

Page 41: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

http://linguistics.berkeley.edu/sei

7. Solution:The Script Encoding Initiative

Page 42: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• To work with users on script proposals

• If needed — it is always needed — to raise money for script proposals to be written and

free fonts to be created

• To work collaboratively with other groups (such as SIL) to ensure there is no duplication

of effort

• To seek experts to review proposals

7. The role ofThe Script Encoding Initiative

Page 43: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• To participate at standards meetings on behalf of minority groups and scholars

• To explain the role and importance of Unicode to scholars, to users, and to the

general public

7. The role ofThe Script Encoding Initiative

Page 44: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• SEI has helped approximately 12 scripts through the standards process — so far

• Work continues actively on 8 scripts, and over 47 await review and expert input

• SEI has received funds from UNESCO (Initiative B@bel) and U.S. National

Endowment for the Humanities

7. The Script Encoding Initiative:Current Progress

Page 45: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

http://linguistics.berkeley.edu/sei/alpha-script-list.html

7. The Script Encoding Initiative:Current Progress

Page 46: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

• To get adequate, stable funding

• To continue to work on proposals

• To promote the need to encode the missing scripts into Unicode

7. The Script Encoding Initiative:Plans for the future

Page 47: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

ConclusionFinishing the job of encoding the world’s minority scripts and historic scripts is a

task that will build the infrastructure for world-wide computerization and literacy

We need assistance from government,

from NGOs, from UN organizations,

and from the private sectorin order to be able to accomplish this

Page 48: Encoding Diversity for All the World’s Languages - UNESCO · Encoding Diversity for All the World’s Languages ... and share text documents in their native script ... ¥ Modi ¥

Unicode website: http://www.unicode.org

Script Encoding Initiative: http://linguistics.berkeley.edu/sei

Everson Typography: http://www.evertype.com