02 c a306-phillips_langtags

1

Language Tagsand Locale Identifiers

A Status Report

2

Presenter and Agenda

Addison Phillips

Internationalization Architect, Yahoo!Co-Editor, Language Tag Registry Update (LTRU) Working Group (RFC 3066bis, draft-matching)

Language tagsLocale identifiers

Addison Phillips is the co-editor to the recent Language Tag registry RFC and its associated matching draft. This presentation details the history of language tags and locale identifiers on the Internet, with a focus on the recent changes and updates to RFC 3066 and efforts to create standardized locales and locale identifiers for the Internet.

3

Languages? Locales?

What’s a language tag?What the #@&%$ is a

locale? Why do identifiers matter?

If the Internet is anything, it is a means of communication. While there are many forms of communication, language and textual information in particular loom large in computer systems.The identification of human “natural language”, as a result, is important, since users expect their computer systems to interact with textual data in useful ways (be it searching, relaying, checking, formatting, or otherwise processing it). Alas, defining what a language is and what constitutes the difference between various forms of language is a complex problem. And, for computer systems, there is another kind of beast: the “locale”, which is even more difficult to grasp. What are these things? How do we identify them? Why do language and locale identifiers matter?

4

Language Tags

Enable presentation, selection, and negotiation of contentDefined by BCP 47– Widely used! XML, HTML, RSS, MIME, SOAP,

SMTP, LDAP, CSS, XSL, CCXML, Java, C#, ASP, perl……….

– Well understood (?)

Natural language and especially written (that is, textual) information are a key and fundamental part of most computer systems. When computer systems were mostly isolated and not interconnected, they mostly dealt with a single language at a time and could be tuned to deal with the particular idiosyncrasies of that language. But the Internet (and other networking technologies) have changed that. Now textual data may be stored, processed, or viewed in many different contexts and many different languages simultaneously. And increasingly the boundaries between “computer” and the world at large is becoming blurred: your “computer” today might equally be your TV, your telephone, your game player, your music player, your PDA, or your automobile. The digital content delivered to your “computer” is more important than the form factor the computer itself takes. As text, speech, and other content associated with language become pervasive and networked together, the selection, identification, and correct processing of the language become critical.Most people seem to believe that they have a relatively good grasp of languages and, thus, of language identification. If you ask your mother-in-law what language the folks in Germany or France speak, for example, she probably will have a ready answer. But the more one delves into languages and language identification, the more complex the problem seems to become.The standard for language identification on the Internet is something called “BCP 47”. It is widely used: the list above is a small fraction of the formats and technologies that implement it. What, never heard of “BCP 47”? BCP 47 is the official designation for the language tagging specification of the IETF. BCP stands for “best current practice”. The most recent document to be BCP 47 is (or, by the time you read this, “was”) RFC 3066, which was preceded by RFC 1766. You’re probably more familiar with the RFC numbers than the BCP number.

5

Locale Identifiers

Different ideas:– Accept-Locale vs. Accept-Language– URIs/URNs, etc.– CLDR/LDML

And Requirements:– Operating environments and harmonization– App Servers– Web Services

New Solution? Cost of Adoption:– UTF-8 to the browser: 8 long years

Locale identifiers, by contrast, are somewhat more difficult to grasp. Your mother-in-law (unless she’s a software engineer) probably has no idea what a locale is.One definition of a locale is:

“a data structure or concept used by programmers to identify a particular collection of cultural, regional, or linguistic preferences.”

Locales are tied to specific programming languages or operating environments. What they do and how they are identified are unique and usually proprietary.There is a relationship of sorts between language and locale: most locale identifiers include a language identifier. So if locale identifiers need to be exchanged on the Internet, as in Web services or between different application servers, how would these identifiers be defined?There are different ideas for how this might happen. One question is cost of adoption: new headers, identifiers, or data structures might take a long time to reach “critical mass” and become useful, while adaptation or cooption of existing structures might introduce problems for existing applications.

6

In the Beginning

Received Wisdom from the Dark AgesLocales:

– japanese, french, german, C– ENU, FRA, JPN– ja_JP.PCK– AMERICAN_AMERICA.WE8ISO8859P1

Languages…… looked a lot like locales (and vice

versa)

In the beginning, there was very little difference between language and locale in computer systems. Locale identifiers (some historical examples are shown above) usually included some kind of language identification.When the Internet became accessible to mere mortals in the early 1990’s, language identification became an immediate concern. The Internet made content easy to exchange across boundaries and borders in ways that closed networks like CompuServe never could master. Identifying languages was necessary for applications such as email and http, so Harald Alvestrand worked to create the first version of BCP 47, which was known as RFC 1766 to address the problem.These language tags became widely adopted, as we’ve noted. Locale identifiers were not created for the Internet, though, because of a lack of distributed applications.“Now, hold on!” you might say. “I’ve used distributed applications for years now: I’ve got client-server and I’ve bought books from Amazon or stocks from my broker or airline tickets on-line. What do you mean ‘there’s a lack of distributed applications’?!?”It is true that there are client-server architectures and Web applications are now quite commonplace. However, these are not truly distributed applications. In a Web application, for example, there is a host where all the logic is stored. This host and its associated programming language or operating environment completely encapsulates the overall locale model. Client-server architectures are similar: the client and server each have specific technology choices associated with them and the business logic lives in one or the other (and usually in the server). Truly distributed applications are the province of integration (EAI, B2B), Web services, and the idea of Service Oriented Architectures (SOA). You only need a shared concept of locale when your logic is being hosted in discrete chunks on multiple systems and when you cannot count on the systems using the same technology! Web apps are usually hosted in a single container or are written by people who have chosen a particular technology. The locale model associated with that technology becomes the locale model of the Web application. The whole point of Web services, by contrast, is to hide this technology decision.

7

Locales and Language Tags meet

Conversations in Prague…– Language tags are being

locale identifiers anyway…– Not going to need a big

new thing…– Just a few things to fix…… we can do this really fast

In 2002, Mark Davis and I attended the Internationalization and Unicode Conference in Prague (so you can see that it pays to attend these events!), where I had a paper about locale identifiers. The basic problem was that language tags were widely distributed, and, since they looked an awful lot like POSIX locale identifiers, most Web application platforms were actually using them as locale identifiers already by mapping language tags to their local equivalent. Mark was working on the CLDR project and was concerned about problems involving script identification (especially for compatibility with Microsoft’s .NET Culture identifiers). It seemed that a few small fixes to BCP 47 (to allow some script subtags) and some documentation (“how to get a locale out of a language tag”) might solve several problems all at once.

8

BCP 47 Basic Structure

Alphanumeric (ASCII only) subtagsUp to eight characters longSeparated by hyphensCase not important (i.e. zh = ZH = zH = Zh)

1*8alphanum * [ “-” 1*8 alphanum ]

The basic structure of language tags has been remarkably stable. Language tags are ASCII strings consisting of subtags separated by hyphens (and notunderscores). The subtags may consist of either (ASCII) letters or digits.There exist suggested capitalization rules for some of the underlying standards used by language tags, but these do not apply to language tags and have no meaning in a language tag context. Language tags are case insensitive.At the bottom of the slide is the original “ABNF” which describes the language tag grammar.

9

RFC 1766

zh-TW

ISO 639-1 (alpha2)

ISO 3166 (alpha2)

ii--klingonklingonRegistered value

RFC 1766 defined language tags in two distinct ways.All language tags took the form of a sequence of subtags composed of the ASCII letters and digits separated by the hyphen character. The subtags could be, at most, eight characters long. RFC 1766 said that:•If the first subtag consisted of two letters, it was a language code from the ISO 639-1 standard. •If there is a second subtag (additional subtags are optional) and it consisted of two letters, it was a region code from the ISO 3166 standard.Otherwise, the interpretation of the tag (and its subtags) was defined by a registry maintained by IANA. If users needed a specific language tag, they could write to a mailing list ([email protected]) and request a registration be created. Here is one such tag, for the Klingon language.

10

RFC 3066

sco-GB

ISO 639-2 (alpha 3 codes)

But use…

enengg--GBGBalpha 2 codes when they exist

X

RFC 3066 expanded on RFC 1766, making a few minor additions and cleaning up a few problems that arose.The main change was the addition of ISO 639-2 codes for languages. The ISO 639-1 codes are two-letters long and there are, necessarily, a limited number of these (about 650 total, given that some letters are reserved). Since there are at least several thousand languages that exist in modern times, this isn’t sufficient to encode the world’s languages. ISO 639-2 assigns three-letter codes, which allows for many more potential codes. This allows all of the languages to be represented by one code or another.RFC 3066 also mandated that if an ISO 639-1 code exists for a language, then that code must be used (and not the ISO 639-2 code). This prevents languages from being encoded using different tags. Thus the tag “eng-UK” is not legal, even though “eng” is a valid ISO 639-2 code: tags must use the “en” code for English.The IANA language tag registry remained the same as during the RFC 1766 era: a collection of isolated registrations.(‘sco’ is the code for ‘Scots’)

11

Problems

Script Variation:– zh-Hant/zh-Hans– (sr-Cyrl/sr-Latn, az-Arab/az-Latn/az-Cyrl, etc.)

Obsolence of registrations:– art-lojban (now jbo), i-klingon (now tlh)

Instability in underlying standards:– sr-CS (CS used to be Czechoslovakia…

A variety of problems were associated with language tags, despite their success. The one Mark and I were primarily interested in was the problem of script variation. Most languages are customarily written in a single script. They may be transcribed in another script, but most native speakers and most content in that language use a single script.A few languages are written equally—or at least “commonly”—in more than one script. Some of the languages are undergoing transitions (Cyrillic script was imposed on several languages during the Soviet era, for example), while others are just naturally written in more than one script. For example, Serbian can be written in either Cyrillic or Latin script. Both traditions are historical to the language, not artificially imposed.The most notable example of script variation is in Chinese, where the traditional form of the script is used in some Chinese speaking regions (Taiwan, Hong Kong) while the simplified form of the script is used in others (the PRC, Singapore). These variations do not follow spoken variation in the language (Hong Kong, for example, speaks Cantonese while Taiwan speaks Mandarin)…which leads to vocabulary and other variations with the writing systems in question. And identifying “Traditional Chinese” using a region has other cultural sensitivity problems…Another problem was the relative ease of registration for language tags compared to the action of the various ISO maintenance and registration bodies. Many of the registered tags were later deprecated due to standards action.A last problem I’ll mention here was instability in ISO 3166 (the region codes). Codes in ISO 3166 are changing all the time, which is not a surprise, given that countries are changing name, boundaries, and organization with some regularity. Alas, ISO 3166 doesn’t just remove old codes: they sometimes give them to a new country or region. So the language code today for “Serbian for Serbia and Montenegro” would have been “Serbian for Czechoslovakia” a couple decades ago.

12

And More Problems

Lack of scriptsLittle support for registered values in softwareReassignment of values by ISO 3166Lack of consistent tag formation (Chinese dialects?)Standards not readily available, bad referencesBad implementation assumptions– 1*8 alphanum *[ “-” 1*8 alphanum]– 2*3 ALPHA [ “-” 2ALPHA ]

Many registrations to cover small variations– 8 German registrations to cover two variations

There were a few other problems, which I’ve listed here…

13

LTRU and “draft-registry”

Defines a generative syntax – machine readable– future proof, extensible

Defines a single source– Stable subtags, no conflicts– Machine readable

Defines when to use subtags– (sometimes)

So Mark and I started writing Internet-Drafts. Eventually, a Working Group was formed at the IETF called the Language Tag Registry Update or LTRU working group.Out of this working group comes a new RFC, which is the new BCP 47. As I write this the RFC has not yet been assigned a number, so it is called RFC 3066bis informally. It changes language tags in a number of interesting ways, while maintaining full compatibility with all existing tags.

14

RFC 3066bis and LTRU

sl-Latn-IT-rozaj-x-mineISO 639-1/2 (alpha2/3)

ISO 15924 script codes (alpha 4)

ISO 3166 (alpha2) or U

N M

49

Registered variants (any number)

Private Use and Extension

Here is an illustration of a new-style language tag.

15

More Examples

es-419 (Spanish for Americas)en-US (English for USA)de-CH-1996 (Old tags are all valid)sl-rozaj-nedis (Multiple variants)zh-t-wadegile (Extensions)

Here are some more examples of language tags showing some of the interesting variations.es-419 makes use of the UN M.49 region codes to describe a language for a larger area than a country.de-CH-1996 was registered in the old IANA Language Tag Registry. It is still a valid tag.sl-rozaj-nedis is probably not a good tag choice, but illustrates that you can have more than a single variant in a well-formed tag. In this case, both –rozaj and –nedis are dialects of Slovenian (sl), but –nedis doesn’t include sl-rozaj in its registered list of prefixes, so this tag is probably meaningless.zh-t-wadegile is a hypothetical tag: if there were an extension for transliterations and it if it were assigned the letter ‘t’, than one valid subtag might be ‘wadegile’.*

* Several well-informed people have cast doubt on the idea of a transliteration extension, not to mention the “wadegile” example shown.

16

Benefits

Subtag registry in one place: one source.Subtags identified by length/contentExtensibleCompatible with RFC 3066 tagsStable: subtags are forever

There are several benefits to switching over to RFC 3066bis.For the first time there is a single, authoritative source for subtags. It contains date versioning information, as well as information on the formation of useful tags. Instead of having to hunt through various versions of ISO 639, ISO 3166, ISO 15924, UN M.49 and the IANA registry, there is one source.It is machine readable and the entries are dated. There is even a mechanism for canonicalizing tags as they evolve.Inside a language tag, the subtags can be identified by length and content. Parsers do not have to have a copy of the registry to extract most of the information in a tag.There are several extension mechanisms. In particular, private use subtags can be used in otherwise public tags.The tags are all backwards compatible with RFC 3066. Any new tag would have been valid to register under pervious versions of BCP 47. And all of the old tags are forwards compatible (although a few are only compatible via fiat).Finally: tags and subtags are stable. Forever.

17

Problems

Matching– Does “en-US” match “en-Latn-US”?

Tag Choices– Users have more to choose from.

Implementations– More to do, more to think about– (easier to parse, process, support the good stuff)

The creation of the new format does create a few problems for users and implementers, though.In particular, there are now more choices for how to form the generative language tags. Matching of tags is a particular issue we’ll cover in more depth in a second.Users have more choices available, so implementations and guidelines are going to be necessary to help people decide what’s best for them.Software implementations will have to do several things. Of course, they’ll have to be modified to be either well-formed or validating processors. The good news here is that the tag syntax is more deterministic and thus more amenable to parsing. And there is a data source that can easily be incorporated into code. The bad news is that some badly-written implementations are going to break and that developers need to go back and evaluate their software.

18

Tag Matching

Uses “Language Ranges” to select sets of content according to the language tagFour Schemes– Basic Filtering– Extended Filtering– Scored Filtering– Lookup

The remaining work of LTRU relates to matching and selecting content based language tags. This has some impact on implementations, which need to guide users in selection of the most appropriate tags.Tag matching depends on language ranges, which are identifiers that people use to specify what they are looking for or wish to match. Ranges select sets of tags. The current version of the Internet-Draft on matching describes four types of matching in two categories (filtering and lookup).

19

Filtering

Ranges specify the least specific item – “en” matches “en”, “en-US”, “en-Brai”, “en-boont”

Basic matching uses plain prefixesExtended matching can match “inside bits”– “en-*-US”

Filtering is one type of matching. In filtering, the range specifies the least specific item that constitutes a match. For example, if I specify a range of “de-CH”, all content in the matching set must include the language “de” (German) and the region “CH” (Switzerland) in its tags.•“Basic filtering” is strict prefix matching. That is range “de-CH” matches tags “de-CH” and “de-CH-1996” but not “de-Brai-CH”, “de”, or “de-Latn-CH-1996”•In “extended filtering”, ranges can match missing elements. Thus “de-*-CH”would match all of the foregoing examples except “de”.

20

Scored Filtering

Assigns a “weight” or “score” to each matchResult set is ordered by match quality

Postulated by John Cowan

Scored filtering, which was first postulated by John Cowan, assigns a weight or score to each potential range-to-tag match. Unlike the other two forms of filtering, scored filtering results in an ordered set of matching tags. This might be useful with search results, for example.

21

Lookup

Range specifies the most specific tag in a match.– “en-US” matches “en” and “en-US” but not “en-

US-boont”Mirrors the locale fallback mechanism and many language negotiation schemes.

The other form of matching is called lookup. In lookup, the user specifies the most specific tag that represents a match. The lookup algorithm is for use in cases where the user wants exactly one item returned for each request. Software resources are examples of language tag matching.(Demo of all matching types)

22

What Do I Do (Content Author)?

Not much.– Existing tags are all still valid: tagging is mostly

unchanged.– Resist temptation to (ab)use the private use

subtags.Unless your language has script variations:– Tag content with the appropriate script subtag(s)

Script subtags only apply to a small number of languages: “zh”, “sr”, “uz”, “az”, “mn”, and a very small number of others.

23

What Do I Do (Programmer)?

Check code for compliance with 3066bis– Decide on well-formed or validating– Implement suppress-script– Change to using the registry– Bother infrastructure folks (Java, MS, Mozilla, etc)

to implement the standard

24

What Do I Do (End-User)?

Check and update your language ranges.Tag content wisely.

25

LTRU Milestone Dates

(Done) RFC 3066bis – Registry went live in December 2005

Produce “Matching” RFC– Draft-04 available

(Anticipated) Produce RFC 3066ter– This includes ISO 639-3 support, extended

language subtags, and possibly ISO 639-6

26

Things to Read

Registry Drafthttp://www.inter-locale.comhttp://www.ietf.org/internet-drafts/draft-ietf-ltru-

registry-12.txtMatching Drafthttp://www.inter-locale.com

LTRU Mailing Listhttps://www1.ietf.org/mailman/listinfo/ltru

27

Things to Do (languages)

Get involved in LTRU Get involved in W3C I18N Core WG!Write implementations Work on adoption of 3066bis: understand the impact

Then get involved with Locale identifiers…

28

Back to Locales…

IUC 20 Round TableSuzanne Topping’s Multilingual ArticleTex Texin and the Locales list…

So we’ve done a deep dive into Language Tags, whereas my point of entry was locale identifiers. What’s going on with locales?Back at IUC20 (see, it pays to go to these events!) there was a round-table in which there was a discussion of problems confronting the Web. Language tags and locale identifiers was one of the key topics discussed at this round table, apparently. I say “apparently”, because I left the conference before the round table. I read about the results on the W3C website and in an article by Suzanne Topping in Multilingual magazine. What I read there surprised and dismayed me. A few weeks later, I found that others in the community were working on locales or, rather, on rubbishing locales. Tex Texin started a list (now defunct) for discussing the problem.I got involved in thinking about the problem.

29

Locale Identifiers and Web Services

Fundamentally, my interest stemmed from the fact that I was working on Web services. Web services are supposed to define a platform-agnostic way to expose logic or functionality in a distributed fashion. By using XML and HTTP, it was hoped that Web services could provide a standards oriented way to accomplish what CORBA or EAI vendors had been providing in a proprietary fashion previously.The problem I was grappling with was: “how do you internationalize a Web service?”Web services have all the same requirements any distributed system has: they have messages, data, text, and potentially cultural, regional, or other issues in them. In our programming environments we have a ready solution for addressing these problems. These often hinge on the locale. And the locale hinges on the user’s preferences in the matter.We have standard language identifiers. We don’t have standard locale anything. What to do?There were (and are) three schools of thought.On the one hand are the identifier folks (such as myself) who think that if we had a general locale-and/or-international-preferences-ID-mechanism, each vendor would implement it in a manner consistent with their existing language/platform and everything would work pretty well.On the other hand are the locale definition folks (such as Mark Davis) who think that if we all agreed to use the same locale data and locale data structures, then we could exchange identifiers and get the same results because everything is the same.On the left foot are the folks who think locales are just a bad idea and ought to be placed in the nearest landfill or entombed in concrete, Chernobyl-style.

30

W3C and Unicode

W3C– Identifiers and cross-over with language tags– Web services– XML, HTML

Unicode Consortium– LDML– CLDR– Standards for content

Two standards organizations that are working in the area of locales and locale identifiers are the W3C (Internationalization Core Working Group) and the Unicode Consortium (the Common Locale Data Repository project).The W3C is, of course, directly concerned with the use and implementation of language tags in document formats and technologies. In addition, the need for locale identification for Web services is a specific work item for the I18n working group.The Unicode folks are working to build a standardized, comprehensive set of locale data.

31

“Language Tags and Locale Identifiers” SPEC

First Working Draft coming soon– URIs?– Simple tags?

The W3C is currently working on a pair of specifications (W3C-ese for “standards track documents”). The first is called “Language Tags and Locale Identifiers”, which, as its names says, has to do with actually creating locale identifiers, as well as providing implementation guidelines for RFC 3066bis and draft-matching.There are questions about how a locale identifier should be structured. Several ideas are currently floating around. For example, URIs might be used. Or 3066bis tags might be “extended” in some way.

32

WS-I18N SPEC

First Working Draft now available:– http://www.w3.org/TR/ws-i18n

The second spec that the W3C is working on is the WS-I18N spec, or “Web Services Internationalization”. This spec relies on the preceding document for locale identifiers and describes how to use locales with Web services technologies. Previous work by the W3C I18N WG in this area include requirements and usage scenarios.

33

Ideas?

02 c a306-phillips_langtags

Technology