Top Banner
What’s New in Globalization? Mark Davis President & Cofounder The Unicode Consortium
26

Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Mar 26, 2015

Download

Documents

Gabriel Reyes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

What’s New in Globalization?

Mark DavisPresident & Cofounder

The Unicode Consortium

Page 2: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

The Unicode Standard, Version 5.0

“Hard copy versions of the Unicode Standard have been among the most crucial and most heavily used reference books in my personal library for years.”

— Donald E. Knuth“For more than a decade, Unicode has been a foundation for many Microsoft products and technologies; Unicode Standard Version 5.0 will help us deliver important new benefits to users.”

— Bill Gates“The path W3C follows to making text on the Web truly global is Unicode.”

— Sir Tim Berners-Lee, KBE“Without Unicode, Java wouldn't be Java, and the Internet would have a harder time connecting the people of the world.”

— James Gosling

Page 3: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

The Unicode Standard, Version 5.0

Obsoletes previous versions

Basis for Microsoft's Vista; in upgrade plans for Google, Yahoo!, and ICU, to name but a few.

Hundreds of pages of new information; thousands of revised pages; all Unicode Standard Annexes

Systematic framework for improved text processing

Improvements to the Unicode Encoding Model for UTF-8, …

Rigorous stability of case folding and identifiersImproved interoperability and backward compatibility

Enabling additional new ways to optimize code

Page 4: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

U5.0 Unicode Character Database

Unicode: far more than a list of characters

Properties: key to how characters function

Changes in 5.0Scripts: Unassigned code points → Zzzz

Casing Stability: Upper → folded

BIDI: Consistent Bidi_Mirrored

Now Normative: kIICore

Line Break: SE Asian → Complex_Context

New Properties: Normative_Name_Alias, Deprecated, 3 Unihan provisional properties

General99,08

9

Private Use

137,468

Surrogate 2,048

Noncharacter 66

Reserved875,44

1

Page 5: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

U5.0 Conformance

Stable Case-Folded≈ Upper → Lower

Much clearer encoding / property model

Stable Approved Named Character Sequences

Bengali, Gurmukhi, Tamil changes

Combining grapheme joiner clarified

Disunification of Diacritics

Page 6: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

5.0 Annexes: Core

UAX #9: Bidirectional AlgorithmTightened conformance requirements

UAX #15: Unicode Normalization FormsNew Stream-Safe Text Format

Appendix of characters requiring special handling

Expanded info on stability guarantees

Additional detailed figures, guidelines

UAX #31: Identifier and Pattern SyntaxAdded profiles & information on usage

Page 7: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

U5.0 Annexes: Boundaries

UAX #14: Line Breaking PropertiesRules modified to improve behavior

Now Normative (conformance clauses reorganized)

UAX #29: Text BoundariesEdge cases improved

Tailorings for text boundaries now in Unicode CLDR

Format of the rules changed to ease implementation

Additional guidelines on regex, identifiers,…

Page 8: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

U5.0 Characters by Script

Phags Pa

Phoenician

Devanagari

Hebrew

Greek

Kannada

Nko

Common

Latin

Inherited Cyrillic

Cuneiform

Balinese

Page 9: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Unicode Character Timeline

1

10

100

1,000

10,000

100,000

1,000,000

2.0.0 2.1.2 3.0.0 3.1.0 3.2.0 4.0.0 4.1.0 5.0.0

Letter

Symbol

Mark

Number

Punctuation

Control/Format

Separator

Page 10: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Unicode Guide for Programmers

Adjunct to Standard

Concise Guide for Software Globalization

Crucial Concepts

Key “Gotchas”Recognize and Avoid

Details onEncoding & conversions:

UTF-8, 16, 32 & BOM

Using character properties

Text Operations

Page 11: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Unicode Common Locale Data Repository: CLDR

Key locale data for world languages

Most extensive standard repository of locale data

XML format

Δευτέρα, 05 Σεπτεμβρίου 2005

Montag, 5. September 2005

¥ 1,234.57 1 234,57руб.

Arabic – arabskiBulgarian – bułgarskiCzech – czeski…

Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…

AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…

Z < Å

Page 12: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Unicode CLDR 1.4

121 languages and 142 territories – 360 locales in all

25% more locale data; over 17,000 new/modified items

Repository separated into language vs locale data

Language-specific segmentation (word/line breaks…)

Transliterations (eg Ελληνικά ↔ Ellēniká)

Data for lenient date/time formatting and parsing

Programmer asks for “numeric day” + “abbreviated month”

Best format pattern returned, eg “dd.MMM”

+ Quarters in dates (eg 2006Q1)

BCP 47 compatibility + extensions

Page 13: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

BCP 47 Language Tags

Usage: HTTP, HTML, XML; CLDR Locale IDs…

RFC 4646; Obsoletes RFCs 1766, 3066

Addresses problems in RFC3066ISO standards: stability / accessibility / ambiguity

Parseability, Extensibility; Registration speed

Identification of script (where necessary):Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

Page 14: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Unicode Security

Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…

Non visual problems: buffer overflows, non-shortest form,…

UTR# 36 Unicode Security ConsiderationsGuidelines & Recommendations

UTS# 39. Unicode Security MechanismsAlgorithms & Data

Limitations on Repertoire

Testing for Confusables

Page 15: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Internationalized Domain Names

One instance of broad problemMany RFCs use Nameprep – limited to Unicode 3.2

Unicode recommendationsNarrow the repertoire: exclude symbols, punctuation

Expand the coverage: currently only Unicode 3.2.

IETF idn-nextsteps publishedSome positive developments, but misreads Unicode, needs more work

Page 17: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Ideographic Variation Database

U+82A6 ashi: multiple forms

The first occurrence – any glyph

Second occurrence is in the name of the town Ashiya – customarily displayed with form #4

Registration for variants

Page 18: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Ideographic Variation Database

Variation SelectorIdentifies a restriction on the appearance of a character

Character + Variation Selector = Variation Sequence

Han ideographsImpossible to build a single collection for everyone: requirements from scholars, governments and publishers…

Instead, registration of multiple independent collections

Unicode Ideographic Variation DatabaseA given variation sequence is used in at most one collection

Makes interchange of variation sequences reliable.

Registration, not Assessment

Page 19: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

ICU 3.6

Mature, portable C/C++/Java int’l libraries

Unicode 5.0, UCA 5.0, CLDR 1.4

ICU4CCharset Detection

Improved: Time Zones, Thai word break, UText (64 bit), Performance, Data Management,…

ICU4J Globalization Preferences

Flexible date/time formats*, Charset conversion*

Page 20: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Near-Term Issues

Unicode 5.0.1, Unicode 5.1

CLDR / BCP 47bis

LDAP

Collation Registry

IANA Charset Registry

Page 21: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Unicode 5.1 - possibilities

CharactersCJK Unified Ideographs Extension C

Minority Scripts: Cham and Lanna

Malayalam chillu

Properties/BehaviorNormalization process for stable strings

Page 22: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

CLDR 1.5 / BCP 47bis

CLDR 1.5

Data Submission Starting November

New structures / data

BCP 47

Adding ~7,000 (!) new language subtags

Possibly other changes…

Page 23: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

LDAP

Now has definitive comparison

(good)

Stuck at Unicode 3.2

(bad)

http://www.ietf.org/rfc/rfc4518.txt

Page 24: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

Collation Registry

Nearing approval

Adds ability to register comparisons

Workable for basic cases

http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-14.txt

Page 25: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

IANA Charset registry

Currently limited usefulness

Ill-defined

Missing mapping tables

Incomplete

Inaccurate

Regime Change

Hope for future improvements!

Page 26: Whats New in Globalization? Mark Davis President & Cofounder The Unicode Consortium.

What’s New in Globalization?

Mark DavisPresident & Cofounder

The Unicode Consortium