Top Banner
Unicode and Legacy Representations of Emoji IUC 36 David Yonge-Mallo, i18n Engineer, Google Oct. 24, 2012 ver. 2012-10-23 14:00
42

Unicode and Legacy Representations of Emoji (IUC 36)

May 14, 2015

Download

Technology

A talk given at IUC (Internationalization & Unicode Conference) 36 on Oct. 24, 2012. The talk discusses the history of emoji, their inclusion in the Unicode standard, and legacy handling of obsolete encodings.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unicode and Legacy Representations of Emoji (IUC 36)

Unicode and Legacy Representations of EmojiIUC 36

David Yonge-Mallo, i18n Engineer, GoogleOct. 24, 2012

ver. 2012-10-23 14:00

Page 2: Unicode and Legacy Representations of Emoji (IUC 36)

"Bit rot"09:15-10:00 KEYNOTE PRESENTATION -

"Bit Rot" – A Disaster Waiting to Happen

Presenter:Dr. Vinton G. CerfVice President andChief Internet Evangelist, Google

Dr. Cerf will discuss the problem of curating digital content on the order of centuries. Unicode has a role to play although there are very complex issues relating to format and structure of digital objects, interpretation of content, intellectual property management, perhaps even patents and other legal framework questions. The problems are both technical and legal.

Page 3: Unicode and Legacy Representations of Emoji (IUC 36)

Outline● A brief history of emoji● Encoding: Shift JIS and Unicode● Mapping and unification● Emoji in Unicode 6● Problems:

○ variation selectors○ regional indicators○ counting

● Best practices

Page 4: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji down the agesWhat if you were tasked with preserving the following texts to be passed down for posterity?

Page 5: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji down the agesWhat if you were tasked with preserving the following texts to be passed down for posterity?

awesome! :-)

Page 6: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji down the agesWhat if you were tasked with preserving the following texts to be passed down for posterity?

awesome! :-)

yay! ☺

Page 7: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji down the agesWhat if you were tasked with preserving the following texts to be passed down for posterity?

awesome! :-)

yay! ☺

i know how much you hiking

Page 8: Unicode and Legacy Representations of Emoji (IUC 36)

What is an emoji (絵文字)?

絵文字 = picture (絵) + character/letter (文字)

What are they?● pictures (representational)● includes facial expressions (smileys)

○ but not restricted to them● stored and transmitted as encoded characters

○ used in email and SMS

History:● popularised on Japanese mobile devices● extension of Japanese character sets● carrier-specific standards

Page 9: Unicode and Legacy Representations of Emoji (IUC 36)

"Early" history in JapanThree major cell phone operators supported emoji:● NTT DoCoMo● au/EZweb by KDDI● SoftBank

Problems:● each operator had its own set of emoji● they were encoded differently● no interoperability between them

Page 10: Unicode and Legacy Representations of Emoji (IUC 36)

Examples of emoji

Above: DoCoMo emoji palette

Right: DoCoMo Foma P902i, c. 2005

Page 11: Unicode and Legacy Representations of Emoji (IUC 36)

Examples of emojiSubset of KDDI emojis:

Subset of SoftBank emojis:

Page 12: Unicode and Legacy Representations of Emoji (IUC 36)

Number of supported emoji

Source: Emoji in Unicode, IUC 33

Page 13: Unicode and Legacy Representations of Emoji (IUC 36)

Outline● A brief history of emoji● Encoding: Shift JIS and Unicode● Mapping and unification● Emoji in Unicode 6● Problems:

○ variation selectors○ regional indicators○ counting

● Best practices

Page 14: Unicode and Legacy Representations of Emoji (IUC 36)

Encoding - Shift JISThis is one of the most popular encodings for Japanese.

The "JIS" part refers to Japanese Industrial Standards. ISO-2022-JP is also known as the "JIS" encoding.

The "shift" part comes from how the double-byte characters are encoded.

0x00 - 0x7F : matches ASCII (except for 2 characters)0x81 - 0x9F : first byte of a double-byte character0xA1 - 0xDF : half-width katakana0xE0 - 0xEF : first byte of a double-byte character

Page 15: Unicode and Legacy Representations of Emoji (IUC 36)

Encoding - Shift JIS

Source: modified from Wikipedia

Page 16: Unicode and Legacy Representations of Emoji (IUC 36)

Encoding - Unicode PUAUnicode has a number of private use areas (PUAs).

PUA range in the Basic Multilingual Plane (BMP):0xE000 - 0xF8FF

Supplementary PUA-A:0xF0000 - 0xFFFFF

Supplementary PUA-B:0x100000 - 0x10FFFD

Page 17: Unicode and Legacy Representations of Emoji (IUC 36)

Each carrier used different values to encode emoji. For example...

NTT DoCoMo:● Shift JIS: 0xF89F - 0xF9FC● Unicode: 0xE63E - 0xE757 (BMP PUA)● JIS points for e-mail

... and similarly for the other two carriers.

Encoding is carrier-specific

Page 18: Unicode and Legacy Representations of Emoji (IUC 36)

Mojibake (文字化け)

Mojibake is what happens when encoded text is displayed using the wrong encoding.

Page 19: Unicode and Legacy Representations of Emoji (IUC 36)

Mojibake (文字化け)

Mojibake is what happens when encoded text is displayed using the wrong encoding.

Sent:

Displayed:

Page 20: Unicode and Legacy Representations of Emoji (IUC 36)

Outline● A brief history of emoji● Encoding: Shift JIS and Unicode● Mapping and unification● Emoji in Unicode 6● Problems:

○ variation selectors○ regional indicators○ counting

● Best practices

Page 21: Unicode and Legacy Representations of Emoji (IUC 36)

Carrier-to-carrier mapping

Source: SoftBank

SoftBank Disney au by KDDI DoCoMo

Page 22: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji support spreads...Emoji began to be supported in web mail and other devices:● Yahoo! Japan Web Mail (2006)● Gmail (2008)● iPhone 2.2 (2008)● Android apps (2009)

Page 23: Unicode and Legacy Representations of Emoji (IUC 36)

Google emojiProvides a unified representation of the three emoji sets:● union of all the emoji characters● cross-mapping

○ combine same character○ a few dozen: existing Unicode

● about 700 new characters○ using PUA○ outside BMP (U+FExxx)

Idea:● support legacy systems by

converting between otherencodings and Unicode

SoftBank

KDDI

DoCoMo

Page 24: Unicode and Legacy Representations of Emoji (IUC 36)

Google PUA mapping table

Page 25: Unicode and Legacy Representations of Emoji (IUC 36)

Converting at boundaries

Convert to/from Unicode

Gmail(Google PUA)

SoftBank

KDDI

DoCoMo

Page 26: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji in GmailUses mapping table to convertbetween PUA and carrier encoding.

Display emoji using images. In someplaces, "[?]" is displayed.

Right: mobile Gmail on iPhone

Below: desktop Gmail compose window

Page 27: Unicode and Legacy Representations of Emoji (IUC 36)

Outline● A brief history of emoji● Encoding: Shift JIS and Unicode● Mapping and unification● Emoji in Unicode 6● Problems:

○ variation selectors○ regional indicators○ counting

● Best practices

Page 28: Unicode and Legacy Representations of Emoji (IUC 36)

Making it officialIn 2007, the Unicode Technical Committee agreed to encode most of the emoji characters, for the purpose of interoperability between systems.

Unicode proposals (joint effort by Google and Apple) 2009:● N3582 "Proposal for Encoding Emoji Symbols"● N3583 "Emoji Symbols Proposed for New Encoding"

Authors:● Markus Scherer, Mark Davis, Kat Momoi, Darick Tong

(Google)● Yasuo Kida, Peter Edberg (Apple)

Page 29: Unicode and Legacy Representations of Emoji (IUC 36)

The Proposal

Source: N3583 "Emoji Symbols Proposed for New Encoding"

Page 30: Unicode and Legacy Representations of Emoji (IUC 36)

Emoji in Unicode 6Goal:● Encode superset of emoji in Unicode, allowing for

roundtrip and fallback mappings

Restrictions:● Source separation rule (strict rule)● Reuse existing Unicode symbols● Separate generic symbols● Abstract characters (no specific colours or animation)● Unify semantically identical symbols, but:

disunify visually similar but semantically different symbols

● Unify Unicode with least-marked most-common symbolSource: Unicode Technical Committee Subcommittee on Encoding of Symbols

Page 31: Unicode and Legacy Representations of Emoji (IUC 36)

Proposal acceptedIn 2010, the new emoji were accepted into Unicode 6.

These consisted of:● 625 emoji new 1:1 to Unicode 6● 103 emoji unified 1:1 with existing characters● 11 keycaps represented as [0-9#] followed by 'keycap'● 10 new 'flag' emojis represented as sequences● 65 emoji logos were not added

In addition, Unicode 6 added many other symbols which are similar in nature to emoji, such as playing cards, plants, and transportation symbols.

Page 32: Unicode and Legacy Representations of Emoji (IUC 36)

Unified and new emoji

Unified emoji: New emoji:

Page 33: Unicode and Legacy Representations of Emoji (IUC 36)

Outline● A brief history of emoji● Encoding: Shift JIS and Unicode● Mapping and unification● Emoji in Unicode 6● Problems:

○ variation selectors○ regional indicators○ counting

● Best practices

Page 34: Unicode and Legacy Representations of Emoji (IUC 36)

New problems introducedSince Gmail was already using the unified PUA, it looks like all that needs to be done to bring it up to spec is to replace the PUA code points with the official ones...

Not so fast -- it's not that simple!

Recall that one of the goals in creating the proposal was:● Reuse existing Unicode symbols

Also, the new emoji include:● keycaps and flags represented by sequences of

characters

What could possibly go wrong?

Page 35: Unicode and Legacy Representations of Emoji (IUC 36)

Can you spot the problems?

Page 36: Unicode and Legacy Representations of Emoji (IUC 36)

Variation selectors

Source: Unicode Standardized Variants

Page 37: Unicode and Legacy Representations of Emoji (IUC 36)

Regional Indicator symbolsThe combined carrier emoji contained ten national flags.(PRC, Germany, Spain, France, UK, Italy, Japan, Korea, Russia, USA)

US proposal (Google and Apple):● encode as "emoji compatibility symbols"

Germany/Ireland counter-proposal:● encode 256 characters for ISO 3166 country codes

Compromise: ● encode twenty-six "regional indicator symbols" (A-Z)● spell out the two-letter country codes

Page 38: Unicode and Legacy Representations of Emoji (IUC 36)

Possible ambiguity

We have "regional indicators" to .

But what if the middle of a string looked like this?

... ...

Is this ... ...

or ... ...?

What about CN/NC, KRUS/RUSK, BB...BBFRUSBB...?

Page 39: Unicode and Legacy Representations of Emoji (IUC 36)

Be careful how you count!Counting the wrong thing is a major source of bugs:● Java's String.length() lies about Unicode supplementary

code points (UCS-2 vs. UTF-16), use String.codePointCount() instead

● masking with "[?]" changes the length● changing encoding changes the length

The above problems existed prior to Unicode 6. But now:● variation selectors are invisible● some emoji are represented by sequences (of

supplementary code points)

Page 40: Unicode and Legacy Representations of Emoji (IUC 36)

Outline● A brief history of emoji● Encoding: Shift JIS and Unicode● Mapping and unification● Emoji in Unicode 6● Problems:

○ variation selectors○ regional indicators○ counting

● Best practices

Page 41: Unicode and Legacy Representations of Emoji (IUC 36)

Best practicesStrive for the following goals:● use Unicode encoding rather than Shift JIS or other● use official Unicode code points instead of PUA● choose wisely whether to use text or image● convert to/from Unicode at boundaries● be aware that Unicode has emoji-like symbols beyond

the Japanese carrier sets, and conversion to the carrier Shift JIS encodings may not be possible for these

● follow Postel's principle○ "be liberal in what you accept,

but conservative in what you send"

Page 42: Unicode and Legacy Representations of Emoji (IUC 36)

The End

Thank you!Q & A