Top Banner
Unicode and Character Sets
25

Unicode and character sets

Jul 03, 2015

Download

Technology

renchenyu
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Unicode and character sets

Unicode and Character Sets

Page 2: Unicode and character sets

The Absolute Minimum Every Software Developer

Absolutely, Positively Must Know About Unicode

and Character Sets (No Excuses!)

- Joel Spolsky

The founder of Stackoverflow

The author of 《More Joel on Software》

Page 3: Unicode and character sets

A

0100 0001

In person’s eye

In computer’s eye

Page 4: Unicode and character sets

ASCII 32~127 8bits

ISO-8859-1, ISO-8859-2, ISO-8859-3……….. 16

In ISO-8859-1, 0xC0 is À

In ISO-8859-7, 0xC0 is ΐ

The same octet has different meanings in different charsets!!

Page 5: Unicode and character sets

UnicodeNot a Charset

To assign a code point to every words in the world

A -> U+0041

http://www.unicode.org/charts/

Page 6: Unicode and character sets

How to use Unicode in computer?

Page 7: Unicode and character sets

UCS-2 (UTF-16)

PROS:

1. map code points (U+0000~U+FFFF) to octet directly

CONS:

1. Be incompatible with ASCII

2. Waste memory when code point <= U+007F

3. Cannot support code point > U+FFFF

A -> U+0041 -> 0x00 0x41

Page 8: Unicode and character sets

UCS-4 (UTF-32)

PROS:

1. map code points (U+00000000~U+FFFFFFFF) to octet directly

CONS:

1. Be incompatible with ASCII

2. Waste huge memory

A -> U+0041 -> 0x00 0x00 0x00 0x41

Page 9: Unicode and character sets

UTF-80000 ~ 007F 0xxxxxxx

0080 ~ 07FF 110xxxxx 10xxxxxx

0800 ~ FFFF 1110xxxx 10xxxxxx 10xxxxxx

A => U+0041 => 1000001 => 01000001 => 0x41

神 => U+795E => 1111001 01011110 =>

11100111 10100101 10011110 => 0xE7 0xA5 0x9E

Page 10: Unicode and character sets

UTF-8

PROS:

1. Be compatible with ASCII

2. Can map all the code points to octets

CONS:

1. Algorithm is a little complicate

Page 11: Unicode and character sets

It does not make sense to have a string without know what

encoding it uses.

- Joel Spolsky

Software communicate with each other by octet stream

A B

Sends E7 A5 9E E9 A9 AC 3F

A should tell B he sends the octets with charset UTF-8.

Then B can understand the received message is “神马?”

Page 12: Unicode and character sets

Charsets in Perl

Page 13: Unicode and character sets

Two ways to get a string in Perl

1. Literal string

2. From I/O

Literal string – depends on the encoding of your source code

# encoding UTF-8

my $a1 = “神马?”;my $a2 = “\xE7\xA5\x9E\xE9\xA9\xAC\x3F”;

my $a3 = <FH>;

Anyway, in the perl’s eye, it’s a string with 7 octets.

ISO-8859-1 or UTF-8?

Page 14: Unicode and character sets

Default, Perl treats it just as a sequence of octets

# encoding UTF-8

my $a1 = “神马?”;print length($a1) #output is 7

How to make perl treat it as a sequence of characters?

# encoding UTF-8

my $a1 = “神马?”;Encode::decode_utf8($a1);

Encode::decode(“utf8”, $a1);

Encode::_utf8_on($a1);

print length($a1) #output is 3

Page 15: Unicode and character sets

What has happened inside?

1. Decode the sequence of octets to Code points as UTF-8(or other charsets)

2. Encode the Code points to internal format (utf8)

3. Turn the string’s UTF8 flag ON

4. According to the UTF8 flag, Perl treats it as a sequence of chars

UTF-8 ? utf8? UTF8?

Page 16: Unicode and character sets

UTF-8

The standard charset made by Ken Thompson

utf8

Perl internal charset

Superset of UTF-8

UTF8

The name of flag that indicate whether

perl should treat it as a sequence of chars

Page 17: Unicode and character sets

More Examples

Page 18: Unicode and character sets

#encoding UTF-8

use Devel::Peek;

print Dump(“神”), Dump(“\xE7\xA5\x9E”);

print Dump(“\x{795E}”), Dump(Encode::decode_utf8(“\xE7\xA5\x9E”));

print Dump(“神”. “\x{795E}”);

FLAGS = <PADMY,POK,Ppok>

PV = 0x16189d8 “\347\245\236”\0

FLAGS = <PADMY,POK,Ppok,UTF8>

PV = 0x2e7478 “\347\245\236”\0 [UTF8 “\x{795e}”]

FLAGS = <PADMY,POK,Ppok,UTF8>

PV = 0x2e74d8 “\347\245\236\303\247\302\245\302\236”\0 \

[UTF8 “\x{795e}\x{e7}\x{a5}\x{9e}”]

\236\303 = 11000011 10100111

\x{e7} = 11100111

Page 19: Unicode and character sets

神E7A59E(UTF-8

encoded)

UTF8 flag = off

神U+795E(unicode)

神E7A59E(utf8 encoded)

UTF8 flag = on

decode

神C9F1(gbk encoded)

UTF8 flag = off

encode

Convert “神” from UTF-8 to GBK

Page 20: Unicode and character sets

Charsets in MySQL

Page 21: Unicode and character sets

Server -> database -> table

CREATE TABLE XXX

……

……

……

DEFAULT CHARSET = UTF-8

Page 22: Unicode and character sets

SET NAMES X

SET CHARACTER_SET_CLIENT = X

SET CHARACTER_SET_CONNECTION = X

SET CHARACTER_SET_RESULTS = X

Page 23: Unicode and character sets

Shell (UTF-8)

Perl (euc-jp)

MySQL(UTF-8)

Client_charset = UTF-8

Client_charset = euc-jp

Connection_charset = shiftJIS

UTF-8 -> shiftJIS

euc-jp -> shiftJIS

shiftJIS -> UTF-8

shiftJIS -> UTF-8

Results_charset = euc-jp

Results_charset = UTF-8UTF-8 <- UTF-8

euc-jp <- UTF-8

Page 24: Unicode and character sets

Q & A

Page 25: Unicode and character sets

Thank U!