Han Unifica)on for Chinese/Japanese/Koreandberry/ATEP/Student... · Kangxi Radicals, CJK Radicals Supplement and CJK Strokes •Kangxi Radicals: U+2F00 ~ U+2FD5 •Includes characters

Post on 22-Jun-2020

4 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

Transcript

Han Unifica)on for Chinese/Japanese/Korean

Wanxing WangJuly 17, 2018

University of Waterloo CS 846 - Advanced Topics in Electronic Publishing

Agenda

• Han Characters• Han Unification• CJK Unified Ideographs in Unicode Table• Further Discussion

Han Characters

• Western vs East Asian words – Phonograms vs Ideograms• Example:

bright /brīt/ 明

moon /mo͞on/sun /sən/

Han Characters• Phonograms: easy to represented by small set of symbols• Ideograms: not able to be generated by automation, “hardcoded”

when encoding• Record number: 48027 ideographs

Variants of Han Characters

• Adapted to non-Chinese cultures:• Hanzi in Chinese, kanji in Japanese and hanja in Korean

• Examples:•切手:• Chinese – “to cut hand”• Japanese – “stamp”

•中� – in Japanese• China• A district in central west Honshuu

Variants of Han Characters

• Ideograph simplifica1on• Tradi&onal Chinese• Hong Kong, Macao, Taiwan and overseas Chinese communi&es

• Simplified Chinese• By Chinese government during 20th century• Mainland China and Singapore

• Simplified Japanese• By Japanese government aBer the Second World War

Variants of Han Characters

• Traditional and Simplified ideographs

Variants of Han Characters

• Ideograph variants in the same character set

Variants of Han Characters

• Variants of character glyphs• A wide variation in the glyphs used in different countries and for

different applications.

Variants of Han Characters

Han Unifica)on

• Unicode a*empts to unify all ideographs from the many CJK na9onal character set standards in to a single set of ideographs• Goal: • Provide coverage for the major CJK character set standards

• Benefits:• A much larger repertoire of characters than found in other CJK

character set standards.• Compa9bility with the character in exis9ng CJK character set

standards.

Three Dimensional Conceptual Model

• X-axis: semantic (meaning, function)• Y-axis: abstract form (general form)• Z-axis: actual shape (instantiated, typeface form)

• Only Z-axis differences were merged or unified in Unicode.

Three Dimensional Conceptual Model

Unification Rules

• Source Separation Rule: If two ideographs are distinct in a primary source standard, then they are not unified.• Round-trip rule• Z-axis variant

Unifica'on Rules

• Noncognate Rule: If two ideographs are unrelated in historical deriva6on (noncognate characters), then they are not unified.• Noncognate

• Cognate

Abstract Shape

• Y-axis: abstract shape• Ideographic Component Structure

Abstract Shape

• Ideograph Features• Number of components• Relative positions of components in each complete ideograph• Structure of a corresponding component• Treatment in a source character set• Radical contained in a component

• If one or more of these features are different between the ideographs compared, the ideographs are considered to have different abstract shapes.

Unification Rules

• Any two ideographs that possess the same abstract shape are then unified provided that their unification is not disallowed by either the Source Separation Rule or the Noncognate Rule.

Examples

• Ideographs not unified

Examples

• Ideographs unified

Unicode Ideographs, Radicals and Strokes

• 121000 to 20902 after Han Unification• Arranged by radical,

followed by the number of additional strokes.

Han Ideograph Arrangement

• The arrangement of the Unicode Han characters is based on the positions of characters as they are listed in four major dictionaries:

• The KangXi Zidian: chosen as primary• It contains most of the source characters• Commonly used throughout East Asia

CJK Unified Ideograph URO

• Unicode’s original block of 20902 ideographs is referred to as Unified Repertorie and Ordering.• Range: U+4E00 ~ U+9FFF

• Original: U+4E00 ~ U+9FA5• Version 5.0: U+9FA5 ~ U+9FBB• Recently: U+9FBC ~ U+ 9FCB

CJK Unified Ideographs Extension A-F

• Extension A: U+3400 ~ U+4DBF• The last large repertoire of ideographs to be added to Unicode’s

BMP.• Extension B: U+20000 ~ U+2A6DF• In Plane 2

• Extension C: U+2A700 ~ U+2B73F• Extension D: U+2B740 ~ U+2B81F• Extension E: U+2B820 ~ U+2CEAF• Extension F: U+2CEB0 ~ U+2EBEF

CJK Compatibility Ideographs

• U+F900 ~ U+FAFF• A Unicode block created to contain Han characters that were encoded

in multiple locations in other established character encodings.• In order to retain round-trip compatibility between Unicode and those

encodings.• Include a few regular ideographs that do not have duplicates.

CJK Compatibility Ideographs

• Process called Normaliza)on can be applied to CJK Compa4bility Ideographs, and the result is that they are converted into their Canonical Equivalents.• For some locales and for some code points, the applica4on of

Normaliza4on effec4vely removes dis4nc4ons.• Example:

Kangxi Radicals, CJK Radicals Supplement and CJK Strokes• Kangxi Radicals: U+2F00 ~ U+2FD5

• Includes characters that represent the complete set of 214 classical radicals as used by the vast majority of ideograph dicFonaries.

• CJK Radicals Supplement: U+2E80 ~ 2EF3

• This collecFon of radical variants appears to be somewhat ad-hoc.

• CJK Strokes: U+31C0 ~ U+31CF, U+31D0 ~ U+31E3

Further Discussion

• Language tags and Han Unification• A common misunderstanding: Han characters cannot be rendered

properly without language information.• Plain text remains legible in the absence of these specifications.

Further Discussion

• What if the ideographs are not enough?• GETA MARKER: 0x3013 〓• IDEOGRAPHIC VARIATION INDICATOR: 0x303E�

Questions

Thanks for listening!

top related