Top Banner
Internationalization in Ruby 2.4 http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/ 40th Internationalization and Unicode Conference Santa Clara, California, U.S.A., November 3, 2016 Martin J. DÜRST [email protected] Aoyama Gakuin University © 2016 Martin J. Dürst , Aoyama Gakuin University Abstract Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated by experts for its productivity and depth. This presentation discusses the progress of adding internationalization functionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be the currently ongoing implementation of locale-aware case conversion. Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing different applications to choose different internationalization models. In practice, Ruby is most often and most conveniently used with UTF-8. Support for internationalization facilities beyond character encoding has been available via various external libraries. As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To use case conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasing strings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4, and efficiently reuses data already available for case-sensitive matching in regular expressions. We study the interface of internationalization functions/methods in a wide range of programming languages and Ruby libraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, with additional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a true Ruby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described. This presentation is intended for users and potential users of the programming language Ruby, and people interested in internationalization of programming languages and libraries in general. For Best Viewing
25

Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

May 14, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Internationalization in Ruby 2.4http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/

40th Internationalization and Unicode Conference

Santa Clara, California, U.S.A., November 3, 2016

Martin J. DÜRST

[email protected]

Aoyama Gakuin University

© 2016 Martin J. Dürst, Aoyama Gakuin University

Abstract

Ruby is a purely object-oriented scripting language which is easy to learn for beginners and highly appreciated byexperts for its productivity and depth. This presentation discusses the progress of adding internationalizationfunctionality to Ruby for the version 2.4 release expected towards the end of 2016. One focus of the talk will be thecurrently ongoing implementation of locale-aware case conversion.

Since Ruby 1.9, Ruby has a pervasive if somewhat unique framework for character encoding, allowing differentapplications to choose different internationalization models. In practice, Ruby is most often and most convenientlyused with UTF-8.

Support for internationalization facilities beyond character encoding has been available via various external libraries.As a result, applications may use conflicting and confusing ways to invoke internationalization functionality. To usecase conversion as an example, up to version 2.3, Ruby comes with built-in methods for upcasing and downcasingstrings, but these only work on ASCII. Our implementation extends this to the whole Unicode range for version 2.4,and efficiently reuses data already available for case-sensitive matching in regular expressions.

We study the interface of internationalization functions/methods in a wide range of programming languages and Rubylibraries. Based on this study, we propose to extend the current built-in Ruby methods, e.g. for case conversion, withadditional parameters to allow language-dependent, purpose-based, and explicitly specified functionality, in a trueRuby way. Both the design as well as the implementation of the new functionality for Ruby 2.4 will be described.

This presentation is intended for users and potential users of the programming language Ruby, and people interested ininternationalization of programming languages and libraries in general.

For Best Viewing

Page 2: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

These slides have been created in HMTL, for projection with Opera (≤12.17 Windows/Mac/Linux). Use F11 to switchto projection mode and back. Texts in gray, like this one, are comments/notes which do not appear on the slides. Pleasenote that depending on the browser and OS you use, some rare characters or special character combinations may notdisplay as intended, but e.g. as empty boxes, question marks, or apart rather than composed.

IntroductionIntroductions

Audience:Programming experience?Programming with Ruby/Rails?Internationalization/Globalization experience?Unicode knowledge?

Speaker:From Switzerland, living in JapanLong-term Unicode/W3C/IUC involvementRuby committer since 2007, mainly contributing

Encoding conversion (String#encode, Ruby 1.9)Unicode normalization (String#unicode-normalize, Ruby 2.2)Non-ASCII case conversion (String#upcase,..., Ruby 2.4)Unicode version updates (Unicode 9.0 for Ruby 2.4)

OverviewIntroductionRuby BasicsNew in Ruby 2.4: Non-ASCII Case ConversionImplementation DetailsLessons Learned and Future Work

Ruby Basics

Ruby

Page 3: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Created by Yukihiro Matsumoto (Matz; since 1993)Easy for beginners,deep for expertsObject-oriented throughout, but not obtrusiveExtremely flexibleParticularly strong for (internal) DSLs and metaprogrammingUsed for Ruby on Rails Web Framework

Ruby ImplementationsMRI (Matz's Ruby Implementation), aka C-Rubyavailable on many platforms (download for Windows)JRuby: Ruby on the JVMRubyMotion: Ruby for IOS, Android, and MacOSOpal: Ruby to JavaScript compilerRubinius: Ruby (mostly) in RubyA lot more ...

This tutorial is about MRI/C-Ruby, the reference implementation

Basic Ruby3.times { puts 'Hello Ruby!' }

Hello Ruby!Hello Ruby!Hello Ruby!

Everything is an objectMethods can take blocks ({ ... } or do ... end)Unobtrusive syntax (no need for semicolons, ...)

Conventions Used in This TalkCode is mostly green, monospaceputs 'Hello Ruby!'

Variable parts are orangeputs "some string"

Encoding is indicated with a subscript'Юに코δ'UTF-8, 'ユニコード'SJIS

Page 4: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Results are indicated with " "1 + 1 2

Frequent Example Юに코δЮ: Cyrillic uppercase YUに: Hiragana NI코: Hangul KOδ: Greek delta

Up and RunningInstall RubyOpen a UTF-8 based console

Easy on Mac and LinuxOn Windows: Cygwin Terminal, PuTTY, ...,or command prompt with chcp 65001

Start irb (Interactive Ruby)Type in Ruby commands

String BasicsStrings are sequences of characters: (codepoints)"Юに코δ".length 4We can get a byte count with:"Юに코δ".bytesize 10They are instances of class String:"Юに코δ".class StringCharacters are strings of length 1:"Юに코δ"[0] "Ю";"Юに코δ"[0].length 1

Using the same class for both strings and characters avoids the distinction between characters and strings of length 1.This matches Ruby's "big classes" policy. It also leaves the door open for 'characters' other than single codepoints.Strings are not Arrays, but where it makes sense, operations work the same for both classes. This is called duck typing.

Encoding BasicsEarch String has an encodingStrings with different encodings can't be mixed

Page 5: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

'Юに코δ'UTF-8 + 'Юに코δ'UTF-16 Encoding::CompatibilityError

Trying to combine strings with different encodings, as here with concatenation (+), leads to an exception. Thereare some exceptions (sic!) to this rule that we will look at later. The reasoning for the error here is thattranscoding should not happen without the programmer being aware of it.

'Dürst'ISO-8859-1 == 'Dürst'ISO-8859-2 false

Trying to compare two character-by-character identical strings in different encodings will produce false, evenif these strings are, as in the above example, also byte-for-byte identical. Again, the reason for the result is thatencoding mismatches should be detected early. In addition, a simple byte-for-byte comparison could producefalse positives.

except if their content is ASCII-only (bytes)

'abc'ISO-8859-1 == 'abc'Shift_JIS true

Just use Unicode, just use UTF-8

Ruby Likes UTF-8Default for source encoding (since Ruby 2.0)(no need for # encoding UTF-8 encoding pragma)Encoding of strings with \u escapes is always UTF-8

"abc\u03B4" 'abcδ'UTF-8

Use -U option if not in an UTF-8 context: ruby -U myscript.rbProcessing of UTF-8 is optimized where possibleUsed out of the box by Ruby on RailsTranscoding available on input/outputThe only (internal) encoding in Ruby 3.0 or 4.0 (speculation!)

Ruby VersionsRuby ≤1.8: RIP (Strings as byte sequences)Ruby 1.9 and later (Strings as character sequences)Ruby 2.0: UTF-8 default source encodingRuby 2.2: Unicode normalization added (2014)Ruby 2.3: Newest published versionRuby 2.4: Release planned for Christmas 2016,non-ASCII case conversion

Page 6: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Ruby Versions and Unicode VersionsYear (y) Ruby version (VRuby) Unicode version (VUnicode)

published around Christmas published in Summer2014 2.2 7.0.02015 2.3 8.0.02016 2.4 9.0.0

A note about Ruby versions and Unicode versions: The Ruby core team is very conservative (in my view tooconservative) in introducing new Unicode versions as bug fixes. Update to new Unicode versions therefore onlyhappens for new Ruby versions.

RbConfig::CONFIG["UNICODE_VERSION"] '9.0.0'

VUnicode = y - 2007

VRuby = 1.5 + VUnicode · 0.1

VUnicode = VRuby · 10 - 15

Don't extrapolate too far!

New in Ruby 2.4:

Non-ASCII Case ConversionCase Conversions Functions in Ruby

'Unicode Everywhere'.upcase 'UNICODE EVERYWHERE'

'Unicode Everywhere'.downcase 'unicode everywhere'

'Unicode Everywhere'.capitalize 'Unicode everywhere'

'Unicode Everywhere'.swapcase 'uNICODE eVERYWHERE'

Case Conversion in Ruby 2.3

Page 7: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

Case Conversions NOT in Ruby 2.3'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase

'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

Case Conversion up to and including Ruby 2.3 is ASCII-only!

Case Conversions NOT in Ruby 2.3'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase

'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

But in Ruby 2.4!

Case Conversion Around the WorldMany more Latin letters than just A-ZOther scripts:

Cyrillic, GreekCoptic, Armenian [, Georgian]Cherokee, Deseret, OsageOld Hungarian, Warang Citi, Glagolitic, Adlam

More minority scripts may introduce case distinctionfrom surrounding majority scripts

Case Distinction HistoryOriginally: Style difference, depending on medium

Upper case for stone inscriptions (SPQR)Lower case for wax tablets,...?

Functional distinction since ~15th century

Modern Case Usage

Page 8: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

(details vary by language)

ALL UPPER CASEEMPHASISAcronyms, abbreviations (DRY, SQL)

First letter upper caseStart of sentenceWords in titlesProper nouns/adjectives (Kyoto, Japanese)NounsHonorifics

Lower case: everything else

German:der Gefangene floh - the prisoner fled, butder gefangene Floh - the captive flea

Isn't ASCII-only Case Conversion Enough?Already in other languages (Python, Perl, Java, ...)Already in Ruby (Regexp: //i)Algorithms and data is available from Unicode ConsortiumIt's a good idea in general

But: Backwards Compatibility?Idea: Option for new functionality'Résumé'.upcase 'RéSUMé''Résumé'.upcase :unicode 'RÉSUMÉ'Matz felt option was not necessaryLots of data is ASCII-onlyFor non-ASCII data, you hopefully used a gem(which you can now eliminate)Check earlygrep your code base for upcase and friendsTest early (preview 2 of Ruby 2.4)

Backwards Compatibility ProblemsExplicit ASCII-only case conversion

E.g. DNS servers(but you used Encoding::ASCII_8BIT there anyway?!)

Page 9: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Exact matches after conversion1. Allowed non-ASCII in userids (e.g. Соколов)2. downcased with Ruby 2.3 to help users (Соколов in DB)3. Used exact match4. In Ruby 2.4, соколов will not match Соколов anymore

Localization: See Turkic, Lithuanian special cases

Backwards Compatibility: :ascii OptionUse if you find a case where you really don't want to convert non-ASCII characters

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase 'RÉSUMÉ ĬÑŦĖŘŊÃŢIJŇŐŃÆŁĨŻÀŤÏŌŅ'

'Résumé ĭñŧėřŋãţijňőńæłĩżàťïōņ'.upcase :ascii 'RéSUMé ĭñŧėřŋãţijňőńæłĩżàťïōņ'

Implementation ChoicesUse a library?

Pure Ruby:UnicodeUtilsActiveSupport::MultibyteTwitterCLDR

C extensions:ICU as a gem: icu, ffi-icu

Integrate IUC?

Write new code?

Implementation ChoicesUse a library?

Different interface if used directlyNot efficient if in pure RubyData duplication

Integrate IUC?

IUC and Ruby both have their own low-level idea of strings

Page 10: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Write new code?

That's what we ended up doing

Where to Get the Data From?Data and other specifications available from the Unicode Consortium:

UnicodeData.txt

CaseFolding.txt

SpecialCasing.txt

Special Cases: Not 1-to-1Number of characters not preserved'ß'.upcase 'SS' (German sz/sharp s)' '.upcase "FFI" ( ligature)Not necessarily reversible'ß'.upcase.downcase 'ß' 'ss''σ'.upcase 'Σ' (Greek sigma)'ς'.upcase 'Σ' (Greek final sigma)'ς'.upcase.downcase 'ς' 'σ'Implemented!'Σ'.downcase should be context-dependentNot yet implemented!

Special Case: Simple Case MappingDefined by UnicodeExcludes mappings that change string lengthFeels outdated

Not implemented!

Special Case: TurkicUsual:'i'.upcase 'I'

Page 11: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

'I'.upcase 'i'Turkish, Azerbaijani, and related languages when written in Latin script'i'.upcase 'İ' (uppercase I with dot)'İ'.downcase 'i''ı'.upcase 'I' (i without dot)'I'.downcase 'ı'Implemented!'Türkiye'.upcase :turkic 'TÜRKİYE'

Special Case: LithuanianUsual:

'Í'.downcase 'í' (accent replaces dot)

Lithuanian:

'Í'.downcase :lithuanian 'i'́(accent above visible dot; may not show because of technology limits)

Not yet implemented!

Special Case: Case FoldingCase mapping:

Change from one form to anotherupcase/downcase/capitalize/swapcase

Case foldingEliminate case-related differencesFor comparison, sortingIn general same as downcaseBut: ß → ss, → ffi, ς → σUpcase for Cherokee

Implemented! with :fold option on downcase

'ß'.downcase :fold 'ss'' '.downcase :fold 'ffi''ς'.downcase :fold 'σ'

Special Case: TitlecaseSome characters have three case forms:

Upper case: DŽ (Croatian/Serbian)

Page 12: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Lower case: džTitle case: Dž

Important for capitalize'džungla'.capitalize 'DŽungla''džungla'.capitalize 'Džungla'

Implemented!

More Special CasesContextual processing, e.g. for i with combining dots(part of Unicode algorithm definition)German uppercase ß(not part of Unicode algorithm definition)others,...Not implemented (yet?)

Implementation

12 Methods to ImplementString (functional) String (destructive) Symbol

upcase upcase! upcase

downcase downcase! downcase

capitalize capitalize! capitalize

swapcase swapcase! swapcase

Not dealt with: String#casecmpWhy: Includes sorting

Internally, a Single FunctionFlags to indicate operation needed(in file include/ruby/oniguruma.h):

#define ONIGENC_CASE_UPCASE (1<<13) /* uppercase mapping */#define ONIGENC_CASE_DOWNCASE (1<<14) /* lowercase mapping */#define ONIGENC_CASE_TITLECASE (1<<15) /* titlecase mapping */

Page 13: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Usage to indicate operation type:

upcase: ONIGENC_CASE_UPCASE(upcasing needed)

downcase: ONIGENC_CASE_DOWNCASE(downcasing needed)

capitalize: ONIGENC_CASE_TITLECASE | ONIGENC_CASE_UPCASE(changed to ONIGENC_CASE_DOWNCASE after first character)

swapcase: ONIGENC_CASE_UPCASE | ONIGENC_CASE_DOWNCASE(both upcasing and downcasing needed)

Option HandlingFlags also used for options:

:fold (for case folding; only on downcase):turkic:lithuanian (not yet implemented):ascii

Corresponding flags:

#define ONIGENC_CASE_FOLD (1<<19) /* has/needs case folding * /#define ONIGENC_CASE_FOLD_TURKISH_AZERI (1<<20) /* Turkic */#define ONIGENC_CASE_FOLD_LITHUANIAN (1<<21) /* Lithuanian */#define ONIGENC_CASE_ASCII_ONLY (1<<22) /* limited to ASCII */

String ExpansionHandles string expansion (e.g. " ".upcase "FFI")

Common to all casing operations

Linked list of buffers (b1→b2→b3→...)Repeatedly calls encoding-specific primitiveto fill as much as possible of next bufferFor buffer bx, allocatesbytes_to_still_be_converted · x + 20 bytesExample:We need a 3rd buffer, and need to convert 5 more bytes,so we allocate length(b3) = 5 · 3 + 20 = 35 bytesUntil no new buffer is needed

Page 14: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Handling Encodings: The Ruby WayEach encoding is implemented by a series of primitivesWork like methods (polymorphism), but implemented in CTotal of 13 primitives per encodingExample primitives:

Length of character at current byte positionAdvance byte position by one characterCodepoint of character at current byte positionInsert codepoint x at current byte position

[1] 松本行弘, 縄手雅彦. スクリプト言語 Ruby の拡張可能な多言語テキスト処理の実装. 情報処理学会論文誌.2005 Nov 15;46(11):2633-42. / Yukihiro Matsumoto and Masahiko Nawate: Multilingual Text Manipulation Methodfor Ruby Language. Journal of Information Processing (JIP); 2005 Nov 15; Vol. 46, No. 11, pp. 2633-42. (in Japanese)

Implementation Choice: UTF-8 only or Primitives

Matz would have been fine withFull Unicode case conversion for UTF-8ASCII-only for all other encodings

Actually used primitives to obtainA more complete implementationExperience about pros/cons of using primitives

Implementation Choice: New or Reused Primitive

3 primitives are used for case folding with regular expressions (//i)mbc_case_foldapply_all_case_foldget_case_fold_codes_by_str

Found no good way to reuse any of these

New primitive

But found a lot of reusable data

The case_map Primitive

Page 15: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Input/output parameters:OnigCaseFoldType flagsStart of source

Input parameters:End of sourceStart of destinationEnd of destinationEncoding (to call other primitives)

Output parameters:Byte count of conversion result(negative for errors)

Most complex 'primitive', although not by much

Implementations of case_map PrimitiveExamples:

"Résumé"UTF-8.upcase callsonigenc_unicode_case_map in enc/unicode.c(most complex case)as defined with OnigEncodingDefine in enc/utf_8.c"Résumé"UTF-16LE.upcase callsonigenc_unicode_case_map in enc/unicode.cas defined with OnigEncodingDefine in enc/utf_16le.c"Résumé"ISO-8859-1.upcase callscase_map in enc/iso_8859_1.c(simple case, good starting point for primitive for new encoding)as defined with OnigEncodingDefine in the same file

The Primitive of Primitives: onigenc_unicode_case_map

Works for UTF-8, UTF-16[BE|LE], UTF-32[BE|LE]140 lines long 'monster function'Same structure as simpler primitives:

Big while loop, one source character a timeCarefully updating ONIGENC_CASE_MODIFIED flagDeal with special cases 'by hand'Reuse existing data where possible

~30 if/else if/elseLots of |/& with flag bits2 gotosgperf-created hash lookups:onigenc_unicode_fold_lookuponigenc_unicode_unfold1_lookup

Page 16: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

More case_map PrimitivesStudents (sophomores/juniors/seniors) at Aoyama Gakuin University

ISO-8859-2: Yushiro Ishii (石井 優史朗)ISO-8859-3: Kanon Shindo (新藤 海音)ISO-8859-4: Kotaro Yoshida (吉田 孝太郎)ISO-8859-5: Masaru Onodera (小野寺 俊)ISO-8859-7: Kosuke Kurihara (栗原 光祐)ISO-8859-9: Kazuki Iijima (飯島 一貴)ISO-8859-10: Toya Hosokawa (細川 登陽)ISO-8859-13: Takuya Miyamoto (宮本 拓弥)ISO-8859-14: Yutaro Tada (多田 悠太朗)ISO-8859-15: Maho Harada (原田 真帆)ISO-8859-16: Satoshi Kayama (香山 智志)Windows-1250, -1257: Sho Koike (小池 翔)Windows-1251: Shunsuke Sato (佐藤 駿介)Windows-1252: Serina Tai (田井 芹奈)Windows-1253: Takumi Koyama (小山 拓美)

So What about Shift_JIS and Friends?For East Asian encodings(Shift_JIS, EUC-JP, GB2312, EUC-KR, Big-5, EUC-TW,...)

data could be shared between //i and case mapping

but case folding for //i only works for ASCII

None of the main Japanese committers thought this was needed anymore

Talk to me if you need it

Reusing Case Folding DataOnig[uruma|gmo] has data for case foldingFolding is very close to downcaseThere is also unfolding (why?), which is close to upcaseThat's almost all we need

Page 17: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Folding Data: Before and Afterin enc/unicode/9.0.0/casefold.h

/* before */ {0x0041, {1, {0x0061}}}, /* A → a */ {0x00df, {2, {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1, {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1, {0x01c6}}}, /* Dž → dž */ {0xab73, {1, {0x13a3}}}, /* → (Cherokee) */

/* after */ {0x0041, {1|F|D, {0x0061}}}, /* A → a */ {0x00df, {2|F|ST|SU|I(1), {0x0073, 0x0073}}}, /* ß → ss */ {0x01c4, {1|F|D|ST|I(8), {0x01c6}}}, /* DŽ → dž */ {0x01c5, {1|F|D|IT|SU|I(9), {0x01c6}}}, /* Dž → dž */ {0xab73, {1|F|U, {0x13a3}}}, /* → (Cherokee) */

Folding Data: Flags(squeezed into an int where only 2 bits were used)

see enc/unicode.c

/* data is available here *//* (flags are the same as for options) */#define U ONIGENC_CASE_UPCASE#define D ONIGENC_CASE_DOWNCASE#define F ONIGENC_CASE_FOLD/* data is in special additional array */#define ST ONIGENC_CASE_TITLECASE#define SU ONIGENC_CASE_UP_SPECIAL#define SL ONIGENC_CASE_DOWN_SPECIAL#define IT ONIGENC_CASE_IS_TITLECASE/* index into special array (size: around 420 words only) */#define I(n) OnigSpecialIndexEncode(n)

Small Implementation Detail(or my attempt at using the Takahashi method)

upcase

seems useful

Page 18: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

downcase

seems useful

capitalize

seems useful

swapcase

Who would use swapcase?

Nobody?

Nobody?Well, I did, when testing swapcase!

Why swapcase?

Page 19: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Why swapcase?Python has it ?! (Matz)

Why swapcase?Python has it ?! (Matz)

To revert accidental Caps Lock output ?! (on Unicode list)

implementing swapcase

must be easyUPPER upperlower LOWER

But what about titlecase?Dz, Dž, Lj, Nj

ᾼ, ᾈ, ᾉ, ᾊ, ᾋ, ᾌ, ᾍ, ᾎ, ᾏῌ, ᾘ, ᾙ, ᾚ, ᾛ, ᾜ, ᾝ, ᾞ, ᾟῼ, ᾨ, ᾩ, ᾪ, ᾫ, ᾬ, ᾭ, ᾮ, ᾯ

Choice 1

Page 20: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

"DžunGLA".swapcase leave as is"DžUNgla"

preferred by Unicode Consortium(never ever need any new standardization)

preserves reversibility(X.swapcase.swapcase == X)

Choice 2"DžunGLA".swapcase

upcase"DŽUNgla"

Choice 3"DžunGLA".swapcase

downcase"džUNgla"

Choice 4"DžunGLA".swapcase

swap

Page 21: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

"dŽUNgla"

proposed by Nobuyoshi Nakada

Implementedswap "dŽUNgla"

useless?, but 'correct'additional effort for implementation

additional effort for testing

Commit DateApril 1st, 2016

(エイプリルフールの日)Japan Time 20:58:33 same date in most timezones

please draw your own conclusions

TestingTest-Driven Development

Write small example testVerify that it doesn't workImplementEnjoy that it worksRinse and repeat

Files:test/ruby/enc/test_case_options.rbtest/ruby/enc/test_case_mapping.rb

Page 22: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Data-Driven TestingTest

every character (except for ranges in UnicodeData.txt)of every encodingfor all option combinationsfor (almost) all methods

Data provided by UnicodeIdentical to data used for implementation ?!

Files:test/ruby/enc/test_case_comprehensive.rb

413 tests, 2'212'391 assertions, 0 failures, 0 errors, 0 skips

Continuous IntegrationCommit early, commit often

Advice (and scolding) from hardcore Ruby hackersKeep code reasonably clean, and motivation highMore commits → higher chance to attend Ruby Kaigi for freeBut: Don't want to affect Ruby build or execution

Solution:Make use of new functionality dependent on special optionUsed :lithuanian (because last to be actually implemented)Test with option protectionRemove option protection

Future:

Ideas, Problems, QuestionsIn No Particular Order

Character propertiesLocale-aware formattingWhat to do with encodings?

Page 23: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Character PropertiesUnicode provides a wide range of character propertiesMost available in RegexpDoes this string contain a Hiragana character?'Юに코δ' =~ /\p{Hiragana}/What script is 'Ю'?sorry, impossible! 不可能!Currently looking at this with a student, hopefully

For Ruby ~2.5Use less memoryFasterMore propertiesMore ways to use

Locale-Aware FormattingWhat I want:

loc = Locale.new 'de-CH' (German as used in Switzerland)

1.2345678E5.to_s "123456.78"

1.2345678E5.to_s(loc) "123'456,78"

Well, Just use a LibraryInternationalization support in libraries:

Pure Ruby:UnicodeUtilsActiveSupport::MultibyteTwitterCLDR

C extensions:ICU as a gem: icu, ffi-icu

Example: Unicode NormalizationUnicodeUtils

UnicodeUtils.nfkc string

Page 24: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

ActiveSupport::Multibyte

ActiveSupport::Multibyte::Chars.new(string).normalize :kc

TwitterCLDR

TwitterCldr::Normalization::NFKC.normalize string

Native (since Ruby 2.2)string.unicode_normalize :nfkc

Libraries avoid monkey patching

not Ruby-like (ライブラリを使うと Ruby らしくない)

Locales and Case MappingsPossible solution (解決案):

loc = Locale.new 'tr''Türkiye'.upcase loc 'TÜRKİYE'

Encodings: Less is More?We discovered flaky support for current encodings(//i case folding: all encodings not at end oftest/ruby/enc/test_regex_casefold.rb)The world is moving to UnicodeMatz wants to move to UTF-8, slowly but steadilyDo we let other encodings die slowly?Or get rid of them in a single step (Ruby3.0?)

AcknowledgmentsKimihito Matsui (松井 仁人) and many other students for help with research and implementationsYui Naruse (成瀬 ゆい), Nobuyoshi Nakada (中田 伸悦) and many other Ruby committers for help and supportMatz (まつもと ゆきひろ) for Ruby, a programmer's best friendAmaya, Opera 12.17, and coderay for slide production and displayThe IME Pad for easy character input

Conclusions

Page 25: Internationalization in Ruby 2¼rst.pdf · Particularly strong for (internal) DSLs and metaprogramming Used for Ruby on Rails Web Framework Ruby Implementations MRI (Matz's Ruby Implementation),

Full Unicode case mapping (mostly) implementedOptions for backward compatibility, special conventions, case foldingSpace efficient implementation by reusing Regexp dataAvailable in Ruby trunk now, please test!

More internationalization work neededTell me what you want most

ReferencesMore information about case conversion implementation internals:http://www.sw.it.aoyama.ac.jp/2016/pub/RubyKaigi/(video at http://rubykaigi.org/2016/presentations/duerst.html)

Q & ASend questions and comments to Martin Dürst(mailto:[email protected])or open a bug report or feature request for Ruby

The latest version of this presentation is available at:

http://www.sw.it.aoyama.ac.jp/2016/pub/IUC40-Ruby2.4/