Unicode Support in ICU for Java Doug Felt [email protected] Globalization Center of Competency, San Jose, CA
Dec 30, 2015
Unicode Support in ICU for Java
Doug Felt
Globalization Center of Competency, San Jose, CA
2
Overview
• What is ICU4J?
• ICU and the JDK, a brief history
• Benefits and tradeoffs of ICU4J
• Features of ICU4J
• Performance of ICU4J
• Using ICU4J
• Conclusion and References
3
What is ICU4J?
• Internationalization Library– Sister project of ICU (C/C++)
– Open-source, non-viral license
– Sponsored by IBM
• Unicode Standard compliant, up-to-date
• 100% Pure Java
• Enhances and extends JDK functionality
• Over five years of continuous development
4
ICU and Java, a History
• Started with Java 1.1 internationalization– Much code contributed by IBM/Taligent
– IBM provided support, bug fixes, enhancements
• Became open-source project in 2000– ICU4C code started with port from Java
• Continued contributions to Java since then– TextLayout, OpenType layout, Normalization
5
Collaboration with Java Teams
• We continue to work with Java internationalization, graphics2D teams
• We participate in Java expert groups (e.g. JSR 204, Supplementary Support)
• Differences– perspectives (conformance, features versus size)
– processes (open source versus corporate/JSR)
– timetable (twice a year versus every two years)
6
Benefits
• Fully implements current standards– Unicode collation, normalization, break iteration– Updated more frequently than Java
• Full CLDR data • Improved performance• Open source, open license, customizable• Compatible with ICU C/C++ libraries and data• Runs on JDK 1.4
– Get supplementary support without moving to 1.5
7
Tradeoffs
• Not built-in, unlike Java i18n support
• Some API differences– But generally a superset of the Java API
– Some differences unavoidable due to class restrictions
– Rule syntax differs to varying degrees
• Data differences– ICU4J uses its own CLDR data, not the JVM’s data
• Size– Can trim ICU4J, but it will always be larger than 0K
8
Features of ICU4J
• Collation
• Normalization
• Break Iteration
• UnicodeSet and Transforms
• Character Properties
• Locale data
• Other– Calendars, Formatters, IDNA, StringPrep, IMEs
9
Collation
• Full UCA (Unicode Collation Algorithm)– Java does not implement UCA collation
• Locale data– Over 60 tailorings for locale-specific collation
– Variants: Pinyin, stroke, traditional, etc.
• Performance – sorting: 2 to 20 times faster
– sort key generation: 1.5 to 4 times faster
– sort key length: 2/3 to 1/4 the length of Java sort keys
10
Normalization
• Java does not provide normalization APIs– Java uses ICU’s implementation internally
– Useful for searching, string equivalence, simplifying processing of text
• Full implementation of Unicode standard– NFC, NFD, NFKC, NFKD
– Also provides FCD ‘quick check’ for optimization
11
Break Iteration
• Fully conforms to Unicode specifications– supplementary characters, Hangul
• Tags– e.g., “what kind of word was this”
• Title case iteration
• Rule-based, dictionary-based for Thai
12
Unicode Set and Transforms
• UnicodeSet – collections of characters based on properties– logical set operations, flexible– “[[:mark:]&[\u0600-\u067f]]”
• Transliterator– general transformations, with chaining and editing– converts between scripts, e.g. Greek/Latin,
Devanagari/Gujarati– rule-based, rules for common conversions supplied\
• UScriptRun
13
Character Properties
• All Unicode character properties– over 80, Java provides access to about 10
• All defined code points
• Current with latest Unicode release – ICU4J 3.0 uses Unicode 4.0.1 data
• Fast access to character data
14
Locale Data
• Standard data, included with ICU4J– CLDR (Common Locale Data Repository)– Ensures same data is available everywhere– Can share resource data with ICU4C applications
• More locales, more kinds of data– ~230 locales, compared to ~130 for Java– Can modularize to include only the data you need
• RFC3066bis support (language_script_region)– e.g., zh_Hans, zh_Hant– keywords (orthogonal variants)
15
Performance of ICU4J
• Instantiation times are comparable– Common instantiate and reuse model
– ICU4J and Java both use caches to limit impact
• Collation performance faster– faster sorting, smaller sort keys
• Performance is difficult to measure– JVM makes a difference
– ICU4J performs well in spot tests
– Use a scenario that matters to you to test
16
Property Data Timings
JVM ICU4J Java (J-I)/I
Sun 1.4.1 89 ns/op 101 ns/op 13%
Sun 1.5.0b2 117 ns/op 102 ns/op -13%
IBM 1.4.1 50 ns/op 66 ns/op 32%
1.13MHz PIII, Win2K
Nanoseconds/operation for character property access (getType,toLowerCase, getDirectionality) on three JVMs.
17
Sizes of ICU4J
• Full jar file: 2,700K• Modular builds for common subsets
– normalizer: 420K– collator: 1,400K– calendar: 1,300K– break iterator: 1,300K– basic properties: 500K– full properties: 1,200K– formatting: 2,200K– transforms: 1,500K
18
Using ICU4J
• Jar file, just add to class path– Or roll into your distribution, it’s Open Source!
– Modular builds help you to trim ICU4J’s code
– Data can be trimmed to further reduce size
• Parallel APIs– APIs on parallel classes are generally a superset
– Change import (one line change) or change class name
– Some differences unavoidable (our supplementary support for Java 1.4 can’t add API to String)
19
Code Examples (1)
import com.ibm.icu.text.BreakIterator;
BreakIterator b = BreakIterator.getWordInstance();
b.setText(text);
for (int pos = b.first();
pos != BreakIterator.DONE;
pos = b.next()) {
doSomething(pos);
}
20
Code Examples (2)
import com.ibm.icu.lang.UCharacter;
int cp, pos = 0;
while (pos < text.length()) {
cp = UCharacter.codePointAt(text, pos);
if (UCharacter.getType(cp) ==
UCharacter.SURROGATE) return true;
pos += UCharacter.charCount(cp);
}
21
Code Examples (3)
import com.ibm.icu.util.ULocale;
import com.ibm.icu.text.Collator;
import java.util.Arrays;
ULocale ulocale = new ULocale(“es_ES@collation=traditional”);
Collator col = Collator.getInstance(ulocale);
String[] list = ...
Arrays.sort(list, col);
22
Conclusion
• ICU4J is not for you if– you have tight size constraints
– you require the Java runtime behavior
• ICU4J is for you if– you need full compliance with current standards
– you need current or additional locale and property data
– you need customizability
– you need features missing from Java (normalization)
– you need additional performance
23
References
• ICU4J– http://oss.software.ibm.com/icu4j/
• Java– http://java.sun.com/
– http://www.ibm.com/java/
• Unicode, CLDR– http://www.unicode.org/
– http://www.unicode.org/cldr/