Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development
Mar 27, 2015
Beyond Text Representation
Building on Unicode to Implement a Multilingual Text
Analysis Framework
Thomas Hampp – IBM Germany Content Management Development
Thomas Hampp IBM
18th International Unicode Conference
2
Basic Text Analysis Tasks Code page conversion and text
representation Segmentation (tokens, sentences,
paragraphs) Morphological analysis / dictionary lookup Compound word decomposition Spell Checking/Spell Aid …
Thomas Hampp IBM
18th International Unicode Conference
3
Advanced Text Analysis Tasks Summarization Categorization/Clustering Extraction of names, terms or
relations Information extraction ParsingAll task should be provided for all
languages
Thomas Hampp IBM
18th International Unicode Conference
4
A Library for Text Analysis The same text analysis tasks are needed in
different multilingual contexts/systems The same software library should be used in
all contexts/systems to perform the analysis The library should work language neutral The text analysis tasks required for a given
context/system should be an input parameter for the library
Thomas Hampp IBM
18th International Unicode Conference
5
Two Problems and One Solution The realization of such a library
faces two kinds of challenges:A. Implementing the actual language
specific analysis tasksB. Encapsulating the language specific
processing by representing input and output in a language neutral fashion
Unicode plays a major role in solving problem B
Thomas Hampp IBM
18th International Unicode Conference
6
A Software Design for a Text Analysis Library
Single API towards the application Separated but combinable language-
specific processing modules Central representation system for
linguistic information Centralized flow of control driven by
linguistic analysis targets
TA
F A
pplic
atio
nText Analysis Framework (TAF) - High Level Design
TAFEngine
App
licat
ion
AP
I
Tokenization
Dictionary Lookup
Term- / Name Identifier ...
...
Plu
gin
Con
trol
AP
I
TAF Plugins
DocumentRepresentation
Annotation Structure
Table Structure
TAF Information Store
Type: Sentence[Sen-Attributes]
Type: Paragraph
[Para -Attributes]
Type: Term
[Term-Attributes]
Type: Token[Token-Attributes]
Type: Token[Token-Attributes]
Type: Token[Token-Attributes]
Type: Document[Doc-Attributes]
I B M s o f t w a r e i s g r e a t ! ...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...
Type: Token[Token-Attributes]
Text Analysis Framework (TAF) -Document Buffer & Annotation Structure
Thomas Hampp IBM
18th International Unicode Conference
10
Implementation Implemented as C++ DLL/shared Lib Provides an extensive object oriented API for
applications and plugins Uses Unicode (ICU based) for all text content Ported to 9 platforms (therefore no platform
dependant solutions acceptable) Because of use in search/indexing strong
focus on performance Supports 30+ languages and 90+ code
pages
Thomas Hampp IBM
18th International Unicode Conference
11
Enter Unicode Used as internal character
representation format (character set)
Converters from/to over 90 external code pages had to be written/integrated
A decision had to be made on the Unicode encoding format: we choose UTF-16
Thomas Hampp IBM
18th International Unicode Conference
12
The Pros UTF-16 We started out without knowledge
of surrogate issues False assumption: Fixed length
encoding Good balance between size and
straightforward representation Efficient interoperability with
Windows, Java, XML4C APIs etc
Thomas Hampp IBM
18th International Unicode Conference
13
The Cons of UTF-16 Not a fixed length encoding because
of surrogates Can not be passed to legacy
functions (C library, OS APIs) Character classification functions
have to work on pointers for surrogates
Wastes some space with western languages
Thomas Hampp IBM
18th International Unicode Conference
14
ANSI C/C++ Compatibility ANSI C++ does define a type w_char for
“wide” character representation (and a matching wide string class wstring)
Unfortunately size and encoding of w_char are not standardized
So we combined the ANSI C++ basic_string template class with the Unicode character data type from ICU to create a C++ and Unicode conformant string class
Thomas Hampp IBM
18th International Unicode Conference
15
Impact Beyond Character Representation Tokenization Finite state processing Dictionary formats “Environmental” issues Development tools support
Thomas Hampp IBM
18th International Unicode Conference
16
Impact: Tokenization Tokenization needs access to
character properties Most but not all relevant are
provided by Unicode character database
For application defined properties there is no more fast & simple 256 character property lookup
Approach limited to western scripts
Thomas Hampp IBM
18th International Unicode Conference
17
Impact: Finite State Processing Finite state character processing in
C usually works with transition tables encoded as arrays
This is easy to implement and very fast in execution
To cover the full range of all Unicode characters, more sophisticated transition tables are required
Thomas Hampp IBM
18th International Unicode Conference
18
Impact:Dictionaries Dictionaries tend to be large As much of them as possible has to be
loaded in memory for performance reasons
For multilingual (server) applications multiple dictionaries will be in memory
Therefore dictionary size matters much Doubling dictionary size might not be an
viable option
Thomas Hampp IBM
18th International Unicode Conference
19
Impact: “Environmental” Issues There is always as residue of single byte
string data (from message catalog, command line, library calls etc.) which sometimes has to be mixed with Unicode string data
Interfaces for console, messages, logs etc. are mostly single byte
Configuration files should be platform-neutral, easily editable and support the full Unicode character set
Thomas Hampp IBM
18th International Unicode Conference
20
Impact: Development Tools Support Only specialized editors can handle
Unicode text Most debuggers don’t display Unicode Source code string constants are hard
to maintain Message catalog compilers on some
platforms are not Unicode enabled
Thomas Hampp IBM
18th International Unicode Conference
21
A Word About Unicode Normalization Forms For reasons of efficient interoperability a
fixed Unicode normalization had to be specified
Early normalization is performance critical
Since round trip convertibility was not a design goal Unicode Kompatibility Composed Normal Form has been chosen
Normalization and cope page conversion can and should be done in one step
Thomas Hampp IBM
18th International Unicode Conference
22
Benefits of Unicode Use No more code page troubles within
the boundaries of the application Very often algorithms can be
established for groups of languages Multilanguage document collections
and even mixed language documents are no problem to represent
Easy and efficient Java (JNI) integration
Thomas Hampp IBM
18th International Unicode Conference
23
Summing Up:Building on Unicode… …solves only the basic character
representation problem for multilingual text analysis
…sets a solid foundation for a multilingual system
…enables algorithms to be reused for groups of languages.
…can have impact on the system far beyond the character representation level
…has been worth the trouble