Top Banner
Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development
22

Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Mar 27, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Beyond Text Representation

Building on Unicode to Implement a Multilingual Text

Analysis Framework

Thomas Hampp – IBM Germany Content Management Development

Page 2: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

2

Basic Text Analysis Tasks Code page conversion and text

representation Segmentation (tokens, sentences,

paragraphs) Morphological analysis / dictionary lookup Compound word decomposition Spell Checking/Spell Aid …

Page 3: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

3

Advanced Text Analysis Tasks Summarization Categorization/Clustering Extraction of names, terms or

relations Information extraction ParsingAll task should be provided for all

languages

Page 4: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

4

A Library for Text Analysis The same text analysis tasks are needed in

different multilingual contexts/systems The same software library should be used in

all contexts/systems to perform the analysis The library should work language neutral The text analysis tasks required for a given

context/system should be an input parameter for the library

Page 5: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

5

Two Problems and One Solution The realization of such a library

faces two kinds of challenges:A. Implementing the actual language

specific analysis tasksB. Encapsulating the language specific

processing by representing input and output in a language neutral fashion

Unicode plays a major role in solving problem B

Page 6: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

6

A Software Design for a Text Analysis Library

Single API towards the application Separated but combinable language-

specific processing modules Central representation system for

linguistic information Centralized flow of control driven by

linguistic analysis targets

Page 7: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

TA

F A

pplic

atio

nText Analysis Framework (TAF) - High Level Design

TAFEngine

App

licat

ion

AP

I

Tokenization

Dictionary Lookup

Term- / Name Identifier ...

...

Plu

gin

Con

trol

AP

I

TAF Plugins

DocumentRepresentation

Annotation Structure

Table Structure

TAF Information Store

Page 8: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Type: Sentence[Sen-Attributes]

Type: Paragraph

[Para -Attributes]

Type: Term

[Term-Attributes]

Type: Token[Token-Attributes]

Type: Token[Token-Attributes]

Type: Token[Token-Attributes]

Type: Document[Doc-Attributes]

I B M s o f t w a r e i s g r e a t ! ...

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...

Type: Token[Token-Attributes]

Text Analysis Framework (TAF) -Document Buffer & Annotation Structure

Page 9: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

10

Implementation Implemented as C++ DLL/shared Lib Provides an extensive object oriented API for

applications and plugins Uses Unicode (ICU based) for all text content Ported to 9 platforms (therefore no platform

dependant solutions acceptable) Because of use in search/indexing strong

focus on performance Supports 30+ languages and 90+ code

pages

Page 10: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

11

Enter Unicode Used as internal character

representation format (character set)

Converters from/to over 90 external code pages had to be written/integrated

A decision had to be made on the Unicode encoding format: we choose UTF-16

Page 11: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

12

The Pros UTF-16 We started out without knowledge

of surrogate issues False assumption: Fixed length

encoding Good balance between size and

straightforward representation Efficient interoperability with

Windows, Java, XML4C APIs etc

Page 12: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

13

The Cons of UTF-16 Not a fixed length encoding because

of surrogates Can not be passed to legacy

functions (C library, OS APIs) Character classification functions

have to work on pointers for surrogates

Wastes some space with western languages

Page 13: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

14

ANSI C/C++ Compatibility ANSI C++ does define a type w_char for

“wide” character representation (and a matching wide string class wstring)

Unfortunately size and encoding of w_char are not standardized

So we combined the ANSI C++ basic_string template class with the Unicode character data type from ICU to create a C++ and Unicode conformant string class

Page 14: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

15

Impact Beyond Character Representation Tokenization Finite state processing Dictionary formats “Environmental” issues Development tools support

Page 15: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

16

Impact: Tokenization Tokenization needs access to

character properties Most but not all relevant are

provided by Unicode character database

For application defined properties there is no more fast & simple 256 character property lookup

Approach limited to western scripts

Page 16: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

17

Impact: Finite State Processing Finite state character processing in

C usually works with transition tables encoded as arrays

This is easy to implement and very fast in execution

To cover the full range of all Unicode characters, more sophisticated transition tables are required

Page 17: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

18

Impact:Dictionaries Dictionaries tend to be large As much of them as possible has to be

loaded in memory for performance reasons

For multilingual (server) applications multiple dictionaries will be in memory

Therefore dictionary size matters much Doubling dictionary size might not be an

viable option

Page 18: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

19

Impact: “Environmental” Issues There is always as residue of single byte

string data (from message catalog, command line, library calls etc.) which sometimes has to be mixed with Unicode string data

Interfaces for console, messages, logs etc. are mostly single byte

Configuration files should be platform-neutral, easily editable and support the full Unicode character set

Page 19: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

20

Impact: Development Tools Support Only specialized editors can handle

Unicode text Most debuggers don’t display Unicode Source code string constants are hard

to maintain Message catalog compilers on some

platforms are not Unicode enabled

Page 20: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

21

A Word About Unicode Normalization Forms For reasons of efficient interoperability a

fixed Unicode normalization had to be specified

Early normalization is performance critical

Since round trip convertibility was not a design goal Unicode Kompatibility Composed Normal Form has been chosen

Normalization and cope page conversion can and should be done in one step

Page 21: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

22

Benefits of Unicode Use No more code page troubles within

the boundaries of the application Very often algorithms can be

established for groups of languages Multilanguage document collections

and even mixed language documents are no problem to represent

Easy and efficient Java (JNI) integration

Page 22: Beyond Text Representation Building on Unicode to Implement a Multilingual Text Analysis Framework Thomas Hampp – IBM Germany Content Management Development.

Thomas Hampp IBM

18th International Unicode Conference

23

Summing Up:Building on Unicode… …solves only the basic character

representation problem for multilingual text analysis

…sets a solid foundation for a multilingual system

…enables algorithms to be reused for groups of languages.

…can have impact on the system far beyond the character representation level

…has been worth the trouble