Top Banner
UNICODE AND MULTILINGUAL COMPUTING A Technical Seminar Report submitted to the Faculty of Computer Science and Engineering Geethanjali College of Engineering & Technology (Cheeryal(V), Keesara(M), R.R. Dist., Hyderabad-A.P.) Accredited by NBA (Affiliated to J.N.T.U.H, Approved by AICTE, New Delhi) In partial fulfillment of the requirement for the award of degree of BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE AND ENGINEERING Under the esteemed guidance of Mr. P. Srinivas, M.Tech, (Ph.D) Sr. Associate Professor By M.MEDHA 09R11A0594
42
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Original

UNICODE AND MULTILINGUAL COMPUTING

A Technical Seminar Report submitted to the Faculty of Computer Science and Engineering

Geethanjali College of Engineering & Technology

(Cheeryal(V), Keesara(M), R.R. Dist., Hyderabad-A.P.)

Accredited by NBA

(Affiliated to J.N.T.U.H, Approved by AICTE, New Delhi)

In partial fulfillment of the requirement for the award of degree of

BACHELOR OF TECHNOLOGYIN

COMPUTER SCIENCE AND ENGINEERING

Under the esteemed guidance of

Mr. P. Srinivas, M.Tech, (Ph.D)Sr. Associate Professor

By

M.MEDHA 09R11A0594

Department of Computer Science & Engineering

Page 2: Original

Year : 2012-2013

Geethanjali College of Engineering

& Technology

(Affiliated to J.N.T.U.H, Approved by AICTE, NEW DELHI.)

Accredited by NBA

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Date:

CERTIFICATE

This is to Certify that the Technical Seminar report on “Unicode and Multilingual Computing ” is a bonafide work done by M.MEDHA (09R11A0594) in partial fulfillment of the requirement of the award for the degree of Bachelor of Technology in “Computer Science and Engineering” J.N.T.U.H, Hyderabad during the year 2012 - 2013.

Technical Seminar Co-Ordinator HOD-CSE(Mr. P. Srinivas) (Prof. Dr. P.V.S. Srinivas) Sr. Associate Professor

2

Page 3: Original

ABSTRACT

Today's global economy demands global computing solutions. Instant communications

across continents--and computer platforms--characterize a business world at work 24

hours a day, 7 days a week. The widespread use of the Internet and e-commerce continue

to create new international challenges.

More and more, users are demanding a computing environment to suit their own

linguistic and cultural needs. They want applications and file formats they can share

around the world, interfaces in their own language, and local time and date displays.

Essentially, users want to write and speak at the keyboard the way they write and speak

in the office.

Fundamentally, computers just deal with numbers. They store letters and other characters

by assigning a number for each one. Even for a single language like English no single

encoding was adequate for all the letters, punctuation, and technical symbols in common

use.These encoding systems also conflict with one another. That is, two encodings can

use the same number for two different characters, or use different numbers for the same

character.So we develop a Unicode which provides a unique number for every character,

Any given computer (especially servers) needs to support many different encodings; yet

whenever data is passed between different encodings or platforms, that data always runs

the risk of corruption.

3

Page 4: Original

Contents

Chapters Page No.

1............ Introduction 5

2............ The Present Context 7

3........... Urbanization pattern in India 11

4.................... Importance of integrating urban development and land use control 12

5. .......... Importance of network and modal integration 15

6........... Urbanization – Inevitable and Desirables 18

7. Urban Transport and City efficiency 19

8. How Administrative lessons relate to urban transport 24

9. Current urban transport scenario in India 28

4

Page 5: Original

10........... Road safety in India 3011. Environmental impact of Urban Transport 3412. Urban Rail Transit Architecture 4313. Conclusion 4714. Future Scope 4815. List of Abbreviations 49

16. References 50

Unicode

In most writing systems, keyboard input is converted into character codes, stored in memory, and converted to glyphs in a particular font for display and printing. The collection of characters and character codes form a codeset. To represent characters of different languages, a different codeset is used.

A character code in one codeset, however, does not necessarily represent the same character in another codeset. For example, the character code 0xB1 is the plus-minus sign (+-) in Latin-1 (ISO 8859-1 codeset), capital BE in Cyrillic (ISO 8859-5 codeset), and does not represent anything in Arabic (ISO 8859-6 codeset) or Traditional Chinese (CJK unified ideographs).

In Unicode, every character, ideograph, and symbol has a unique character code, eliminating any confusion between character codes of different codesets. In Unicode, multiple codesets need not be defined. Unicode represents characters from most of the world's languages as well as publishing characters, mathematical and technical symbols, and punctuation characters. This universal representation for text data has been further enhanced and extended in the latest release of Unicode: The Unicode Standard, Version 3.0.

5

Page 6: Original

PROCEDURE

Sun Microsystems defines the following levels at which an application can support a customer's international needs:

Internationalization Localization

Software internationalization is the process of designing and implementing software to transparently manage different linguistic and cultural conventions without additional modification. The same binary copy of an application should run on any localized version of the Solaris operating environment, without requiring source code changes or recompilation.

Software localization is the process of adding language translation (including text messages, icons, buttons, and so on), cultural data, and components (such as input methods and spell checkers) to a product to meet regional market requirements.

The Solaris operating environment is an example of a product that supports both internationalization and localization. The Solaris operating environment is a single internationalized binary that is localized into various languages (for example, French, Japanese, and Chinese) to support the language and cultural conventions of each language.

6

Page 7: Original

Supporting the Unicode Standard

Unicode (Universal Codeset) is a universal character encoding scheme developed and promoted by the Unicode Consortium, a non-profit organization which includes Sun Microsystems. The Unicode standard encompasses most alphabetic, ideographic, and symbolic characters.

Using one universal codeset enables applications to support text from multiple scripts in the same documents without elaborate tagging. However, applications must treat Unicode as any another codeset--applying codeset independence to Unicode as well.

Unicode locales are called the same way and function the same way as all other locales in the Solaris operating environment. These locales provide the extra benefits that the Unicode codeset brings to the work environment, including the ability to create text in multiple scripts without having to switch locales. Sun Microsystems provides the same level of Unicode locale support for both 32-bit and 64-bit Solaris environments.

7

Page 8: Original

Benefits of Unicode

Support for Unicode provides many benefits to application developers, including:

Global source and binary. Support for mixed-script computing environments.

Improved cross-platform data interoperability through a common codeset.

Space-efficient encoding scheme for data storage.

Reduced time-to-market for localized products.

Expanded market access.

Developers can use Unicode to create global applications. Users can exchange data more freely using one flat codeset without elaborate code conversions to comprehend characters.

In the Solaris operating environment internationalization framework, Unicode is "just another codeset." By adopting and implementing codeset independence to design, applications can handle different codesets without extensive code rework to support specific languages.

8

Page 9: Original

 Unicode Coded Representations

In recent years, the Unicode Consortium and other related organizations have developed different formats to represent and store a Unicode codeset. To represent characters from all major languages in multibyte format, the ISO/IEC International Standard 10646-1 (commonly referred to as 10646) has defined the Universal Multiple-Octet Coded Character Set (UCS) format. Character forms contained in the 10464 specifications are:

Universal Coded Character Set-2 (UCS-2) also known as Basic Multilingual Plane (BMP)--characters are encoded in two bytes on a single plane.

Universal Coded Character Set-4 (UCS-4)--characters encoded in four bytes on multiple planes and multiple groups.

UCS Transformation Format 16-bit form (UTF-16)--extended variant of UCS-2 with characters encoded in 2-4 bytes.

UCS Transformation Format 8-bit form (UTF-8)--a transformation format using characters encoded in 1-6 bytes.

UCS-2 defines a 64K coding space, or BMP, to represent character codes in a two-octet row and cell format. The row and cell octets designate the cell location of a particular character code within a 256 by 256 (00-FF) plane.

9

Page 10: Original

UCS-4 defines a four-octet coding space divided into four units: group, plane, row, and cell. The row and cell octets designate the cell location of a particular character code within a plane. The plane octet designates the plane number (00-FF), and the group octet the group number (00-7F) to which the plane belongs. In total, there are 256 planes occurring 127 times.

Figure 2-1 UCS-2 and UCS-4 coding schemes

 Unicode in the Solaris 8 Operating Environment

The support of Unicode, Version 3.0 in the Solaris 8 Operating Environment's Unicode locales has provided an enhanced framework for developing multiscript applications. Properly internationalized applications require no changes to support the Unicode locales. All internationalized CUI and GUI utilities and commands in the Solaris operating environment are available in Unicode locales without modification.

All Unicode locales in the Solaris operating environment are based on the UTF-8 format. Each locale includes a base language in the UTF-8 codeset and regional data related to the base language and its cultural conventions (such as local formatting rules, text messages, help messages, and other related files). Each locale also supports several other scripts for input, display, code conversion, and printing.

10

Page 11: Original

Multilingual Computing with the 9.1 SAS Unicode Server

In 9.1, SAS customers in many regions around the world will use the DBCS extensions in order to support global data (multilingual data which can only be represented in the Unicode character set). With the SAS Unicode server, it is now possible to write a SAS application which processes Japanese data, German data, Polish data, and more, all in the same session. A single server can deliver multilingual data to users around the world.

11

Page 12: Original

Unicode UTF-8 en_US.UTF-8 Locale

en_US.UTF-8 is the flagship Unicode locale in the Solaris operating environment. The en_US.UTF-8 locale is an American English-based locale with multiscript processing support for characters in many different languages. New and enhanced features of all Unicode locales include support of the Unicode 3.0 character set, complex text layout scripts in correct rendition, native Asian input methods, more MIME character sets in dtmail, various new iconv code conversions, and an enhanced PostScript print filter.

All Unicode locales in the Solaris operating environment support multiple scripts. Thirteen input modes area available: English/European, Cyrillic, Greek, Arabic, Hebrew, Thai, Unicode Hex, Unicode Octal, Table lookup, Japanese, Korean, Simplified Chinese, and Traditional Chinese.

Users can input characters from any combination of scripts and the entire Unicode coding space.

12

Page 13: Original

Language  Code 

Cyrillic  cc

Greek  gg

Thai  tt

Arabic  ar

Hebrew  hh

Unicode Hex  uh

Unicode Octal  uo

Lookup  ll

Japanes  ja

Korean  ko

Simplified Chinese  sc

Traditional Chinese  tc

English/European  Control+Space

13

Page 14: Original

Table 3-1 UTF-8 Input Mode two-letter codes

Figure 3-2 UTF-8 Input Mode selection

To input text from a Lookup table, select the Lookup input mode. A lookup table with all input modes and various symbol and technical codesets appears, as shown in Figure 3-2.

The Table lookup input mode is the easiest for non-native speakers to input characters in a foreign language--a lookup window displays characters from a selected script, as shown for the Asian input mode in Figure 3-3.

The Arabic, Hebrew, and Thai input modes provide full complex text layout features, including right-to-left display and context-sensitive character rendering. The Unicode octal and hexadecimal code input modes generate Unicode characters from their octal and hexadecimal equivalents, respectively.

The Japanese, Korean, Simplified Chinese, and Traditional Chinese input modes provide full native Asian input.

14

Page 15: Original

Figure 3-3 UTF-8 Table Lookup

15

Page 16: Original

Figure 3-4 Asian input mode

The Unicode locales can use the enhanced mp(1) printing filter to print text files. mp(1) prints flat text files written in UTF-8 using various Solaris system and printer resident fonts (such as bitmap, Type1, TrueType) depending on the script. The output is standard PostScript. For more information, refer to the mp(1) man page.

The Unciode locale supports various MIME character sets in dtmail, including various Latin, Greek, Cyrillic, Thai, and Asian character sets. Some of the example character sets are: ISO-8859-1 ~ 10, 13, 14, 15, UTF-8, UTF-7, UTF-16, UTF-16BE, UTF-16LE, Shift_JIS, ISO-2022-JP, EUC-KR, ISO-2022-KR, TIS-620, Big5, GB2312, KOI8-R, KOI8-U, and ISO-2022-CN. With this support, users can send and receive email messages encoded in MIME character sets from almost any region in the world. dtmail automatically decodes e-mail by recognizing the MIME character set and content transfer encoding in the message. The sender specifies the MIME character set for the recipient mail user agent.

16

Page 17: Original

Figure 3-5 Multiple character sets in dtmail

17

Page 18: Original

Codeset Conversion

The Solaris operating environment locale supports enhanced code conversion among the major codesets of several countries. Figure 3-5shows the codeset conversions between UTF-8 and many other codesets.

Figure 3-6 Unicode codeset conversions

Codesets can be converted using the sdtconvtool utility or the iconv(1) command. sdtconvtool detects available iconv code conversions and presents them in an easy-to-use format.

18

Page 19: Original

Figure 3-7 sdtconvtool for converting between codesets

Users can also add their own code conversions and use them in iconv(3) functions, iconv(1) command line utilities, andsdtconvtool(1). For more information on user-extensible, user-defined code conversions, refer to the geniconvtbl(1) andgeniconvtbl(4) man pages.

Developers can use iconv(3) to access the same functionality. This includes conversions to and from UTF-8 and many ISO-standard codesets, including UCS-2, UCS-4, UTF-7, UTF-16, KO18-R, Japanese EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK, PCK (Shift JIS), BIG5, Johap, ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.

19

Page 20: Original

ISSUES IN USING UNICODE

To properly internationalize an application, use the following guidelines:

• Avoid direct access with Unicode. (This is a task of the platform's internationalization framework.)

• Use the POSIX model for multibyte and wide-character interfaces.

• Only call APIs that the internationalization framework provides for language and cultural-specific operations. All POSIX, X11, Motif, and CDE interfaces are available to Unicode locales.

• Remain codeset independent.

20

Page 21: Original

Unicode-based multilingual form development

Translation

The first step in constructing a Unicode-based multilingual Web page is fairly

self-evident: The material must be translated into the desired target languages by

persons knowledgeable in those languages. At some point in the future, automatic

translation or global translation (formerly known as "machine translation" or "MT") may

be sophisticated enough to do a large part of that job, but it is not quite ready at this

time. Additionally, although great strides have been made in this area in recent years, it

is hard to imagine a time when human review of automatically translated text will not be

necessary.

Localization

Localization is the next important concept in understanding multilingual computing.

Web localization can be defind as simply the act of making a Web site linguistically and

culturally appropriate to a local audience. An "accurate" translation may not be enough

-- a translated text must be "localized" for the target audience viewing the Web page.

We could use Spanish as an example: A Web page might be accurately translated into

Spanish, but then the question could be, Which Spanish? Peruvian Spanish, for

example, is not the same as Mexican Spanish. Therefore, Peruvians reading a

Mexican Spanish Web site might be able to understand almost all of it, but certain

nuances or turns of phrase might be unfamiliar to them. For the most part, multilingual

21

Page 22: Original

sites are not yet sophisticated enough or targeted enough to deal with such

differentiations, but that will certainly change in the future. In other words, localization

will become more and more significant as it helps direct the growth and acceptance of

the Internet in ever more broad and diverse cultural settings.

Multiple languages on the same Web page

For the purposes of this tutorial, however, localization is not an issue: The English used

is clearly American (not British), the Russian text used is very straightforward, and

although Chinese has many different spoken dialects, any Chinese reader could

understand and respond to the survey question as presented. Our main goal here is to

simply construct the initial building blocks of a multilingual Web site, where it is possible

to display multiple languages on the same Web page together. Once the key concepts

involved in getting a Unicode-based multilingual Web page up and running are

understood, localization and more advanced aspects of Web design can be addressed

by the developer.

Expert review

Expert review is rather self-explanatory. After a Web page has been translated,

globalized for consistency and content, and localized for individual target language

impact, it should undergo expert review. Does the site hang together overall? Is there a

consistent message for content and tone between the languages? This examination for

consistency between languages might also be viewed as the process of

"internationalizing" the site. These are all issues that could fall under the category of

expert review.

22

Page 23: Original

Page markup

After translation, localization and expert review, we can proceed to working out the

Unicode equivalents and the actual page markup. Besides the three languages --

English, Russian and Chinese -- used for the survey question in this tutorial, we have

also added random characters in Japanese, Hebrew, Hindi, and Tibetan to

demonstrate the amazing variety of Unicode-based characters in a multilingual site.

Sample multilingual survey question

We begin to construct our Unicode-based multilingual Web page example with a

sample survey question: "Do you want to buy a new computer? Yes___ No___"

translated (except for English, of course!) into our two other target languages. The

result is displayed in Figure 1.

Unicodization

Once this procedure is complete, we need to transfer these language texts into their

Unicode equivalents. We could call this step "Unicodization" (I don't know if this term

has been coined yet, but if not, it needs to be.) It is not necessary, of course, to

translate the English characters of our example into their Unicode equivalents, since

they would be displayed properly in any case because of ASCII. However, we do so

anyway in order to demonstrate how the process works overall (as well as for

consistency).

23

Page 24: Original

Transferring Unicode characters into their hexadecimal

equivalents

Although Unicode does work with decimal numbers, hexadecimal numbers are the

standard. The Unicode characters are transferred into their hexadecimal equivalents.

The characters (with the underlying hexadecimal equivalents) are then placed by the

software or by hand into whatever markup document is being prepared. The key point

is that, whether the user or developer sees them or not, the hexadecimal equivalents

are the foundation of the process, and can then be manipulated as needed for various

other programming purposes.

Random characters in other languages

In the previous example we used three languages. However, Unicode allows us to use

a large number of languages -- at least in short sentences or segments -- on the

same Web page.

Adding Unicode hexadecimal numbers to the page markup

The next step is transferring these Unicode hexadecimal numbers into markup

language for the Web page which will be built around them. To do this, we add the

symbols &#x to the front of the number with a semi-colon (;) placed at the end. For

example, the Chinese characters for the word "computer" are designated as follows in

hexadecimal form: 电 脑.

Using XML tags

24

Page 25: Original

XML tags provide a basis for organization and for building complexity into multilingual

documents. For the purpose of this tutorial, we simply use a single survey question as

the basis of our Unicode-based multilingual document. However, as multilingual

e-commerce develops, documents can easily become extremely complex. XML can

provide an excellent mechanism for managing this

Multilingual standard for XML

There is, however, a problem: At present, there is not a consistent multilingual XML

standard. As Yves Savourel points out in an article in the October/November 2000

edition of MultiLingual Computing and Technology magazine, "a standard markup

method is needed for working with multilingual documents" ("XML Technologies and

the Localization Process," #35, Volume 11, Issue 7, p. 62). Savourel's comment could

eventually emerge as a major understatement! Just as Unicode itself is bringing

standardization to the characters of the world's languages and symbols, a standard

XML language for multilingual documents will become crucial to the smooth

development of multilingual e-commerce.

Form development: More complex scripts and XML tags needed

For the purposes of our simple survey, the lack of a multilingual XML standard is not a

problem. A more complex multilingual survey might be devised that would have

numerous questions and would be sorted and tabulated by XML according to

language, type of response, region of the world, or other factors. The user would first

find his language, then work his way down through a series of questions, with answers

and responses being sent back and forth to a cgi bin file. For now, we will merely use a

simple Perl CGI script. In a more complex multilingual Web page, numerous layers of

25

Page 26: Original

scripts and responses might be utilized. Defining how those scripts are used and

interact -- and using XML in that process -- is at the heart of building effective

multilingual Web sites using Unicode. That is a more advanced topic that builds on

what we have presented here.

Display of HTML

Now we're ready to view our basic survey question. For those who have not yet loaded

a Unicode font.

Multilingual Computing with the 9.1 SAS Unicode Server

26

Page 27: Original

UNICODE SUPPORT IN SAS 9.1

In 9.1, SAS customers in many regions around the world will use the DBCS extensions in order to support global data (multilingual data which can only be represented in the Unicode character set). With the SAS Unicode server, it is now possible to write a SAS application which processes Japanese data, German data, Polish data, and more, all in the same session. A single server can deliver multilingual data to users around the world.

This paper will discuss the following six scenarios for using the SAS Unicode server.

1. Populating a Unicode database.

2. Using SAS/SHARE® as a Unicode data server.

3. Using thin-client applications with the Unicode data server.

4. Using SAS/IntrNet® as a Unicode compute server.

5. Using SAS® AppDev Studio as a Unicode compute server.

6. Generating Unicode HTML and PDF output using the SAS Output Display System (ODS).

The SAS Unicode server is designed to run on ASCII based machines. It can be run as a data or compute server or as a batch program.

RESTRICTIONS

There are a few restrictions to the SAS Unicode server.

27

Page 28: Original

1. The SAS Display Manger is not supported and if used will not display data correctly.

2. Enterprise Guide® cannot access a SAS Unicode server.

3. You cannot run a SAS Unicode server on MVS (OS/390) or OpenVMS.

4. Fullscreen capability for UTF8 session encoding using national characters is not supported. Therefore, products that rely on fullscreen capability are not supported. This includes SAS/EIS, SAS/Warehouse Administrator, SAS/ASSIST, SAS/LAB, and Enterprise Miner V3.0 and earlier. FSEDIT, INSIGHT, and other Frame-based products are not supported.

5. Multi-lingual characters are not supported with SAS/GRAPH fonts, SAS/GRAPH ActiveX, and SAS/GRAPH Java Applets and UTF8.

6. OLEDB local data providers do not fully support multi-lingual data.

7. SAS/Access Engines to Oracle, ODBC, and OLEDB fully support the UTF8 encoding in SAS 9.1, but other Access engines do not.

8. OLEDB Local data providers and OLEDB IOM data providers do not support multi-lingual data.

9. The UTF8 server running on Windows does not support national characters for pathnames, such as an external file name or the directory name of a SAS dataset.

STARTING AND USING A SAS UNICODE SERVER

To start a SAS Unicode server you must do two things:

28

Page 29: Original

1. Install SAS 9.1 or later, with DBCS extensions.

2. Specify ENCODING UTF8 when you start SAS, such as: sas -encoding UTF8

POPULATING A UNICODE DATABASE

The first step in converting an existing database to Unicode or in setting up a new Unicode based system will be to convert all of your data from its legacy encoding to the UTF8 encoding. Once the data is in a Unicode database, there will not be any loss of data when it is read by a Unicode server.

USING SAS/SHARE AS A UNICODE DATA SERVER

SAS/SHARE is a product that enables multiple users to access data from a central server. To convert your existing SAS/SHARE server to a SAS Unicode server you must specify the –ENCODING UTF8 config option.

USING JDBC WITH A UNICODE DATA SERVER

The SAS system is continuously increasing support for industry standard data access protocols such as JDBC. The JDBC interface is a data access interface for Java applications. Java supports Unicode string data and therefore, it would be very natural for the SAS Unicode server to function as the data server for Java.

USING SAS/INTRNET® AS A COMPUTE SERVER

The SAS system is often used as a compute server from a non-SAS client. This is another natural fit for the SAS Unicode server.

USING SAS® APPDEV STUDIO AS A COMPUTE SERVER

29

Page 30: Original

SAS® AppDev Studio enables Java programmers to run programs on a SAS server. The programs that run on the server are either SCL programs running with Jconnect or remote objects executed through SAS Integration Technologies.

GENERATING UNICODE OUTPUT USING ODS

A SAS Unicode server can be used in a batch program to produce ODS output with an encoding of UTF8. At the time of this writing, the following ODS output formats support –encoding UTF8:

• HTML

• XML

• PDF

UNICODE PROCESSING IN THE SAS SYSTEM

There are several Unicode related features of SAS 9. These features are available for SAS sessions running legacy encodings as well as SAS Sessions running with a UTF8 encoding.

30

Page 31: Original

• Unicode ENCODING= values for FILENAME and ODS statements.

• Unicode FORMATS and INFORMATS.

• NL formats for displaying currency and date formats matching the user’s locale.

Conclusion

31

Page 32: Original

Thus by using Unicode for mulit languages the corruption of data is less.Moreover the conversion process is easy and less time consuming. Using this the information can be passed globally in any language.This gives us more security for data.We can also develop web pagesbased on this.

REFERENCES

32

Page 33: Original

1. Tony Graham. M&T Press/IDG Books Worldwide,A guide to the Unicode standard and its

use

2. MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 

3. 1. SAS(R) 9.1 National Language Support (NLS)

4. Reference. SAS Institute Inc., Cary, NC. SAS.

5. 2. "Base SAS Software." SAS OnlineDoc, Version 9.1

6. 2003 CD-ROM. SAS Institute Inc., Cary, NC. SAS.

7. 3. Cross-Environment Data Access (CEDA). "Base

8. SAS Software." SAS OnlineDoc, Version 9. 2003.

9. CD-ROM. SAS Institute Inc., Cary, NC. SAS.

10. 4. Cross-Environment Data Access (CEDA). SAS

11. Institute Inc., Cary, NC.SAS Available at:

12. http://support.sas.com/rnd/migration/planning/files/ce

13. da.html.

14. 5. Character Variable Padding (CVP). "Base SAS

15. Software." SAS OnlineDoc, Version 9.1 2003. CDROM. SAS Institute Inc., Cary, NC. SAS.

16. 6. Encoding. "National Language Support (NLS)

17. Reference." SAS OnlineDoc, Version 9.1 2003. CDROM. SAS Institute Inc., Cary, NC.

SAS..

18. Technology's address is: MultiLingual Computing, Inc., 319 North First Avenue, Sandpoint,

ID 83864.

33

Page 34: Original

19. For an excellent article and examples of some of the issues involved in using Perl, XML and

Unicode, see Michel Rodriguez's "Character Encodings in XML and Perl" at XML.com

(April, 2000).

34