Top Banner
McGraw-Hill/Irwin McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming the Web Using XML
36

McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

McGraw-Hill/IrwinMcGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.© 2004 by The McGraw-Hill Companies, Inc. All rights reserved.

Using XML Parsers and Unicode

Ellen Pearlman

Eileen Mullin

Programming the Web Using XML

Page 2: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-2

Learning ObjectivesLearning Objectives

1. Understanding what an XML parser does2. Working with the basic Microsoft parser3. Differentiating between valid documents in

different parsers and the way they define error statements

4. Learning about Unicode and UTF-8, UTF-16 and UTF-32

5. Investigating different character sets and typefaces for Unicode

Page 3: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-3

IntroductionIntroduction

• A parser is a grammar and syntax checker for markup and other programming languages.

• A parser compares a XML document against the grammar in its DTD. This process, called validation ensures there are no mistakes that could potentially confuse the XML applications that access your content.

• If a document follows the rules listed in its DTD, then it is said to be valid. If the document has markup errors that contradict the rules of the DTD, then it would be labeled invalid.

Page 4: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-4

What is Unicode?What is Unicode?

• The Unicode Consortium was founded with the goal to foster a character encoding that encompasses all major scripts in the world.

• Currently, Unicode has a little less than 50,000 different characters encoded in 16 bits for a total of up to 65,536 possible characters. Already almost a third of the encoded characters are in Han Chinese ideographs.

• More languages are on the way, and so Unicode will jump to 32 bits per character.

• Because XML uses Unicode as its character set, all character sets are compatible.

Page 5: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-5

ParsersParsers

• When XML parsers first became commonly used, they consisted of basic text editors like Microsoft's NotePad, Wordpad and Apple SimpleText, and not much else. These basic text editors could not support Unicode.

• Now parsers are divided into three categories, basic text editors, graphical text editors and integrated development environments.

Page 6: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-6

How Parsers WorkHow Parsers Work

• In general, a parser looks for certain specifics, like the beginning of an XML statement <?xml version =, or even parenthesis (), percent sign % and so on.

• Just as we look for a period (.) to end a sentence, a parser looks for certain pre-established XML grammatical conventions to know that a statement is correctly formed.

Page 7: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-7

Differences Between an XML Differences Between an XML Parser and an HTML ParserParser and an HTML Parser

• With HTML, there is already a pre-set standard that already tells the Web browser application how to render the information visually.

• An XML editor or parser does not have any predetermined definition of your documents’ element and attribute names. An XML parser only knows basic valid and invalid rules. An XML parser only knows how to look at pure character strings.

Page 8: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-8

The Basic Microsoft ParserThe Basic Microsoft Parser

• MSXML, Microsoft’s basic XML parser, is a good, free parser that is embedded into the Internet Explorer browser.

• MSXML is a graphical text editor. It can be referred to as a WYSIWYG (What You See Is What You Get) editor. That means that there are no implied statements, and everything is displayed on the screen.

Page 9: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-9

Titus Andronicus coded in XMLTitus Andronicus coded in XML

Page 10: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-10

Titus Andronicus: Play.dtdTitus Andronicus: Play.dtd

Page 11: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-11

Play.dtd With </title> End Tag Play.dtd With </title> End Tag MissingMissing

Page 12: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-12

The Error Message Produced by The Error Message Produced by the Missing </title> End Tagthe Missing </title> End Tag

Page 13: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-13

Line 5223, Referred to by the Error Line 5223, Referred to by the Error MessageMessage

Page 14: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-14

Showing play.dtd, Line 5223 Showing play.dtd, Line 5223 Now CompleteNow Complete

Page 15: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-15

Creating Your Own Valid Document: Creating Your Own Valid Document: validatortest.xml document validatortest.xml document

Page 16: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-16

validatortest.xml Document in IEvalidatortest.xml Document in IE

Page 17: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-17

validatortest.xml Document in validatortest.xml Document in NetscapeNetscape

Page 18: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-18

A Word About ErrorsA Word About Errors

• Most parsers deal with errors in XML in one of two ways. There are errors and then there are fatal errors. – A basic error is a violation of the rules in whatever

specification it is checking the code against (i.e. XSLT, plain XML). The parser points out the error and continues processing.

– A fatal error stops the parser from checking the code. It also stops the XML document from being well-formed.

Page 19: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-19

Using XML SpyUsing XML Spy

• XML Spy can be thought of as an IDE because it not only has a text and code editor, but also a compiler, debugger and GUI intuitive interface.

• With XML Spy a developer could actually build a sophisticated project.

• There are two basic views, the Text view, which resembles any text editor and the Enhanced Grid View, which shows more of the schema of the document.

Page 20: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-20

Altova XML Spy Home PageAltova XML Spy Home Page

Page 21: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-21

Initial Code Listing: Initial Code Listing: validatortest.xml validatortest.xml

<?xml version="1.0" encoding="UTF-8"?>

<!-- This is good to use as a test -->

<!DOCTYPE scribble [

<!ELEMENT scribble (first, second, third, fourth)>

<!ELEMENT first (#PCDATA)>

<!ELEMENT second (#PCDATA)>

<!ELEMENT third (#PCDATA)>

<!ELEMENT forth (#PCDATA)>

]>

<scribble>

<first>Our first line</first>

<second>Our second line</second>

<third>Our third line</third>

<fourth>Our fourth line</fourth>

</scribble>

Page 22: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-22

Invalid Validatortest.xml in XML Invalid Validatortest.xml in XML Spy ProgramSpy Program

Page 23: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-23

Viewing Validatortest.xml in IEViewing Validatortest.xml in IE

Page 24: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-24

Corrected Version: Corrected Version: validatortest.xml validatortest.xml

<!-- This is good to use as a test -->

<!DOCTYPE scribble [

<!ELEMENT scribble (first, second, third, fourth)>

<!ELEMENT first (#PCDATA)>

<!ELEMENT second (#PCDATA)>

<!ELEMENT third (#PCDATA)>

<!ELEMENT fourth (#PCDATA)>

]>

<scribble>

<first>Our first line</first>

<second>Our second line</second>

<third>Our third line</third>

<fourth>Our fourth line</fourth>

</scribble>

Page 25: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-25

Other XML Editors: Viewing Other XML Editors: Viewing validatortest.xml in XML Edit Provalidatortest.xml in XML Edit Pro

Page 26: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-26

The Development of a Global The Development of a Global Standard: Introducing ASCIIStandard: Introducing ASCII

• ASCII is actually a subset of other character sets that contain 256 characters.

• ASCII was a 7-bit coding system with a limited range and in order to increase its range, an 8-bit coding system was developed, Latin-1 (ISO 646), which coded 256 characters.

• It became the language character set of choice for the Internet, e-mail, gopher, and ftp sites. However, this did not cover all characters that existed in all other non-Latin based languages.

Page 27: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-27

The Development of a Global The Development of a Global Standard: UnicodeStandard: Unicode

• In order to expand the range of permissible characters in 1983, ISO 10646 was developed that used 32 bits and could code 4 billion different characters. However, the code string became too big, and actually clogged up the bandwidth pipes it flowed through.

• Unicode, developed in 1987 by the International Standard ISO/IEC and maintained since 1991 by the Unicode Consortium, halved the code bit to 16, making it a workable solution because now it could handle more characters using less bandwidth.

Page 28: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-28

The Adoption of UnicodeThe Adoption of Unicode

• Unicode provides a unique number for each and every character in the world, no matter what platform, program or language they are viewed on.

• Every major vendor and standards body, operating system, browser and host of other products has adopted the standard.

• Another standard, ISO 10646-1:1993, is being used on the Web and has, for all purposes, Unicode has become a subset of that ISO standard.

Page 29: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-29

Unicode Enabled Operating Unicode Enabled Operating SystemsSystems

• Below is a list of operating systems that are Unicode-enabled:

– Apple Mac OS 9.2, Mac OS X 10.1, Mac OS X Server, ATSUI– Bell Labs Plan 9– Compaq's Tru64 UNIX, Open VMS– GNU/Linux with glibc 2.2.2 or newer - FAQ support– IBM AIX, AS/400, OS/2– Inferno by Vita Nuova – Java– Microsoft Windows CE, Windows NT, Windows 2000, and Windows XP– SCO UnixWare 7.1.0– Sun Solaris– Symbian Platform

Page 30: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-30

XML:LANG AttributeXML:LANG Attribute

• One of the most important attributes used in combination with XML and Unicode is the xml:lang attribute. It is the only attribute to use a language code.

• This attribute asks the XML software to call upon the server to process the current document with the specified language.

• An example of this would be as follows coded in an XML statement:<spanishtext xml:lang=ES>Hola amigo</spanishtext>

Page 31: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-31

Pull-down Menu Structure in XML Spy Pull-down Menu Structure in XML Spy to Add Elements and Attributesto Add Elements and Attributes

Page 32: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-32

Unicode for Cherokee language Unicode for Cherokee language

Page 33: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-33

UTF-8 and BeyondUTF-8 and Beyond

• UTF, which stands for Universal Character Set Transformation Format, allows Unicode to be broken into 8, 16 or even 32 bit values that are used in email and on the Internet. <?XML version ="1.0" encoding="UTF-8>.

• Unicode encodes all text by the type of script (i.e. English language, Cyrillic, etc) used, not the language used, an important distinction that avoids unnecessary duplication of letters.

Page 34: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-34

Character Sets and TypefaceCharacter Sets and Typeface

• Character sets do not refer to display formats, colors or typefaces. Unicode characters become visible to the user through a special rendering process that maps characters into glyphs.

• Glyphs are the specific shape of any given character as it is displayed. The actual character "A" is really a generic "A" which might look like the plain letter "A".

• Many things affect this rendering process such as operating systems, language settings, keyboard and display software, word processing software, type rasterizer and input and output hardware.

Page 35: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-35

Character Sets and Typeface (2)Character Sets and Typeface (2)

• In ASCII there is a one-to-one correlation between the character, the glyph and the character set. That means that ASCII strips a character raw and renders it in basic text which resembles to most of us plain Courier.

• This is not true for Unicode. It can render beautiful scripts. Different standards bodies have been set up to make sure languages and scripts coordinate.

Page 36: McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Using XML Parsers and Unicode Ellen Pearlman Eileen Mullin Programming.

6-36

The End