-
The TEI Consortium
The Association for Computers and the Humanities (ACH);The
Association for Computational Linguistics (ACL);
The Association for Literary and Linguistic Computing (ALLC)
TEI P4Guidelines for Electronic Text
Encoding and InterchangeXML-compatible edition
edited by C M Sperberg-McQueenand Lou Burnard
XML conversion by Syd Bauman, Lou Burnard,Steven DeRose, and
Sebastian Rahtz
Oxford — Providence — Charlottesville — BergenJune 2004
-
Guidelines for Electronic Text Encoding and Interchange
© 1990, 1992, 1993, 1994 ACH, ACL, ALLC.© 2002 TEI
Consortium
-
Published for the TEI Consortium by the Humanities Computing
Unit, University of Oxford
ISBN 0-952-33013-X
-
ContentsI Introduction 1
1 About These Guidelines 31.1 Structure and Notational
Conventions of this Document 4
1.1.1 Structure 41.1.2 Notational Conventions 5
1.2 Underlying Principles and Intended Use 61.2.1 Design
Principles of the TEI Scheme 71.2.2 Intended Use 8
1.2.2.1 Use in Text Capture and Text Creation 81.2.2.2 Use for
Interchange 91.2.2.3 Use for Local Processing 10
1.3 Historical Background 111.3.1 Origin and Development of the
TEI 111.3.2 Future Developments 12
2 A Gentle Introduction to XML 132.1 What’s special about XML?
14
2.1.1 Descriptive markup 142.1.2 Types of document 142.1.3 Data
independence 15
2.2 Textual structure 152.3 XML structures 16
2.3.1 Elements 162.3.2 Content models: an example 16
2.4 Validating a document’s structure 182.4.1 An example DTD
182.4.2 Generic identifier 182.4.3 Content model 192.4.4 Occurrence
indicators 192.4.5 Connectors 192.4.6 Model groups 19
2.5 Complicating the issue 212.6 Attributes 22
2.6.1 Declaring attributes 222.6.2 Attribute names 232.6.3
Attribute values 232.6.4 Default value 232.6.5 ID and IDREF
attributes 24
2.7 Entities 252.7.1 Entity declarations 252.7.2 Entity
references 262.7.3 Character references 262.7.4 Unparsed entities
and Notations 272.7.5 Parameter entities 27
2.8 Marked sections 282.8.1 CDATA marked section 282.8.2
Conditional marked section 29
2.9 Other components of an XML document 302.9.1 Processing
instructions 302.9.2 Namespaces 31
2.10 Putting it all together 322.10.1 SGML and XML declarations
322.10.2 The DOCTYPE declaration 322.10.3 The Document Instance
34
v
-
2.10.4 Ancillary Files 343 Structure of the TEI Document Type
Definition 35
3.1 Main and Auxiliary DTDs 353.2 Core, Base, and Additional Tag
Sets 36
3.2.1 The Core Tag Sets 373.2.2 The Base Tag Sets 373.2.3 The
Additional Tag Sets 383.2.4 User-Defined Tag Sets 39
3.3 Invocation of the TEI DTD 393.4 Combining TEI Base Tag Sets
403.5 Global Attributes 423.6 The TEI2.DTD File 45
3.6.1 Structure of the TEI2.DTD File 453.6.2 Embedding Local
Modifications 473.6.3 Embedding the Core Tag Sets 483.6.4 Embedding
the Base Tag Set 483.6.5 Embedding the Additional Tag Sets 49
3.7 Element Classes 503.7.1 Classes Which Share Attributes
503.7.2 Classes Used in Content Models 523.7.3 The TEICLAS2.ENT
File 533.7.4 Low-Level Element Classes 543.7.5 High-Level Element
Classes 563.7.6 Elements Marked for Text Type 573.7.7 Standard
Content Models 583.7.8 Components in Mixed and General Bases
593.7.9 Miscellaneous Content-Model Classes 60
3.8 Other Parameter Entities in TEI DTDs 613.8.1 Inclusion and
Exclusion of Elements 623.8.2 Parameter Entities for Element
Generic Identifiers 623.8.3 Parameter Entities for TEI Keywords
623.8.4 Generation of an XML DTD 633.8.5 Declaration of TEI
keywords 63
II Core Tags and General Rules 654 Languages and Character Sets
67
4.1 A simple character encoding model 674.1.1 Some definitions
684.1.2 Characters and glyphs 694.1.3 Characters and their encoding
704.1.4 Character semantics 724.1.5 Characters from the Private
Usage Area 72
4.2 Entry and display of characters 734.2.1 Character input and
entity references 734.2.2 Transliteration schemes 75
4.3 Code shifting 754.4 The Writing System Declaration 77
5 The TEI Header 795.1 Organization of the TEI Header 79
5.1.1 The TEI Header and Its Components 805.1.2 Types of Content
in the TEI Header 81
5.2 The File Description 825.2.1 The Title Statement 835.2.2 The
Edition Statement 855.2.3 Type and Extent of File 87
vi
-
5.2.4 Publication, Distribution, etc. 885.2.5 The Series
Statement 895.2.6 The Notes Statement 915.2.7 The Source
Description 925.2.8 Computer Files Derived from Other Computer
Files 945.2.9 Computer Files Composed of Transcribed Speech 94
5.3 The Encoding Description 965.3.1 The Project Description
975.3.2 The Sampling Declaration 985.3.3 The Editorial Practices
Declaration 995.3.4 The Tagging Declaration 1025.3.5 The Reference
System Declaration 103
5.3.5.1 Prose Method 1055.3.5.2 Stepwise Method 1055.3.5.3
Milestone Method 106
5.3.6 The Classification Declaration 1085.3.7 The Feature System
Declaration 1095.3.8 The Metrical Declaration Element 1105.3.9 The
Variant-Encoding Method Element 112
5.4 The Profile Description 1125.4.1 Creation 1135.4.2 Language
Usage 1135.4.3 The Text Classification 114
5.5 The Revision Description 1165.6 Minimal and Recommended
Headers 1185.7 Note for Library Cataloguers 121
6 Elements Available in All TEI Documents 1236.1 Paragraphs
1246.2 Treatment of Punctuation 1256.3 Highlighting and Quotation
126
6.3.1 What Is Highlighting? 1276.3.2 Emphasis, Foreign Words,
and Unusual Language 128
6.3.2.1 Foreign Words or Expressions 1286.3.2.2 Emphatic Words
and Phrases 1296.3.2.3 Other Linguistically Distinct Material
130
6.3.3 Quotation 1306.3.4 Terms, Glosses, and Cited Words
1346.3.5 Some Further Examples 135
6.4 Names, Numbers, Dates, Abbreviations, and Addresses 1366.4.1
Referring Strings 1366.4.2 Addresses 1386.4.3 Numbers and Measures
1396.4.4 Dates and Times 1416.4.5 Abbreviations and Their
Expansions 143
6.5 Simple Editorial Changes 1446.5.1 Correction of Apparent
Errors 1456.5.2 Regularization and Normalization 1476.5.3
Additions, Deletions, and Omissions 148
6.6 Simple Links and Cross References 1526.7 Lists 1546.8 Notes,
Annotation, and Indexing 157
6.8.1 Notes and Simple Annotation 1576.8.2 Index Entries 159
6.9 Reference Systems 160
vii
-
6.9.1 Using the ID and N Attributes 1616.9.2 Creating New
Reference Systems 1626.9.3 Milestone Tags 1636.9.4 Declaring
Reference Systems 166
6.10 Bibliographic Citations and References 1676.10.1 Elements
of Bibliographic References 1686.10.2 Components of Bibliographic
References 171
6.10.2.1 Analytic, Monographic, and Series Levels 1716.10.2.2
Authors, Titles, and Editors 1736.10.2.3 Imprint, Pagination, and
Other Details 1766.10.2.4 Series Information 1796.10.2.5 Notes and
Other Additional Information 1796.10.2.6 Order of Components within
References 180
6.10.3 Bibliographic Pointers 1806.10.4 Relationship to Other
Bibliographic Schemes 181
6.11 Passages of Verse or Drama 1826.11.1 Core Tags for Verse
1826.11.2 Core Tags for Drama 184
6.12 Overview of the Core Tag Set 1877 Default Text Structure
189
7.1 Divisions of the Body 1907.1.1 Un-numbered Divisions
1917.1.2 Numbered Divisions 1917.1.3 Numbered or Un-numbered?
1937.1.4 Partial and Composite Divisions 195
7.2 Elements Common to All Divisions 1967.2.1 Headings and
Trailers 1967.2.2 Openers and Closers 1977.2.3 Arguments and
Epigraphs 1997.2.4 Content of Textual Divisions 200
7.3 Groups of Texts 2017.4 Front Matter 2067.5 Title Pages
2087.6 Back Matter 2107.7 DTD Fragment for Default Text Structure
212
III Base Tag Sets 2138 Base Tag Set for Prose 2159 Base Tag Set
for Verse 217
9.1 Structure of the Base Tag Set for Verse 2179.2 Structural
Divisions of Verse Texts 2189.3 Components of the Verse Line 2219.4
Rhyme and Metrical Analysis 224
9.4.1 Sample Metrical Analyses 2249.4.2 Segment-Level versus
Line-level Tagging 2269.4.3 Metrical Analysis of Stanzaic Verse
226
9.5 Rhyme 2279.6 Encoding Procedures For Other Verse Features
228
10 Base Tag Set for Drama 23110.1 Front and Back Matter 232
10.1.1 The Set Element 23210.1.2 Prologues and Epilogues
23310.1.3 Records of Performances 23510.1.4 Cast Lists 236
10.2 The Body of a Performance Text 239
viii
-
10.2.1 Major Structural Divisions 23910.2.2 Speeches and
Speakers 24010.2.3 Stage Directions 24210.2.4 Speech Contents
24410.2.5 Embedded Structures 24610.2.6 Simultaneous Action 249
10.3 Other Types of Performance Text 24910.3.1 Technical
Information 251
11 Transcriptions of Speech 25311.1 General Considerations and
Overview 254
11.1.1 Divisions 25511.2 Elements Unique to Spoken Texts 256
11.2.1 Utterances 25911.2.2 Pause 26011.2.3 Vocal, Kinesic,
Event 26011.2.4 Writing 26211.2.5 Temporal Information 26211.2.6
Shifts 26211.2.7 Formal Definition 265
11.3 Elements Defined Elsewhere 26511.3.1 Segmentation 26611.3.2
Synchronization and Overlap 26711.3.3 Regularization of Word Forms
27111.3.4 Prosody 27111.3.5 Speech Management 27211.3.6 Analytic
Coding 273
12 Print Dictionaries 27512.1 Dictionary Body and Overall
Structure 27612.2 The Structure of Dictionary Entries 278
12.2.1 Hierarchical Levels 27912.2.2 Groups and Constituents
280
12.3 Top-level Constituents of Entries 28512.3.1 Information on
Written and Spoken Forms 28512.3.2 Grammatical Information
29012.3.3 Sense Information 292
12.3.3.1 Definitions 29212.3.3.2 Translation Equivalents 293
12.3.4 Etymological Information 29512.3.5 Other Information
297
12.3.5.1 Examples 29712.3.5.2 Usage Information and Other Labels
29812.3.5.3 Cross References to Other Entries 30112.3.5.4 Notes
within Entries 303
12.3.6 Related Entries 30412.4 Headword and Pronunciation
References 30512.5 Typographic and Lexical Information in
Dictionary Data 308
12.5.1 Editorial View 30912.5.2 Lexical View 31112.5.3 Retaining
Both Views 312
12.5.3.1 Using Attribute Values to Capture Alternate Views
31212.5.3.2 Recording Original Locations of Transposed Elements
314
12.5.4 Attributes for Dictionary Elements 31512.6 Unstructured
Entries 315
13 Terminological Databases 317
ix
-
13.1 The Terminological Entry 31713.2 Tags for Terminological
Data 31813.3 Basic Structure of the Terminological Entry 323
13.3.1 Nested Term Entries 32313.3.2 Flat Term Entries Using
Rules of Adjacency 32313.3.3 Flat Term Entries Using Group and
Depend Attributes 32413.3.4 References between Term Entries 326
13.4 Overall Structure of Terminological Documents 32613.4.1 DTD
Fragment for Nested Style 32813.4.2 DTD Fragment for Flat Style
329
13.5 Additional Examples of Term Entries 32913.5.1 Example Term
Entry from ISO 472 33013.5.2 The Example Treated as a Single Term
Entry in Nested Form 33013.5.3 The Example Treated as Two Separate
Term Entries in Nested Form 33113.5.4 The Example Treated as a Flat
Term Entry Using Adjacency Rules 33213.5.5 The Example Treated as a
Flat Term Entry Not Using Adjacency Rules 332
IV Additional Tag Sets 33514 Linking, Segmentation, and
Alignment 337
14.1 Pointers 33814.1.1 Pointers and Links 33914.1.2 Using
Pointers and Links 34014.1.3 Groups of Links 34314.1.4 Intermediate
Pointers 346
14.2 Extended Pointers 34614.2.1 Extended Pointer Elements
34714.2.2 Extended Pointer Syntax 347
14.2.2.1 Location Ladders 34814.2.2.2 Location Terms 34814.2.2.3
The ROOT Keyword 34914.2.2.4 The HERE Keyword 35014.2.2.5 The ID
Keyword 35014.2.2.6 The REF Keyword 35014.2.2.7 The CHILD Keyword
35014.2.2.8 The DESCENDANT Keyword 35314.2.2.9 The ANCESTOR Keyword
35314.2.2.10 The PREVIOUS Keyword 35414.2.2.11 The NEXT Keyword
35414.2.2.12 The PRECEDING Keyword 35414.2.2.13 The FOLLOWING
Keyword 35514.2.2.14 The PATTERN Keyword 35514.2.2.15 The TOKEN
Keyword 35614.2.2.16 The STR Keyword 35714.2.2.17 The SPACE Keyword
35714.2.2.18 The FOREIGN Keyword 35814.2.2.19 The HYQ Keyword
35814.2.2.20 The DITTO Keyword 358
14.2.3 Using Extended Pointers 35914.2.4 Representation of HTML
links in TEI 360
14.3 Blocks, Segments and Anchors 36214.4 Correspondence and
Alignment 366
14.4.1 Correspondence 36714.4.2 Alignment of Parallel Texts
36814.4.3 A Three-way Alignment 370
14.5 Synchronization 372
x
-
14.5.1 Aligning Synchronous Events 37314.5.2 Placing Synchronous
Events in Time 374
14.6 Identical Elements and Virtual Copies 37614.7 Aggregation
37714.8 Alternation 38114.9 Connecting Analytic and Textual Markup
386
15 Simple Analytic Mechanisms 38715.1 Linguistic Segment
Categories 38815.2 Global Attributes for Simple Analyses 39215.3
Spans and Interpretations 39215.4 Linguistic Annotation 396
16 Feature Structures 40116.1 Introduction 40116.2 Elementary
Feature Structures: Features with Binary Values 40216.3 Feature,
Feature-Structure and Feature-Value Libraries 40416.4 Symbolic,
Numeric, Measurement, Rate and String Values 40616.5 Structured
Values 41216.6 Singleton, Set, Bag and List Collections of Values
41316.7 Alternative Features and Feature Values 41616.8 Boolean,
Default and Uncertain Values 42016.9 Indirect Specification of
Values Using the rel Attribute 423
16.9.1 The Not-Equals Relation 42316.9.2 Other Inequality
Relations 42416.9.3 Subsumption and Non-subsumption Relations
42516.9.4 Relations Holding with Sets, Bags, and Lists 42716.9.5
Varieties of Subsumption and Non-subsumption 428
16.10 Two Illustrations 42917 Certainty and Responsibility
435
17.1 Levels of Certainty 43517.1.1 Using Notes to Record
Uncertainty 43617.1.2 Structured Indications of Uncertainty 436
17.2 Attribution of Responsibility 43918 Transcription of
Primary Sources 441
18.1 Altered, Corrected, and Erroneous Texts 44218.1.1 Use of
Core Tags for Transcriptional Work 44218.1.2 Abbreviation and
Expansion 44518.1.3 Correction and Conjecture 44718.1.4 Additions
and Deletions 44818.1.5 Substitutions 45218.1.6 Cancellation of
Deletions and Other Markings 45418.1.7 Text Omitted from or
Supplied in the Transcription 455
18.2 Non-Linguistic Phenomena in the Source 45618.2.1 Document
Hands 45718.2.2 Hand, Responsibility, and Certainty Attributes
45918.2.3 Damage, Illegibility, and Supplied Text 46018.2.4 Use of
the Gap, Del, Damage, Unclear and Supplied Tags in Combination
46318.2.5 Space 46418.2.6 Lines 464
18.3 Headers, Footers, and Similar Matter 46518.4 Other Primary
Source Features not Covered in These Guidelines 466
19 Critical Apparatus 46719.1 The Apparatus Entry, Readings, and
Witnesses 468
19.1.1 The Apparatus Entry 46819.1.2 Readings 469
xi
-
19.1.3 Indicating Subvariation in Apparatus Entries 47219.1.4
Witness Information 474
19.1.4.1 Witness Detail Information 47419.1.4.2 Witness
Information in the Source 47519.1.4.3 The Witness List 476
19.1.5 Fragmentary Witnesses 47819.2 Linking the Apparatus to
the Text 478
19.2.1 The Location-referenced Method 47919.2.2 The Double
End-Point Attachment Method 48019.2.3 The Parallel Segmentation
Method 482
19.3 Using Apparatus Elements in Transcriptions 48320 Names and
Dates 485
20.1 Personal Names 48620.2 Place Names 490
20.2.1 Geo-political Place Names 49220.2.2 Geographic Names
49220.2.3 Relative Place Names 493
20.3 Organization names 49420.4 Dates and Time 497
20.4.1 Absolute Dates and Times 49820.4.2 Relative Dates and
Times 500
21 Graphs, Networks, and Trees 50321.1 Graphs and Digraphs
504
21.1.1 Transition Networks 50821.1.2 Family Trees 50921.1.3
Historical Interpretation 510
21.2 Trees 51221.3 Another Tree Notation 515
22 Tables, Formulae, and Graphics 52122.1 Tables 522
22.1.1 The TEI Table DTD 52222.1.2 Other Table DTDs 525
22.2 Formulae and Mathematical Expressions 52622.3 Specific
Elements for Graphic Images 52922.4 Overview of Basic Graphics
Concepts 53222.5 Graphic Image Formats 533
22.5.1 Vector Graphic Formats 53422.5.2 Raster Graphic Formats
53422.5.3 Photographic and Motion Video Formats 535
23 Language Corpora 53723.1 Varieties of Composite Text 53823.2
Contextual Information 540
23.2.1 The Text Description 54123.2.2 The Participants
Description 54523.2.3 The Setting Description 549
23.3 Associating Contextual Information with a Text 55123.3.1
Combining Corpus and Text Headers 55123.3.2 Declarable Elements
55223.3.3 Summary 555
23.4 Linguistic Annotation of Corpora 55523.4.1 Levels of
Analysis 556
23.5 Recommendations for the Encoding of Large Corpora 556V
Auxiliary Document Types 557
24 The Independent Header 559
xii
-
24.1 Definition and Principles for Encoders 55924.2 Required and
Recommended Tags 56024.3 Header Elements and their Relationship to
the MARC Record 56424.4 MARC Fields for the File Description
56424.5 MARC Fields for the Encoding Description 56624.6 MARC
Fields for the Profile Description 56724.7 MARC fields for the
Revision Description 56724.8 Structure of the DTD for Independent
Headers 568
25 Writing System Declaration 57125.1 Overall Structure of
Writing System Declaration 57125.2 Identifying the Language 57325.3
Describing the Writing System 57425.4 Documenting the Character Set
and Its Encoding 575
25.4.1 Base Components of the WSD 57525.4.2 Exceptions in the
WSD 57725.4.3 Documenting Coded Character Sets and Entity Sets
58125.4.4 Documenting Transliteration Schemes 581
25.5 Notes in the WSD 58125.6 Linkage between WSD and Main
Document 58125.7 Predefined TEI WSDs 58225.8 Details of WSD
Semantics 582
25.8.1 WSD Semantics: General Principles 58325.8.2 Semantics of
WSD Base Components 58325.8.3 Multiple Base Components 58425.8.4
Semantics of Exceptions 584
25.8.4.1 Case 1: replacement 58525.8.4.2 Case 2: merger
58525.8.4.3 Case 3: expansion 587
25.8.5 Merger of Form and Character Elements 58726 Feature
System Declaration 589
26.1 Linking a TEI Text to Feature System Declarations 58926.2
The Overall Structure of a Feature System Declaration 59126.3
Feature Declarations 59326.4 Feature Structure Constraints 59626.5
A Complete Example 598
27 Tag Set Documentation 60127.1 The TagDoc Documentation
Element 603
27.1.1 The AttList Documentation Element 60527.2 Element Classes
60727.3 Entity Documentation 608
VI Technical Topics 61128 Conformance 613
28.1 Definitions of Terms 61328.1.1 TEI-Conformant Document
61328.1.2 TEI Local Processing Format 61328.1.3 TEI Interchange
Format 61428.1.4 TEI Packed Interchange format 61428.1.5 TEI
Recommended Practice 61428.1.6 TEI Abstract Model 615
28.2 Modifications to TEI SGML Declaration 61528.3 Modifications
to TEI Document Type Declarations 61528.4 TEI Processing Model
616
28.4.1 Document Capture and Reclamation 61628.4.2 Local Storage
Format and Application Software 617
xiii
-
28.4.3 Enrichment and Other Processing 61728.4.4 Data Export
61728.4.5 Data Import 61728.4.6 TEI Conformance in the Processing
Model 618
28.5 Aspects of Conformance and Document Description 61828.5.1
Character Sets 61828.5.2 SGML Declaration 61928.5.3 Document Type
Declaration 61928.5.4 Tag Usage and Feature Marking 61928.5.5
Non-SGML, non-XML Markup 620
29 Modifying and Customizing the TEI DTD 62129.1 Kinds of
Modification 623
29.1.1 Suppressing Elements 62329.1.2 Renaming Elements
62429.1.3 Class Extension 62529.1.4 New content models 626
29.2 Documenting the Modifications 62629.3 TEI Lite: an example
Customization 627
30 Rules for Interchange 63130.1 Negotiated Interchange 63130.2
Some Simple Examples 63230.3 Non-Negotiated Interchange 63330.4
Notes for Implementors 633
31 Multiple Hierarchies 63531.1 Concurrent Markup of Multiple
Hierarchies 63631.2 Boundary Marking with Milestone Elements
63631.3 Fragmentation of Elements 63731.4 Reconstitution of Virtual
Elements 63831.5 Multiple Encodings of the Same Information 63831.6
Concurrent Markup for Pages and Lines 639
32 Algorithm for Recognizing Canonical References 643VII
Alphabetical Reference Lists of Classes, Entities, and Elements
647
33 Element Classes 64934 Entities 68535 Elements 703
VIII Reference Material 98736 Obtaining the TEI DTD 98937
Obtaining TEI WSDs 99338 Sample Tag Set Documentation 995
38.1 Tag Documentation for the TEI p Element 99538.2 Tag
Documentation for the TEI head Element 99538.3 Tag Documentation
for the TEI div Element 99638.4 Class Documentation for the TEI
Divn Class 997
39 Formal Grammar for the TEI-Interchange-Format Subset of SGML
99939.1 Notation 99939.2 Grammar for SGML Document (Overview)
99939.3 Grammar for SGML Declaration 100039.4 Grammar for DTD
100239.5 Grammar for Document Instance 100439.6 Common Syntactic
Constructs 100639.7 Lexical Scanner 100739.8 Differences from ISO
8879 1009
Appendix A Bibliography 1011Appendix B Index 1017
xiv
-
Appendix C Prefatory Notes 1019Appendix C1. Introductory Note
(November 2001) 1019Appendix C2. Introductory Note (June 2001)
1019Appendix C3. Introductory Note (May 1999) 1021
Appendix C3.1. Typographic corrections made 1021Appendix C3.2.
Specific changes in the DTD 1021Appendix C3.3. Outstanding errors
1022
Appendix C4. Preface (April 1994) 1023Appendix C5.
Acknowledgments 1024
Appendix C5.1. TEI Working Committees (1990-1993) 1024Appendix
C5.2. Advisory Board 1026Appendix C5.3. Steering Committee
Membership 1026
Appendix D Colophon 1027
Revision History: June 2004
© Text Encoding Initiative Consortium
xv
-
xvi
-
In memoriamDonald E. Walker
22 November 1928 – 26 November 1993
xvii
-
xviii
-
Introductory Note (March 2002) Introductory Note (March 2002)The
primary goal of this revision has been to make available a new and
corrected version of the TEIGuidelines which:
• is expressed in XML and conforms to a TEI-conformant XML DTD;•
generates a set of DTD fragments that can be combined together to
form either SGML or
XML document type definitions;• corrects blatant errors,
typographical mishaps, and other egregious editorial oversights;•
can be processed and maintained using readily available XML tools
instead of the
special-purpose ad hoc software originally used for TEI P3.
A second major design goal of this revision has been to ensure
that the DTD fragments generated wouldnot break existing documents:
in other words, that any document conforming to the original TEI
P3SGML DTD would also conform to the new XML version of it.
Although full backwards compatibilitycannot be guaranteed, we
believe our implementation is consistent with that goal.
In most respects, the TEI Guidelines have stood the test of time
remarkably well. The present editionmakes no substantial attempt to
rewrite those few parts of them which have now been rendered
obsoleteby changes since their first publication, though an
indication is given in the text of where such rewritingis now
considered necessary. Neither does the present version attempt to
address any of the manypossible new areas of digital activity in
which the TEI approach to standardization may have somethingto
offer. Both these tasks require the existence of an informed and
active TEI Council to direct andvalidate such extension and
maintenance work, in response to the changing needs and priorities
of theTEI user community.
Two exceptions to the above principles may be cited: firstly,
the chapter which originally provided a‘Gentle Introduction’ to
SGML has been completely rewritten to provide a similarly gentle
introductionto XML; secondly the chapter on character sets has been
completely revised in light of the closeconnexion between Unicode
and XML. The editors gratefully acknowledge the assistance of the
ad hocworkgroup chaired by Christian Wittern, which undertook to
provide expert advice and correction atvery short notice, in the
latter task.
The preparation of this new version relied extensively on
preliminary work carried out by the formerNorth American editor of
the TEI Guidelines, C.M. Sperberg-McQueen. In a TEI working paper
writtenin 19991 he sketched out a precise blueprint for the
conversion of the TEI from SGML to XML, whichwe have implemented,
with only slight modification.
The Editors would also like to express thanks to the team of
volunteers from the TEI community whohelped us with the task of
proof reading the first draft during the summer of 2001; and to
SebastianRahtz of Oxford University Computing Services, without
whose skill and enthusiasm this new editionwould not have been
possible.
A substantial proportion of the work of preparing this new
edition was funded with the assistance of agrant from the US
National Endowment for the Humanities, whose continued support of
the TEI hasalso been crucial to the effort of setting up the TEI
Consortium.
Finally, we would like to thank all our colleagues on the
interim management board of the TEIConsortium, in particular its
Chairman John Unsworth, for their continued support of the TEI’s
work,and their willingness to devote effort to the difficult task
of overseeing its transition to a neworganizational
infrastructure.
Summary details of the changes made in the present and previous
editions are given in their PrefatoryNotes, all of which are now
reproduced in an Appendix to the present edition: see Appendix C
PrefatoryNotes.
Lou Burnard and Syd Bauman (TEI Editors) Oxford and Providence,
March 2002.
1 TEI ED W69, available from the TEI website at
http://www.tei-c.org/Vault/ED/edw69.htm.
xix
http://www.tei-c.org/Vault/ED/edw69.htm
-
xx
-
I: Introduction
June 2004 1 TEI Consortium
-
June 2004 2 TEI Consortium
-
1 About These Guidelines 1 About These GuidelinesThese
Guidelines have been developed by the Text Encoding Initiative
(TEI); see 1.3 Historical Back-ground. They are addressed to anyone
who works with any text in electronic form.
They provide means of representing those features of a text
which need to be identified explicitly in orderto facilitate
processing of the text by computer programs. In particular, they
specify a set of markers(or tags) which may be inserted in the
electronic representation of the text, in order to mark the
textstructure and other textual features of interest. Without such
explicit markers, many important featuresremain difficult to locate
by mechanical means such as computer programs, and thus difficult
to processeffectively. The process of inserting such explicit
markers for implicit textual features is often called‘markup’,
‘encoding’, or ‘tagging’, and the term encoding scheme or markup
language denotes the ruleswhich govern the use of markup in a set
of encodings.
The Guidelines formulated in this document are intended for use
in interchange between individualsand research groups using
different programs and computer systems over a broad range of
applications.Since they contain an inventory of the features most
often found useful for text processing, the Guidelinesalso provide
help to those creating texts in electronic form. They can also be
used for the local storage oftext which is to be processed with
multiple software packages requiring different input formats.
The Guidelines apply to texts in any natural language, of any
date, in any literary genre or text type,without restriction on
form or content. They treat both continuous materials (‘running
text’) anddiscontinuous materials such as dictionaries and
linguistic corpora. Though principally directed tothe needs of the
scholarly research community, the Guidelines are not restricted to
esoteric academicapplications. They should also be useful for
librarians who maintain and document electronic materials,as well
as for publishers and others creating or distributing electronic
texts. Although they focuson problems of representing in electronic
form texts which already exist in traditional media,
theseGuidelines should also be useful for the creation of
electronic texts. They are adequate to, but not limitedby, existing
practices.
The rules and recommendations made in the these Guidelines are
designed to enable the creation ofdocuments that conform to either
the Standard Generalized Markup Language (SGML, defined by ISO8879)
or the Extensible Markup Language (XML, defined by the World Wide
Web Consortium’s XMLRecommendation). XML is a subset of SGML, and
the modifications to these Guidelines to supportXML are designed to
maximize compatibility with both specifications. For more
information on markuplanguages see chapter 2 A Gentle Introduction
to XML .
These Guidelines also make reference to character encoding
standards such as ISO 646, ISO 10646 andUnicode. ISO 646 defines a
standard seven-bit character set in terms of which recommendations
oncharacter-level interchange are formulated; this is the most
portable character set for broad interchange,but requires indirect
encoding of many characters. Unicode provides a much larger
character setappropriate for international use, and all XML
implementations must support it; however, it is not asof this
writing so widely portable as ISO 646.
This document provides the authoritative statement of the
requirements and usage of the TEI encodingscheme. Although it
includes numerous small examples, it must be stressed that it is
intended as areference manual and that readers unfamiliar with
SGML, XML, or text markup in general will find itdifficult to learn
the encoding scheme by reading this document alone.
This document will be complemented by a series of tutorials in
text encoding (document TEI U1 etseq.) and a case book of extended
examples with discussion of the rationale for various markup
choices(TEI T1).2 Readers seeking an introduction to text markup
and the use of the TEI encoding scheme in aspecific area should
consult an appropriate tutorial; those already familiar with the
scheme and interestedin seeing examples of its application should
consult the case book.
The remainder of this chapter comprises three sections. The
first gives an overview of the structureand notational conventions
used throughout the document. The second enumerates the design
principles
2 TEI documents bear identifying numbers which indicate the
provenance of the document (here simply “TEI”, in other cases
theTEI work group number, e.g. “TEI AI5”), the type of document
(here “U” and “T”, meaning users’ guide or users’ manual andsample
text(s)), and a sequential number. The TEI document number of the
document in hand is TEI P4 (for TEI public proposalnumber 4).
June 2004 3 TEI Consortium
-
1 About These Guidelines
underlying the TEI scheme and the application environments in
which it may be found useful. Finally,the third section gives a
brief account of the origins and development of the Text Encoding
Initiativeitself.
1.1 Structure and Notational Conventions of this Document 1.1
Structure and NotationalConventions of this Document
1.1.1 Structure 1.1.1 StructurePart I provides some relevant
background information about the Guidelines themselves (in this
chapter);a brief technical review of markup languages (chapter 2 A
Gentle Introduction to XML); and a descriptionof how the TEI
document type definition (DTD) is organized (chapter 3 Structure of
the TEI DocumentType Definition).
Part II provides a systematic treatment of issues common to all
text types: character representation(chapter 4 Languages and
Character Sets); in-file documentation of the text (chapter 5 The
TEI Header);tags for text features found in all sorts of text:
lists, notes, emphasis, quotations, cross-references,technical
terms, names, dates, numbers, etc. (chapter 6 Elements Available in
All TEI Documents); and adefinition for the default structure of
all TEI documents (chapter 7 Default Text Structure).
Part III documents various base tag sets: these include
specialized tags for prose, for verse, for dramaand other
performance materials, for spoken materials, as well as for letters
and memoranda, printeddictionaries, and terminological data.
Additional sections discuss user-defined and mixed base tag sets.An
instance of the TEI DTD must use one and only one base tag set,
unless one of the ‘mixed’ bases isused.
Part IV documents various additional tag sets, which may be
included or excluded, as appropriate.Topics covered include a
variety of approaches to the analysis and interpretation of texts,
and includerepresentations for hypertextual links and other
non-hierarchic structures, as well as specialized tags forthe
encoding of critical editions and language corpora.
Part V defines certain specialized auxiliary document types,
used to encode information about the waythat texts have been
encoded, specifically: the TEI header regarded as a distinct
document; the TEIWriting System Declaration; the Feature System
declaration; and the Tag Set Documentation.
Part VI contains a number of technical discussions of a more
specialist interest. Topics covered includethe notion of formal
conformance to the TEI Guidelines; the controlled user-modification
of the TEIDTD; practical aspects of the use of TEI markup both in
local processing and in interchange; and therelationship of TEI
markup to other markup standards.
Part VII consists of an alphabetical reference list of all
elements and element classes defined in the TEIencoding scheme.
Part VIII provides further reference material: specifically, a
description of how to obtain current versionsof the full TEI DTDs
and the set of standard Writing System Declarations, a sample
Feature SystemDeclaration for basic grammatical annotation, sample
tag documentation, and a formal grammar for thesubset of SGML used
in the TEI interchange format. No formal subset has been defined
for XML, sinceXML itself is a subset appropriate to these
Guidelines.
In the back matter, a bibliography lists works cited in the text
of the Guidelines. A mechanicallygenerated index is also provided,
which can serve, it is hoped, as a finding aid for the use of
theGuidelines.
TEI Consortium 4 June 2004
-
1.1 Structure and Notational Conventions of this Document
1.1.2 Notational Conventions 1.1.2 Notational ConventionsThis
section describes the typographic and stylistic conventions used
throughout this document. The useof many terms and concepts which
have not yet been defined is unavoidable in this section. All
suchterms and concepts will be explained in later chapters of Part
I.
When SGML or XML elements are mentioned in the text, they take
the form , where “name”is the generic identifier of the element.
Sample tags mentioned in the text are displayed in the form.
References to attributes take the form attname, where“attname” is
the name of the attribute. Where the elements and attributes thus
mentioned are part of theTEI encoding scheme, they are included in
the index.
These Guidelines distinguish encoding practices and elements as
required, recommended, or optional.The phrases “must”, “is required
to”, etc., mark practices and tags which are required for TEI
con-formance. The phrases “should”, “it is recommended that”, “it
is preferable to ...”, etc., are used indescribing practices which
are recommended but not required for TEI conformance. Modal verbs
like“may”, “might”, etc., mark practices which are strictly
optional. Qualifying phrases like “if desired”,“where appropriate”,
or “under some circumstances” are used when some tag or practice
described maybe desirable or acceptable under some circumstances
and not under others.
In the reference section in Part VII, elements and their
attributes are all classed as one of:
required unconditionally required in a TEI-conformant
documentmandatory when applicable required under the appropriate
conditions; may be omitted if not
applicablerecommended recommended unless there are good reasons,
in the given circumstances, against itrecommended when applicable
recommended under some circumstances (which should be clear
from context)optional strictly optional
This reference section includes cross-references to the chapter
or section of the main text within whicheach element is discussed.
Most sections of the main text in which elements are defined begin
with adescriptive list of the elements concerned in the following
format:
short description of the element marked by . Where appropriate
this is followed bya list of significant non-global attributes for
the element as follows:
attribute description of the attribute’s meaning or usage,
optionally followed by a list ofsuggested or legal values:value1
meaning of value1value2 meaning of value2
Not all attributes are always included in these lists; those
which are shared with other elements in aclass are usually listed
separately, and those of relatively specialized interest are
usually listed only in thereference section. The values of the
attribute are introduced with one of the following formulaic
phrases:
‘Legal values include:’ The attribute cannot take values other
than those given. Other values willcause parsing errors. (This is
used relatively rarely in these Guidelines.)
‘Suggested values include:’ The values listed constitute a set
which should suffice for mostpurposes, and they should be used
where appropriate. Developers of TEI-aware softwareshould ensure
that their software can process these values appropriately. In some
cases,however, it is conceivable that other values might be
necessary, so the declaration for theattribute does not restrict
legal values to those given. TEI-aware software should
havereasonable fallback processing for values not in the list.
‘Sample values include:’ The attribute can take any value; those
listed are provided simply asexamples of the kind of value
possible.
June 2004 5 TEI Consortium
-
1 About These Guidelines
Each list of elements is followed by some discussion of its
semantics and usage, followed by one or moreexamples, taken
wherever possible from real texts, and presented in the following
format:
This paragraph contains an italicized phrase
All the examples are (or should be) legal SGML or XML, but
because they are fragmentary they maynot be parseable without
additional context. They also frequently make liberal use of white
space toexhibit the logical structure of the encoding more clearly.
Although this does not affect the validity ofthe examples, some
users will prefer not to follow it in practice, since not all
processors will ignore theextra white space. Except where otherwise
noted, examples do not use minimization not permitted byXML, though
SGML users may wish to exercise SGML’s options to:
• use empty end-tags (of the form ) to close the most recently
opened element• omit end-tags where they may legally be omitted
(the TEI DTDs do not permit omission of
any start-tags)
Attribute values are given indifferently in single quotes or
double quotes. Unquoted attribute values arenot permitted in XML,
and so are not used except where otherwise noted, for example to
emphasize acomparison between SGML and XML.
After the examples and usage notes, each section typically
concludes with a DTD fragment containing theformal declarations for
the elements described. Each DTD fragment is given a heading, and
may containelement and attribute list declarations, entity
declarations, parameter entity references, comments, andreferences
to DTD fragments in other sections. The DTD fragments of a single
chapter almost invariablybelong to the same DTD file, the structure
of which is typically described (with references to the
includedfragments) in one of the first or last sections of the
chapter.
The DTD fragments are identical to the DTDs distributed with
these Guidelines, with the followingexceptions:
• In the text, the DTD fragments appear in an order dictated by
organization of this document;the actual DTD files may re-order the
material slightly. This is indicated in the text byreferences from
one DTD fragment to another.
• The DTD fragments in the text show the generic identifiers of
all elements using the standardEnglish names assigned in this
document; the actual DTD files use parameter entities for
allgeneric identifiers, so that elements can be conveniently
renamed, as described in chapter 29Modifying and Customizing the
TEI DTD.
• The actual DTD files include conditional marked sections
surrounding the element andattribute list declaration for each
element, to ensure that elements can conveniently besuppressed or
redefined, as described in chapter 29 Modifying and Customizing the
TEI DTD.The fragments in the text suppress the marked-section-open
and marked-section-close markup.
Note that, in both text and DTD, the omissibility indicators
which must appear within an SGMLdeclaration (but which are illegal
in XML) are always given in parameterized form, as in the
followingexamples. This is to enable a single source to support
both XML and SGML versions of the DTDs, asfurther discussed in
section 3.8.4 Generation of an XML DTD.
What appears in the text, therefore, as:
will appear thus in the actual DTD file:
]]>
For further discussion, see chapter 3 Structure of the TEI
Document Type Definition, or chapter 29Modifying and Customizing
the TEI DTD.
1.2 Underlying Principles and Intended Use 1.2 Underlying
Principles and Intended Use
TEI Consortium 6 June 2004
-
1.2 Underlying Principles and Intended Use
1.2.1 Design Principles of the TEI Scheme 1.2.1 Design
Principles of the TEI SchemeThe planning conference held at Vassar
College in November, 1987 (see section 1.3 Historical Back-ground)
agreed on a number of principles concerning the basic design goals
of the Text Encoding Initia-tive. These principles are expounded in
various documents of the TEI (notably TEI ED P1 and TEI EDP2) and
the interested reader is directed to those documents for further
discussion.
Because of its roots in the humanistic research community, the
TEI scheme is driven by its original goal ofserving the needs of
research, and is therefore committed to providing a maximum of
comprehensibility,flexibility, and extensibility. More specific
design goals of the TEI have been that the Guidelines should:
• provide a standard format for data interchange• provide
guidance for encoding of texts in this format• support the encoding
of all kinds of features of all kinds of texts studied by
researchers• be application independent
This has led to a number of important design decisions, such
as:
• the choice of SGML, XML, ISO 646, and Unicode• the provision
of a large predefined tag set• a distinction between required,
recommended, and optional encoding practices• encodings for
different views of text• alternative encodings for the same text
features• mechanisms for user-defined extensions to the scheme
These goals and principles are expounded in more detail
below.
The goals of creating a common interchange format which is
application independent require thedefinition of a specific markup
syntax as well as the definition of a large predefined tag set. The
syntaxof the recommendations made in this document conforms to the
international standard ISO 8879, whichdefines the Standard
Generalized Markup Language, and to the World Wide Web Consortium’s
XMLRecommendation, which defines the Extensible Markup Language.
Full document type declarations areprovided for the scheme
described in these Guidelines; they are constructed so that they
can be easilyconverted to either language. Reference is also made
to ISO 646, which defines a standard seven-bitcharacter set; and to
Unicode, which defines a larger character set supporting most
modern languages.
The goal of providing guidance for text encoding requires that
recommendations be made as to whattextual features should be
recorded in various situations. This mandate is fulfilled by the
explicitspecification, in the reference section for each tag, that
the tag is required, mandatory when applicable butotherwise
omissible, recommended generally, recommended when applicable but
not always applicable,or optional.
However, the TEI Guidelines make (with relatively rare
exceptions) no suggestions or restrictions as tothe relative
importance of textual features. The philosophy of the Guidelines is
“if you want to encodethis feature, do it this way” — but very few
features are mandatory.
The Guidelines have been written largely with a focus on text
capture (i.e. the representation in electronicform of an already
existing copy text in another medium) rather than text creation
(where no such copytext exists). Hence the frequent use of terms
like ‘transcription’, ‘original’, ‘copy text’, etc. However,the
Guidelines should be equally applicable to text creation, and the
two terms text creation and textcapture are often used
interchangeably.
Concerning text capture the TEI Guidelines do not specify a
particular approach to the problem of fidelityto the source text
and recoverability of the original; such a choice is the
responsibility of the text encoder.The current version of these
Guidelines, however, provides a more fully elaborated set of tags
for markupof rhetorical, linguistic, and simple typographic
characteristics of the text than for detailed markup ofpage layout
or for fine distinctions among type fonts or manuscript hands.
In these Guidelines, no hard and fast distinction is drawn
between ‘objective’ and ‘subjective’ informationor between
‘representation’ and ‘interpretation’. These distinctions, though
widely made and oftenuseful in narrow, well-defined contexts, are
perhaps best interpreted as distinctions between issues on
June 2004 7 TEI Consortium
-
1 About These Guidelines
which there is a scholarly consensus and issues where no such
consensus exists. Such consensushas been, and no doubt will be,
subject to change. The TEI Guidelines do not make suggestions
orrestrictions as to which of these features should be encoded. The
use of the terms descriptive andinterpretive about different types
of encoding in the Guidelines is not intended to support any
particularview on these theoretical issues, but reflects a purely
practical division of responsibility between thetwo committees
called Committee on Text Representation and Committee on Text
Interpretation andAnalysis.
In general, the accuracy and the reliability of the encoding and
the appropriateness of the interpretationis for the individual user
of the text to determine. The Guidelines provide a means of
documentingthe encoding in such a way that a user of the text can
know the reasoning behind that encoding, and thegeneral
interpretive decisions on which it is based. It is strongly
recommended that the TEI header beused to give an account of these
aspects of the encoding. The TEI header is described in chapter 5
TheTEI Header.
In many situations more than one view of a text is needed. No
absolute recommendation to embody onespecific view of text can
apply to all texts and all approaches to them. The syntaxes of SGML
and XMLensure that some encodings can be ignored for some purposes.
To enable encoding multiple views, theseGuidelines not only treat a
variety of text features, but sometimes provide several alternative
encodingsfor what appear to be identical textual phenomena. These
Guidelines therefore offer the possibility ofencoding many
different views of the text, simultaneously if necessary.
However, the Guidelines are built on the assumption that there
is a common core of textual features sharedby virtually all texts
and virtually all serious work on texts. This core set of tags is
defined in Chapter 6Elements Available in All TEI Documents. Beyond
this core, many different elements can be encoded.
In brief, the TEI Guidelines define a general-purpose encoding
scheme which makes it possible to encodedifferent views of text,
possibly intended for different applications, serving the majority
of scholarlypurposes of text studies in the humanities. However, no
predefined encoding scheme can serve allresearch purposes.
Therefore, the TEI also provides means of modifying and extending
the encodingscheme defined by the Guidelines (see chapter 29
Modifying and Customizing the TEI DTD).
1.2.2 Intended Use 1.2.2 Intended UseWe envisage three primary
functions for these Guidelines:
• guidance for individual or local practice in text creation and
data capture;• support of data interchange;• support of
application-independent local processing.
These three functions are so thoroughly interwoven in practice
that it is hardly possible to address anyone without addressing the
others. However, the distinction provides a useful framework for
discussingthe possible role of the Guidelines in work with
electronic texts.
1.2.2.1 Use in Text Capture and Text Creation 1.2.2.1 Use in
Text Capture and Text CreationThe description of textual features
found in the chapters which follow should provide a useful
checklistfrom which scholars planning to create electronic texts
should select the subset of features suitable fortheir project.
Problems specific to text creation or text ‘capture’ have not
been considered explicitly in this document.For purposes of the TEI
interchange format and for use of markup languages, it does not
matter how atext is created or captured: it can be typed by hand,
scanned from a printed book or typescript, read froma typesetter’s
tape, or acquired from another researcher who may have used another
markup scheme (orno explicit markup at all).
We include here only some general points which are often raised
about markup and the process of datacapture.
XML, and even SGML, can appear distressingly verbose,
particularly when (as in these Guidelines) thenames of tags and
attributes are chosen for clarity and not for brevity. Editor
macros and keyboardshorthands can allow a typist to enter
frequently used tags with single keystrokes. Special-purpose
TEI Consortium 8 June 2004
-
1.2 Underlying Principles and Intended Use
software may be purchased which scans word-processor or scanner
data and inserts tags. Markup-aware software can help with
maintaining the hierarchical structure of the document, and display
thedocument with visual formatting rather than raw tags.
The techniques described in chapter 29 Modifying and Customizing
the TEI DTD may be used to giveshorter names to the tags being used
most often. It should also be noted that the examples in this text
arechosen to exhibit the markup compactly, and thus have denser
markup than will be typical in many texts.
The SGML standard provides ways of abbreviating, omitting, or
otherwise minimizing the amount ofmarkup which need be explicitly
provided in a text. They are all forbidden in the TEI interchange
formatbecause their use complicates processing; this does not
however preclude their use in local processing,where this is felt
appropriate or desirable. The XML Working Group followed this
guideline as well, andXML prohibits essentially the same
minimization practices proposed by these Guidelines.
1.2.2.2 Use for Interchange 1.2.2.2 Use for InterchangeWhen the
TEI Guidelines are used for interchange, it is expected that
researchers using other encodingschemes in their work will
translate outgoing data from such schemes into the scheme described
bythese Guidelines, and similarly translate incoming data from the
scheme described here into those usedinternally. For such
translations to be carried out without loss of information, the
scheme proposedhere must be as expressive (in a formal sense) as
any encoding scheme now known to be in wide use fortextual
research. To ensure that this is the case, a set of extension
techniques is provided (see chapter 29Modifying and Customizing the
TEI DTD) which makes possible the addition of extra tags, the
renamingof existing tags, and certain kinds of redefinition.
Although the intention is to minimize the need forrecourse to such
extensions, they may be used to accommodate the encoding of new or
unanticipatedtextual features. To translate between any pair of
encoding schemes implies:
1. identifying the sets of textual features distinguished by the
two schemes;2. determining where the two sets of features
correspond;3. creating a suitable set of mappings.
For example, to translate from encoding scheme X into the TEI
scheme:
1. Make a list of all the textual features distinguished in X.2.
Identify the corresponding feature in the TEI scheme. There are
three possibilities for each
feature:
i. the feature exists in both X and the TEI scheme;ii. X has a
feature which is absent from the TEI scheme;
iii. X has a feature which corresponds with more than one
feature in the TEI scheme.
The first case is unproblematic. The second requires an
extension to the TEI scheme, asdescribed in chapter 29 Modifying
and Customizing the TEI DTD. The third requires that aconsistent
choice be made. The algorithm used to make that choice should be
documented inthe TEI header.
3. Using the table of equivalences so generated, a simple
translation can be carried out betweenX and the TEI.
The ease with which this translation can be carried out will of
course depend on the clarity andexplicitness with which scheme X
represents the features it encodes.
Translating from the TEI into scheme X follows the same pattern,
except that if a TEI feature has noequivalent in X, and X cannot be
extended, information must be lost in translation.
Similar procedures may be followed where the TEI scheme is to be
used as an interlanguage forinterchange among several different
sites or applications, although the degree of TEI-conformance
mayvary.
In the simplest case, where two sites or individuals exchanging
texts know each other and know orcan inquire what equipment the
other is using, these Guidelines serve primarily as documentation
for afile format, which can be referred to without actually being
transmitted together with the file. In the
June 2004 9 TEI Consortium
-
1 About These Guidelines
general case, where sender and recipient cannot communicate such
information, a stricter degree of TEIconformance will be required
for loss-free interchange.
The rules defining such strict conformance to the Guidelines are
given in some detail in chapter 28Conformance. The interchange
format defined there requires that an electronic text:
1. adhere to the SGML declaration defined in these Guidelines
(when using SGML), or to theXML syntax rules (which imply a
particular SGML declaration). These constructs are furtherdiscussed
in chapter 2 A Gentle Introduction to XML.
2. conform to the document type declarations defined in these
Guidelines, unless modifiedor extended as described in chapter 29
Modifying and Customizing the TEI DTD. Theseconstructs are further
discussed in chapter 2 A Gentle Introduction to XML.3
3. provide external documentation as described in chapter 27 Tag
Set Documentation for allelements not defined in these Guidelines,
specifying a formal name (generic identifier) anda corresponding
full natural-language name, describing its meaning and usage,
specifying itslegal content and also any attributes it may use.
4. adhere to the requirements of the TEI header in providing
bibliographic identification of thetext and description of the
encoding practices used (as described in chapter 5 The TEI
Header).
Note that the interchange format makes no formal restriction on
the character set to be used in interchange,as this will depend on
the medium of interchange and the local character sets in use by
sender and receiver.For further information, refer to chapter 30
Rules for Interchange.
1.2.2.3 Use for Local Processing 1.2.2.3 Use for Local
ProcessingMachine-readable text can be manipulated in many ways;
some users:
• edit texts (e.g. word processors, syntax-directed editors)•
edit, display, and link texts in hypertext systems• format and
print texts using desktop publishing systems, or batch-oriented
formatting pro-
grams• load texts into free-text retrieval databases or
conventional databases• unload texts from databases as search
results or for export to other software• search texts for words or
phrases• perform content analysis on texts• collate texts for
critical editions• scan texts for automatic indexing or similar
purposes• parse texts linguistically• analyze texts stylistically•
scan verse texts metrically• link text and images
These applications cover a wide range of likely uses but are by
no means exhaustive. The aim has beento make the TEI Guidelines
useful for encoding the same texts for different purposes. We have
avoidedanything which would restrict the use of the text for other
applications. We have also tried not to omitanything essential to
any single application.
3 These guidelines do not provide any other schema (XML Schema,
RELAX NG, etc.) corresponding to the DTDs, although suchmay be
provided at a later time.
TEI Consortium 10 June 2004
-
1.3 Historical Background
1.3 Historical Background 1.3 Historical BackgroundThe Text
Encoding Initiative grew out of a planning conference sponsored by
the Association forComputers and the Humanities (ACH) and funded by
the U.S. National Endowment for the Humanities(NEH), which was held
at Vassar College in November 1987. At this conference some
thirtyrepresentatives of text archives, scholarly societies, and
research projects met to discuss the feasibility ofa standard
encoding scheme and to make recommendations for its scope,
structure, content, and drafting.During the conference, the
Association for Computational Linguistics and the Association for
Literaryand Linguistic Computing agreed to join ACH as sponsors of
a project to develop the Guidelines. Theoutcome of the conference
was this set of principles, which determined the further course of
the project.
1. The guidelines are intended to provide a standard format for
data interchange in humanitiesresearch.
2. The guidelines are also intended to suggest principles for
the encoding of texts in the sameformat.
3. The guidelines should
i. define a recommended syntax for the format,ii. define a
metalanguage for the description of text-encoding schemes,
iii. describe the new format and representative existing schemes
both in that metalan-guage and in prose.
4. The guidelines should propose sets of coding conventions
suited for various applications.5. The guidelines should include a
minimal set of conventions for encoding new texts in the
format.6. The guidelines are to be drafted by committees on
i. text documentationii. text representation
iii. text interpretation and analysisiv. metalanguage definition
and description of existing and proposed schemes,
coordinated by a steering committee of representatives of the
principal sponsoring organiza-tions.
7. Compatibility with existing standards will be maintained as
far as possible.8. A number of large text archives have agreed in
principle to support the guidelines in their
function as an interchange format, and have (since the
publication of the prior edition), actuallydone so. We continue to
encourage funding agencies to support development of tools
tofacilitate this interchange.
9. Conversion of existing machine-readable texts to the new
format involves the translation oftheir conventions into the syntax
of the new format. No requirements will be made for theaddition of
information not already coded in the texts.
In the course of the work, some of these goals assumed greater,
some lesser importance; some provedeasier, some harder to achieve.
The document in hand does define a standard form for the
interchangeof textual material, and adumbrate principles for the
creation of new electronic texts. The onlymetalanguage used,
however, is that common to XML and SGML, and no formal definitions
are given forother encoding schemes. These Guidelines do define a
minimal set of conventions for text encoding (i.e.those elements
classed as recommended or required), though few researchers will be
satisfied to encodeonly what is required or recommended here, since
the set of required and recommended elements is rathersmall. This
document does not, however, define — at least not explicitly —
“sets of coding conventionssuited for various applications”, since
consensus on suitable conventions for different applications
provedelusive; this remains a goal for future work.
1.3.1 Origin and Development of the TEI 1.3.1 Origin and
Development of the TEIThe Text Encoding Initiative proper began in
June 1988 with funding from the NEH, soon followed byfurther
funding from the Commission of the European Communities, the Andrew
W. Mellon Foundation,and the Social Science and Humanities Research
Council of Canada. Four working committees,composed of
distinguished scholars and researchers from both Europe and North
America, were namedto deal with problems of text documentation
(resulting largely in chapter 5 The TEI Header),
textrepresentation, text analysis and interpretation (together
responsible for most of what has become partsII, III, and IV), and
metalanguage and syntax issues (largely responsible for part VI).A
first draft version (1.0) of the Guidelines was distributed in July
1990 under the title Guidelines for theEncoding and Interchange of
Machine-Readable Texts, with the TEI document number TEI P1.
Withminor changes and corrections, this version was reprinted as
version 1.1 in November 1990.Extensive public comment and further
work on areas not covered in version 1 resulted in the drafting ofa
revised version, TEI P2, distribution of which began in April 1992.
This version includes substantialamounts of new material, resulting
from work carried out by several specialist working groups, set up
in1990 and 1991 to propose extensions and revisions to the text of
P1. The overall organization, both ofthe draft itself and of the
scheme it describes, was entirely revised and reorganized in
response to publiccomment on the first draft.In June, 1993, the
Advisory Board of the Text Encoding Initiative met to review the
current state of theGuidelines, and recommended the formal
publication of the work done to that time. That version ofthe TEI
Guidelines, TEI P3, represents a further revision of all chapters
published under the documentnumber TEI P2, and the addition of
further chapters. Although subject to revision and amendment onthe
basis of practical experience and public discussion, that version
of the Guidelines was published inMay of 1994 without the label
‘draft’, and marks the conclusion of the initial development
work.In February of 1998 the World Wide Web Consortium issued a
final Recommendation for the ExtensibleMarkup Language, XML. XML
was developed as a far simpler subset of SGML, for many of the
samereasons as the TEI interchange subset, and taking a very
similar approach. Several TEI participantscontributed heavily to
the development of XML, most notably XML’s senior co-editor C. M.
Sperberg-McQueen, who until recently served as the North American
co-editor for these Guidelines.Following the ratification of XML
and its rapid adoption, many projects found need for an updated
versionof these Guidelines which supported XML unambiguously. For
example, because SGML element namesare normally case-insensitive
while XML ones are not, a decision had to be made on the
normativecase for TEI element names in XML. The TEI editors, with
abundant assistance from others who havedeveloped and used TEI,
developed an update plan, and made tentative decisions on relevant
syntacticissues. With the formation of the TEI Consortium in 2001,
and with generous funding from the NationalEndowment for the
Humanities, a formal update was undertaken. The goals of this
update were to reviseboth the text and the DTDs of the scheme in a
way compatible with the use of either SGML or XML. Thepresent
edition is the first public draft of that update; the present
editors hope that it maintains the qualityand usefulness of P3, and
solicit comments, suggestions, and other input wherever it does
not.
June 2004 11 TEI Consortium
-
1 About These Guidelines
1.3.2 Future Developments 1.3.2 Future DevelopmentsWork on areas
still not satisfactorily covered in this manual will continue, and
resulting recommendationswill be issued as supplements to the
published Guidelines. Work is expected to continue in at least
thefollowing areas:
• linguistic description and grammatical annotation• historical
analysis and interpretation• base tag sets for further document
types• manuscript analysis and physical description of text
The encoding recommended by this document may be used without
fear that future versions of the TEIscheme will be inconsistent
with it in fundamental ways. The TEI will be sensitive, in revising
theseGuidelines, to the possible problems which revision might pose
for those who are already using thisversion of the Guidelines.
Wherever consistent with the long-term goals of the project,
consistency withthis version will be preserved in future
revisions.
TEI Consortium 12 June 2004
-
2 A Gentle Introduction to XML 2 A Gentle Introduction to XMLAs
originally published in previous editions of the Guidelines, this
chapter provided a gentle introduction to ‘just enough’SGML for
anyone to understand how the TEI used that standard. Since then,
the Gentle Guide seems to have taken on a lifeof its own
independent of the Guidelines, having been widely distributed (and
flatteringly imitated) on the web. In revisingit for the present
draft, the editors have therefore felt free to reduce considerably
its discussion of SGML-specific matters, infavour of a simple
presentation of how the TEI uses XML.
The encoding scheme defined by these Guidelines may be
formulated either as an application of the ISOStandard Generalized
Markup Language (SGML)4 or of the more recently developed W3C
ExtensibleMarkup Language (XML)5. Both SGML and XML are widely-used
for the definition of device-independent, system-independent
methods of storing and processing texts in electronic form; XML
beingin fact a simplification or derivation of SGML. In the present
chapter we introduce informally the basicconcepts underlying such
markup languages and attempt to explain to the reader encountering
themfor the first time how they are actually used in the TEI
scheme. Except where the two are explicitlydistinguished,
references to XML in what follows may be understood to apply
equally well to the TEIusage of SGML. For a more technical account
of TEI practice see chapter 28 Conformance; for a moretechnical
description of the subset of SGML used by the TEI encoding scheme,
see chapter 39 FormalGrammar for the TEI-Interchange-Format Subset
of SGML.
XML is an extensible markup language used for the description of
marked-up electronic text. Moreexactly, XML is a metalanguage, that
is, a means of formally describing a language, in this case,
amarkup language. Historically, the word markup has been used to
describe annotation or other markswithin a text intended to
instruct a compositor or typist how a particular passage should be
printed or laidout. Examples include wavy underlining to indicate
boldface, special symbols for passages to be omittedor printed in a
particular font and so forth. As the formatting and printing of
texts was automated, theterm was extended to cover all sorts of
special codes inserted into electronic texts to govern
formatting,printing, or other processing.
Generalizing from that sense, we define markup, or
(synonymously) encoding, as any means of makingexplicit an
interpretation of a text. Of course, all printed texts are
implicitly encoded (or marked up)in this sense: punctuation marks,
use of capitalization, disposition of letters around the page, even
thespaces between words, might be regarded as a kind of markup, the
function of which is to help the humanreader determine where one
word ends and another begins, or how to identify gross structural
featuressuch as headings or simple syntactic units such as
dependent clauses or sentences. Encoding a text forcomputer
processing is in principle, like transcribing a manuscript from
scriptio continua,6 a process ofmaking explicit what is conjectural
or implicit, a process of directing the user as to how the content
ofthe text should be (or has been) interpreted.
By markup language we mean a set of markup conventions used
together for encoding texts. Amarkup language must specify what
markup is allowed, what markup is required, how markup is tobe
distinguished from text, and what the markup means. XML provides
the means for doing the firstthree; documentation such as these
Guidelines is required for the last.
The present chapter attempts to give an informal introduction to
those parts of XML of which a properunderstanding is necessary to
make best use of these Guidelines. The interested reader should
also consultone or more of the dozens of excellent introductory
text books or web sites now available on the subject.
4 International Organization for Standardization, ISO 8879:
Information processing – Text and office systems –
StandardGeneralized Markup Language (SGML), ([Geneva]: ISO, 1986).5
World Wide Web Consortium: Extensible Markup Language (XML) 1.0,
available from http://www.w3.org/TR/REC-xml6 In the “continuous
writing” characteristic of manuscripts from the early classical
period, words are written continuously with nointervening spaces or
punctuation.
June 2004 13 TEI Consortium
http://www.w3.org/TR/REC-xml
-
2 A Gentle Introduction to XML
2.1 What’s special about XML? 2.1 What’s special about XML?Three
characteristics of XML seem to us to make it unlike other other
markup languages:
• its emphasis on descriptive rather than procedural markup;•
its document type concept;• its independence of any one hardware or
software system.
These three aspects are discussed briefly below, and then in
more depth in sections 2.3 XML structuresand 2.7 Entities.The
markup language with which XML is most frequently compared,
however, is HTML, the language inwhich web pages had always been
written until XML began to replace it. Compared with HTML, XMLhas
some other important characteristics:
• XML is extensible: it does not contain a fixed set of tags•
XML documents must be well-formed according to a defined syntax,
and may be formally
validated• XML focuses on the meaning of data, not its
presentation
2.1.1 Descriptive markup 2.1.1 Descriptive markupIn a
descriptive markup system, the markup codes used do little more
than categorize parts of a document.Markup codes such as or
\end{list} simply identify a portion of a document and assert of
itthat “the following item is a paragraph,” or “this is the end of
the most recently begun list,” etc. Bycontrast, a procedural markup
system defines what processing is to be carried out at particular
points ina document: “call procedure PARA with parameters 1, b and
x here” or “move the left margin 2 quadsleft, move the right margin
2 quads right, skip down one line, and go to the new left margin,”
etc. InXML, the instructions needed to process a document for some
particular purpose (for example, to formatit) are sharply
distinguished from the descriptive markup which occurs within the
document. Theyare collected outside the document in separate
procedures or programs, and are usually expressed in adistinct
document called a stylesheet, though it may do much more than
simply define the rendition orvisual appearance of a document.7
With descriptive instead of procedural markup the same document
can readily be processed in manydifferent ways, using only those
parts of it which are considered relevant. For example, a
contentanalysis program might disregard entirely the footnotes
embedded in an annotated text, while a formattingprogram might
extract and collect them all together for printing at the end of
each chapter. Different kindsof processing can be carried out with
the same part of a file. For example, one program might
extractnames of persons and places from a document to create an
index or database, while another, operatingon the same text, but
using a different stylesheet, might print names of persons and
places in a distinctivetypeface.
2.1.2 Types of document 2.1.2 Types of documentA second key
aspect of XML is its notion of a document type: documents are
regarded as having types,just as other objects processed by
computers do. The type of a document is formally defined by
itsconstituent parts and their structure. The definition of a
‘report’, for example, might be that it consistedof a ‘title’ and
possibly an ‘author’, followed by an ‘abstract’ and a sequence of
one or more ‘paragraphs’.Anything lacking a title, according to
this formal definition, would not formally be a report, and
neitherwould a sequence of paragraphs followed by an abstract,
whatever other report-like characteristics thesemight have for the
human reader.If documents are of known types, a special purpose
program (called a parser), once provided with anunambiguous
definition of a document’s type, can check that any document
claiming to be of a that typedoes in fact conform to the
specification. A parser can check that all and only elements
specified for aparticular document type are present, that they are
combined in appropriate ways, correctly ordered andso forth. More
significantly, different documents of the same type can be
processed in a uniform way.Programs can be written which take
advantage of the knowledge encapsulated in the document
structureinformation, and which can thus behave in a more
‘intelligent’ fashion.7 We do not here discuss in any detail the
ways that a style sheet can be used or defined, nor do we discuss
the increas-ingly popular W3C Stylesheet Languages. See
http://www.w3.org/TR/xsl for the Extensible Stylesheet Language
(XSL), andhttp://www.w3.org/TR/xslt for the XSL Transformations
(XSLT) Language.
TEI Consortium 14 June 2004
http://www.w3.org/TR/xslhttp://www.w3.org/TR/xslt
-
2.2 Textual structure
2.1.3 Data independence 2.1.3 Data independenceA basic design
goal of XML is to ensure that documents encoded according to its
provisions can movefrom one hardware and software environment to
another without loss of information. The two featuresdiscussed so
far both address this requirement at an abstract level; the third
feature addresses it at thelevel of the strings of data characters
of which documents are composed. All XML documents,
whateverlanguage or writing system they employ, use the same
underlying character encoding (that is, the samemethod of
representing the graphic forms making up a particular writing
system as binary data).8 Thisencoding is defined by an
international standard,9 which is implemented by a universal
character setmaintained by an industry group called the Unicode
Consortium, and known as Unicode;10 this providesa standardised way
of representing any of the thousands of discrete symbols making up
the world’swriting systems, past and present.For technical and
historical reasons which need not concern us, it is often necessary
to translate textsencoded as Unicode into some smaller or less
general encoding scheme. XML uses a general purposestring
substitution mechanism for this purpose, inherited from SGML (which
predates the availability ofUnicode). In simple terms, this
mechanism allows for the indirect representation of arbitrary parts
of adocument (be they single characters, character strings, or
whole files) within it. One obvious applicationfor this mechanism
is to ensure consistency of nomenclature; another, more significant
one, is to counterthe notorious inability of different computer
systems to understand each other’s character sets, or of anyone
system to provide all the graphic characters needed for a
particular application. The strings defined bythis
string-substitution mechanism are called entities and they are
discussed below in section 2.7 Entities.
2.2 Textual structure 2.2 Textual structureA text is not an
undifferentiated sequence of words, much less of bytes. For
different purposes, it maybe divided into many different units, of
different types or sizes. A prose text such as this one might
bedivided into sections, chapters, paragraphs, and sentences. A
verse text might be divided into cantos,stanzas, and lines. Once
printed, sequences of prose and verse might be divided into
volumes, gatherings,and pages.Structural units of this kind are
most often used to identify specific locations or reference points
within atext (“the third sentence of the second paragraph in
chapter ten”; “canto 10, line 1234”; “page 412,” etc.)but they may
also be used to subdivide a text into meaningful fragments for
analytic purposes (“is theaverage sentence length of section 2
different from that of section 5?” “how many paragraphs
separateeach occurrence of the word ‘nature’?” “how many pages?”).
Other structural units are more clearlyanalytic, in that they
characterize a section of a text. A dramatic text might regard each
speech by adifferent character as a unit of one kind, and stage
directions or pieces of action as units of another kind.Such an
analysis is less useful for locating parts of the text (“the 93rd
speech by Horatio in Act 2”) thanfor facilitating comparisons
between the words used by one character and those of another, or
those usedby the same character at different points of the play.In
a prose text one might similarly wish to regard as units of
different types passages in direct or indirectspeech, passages
employing different stylistic registers (narrative, polemic,
commentary, argument, etc.),passages of different authorship and so
forth. And for certain types of analysis (most notably
textualcriticism) the physical appearance of one particular printed
or manuscript source may be of importance:paradoxically, one may
wish to use descriptive markup to describe presentational features
such astypeface, line breaks, use of whitespace and so forth.These
textual structures overlap with each other in complex and
unpredictable ways. Particularly whendealing with texts as
instantiated by paper technology, the reader needs to be aware of
both the physicalorganization of the book and the logical structure
of the work it contains. Many great works (Sterne’sTristram Shandy
for example) cannot be fully appreciated without an awareness of
the interplay betweennarrative units (such as chapters or
paragraphs) and page divisions. For many types of research, it is
theinterplay between different levels of analysis which is crucial:
the extent to which syntactic structureand narrative structure
mesh, or fail to mesh, for example, or the extent to which
phonological structuresreflect morphology.
8 See Extensible Markup Language (XML) 1.0, Section 2.2
Characters.9 ISO/IEC 10646-1993 Information Technology — Universal
Multiple-Octed Coded Character Set (UCS)10 See
http://www.unicode.org/
June 2004 15 TEI Consortium
http://www.unicode.org/
-
2 A Gentle Introduction to XML
2.3 XML structures 2.3 XML structuresThis section describes the
simple and consistent mechanism for the markup or identification of
textualstructure provided by XML. It also describes the methods XML
provides for the expression of rulesdefining how units of textual
structure can meaningfully be combined in a text.
2.3.1 Elements 2.3.1 ElementsThe technical term used in XML for
a textual unit, viewed as a structural component, is
element.Different types of elements are given different names, but
XML provides no way of expressing themeaning of a particular type
of element, other than its relationship to other element types.
That is,all one can say about an element called (for instance) is
that instances of it may (or may not)occur within elements of type
, and that it may (or may not) be decomposed into elements oftype .
It should be stressed that XML is entirely unconcerned with the
semantics of textualelements: these are application dependent. It
is up to the creators of XML vocabularies (such as theseGuidelines)
to choose intelligible names for the elements they identify and to
define their proper use intext markup. That is the chief purpose of
documents such as the TEI Guidelines. From the need tochoose
element names indicative of function comes the technical term for
the name of an element type,which is generic identifier, or GI.
Within a marked up text (a document instance), each element must
be explicitly marked or tagged insome way. This is done by
inserting a tag at the beginning of the element (a start-tag) and
another atits end (an end-tag).11 The start- and end-tag pair are
used to bracket off the element occurrences withinthe running text,
in rather the same way as different types of parentheses or
quotation marks are used inconventional punctuation. For example, a
quotation element in a text might be tagged as follows:
... Rosalind's remarks This is the silliest stuffthat ere I
heard of! clearly indicate ...
As this example shows, a start-tag takes the form , where the
opening angle bracket indicatesthe start of the start-tag, “quote”
is the generic identifier of the element which is being delimited,
andthe closing angle bracket indicates the end of a tag. An end-tag
takes an identical form, except that theopening angle bracket is
followed by a solidus (slash) character, so that the corresponding
end-tag is.12
2.3.2 Content models: an example 2.3.2 Content models: an
exampleAn element may be empty, that is, it may have no content at
all, or it may contain just a sequence ofcharacters with no other
elements. More usually, however, elements of one type will be
embedded(contained entirely) within elements of a different
type.
To illustrate this, we will consider a very simple structural
model. Let us assume that we wish to identifywithin an anthology
only poems, their titles, and the stanzas and lines of which they
are composed. InXML terms, our document type is the anthology, and
it consists of a series of poems. Each poem hasembedded within it
one element, a title, and several occurrences of another, a stanza,
each stanza havingembedded within it a number of line elements.
Fully marked up, a text conforming to this model mightappear as
follows:13
The SICK ROSE
O Rose thou art sick.The invisible worm,That flies in the
night
11 In SGML (but not in XML) the name and the content model may
be separated by an additional part of the declaration
whichspecifies ‘omission rules’ for the element concerned. These
rules state whether or not start- and end-tags must be present for
everyoccurrence of the element concerned: as noted above, such tag
omission is not permitted in XML, and is not permitted in the
TEIInterchange format.12 Because the opening angle bracket has this
special function in an XML document, special steps must be taken to
use thatcharacter for other purposes (for example, as the
mathematical less-than operator); see further 2.7.2 Entity
references; in SGML(but not XML) different characters may be
defined for use as any of the delimiting characters (the angle
brackets, exclamation markand solidus).13 The example is taken from
William Blake’s Songs of innocence and experience (1794). The
markup is designed for illustrativepurposes and is not
TEI-conformant.
TEI Consortium 16 June 2004
-
2.3 XML structures
In the howling storm:
Has found out thy bedOf crimson joy:And his dark secret loveDoes
thy life destroy.
It should be stressed that this example does not use the same
names as are proposed for correspondingelements elsewhere in these
Guidelines: the above is not a valid TEI document. It will however
serve asan introduction to the basic notions of XML. Whitespace and
line breaks have been added to the examplefor the sake of visual
clarity only; they have no particular significance in the XML
encoding itself. Also,the line
is an XML comment and is not treated as part of the text.
As it stands, the above example is what is known as a
well-formed XML document: to achieve this status,an XML document
must obey the following simple rules:
• there should be a single element (start- and end- tag pair)
which encloses the whole document:this is known as the root element
( in our case);
• each element should be completely contained by the root
element, or by an element which isso contained; elements may not
partially overlap one another;
• the tags marking the start and end of each element must always
be present. 14
An XML document which is well-formed can be processed in a
number of useful ways. A simpleindexing program could extract only
the relevant text elements in order to make a list of titles, first
lines,or words used in the poem text; a simple formatting program
could insert blank lines between stanzas,perhaps indenting the
first line of each, or inserting a stanza number. Different parts
of each poem couldbe typeset in different ways. A more ambitious
analytic program could relate the use of punctuationmarks to
stanzaic and metrical divisions.15 Scholars wishing to see the
implications of changing thestanza or line divisions chosen by the
editor of this poem can do so simply by altering the position of
thetags. And of course, the text as presented above can be
transported from one computer to another andprocessed by any
program (or person) capable of making sense of the tags embedded
within it with noneed for the sort of transformations and
translations needed to move word processor files around.
However, well-formedness alone is not enough for the full range
of what might be useful in marking up adocument. It might well be
useful if, in the process of preparing our digital anthology, a
computer systemcould check some basic rules about how stanzas,
lines, and titles can sensibly co-occur in a document.It would be
even more useful if the system could check that stanzas are always
labelled andnot occasionally or . An XML document in which such
rules have been checkedis technically known as a valid document,
and the ability to perform such validation is one of the
keyadvantages of using XML. To carry this out, some way of formally
stating the criteria for successfulvalidation is necessary: in XML
this formal statement may be provided by an additional document
knownas a document type declaration (DTD) or by an XML
schema.16
14 This is not strictly true for empty elements, for which
start- and end-tags can be combined, as further discussed below.15
Note that this simple example has not addressed the problem of
marking elements such as sentences explicitly; the implicationsof
this are discussed below in section 2.5 Complicating the issue.16
The DTD language described in the remainder of this section is
neither the only way of representing such criteria, northe most
powerful. One important alternative is provided by another W3C
Recommendation: the XML Schema
language(http://www.w3.org/XML/Schema); another is provided by the
OASIS Committee’s specification for Relax NG
(http://www.oasis-open.org/committees/relax-ng/). It is highly
probable that future releases of these Guidelines will use such a
language, in preferenceto, or as well as, a DTD.
June 2004 17 TEI Consortium
http://www.w3.org/XML/Schemahttp://www.oasis-open.org/committees/relax-ng/http://www.oasis-open.org/committees/relax-ng/
-
2 A Gentle Introduction to XML
2.4 Validating a document’s structure 2.4 Validating a
document’s structureRules such as those informally stated above are
the first stage in the creation of a formal specificationfor the
structure of an XML document, or document type declaration, usually
abbreviated to DTD. Increating a DTD, the document designer may be
as lax or as restrictive as the occasion warrants. A balancemust be
struck between the convenience of following simple rules and the
complexity of handling realtexts. This is particularly the c