The use of SGML and XML at the Publications Office Dr. Holger Bagola Dir A – Cell “Methods and Development — Formats” [email protected]
Jan 14, 2016
The use of SGML and XML at the Publications Office
Dr. Holger BagolaDir A – Cell “Methods and Development — Formats”
The use of SGML and XML at the Publications Office
2
Table of contents
• Historical overview• Formex• Other areas of XML usage• Conclusion
The use of SGML and XML at the Publications Office
3
Table of contents
• Historical overview• Formex• Other areas of XML usage• Conclusion
The use of SGML and XML at the Publications Office
4
Historical overview
• Tasks of the Publications Office• Archiving of legislative publications• First steps in SGML• Migration to XML• Basic advantage: availability of tools
The use of SGML and XML at the Publications Office
5
Table of contents
• Historical overview• Formex• Other areas of XML usage• Conclusion
The use of SGML and XML at the Publications Office
6
Formex (1)
• Basic principles– XML Schema instead of DTD– One single schema– Number of root elements 12 instead of
30– Number of elements about 350 instead
of 1200– Distinction between semantic and
physical markup
The use of SGML and XML at the Publications Office
7
Formex (2)
ARTICLE (TI.ARTICLE, (PARAG+ | ALINEA+))
TI.ARTICLE (#PCDATA)
PARAG (NO.PARAG, ALINEA+)
NO.PARAG(#PCDATA)
ALINEA ((#PCDATA | NOTE | HT| FT)* |
(P | LIST | TABLE)+)
. . .
Blue: semantic markup
Red: physical markup
The use of SGML and XML at the Publications Office
8
Formex (3)
• Table model– Analysis of CALS, HTML, Formex v. 3– Choice:
• Model close to HTML (top-down approach, nested tables)
• Maintenance of semantic information such as in Formex v. 3
The use of SGML and XML at the Publications Office
9
Formex (4)
• Footnotes– Distinction between notes in text and
tables for readability and production simplicity
– Insertion of text notes into the surrounding text
– ID/IDREF to signal identical footnotes– Numbering is an object of presentation– Table notes assembled at the top of the
table
The use of SGML and XML at the Publications Office
10
Formex (5)
• Quotations– Structured quotations vs. ‘#PCDATA’
quotations– Elements signaling start and end of a
quotation (quotation marks)– Element with function of a container for
structured quotations.
The use of SGML and XML at the Publications Office
11
Formex (6)
Example:Article 2
In article 1(2) of regulation (EC) 1234/94 the word ‘car’ is replaced by ‘bus’.
Article 6 of the same regulation is replaced by the following text:
‘Article 6
This is the new text of article 6.’
The use of SGML and XML at the Publications Office
12
Formex (7)Example:
<ARTICLE IDENTIFIER=“002”><TI.ARTICLE>Article 2</TI.ARTICLE><ALINEA>In article 1(2) of regulation (EC) 1234/94 the <QUOT.START ID=“QS0001” REF.END=“QE0001” CODE=“2018”/>car <QUOT.END ID=“QE0001” REF.START=“QS0001” CODE=“2019”/> is replaced by <QUOT.START ID=“QS0002” REF.END=“QE0002” CODE=“2019”/>bus<QUOT.END ID=“QE0002” REF.START=“QS0002” CODE=“2019”/>.</ALINEA><ALINEA>
<P>Article 6 of the same regulation is replaced by the following text:</P>
<QUOT.S><ARTICLE IDENTIFIER=“006”>
<TI.ARTICLE><QUOT.START ID=“QS0003” REF.END=“QE0003” CODE=“2018”/>Article 6</TI.ARTICLE>
<ALINEA>This is the new text of article 6.<QUOT.END ID=“QE0003” REF.START=“QS0003” CODE=“2019”/></ALINEA>
</ARTICLE></QUOT.S>
</ALINEA></ARTICLE>
The use of SGML and XML at the Publications Office
13
Formex (8)
• Splitting large documents– Fragmentation by definition of inclusions
for the main document– Secondary instances referencing the
inclusions by means of XML entity mechanism
– Inclusions may not necessarily be valid XML instances
The use of SGML and XML at the Publications Office
14
Formex (9)
main.xml
<?xml version=“1.0”?><doc> <ti>title</ti> <chap no=“1”> <incl ref=“frag-1.frg”/> </chap></doc>
frag-1.frg
<text>…</text><text>…</text>
container.xml
<?xml version=“1.0”?><!DOCTYPE frag [<!ENTITY cnt SYSTEM “frag-1.frg”>]><frag>&cnt;</frag>
The use of SGML and XML at the Publications Office
15
Formex (10)
• Character set– OJ publications in 20 (21) languages– Different alphabets– International character set definition
Unicode (UTF-8)– Definition of allowed character ranges– Special font ‘EU-Albertina’
The use of SGML and XML at the Publications Office
16
Formex (11)
• Meta-data– OJ publications are composed of
different levels: • Publication• Document• ‘Contents’
– Meta-data separated according to these levels
The use of SGML and XML at the Publications Office
17
Formex (12)
Publication
Meta-data concerning the publication
Structure of thepublication withreferences to documents
Document
Meta-data for document
References to components
Document
Meta-data for document
References to components
Contentsmain part001
ContentsAnnex 1001.001
ContentsAnnex 2001.002
Contentsmain part002
ProCat
The use of SGML and XML at the Publications Office
18
Formex (13)
• Meta-data (continued)– Extraction of meta-data by means of
automatic processes (pre-notices)– Extension of pre-notices by juridical analysis– Availability of notices in ProCat for other
productions (Celex) and projects
The use of SGML and XML at the Publications Office
19
Formex (14)
• Final remark on Formex specifications– Only few complete production chains
from the author to the printer– Concentration on publication of Official
Journal
The use of SGML and XML at the Publications Office
20
Formex (15)
• Validation of Formex deliveries– In-depth validation necessary– Automatic procedures– Manual procedures
The use of SGML and XML at the Publications Office
21
Formex (16)
• Validation of Formex deliveries (continued)– Automatic procedures
• Control of filename conventions• Parsing of various components• Control of completeness• Execution of additional validation rules• Comparison of contents between Formex
and PDF
Report (XML instance)
The use of SGML and XML at the Publications Office
22
Formex (17)
• Validation of Formex deliveries (continued)– Manual procedures
• Verification of the report generated by the automatic validation procedure
• Control of the use of Formex specifications in all language versions
Report (XML instance) = basis forarchiving or rejection
The use of SGML and XML at the Publications Office
23
Formex (18)
• Conversion of Formex v. 3 into Formex v. 4– Conversion of character set (ISO 2020 – UTF8)– Transformation of SGML instances into well-
formed XML instances– Extraction of tables and conversion into an
intermediate model– Generation of meta-data levels– Conversion of old elements and generation of
new elements– Validation of the results
The use of SGML and XML at the Publications Office
24
Formex (19)
• Specifications:
http://formex.publications.eu.int/
The use of SGML and XML at the Publications Office
25
Table of contents
• Historical overview• Formex• Other areas of XML usage• Conclusion
The use of SGML and XML at the Publications Office
26
Other areas of XML usage (1)
• Index of OJ publications– Biannual issues– Monthly issues– Extraction from Celex/ProCat– Transformation into PDF by means of
XSLT and XSL FO (biannual version only)
The use of SGML and XML at the Publications Office
27
Other areas of XML usage (2)
• Consolidation of legal documents– Mainly based on Formex– Additional administrative data in XML– Relations between historical levels
• Description of the composition of a given historical level
• Concordance of information on numbering schemes (articles, …) for each level
The use of SGML and XML at the Publications Office
28
Other areas of XML usage (3)
• Conversion to RTF– Compatibility with other EU services– Input in SGML or XML– Results with LegisWrite templates
The use of SGML and XML at the Publications Office
29
Other areas of XML usage (4)
SGML instance
(Formex v. 3)
Characterconversion
Transformationinto well-
formed XML
Transformation into internalXML format
Transformationinto RTF
(LegisWrite)
Output inRTF (Legis-
Write)
XMLinstance
(Formex v. 4)
The use of SGML and XML at the Publications Office
30
Other areas of XML usage (5)
• Production of the EU budget– Creation and maintenance of a common
central repository (XML)– Markup of modified elements during the
decision process in working language– Translation only of parts modified– Update of repository after publication
The use of SGML and XML at the Publications Office
31
Other areas of XML usage (6)
Budget services
Translationservice
Publications Office
Budget XMLrepository
Printer
Formexarchive
pre-printingpost-printing
The use of SGML and XML at the Publications Office
32
Other areas of XML usage (7)
• ‘Secondary legislation’– Publication of legislation in force in
‘new’ languages– XML production on basis of Formex
archive– Transformation of translated input– Transformation of SGML into XML of
Formex instance– Merging of XML instances
The use of SGML and XML at the Publications Office
33
Other areas of XML usage (8)
Worddocument Formex
archive
Conversioninto XML
Extractionof text
Conversioninto XML
Extractionof skeleton
Mergingskeleton &
text
Simplifystructure
Publication
ProCat
Celex
The use of SGML and XML at the Publications Office
34
Other areas of XML usage (9)
• European document repository– TIFF of publications– PDF of publications– Formex instances of OJ publications– Exchange of information by XML
messages
The use of SGML and XML at the Publications Office
35
Other areas of XML usage (10)
• Publication of calls for tender (OJ-S)– Input in different electronic formats– Harmonization in XML– Updating database TED– Production of CD-ROM version
The use of SGML and XML at the Publications Office
36
Table of contents
• Historical overview• Formex• Other areas of XML usage• Conclusion
The use of SGML and XML at the Publications Office
37
Conclusion
• Difficult start with SGML• Successful use of XML as well as of
other standards such as XSLT/XPath, XSL FO
• Powerful possibilities of re-use of XML instances
• How to profit from our experiences?
The use of SGML and XML at the Publications Office
38
Proposal for technical solution
• An example: a regulation in the European legislative context and a ‘Verordnung’ in German legislation
• Evident structural differences
• Evident common structural objects
The use of SGML and XML at the Publications Office
39
Differences and common objects (1)
• EU regulation– Title– Preamble
• Citations• Recitals
– Enacting terms• Articles
– Article header» Numbering
– Paragraphs or alineas
• German regulation– Title– Preamble
• Paragraphs
– Enacting terms• Articles
– Article header» Numbering +
text– alineas
The use of SGML and XML at the Publications Office
40
Differences and common objects (2)
– Final• Applicability• Signature
– Final
• Signature
The use of SGML and XML at the Publications Office
41
Differences and common objects (3)
• preamble – European model
PREAMBLE (PREAMBLE.INIT,CITATION+,RECITAL+,
PREAMBLE.FINAL)
PREAMBLE.INIT (P)
CITATION (P)
RECITAL (NP)
PREAMBLE.FINAL (P)
– German modelPREAMBLE (P)
The use of SGML and XML at the Publications Office
42
Differences and common objects (4)
• article– European model
ARTICLE (ARTICLE.HEADER, (PARAG+ |ALINEA+))
ARTICLE.HEADER(#PCDATA)PARAG (NO.PARAG, ALINEA+)ALINEA (P|LIST)+
– German modelARTICLE (ARTICLE.HEADER,
(PARAG+ |ALINEA+))ARTICLE.HEADER(NP)NP (NO.P,TXT)PARAG (NO.PARAG, ALINEA+)ALINEA (P|LIST)+
The use of SGML and XML at the Publications Office
43
Differences and common objects (5)
• final – European model
FINAL (APPLICABILITY,SIGNATURE)APPLICABILITY (P)SIGNATURE (PL.DATE,SIGNATORY)PL.DATE (P)SIGNATORY (P+)
– German modelFINAL (SIGNATURE)SIGNATURE (PL.DATE,SIGNATORY)PL.DATE (P)SIGNATORY (P+)
The use of SGML and XML at the Publications Office
44
Differences and common objects (6)
Specific models for European regulation
Specific models for German regulation
Common models for European and German regulation
The use of SGML and XML at the Publications Office
45
Differences and common objects (7)
• Common grammar fragment<!ELEMENT ALINEA (P | LIST)+ ><!ELEMENT ARTICLE (ARTICLE.HEADER, (ALINEA+ | PARAG+)) ><!ELEMENT ENACTING.TERMS (ARTICLE+) ><!ELEMENT ITEM (NP, (P | LIST) ><!ELEMENT NO.P (#PCDATA) ><!ELEMENT NOTE (P+) ><!ATTLIST NOTE NOTE.ID ID #REQUIRED ><!ELEMENT NP (NO.P, TXT) ><!ELEMENT P (#PCDATA | NOTE)* ><!ELEMENT PARAG (PARAG.NO, ALINEA+) ><!ELEMENT PARAG.NO (#PCDATA) ><!ELEMENT PL.DATE (P+) ><!ELEMENT REGULATION (TITLE, PREAMBLE, ENACTING.TERMS, FINAL) ><!ATTLIST CTRY (DE | EU-EN) #REQUIRED ><!ELEMENT SIGNATORY (P+) ><!ELEMENT SIGNATURE (PL.DATE, SIGNATORY) ><!ELEMENT TITLE (P+) ><!ELEMENT TXT (#PCDATA | LIST | NOTE)* >
The use of SGML and XML at the Publications Office
46
Differences and common objects (8)
• Specific grammar for EU regulation
<!ENTITY % common SYSTEM “regulation-common.dtd”>
%common;
<!ELEMENT APPLICABILITY (P) >
<!ELEMENT ARTICLE.HEADER (P) >
<!ELEMENT CITATION (P) >
<!ELEMENT FINAL (APPLICABILITY, SIGNATURE) >
<!ELEMENT PREAMBLE (PREAMBLE.INIT, CITATION+, RECITAL.INIT?,
RECITAL+, PREAMBLE.FINAL) >
<!ELEMENT PREAMBLE.FINAL (P) >
<!ELEMENT PREAMBLE.INIT (P) >
<!ELEMENT RECITAL (P | NP) >
<!ELEMENT RECITAL.INIT (P) >
The use of SGML and XML at the Publications Office
47
Differences and common objects (9)
• Specific grammar for German regulation
<!ENTITY % common SYSTEM “regulation-common.dtd”>
%common;
<!ELEMENT ARTICLE.HEADER (NP) >
<!ELEMENT FINAL (SIGNATURE) >
<!ELEMENT PREAMBLE (P+) >
The use of SGML and XML at the Publications Office
48
Final remarks
• Possible objects:– Metadata on document level– Metadata on archiving level (research
aspects)– Common models for complex objects: tables,
quotations, etc.