Top Banner
The use of SGML and XML at the Publications Office Dr. Holger Bagola Dir A – Cell “Methods and Development — Formats” [email protected]
48

The use of SGML and XML at the Publications Office

Jan 14, 2016

Download

Documents

danae_

The use of SGML and XML at the Publications Office. Dr. Holger Bagola Dir A – Cell “Methods and Development — Formats” [email protected]. Table of contents. Historical overview Formex Other areas of XML usage Conclusion. Table of contents. Historical overview Formex - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

Dr. Holger BagolaDir A – Cell “Methods and Development — Formats”

[email protected]

Page 2: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

2

Table of contents

• Historical overview• Formex• Other areas of XML usage• Conclusion

Page 3: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

3

Table of contents

• Historical overview• Formex• Other areas of XML usage• Conclusion

Page 4: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

4

Historical overview

• Tasks of the Publications Office• Archiving of legislative publications• First steps in SGML• Migration to XML• Basic advantage: availability of tools

Page 5: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

5

Table of contents

• Historical overview• Formex• Other areas of XML usage• Conclusion

Page 6: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

6

Formex (1)

• Basic principles– XML Schema instead of DTD– One single schema– Number of root elements 12 instead of

30– Number of elements about 350 instead

of 1200– Distinction between semantic and

physical markup

Page 7: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

7

Formex (2)

ARTICLE (TI.ARTICLE, (PARAG+ | ALINEA+))

TI.ARTICLE (#PCDATA)

PARAG (NO.PARAG, ALINEA+)

NO.PARAG(#PCDATA)

ALINEA ((#PCDATA | NOTE | HT| FT)* |

(P | LIST | TABLE)+)

. . .

Blue: semantic markup

Red: physical markup

Page 8: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

8

Formex (3)

• Table model– Analysis of CALS, HTML, Formex v. 3– Choice:

• Model close to HTML (top-down approach, nested tables)

• Maintenance of semantic information such as in Formex v. 3

Page 9: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

9

Formex (4)

• Footnotes– Distinction between notes in text and

tables for readability and production simplicity

– Insertion of text notes into the surrounding text

– ID/IDREF to signal identical footnotes– Numbering is an object of presentation– Table notes assembled at the top of the

table

Page 10: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

10

Formex (5)

• Quotations– Structured quotations vs. ‘#PCDATA’

quotations– Elements signaling start and end of a

quotation (quotation marks)– Element with function of a container for

structured quotations.

Page 11: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

11

Formex (6)

Example:Article 2

In article 1(2) of regulation (EC) 1234/94 the word ‘car’ is replaced by ‘bus’.

Article 6 of the same regulation is replaced by the following text:

‘Article 6

This is the new text of article 6.’

Page 12: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

12

Formex (7)Example:

<ARTICLE IDENTIFIER=“002”><TI.ARTICLE>Article 2</TI.ARTICLE><ALINEA>In article 1(2) of regulation (EC) 1234/94 the <QUOT.START ID=“QS0001” REF.END=“QE0001” CODE=“2018”/>car <QUOT.END ID=“QE0001” REF.START=“QS0001” CODE=“2019”/> is replaced by <QUOT.START ID=“QS0002” REF.END=“QE0002” CODE=“2019”/>bus<QUOT.END ID=“QE0002” REF.START=“QS0002” CODE=“2019”/>.</ALINEA><ALINEA>

<P>Article 6 of the same regulation is replaced by the following text:</P>

<QUOT.S><ARTICLE IDENTIFIER=“006”>

<TI.ARTICLE><QUOT.START ID=“QS0003” REF.END=“QE0003” CODE=“2018”/>Article 6</TI.ARTICLE>

<ALINEA>This is the new text of article 6.<QUOT.END ID=“QE0003” REF.START=“QS0003” CODE=“2019”/></ALINEA>

</ARTICLE></QUOT.S>

</ALINEA></ARTICLE>

Page 13: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

13

Formex (8)

• Splitting large documents– Fragmentation by definition of inclusions

for the main document– Secondary instances referencing the

inclusions by means of XML entity mechanism

– Inclusions may not necessarily be valid XML instances

Page 14: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

14

Formex (9)

main.xml

<?xml version=“1.0”?><doc> <ti>title</ti> <chap no=“1”> <incl ref=“frag-1.frg”/> </chap></doc>

frag-1.frg

<text>…</text><text>…</text>

container.xml

<?xml version=“1.0”?><!DOCTYPE frag [<!ENTITY cnt SYSTEM “frag-1.frg”>]><frag>&cnt;</frag>

Page 15: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

15

Formex (10)

• Character set– OJ publications in 20 (21) languages– Different alphabets– International character set definition

Unicode (UTF-8)– Definition of allowed character ranges– Special font ‘EU-Albertina’

Page 16: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

16

Formex (11)

• Meta-data– OJ publications are composed of

different levels: • Publication• Document• ‘Contents’

– Meta-data separated according to these levels

Page 17: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

17

Formex (12)

Publication

Meta-data concerning the publication

Structure of thepublication withreferences to documents

Document

Meta-data for document

References to components

Document

Meta-data for document

References to components

Contentsmain part001

ContentsAnnex 1001.001

ContentsAnnex 2001.002

Contentsmain part002

ProCat

Page 18: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

18

Formex (13)

• Meta-data (continued)– Extraction of meta-data by means of

automatic processes (pre-notices)– Extension of pre-notices by juridical analysis– Availability of notices in ProCat for other

productions (Celex) and projects

Page 19: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

19

Formex (14)

• Final remark on Formex specifications– Only few complete production chains

from the author to the printer– Concentration on publication of Official

Journal

Page 20: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

20

Formex (15)

• Validation of Formex deliveries– In-depth validation necessary– Automatic procedures– Manual procedures

Page 21: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

21

Formex (16)

• Validation of Formex deliveries (continued)– Automatic procedures

• Control of filename conventions• Parsing of various components• Control of completeness• Execution of additional validation rules• Comparison of contents between Formex

and PDF

Report (XML instance)

Page 22: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

22

Formex (17)

• Validation of Formex deliveries (continued)– Manual procedures

• Verification of the report generated by the automatic validation procedure

• Control of the use of Formex specifications in all language versions

Report (XML instance) = basis forarchiving or rejection

Page 23: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

23

Formex (18)

• Conversion of Formex v. 3 into Formex v. 4– Conversion of character set (ISO 2020 – UTF8)– Transformation of SGML instances into well-

formed XML instances– Extraction of tables and conversion into an

intermediate model– Generation of meta-data levels– Conversion of old elements and generation of

new elements– Validation of the results

Page 24: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

24

Formex (19)

• Specifications:

http://formex.publications.eu.int/

Page 25: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

25

Table of contents

• Historical overview• Formex• Other areas of XML usage• Conclusion

Page 26: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

26

Other areas of XML usage (1)

• Index of OJ publications– Biannual issues– Monthly issues– Extraction from Celex/ProCat– Transformation into PDF by means of

XSLT and XSL FO (biannual version only)

Page 27: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

27

Other areas of XML usage (2)

• Consolidation of legal documents– Mainly based on Formex– Additional administrative data in XML– Relations between historical levels

• Description of the composition of a given historical level

• Concordance of information on numbering schemes (articles, …) for each level

Page 28: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

28

Other areas of XML usage (3)

• Conversion to RTF– Compatibility with other EU services– Input in SGML or XML– Results with LegisWrite templates

Page 29: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

29

Other areas of XML usage (4)

SGML instance

(Formex v. 3)

Characterconversion

Transformationinto well-

formed XML

Transformation into internalXML format

Transformationinto RTF

(LegisWrite)

Output inRTF (Legis-

Write)

XMLinstance

(Formex v. 4)

Page 30: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

30

Other areas of XML usage (5)

• Production of the EU budget– Creation and maintenance of a common

central repository (XML)– Markup of modified elements during the

decision process in working language– Translation only of parts modified– Update of repository after publication

Page 31: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

31

Other areas of XML usage (6)

Budget services

Translationservice

Publications Office

Budget XMLrepository

Printer

Formexarchive

pre-printingpost-printing

Page 32: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

32

Other areas of XML usage (7)

• ‘Secondary legislation’– Publication of legislation in force in

‘new’ languages– XML production on basis of Formex

archive– Transformation of translated input– Transformation of SGML into XML of

Formex instance– Merging of XML instances

Page 33: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

33

Other areas of XML usage (8)

Worddocument Formex

archive

Conversioninto XML

Extractionof text

Conversioninto XML

Extractionof skeleton

Mergingskeleton &

text

Simplifystructure

Publication

ProCat

Celex

Page 34: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

34

Other areas of XML usage (9)

• European document repository– TIFF of publications– PDF of publications– Formex instances of OJ publications– Exchange of information by XML

messages

Page 35: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

35

Other areas of XML usage (10)

• Publication of calls for tender (OJ-S)– Input in different electronic formats– Harmonization in XML– Updating database TED– Production of CD-ROM version

Page 36: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

36

Table of contents

• Historical overview• Formex• Other areas of XML usage• Conclusion

Page 37: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

37

Conclusion

• Difficult start with SGML• Successful use of XML as well as of

other standards such as XSLT/XPath, XSL FO

• Powerful possibilities of re-use of XML instances

• How to profit from our experiences?

Page 38: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

38

Proposal for technical solution

• An example: a regulation in the European legislative context and a ‘Verordnung’ in German legislation

• Evident structural differences

• Evident common structural objects

Page 39: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

39

Differences and common objects (1)

• EU regulation– Title– Preamble

• Citations• Recitals

– Enacting terms• Articles

– Article header» Numbering

– Paragraphs or alineas

• German regulation– Title– Preamble

• Paragraphs

– Enacting terms• Articles

– Article header» Numbering +

text– alineas

Page 40: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

40

Differences and common objects (2)

– Final• Applicability• Signature

– Final

• Signature

Page 41: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

41

Differences and common objects (3)

• preamble – European model

PREAMBLE (PREAMBLE.INIT,CITATION+,RECITAL+,

PREAMBLE.FINAL)

PREAMBLE.INIT (P)

CITATION (P)

RECITAL (NP)

PREAMBLE.FINAL (P)

– German modelPREAMBLE (P)

Page 42: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

42

Differences and common objects (4)

• article– European model

ARTICLE (ARTICLE.HEADER, (PARAG+ |ALINEA+))

ARTICLE.HEADER(#PCDATA)PARAG (NO.PARAG, ALINEA+)ALINEA (P|LIST)+

– German modelARTICLE (ARTICLE.HEADER,

(PARAG+ |ALINEA+))ARTICLE.HEADER(NP)NP (NO.P,TXT)PARAG (NO.PARAG, ALINEA+)ALINEA (P|LIST)+

Page 43: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

43

Differences and common objects (5)

• final – European model

FINAL (APPLICABILITY,SIGNATURE)APPLICABILITY (P)SIGNATURE (PL.DATE,SIGNATORY)PL.DATE (P)SIGNATORY (P+)

– German modelFINAL (SIGNATURE)SIGNATURE (PL.DATE,SIGNATORY)PL.DATE (P)SIGNATORY (P+)

Page 44: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

44

Differences and common objects (6)

Specific models for European regulation

Specific models for German regulation

Common models for European and German regulation

Page 45: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

45

Differences and common objects (7)

• Common grammar fragment<!ELEMENT ALINEA (P | LIST)+ ><!ELEMENT ARTICLE (ARTICLE.HEADER, (ALINEA+ | PARAG+)) ><!ELEMENT ENACTING.TERMS (ARTICLE+) ><!ELEMENT ITEM (NP, (P | LIST) ><!ELEMENT NO.P (#PCDATA) ><!ELEMENT NOTE (P+) ><!ATTLIST NOTE NOTE.ID ID #REQUIRED ><!ELEMENT NP (NO.P, TXT) ><!ELEMENT P (#PCDATA | NOTE)* ><!ELEMENT PARAG (PARAG.NO, ALINEA+) ><!ELEMENT PARAG.NO (#PCDATA) ><!ELEMENT PL.DATE (P+) ><!ELEMENT REGULATION (TITLE, PREAMBLE, ENACTING.TERMS, FINAL) ><!ATTLIST CTRY (DE | EU-EN) #REQUIRED ><!ELEMENT SIGNATORY (P+) ><!ELEMENT SIGNATURE (PL.DATE, SIGNATORY) ><!ELEMENT TITLE (P+) ><!ELEMENT TXT (#PCDATA | LIST | NOTE)* >

Page 46: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

46

Differences and common objects (8)

• Specific grammar for EU regulation

<!ENTITY % common SYSTEM “regulation-common.dtd”>

%common;

<!ELEMENT APPLICABILITY (P) >

<!ELEMENT ARTICLE.HEADER (P) >

<!ELEMENT CITATION (P) >

<!ELEMENT FINAL (APPLICABILITY, SIGNATURE) >

<!ELEMENT PREAMBLE (PREAMBLE.INIT, CITATION+, RECITAL.INIT?,

RECITAL+, PREAMBLE.FINAL) >

<!ELEMENT PREAMBLE.FINAL (P) >

<!ELEMENT PREAMBLE.INIT (P) >

<!ELEMENT RECITAL (P | NP) >

<!ELEMENT RECITAL.INIT (P) >

Page 47: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

47

Differences and common objects (9)

• Specific grammar for German regulation

<!ENTITY % common SYSTEM “regulation-common.dtd”>

%common;

<!ELEMENT ARTICLE.HEADER (NP) >

<!ELEMENT FINAL (SIGNATURE) >

<!ELEMENT PREAMBLE (P+) >

Page 48: The use of SGML and XML at the  Publications Office

The use of SGML and XML at the Publications Office

48

Final remarks

• Possible objects:– Metadata on document level– Metadata on archiving level (research

aspects)– Common models for complex objects: tables,

quotations, etc.