Top Banner
Manfred Thaller Universität zu * Köln [email protected] Characterisation Digital Preservation Planning: Principles, Examples and the Future with Planets. July 29 th , 2008 * University at, NOT of Cologne
55

Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11

May 31, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Manfred ThallerUniversität zu* Köln

[email protected]

Characterisation

Digital Preservation Planning: Principles, Examples and the Future with Planets.

July 29th, 2008

* University at, NOT of Cologne

Page 2: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

I – What is (in) a format?

2

Page 3: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

3

Page 4: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

4

6 rows5 columns

Page 5: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

5

5 rows6 columns

Page 6: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

6

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

1 == blue 0 == red

Page 7: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

7

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

1 == green 0 == yellow

Page 8: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

8

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

Store:1,1,1,1,1,1,0,0,0,1,1,1,0,1,1,1,1,0,1,1,1,1,0,1,1,1,1,1,1,1

Uncompressed

Page 9: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

9

1 1 1 1 11 0 0 0 11 1 0 1 1

1 1 0 1 11 1 0 1 11 1 1 1 1

Store:6,1,3,0,3,1,1,0,4,1,1,0,4,1,1,0,7,1

(Compressed)Run Length Encoded

Page 10: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

10

dimensions

photogrammetric interpretation

compression

Page 11: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

An image

11

<basic information>

<rendering information>

<storage information>

<data>

Page 12: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

File format

12

<basic information>What to do?

<rendering information>How to do it?

<storage information>How to move it from persistent to

deployed form?<data>

What to deploy?

Page 13: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

13

<basic information>What to do?

<rendering information>How to do it?

<storage information>How to move it from persistent to

deployed form?<data>

What to deploy?

File format

Page 14: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

14

<basic information>Mandatory

<rendering information>Useful

<storage information>Historical

<data>Mandatory

File format

Page 15: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

15

A deterministic specification how the properties of a digital object can reversibly be converted into a linear bytestream (bitstream).

File format

Page 16: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

II – Why would we want to know?

16

Page 17: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

III – Which format to choose?

17

Page 18: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: text

18

High confidence Medium confidence Low confidence

� Plain text (encoding: ISO8859-1 - 9, UTF-8, UTF-16 with BOM)� XML (includes XSD/XSL/XHTML, etc.; with included or accessibleschema and characterencoding explicitlyspecified)� PDF/A-1 (ISO 19005-1)

� Cascading Style Sheets (*.css)� DTD (*.dtd)� PDF (*.pdf) (embedded fonts)� Rich Text Format 1.x (*.rtf)� HTML 4.x (include aDOCTYPE declaration)� SGML (*.sgml)� Open Office (*.sxw/*.odt)� Office Open XML (*.docx)

�PDF (*.pdf) (encrypted)� Microsoft Word (*.doc)� WordPerfect (*.wpd)� DVI (*.dvi)� All other text formats notlisted here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 19: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: bitmap / raster image

19

High confidence Medium confidence Low confidence

�TIFF (uncompressed)� PNG (*.png)

� BMP (*.bmp)� JPEG/JFIF (*.jpg)�JPEG2000 (prefer lossless or uncompressed) (*.jp2)�TIFF (compressed)�GIF (*.gif)

�MrSID (*.sid)�TIFF (in Planar format) �FlashPix (*.fpx)�PhotoShop (*.psd)�All other raster image formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 20: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: vector graphics

20

High confidence Medium confidence Low confidence

�SVG 1.1 (no Java binding) (*.svg)

�Computer Graphic Metafile (CGM, WebCGM) (*.cgm)

�Encapsulated Postscript (EPS)�Macromedia Flash (*.swf)�All other vector image formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 21: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: audio

21

High confidence Medium confidence Low confidence

�AIFF (PCM) (*.aif, *.aiff)� WAV (PCM) (*.wav)

�SUN Audio (uncompressed) (*.au)�Standard MIDI (*.mid,*.midi)�Ogg Vorbis (*.ogg)�Free Lossless Audio Codec (*.flac)� Advance Audio Coding (*.mp4, *.m4a, *.aac)� MP3 (MPEG-1/2, Layer 3)(*.mp3)

�AIFC (compressed) (*.aifc)� NeXT SND (*.snd)� RealNetworks 'Real Audio‚ (*.ra, *.rm, *.ram)� Windows Media Audio�(*.wma)�WAV (compressed) (*.wav)�All other audio formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 22: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: video

22

High confidence Medium confidence Low confidence

�Motion JPEG 2000(ISO/IEC 15444-4)(*.mj2)� AVI (uncompressed)(*.avi)�QuickTime Movie(uncompressed)(*.mov)�Motion JPEG (*.avi,*.mov)

�Ogg Theora (*.ogg)�MPEG-1, MPEG-2 (*.mpg, *.mpeg)�MPEG-4(*.mp4)

�AVI (compressed) (*.avi)�QuickTime Movie(compressed) (*.mov)�RealNetworks 'Real Video‚ (*.rv)�Windows Media Video(*.wmv)�All other video formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 23: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: “data base”

23

High confidence Medium confidence Low confidence

�Delimited Text (*.txt,*.csv)�SQL DDL

�DBF (*.dbf)�OpenOffice *.sxc/*.ods)�Office Open XML *.xlsx)

�Excel (*.xls)�All other spreadsheet/ database formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 24: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Recommended formats: 3D (“virtual reality”)

24

High confidence Medium confidence Low confidence

�X3D (*.x3d) �VRML (*.wrl, *.vrml)�U3D (Universal 3D fileformat)

�All other virtual reality�formats not listed here

http://www.fcla.edu/digitalArchive/pdfs/recFormats.pdf

Page 25: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

25

Doctoral thesis on robustness of file formats:

Volker Heydegger, University at Cologne.

[email protected]

Page 26: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

IV – How to we identify a format?

26

Page 27: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

What kind of file is this?

27

Two ways to identify a file:

(a)By extension.

„Each file ending with *.doc is a MS Word document“

Page 28: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

28

Two ways to identify a file:

(b) By internal characteristics („magic number“, „signature“).

A TIFF file begins with …Bytes 0-1: The byte order used within the file. Legal values are:“II” (4949.H) / “MM” (4D4D.H)Bytes 2-3 An arbitrary but carefully chosen number (42) that further identifies the file as aTIFF file.

What kind of file is this?

Page 29: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

File format registries - URLs

29

PRONOM:http://www.nationalarchives.gov.uk/pronom/(does not only rely on extensions)

Global Digital Format Registry:http://hul.harvard.edu/gdfr

(predominantly project description)

FileExt:http://filext.com

(predominantly links to software)

Page 30: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

V – What’s a file characteristic, than?

30

Page 31: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Technical metadata �

A high proportion of the preservation metadata will be in narrative format and will require manual entry by Library staff. A significant subset of the data however, relating to technical file characteristics, can be automatically extracted from the digital object by reading the file header details. This successful extraction of preservation metadata has been proved in a previous National Library proof of concept project. The automated capture of this information will significantly reduce the amount of manual data entry required from Library staff.

� file characteristics.

31

Page 32: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Why automate?

32

1 million objects: use one second for each.

== 16666.7 minutes == 277.8 hours

== 11.57 working days of a computer

== 34.7 8-hour days for a Human== 7 working weeks

Page 33: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Why automate?

33

1 million objects: use five minutes for each.

== 416 666.7 hours

== 52 803.4 8-hour days for a Human== way too much for anything

Page 34: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Formats in PLANETS:File characteristics

34

Based on two formal languages:

(1)eXtensible Characterisation Extraction Language (= XCEL)

(2)eXtensible Characterisation Description Language (= XCDL)

Page 35: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Migrator

tiff

png

Extractor

tiff XCEL png XCEL

Comparator

png XCDL

tiff XCDL

93%

The comparator

35

Page 36: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Extractor

Appropriate XCELs Comparator

C-Set

The comparator

36

Page 37: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Why data?

37

► Photoshop ►

► Photoshop ►

Becomes discoverable only from the actual data …

Page 38: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

V – What is not in a file format?

38

Page 39: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Testfile in Word 2007

39

Page 40: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Testfile in Word 2003 (2007)

40

Page 41: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Testfile in Open Office ODT

41

Page 42: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Testfile in PDF

42

Page 43: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Measuring the pages …

43

Cut out page from rendering surface.

Scale to common dimensions: 371 +/- 1 x 521 +/- 1

Measure1. The leftmost and lowest completely black pixel in the letter “A”

starting the first line of the main text.2. The leftmost and highest completely black pixel in the letter “E”

starting the first line of the text in the footnote.3. The geometrical centre of the period at the end of the main

sentence.4. The geometrical centre of the period at the end of the footnote

text.

Page 44: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Measuring Word 2003

44

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Page 45: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Measuring Word 2007

45

(i) = 45 / 134;

(ii) = 57 / 470;

(iii) = 215 / 322 ;

(iv) = 254 / 483

Page 46: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Open Office ODT

46

(i) = 44 / 132;

(ii) = 52 / 469;

(iii) = 214 / 320 ;

(iv) = 247 / 482

Page 47: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

PDF

47

(i) = 45 / 130;

(ii) = 59 / 467;

(iii) = 215 / 317 ;

(iv) = 254 / 480

Page 48: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Summary I

48

The comparison of the four renderings of the example pages described above seem to indicate clearly, that a migration from the Word family of formats to PDF is a better way to preserve the content of the document, than a migration to the Open Office format.

Page 49: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Measuring Word 2003

49

Relationship tagged explicitly.

Text / footnote separation clear.

Rendering / layout not (totally) predicatble.

Footnote indicator unpredictable.

Page 50: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Measuring Word 2007

50

Relationship tagged explicitly.

Text / footnote separation extremely clear.

Rendering / layout pretty predictable.

Footnote indicator not predictable.

Page 51: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Open Office ODT

51

Relationship tagged explicitly.

Text / footnote separation extremely clear.

Rendering / layout a little bit predictable.

Footnote indicator predictable.

Page 52: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

PDF

52

Relationship expressed by layout.

Text / footnote separation missing.

Rendering / layout very much predictable.

Footnote indicator predictable.

Page 53: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Summary II

53

The comparison of the four internal structures of the example pages described above seem to indicate clearly, that a migration from the Word family of formats to PDF is a worse way to preserve the content of the document, than a migration to the Open Office format.

Page 54: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Small technical note

54

Do not forget, that the whole movement started by SGML, carried into the WWW by HTML, transferred to content by the TEI and started XML as a basic empowering technology ...... assumes that rendering is NOT particularly relevant.

Page 55: Characterisation - Planets · An image 10 dimensions photogrammetric interpretation compression. An image 11 <basic information> <rendering information> <storage information>

Proposal

55

<significantPoints><point x=”45” y=”134” /><point x=”57” y=”470” /><point x=”215” y=”322” /><point x=”254” y=”483” />

</significantPoints>