Top Banner
eXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006
31

EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Jan 01, 2016

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

eXtensible Characterisation Languages (XCL)

Manfred Thaller, (University at Cologne)

DPP meeting, Glasgow, Nov. 23rd 2006

Page 2: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

Page 3: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

Page 4: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

Page 5: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

Page 6: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Vision:

Page 7: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Questions …

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Is all information contained within oldFormat also contained within newFormat?

Page 8: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Questions …

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Is all information contained within oldFormat also contained within newFormat?

2. Is all information, which is relevant for the usage of the information, within oldFormat also contained within newFormat?

Page 9: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Questions …

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Is all information contained within oldFormat also contained within newFormat?

2. Is all information, which is relevant for the usage of the information, within oldFormat also contained within newFormat?

3. Is the conversion process a(oldFormat, newFormat) better than b(oldFormat, newFormat) , i.e. does it preserve more of the information contained within oldFormat?

Page 10: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Building Block I: XCEL

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

A language, which allows a program to read "any file specification" based on a

==> "eXtensible Characterisation Extraction Language"

Formulate the humanly readable specifications of TIFF, RTF, WAV …in a language, which a general purpose program can read.

General enough that any existing format specification can be expressed in it. (LATeX, MAX, VRML …)

Page 11: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCEL – Structuring Elements

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

range

item

subitem

<startposition><length>

item

symbol

property

Page 12: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCEL – Structuring Elements

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<startposition><length>

Byte offsets: 1000, 1248

Truly binary files: Most sound, image formatsBinary addressable files: PDF, Max

Page 13: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCEL – Structuring Elements

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<startposition><length>

Procedures:p(begin, trigger) q(trigger,filter,implication)

Encoded / mark up files: RTF, TeX, SVG, VRML …

Page 14: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCEL – Structuring Elements

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<startposition><length>

Procedures:p(current_Position, <someTag>”).q(“</someTag>”,pair(“<[a-zA-0-9]*>”,”</&>”), implyBy(“</someOtherTag>”))

Encoded / mark up files: RTF, TeX, SVG, VRML …

Page 15: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Building Block II: XCDL

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

A language, which allows a program to describe "any file content" using a

==> "eXtensible Characterisation Definition Language"

Formulate the content of any file in an abstract language, which captures the complete information contained in it.

General enough that any existing content can be expressed in it.

Page 16: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

1. Sequences of bytes

2. With properties applicable to subsequences

Page 17: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Ashes to Ashes once more

<data id=”1”> {\rtf1\ansi\ansicpg1252\deff0\deflang1031{\fonttbl{\f0\fswiss\fcharset0 Arial;}}\viewkind4\uc1\pard\f0\fs20 \b Ashes\b0 to \b Ashes\b0 once \b more\b0.\par} </data>

Page 18: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

<normData id=”1” type=”text”> Ashes to Ashes once more. </normData>

<property id=”5” source=”raw” cat=”descr”> <name>boldFace</name> <valueSet id=”1”> <rawVal>Ashes</rawVal> <dataRef ind=”normSpecific”> <ref id=”1” start=”0” end=”4”/> <ref id=”1” start=”9” end=”13”/> </dataRef> </valueSet> <valueSet id=”2”> <rawVal>more</rawVal> <dataRef ind=”normSpecific”> <ref id=”1” start=”20” end=”23”/> </dataRef> </valueSet> </property>

Page 19: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Assumption 1: A file format is a set of rules which formalize all knowledge needed to process the binary information contained within a distinct and complete block of binary information, traditionally called a file.

Page 20: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCDL: Basic Architecture

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Assumption 2: The extensible characterisation extraction language is designed to be able to express all such rules within a given file format. The extensible characterisation definition language is designed to be able to describe all the information contained within a file the format of which is described by a valid XCEL description.

Page 21: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

XCDL: Basic Architecture

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Assumption 3: A specific XCEL description is not required to express all the rules within a specific file format. A XCDL derived from such a partial XCEL will, therefore, potentially also contain only part of the information of a file encoded in that format.

Even when the XCEL describes a format completely, an extractor is not required to extract all characteristics of a file.

Some characteristics are only important for processing: compression method not important, after decompression succeeded.

Page 22: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Building Block III: Metrics

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Page 23: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Building Block III: Metrics

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Starting in month 13.

However ...

Page 24: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Metrics: Basic Assumptions

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Currently bottom up approach:

Observe characteristics occuring within files …

… and build name libraries from them.

{"color depth", "# of planes"} => colorDepth

Page 25: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Metrics: Basic Assumptions

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Later parallel top down approach:

Create file characteristics ontology …

… and link it to the name libraries.

"width" in image file != "width" in text file.

Page 26: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Metrics: Example I

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Percentage of bytes in a binary stream which are preserved within range of +/- 5 of original.

(Images: Would scarcely be observable on screen.)

E.g. relevant when colorspace appropriate for printing is transformed into a colorspace optimized for screen.

Page 27: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Metrics: Example II

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Degree to which font applied recreates the original typesetting characteristics.

(Texts:Derived metric from comparison of font metrics.)

Page 28: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Metrics: Problem

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Problem not so much individual metrics butsummation rules.

An image migration step preserves 98 % of the image bytes within +/- 1 %.

It also preserves 4 of 20 ( = 25 %) boolean properties (creator, scanning equipment …).

Quality of the migration: (0.98 + 0.25) / 2 = .615?

Page 29: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Metrics: Problem

* M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Possible solution: " weights derived from PP.

An image migration step preserves 98 % of the image bytes within +/- 1 %.

It also preserves 4 of 20 ( = 25 %) boolean properties (creator, scanning equipment …). Weight engineering metrics by "arbitrary

Quality of the migration: 0.98*w1 + 0.25*w2 / 2 =

Page 30: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006

Page 31: EXtensible Characterisation Languages (XCL) Manfred Thaller, (University at Cologne) DPP meeting, Glasgow, Nov. 23 rd 2006.

Thank you!

M. Thaller DPP meeting, Glasgow, Nov. 23rd 2006