Top Banner
Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John Bateman Renate Henschel Judy Delin talk given by: Guowen Yang Taipei, September 200
50

Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Apr 06, 2016

Download

Documents

Kirsten Richter
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

A brief introduction to the GeM annotation schema for complex document layout

John BatemanRenate Henschel Judy Delin

talk given by: Guowen Yang

Taipei, September 2002

Page 2: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Overview of TalkOrientation:

– describing the approach to annotation of multimodal documents developed in the GeM project

• What is the GeM project?– goals, methods, requirements

• Summary of annotation problems raised

• Annotation solutions adopted

Page 3: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The GeM project (‘Genre and Multimodality’)

– supported by the British Economic and Social Research Council (ESRC)

– Cooperation: • University of Stirling• University of Bremen• Enterprise Information Design Unit

– Goal: to put the description of multimodal page-based documents on a sound

empirical footing

Page 4: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The problem of data

– there is now much theorizing about how multimodal documents work

– but the empirical basis of this theorizing is often less than strong

Page 5: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Basic GeM hypotheses– Documents belonging to different kinds of ‘genres’ will exhibit different

kinds of multimodal patterns just as text sorts exhibit different lexicogrammatical patterns

– It should be possible to map out these patterns for different genres

– There should be a regular relationship between genre-type and the patterns found

Page 6: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Requirement– An annotated corpus needed to be constructed

containing the extra information that we know/expect to be most useful in establishing descriptions of multimodal documents

– The extra information is then to serve as the basis for generalizations about genre

Page 7: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The problem of data selection and description

– what kinds of documents are we talking about?

– what kinds of annotation do we need?

Page 8: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

– what kinds of documents are we talking about?

Any page-based medium which combines information

from a variety of modalities in order to get its message

across

Page 9: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Initial genres selected for the GeM corpus

– field guides (birds)

– instruction manuals (telephones)

– print newspapers

– electronic web-based versions of newspapers

Page 10: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Field guides

Page 11: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Instruction manuals

Page 12: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Print newspapers

Page 13: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Web-based newspapers

Page 14: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Motivations for selections– all contain combinations of graphical, textual,

photographic material

– all use the layout of these elements in complex ways

– for all the documents taken we were able to obtain feedback and discussion from their designers

Page 15: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Some Relations to Natural Language Processing

– it is our belief that we can approach the design and function of these documents using established linguistic techniques

– the ‘unit of analysis’ is scaled-up from the sentence or the text to the page (at least)

– given a formal specification of the motivation and realization of such documents, we can consider their automatic generation

Page 16: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The problem of data selection and description

– what kinds of documents are we talking about?

– what kinds of annotation do we need?

Page 17: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

– what kinds of annotation do we need?

Page 18: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The GeM annotation layers

• Content structure

• Rhetorical structure

• Layout structure

• Navigation structure

• Linguistic structure

Page 19: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

The GeM annotation layers

• Content structure

• Rhetorical structure

• Layout structure

• Navigation structure

• Linguistic structure

genre

form

Page 20: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Practical information required– the GeM model also takes seriously the notion that the

concrete, practical conditions of production (technology, material, time-available, etc.) all contribute substantially to the properties of a genre

• Canvas constraints

• Production constraints

• Consumption constraints

Page 21: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Requirements

– From the GeM perspective, a page-based multimodal document requires analysis from at least these levels and considering the sources of constraint identified.

– Only then do we have enough information to consider:• motivation of design• critique of design and communicative effectiveness• repurposing• automatic generation

Page 22: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Pointers– The assumptions made, and the particular layers of

analysis adopted, are motivated and introduced at length in:

• Delin/Bateman/Allen: Information Design Journal• Delin/Bateman: Document Design

– Details on the website http://purl.org/net/gem

Page 23: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Overview of TalkOrientation:

– describing the approach to annotation of multimodal documents developed in the GeM project

• What is the GeM project?– goals, methods, requirements

• Summary of annotation problems raised

• Annotation solutions adopted

Page 24: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Overview of TalkOrientation:

– describing the approach to annotation of multimodal documents developed in the GeM project

• What is the GeM project?– goals, methods, requirements

• Summary of annotation problems raised

• Annotation solutions adopted

Page 25: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Summary of annotation problems raised

– form of annotation to select

– criteria for recognising units

– multiple non-isomorphic intersecting hierarchies

– non-linear information

– complex query requirements

Page 26: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– form of annotation to select

TEI: Text Encoding Initiative

CES: Corpus Encoding Standard

XCES: XML version

GEM annotation scheme

Page 27: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– criteria for recognising units• basic vocabulary of the page:

images, signs, sentences, numbers, ...

• layout units: hierarchy determined visually and by considering the degree to which elements ‘belong together’

• rhetorical structure: traditional analysis according to Mann&Thompson’s rhetorical structure theory (RST)

• navigation units: elements pointing elsewhere in the document

Page 28: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– multiple non-isomorphic intersecting hierarchies

• stand-off annotation...

Page 29: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

XML stand-off annotation for encoding the GeM layers

• a single ‘base’ element annotated file

• several ‘stand-off’ layers of annotation

• a Document Type Definition (DTD) for each layer of annotation

• each annotation layer corresponds to a GeM analysis layer

Page 30: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: the base file <unit id="u-21.5">---------------</unit> <unit id="u-21.6" src="gannet.jpg" alt="gannet-photo"/> <unit id="u-21.7"> Huge (90cm) unmistakable seabird. </unit> <unit id="u-21.8"> Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. </unit> <unit id="u-21.9"> In summer, yellow head of adult inconspicuous. </unit> <unit id="u-21.10"> Plunges spectacularly for fish. </unit> <unit id="u-21.11"> Sexes similar. </unit>

Basic ‘vocabulary’ of the page, segmented

and numbered.

Actual ordering and positioning on the page irrelevant at this stage.

Predominantly ‘flat’ structure.

Page 31: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

base units

Layout SemanticContent

RSTsegments

navigationalelements

layout units

Distribution of information across layers

Page 32: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

base units

Layout SemanticContent

RSTsegments

navigationalelements

layout units

Distribution of information across layers

Page 33: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• Working visually from the page, decompose the objects on the page in terms of their visual unity

• Transform the page decomposition into a hierarchical structure

• Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc.

Example: Derivation of layout structure

Page 34: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Page 35: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Page 36: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Page 37: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• provides a place for assigning specific information about the layout units

• contents given by collections of the base units of the page

Complete layout structure

Page 38: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: layout units (1)

Layout units content defined by cross

references (xrefs) to base units

Content here not formally used and may

be ommitted

<layout-unit id="lay-flegg-text" xref="u-21.7 u-21.8 u-21.9 u-21.10 u-21.11"> Huge (90cm) unmistakable seabird. Watch for white, cigar-shaped body and long straight, slender, black-tipped wings. In summer, yellow head of adult inconspicuous. Plunges spectacularly for fish. Sexes similar.</layout-unit>

Page 39: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: layout units (2)

Layout units contain typographical details

common over the unit and its children

Layout units again identified via cross-references

Typographical information modelled on CSS and

XSL:FO

<text xref="lay-21.12 lay-21.14 lay-21.16 lay-21.18 lay-21.20" font-family="sans-serif" font-size="10" font-style="normal" font-weight="bold" case="mixed" justification="right" color="black"/>

Page 40: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

GeM layers: layout units (3)<layout-root id="page-21"> <layout-leaf xref="header-21"/> <layout-chunk id="body-21"> <layout-leaf xref="lay-21.2"/> <layout-leaf xref="lay-21.3"/> </layout-chunk> <layout-leaf xref="page-no-21"/></layout-root>

Layout structure is recursive

page-21

header-21 body-21 page-no-21

lay-21.2 lay-21.3

Page 41: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– non-linear information• positioning of layout units within a page is

specified two-dimensionally with respect to a generalized page model

• the page model decomposes the page area into a hierarchy of grids

• specifying the grid for a page is part of the annotation task.

Page 42: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• Working visually from the page, decompose the objects on the page in terms of their visual unity

• Transform the page decomposition into a hierarchical structure

• Specify presentation information for units: e.g., font size, type, colour, image type, resolution, etc.

• Inspect the page for any local or global grid structure

• Relate layout units to grid positions

Example: Derivation of layout structure

Page 43: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

• each sub-tree can additionally be assigned to a position in a hierarchically ordered page grid

Complete layout structure + page model

Page 44: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Complete layout structure + page model

Page 45: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

85%

5%

10%

14cm

GeM layers: area model

Layout units are related to identified elements of a

hierarchical grid specified in the area model

<area-root id="page-frame" cols="1" rows="3" hspacing="100" vspacing="10 85 5" height="16cm" width="14cm"> <sub-area id="body-frame" location="row-2" cols="2" rows="1" hspacing="50 50" vspacing="100"/></area-root>

<layout-root id="page-21"> <layout-leaf xref="header-21" location="row-1"

area-ref="page-frame"/> ... </layout-root>

16cm

Page 46: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Annotation solutions adopted

– complex query requirements

• Xpath Queries using standard

tools

Page 47: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Conclusions

• The annotation scheme allows detailed annotation of complex page-based documents

• Regularities can be sought using complex Xpath queries

• The system is open-ended and extensible without any redefinition of existing resources

Page 48: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Ongoing Work• Further collection and ongoing annotation of corpus

– http://purl.org/net/gem

• Use of results for criticism of document design and for exploring the relation between layout and rhetorical structure– Delin/Bateman: Document Design, 2002

• Use of Xpath queries within sequences of extensible style sheet transformations for automatic document generation– Henschel/Bateman/Delin: Konvens2002, Saarbrücken

Page 49: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Future Work• Extension of annotation schemes

– Current violations of the grid area model handled by relative offsets, need more flexible approach

– non-rectilinear grids for more complex design– consideration of dynamic elements, animation, etc.

• Extension of genres considered– advertisements– scientific documents

• Extension of languages considered

Page 50: Applied Linguistics Sprach- und Literaturwissenschaften Fachbereich 10 A brief introduction to the GeM annotation schema for complex document layout John.

Applied Linguistics Sprach- und LiteraturwissenschaftenFachbereich 10

Thank you !