Top Banner
Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh
23

Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Dec 16, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Tutorial on Standoff Markupas used in:

HCRC Map Task CorpusMATE/NITE Workbench

Amy IsardHCRC Language Technology

GroupUniversity of Edinburgh

Page 2: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Annotation

• Don’t keep all your data in one big document

• One document for each annotation level (with its own DTD)

• Links between documents

Page 3: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

LTG link syntax (1)

• an element can point to one or more contiguous elements in the same or a different document

• each element is identified by a unique ID

• a link is shown as an attribute on an element

• default attributes in the DTD tell a program that this is a link

Page 4: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

LTG link syntax (2)

• attributes to describe a link which will be embedded in the original element output document

href CDATA #IMPLIEDxml:link CDATA #FIXED "simple“show CDATA #FIXED "embed“actuate CDATA #FIXED "auto"

Page 5: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Example (1):Words XML

<!DOCTYPE SYSTEM “words.dtd”><words> <word id=“w1”>turn</word> <word id=“w2”>right</word> <word id=“w3”>for</word> <word id=“w4”>three</word> <word id=“w5”>centimetres</word> <word id=“w6”>okay</word></words>

Page 6: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Example (2):Moves XML

<!DOCTYPE SYSTEM “moves.dtd”><moves> <move type=“instruct” speaker=“spk1”

id=“m1” href=“words.xml#id(w1)..id(w5)”/> <move type=“align” speaker=“spk1” id=“m2” href=“words.xml#id(w6)”/>…</moves>

Page 7: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Example (3):Moves and Words XML

<!DOCTYPE SYSTEM “words.dtd”>

<words> <word

id=“w1”>turn</word> <word

id=“w2”>right</word> <word id=“w3”>for</word> <word

id=“w4”>three</word> <word id=“w5”>centimetres </word> <word

id=“w6”>okay</word></words>

<!DOCTYPE SYSTEM “moves.dtd”>

<moves> <move type=“instruct”

speaker=“spk1” id=“m1” href=“words.xml#id(w1)..id(w5)”/>

<move type=“align” speaker=“spk1” id=“m2”

href=“words.xml#id(w6)”/>…</moves>

Page 8: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Advantages of Standoff Annotation

• It is possible to have levels of annotation which have crossing branches (not normally possible in XML)

• New levels of annotation can be added without disturbing existing ones

• Editing one level of annotation has minimal knock-on effects on others

• People can work on different levels at the same time without worrying about creating different versions

Page 9: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Example Map Task Annotation Structure

three centimetres okay three or four centimetres okay

right right

M instruct M ack M instruct M ackM align M align

S1

S2

turn right for

reparandum repair

Game instruct

Disfluency

DialogueMoves

DialogueGames

Disfluencies

Words

Page 10: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

HCRC Map Task XML Corpus Architecture

Gaze

Timed Units

Tokens

Tagged Words

Automatic Syntax

Moves

Games

Transactions

Disfluencies

LandmarkReferences

Other Speaker’sWords

Page 11: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Tools and Software

• LTXML tools www.ltg.ed.ac.uk/software

• MATE workbench (NITE)mate.nis.sdu.dk (nite.nis.sdu.dk)

• Map Task XMLwww.hcrc.ed.ac.uk/maptask

Page 12: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

knit

• Part of the LTXML toolkit• Allows you to “expand” links

according to how they have been defined in the DTD (e.g. replace or embed)

• Command line program, can be used in pipelines

Page 13: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Example (3):Moves and Words XML

<!DOCTYPE SYSTEM “words.dtd”>

<words> <word

id=“w1”>turn</word> <word

id=“w2”>right</word> <word id=“w3”>for</word> <word

id=“w4”>three</word> <word id=“w5”>centimetres </word> <word

id=“w6”>okay</word></words>

<!DOCTYPE SYSTEM “moves.dtd”>

<moves> <move type=“instruct”

speaker=“spk1” id=“m1” href=“words.xml#id(w1)..id(w5)”/>

<move type=“align” speaker=“spk1” id=“m2”

href=“words.xml#id(w6)”/>…</moves>

Page 14: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Example (4)Moves XML with embed

links<!DOCTYPE SYSTEM “moves.dtd”><moves> <move type=“instruct” speaker=“spk1” id=“m1”

href=“words.xml#id(w1)..id(w5)”> <word id=“w1”>turn</word> <word id=“w2”>right</word> <word id=“w3”>for</word> <word id=“w4”>three</word> <word id=“w5”>centimetres</word> </move> <move type=“align” speaker=“spk1” id=“m2”

href=“words.xml#id(w6)”> <word id=“w6”>okay</word> </move>…</moves>

Page 15: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standoff Example (4)Moves XML with replace

links <!DOCTYPE SYSTEM “moves.dtd”><moves> <word id=“w1”>turn</word> <word id=“w2”>right</word> <word id=“w3”>for</word> <word id=“w4”>three</word> <word id=“w5”>centimetres</word> <word id=“w6”>okay</word>…</moves>

Page 16: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Working with knit

• Use knit on one XML document to work with one hierarchical view of the data

• To work across hierarchies, knit several views and navigate using the structures plus the unique ids of elements

Page 17: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Stylesheets

• style sheet: template rules– pattern which specifies which tree it applies to– pattern which specifies which tree it should

output

stylesheet processor– reads XML document and stylesheet– carries out the instructions in the stylesheet– outputs a new XML document or

Page 18: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Template Matching

• XPath is a language for addressing parts of an XML document, and is used by XSLT in the match attribute of a template e.g. <template match=“sentence”> matches any sentence element.

• A stylesheet processor goes through the XML document matching elements to templates and carries out the instructions in the template.

Page 19: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Standard Stylesheet Example

<template match=“dial”> <table> <apply-templates/> </table></template>

<template match=“move”>

<tr> <apply-templates/> </tr></template>

<template match=“word”>

<td> <apply-

templates/> </td></template>

Page 20: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

The MATE Workbench

• For display, querying, and especially annotation of XML corpora

• Flexible user-defined user interfaces• Uses stylesheets to create Java display

objects which have defined user interface behaviours

• In MATE internal data representation, elements with link pointers are viewed as parent elements

Page 21: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

MATE query language

• Easy to write queries over more than one hierarchy

• In MATE query language you define variables by element type and then relationships between them

• ($a ^ $b) means that element $a is a parent of element $b, either in the same document, or via a link.

Page 22: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

MATE example query

• Find all words which are in a move whose label is “instruct” and which are part of a disfluency

($w word)($m move)($d disfluency);($m ^ $w) and ($m label ~ instruct)

and ($d ^ $w)

Page 23: Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.

Conclusions

• Standoff markuup is not just theoretically a good idea

• Map Task standoff annotations in place for 5 years, used regularly

• Accessible to linguists with modest technical backgrounds