LIRICSB1 43.10887 44.17658 Slightly northeast AutoFeedback: CheckQuestion A1 44.74381 45.64469 Yeah, very slightly AlloFeedback: Confirm /disconfirm/ Language English Corpus TRAINS

LIRICS Deliverable D4.4

Multilingual test suites for semantically annotated data

Project reference number e-Content-22236-LIRICS

Project acronym LIRICS

Project full title Linguistic Infrastructure for Interoperable Resource and Systems

Project contact point Laurent Romary, INRIA-Loria 615, rue du jardin botanique BP101. 54602 Villers lès Nancy (France)

[email protected]

Project web site http://lirics.loria.fr

EC project officer Erwin Valentini

Document title Multilingual test suites for semantically annotated data

Deliverable ID D4.4

Document type Report

Dissemination level Public

Contractual date of delivery M30

Actual date of delivery 30th June 2007

Status & version Final, version 3.0

Work package, task & deliverable responsible UtiL

Author(s) & affiliation(s) Harry Bunt, Olga Petukhova, and Amanda Schiffrin, UtiL

Additional contributor(s)

Keywords Annotation, Semantic Representation, Test suites

Document evolution

version date version Date

1.0 26th June 2007

2.0 28th August 2007

3.0 13th September 2007

mailto:[email protected]://lirics.loria.fr/

1 Introduction

This document forms the last in the series of deliverables for Work Package 4 in the LIRICS project. The first, D4.1 (Bunt and Schiffrin, 2006), introduced the methodological factors which should be taken into consideration when isolating appropriate semantic concepts for representation; the second, D4.2 (Schiffrin and Bunt, 2007), discussed some of the problems encountered in identifying commonalities in alternative approaches, suggesting ways in which these problems might be solved, and presenting a preliminary set of data categories for semantic annotation. The third, D4.3 (Schiffrin and Bunt, 2007) represents the final stage in the evolution of the preliminary data categories after these have been extensively discussed, applied and tested. This final document, D4.4 (Petukhova, Schiffrin and Bunt, 2007), is intended as a companion guide to set of test suites and annotation guidelines which will be made available online.

We will present here the methods used in the production of the annotation, as well as concrete examples of each concept, in pseudo XML format for three different areas of semantic annotation: Dialogue Act, Semantic Role and Reference annotation. Temporal annotation was originally one of the areas of consideration, but since this has since been nominated by ISO as an area of particular interest in a new ISO work item, this area was dropped in LIRICS. It would have been redundant to duplicate work that is being carried out elsewhere.

The rest of the document will be divided into the following sections:

(Section 2) Dialogue Act Annotation.

(Section 3) Semantic Role Annotation.

(Section 4) Reference Annotation.

(Section 5) Concluding remarks.

(Appendices) - Extended examples of XML annotations;

- Guidelines for annotating dialoguen acts, semantic roles, and reference refelations.

Each of the sections concerning a specific semantic area of annotation will be further divided into the following subsections of information:

• Description of the annotation task.

• Description of the corpora used, with references and statistics.

• Description of the annotation tool used; rationale behind the choice of annotation tool; screenshots.

• Example XML annotation for each data category.

• Summary and discussion of the issues arising from the annotation task, including occurrence figures.

This deliverable constitutes a report on the final state of play at the 30-month stage of the project.

2 Dialogue Act Annotation

2.1 Annotation Task

The dialogue annotation task implied two main activities:

- identification of the boundaries of functional segments with at least one communicative function (segmentation task);

- assigning dialogue act tags (also multiple tags) to the identified segments in multiple dimensions (classification task).

Each dialogue was annotated by at least two different trained annotators with aim to estimate the inter-annotator agreement and designing the so-called ‘gold-standard’ annotation.

Annotators were provided with Annotation Guidelines for dialogue act annotation (see Appendix I.A)

2.2 Corpora

Dialogue act annotation was performed for three languages: English, Dutch, and Italian.

For English selected dialogues from two dialogue corpora were annotated: TRAINS 1 (5 dialogues; 349 utterances) and MapTask2 (2 dialogues; 386 utterances). Dialogues from both corpora are two-agent human-human dialogues. TRAINS dialogues are information-seeking dialogues where an information office assistant is supposed to help a client in choosing the optimal transport train connection. MapTask dialogues are so-called instructing dialogues where one participant plays the role of an instruction-giver navigating another participant, who is an instruction-follower, through the map. For both corpora orthographic transcriptions for each individual speaker, including word-level timings, were used.

For Dutch selected dialogues from two dialogue corpora were annotated: DIAMOND3 ((one extended dialogue, 301 utterances) and Schiphol (Amsterdam Airport) Information Office (6 dialogues; 202 utterances). Dialogues from both corpora are two-agent human-human dialogues. DIAMOND dialogues have an assistance-seeking nature with one participant playing the role of an instructor explaining to the user how to configure and operate a fax-machine. Schiphol Information Office dialogues are information-seeking dialogues where an assistant is requested to provide a client the information all around the airport activities and facilities (e.g. timetable, security, etc.). The original DIAMOND dialogue is pre-segmented per dialogue utterance for each speaker with indication of utterance start and end time. The original Schiphol dialogues are pre-segmented per speaker turn without authentic turn timings.

For Italian 6 selected dialogues (393 utterances) from the SITAL corpus were annotated. All dialogues are two-agent human-human information-seeking dialogues. The SITAL corpus contains dialogues between a travel agency's operator and a person seeking travel information or to book a ticket, a hotel room or a flight.

1 For more information about the TRAINS corpus please visit http://www.cs.rochester.edu/research/speech/trains.html

2 Detailed information about the MapTask project can be found at http://www.hcrc.ed.ac.uk/maptask/

3 See Geertzen et al. 2004

2.3 Annotation Tool

For the dialogue act annotation the ANVIL tool was used (http://www.dfki.de/~kipp/ANVIL). The tool allows the multidimensional segmentation of dialogue units into functional segments and their annotation (labelling) in multiple dimensions simultaneously. For ANVIL is also no problem that the annotator can mark up discontinuous segments and re-segment the pre-segmented dialogue units, e.g. some dialogues were presented in pre-segmented form, either per turn as in the Dutch Schiphol Information Office corpus or per utterance as in the Dutch DIAMOND corpus, so using ANVIL annotators had the possibility to cut larger units into smaller functional segments.

Figure 1 shows the annotator’s interface of the ANVIL tool and how it organizes the annotation work.

2.4 Examples

In this section we illustrate the data categories defined for dialogue acts with examples from the annotated corpora. Some of these examples are also shown in the XML-representation extracted from the original ANVIL-files in Appendix I.B.

/setQuestion/ Language English

Corpus TRAINS

http://www.dfki.de/~kipp/anvil

Example Speaker Start time End time Utterance DA-label B1 58.05684 60.52592 How far is it from Avon

to Bath? Task: SetQuestion

/propositionalQuestion/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 15.41509 16.4828 Have you got a

graveyard in the middle? Task: PropositionalQuestion

/alternativesQuestion/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 60.02543 62.99501 Do you wanna take the

boxcars with you or do you want to leave them in Elmira?

Task: AlternativesQuestion

/checkQuestion/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label B1 22.68888 24.25708 Due south and then back

again? AutoFeedback: CheckQuestion

/indirectSetQuestion/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 128.15881 129.46008 Then from Dansville to

Corning? (full form is: How far is it from Dansville to Corning?)

Task: IndirectSetQuestion

/indirectPropositionalQuestion/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 67.29922 72.53768 I was wondering

if we could actually pick up

Task: IndirectPropositionalQuestion

those two boxcars which are in Bath?

/indirectAlternativesQuestion/ Language Italian

Corpus SITAL

Example Speaker Start time End time Utterance DA-label A1 136.96742 138.73582 o viaggiamo la

mattina oppure il e pomeriggio VOC puff

Task: indirectAlternativesQuestion

/inform/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 38.97149 38.97149 We can unload any

amount of cargo onto a train in one hour

Task: Inform

/agreement/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 57.22269 58.95772 It’s five hours now that

we are back in Corning Task: Inform

A1 59.29138 59.55831 Yes Task: agreement

/disagreement/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 124.4 126.84 That’s a very

unnatural motion to

Task: disagreement

/correction/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label B1 139.50325 140.53759 Just a straight line Partner

along Communication Management: Completion; Turn: TurnGrab

A1 141.50521 142.60628 No, this is a curve Task: Correction

/setAnswer/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 15.68202 16.4828 What do we have here? Task: SetQuestion B2 16.4828 18.95189 We have three tankers

available in Corning Task: SetAnswer

/propositionalAnswer/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 39.43861 39.43861 Will either one take me

any quicker? Task: PropositionalQuestion

A1 41.95354 42.17462 No Task: PropositionalAnswer

/confirm/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label B1 43.10887 44.17658 Slightly northeast AutoFeedback:

CheckQuestion A1 44.74381 45.64469 Yeah, very slightly AlloFeedback:

Confirm

/disconfirm/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 50.049 52.18442 If you are trying to fill

how many tankers three?

Task: CheckQuestion

B1 52.78501 53.85272 No, just one transport Task: Disconfirm

/instruct/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 72.43759 75.50726 You should be avoiding

that by quite a distance Task: Instruct

/suggest/ Language Dutch

Corpus DIAMOND

Example Speaker Start time End time Utterance DA-label A1 566.98844 571.42612 maar ik denk dat ik

anders misschien opdracht 3 alvast daarbij mee moet nemen en dat ik op het einde terugga

Discourse Structuring: Suggest

/request/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 167.69753 169.73283 So, did you wanna

repeat that plan? AlloFeedback: Request

/acceptRequest/ Language English

Corpus Trains

Example Speaker Start time End time Utterance DA-label B1 167.69753 169.73283 So, did you wanna

repeat that plan? AlloFeedback: Request

A1 170.43353 170.7672 okay AutoFeedback: AcceptRequest

/declineRequest/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 44.97737 48.97835 kunt u nog eens

zeggen van waar naar waar u wilt reizen

Task: request

B1 49.18148 50.049 Nee Task: declineRequest

/promise/

Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 29.16188 30.1295 Ik zal hem om laten

roepen Task: Promise

/offer/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 0.43376 1.23454 Can I help you? Task: Offer

/acceptOffer/

Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 0.43376 1.23454 Can I help you? Task: Offer B1 1.6683 1.96859 yeah Task: AcceptOffer

/declineOffer/

Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 16.57476 18.34549 Wilt u nog een andere

verbinding weten

Task: Offer

B1 18.34549 19.16448 nee Task: declineOffer

/positiveAutoFeedback/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 254.18219 256.2175 And above that

there’s an east lake Task: Inform

B1 256.68463 257.21848 Oh, right AutoFeedback: positiveAutoFeedback

/negativeAutoFeedback/ Language Italian

Corpus SITAL

Example Speaker Start time End time Utterance DA-label

A1 37.13636 38.00387 mi scusi, mi ha detto alle dieci o alle dodici e cinquantacinque

Auto Feedback: negativeAutoFeedback

Language English

Corpus TRAINS


A1 29.067310 30.471672 I'm sorry what was the next question

Auto Feedback: negativeAutoFeedback

/feedbackElicitation/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 503.76 504.72 Does that make sense Allo Feedback:

feedbackElicitation

/negativeAlloFeedback/ Language Dutch

Corpus DIAMOND

Example Speaker Start time End time Utterance DA-label B1 741.19232 750.86846 je kunt als

alternatief voor het faxnummer gewoon intypen met de cijfertoetsen kun je dus ook naamtoetsen gebruiken... of verkorte kiescodes

Task: Instruct

A1 753.30418 754.10497 maar dat wil ik niet AlloFeedabck: negativeAlloFeedback

/positiveAlloFeedback/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 170.7672 187.31673 engine E one goes

to Dansville picks up three boxcars goes to Corning is loaded with oranges and goes to Bath engine E two picks takes the two boxcars at Elmira to Corning where they're

AutoFeedback: Inform

loaded with oranges and then takes them to Bath

B1 187.88396 188.11751 Okay AlloFeedback: positiveAlloFeedback

/turnKeep/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 122.88698 123.05381 So Turn Management:

TurnTake/TurnKeep

/turnGive/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 0 0.50049 Schiphol Inlichtingen Contact

Management: ContactIndication; Turn Management: TurnGive; Social Obligation Management: initialSelfIntroduction

/turnAccept/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 0 0.50049 Schiphol Inlichtingen Contact

Management: ContactIndication; Turn Management: TurnGive; Social Obligation Management:

/turnTake/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 122.88698 123.05381 So Turn Management:

TurnTake/TurnKeep

initialSelfIntroductionB1 0.50049 0.56722 Ja Contact

Management: ContactIndication; Turn Management: TurnAccept

/turnGrab/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 148.41197 148.41197 Which we’re going to

pass on the south Task: Instruct; Turn Management: TurnGrab

/turnRelease/

Language

Corpus

Example In two-agent dialogues turn transitions from one participant to another are different than those in, for example, multi-party interaction, and are as a rule more smoothly. Thus, one participant normally may give the turn to his/her partner or keep the turn. When the speaker wants the partner the opportunity to take the turn, he/she is doing this implicitly by stopping talking.

/stalling/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 81.44641 87.28546 Then six a.m. at

Dansville , nine a.m. at Avon

Task: Inform

A2 84.8831 86.11765 um … Time Management: stalling; Turn Management: TurnKeep

/pausing/

Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label B1 94.39241 94.75944 Wait a second Time Management:

pausing; Turn Management: TurnKeep

/completion/

Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 55.98815 59.8586 That’ll get there at four

into Corning Task: Inform

B1 59.09119 60.92632 And load up Partner Communication Management: completion

/correctMisspeaking / Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 42.80858 44.97737 second engine E3 is

going to uhh city H to pick up the bananas, back to A, dro…

Task: Inform

B1 44.97737 45.97835 H to pick up the oranges

Partner Communication Management: correctMisspeaking

/signalSpeakingError/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label B1 93.49153 93.82519 Oh oh Own Communication

Manegement: signalSpeakkingError

/selfCorrection/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 58.05684 60.52592 How far is it from

Avon to Bath? Task: SetQuestion

A2 60.65939 61.22661 to Corning Own Communication Management: SelfCorrection

/contactIndication/ Language English

Corpus TRAINS

Example Speaker Start time End time Utterance DA-label A1 0.33366 0.36703 Hi Contact

Management: contactIndication; Social Obligation Management; initialGreeting

/contactCheck/ Language Dutch

Corpus DIAMOND

Example Speaker Start time End time Utterance DA-label A1 1108.92004 1109.57996 eh jeroen? Contact

Management: contactCheck

/interactionStructuring/ Language English

Corpus Map Task

Example Speaker Start time End time Utterance DA-label A1 0.33366 1.36801 Starting off Discourse Structure

Management: interactionStructuring

Language Dutch

Corpus DIAMOND

Example Speaker Start time End time Utterance DA-label A1 306.48999 307.94 he Jeroen ik heb nog

een vraag Discourse Structure Management: interactionStructuring

/initialGreeting/ Language English

Corpus TRAINS


Management: contactIndication; Social Obligation Management: initialGreeting

Language Italian

Corpus SITAL

Example Speaker Start time End time Utterance DA-label A1 0.36703 0.66732 buongiorno Social Obligation

Management: initialGreeting

/returnGreeting/

Language English

Corpus TRAIN


Management: contactIndication; Social Obligation Management: initialGreeting

B1 1.99187 2.09128 Hi Social Obligation Management: returnGreeting

/initialSelfIntroduction/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 0.07 1.00098 Inlichtingen Schiphol Contact

Management: contactIndication; Turn Management: turnGive; Social Obligation Management: initialSelfIntroduction;

/returnSelfIntroduction/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label A1 0.07 1.00098 Inlichtingen Schiphol Contact

Management: contactIndication; Turn Management: turnGive;

Social Obligation Management: initialSelfIntroduction

B1 1.10108 2.03533 met mevrouw van der Wilde

Social Obligation Management: returnSelfIntroduction

/initialGoodbye/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label B1 15.48182 16.04905 Goedemiddag Social Obligation

Management: initialGoodbye

/returnGoodbye/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label B1 15.48182 16.04905 Goedemiddag Social Obligation

Management: initialGoodbye

A1 16.04905 16.31597 Dag Social Obligation Management: returnGoodbye

/apology/ Language Italian

Corpus SITAL

Example Speaker Start time End time Utterance DA-label A1 37.13636 38.00387 mi scusi Social Obligation

Management: apology

/acceptApology/ Language English

Corpus TRAINS


B1 365.620000 367.804207 to Avon I'm + sorry Social Obligation Management: apology

A1 365.446201 366.193304 +okay Social Obligation Management: acceptApology

/thanking/ Language Dutch

Corpus Schiphol

Example Speaker Start time End time Utterance DA-label B1 14.14718 14.98133 hartelijk bedankt Social Obligation

Management: thanking

/acceptThanking/ Language Dutch

Corpus Schiphol Example Speaker Start time End time Utterance DA-label

B1 14.14718 14.98133 hartelijk bedankt Social Obligation Management: thanking

A1 14.98133 15.48182 Tot uw dienst Social Obligation Management: acceptThanking

Data categories coverage: tag occurrences in the annotated test suites

The following table gives an overview of the tag occurrences and the coverage of the data categories (percentage between brackets) in the annotatedtest suites for each data category and each language.

Data category English Dutch Italian

/setQuestion/ 22 (2.99%) 9 (1.79%) 16 (4.07%)

/propositionalQuestion/ 18 (2.45%) 21 (4.17%) 19 (4.8%)

/alternativesQuestion/ 1 (0.13%) 1 (0.2%) 3 (0.8%)

/checkQuestion/ 35 (4.76%) 30 (5.96%) 8 (2.04%)

/indirectSetQuestion/ 2 (0.27%) 4 (0.8%) 12 (3.1%)

/indirectPropositionalQuestion/ 1 (0.13%) 5 (1%) 5 (1.3%)

/indirectAlternativesQuestion/ 0 1 (0.2%) 2 (0.5%)

/inform/ 122 (16.6%) 89 (17.7%) 56 (14.25%)

/agreement/ 2 (0.27%) 3 (0.6%) 30 (7.6%)

/disagreement/ 1 (0.13%) 0 0

/correction/ 1 (0.13%) 4 (0.8%) 0 /setAnswer/ 26 (3.54%) 12 (2.39%) 33 (8.4%)

/propositionalAnswer/ 12 (1.63%) 21 (4.17%) 17 (4.33%)

/confirm/ 29 (3.95%) 26 (5.17%) 14 (3.6%)

/disconfirm/ 1 (0.13%) 1 (0.2%) 1 (0.3%)

/instruct/ 56 (7.62%) 17 (3.4%) 0

/suggest/ 0 1 (0.2%) 0

/request/ 1 (0.13%) 2 (0.4%) 9 (2.3%)

/acceptRequest/ 1 (0.13%) 1 (0.2%) 2 (0.5%)

/declineRequest/ 0 1 (0.2%) 0

/promise/ 0 3 (0.6%) 2 (0.5%)

/offer/ 5 (0.68%) 3 (0.6%) 6 (1.5%)

/acceptOffer/ 5 (0.68%) 2 (0.4%) 1 (0.3%)

/declineOffer/ 0 1 (0.2%) 3 (0.8%)

/positiveAutoFeedback/ 174 (23.67%) 96 (19.1%) 73 (18.6%)

/positiveAlloFeedback/ 2 (0.27%) 1 (0.2%) 19 (4.8%)

/negativeAutoFeedback/ 2 (0.27%) 0 1 (0.3%)

/negativeAlloFeedback/ 0 1 (0.2%) 2 (0.5%)

/feedbackElicitation/ 1 (0.13%) 0 0

/turnAccept/ 2 (0.27%) 7 (1.4%) 0

/turnGive/ 3 (0.41%) 13 (2.6%) 0

/turnGrab/ 17 (2.31%) 11 (2.2%) 0

/turnKeep/ 138 (18.78%) 87 (17.3%) 5 (1.3%)

/turnRelease/ 0 0 3 (0.8%)

/turnTake/ 69 (9.39%) 43 (8.5%) 46 (11.7%)

/stalling/ 74 (10.07%) 72 (14.3%) 27 (6.9%)

/pausing/ 5 (0.68%) 9 (1.79%) 11 (2.8%)

/completion/ 4 (0.54%) 0 0

/correctMisspeaking / 1 (0.13%) 0 0

/signalSpeakingError/ 4 (0.54%) 0 0

/selfCorrection/ 20 (2.72%) 6 (1.2%) 3 (0.8%)

/contactIndication/ 5 (0.68%) 12 (2.39%) 7 (1.8%)

/contactCheck/ 1 (0.13%) 2 (0.4%) 4 (1.02%)

/interactionStructuring/ 12 (1.63%) 9 (1.79%) 4 (1.02%)

/initialGreeting/ 5 (0.68%) 5 (1%) 6 (1.5%)

/returnGreeting/ 1 (0.13%) 1 (0.2%) 8 (2.04%)

/initialSelfIntroduction/ 0 6 (1.2%) 6 (1.5%)

/returnSelfIntroduction/ 0 4 (0.8%) 4 (1.02%)

/initialGoodbye/ 0 6 (1.2%) 4 (1.02%)

/returnGoodbye/ 0 5 (1%) 2 (0.5%)

/apology/ 0 0 1 (0.3%)

/acceptApology/ 1 (0.13%) 0 0

/thanking/ 1 (0.13%) 7 (1.4%) 6 (1.5%)

/acceptThanking/ 0 5 (1%) 0

Table 1: Tag occurrences and data category coverage (in %)

in the tested corpora for each language

From Table 1 we may observe that all the data categories defined for the communicative functions of dialogue acts occurred in the test suites for at least one the languages. No utterance was labeled by the annotators as UNCODED.

Inter-annotator agreement and discussion of issues arising

For the purpose of qualitative evaluation of the proposed data categories for the dialogue act annotation the inter-annotator agreement was calculated using the standard kappa statistic (see Cohen, 1960, Carletta, 1996). This measure is given by:

κ = (P(A) – P(E)) / (1 – P(E))

where P(A) is the proportion of times that the k annotators agree and P(E) is the proportion of agreement expected the k annotators agree by chance. The ag3reement was measured on both annotation tasks (segmentation and classification). The analysis was made on 2 Map Task dialogues (386 utterances) and 3 TRAINS dialogues (187 utterances) for English and 1 DIAMOND dialogue (301 utterances) and 5 Schiphol dialogues (152 utterances) for Dutch. Each utterance was tagged by two trained annotators independently. As for classification task the agreement was calculated for each data category in isolation and for the annotation task in general. The results for the segmentation task are as follows:

Corpus P(A) P(E) Kappa Map Task 0.99 0.96 0.74

TRAINS 1.00 1.00 nav

DIAMOND 1.00 1.00 nav

Schiphol 0.99 0.94 0.83

Table 2: Inter-annotator agreement (kappa statistic on segmentation task.

According to the scale proposed by Rietveld and van Hout (1993) these kappa values reflect substantial and almost perfect agreement between the annotators on the segmentation task. According to the scale proposed by Landis and Koch (1997), these values show good and high agreement (scores are higher than 0.6).

Table 3 presents the agreement statistics on the dialogue act classification task after segmentation. The table shows near-perfect between the annotators both for all separate classes of dialogue act functions and for the corpus data as a whole (evaluation according to Rietveld & van Hout, 1993). (See also Geertzen & Bunt, 2006 for more complex but also more accurate forms of weighted agreement calculation.)

Data category Kappa

Information Seeking functions 0.983

Information Providing functions 0.989

Action Discussion functions 0.994

Auto-Feedback functions 0.994

Allo-Feedback functions 1.00

Turn final functions 0.958

Turn initial functions 0.954

Time Management functions 0.971

Partner Communication Management functions 1.00

Own Communication Management functions 0.929

Contact Management functions 0.956

Discourse Structuring functions 0.982

Social Obligation Management functions 0.938

Whole corpus of annotated data 0.93

Table 3: Inter-annotator agreement on the classification task, measured with kappa statistic.

Further analysis of the confusion matrix that can be constructed from the data (see Table 5 below) shows in which cases the annotators experienced some difficulties in reaching agreement. The following cases can be identified:

- CheckQuestions vs Inform: CheckQuestions and Informs often have the same surface structures and observable features, such as word order, declarative intonation and pitch contour. Discrimination between these two communicative functions often requires knowledge of the context, in particular of the dialogue history and of the distribution of information among the dialogue participants (for instance, whether a participant is an expert on the content of the dialogue act).

- Inform vs SetAnswer: SetAnswers can be confused with Informs if an annotator only takes the dialogue history into account, namely noticing that the previous utterance was a Question. Replies to Questions are often Answers, but this not always the case.

For example:

A1 Ik zou graag willen weten wanneer het toestel met het vluchtnummer KL

678 of dat morgenochtend of zaterdagochtend aankomt. (I would like to know when the plane with flight number KL 678 whether it arrives tomorrowmorning or Saturday morning)

B1 Moment hoor. (Just a moment)

B2 Hij komt morgen en zaterdag. (It arrives tomorrow and on Saturday)

The utterance (B2) is not a SetAnswer but rather an Inform, telling the other participant that the presupposition of the AltsQ uestion was false.

- Turn Management function vs no Turn Management function: see Table 4. The table shows that annotators failed to reach 100% agreement on assigning Turn Management functions (this applies to all function except for TurnGrab), where one annotator decided not to assign any Turn Management function. This was invariabley caused by a lack of evidence in the communicative behavior of dialogue participants in the form of observable features which would reflect such functions. Such features may be linguistic cues (e.g. ‘uhm’) and intonation properties (e.g. pauses, rising intonation, word lengthening, etc.).

Label none TurnGive TurnKeep TurnAccept TurnGrab TurnTake

none 1677

TurnGive 10 26

TurnKeep 5 190

TurnAccept 3 5

TurnGrab 21

TurnTake 7 90

Table 4: Confusion matrix for Turn Management function assignment

Difficult cases for annotators where so-called backchannels, signals by which a participant may indicate his understanding of what is said without necessarily accepting or agreeing with what is said. The producer of a backchannel does not take the turn and does not wish to interfere with what the partner is saying, nor does he wish to show the intention to interrupt and obtain the turn, but he wants to show an active listening attitude and/or encourage the partner to continue. Such phenomena often take place in hesitation phases when one of the dialogue partners signals difficulty to complete his/her utterance, by making pauses, stallings, producing other vocal signs such as heavy breathing or puffing.

For example:

A1 and then we're going to turn east

A2 turn ... VOC_inbreath

B1 mmhmm

A3 not ... straight east ... slightly sort of northeast

Whether a Feedback act has a Turn Giving function or not is sometimes difficult to decide for a human annotator, because the differences between Turn Giving vs no Turn Giving can be very subtle (voicing, pitch contour, energy, initial pauses, etc.), making the annotator’s decision rather subjective.

A similar scenario was observed for assigning Turn Keep functions (did hesitation take place) and Turn Take (did speaker perform a separate act to that effect). Here also ,prosodic rather than lexical cues indicate the speaker’s intentions to manage this aspect of the interaction.

The other observed disagreements were accidental in nature and can be disregarded.

3 Semantic Role Annotation

3.1 Annotation Task

We define a semantic role as the type of relationship that a participant plays in some real or imagined situation; therefore the semantic role annotation task involved two main activities:

• Identification and labeling of markables: expressions that represent the entities involved in semantic role relations. Markables come in two varieties:

• anchors, which correspond to one of three situation (or ‘eventuality’) types: events, states and facts (every semantic role must be ‘anchored’ to a situation of one of these types). Anchors are realised mainly by verbs but sometimes also by nouns.

• situation participants. The are realised mainly by nouns, noun phrases and pronouns (ignoring event coreference, temporal coreference, etc.).

• Identification and labeling of links: referential relations between participant and anchor markables.

Annotators were provided with Annotation Guidelines for semantic role annotation (see Appendix II.A)

In order to have a reasonable coverage of different types of semantic roles, it was decided that a minimum of 500 sentences per language should be annotated. Tet suites with semantic role annotations were constructed for four languages: English, Dutch, Italian, and Spanish.

For Dutch and English all test suite material was annotated independently by at least three different annotators, in order to investigate the usability of the tagset in terms of inter-annotator agreement.

3.2 Corpora

For English FrameNet and PropBank data was used. We selected three unbroken FrameNet texts (120 sentences) and separate sentences (83 sentences). PropBank data consists of isolated sentences (355 sentences).

For Dutch 15 unbroken texts were selected from news articles, with a total of 260 sentences.

News articles were also selected to construct Italian test suites (101 sentences). All files were taken from the Italian Treebank corpus.

For Spanish, the LIRICS test suite consists of 189 sentences taken from the Spanish FrameNet corpus.

3.3 Annotation Tool

The annotations were made using the GATE annotation tool form the University of Sheffield4. GATE provides annotators with a graphical interface for indicating which pieces of text denote relevant concepts (the ‘maarkables’). For the LIRICS annotation task two types of annotation label have been added to GATE: SemanticAnchor and SemanticRole (updated gate.jar file was provided by UtiL).

Figure 2 shows the GATE interface for annotators and how it organizes an annotators’ activities.

4 See: http://gate.ac.uk for further details and http://gate.ac.uk/documentation.html for documentation.

3.4 Examples from the annotated test suites

This section contains examples from the annotated corpora, which illustrate the data categories defined for semantic roles.

/agent/

Language Spanish

Corpus Spanish FrameNet

Example [El partido popular Agent,e11] ha planteadoe1 [las elecciones Theme,e1] [para llegar al Gobierno de la Nacion Reason, e1].

Language English

Corpus FrameNet

Example [Libya Agent,e1&e2] has showne1 [interest Theme,e1] in and takene2 [steps Theme, e2] [to acquiree3 [weapons of mass destruction (WMD) Theme, e3] and their delivery systems Purpose, e1&e2].

/partner/

Language English

Corpus FrameNet

Example [On 19 December 2003 Time, e1], [Libyan leader col. Muammar Gadhafi Agent,e1] [publicly Manner, e1] confirmede1 [his commitmente2 [to disclosee3 and dismantlee4 [WMD programs Patient, e3&e4] [in his country Location, e3&e4] Theme,e1] Purpose, e2] [following [a nine-month period Duration, e5] of negotiationse5 [with US and UK authorities Partner,e5] Reason, e1].

/cause/

Language English

Corpus FrameNet

Example [Signing the protocol Cause, e1] would ensuree1 [IAEA Beneficiary, e1] [oversight over Libya's nuclear transition from weapons creation to peaceful purposes Reason, e1].

/instrument/

Language English

Corpus FrameNet

Example [In 2003 Time, e1], Libya admittede1 [its previous intentions to acquiree2 [equipment Theme, e2; Instrument, e4] needede3 [to producee4 [biological weapons (BW) Result, e4] Purpose, e3] Theme, e1]

/patient/

Language English

Corpus PropBank

Example [White women Agent, e1&e2] servee1 [tea and coffee Theme, e1] , and then washe2 [the cups and saucers Patient, e2] [afterwards Time, e2] .

/pivot/

Language English

Corpus PropBank

Example [Vicar Marshall Agent, e1; Pivot, e2] admitse1 [to mixed feelingse2 [about this issue Theme, e2] Theme, e1].

/theme/

Language Spanish

Corpus Spaniish FrameNet

Example [China Pivot, e1] no representae1 [un peligro militar Theme, e1].

Language English

Example [One man Agent, e1] wrappede1 [several diamonds Theme, e1] [in the knot of his tie Final_Location, e1].

/beneficiary/

Language English

Corpus PropBank

Example [U.S. Trust Agent, e1] [recently Time, e1] introducede1 [certain mutual-fund products Theme,e1] , which allowe2 [it Beneficiary, e2] [to servee3 [customers Beneficiary, e3] Purpose, e2].

/source/

Language English

Corpus PropBank

Example [Eaton Beneficiary, e1] earnede1 [from continuing operations Source, e1]

/goal/

Language English

Corpus PropBank

Example [The executive Agent, e1] recallse1 [[Mr. Corry Agent, e2] whisperinge2 [to him and others Goal, e2] Theme, e1].

/result/

Language English

Corpus PropBank

Example [Within the past two months Duration, e1] [a bomb Patient, e1; Cause, e2] explodede1 [in the offices of the El Espectador in Bogota Location, e1], [destroyinge2 [a major part of its installations and equipment Patient, e2] Result, e1]

/reason/

Language English

Corpus PropBank

Example [Elisa Hollis Agent, e1] launchede1 [a diaper service Result, e1] [last year Time, e1] [because [State College , Pa. Pivot, e2] didn't havee2 [one Theme, e2] Reason, e1].

/purpose/

Language English

Corpus PropBank

Example [Two steps Theme, s1] ares1 [necessary Attribute, s1] [to translatee1 [this idea Patient, e1] [into action Result, e1] Purpose, s1]

/time/

Language English

Corpus PropBank

Example [Right now Time, e1] [[about a dozen Amount, e1] laboratories Agent, e1&e2] , [in the U.S. , Canada and Britain Location, e1] , are racinge1 [to unmaske2 [other suspected tumor-suppressing genes Theme, e2] Purpose, e1].

/manner/

Language English

Corpus PropBank

Example [These rate indications Theme, s1] ares1 n't [directly Manner, s1] comparables1.

/medium/

Language English

Corpus PropBank

Example [They Pivot, s1; Agent, e1] coulds1 seee1 [the 23 pairs of chromosomes Theme, e1] [in the cells Location, e1] [under a microscope Medium, e1].

/means/

Language English

Corpus FrameNet

Example [Sears Agent, e1] blanketede1 [the airwaves Patient, e1] [with ads about its new pricing strategy Means, e1]

/setting/

Language English

Corpus FrameNet

Example [A number of medical and agricultural research centers Pivot, s1; Intsrument, e1] hads1 [the potential Attribute, s1] to be usede1 [in BW research Setting, e1].

Here comee1 [the ringers Agent, e1&e2] [from above Initial_Location, e1], makinge2 [a very obvious exit Theme, e2] [while [the congregation Pivot, s1] iss1 [at prayer Setting, s1] Time, e1&e2]

/location/

Language English

Corpus FrameNet

Example [Here Location, s1] iss1 [an example Theme, s1].

[They Patient, e1] aren't acceptede1 [everywhere Location, e1]

[The stairs Theme, s1] are locateds1 [next to the altar Location,s1]

/initialLocation/

Language English

Corpus FrameNet

Example Here comee1 [the ringers Agent, e1&e2] [from above Initial_Location, e1], makinge2 [a very obvious exit Theme, e2] [while [the congregation Pivot, s1] iss1 [at prayer Setting, s1] Time, e1&e2]

/finalLocation/

Language English

Corpus PropBank

Example [One man Agent, e1] wrappede1 [several diamonds Theme, e1] [in the knot of his tie Final_Location, e1].

/path/

Language English

Corpus PropBank

Example [Father McKenna Agent, e1] movese1 [through the house Path, e1] [praying in Latin Manner, e1]

/distance/

Language English

Corpus FrameNet

Example [Libya Agent, e1] pledgede1 [to eliminatee2 [[ballistic missiles Pivot, s1] capables1 of travelinge3 [more than 300km Distance, e3] Patient, e2] Theme, e1].

/amount/

Language English

Corpus PropBank

Example [The ruble Theme, s1] iss1 n't worths1 [much Amount, s1].

/attribute/

Language English

Corpus

Example [A number of medical and agricultural research centers Pivot, s1; Intsrument, e1] hads1 [the potential Attribute, s1] to be usede1 [in BW research Setting, e1].

/frequency/

Language English

Corpus PropBank

Example [President Zia of Pakistan Agent, e1] [repeatedly Frequency, e1] statede1 [that [fresh Soviet troops Patient, e2] were being insertede2 [into Afghanistan Final_Location, e2] Theme, e1]

3.5 Data category coverage

Table 5 gives an overview of the tag occurrences and the coverage of the data categories by the test suites (percentages between brackets) for each defined data category and language.

Data category English Dutch Italian Spanish

Total amount of objcts 1795 1326 454 1356

/agent/ 311 (17.3%) 186 (14%) 60 (13.2%) 258 (19%)

/partner/ 5(0.3%) 9 (0.7%) 2 (0.4%) 3 (0.2%)

/cause/ 39 (2.2%) 33 (2.5%) 2 (0.4%) 43 (3.2%)

/instrument/ 10 (0.56%) 7 (0.5%) 7 (1.5%) 4 (0.3%)

/patient/ 186 (10.4%) 137 (10.3%) 51 (11.2%) 119 (8.8%)

/pivot/ 104 (5.8%) 85 (6.4%) 51 (11.2%) 154 (11.4%)

/theme/ 501 (27.9%) 331 (25%) 117 (25.6%) 315 (23.2%)

/beneficiary/ 40 (2.02%) 19 (1.4%) 7 (1.5%) 63 (4.7%)

/source/ 16 (0.9%) 31 (2.3%) 7 (1.5%) 2 (0.1%)

/goal/ 18 (1%) 13 (1%) 13 (2.9%) 5 (0.4%)

/result/ 66 (3.7%) 54 (4.1%) 14 (3.1%) 24 (1.8%)

/reason/ 36 (2%) 14 (1.1%) 9 (2%) 43 (3.2%)

/purpose/ 49 (2.7%) 18 (1.4%) 7 (1.5%) 24 (1.8%)

/time/ 135 (7.5%) 106 (8%) 13 (2.9%) 65 (4.8%)

/manner/ 39 (2.2%) 27 (2%) 18 (4%) 44 (3.2%)

/medium/ 4 (0.2%) 1 (0.1%) 2 (0.4%) 8 (0.6%)

/means/ 8 (0.4%) 6 (0.5%) 0 2 (0.1%)

/setting/ 47 (2.6%) 48 (3.6%) 16 (3.5%) 28 (2.1%)

/location/ 41 (2.3%) 66 (5%) 24 (5.3%) 34 (2.5%)

/initial_location/ 2 (0.1%) 1 (0.1%) 2 (0.4%) 5 (0.4%)

/final_location/ 6 (0.3%) 10 (0.8%) 7 (1.5%) 43 (3.2%)

/path/ 20 (1.1%) 9 (0.7%) 0 0

/distance/ 1 (0.06%) 0 1 (0.2%) 0

/amount/ 27 (1.5%) 19 (1.4%) 11 (2.4%) 17 (1.3%)

/attribute/ 72 (4%) 88 (6.6%) 6 (1.3%) 45 (3.3%)

/frequency/ 12 (0.7%) 8 (0.6%) 0 9 (0.7%)

unclassified 0 0 6 (1.3%) 0

Table 5: Tag occurrences and data categories coverage (in %)

in the tested corpora for each language in isolation

3.6 Inter-annotator agreement and discussion of issues arising

Three annotators annotated the test suties for English and Dutch independently. The annotators were students of linguistics, Dutch naitive speakers, and their level of English knowledge was evaluated as proficient. The annotators had no previous experience an annotation; they received one afternoon of training in annotation using LIRICS data categories and the annotation tool. They also received a short (7 pages) document with annotation guidelines (see Appendix II.A). This allowed an evaluation of the usability of the LIRICS data categories for semantic role annotation by determining the agreement among the annotators. This was done in the usual way by calculating the standard kappa statistic (see Cohen, 1960, Carletta, 1996, and above, section 2.5).

The obtained Kappa scores were evaluated according to Rietveld & van Hout (1993) and interpreted as all annotators having reached substantial agreement on all annotation tasks (scores between 0.61 to 0.8), except for one annotator pair (A2&A3) whose agreement on labelling Dutch anchors and semantic roles was moderate (less than 0.61). The results are shown in Table 6 for each pair of annotators.

Annotators’ pairs A1&A2 A1&A3 A2&A3

Anchors (English) 0.66 0.66 0.61

Semantic roles (English) 0.64 0.68 0.62

Anchors (Dutch) 0.73 0.77 0.54

Semantic roles (Dutch) 0.6 0.65 0.56

Table 6: Inter-annotator agreement on the two labeling tasks based on kappa statistic

for English and Dutch corpus data

A closer look at the confusion matrices for both corpora (see Tables 7 and 8 below) shows the disagreement cases. We found that the following data categories were a source of confusion by annotators:

• The role of adjectives in descriptions of states or facts by means of constructions like Copula + Adjective was not labelled consistently, as in the following example (where (b) is the correct annotation):

a. Roses are red

roses are red .

b. Roses are red roses are red .

• Theme vs Patient: these roles have one distinguishing property, Theme distinguished from patient by whether it is a participant that is affected by the event or not; if it is not, then it is a Theme; if it is, then it is a Patient. Sometimes, however, it was difficult for annotators to decide whether the participant is affected/ change by the event or not, for example:

An ancient stone church stands amid the fields, the sound of bells cascading from its tower, calling the faithful to Evensong.

Any question .. is answered by reading this book about sticky fingers and sweaty scammers .

Individuals close to the situation believe Ford officials will seek a meeting this week with Sir John to outline their proposal for a full bid.

Jayark, New York, distributes and rents audio-visual equipment and prints promotional ads for retailers.

• Theme vs Pivot: Theme is distinguished from Pivot by whether it is a participant that has the most central role or not; if it is not, then it is a Theme; if it is, then it is a Pivot. Again, this can be difficult for annotators to decide.

U.S. officials say they aren't satisfied (Annotator1 labelled as Theme; other two – as Pivot of the state ‘to be satisfied’)

They may be offshoots of the intifadah, the Palestinian rebellion in the occupied territories, which the U.S. doesn't classify as terrorism.

• Theme vs Result: Theme is distinguished from Result by whether it is a participant that exists independently of the event or not; if it is, then it is a Theme; if not, then it is either Result.

Together with the 3.6 million shares currently controlled by management, subsidiaries and directors, the completed tender offer would give Sea Containers a controlling stake.

Delegates from 91 nations endorsed a ban on world ivory trade in an attempt to rescue the endangered elephant from extinction (potential Result).

These are the last words Abbie Hoffman ever uttered.

• Location vs Setting: Setting is distinguished from location by whether it defines a physical location or not; if it does not, then it is a setting; if does, then it is location. Some cases, however, can be ambiguous, for example:

It hopes to speak to students at theological colleges about the joys of bell ringing.

They settle back into their traditional role of making tea at meetings.

• Beneficiary vs Goal: Goal is distinguished from Beneficiary by whether it is a participant that is clearly advantaged or disadvantaged by the event; if it is, then it is a Beneficiary; if not, then it may be a Goal. For example:

Libya employed Iranian-supplied mustard gas bombs against Chad, its southern neighbour, in 1987.

When their changes are completed, and after they have worked up a sweat, ringers often skip off to the local pub, leaving worship for others below.

• It was pointed out by annotators that Pivot seems to be a rather general abstract role which subsumes more fined-grained distinctions such as the experiencer of psychological events/states, the theme of some states like “owning” etc. On the other hand, we have such examples like ‘John has a dog’ where obviously ‘John’ plays a more central role than ‘a dog’, and to label both participants as Themes would be unsatisfactory. This was the main reason for introducing the Pivot role.

• It was suggested that it would be more efficient to organize the roles into a taxonomy, exploiting, for instance, semantic features like [+/- agentivity] and similar, so that in their application to real texts, annotators can be presented with different levels of granularity and perform a case-by-case decision without being forced to choose a highly specific role, e.g. Location (general role) and Initial_Location, Final_Location, Path as sub-roles.

• Also the issue arose whether separate roles like Initial_Time, Final_Time and

Duration should be defined, since this is an overlap with temporal information. These data categories are defined for the domain of temporal annotation, and as semantic roles they seem to be superfluous. On the other hand, if someone would be interested only in semantic roles, the proposed set of tags should be complete and also cover temporal roles. The same could be said about spatial roles.

Agent Amount

Att ben Cau Du FL IL Fre Goal IT Instr

Locatio

Man Mea Medi Part Path Patie Piv Purp Rea Res Sett Sour Them Ti

Agent 503

Amount 40

Attribute 41

Benefic 62

Cause 22 53

Duration 26

F_locatio 10

I_locatio 18

Frequen 23

Goal 1 23

I_time 16

Instrum 7

Locatio 63

Manner 20 66

Means 2 9

Medium 2 4

Partner 9

Path 31

Patient 19 4 5 5 1 2 207

Pivot 40 3 3 145

Purpose 1 1 62

Reason 4 78

Result 9 80

Setting 1 15 3 2 1 1 1 44

Source 1 1 20

Theme 28 4 46 3 7 1 5 1 3 1 103 59 4 3 16 644

time 1 1 196

Table 7: Confusion matrix for semantic roles for English corpus

Agent Amount Att ben Cau Du FL IL Fre Goal IT Instr Locatio Man Mea Medi Part Path Patie Piv Purp Rea Res Sett Sour Them Ti

Agent 133

Amount 1 9

Attribute 1

Benefic 1 7

Cause 8 7

Duration 1 8

F_locatio 9

I_locatio 2

Frequen 6

Goal 11 16

I_time 1 9

Instrum 1 1 1

Locatio 56

Manner 1 11 1 1 15

Means 1 2

Medium 1 1

Partner 1 2 9

Path 2 2 7

Patient 5 13 1 6 1 27

Pivot 39 68

Purpose 9

Reason 1 3 9

Result 2 1 1 13 15

Setting 1 3 1 8 4 29

Source 2 1 2 1 1 3 22

Theme 23 1 14 1 1 1 1 4 4 90 37 11 12 173

time 90

Table 8: Confusion matrix for semantic roles for Dutch corpus

40

4 Reference Annotation

4.1 Annotation Task

The reference annotation tasks involved two main activities:

• Identification and labeling of markables: referential expressions realised by nouns, noun phrases and pronouns (ignoring event coreference, temporal coreference, etc.).

• Identification and labeling of links: referential relations between markables.

Annotators were provided with Annotation Guidelines (see Appendix III.A).

4.2 Corpora

The annotations were performed on corpus material for four languages, English, Dutch, Italian, and German:

• For English 177 sentences were selected from the FrameNet corpus 5 . In their annotation with respect to referential relations, 375 markables and 233 links were identified. In addition, 142 sentences were selected from the MUC-6 891102-0148 corpus. In the annotation of these sentences 331 markables and 221 links were identified and labeled.

• For Dutch 274 sentences from news articles were selected for reference annotation. Annotators identified and labeled 494 markables and 327 coreferential links.

• For Italian 137 sentences from Italian newspaper articles were annotated, where 736 markables and 265 coreferential links were identified and labeled.

• The German test suite consisted of 232 sentences from newspaper articles (Handelsblatt, financial news), where 98 markables were identified and 175 coreferential links.

4.3 Annotation Tool

The annotations were performed using the PALinkA annotation tool6, an XML-based tool that was originally designed for the purpose of referential relation annotation. This tool has considerable advantages:

• It is language/platform/task-independent. The specifications relevant to the annotation task were provided in an external file (see Appendix III.B));

• It allows easy identification and labeling of all markables and links, with point and click actions, and has the possibility to perform undo/redo/delete operations;

• It has a user-friendly interface.

5 See http://framenet.icsi.berkeley.edu/ for more information.

6 Visit the Palinka site http://clg.wlv.ac.uk/projects/PALinkA/ for more information and downloads.

41

Figure 3 shows the PALinkA annotation interface and the organization of the annotation work.

Figure 3: Screen short of annotation work using PALinkA

4.4 Examples

In this section we provide some examples from the test suites, to illustrate each of the data categories for reference annotation that were defined in LIRICS. The same examples are shown in the XML-format extracted from the PALinkA files in Appendix III.B.

/synonymy/ Language English

Corpus FrameNet

Example There is a significant amount of open-source literature concerning Libya 's acquisition and use of [chemical weapons ]( [CW ]) ; [it ] is well documented [that [Libya ]employed Iranian-supplied mustard gas bombs against [Chad ], [[its ]southern neighbor ], in 1987 ].

/hyponymy/

Language English

Corpus FrameNet

Example Housing is scarce and [public services ]-- [the court system , schools , mail service , telephone network and the highways ]-- are in disgraceful

42

condition

/acronymy/ Language English

Corpus FrameNet

Example [Libya ]has shown interest in and taken steps to acquire [weapons of mass destruction ]( [WMD ])

/compatibility/

Language English

Corpus FrameNet

Example In 2003 , [Libya ]admitted [its ]previous intentions to acquire equipment needed to produce [biological weapons ]( [BW ]) . In October and December 2003 , Libyan officials took US and UK experts to a number of [medical and agricultural research centers ][that ]had the potential to be used in [BW ]research . [The country ]acceded to the biological and toxin weapons convention on 19 January 1982 . There are allegations that the alleged [chemical weapon ]( [CW ]) plants at Rabta and Tarhunah could contain [BW ]research facilities as well .

/meronymy/

Language Dutch

Corpus

Example [Alexandra Polier ], [die ]eerder werkte als redacteur op het kantoor in NewYork van het Amerikaanse persagentschap Associated Press , is op dit moment in Nairobi op bezoek bij de ouders van [[haar ]verloofde ], [Yaron Schwartzman ], een Israëliër die in Kenia opgroeide .

/metonymy/

Language Italian

Corpus

Example E [ la Bimex ][ si ] era affrettata ad aprire [ le [ sue ] porte ] sia [[ alla commissione ministeriale ] che [[ ai carabinieri ][ del nucleo antisofisticazioni ]] e [ all' Usl ]] : [[ ispezioni ] e [ controlli ]] avevano trovato tutto regolare . > , dice tranquillo [ Ugo De Bei ] .

43

/partOf/ Language English

Corpus FrameNet

Example [Libya ]'s motivation to acquire [WMD ], and [ballistic missiles ]in particular , appears in part to be a response to Israel 's clandestine nuclear program and a desire to become a more active player in Middle Eastern and African politics [The others ]here today live elsewhere . [They ]belong to a group of [15 ringers ]--

/subsetOf/

Language English

Corpus MUC-6 891102-0148

Example " [The group ]says standardized achievement test scores are greatly inflated because [teachers ]often " teach the test " as Mrs. Yeargin did , although [most ]are never caught .

/memberOf /

Language English

Corpus MUC-6 891102-0148

Example [Friends of Education ]rates [South Carolina ]one of [the worst seven states ]in [its ]study on academic cheating .

/abstract/

Language English

Corpus MUC-6 891102-0148

Example [Mrs. Yeargin ]was fired and prosecuted under an unusual South Carolina law that makes [it ]a crime [to breach test security ]

/concrete/

Language English

Corpus MUC-6 891102-0148

Example [she ]spotted [a student ]looking at [crib sheets ]

/animate/ Language English

Corpus MUC-6 891102-0148

Example And most disturbing , [it ]is [[educators ], not students ], [who ]are blamed for much of the wrongdoing .

/inanimate/

Language English

Corpus MUC-6 891102-0148

Example [She ]had seen cheating before , but [these notes ]were uncanny .

44

/alienable/ Language English

Corpus MUC-6 891102-0148

Example And sales of [test-coaching booklets ]for classroom instruction are booming .

/inalienable/

Language English

Corpus FrameNet

Example Libya then invited the [IAEA ]to verify the elimination of nuclear weapon related activities .

/naturalGender/

Language English

Corpus FrameNet

Example [Mr. Gonzalez ]is not quite a closet supply-side revolutionary , however

Language English

Corpus MUC-6 891102-0148

Example [Cathryn Rice ]could hardly believe [her ]eyes .

/cardinality/ Language English

Corpus MUC-6 891102-0148

Example Standing on a shaded hill in a run-down area of [this old textile city ], [the school ]has educated many of [South Carolina ]'s best and brightest , including [[the state ]'s last two governors ], [Nobel Prize winning physicist ][Charles Townes ]and [actress ][Joanne Woodward ].

/collective/

Language English

Corpus MUC-6 891102-0148

Example South Carolina 's reforms were designed for schools like [[Greenville ]High School ]. And [South Carolina ]says [it ]is getting results .

/nonCollective/

Language English

Corpus MUC-6 891102-0148

Example [There ]may be [others ]doing what [she ]did .

/countable/ Language English

Corpus MUC-6 891102-0148

Example [The school-board hearing ]at [which ][she ]was dismissed was crowded

45

with [students , teachers and parents ][who ]came to testify on [her ]behalf .

/nonCountable/

Language English

Corpus MUC-6 891102-0148

Example Says [[the organization ]'s founder ], [John Cannell ], prosecuting Mrs. Yeargin is " a way for [administrators ]to protect [themselves ]and look like [they ]take [cheating ]seriously , when in fact [they ]do n't take [it ]seriously at all . "

/definiteIdentifiableTerm/

Language English

Corpus FrameNet

Example A strong challenge from [the far left ], [the communist coalition Izquierda Unida ], failed to topple [him ].

/genericTerm/

Language English

Corpus FrameNet

Example [Unemployment ]still is officially recorded at 16.5 % , the highest rate in Europe , although actual [joblessness ]may be lower .

/indefiniteTerm/

Language English

Corpus FrameNet

Example [The far left ]had [some good issues ] even if [it ]did not have good programs for dealing with [them ].

/nonSpecificTerm/

Language English

Corpus FrameNet

Example The result is a generation of [young people ][whose ]ignorance and intellectual incompetence is matched only by [their ]good opinion of [themselves ].

/specificTerm/

Language English

Corpus FrameNet

Example [These beliefs ] so dominate our educational establishment , our media , our politicians , and even our parents that [it ]seems almost blasphemous [to challenge [them ]].

Data category coverage: tag occurrences in the annotated corpora

The annotation results were evaluated quantitatively with the respect to the frequencies of the LIRICS data categories. The following table gives an overview of tag occurrences and data

46

categories coverage (the percentage given in brackets) in the annotated corpora for each data category and for each language.

Data category English Dutch Italian German

/synonymy/ 4 (0.9%) 18 (5.5%) 15(5.7%) 15(8.6%)

/hyponymy/ 3 (0.7%) 0 9(3.4%) 7(4%)

/acronymy/ 9 (2%) 5 (1.5%) 7(2.6%) 3(1.7%)

/compatibility/ 37 (8%) 0 23(8.7%) 0

/meronymy/ 0 2 (0.6%) 3(1.1%) 0

/metonymy/ 0 0 7(2.6%) 0

LINGUISTIC="NA" (not applicable) 401 (88.4%) 271 (82.9%) 192(72.5%) 138(78.9%)

LINGUISTIC="unclassified" 0 31(9.5%) 9(3.4%) 12(6.8%)

/objectalIdentity/ 429 (94.5%) 300 (91.7%) 225(84.9%) 117(66.9%)

/partOf/ 3 (0.7%) 12 (3.7%) 6(2.3%) 0

/subsetOf/ 4 (0.9%) 9 (2.8%) 7(2.6%) 5(2.8%)

/memberOf / 18 (3.9%) 3 (0.9%) 25(9.4%) 24(13.7%)

OBJECT="NA" 0 2 (0.6%) 0 29(16.6%)

OBJECT=”unclassified” 0 1(0.3%) 2(0.8%) 0

___________________________ __________ ________ ________ _________

/abstract/ 134 (19%) 108 (21.9%) 209(28.4%) 16(16.3%)

/concrete/ 572(81%) 386(78.1%) 495(67.3%) 79(80.6%)

ABSTRACTNESS="unclassified" 0 0 32(4.3%) 3(3.1%)

/animate/ 420(59.5%) 259(52.4%) 144(19.6%) 40(40.8%)

/inanimate/ 286(40.5%) 235(47.6%) 564(76.6%) 58(59.2%)

ANIMACY="unclassified" 0 0 28(3.8%) 0

/alienable/ 686(97.2%) 391(79.1%) 29(4%) 49(50%)

/inalienable/ 20(2.8%) 103(20.9%) 12(1.6%) 1(1%)

ALIENABILITY="unclassified" 0 0 695(94.4%) 48(49%)

/naturalGender/ 94(male)+ 183(female)

(39.2%)

120(male) +

18(female) (27.9%)

323(male) + 197(female)

(70.7%)

22(male) + 3(female) (25.5%)

GENDER=”NA” 429(60.8%) 345(69.8%) 21(2.8%) 57(58.2%)

GENDER="unclassified" 0 11(2.3%) 195(26.5%) 16 (16.3%)

/cardinality/ 11(1.6%) 33(6.7%) 59(8%) Not annotated

CARDINALITY="NA" 695(98.4%) 461(93.3%) 677(92%) Not annotated

/collective/ 245(34.7%) 294(59.5%) 69(9.4%) 21(21.4%)

/nonCollective/ 295(41.8%) 172(34.8%) 7(0.9%) 74(75.5%)

47

COLLECTIVENESS="NA" 166(23.5%) 28(5.7%) 651(88.5%) 3(3.1%)

COLLECTIVENESS="unclassified" 0 0 9(1.2%) 0

/countable/ 545(77.2%) 395(80%) 330(44.8%) 65(66.4%)

/nonCountable/ 66(9.3%) 69(14%) 108(14.7%) 7(7.1%)

COUNTABILITY="NA" 95(13.5%) 30(6%) 271(36.8%) 19(19.4%)

COUNTABILITY=”unclassified” 0 0 27(3.7%) 7(7.1%)

/definiteIdentifiableTerm/ 329(46.6%) 255(51.6%) 559(76%) 8(8.2%)

/genericTerm/ 74(10.5%) 52(10.5%) 47(6.4%) 15(15.3%)

/indefiniteTerm/ 1(0.1%) 11(2.3%) 124(16.8%) 15(15.3%)

/nonSpecificTerm/ 79(11.2%) 19(3.8%) 0 1(1%)

/specificTerm/ 219(31%) 128(25.9%) 0 9(9.2%)

DEFINITENESS="unclassified" 2(0.3%) 1(0.2%) 5(0.7%) 0

DEFINITENESS="NA" 2(0.3%) 28(5.7%) 1(0.1%) 50(51%)

Table 9: Tag occurrences and data categories coverage (in %) in the tested corpora for each language in isolation

It may be observed from Table 9 that all the LIRICS data categories were covered by the test suites. The percentages given in brackets indicate that their frequencies are comparable for the various corpora, except for the following differences:

• the category GENDER was ‘unclassified’ for Italian in 26.5% of the cases, and for German in 16.3%. Annotators noticed that the application of /naturalGender/ is not clear, in particular for Italian and German. With the exception of human beings, most objects and concepts have a gender due to their grammatical classification. Moreover, a high percentage of ‘not applicable’ values was observed for all languages (English: 60.8%; Dutch: 69.8%; German: 58.2%). Annotators noticed that NPs such as "controller", and "Menschen" are difficult to tag, since they refer to both natural genders.

• there are some discrepancies in labelling the ‘DEFINITENESS’ category across languages. In the annotators’ opinion, the definitions of the definiteness data categories need to be tightened (see the suggestions of the Italian LIRICS partners in Section 5).

• there are high proportions of unclassified ‘ALIENABILITY’ for Italian and German. German annotators noticed that it is difficult to select an appropriate value when dealing with proper names, since the value "NA" is lacking.

• ‘CARDINALITY’ for German was not annotated.

• a high percentage of cases was labelled with the ‘not applicable’ value for ‘COLLECTIVITY’ (no comments on this; comments required to be sent).

4.5 Discussion

For a qualitative evaluation of the performed annotation work, each annotator (project partner) was asked to comment on the following three points:

1. the definition of the annotation task

2. the definitions of data categories

48

3. the use of the annotation tool

With respect to the annotation task, as defined above, it was noticed at an early stage that the main purpose of this LIRICS task is to illustrate the use of the data categories that were defined, which means that it would not be necessary to identify all possible markables as described in the Annotation Guidelines (all NPs and embedded structures), since not all of them enter in coreferential relations. Therefore, it was agreed with all project partners that the identification of markables and links can be performed in parallel, in other words, that only NPs would be marked up which participate in coreferential relations.

With respect to the definitions of data categories the following issues were discussed:

• the category ‘GENDER’: not applicable in a significant amount of cases, e.g. ‘people’, ‘a student’, since they refer to both genders. The Italian partners noticed that most objects and concepts have a gender due to their grammatical classification. Since /naturalGender/ is a semantic rather than a grammatical notion, it does not apply to many types of referent, and such referents should be labelled as ‘GENDER=’Not applicable’

• the category ‘DEFINITENESS’: annotators proposed to consider ‘DEFINITENESS’ as a grammatical category. It was suggested to change the /definiteIdentifiableTerm/ category into /definteTerm/, in parallel with /indefiniteTerm/. These two categories refer to the surface values of a NP in a document. Since these categories are purely syntactic in nature, it was agreed to leave /definiteTerm/ and /indefiniteTerm/ completely out of consideration.

On the other hand, it was proposed to introduce a new category, /specificity/, whose values would be /SpecificTerm/ /nonSpecificTerm/ and /genericTerm/. The definition of /genericTerm/ should be reformulated like those for /specificTerm/ and /nonSpecificTerm/, i.e. the speaker refers to any members of a class as a representative of its class. We would then have cases like the following:

• The lions are noble beasts. GENERIC/DEFINTE

• I want to meet a Norwegian. GENERIC/INDEFINITE

• I want to meet the Norwegian. SPECIFIC/DEFINITE

• I saw a Norwegian. NONSPECIFIC/INDEFINITE

• She met a Norwegian. NONSPECIFIC/INDEFINITE

• I met a guy, he is a Norwegian. SPECIFIC/INDEFINITE

It was suggested that the /hyperonym/ linguistic relation should be reintroduced as a data category, although it was considered redundant at an early stage in the annotation enterprise. This is due to the way co-reference annotation is conducted, i.e. from the second (determined) NP towards the probable anchor (usually an indefinite NP). It is strange to mark a relation the other way round.

These suggestions have all been taken into account in the final version of the proposed set of LIRICS semantic data categories, as documented in the final version of deliverable D4.3

As a final comment, annotators noticed that the number of attributes and values is extremely high for obtaining a coherent tagging, even for a single annotator. In the case of a MARKABLE, an annotator has to keep 7 categories in mind, each one subdivided into at least four others.

49

The following issues were pointed out with respect to annotators’ experiences in using the PALinkA. A disadvantage when using the tool to perform annotations according to the LIRICS specifications was the difficulty to assign values to categories in one step. (Comment from one annotator: ‘It is so much clicking and thinking and clicking... and clicking again!’). This increased the annotation time considerably. Instead of separate windows with drop-down menus for each category separately, one window with a list of all categories and drop-down lists with values for each of them would be preferred.

5 Concluding remarks

In this report we have documented Task 4.3 in Work Package 4: Test suites for semantic content annotation and representation. The task was defined to involve at least four European languages; indeed, it has involved English, Dutch, Italian, German and Spanish, and it has been carried out for the intended domains of semantic information: semantic roles, dialogue acts, and coreference relations. The data categories that were defined for these domains, and that are documented in Deliverable D4.3, have been marginally updated as a result of the test suite construction and annotation effort, and have been endorsed by the Thematic Domain Group 3, Semantic Content, of division TC 37/SC 4 of the International Organization for Standards ISO.

For semantic role annotation, the state of the art in computational linguistics is such that there are widely diverging views on what may constitute a useful set of semantic roles, with the FrameNet and PropBank initiatives as two opposite extremes. We have proposed a set of data categories that corresponds roughly to the upper levels of the FrameNet hierarchy, but with a more strictly semantic orientation. In view of these circumstances, we have carried out an investigation in the usability of the proposed set of dscriptors by having material in English and Dutch (partly taken from FrameNet and PropBank data) annotated independently by three annotators. This is reported in section 3.5 of this document. It turns out that even previously untrained annotators, with no specific background in the area, were able to reach substantial agreement on the use of the LIRICS data categories. This is a welcome and very encouraging result. Outside of and after the LIRICS project, this will be investigated further, also by systematically relating LIRICS annotations to FramNet and PropBank annotations, and willl be reported at conferences and in the literature on semantic annotation.

For coreference annotation the situation is rather different. The computational linguistics community is less divided in this area, and the LIRICS data categories for reference annotation build on several related efforts in reference annotation. This part of the annotation work presented relatively little difficulty and did not warrant a separate investigation into the usability of the proposed data categories. However, annotators were asked to comment on a number of aspects of their work, and this has resulted in some suggestions for improving the set of data categories for reference, which have been taken into account in the final proposal of this set, as documented in Deliverable D4.3

For dialogue act annotation the state of the art is such that different annotation schemes use a number of common core descriptors, but vary widely in the number of additional tags, as well as in their granularity, their naming, and the strictness of their definitions. The LIRICS proposal for this domain is based on taking the common core of a range of existing approaches and extending this core in a principled way, with the help of a formalized notion of ‘multidimensionality’ in dialogue act annotation, which has been around informally in this domain for some time. The usability of the LIRICS tagset was evaluated by having two experienced annotators independently annotating the test suites for English and Dutch. The results, described in section 2.5 of this report, show a near-perfect annotator agreement.

50

6 Bibliography

Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249-254.

Cohen, J. (1960). A cofficient of agreement for nominal scales. Education and Psychological Measurement, 20:37-46.

Geertzen, J., Y. Girard, and R. Morante (2004). The DIAMOND project. Poster at the 8th Workshop on the Semantics and Pragmatics of Dialogue (CATALOG 2004), Barcelona, Spain, May 2004.

Geertzen, J. and H. Bunt (2006) Measuring annotator agreement in a complex, hierarchical dialogue act scheme. In Proceedings of the 7th Workshop on Discourse and Dialogue (SIGdial 2006), Sydney, Australia, July 2006, pp. 126 – 133.

Landis, J. and Koch, G. (1977). A one-way components of variance model for categorical data. Biometrics, 33:671-679.

Rietveld, T. & R. van Hout (1993). Statistical techniques for the study of language and language behavior. Berlin: Mouton de Gruyter, page 219

51

Appendix I.A Annotation Guidelines for Dialogue Acts

Dialogue act annotation is about indicating the kind of intention that the speaker had; what kind of thing was he trying to achieve? When participating in a dialogue, this is what agents are trying to establish.

1. First and most important guideline: “Do as the Addressee would do!“

When assigning annotation tags to a dialogue utterance, put yourself in the position of the participant at whom the utterance was addressed, and imagine that you try to understand what the speaker is trying to do. Why does (s)he say what (s)he says? What are the purposes of the utterance? What assumptions does the speaker express about the addressee? Answering such questions should guide you in deciding which annotation tags to assign, regardless of how exactly the speaker has expressed himself. Use all the information that you could have if you were the actual addressee, and like the addressee, try to interpret the speaker’s communicative behaviour as best as you can.

2. Second and equally important guideline: “Think functionally, not formally!“

The linguistic form of an utterance often provides vital clues for choosing an annotation, but such clues may also be misleading; in making your choice of annotation tags you should of course use the linguistic clues to your advantage, but don’t let them fool you - the true question is not what the speaker says but what he means.

For example, WH-QUESTIONS are questions where the speaker wants to know which elements of a certain domain have a certain property. In English, such questions often contain a word beginning with „wh“, such as which as in Which books did you read on your holidays? or where in Where do your parents live?. But in other languages this is not the case; moreover, even in English not all sentences of this form express a WH-QUESTION: Why don’t you go ahead is for instance typically a SUGGESTION rather than a question.

Similarly, YN-QUESTIONS are questions where the speaker wants to know whether a certain statement is true or false. Such sentences typically have the form of an interrogative statement, such as Is The Hague the capital of the Netherlands? or Do you like peanut butter?. But not all sentences of this form express a YN-QUESTION; for example, Do you know what time it is? functions most often as in INDIRECT WH-QUESTION (What time is it?), and Would you like some coffee? is an OFFER; Shall we go? is a SUGGESTION.

3. Another important general guideline is: “Be specific!“

Among the communicative functions that you can choose from, there are differences in specificity, corresponding with their relative positions in hierarchical subsystems. For instance, a CHECK is more specific than a YES/NO-QUESTION, in that it additionally carries the expectation that the answer will be positive. Similarly, a CONFIRMATION is more specific than a YES/NO-ANSWER, in that it carries the additional speaker that the addressee expects the answer to be positive.

In general, try all the time to be as specific as you can. But if you’re in serious doubt about specific functions, then simply use a less specific function tag that covers the more specific functions.

4. On indirect speech acts: “Code indirect speech acts just like direct ones.“

Standard speech act theory regards indirect speech acts, such as indirect questions, as just an indirect form of the same illocutionary acts. By contrast, the DIT++ taxonomy incorporates the idea that indirect dialogue acts signal subtly different packages of beliefs and intentions than direct ones. For example, the direct question What time is it? carries the assumption that the addressee knows what time it is, whereas the indirect question Do you know what time it is? does not carry that assumption (it does at least not express that assumption; in fact it questions it).

52

5. On implicit functions: “Do not code implicit communicative functions, that can be deduced from functions that you have already assigned.“

Implicit communicative functions occur in particular for positive feedback.

For example, someone answering a question may be assumed to (believe to) have understood the question. So any time you annotate an utterance as an ANSWER (of some sort), you might consider annotating it also as providing positive feedback on the interpretation of the question that is answered. Don’t! It would be redundant.

Notice also that the definition of a positive (auto-) feedback act concerning interpretation stipulates that the speaker wants the addressee to know that he (speaker) has understood the question. A speaker who answers a question does not so much want to tell the addressee that his question was understood—that’s just a side-effect of giving an answer, that no speaker can avoid.

Similarly for reacting to an offer, a request, a suggestion, etc.

6. Guidelines for the annotation of feedback functions.

Negative feedback, where the speaker wants to indicate that there was a problem in processing a dialogue utterance, is always explicit and as such mostly easy to annotate. 6.1 Implicit and explicit positive feedback.

Positive feedback is sometimes given explicitly, and very often implicitly.

Examples of explicit positive auto-feedback are the following utterances by B, where he repeats part of the question by A:

A: What time does the KLM flight from Jakarta on Friday, October 13 arrive?

B: The KLM flight from Jakarta on Friday, October 13 has scheduled arrival time 08.50.

B: The flight from Jakarta on Friday has scheduled arrival time 08.50.

B: The KLM flight from Jakarta on October 13 has scheduled arrival time 08.50.

B: The flight from on October 13 has scheduled arrival time 08.50.

In such cases, the utterance by B should be annotated as having, besides the general-purpose function WH-ANSWER in the Task/Domain dimension, also a function in the Auto-Feedback dimension (see below).

By contrast, the short answer: At 08.50 would carry only implicit feedback information, and should therefore, following Guideline 5, not be coded in the Auto-Feedback dimension.

6. 2 Levels of feedback.

The DIT++ taxonomy distinguishes 5 levels of feedback:

1. participant A pays attention to participant B’s utterance.

2. A perceives B’s utterance, i.e. A recognizes the words and nonverbal elements in B’s contribution.

3. A understands B’s utterance, i.e. A assigns an interpretation to B’s utterance, including what A believes B is trying to achieve with this utterance (what are his goals and associated beliefs about the task/domain and about A).

4. A evaluates B’s utterance, i.e. A decides whether the beliefs about B that characterize his understanding of B’s utterance, can be added to A’s model of the dialogue context, updating his context model without arriving at inconsistencies.

5. A ’executes’ B’s utterance, i.e. A performs actions which are appropriate for achieving a goal that he had identified and added to his context model. (For instance, executing a request

53

is to perform the requested action; executing an answer is to add the content of the answer to one’s information; executing a question is to look for the information that was asked for.)

There are certain relations between these levels: in order to execute a dialogue act one must have evaluated it positively („accepted“ it); which is only possible if one (believes to) have understood the corresponding utterance; which presupposes that one perceived the utterance in the first place, which, finally, requires paying attention to what is said. So for instance positive auto-feedback about the acceptance of the addressee’s previous utterance implies positive feedback at the „lower“ levels of understanding, perception, and attention.

For positive feedback functions a higher-level function is more specific than the lower-level functions. (Remember that a function is more specific if it implies other functions.)

For negative feedback the reverse holds: when a speaker signals the impossibility to perceive an utterance, he implies the impossibility to interpret, evaluate and execute it. So negative feedback at a lower level implies negative feedback at higher levels.

Since, following Guideline 3, you should always be as specific as possible, you should observe the following guideline for annotating feedback functions:

Guideline 6: When assigning a feedback function, choose the highest level of feedback in the case of positive feedback that you feel to be appropriate, and choose the lowest level in the case of negative feedback.

While this guideline instructs you to be as specific as possible, sometimes you’ll be in serious doubt. You may for instance find yourself in a situation where you have no clue whether a feedback signal (such as OK) should be interpreted at the level of interpretation or that of evaluation. In such a case you should use the less specific of the two, since the more specific level would mean that you „read“ more into this utterance than you can justify.

In practice, it is often difficult to decide the level of feedback that should be chosen. One of the reasons for this is that the same verbal and nonverbal expressions may be used at most of the levels (with a tendency to signal feedback (positively or negatively) with more emphasis as higher levels of processing are involved). It may happen that you encounter a feedback signal and you have no clue at all at which level you should interpret that signal. In this situation the annotation scheme allows you to use the labels POSITIVE and NEGATIVE, which leave the level of feedback unspecified.

7. Guidelines for the annotation of Interaction Management functions

7.1 Turn Management.

General guideline:

“Code Turn Management functions only when these are not just implied.“

In a spoken dialogue, the participants take turns to speak. (Their nonverbal behaviour is not organised in turns; both participants use facial expressions and gestures more or less all the time.) A turn, that is a stretch of speech by one of the participants, in general consists of smaller parts that have a meaning as a dialogue act; these parts we call „utterances“. Turn Management acts are the actions that participants perform in order to manage this aspect of the interaction. This acts are subdivided into acts for taking the turn (utterance-initial acts) and those for keeping the turn or giving it away (utterance-final acts). Usually only the first utterance in a turn has an utterance-initial function and only the last an utterance-final one. The non-final utterances in a turn do not have an utterance-final function, except when the speaker signals (typically in the form of a rising intonation at the end of the utterance) that the utterance is not going to be the last one in the turn, that he wants to continue. In that case the utterance has a TURN KEEPing function. Except for the first one, the utterances in the turn do not have an utterance-initial function; the speaker does not have to perform a separate act in order to continue; all he has to do is to continue speaking.

When a speaker accepts a turn that the addressee has assigned to him through a TURN ASSIGN act, the utterance should be annotated as having the utterance-initial function TURN

54

ACCEPT only when the speaker performs

LIRICSB1 43.10887 44.17658 Slightly northeast AutoFeedback: CheckQuestion A1 44.74381 45.64469 Yeah, very slightly AlloFeedback: Confirm /disconfirm/ Language English Corpus TRAINS

Documents