Multimodal corpora and speech technology

Multimodal corpora and Multimodal corpora and speech technologyspeech technology

Kristiina JokinenKristiina Jokinen

University of Art and Design HelsinkiUniversity of Art and Design Helsinki

[email protected]@uiah.fi

mailto:[email protected]

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Metaphors for Human-computer Metaphors for Human-computer interactioninteraction

Computer as a toolComputer as a tool– Passive and transparentPassive and transparent– Supports the human goals, human controlSupports the human goals, human control

Computer as an agentComputer as an agent– Intelligent software mediating interaction between the Intelligent software mediating interaction between the

human user and an applicationhuman user and an application– Models of beliefs, desires, intentions (BDI)Models of beliefs, desires, intentions (BDI)– Complex interaction Complex interaction

Cooperation, negotiationCooperation, negotiation Multimodal communicationMultimodal communication


Research at UIAHResearch at UIAH

Interact:Interact: Cooperation with Finnish universities, IT companies, Cooperation with Finnish universities, IT companies,

Association of the Deaf, Arla InstituteAssociation of the Deaf, Arla Institute Finnish dialogue systemFinnish dialogue system Rich interaction situationRich interaction situation Adaptive machine learning techniquesAdaptive machine learning techniques Agent-based architectureAgent-based architecture www.mlab.uiah.fi/interact/www.mlab.uiah.fi/interact/

DUMAS:DUMAS: EU IST-project (SICS, UIAH, UTA, UMIST, Etex, Conexor, EU IST-project (SICS, UIAH, UTA, UMIST, Etex, Conexor,

Timehouse)Timehouse) User modelling for AthosMail (Interactive email application)User modelling for AthosMail (Interactive email application) Reinforcement learning and dialogue strategiesReinforcement learning and dialogue strategies www.sics.se/~dumas/ www.sics.se/~dumas/

http://www.mlab.uiah.fi/interact/


Multimodal Museum InterfacesMultimodal Museum Interfaces

Marjo MMarjo Mäenpää, Antti Raikeäenpää, Antti Raike Study projectsStudy projects NNew ways of relating art that is both visually ew ways of relating art that is both visually

interesting and accessible in terms of contentsinteresting and accessible in terms of contents::– virtual human (avatar) that interactively guides the user virtual human (avatar) that interactively guides the user

through the exhibition using both spoken and sign through the exhibition using both spoken and sign languagelanguage

Design for all: Design for all: accessibility to virtual visitors on accessibility to virtual visitors on museum web sitesmuseum web sites


MUMIN NetworkMUMIN Network

NorFA network on MUltiModal INterfacesNorFA network on MUltiModal INterfaces Support for contacts, cooperation, Support for contacts, cooperation,

education, and research on multimodal education, and research on multimodal interactive systemsinteractive systems

MUMIN PhD-course in Tampere 18-22 MUMIN PhD-course in Tampere 18-22 November (lectures and hands-on exercises November (lectures and hands-on exercises on eye-tracking, speech interfaces, on eye-tracking, speech interfaces, electromagnetogram, virtual world)electromagnetogram, virtual world)

More information and application forms: More information and application forms: http://www.cst.dk/muminhttp://www.cst.dk/mumin


Content of the lectureContent of the lecture

Definitions and terminologyDefinitions and terminology Why multimodalityWhy multimodality Projects and toolsProjects and tools Multimodal annotationsMultimodal annotations Conclusions and referencesConclusions and references

Definitions and TerminologyDefinitions and Terminology


What is multi-modality?What is multi-modality?Mark Maybury: Dagstuhl seminar 2001


Human-computer interactionHuman-computer interactionGibbon et al. (2000) Handbook of Multimodal and Spoken Dialogue Systems

Control: manipulation and coordination of information

Perception: transforming sensory information to higher level representations


TerminologyTerminology

Maybury and WahlsterMaybury and Wahlster (1998) (1998) MediumMedium = material object used for = material object used for

presenting or saving information, physical presenting or saving information, physical carriers (sounds, movements, NL)carriers (sounds, movements, NL)

CodeCode = system of symbols used for = system of symbols used for communicationcommunication

ModalityModality = = senses employed to process senses employed to process incoming information (vision, audition, incoming information (vision, audition, olfaction, touch, olfaction, touch, taste) => perceptiontaste) => perception– vs.vs.

communication system, consisting of a code communication system, consisting of a code expressed through a certain medium => HCIexpressed through a certain medium => HCI


ISLE/NIMM definitionsISLE/NIMM definitions

Medium = physical channel for information Medium = physical channel for information encoding: visual, audio, gesturesencoding: visual, audio, gestures

Modality = particular way of encoding Modality = particular way of encoding information in some mediuminformation in some medium


EAGLES definitionsEAGLES definitions Multimodal systemsMultimodal systems represent and manipulate information represent and manipulate information

from different human communication channels at multiple from different human communication channels at multiple levels of abstractionlevels of abstraction

Multimedia systemsMultimedia systems offer more than one device for user offer more than one device for user input to the system and for system feedback to the user, input to the system and for system feedback to the user, e.g. microphone, speaker, keyboard, mouse, touch e.g. microphone, speaker, keyboard, mouse, touch screen, camerascreen, camera– do not generate abstract concepts automatically do not generate abstract concepts automatically – do not transform the informationdo not transform the information

Multimodal (audio-visual) speech systemsMultimodal (audio-visual) speech systems utilise the same utilise the same multiple channels as human communication by integrating multiple channels as human communication by integrating non-verbal cues (facial expression, eye/gaze, and lip non-verbal cues (facial expression, eye/gaze, and lip movements) with ASR and SSmovements) with ASR and SS


Why multimodality?Why multimodality?


Why multimodal researchWhy multimodal research

Next generation interface design will be more Next generation interface design will be more conversational in styleconversational in style– Flexible use of input modes depending on the setting: Flexible use of input modes depending on the setting:

speech, gesture, pen, etc.speech, gesture, pen, etc.– Broader range of users: Broader range of users: ordinary citizens, children, ordinary citizens, children,

elderly, users with special needs elderly, users with special needs Human communication researchHuman communication research

– CA, psychologistsCA, psychologists– esp. nonverbal behaviour and speechesp. nonverbal behaviour and speech

Animated interface agentsAnimated interface agents


Advantages of MM interfacesAdvantages of MM interfaces

Redundant and/or complementary modalities can increase Redundant and/or complementary modalities can increase interpretation accuracyinterpretation accuracy– E.g. combine ASR and lipreading in noisy environmentsE.g. combine ASR and lipreading in noisy environments

Different modalities, different benefitsDifferent modalities, different benefits– Object references easier by pointing than by speakingObject references easier by pointing than by speaking– Commands easier to speak than to choose from a menu using a Commands easier to speak than to choose from a menu using a

pointing devicepointing device– Multimedia output more expressive than single-medium outputMultimedia output more expressive than single-medium output

New applicationsNew applications– Some tasks cumbersome or impossible in a single modalitySome tasks cumbersome or impossible in a single modality– E.g. interactive TVE.g. interactive TV


Advantages (cont.)Advantages (cont.)

Freedom of choiceFreedom of choice– users differ in their modality preferencesusers differ in their modality preferences– user have different needs (Design for All)user have different needs (Design for All)

NaturalnessNaturalness– Transfer the habits and strategies learned in human-Transfer the habits and strategies learned in human-

human communication to human-computer interactionhuman communication to human-computer interaction Adaptation to different environmental settings or Adaptation to different environmental settings or

evolving environments evolving environments – switch from one modality to another depending on switch from one modality to another depending on

external conditions (noise, light...)external conditions (noise, light...)


””Disadvantages”Disadvantages”

Coordination and combination of modalities Coordination and combination of modalities – cognitive overload of the user by stimulation cognitive overload of the user by stimulation

with too many mediawith too many media

Collection of data more expensiveCollection of data more expensive– more complex technical setupmore complex technical setup– increased amount of data to be collectedincreased amount of data to be collected– interdisciplinary know-howinterdisciplinary know-how

““Natural” remains a rather vague termNatural” remains a rather vague term


Projects and toolsProjects and tools


EAGLES/ISLE initiativesEAGLES/ISLE initiatives

EAGLES = Expert Advisory Group on Language EAGLES = Expert Advisory Group on Language Engineering StandardsEngineering Standards– Gibbon et al. (1997) Handbook on Standards and Resources for Gibbon et al. (1997) Handbook on Standards and Resources for

Spoken Language Systems.Spoken Language Systems.– Gibbon et al (2000) Handbook of multimodal and spoken dialogue Gibbon et al (2000) Handbook of multimodal and spoken dialogue

systems. Resources, Terminology, and Product Evaluation.systems. Resources, Terminology, and Product Evaluation.

ISLE/NIMM = International Standards for Language ISLE/NIMM = International Standards for Language Engineering / Natural Interaction and Multi-ModalityEngineering / Natural Interaction and Multi-Modality– discuss annotation schemes specifically for the fields of natural discuss annotation schemes specifically for the fields of natural

interaction and multi-modal research and developmentinteraction and multi-modal research and development– develop guidelines for such schemesdevelop guidelines for such schemes


NITENITE Dybkjaer et al. 2001Dybkjaer et al. 2001 workbench for multilevel and multimodal workbench for multilevel and multimodal

annotationsannotations general purpose tools: stylesheets determine general purpose tools: stylesheets determine

look and functionality of the user’s toollook and functionality of the user’s tool continue on the basis of the project MATEcontinue on the basis of the project MATE http://nite.nis.sdu.dk/ http://nite.nis.sdu.dk/


MPI ProjectsMPI Projects

Max Planck Institute for Psycholinguistics (MPI) in Nijmegen Max Planck Institute for Psycholinguistics (MPI) in Nijmegen develop tools for the analysis of multimedia (esp. audiovisual) develop tools for the analysis of multimedia (esp. audiovisual)

corporacorpora support the scientific exploitation by linguists, anthropologists, support the scientific exploitation by linguists, anthropologists,

psychologists and other researcherspsychologists and other researchers CAVA CAVA (Computer Assisted Video Analysis) (Computer Assisted Video Analysis) EUDICO (European Distributed Corpora) EUDICO (European Distributed Corpora)

– platform-independentplatform-independent– support various storage formatssupport various storage formats– support distributed operation via the internetsupport distributed operation via the internet


ATLAS/Annotation GraphsATLAS/Annotation Graphs

Framework to represent complex Framework to represent complex annotations on signals of arbitrary annotations on signals of arbitrary dimensionalitydimensionality

Abstraction over the diversity of linguistic Abstraction over the diversity of linguistic annotations expanding on Annotation annotations expanding on Annotation GraphsGraphs

http://www.nist.gov/speech/atlas/ http://www.nist.gov/speech/atlas/


TalkBankTalkBank

Five year interdisciplinary research project funded by NSFFive year interdisciplinary research project funded by NSF Carnegie Mellon University and the University of Carnegie Mellon University and the University of

PennsylvaniaPennsylvania Developing a number of tools and standardsDeveloping a number of tools and standards Study human and animal communicationStudy human and animal communication

– Animal CommunicationAnimal Communication– Classroom Discourse Classroom Discourse – Linguistic ExplorationLinguistic Exploration– Gesture and SignGesture and Sign– Text and Discourse Text and Discourse

CHILDES database is viewed as a subset of TalkBankCHILDES database is viewed as a subset of TalkBank http://www.talkbank.org/http://www.talkbank.org/


Annotation ToolsAnnotation Tools

Anvil (Michael Kipp): speech and gestureAnvil (Michael Kipp): speech and gesture AGTK (Bird and Liberman): speechAGTK (Bird and Liberman): speech MMAX (Mueller and Strube): speech, MMAX (Mueller and Strube): speech,

gesturegesture Multitool (GSMLC Multitool (GSMLC Platform for Multimodal Platform for Multimodal

Spoken Language Corpora): video, videoSpoken Language Corpora): video, video– http://www.ling.gu.se/gsmlc/http://www.ling.gu.se/gsmlc/


Some statistics of the toolsSome statistics of the tools

Dybkjaer et al. (2002): ISLE/NIMM Survey of Dybkjaer et al. (2002): ISLE/NIMM Survey of Existing tools, standards and user needsExisting tools, standards and user needs

Speech is the key modality 9/10Speech is the key modality 9/10 Gesture 7/10Gesture 7/10 Facial expression 3/10Facial expression 3/10


Annotation GraphsAnnotation Graphs

Bird et al. 2000Bird et al. 2000 Formal framework for representing linguistic Formal framework for representing linguistic

annotationsannotations Abstract away from file formats, coding schemes Abstract away from file formats, coding schemes

and user interfaces, providing a and user interfaces, providing a logical layerlogical layer for for annotation systems annotation systems

AGTK (Annotation Graph Toolkit): AGTK (Annotation Graph Toolkit): nodes encode time points, edges annotation nodes encode time points, edges annotation

labelslabels http://agtk.sourceforge.net/http://agtk.sourceforge.net/


AGTK: Discourse Annotation ToolAGTK: Discourse Annotation Tool


Anvil - Anvil - Annotation of Video and Annotation of Video and Language Data Language Data

Michael Kipp (2001)Michael Kipp (2001) Java-based annotation tool for video filesJava-based annotation tool for video files Encoding of nonverbal behaviour (e.g. gesture)Encoding of nonverbal behaviour (e.g. gesture) Import annotations of speech related phenomena Import annotations of speech related phenomena

(e.g. dialogue acts) on multiple layers, (e.g. dialogue acts) on multiple layers, trackstracks Track definitions according to a specific annotation Track definitions according to a specific annotation

scheme in Anvil's generic track configurationscheme in Anvil's generic track configuration All data storage and exchange is in XML All data storage and exchange is in XML


Anvil – screen shotAnvil – screen shot


Multimodal AnnotationMultimodal Annotation

Multi-media corporaMulti-media corpora

Contain multi-media information where Contain multi-media information where various independent streams such as various independent streams such as speech, gesture, facial expression and eye speech, gesture, facial expression and eye movements are annotated and linkedmovements are annotated and linked

Hugely complex due to complicated time Hugely complex due to complicated time relationships between the annotationsrelationships between the annotations


Annotation ChallengesAnnotation Challenges

Better understanding of natural communication Better understanding of natural communication modalities: human speech, gaze, gestures, facial modalities: human speech, gaze, gestures, facial expressions => how do different modalities expressions => how do different modalities support input disambiguationsupport input disambiguation

Behavioural issues: automaticity of human Behavioural issues: automaticity of human communication modescommunication modes

Multiparty communicationMultiparty communication Technical Challenges: Technical Challenges:

– SynchronisationSynchronisation– Error handlingError handling– Multimodal platforms, toolkits, architecturesMultimodal platforms, toolkits, architectures


Annotation IssuesAnnotation Issues

Phenomena Phenomena – What is investigated: sounds, words, dialogue acts, coreference, new What is investigated: sounds, words, dialogue acts, coreference, new

information, correction, feedbackinformation, correction, feedback TheoryTheory

– How to label, what categoriesHow to label, what categories RepresentationRepresentation

– MarkupMarkup

It happened yesterday orthographic representation

<w>It</w><w>happened</w> <w>yesterday</w> XML representation


XML representationsXML representations eXtended Markup Language eXtended Markup Language Becoming a standard for data representationBecoming a standard for data representation

<word>happen</word><word>happen</word><word base=”happen”><word base=”happen”>

Distinction between elements and attributes:Distinction between elements and attributes:– <word> <base>happen</base> <pos>verb</pos> </word><word> <base>happen</base> <pos>verb</pos> </word>– <word base=”happen”> <pos>verb</pos><word base=”happen”> <pos>verb</pos>– <word base=”happen” pos=”verb”><word base=”happen” pos=”verb”>

XSL Stylesheet LanguageXSL Stylesheet Language XSLT Language to convert XML documents into another document in XSLT Language to convert XML documents into another document in

any form any form Does not support:Does not support:

– typed/grammar specification of attribute values typed/grammar specification of attribute values – inference models for element values shared by more than one elementinference models for element values shared by more than one element– applicability restriction of attributes that are mutual exclusiveapplicability restriction of attributes that are mutual exclusive


Speech annotationSpeech annotationGibbon et al. (2000) Handbook of Multimodal and Spoken Dialogue Systems


Spoken Dialogue AnnotationsSpoken Dialogue Annotations

Dialogue Acts (Communicative Acts)Dialogue Acts (Communicative Acts)– GCSL: acceptance, acknowledgement, agreement, GCSL: acceptance, acknowledgement, agreement,

answer, confirmation, question, request, etc.answer, confirmation, question, request, etc.– Interact:Interact:

FeedbackFeedback– strucutre, position, functionstrucutre, position, function

Turn managementTurn management– overlap (give attention, affirmation,reminder, excuse, overlap (give attention, affirmation,reminder, excuse,

hesitation, disagreement, lack of hearing)hesitation, disagreement, lack of hearing)– opening/closing an activityopening/closing an activity


Interact tagsInteract tagsDialogue Act freq % Example

statement 527 23.5 Eiköhän se löydy sitten. I'm sure I'll find it

acknowledgement 389 17.4 Joo ok, right

question 237 10.6Ja kauanko sinne on ajoaika? And how long does it take to get there?

answer 213 9.5Se tulee noin 15 minuuttia tohon Oulunkylään.It'll take about 15 minutes to Oulunkylä.

confirmation 162 7.2 Suunnilleen joo. Approximately, yes.

opening 158 7.0 Mä oon X X hei. Hello, my name is X X.

check 123 5.5Eli kuudelt lähtee ensimmäiset. So the first ones depart at 6 o'clock

thanking 112 5.0 Kiitoksia paljon. Thanks a lot.

repetition 107 4.8 Kaheksan kolkyt kolme. At 8.33 a.m.

ending 100 4.5 Hei. Bye.

call_to_continue 45 2.0 Joo-o. Uh-huh.

wait 23 1.0 Katsotaan, hetkinen vaan. Let's see, just a minute.

correction 19 0.8Ei vaan se on edellinen se Uintikeskuksen pysäkki.No, the Uintikeskus stop is the previous one.

completion 10 0.4 ...kymmentä joo. ...ten, right.

request_to_repeat 10 0.4 Anteeks mitä? Sorry?

sigh 6 0.2 Voi kauhee. Oh dear.

Interact tags (Jokinen et al. 2001)


Non-linguistic Vocalizations Non-linguistic Vocalizations CHRISTINE corpusCHRISTINE corpus

– Simple descriptions: belch, clearsThroat, cough, crying, giggle, Simple descriptions: belch, clearsThroat, cough, crying, giggle, humming, laugh, laughing, moan, onTelephone, panting, raspberry, humming, laugh, laughing, moan, onTelephone, panting, raspberry, scream, screaming, sigh, singing, sneeze, sniff, whistling, yawnscream, screaming, sigh, singing, sneeze, sniff, whistling, yawn

– More complex descriptions: imitates woman's voice, imitating a sexy More complex descriptions: imitates woman's voice, imitating a sexy woman's voice, imitating Chinese voice, imitating drunken voice, woman's voice, imitating Chinese voice, imitating drunken voice, imitating man's voice, imitating posh voice, mimicking police siren, imitating man's voice, imitating posh voice, mimicking police siren, mimicking Birmingham accent, mimicking Donald Duck, mimicking stupid mimicking Birmingham accent, mimicking Donald Duck, mimicking stupid man's voice, mimicking, speaking in French, spelling, whingeing, face-man's voice, mimicking, speaking in French, spelling, whingeing, face-slapping noise, drowning noises, imitates sound of something being slapping noise, drowning noises, imitates sound of something being unscrewed and popped off, imitates vomiting, makes drunken sounds unscrewed and popped off, imitates vomiting, makes drunken sounds and a pretend belch, makes running noises, sharp intake of breath, clickand a pretend belch, makes running noises, sharp intake of breath, click

– Non-vocal events: Non-vocal events: loud music and conversation, banging noise, break in loud music and conversation, banging noise, break in recording, car starts up, cat noises, children shouting, dog barks, poor recording, car starts up, cat noises, children shouting, dog barks, poor quality recording, traffic noise, loud music is on, microphone too far quality recording, traffic noise, loud music is on, microphone too far away, mouth full, telephone rings, beep, clapping, tapping on computer, away, mouth full, telephone rings, beep, clapping, tapping on computer, televisiontelevision


Gesture Annotation 1Gesture Annotation 1

Different types:Different types:– iconic, pointing, emblematiciconic, pointing, emblematic

Different functions:Different functions:– make speech understanding easier make speech understanding easier – make speech production easiermake speech production easier– add semantic and discourse level informationadd semantic and discourse level information


Gesture Annotation 2Gesture Annotation 2

What to annotateWhat to annotate– TimeTime– Movement encodingMovement encoding– Body parts involved (head, hand, fingers)Body parts involved (head, hand, fingers)– Static vs dynamic componentsStatic vs dynamic components– Direction, path shape, hand orientationDirection, path shape, hand orientation– Location w r t bodyLocation w r t body


LIMSI Coding Schema for MM LIMSI Coding Schema for MM Dialogues (Car Driver & Co-pilot)Dialogues (Car Driver & Co-pilot)

generalgeneral: v stands for verbal, g stands for gesture, c stands for human copilot, p : v stands for verbal, g stands for gesture, c stands for human copilot, p stands for human pilot, / and \ stands for begin and end of gesture, % stands for stands for human pilot, / and \ stands for begin and end of gesture, % stands for a comment written by the encoder, [ and ] are used for defining successive a comment written by the encoder, [ and ] are used for defining successive segments of the itinerary ({ and } code fsubparts of such segments)segments of the itinerary ({ and } code fsubparts of such segments)

timetime: < timecode-begin / timecode-end > : < timecode-begin / timecode-end > body partbody part: te=tête (head), ma=main (hand), mo=menton (chin), ms=mains (both : te=tête (head), ma=main (hand), mo=menton (chin), ms=mains (both

hands)hands) fingersfingers : ix=index (first finger), mj=majeur (middle finger), an=annulaire (ring : ix=index (first finger), mj=majeur (middle finger), an=annulaire (ring

finger), au=auriculaire (little finger), po=pouce (thumb)finger), au=auriculaire (little finger), po=pouce (thumb) gazegaze : oc= short glance on the map, ol= long glance on the map : oc= short glance on the map, ol= long glance on the map shape of the body partshape of the body part: td=tendu (tense), sp=souple (loose), cr=crochet (hook): td=tendu (tense), sp=souple (loose), cr=crochet (hook) global movementglobal movement: mv=mouvement ample (wide movement), r=mouvements : mv=mouvement ample (wide movement), r=mouvements

répétés (repeated movement), ( )=statiquerépétés (repeated movement), ( )=statique direction of movementdirection of movement: ar=arrière (backwards), tr=transversal (side), ci=circular: ar=arrière (backwards), tr=transversal (side), ci=circular meaning of gesturemeaning of gesture: ds=designation, ca= designation on the map, dr=direction, : ds=designation, ca= designation on the map, dr=direction,

dc=description, pc=position dc=description, pc=position


LIMSI Coding SchemaLIMSI Coding Schema

Example:Example:v(p): et maintenant? v(p): et maintenant? v(c): on va, non /là-bas je/ pense, tout droitv(c): on va, non /là-bas je/ pense, tout droitg(c): ixtddrg(c): ixtddr

graphic(copilot): index finger tense graphic(copilot): index finger tense directiondirection


Gesture Coding Schemas 1Gesture Coding Schemas 1Dybkjaer et al (2002) ISLE/NIMM Survey on MM tools and resources


Gesture Coding Schemas 2Gesture Coding Schemas 2Dybkjaer et al (2002) ISLE/NIMM Survey on MM tools and resources


Facial Action Coding (FACS)Facial Action Coding (FACS)

P. Ekman & W. Friesen (1976)P. Ekman & W. Friesen (1976) describes visible facial movementsdescribes visible facial movements anatomically basedanatomically based Action Unit (AU): action produced by one Action Unit (AU): action produced by one

muscle or group of related musclesmuscle or group of related muscles any expression described as a set of AUsany expression described as a set of AUs 46 AUs defined46 AUs defined


AUs for raising eye-browsAUs for raising eye-browsDybkjaer et al (2002) ISLE/NIMM Survey of Annotation Schemes and Identification of Best Practise


Alphabet of the eyesAlphabet of the eyes

I. Poggi, N. Pezzato, C. PelachaudI. Poggi, N. Pezzato, C. Pelachaud Gaze annotationGaze annotation

– eyebrow movements, eyelid openness, eyebrow movements, eyelid openness, wrinkles, eye direction, eye reddening, humiditywrinkles, eye direction, eye reddening, humidity

E.g. eyebrows:E.g. eyebrows:– right/left: Internal: up / down right/left: Internal: up / down

Central: up / down Central: up / down

External: up / downExternal: up / down


ConclusionsConclusions

Need for corpora annotated with multimodal informationNeed for corpora annotated with multimodal information Much to do in coding MM information in all forms, relevant Much to do in coding MM information in all forms, relevant

level of detail, cross-level & cross-modalitylevel of detail, cross-level & cross-modality No general coding schemasNo general coding schemas

– coding schemas for different aspects of facial expression, task-coding schemas for different aspects of facial expression, task-dependent gestures etcdependent gestures etc

– No cross-modality coding schemasNo cross-modality coding schemas Lack of theoretical formalisationLack of theoretical formalisation

– how the face expresses cognitive propertieshow the face expresses cognitive properties– how gestures are used (except for sign language)how gestures are used (except for sign language)– how they are coordinated with speechhow they are coordinated with speech

No general annotation toolsNo general annotation tools


ReferencesReferences Bernsen, N. O., Dybkjær, L. and Kolodnytsky, M.: THE NITE WORKBENCH - A Tool for Bernsen, N. O., Dybkjær, L. and Kolodnytsky, M.: THE NITE WORKBENCH - A Tool for

Annotation of Natural Interactivity and Multimodal Data. Proceedings of the Third Annotation of Natural Interactivity and Multimodal Data. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC'2002), Las International Conference on Language Resources and Evaluation (LREC'2002), Las Palmas, May 2002.Palmas, May 2002.

Bird, S. and M. Liberman. A formal framework for linguistic annotation. Speech Bird, S. and M. Liberman. A formal framework for linguistic annotation. Speech Communication, 33(1,2):23-60, 2001.Communication, 33(1,2):23-60, 2001.

Dybkjaer et al (2002). Dybkjaer et al (2002). ISLE/NIMM reportsISLE/NIMM reports. . http://isle.nis.sdu.dk/reports/wp11/ http://isle.nis.sdu.dk/reports/wp11/ Gibbon, D., Mertins I. and R. Moore (eds.) Gibbon, D., Mertins I. and R. Moore (eds.) Handbook of multimodal and spoken dialogue Handbook of multimodal and spoken dialogue

systems. Resources, Terminology, and Product Evaluationsystems. Resources, Terminology, and Product Evaluation . Kluwer, 2000.. Kluwer, 2000. Granström, B. (ed.) Granström, B. (ed.) Multimodality in Language and Speech SystemsMultimodality in Language and Speech Systems. Dordrecht: Kluwer . Dordrecht: Kluwer

2002.2002. Kipp, M. Anvil - A Generic Annotation Tool for Multimodal Dialogue. Proceedings of Kipp, M. Anvil - A Generic Annotation Tool for Multimodal Dialogue. Proceedings of

Eurospeech 2001, pp. 1367-1370, Aalborg, September 2001.Eurospeech 2001, pp. 1367-1370, Aalborg, September 2001. Maybury, M. T. and W. Wahlster (1998). Maybury, M. T. and W. Wahlster (1998). Readings in Intelligent User InterfacesReadings in Intelligent User Interfaces. San . San

Francisco, CA, Morgan KaufmannFrancisco, CA, Morgan Kaufmann Muller, C. and M. Strube. MMAX: Atool for the annotation of multi-modal corpora. In Muller, C. and M. Strube. MMAX: Atool for the annotation of multi-modal corpora. In

Proceedings of 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Proceedings of 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, Washington, pp. 4550, 2001.Systems, Seattle, Washington, pp. 4550, 2001.

Wahlster, W (ed). Wahlster, W (ed). Dagstuhl seminar on Multimodality Dagstuhl seminar on Multimodality http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality/ http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality/

Multimodal corpora and speech technology

Documents