Top Banner
Multimodal corpora and Multimodal corpora and speech technology speech technology Kristiina Jokinen Kristiina Jokinen University of Art and Design University of Art and Design Helsinki Helsinki [email protected] [email protected]
49

Multimodal corpora and speech technology

Jan 14, 2016

Download

Documents

Binh

Multimodal corpora and speech technology. Kristiina Jokinen University of Art and Design Helsinki [email protected]. Metaphors for Human-computer interaction. Computer as a tool Passive and transparent Supports the human goals, human control Computer as an agent - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multimodal corpora and speech technology

Multimodal corpora and Multimodal corpora and speech technologyspeech technology

Kristiina JokinenKristiina Jokinen

University of Art and Design HelsinkiUniversity of Art and Design Helsinki

[email protected]@uiah.fi

Page 2: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Metaphors for Human-computer Metaphors for Human-computer interactioninteraction

Computer as a toolComputer as a tool– Passive and transparentPassive and transparent– Supports the human goals, human controlSupports the human goals, human control

Computer as an agentComputer as an agent– Intelligent software mediating interaction between the Intelligent software mediating interaction between the

human user and an applicationhuman user and an application– Models of beliefs, desires, intentions (BDI)Models of beliefs, desires, intentions (BDI)– Complex interaction Complex interaction

Cooperation, negotiationCooperation, negotiation Multimodal communicationMultimodal communication

Page 3: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Research at UIAHResearch at UIAH

Interact:Interact: Cooperation with Finnish universities, IT companies, Cooperation with Finnish universities, IT companies,

Association of the Deaf, Arla InstituteAssociation of the Deaf, Arla Institute Finnish dialogue systemFinnish dialogue system Rich interaction situationRich interaction situation Adaptive machine learning techniquesAdaptive machine learning techniques Agent-based architectureAgent-based architecture www.mlab.uiah.fi/interact/www.mlab.uiah.fi/interact/

DUMAS:DUMAS: EU IST-project (SICS, UIAH, UTA, UMIST, Etex, Conexor, EU IST-project (SICS, UIAH, UTA, UMIST, Etex, Conexor,

Timehouse)Timehouse) User modelling for AthosMail (Interactive email application)User modelling for AthosMail (Interactive email application) Reinforcement learning and dialogue strategiesReinforcement learning and dialogue strategies www.sics.se/~dumas/ www.sics.se/~dumas/

Page 4: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Multimodal Museum InterfacesMultimodal Museum Interfaces

Marjo MMarjo Mäenpää, Antti Raikeäenpää, Antti Raike Study projectsStudy projects NNew ways of relating art that is both visually ew ways of relating art that is both visually

interesting and accessible in terms of contentsinteresting and accessible in terms of contents::– virtual human (avatar) that interactively guides the user virtual human (avatar) that interactively guides the user

through the exhibition using both spoken and sign through the exhibition using both spoken and sign languagelanguage

Design for all: Design for all: accessibility to virtual visitors on accessibility to virtual visitors on museum web sitesmuseum web sites

Page 5: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

MUMIN NetworkMUMIN Network

NorFA network on MUltiModal INterfacesNorFA network on MUltiModal INterfaces Support for contacts, cooperation, Support for contacts, cooperation,

education, and research on multimodal education, and research on multimodal interactive systemsinteractive systems

MUMIN PhD-course in Tampere 18-22 MUMIN PhD-course in Tampere 18-22 November (lectures and hands-on exercises November (lectures and hands-on exercises on eye-tracking, speech interfaces, on eye-tracking, speech interfaces, electromagnetogram, virtual world)electromagnetogram, virtual world)

More information and application forms: More information and application forms: http://www.cst.dk/muminhttp://www.cst.dk/mumin

Page 6: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Content of the lectureContent of the lecture

Definitions and terminologyDefinitions and terminology Why multimodalityWhy multimodality Projects and toolsProjects and tools Multimodal annotationsMultimodal annotations Conclusions and referencesConclusions and references

Page 7: Multimodal corpora and speech technology

Definitions and TerminologyDefinitions and Terminology

Page 8: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

What is multi-modality?What is multi-modality?Mark Maybury: Dagstuhl seminar 2001

Page 9: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Human-computer interactionHuman-computer interactionGibbon et al. (2000) Handbook of Multimodal and Spoken Dialogue Systems

Control: manipulation and coordination of information

Perception: transforming sensory information to higher level representations

Page 10: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

TerminologyTerminology

Maybury and WahlsterMaybury and Wahlster (1998) (1998) MediumMedium = material object used for = material object used for

presenting or saving information, physical presenting or saving information, physical carriers (sounds, movements, NL)carriers (sounds, movements, NL)

CodeCode = system of symbols used for = system of symbols used for communicationcommunication

ModalityModality = = senses employed to process senses employed to process incoming information (vision, audition, incoming information (vision, audition, olfaction, touch, olfaction, touch, taste) => perceptiontaste) => perception– vs.vs.

communication system, consisting of a code communication system, consisting of a code expressed through a certain medium => HCIexpressed through a certain medium => HCI

Page 11: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

ISLE/NIMM definitionsISLE/NIMM definitions

Medium = physical channel for information Medium = physical channel for information encoding: visual, audio, gesturesencoding: visual, audio, gestures

Modality = particular way of encoding Modality = particular way of encoding information in some mediuminformation in some medium

Page 12: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

EAGLES definitionsEAGLES definitions Multimodal systemsMultimodal systems represent and manipulate information represent and manipulate information

from different human communication channels at multiple from different human communication channels at multiple levels of abstractionlevels of abstraction

Multimedia systemsMultimedia systems offer more than one device for user offer more than one device for user input to the system and for system feedback to the user, input to the system and for system feedback to the user, e.g. microphone, speaker, keyboard, mouse, touch e.g. microphone, speaker, keyboard, mouse, touch screen, camerascreen, camera– do not generate abstract concepts automatically do not generate abstract concepts automatically – do not transform the informationdo not transform the information

Multimodal (audio-visual) speech systemsMultimodal (audio-visual) speech systems utilise the same utilise the same multiple channels as human communication by integrating multiple channels as human communication by integrating non-verbal cues (facial expression, eye/gaze, and lip non-verbal cues (facial expression, eye/gaze, and lip movements) with ASR and SSmovements) with ASR and SS

Page 13: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Why multimodality?Why multimodality?

Page 14: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Why multimodal researchWhy multimodal research

Next generation interface design will be more Next generation interface design will be more conversational in styleconversational in style– Flexible use of input modes depending on the setting: Flexible use of input modes depending on the setting:

speech, gesture, pen, etc.speech, gesture, pen, etc.– Broader range of users: Broader range of users: ordinary citizens, children, ordinary citizens, children,

elderly, users with special needs elderly, users with special needs Human communication researchHuman communication research

– CA, psychologistsCA, psychologists– esp. nonverbal behaviour and speechesp. nonverbal behaviour and speech

Animated interface agentsAnimated interface agents

Page 15: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Advantages of MM interfacesAdvantages of MM interfaces

Redundant and/or complementary modalities can increase Redundant and/or complementary modalities can increase interpretation accuracyinterpretation accuracy– E.g. combine ASR and lipreading in noisy environmentsE.g. combine ASR and lipreading in noisy environments

Different modalities, different benefitsDifferent modalities, different benefits– Object references easier by pointing than by speakingObject references easier by pointing than by speaking– Commands easier to speak than to choose from a menu using a Commands easier to speak than to choose from a menu using a

pointing devicepointing device– Multimedia output more expressive than single-medium outputMultimedia output more expressive than single-medium output

New applicationsNew applications– Some tasks cumbersome or impossible in a single modalitySome tasks cumbersome or impossible in a single modality– E.g. interactive TVE.g. interactive TV

Page 16: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Advantages (cont.)Advantages (cont.)

Freedom of choiceFreedom of choice– users differ in their modality preferencesusers differ in their modality preferences– user have different needs (Design for All)user have different needs (Design for All)

NaturalnessNaturalness– Transfer the habits and strategies learned in human-Transfer the habits and strategies learned in human-

human communication to human-computer interactionhuman communication to human-computer interaction Adaptation to different environmental settings or Adaptation to different environmental settings or

evolving environments evolving environments – switch from one modality to another depending on switch from one modality to another depending on

external conditions (noise, light...)external conditions (noise, light...)

Page 17: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

””Disadvantages”Disadvantages”

Coordination and combination of modalities Coordination and combination of modalities – cognitive overload of the user by stimulation cognitive overload of the user by stimulation

with too many mediawith too many media

Collection of data more expensiveCollection of data more expensive– more complex technical setupmore complex technical setup– increased amount of data to be collectedincreased amount of data to be collected– interdisciplinary know-howinterdisciplinary know-how

““Natural” remains a rather vague termNatural” remains a rather vague term

Page 18: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Projects and toolsProjects and tools

Page 19: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

EAGLES/ISLE initiativesEAGLES/ISLE initiatives

EAGLES = Expert Advisory Group on Language EAGLES = Expert Advisory Group on Language Engineering StandardsEngineering Standards– Gibbon et al. (1997) Handbook on Standards and Resources for Gibbon et al. (1997) Handbook on Standards and Resources for

Spoken Language Systems.Spoken Language Systems.– Gibbon et al (2000) Handbook of multimodal and spoken dialogue Gibbon et al (2000) Handbook of multimodal and spoken dialogue

systems. Resources, Terminology, and Product Evaluation.systems. Resources, Terminology, and Product Evaluation.

ISLE/NIMM = International Standards for Language ISLE/NIMM = International Standards for Language Engineering / Natural Interaction and Multi-ModalityEngineering / Natural Interaction and Multi-Modality– discuss annotation schemes specifically for the fields of natural discuss annotation schemes specifically for the fields of natural

interaction and multi-modal research and developmentinteraction and multi-modal research and development– develop guidelines for such schemesdevelop guidelines for such schemes

Page 20: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

NITENITE Dybkjaer et al. 2001Dybkjaer et al. 2001 workbench for multilevel and multimodal workbench for multilevel and multimodal

annotationsannotations general purpose tools: stylesheets determine general purpose tools: stylesheets determine

look and functionality of the user’s toollook and functionality of the user’s tool continue on the basis of the project MATEcontinue on the basis of the project MATE http://nite.nis.sdu.dk/ http://nite.nis.sdu.dk/

Page 21: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

MPI ProjectsMPI Projects

Max Planck Institute for Psycholinguistics (MPI) in Nijmegen Max Planck Institute for Psycholinguistics (MPI) in Nijmegen develop tools for the analysis of multimedia (esp. audiovisual) develop tools for the analysis of multimedia (esp. audiovisual)

corporacorpora support the scientific exploitation by linguists, anthropologists, support the scientific exploitation by linguists, anthropologists,

psychologists and other researcherspsychologists and other researchers CAVA CAVA (Computer Assisted Video Analysis) (Computer Assisted Video Analysis) EUDICO (European Distributed Corpora) EUDICO (European Distributed Corpora)

– platform-independentplatform-independent– support various storage formatssupport various storage formats– support distributed operation via the internetsupport distributed operation via the internet

Page 22: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

ATLAS/Annotation GraphsATLAS/Annotation Graphs

Framework to represent complex Framework to represent complex annotations on signals of arbitrary annotations on signals of arbitrary dimensionalitydimensionality

Abstraction over the diversity of linguistic Abstraction over the diversity of linguistic annotations expanding on Annotation annotations expanding on Annotation GraphsGraphs

http://www.nist.gov/speech/atlas/ http://www.nist.gov/speech/atlas/

Page 23: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

TalkBankTalkBank

Five year interdisciplinary research project funded by NSFFive year interdisciplinary research project funded by NSF Carnegie Mellon University and the University of Carnegie Mellon University and the University of

PennsylvaniaPennsylvania Developing a number of tools and standardsDeveloping a number of tools and standards Study human and animal communicationStudy human and animal communication

– Animal CommunicationAnimal Communication– Classroom Discourse Classroom Discourse – Linguistic ExplorationLinguistic Exploration– Gesture and SignGesture and Sign– Text and Discourse Text and Discourse

CHILDES database is viewed as a subset of TalkBankCHILDES database is viewed as a subset of TalkBank http://www.talkbank.org/http://www.talkbank.org/

Page 24: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation ToolsAnnotation Tools

Anvil (Michael Kipp): speech and gestureAnvil (Michael Kipp): speech and gesture AGTK (Bird and Liberman): speechAGTK (Bird and Liberman): speech MMAX (Mueller and Strube): speech, MMAX (Mueller and Strube): speech,

gesturegesture Multitool (GSMLC Multitool (GSMLC Platform for Multimodal Platform for Multimodal

Spoken Language Corpora): video, videoSpoken Language Corpora): video, video– http://www.ling.gu.se/gsmlc/http://www.ling.gu.se/gsmlc/

Page 25: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Some statistics of the toolsSome statistics of the tools

Dybkjaer et al. (2002): ISLE/NIMM Survey of Dybkjaer et al. (2002): ISLE/NIMM Survey of Existing tools, standards and user needsExisting tools, standards and user needs

Speech is the key modality 9/10Speech is the key modality 9/10 Gesture 7/10Gesture 7/10 Facial expression 3/10Facial expression 3/10

Page 26: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation GraphsAnnotation Graphs

Bird et al. 2000Bird et al. 2000 Formal framework for representing linguistic Formal framework for representing linguistic

annotationsannotations Abstract away from file formats, coding schemes Abstract away from file formats, coding schemes

and user interfaces, providing a and user interfaces, providing a logical layerlogical layer for for annotation systems annotation systems

AGTK (Annotation Graph Toolkit): AGTK (Annotation Graph Toolkit): nodes encode time points, edges annotation nodes encode time points, edges annotation

labelslabels http://agtk.sourceforge.net/http://agtk.sourceforge.net/

Page 27: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

AGTK: Discourse Annotation ToolAGTK: Discourse Annotation Tool

Page 28: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Anvil - Anvil - Annotation of Video and Annotation of Video and Language Data Language Data

Michael Kipp (2001)Michael Kipp (2001) Java-based annotation tool for video filesJava-based annotation tool for video files Encoding of nonverbal behaviour (e.g. gesture)Encoding of nonverbal behaviour (e.g. gesture) Import annotations of speech related phenomena Import annotations of speech related phenomena

(e.g. dialogue acts) on multiple layers, (e.g. dialogue acts) on multiple layers, trackstracks Track definitions according to a specific annotation Track definitions according to a specific annotation

scheme in Anvil's generic track configurationscheme in Anvil's generic track configuration All data storage and exchange is in XML All data storage and exchange is in XML

Page 29: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Anvil – screen shotAnvil – screen shot

Page 30: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Multimodal AnnotationMultimodal Annotation

Page 31: Multimodal corpora and speech technology

Multi-media corporaMulti-media corpora

Contain multi-media information where Contain multi-media information where various independent streams such as various independent streams such as speech, gesture, facial expression and eye speech, gesture, facial expression and eye movements are annotated and linkedmovements are annotated and linked

Hugely complex due to complicated time Hugely complex due to complicated time relationships between the annotationsrelationships between the annotations

Page 32: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation ChallengesAnnotation Challenges

Better understanding of natural communication Better understanding of natural communication modalities: human speech, gaze, gestures, facial modalities: human speech, gaze, gestures, facial expressions => how do different modalities expressions => how do different modalities support input disambiguationsupport input disambiguation

Behavioural issues: automaticity of human Behavioural issues: automaticity of human communication modescommunication modes

Multiparty communicationMultiparty communication Technical Challenges: Technical Challenges:

– SynchronisationSynchronisation– Error handlingError handling– Multimodal platforms, toolkits, architecturesMultimodal platforms, toolkits, architectures

Page 33: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Annotation IssuesAnnotation Issues

Phenomena Phenomena – What is investigated: sounds, words, dialogue acts, coreference, new What is investigated: sounds, words, dialogue acts, coreference, new

information, correction, feedbackinformation, correction, feedback TheoryTheory

– How to label, what categoriesHow to label, what categories RepresentationRepresentation

– MarkupMarkup

It happened yesterday orthographic representation

<w>It</w><w>happened</w> <w>yesterday</w> XML representation

Page 34: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

XML representationsXML representations eXtended Markup Language eXtended Markup Language Becoming a standard for data representationBecoming a standard for data representation

<word>happen</word><word>happen</word><word base=”happen”><word base=”happen”>

Distinction between elements and attributes:Distinction between elements and attributes:– <word> <base>happen</base> <pos>verb</pos> </word><word> <base>happen</base> <pos>verb</pos> </word>– <word base=”happen”> <pos>verb</pos><word base=”happen”> <pos>verb</pos>– <word base=”happen” pos=”verb”><word base=”happen” pos=”verb”>

XSL Stylesheet LanguageXSL Stylesheet Language XSLT Language to convert XML documents into another document in XSLT Language to convert XML documents into another document in

any form any form Does not support:Does not support:

– typed/grammar specification of attribute values typed/grammar specification of attribute values – inference models for element values shared by more than one elementinference models for element values shared by more than one element– applicability restriction of attributes that are mutual exclusiveapplicability restriction of attributes that are mutual exclusive

Page 35: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Speech annotationSpeech annotationGibbon et al. (2000) Handbook of Multimodal and Spoken Dialogue Systems

Page 36: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Spoken Dialogue AnnotationsSpoken Dialogue Annotations

Dialogue Acts (Communicative Acts)Dialogue Acts (Communicative Acts)– GCSL: acceptance, acknowledgement, agreement, GCSL: acceptance, acknowledgement, agreement,

answer, confirmation, question, request, etc.answer, confirmation, question, request, etc.– Interact:Interact:

FeedbackFeedback– strucutre, position, functionstrucutre, position, function

Turn managementTurn management– overlap (give attention, affirmation,reminder, excuse, overlap (give attention, affirmation,reminder, excuse,

hesitation, disagreement, lack of hearing)hesitation, disagreement, lack of hearing)– opening/closing an activityopening/closing an activity

Page 37: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Interact tagsInteract tagsDialogue Act freq % Example

statement 527 23.5 Eiköhän se löydy sitten. I'm sure I'll find it

acknowledgement 389 17.4 Joo ok, right

question 237 10.6Ja kauanko sinne on ajoaika? And how long does it take to get there?

answer 213 9.5Se tulee noin 15 minuuttia tohon Oulunkylään.It'll take about 15 minutes to Oulunkylä.

confirmation 162 7.2 Suunnilleen joo. Approximately, yes.

opening 158 7.0 Mä oon X X hei. Hello, my name is X X.

check 123 5.5Eli kuudelt lähtee ensimmäiset. So the first ones depart at 6 o'clock

thanking 112 5.0 Kiitoksia paljon. Thanks a lot.

repetition 107 4.8 Kaheksan kolkyt kolme. At 8.33 a.m.

ending 100 4.5 Hei. Bye.

call_to_continue 45 2.0 Joo-o. Uh-huh.

wait 23 1.0 Katsotaan, hetkinen vaan. Let's see, just a minute.

correction 19 0.8Ei vaan se on edellinen se Uintikeskuksen pysäkki.No, the Uintikeskus stop is the previous one.

completion 10 0.4 ...kymmentä joo. ...ten, right.

request_to_repeat 10 0.4 Anteeks mitä? Sorry?

sigh 6 0.2 Voi kauhee. Oh dear.

Interact tags (Jokinen et al. 2001)

Page 38: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Non-linguistic Vocalizations Non-linguistic Vocalizations CHRISTINE corpusCHRISTINE corpus

– Simple descriptions: belch, clearsThroat, cough, crying, giggle, Simple descriptions: belch, clearsThroat, cough, crying, giggle, humming, laugh, laughing, moan, onTelephone, panting, raspberry, humming, laugh, laughing, moan, onTelephone, panting, raspberry, scream, screaming, sigh, singing, sneeze, sniff, whistling, yawnscream, screaming, sigh, singing, sneeze, sniff, whistling, yawn

– More complex descriptions: imitates woman's voice, imitating a sexy More complex descriptions: imitates woman's voice, imitating a sexy woman's voice, imitating Chinese voice, imitating drunken voice, woman's voice, imitating Chinese voice, imitating drunken voice, imitating man's voice, imitating posh voice, mimicking police siren, imitating man's voice, imitating posh voice, mimicking police siren, mimicking Birmingham accent, mimicking Donald Duck, mimicking stupid mimicking Birmingham accent, mimicking Donald Duck, mimicking stupid man's voice, mimicking, speaking in French, spelling, whingeing, face-man's voice, mimicking, speaking in French, spelling, whingeing, face-slapping noise, drowning noises, imitates sound of something being slapping noise, drowning noises, imitates sound of something being unscrewed and popped off, imitates vomiting, makes drunken sounds unscrewed and popped off, imitates vomiting, makes drunken sounds and a pretend belch, makes running noises, sharp intake of breath, clickand a pretend belch, makes running noises, sharp intake of breath, click

– Non-vocal events: Non-vocal events: loud music and conversation, banging noise, break in loud music and conversation, banging noise, break in recording, car starts up, cat noises, children shouting, dog barks, poor recording, car starts up, cat noises, children shouting, dog barks, poor quality recording, traffic noise, loud music is on, microphone too far quality recording, traffic noise, loud music is on, microphone too far away, mouth full, telephone rings, beep, clapping, tapping on computer, away, mouth full, telephone rings, beep, clapping, tapping on computer, televisiontelevision

Page 39: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Annotation 1Gesture Annotation 1

Different types:Different types:– iconic, pointing, emblematiciconic, pointing, emblematic

Different functions:Different functions:– make speech understanding easier make speech understanding easier – make speech production easiermake speech production easier– add semantic and discourse level informationadd semantic and discourse level information

Page 40: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Annotation 2Gesture Annotation 2

What to annotateWhat to annotate– TimeTime– Movement encodingMovement encoding– Body parts involved (head, hand, fingers)Body parts involved (head, hand, fingers)– Static vs dynamic componentsStatic vs dynamic components– Direction, path shape, hand orientationDirection, path shape, hand orientation– Location w r t bodyLocation w r t body

Page 41: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

LIMSI Coding Schema for MM LIMSI Coding Schema for MM Dialogues (Car Driver & Co-pilot)Dialogues (Car Driver & Co-pilot)

generalgeneral: v stands for verbal, g stands for gesture, c stands for human copilot, p : v stands for verbal, g stands for gesture, c stands for human copilot, p stands for human pilot, / and \ stands for begin and end of gesture, % stands for stands for human pilot, / and \ stands for begin and end of gesture, % stands for a comment written by the encoder, [ and ] are used for defining successive a comment written by the encoder, [ and ] are used for defining successive segments of the itinerary ({ and } code fsubparts of such segments)segments of the itinerary ({ and } code fsubparts of such segments)

timetime: < timecode-begin / timecode-end > : < timecode-begin / timecode-end > body partbody part: te=tête (head), ma=main (hand), mo=menton (chin), ms=mains (both : te=tête (head), ma=main (hand), mo=menton (chin), ms=mains (both

hands)hands) fingersfingers : ix=index (first finger), mj=majeur (middle finger), an=annulaire (ring : ix=index (first finger), mj=majeur (middle finger), an=annulaire (ring

finger), au=auriculaire (little finger), po=pouce (thumb)finger), au=auriculaire (little finger), po=pouce (thumb) gazegaze : oc= short glance on the map, ol= long glance on the map : oc= short glance on the map, ol= long glance on the map shape of the body partshape of the body part: td=tendu (tense), sp=souple (loose), cr=crochet (hook): td=tendu (tense), sp=souple (loose), cr=crochet (hook) global movementglobal movement: mv=mouvement ample (wide movement), r=mouvements : mv=mouvement ample (wide movement), r=mouvements

répétés (repeated movement), ( )=statiquerépétés (repeated movement), ( )=statique direction of movementdirection of movement: ar=arrière (backwards), tr=transversal (side), ci=circular: ar=arrière (backwards), tr=transversal (side), ci=circular meaning of gesturemeaning of gesture: ds=designation, ca= designation on the map, dr=direction, : ds=designation, ca= designation on the map, dr=direction,

dc=description, pc=position dc=description, pc=position

Page 42: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

LIMSI Coding SchemaLIMSI Coding Schema

Example:Example:v(p): et maintenant? v(p): et maintenant? v(c): on va, non /là-bas je/ pense, tout droitv(c): on va, non /là-bas je/ pense, tout droitg(c): ixtddrg(c): ixtddr

graphic(copilot): index finger tense graphic(copilot): index finger tense directiondirection

Page 43: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Coding Schemas 1Gesture Coding Schemas 1Dybkjaer et al (2002) ISLE/NIMM Survey on MM tools and resources

Page 44: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Gesture Coding Schemas 2Gesture Coding Schemas 2Dybkjaer et al (2002) ISLE/NIMM Survey on MM tools and resources

Page 45: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Facial Action Coding (FACS)Facial Action Coding (FACS)

P. Ekman & W. Friesen (1976)P. Ekman & W. Friesen (1976) describes visible facial movementsdescribes visible facial movements anatomically basedanatomically based Action Unit (AU): action produced by one Action Unit (AU): action produced by one

muscle or group of related musclesmuscle or group of related muscles any expression described as a set of AUsany expression described as a set of AUs 46 AUs defined46 AUs defined

Page 46: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

AUs for raising eye-browsAUs for raising eye-browsDybkjaer et al (2002) ISLE/NIMM Survey of Annotation Schemes and Identification of Best Practise

Page 47: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

Alphabet of the eyesAlphabet of the eyes

I. Poggi, N. Pezzato, C. PelachaudI. Poggi, N. Pezzato, C. Pelachaud Gaze annotationGaze annotation

– eyebrow movements, eyelid openness, eyebrow movements, eyelid openness, wrinkles, eye direction, eye reddening, humiditywrinkles, eye direction, eye reddening, humidity

E.g. eyebrows:E.g. eyebrows:– right/left: Internal: up / down right/left: Internal: up / down

Central: up / down Central: up / down

External: up / downExternal: up / down

Page 48: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

ConclusionsConclusions

Need for corpora annotated with multimodal informationNeed for corpora annotated with multimodal information Much to do in coding MM information in all forms, relevant Much to do in coding MM information in all forms, relevant

level of detail, cross-level & cross-modalitylevel of detail, cross-level & cross-modality No general coding schemasNo general coding schemas

– coding schemas for different aspects of facial expression, task-coding schemas for different aspects of facial expression, task-dependent gestures etcdependent gestures etc

– No cross-modality coding schemasNo cross-modality coding schemas Lack of theoretical formalisationLack of theoretical formalisation

– how the face expresses cognitive propertieshow the face expresses cognitive properties– how gestures are used (except for sign language)how gestures are used (except for sign language)– how they are coordinated with speechhow they are coordinated with speech

No general annotation toolsNo general annotation tools

Page 49: Multimodal corpora and speech technology

22 August 200222 August 2002 NordTalk NorFA Course "Using Spoken Language Corpora"NordTalk NorFA Course "Using Spoken Language Corpora"

ReferencesReferences Bernsen, N. O., Dybkjær, L. and Kolodnytsky, M.: THE NITE WORKBENCH - A Tool for Bernsen, N. O., Dybkjær, L. and Kolodnytsky, M.: THE NITE WORKBENCH - A Tool for

Annotation of Natural Interactivity and Multimodal Data. Proceedings of the Third Annotation of Natural Interactivity and Multimodal Data. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC'2002), Las International Conference on Language Resources and Evaluation (LREC'2002), Las Palmas, May 2002.Palmas, May 2002.

Bird, S. and M. Liberman. A formal framework for linguistic annotation. Speech Bird, S. and M. Liberman. A formal framework for linguistic annotation. Speech Communication, 33(1,2):23-60, 2001.Communication, 33(1,2):23-60, 2001.

Dybkjaer et al (2002). Dybkjaer et al (2002). ISLE/NIMM reportsISLE/NIMM reports. . http://isle.nis.sdu.dk/reports/wp11/ http://isle.nis.sdu.dk/reports/wp11/ Gibbon, D., Mertins I. and R. Moore (eds.) Gibbon, D., Mertins I. and R. Moore (eds.) Handbook of multimodal and spoken dialogue Handbook of multimodal and spoken dialogue

systems. Resources, Terminology, and Product Evaluationsystems. Resources, Terminology, and Product Evaluation . Kluwer, 2000.. Kluwer, 2000. Granström, B. (ed.) Granström, B. (ed.) Multimodality in Language and Speech SystemsMultimodality in Language and Speech Systems. Dordrecht: Kluwer . Dordrecht: Kluwer

2002.2002. Kipp, M. Anvil - A Generic Annotation Tool for Multimodal Dialogue. Proceedings of Kipp, M. Anvil - A Generic Annotation Tool for Multimodal Dialogue. Proceedings of

Eurospeech 2001, pp. 1367-1370, Aalborg, September 2001.Eurospeech 2001, pp. 1367-1370, Aalborg, September 2001. Maybury, M. T. and W. Wahlster (1998). Maybury, M. T. and W. Wahlster (1998). Readings in Intelligent User InterfacesReadings in Intelligent User Interfaces. San . San

Francisco, CA, Morgan KaufmannFrancisco, CA, Morgan Kaufmann Muller, C. and M. Strube. MMAX: Atool for the annotation of multi-modal corpora. In Muller, C. and M. Strube. MMAX: Atool for the annotation of multi-modal corpora. In

Proceedings of 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Proceedings of 2nd IJCAI Workshop on Knowledge and Reasoning in Practical Dialogue Systems, Seattle, Washington, pp. 4550, 2001.Systems, Seattle, Washington, pp. 4550, 2001.

Wahlster, W (ed). Wahlster, W (ed). Dagstuhl seminar on Multimodality Dagstuhl seminar on Multimodality http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality/ http://www.dfki.de/~wahlster/Dagstuhl_Multi_Modality/