Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan
Dec 22, 2015
Multi-Modal Dialogue in Personal Navigation Systems
Arthur Chan
Introduction
The term “multi-modal” General description of an application that could
be operated in multiple input/output modes. E.g
Input: voice, pen, gesture, face expression. Output: voice, graphical output
[Also see the supplementary slides on Alex-Arthur’s discussion on the definition]
Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation
Navigation System provides MMD an interesting scenario a case why MMD is useful
Structure of this presentation 3 system papers
AT&T MATCH• speech and pen input with pen gesture
Speechworks Walking Direction System• speech and stylus input
Univ. of Saarland REAL• Speech and pen input• Both GPS and a magnetic tracker were used.
Multi-modal Language Processing for Mobile Information Access
Overall Function
A working city guide and navigation system Easy access restaurant and subway information
Runs on a Fujitsu pen computerUsers are free to
give speech command draw on display with stylus
Types of Inputs
Speech Input “show cheap italian
restaurants in chelsea” Simultaneous Speech and
Pen Input Circle and area Say “show cheap italian
restaurants in neighborhood” at the same time.
Functionalities include Review Subway routine
Input Overview
Speech Input Use AT&T Watson speech recognition engine
Pen Input (electron Ink) Allow usage of pen gesture. It could be a complex, pen input
Use special aggregation techniques for all this gesture.
Inputs would be combined using lattice combination.
Pen Gesture and Speech Input
For example: U: “How do I get to this
place?” <user circled one of the
restaurant displayed on the map>
S: “Where do you want to go from?”
U “25th St & 3rd Avenue”• <user writes 25th St & 3rd
Avenue>
<System compute the shortest route >
Summary
Interesting aspects of the system Illustrate the real life scenario where multi-
modal inputs could be used Design issue:
how different inputs should be used together? Algorithmic issue:
how different inputs should be combined together?
Multi-modal Spoken Dialog with Wireless Devices
Overview
Work by Speechworks Jointly conducted by speech recognition and
user interface folks Two distinct elements
Speech recognition• In a embedded domain, which speech recognition
paradigm should be used? embedded speech recognition? network speech recognition? distributed speech recognition?
User interface• How to “situationlize” the application?
Overall Function
Walking Directions Application Assume user walking in an unknown city Compaq iPAQ 3765 PocketPC Users could
Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps.
Accept speech input and stylus input Not pen gesture.
Choice of speech recognition paradigm
Embedded speech recognition Only simple commands could be used due to
computation limits.
Network speech recognition Bandwidth is required Sometimes network would be cut-off
Distributed speech recognition Client takes care of front-end Server takes care of decoding <Issues: higher complexity of the code. >
User Interface
Situationalization Potential scenario
Sitting at a desk Getting out of a cab, building, subway and preparing
to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.
Their conclusion
Balances of audio and visual information Could be reduced to 4 complementary
components Single-modal
• 1, Visual Mode• 2, Audio Mode
Multi-modal• 3, Visual dominant• 4, Visual dominant
A glance of UI
Summary
Interesting aspects Great discussion on
how speech recognition could be used in an embedded domain
how the user would use the dialogue application
Multi-modal Dialog in a Mobile Pedestrian Navigation System
Overview
Pedestrian Navigation System Two components:
IRREAL : indoor navigation system• Use magnetic tracker
ARREAL: outdoor navigation system• Use GPS
Speech Input/Output
Speech Input: HTK / IBM Viavoice
embedded and Logox was being evaluated
Speech Output: Festival
Visual output
Both 2D and 3D spatialization supported
Interesting aspects
Tailor the system for elderly people Speaker clustering
to improve recognition rate for elderly people Model selection
Choose from two models based on likelihood• Elderly models• Normal adult models
Conclusion
Aspects of multi-modal dialogue What kind of inputs should be used? How speech and other inputs could be
combined/interacted? How users would use the system? How the system should respond to the users?
Supplements on Definition of Multi-modal Dialogue &How MATCH combine multi-modal inputs
Definition of Multi-modal Dialog
In slide “Introduction”, Arthur’s definition of multi-modal application
General description of an application that could be operated in multiple input/output modes.
Alex’s comment “So how about the laptop? Will you consider it as a
multi-modal application?”
I am stunned! Alex makes some sense!
The laptop examples show We expect “multi-modal application” to be in
some way to allow two different modes to operate simultaneously.
So, though laptop allows both mouse input and keyboard input, it doesn’t fit into what people call multi-modal application.
A further refinement
It is still important to consider a multi-modal application as a generalization of a single-modal application
This allows Thinking on how to deal with situation where a
particular mode fails.
How multi-modal inputs could be combined?
How speech are input? Simple click-to-speak input is used. Output are speech lattice.
How pen gesture are input?
Key strokes could contain Lines and arrows Handwritten words Or selection of entities on the screen
Standard template-based algorithm is used Also extract arrow head and mark.
Recognition could be 285 words
attribute of the restaurants “Cheap”, “Chinese” Zones or point of interest “soho”, “empire”
10 basic gesture marks: Lines, arrows, areas, points and question mark
Input broken into a lattice of strokes.
Pen Input Representation
FORM MEANING (NUMBER TYPE) SEMFORM: physical form of the gesture e.g. area, point, line, arrowMEANING: meaning of the forme.g. “area” could be loc(cation) or sel(ection)NUMBER: number of entities in the selectione.g. 1, 2, 3 or manyTYPE: the type of entitiese.g. res(taurant) and theaterSEM: place holder for specific contents of a gesturee.g. points make up an area, identifiers of an object.
Example:
First Area Gesture Second Area Gesture
Example (cont.)
Either a Location (0->1->2->3->7)Or the restaurant (0->1->2->4->5->6->7)
Either a Location (8->9->10->16)Or two restaurant (8->9->11->12->13->16)
Aggregate numerical expression from Gesture 1 and 2->14->15
Example (cont.)
User say: “show Chinese restaurant in this and this neighborhood” (Two locations are specified)
Example (cont.)
User say: “Tell me about this place and these places (Two restaurants are specified)
Example (cont.)
Not covered here: If users say “these three restaurants” The program need to aggregate two gestures
together. Covered by “Deixis and Conjunction in Multimodal
Systems” by Michael Johnston In brief: gestures will be combined and forming
new paths of lattice.
How Multi-modal inputs are integrated?
Issues:1, Timing of inputs2, How Inputs are processed? (FST)
Details could be found in • “Finite-state multimodal parsing and understanding”• “Tight-coupling of Multimodal Language Processing
with Speech Recognition “
3, Multi-modal Grammars
Timing of Inputs
MATCH Takes speech and gesture lattice and create
meaning lattice A time out system is used. When user hit a click-to-speak button, speech result
arrives If inking is on the progress, MATCH waits for the
gesture lattice in the short time-out Otherwise MATCH will treat the input as unimodal.
Similar case for the gesture lattice.
FST processing of multi-modal inputs
Multi-modal integration Modeled by a 3-tape finite state device
Speech and Gesture stream (gesture symbols) Their combined meaning (meaning symbols)
Device take speech and gesture as inputs and create the meaning output.
Simulated by two transducers G:W -> aligning speech and gesture G_W:M -> composite alphabet of speech and
gesture symbols as inputs and outputs meaning Speech and gesture input will be composed by G:W Then G_W will be composed by G_W:M
Multi-modal Grammar
Input word and gesture streams generate an XML representation of meaning eps : epsilon
Output would look like <cmd>
• <phone> <restaurant> [id1] </restaurant>
• </phone> </cmd>
Multi-modal Grammar