Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-Modal Dialogue in Personal Navigation Systems

Arthur Chan

Introduction

The term “multi-modal” General description of an application that could

be operated in multiple input/output modes. E.g

Input: voice, pen, gesture, face expression. Output: voice, graphical output

[Also see the supplementary slides on Alex-Arthur’s discussion on the definition]

Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation

Navigation System provides MMD an interesting scenario a case why MMD is useful

Structure of this presentation 3 system papers

AT&T MATCH• speech and pen input with pen gesture

Speechworks Walking Direction System• speech and stylus input

Univ. of Saarland REAL• Speech and pen input• Both GPS and a magnetic tracker were used.

Multi-modal Language Processing for Mobile Information Access

Overall Function

A working city guide and navigation system Easy access restaurant and subway information

Runs on a Fujitsu pen computerUsers are free to

give speech command draw on display with stylus

Types of Inputs

Speech Input “show cheap italian

restaurants in chelsea” Simultaneous Speech and

Pen Input Circle and area Say “show cheap italian

restaurants in neighborhood” at the same time.

Functionalities include Review Subway routine

Input Overview

Speech Input Use AT&T Watson speech recognition engine

Pen Input (electron Ink) Allow usage of pen gesture. It could be a complex, pen input

Use special aggregation techniques for all this gesture.

Inputs would be combined using lattice combination.

Pen Gesture and Speech Input

For example: U: “How do I get to this

place?” <user circled one of the

restaurant displayed on the map>

S: “Where do you want to go from?”

U “25th St & 3rd Avenue”• <user writes 25th St & 3rd

Avenue>

<System compute the shortest route >

Summary

Interesting aspects of the system Illustrate the real life scenario where multi-

modal inputs could be used Design issue:

how different inputs should be used together? Algorithmic issue:

how different inputs should be combined together?

Multi-modal Spoken Dialog with Wireless Devices

Overview

Work by Speechworks Jointly conducted by speech recognition and

user interface folks Two distinct elements

Speech recognition• In a embedded domain, which speech recognition

paradigm should be used? embedded speech recognition? network speech recognition? distributed speech recognition?

User interface• How to “situationlize” the application?

Overall Function

Walking Directions Application Assume user walking in an unknown city Compaq iPAQ 3765 PocketPC Users could

Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps.

Accept speech input and stylus input Not pen gesture.

Choice of speech recognition paradigm

Embedded speech recognition Only simple commands could be used due to

computation limits.

Network speech recognition Bandwidth is required Sometimes network would be cut-off

Distributed speech recognition Client takes care of front-end Server takes care of decoding <Issues: higher complexity of the code. >

User Interface

Situationalization Potential scenario

Sitting at a desk Getting out of a cab, building, subway and preparing

to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.

Their conclusion

Balances of audio and visual information Could be reduced to 4 complementary

components Single-modal

• 1, Visual Mode• 2, Audio Mode

Multi-modal• 3, Visual dominant• 4, Visual dominant

A glance of UI

Summary

Interesting aspects Great discussion on

how speech recognition could be used in an embedded domain

how the user would use the dialogue application

Multi-modal Dialog in a Mobile Pedestrian Navigation System

Overview

Pedestrian Navigation System Two components:

IRREAL : indoor navigation system• Use magnetic tracker

ARREAL: outdoor navigation system• Use GPS

Speech Input/Output

Speech Input: HTK / IBM Viavoice

embedded and Logox was being evaluated

Speech Output: Festival

Visual output

Both 2D and 3D spatialization supported

Interesting aspects

Tailor the system for elderly people Speaker clustering

to improve recognition rate for elderly people Model selection

Choose from two models based on likelihood• Elderly models• Normal adult models

Conclusion

Aspects of multi-modal dialogue What kind of inputs should be used? How speech and other inputs could be

combined/interacted? How users would use the system? How the system should respond to the users?

Supplements on Definition of Multi-modal Dialogue &How MATCH combine multi-modal inputs

Definition of Multi-modal Dialog

In slide “Introduction”, Arthur’s definition of multi-modal application

General description of an application that could be operated in multiple input/output modes.

Alex’s comment “So how about the laptop? Will you consider it as a

multi-modal application?”

I am stunned! Alex makes some sense!

The laptop examples show We expect “multi-modal application” to be in

some way to allow two different modes to operate simultaneously.

So, though laptop allows both mouse input and keyboard input, it doesn’t fit into what people call multi-modal application.

A further refinement

It is still important to consider a multi-modal application as a generalization of a single-modal application

This allows Thinking on how to deal with situation where a

particular mode fails.

How multi-modal inputs could be combined?

How speech are input? Simple click-to-speak input is used. Output are speech lattice.

How pen gesture are input?

Key strokes could contain Lines and arrows Handwritten words Or selection of entities on the screen

Standard template-based algorithm is used Also extract arrow head and mark.

Recognition could be 285 words

attribute of the restaurants “Cheap”, “Chinese” Zones or point of interest “soho”, “empire”

10 basic gesture marks: Lines, arrows, areas, points and question mark

Input broken into a lattice of strokes.

Pen Input Representation

FORM MEANING (NUMBER TYPE) SEMFORM: physical form of the gesture e.g. area, point, line, arrowMEANING: meaning of the forme.g. “area” could be loc(cation) or sel(ection)NUMBER: number of entities in the selectione.g. 1, 2, 3 or manyTYPE: the type of entitiese.g. res(taurant) and theaterSEM: place holder for specific contents of a gesturee.g. points make up an area, identifiers of an object.

Example:

First Area Gesture Second Area Gesture

Example (cont.)

Either a Location (0->1->2->3->7)Or the restaurant (0->1->2->4->5->6->7)

Either a Location (8->9->10->16)Or two restaurant (8->9->11->12->13->16)

Aggregate numerical expression from Gesture 1 and 2->14->15

Example (cont.)

User say: “show Chinese restaurant in this and this neighborhood” (Two locations are specified)

Example (cont.)

User say: “Tell me about this place and these places (Two restaurants are specified)

Example (cont.)

Not covered here: If users say “these three restaurants” The program need to aggregate two gestures

together. Covered by “Deixis and Conjunction in Multimodal

Systems” by Michael Johnston In brief: gestures will be combined and forming

new paths of lattice.

How Multi-modal inputs are integrated?

Issues:1, Timing of inputs2, How Inputs are processed? (FST)

Details could be found in • “Finite-state multimodal parsing and understanding”• “Tight-coupling of Multimodal Language Processing

with Speech Recognition “

3, Multi-modal Grammars

Timing of Inputs

MATCH Takes speech and gesture lattice and create

meaning lattice A time out system is used. When user hit a click-to-speak button, speech result

arrives If inking is on the progress, MATCH waits for the

gesture lattice in the short time-out Otherwise MATCH will treat the input as unimodal.

Similar case for the gesture lattice.

FST processing of multi-modal inputs

Multi-modal integration Modeled by a 3-tape finite state device

Speech and Gesture stream (gesture symbols) Their combined meaning (meaning symbols)

Device take speech and gesture as inputs and create the meaning output.

Simulated by two transducers G:W -> aligning speech and gesture G_W:M -> composite alphabet of speech and

gesture symbols as inputs and outputs meaning Speech and gesture input will be composed by G:W Then G_W will be composed by G_W:M

Multi-modal Grammar

Input word and gesture streams generate an XML representation of meaning eps : epsilon

Output would look like <cmd>

• <phone> <restaurant> [id1] </restaurant>

• </phone> </cmd>

Multi-modal Grammar

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Documents

speech command

types of inputs speech

stylus slide

pen input circle

stylus input

saarland real speech

definition slide

chelsea simultaneous