Top Banner
Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan
40

Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Dec 22, 2015

Download

Documents

Maria Gibson
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-Modal Dialogue in Personal Navigation Systems

Arthur Chan

Page 2: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Introduction

The term “multi-modal” General description of an application that could

be operated in multiple input/output modes. E.g

Input: voice, pen, gesture, face expression. Output: voice, graphical output

[Also see the supplementary slides on Alex-Arthur’s discussion on the definition]

Page 3: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-modal Dialogue (MMD) in Personal Navigation System Motivation of this presentation

Navigation System provides MMD an interesting scenario a case why MMD is useful

Structure of this presentation 3 system papers

AT&T MATCH• speech and pen input with pen gesture

Speechworks Walking Direction System• speech and stylus input

Univ. of Saarland REAL• Speech and pen input• Both GPS and a magnetic tracker were used.

Page 4: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-modal Language Processing for Mobile Information Access

Page 5: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Overall Function

A working city guide and navigation system Easy access restaurant and subway information

Runs on a Fujitsu pen computerUsers are free to

give speech command draw on display with stylus

Page 6: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Types of Inputs

Speech Input “show cheap italian

restaurants in chelsea” Simultaneous Speech and

Pen Input Circle and area Say “show cheap italian

restaurants in neighborhood” at the same time.

Functionalities include Review Subway routine

Page 7: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Input Overview

Speech Input Use AT&T Watson speech recognition engine

Pen Input (electron Ink) Allow usage of pen gesture. It could be a complex, pen input

Use special aggregation techniques for all this gesture.

Inputs would be combined using lattice combination.

Page 8: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Pen Gesture and Speech Input

For example: U: “How do I get to this

place?” <user circled one of the

restaurant displayed on the map>

S: “Where do you want to go from?”

U “25th St & 3rd Avenue”• <user writes 25th St & 3rd

Avenue>

<System compute the shortest route >

Page 9: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Summary

Interesting aspects of the system Illustrate the real life scenario where multi-

modal inputs could be used Design issue:

how different inputs should be used together? Algorithmic issue:

how different inputs should be combined together?

Page 10: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-modal Spoken Dialog with Wireless Devices

Page 11: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Overview

Work by Speechworks Jointly conducted by speech recognition and

user interface folks Two distinct elements

Speech recognition• In a embedded domain, which speech recognition

paradigm should be used? embedded speech recognition? network speech recognition? distributed speech recognition?

User interface• How to “situationlize” the application?

Page 12: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Overall Function

Walking Directions Application Assume user walking in an unknown city Compaq iPAQ 3765 PocketPC Users could

Select a city, start-end addresses Display a map Control the display Display directions Display interactive directions in the form of list of steps.

Accept speech input and stylus input Not pen gesture.

Page 13: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Choice of speech recognition paradigm

Embedded speech recognition Only simple commands could be used due to

computation limits.

Network speech recognition Bandwidth is required Sometimes network would be cut-off

Distributed speech recognition Client takes care of front-end Server takes care of decoding <Issues: higher complexity of the code. >

Page 14: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

User Interface

Situationalization Potential scenario

Sitting at a desk Getting out of a cab, building, subway and preparing

to walk somewhere Walking somewhere with hands free Walking somewhere carrying things Driving somewhere in heavy traffic Driving somewhere in light traffic Being the passenger in a car Being in highly noisy environment.

Page 15: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Their conclusion

Balances of audio and visual information Could be reduced to 4 complementary

components Single-modal

• 1, Visual Mode• 2, Audio Mode

Multi-modal• 3, Visual dominant• 4, Visual dominant

Page 16: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

A glance of UI

Page 17: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Summary

Interesting aspects Great discussion on

how speech recognition could be used in an embedded domain

how the user would use the dialogue application

Page 18: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-modal Dialog in a Mobile Pedestrian Navigation System

Page 19: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Overview

Pedestrian Navigation System Two components:

IRREAL : indoor navigation system• Use magnetic tracker

ARREAL: outdoor navigation system• Use GPS

Page 20: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Speech Input/Output

Speech Input: HTK / IBM Viavoice

embedded and Logox was being evaluated

Speech Output: Festival

Page 21: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Visual output

Both 2D and 3D spatialization supported

Page 22: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Interesting aspects

Tailor the system for elderly people Speaker clustering

to improve recognition rate for elderly people Model selection

Choose from two models based on likelihood• Elderly models• Normal adult models

Page 23: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Conclusion

Aspects of multi-modal dialogue What kind of inputs should be used? How speech and other inputs could be

combined/interacted? How users would use the system? How the system should respond to the users?

Page 24: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Supplements on Definition of Multi-modal Dialogue &How MATCH combine multi-modal inputs

Page 25: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Definition of Multi-modal Dialog

In slide “Introduction”, Arthur’s definition of multi-modal application

General description of an application that could be operated in multiple input/output modes.

Alex’s comment “So how about the laptop? Will you consider it as a

multi-modal application?”

Page 26: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

I am stunned! Alex makes some sense!

The laptop examples show We expect “multi-modal application” to be in

some way to allow two different modes to operate simultaneously.

So, though laptop allows both mouse input and keyboard input, it doesn’t fit into what people call multi-modal application.

Page 27: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

A further refinement

It is still important to consider a multi-modal application as a generalization of a single-modal application

This allows Thinking on how to deal with situation where a

particular mode fails.

Page 28: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

How multi-modal inputs could be combined?

How speech are input? Simple click-to-speak input is used. Output are speech lattice.

Page 29: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

How pen gesture are input?

Key strokes could contain Lines and arrows Handwritten words Or selection of entities on the screen

Standard template-based algorithm is used Also extract arrow head and mark.

Recognition could be 285 words

attribute of the restaurants “Cheap”, “Chinese” Zones or point of interest “soho”, “empire”

10 basic gesture marks: Lines, arrows, areas, points and question mark

Input broken into a lattice of strokes.

Page 30: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Pen Input Representation

FORM MEANING (NUMBER TYPE) SEMFORM: physical form of the gesture e.g. area, point, line, arrowMEANING: meaning of the forme.g. “area” could be loc(cation) or sel(ection)NUMBER: number of entities in the selectione.g. 1, 2, 3 or manyTYPE: the type of entitiese.g. res(taurant) and theaterSEM: place holder for specific contents of a gesturee.g. points make up an area, identifiers of an object.

Page 31: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Example:

First Area Gesture Second Area Gesture

Page 32: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Example (cont.)

Either a Location (0->1->2->3->7)Or the restaurant (0->1->2->4->5->6->7)

Either a Location (8->9->10->16)Or two restaurant (8->9->11->12->13->16)

Aggregate numerical expression from Gesture 1 and 2->14->15

Page 33: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Example (cont.)

User say: “show Chinese restaurant in this and this neighborhood” (Two locations are specified)

Page 34: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Example (cont.)

User say: “Tell me about this place and these places (Two restaurants are specified)

Page 35: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Example (cont.)

Not covered here: If users say “these three restaurants” The program need to aggregate two gestures

together. Covered by “Deixis and Conjunction in Multimodal

Systems” by Michael Johnston In brief: gestures will be combined and forming

new paths of lattice.

Page 36: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

How Multi-modal inputs are integrated?

Issues:1, Timing of inputs2, How Inputs are processed? (FST)

Details could be found in • “Finite-state multimodal parsing and understanding”• “Tight-coupling of Multimodal Language Processing

with Speech Recognition “

3, Multi-modal Grammars

Page 37: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Timing of Inputs

MATCH Takes speech and gesture lattice and create

meaning lattice A time out system is used. When user hit a click-to-speak button, speech result

arrives If inking is on the progress, MATCH waits for the

gesture lattice in the short time-out Otherwise MATCH will treat the input as unimodal.

Similar case for the gesture lattice.

Page 38: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

FST processing of multi-modal inputs

Multi-modal integration Modeled by a 3-tape finite state device

Speech and Gesture stream (gesture symbols) Their combined meaning (meaning symbols)

Device take speech and gesture as inputs and create the meaning output.

Simulated by two transducers G:W -> aligning speech and gesture G_W:M -> composite alphabet of speech and

gesture symbols as inputs and outputs meaning Speech and gesture input will be composed by G:W Then G_W will be composed by G_W:M

Page 39: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-modal Grammar

Input word and gesture streams generate an XML representation of meaning eps : epsilon

Output would look like <cmd>

• <phone> <restaurant> [id1] </restaurant>

• </phone> </cmd>

Page 40: Multi-Modal Dialogue in Personal Navigation Systems Arthur Chan.

Multi-modal Grammar