Designing Robust Multimodal Systems for Diverse Users and Mobile Environments

1Center for Human Computer Communication

Department of Computer Science, OG I

Designing Robust Multimodal Systems for Designing Robust Multimodal Systems for Diverse Users and Mobile EnvironmentsDiverse Users and Mobile Environments

Sharon [email protected]; http://www.cse.ogi.edu/CHCC/



Introduction to Perceptive Multimodal Introduction to Perceptive Multimodal InterfacesInterfaces

• Multimodal interfaces recognize combined natural human input modes (speech & pen, speech & lip movements)

• Radical departure from GUIs in basic features, interface design & architectural underpinnings

• Rapid development in 1990s of bimodal systems• New fusion & language processing techniques• Diversification of mode combinations & applications• More general & robust hybrid architectures



Advantages of Multimodal Advantages of Multimodal InterfacesInterfaces

• Flexibility & expressive power • Support for users’ preferred interaction style • Accommodate more users,** tasks, environments** • Improved error handling & robustness** • Support for new forms of computing, including mobile

& pervasive interfaces• Permit multifunctional & tailored mobile interfaces,

adapted to user, task & environment



The Challenge of Robustness:The Challenge of Robustness:Unimodal Speech Technology’s Achilles’ Unimodal Speech Technology’s Achilles’

Heel Heel

• Recognition errors currently limit commercialization of speech technology, especially for:– Spontaneous interactive speech– Diverse speakers & speaking styles (e.g.,

accented)– Speech in natural field environments (e.g.,

mobile)• 20-50% drop in accuracy typical for real-world

usage conditions



Improved Error Handling in Improved Error Handling in Flexible Multimodal InterfacesFlexible Multimodal Interfaces

• Users can avoid errors through mode selection• Users’ multimodal language is simplified, which

reduces complexity of NLP & avoids errors• Users mode switch after system errors, which

undercuts error spirals & facilitates recovery• Multimodal architectures potentially can support

“mutual disambiguation” of input signals

Example of Mutual Disambiguation: Example of Mutual Disambiguation: QuickSet Interface during Multimodal “PAN” QuickSet Interface during Multimodal “PAN”

CommandCommand

Processing & Processing & ArchitectureArchitecture• Speech & gestures

processed in parallel • Statistically ranked

unification of semantic interpretations

• Multi-agent architecture coordinates signal recognition, language processing, & multimodal integration

MultimodalInput on User

Interface

SpeechRecognition

SpokenLanguage

Interpretation

GestureRecognition

GesturalLanguage

Interpretation

MultimodalIntegrator

MultimodalBridge

SystemConfirmation

to User



General Research QuestionsGeneral Research Questions

• To what extent can a multimodal system support mutual disambiguation of input signals?

• How much is robustness improved in a multimodal system, compared with a unimodal one?

• In what usage contexts and for what user groups is robustness most enhanced by a multimodal system?

• What are the asymmetries between modes in disambiguation likelihoods?



Study 1- Research MethodStudy 1- Research Method

• Quickset testing with map-based tasks(community fire & flood management)

• 16 users— 8 native speakers & 8 accented (varied Asian, European & African accents)

• Research design— completely-crossed factorial with between-subjects factors: (1) Speaker status (accented, native)

(2) Gender• Corpus of 2,000 multimodal commands

processed by QuickSet



VideotapeVideotape

Multimodal system processing

for accented and mobile users



Study 1- ResultsStudy 1- Results

• 1 in 8 multimodal commands succeeded due to mutual disambiguation (MD) of input signals

• MD levels significantly higher for accented speakers than native ones—

15% vs 8.5% of utterances• Ratio of speech to total signal pull-ups differed for

users— .65 accented vs .35 native• Results replicated across signal & parse-level MD



Table 1—Mutual Disambiguation Rates for Table 1—Mutual Disambiguation Rates for Native versus Accented Speakers Native versus Accented Speakers



Table 2- Recognition Rate Differentials between Table 2- Recognition Rate Differentials between Native and Accented Speakers for Speech, Native and Accented Speakers for Speech,

Gesture and Multimodal Commands Gesture and Multimodal Commands



Study 1- Results (cont.)Study 1- Results (cont.)

Compared to traditional speech processing, spoken language processed within a multimodal architecture yielded:

41.3% reduction in total speech error rate

No gender or practice effects found in MD rates



Study 2- Research MethodStudy 2- Research Method

• QuickSet testing with same 100 map-based tasks

• Main study:– 16 users with high-end mic (close-talking, noise-canceling)– Research design completely-crossed factorial:

(1) Usage Context- Stationary vs Mobile (within subjects) (2) Gender• Replication:

– 6 users with low-end mic (built-in, no noise cancellation) – Compared stationary vs mobile



Study 2- Research AnalysesStudy 2- Research Analyses

• Corpus of 2,600 multimodal commands

• Signal amplitude, background noise & SNR estimated for each command

• Mutual disambiguation & multimodal system recognition rates analyzed in relation to dynamic signal data



Mobile user with hand-held system & close-Mobile user with hand-held system & close-talking headset in moderately noisy environmenttalking headset in moderately noisy environment

(40-60 dB noise)(40-60 dB noise)



Mobile research infrastructure, with user Mobile research infrastructure, with user instrumentation and researcher field instrumentation and researcher field

stationstation



Study 2- ResultsStudy 2- Results

• 1 in 7 multimodal commands succeeded due to mutual disambiguation of input signals

• MD levels significantly higher during mobile than stationary system use—

16% vs 9.5% of utterances• Results replicated across signal and parse-level MD



Table 3- Mutual Disambiguation Rates Table 3- Mutual Disambiguation Rates during Stationary and Mobile System Useduring Stationary and Mobile System Use



Table 4- Recognition Rate Differentials during Table 4- Recognition Rate Differentials during Stationary and Mobile System Use for Speech, Stationary and Mobile System Use for Speech,

Gesture and Multimodal CommandsGesture and Multimodal Commands



Study 2- Results (cont.)Study 2- Results (cont.)

Compared to traditional speech processing, spoken language processed within a multimodal architecture yielded:

19-35% reduction in total speech error rate (for noise-canceling & built-in mics, respectively)

No gender effects found in MD



• Multimodal architectures can support mutual disambiguation & improved robustness over unimodal processing

• Error rate reduction can be substantial— 20-40%• Multimodal systems can reduce or close the recognition rate

gap for challenging users (accented speakers) & usage contexts (mobile)

• Error-prone recognition technologies can be stabilized within a multimodal architecture, which functionmore reliably in real-world contexts

Conclusions



Future Directions & ChallengesFuture Directions & Challenges

• Intelligently adaptive processing, tailored for mobile usage patterns & diverse users

• Improved language & dialogue processing techniques, and hybrid multimodal architectures

• Novel mobile & pervasive multimodal concepts• Break the robustness barrier— reduce error rate (For more information— http://www.cse.ogi.edu/CHCC/)

Designing Robust Multimodal Systems for Diverse Users and Mobile Environments

Documents