Top Banner
© 2007 Ken Rehor. All Rights Reserved. 1 Introduction to VoiceXML and Voice Web Architecture Ken Rehor
102

Introduction to VoiceXml and Voice Web Architecture

May 19, 2015

Download

Technology

Thien Nguyen

Introduction to VoiceXml and Voice Web Architecture
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 1

Introduction to VoiceXML and Voice Web Architecture

Ken Rehor

Page 2: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 2

Session Overview

• Voice Web Architecture– Components of a Voice Web Application

• Voice Standards– W3C Speech Interface Framework

• VoiceXML– Language features– Execution model - Form Interpretation Algorithm (FIA)

• Application Design Techniques– Static vs. dynamic VoiceXML– Performance Considerations

• CCXML, VoiceXML and VoIP• Application Deployment Models• New Technologies

– Speaker Biometrics, Video, Multimodal, VoiceXML 3.0

Page 3: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 3

Simplifying Voice Services programming

• Web-based architecture for interactive speech services– Exploit web technologies to simplify voice service creation and deployment

– Enable consolidation of voice and web services

– Separate service logic from user interaction

• High-level programming languages– Control speech and telephony resources in uniform manner

– Shield application programmers from implementation details• No need to know ASR, TTS, telephony APIs

– Create portable applications• Run on enterprise system or in telephone network

• Run on a variety of platforms, ASR agnostic

Page 4: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 4

Voice Web Application Architecture

Page 5: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 5

• Standard/Common high-level language– Designed for the task

• Leverage open, known technology– Web protocols, servers, networks, development tools, expertise

• Phone number mapped to URL– Phone number associated with URL of voice service

Key Ideas

Page 6: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 6

Internet or

Intranet

Any phone

Web Browser

HTTP

HTTP

Application(web) server

• Application logic• Content and data• Transaction processing• Database interface

<html>

VoiceXMLbrowser

PSTN orVoIP

Voice / Web Application Architecture

• Grammars• Audio files• Scripts

• Images• Audio files• Scripts

HTTP

.wav

<grxml>

<vxml>

Page 7: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 7

.wav

<grxml>

Internet or intranet

PSTN

Caller

Customer service, please…

HTTP

Webserver

<vxml>

AS

R

TT

SA

udio

DT

MF

Te

lep

ho

ny

VoiceXMLinterpreter

middleware

VoiceXMLplatform

Welcome toAcme products

Voice Application Architecture and Components

OA

&M

Page 8: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 8

Internet orIntranet

Application(web) server

• Application logic• Content and data• Transaction processing• Database interface

HTTP

<vxml>

Application Backend Architecture

Database(content)

Transaction Server

Web service

Intranet or

Internet

• Grammars• Audio files• Scripts

Page 9: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 9

Components of a Voice Solution

• Traditional phone, VoIP phone, mobile phone, or multimodal device

• Telephone network– Circuit-switched PSTN or packet-switched VoIP

– Connects caller’s telephone with Telephony Server

• Voice User Interface– Dialog structure / flow

– Prompts – what the application says to the user

– Speech grammars – what the user can say

• Application logic that executes on an application server– Web "back-end“

– Database, or database interface

• VoiceXML Server that executes dialogs– Controls resources such as ASR, SIV, TTS, etc

• Data network to connect application server and VoiceXML server

Page 10: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 10

Inbound or Outbound calls

• VoiceXML application works the same for inbound and outbound calls

– Additional call progress detection generally required for outbound

• Simple protocol for initiating outbound calls– No firm standards, but most vendors follow similar techniques

– HTTP, Web Services, etc.

Page 11: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 11

Standards

Page 12: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 12

Value of Open Standards

• Non-proprietary interfaces between components

• Allow choice of best components for the task

• User interface languages– W3C Speech Interface Framework: VoiceXML, SRGS, SSML, SI– W3C: HTML, XHTML, SMIL, X+V– OMA: WAP

• Communication protocols– W3C: CCXML for 3rd-party telephony call control– W3C: HTTP, HTTPS, SOAP, WSDL– IETF: SIP, MRCP, MSCP– 3GPP: IMS– ITU: T1, ISDN

Page 13: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 13

Visual vs. Voice markup

Web app UI• HTML – Structure

– Layout

– Input declaration

– Transitions

• Images

• Audio files / streams

• Video

• Text

• Scripts

Voice Web app UI• VoiceXML – Structure

– Dialog flow

– Input declaration

– Transitions

• Audio files

• Video, Images

• Text (for TTS)

• Scripts

Page 14: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 14

Protocols

Web applications• HTTP, HTTPS

• RTP

• SOAP

• WSDL

• …

Voice Web applications• HTTP, HTTPS

• RTP

• SOAP

• WSDL

• SIP

• …

Page 15: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 15

Voice Standards Activities

• Speech Interface Framework

• Network protocols

– SIP, MRCP v2, etc.

• Platform Certification, Developer Certification,

Speaker Biometrics, Architecture, Tools

Page 16: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 16

Scripts

HTTPHTTPS

HTTPHTTPS

VoIPGateway

VoiceXMLBrowser

Telephony Control Interface: SIP, etc.Dialog Control Interface: SIP, MSCP, etc.

DialogControlInterface

VoiceXMLApplication

CCXML VXML

Conference/MediaServer

CCXMLBrowser

Voice Application Standards

PhoneNetwor

k

Caller

CCXMLCall ControlApplication

Media ControlInterface

SOAP

MRCP Client

Audio

DTMF

GRXML

Scripts

Audio

MediaMixer /Server

T1 / E1ISDNSS7

SIP

RFC 2833

RTP

TTS

Server

M R C P

SIV

Server

ASR

Server

GRXMLSSML ** standards in progress **

GRXML

G.711, WAV, .au, mp3, etc.

SIP NetannMSCMLMOML / MSMLMSCPDMSPMGCPetc.

Telephony ControlInterface

VoiceXML 2.0VoiceXML 2.1ECMAScript 262

MRCP v1MRCP v2

SSML

Page 17: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 17

W3C Speech Interface Framework

Page 18: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 18

Voice Application Components

• Dialog – flow control of the inputs, outputs, next steps

• Input grammars– Control input constraints for DTMF and speech recognition

• Output formatting– Pronunciation, timing, sequencing

Page 19: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 19

W3C Speech Interface Framework

• VoiceXML

• SRGS

• SSML

• Semantic Interpretation

• Pronunciation Lexicon

• Call Control

For more information, see:W3C Voice Browser Working Group http://www.w3.org/Voice/

Page 20: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 20

Voice User Interface - Dialog• W3C VoiceXML 2.0

– W3C Recommendation March 2004– Widely implemented

• Approximately 4 dozen platforms• Many service providers worldwide

– VoiceXML Forum certification program• Nearly two dozen certified platforms, more coming

• W3C VoiceXML 2.1– Candidate Recommendation Sept 2006– Test suite under development; Certification Program to follow– Many platform vendors are implementing

• W3C VoiceXML 3.0– Early stages of development– SCXML – state chart markup language designed as a controller for V3 and

CCXML 2.0 ("Working Draft" Jan 2006)

Page 21: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 21

User Interaction – Input / Output Control

• Input grammars W3C SRGS 1.0

– W3C Recommendation– Widely implemented

• Output formatting W3C SSML 1.0

– W3C Recommendation– Widely implemented, yet minor real support

(most TTS engines ignore the SSML instructions)

• Semantic Interpretation for Speech Recognition W3C SISR 1.0– Nearing Candidate Recommendation– Implementation gaining acceptance

Page 22: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 22

W3C Speech Interface FrameworkSemantic Interpretation

Page 23: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 23

W3C Speech Recognition Grammar Specification

• Markup language to control input constraints– Finite-state speech recognition

– DTMF recognition

• Two variations– XML (GRXML)

– ABNF

• Version 1.0: W3C Recommendation – March 2004

• Implemented and supported by numerous vendors

Page 24: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 24

GRXML ASR example

• asdf<grammar type="application/srgs+xml" root="r2" version="1.0"> <rule id="r2" scope="public">

<one-of> <item>coffee</item> <item>tea</item> <item>milk</item> <item>nothing</item> </one-of> </rule> </grammar>

Page 25: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 25

GRXML DTMF example<?xml version="1.0"?>

<grammar mode="dtmf" version="1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" xmlns="http://www.w3.org/2001/06/grammar">

<rule id="digit"> <one-of> <item> 0 </item> <item> 1 </item> <item> 2 </item> <item> 3 </item> <item> 4 </item> <item> 5 </item> <item> 6 </item> <item> 7 </item> <item> 8 </item> <item> 9 </item> </one-of></rule>

<rule id="pin" scope="public"> <one-of> <item> <item repeat="4"><ruleref uri="#digit"/></item> # </item></one-of></rule>

</grammar>

Page 26: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 26

W3C Speech Synthesis Markup Language

• Markup language to control spoken and audio output

• Version 1.0: W3C Recommendation – Sept 2004

• Implemented and supported by numerous vendors

• Version 1.1: under development– Adds support for tonal languages

– First public Working Draft published January 2007

Page 27: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 27

SSML Functions

• Audio output– <audio>

• Text-to-Speech output– Contained within SSML constructs

• Pronunciation controls– <say-as>

• Interpret-as

• Format

• Detail

– <emphasis>

• Timing– <break>

Page 28: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 28

SSML Functions (cont’d)

• Spoken language– xml:lang

• Prosody and Style – voice control– Voice– Gender– Age– Name

• Prosody– <prosody>

• Pitch• Contour• Range• Rate• Duration• Volume

Page 29: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 29

SSML Functions (cont’d)

• Sentence structure– <p>

– <s>

• phoneme -- Modify text– <sub> - substitute text

• Location identification– <mark>

Page 30: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 30

VoiceXML 2.x

Page 31: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 31

VoiceXML Scope

• Human-machine interaction provided by voice response systems: – Output

• play audio files

• produce synthesized speech

– Input

• record spoken input

• recognize spoken input

• collect character input

– Control flow

– Telephony

• transfer a user to another destination, such as a live agent

• disconnect a user

Page 32: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 32

VoiceXML Goals

• Separate user interaction from service logic – Creates new possible business models

• Service developer can be separate from telephony platform provider

• Enable service portability across implementation platforms– Assume common set of platform capabilities

– Provide common language for:

• Content providers, Tool providers, Platform providers

• Safely handle shared network-based applications– deterministic behavior

• Easy to build common types of applications

• Features to build complex types of applications

• Shield application authors from low-level platform-specific details– Promotes portability, ease of service creation

Page 33: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 33

VoiceXML 2.0 Basic Functions

• Input– <field>, <menu> recognition– <record> audio recording

• Output– <prompt> container for TTS or prerecorded audio– <audio> prerecorded audio

• Control Flow– <if>, <else>, <elseif> basic conditional logic– <script> complex scripts using ECMAScript– <goto> transition to a new document– <submit> submit data to a web application

• Telephony– <disconnect>– <transfer>

Page 34: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 34

VoiceXML Execution Model

• Form Interpretation Algorithm <form>• Execution is synchronous (mostly)

– Disconnect events are handled (somewhat) asynchronously

• Audio is queued– Played only when encountering a waiting state

• Processing is always in one of two states:– Waiting for input in an input item

• such as <field>, <record>, or <transfer>– Transitioning between input items in response to an input

• Event-driven– <catch>, <throw> generalized event mechanism– <nomatch>, <noinput> short-hand user-input event handling– <error> short-hand error event handling

Page 35: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 35

Key Points

• Architecture leverages all things "internet"– Languages, protocols, servers, developers, etc.

• Separation of concerns– Application logic / database vs. telephony / speech resources

– Enables new business models

• Voice ASP

• Prepackaged applications

• URL (application) associated with phone number– Calling party or Called party

– Share resources among many applications (VoiceASP)

• High-level languages, specific to domain / task– Simplify development and maintenance

Page 36: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 36

VoiceXML <form> and <field>

• <form> – Dialog container

– "Form Interpretation Algorithm" (FIA) specifies default behavior

• <field> – Collect input from caller– <grammar> specifies input 'constraints'

• <prompt> – Container for <audio> and text

Page 37: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 37

<?xml version="1.0"?><vxml version="2.0">

<form>

<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>

<block> <submit next="http://acme.com/route... " method="get"/> </block>

</form></vxml>

Example

main.vxmlNote: Code simplified for demonstration purposes…

Page 38: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 38

User Input - Grammars

• Grammars can be speech or DTMF (touchtone)– Both types can be active simultaneously

• Specified by SRGS– XML grammars are normative (aka GRXML)– ABNF grammars are more concise but more complex to author

• Grammars may be specified inline or sourced externally

• External grammars are referenced by URI

• Multiple grammars may be active simultaneously.

Page 39: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 39

Sales I'd like to place an order I need to talk to a salesmanRepair repair department service service department customer serviceOrder status where's my order? track my order track my shipment where the hell is my stuff?

Grammars can get very complicated:There are many ways to say the same thing…

Page 40: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 40

<grammar …xml:lang="en-US" version="1.0">

<rule id="dept" scope="public"> <one-of> <item>sales</item> <item>repair</item> <item>order status</item></one-of></rule>

</grammar>

Basic GRXML grammar example

main_menu.grxml

Page 41: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 41

<form>

<field name="sales_menu"> <prompt> <audio src="sales_menu.wav"> You've reached Acme's sales department. To place an order, say sales. To speak to an associate, say I'd like to speak to someone. </audio> </prompt> <grammar src="sales_menu.grxml"/> </field>

<block> <submit next="http://acme.com/... " method="get"/> </block>

</form>

VoiceXML example – next step

sales.vxml

Page 42: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 42

<form>

<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>

<noinput> You must say something. </noinput>

<block> <submit next="http://acme.com/route... " method="get"/> </block>

</form>

VoiceXML example with error handling

newmain.vxml

Page 43: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 43

<form>

<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>

<noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch>

<block> <submit next="http://acme.com/route... " method="get"/> </block>

</form>

VoiceXML example with error handling

newmain.vxml

Page 44: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 44

<form>

<field name="main_menu"> <prompt> <audio src="welcome.wav"> Welcome to Acme. You can choose sales, repair, or order status.</audio> </prompt> <grammar src="main_menu.grxml"/> </field>

<help> You can say sales, repair, or order status. </help> <noinput> You must say something. </noinput> <nomatch> I didn't understand you. Please try again. </nomatch>

<block> <submit next="http://acme.com/route... " method="get"/> </block>

</form>

VoiceXML example with error handling

newmain.vxml

Page 45: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 46

Set platform features via <property>

• Input modes: type of input from a callerDTMF-only <property name="inputmodes" value="dtmf">

Voice-only <property name="inputmodes" value="voice">Both <property name="inputmodes" value="dtmf voice">

• Timeouts<property name="timeout" value="1450ms">

<property name="termtimeout" value="2500ms">

...

Page 46: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 47

Call processing: <transfer>

• Blind– Go somewhere but don't return

• Bridge– Add on another party, resume

execution when done talking

Page 47: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 48

<form id="xfer">

<block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" >

</transfer>

</form>

Call processing: <transfer>

• Blind transfer

Page 48: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 49

<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" >

</transfer></form>

Call processing: <transfer>

• Bridge transfer

Page 49: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 50

<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>

</transfer></form>

Call processing: <transfer>

• Bridge transfer with cancel feature

Page 50: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 51

<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>

<filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try again later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled>

</transfer></form>

Call processing: <transfer>

Page 51: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 52

<form id="xfer"> <block> <prompt> Calling Riley. Please wait. </prompt> </block>

<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" transferaudio="music.wav" connecttimeout="60s" > <prompt> Say cancel at any time to disconnect this call.</prompt> <grammar src="cancel.grxml" type="application/srgs+xml"/>

<filled> <assign name="mydur" expr="mycall$.duration"/> <if cond="mycall == 'busy'"> <prompt> Riley's line is busy. Try back later. </prompt> <elseif cond="mycall == 'noanswer'"/> <prompt> Riley didn't answer the phone. Please call back another time. </prompt> </if> </filled>

</transfer></form>

Call processing: <transfer>

Page 52: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 53

Call processing: <transfer>

Page 53: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 54

New Features in VoiceXML 2.1

• Dynamically referencing grammars and scripts– <grammar expr=“…”> <script expr=“…”>

• Detect Barge-in During Prompt Playback: enhance SSML 1.0 <mark>– Add markexpr attribute

– Add markname and marktime to application.lastresult$ object

• Fetch (XML) data without transition: <data>– Uses read-only subset of DOM

• Dynamically concatenate prompts: <foreach> – Interate through ECMAScript array and execute content

• Record user’s utterance while attempting ASR – recordutterance property

– Add shadow variables: recording, recordingsize, recordingduration

• Send data upon disconnect– <disconnect namelist=“…” >

• Additional <transfer> types– <transfer type=“…” …/>

Page 54: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 55

Dynamic Applications

Page 55: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 56

VoiceXML Application Structure

• Static– User experience is the same for everyone

• Information doesn’t change frequently

• No customization per user, time of day, etc.

• Pages are created once and used many times

• Dynamic– User experience is customized by:

• User: e.g. my.yahoo.com, amazon.com (especially once you log in)

• Situation: e.g. travel specials on expedia.com

– Data driven, e.g. inventory system, airline reservations

– Generated by a program at runtime

• JSP, ASP

• App servers such as BEA, IBM Websphere, Oracle 9iAS

Page 56: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 57

VoiceXML 2.1 and AJAX

• VoiceXML + ECMAScript + <data> + XML

• <data> element allows retrieval of arbitrary XML data without document transition

• Static VoiceXML document can fetch user-specific data at runtime

• Decouple presentation layer from business logic

• Performance improvements due to:– Cache-able VoiceXML

– No need to generate entirely new pages for each dialog when only the content is new

– Less network traffic

Page 57: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 58

Dynamic Application ConsiderationsExecution of VoiceXML is running a program on your server…

• Must guarantee quality of dynamically-generated VoiceXML documents and ASR grammars

– Catch parse errors, execution errors

– What does the caller hear if there is an error?

• not “Could not parse VoiceXML document”

• Runtime performance– Parse and interpretation time of large documents

– Inefficient scripts and speech grammars

• Security implications– Exploit a bug in a particular implementation? Make free phone calls?

– Could there be a VoiceXML virus? Will all platforms protect against them?

Careful application design, testing and monitoring is essential

Page 58: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 59

Dynamic Application Considerations

• A mix of different simultaneous applications means variable platform load and execution profile– Parse time of VoiceXML document

– Fetching VoiceXML documents, grammars, audio from remote web servers

– Load Balancing

– How to protect platform from harmful application? (intentional or otherwise?)

• Max size of document

• Max size of grammar

• Complexity measurement of document or grammar (statically checked before execution?)

Platforms, networks, and applications must be carefully engineered

Page 59: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 60

Performance Considerations

Page 60: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 61

Load Balancing for Performance and Reliability

• CPU/memory utilization– Grammar compilation

– ASR load

– TTS load

• Telephony Network– Channel balancing

– Dead channel

• Incoming/Outgoing channel assignment / mix

Page 61: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 62

Performance: Caching

• Fetched documents, grammars, audio files, streams

• Local or distributed cache?

• Effects of prefetching

• Where to cache generated grammars?– Per system

– In-network

• Use external grammar compilation server?

Page 62: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 63

Application Management

Page 63: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 64

Application Monitoring and Maintenance

• Runtime logs– Web / application server

– Voice server

– Call Detail Reporting

• Utterance recordings and logs– Useful for grammar and dialog tuning

– Security of recordings may be an issue

– Disk space: full-call recordings may be prohibitively large

Usage data must be continually monitored to improve user experience

Page 64: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 65

Operations, Administration, Maintenance, Provisioning

• System Monitoring– Interfacing to existing Telco OSSs– Web-based for ISP environment

• Provisioning– Application, Customer

• DN-URI mapping– Telephony

• Call origination/transfer• Max call timeout• Max number of concurrent calls

– Platform-specific VoiceXML features• ECMAScript allowed?• Telephony control allowed?• Max grammar size

Page 65: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 66

Billing

• "platform time"– Usage of server resources

• Toll Free usage– It's toll free, not free

• Transferred calls– Inbound minutes

– Outbound minutes

– Network features, e.g. Network Redirect

• Outbound calls

Logging and Charging for usage of resources

Accurate billing information is a critical factor in application cost or profitability

Page 66: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 67

Application Deployment Models

Build-your-own network vs. Outsourcing

Page 67: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 68

Build vs. Outsource? Deployment Options Enable a Variety of Business Models

• Completely in-house– Maintain complete control for security– Development and deployment systems can be identical

• Outsourced VoiceXML/Telephony– Large-scale distributed networks without major capital investment– Grow quickly and incrementally

• Completely outsourced hosting– All components and systems managed by 3rd party

• Packaged software– VoiceXML application integrated with existing apps

Page 68: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 69

Completely In-House

• Local control of all systems

• Voice server, app server, database can be on local network

• Development and deployment systems can be identical

• Physical security: in-house team “owns” it

• Failover, reliability, scalability must be locally managed

• Redundant power, networks, etc. are required

Page 69: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 70

CiscoIPCC

VoiceXML On-premises Deploymentusing TDM or VoIP carrier connection

PSTN

VoiceXMLBrowsers

VoIPGateway, PBX, etc.

DatabaseCo-location facility

TDM:DS3,

Multiple PRI,etc.

ASRservers

WebApplications

WebApplications

VoIP"pipe"

Page 70: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 71

Outsourced VoiceXML / Telephony

• Telephony and VoiceXML servers outsourced to "Voice Service Provider" (VSP)

• Application remains in your data center(s)– Geographically distributed

– May be dedicated to specific customers

• Many carrier-grade vendors to choose from

Page 71: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 72

CiscoIPCC

Outsourced VoiceXML / Telephony

PSTN

VoiceXMLBrowsers

VoIPgateway

Database

Co-location facility

ASRservers

Internet

Voice Service Provider:Carrier-grade outsourcing facility

• Architecture is identical to in-house deployment

• Secure IP connection used between facilities

WebApplications

WebApplications

Page 72: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 73

Advantages of Outsourcing to a VSP

• Choice of many vendors: one for all customers, or choose the

best one for each customer

• Add capacity by adding multiple vendors

• No capital investment

• Pay-as-you-go pricing models

• Failover, reliability, scalability simplified

• Physical security of equipment and networks managed by VSP

• VPN or dedicated data connection to your backend systems

Page 73: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 74

Distribute Load to Multiple VSPs

Database

Customerco-location facility

CiscoIPCC

VoiceXMLBrowsers

ASRservers Cisco

IPCC

VoiceXMLBrowsers

ASRservers

CiscoIPCC

VoiceXMLBrowsers

ASRservers

Internet

CiscoIPCC

VoiceXMLBrowsers

ASRservers

PSTN

Multiple co-lo facilitiescan be deployed for geographicredundancy and enhancedcapacity.

WebApplications

WebApplications

Page 74: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 75

Completely Outsourced

• Deploy hardware & software systems at customer-managed co-location facilities

• Deploy complete systems at co-location facilities managed by 3rd party

• Deploy pre-packaged VoiceXML application integrated with customer's call center (managed by customer)

Page 75: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 76

Combination of In-house and Outsourced Several ways to balance resources

• Primary in-house, with overflow or failover to a VSP– Local control of resources

– Overflow to VSP during peak usage

– Backup for failover / disaster recovery

• In-house development, with primary deployment via VSP– In-house development and trials

– “Push to the network” when ready to deploy

Page 76: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 77

CCXML, VoiceXML, and VoIP

3rd-Party Call Control

Page 77: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 78

PSTN

Inbound call using TDM connections

VoiceXMLServer

• 1st-party call control: VoiceXML server handles call routing/setup/answer

Caller

Page 78: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 79

PSTN

customer

Inbound call using VoIP (SIP and RTP)

VoIPGateway

VoiceXMLServer

1. INVITE

2. RTP

• 1st-party call control: VoIP gateway routes call to VoiceXML server, which handles call routing/setup/answer

Page 79: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 80

Why VoIP?

• Flexible network topology

• Simplified integration of voice dialog resources

• Vendor independence for network elements

• Separation of concerns: voice dialog resources vs. call control

Page 80: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 81

PSTN

caller

Inbound Call using 3rd Party Call Control

VoIPGateway

Call RoutingApplication

VoiceXMLServer

1. INVITE

3. RTP

2. INVITE

• 3rd party application handles call routing/setup/answer

Page 81: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 82

PSTN

caller

Outbound call using 3rd Party Call Control

VoIPGateway

OutboundCalling

Application

VoiceXMLServer

1. INVITE

3. RTP

2. INVITE

• 3rd party application handles outbound call initiation/setup/routing

• “Attaches” VoiceXML dialog to connection

Page 82: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 83

What is CCXML?

• XML-based language that manages the connections and resources used in phone calls

• Designed for 3rd-party call control applications

• Allows for easy integration into back end web applications very similar to VoiceXML’s model

• Uses the finite state machine model– Event handlers move from one state to the next using markup tags

• CCXML provides commands to run a “dialog” on a call leg

Page 83: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 84

Why is CCXML Needed?

• VoiceXML was designed primarily for voice dialogs– 1st-party call control: <disconnect> and a several predefined common

<transfer> types

• Connection management requires full asynchronous event handling– Connection/telephony events can occur any time during a call and must be

handled

– VoiceXML specifically limits asynchronous events to simplify the execution and programming model

• 1st-party Call Control can be useful but has limited flexibility– VoiceXML 2.1 <transfer> adds "consultation" feature for network

redirect

Page 84: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 85

Media

HTTPHTTP

PSTN

Caller

TelephonyInterface

CCXMLServer

DialogServer

Telephony ControlInterface

DialogControlInterface

TelephonyWeb

Application

VoiceWeb

Application

CCXML VXML

CCXML System Architecture

ConferenceServer

Page 85: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 86

CCXML features

• Telephony channel control: voice paths and signaling– <createcall>, <accept>, <disconnect>, <reject>, <redirect>

• Media control: Conference Bridges and Mixers– <join>, <unjoin>, <createconference>, <destroyconference>

• Dialog control: Add a VoiceXML (or other dialog) resource to a connection– <dialogstart>, <dialogprepare>, <dialogterminate>

Page 86: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 87

Integration of CCXML and VoiceXML

• Dialogs are created using <dialogstart>– You pass the URL of the document that you want to run

• Dialogs can be ended using <dialogterminate>– This allows CCXML to end a dialog based on a external event such as

someone calling you on a second line

• Dialogs can return data back to the CCXML platform– In VoiceXML use <exit namelist="a b c"/>– This is exposed in the CCXML dialog.exit event

Page 87: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 88

W3C CCXML 1.0 status

• Nearing "Candidate Recommendation" status– Language complete– Test suite under development– Certification Program under consideration

• Growing support throughout the world

• Several open source projects underway– See http://www.sourceforge.net

Page 88: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 89

Next-Generation Technologies

Page 89: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 90

Next-Generation Technologies

• Speaker Biometrics-based authentication– Speaker Identification– Speaker Verification

• Video IVR --VoiceXML augmented with video– Early stages of commercial deployment now– Simple extension to standard platforms– Straightforward step towards full multimodal

• Multimodal– Multiple input modalities: speech recognition, keypad, handwriting,

biometrics (voice, fingerprint, iris, etc.), geolocation, motion– Multiple output modalities: graphics, audio (speech, TTS, music,

polyphonic tones)

Page 90: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 91

Speaker Biometrics

Page 91: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 92

Why Speaker Biometrics?

• Identify an individual for remote transactions

• Text / DTMF PINs are inadequate– Easily compromised

– Easily forgotten

– Does not identify an individual

• US Federal Regulations– FFIEC guidelines for financial services

Page 92: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 93

Speaker Identification and Verification (SIV)

• Authentication– The process of confirming one or more identities.

• Speaker Identification (one-to-many)– Authentication with multiple identity claims.

• Speaker Verification (one-to-one)– Authentication with a single identity claim.

Page 93: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 94

Types of SIV

• Text independent– SIV technology that can operate on any freeform or structured spoken input.

• Text dependent– SIV technology (usually verification technology) that requires the voice input

of one or more specific passwords or pass phrases (having been enrolled).

• Text prompted– SIV technology (usually verification) that randomly selects words and/or

phrases and prompts the speaker to repeat them. The term is also called challenge-response.

Page 94: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 95

Fundamental Phases of SIV

• Enrollment– Capture one or more user utterances to ‘train’ the system

• Verification– Capture one or more user utterances to make an identity claim

• Adaptation & Scoring– Judge how close the user’s verification utterance is to the enrolled

utterance

– Refine the existing enrolled utterance with information from the verification utterance

Page 95: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 96

Video and Multimodal

Page 96: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 97

“Video” VoiceXML

• Video extensions to VoiceXML– 3G Wireless

– VoIP phones

• VoiceXML is just a dialog language– Initially only for voice input/output

• Example– Videomail is a dialog application very similar to voicemail

• Video and audio are somewhat analogous– VoiceXML can be ‘hacked’ to handle video now:

• <audio src="foo.au“/> could “play” a video file via <audio src=“foo.mpeg4”/>

– VoiceXML 3.0 might add a new language feature

• e.g. <video src="foo.avi"> or <media src="foo.mpeg4">

Page 97: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 98

“Video” VoiceXML Deployment and Standardization

• Simple extension to standard platforms– Easy integration with current platforms

– Doesn’t “break” existing functionality

– Well aligned with “VoiceXML model”

• Early stages of commercial deployment– Several vendors have deployed large-scale commercial systems

• Step towards full multimodal

Page 98: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 99

Multimodal Applications

• W3C Multimodal Interaction Working Group– Defining new standards based on extensive industry experience

• IBM / Motorola / Opera X+V 1.2– Early stages of commercial deployment– Freely available from Opera http://dev.opera.com/articles/voice/

For more information, see:W3C Multimodal Interaction Working Group http://www.w3.org/2002/mmi

Page 99: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 100

VoiceXML 3.0

Page 100: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 101

VoiceXML 3.0

• Modularization– Cleanly separate functions to enable integration with other modalities

– Enables code reuse

• New media processing– Video

– Voice processing

– Navigation

– Speaker biometrics

• Separation of data, control flow and presentation– Control flow embodied in new language: SCXML

• Clean data model

Page 101: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 102

• W3C Voice Browser Working Group http://www.w3.org/voice

– VoiceXML 2.0 Recommendation

• http://www.w3.org/TR/voicexml20/

– VoiceXML 2.1 Working Draft

• http://www.w3.org/TR/voicexml21/

– Semantic Interpretation Working Draft

• http://www.w3.org/TR/semantic-interpretation/

– SRGS 1.0 Recommendation

• http://www.w3.org/TR/speech-grammar/

– SSML

• 1.0 Recommendation http://www.w3.org/TR/speech-synthesis/

• 1.1 Working Draft http://www.w3.org/TR/speech-synthesis11/

– CCXML 1.0

• http://www.w3.org/TR/ccxml/

– SCXML

• http://www.w3.org/TR/scxml/

• IETF http://www.ietf.org

References

Page 102: Introduction to VoiceXml and Voice Web Architecture

© 2007 Ken Rehor. All Rights Reserved. 103

Ken Rehorhttp://www.kenrehor.com

VoiceXML Forum Co-founder and past-Chair

Chair, VoiceXML Forum Conformance Committee

Co-Chair, VoiceXML Forum Speaker Biometrics Committee

W3CCo-editor: VoiceXML 1.0, 2.0, 2.1, 3.0Co-editor: CCXML 1.0