A Speech-Driven Automatic Receptionist Written in VoiceXML · Abstract This thesis describes the implementation of a speech-driven receptionist for Voxway AB. The receptionist was

Department of Linguistics and PhilologySprakteknologiprogrammet(Language Technology Programme)Master’s thesis in Computational Linguistics

10th June 2005

A Speech-DrivenAutomatic Receptionist

Written in VoiceXML

Katarina Matzon

Supervisors:Beata Megyesi, Uppsala UniversityTobias Ohman, Voxway AB

Abstract

This thesis describes the implementation of a speech-driven receptionist for Voxway AB. Thereceptionist was designed to be used by smaller Swedish companies. It answers calls com-ing into the company and directs the calls to an employee based on speech input from theuser. It also handles unrecognized names and unanswered phonecalls. It was programmed inVoiceXML and ColdFusion. A database was designed and implemented to store data neededin order to make the receptionist dynamic and to log call statistics. The telephony applica-tion was evaluated by test users and a user survey. A website (programmed in HTML andColdFusion) was designed to administrate the telephony application and allow companies tocustomize the application as well as view statistics about their usage of the application.

Contents

Abstract ii

Contents iii

List of Figures v

List of Tables vi

Acknowledgements vii

1 Introduction 11.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Dialogue Systems 32.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Design Methods . . . . . . . . . . . . . . . . . . . . . . . 52.2.2 Human Communication . . . . . . . . . . . . . . . . . . . 62.2.3 Design of Dialogue . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 VoiceXML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 ColdFusion . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Programming the Receptionist 123.1 Static Receptionist . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Design of Dialogue . . . . . . . . . . . . . . . . . . . . . . 133.1.2 Basic Code . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.3 Building Grammars for Use . . . . . . . . . . . . . . . . . 143.1.4 Integrating Error Handling in the Code . . . . . . . . . . . 16

3.2 Integrating Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.1 Building the Database . . . . . . . . . . . . . . . . . . . . 173.2.2 Using ColdFusion to Integrate Dynamics . . . . . . . . . . 193.2.3 Organizing the Code for Dynamics . . . . . . . . . . . . . 203.2.4 Dynamic Queries and Output . . . . . . . . . . . . . . . . . 203.2.5 Dynamic Grammars . . . . . . . . . . . . . . . . . . . . . 223.2.6 Dynamic Prompts . . . . . . . . . . . . . . . . . . . . . . 233.2.7 Implementing Statistical Element . . . . . . . . . . . . . . 23

iii

4 Evaluation 244.1 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Test Users . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.2 Evaluation of Results . . . . . . . . . . . . . . . . . . . . . 25

5 Designing the Web Interface 27

6 Concluding Remarks 306.1 Future Improvements . . . . . . . . . . . . . . . . . . . . . . . . . 30

A Database 32

Bibliography 33

iv

List of Figures

2.1 The three modules of a dialogue system . . . . . . . . . . . . . . . . . 32.2 The relationship between SGML, HTML, XML and VoiceXML . . . . 92.3 A simple VoiceXML example . . . . . . . . . . . . . . . . . . . . . . 92.4 The seven subsystems of VoiceXML . . . . . . . . . . . . . . . . . . . 10

3.1 Stages of Development of Receptionist . . . . . . . . . . . . . . . . . . 123.2 Example Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 Receptionist Applications’s chain of events . . . . . . . . . . . . . . . 143.4 Example of different types of VoiceXML grammars . . . . . . . . . . . 153.5 Example of error handling in a dialogue . . . . . . . . . . . . . . . . . 163.6 Static event handling for an unanswered call . . . . . . . . . . . . . . . 173.7 Example of a possible conversation . . . . . . . . . . . . . . . . . . . . 193.8 Query to find company name and ID . . . . . . . . . . . . . . . . . . . 213.9 Example of ColdFusion output . . . . . . . . . . . . . . . . . . . . . . 213.10 Dynamic Grammar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.11 Dynamic Prompt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Task example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 Home Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.2 Employee List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285.3 Blank form for new employees . . . . . . . . . . . . . . . . . . . . . . 28

v

List of Tables

4.1 User Satisfaction Survey with Average Scores . . . . . . . . . . . . . . 254.2 User Satisfaction Scores . . . . . . . . . . . . . . . . . . . . . . . . . 26

vi

Acknowledgements

I would like to thank the people without whom this paper would not be what it istoday. Thank you to both my supervisors Beata Megyesi and TobiasOhman. Thankyou Bea for your encouragement and advice in writing this thesis, and thank youTobias for all your encouragement and help on the programming of the reception-ist. I would like to thank Botond Pakucs at KTH for contributing with advice on theevaluation of dialogue systems. I would also like to thank my friend Jens Bergqvistfor helping me record incredible sound for the receptionist so that it sounds moreprofessional. Thank you to all my friends and family who have supported me thissemester and were always around to talk when I needed a break. And lastly, I espe-cially want to thank my boyfriend, Johan, for being such an incredible support andhelp throughout this process, thank you for being my bollplank!

vii

1 Introduction

Natural language processing, the study of linguistics and computer science, is grow-ing everyday. Everywhere people go today computers are understanding and inter-preting the human language. One of the branches of computational linguistics isspeech technology where computers ‘understand’ and output speech. More and morecompanies are using speech technology. If you call the Swedish railway companyyou will be speaking to a computer to book your tickets or if you call the postal of-fice in the United States you will be speaking to a computer to find out the postalcode you need.

Soon enough we will not need to type into keyboards because it will be standardto talk to your home computers. People are already speaking to their mini-computers.For example, when a person calls their friend on their mobile, they just say thefriend’s name and the call is connected (Dobler, 2000). Or when you are drivingin your car and your navigational system is reciting directions for you to follow toyour next destination (Wikipedia). These features improve our lives at home or atwork.

One branch of speech technology is spoken dialogue systems. Spoken dialoguesystems utilize speech technology to enable humans and computers to interact bymeans of human speech. Here both aspects of speech technology, speech recognitionand speech synthesis, combine to interact with humans in the form of a dialogue. Inthe Merriam-Webster English Online Dictionary dialogue is defined as follows

Dialogue a conversation between two or more persons; also : a similar exchangebetween a person and something else (as a computer) b : an exchange of ideasand opinions c : a discussion between representatives of parties to a conflictthat is aimed at resolution

A spoken dialogue system can then be defined as a system designed to perform aspoken conversation between a person and a computer. One area where these systemsare increasingly popular is the telephone industry. In the end of the 1990’s, telephonecompanies wanted to develop a common language to voice enable the web, in otherwords, to build dialogue systems that work over the web and over the telephone.The result of this discussion was VoiceXML (Voice Extensible Markup Language)(W3C, 2003). VoiceXML made it much simpler for companies to build web-enabledapplications that include speech over the telephone and expanded the possibilities forvoice applications.

1.1 Purpose

The purpose of this thesis is to develop a speech-driven receptionist for Voxway AB.Voxway AB is a company specializing in developing and hosting IVR (Interactive

1

Voice Response) applications with speech technology. This task involves developingan automatic receptionist for small companies where the goal is to form a comfortableand efficient dialogue between the caller and the automated service.

The dialogue system is programmed with VoiceXML. The receptionist is de-signed to expect the name or position of a person at the company. In case the reques-ted person may be reached at several numbers, the application asks which number itshould connect to (mobile, home, work). After the system knows the correct num-ber, it connects the call. It also handles problems such as unrecognized names, busysignals, and unavailability.

Besides the dialogue aspect, the application involves designing a database andweb interface that can be accessed by each company in order to customize the ap-plication to their needs. Each company has its own application content that is storedin a database and accessed by using the telephone number that receives the call as akey. The information in the database is managed by the website which is designed toallow different companies to enter the site with a password and enter the informationfor each employee that is necessary for the ‘receptionist’ to be able to connect a call.

The website allows companies to see call statistics about the calls coming intothe company and calls transferred within the company.

1.2 Outline

This paper describes the implementation of an automatic receptionist. Chapter twogives a background on dialogue systems in order to prepare the user for chapter threewhich discusses the implementation of the receptionist from the static receptionist tothe dynamic receptionist. The next chapter describes the evaluation of the implemen-ted receptionist. The chapter to follow the evaluation describes the website designand implementation. The paper ends with concluding remarks and suggestions forfuture improvements.

2

2 Dialogue Systems

Spoken dialogue systems are systems built to handle human-computer interaction inthe form of speech. A system normally consists of different modules that handle dif-ferent aspects of the dialogue. A simple system consists of three modules, a speechrecognizer, a dialogue manager and an output generator as seen in Figure 2.1 (Gust-afson, 2002).

OutputGenerator

Speech Recognizer

Dialogue Manager

Figure 2.1: The three modules of a dialogue system

The first part is the automatic speech recognizer which converts the speech thatis the input into text that the computer can parse. Once the text is parsed, it is sentto the dialogue manager which decides how the system should react to the input.Often, the reaction is to send output to the output component or generator. The outputcomponent consists of recorded prompts or text-to-speech (TTS) which converts agiven output into speech to be recited to the user.

Together these components form a dialogue system. This system can then acceptinput as speech, parse this input, decide how to handle the input, and send outputvia the generator. This is how a general dialogue system works, but systems aredesigned with different goals in mind and each component in the system will beformed differently depending on the goal. For example, the CU communicator isan interactive dialogue system for travel information over the phone (Pellom andWard, 2000). In comparison, a system with an entirely different goal is August, amultimodal dialogue system which was used to interact with people at the culturalcenter in Stockholm (Gustafson et al., 1999). Since dialogue systems can differ sogreatly, they are divided into three categories.

The first is thetask-orienteddialogue. This dialogue has well-defined goals andthis is usually a simple dialogue. Examples include simple question and answer sys-tems such as the CU communicator mentioned above. Another example is a systemthat gives traintimes over the telephone such as the Philips automatic train timetableinformation system (Aust et al., 1995). The second type of dialogue is theexplorativedialogue where the goals are not as well-defined but instead the goals are to acquireknowledge about complex tasks or browse information (Gustafson, 2002). An ex-

3

ample would be an information browsing system such as AdApt which allows usersto find out information about available apartments in the Stockholm area (Gustafsonet al., 2000). Although there is a goal in their interaction it is not easily defined. WithAdApt, the goal may be to find an apartment to buy or simply to browse availableapartments out of curiosity. The third type of dialogue iscontext-oriented. These dia-logues are focused on the actual dialogue situation. The primary goal for the userin this interaction is to be entertained (Gustafson, 2002). This dialogue is based onthe system, its locations, or its surroundings. An example of this would be a mu-seum guide system that talks about the exhibition it is stationed in such as August,the system described earlier (Gustafson et al., 1999). August has no goal other thanconversing.

Today, task-oriented dialogue systems are the most common. Mostly becauseit is easy to measure errors and effectiveness of the systems since the goals are soclear (Gustafson, 2002). But the other two types are possible and would expand thepossibilities of the dialogue systems endlessly.

A more in-depth look into each of the components of dialogue systems will beexplored below.

2.1 Speech Recognition

Automatic speech recognition (ASR) is the task of converting speech to text thatcan then be parsed by the computer. Determining what type of recognizer to buildis one of the first steps. Many types of recognizers exist. One distinction is basedon whether the system has prior knowledge about the user’s speech characteristics ornot. Speaker-dependent (SD) systems are designed to understand speakers previouslytrained on the system, and speaker-independent (SI) systems are trained to respondto a large group of people where training for each individual would be impossible(O’Shaughnessy, 2000). SD systems exist, for example, in mobile phones where thespeech recognizer recognizes its owner’s way of pronouncing a person in the phonebook exclusively. SI systems are much harder to make successful considering thelarge variations in speech that need to be taken into consideration.

Inter-speaker variability is the difference in speech between individuals. Thesedifferences include dialects, emotion in speech, sex of the speaker, and age of thespeaker. For example, the accent of a person from the south of Sweden is very dif-ferent compared to the accent of a person from the north of Sweden. A SI recog-nizer needs to account for these differences in order to understand a broader scope ofpeople. Besides these differences even the emotion in a voice differs between speak-ers. For example, the level of excitement in a voice will also be different dependingon the speaker. All of these differences and more need to be considered when buildinga SI system.

Besides inter-speaker variability, intra-speaker variability exists. Intra-speakervariability is the variability of speech within one person. One person is unlikely toutter the same exact thing more than once. The combination of intonation, pauses andemphasis is difficult to repeat exactly. This effects both SI and SD systems. A speechrecognizer needs to be broad enough to handle these subtle differences in speech andbe able to recognize the words that are spoken, but it needs to be narrow enough sothat it does not confuse similar words.

4

Besides the aspects of speech, the nonspeech aspects are important to consideras well. Background noise plays a huge factor for the recognition. If a person issitting in a crowded restaurant or in an empty room, it will be more difficult forthe recognizer to recognize the person in the restaurant because of all the noise in thebackground. Also channel distortion needs to be considered. If a person is interactingwith a system via a telephone the connection can worsen the recognition because ofbandwidth limitations in the telephone network. Mobile phone connections can bebad or if a person calls from overseas, the connection can be affected and make itmore difficult for the recognizer to understand the caller. The perfect conditions fora speech recognizer is one person in a silent room interacting with the computerwithout a medium such as a telephone. These conditions are, of course, not thatcommon.

Once speech is recognized and the actual text is extracted, the computer parsesthe input in a couple of ways. Each speech recognizer is equipped with a linguisticcomponent that will parse the text before it is sent to the dialogue manager. Thesimplest parser is a static grammar which means that the parser has an unchanginggrammar that the input is matched to, to try to find the best match. These matchescan be similar to one another and therefore lists can be made by the system listing themost similar match to the the least similar. In more complex recognizers, a lexicon orcorpus with a much larger number of words along with a grammar interact to parsethe meaning of the input (Gustafson, 2002). This allows for more possibilities when itis impossible to know exactly what inputs will be entered. A more complex linguisticcomponent allows for a more robust system.

Once speech is recognized and parsed so that the system can interpret it, it is sentto the next component, the dialogue manager.

2.2 Dialogue Management

The dialogue manager in a dialogue system is the backbone of the system. Once a textis parsed by the recognizer, the dialogue manager has to decide what to do with theinput it has received. There are several different aspects to consider in the design ofthe dialogue manager so that it can handle input correctly and a successful dialoguecan be programmed. The first and most basic is which method of design the designerchooses.

2.2.1 Design Methods

A few different ways to design a dialogue system exist. Design by inspiration, designby observation and design by simulation (Gustafson, 2002). Designing by inspirationis when a designer decides how he is going to design his dialogue without consultingany external party. This is a bit risky since one person cannot think of all the possibil-ities in a conversation and it relies solely on the linguistic competence of the designer(Gustafson, 2002). This can be considered an option in simple systems where the pur-pose is for the user to reach a goal. Here it works since the user can be trained on howhe can reach his goal, and then the dialogue system can be considered a success. Inmore complex systems, it will most likely not give a good result. Designing by ob-servation is when the designer observes communication between humans emulatingthe situation he wants to depict in his system and trys to incorporate aspects of that

5

communication into the system. Lastly is design by simulation (wizard-of-oz tech-nique) which is when some or all parts of a system are simulated and thus differentaspects of the dialogue can be tested (Gustafson, 2002). This is quite a useful strategysince it will make the system more realistic since it will be a human speaking to asimulated interface instead of a human speaking to a human. The type of system andthe possibilities the designer has will decide which design strategy is best suited forthe dialogue system.

Once a design method is chosen it is important to consider certain principles thatexist in human communication.

2.2.2 Human Communication

In order for a successful dialogue to be designed, the designer needs to observe hu-man dialogue and account for all the unwritten rules that exist in human conversation.Only by following these rules and principles will the designer be able to design a dia-logue system that people find as natural as speaking to a human.These principles andrules are discussed below.

Certain assumptions exist when humans communicate in order for a conversationto be satisfactory to all parties. Principles have been studied and defined so that com-munication can be more easily studied. Grice (1975) has famously written about fourwell-known maxims that govern all conversation and when they are not followed, aconversation can be considered unsatisfactory. These four maxims are listed below.

• Quality. This means that in a conversation a person should always be sincere.People expect to hear the truth and will therefore be surprised if this maxim isnot followed.

• Quantity. This means a person should say neither too little nor too much. If aperson doesn’t say enough then it could lead to confusion and the same couldhappen if they say too much.

• Relevance. This is easily explained as what a person says should always berelevant in conversation. If a person starts speaking of something unrelated tothe current subject then it will confuse the listeners.

• Manner. This means avoid ambiguity. Be clear and to the point otherwise itcan lead to confusion.

All of these maxims need to be upheld in a dialogue system if the user is to feelcomfortable with the conversation.

Besides underlying principles in conversations, the conversation structure is im-portant to follow. Conversations between humans are structured in turn constructionunits (TCU). Each speech act by each partner is considered a TCU and these TCUsare surrounded by turn relevance places (TRPs) (Norrby, 1996). For example, if oneperson directs a question to another person, that is considered a TCU. The answer theother person gives is another TCU and the time in between the question and answeris a TRP. TRPs are extremely important because they signal when another party cantake a turn. TRPs are the natural place to take a turn if you are participating in a con-versation. They can be signalled by a longer pause, the intonation at the end of a TCUand other signals that humans perceive automatically. It is important for the dialogue

6

system to understand when a pause is a TRP or not, otherwise a conversation can befrustrating for the user.

These TRPs can be easier to find if the role of initiative in the dialogue is clear.When one person starts a dialogue she has initiative. The initiative can switch betweenthe different parties as the conversation moves along to keep it going forward. Aconversation is considered single initiative if one party always takes initiative (Gust-afson, 2002). For example, the Danish flight ticket reservation system is a mainlysystem-directed task oriented dialogue (Bernsen et al., 1997). Mixed initative is wheneither party can take initiative (Gustafson, 2002). This can be seen in a system wherethe user can prompt the system for an answer to a question and the system can dothe same with the user. An example of such a system is the Waxholm system whichgives boat information for the Stockholm archipelago and was designed to allow userinitiative as well as system initiative (Carlson et al., 1995).

These assumptions and underlying rules of conversation need to be taken intoconsideration when designing a dialogue manager. Otherwise it will most likely beunpleasing to the human user. The next step is programming the actual dialogue.

2.2.3 Design of Dialogue

Once the design method is decided and conversation principles are considered, thedesigner is ready to program the type of dialogue the manager will understand andinterpret.

To help in the design process, the designer can gather examples of dialogues tobase design on or if this is not a possibility, the designer can use scenarios (Gustafson,2002). Scenarios are when a designer considers all the different types of dialoguesthat can occur with the system in order to form a successful design. Scenarios are veryhelpful in that they take the system through as many different dialogues as possible.

With the help of the gathered examples or scenarios, a dialogue is designed. Thedialogue manager can then be programmed to interact with human users in the limitedway that the system was designed to. But in order for the system to reach a greaterscope of information, the dialogue manager may interact with a database. A databasestores all the information that could be relevant to the dialogue. For example, in atrain booking system, where people call to book tickets, the dialogue manager mustinteract with the database in order to find out information about the trains that arerelevant. The database may give input to what the acceptable output may be. Oncethe dialogue manager has processed the input, the appropriate output is sent to thenext component, the output generator.

2.3 Generator

Output can be generated in a few ways in a dialogue system. One way is throughrecorded prompts that are played back to the user. Another way is generated througha TTS system.

Recorded prompts can be used when there are messages that are always playedin every dialogue. They are chosen because it is a real voice instead of a computergenerated voice since human voices could be considered more pleasing to humanlisteners.

7

TTS is used when the output can not be foreseen. TTS does not sound as naturalas a human voice and therefore recorded prompts are sometimes preferred, but, inmany systems, output is often unique which makes TTS extremely powerful. TTSsystems generally synthesize speech from text using linguistic processing and con-catenating small speech units. It converts input text into speech waveforms usingalgorithms and previously coded speech data (O’Shaughnessy, 2000). Speech syn-thesizers can be characterized by the size of speech units they concatenate and by themethod used to synthesize the speech (O’Shaughnessy, 2000). Large speech unitsproduce high-quality speech but requires a lot of memory while efficient coding re-duces memory but also reduces speech quality. Most commercial synthesizers havebeen based on word or phone concatenation (O’Shaughnessy, 2000).

Two commercial applications exist for speech synthesizers, voice-response sys-tems which handle input text of limited vocabulary and syntax, and TTS systemswhich accept all input text (O’Shaughnessy, 2000). TTS systems construct speechfrom text using small speech units and much linguistic processing whereas voice-response systems simply concantenate speech from the large units the system hasstored. TTS systems are the systems that are of interest for most spoken dialoguesystems.

Several different methods of synthesis exist for TTS systems which include form-ant synthesis, articulatory synthesis, linear predictive coding synthesis, and wave-form synthesis. The highest-quality synthesized speech uses waveform coders andlarge memories (O’Shaughnessy, 2000). These synthesizers can be considered quiteadvanced for certain systems. Two other types of synthesizers are terminal-analogsynthesizers and articulatory synthesizers (O’Shaughnessy, 2000). With articulat-ory synthesis, the sound is created by modelling the actual vocal tract shapes andmovements. In terminal-analogue synthesis only the acoustic results of speech aremodelled without taking the vocal tract into account. The choice of synthesizer isgreatly influenced by the size of the vocabulary. For example, a system that requiresa synthesizer that can produce unlimited text will generally be of lower quality thana system that has limited output.

The generator makes up the last of the three components that a dialogue systemconsists of. Now I will discuss one possibility to implement a dialogue system. Thisis the implementation that will be used in this thesis. If you want to learn more aboutspeech synthesis or speech recognition refer to (O’Shaughnessy, 2000). For moreinformation on dialogue systems refer to (Gustafson, 2002).

2.4 VoiceXML

VoiceXML (Voice Extensible Markup Language) is a powerful markup language thatdescends from SGML (Standard Generalized Markup Language). VoiceXML hastwo older siblings, HTML and XML, which were developed as children of SGML(see Figure 2.2). Whereas HTML is considered a single SGML application, XML isa metalanguage just as SGML. A metalanguage is a language that is used to defineother languages (Abbott, 2002). All the descendents of SGML are markup languageswhich means that information content is stored with tags that describe the meaningof the information content (Abbott, 2002). XML was developed by a designer togeneralize the success of HTML and also allow for a broader user base than SGML

8

by taking away some of the complexities of its mother language (Abbott, 2002).VoiceXML can be considered a young sibling to HTML.

VoiceXML

SGML

XMLHTML

Figure 2.2: The relationship between SGML, HTML, XML and VoiceXML

Although it is a sibling it interacts differently with its users than HTML since inVoiceXML applications the user speaks to the computer whereas in HTML, the usercommunicates visually with the computer with their mouse or keyboard (Abbott,2002). VoiceXML was developed after discussion between telephone companies todevelop a common language to voice enable the web. The first version was releasedin August 1999. A simple example is seen in Figure 2.3. The output after runningthis example would be a TTS of the text ’Hello World’.

<?xml version="1.0"?><vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"><form><block>Hello World!</block>

</form></vxml>

Figure 2.3: A simple VoiceXML example

VoiceXML can be seen as a complete dialogue system for telephony applicationswhere the designer simply has to program the dialogue manager and build grammarsfor the system. This can be seen in the seven subsystems which are listed below andillustrated in Figure 2.4.

Network Interface Allows HTTP to communicate with a web server.

VoiceXML Interpreter Software that can be considered the dialogue manager. Thisis where the programming and construction of the dialogue takes place.

TTS As discussed above translates text to speech.

Audio Allows audio prompts to be played or recorded.

9

Speech RecognitionAs discussed above translates user utterances into text. Voice-XML uses speaker-independent speech recognition where the interactions arestructured dialogs where the user is limited to a finite vocabulary.

DTMF (dual tone multi-frequency) Translates keypad input into characters

Telephony Interface Enables communication with telephone networks.

NetworkInterface

TTSAudioSpeechRecognition

VoiceXML InterpreterTelephonyInterface

DTMF

Figure 2.4: The seven subsystems of VoiceXML (Abbott, 2002)

By putting together speech recognition, speech synthesis, XML and the web inthis one powerful language, VoiceXML is able to extend the reach of the web sinceit allows it to be accessed from anywhere. It makes the web easier to use especiallyfor people with disabilities such as blindness or illiteracy. In addition, it increases theoptions for human-computer interfaces since it is an inexpensive option comparedto other voice applications (Abbott, 2002). VoiceXML has taken the expensive high-end technology of speech technology and combined it with markup language to makespeech technology something that is available for even low-end systems.

VoiceXML works by interpreting between the user and the web server. The Voice-XML code lies on a server and is accessed by the web or by a telephone number. Thecode is processed and able to form a dialogue with the caller. Although this is power-ful in and of itself, it is not very exciting. It can be compared to a static web page, theresults never change. In order to make it dynamic it can integrate with a web applic-ation server which allows it to connect to a database. One such application server isColdFusion.

2.4.1 ColdFusion

ColdFusion was created in 1995 to introduce dynamics onto the internet (Daneshand Motlagh, 2000). Coldfusion interprets commands given by the web and connectsto the database to retrieve the necessary information. For example, a website thatcontains many articles uses an application server such as ColdFusion to access thearticles in the database. Otherwise each article would have to have its own webpage.This is what makes the web dynamic. When ColdFusion integrates with VoiceXMLit allows telephony applications to become dynamic. ColdFusion is responsible forgetting information to and from the database in the same way it does with regularwebpages, but with voice applications it is interpreted by the VoiceXML gateway in

10

order for the information to be processed and found in the database. ColdFusion codecan be integrated into VoiceXML applications which makes it very simple and easyto learn. Simple SQL statements are used to retrieve the necessary information fromthe web and this information continues to be processed by the VoiceXML code.

11

3 Programming the Receptionist

The receptionist is programmed using VoiceXML and ColdFusion. Since the otherparts of a dialogue system are included in the VoiceXML system (see section 2.4), thefocus of the implementation will be on the design and implementation of the programcode. Designing the receptionist has several stages of development (as seen in Fig-

Statistics

Dynamic Code

Database Design

Event Handlers

Static Code

Figure 3.1: Stages of Development of Receptionist

ure 3.1). The first stage involves designing a static receptionist where no dynamic in-formation exists to make sure that the program can run with hard-coded information.The next step involves integrating event handlers that will handle misrecognitionsand other events. Once these two pieces are working, a database is developed thatwill allow the information that the receptionist uses to be dynamic. After the data-base is done, the static receptionist is reprogrammed to include ColdFusion markuplanguage (CFML) which will enable communication with the database. Once the dy-namics are in place, I am able to program in statistical elements that are important foradministrative purposes such as call length, time the call started, phone number thatthe user called from, and the number the user called. After this, a website is designedthat will allow companies to submit, change, or delete information in the database.Each of these developments is discussed below.

12

3.1 Static Receptionist

3.1.1 Design of Dialogue

Before programming the receptionist, the dialogue is designed. Since it is a simpledialogue, it is designed by inspiration and some observation of receptionist situations.A dialogue needs to be designed that upholds Grice’s four maxims as discussedabove, where theturn relevance places(TRPs) are obvious to the caller and alsomakes the system’s dialogue simple so that the user will model their dialogue to thesystem’s. The best approach is to be direct and to the point in as few words as pos-sible. The dialogue is designed to be single-initiative where the system will alwaysdirect the caller. Although more experienced users have the possibility to barge-inwhich interrupts the computer when it is speaking which makes the dialogue moreefficient. An example dialogue can be seen in Figure 3.2.

(1) Computer: Valkommen till foretaget. Vem vill du prata med?Caller: Anna Matzon.Computer: Vill du prata med kundservice Anna Matzon?Caller: Ja.Computer: Vill du bli kopplad till jobbtelefon, mobilen eller hemtelefon?Caller: Jobbtelefon.Computer: Varsagod. Snalla vanta medans jag kopplar samtalet.(samtalet kopplas)

(2) Translated into EnglishComputer: Welcome to the Company!Who would you like to speak to?Caller: Anna Matzon.Computer: Would you like to speak to customer service Anna Matzon?Caller: Yes.Computer: Would you like to be connected to work, mobile, or homephone?Caller: Workphone.Computer: One moment. Please wait while I transfer your call.(call transfers)

Figure 3.2: Example Dialogue

In this conversation, quality is upheld since there is no false statement in theconversation and the system is therefore sincere. Quantity is also upheld since thequestions are simple but informative so that the user knows what response is neces-sary. The conversation upholds the relevance maxim since all the questions directedby the system are related to the goal of connecting the caller to a callee. Since thequestions are unambigious, the manner maxim is also upheld. And in this way, allfour maxims are satisfied. Since the system mostly asks questions, the TRPs are alsoclear to the user since an obvious TRP is the end of a question. The user is placedin a single-initiative situation since the questions are always directed to the user, andthe user should not feel a need to ask questions in return.

The goal with the receptionist is not to have a long conversation, but to connectthe caller to a callee as simply and quickly as possible. This dialogue succeeds on

13

that aspect while upholding the rules of human conversation. The implementation ofthis design is discussed below.

3.1.2 Basic Code

The static receptionist where all values are hard-coded, is programmed solely withVoiceXML. In the static version, the program code consists of one document that isfollowed linearly to connect the caller to a fixed destination. This chain of events canbe seen in Figure 3.3.

CalleeCaller

TransferCall

CalleeNumber

ConfirmCallee

CalleeName

Figure 3.3: Receptionist Applications’s chain of events

In the first part of the code, speech synthesis is used to ask who the caller wouldlike to speak to. The response the caller gives has to be a part of the active grammarin order for it to be accepted. The grammars are discussed more below.

If the user gives a response recognized by the system, the system confirms therecognized person that the caller chose. If the person is confirmed, the user is thenasked by a speech synthesis prompt which telephone number she would like to beconnected to. This response is also directed by a grammar. In the static version, thecomputer asks every person if they want to be connected to home, work or mobilephone since no database exists with information if one employee has more than onenumber or not. If it is incorrect, the code starts from the beginning. Once the numberis retrieved, it goes to the next section which is the transfer section. In this section thecall is transferred to the phone number that the caller wants to be connected to. If thenumber is busy the caller is told that they have to call back and a similar response ifno one answers. After the call has been transferred and has returned, the system hasa simple last message before the call disconnects.

But in the static code, the telephone number is always the same since it is hard-coded. Therefore, the static code is pretty uninteresting to use except as a base tobuild on. How this static code turns into a useful dynamic code is discussed later inthis chapter, but first grammars and event handlers will be discussed.

3.1.3 Building Grammars for Use

In building the grammar for the receptionist, the goal is to keep the accepted re-sponses short and simple so that the dialogue will be efficient and at the same time,

14

the speech recognizer will be able to work easily with short phrases. As discussedearlier, VoiceXML is built up of seven subsystems. One of these subsystems is thespeech recognizer. In order for the recognizer to recognize user input, it needs tobe told what the accepted responses are so that it can try to match them with theuser input. This is done with grammars. A grammar can be built in several ways inVoiceXML. It can be a simple list of options, an inline grammar that is placed whereit is used, or an external grammar that is placed in another document. Examples ofthese three are found in Figure 3.4. For the static code, an external grammar is usedfor both grammars. The first grammar is all the acceptable names a user can ask for(name grammar) and the second is the different types of telephone numbers theycould be connected to(number grammar).

<option value="rod">rod</option><option value="bla">bla</option><option value="gron">gron</option>

<rule id="number" scope="public"><one-of><item>jobbet</item> <item>mobilen</item> <item>hemma</item> </one-of></rule>

Figure 3.4: Example of different types of VoiceXML grammars

As seen in Figure 3.4, the external grammar is identical to the in-line grammar,the only difference being that an external grammar is placed in another documentinstead of in the code. They are composed of rules that are defined by listing thepossibilities. The options grammar is a bit different since there are no rules, insteada field has a set of options that defines the grammar. An external grammar is chosenfor both grammars in the static code since it is neater and does not clutter the code.Since it is an external grammar, the rules can be more expansive as well.

Since these grammars are what the speech recognizer will try to match to the userinput, the text is written assay-astext which is similar to orthographic transcription.For example,Matzonis writtenmatsonsince thez is pronounced as answhen spoken.Although it is written as it sounds, it is not phonetically transcribed.

Once the grammars are implemented, the system recognizes an accepted nameand connects the caller to the static phone number. But what happens with input thatis not included in the grammar? Event handling is discussed in the next section.

15

3.1.4 Integrating Error Handling in the Code

Error handling is necessary in order to handle exceptions in a way that is pleasingto the user. Errors introduced by imperfect recognition is a large problem facing dia-logue systems (Choularton, 2004). Two general approaches exist to tackle this prob-lem, error avoidance and error handling (Choularton, 2004). VoiceXML has built-inerror handling for certain exceptions such asnomatchandnoinput. Nomatchis whena person’s response does not match any items in the specified grammars whereasnoinput is when the user gives no audible response. In VoiceXML, by default, bothof these are handled with a simple error message with a TTS voice and then re-prompting the user for a response. This is a potentially frustrating scenario for a usersince they would hear the same error message every time they give an unacceptableresponse. It is important that the exceptions are handled differently depending onthe number of times the user has given an unacceptable response. Since the systemwants to be natural, repeating the same question again and again is not desirable.According to Shin et al. (2002), user behavior when met with an error is to rephraseor repeat their response. This user behavior can be modelled in dialogue systems tomanage dialogue when errors are introduced (Choularton, 2004). This way, the useris prompted once to repeat their answer and the second time they are given morespecific instructions to rephrase their response. This approach follows the most nor-mal way of handling errors even if it is not the most desirable since the informationfrom the user’s first response is discarded (Gorrell, 2003). For example, if the userresponds with an unrecognized response one time, the message to the user will bedifferent than if it is the third time. An example conversation with error handling isseen in Figure 3.5.

(3) Computer: Valkommen till foretaget. Vem vill du prata med?Caller: ehm, jag vet inte.Computer: Jagar ledsen. Jag forstod inte. Vem vill du prata med?Caller: ehm, jag vet inte.Computer: Jag kanner inte igen det namnet. Du kan saga namnet ellerfunktionen av personen du vill prata med.Caller: Jag kommer inte ihag.Computer: Tyvarr sa forstod jag inte. Jag kopplar dig till kundtjanst.

(4) Translated to EnglishComputer: Welcome to the company. Who do you want to speak to?Caller: ummm, I don’t knowComputer: I’m sorry I did not understand you, who would you like to speakto?Caller: Umm, I don’t knowComputer: I don’t recognize that name. You can say the name or position ofthe person you would like to speak to.Caller: I don’t remember.Computer: Unfortunately I did not understand. I will connect you to customerservice.

Figure 3.5: Example of error handling in a dialogue

16

Strategies that take longer but produce fewer errors and corrections are preferredby users (Hirschberg et al., 2000). As seen in the example above, if the system isunable to recognize an accepted answer three times in a row, the system connectsthe caller to customer service that can help them. This is a simple way of handlingerrors where after three attempts general help is given to the user (Gorrell, 2003). Ichoose to do this after three times since it gives the caller three opportunites to getto their desired person each time with slightly more specific instructions. If they arestill unsuccessful after the third time, there is obviously a problem. More advancedtechniques in error handling exist which take many aspects of the conversation intoconsideration as seen in Higgins - a dialogue system for investigating error handlingtechniques (Carlson et al., 2004).

I have not implemented unique error handling for thenumber grammarwherethe user can respond with one of three options: mobile, home, or workhphone sincethe options are listed for the user in the question. It is unnecessary since the errorhandling would be simply reprompting the user again.

Thenumber grammarand thename grammarare the only two grammars whereerror handling for the user response is necessary. Error handling is also necessaryfor events pertaining to the phonecall. For example, error handling is necessary ifthe call is transferred to a number that is busy or has no answer. This is handled inthe static version by simply stating that the person is busy or isn’t answering andthanking them for their call as seen in Figure 3.6. Once the dynamics are built in, theuser is given the option of trying another number or another person.

(5) Computer. Anna Matzon svarar inte. Tack for samtalet, prova garna igensenare.Computer: Anna Matzon is not answering. Thank you for your call, please tryagain later.

Figure 3.6: Static event handling for an unanswered call

To summarize, the static code is coded in VoiceXML where a person calls in, asksfor a person that is in the grammar, responds with the type of number they want tocall and are connected to a static number. If their responses are unacceptable, specialevent handlers exist. Also if the number is busy/noanswer, they are informed. It isquite obvious that this code is not very powerful. The force comes when the codebecomes dynamic. In order for it to be dynamic, it needs a database to hold all thenecessary information.

3.2 Integrating Dynamics

The first part to integrating dynamics to the static code is building a functional data-base. Once the database is successful, ColdFusion can be integrated with VoiceXMLto connect the database to the program.

3.2.1 Building the Database

An efficient database is necessary to build an acceptable system. Without a workingdatabase, the system is not functional which is why the database design is so import-

17

ant and central to the entire system. The database can be viewed in Appendix A. Itconsists of five tables which are listed below.

• Company

• Employee

• Tilltal

• InCall

• TransferCall

TheCompanytable holds information about each company. Each company hasa unique id which is used to separate the information in the other tables betweencompanies. TheEmployeetable holds information about each individual employeeincluding their telephone numbers and position at the company. Each employee hastheir own unique id which separates the employees in theTilltal table as well. TheTilltal table is the source of the grammar for all the names. Here, each name that canbe used to reach a person is registered with that employee’s ID. The last two tables,theInCall andTransferCalltables hold information about the calls for administrativepurposes. In order to test that these tables with the information included as above areefficient and functional, scenarios that can happen with a caller are designed and howthese events effect the database are tested. A few scenarios are accounted for below.

All the scenarios begin by a caller calling a certain telephone number whichidentifies the company in the database. Knowing which company it is, the systemfinds the appropriate welcome message and plays the message to the caller. After thewelcome message, the system asks who the caller wants to speak to. The caller thenresponds with a name (in our example the name is Anna).

The system then searches in theTilltal table of the database with the id of thecompany as above to find an entry of the name Anna. It then finds an entry, connectsit to the employee table with the employee ID, and finds the filename with the em-ployee Anna’s full name and asks the caller if he wants to speak to Anna Matzon.If the answer is yes, the caller is connected to one of the telephone numbers in theEmployeetable. If no, the system has to start from the beginning but this time elim-inating the employee Anna Matzon as one of the options. In this way the system cansearch through the names in theTilltal table to find a different result. This is done byeliminating the previous employee’s ID from the search.

One variation of the above scenario is when a caller wants to speak to a group,for example sales or customer service. If the caller asks for customer service then thecomputer is going to find the employee that has customer service as her position. Theproblem comes when the computer wants to confirm the callee with the caller. If thecomputer says the callee’s actual name then the caller has no idea if it is correct ornot. An example of this can be seen in Figure 3.7. A simple solution to this problemis that instead of simply having their names in the confirmation, the confirmationstates their position along with their full name so that if the person calling does notknow the callee’s name they will still know they are being connected to the correctperson.

The next scenario is how the database should handle the calls that aren’t connec-ted. A first thought is that for the calls that aren’t answered or are busy and aren’tautomatically connected to voicemail, the system could have a message system of

18

(6) Computer: Valkommen till foretaget. Vem vill du prata med?Caller: Kundservice.Computer: Vill du prata med Anna Matzon?Caller: Jag vet inte, jag antar det.Computer: Jagar ledsen. Jag forstod inte. Vill du prata med Anna Matzon?Caller: OK.Computer: Vill du bli kopplad till jobbtelefon, mobilen, eller hemtelefon?

(7) Translation in English.Computer: Welcome to the company. Who do you want to speak to?Caller: Customer ServiceComputer: Would you like to speak to Anna Matzon?Caller: Umm, I don’t know, I guess.Computer: I’m sorry I did not understand. Would you like to speak to AnnaMatzon?Caller: Ok.Computer: Would you like to be connected to workphone, mobilephone orhomephone?

Figure 3.7: Example of a possible conversation

its own. But on further insight, the complications of a messaging system outweighthe benefits. Since most employees are assumed to have voicemail already, it is verycomplicated work for something that in most cases already exists. Therefore, for thefew cases where voicemail does not pick up a busy or unanswered call, the caller willbe asked if they would like to call another number or another person.

After running through all the above scenarios and several others, the databaseseems to be functional and effective for the receptionist’s goals.

The database’s interaction with the program can be seen in the ColdFusion quer-ies that are discussed below.

3.2.2 Using ColdFusion to Integrate Dynamics

Dynamics are integrated by writing SQL queries that pull information from the data-base dynamically instead of being hard-coded. This information is then used by theVoiceXML code using CFOUTPUT, or VoiceXML information is converted to Cold-Fusion to be used in new queries.

The process works as follows. When a call comes in, ColdFusion queries thedatabase to find out which company the call is meant to go to. Once this is extracted,the information is used to pull an appropriate welcome message and the first question.

When the caller responds with who they want to speak to, this response is storedin a VoiceXML variable. This variable is converted to a ColdFusion session variablewhich is used to query the database and pull out the entire name of the callee.

This result is then used to confirm that it is indeed the correct employee and if itis, it is sent to the next document where the caller is asked which telephone numberthey would like to to call. The options are dynamic depending on which numbers the

19

employee has in the database. Once the caller responds with a number, the number issent to the next document and used to transfer the caller.

If the transfer is unsuccessful in any of the ways discussed above in the staticcode, the caller is sent to another document where they are asked if they would liketo try another number if the employee has more than one, and if not, if they wouldlike to try another person. If they would, the whole process starts again without theperson they have just tried to reach available in the grammar.

3.2.3 Organizing the Code for Dynamics

In order for the dynamics to work properly, the static code needs to be separated intoseveral documents so that each part has its own document and also since ColdFusionruns before VoiceXML on each page, VoiceXML variables need to be sent to a newdocument in order to be used by ColdFusion. The documents are divided as follows.

• index.cfm - the initial variables for both ColdFusion and VoiceXML are set.

• initial.cfm - the first welcome message is played and the first queries are runto extract necessary information from the database.

• person.cfm - the first question is asked to the caller, who they’d like to speakto.

• confirm.cfm - this document confirms the person that the system recognizedfrom the variable sent from person.cfm

• number.cfm - the number to call is extracted by a new question to the caller.

• xfer.cfm - the caller is transferred to the number from the previous page.

• reconnect.cfm - redirects the caller if necessary.

With these separate documents, it is easy to pass and retrieve information fromthe database. A few VoiceXML variables that need to be converted into ColdFusionvariables are able to be passed between documents and thus be converted. These arethe caller’s number and where they were calling. These are sent from index.cfm toinital.cfm where several are inserted into the database. Another VoiceXML variableis the name that the caller said they wanted to speak to. This name is sent from per-son.cfm to confirm.cfm where it is used in a query to pull out the fullname of theperson in order to confirm. The final VoiceXML variable that needs to be conver-ted is the number that the caller wants to be connected to. Once this is retrieved innumber.cfm it is sent to xfer.cfm.

Except for the above conversions from VoiceXML variables to ColdFusion vari-ables, the ColdFusion code is simply integrated in the VoiceXML code so that Voice-XML can use the information from the database. This includes dynamic queries,grammars and prompts which are discussed below.

3.2.4 Dynamic Queries and Output

The CFML is a language very similar to VoiceXML and HTML. ColdFusion is themedium between the database and VoiceXML. In order to make the static code dy-namic, queries to retrieve information from the databases need to be included. The

20

most important pieces of the code are the queries and the output of these queries.These queries can be placed anywhere appropriate in the document between theCFQUERYtags. The queries are written in SQL and pull out the necessary informa-tion from the database. These queries can include session variables that have alreadybeen set but they may not include VoiceXML variables. An example can be seen inFigure 3.8 where the query finds the company’s ID and name where that company’stelephone number is the same as the session variablebnumber. These results are thenset to session variables since they will be referenced throughout the code.

<CFQUERY NAME="company_id" DATASOURCE="telefonist">SELECT c.id as comp_id, c.name as comp_nameFROM company cWHERE c.BaseNumber=’#session.bnumber#’AND c.Password=’foretag’</CFQUERY><CFSET session.comp_id =company_id.comp_id><CFSET session.comp_name=company_id.comp_name>

Figure 3.8: Query to find company name and ID

Results from queries such as the example in Figure 3.8 cannot be used in theVoiceXML code without theCFOUTPUTtag. This tag allows ColdFusion variablesto be placed in VoiceXML code as seen in the example in Figure 3.9.

<assign name="transfer_number"expr="<CFOUTPUT>#query_phonenumber.jobphone#</CFOUTPUT>

Figure 3.9: Example of ColdFusion output

This example is anassigncommand in VoiceXML where it assigns the variabletransfernumberthe value ofexprwhich is aCFOUTPUTexpression evaluating intoa telephone number. The system evaluates theCFOUTPUTexpression which refer-ences a query result. The value of the query result is then placed in the VoiceXMLvariabletransfernumber.

The opposite is done as well where ColdFusion variables are assigned the valueof a VoiceXML variable. But, in this direction, there is an extra step. Here, it isnecessary to send the variable to a new document and then place it in a CFSETstatement.

CFOUTPUTandCFQUERYtags are used to integrate dynamics into the staticcode. Two of the important dynamic parts are the dynamic grammars and dynamicprompts which are discussed below.

21

3.2.5 Dynamic Grammars

Dynamic grammars hold query results from the database and are used as grammarsin the same way as static grammars except dynamic grammars change depending onthe information in the database. Grammars become dynamic by using ColdFusion.The code for thename grammaris seen in Figure 3.10.

<CFQUERY NAME="possible_names" DATASOURCE="telefonist">SELECT t.Name as possiblenameFROM Tilltal t, company c, employee eWHERE c.id=#session.comp_id#AND t.CompanyID=c.idAND t.emplyeeID=e.idAND NOT e.id = #session.first_person_Id#</CFQUERY>

<grammar mode="voice" version="1.0"><rule id="fullname" scope="public"><one-of><CFOUTPUT QUERY="possible_names"><item>#possiblename#</item></CFOUTPUT></one-of></rule></grammar>

Figure 3.10: Dynamic Grammar

Instead of hard-coding each possible name into the grammar like the static code,ColdFusion pulls out each possible name that satisfies the query. That is the grammarVoiceXML then uses to match the user input to.

An interesting aspect of this grammar is in the query where theemployeeIDcannot be the same as the session variablefirst personid. This is so that if the recognizerrecognizes the wrong name the first time around, that name will no longer be anoption when the person is directed back to the first question again. Otherwise it pullsout all the names in theTilltal table where the company ID is the same as the sessionvariablecompanyIDand theID of the Employeematches theemployeeIDof theTilltal table.

The number grammaris kept static since there are only three options. But thequestion to the caller is made dynamic by asking them only about the numbers thatare available for that employee. If there is only one number for an employee, the callis transferred immediately.

22

3.2.6 Dynamic Prompts

In order for the receptionist to sound natural, the speech synthesis prompts usedin the static receptionist are replaced with audio prompts. These audio prompts arerecorded individually for each company and stored in a catalogue of that company.These prompts need to be dynamic as well so that they reference the appropriateprompts. This is done by placing a CFOUTPUT tag in the call to the audio promptas seen in the example in Figure 3.11. Here it is the appropriate company’s name sothat the system searches in the appropriate catalogue for the prompts.

<audio src="http://path/<CFOUTPUT>#session.comp_name#</CFOUTPUT>/welcome1.wav"/>

Figure 3.11: Dynamic Prompt

One prompt is individual to the employee which is their name prompt. Thisprompt is referenced by the filename that is included in the employee table in thedatabase and references a file in the appropriate company’s catalogue. This way theprompts are different for each company and can be formed individually.

3.2.7 Implementing Statistical Element

Once the main part of the code is functional, the next step is to enable statisticalelements that are important for administrative purposes. Some of these are variablessuch as start time, the length of the phonecall and the telephone numbers significantto the call. This is easily accomplished by insert statements in the beginning of thecode that insert all the information that the program can get in the beginning of thephonecall such as the start time and the telephone numbers. Then, at the end of thecode, an update statement is used to insert the duration of the call along with theother information. This information is important so that the company that uses thesystem can see call logs and analyze telephone costs.

To get an accurate time, I have to use SQL and ColdFusion functions that takethe exact time when the statement is run. The SQL time is more accurate since itruns when the statement is run whereas the ColdFusion time runs before the actualVoiceXML code is processed and thus can be slightly inaccurate. The start time isnot a problem since the exact time that the call starts is when the code starts running.The more difficult part is the end time which is needed to calculate the duration ofthe call. Here a SQL timer is more accurate.

Information is also logged about the transfer call. Here the duration of the callis much easier since a built-in VoiceXML shadow variable exists that records theduration of the call. Therefore all the information is simply inserted into the databasewhen the transfer ends.

23

4 Evaluation

4.1 Evaluation Method

To evaluate the receptionist, a part of the method proposed by Walker et al. (1997)called PARAdigm for DIalogue System Evaluation (PARADISE) is used. PARA-DISE is a method to evaluate spoken language systems where it is assumed that thesystem’s main objective is to maximize user satisfaction. The PARADISE frameworkderives a performance result for a dialogue system as a weighted linear combinationof task-based success and dialogue costs (Walker et al., 1997). In order to get an ap-propriate performance rating many different measures are used such as user turns,help requests, and recognizer rejects. These measures are weighted and combinedresulting in a performance measure of the dialogue system.

Task completion and user satisfaction scoring which is included in their measuresare the sole contributors to this evaluation. In order to include the other measures,such as user turns and help requests, the conversation would have to have been re-corded in a controlled environment such as a studio. I did not have access to such acontrolled environment and, therefore, I was only able to measure user satisfactionby a survey and task success. User satisfaction has been used to indicate the usab-ility of the dialogue agent where two factors are relevant, task success and dialoguecosts (Walker et al., 1997). Seeing as user satisfaction is very central to the successof a dialogue system, I believe by taking a part of the PARADISE strategy, I can stillmeasure the success of the dialogue system to a certain extent without considering theother factors in the dialogue (such as user turns or recognizer rejects). This is done byusing the survey used in Walker et al. (1999) that users fill in after being given a taskto complete. This survey can be viewed in Table 4.1. The scores on the survey aretotalled for a user satisfaction score that is represented in percentages. The users arealso able to write comments on different aspects of the system and these commentsare accounted for below.

(8) Call phonenumber. You want to speak to the head of economics at thecompany. You do not know the name of the person only the position.Ring telefonnummer. Du vill prata med ekonomiansvarig pa foretaget. Du kaninte namnet pa personen bara den positionen.

Figure 4.1: Task example

24

Table 4.1: User Satisfaction Survey with Average Scores

Question Average Score(%)

1.Var Systemet latt att forsta i detta samtal? (Was the system easy to understand?) 88%2.Forstod systemet vad du sa? (Did the system understand what you said?) 76%3.Var det latt att na fram till personen du sokte? (Was it easy to reach the person you asked for?) 73%4.Var takten av samtalet bra for detta samtal? (Was the pace of the conversation appropriate?) 88%5.Visste du vad du kunde saga i varje steg i dialogen?(Did you know what you could say at each step in the conversation?) 80%6.Hur ofta var systemet sakta att svara i detta samtal? (How often was the system slow in responding?) 73%7.Fungerade systemet som du forvantade dig i detta samtal? (Did the system behave as you expected?) 84%

Total Average Score 80%

4.2 Testing

Each user is given a task to complete. An example task is listed in Figure 4.1. Afterthe task is completed, they fill in a user survey which consists of seven questionswhich they rate on a scale of 1–5 with 1 being poor and 5 being great. This surveycan be viewed in Table 4.1. Four people have the task of reaching a person at thecompany knowing only their position and not their name. Three people have to reacha person knowing their name, and two other people have to try to reach people thatdo not exist at the company. The company consists of five people with at least fourreferences to each such as their first name, their full name, and their position. Eachperson at the company has one to three telephone numbers where they can be reached.

4.2.1 Test Users

The test users are nine adults aged between 25 and 35 years. Six males and threefemales participate in the evaluation, and they have varying degrees of experiencewith this type of system. A description of the users can be seen in Table 4.2. Fourmales and one female of the test users have studied for a computer degree and there-fore have high computer knowledge whereas the others are not as experienced withcomputers. Two of the five people that have studied computers have studied com-putational linguistics and, therefore, are experienced with this type of system. Someof the users have been exposed to this type of system before, most often they haveused the Swedish Railways train information system over the phone. Others have noprevious experience with this type of dialogue system.

4.2.2 Evaluation of Results

Of the seven people whose task is to reach a person at the company, six are successful.The two people who try to reach people that do not work at the company reachcustomer service as does the one person who does not reach the person he is tryingto. The ratings in user satisfaction can be seen in the Table 4.2.

The most satisfied users are the users whose task is to reach a person at thecompany knowing their name. The users that are given the task of trying to find aperson that does not work at the company comment that they did not know that theirperson did not work at the company and that they could have been connected to

25

Table 4.2: User Satisfaction Scores

User Satisfaction User Description

Full Name86% Female. Low computer knowledge. Little previous experience with dialogue systems91% Male. High computer knowledge. Much previous experience with dialogue systems89% Female. High computer knowledge. Some previous experience with dialogue systemsPosition86% Male. High computer knowledge. Much previous experience with dialogue systems86% Male. High computer knowledge. Little previous experience with dialogue systems51% Male. Low computer knowledge. No previous experience with dialogue systems86% Male. Low computer knowledge. No previous experience with dialogue systemsNon-Existant Name86% Male. High computer knowledge. Little previous experience with dialogue systems77% Female. Low computer knowledge. Little previous experience with dialogue systems

customer service more quickly. One thought the error handling is good and the otherthought it is ok but would have liked to have been connected to customer service morequickly. Three people would have liked more instruction as to what the caller can sayat the beginning of the phone call. Three people think that the only slow part in theconversation is when the call is being transferred while the others do not commenton any delay. Everybody besides one person thinks the pace of conversation is good,but one person thinks it is too fast.

The ratings on the different survey questions vary greatly between users althoughthe overall rating for the majority is between 86% and 91%. The average scores on theindividual questions can be seen in Table 4.1. Most questions receive a score of a 4 or5 from almost all of the users. The questions regarding the pace of conversation andthe question regarding what the caller could say at each step in the dialogue receivelower scores from a few people. The two callers that try to reach a person that doesnot exist give lower ratings to the questions regarding if the system understood themand if it was easy to reach the person they were calling.

One user begins by stating a whole sentence when he is asked who he wantsto speak to instead of short prompt. This causes him to receive an error message.This user does not rephrase himself and thus becomes quickly frustrated with thesystem. This results in the low score of 51%. This user does not have much computerknowledge and no previous experience with this type of system (see Table 4.2).

The user satisfaction is quite high and seven of the nine test users rate the systembetween 86% and 91%. The average score for the survey is 80% which reflects thefew low scores from some of the test users but individually the majority of the usersrated the system above average.

26

5 Designing the Web Interface

The website is designed as a complement to the receptionist as a way for companiesto change, add, or delete information in the database as well as see call statistics forthe calls made to and transferred within the company. The website is simply designedusing HTML and ColdFusion and is designed with function in mind. ColdFusion isused to query the database for the information necessary on the website in the sameway that it does for the receptionist application. Each page is simple with links at thetop of the page to navigate to the other pages on the site. The user logs onto the siteusing the company’s phone number as their userID and the password specific to thatcompany.

The information on the website is divided into three categories — statistics, ahomepage, and employee information. The statistics that the company is interestedin seeing is the information about the calls coming into the company and the callsbeing transferred within the company. This information is logged in the database intheInCall andTransferCalltables as discussed in the previous chapter. The statisticspage is composed of simple queries that return all of the rows from these two tables.These queries are then outputted in a HTML table usingCFOUTPUT. This can bequite a long list to go through, therefore a summary is given on the homepage.

The homepage is the first page the company comes to when it logs on. Here thereis a summary of the call statistics (the number of calls to the company and the numberof calls transferred). The statistics on the homepage is done with a similar query asthe query on the statistics page, but instead of listing all the rows, the number ofrows returned are counted. This number is the sum of calls into the company for theInCall and the sum of transferred calls in theTransferCalltable. These numbers areoutputted on the homepage. The homepage is seen in Figure 5.1. On this page, a linkis given below the statistics to the statistics page if the user wants more informationabout the call statistics as discussed above. There is also a link below the statisticslink which links to the employee information page.

Figure 5.1: Home Page

27

The third category of information is employee information. For the employeepage (see Figure 5.2), the first page simply outputs a list of the employees (theirname, telephone numbers and position) using a query of theEmployeetable. Linksare placed next to each employee with the option to change or delete as seen inFigure 5.2.

Figure 5.2: Employee List

If the user wants to add a new employee, the link at the end of the list takesthem to a new page where there is a blank form (see Figure 5.3) that they fill inwith employee information including the different names with which a person can bereferenced. An insert statement is used which inserts the new employee’s informationinto the correct table.

Figure 5.3: Blank form for new employees

If they want to change information about an employee they are taken to a similar

28

form where the form is filled with the current information. Here they can changeany of the information including all the different names listed in theTilltal table. Anupdate statement is used for theEmployeetable where all the new values for eachfield replaces the old ones. For the search names, theTilltal table needs to be updatedand this is done most easily by using a delete statement to delete all of the currentnames for that employee and then an insert statement with all the new names inserted.This is done so that each name doesn’t have to be compared with the other nameswhich would be the case if an update statement is used, since every name would haveto be compared with the names already in theTilltal table to insure that two entriesdo not have the same information.

If the user wants to delete an employee, they are taken to an intermediate stepwhich confirms that they really want to delete the employee. This is to insure thatemployees are not deleted by accident since the delete function is irreversible. If theyconfirm that they want to delete an employee, a delete statement is used first on theEmployeetable and then on theTilltal table to remove all the information about theemployee. The user is then returned to the employee list with that employee removed.If they do not want to delete that employee, they are returned to an unchanged em-ployee list.

29

6 Concluding Remarks

In this thesis I have implemented an automatic speech-driven receptionist for Swedishcompanies using VoiceXML and ColdFusion. The receptionist is designed to expectthe name or position of an employee at the company. In case the employee can bereached at several numbers, the application asks which number it should connect to,based on the numbers in the database. It handles unrecognized names with specialevent handling so the dialogue is more pleasing to the caller. If the employee is busyor doesn’t answer, the caller is asked if they would like to try another number if theemployee has more than one, and otherwise if they would like to try another person.VoiceXML is used for the telephony application where speech recognition is used todirect the caller to their desired destination. Pre-recorded audio prompts by a humanspeaker are used instead of speech synthesis because it is considered more pleasingto the user.

In order for the receptionist to run dynamically, a database is designed that storesthe application content for each company. This information is retrieved using Cold-Fusion. Statistical elements that are necessary for administrative purposes, such ascall length and the time the call started, are also inserted into the database by theprogram.

A website is designed to manage the database information where companies areable to log in, view call statistics, and change, add or delete employee information.This site is programmed in HTML and ColdFusion.

The telephony application is tested by nine users using a user survey combinedwith a task as suggested by the PARADISE framework. Seven of the nine test usershave a satisfaction score of 86% or higher.

6.1 Future Improvements

The receptionist designed and implemented in this thesis is a great solution for smal-ler companies where the calls are easily directed and transferred to one of the em-ployees. It is satisfying to the users and easy for the company to manipulate using thewebsite.

Some suggestions from the evaluation could be implemented in a future version.To make error handling more clear, the system could give a list of people that workat the company with positions if the caller is unsuccessful after the first time. Thiscould be desirable for smaller companies where listing names would not be time-consuming, and in this way a caller would immediately know if the person they wishto reach works at the company. On the other hand, it would be too tedious for a largecompany. Error handling could also be improved by implementing more advancedtechniques such as using information from the unmatched input to understand what

30

the caller is trying to say, instead of discarding the first response if it is not matchedwhich is done here.

The receptionist has the function of directing a caller to a desired destination.This idea could be expanded to include a messaging service where employees couldleave messages in the database for potential callers. For example, an employee couldleave a message saying they are at lunch between 12–13. This would be convenientfor the caller so they would not need to try several numbers if they knew when theywould be able to reach the callee easily.

To make the receptionist system even more useful, another system could be addedwhere employees keep their contacts. In this way, the receptionist would not onlytake incoming calls but would also be able to connect outgoing calls. This wouldsave time for the employees so they wouldn’t have to look up telephone numbersanymore. They could simply state the name of the person they wish to call, and thenthe receptionist would transfer the call.

A receptionist for smaller companies is implemented in this thesis. For large com-panies, the grammars would become quite large and the risk for overlapping namesis greater. For smaller companies, the simple system-initiated dialogue is efficient aswell as effective as can be seen in the evaluation scores. If a larger company wouldlike to use the system, it would be a good idea to make the dialogue mixed-initiativeso the user could more easily state their purpose with the phone call.

If a larger company would like the system, it would be a good idea to implementa synonym builder to handle all the different positions and names at the company. Forexample, what would happen if a company included a new position, for example CEO(in Swedish VD). Each time this happened, someone would have to manually insertall the synonyms which is time-consuming for someone sitting at the company thatis updating their employees. It is unlikely that they will be able to sit and reflect onthe different ways to refer to one position. With the example, VD, the company mayrefer to that position only as VD but it is possible to ask for verkstallande direktoror chefen which are synonyms to VD. At the same time, it is not possible for theprogrammer to update the database manually when a new position is added. Ideallyone would want an automatic synonym builder that automatically updates theTilltaltable for each new position. In this version the system only handles the position thatis included in theTilltal table and not synonyms of it. Of course, a company maychoose to insert many different variants of a position into theTilltal table which isnot as large of a job for smaller companies.

In conclusion, the receptionist is designed for smaller companies to receive andtransfer phonecalls that come into the company. Some of the above improvementscould be added to the receptionist system to make it even more useful in a companyenvironment.

31

A Database

32

Bibliography

Kenneth R. Abbott. Voice Enabling Web Applications: VoiceXML and Beyond.Apress, 2002.

H. Aust, M. Oerder, F. Seide, and V. Stenbiss. The Philips automatic train timetableinformation system.Speech Communication, 17:249–262, 1995.

Niels Ole Bernsen, Hans Dybkaer, and Laila Dybkjaer.Designing Interactive SpeechSystems: From First Ideas to User Testing. Springer-Verlag New York, Inc., 1997.

R. Carlson, J. Edlund, and G. Skantze. Higgins - a spoken dialogue system for in-vestigating error handling techniques. InProceedings of ICSLP, 2004.

R. Carlson, S. Hunnicutt, and J. Gustafson. Dialogue management i the Waxholmsystem. InProceedings of Spoken Dialogue Systems, 1995.

Stephen Choularton. Handling speech recognition errors in spoken dialogue systems.Technical report, Center for Language Technology, Macquarie University, 2004.Workshop Paper ACL-04.

A. Danesh and K. Motlagh.Mastering ColdFusion 4.5. Sybex, 2000.

Stefan Dobler. Speech recognition technology for mobile phones. Ericsson Review,no. 3 2000.

Genevieve Gorrell. Recognition error handling in spoken dialogue systems. InPro-ceedings of the 2nd International Conference on Mobile and Ubiquitious Mul-timedia, Linkopings Electronic Conference Proceedings (www), 2003.http://www.ep.liu.se/ecp/011/012.

H.P. Grice. Logic and conversation, 1975.

J. Gustafson, L. Bell, J. Boye, J. Edlund, J. Beskow, R. Carlson, B. Granstrom,D. House, and M. Wiren. AdApt - a multimodal conversational dialogue systemin an apartment domain. InProceedings of ICSLP ’00, volume 2, pages 134–137,2000.

Joakim Gustafson.Developing Multimodal Spoken Dialogue Systems EmpiricalStudies of Spoken Human-Computer Interaction. PhD thesis, Kungliga TekniskaHogskolan, 2002.

Joakim Gustafson, Magnus Lundeberg, and Johan Liljecrantz. Experiencesfrom the development of August- a multi-modal spoken dialogue system.http://www.speech.kth.se/august/ids99augexp.html, 1999.

33

http://www.ep.liu.se/ecp/011/012

http://www.ep.liu.se/ecp/011/012

Julia Hirschberg, Marc Swerts, and Diane J. Litman. Corrections in spoken dialoguesystems. InProceedings of the 6th International Conference of Spoken LanguageProcessing (ICSLP-2000), Beijing, China, October 2000.

Catrin Norrby.Samtalsanalys. Studentlitteratur, 1996.

Douglas O’Shaughnessy.Speech Communications. IEEE Press, 2000.

Bryan Pellom and Wayne Ward. The CU communicator: An architecture for dialoguesystems. InProceedings of the 6th International Conference of Spoken LanguageProcessing (ICSLP-2000), 2000.

J. Shin, S. Narayanan, L. Gerber, A. Kazemzadeh, and D. Byrd. Analysis of userbehavior under error conditions in spoken dialog. ICSLP, Sep 2002.

Voice Extensible Markup Language (VoiceXML) Version 2.0. W3C, 2003. http://www.w3.org/TR/voicexml20.

Marilyn Walker, Diane Litman, and Candace Kamm. Evaluating spoken languagesystems. InProceedings of the American Voice Input/Output Society (AVIOS),May 1999.

Marilyn Walker, Diane Litman, Candace Kamm, and Alicia Abella. PARADISE:A framework for evaluating spoken dialogue agents. InProceedings of the 35thAnnual Meeting of the Association of Computational Linguistics. ACL 97, 1997.

Wikipedia. Global positioning system.http://en.wikipedia.or/wiki/Gps.

34

http://www.w3.org/TR/voicexml20

http://www.w3.org/TR/voicexml20

http://en.wikipedia.or/wiki/Gps

A Speech-Driven Automatic Receptionist Written in VoiceXML · Abstract This thesis describes the implementation of a speech-driven receptionist for Voxway AB. The receptionist was

Documents