User Interfaces: Their Goals, Practices, and
Challenges
by
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
(Computer Science)
(Vancouver)
c©Yelim Kim, 2020
The following individuals certify that they have read, and
recommend to
the Faculty of Graduate and Postdoctoral Studies for acceptance,
the thesis
entitled:
Practices, and Challenges
submitted by Yelim Kim in partial fulfillment of the
requirements
for the degree of Master of Science in Computer Science
Examining Committee:
(VUIs) are becoming ubiquitous through devices that feature voice
assistants
such as Apple’s Siri and Amazon Alexa. Naturalness is often
considered to
be central to conversational VUI designs as it is associated with
numerous
benefits such as reducing cognitive load and increasing
accessibility. The lit-
erature offers several definitions for naturalness, and existing
conversational
VUI design guidelines provide different suggestions for delivering
a natural
experience to users. However, these suggestions are hardly
comprehensive
and often fragmented. A precise characterization of naturalness is
necessary
for identifying VUI designers’ needs and supporting their design
practices. To
this end, we interviewed 20 VUI designers, asking what naturalness
means
to them, how they incorporate the concept in their design practice,
and
what challenges they face in doing so. Through inductive and
deductive the-
matic analysis, we identify 12 characteristics describing
naturalness in VUIs
and classify these characteristics into three groups, which are
‘Fundamental’,
‘Transactional’ and ‘Social’ depending on the purpose each
characteristic
serves. Then we describe how designers pursue these characteristics
under
different categories in their practices depending on the contexts
of their VUIs
(e.g., target users, application purpose). We identify 10
challenges that de-
signers are currently encountering in designing natural VUIs. Our
designers
reported experiencing the most challenges when creating naturally
sounding
dialogues, and they required better tools and guidelines. We
conclude with
iii
implications for developing better tools and guidelines for
designing natural
VUIs.
iv
Providing natural conversation experience is often considered
central to de-
signing conversational Voice User Interfaces (VUIs), as it is
expected to bring
out numerous benefits such as lower cognitive load, lower learning
curve, and
higher accessibility. Despite its noted importance, naturalness is
ill-defined.
There are also no comprehensive standard resources for helping
designers to
pursue naturalness. In order to provide support for VUI designers
in the
future, it is critical to understand how they currently perceive
and pursue
naturalness in their design practices. Hence, we interviewed 20 VUI
designers
to understand their notion of a natural conversational VUI and
their practices
and challenges of pursuing it. In this thesis, we present 12
characteristics of
naturalness and classify these characteristics into 3 groups. We
also identify
10 challenges that our designers are currently encountering and
conclude with
implications for developing better tools and guidelines for
designing natural
conversational VUIs.
v
Preface
This thesis was written based on the study approved by the UBC
Behavioural
Research Ethics Board (certificate number H18-01732). This thesis
extends
a conference paper that is currently under review for publication.
As the first
author of the submitted paper, I designed and conducted the
semi-structured
interviews and analyzed the data under the supervision of Dr.
Dongwook
Yoon and Dr. Joanna McGrenere. More specifically, my two
supervisors
helped me formulate research questions, design the study and
analyze the
collected data. The submitted paper was written with great help
from the
two supervisors as well as the help from Mohi Reza, another
co-author of
the paper. Mohi Reza, an MSc student, provided a great amount of
help in
writing the introduction and related work section of the submitted
paper as
well as providing great insight for shaping findings and
contributions. Mohi
Reza also provided English writing assistance for the submitted
paper.
vi
2.3 Characterizing Conversations . . . . . . . . . . . . . . . . .
. 7
2.5 Human-likeness in Embodied Agents . . . . . . . . . . . . . .
8
2.6 Tools and Guidelines for VUI Design . . . . . . . . . . . . . .
9
vii
5 Discussion and Implications for Design . . . . . . . . . . . . .
47
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . 47
5.1.2 Positioning Naturalness of VUI in the Literature . . . .
48
5.1.3 Contrasting Naturalness of Transactional vs. Social
Agents . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . .
53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 55
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 69
A.2 The Recruitment Message Posted Through SNS . . . . . . . . .
71
A.3 The Consent Form . . . . . . . . . . . . . . . . . . . . . . .
. . 73
viii
A.5 The Semi-structured Interview Script . . . . . . . . . . . . .
. 86
A.6 More Descriptions on the User Task . . . . . . . . . . . . . .
. 93
A.7 The Post User-task Survey . . . . . . . . . . . . . . . . . . .
. 97
A.8 The Data Analysis Process . . . . . . . . . . . . . . . . . . .
. 104
ix
4.1 The twelve characteristics of naturalness that designers
deem
important . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.2 The 10 challenges that designers are currently encountering
in
designing natural VUIs . . . . . . . . . . . . . . . . . . . . . .
32
x
Acknowledgements
Firstly, I would like to express my sincere gratitude to my two
supervisors,
Dongwook Yoon and Joanna McGrenere. They were always generous
with
their time whenever I needed their help. Before I entered graduate
school, I
did not have much exposure to the Human-computer Interaction (HCI)
field
and research environment. My two great supervisors were always so
patient
and generous with my progress and encouraged me to explore and make
my
own decisions for my project. I really thank them for their
extensive help
throughout the whole project. My two supervisors always amazed me
with
their great professionalism and their endless passion for HCI
research, and
they set an example for me to follow. Secondly, I would like to
thank Mohi
Reza, an MSc student who helped me in writing a paper we submitted
to a top
tier computing conference. He dedicated 2 weeks for me to help my
project.
I really enjoyed working with him and learned a lot from him.
Thirdly, I
would like to express my appreciation to Karon MacLean for
accepting to
be the second reader of my thesis and being so generous with her
time. I
also want to thank her for her insightful guidance and help for the
other
project that I worked on with her student, Soheil Kianzad. Lastly,
I would
like to thank the students in Multimodal User eXperience lab for
providing
insightful feedback on my study and for offering much valuable
advice on
graduate life.
Introduction
In this chapter, we first introduce an overview of the problem
space. Then,
we motivate and illustrate the contributions of our study, and
outline the
overall structure of the thesis.
1.1 Problem Definition
With substantial industrial interest, conversational Voice User
Interfaces
(VUIs) are becoming ubiquitous, with the plethora of everyday
gadgets, from
smartphones to home control systems, that feature voice assistants
(e.g., Ap-
ple’s Siri, Amazon Alexa and Google Home Assistant). Conversational
VUI
systems are one of the two general types of VUI systems [1]. In a
conversa-
tional VUI system, users perceive the voice agents as conversation
partners
and accomplish their goals by having conversations with the agents
[1]. While
in a command-based VUI system, which is the other general type of
VUI sys-
tem, users are expected to learn and use the appropriate voice
commands to
accomplish their goals [2]. Hereafter, we use the term ‘VUI’ to
refer to ‘con-
versational VUI’, and we use the term ‘voice agent’ to refer to
‘conversational
agent’ [3].
At the heart of desired properties of VUIs is naturalness.
According to
1
prior work and industrial design guidelines, enabling users to
accomplish their
goals by having natural conversations with voice agents brings out
numerous
benefits such as lowering cognitive load [4, 5], lowering learning
curve [5],
and increasing accessibility [5, 6]. Hence, multiple VUI design
textbooks and
guidelines recommend that designers make VUIs that provide natural
con-
versational experiences to the users [7, 5, 8]. VUIs have only
become popular
recently. As such, designing a conversational voice user interface
can be dif-
ficult due to the lack of standard and comprehensive design
guidelines as
Robert and Raphael noted in their book, “conversational interfaces
are at
the stage that web interfaces were in 1996: the technologies are in
the hands
of masses, but mature design standards have not yet emerged around
them.”.
In fact, currently, available design resources suggest design
approaches to-
ward naturalness, but their characterization of the term is
fragmented and
hardly comprehensive. Multiple resources recommend different
practices to
designers to make VUIs sound more natural [9, 10], feature natural
dialogues
[11, 12, 13], or offer natural interactions [14, 15]. However, the
term, natural-
ness, is an ill-defined construct lacking precision and clarity
[16]. Therefore,
the field is lacking a comprehensive and substantive
characterization of nat-
uralness in VUIs despite its advertised importance.
Bridging this conceptual gap is a critical step towards providing
compre-
hensive guidance to designers who strive to create natural VUI
experiences.
The literature in communications and social science inform the
characteri-
zation of naturalness in human dialogues. They suggest that people
have
different expectations and concepts of violations in interpersonal
communi-
cation, depending on a class of situational factors, such as who is
talking,
and the relationship between interlocutors [17, 18]. Given that
modern voice
assistants are often situated in complex and dynamic social
settings [19], it
is possible that conversational characteristics of a VUI considered
natural in
one setting is not perceived the same in another setting (e.g. an
extremely
human-like voice agent can be considered deceptively
anthropomorphic and
2
uncanny [20]). The extent to which specific characteristics of
naturalness
apply in different conversational settings remains an open
question.
Within the broader discourse on Natural User Interface (NUI)
design,
the preliminary conceptions of naturalness offered have remained
abstract
and generic. Some have described naturalness as a property that
refers to
how the users “interact with and feel about a product” [21]. Others
have
used it to describe devices that “adapt to our needs and
preferences”, and
enable people to use technology is “whatever way is most
comfortable and
natural” [22]. Such broad characterizations make sense at a
conceptual level.
However, the extent to which they can be applied in the domain of
VUI
design remains uncertain.
While there are some existing VUI guidelines, they too are “too
high-level
and not easy to operationalise” [23]. To offer designers proper
guidelines and
tools, we need to seek answers to questions on how designers
characterize nat-
uralness, how they align such characteristics to their varying
design goals,
and what challenges they face in this pursuit of naturalness. Doing
so will
enable researchers to create conceptual and technical tools that
support nat-
uralness for VUI designers.
As a first step towards characterizing naturalness, we conducted
semi-
structured interviews with 20 VUI designers to understand how they
define
naturalness, what design process they use to enhance naturalness,
and the
challenges they face. Through reflexive thematic analysis [24], our
study
revealed a comprehensive set of characteristics which we mapped
into dif-
ferent categories according to the aspects of naturalness each
characteristic
contributes to, whether it is for achieving a basic required skill
for having a
fluent verbal communication, for providing social interactions, or
for helping
users’ tasks. Some of the characteristics mirror those found in the
human-
to-human conversation literature, but interestingly designers also
identified
characteristics that are “beyond-human”, which reflect the
machine-specific
characteristics that outperform people such as superior memory
capacity and
3
processing power. Our VUI designers also described significant
challenges in
achieving natural interaction related to a lack of adequate design
tools and
guidelines and in balancing the different characteristics based on
the role of
voice agent (e.g., a social companion or a personal
assistant).
1.2 Contributions
In this study, we recruited VUI designers, a group that was not
explored
previously in the HCI literature to our knowledge, and ran an
empirical
study to uncover their perceptions of naturalness and their current
practices
of pursing it. Our work contributes the following:
1. We identified a set of 12 characteristics for naturalness as
perceived
by designers and categorized them based on different aspects of
nat-
uralness each characteristic contributes to: Fundamental, Social
and
Transactional.
2. We identified and characterized the 10 challenges that hinder
designers
from creating natural VUIs.
3. We proposed design implications for the tools and design
guidelines to
support designers in creating natural VUIs based on our findings
from
the interviews.
1.3 Overview
In Chapter 2, we present relevant previous works. For Chapter 3, we
describe
our study design and analysis methodologies. Then, Chapter 4
introduces
our study findings, and Chapter 5 discusses our reflections on the
findings
and presents design implications that we created based on our
reflections
and insights. Finally, Chapter 6 provides the conclusion of the
thesis and
suggestions for future work.
Chapter 2
Related Work
We set the stage by first reviewing the existing body of literature
on VUIs.
Then, we look at the broad manner in which naturalness is currently
con-
ceived and employed, and the ways in which people have
characterized con-
versations. We then focus on the rich body of work on
anthropomorphism in
conversational agents and beyond, a topic that is of particular
importance to
our discussion on naturalness. Finally, we look at orthogonal, yet
important
concerns, and review existing tools and guidelines for VUIs.
2.1 VUI Literature in HCI
Researchers have been investigating ways to support speech
interactions since
as early as the 50s. With rapid advancements in Natural Language
Under-
standing (NLU), we transitioned from rudimentary speech-recognition
based
systems such as Audrey [25] and Harpy [26] in the 50s and 70s, to
task-
oriented systems like SpeechActs [27] in the 90s, and sophisticated
conversa-
tional agents that we now have.
More recently, several studies from the HCI community, have been
inves-
tigating how voice assistants impact users [19, 28, 29, 30]. In
these studies,
various issues have been explored, including how VUIs fit into
everyday set-
5
tings [19], how users perceive social and functional roles in
conversation [28],
and the disparity between high user expectations and low system
capability
[31].
A common thread between many of these studies is that they take
into
account the perspective of the users. As Wigdor [21] put it,
naturalness is
a powerful word because it elicits a range of ideas in those who
hear it - in
this study, we take the path less trodden, and see what designers
think.
2.2 Existing Definitions of Naturalness
Given that naturalness is a construct, we see several angles from
which ex-
isting studies define and use the term.
As a descriptor for human-likeness: Naturalness is often seen as a
“mimicry
of the real world” [21]. In the context of speech, the human is the
natural
entity of concern, and hence, behavioral realism, i.e. creating
VUIs that act
like real humans, has become a focus. We can trace the attribution
of an-
thropomorphic traits onto computers in a seminal paper by Turing
[32] on
whether machines can think. In that paper, he assumes the “best
strategy”
to answer this question is to seek answers from machines that would
be “nat-
urally given by man”. The pervasive influence of such thought can
be seen
in existing definitions of naturalness in VUI literature - [33, 34,
35] all treat
naturalness in this light, as a pursuit of human-likeness.
As a distinguishing term between the novel and traditional modes of
in-
put: The term is also used to contrast interfaces that leverage
newer input
modalities such as speech and gestures, with more classical modes
on in-
put, namely, graphical and command-line interfaces [36]. In this
definition,
the term is in essence an umbrella descriptor of countless systems
involv-
ing multi-touch [37, 38], hand-gestures [39, 40], speech [41, 42],
and beyond
[43, 44].
As interfaces that are unnoticeable to the user: Another usage
draws from
6
Mark Weiser’s notion of transparency introduced in his seminal
article on
ubiquitous computing [45]. In this formulation, naturalness is a
descriptor
for technologies that “vanish into the background” by leveraging
natural
human capabilities [46, 47].
As an external property: In this conception, the term does not
refer to the
device itself, but rather the experience of using it, i.e. the
focus is on what
users do and how they feel when using the device [48]. The
characteristics
that we present in this paper can also be viewed from such an
angle, i.e.
designers form and utilize characteristics not because they make
the VUI
more natural, but rather the experience of using it more
natural.
The existing usage of the term has drawn heavy criticism from some
-
Hansen and Dalsgaard [16] describe naturalness as “unnuanced and
marred
by imprecision”, and find the non-neutral nature of the term to be
prob-
lematic. In their view, the term has been misused to conflate
“novel and
unfamiliar” products with “positive associations”, akin to
marketing propa-
ganda.
Norman [49] contends the distinction between natural and
non-natural
systems and notes that there is nothing inherently more natural
about newer
modalities over traditional input methods. With speech, for
example, he
notes that utterances still have to be learned.
2.3 Characterizing Conversations
Existing literature [50, 51, 52] has demarcated different forms of
human con-
versations based on purpose. Clark et al. [28] takes these forms,
and classi-
fies them into two broad categories - social and transactional. In
the former
category, he notes that the aim is to establish and maintain
long-term re-
lationships, whereas, in the latter, the focus is on completing
tasks. In our
study, we ground some of the characteristics that designers mention
on the
basis of these two categories.
7
Written Language
Previous studies from linguistics have identified the differences
between spo-
ken and written languages [53, 54, 55]. Researchers found that
people use
more complex words for writing compared to when they are speaking
[53,
54, 56, 57]. Bennett said that passive sentences are more
frequently used
in the written texts [58]. Also, more complex syntactic structures
are used
in written language than spoken language [56]. In our study, we
analyzed
our interview data based on these previous works to find out what
specific
aspects of spoken dialogues that VUI designers find it challenging
to mimic
when they are writing VUI dialogues.
2.5 Human-likeness in Embodied Agents
A rich body of studies explore issues revolving around
human-likeness in em-
bodied agents and our relationships with them. They investigate a
plethora
of concerns such as ways to transfer of human qualities onto
machines [59, 60],
ways to maintain trust between users and computers [61, 62, 63],
modeling
human-computer relationships [64, 65], designing for different user
groups
such as older adults [66, 67], children [68, 69], and stereotypes
[70, 71].
A series of studies by Naas et al. on how people respond to voice
assistants
have been done. Results from these studies suggest that people
apply existing
social norms to their interactions with voice assistants [72]. The
“Similarity
attraction hypothesis” posits that people prefer interacting with
computers
that exhibit a personality that is similar to their own [73], and
that cheerful
voice agents can be undesirable to sad users.
In our study, designers reflect on issues that echo the literature
by con-
sidering factors such as personality, trust, bias, and demographics
in their
VUI design practice.
2.6 Tools and Guidelines for VUI Design
Many large vendors of commercial voice assistants provide their own
separate
guidelines for designers [74, 12, 75]. These guidelines offer
design advice
tailored to developing applications for a specific platform. With
regards to
platform-independent options, some preliminary effort has been
undertaken
in the form of principles [76], models [77] and design tools [78]
for VUIs.
More specifically, Ross et al. provided a set of design principles
for the VUI
applications taking a role as a faithful servant [76] while Myers
et al. analyzed
and modelled users’ behaviour patterns in interaction with
unfamiliar VUIs
[77]. Lastly, Klemmer et al. introduced their tool for intuitive
and fast
VUI prototyping process [78]. Our study adds the design
implications for
tools and design guidelines to help VUI designers for creating
natural VUI
experience.
9
Methods
To understand how VUI designers perceive naturalness in their
design prac-
tices, we conducted semi-structured interviews with 20 VUI
designers. We
designed the interview questions with a constructivist
epistemological stance,
viewing the interview as a collaborative meaning-making process
between the
interviewer and the interviewee [79]. This chapter will describe
the design
of our study and the methodologies we used for collecting and
analyzing our
data.
3.1 Participants
We recruited 20 VUI designers (7 female, 13 male) using purposeful
sampling.
To draw findings from a varied set of perspectives, we interviewed
both am-
ateur (N = 7) and professional (N = 13) VUI designers. We recruited
the
participants through flyers (Appendix A.1) and study invitation
messages
on social network services such as Facebook and LinkedIn (Appendix
A.2).
Participants’ ages ranged from 17 to 73 (M = 34.3, Median = 30.5,
SD =
14.7). The nationalities of our participants were as follows: 4
American, 1
Belgian, 1 Brazilian, 5 Canadian, 1 Dutch, 1 German, 5 Indian, 1
Italian,
and 1 Mexican.
10
In this study, we define professional VUI designers as people
working
full-time on designing VUI applications regardless of their actual
job titles
(e.g., VUI designer, Voice UX Manager, UX manager, CEO).
Participant’s
length of professional VUI design experience ranged from 9 months
to 20
years (M = 4 years and 2 months, Median = 2 years, SD = 6 years).
Our
participants worked in companies that ranged considerably in size,
from 3
employee startups to large corporations with over 5,000 employees.
Most of
the professional VUI designers we recruited (8 out of 13) were
working for
relatively small size companies (from 2 employees to 49 employees),
and the
remaining 2 participants were working for medium size companies
(from 50
to 999 employees).
In addition to the information stated above, we collected the
participants’
highest level of education (2 with a high school diploma, 2 with a
technical
training certificate, 9 with a bachelor’s degree, 5 with a master’s
degree,
and 2 with a doctoral degree), and their familiarities with Speech
Synthesis
Markup Language (SSML) (Figure 3.1). We collected information
about
SSML familiarity because SSML is the only available standard method
to
modulate synthesized voices across different platforms.
Figure 3.1: The distribution of the participants’ familiarities
with SSML (N=20)
11
About half of the participants considered them to be unfamiliar
with
SSML, while the same amount of the participants considered them to
be
familiar with SSML.
All of our participants had previously designed at least one
conversational
voice user interface, including voice applications for Amazon’s and
Google’s
smart speakers, humanized Interactive Voice Response (IVR) systems,
and
VUIs for a virtual nurse, a smart home appliance and a companion
robot.
3.2 Interviews
For each participant, we conducted one session of a semi-structured
interview
based on the interview script in Appendix A.5. The duration of
interviews
varied from 30 minutes to about an hour, depending on the
participants’ time
availabilities. All of the interviews were conducted by the thesis
author. We
arranged online interviews for the participants (17 out of 20
participants)
who could not come to the University of British Columbia for the
interview.
Prior to each interview session, the participants were asked to
fill out a
survey asking about their demographic data, familiarities with SSML
and
previous VUI design experiences (Appendix A.4). In this survey, we
also
asked their most memorable VUI projects to contextualize our
research ques-
tions based on their vivid memories. The collected data through
pre-interview
surveys were analyzed using descriptive statistics to ensure the
diversity of
the participant demographics (i.e., gender and VUI design
professional level).
The interview questions can be broadly divided into four sections.
The
goal of the first section was to understand the participants’
general VUI de-
sign practices and their previous VUI experiences in depth. In this
part, we
requested them to describe their design practices for the two most
memo-
rable VUI projects that they reported in the pre-interview survey.
For the
second section, we sought to understand the participants’
conceptions of a
natural VUI. To achieve this goal, we requested them to provide
their own
12
definitions of a natural VUI. Then, we asked them how important it
is for
them to create a natural VUI, and if it is important, what benefits
they are
expected to gain by doing so. The goal of the third section was to
under-
stand the participants’ design practices for creating natural VUIs
and the
challenges our participants are currently facing in creating
natural VUIs.
Therefore, we asked the participants what particular design steps
they take
for creating more natural VUIs, and asked them what the most
challenging
aspects of carrying out those steps are. The last section of the
questions
was to understand how useful the current design guidelines and
tools are for
designing natural VUIs. In this part, we asked the participants
about what
tools or design guidelines they are currently using and how helpful
they are
for designing natural VUIs.
3.3 User Tasks
Depending on the participants’ time availabilities, 15 out of 20
participants
had the time to do the user task after the interview. In this user
task,
participants were asked to write relatively short VUI dialogues in
two ways:
by typing using the keyboard, as well as by using a voice typing
tool that
we created. However, one participant whose age was 73 dropped in
the
middle of the user task because he felt tired of typing. The
duration of each
user task was about 15 minutes. This user task was designed to
accomplish
two goals. The first goal was to understand the participants’ VUI
dialogue
writing procedures and their perceptions of the current synthesized
voices.
The second goal was to explore the possibilities of using voice
typing instead
of the keyboard input for creating natural VUIs. Please note that
the data
collected from the user tasks did not directly inform our study
results due
to the lack of the amount and richness of the collected data.
However, to
provide a more transparent understanding of our study, we lay out
the whole
procedure of the user task in Appendix A.6.
13
3.4 Procedure
Prior to the interview, each participant was asked to fill out the
pre-interview
survey (Appendix A.4) mentioned above. Before each interview
session, an
email containing the consent form (Appendix A.3) was sent to each
partici-
pant, so that our participants could provide their consent by
replying to the
emails.
Before recording the interview sessions, the participants were
informed
that the recording would start. After asking the interview
questions, the
participants who were able to spend more time carried out the user
task.
After they finished the user task, they were asked to fill out the
post-task
survey (Appendix A.7) to provide their feedback on the task and the
current
design guidelines. We provided a link for the survey to each
participant.
Lastly, we asked participants if there were any concerns or
questions re-
garding this study. We addressed their questions if there were any,
and if
there were no further questions, we informed them that the
interview was
finished, and thanked them for their help in this study. At the end
of each
interview session, each participant received $15/hour for their
participation.
The payment was made electronically through Paypal or Interact
e-Transfer.
Some of the participants refused to get paid and expressed their
desire to
help the study for free.
3.5 Data Analysis
All 20 interviews were transcribed before being analyzed. We used
Braun
and Clarke’s approach for reflexive thematic analysis [24] for
analyzing the
interview data. Their approach was particularly suited for our
study be-
cause of its theoretical flexibility and rigour. The three members
of the
research team, including the thesis author and her two supervisors,
had a
one-hour weekly meeting where they developed the themes over the
course
of several months. Instead of seeking the objective truth, we took
an ap-
14
proach to crystallization [80], and developed a deeper
understanding of the
data by sharing each other’s interpretations of the data during
each meet-
ing. For facilitating a productive discussion, we organized the
themes us-
ing several different ways in a concurrent manner, and this
includes using
post-its, ‘Miro’ (https://miro.com/), an online collaborative
whiteboard
platform, and ‘airtable’ (https://airtable.com/), an online
collaborative
spreadsheet application (Appendix A.8). All of the members in the
team
coded the interview data using the open coding methodology [81]
“where the
text is read reflectively to identify relevant categories.” The two
members of
the team went through at least three interview transcripts, and the
thesis
author went through them all. We took both inductive and deductive
ap-
proaches for coding the data and developed a set of coherent themes
that
form the basis of our findings. Our deductive approach for coding
was de-
rived from the previous works on “the classification of human
conversation”
[52, 50, 51].
3.6 Apparatus
For the participants who were not able to visit the University of
British
Columbia to have an in-person interview, we used ‘Zoom’, an online
com-
munication software (https://zoom.us/), to conduct the interviews
and
record the audios of the interviews. For the in-person interviews,
‘Easy Voice
Recorder’, an Android application for audio recording
(https://shorturl.
at/mwzHK), was used to record the audio of the interview. The
interviewer
also wrote interview notes on papers.
For the user task, the participants were asked to write VUI
dialogues
using Google Docs. Then, the interviewer used a software tool named
‘TTS
reader’, which we built for playing the sound of VUI dialogues
(https:
//github.com/usetheimpe/ttsReader). TTS reader used Amazon
Polly
voices (https://aws.amazon.com/polly/) to read the VUI dialogues
that
tiny.cc/sqjxjz), was used to let the participants add voice
comments on
the dialogues that they wrote.
Results
In this section, we first describe how our VUI designers
characterize natu-
ralness, followed by the challenges that they are facing in
creating a natural
VUI. We also elicit how each challenge is related to the
characteristics of
naturalness defined by our designers.
4.1 Designers Characterize Naturalness
To contextualize our findings, we briefly summarize the most
memorable
projects described by our 20 participants. In total, we collected
data about
38 VUI projects. There was only one chat-oriented dialogue system
(e.g.,
[82, 83]) where a participant built a conversational agent, for
reducing elders’
loneliness and the rest of them were grouped as task-oriented
systems accord-
ing to the definition provided by [84]. Among the types of the
applications
mentioned during the interviews, there were 27 Intelligent Personal
Assistant
(IPA) systems for smart home speakers (23 Amazon Alexa, 4 Google
Home),
8 Interactive Voice Response (IVR) phone systems, 1 voice agent
system for
a smart air-conditioner, 1 voice agent system for a mobile
application, and 1
voice agent system for a humanoid robot.
When asked to provide their definitions of a natural VUI, our
participants
17
responded in terms of the characteristics of human conversation
that they
consider important for creating a natural VUI. Later, our thematic
analysis
revealed the three categories for classifying these characteristics
of human
conversation, namely ones that are: (1) fundamental to any good
conver-
sation, (2) ones that promote good social interactions, and (3)
those that
help users to accomplish their tasks. Perhaps not surprisingly, the
latter two
categories echo classifications for human-to-human conversations in
existing
literature [52, 50, 51], labeled as “social conversation” and
“transactional
conversation”. To be consistent with that literature, we adopted
those la-
bels.
We found that instead of pursuing all the characteristics under the
three
categories, our designers selectively choose a particular set of
characteris-
tics depending on the context of their VUI applications and their
design
purposes. We also found that pursuing multiple characteristics at
once can
create conflicts.
The three categories we use in this study can help readers
conceptualize
the 12 characteristics that we found in this study. They can also
serve as
a lens to understand why designers pursue these characteristics and
how
these characteristics can often conflict with each other. The
following section
provides detailed descriptions of each characteristic.
4.1.1 Fundamental Conversation Characteristics
Among the conversation characteristics mentioned by our
participants, there
was a set of fundamental verbal communication characteristics that
a natu-
ral VUI should have, regardless of whether the aim is to support
social con-
versations or transactional conversations. The participants
consider a VUI
that does not achieve these elements as unnatural. For example,
synthesized
voices that do not show appropriate prosodies (e.g., having a
monotonous
intonation) were frequently referred to as “robotic” (P1, P4, P6)
or being a
“machine”. (P18)
Fundamental Characteristics
lovable.
§ Proactively help users. § Present a task-appropriate persona. §
Be capable to handle a wide range of topics in
the task domain. § Deliver information with machine-like speed
and
accuracy.* § Maintain user profiles to deliver personalized
services.*
§ Sound like a human speech. § Understand and use variations in
human language. § Use appropriate prosody and intonation. §
Collaboratively repair conversation breakdowns.
*beyond-human aspects
Figure 4.1: The twelve characteristics of naturalness that
designers deem important
19
Sound Like a Human Speech
Six participants (P2, P5, P6, P7, P13, P15) mentioned that
utterances of
a natural VUI should have characteristics of spoken language as
opposed
to written text. For example, people tend to use more abstract
words and
complex sentence structures when writing [85, 86]. To specify, a
natural VUI
should use simple words:“When you’re writing it down, and you say
it out
loud, sometimes you realize that whatever you’ve written down is
way too
long or has way too many big words.” (P6) However, our participants
said
that the simple words do not mean less formal words such as slang
or vulgar
expressions, but rather more typical words that the persona of the
VUI would
use. P4 mentioned that he avoided using words that are “too casual”
for his
virtual doctor application because “people would take it more
seriously if
they felt that it was a natural doctor.”
A natural VUI should also incorporate filler words [87], breathing,
and
pauses:“You need to introduce those pauses...conversation bits,
like, ‘Um’,
‘Like’, ‘You know’...it makes the conversation more natural.” (P13)
The par-
ticipants mentioned that the patterns for breathing and pausing
should give
the impression of a ‘mind’ in the VUI, and make the user feel more
like they
are talking to a human:“Pauses before something, like a joke...we
need to
create an anticipation for the final jokes.” (P15)
Understand and Use Variations in Human Language
Human language is immensely flexible, and we can express the same
request
in countless ways. Thirteen participants (P1, P2, P5, P6, P7, P9,
P10, P11,
P16, P17, P18, P19, P20) mentioned that a natural VUI should
understand
various synonymous expressions spoken by users, such as:“‘Increase
the vol-
ume.’, ‘Turn up the volume.’...at the end of the day, you just want
[it] to
increase the volume.” (P16) The participants considered VUIs that
heavily
restrict what users can say to be unnatural and of diminished
value:“...if you
are instructing people to speak in a certain way, that’s not how, I
feel, voice
20
In addition to understanding varied expressions, 4 participants
(P6, P7,
P17, P20) mentioned that a natural VUI should be able to respond
using
a varied set of expressions to avoid sounding repetitive. “...we
don’t always
say, ‘Good choice!’, we say, ‘Great choice!’, or we say,
‘Awesome!’” (P6)
When users repeat the same input, a natural VUI should respond to
it using
different expressions:“So if you [the user] go through the
application more
than once, the structure [of the dialogue] is similar, but you will
always hear
different sentences.” (P7)
The participants mentioned that variation of expressions in human
lan-
guage should be considered within the context of the target user.
For ex-
ample, factors such as different age groups and even individual
differences
need to be considered:“...we found out like hanging up the house
phone, and
dial the phone clockwise. These are all words that older people
use, while we
don’t anymore.” (P8) Age aside, the same expression can mean very
different
things when coming from different individuals:“If I say ‘it’s hot’,
then it’s
different from you say ‘it’s hot’, right?” (P5)
However, there are certain use cases where language variation is
unde-
sirable. This characteristic is less important when the primary
purpose of a
VUI application is having a transactional conversation for helping
the users’
tasks, and the target users of the application are not the general
public, but
rather people from certain professions such as police officers or
fire workers.
This is because such target users are often trained to use special
keywords
and have a fixed workflow for faster and efficient communications.
Hence, in
order to help them effectively, the application should stick to the
fixed set
of the vocabularies:“So one of the main users of this type of
application will
be police or fire ambulance [drivers]...They’re used to a very
rigid command
set. So they’re always saying things in the same way.” (P2)
21
Use Appropriate Prosody and Intonation
Prosody “refers to the intonation contour, stress pattern, and
tempo of an ut-
terance, the acoustic correlates of which are pitch, amplitude, and
duration.”
[88]. Eleven participants (P2, P3, P4, P5, P6, P8, P9, P11, P12,
P13, P18)
said that a natural VUI should present messages clearly with the
appropriate
prosody:“For me, it’s important to put the right intonation in some
parts of
the text to make it clear.” (P3) However, they highlighted that the
appro-
priate prosody can differ for age groups. Many participants
reported that
Amazon Alexa, a popular commercial voice assistant, is “way too
fast” (P11)
by default and that prosodies should be modified for
seniors:“...for seniors,
you may want to slow down the speed and potentially increase the
volume
or put an emphasis on the words.” (P2) Since there is no voice
customized
for seniors, P11 had to put a break for each sentence manually.
“...it’s like,
‘Okay, here’s your calendar!’, another break. ‘Today you have...’,
a slight
pause, like point two, point three-second pause.” (P11)
Collaboratively Repair Conversation Breakdowns
During verbal communication, we often encounter small conversation
break-
downs when people do not respond in a timely way or do not
understand
what each other said. Four participants (P3, P6, P7, P8) mentioned
that a
natural VUI should solve these kinds of conversation breakdowns in
a sim-
ilar way how humans collaboratively solve them by asking each
other:“Like
in this conversation, how often did we already say, ‘I don’t
understand you’
or ‘What do you mean?’ or ‘Can you explain more?’. That’s already
for
humans that way...and the robots and the user interfaces have to
learn from
humans.” (P7)
A VUI is considered as very unnatural and machine-like if it
repeats
the same information when the conversation breaks down:“...if you
don’t
understand something sometimes, [and if] the system just keeps
repeating the
same information [to get the response from you], just like a
robot.” (P3)
22
4.1.2 Social Conversation Characteristics
While the main focus of our participants was on task-oriented
applications as
mentioned in section 4.1, they also emphasized the importance of
providing
positive social interactions to users. Humans have social
conversations to
build a positive relationship with each other [89]. In the VUI
context, de-
signers incorporate the social conversation characteristics for
providing har-
monious and positive interactions, a more realistic feeling of
conversation,
and a feeling of being heard.
Express Sympathy and Empathy
Ten participants (P2, P3, P4, P5, P6, P8, P9, P11, P12, P13)
mentioned
the importance of providing empathetic responses to users’
sentiments to
maintain harmonious interactions:“...if I know your favorite team
won, I’d
have a happy voice. If I know your favorite team lost, I’d have a
sad voice.”
(P2)
Most of the participants’ elaborations on this part were focused on
show-
ing sympathy when users experience negative sentiments. The
participants
try to make the voice assistants console the users and present
empathetic
voices when users feel negative:“If they respond negatively, the
Alexa re-
sponds, ‘Oh, I’m sorry to hear that.’” (P4)
Beyond being sympathetic, the participants even actively try to
soothe
users’ feelings in situations when they feel heightened emotions
such as
anger:“You have a calm reassuring voice when they’re upset because
there’s
a traffic.” (P9)
To find out if users feel negative, the participants use user
responses, their
personal information (e.g., their favorite sports teams) and the
location of
the conversations (e.g., hospital). There was no participant who
mentioned
their experiences of using a sentiment analysis approach, and one
designer
specifically mentioned that using such an approach required too
much of a
time commitment:“I don’t have time to know the APIs that can do
sentiment
23
detection.” (P11)
If there is no way to detect users’ real-time sentiments, then our
designers
choose to use a “flat voice” (P3) to prevent the happy voice of a
voice agent
upsets the user who are currently feeling down, as suggested in
[72]:“You
have to control the tone of voice, because you can’t sound very
enthusiastic,
things like that, because you never know the situation of the
person on the
other side.” (P3)
Express Interest to Users
Four participants (P4, P6, P11, P12) said that they incorporate
greetings,
compliments, and words that express interest to the user. These
words make
the conversations appear “real-ish” (P11), and make users feel
important:“I
think the benefit of providing this type of responses, instead of
just blank ones,
is that it actually helps the person feel like their responses
actually got heard.”
(P4)
P11 and P6 mentioned that VUIs can even make users feel as if
they
have personal connections the applications by providing daily
greetings or
feedback on users’ actions:“I’ll say, ‘See you tomorrow!’. Little
snippets of
humanity.” (P11) “We could just say the recipe steps and all of
that, and not
have to ask questions like, ‘How’s the spice?’ and all of that, but
if we do,
then there’s some kind of personal connection.” (P6)
Be Interesting, Charming and Lovable
Social conversations include humour and gossip which fulfill
hedonic values
[90, 91] that transactional conversations do not contain. In order
to bring
more user engagement for task-oriented applications, 4 participants
(P1, P7,
P13, P17) reported trying to write more interesting dialogues and
to create
a charming persona:“Interactive means using some good words.
Something
which sounds interesting to the user.” (P17)
24
The importance of being entertaining was emphasized, especially
when
the target users are children. “So, when it’s a kid’s application,
you respond
back in a very funny way. You use, terms like ‘Okie Dokie’.”
(P13)
P7, who created an Alexa application for resolving the conflicts
between
children, mentioned that the persona does not need to be loving and
nice to
be charming. It can be sarcastic and funny, instead:“She’s not
loving and
caring, but she’s maybe a little sarcastic. She makes fun of what
they say and
I would say she’s lovable, not loving.” (P7)
When the system fails to accomplish what the user asked for,
design-
ers mentioned that a persona of a VUI that presents socially
preferable be-
haviours can abate negative emotions from users:“...if I had the
sense that
it understood its own limitation as opposed to telling me it can
not do some-
thing...I’ll be more flexible with it.” (P1)
4.1.3 Transactional Conversation Characteristics
As mentioned in section 4.1, all participants, except one, grounded
their an-
swers in their design experiences for task-oriented applications.
For transac-
tional conversations, our designers want to achieve naturalness by
leveraging
the way the user used to exploit verbal interactions with others to
get things
done. P9’s example is especially illustrative:“To me, the greatest
benefit [of
voice interaction] is that it’s an interface that someone will
naturally know
how to use and won’t have to learn how to use a new [command]. Man,
I
know people, especially kids, are great at using phones and all the
things they
learn, but I really like how we can do [many things] with voice.
One example
I like to give is that our internet was down, and we had to reboot
the router
or whatever, and I was wondering if it worked, so I just said
‘Alexa, are you
working?’ and it says ‘Yes, I’m working’ before I even thought
about it.”
(P9)
25
designers considered machine-like speed and memory as
characteristics that
a natural VUI should have for transactional conversations, which we
label
as “Beyond-human” aspects. This was surprising given that existing
notions
of naturalness are primarily based on being human-like. In other
words,
the designers’ expectations of naturalness in VUIs extend beyond
providing
realistic conversation experiences, and include machine-specific
benefits:“So
I guess a natural agent would be close to having a real
conversation with
someone, but with all the added benefits of an actual application.”
(P12)
Proactively Help Users
Eleven participants (P2, P3, P8, P9, P10, P12, P13, P15, P16, P18,
P20)
mentioned that a natural VUI should be efficient, and proactively
“detect or
even ask for the things that [the user] needs.” (P12)
P18 said a VUI should not wait for a command. If someone says
they
have a problem, a natural response for humans is to ask if the
person needs
help, even when not explicitly asked:“From a linguistic
perspective, ‘Could
you help me with my software?’ is a yes-no question. ‘I have a
problem
with my software.’ is not even a question yet. So for ‘I have a
problem’,
bots need to be more proactive and ask a question, ‘Could I help
you with
the software?’” (P18) Hence, a natural VUI should understand the
meaning
behind the statement and take action proactively to help the
users.
The efficiency of a VUI was not strictly measured by the total time
taken
for the task. Instead, they considered the quality of the results
that users
would obtain compared to the number of the conversation turns that
they
took to finish:“...not necessarily as short of a time as possible.
But, something
that makes sense and it is value-driven for me as a user. So that
means if
I’m engaging in multiple [conversation] turns just to get more, I
guess, more
valuable information on my end, I’m okay with that.” (P12)
Related, a natural VUI should avoid overloading users with excess
infor-
mation, and instead, it should minimize the number of conversation
turns:“...you
26
do not become a robot who can keep going on and on and on and on
about
all this information...You should not overload the user with a lot
of informa-
tion. You should try to cut down as many decisions for the user as
possible.”
(P13) The number of conversation turns can be minimized by
proactively
“asking them [users] less and less and assuming more...” (P13) To
ask fewer
questions, a natural VUI should actively make decisions based on
contextual
information:“But if the user tells me the zip code correctly, I
don’t ask him for
city and state, I use some libraries to find the name of the city
and state...We
need to have a record of the entire conversation from top to
bottom.” (P13)
Even though minimizing the number of questions is important, if
the
consequence of failing the task is considerable, a natural VUI
should ask the
user to confirm:“...so if I say things like ‘You wanted your
checking account.
Is that correct?’ and I say ‘No, I want my savings account’ then
that to me,
that confirm and correct [strategy] is a very important part in
making it more
conversational.” (P10)
Present a Task-appropriate Persona
Four participants (P3, P4, P5, P13) said that a natural VUI
application
should present an appropriate persona for certain tasks. The tone
of voice
should match the application’s purpose to increase user trust and
elicit more
useful responses. For example, P4 mentioned that for financial
applications,
the voice agent should sound serious for making the application
feel more re-
liable:“So when you’re creating these prompts, every company has a
different
tone of voice...Like do you want the machine to be quirky? Do you
want a
machine to be very serious? If you’re talking about your wealth
management,
you don’t want to have a fun guy. It has to be serious.” (P5)
As another example, P4 designed an application for collecting
elders’
health status. He tried to make the voice agent sound like a real
doctor as
much as possible. This was done to ensure that users take the task
more
seriously and report their status correctly. “...as if someone was
visiting
27
their doctor and asking the questions...it was better than making
it seems like
you were having a conversation with a friend, because it was kind
of a serious
topic dealing with...people would take it more seriously if they
felt that it was
a natural doctor, something like that.” (P4)
Be Capable to Handle a Wide Range of Topics in the Task
Domain
Nine participants (P2, P5, P6, P7, P9, P10, P11, P16, P17)
mentioned that
a natural VUI should not only be able to respond to the questions
directly
related to its task but also be able to handle a wide range of
topics within
the domain of its task:“I would think it [a natural VUI] would need
to handle
anything that is specific to that institution, right? If I call
Bank of America
and ask about my Bank of America go-card, you know you need to
understand
me.” (P10)
A natural VUI should also be able to handle changes in
conversation
topics as long as the topic belongs within the task domain. “Let’s
say I
want to book a table for three people, and I said [to a waiter] ‘I
want to book
a table.’ and [a waiter asked] ‘For how many people?’ and what if I
say,
‘What do you have outdoor seating?’ I didn’t answer the question. I
didn’t
say like six people or two people, because I need to know this
other piece of
information, but I also didn’t say like ‘How tall is Barack
Obama?’” (P9)
When a user brings a topic that is beyond the task domain, a
natural
VUI should still continue the conversation and remind the user
about the task
domain in which it can help with:“...if a person says ‘I want to
order a pizza’,
and your skill [Amazon VUI application] has no idea what that
is...give them
a helpful prompt saying ‘This is the senior housing voice
assistant. I can help
you with finding when the next bus is, or finding when the next
garbage day
is, or this or this.” (P2)
To help users aware of the boundaries of the serviceable topic
domain, the
designers recommended preemptively providing context to users to
help them
understand what they can do with the application:“A lot of people
make a
28
mistake in the design by saying ‘Welcome to Toyota. How can I help
you?’
And it’s like you’re going to fail right there because that’s so
open-ended.
No one will have an idea of what they can or can’t say. They will
probably
fail. So you have to be really clear...like ‘Welcome to Toyota’s
repair center!
Would you like to schedule an appointment?’” (P9)
However, when the target users are children, the designers should
expect
them to ask a lot of questions outside of the task domain:“ What’s
your
favorite color, Alexa?’ and they [children] would like to shift the
conversation
or just like go to a totally different topic.” (P7)
Beyond-human Aspect #1:
Deliver Information With Machine-Like Speed and Accuracy
Our designers mentioned that, to accomplish its tasks in an
efficient manner,
a natural VUI should incorporate machine-specific attributes such
as high
processing powers, and selectively mimic certain parts of human
conversa-
tion instead of pursuing every aspect of a natural human
conversation:“This
machine will be able to talk to us as if it was a human of
course...more effi-
cient, of course, more...you know you have a couple of these rules,
but that’s
how I personally define natural.” (P5)
P12 mentioned that a natural VUI should attain the human-level
ability
to maintain conversational context while being able to deliver
accurate in-
formation in a blazing fast manner:“So it’s just super-fast
processing times,
being able to deliver information while maintaining conversational
context.”
(P12)
Our participants described natural human speeches as often being
indirect
and inefficient, so these aspects of human conversation should be
left out
when designing for a natural VUI.
“There are so much more words like in the human version of asking,
but
it sounds human. You know, it’s not as direct or not as efficient,
but there
is a kind of like a personality behind it, I guess?” (P1)
29
“Oh, no less conversational, because you don’t want...something
that you’re
using every day. You don’t want to have that be chatty and friendly
right?
You want to get your work done. so you know concentrating on being
effi-
cient and giving them the information and exactly the way that they
want
it.” (P10)
Humans’ memories are volatile in contrast to machine memories.
Designers
mentioned that a natural VUI should store a vast amount of personal
infor-
mation of users such as personal histories or family relationships
to combine
this information all together and provide a customized experience
for users.
“We customize all the knowledge of the user.” (P8)
“We personalize things and make things fit each user. Suppose you
have
an allergy or specific dietary requirements, then we could filter
out all of those
recipes and only suggest you the recipes that fit your needs.”
(P6)
However, designers are, of course, aware that storing a huge amount
of
information comes with concerns about privacy. So the importance of
trans-
parency on what data is stored was highlighted:“You need to be
transparent
about the collected and stored data.” (P8)
4.2 Designers Experience Challenges
In order to inform where and how we should invest our efforts for
future
technological and theoretical advancement, we asked our
participants what
makes designing for a natural VUI the most challenging. We grouped
the
challenges they described and ordered them in a list by the number
of par-
ticipants who mentioned the issue. To contextualize their
challenges, we also
asked them to describe their design practices.
We found that the designers largely follow a user-centered design
process
30
that includes three phases, namely a user research phase, a
high-level design
phase, and a testing phase each described next. This matches with
the VUI
design process laid out by Google [92]. During the user research
phase, they
conduct user research to determine requirements for their
applications, collect
the user utterances, and create the personas for their
applications. This
phase involves multiple user observations and interviews. In the
high-level
design phase, designers create high-level designs such as sample
dialogues and
flowcharts of dialogues for their applications. Designers
interactively develop
high-level designs through collaborations with other designers.
During this
phase, while writing the dialogues, designers create the audio
outputs of the
dialogues and modify the audio and dialogues more or less in
parallel. In the
testing phase, designers create prototypes and conduct user tests
to validate
their high-level designs and iteratively develop real VUI
applications through
multiple user testings.
Here we present the most prevalent challenges, as reported by our
par-
ticipants (Figure 4.2), in designing a natural VUI. For each
challenge, we
illustrate how it is related to the twelve characteristics of
naturalness de-
scribed in Figure 4.1.
1. Synthesized Voice Fails to Convey Nuances and Emotion
The same sentence can convey different meanings depending on the
way one
narrates the text. Paralanguage, such as intonation, pauses,
volume, and
prosody, are essential components in expressing subtle nuances and
emotions
in speech. During the high-level phase, to make a VUI sound
natural, the
designers want to have control over the way the speech synthesizer
will nar-
rate their dialogues to the users. However, 9 participants (P1, P4,
P5, P7,
P8, P10, P13, P18, P20) reported that current speech synthesis
technology
is lacking the expressivity to interpret the intended meaning of
the dialogue
text and convey it to the user. They felt that even the best speech
synthe-
sizer still sounds like “just a robo-voice” (P4) or like “just
putting the sounds
31
Primary 10 Challenges in Designing a Natural VUI
1. Synthesized voice fails to convey nuances and emotion. 2. SSML
is time-consuming to use while not producing the desired results.
3. Existing VUI guidelines lack concrete and useful recommendations
on how to design
for naturalness. 4. Writing for spoken language is difficult. 5.
Reconciling between “social” and “transactional” is hard. 6.
Conveying messages clearly is difficult due to the limitation of
synthesized voice. 7. Handling various spoken inputs from the users
is difficult. 8. Impossible to capture all the possible situations.
9. Difficult to capture the users’ emotions. 10. Difficult to
understand users’ perceived naturalness.
Figure 4.2: The 10 challenges that designers are currently
encountering in designing natural VUIs
32
together” (P18) rather than “really meaning it [the script].” (P18)
They
think that the voice synthesis technology has a large gap to
bridge, saying
“there’s a long way to go for it to become very expressive.”
(P7)
Hiring voice actors who can narrate the script in a natural tone
and flow of
a “real voice” was reported by many participants (P3, P6, P7, P8,
P9, P13)
as a common solution to make a VUI sound natural. However,
recording
audio is considered to be significantly limited in flexibility and
scalability.
When there is a need to change the narration, editing audio of a
recorded
speech is more laborious than re-synthesizing audio from an edited
text:“...if
we discover during research there are more words, then we have to
hire that
actor again to speak those words again. So it was not practical at
all.” (P8)
Moreover, using pre-baked recordings was not a scalable solution to
gen-
erate spoken utterances of modern VUIs that are required to handle
a wide
variety of data and conversation context:“Of course, when you are
using a
real voice actor, it’s impossible because I should record every
single street
name that we have in Brazil.” (P3)
This challenge is more severe for non-English languages. “Also the
way
she [Amazon Alexa] is speaking for us in German it sounds very
ironic and
sarcastic, her tone of the voice.” (P18) “[The] Dutch language is
not so de-
veloped yet for Google...English words, sometimes Dutch people use
English
words, but if Google is [speaking] in Dutch, then it’s sometimes
complicated
[not correct].” (P8)
Due to the limited expressivity of the synthesized voices, for the
applica-
tions that need to convey human-like emotions, our designers
reported that
they often had no other choice than using their own voices to
record the au-
dios. For example, P1 found that the currently available
synthesized voices
were not good enough to express nuanced emotions that he desired to
express
for his storytelling application:“...there are some subtleties that
I couldn’t get
Alexa to feel nostalgic about, you know, there is no command like
nostalgia
about the house party that you first met this guy that you are
still in love with
33
at, you know?” (P1)
Interestingly, for the applications that do not need to express
emotions,
the importance of the expressive voice was not highlighted, and was
even
de-promoted. Being humorous is often considered to be relatively
human-
specific behaviour. P1 said that using Alexa’s “robotic” voice for
pulling a
joke creates irony and makes the situation funnier. Hence, he used
Alexa’s
voice for his VUI application where Alexa is asking about the
users’ feces
every day:“Alexa asks a question that only a human would ask and
it’s like
a very human written skill [Amazon VUI application] that sounds
like very
robotic and because you’re talking about poop because it’s like a
joke...There’s
something I think funnier about it.” (P1)
Relation with the characteristics of naturalness: The difficulty
in
expressing nuanced emotions inhibits the designers from achieving
the social
characteristics of the VUIs listed in Figure 4.1. P8 found that the
synthesized
voice used by IBM Watson was not able to produce natural laughing
sounds
and posit a risk of misrepresenting itself with unintended negative
social
expressions:“The robot can not laugh, because if the robot laughs
and you
just say, ‘ha, ha, ha’ it sounds sarcastic...older adults actually
feels like Alice
[the robot] is laughing at them. So that’s bad.”
2. SSML is Time-Consuming to Use While Not Producing the
Desired Results
Designers can emphasize the parts of dialogues or modify the
prosodies of the
audio such as pitch, volume, and speed by using Speech Synthesis
Markup
Language (SSML) during the high-level design phase. Tech giants
such as
Google, Amazon and IBM continuously develop and support their own
sets of
SSML tags. However, many of our participants (P2, P3, P7, P9, P10,
P11,
P13, P15, P16) pointed out that writing and editing SSML tags is
“time
digging” (P13), while it frequently fails to yield the desired
result:“I haven’t
been very successful at doing this with SSML. It makes very minor
changes,
34
but it doesn’t come close to what it would be if you use a voice
actor, for
example.” (P7)
It was difficult to make the whole sentence flow naturally, and it
felt “still
too mechanical.” (P5) Even after fine-tuning speech timings by
meticulously
entering numerical values (e.g., 0.5 seconds of whispering):“I
think it’s not
very natural, like another 0.5-second break here, another somewhat
slower
here, all those things.” (P15)
The poor design of SSML authoring interfaces, which resulted in
SSML
requiring too much time to use, was another point of consternation.
Most of
our designers were using a simple text editor or generic XML
mark-up tools
for writing SSML tags. Hence, they had to (re)write and (re)listen
to the
whole sentence or paragraph even when only making a small change to
their
dialogues: “I mean even just in the best-case scenario, let’s say
you listen to
a prompt, you decided that you wanted to change one thing by using
SSML.
You change that thing. You listen to it again. That‘s your
best-case scenario,
and right there you just spent, I don’t know, couple minutes maybe,
and if
you have a hundred prompts to do, it’s just not worth it for the
small benefit
you’ll get.” (P9) Also, it was hard to evaluate when the SSML tag
reached
the optimal level of expressiveness. Hence, our designers often
spent a lot of
time iteratively modifying SSML tags without knowing when to
stop:“Hard
to stop, like, I’m not satisfied with what I got there, so I just
keep on changing
something here and there.” (P15)
Our designers had a hard time using SSML when trying to imbue a
nar-
ration with emotional expressivity. Designing for expressing a
subjective
experience of emotion requires holistic control of all prosody
features at the
same time. However, SSML only offers control of each prosodic
element at a
time separately:“The technology should be mature a little bit for
us to have
an SSML tag that is empathetic. It has a very subjective nature.
The thing
is that it’s not very objective. The objective things are being
loud, slient...but
this is completely different.” (P13)
35
Due to these limits of the current SSML, most of the participants
had
abandoned using SSML except for making simple changes of the audio
such
as putting the breaks, slowing the speeds, and correcting the
pronunciations
of mispronounced words. “In the SSML, I spent some time a while
ago, like
there is a prosody tag like that. I just tried and tried and,
usually it just didn’t
do what I wanted it to do.” (P9) “The prosody tag can be really
difficult to
deal with right?...I haven’t played with SSML for a couple [of]
years.” (P10)
“It’s been a while since I played with SSML. Well...[the SSML tag
for putting
a] break I use it all the time still.” (P7)
If the application is targeting multiple platforms (e.g., Google
Home and
Amazon Alexa), using SSML takes even more time, because designers
should
test it for each platform. In other words, even for the same tag,
the resulted
voices can be different depending on the platforms and the same set
of SSML
tags might not be available on some platforms. This requires
designers to test
their SSML tags for each platform:“Different speech synthesizers
are going
to have different packages, so I want to be able to play with the
SSML before
I decide on how this is going to work.” (P10)
Relation with the characteristics of naturalness: Being able
to
produce an empathetic voice is essential for having positive social
interactions
with users, as stated in Figure 4.1. Even though SSML is the only
available
way for modifying synthesized voices, our designers reported the
limitations
of SSML in truly creating an empathetic voice and the significant
amount of
time required as the obstacles to the achievement of the social
characteristics.
3. Existing VUI Guidelines Lack Concrete and Useful Recommen-
dations on How to Design for Naturalness
In terms of writing VUI dialogues during the high-level design
phase, 6 par-
ticipants (P2, P5, P6, P10, P11, P19, and P20) mentioned 3 problems
with
the existing VUI guidelines.
First, they found the existing VUI guidelines easy to dismiss as
cliche.
36
More specifically, our designers mentioned that some of the
existing guide-
lines are “somewhat common sense in terms of avoiding using
technical lan-
guage, try and make it casual and simple”, and easy to let go:“I
feel like
it’s kind of obvious and you know that when you’re creating
something like a
voice skill...I probably read it once, and I just left it.”
(P6)
Secondly, our designers found the existing design guidelines do not
ap-
ply to a certain VUI depending on the context of the project:“At
the same
time, I think every company will have its own set of these [Design
guide-
lines]...I mean, some apps are made to comfort people and make them
feel
less alone, and those guidelines are completely irrelevant, so it
does depend
on the context.” (P5)
Lastly, they pointed out that the guidelines are lacking useful
linguistic
insights:“I feel like linguistic principles are a little more
difficult to come
by for voice interaction designers.” (P2) Linguists have been
discovering
hidden patterns of natural human conversations that people use
without
realizing it. For creating VUIs that enable natural conversations
with users,
VUI designers should know how to use these linguistic insights on
natural
conversations as well as how to incorporate these insights with
their design
strategies for better user experiences. However, our designers
reported that
there is a “disconnection” between the linguistic insights and
their design
strategies.
“So there’s a lot you can do with language, with pragmatics [a
sub-field
of linguistics] and this is not the rocket science. I mean, we have
a lot of
research about pragmatics already, since the 70s, since the 80s.
Well, they’re
explored in research well enough, but they’re not connected well
enough with
the IT department.” (P18)
The designers who found the existing design guidelines as useful
reported
that whenever they are designing VUI applications, they need to put
an
intentional effort to return to the guidelines to brush up their
memories on
them:“So for example, I need to work on confirmations. Let me go to
refresh
37
my memory on how to do confirmation style...I don’t have to like
constantly
go back to them [the design guidelines], but I certainly do go back
in and look
[at them]” (P9), or not applying them due to the time
constraint:“So there
are principles that I work with, but that’s just going to have to
come to me
at the moment. I’m not going to go back.” (P11)
Relation with the characteristics of naturalness: This
challenge
hinders designers from creating a VUI that provides a great user
experience
while achieving the fundamental characteristic of a natural VUI,
‘Sound like
a human is speaking’. More specific design guidelines that provide
actionable
advice on how to connect these linguistic components and design
intentions
are requested from our designers.
“I have no idea on how to use this construct behind these phrases,
and
how to break it down to use in my favorite design process...The
contrac-
tion, phonetics, and ‘Umm’...I think understanding how meaning can
change
depending on how things are structured and organized in the
dialogue” [is
important]. (P20)
When the participants are writing dialogues during the high-level
design
phase, they often find it is hard to write for spoken language.
When people
are talking, they adopt spoken language characteristics such as
filler words,
personal pronouns and colloquial words [53, 54, 55]. However, they
show
these characteristics subconsciously. Hence, when our designers are
writing,
they often forget to incorporate these characteristics.
“What I want to say is when you just started [as a VUI designer],
it’s
difficult because you have to really understand that the text
you’re writing
is not something to be read. It’s something to be spoken. It’s for
a spoken
interface, so you have to really try to imagine yourself in that
situation and,
like I said, just write like you are writing a screenplay or
something like that.”
(P3)
38
Designers write the dialogues by typing on the keyboard instead of
speak-
ing it, and our designers reported that it is often hard to detect
unnaturalness
of the dialogue just by reading it:“Challenges? So the platform
limitation is
definitely a challenge. A lot of times, the conversation sounds
good on paper,
but you really have to just say it.” (P12)
Since we subconsciously use the characteristics of spoken language
when
conversing with others, P10 mentioned that the unnaturalness caused
by
writing can be hard to detect, and many people often treat this
problem as
something insignificant and hence do not put the effort in
enhancing it:“I
think the hard thing about these kinds of interfaces is that people
feel like
because they can talk, because they speak English, so they can
write one of
these interfaces.” (P10)
From our survey on the existing VUI design guidelines, we found
that mul-
tiple guidelines already exist for supporting designers in writing
for spoken
dialogues [12, 13]. These guidelines tell VUI designers about what
character-
istics of the spoken language they should incorporate in their VUI
dialogues.
Since people use these characteristics unconsciously, if designers
do not re-
turn to guidelines and verify their dialogues manually, they often
forget to
apply the rules from design guidelines when they are writing their
dialogues.
For example, even though there is a design guideline asking
designers to
avoid putting too much information in one line [93], designers
often make
mistakes:“The frequent mistake is people are giving a big amount of
the texts
and yeah just read it.” (P18)
This makes designing a natural VUI challenging. However, P3, a
profes-
sional VUI designer, mentioned that this could be overcome as
designers gain
more experience:“At first, it is difficult, because, like I said,
normally when
you’re writing, you’re writing for someone to read it, not to speak
it...[It]
took something like maybe three months or so to really get used to
this way
of writing, the practice of writing dialogue.” (P3)
Relation with the characteristics of naturalness: This
difficulty
39
hinders designers from creating VUI dialogues that sound like
spoken dia-
logues, and it thus counters to the fundamental characteristic of a
natural
VUI, ‘Sound like a human is speaking’.
5. Reconciling Between “Social” and “Transactional” Is Hard
Our designers tried to embrace the characteristics of social
conversations
for providing more realistic and interactive conversation
experiences to their
users, however since most of them were designing for task-oriented
applica-
tions, they found the goals for task-oriented applications often
conflict with
the desire to embrace social conversation characteristics.
Five designers (P5, P11, P15, P17, P20) mentioned their desire to
add
more components of social conversations to their VUI designs to
make them
more realistic. However, in an attempt to do so, they found that
the dialogue
gets longer and it conflicts with the task-oriented goal to
complete the tasks
efficiently:“So obviously I wanted to write the dialogues that felt
[like a] human
[is speaking and] didn’t feel robotic, but I soon realized that
things are more
complicated. The more you want to add personality to things, then
the longer
becomes your dialogue.” (P20)
So, our designers felt that the two categories of the naturalness
character-
istics often don’t get along with one another:“...efficient, but it
has to come
up as like friendly [and] conversational.” (P5) “Challenges, I told
you earlier,
challenges are keeping it simple, yet interactive. It should sound
familiar. It
should sound friendly. It should not go out of the voice, so like
that.” (P17)
Relation with the characteristics of naturalness: This often
con-
fused our designers when they were writing the dialogues and often
made
them give up incorporating social characteristics. “But again we’re
still
thinking, we’re still in the process of, ‘Should we actually put in
those lit-
tle sentences [for having social interactions] or not?’” (P6) “I
would prefer,
right now, to focus more on helping people achieve their goals and
move on
with their lives more than a kind of having these artificial
entities talking to
40
6. Conveying Messages Clearly Is Difficult Due to the
Limitation
of Synthesized Voice
Our designers (P4, P8, P9, P12, P14) are facing a challenge in
conveying
precise meanings through synthesized voices. This is because
synthesized
voices often mispronounce certain types of words and sentences, and
are not
able to produce proper intonations and tones.
Our designers elaborated on three specific problematic cases of
mispro-
nunciation. Synthesize voices often fail to: (1) put proper breaks
when pro-
nouncing long sentences:“...some words just kind of got mashed
together, like
into one whole long word, and it [the long setence] sounded weird.”
(P4), (2)
pronounce contractions clearly. “That’s what it’s supposed to say,
‘What’ll
it be?’, and then when it [Amazon Alexa] actually says it, it will
be like,
‘Whatill it be? ’” (P12), and (3) pronounce proper nouns such as
names of
cryptocurrency and people. “...she can’t pronounce ethereum [a
cryptocur-
rency] and mmm that’s a popular one.” (P14), “Names of people,
Google says
it differently.” (P8)
Our participants also elaborated on two problematic cases
inadequate
tones and intonations: (1) sentences with the question mark, and
(2) non-
lexical words. P9 mentioned the difficulty of producing natural
sounds for
interrogative sentences ending with a question mark. “It’s so hard
sometimes
to get the text to speech to ask a question in a way that makes
sound natu-
ral...so I changed the question mark to a period, and then it said
‘Which one
would you like?’, which is more how a person would actually say it,
because a
lot of times when we ask a question, we’re not doing a rising
intonation.” P8
reported that synthesized voices do not produce proper tones for
non-lexical
words (i.e., words do not have a defined meaning) such as laughter.
Her
design intention was to make her robot laugh with a happy tone, but
it had
a sarcastic tone instead:“The robot can not laugh, because if the
robot laughs,
41
Relation with the characteristics of naturalness: The
limitations
of speech synthesis systems (i.e., mispronouncing words and
sentences and
not being able to produce proper intonations) hinder designers not
only from
attaining the fundamental characteristics of a natural VUI (‘Sound
like a
human is speaking’ and ‘Use appropriate prosody and intonation’)
but also
from achieving a social characteristic of it (‘Express sympathy and
empa-
thy’). This is because non-lexical words are essential in human
conversations
for social interactions. People feel mutual understanding and
compassion
towards each other when they laugh together. Hence, by not being
able to
produce natural laughing sounds, it’s more challenging to provide
harmo-
nious interactions with users.
7. Handling Various Spoken Inputs From the Users Is Difficult
Four participants (P2, P6, P9, P11) mentioned that the current NLU
engine
is still not good enough to understand various expressions of our
language,
and this requires VUI designers to provide possible synonyms and
expressions
when they are writing dialogues for training the NLU engine:“Yeah,
I still
need to interview them [the VUI users]. I will, just because I
don’t think
the natural language engine is good enough to be able to figure out
all the
different ways you can ask for a bus.”(P2) However, they often
found that
the collected synonyms and expressions do not cover all the
possible ones:“I
don’t necessarily know what the users are going to say back. So
I’ll make
up something that you might say but that’s certainly not the same
as a user
who’s never used it [the VUI application] before.”(P9)
Designers pointed out that understanding human vocabularies can
be
especially challenging due to its personal aspect. The same word
can be used
to express different meanings based on the conversational context
(e.g., age
and individual differences). Hence, designers mentioned that user
utterances
should be understood in the personalized context:“You need to
personalize
42
the VUI to a user...You can not talk with Alexa if you‘re older
because Alexa
won’t understand you.&rdquo
LOAD MORE