End-to-end conversational agents: what's ... - Marco Baroni...Why is conversation so easy? Simon Garrod1 and Martin J. Pickering2 1University of Glasgow, Department of Psychology,

End-to-endconversationalagents:what'smissing?

MarcoBaroniFacebookArtificialIntelligenceResearch

Let'sDiscussWorkshopatNIPS2016

Withmanythanksto...

• Angeliki Lazaridou,AlexPeysakhovich• GemmaBoleda• JasonWeston,Marc'Aurelio Ranzato,Sumit Chopra,AntoineBordes• TomasMikolov,ArmandJoulin,Germán Kruszewski• RaquelFernandez• Aurelie Herbelot• ...

Why is conversation so easy?Simon Garrod1 and Martin J. Pickering2

1University of Glasgow, Department of Psychology, Glasgow, UK, G12 8QT2University of Edinburgh, Department of Psychology, Edinburgh, UK, EH8 9JZ

Traditional accounts of language processing suggestthat monologue – presenting and listening to speeches– should be more straightforward than dialogue – hold-ing a conversation. This is clearly not the case. Weargue that conversation is easy because of an inter-active processing mechanism that leads to the align-ment of linguistic representations between partners.Interactive alignment occurs via automatic alignmentchannels that are functionally similar to the automaticlinks between perception and behaviour (the so-calledperception–behaviour expressway) proposed in recentaccounts of social interaction. We conclude that humansare ‘designed’ for dialogue rather than monologue.

Whereas many people find it difficult to present a speechor even listen to one, we are all very good at talking toeach other. This might seem a rather obvious and banalobservation, but from a cognitive point of view the appa-rent ease of conversation is paradoxical. The range andcomplexity of the information that is required in mono-logue (preparing and listening to speeches) is much lessthan is required in dialogue (holding a conversation). Inthis article we suggest that dialogue processing is easybecause it takes advantage of a processingmechanism thatwe call ‘interactive alignment’. We argue that interactivealignment is automatic and reflects the fact that humansare designed for dialogue rather thanmonologue.We showhow research in social cognition points to other similarautomatic alignment mechanisms.

Problems posed by dialogueThere are several reasonswhy language processing shouldbe difficult in dialogue. Take speaking. First, there isthe problem that conversational utterances tend to beelliptical and fragmentary. Assuming, as most accounts oflanguage processing do, that complete utterances are‘basic’ (because all information is included in them), thenellipsis should present difficulty. Second, there is theproblem of opportunistic planning. Because you cannotpredict how the conversation will unfold (your addresseemight suddenly ask you an unexpected question that youhave to answer), you cannot plan what you are going to sayfar in advance. Instead, you have to do it on the spot. Third,there is the problem ofmakingwhat you say appropriate tothe addressee. The appropriateness of referring to some-one as ‘my next-door neighbour Bill’, Bill, or just himdepends on how much information you share with youraddressee at that point in the conversation. Does she know

who Bill might be? Does she knowmore than one Bill? Is itobvious to both of you that there is only one male personwho is relevant? Similarly, in listening, you have to guessthe missing information in elliptical and fragmentaryutterances, and also have to make sure that you interpretwhat the speaker says in the way he intends.

If this were not enough, conversation presents a wholerange of interface problems. These include deciding whenit is socially appropriate to speak, being ready to come in atjust the right moment (on average you start speakingabout 0.5 s before your partner finishes [1]), planningwhatyou are going to say while still listening to your partner,and, in multi-party conversations, deciding who to address.To do this, you have to keep task-switching (one momentspeaking, the next moment listening). Yet, we know thatin general multi-tasking and task switching are reallychallenging [2]. Try writing a letter while listening tosomeone talking to you!

So why is conversation easy?Part of the explanation is that conversation is a jointactivity [3]. Interlocutors (conversational partners) worktogether to establish a joint understanding of what theyare talking about. Clearly, having a common goal goessome way towards solving the problem of opportunisticplanning, because it makes your partner’s contributionsmore predictable (see Box 1). However, having a commongoal does not in itself solve many of the problems ofspeaking and listening alluded to above. For instance, itdoes not ensure that your contributions will be appropriatefor your addressee, alleviate the problems of dealing withfragmentary and elliptical utterances, or prevent interfaceproblems.

One aspect of joint action that is important concernswhat we call ‘alignment’. To come to a common under-standing, interlocutors need to align their situation models,which are multi-dimensional representations containinginformation about space, time, causality, intentionalityand currently relevant individuals [4–6]. The success ofconversations depends considerably on the extent to whichthe interlocutors represent the same elements within theirsituation models (e.g. they should refer to the same indi-vidual when using the same name). Notice that even ifinterlocutors are arguingwith each other or are lying, theyhave to understand each other, so presumably alignment isnot limited to cases where interlocutors are in agreement.

But how do interlocutors achieve alignment of situationmodels? We argue that they do not do this by explicitnegotiation. Nor do they model and dynamically updateevery aspect of their interlocutors’ mental states. Instead,Corresponding author: Simon Garrod ([email protected]).

Opinion TRENDS in Cognitive Sciences Vol.8 No.1 January 20048

http://tics.trends.com 1364-6613/$ - see front matter q 2003 Elsevier Ltd. All rights reserved. doi:10.1016/j.tics.2003.10.016

Theconversationalagentpipeline

PROC IEEE, VOL. X, NO. X, JANUARY 2012 1

POMDP-based Statistical Spoken DialogueSystems: a Review

Steve Young, Fellow, IEEE, Milica Gašić, Member, IEEE, Blaise Thomson, Member, IEEE,and Jason D Williams, Member, IEEE

(Invited Paper)

Abstract—Statistical dialogue systems are motivated by theneed for a data-driven framework that reduces the cost oflaboriously hand-crafting complex dialogue managers and thatprovides robustness against the errors created by speech recog-nisers operating in noisy environments. By including an explicitBayesian model of uncertainty and by optimising the policy viaa reward-driven process, partially observable Markov decisionprocesses (POMDPs) provide such a framework. However, ex-act model representation and optimisation is computationallyintractable. Hence, the practical application of POMDP-basedsystems requires efficient algorithms and carefully constructedapproximations. This review article provides an overview of thecurrent state of the art in the development of POMDP-basedspoken dialogue systems.

Index Terms—Spoken dialogue systems, POMDP, reinforce-ment learning, belief monitoring, policy optimisation.

I. INTRODUCTION

SPOKEN dialogue systems (SDS) allow users to interactwith a wide variety of information systems using speechas the primary, and often the only, communication medium[1], [2], [3]. Traditionally, SDS have been mostly deployedin call centre applications where the system can reduce theneed for a human operator and thereby reduce costs. Morerecently, the use of speech interfaces in mobile phones hasbecome common with developments such as Apple’s “Siri”and Nuance’s “Dragon Go!” demonstrating the value of inte-grating natural, conversational speech interactions into mobileproducts, applications, and services.

The principal elements of a conventional SDS are shownin Fig 11. At each turn t, a spoken language understanding(SLU) component converts each spoken input into an abstractsemantic representation called a user dialogue act u

t

. Thesystem updates its internal state s

t

and determines the nextsystem act via a decision rule a

t

= ⇡(st

), also known as apolicy. The system act a

t

is then converted back into speechvia a natural language generation (NLG) component. The statest

consists of the variables needed to track the progress of

S. Young, M. Gašić and B. Thomson are with the Dept of Engineer-ing, Trumpington Street, Cambridge University, Cambridge, UK, e-mail:{sjy,mg436, brmt2}@eng.cam.ac.uk

JD Williams is with Microsoft Research, Redmond, WA, USA, e-mail:[email protected]

1Multimodal dialogue is beyond the scope of this paper, but it should benoted that the POMDP framework can be extended to handle multimodalinput and output [4]. Depending on the application, both the input andoutput may include a variety of modalities including gestures, visual displays,haptic feedback, etc. Of course, this could result in larger state spaces, andsynchronisation issues would need to be addressed.

the dialogue and the attribute values (often called slots) thatdetermine the user’s requirements. In conventional systems,the policy is usually defined by a flow chart with nodesrepresenting states and actions, and arcs representing userinputs[5], [6].

Despite steady progress over the last few decades in speechrecognition technology, the process of converting conversa-tional speech into words still incurs word error rates in therange 15% to 30% in many real world operating environ-ments such as in public spaces and in motor cars [7], [8].Systems which interpret and respond to spoken commandsmust therefore implement dialogue strategies that accountfor the unreliability of the input and provide error checkingand recovery mechanisms. As a consequence, conventionaldeterministic flowchart-based systems are expensive to buildand often fragile in operation.

Spoken Language

Understanding(SLU)

Natural LanguageGeneration

(NLG)

StateEstimator

User

ut

Inputspeech

at Policy

st

SystemResponse

Dialogue Manager

Fig. 1. Components of a finite state-based spoken dialogue system. At eachturn the input speech is converted to an abstract representation of the user’sintent ut, the dialogue state st is updated and a deterministic decision rulecalled a policy maps the state into an action at in response.

During the last few years, a new approach to dialogue man-agement has emerged based on the mathematical frameworkof partially observable Markov decision processes (POMDPs2)[9], [10], [11]. This approach assumes that dialogue evolvesas a Markov process, i.e., starting in some initial state s0,each subsequent state is modelled by a transition probability:p(s

t

|st�1, at�1). The state st is not directly observable re-

flecting the uncertainty in the interpretation of user utterances;instead, at each turn, the system regards the output of the SLUas a noisy observation o

t

of the user input with probabilityp(o

t

|st

) (see Fig 2). The transition and observation probabilityfunctions are represented by a suitable stochastic model, calledhere the dialogue model M. The decision as to which actionto take at each turn is determined by a second stochastic

2pronounced “pom dee pees”

Youngetal.2013

Frompipelinestoend-to-endmachinetranslation

Anthes 2010Sutskever etal. 2014

sequence of words representing the answer. It is therefore clear that a domain-independent methodthat learns to map sequences to sequences would be useful.

Sequences pose a challenge for DNNs because they require that the dimensionality of the inputs andoutputs is known and fixed. In this paper, we show that a straightforward application of the LongShort-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.The idea is to use one LSTM to read the input sequence, one timestep at a time, to obtain large fixed-dimensional vector representation, and then to use another LSTM to extract the output sequencefrom that vector (fig. 1). The second LSTM is essentially a recurrent neural network languagemodel[28, 23, 30] except that it is conditioned on the input sequence. The LSTM’s ability to successfullylearn on data with long range temporal dependencies makes it a natural choice for this applicationdue to the considerable time lag between the inputs and their corresponding outputs (fig. 1).

There have been a number of related attempts to address the general sequence to sequence learningproblem with neural networks. Our approach is closely related to Kalchbrenner and Blunsom [18]who were the first to map the entire input sentence to vector, and is related to Cho et al. [5] althoughthe latter was used only for rescoring hypotheses produced by a phrase-based system. Graves [10]introduced a novel differentiable attention mechanism that allows neural networks to focus on dif-ferent parts of their input, and an elegant variant of this idea was successfully applied to machinetranslation by Bahdanau et al. [2]. The Connectionist Sequence Classification is another populartechnique for mapping sequences to sequences with neural networks, but it assumes a monotonicalignment between the inputs and the outputs [11].

Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. Themodel stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads theinput sentence in reverse, because doing so introduces many short term dependencies in the data that make theoptimization problem much easier.

The main result of this work is the following. On the WMT’14 English to French translation task,we obtained a BLEU score of 34.81 by directly extracting translations from an ensemble of 5 deepLSTMs (with 384M parameters and 8,000 dimensional state each) using a simple left-to-right beam-search decoder. This is by far the best result achieved by direct translation with large neural net-works. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81BLEU score was achieved by an LSTM with a vocabulary of 80k words, so the score was penalizedwhenever the reference translation contained a word not covered by these 80k. This result showsthat a relatively unoptimized small-vocabulary neural network architecture which has much roomfor improvement outperforms a phrase-based SMT system.

Finally, we used the LSTM to rescore the publicly available 1000-best lists of the SMT baseline onthe same task [29]. By doing so, we obtained a BLEU score of 36.5, which improves the baseline by3.2 BLEU points and is close to the previous best published result on this task (which is 37.0 [9]).

Surprisingly, the LSTM did not suffer on very long sentences, despite the recent experience of otherresearchers with related architectures [26]. We were able to do well on long sentences because wereversed the order of words in the source sentence but not the target sentences in the training and testset. By doing so, we introduced many short term dependencies that made the optimization problemmuch simpler (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble withlong sentences. The simple trick of reversing the words in the source sentence is one of the keytechnical contributions of this work.

A useful property of the LSTM is that it learns to map an input sentence of variable length intoa fixed-dimensional vector representation. Given that translations tend to be paraphrases of thesource sentences, the translation objective encourages the LSTM to find sentence representationsthat capture their meaning, as sentences with similar meanings are close to each other while different

2

Trainingend-to-endmachinetranslation

Iosono il dottore dicuiinquesta novellasi parla talvolta conparolepoco lusinghiere.

Iamthedoctoroccasionallymentionedinthisstory,inunflatteringterms.

predict

End-to-endconversationalagentsA Neural Conversational Model

used for neural machine translation and achieves im-provements on the English-French and English-Germantranslation tasks from the WMT’14 dataset (Luong et al.,2014; Jean et al., 2014). It has also been used forother tasks such as parsing (Vinyals et al., 2014a) andimage captioning (Vinyals et al., 2014b). Since it iswell known that vanilla RNNs suffer from vanish-ing gradients, most researchers use variants of LongShort Term Memory (LSTM) recurrent neural net-works (Hochreiter & Schmidhuber, 1997).

Our work is also inspired by the recent success of neu-ral language modeling (Bengio et al., 2003; Mikolov et al.,2010; Mikolov, 2012), which shows that recurrent neuralnetworks are rather effective models for natural language.More recently, work by Sordoni et al. (Sordoni et al., 2015)and Shang et al. (Shang et al., 2015), used recurrent neuralnetworks to model dialogue in short conversations (trainedon Twitter-style chats).

Building bots and conversational agents has been pur-sued by many researchers over the last decades, and itis out of the scope of this paper to provide an exhaus-tive list of references. However, most of these systemsrequire a rather complicated processing pipeline of manystages (Lester et al., 2004; Will, 2007; Jurafsky & Martin,2009). Our work differs from conventional systems byproposing an end-to-end approach to the problem whichlacks domain knowledge. It could, in principle, be com-bined with other systems to re-score a short-list of can-didate responses, but our work is based on producing an-swers given by a probabilistic model trained to maximizethe probability of the answer given some context.

3. ModelOur approach makes use of the sequence-to-sequence(seq2seq) framework described in (Sutskever et al., 2014).The model is based on a recurrent neural network whichreads the input sequence one token at a time, and predictsthe output sequence, also one token at a time. During train-ing, the true output sequence is given to themodel, so learn-ing can be done by backpropagation. The model is trainedto maximize the cross entropy of the correct sequence givenits context. During inference, given that the true output se-quence is not observed, we simply feed the predicted outputtoken as input to predict the next output. This is a “greedy”inference approach. A less greedy approach would be touse beam search, and feed several candidates at the previ-ous step to the next step. The predicted sequence can beselected based on the probability of the sequence.

Concretely, suppose that we observe a conversation withtwo turns: the first person utters “ABC”, and second personreplies “WXYZ”. We can use a recurrent neural network,

Figure 1. Using the seq2seq framework for modeling conversa-tions.

and train to map “ABC” to “WXYZ” as shown in Figure 1above. The hidden state of the model when it receives theend of sequence symbol “” can be viewed as thethought vector because it stores the information of the sen-tence, or thought, “ABC”.

The strength of this model lies in its simplicity and gener-ality. We can use this model for machine translation, ques-tion/answering, and conversations without major changesin the architecture. Applying this technique to conversa-tion modeling is also straightforward: the input sequencecan be the concatenation of what has been conversed so far(the context), and the output sequence is the reply.

Unlike easier tasks like translation, however, a modellike sequence-to-sequence will not be able to successfully“solve” the problem of modeling dialogue due to sev-eral obvious simplifications: the objective function beingoptimized does not capture the actual objective achievedthrough human communication, which is typically longerterm and based on exchange of information rather than nextstep prediction. The lack of a model to ensure consistencyand general world knowledge is another obvious limitationof a purely unsupervised model.

4. DatasetsIn our experiments we used two datasets: a closed-domainIT helpdesk troubleshooting dataset and an open-domainmovie transcript dataset. The details of the two datasets areas follows.

4.1. IT Helpdesk Troubleshooting dataset

In our first set of experiments, we used a dataset which wasextracted from a IT helpdesk troubleshooting chat service.In this service, costumers face computer related issues, anda specialist help them by conversing and walking througha solution. Typical interactions (or threads) are 400 wordslong, and turn taking is clearly signaled. Our training setcontains 30M tokens, and 3M tokens were used as valida-tion. Some amount of clean up was performed, such asremoving common names, numbers, and full URLs.

Vinyals andLe2015

Trainingend-to-endconversationalagents

A:ShallwecallAlanTurner?

B:AlanTurner?

A:Yeah.

B:Ihaven'tgothisnumberhaveyougotit?

http://bncweb.lancs.ac.uk/

Trainingend-to-endconversationalagents


B:AlanTurner?

A:Yeah.


predict

End-to-endconversationalagentsH:i amseeinganerrorrelatedtovpn

M:whatistheerrormessagethatyouaregettingwhenconnectingtovpnusingnetworkconnect?

H:connectionrefusedorsomethinglikethat

M:trythis,goto pageandenteryournameandpassword

H:done

M:trylogginginwithandcheckifit'sworkingfinenow

H:yes,nowitworks!

Vinyals andLe2015

End-to-endconversationalagents

H::whatdoyouthinkaboutmessi?M:he’sagreatplayer.

H:whatdoyouthinkaboutcleopatra?M:oh,she’sveryregal.

H:whatdoyouthinkaboutengland duringthereignofelizabeth?M:itwasagreatplace.

Vinyals andLe2015


H:whatisyourjob?M:i’m alawyer.H:whatdoyoudo?M:i’m adoctor.

Vinyals andLe2015


H: what are you doing?M: nothing.

H: what is your name?M: i don’t know, sir.

H: how old are you?M: i’m not sure.

Lietal.2016a

ConversationasajointprojectClark1996

• (Extra-linguistic)context:aconversationwilltakeplaceinacertainstateoftheworld

• Commonground:theparticipantsmustcoordinateona(dynamic)setof"mutualknowledge,mutualbeliefs,andmutualassumptions"

• Purpose:weengageinaconversationinordertoachievesomething

Context


B:AlanTurner?

A:Yeah.

[BknowsAlanTurnernumber]


Context


B:AlanTurner?

A:Yeah.

[AlanTurner'snumberis333454443]

B:Hisnumberis911

Accessingextra-linguisticcontext

• Teachmachinetoqueryadata-base(Bordes andWeston2016):

A:ShallwecallAlanTurner?B:AlanTurner?A:Yeah.B:db_search(Alan_Turner,Has_Number,*])DB:333454443B:Alan'snumberis333454443

Accessingextra-linguisticcontext

Visual Dialog

Abhishek Das1, Satwik Kottur2, Khushi Gupta2⇤, Avi Singh3⇤, Deshraj Yadav1, José M.F. Moura2,Devi Parikh4, Dhruv Batra4

1Virginia Tech, 2Carnegie Mellon University, 3UC Berkeley, 4Georgia Institute of Technology

AbstractWe introduce the task of Visual Dialog, which requires anAI agent to hold a meaningful dialog with humans in natu-ral, conversational language about visual content. Specifi-cally, given an image, a dialog history, and a question aboutthe image, the agent has to ground the question in image,infer context from history, and answer the question accu-rately. Visual Dialog is disentangled enough from a specificdownstream task so as to serve as a general test of ma-chine intelligence, while being grounded in vision enoughto allow objective evaluation of individual responses andbenchmark progress. We develop a novel two-person chatdata-collection protocol to curate a large-scale Visual Dia-log dataset (VisDial). Data collection is underway and oncompletion, VisDial will contain 1 dialog with 10 question-answer pairs on all ⇠200k images from COCO, with a totalof 2M dialog question-answer pairs.We introduce a family of neural encoder-decoder models forVisual Dialog with 3 encoders – Late Fusion, HierarchicalRecurrent Encoder and Memory Network – and 2 decoders(generative and discriminative), which outperform a num-ber of sophisticated baselines. We propose a retrieval-basedevaluation protocol for Visual Dialog where the AI agent isasked to sort a set of candidate answers and evaluated onmetrics such as mean-reciprocal-rank of human response.We quantify gap between machine and human performanceon the Visual Dialog task via human studies. Our dataset,code, and trained models will be released publicly. Puttingit all together, we demonstrate the first ‘visual chatbot’!

1. IntroductionWe are witnessing unprecedented advances in computer vi-sion (CV) and artificial intelligence (AI) – from ‘low-level’AI tasks such as image classification [14], scene recogni-tion [55], object detection [27] – to ‘high-level’ AI taskssuch as learning to play Atari video games [35] and Go [47],answering reading comprehension questions by understand-

⇤Work done while KG and AS were interns at Virginia Tech.

Figure 1: We introduce a new AI task – Visual Dialog, where an AIagent must hold a dialog with a human about visual content. Weintroduce a large-scale dataset (VisDial), an evaluation protocol,and novel encoder-decoder models for this task.

ing short stories [15, 57], and even answering questionsabout images [4, 32, 41] and videos [49, 50]!What lies next for AI? We believe that the next genera-tion of visual intelligence systems will need to posses theability to hold a meaningful dialog with humans in naturallanguage about visual content. Applications include:• Aiding visually impaired users in understanding their sur-

roundings [5] or social media content [58] (AI: ‘John justuploaded a picture from his vacation in Hawaii’, Human:‘Great, is he at the beach?’, AI: ‘No, on a mountain’).

• Aiding analysts in making decisions based on large quan-tities of surveillance data (Human: ‘Did anyone enter thisroom last week?’, AI: ‘Yes, 27 instances logged on cam-era’, Human: ‘Were any of them carrying a black bag?’),

• Interacting with an AI assistant (Human: ‘Alexa – canyou see the baby in the baby monitor?’, AI: ‘Yes, I can’,Human: ‘Is he sleeping or playing?’).

• Robotics applications (e.g. search and rescue missions)where the operator may be ‘situationally blind’ and oper-

1

arX

iv:1

611.

0866

9v1

[cs.C

V]

26 N

ov 2

016

Dasetal.2016

Accessingextra-linguisticcontext:problemsandsolutions

• Subject-eliciteddata:small• Scriptedinteractions:limited,artificial

• Mixlarge-scalepurely-text-basedtrainingandsmall/artificialreal-world-basedtraining(how?)• Cleverwaystocollectgroundeddata(games?)• Groundeddatainthewild(movies?)

Howtousecontextualinformation

Contextualvsgeneric

• What'sthecolorofcows?• What'sthecolorofthiscow?

Precisevsvague

• WheredoesAlanTurnerlive?

• 34WashingtonStreet/35LincolnAvenue/IntheUS

• IntheUS/IntheUnitedStates/InWinesburg,Ohio/34WashingtonStreet

CommongroundClarkandBrennan1991


B:AlanTurner?

A:Yeah.

B:Ihaven'tgothis numberhaveyougotit?

CommongroundClarkandBrennan1991


B:AlanTurner?

A:Yeah.


Commonground

Leslie:Gotta seethemnexttime.Leslie:See,Iwasabitshocked,youknowLeslie:MikesaidhewenttoChessington,wasit?Rob:Yeah.Leslie:Haha!Leslie:Whydidn'thedothatwhenwe werehere?


Commongroundinend2endsystemsH:i amseeinganerrorrelatedtovpn

M:whatistheerrormessagethatyouaregettingwhenconnectingtovpnusingnetworkconnect?

H:connectionrefusedorsomethinglikethat

M:trythis,goto pageandenteryournameandpassword

H:done

M:trylogginginwithandcheckifit'sworkingfinenow

H:yes,nowitworks!

Vinyals andLe2015

Commongroundinend2endsystems

H:whatisyourjob?M:i’m alawyer.H:whatdoyoudo?M:i’m adoctor.

Vinyals andLe2015

CommongroundandconversationpartnersRob:LetmeintroduceyoutomyfriendMike...Machine:Nicetomeetyou!

...Machine:MikesaidhewenttoChessington,wasit?Rob:Yeah.

Machine:Iknowthisperson,Mike,whowenttoChessington.Alice:Yeah

Personas(Lietal.2016b)

EOS Rob

Wor

d em

bedd

ings

(50k

)

englandlondonu.s.

greatgood

stay

live okaymondaytuesday

Spea

ker e

mbe

ddin

gs(7

0k)

Rob_712

where do you live

in

in Rob england Rob

england

. Rob

. EOSSource Target

skinnyoflynny2

TomcoatezKush_322

D_Gomes25

Dreamswalls

kierongillen5

TheCharlieZ

The_Football_BarThis_Is_Artful

DigitalDan285

Jinnmeow3

Bob_Kelly2

Figure 1: Illustrative example of the Speaker Model introduced in this work. Speaker IDs close in embedding space tend torespond in the same manner. These speaker embeddings are learned jointly with word embeddings and all other parameters ofthe neural model via backpropagation. In this example, say Rob is a speaker clustered with people who often mention Englandin the training data, then the generation of the token ‘england’ at time t = 2 would be much more likely than that of ‘u.s.’. Anon-persona model would prefer generating in the u.s. if ‘u.s.’ is more represented in the training data across all speakers.

4.3 Speaker-Addressee ModelA natural extension of the Speaker Model is amodel that is sensitive to speaker-addressee inter-action patterns within the conversation. Indeed,speaking style, register, and content does not varyonly with the identity of the speaker, but also withthat of the addressee. For example, in scripts forthe TV series Friends used in some of our exper-iments, the character Ross often talks differentlyto his sister Monica than to Rachel, with whomhe is engaged in an on-again off-again relationshipthroughout the series.

The proposed Speaker-Addressee Model oper-ates as follows: We wish to predict how speaker iwould respond to a message produced by speaker j.Similarly to the Speaker model, we associate eachspeaker with a K dimensional speaker-level repre-sentation, namely v

i

for user i and vj

for user j. Weobtain an interactive representation V

i,j

2 RK⇥1by linearly combining user vectors v

i

and vj

inan attempt to model the interactive style of user itowards user j,

V

i,j

= tanh(W1 · vi +W2 · v2) (7)

where W1,W2 2 RK⇥K . Vi,j is then linearly in-corporated into LSTM models at each step in thetarget:

2

664

i

t

f

t

o

t

l

t

3

775 =

2

664

�

�

�

tanh

3

775W ·

2

4h

t�1e

s

t

V

i,j

3

5 (8)

c

t

= f

t

· ct�1 + it · lt (9)

h

s

t

= o

t

· tanh(ct

) (10)

V

i,j

depends on both speaker and addressee andthe same speaker will thus respond differently toa message from different interlocutors. One po-tential issue with Speaker-Addressee modelling isthe difficulty involved in collecting a large-scaletraining dataset in which each speaker is involvedin conversation with a wide variety of people.Like the Speaker Model, however, the Speaker-Addressee Model derives generalization capabil-ities from speaker embeddings. Even if the twospeakers at test time (i and j) were never involvedin the same conversation in the training data, twospeakers i0 and j0 who are respectively close inembeddings may have been, and this can help mod-elling how i should respond to j.

4.4 Decoding and RerankingFor decoding, the N-best lists are generated us-ing the decoder with beam size B = 200. We set amaximum length of 20 for the generated candidates.Decoding operates as follows: At each time step,we first examine all B ⇥B possible next-word can-didates, and add all hypothesis ending with an EOStoken to the N-best list. We then preserve the top-Bunfinished hypotheses and move to the next wordposition.

To deal with the issue that SEQ2SEQ modelstend to generate generic and commonplace re-sponses such as I don’t know, we follow Li et al.(2016) by reranking the generated N-best list using

Negotiatingcommonground

Alan:InfactitwastheSociety'spolicynowthatwewon'tputanimalstosleep

unlessthereisanextremecauseforthat...

James:WhyisitthatinthenorthofEngland,particularlyinthisregionofthenorth

ofEngland,weseemtobeworsethananywhereelse?

Alan:Idon'tthinkyouare.Youknow,sinceI'vebeendowninHorsham,Ihave

foundthatthenortheastisnotalone.Therearemanyotherareasinthecountry

whereanimalsareseriouslyabused.Thenortheastisn'ttheonlyplace.


Updatingcommonground

A:ShallwecallAlanTurner?B:AlanTurner?A:Yeah.B:Ihaven'tgothisnumberhaveyougotit?

...A:Shallwecallour...hisnameisn'tAlanTurner.B:Wellwhat'shisname?A:Hisnameis...Richard...Thorpe,whichiswhyamIlookingunderTinmybook...

Thepurposeofconversations

Task-orientedconversations

A:I'dliketobookarestaurantfortonight.B:Whatkindoffood?A:Indian,perhaps?B:HowaboutShahi TandooronGeorgeStreet?A:Thatsoundsgreat,thankyou!

Chitchat?



B:Beforetheendoftheshow,we'vegottenminutesleft...


A:No...hey,haveyouseenthey'refilminginHarewood thisweek?

ConversationalsuccessisinherentlyrewardingAuthor's personal copy

Brain-to-brain coupling: a mechanismfor creating and sharing a social worldUri Hasson1,2, Asif A. Ghazanfar1,2, Bruno Galantucci3,4, Simon Garrod5,6 andChristian Keysers7,8

1 Neuroscience Institute, Princeton University, Princeton, NJ 08540, USA2 Department of Psychology, Princeton University, Princeton, NJ 08540, USA3 Department of Psychology, Yeshiva University, New York, NY 10033, USA4 Haskins Laboratories, New Haven, CT 06511, USA5 Institute of Neuroscience and Psychology, University of Glasgow, Glasgow G12 8QB, UK6 Department of Psychology, University of Western Australia, Perth, WA 6009, Australia7 Netherlands Institute for Neuroscience (NIN-KNAW), 1105 BA Amsterdam, The Netherlands8 Department of Neuroscience, University Medical Center Groningen, University of Groningen, 9700 RB Groningen,The Netherlands

Cognition materializes in an interpersonal space. Theemergence of complex behaviors requires the coordina-tion of actions among individuals according to a sharedset of rules. Despite the central role of other individualsin shaping one’s mind, most cognitive studies focus onprocesses that occur within a single individual. We callfor a shift from a single-brain to a multi-brain frame ofreference. We argue that in many cases the neural pro-cesses in one brain are coupled to the neural processes inanother brain via the transmission of a signal throughthe environment. Brain-to-brain coupling constrains andshapes the actions of each individual in a social network,leading to complex joint behaviors that could not haveemerged in isolation.

Why two (or more) brains are better than oneAlthough the scope of cognitive neuroscience research isvast and rich, the experimental paradigms used are pri-marily concerned with studying the neural mechanisms ofone individual’s behavioral processes. Typical experimentsisolate humans or animals from their natural environ-ments by placing them in a sealed room where interactionsoccur solely with a computerized program. This egocentricframework is reminiscent of the Ptolemaic geocentricframe of reference for the solar system. From the earlydays of civilization, stars were not thought to have anyinfluence on the geophysical processes on Earth. The pres-ent understanding of gravity, orbits and the tides cameabout only after the Copernican revolution, which broughtabout the realization that the Earth is just another ele-ment in a complex, interacting system of planets. Along thesame lines, we argue here that the dominant focus onsingle individuals in cognitive neuroscience paradigmsobscures the forces that operate between brains to shapebehavior.

Verbal communication is an excellent example to illus-trate the role that other individuals play in one’s cognitive

processes. As Wittgenstein argued, the meaning of a wordis defined by its use [1]. The word’s correct use, however,can vary across eras, cultures and contexts. Thus, theappropriate use of a word is grounded in a set of inter-related norms shared by a community of speakers. Tomaster a language, one has to learn the correct uses ofwords by interacting with other members of the communi-ty. Such interactions fundamentally shape the way indi-viduals think and act in the world [2,3]. This is by no meanslimited to language. Several other nonverbal social andcognitive skills, such as courting, dancing or tool manipu-lation, require the collaboration of multiple agents thatcoordinate their behavior according to a shared set of rulesand customs. With so many cognitive faculties emergingfrom interpersonal space, a complete understanding of thecognitive processes within a single individual’s brain can-not be achieved without examining and understanding theinteractions among individuals [4]. In this article, we callfor a shift from a single-brain to a multi-brain frame ofreference.

Brain-to-brain couplingThe premise of brain-to-brain coupling is that the percep-tual system of one brain can be coupled to the motor systemof another. This binding mechanism builds on a morerudimentary ability of brains to be coupled to the physicalworld (stimulus-to-brain coupling, Figure 1a). Differentobjects in the environment emit different forms of energy(mechanical, chemical, electromagnetic), and receptorsconvert these signals into electrical impulses that the braincan use to infer information about the state of the worldand generate appropriate behaviors. Furthermore, organ-isms are not passive receivers of sensory input but ratheractively move their sensory receptor surfaces (hands, eyes,tongues, etc.) to sample information from the environment[5,6]. Thus, stimulus-to-brain coupling is fundamental tothe ability to retrieve information about the world to guideactions.

Brain-to-brain coupling also relies on stimulus-to-braincoupling as a vehicle for conveying information. However,

Opinion

Corresponding author: Hasson, U. ([email protected])Keywords: action–perception; mother–infant; coupled oscillations; socialneuroscience; joint action; speaker–listener neural coupling.

114 1364-6613/$ – see front matter ! 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.tics.2011.12.007 Trends in Cognitive Sciences, February 2012, Vol. 16, No. 2

Learningandtransferringlinguisticcuesofsuccess

SeealsoLietal.2016c

Explicitreward

A:I'dliketobookarestaurantfortonight.B:Whatkindoffood?A:Indian,perhaps?B:HowaboutShahi TandooronGeorgeStreet?A:Thatsoundsgreat,thankyou!+1

Implicitreward

A:ShallwecallAlanTurner?B:AlanTurner?A:Yeah.B:Here'shisnumber:333454443A:Great,thankyou!


B:Well,thisisthelastslide,weranoutoftimeagain!

• Anthes.2010.AutomatedtranslationofIndianlanguages.Comm ACM.• BorderandWeston.2016.Learningend-to-endgoal-orienteddialog.arXiv.• Clark.1996.Usinglanguage.CUP.• ClarkandBrennan.1991.Groundingincommunication.InPerspectivesonsociallysharedcognition.

• Dasetal.2016.Visualdialog.arXiv.• Garrod andPickering.2004.Whyisconversationsoeasy.TICS.• Hassonetal.2012.Brain-to-braincoupling:Amechanismforcreatingandsharingasocialworld.TrendsCogSci.

• Lietal2016a.Adiversity-promotingobjectivefunctionforneuralconversationmodels.NAACL2016.

• Lietal2016b.Apersona-basedneuralconversationmodel.ACL2016.• Lietal2016c.Deepreinforcementlearningfordialoguegeneration.EMNLP2016.• Sutskever etal.2014.Sequencetosequencelearningwithneuralnetworks.NIPS.• Vinyals andLe.2015.Aneuralconversationalmodel.ICMLDLW.• Youngetal.2013.POMDP-basedstatisticalspokendialoguesystems:Areview.ProcIEEE.

End-to-end conversational agents: what's ... - Marco Baroni...Why is conversation so easy? Simon Garrod1 and Martin J. Pickering2 1University of Glasgow, Department of Psychology,

Documents