Adversarial Learning of Task-Oriented Neural Dialog Models

Proceedings of the SIGDIAL 2018 Conference, pages 350–359,Melbourne, Australia, 12-14 July 2018. c©2018 Association for Computational Linguistics

350

Adversarial Learning of Task-Oriented Neural Dialog Models

Bing LiuCarnegie Mellon University

Electrical and Computer [email protected]

Ian LaneCarnegie Mellon University

Electrical and Computer EngineeringLanguage Technologies Institute

[email protected]

Abstract

In this work, we propose an adversariallearning method for reward estimation inreinforcement learning (RL) based task-oriented dialog models. Most of the cur-rent RL based task-oriented dialog sys-tems require the access to a reward signalfrom either user feedback or user ratings.Such user ratings, however, may not al-ways be consistent or available in practice.Furthermore, online dialog policy learningwith RL typically requires a large numberof queries to users, suffering from sampleefficiency problem. To address these chal-lenges, we propose an adversarial learn-ing method to learn dialog rewards directlyfrom dialog samples. Such rewards arefurther used to optimize the dialog policywith policy gradient based RL. In the eval-uation in a restaurant search domain, weshow that the proposed adversarial dialoglearning method achieves advanced dialogsuccess rate comparing to strong baselinemethods. We further discuss the covariateshift problem in online adversarial dialoglearning and show how we can address thatwith partial access to user feedback.

1 Introduction

Task-oriented dialog systems are designed to as-sist user in completing daily tasks, such as mak-ing reservations and providing customer support.Comparing to chit-chat systems that are usu-ally modeled with single-turn context-responsepairs (Li et al., 2016; Serban et al., 2016), task-oriented dialog systems (Young et al., 2013;Williams et al., 2017) involve retrieving informa-tion from external resources and reasoning overmultiple dialog turns. This makes it especially im-

portant for a system to be able to learn interac-tively from users.

Recent efforts on task-oriented dialog systemsfocus on learning dialog models from a data-driven approach using human-human or human-machine conversations. Williams et al. (2017)designed a hybrid supervised and reinforcementlearning end-to-end dialog agent. Dhingra etal. (2017) proposed an RL based model for infor-mation access that can learn online via user inter-actions. Such systems assume the model has ac-cess to a reward signal at the end of a dialog, eitherin the form of a binary user feedback or a con-tinuous user score. A challenge with such learn-ing systems is that user feedback may be inconsis-tent (Su et al., 2016) and may not always be avail-able in practice. Further more, online dialog pol-icy learning with RL usually suffers from sampleefficiency issue (Su et al., 2017), which requires anagent to make a large number of feedback queriesto users.

To reduce the high demand for user feedbackin online policy learning, solutions have been pro-posed to design or to learn a reward function thatcan be used to generate a reward in approxima-tion to a user feedback. Designing a good re-ward function is not easy (Walker et al., 1997) asit typically requires strong domain knowledge. ElAsri et al. (2014) proposed a learning based re-ward function that is trained with task completiontransfer learning. Su et al. (2016) proposed anonline active learning method for reward estima-tion using Gaussian process classification. Thesemethods still require annotations of dialog ratingsby users, and thus may also suffer from the ratingconsistency and learning efficiency issues.

To address the above discussed challenges, weinvestigate the effectiveness of learning dialog re-wards directly from dialog samples. Inspired bythe success of adversarial training in computer vi-

351

sion (Denton et al., 2015) and natural languagegeneration (Li et al., 2017a), we propose an ad-versarial learning method for task-oriented dialogsystems. We jointly train two models, a gener-ator that interacts with the environment to pro-duce task-oriented dialogs, and a discriminatorthat marks a dialog sample as being successfulor not. The generator is a neural network basedtask-oriented dialog agent. The environment thatthe dialog agent interacts with is the user. Qual-ity of a dialog produced by the agent and the useris measured by the likelihood that it fools the dis-criminator to believe that the dialog is a successfulone conducted by a human agent. We treat dia-log agent optimization as a reinforcement learningproblem. The output from the discriminator servesas a reward to the dialog agent, pushing it towardscompleting a task in a way that is indistinguishablefrom how a human agent completes it.

In this work, we discuss how the adversariallearning reward function compares to designed re-ward functions in learning a good dialog policy.Our experimental results in a restaurant search do-main show that dialog agents that are optimizedwith the proposed adversarial learning methodachieve advanced task success rate comparing tostrong baseline methods. We discuss the impactof the size of annotated dialog samples to the ef-fectiveness of dialog adversarial learning. We fur-ther discuss the covariate shift issue in interactiveadversarial learning and show how we can addressthat with partial access to user feedback.

2 Related Work

Task-Oriented Dialog Learning Popular ap-proaches in learning task-oriented dialog systemsinclude modeling the task as a partially observableMarkov Decision Process (POMDP) (Young et al.,2013). Reinforcement learning can be applied inthe POMDP framework to learn dialog policy on-line by interacting with users (Gasic et al., 2013).Recent efforts have been made in designing end-to-end solutions (Williams and Zweig, 2016; Liuand Lane, 2017a; Li et al., 2017b; Liu et al., 2018)for task-oriented dialogs. Wen et al. (2017) de-signed a supervised training end-to-end neural dia-log model with modularly connected components.Bordes and Weston (2017) proposed a neural di-alog model using end-to-end memory networks.These models are trained offline using fixed di-alog corpora, and thus it is unknown how well

the model performance generalizes to online userinteractions. Williams et al. (2017) proposed ahybrid code network for task-oriented dialog thatcan be trained with supervised and reinforcementlearning. Dhingra et al. (2017) proposed a rein-forcement learning dialog agent for informationaccess. Such models are trained against rule-baseduser simulators. A dialog reward from the usersimulator is expected at the end of each turn oreach dialog.

Dialog Reward Modeling Dialog reward es-timation is an essential step for policy optimiza-tion in task-oriented dialogs. Walker et al. (1997)proposed PARADISE framework in which usersatisfaction is estimated using a number of dia-log features such as number of turns and elapsedtime. Yang et al. (2012) proposed collaborative fil-tering based method in estimating user satisfactionin dialogs. Su et al. (2015) studied using convo-lutional neural networks in rating dialog success.Su et al. (2016) further proposed an online activelearning method based on Gaussian process for di-alog reward learning. These methods still requirevarious levels of annotations of dialog ratings byusers, either offline or online. On the other sideof the spectrum, Paek and Pieraccini (2008) pro-posed inferring dialog rewards directly from di-alog corpora with inverse reinforcement learning(IRL) (Ng et al., 2000). However, most of the IRLalgorithms are very expensive to run, requiring re-inforcement learning in an inner loop. This hin-ders IRL based dialog reward estimation methodsto scale to complex dialog scenarios.

Adversarial Networks Generative adversar-ial networks (GANs) (Goodfellow et al., 2014)have recently been successfully applied in com-puter vision and natural language generation (Liet al., 2017a). The network training process isframed as a game, in which people train a gen-erator whose job is to generate samples to fool adiscriminator. The job of a discriminator is to dis-tinguish samples produced by the generator fromthe real ones. The generator and the discrimina-tor are jointly trained until convergence. GANswere firstly applied in image generation and re-cently used in language tasks. Li et al. (2017a)proposed conducting adversarial learning for re-sponse generation in open-domain dialogs. Yanget al. (2017) proposed using adversarial learning inneural machine translation. The use of adversariallearning in task-oriented dialogs has not been well

352

studied. Peng et al. (2018) recently explored us-ing adversarial loss as an extra critic in additionto the main reward function based on task com-pletion. This method still requires prior knowl-edge of a user’s goal, which can be hard to col-lect in practice, in defining task completion. Ourproposed method uses adversarial reward as theonly source of reward for policy optimization inaddressing this challenge.

3 Adversarial Learning forTask-Oriented Dialogs

In this section, we describe the proposed adver-sarial learning method for policy optimization intask-oriented neural dialog models. Our objectiveis to learn a dialog agent (i.e. the generator, G)that is able to effectively communicate with a userover a multi-turn conversation to complete a task.This can be framed as a sequential decision mak-ing problem, in which the agent generates a bestaction to take at every dialog turn given the dialogcontext. The action can be in the form of eithera dialog act (Henderson et al., 2013) or a naturallanguage utterance. We study on dialog act levelin this work. Let Uk and Ak represent the user in-put and agent outputs (i.e. the agent act ak and theslot-value predictions) at turn k. Given the currentuser input Uk, the agent estimates the user’s goaland select a best action ak to take conditioning onthe dialog history.

In addition, we want to learn a reward function(i.e. the discriminator, D) that is able to provideguidance to the agent for learning a better policy.We expect the reward function to give a higher re-ward to the agent if the conversation it had with theuser is closer to how a human agent completes thetask. Output of the reward function is the prob-ability of a given dialog being successfully com-pleted. We train the reward function by forcing itto distinguish successful dialogs and dialogs con-ducted by the machine agent. At the same time,we also update the dialog agent parameters withpolicy gradient based reinforcement learning us-ing the reward produced by the reward function.We keep updating the dialog agent and the rewardfunction until the discriminator can no longer dis-tinguish dialogs from a human agent and from amachine agent. In the subsequent sections, we de-scribe in detail the design of our dialog agent andreward function, and the proposed adversarial dia-log learning method.

LSTM dialogue state, sk

Query results encoding, Ek

Slot value probs for each slot type, vk

System outputs at turn k - 1, Ak-1

User input encoding at turn k, Uk

Policy network

System action at turn k, ak

Figure 1: Design of the task-oriented neural dia-log agent.

3.1 Neural Dialog Agent

The generator is a neural network based task-oriented dialog agent. The model architecture isshown in Figure 1. The agent uses an LSTM re-current neural network to model the sequence ofturns in a dialog. At each turn of a dialog, theagent takes a best system action conditioning onthe current dialog state. A continuous form dialogstate is maintained in the LSTM state sk. At eachdialog turn k, user input Uk and previous systemoutput Ak−1 are firstly encoded to continuous rep-resentations. The user input can either in the formof a dialog act or a natural language utterance. Weuse dialog act form user input in our experiment.The dialog act representation is obtained by con-catenating the embeddings of the act and the slot-value pairs. If natural language form of input isused, we can encode the sequence of words usinga bidirectional RNN and take the concatenation ofthe last forward and backward states as the utter-ance representation, similar to (Yang et al., 2016)and (Liu and Lane, 2017a). With the user inputUk and agent inputAk−1, the dialog state sk is up-dated from the previous state sk−1 by:

sk = LSTMG(sk−1, [Uk, Ak−1]) (1)

Belief Tracking Belief tracking maintains thestate of a conversation, such as a user’s goals, byaccumulating evidence along the sequence of dia-log turns. A user’s goal is represented by a list ofslot-value pairs. The belief tracker updates its es-timation of the user’s goal by maintaining a proba-bility distribution P (lmk ) over candidate values foreach of the tracked goal slot type m ∈ M . Withthe current dialog state sk, the probability over

353

candidate values for each of the tracked goal slotis calculated by:

P (lmk |U≤k, A<k) = SlotDistm(sk) (2)

where SlotDistm is a single hidden layer MLPwith softmax activation over slot type m ∈M .

Dialog Policy We model the agent’s policywith a deep neural network. Following the pol-icy, the agent selects the next action in responseto the user’s input based on the current dialogstate. In addition, information retrieved from ex-ternal resources may also affects the agent’s nextaction. Therefore, inputs to our policy module arethe current dialog state sk, the probability distri-bution of estimated user goal slot values vk, andthe encoding of the information retrieved from ex-ternal sources Ek. Here instead of encoding theactual query results, we encode a summary of theretrieved items (i.e. count and availability of thereturned items). Based on these inputs, the policynetwork produces a probability distribution overthe next system actions:

P (ak | U≤k, A<k, E≤k) = PolicyNet(sk, vk, Ek)(3)

where PolicyNet is a single hidden layer MLPwith softmax activation over all system actions.

3.2 Dialog Reward Estimator

The discriminator model is a binary classifier thattakes in a dialog with a sequence of turns and out-puts a label indicating whether the dialog is a suc-cessful one or not. The logistic function returnsa probability of the input dialog being success-ful. The discriminator model design is as shown inFigure 2. We use a bidirectional LSTM to encodethe sequence of turns. At each dialog turn k, inputto the discriminator model is the concatenation of(1) encoding of the user input Uk, (2) encoding ofthe query result summary Ek, and (3) encoding ofagent output Ak. The discriminator LSTM outputat each step k, hk, is a concatenation of the for-ward LSTM output

−→hk and the backward LSTM

output←−hk: hk = [

−→hk,←−hk].

Once obtaining the discriminator LSTM stateoutputs {h1, . . . , hK}, we experiment with fourdifferent methods in combining these state outputsto generated the final dialog representation d forthe binary classifier:

BiLSTM-last Produce the final dialog repre-sentation d by concatenating the last LSTM state

E1U1

x x xx. . . . . .

x

xx

. . . . . .

Max Pooling

D(d)

A1 E2U2 A2 EKUK AK

d

. . . . . .

Figure 2: Design of the dialog reward estimator:Bidirectional LSTM with max pooling.

outputs from the forward and backward directions:d = [

−→hK ,←−h1]

BiLSTM-max Max-pooling. Produce the fi-nal dialog representation d by selecting the max-imum value over each dimension of the LSTMstate outputs.

BiLSTM-avg Average-pooling. Produce thefinal dialog representation d by taking the averagevalue over each dimension of the LSTM state out-puts.

BiLSTM-attn Attention-pooling. Producethe final dialog representation d by taking theweighted sum of the LSTM state outputs. Theweights are calculated with attention mechanism:

d =

K∑k=1

αkhk (4)

and

αk =exp(ek)∑Kt=1 exp(et)

, ek = g(hk) (5)

g a feed-forward neural network with a single out-put node. Finally, the discriminator produces avalue indicating the likelihood the input dialog be-ing a successful one:

D(d) = σ(Wod+ bo) (6)

where Wo and bo are the weights and bias in thediscriminator output layer. σ is a logistic function.

3.3 Adversarial Learning with PolicyGradient

Once we obtain a dialog sample initiated by theagent and a dialog reward from the reward func-tion, we optimize the dialog agent using REIN-FORCE (Williams, 1992) with the given reward.

354

The reward D(d) is only received at the end of adialog, i.e. rK = D(d). We discount this final re-ward with a discount factor γ ∈ [0, 1) to assign arewardRk to each dialog turn. The objective func-tion can thus be written as Jk(θG) = EθG [Rk] =

EθG[∑K

t=k γt−krt − V (sk)

], with rk = D(d) for

k = K and rk = 0 for k < K. V (sk) is the statevalue function which serves as a baseline value.The state value function is a feed-forward neu-ral network with a single-node value output. Weoptimize the generator parameter θG to maximizeJk(θG). With likelihood ratio gradient estimator,the gradient of Jk(θG) can be derived with:

∇θGJk(θG) = ∇θGEθG [Rk]

=∑ak∈A

G(ak|·)∇θG logG(ak|·)Rk

= EθG [∇θG logG(ak|·)Rk](7)

where G(ak|·) = G(ak|sk, vk, Ek; θG). The ex-pression above gives us an unbiased gradient es-timator. We sample agent action ak following asoftmax policy at each dialog turn and computethe policy gradient. At the same time, we updatethe discriminator parameter θD to maximize theprobability of assigning the correct labels to thesuccessful dialog from human demonstration andthe dialog conducted by the machine agent:

∇θD[Ed∼θdemo

[log(D(d))] +

Ed∼θG [log(1−D(d))]] (8)

We continue to update both the dialog agent andthe reward function via dialog simulation or realuser interaction until convergence.

4 Experiments

4.1 DatasetWe use data from the second Dialog State Track-ing Challenge (DSTC2) (Henderson et al., 2014)in the restaurant search domain for our modeltraining and evaluation. We add entity infor-mation to each dialog sample in the originalDSTC2 dataset. This makes entity information apart of the model training process, enabling theagent to handle entities during interactive evalu-ation with users. Different from the agent ac-tion definition used in DSTC2, actions in oursystem are produced by concatenating the act

Algorithm 1 Adversarial Learning for Task-Oriented Dialog

1: Required: dialog corpus Sdemo, user simual-tor U , generator G, discriminator D

2: Pretrain a dialog agent (i.e. the generator) Gon dialog corpora Sdemo with MLE

3: Simulate dialogs Ssimu between U and G4: Sample successful dialogs S(+) and random

dialogs S(−) from {Sdemo, Ssimu}5: Pretrain a reward function (i.e. the discrimi-

nator) D with S(+) and S(−) . eq 86: for number of training iterations do7: for G-steps do8: Simulate dialogs Sb between U and G9: Compute reward r for each dialog in

Sb with D . eq 610: Update G with reward r . eq 711: end for12: for D-steps do13: Sample dialogs S(b+) from S(+)

14: Update D with S(b+) and Sb (with Sbas negative examples) . eq 8

15: end for16: end for

and slot types in the original dialog act output(e.g. “confirm(food = italian)” maps to“confirm food”). The slot values are capturedin the belief tracking outputs. Table 1 shows thestatistics of the dataset used in our experiments.

# of train/dev/test dialogs 1612/506/ 1117# of dialog turns in average 7.88# of slot value options

Area 5Food 91Price range 3

Table 1: Statistics of DSTC2 dataset.

4.2 Training SettingsWe use a user simulator for our interactive train-ing and evaluation with adversarial learning. In-stead of using a rule-based user simulator as inmany prior work (Zhao and Eskenazi, 2016; Penget al., 2017), in our study we use a model-basedsimulator trained on DSTC2 dataset. We followthe design and training procedures of (Liu andLane, 2017b) in building the model-based simu-lator. The stochastic policy used in the simula-tor introduces additional diversity in user behavior

355

during dialog simulation.

Before performing interactive adversarial learn-ing with RL, we pretrain the dialog agent and thediscriminative reward function with offline super-vised learning on DSTC2 dataset. We find thisbeing helpful in enabling the adversarial policylearning to start with a good initialization. Thedialog agent is pretrained to minimize the cross-entropy losses on agent action and slot value pre-dictions. Once we obtain a supervised trainingdialog agent, we simulate dialogs between theagent and the user simulator. These simulated di-alogs together with the dialogs in DSTC2 datasetare then used to pretrain the discriminative re-ward function. We sample 500 successful dialogsas positive examples, and 500 random dialogs asnegative examples in pretraining the discrimina-tor. During dialog simulation, a dialog is markedas successful if the agent’s belief tracking out-puts fully match the informable (Henderson et al.,2013) user goal slot values, and all user requestedslots are fulfilled. This is the same evaluation cri-teria as used in (Wen et al., 2017) and (Liu andLane, 2017b). It is important to note that suchdialog success signal is usually not available dur-ing real user interactions, unless we explicitly askusers to provide such feedback.

During supervised pretraining, for the dialogagent we use LSTM with a state size of 150. Hid-den layer size for the policy network MLP is set as100. For the discriminator model, a state size of200 is used for the bidirectional LSTM. We per-form mini-batch training with batch size of 32 us-ing Adam optimization method (Kingma and Ba,2014) with initial learning rate of 1e-3. Dropout(p = 0.5) is applied during model training to pre-vent the model from over-fitting. Gradient clip-ping threshold is set to 5.

During interactive learning with adversarial RL,we set the maximum allowed number of dialogturns as 20. A simulation is force to terminated af-ter 20 dialog turns. We update the model with ev-ery mini-batch of 25 samples. Dialog rewards arecalculated by the discriminative reward function.Reward discount factor γ is set as 0.95. These re-wards are used to update the agent model via pol-icy gradient. At the same time, this mini-batch ofsimulated dialogs are used as negative examples toupdate the discriminator.

4.3 Results and AnalysisIn this section, we show and discuss our empir-ical evaluation results. We first compare dialogagent trained using the proposed adversarial re-ward to those using human designed reward andusing oracle reward. We then discuss the impactof discriminator model design and model pretrain-ing on the adversarial learning performance. Lastbut not least, we discuss the potential issue of co-variate shift during interactive adversarial learningand show how we address that with partial accessto user feedback.

4.3.1 Comparison to Other Reward TypesWe first compare the performance of dialog agentusing adversarial reward to those using designedreward and oracle reward on dialog success rate.Designed reward refers to reward function that isdesigned by humans with domain knowledge. Inour experiment, based on the dialog success crite-ria defined in section 4.2, we design the followingreward function for RL policy learning:

• +1 for each informable slot that is correctlyestimated by the agent at the end of a dialog.

• If ALL informable slots are tracked correctly,+1 for each requestable slot successfully han-dled by the agent.

In addition to the comparison to human de-signed reward, we further compare to the case ofusing oracle reward during agent policy optimiza-tion. Using oracle reward refers to having accessto the final dialog success status. We apply a re-ward of +1 for a successful dialog, and a rewardof 0 for a failed dialog. Performance of the agentusing oracle reward serves as an upper-bound forthose using other types of reward. For the adver-sarial reward curve, we use BiLSTM-max as thediscriminator model. During RL training, we nor-malize the rewards produced by different rewardfunctions.

Figure 3 show the RL learning curves for mod-els trained using different reward functions. Thedialog success rate at each evaluation point is cal-culated by averaging over the success status of1000 dialog simulations at that point. The pre-train baseline in the figure refers to the super-vised pretraining model. This model does not getupdated during interactive learning, and thus thecurve stays flat during the RL training cycle. Asshown in these curves, all the three types of reward

356

Figure 3: RL policy optimization performancecomparing with adversarial reward, designed re-ward, and oracle reward.

functions lead to improved dialog success ratealong the RL training process. The agent trainedwith designed reward falls behind the agent trainedwith oracle reward by a large margin. This showsthat the reward designed with domain knowledgemay not fully align with the final evaluation cri-teria. Designing a reward function that can pro-vide an agent enough supervision signal and alsowell aligns the final system objective is not a triv-ial task (Popov et al., 2017). In practice, it is of-ten difficult to exactly specify what we expect anagent to do, and we usually end up with simpleand imperfect measures. In our experiment, agentusing adversarial reward achieves a 7.4% improve-ment on dialog success rate at the end of 6000 in-teractive dialog learning episodes, outperformingthat using the designed reward (4.2%). This showsthe advantage of performing adversarial training inlearning directly from expert demonstrations andin addressing the challenge of designing a goodreward function. Another important point we ob-serve in our experiments is that RL agents trainedwith adversarial reward, although enjoy higherperformance in the end, suffer from larger vari-ance and instability on model performance duringthe RL training process, comparing to agents us-ing human designed reward. This is because dur-ing RL training the agent interfaces with a movingtarget, rather than a fixed objective measure as inthe case of using the designed reward or oracle re-ward. The model performance only gradually getsstabilized when both the dialog agent and the re-ward function are close to convergence.

4.3.2 Impact of Discriminator Model DesignWe study the impact of different discriminatormodel designs on the adversarial learning perfor-mance. We compare the four pooling methods de-scribed in section 3.2 in producing the final dialogrepresentation. Table 2 shows the offline evalua-tion results on 1000 simulated test dialog samples.Among the four pooling methods, max-poolingon bidirectional LSTM outputs achieves the bestclassification accuracy in our experiment. Max-pooling also assigns the highest probability to suc-cessful dialogs in the test set comparing to otherpooling methods. Attention-pooling based LSTMmodel achieves the lowest performance across allthe three offline evaluation metrics in our study.This is probably due to the limited number oftraining samples we used in pretraining the dis-criminator. Learning good attentions usually re-quires more data samples and the model may thusoverfit the small training set. We observe sim-ilar trends during interactive learning evaluationthat the attention-based discriminator leads to di-vergence of policy optimization more often thanthe other three pooling methods. Max-pooling dis-criminator gives the most stable performance dur-ing interactive RL training.

Prediction Success FailModel Accuracy Prob. Prob.BiLSTM-last 0.674 0.580 0.275BiLSTM-max 0.706 0.588 0.272BiLSTM-avg 0.688 0.561 0.268BiLSTM-attn 0.652 0.541 0.285

Table 2: Performance of different discriminatormodel design, on prediction accuracy and proba-bilities assigned to successful and failed dialogs.

4.3.3 Impact of Annotated Dialogs forDiscriminator Training

Annotating dialogs for model training requires ad-ditional human efforts. We investigate the im-pact of annotated dialog samples on discrimina-tor model training. The amount of annotated di-alogs required for learning a good discrimina-tor depends mainly on the complexity of a task.Given the rather simple nature of the slot fillingbased DSTC2 restaurant search task, we experi-ment with annotating 100 to 1000 discriminatortraining samples. We use BiLSTM-max discrimi-nator model in these experiments. The adversarial

357

Figure 4: Impact of discriminator training samplesize on RL dialog learning performance.

RL training curves with different levels of discrim-inator training samples are shown in Figure 4. Asthese results illustrate, with 100 annotated dialogsas positive samples for discriminator training, thediscriminator is not able to produce dialog rewardsthat are useful in learning a good policy. Learningwith 250 positive samples does not lead to con-crete improvement on dialog success rate neither.With the growing number of annotated samples,the dialog agent becomes more likely to learn abetter policy, resulting in higher dialog successrate at the end of the interactive learning sessions.

4.3.4 Partial Access to User FeedbackA potential issue with RL based interactive ad-versarial learning is the covariate shift (Ross andBagnell, 2010; Ho and Ermon, 2016) problem.The positive examples for discriminator trainingare generated before the interactive learning cy-cle based on the supervised pretraining dialog pol-icy. During interactive RL training, the agent’spolicy keeps getting updated. The newly gener-ated dialog samples based on the updated policymay be equally good comparing to the initial setof positive dialogs, but they may look very differ-ent. In this case, the discriminator is likely to givethese dialogs low rewards as the pattern presentedin these dialogs is different to what the discrimi-nator is initially trained on. The agent will thusbe discouraged to produce such type of successfuldialogs in the future with these negative rewards.To address such covariate shift issue, we design aDAgger (Ross et al., 2011) style imitation learn-ing method to the dialog adversarial learning. Weassume that during interactive learning with users,occasionally we can receive feedback from users

Figure 5: Addressing covariate shift in online ad-versarial dialog learning with partial access to userfeedback.

indicating the quality of the conversation they hadwith the agent. We then add those dialogs withgood feedback to the pool of positive dialog sam-ples used in discriminator model training. Withthis, the discriminator can learn to assign highrewards to such good dialogs in the future. Inour empirical evaluation, we experiment with theagent receiving positive feedback 10% and 20% ofthe time during its interaction with users. The ex-perimental results are shown in Figure 5. As illus-trated in these curves, the proposed DAgger stylelearning method can effectively improve the dia-log adversarial learning with RL, leading to higherdialog success rate.

5 Conclusions

In this work, we investigate the effectivenessof applying adversarial training in learning task-oriented dialog models. The proposed methodis an attempt towards addressing the rating con-sistency and learning efficiency issues in onlinedialog policy learning with user feedback. Weshow that with limited number of annotated di-alogs, the proposed adversarial learning methodcan effectively learn a reward function and use thatto guide policy optimization with policy gradientbased reinforcement learning. In the experimenton a restaurant search domain, we show that theproposed adversarial learning method achieves ad-vanced dialog success rate comparing to baselinemethods using other forms of reward. We furtherdiscuss the covariate shift issue during interactiveadversarial learning and show how we can addressit with partial access to user feedback.

358

ReferencesAntoine Bordes and Jason Weston. 2017. Learning

end-to-end goal-oriented dialog. In InternationalConference on Learning Representations.

Emily L Denton, Soumith Chintala, Rob Fergus, et al.2015. Deep generative image models using a lapla-cian pyramid of adversarial networks. In Advancesin neural information processing systems, pages1486–1494.

Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao,Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017.Towards end-to-end reinforcement learning of dia-logue agents for information access. In Proceedingsof ACL.

Layla El Asri, Romain Laroche, and Olivier Pietquin.2014. Task completion transfer learning for rewardinference. Proc of MLIS.

Milica Gasic, Catherine Breslin, Matthew Henderson,Dongho Kim, Martin Szummer, Blaise Thomson,Pirros Tsiakoulis, and Steve Young. 2013. On-line policy optimisation of bayesian spoken dialoguesystems via human interaction. In ICASSP, pages8367–8371. IEEE.

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. 2014. Generative ad-versarial nets. In Advances in neural informationprocessing systems, pages 2672–2680.

Matthew Henderson, Blaise Thomson, and JasonWilliams. 2013. Dialog state tracking challenge 2& 3.

Matthew Henderson, Blaise Thomson, and JasonWilliams. 2014. The second dialog state trackingchallenge. In SIGDIAL.

Jonathan Ho and Stefano Ermon. 2016. Generative ad-versarial imitation learning. In Advances in NeuralInformation Processing Systems, pages 4565–4573.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. In ICLR.

Jiwei Li, Michel Galley, Chris Brockett, Georgios PSpithourakis, Jianfeng Gao, and Bill Dolan. 2016. Apersona-based neural conversation model. In Proc.of ACL.

Jiwei Li, Will Monroe, Tianlin Shi, Alan Ritter, andDan Jurafsky. 2017a. Adversarial learning for neu-ral dialogue generation. In Proceedings of ACL.

Xuijun Li, Yun-Nung Chen, Lihong Li, and JianfengGao. 2017b. End-to-end task-completion neural di-alogue systems. arXiv preprint arXiv:1703.01008.

Bing Liu and Ian Lane. 2017a. An end-to-end trainableneural network model with belief tracking for task-oriented dialog. In Interspeech.

Bing Liu and Ian Lane. 2017b. Iterative policy learningin end-to-end trainable task-oriented neural dialogmodels. In Proceedings of 2017 IEEE Workshop onAutomatic Speech Recognition and Understanding(ASRU).

Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, PararthShah, and Larry Heck. 2018. Dialogue learning withhuman teaching and feedback in end-to-end train-able task-oriented dialogue systems. In NAACL.

Andrew Y Ng, Stuart J Russell, et al. 2000. Algorithmsfor inverse reinforcement learning. In Icml, pages663–670.

Tim Paek and Roberto Pieraccini. 2008. Automatingspoken dialogue management design using machinelearning: An industry perspective. Speech commu-nication, 50(8-9):716–729.

Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu,Yun-Nung Chen, and Kam-Fai Wong. 2018. Ad-versarial advantage actor-critic model for task-completion dialogue policy learning. In ICASSP.

Baolin Peng, Xiujun Li, Lihong Li, Jianfeng Gao,Asli Celikyilmaz, Sungjin Lee, and Kam-Fai Wong.2017. Composite task-completion dialogue policylearning via hierarchical deep reinforcement learn-ing. In Proceedings of the 2017 Conference on Em-pirical Methods in Natural Language Processing,pages 2231–2240.

Ivaylo Popov, Nicolas Heess, Timothy Lillicrap,Roland Hafner, Gabriel Barth-Maron, Matej Ve-cerik, Thomas Lampe, Yuval Tassa, Tom Erez, andMartin Riedmiller. 2017. Data-efficient deep re-inforcement learning for dexterous manipulation.arXiv preprint arXiv:1704.03073.

Stephane Ross and Drew Bagnell. 2010. Efficient re-ductions for imitation learning. In Proceedings ofthe thirteenth international conference on artificialintelligence and statistics, pages 661–668.

Stephane Ross, Geoffrey Gordon, and Drew Bagnell.2011. A reduction of imitation learning and struc-tured prediction to no-regret online learning. In Pro-ceedings of the fourteenth international conferenceon artificial intelligence and statistics, pages 627–635.

Iulian V Serban, Alessandro Sordoni, Yoshua Bengio,Aaron Courville, and Joelle Pineau. 2016. Buildingend-to-end dialogue systems using generative hier-archical neural network models. In Proceedings ofthe 30th AAAI Conference on Artificial Intelligence(AAAI-16).

Pei-Hao Su, Paweł Budzianowski, Stefan Ultes, Mil-ica Gasic, and Steve Young. 2017. Sample-efficientactor-critic reinforcement learning with superviseddata for dialogue management. In Proceedings ofthe 18th Annual SIGdial Meeting on Discourse andDialogue, pages 147–157, Saarbrucken, Germany.Association for Computational Linguistics.

359

Pei-Hao Su, Milica Gasic, Nikola Mrksic, Lina Rojas-Barahona, Stefan Ultes, David Vandyke, Tsung-Hsien Wen, and Steve Young. 2016. On-line activereward learning for policy optimisation in spoken di-alogue systems. In Proceedings of ACL.

Pei-Hao Su, David Vandyke, Milica Gasic, DonghoKim, Nikola Mrksic, Tsung-Hsien Wen, and SteveYoung. 2015. Learning from real users: Rating di-alogue success with neural networks for reinforce-ment learning in spoken dialogue systems. In Inter-speech.

Marilyn A Walker, Diane J Litman, Candace A Kamm,and Alicia Abella. 1997. Paradise: A frameworkfor evaluating spoken dialogue agents. In Proceed-ings of the eighth conference on European chap-ter of the Association for Computational Linguistics,pages 271–280. Association for Computational Lin-guistics.

Tsung-Hsien Wen, David Vandyke, Nikola Mrksic,Milica Gasic, Lina M. Rojas-Barahona, Pei-Hao Su,Stefan Ultes, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialoguesystem. In Proc. of EACL.

Jason D Williams, Kavosh Asadi, and Geoffrey Zweig.2017. Hybrid code networks: practical and efficientend-to-end dialog control with supervised and rein-forcement learning. In ACL.

Jason D Williams and Geoffrey Zweig. 2016. End-to-end lstm-based dialog control optimized with su-pervised and reinforcement learning. arXiv preprintarXiv:1606.01269.

Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning. Machine learning, 8(3-4):229–256.

Zhaojun Yang, Gina-Anne Levow, and Helen Meng.2012. Predicting user satisfaction in spoken dia-log system evaluation with collaborative filtering.IEEE Journal of Selected Topics in Signal Process-ing, 6(8):971–981.

Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2017.Improving neural machine translation with condi-tional sequence generative adversarial nets. arXivpreprint arXiv:1703.04887.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 1480–1489.

Steve Young, Milica Gasic, Blaise Thomson, and Ja-son D Williams. 2013. Pomdp-based statistical spo-ken dialog systems: A review. Proceedings of theIEEE, 101(5):1160–1179.

Tiancheng Zhao and Maxine Eskenazi. 2016. Towardsend-to-end learning for dialog state tracking andmanagement using deep reinforcement learning. InSIGDIAL.

Adversarial Learning of Task-Oriented Neural Dialog Models

Documents