Proceedings of Machine Learning Research - Sequence to Better Sequence…proceedings.mlr.press/v70/mueller17a/mueller17a.pdf · 2018-10-24 · Sequence to Better Sequence: Continuous

Sequence to Better Sequence: Continuous Revision of Combinatorial Structures

Jonas Mueller 1 David Gifford 1 Tommi Jaakkola 1

AbstractWe present a model that, after learning on ob-servations of (sequence, outcome) pairs, can beefficiently used to revise a new sequence in orderto improve its associated outcome. Our frame-work requires neither example improvements,nor additional evaluation of outcomes for pro-posed revisions. To avoid combinatorial-searchover sequence elements, we specify a generativemodel with continuous latent factors, which islearned via joint approximate inference using arecurrent variational autoencoder (VAE) and anoutcome-predicting neural network module. Un-der this model, gradient methods can be used toefficiently optimize the continuous latent factorswith respect to inferred outcomes. By appropri-ately constraining this optimization and using theVAE decoder to generate a revised sequence, weensure the revision is fundamentally similar tothe original sequence, is associated with betteroutcomes, and looks natural. These desiderataare proven to hold with high probability underour approach, which is empirically demonstratedfor revising natural language sentences.

IntroductionThe success of recurrent neural network (RNN) modelsin complex tasks like machine translation and audio syn-thesis has inspired immense interest in learning from se-quence data (Eck & Schmidhuber, 2002; Graves, 2013;Sutskever et al., 2014; Karpathy, 2015). Comprised of ele-ments s

t

P S , which are typically symbols from a discretevocabulary, a sequence x “ ps1, . . . , sT q P X has length Twhich can vary between different instances. Sentences area popular example of such data, where each s

j

is a wordfrom the language. In many domains, only a tiny fractionof X (the set of possible sequences over a given vocabu-lary) represents sequences likely to be found in nature (ie.

1MIT Computer Science & Artificial Intelligence Laboratory.Correspondence to: J. Mueller <[email protected]>.

Proceedings of the 34 thInternational Conference on Machine

Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017by the author(s).

those which appear realistic). For example: a random se-quence of words will almost never form a coherent sentencethat reads naturally, and a random amino-acid sequence ishighly unlikely to specify a biologically active protein.

In this work, we consider applications where each sequencex is associated with a corresponding outcome y P R.For example: a news article title or Twitter post can beassociated with the number of shares it subsequently re-ceived online, or the amino-acid sequence of a syntheticprotein can be associated with its clinical efficacy. We op-erate under the standard supervised learning setting, assum-ing availability of a dataset D

n

“ tpxi

, yi

quni“1

iid„ pXY

of sequence-outcome pairs. The marginal distributionpX

is assumed as a generative model of the natural se-quences, and may be concentrated in a small subspace ofX . Throughout this paper, p denotes both density and dis-tribution functions depending on the referenced variable.

After fitting models to D

n

, we are presented a new se-quence x0 P X (with unknown outcome), and our goal is toquickly identify a revised version that is expected to havesuperior outcome. Formally, we seek the revised sequence:

x˚ “ argmax

xPCx0

ErY | X “ xs (1)

Here, we want the set Cx0 of feasible revisions to ensure

that x˚ remains natural and is merely a minor revision ofx0. Under a generative modeling perspective, these twogoals are formalized as the following desiderata: p

X

px˚q isnot too small, and x˚ and x0 share similar underlying latentcharacteristics. When revising a sentence for example, it isimperative that the revision reads naturally (has reasonablelikelihood under the distribution of realistic sentences) andretains the semantics of the original.

This optimization is difficult because the constraint-set andobjective may be highly complex and are both unknown(must be learned from data). For many types of sequencesuch as sentences, standard distance measures applied di-rectly in the space of X or S (eg. Levenshtein distance orTF-IDF similarity) are inadequate to capture meaningfulsimilarities, even though these can be faithfully reflected bya simple metric over an appropriately learned space of con-tinuous latent factors (Mueller & Thyagarajan, 2016). Inthis work, we introduce a generative-modeling frameworkwhich transforms (1) into a simpler differentiable optimiza-


tion by leveraging continuous-valued latent representationslearned using neural networks. After the generative modelhas been fit, our proposed procedure can efficiently reviseany new sequence in a manner that satisfies the aforemen-tioned desiderata (with high probability).

Related WorkUnlike imitation learning, our setting does not requireavailability of improved versions of a particular sequence.This prevents direct application of a sequence-to-sequencemodel (Sutskever et al., 2014). Similar to our approach,Gomez-Bombarelli et al. (2016) also utilize latent autoen-coder representations in order to propose novel chemicalstructures via Bayesian optimization. However, unlike se-quential bandit/reinforcement-learning settings, our learnersees no outcomes outside of the training data, neither forthe new sequence it is asked to revise, nor for any of itsproposed revisions of said sequence (Mueller et al., 2017).Our methods only require an easily-assembled dataset ofsequence-outcome pairs and are thus widely applicable.

Combinatorial structures are often optimized via complexsearch heuristics such as genetic programming (Zaeffereret al., 2014). However, search relies on evaluating iso-lated changes in each iteration, whereas good revisions ofa sequence are often made over a larger context (ie. al-tering a phrase in a sentence). From the vast number ofpossibilities, such revisions are unlikely to be found bysearch-procedures, and it is generally observed that suchmethods are outperformed by gradient-based optimizationin high-dimensional continuous settings. Unlike combina-torial search, our framework leverages gradients in order toefficiently find good revisions at test time. Simonyan et al.(2014) and Nguyen et al. (2015) also proposed gradient-based optimization of inputs with respect to neural predic-tions, but work in this vein has been focused on conditionalgeneration (rather than revision) and is primarily restrictedto the continuous image domain (Nguyen et al., 2016).

MethodsTo identify good revisions, we first map our stochastic com-binatorial optimization problem into a continuous spacewhere the objective and constraints exhibit a simpler form.We assume the data are generated by the probabilisticgraphical model in Figure 1A. Here, latent factors Z P Rd

specify a (continuous) configuration of the generative pro-cess for X,Y (both sequences and outcomes), and weadopt the prior p

Z

“ Np0, Iq. Relationships between thesevariables are summarized by the maps F,E,D which weparameterize using three neural networks F ,E ,D trainedto enable efficient approximate inference under this model.

The first step of our framework is to fit this model to D

n

(A) Graphical Model (B) Revision Procedure

x z

y

E F

Dx0

z0

y0

z1

y1

. . . z˚

y˚

x˚

§ § . . . §

E D

F F F

`r

z

F `r

z

F `r

z

F

Figure 1. (A) Assumed graphical model (shaded nodes indicateobserved variables, dashed arrows are learned neural networkmappings). (B) Procedure for revising a given x0 to produce x

˚

with superior expected outcome.

by learning the parameters of these inference networks: theencoder E , the decoder D , and the outcome-predictor F .A good model that facilitates high-quality revision underour framework will possess the following properties: (1)Y can efficiently be inferred from Z and this relationshipobeys a smooth functional form, (2) the map D producesa realistic sequence x given any z with reasonable priorprobability, (3) the distribution of natural sequences is ge-ometrically simple in the latent Z-space. We explicitly en-courage (1) by choosing F as a fairly simple feedforwardnetwork, (2) by defining D as the most-likely x given z,and (3) by endowing Z with our simple Np0, Iq prior.

Another characteristic desired of our Z-representations isthat they encode meaningful sequence-features such thattwo fundamentally similar sequences are likely to havebeen generated from neighboring z-values. Applied to im-age data, VAE models similar to ours have been found tolearn latent representations that disentangle salient char-acteristics such as scale, rotation, and other independentvisual concepts (Higgins et al., 2016). The latent repre-sentations of recurrent architectures trained on text (similarto the models used here) have also been shown to encodemeaningful semantics, with a strong correlation betweendistances in the latent space and human-judged similaritybetween texts (Mueller & Thyagarajan, 2016). By exploit-ing such simplified geometry, a basic shift in the latentvector space may be able to produce higher-quality revi-sions than attempts to directly manipulate the combinato-rial space of sequence elements.

After fitting a model with these desirable qualities, ourstrategy to revise a given sequence x0 P X is outlinedin Figure 1B. First, we compute its latent representationz0 “ Epx0q using a trained encoding map. As the latentrepresentations z are continuous, we can employ efficientgradient-based optimization to find a nearby local optimumz˚ of F pzq (within a simple constraint-set around z0 de-fined later on). To z˚, we subsequently apply a simple de-coding map D (defined with respect to our learned model)in order to obtain our revised sequence x˚. Under our


assumed model, the optimization in latent representation-space attempts to identify a generative configuration whichproduces large values of Y (as inferred via F ). The sub-sequent decoding step seeks the most likely sequence pro-duced by the optimized setting of the latent factors.

Variational Autoencoder

For approximate inference in the X,Z relationship, weleverage the variational autoencoder (VAE) model ofKingma & Welling (2014). In our VAE, a generative modelof sequences is specified by our prior over the latent valuesz combined with a likelihood function p

D

px | zq which ourdecoder network D outputs in order to evaluate the likeli-hood of any sequence x given z P Rd. Given any sequencex, our encoder network E outputs a variational approxima-tion q

E

pz | xq of the true posterior over the latent-valuesppz | xq9 p

D

px | zqpZ

pzq. As advocated by Kingma &Welling (2014) and Bowman et al. (2016), we employ thevariational family q

E

pz | xq “ Npµz|x,⌃z|x) with diag-

onal covariance. Our revision methodology employs theencoding procedure Epxq “ µ

z|x which maps a sequenceto the maximum a posteriori (MAP) configuration of thelatent values z (as estimated by the encoder network E ).

The parameters of E ,D are learned using stochasticvariational inference to maximize a lower bound for themarginal likelihood of each observation in the training data:

log pX

pxq • ´“Lrecpxq ` Lpripxq

‰(2)

Lrecpxq “ ´Eq

E

pz|xq rlog pD

px | zqsLpripxq “ KLpq

E

pz | xq|| pZ

qDefining �

z|x “ diagp⌃z|xq, the prior-enforcing Kullback-

Leibler divergence has a differentiable closed form expres-sion when q

E

, pZ

are diagonal Gaussian distributions. Thereconstruction term Lrec (ie. negative log-likelihood underthe decoder model) is efficiently approximated using justone Monte-Carlo sample z „ q

E

pz | xq. To optimize thevariational lower bound over our data D

n

with respect tothe parameters of neural networks E ,D , we use stochas-tic gradients of (2) obtained via backpropagation and thereparameterization trick of Kingma & Welling (2014).

Throughout, our encoder/decoder models E ,D are recur-rent neural networks (RNN). RNNs adapt standard feedfor-ward neural networks for sequence data x “ ps1, . . . , sT q,where at each time-step t P t1, . . . , T u, a fixed size hidden-state vector h

t

P Rd is updated based on the next elementin the input sequence. To produce the approximate pos-terior for a given x, our encoder network E appends thefollowing additional layers to the final RNN hidden-state(parameterized by W

µ

,W�

,Wv

, bµ

, b�

, bv

):

µz|x “ W

µ

hT

` bµ

P Rd

�z|x “ expp´|W

�

v ` b�

|q, v “ ReLUpWv

hT

` bv

q (3)

The (squared) elements of �z|x P Rd form the diagonal of

our approximate-posterior covariance ⌃

z|x. Since Lpri isminimized at �

z|x “ ~1 and Lrec is likely to worsen withadditional variance in encodings (as our posterior approx-imation is unimodal), we simply do not consider �

z|x val-ues that exceed 1 in our variational family. This restrictionresults in more stable training and also encourages the en-coder and decoder to co-evolve such that the true posterioris likely closer to unimodal with variance § 1.

To evaluate the likelihood of a sequence, RNN D computesnot only its hidden state h

t

, but also the additional output:

⇡t

“ softmaxpW⇡

ht

` b⇡

q (4)

At each position t, ⇡t

estimates ppst

| s1, . . . , st´1q by re-lying on h

t

to summarize the sequence history. By thefactorization pps1, . . . , sT q “ ±

T

t“1 ppst

| st´1, . . . , s1q,

we have pD

px | zq “ ±T

t“1 ⇡t

rst

s, which is calculated byspecifying an initial hidden-state h0 “ z and feedingx “ ps1, . . . , sT q into D . From a given latent configurationz, our revisions are produced by decoding a sequence viathe most-likely observation, which we denote as the map:

Dpzq “ argmax

xPXpD

px | zq (5)

While the most-likely decoding in (5) is itself a combi-natorial problem, beam search can exploit the sequential-factorization of ppx | zq to efficiently find a good approx-imate solution (Wiseman & Rush, 2016; Sutskever et al.,2014). For x˚ “ Dpzq P X , this decoding strategy seeksto ensure neither p

X

px˚q nor ppz | x˚q is too small.

Compositional Prediction of Outcomes

In addition to the VAE component, we fit a composi-tional outcome-prediction model which uses a standardfeed forward neural network F to implement the mapF : Rd Ñ R. It is assumed that F pzq “ ErY | Z “ zs un-der our generative model. Rather than integrating over Zto compute ErY | X “ xs “ ≥

F pzqqE

pz | xqdz, we em-ploy the first-order Taylor approximation F pEpxqq, wherethe approximation-error shrinks the more closely F resem-bles an affine transformation. To ensure this approximate-inference step accurately estimates the conditional expec-tation, we jointly train E and F with the loss:

Lmsepx, yq “ ry ´ F pEpxqqs2 (6)

If the architecture of networks E ,F is specified with suf-ficient capacity to capture the underlying conditional rela-tionship, then we should have F pEpxqq « ErY | X “ xsafter properly learning the network parameters from a suf-ficiently large dataset (even F is a nonlinear map).


Enforcing Invariance

In theory, it is possible that some dimensions of z pertainsolely to the outcome y and do not have any effect on thedecoded sequence Dpzq. Happening to learn this sort oflatent representation would be troubling, since subsequentoptimization of the inferred y with respect to z might notactually lead to a superior revised sequence. To mitigatethis issue, we carefully ensure the dimensionality d of ourlatent Z does not significantly exceed the bottleneck ca-pacity needed to produce accurate outcome-predictions andVAE reconstructions (Gupta et al., 2016). We explicitlysuppress this undesirable scenario by adding the followingloss to guide training of our neural networks:

Linv “ Ez„p

Z

“F pzq ´ F pEpDpzqqq

‰2 (7)

When optimizing neural network parameters with respectto this loss, we treat the parameters of D and the lefthandF pzq term as fixed, solely backpropagating Monte-Carloestimated gradients into E ,F . Driving Linv toward 0 en-sures our outcome-predictions remain invariant to varia-tion introduced by the encoding-decoding process (and thisterm also serves as a practical regularizer to enforce addi-tional smoothness in our learned functions).

Joint Training

The parameters of all components of this model (qE

, pD

,and F ) are learned jointly in an end-to-end fashion. Train-ing is done via stochastic gradient descent applied to mini-mize the following objective over the examples in D

n

:

Lpx, yq “ Lrec ` �priLpri ` �mse

�2Y

Lmse ` �inv

�2Y

Linv (8)

where �2Y

denotes the (empirical) variance of the outcomes,and the � • 0 are constants chosen to balance the relativeweight of each goal so that the overall framework producesmaximally useful revisions. By setting �mse “ �inv “ 0

at first, we can optionally leverage a separate large cor-pus of unlabeled examples to initially train only the VAEcomponent of our architecture, as in the unsupervised pre-training strategy used successfully by Kiros et al. (2015);Erhan et al. (2010).

In practice, we found the following training strategy towork well, in which numerous mini-batch stochastic gra-dient updates (typically 10-30 epochs) are applied withinevery one of these steps:

Step 1: Begin with �inv “ �pri “ 0, so Lrec and Lmseare the only training objectives. We found that regardlessof the precise value specified for �mse, both Lrec and Lmsewere often driven to their lowest possible values during thisjoint optimization (verified by training individually againsteach objective).

Step 2: Grow �pri from 0 to 1 following the sigmoid an-nealing schedule proposed by Bowman et al. (2016), whichis needed to ensure the variational sequence to sequencemodel does not simply ignore the encodings z (note thatthe formal variational lower bound is attained at �pri “ 1).

Step 3: Gradually increase �inv linearly until Linv becomessmall on average across our Monte-Carlo samples z „ p

Z

.Here, p

D

is treated as constant with respect to Linv, andeach mini-batch used in stochastic gradient descent is cho-sen to contain the same number of Monte-Carlo samplesfor estimating Linv as (sequence, outcome) pairs.

Proposing RevisionsWhile the aforementioned training procedure is computa-tionally intensive, once learned, our neural networks canbe leveraged for efficient inference. Given user-specifiedconstant ↵ ° 0 and a to-be-revised sequence x0, we pro-pose the revision x˚ output by the following procedure.

REVISE AlgorithmInput: sequence x0 P X , constant ↵ P p0, |2⇡⌃

z|x0|´ 1

2 qOutput: revised sequence x˚ P X

1) Use E to compute qE

pz | x0q2) Define C

x0 “ z P Rd

: qE

pz | x0q • ↵(

3) Find z˚ “ argmax

zPCx0

F pzq (gradient ascent)

4) Return x˚ “ Dpz˚q (beam search)

Intuitively, the level-set constraint Cx0 Ñ Rd ensures that

z˚, the latent configuration from which we decode x˚, islikely similar to the latent characteristics responsible for thegeneration of x0. Assuming x0 and x˚ share similar latentfactors implies these sequences are fundamentally similaraccording to the generative model. Note that z˚ “ Epx0qis always a feasible solution of the latent-factor optimiza-tion over z P C

x0 (for any allowed value of ↵). Further-more, this constrained optimization is easy under our Gaus-sian approximate-posterior, since C

x0 forms a simple ellip-soid centered around Epx0q.

To find z˚ in Step 3 of the REVISE procedure, we use gra-dient ascent initialized at z “ Epx0q, which can quicklyreach a local maximum if F is parameterized by a simplefeedforward network. Starting the search at Epx0q makesmost sense for unimodal posterior approximations like ourGaussian q

E

. To ensure all iterates remain in the feasibleregion C

x0 , we instead take gradient steps with respect to apenalized objective F pzq ` µ ¨ Jpzq where:

Jpzq “ log

”K ´ pz ´ Epx0qqT ⌃´1

z|x0pz ´ Epx0qq

ı

K “ ´2 logrp2⇡qd{2|⌃z|x|1{2↵s (9)

and 0 † µ ! 1 is gradually decreased toward 0 to en-


sure the optimization can approach the boundary of Cx0 . In

terms of resulting revision quality, we found this log barriermethod outperformed other standard first-order techniquesfor constrained optimization such as the projected gradientand Franke-Wolfe algorithms.

In principle, our revision method can operate on the latentrepresentations of a traditional deterministic autoencoderfor sequences, such as the seq2seq models of Sutskeveret al. (2014) and Cho et al. (2014). However, the VAEoffers numerous practical advantages, some of which arehighlighted by Bowman et al. (2016) in the context of gen-erating more-coherent sentences. The posterior uncertaintyof the VAE encourages the network to smoothly spread thetraining examples across the support of the latent distribu-tion. In contrast, central regions of the latent space under atraditional autoencoder can contain holes (to which no ex-amples are mapped), and it is not straightforward to avoidthese in our optimization of z˚. Furthermore, we introducean adaptive variant of our decoder in §S1 which is designedto avoid poor revisions in cases where the initial sequenceis already not reconstructed properly: DpEpx0qq ‰ x0.

Theoretical Properties of Revision

Here, we theoretically characterize properties of revisionsobtained via our REVISE procedure (all proofs are rel-egated to §S3 in the Supplementary Material). Our re-sults imply that in an ideal setting where our neural net-work inference approximations are exact, the revisions pro-posed by our method are guaranteed to satisfy our previ-ously stated desiderata: x˚ is associated with an expectedoutcome-increase, x˚ appears natural (has nontrivial prob-ability under p

X

whenever x0 is a natural sequence), andx˚ is likely to share similar latent characteristics as x0

(since x˚ is the most likely observation generated fromz˚ and q

E

pz˚ | x0q • ↵ by design). Although exactapproximations are unrealistic in practice, our theory pre-cisely quantifies the expected degradation in the quality ofproposed revisions that accompanies a decline in either theaccuracy of our approximate inference techniques or themarginal likelihood of the original sequence to revise.

Theorems 1 and 2 below ensure that for an initial sequencex0 drawn from the natural distribution, the likelihood of therevised sequence x˚ output by our REVISE procedure un-der p

X

has lower bound determined by the user-parameter↵ and the probability of the original sequence p

X

px0q.Thus, when revising a sequence x0 which looks natural(has substantial probability under p

X

), our procedure ishighly likely to produce a revised sequence x˚ which alsolooks natural. The strength of this guarantee can be pre-cisely controlled by choosing ↵ appropriately large in ap-plications where this property is critical.

In each high probability statement, our bounds assume the

initial to-be-revised sequence x0 stems from the naturaldistribution p

X

, and each result holds for any fixed con-stant � ° 0. We first introduce the following assumptions:

(A1) For � ° 0,↵ ° 0, there exists 0 † � § 1 such that:i. With probability • 1 ´ �{2 (over x „ p

X

):

ppz | xq • � ¨ qE

pz | xq whenever qE

pz | xq • ↵

ii. PrpZ R BR{2p0qq • � ¨ Prp rZ R B

R{2p0qq

where Z „ Np0, Iq, and rZ „ qZ

, the average encoding

distribution defined by Hoffman & Johnson (2016) as:

qZ

pzq “ Ex„p

X

rqE

pz | xqs (10)

BR

p0q “ tz P Rd

: ||z|| § Ru denotes the Euclidean ballcentered around 0 with radius R defined here as:

R “ maxtR1, R2u (11)

with R1 “a

´8 logr↵ ¨ p2⇡qd{2sR2 “ maxt rR2, 2u, rR2 “

c8 ´ 1

4d log

´��

8

¯

(A2) There exists ⌘ ° 0 (depends on �) such that withprobability • 1 ´ �{2 (over x0 „ p

X

): ppz˚ | x˚q § ⌘

This means the latent posterior is bounded at x˚, z˚ (asdefined in REVISE), where both depend upon the initial to-be-revised sequence x0.Theorem 1. For any � ° 0, (A1) and (A2) imply:

pX

px˚q • ↵�

⌘¨ p

X

px0qwith probability • 1 ´ � (over x0 „ p

X

).

Condition (A1) forms a generalization of absolute conti-nuity, and is required since little can be guaranteed aboutour inference procedures if the variational posterior is tooinaccurate. Equality holds in (A1) with probability 1 if thevariational distributions q

E

exactly represent the true poste-rior (� Ñ 1 as the variational approximations become moreaccurate over the measure p

X

). In practice, minimizationof the reverse KL divergence (Lpri) used in our VAE for-mulation ensures that q

E

pz | xq is small wherever the trueposterior ppz | xq takes small values (Blei et al., 2017).

While the bound in Theorem 1 has particularly simpleform, this result hinges on assumption (A2). One can showfor example that the inequality in (A2) is satisfied if theposteriors ppz | x˚q are Lipschitz continuous functions ofz at z˚ (sharing one Lipschitz constant over all possiblex˚). In general however, (A2) heavily depends on both thedata distribution p

X

and decoder model pD

. Therefore, weprovide a similar lower bound guarantee on the likelihoodof our revision x˚ under p

X

, which instead only relies onweaker assumption (A3) below.

(A3) There exists L ° 0 such that for each x P X :pD

px | zq is a L-Lipschitz function of z over BR`1p0q.


Here, L depends on � (through R), and we assume L • 1

without loss of generality. (A3) is guaranteed to hold in thesetting where we only consider sequences of finite length§ T . This is because the probability output by our decodermodel, p

D

px | zq, is differentiable with bounded gradi-ents over all z P B

R

p0q under any sequence-to-sequenceRNN architecture which can be properly trained using gra-dient methods. Since B

R`1p0q Ä Rd is a closed interval,pD

px | zq must be Lipschitz continuous over this set, for agiven value of x. We can simply define L to be the largestLipschitz constant over the |S|T possible choices of x P X

(|S| “ size of the vocabulary). In the next theorem below,user-specified constant ↵ ° 0 is defined in REVISE, and L,�, R all depend on �.Theorem 2. For any � ° 0, if (A1) and (A3) hold, then

with probability • 1 ´ � (over x0 „ pX

):

pX

px˚q • Ce´R

Ld

¨“� ¨ ↵ ¨ p

X

px0q‰d`1

where constant C “ ⇡d{2

�pd

2 ` 1q ¨ pd ` 1qdpd ` 2qd`1

Our final result, Theorem 3, ensures that our optimizationof z˚ with respect to F is tied to the expected outcomes atx˚ “ Dpz˚q, so that large improvements in the optimiza-tion objective: F pz˚q ´ F pEpx0qq imply that our revisionprocedure likely produces large expected improvements inthe outcome: ErY | X “ x˚s ´ ErY | X “ x0s. For thisresult, we make the following assumptions:

(A4) For any � ° 0, there exists ° 0 such thatPrpX P Kq • 1 ´ �{2, where we define:

K “ tx P X : x0 “ x ùñ pX

px˚q • u (12)

as the subset of sequences whose improved versions pro-duced by our REVISE procedure remain natural with likeli-hood • . Note that either Theorem 1 or 2 (with the corre-sponding assumptions) ensures that one can suitably define such that (A4) is satisfied (by considering a sufficientlylarge finite subset of X ).

(A5) For any ° 0, there exists ✏mse ° 0 such thatPrpX P Emseq ° 1 ´ , where we define:

Emse“ tx P X : |F pEpxqq ´ ErY |X “ xs| § ✏mseu (13)

(A6) For any � ° 0, there exists ✏inv ° 0 such that:

|F pzq ´ F pEpDpzqqq| § ✏inv for all z P BR

p0q Ä Rd

where R is defined in (11) and depends on �.

Here, ✏mse and ✏inv quantify the approximation error of ourneural networks for predicting expected outcomes and en-suring encoding-decoding invariance with respect to F .

Standard learning theory implies both ✏mse, ✏inv will bedriven toward 0 if we use neural networks with sufficientcapacity to substantially reduce Lmse and Linv over a largetraining set.Theorem 3. For any � ° 0, if conditions (A1), (A4), (A5),

and (A6) hold, then with probability • 1 ´ � ´ :

�

z

˚ ´ ✏ § F pz˚q ´ F pEpx0qq § �

z

˚ ` ✏ (14)

where �

z

˚ “ ErY | X “ x˚s ´ ErY | X “ x0s✏ “ ✏

inv

` 2✏mse

Here, , ✏inv are defined in terms of � as specified in (A4),(A6), and ✏mse is defined in terms of as specified in (A5).

ExperimentsAll of our RNNs employ the Gated Recurrent Unit (GRU)of Cho et al. (2014), which contains a simple gating mech-anism to effectively learn long-range dependencies acrossa sequence. Throughout, F is a simple feedforward net-work with 1 hidden layer and tanh activations (note that thepopular ReLU activation is inappropriate for F since it haszero gradient over half its domain). Decoding with respectto p

D

is simply done entirely greedily (ie. a beam-search ofsize 1) to demonstrate our approach is not reliant on searchheuristics. §S2 contains additional details for each analysis.

Simulation Study

To study our methods in a setting where all aspects of per-formance can be quantified, we construct a natural distri-bution p

X

over sequences of lengths 10-20 whose elementsstem from the vocabulary S “ tA,B, . . . , I, Ju. Each se-quence is generated via the probabilistic grammar of TableS1. For each sequence, the associated outcome y is sim-ply the number of times A appears in the sequence (a com-pletely deterministic relationship). Since A often follows Cand is almost always followed by B under p

X

, a procedureto generate natural revisions cannot simply insert/substituteA symbols at random positions.

Table 1 compares various methods for proposing revisions.Letting �

Y

denote the standard deviation of outcomesin D

n

, we evaluate each proposed x˚ using a rescaledversion of the actual underlying outcome-improvement:�

Y

px˚q “ �´1Y

pErY | X “ x˚s ´ ErY | X “ x0sq. Ex-cept where sample size is explicitly listed, all models weretrained using n “ 10, 000 (sequence, outcome) pairs sam-pled from the generative grammar. Wherever appropriate,the different methods all make use of the same neural net-work components with latent dimension d “ 128. Otherthan ↵, all hyperparameters of each revision method de-scribed below were chosen so that over 1000 revisions, theLevenshtein (edit) distance dpx˚, x0q « 3.3 on average.


Model �

Y

px˚q ´ log pX

px˚q dpx˚, x0qlog↵ “ ´10000 0.51 ˘0.55 29.0 ˘9.3 3.3 ˘3.4n “ 1000 0.15 ˘0.44 32.0 ˘9.4 2.8 ˘3.4n “ 100 0.02 ˘0.30 37.0 ˘9.7 4.2 ˘4.0

log↵ “ ´1 0.20 ˘0.39 28.2 ˘7.6 1.4 ˘2.2ADAPTIVE 0.47 ˘0.49 28.8 ˘9.0 3.1 ˘3.4�inv “ �pri “ 0 0.05 ˘0.68 30.4 ˘8.4 3.3 ˘3.5SEARCH 0.45 ˘0.51 29.0 ˘9.4 3.2 ˘1.4

Table 1. Results for revisions x

˚ produced by different methodsin our simulation study (averaged over the same test set of 1000starting sequences x0 „ pX , with ˘1 standard deviation shownand the best results in bold).

All three results above the line in Table 1 are based on thefull model described in our joint training procedure, withnew sequences proposed via our REVISE algorithm (usingthe setting log↵ “ ´10000). In the latter two results, thismodel was only trained on a smaller subset of the data. Wealso generated revisions via this same procedure with themore conservative choice log↵ “ ´1. ADAPTIVE denotesthe same approach (with log↵ “ ´10000), this time usingthe adaptive decoding D

x0 introduced in §S1, which is in-tended to slightly bias revisions toward x0. The model with�inv “ �pri “ 0 is a similar method using a deterministicsequence-to-sequence autoencoder rather than our proba-bilistic VAE formulation (no variational posterior approxi-mation or invariance-enforcing) where the latent encodingsare still jointly trained to predict outcomes via F . Underthis model, a revision is proposed by starting at Epx0q inthe latent space, taking 1000 (unconstrained) gradient stepswith respect to F , and finally applying D to the resulting z.

The above methods form an ablation study of the variouscomponents in our framework. SEARCH is a different com-binatorial approach where we randomly generate 100 revi-sions by performing 4 random edits in x0 (each individualedit is randomly selected as one of: substitution, insertion,deletion, or no change). In this approach, we separatelylearn a language-model RNN L on our training sequences(Mikolov et al., 2010). Sharing the same GRU architec-ture as our decoder model, L directly estimates the likeli-hood of any given sequence under p

X

. Of the randomlygenerated revisions, we only retain those sequences x forwhich Lpxq • 1

|S|Lpx0q (in this case, those which are notestimated to be † 10 times less likely than the originalsequence x0 under p

X

). Finally, we score each remain-ing candidate (including x0) using the outcome-predictionmodel F pEpxqq, and the best is chosen as x˚.

Table 1 shows that our probabilistic VAE formulationoutperforms the alternative approaches, both in terms ofoutcome-improvement achieved as well as ensuring revi-

Model �

Y

px˚q �

L

px˚q dpx˚, x0qlog↵ “ ´10000 0.52 ˘0.77 -8.8 ˘6.5 2.6 ˘3.3log↵ “ ´1 0.31 ˘0.50 -7.6 ˘5.8 1.7 ˘2.6ADAPTIVE 0.52 ˘0.72 -8.7 ˘6.4 2.5 ˘3.3�inv “ �pri “ 0 0.22 ˘1.03 -10.2 ˘7.0 3.3 ˘3.4SEARCH 0.19 ˘0.56 -7.7 ˘4.2 3.0 ˘1.2

Table 2. Results for revised beer-review sentences x

˚ producedby different methods (average ˘ standard deviation reported overthe same held-out set of 1000 initial sentences x0). The third col-umn employs the definition �Lpx˚q “ logLpx˚q ´ logLpx0q.

sions follow pX

. For comparison, ´ log pX

px0q had an av-erage value of 26.8 (over these 1000 starting sequences),and changing one randomly-selected symbol in each se-quence to A results in an average negative log-probabilityof 32.8. Thus, all of our revision methods clearly accountfor p

X

to some degree. We find that all components usedin our REVISION procedure are useful in achieving superiorrevisions. While individual standard deviations seem large,nearly all average differences in �

Y

or ´ log pX

valuesproduced by different methods are statistically significantconsidering they are over 1000 revisions.

From Supplementary Figure S1, it is clear that ↵ con-trols how conservative the changes proposed by our RE-VISE procedure tend to be, in terms of both ´ log p

X

px˚qand the edit distance dpx0, x

˚q. The red curve in FigureS1A suggests that our theoretical lower bounds for p

X

px˚qare overly stringent in practice (although only the average-case is depicted in the figure). The relationship betweenlog p

X

px0q and log pX

px˚q (see Figure S1B) is best-fit bya line of slope 1.2, indicating that the linear dependenceon p

X

px0q in the Theorem 1 bound for pX

px˚q is rea-sonably accurate. Figure S1C shows that the magnitudeof changes in the latent space (arising from z-optimizationduring our REVISE procedure) only exhibits a weak corre-lation with the edit distance between the resulting revisionand the original sequence. This implies that a fixed shift indifferent directions in the latent space can produce drasti-cally different degrees of change in the sequence space. Toensure a high-quality revision, it is thus crucial to carefullytreat the (variational) posterior landscape when performingmanipulations of Z.

Improving Sentence Positivity

Next, we apply our model to „1M reviews from BeerAd-vocate (McAuley et al., 2012). Each beer review is parsedinto separate sentences, and each sentence is treated as anindividual sequence of words. In order to evaluate meth-ods using an outcome that can be obtained for any pro-posed revision, we choose y P r0, 1s as the VADER senti-ment compound score of a given sentence (Hutto & Gilbert,


Model Sentence �

Y

px˚q �

L

px˚q dpx˚, x0qx0 this smells pretty bad. - - -log↵ “ ´10000 smells pretty delightful! +2.8 -0.5 3ADAPTIVE smells pretty delightful! +2.8 -0.5 3log↵ “ ´1 i liked this smells pretty. +2.5 -2.8 3�inv “ �pri “ 0 pretty this smells bad! -0.2 -3.1 3SEARCH wow this smells pretty bad. +1.9 -4.6 1

Table 3. Example of a held-out beer review x0 (in bold) revised to improve the VADER sentiment. Underneath the original sentence,we show the revision produced by each different method along with the true (rescaled) outcome improvement �Y , change in estimatedmarginal likelihood �L, and edit distance dpx˚

, x0q. Table S2 contains additional examples.

# Steps Decoded Sentencex0 where are you, henry??100 where are you, henry??1000 where are you, royal??5000 where art thou now?10000 which cannot come, you of thee?x

˚ where art thou, keeper??

x0 you are both the same size.100 you are both the same.1000 you are both wretched.5000 you are both the king.10000 you are both these are very.x

˚ you are both wretched men.

Table 4. Decoding from latent Z configurations encountered atthe indicated number of (unconstrained) gradient steps fromEpx0q, for the model trained to distinguish sentences from Shake-speare vs. contemporary authors. Shown first and last are x0 andthe x

˚ returned by our REVISION procedure (constrained withlog↵ “ ´10000). Table S3 contains additional examples.

2014). VADER is a complex rule-based sentiment analysistool which jointly estimates polarity and intensity of En-glish text, and larger VADER scores correspond to text thathumans find more positive with high fidelity.

We applied all aforementioned approaches to produce re-visions for a held-out set of 1000 test sentences. As p

X

underlying these sentences is unknown, we report estimatesthereof obtained from a RNN language-model L learned onthe sentences in D

n

. Table 2 demonstrates that our VAE ap-proach achieves the greatest outcome-improvement. More-over, Tables 3 and S2 show that our probabilistically-constrained VAE revision approach produces much morecoherent sentences than the other strategies.

Revising Modern Text in the Language of Shakespeare

For our final application, we assemble a dataset of „100Kshort sentences which are either from Shakespeare or amore contemporary source (details in §S2.3). In this train-ing data, each sentence is labeled with outcome y “ 0.9

if it was authored by Shakespeare and y “ 0.1 otherwise(these values are chosen to avoid the flat region of the sig-moid output layer used in network F ). When applied inthis domain, our REVISE procedure thus attempts to altera sentence so that the author is increasingly expected to beShakespeare rather than a more contemporary source.

Tables 4 and S3 show revisions (of held-out sentences)proposed by our REVISE procedure with adaptive decod-ing (see §S1), together with sentences generated by apply-ing the adaptive decoder at various points along an uncon-strained gradient-ascent path in latent Z space (followinggradients of F ). Since the data lack similar versions of asentence written in both contemporary and Shakespeareanlanguage, this revision task is an ambitious application ofour ideas. Without observing a continuous spectrum of out-comes or leveraging specially-designed style transfer fea-tures (Gatys et al., 2016), our REVISE procedure has toalter the underlying semantics in order to nontrivially in-crease the expected outcome of the revised sentence underF . Nevertheless, we find that many of the revised sentenceslook realistic and resemble text written by Shakespeare.Furthermore, these examples demonstrate how the proba-bilistic constraint in our REVISE optimization prevents therevision-generating latent Z configurations from strayinginto regions where decodings begin to look very unnatural.

DiscussionThis paper presents an efficient method for optimizing dis-crete sequences when both the objective and constraints arestochastically estimated. Leveraging a latent-variable gen-erative model, our procedure does not require any examplesof revisions in order to propose natural-looking sequenceswith improved outcomes. These characteristics are provento hold with high probability in a theoretical analysis ofVAE behavior under our controlled latent-variable manip-ulations. However, ensuring semantic similarity in text-revisions remains difficult for this approach, and might beimproved via superior VAE models or utilizing additionalsimilarity labels to shape the latent geometry.


ReferencesBlei, D. M., Kucukelbir, A., and McAuliffe, J. D. Varia-

tional inference: A review for statisticians. Journal of

the American Statistical Association, 2017.

Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A. M., Joze-fowicz, R., and Bengio, S. Generating sentences from acontinuous space. Conference on Computational Natu-

ral Language Learning, 2016.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau,D., Bougares, F., Schwenk, H., and Bengio, Y. Learn-ing phrase representations using rnn encoder-decoder forstatistical machine translation. Empirical Methods on

Natural Language Processing, 2014.

Eck, D. and Schmidhuber, J. A first look at music compo-sition using lstm recurrent neural networks. IDSIA Tech-

nical Report, 2002.

Erhan, D., Bengio, Y., Courville, A., Manzagol, P., Vin-cent, P., and Bengio, S. Why does unsupervised pre-training help deep learning? Journal of Machine Learn-

ing Research, 11:625–660, 2010.

Gatys, L. A., Ecker, A. S., and Bethge, M. Image styletransfer using convolutional neural networks. Computer

Vision and Pattern Recognition, 2016.

Gomez-Bombarelli, R., Duvenaud, D., Hernandez-Lobato,J. M., Aguilera-Iparraguirre, J., , Hirzel, T., Adams,R. P., and Aspuru-Guzik, A. Automatic chemical de-sign using a data-driven continuous representation ofmolecules. arXiv:1610.02415, 2016.

Graves, A. Generating sequences with recurrent neural net-works. arXiv:1308.0850, 2013.

Gupta, P., Banchs, R. E., and Rosso, P. Squeezing bot-tlenecks: Exploring the limits of autoencoder semanticrepresentation capabilities. Neurocomputing, 175:1001–1008, 2016.

Higgins, I., Matthey, L., Glorot, X., Pal, A., Uria, B.,Blundell, C., Mohamed, S., and Lerchner, A. Early vi-sual concept learning with unsupervised deep learning.arXiv:1606.05579, 2016.

Hoffman, M. D. and Johnson, M. J. Elbo surgery: yetanother way to carve up the variational evidence lowerbound. NIPS Workshop on Advances in Approximate

Bayesian Inference, 2016.

Hutto, C.J. and Gilbert, E. Vader: A parsimonious rule-based model for sentiment analysis of social media text.Eighth International Conference on Weblogs and Social

Media, 2014.

Karpathy, A. The unreasonable effectiveness of recurrentneural networks. Andrej Karpathy blog, 2015. URLkarpathy.github.io.

Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. International Conference on Learning Represen-

tations, 2014.

Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Tor-ralba, A., Urtasun, R., and Fidler, S. Skip-thought vec-tors. Advances in Neural Information Processing Sys-

tems, 2015.

McAuley, J., Leskovec, J., and Jurafsky, D. Learning at-titudes and attributes from multi-aspect reviews. IEEE

International Conference on Data Mining, 2012.

Mikolov, T., Karafiat, M., Burget, L., Cernocky, J., andKhudanpur, S. Recurrent neural network based languagemodel. Interspeech, 2010.

Mueller, J. and Thyagarajan, A. Siamese recurrent archi-tectures for learning sentence similarity. Proc. AAAI

Conference on Artificial Intelligence, 2016.

Mueller, J., Reshef, D. N., Du, G., and Jaakkola, T. Learn-ing optimal interventions. Artificial Intelligence and

Statistics, 2017.

Nguyen, A., Yosinski, J., and Clune, J. Deep neural net-works are easily fooled: High confidence predictions forunrecognizable images. Computer Vision and Pattern

Recognition, 2015.

Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., andClune, J. Synthesizing the preferred inputs for neurons inneural networks via deep generator networks. Advances

in Neural Information Processing Systems, 2016.

Simonyan, K., Vedaldi, A., and Zisserman, A. Deep insideconvolutional networks: Visualising image classificationmodels and saliency maps. ICLR Workshop Proceedings,2014.

Sutskever, I., Vinyals, O., and Le, Q.V. Sequence to se-quence learning with neural networks. Advances in Neu-

ral Information Processing Systems, 2014.

Wiseman, S. and Rush, A. M. Sequence-to-sequence learn-ing as beam-search optimization. Empirical Methods in

Natural Language Processing, 2016.

Zaefferer, M., Stork, J., Friese, M., Fischbach, A., Naujoks,B., and Bartz-Beielstein, T. Efficient global optimizationfor combinatorial problems. Genetic and Evolutionary

Computation Conference, 2014.

Proceedings of Machine Learning Research - Sequence to Better Sequence…proceedings.mlr.press/v70/mueller17a/mueller17a.pdf · 2018-10-24 · Sequence to Better Sequence: Continuous

Documents