-
Proceedings of the 2020 Conference on Empirical Methods in
Natural Language Processing, pages 794–805,November 16–20, 2020.
c©2020 Association for Computational Linguistics
794
Back to the Future: Unsupervised Backprop-based Decoding
forCounterfactual and Abductive Commonsense Reasoning
Lianhui Qin†‡ Vered Shwartz †‡ Peter West †‡ Chandra
Bhagavatula‡Jena D. Hwang ‡ Ronan Le Bras ‡ Antoine Bosselut †‡
Yejin Choi†‡
†Paul G. Allen School of Computer Science & Engineering,
University of Washington‡Allen Institute for Artificial
Intelligence
{lianhuiq, pawest, antoineb, yejin}@cs.washington.edu{vered,
chandrab, jenah, ronanlb}@allenai.org
Abstract
Abductive and counterfactual reasoning, coreabilities of
everyday human cognition, requirereasoning about what might have
happened attime t, while conditioning on multiple contextsfrom the
relative past and future. However,simultaneous incorporation of
past and futurecontexts using generative language models(LMs) can
be challenging, as they are trainedeither to condition only on the
past context orto perform narrowly scoped text-infilling.
In this paper, we propose DELOREAN, a newunsupervised decoding
algorithm that can flex-ibly incorporate both the past and future
con-texts using only off-the-shelf, left-to-right lan-guage models
and no supervision. The key in-tuition of our algorithm is
incorporating the fu-ture through back-propagation, during which,we
only update the internal representation ofthe output while fixing
the model parameters.By alternating between forward and
backwardpropagation, DELOREAN can decode the out-put representation
that reflects both the leftand right contexts. We demonstrate that
ourapproach is general and applicable to twononmonotonic reasoning
tasks: abductive textgeneration and counterfactual story
revision,where DELOREAN outperforms a range ofunsupervised and some
supervised methods,based on automatic and human evaluation.1
1 Introduction
Everyday causal reasoning requires reasoningabout the likely
explanations to partially observ-able past and future (abductive
reasoning (Peirce,1960)) and reasoning about the alternative
futurebased on counterfactual past (counterfactual rea-soning).
Such nonmonotonic reasoning requires
1Code is available at
https://github.com/qkaren/unsup_gen_for_cms_reasoning
She hit the rope and the tire fell on top of her.
Abductive Reasoning
Ray hung a tire on a rope to make his daughter a swing.
Past Observation
Ray ran to his daughter to make sure she was okay.
Future Observation
Original Ending
Zeke thought about being a vampire or a wizard.
Then he decided on a scarier costume.
Zeke dressed up like a skeleton.
Zeke thought about Lannister, but he didn’t want to look like a
Lannister.
He wanted to look like a Stark.
Zeke dressed up like a Stark.
Story Context
Zeke was throwing a party.
All his friends were dressing up for this Halloween party.
All his friends were dressing up for this Game of Thrones themed
party. [Counterfactual]
DELORE
AN
Rewritten Ending
Hypothesis
Counterfactual Reasoning
Figure 1: DELOREAN, our proposed method, with gen-erated
reasoning results. Top: the goal in abductivereasoning is to
generate a hypothesis (Y ) of what hap-pened between the observed
past (X) and future (Z)contexts. Bottom: In counterfactual
reasoning, givena story context altered by a counterfactual
condition,X , and the original ending Z, the goal is to generate
anew ending Y which is coherent with X while remain-ing similar to
Z. The story from TIMETRAVEL (Qinet al., 2019a) consists of five
sentences. Our approachalternates forward (left-to-right) and
backward (right-to-left) passes that iteratively refine the
generated textsw.r.t context from each side.
inferring plausible but potentially defeasible con-clusions from
incomplete or hypothetical observa-tions (Reiter, 1988). While
humans are remarkablygood at this type of causal reasoning,
developingAI systems capable of nonmonotonic reasoning for
https://github.com/qkaren/unsup_gen_for_cms_reasoninghttps://github.com/qkaren/unsup_gen_for_cms_reasoning
-
795
a wide range of situations describable in naturallanguage has
been a major open research question.
More concretely, with abductive reasoning, thegoal is to find
the most plausible explanation forincomplete observations (Peirce,
1960). In the toppart of Figure 1, given the first observation
thatRay is “making his daughter a swing” and the laterobservation
that he “ran to [her] to make sure shewas okay,” we can hypothesize
that she somehowgot hurt by the swing.
In contrast, counterfactual reasoning concernsthe causal changes
to future events given a changein the past condition (i.e.,
“counterfactual condi-tion”; Goodman, 1947). For example, the
bottompart of Figure 1 shows the original five sentencestory (S1,
..., S5) and an alternative counterfac-tual condition given in
S′2—that instead of beinga generic “Halloween party”, the new
counterfac-tual condition is that it is going to be a “Gameof
Thrones themed party”! Given these, the prob-lem we want to solve
is to update the future events(S′3, ..., S
′5), so that instead of “Zeke dressed up as
skeleton”, we have “Zeke dressed up like a Stark”.2
Recently, two tasks and corresponding bench-marks have been
introduced to tackle language-based nonmonotonic reasoning: the ART
datasetfor abductive NLG (Bhagavatula et al., 2019), andthe
TIMETRAVEL dataset for counterfactual storyrewriting (Qin et al.,
2019a). Both tasks are framedas conditional generation, with
multiple contextsto condition on. The currently dominant
paradigmfor conditional text generation tasks is
fine-tuningpre-trained language models (LMs), such as GPT2(Radford
et al., 2019a), on large-scale training datafor supervision.
However, despite the large num-ber of training examples, supervised
approachesstill perform considerably worse than humans andare
subject to developing superficial strategies suchas repeating the
observations as is or memorizingprevalent surface patters specific
in the dataset (Qinet al., 2019a). Furthermore, having to require
large-scale training data for each domain and task wouldbe utterly
inefficient for broad-coverage nonmono-tonic reasoning in
language.
In this paper, we investigate an alternative pathtoward
language-based nonmonotonic reasoningusing pre-trained language
models as is. Intuitively,both the abductive and counterfactual
reasoning
2“Lannister” in S′3 and “Stark” in S′4 and S′5 refer to
char-acter names in the TV show, “Game of the Thrones.” All
theoutput text shown in Figure 1 is the actual system output
fromDELOREAN.
requires learning coherent patterns in narrative,which should be
already available in large-scalepretrained language models.
However, the key chal-lenge is that most generative language models
aretrained to condition only on the left context, or toperform
narrowly scoped text-infilling.
This paper presents DELOREAN: DEcoding fornonmonotonic LOgical
REAsoNing, an unsuper-vised decoding algorithm that only assumes
off-the-shelf left-to-right language models with no supervi-sion.
The key intuition of our algorithm is incorpo-rating the future
through back-propagation, duringwhich, we only update the internal
representationof the output while fixing the model parameters.More
specifically, DELOREAN alternates betweenthe forward and backward
passes, where the for-ward pass performs left-to-right inference
giventhe left context (roughly maximizing P (Y |X) inFigure 1),
while the backward pass instills theright constraint through
right-to-left backpropaga-tion with a task-specific loss (roughly
maximizingP (Z|XY )). The forward and backward outputsare mixed
into a single vector, from which tokensare sampled to generate the
desired output. Tochoose the best output across iterations, we
employan unsupervised ranking step based on BERT’snext sentence
prediction task to measure coherence(Devlin et al., 2018).
On both tasks, DELOREAN outperforms all otherunsupervised
methods in terms of both automaticmetrics and human evaluation,
demonstrating thatnonmonotonic reasoning through conditional
de-coding is a promising research direction. Moreover,outputs
produced by our model are judged as morecoherent than those from
the supervised models. Insum, our study shows that
backpropagation-baseddecoding may enable additional future
applicationsof unsupervised generation and reasoning.
2 Background
Most NLP benchmarks have focused on reason-ing about information
that is entailed from thepremise. For instance, natural language
infer-ence (NLI; Bowman et al., 2015) focuses primarilyon whether a
hypothesis is entailed from a givenpremise, which means the
information stated in thehypothesis is a subset of the information
providedin the premise. However, it has been noted thathuman
reasoning is often the other way, where hy-potheses often contain
new information that wasnot available in the premise, but plausibly
true (but
-
796
…ỹ f1 ỹ
f2 ỹ
fN
ỹb1 ỹb2 ỹbN
ỹ2 ỹN…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
LM
x1 xNX…
…ỹ1 ỹ2 ỹN
x2
Initialization
Figure 2: Illustration of the DELOREAN decoding procedure, using
abductive reasoning as an example. At ini-tialization (upper-left
box), the language model (LM) initializes the logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
Ỹ = {ỹ1, . . . , ỹN} of the hypothesisby reading the past
context X and generating a continuation with regular decoding. At
each forward-backwarditeration, we compute the task-specific loss
LỸ of the logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
based on the future constraint Z (red box). Thebackward pass
then performs back-propagation and produces the backward logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
Ỹ b = {ỹb1, . . . , ỹbN}. In thesubsequent forward pass, for
each step n, we compute the forward logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
ỹfn conditioning on the preceding logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
ỹn−1, and then mix it with the respective backward logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
to produce the new logits
…ỹ f1 ỹ
f2 ỹ
fNỹ
f3
ỹb1 ỹb2 ỹbNỹ
b3
ỹ2 ỹNỹ3…
…
…
Ray ran [S]…
Computing Loss
Backpropagation
Past context X
Input: Ray hung a tire on a rope to make his daughter a
swing.
x1 x2 xNX…
Future constraint Z
Input: Ray ran to his daughter to make sure she was okay.
z1 z2 zNZ…
to
LM
Repeat
T times
Generation Y
ỹ1
Output: She hit the rope and the tire fell on top of her.
LM Forward PassLM Backward Pass
Forward Backward MixForward LogitsBackward Logits
ỹn at step n.
possibly defeasible with new additional context)(Johnson-Laird,
2006; Mercier and Sperber, 2017).This type of reasoning corresponds
to nonmono-tonic reasoning (Kraus et al., 1990), as it contra-dicts
the monotonicity property according to whichvalid arguments cannot
be made invalid by addingpremises. We study two tasks of that
nature: abduc-tive reasoning (§2.1) and counterfactual
reasoning(§2.2).
2.1 Abductive Reasoning
Abductive reasoning aims at finding the most likelyexplanation
to partial observations (Peirce, 1960).It has a central role in the
human ability to “read be-tween the lines,” and is crucial for
language acqui-sition (Andersen, 1973), understanding sentencesin
discourse (Hobbs et al., 1993), and many more.Despite the
importance, however, relatively littlefocus has been given to it in
NLP research.
Recently, Bhagavatula et al. (2019) propose the
abductive reasoning task. Given two observations,the goal is to
determine the most likely explana-tion of what happened in-between.
The datasetintroduced for the task, ART, consists of 20k
obser-vations derived from the first and last sentence ofstories in
the ROCStories dataset (Mostafazadehet al., 2016a). We focus on the
abductive NLGsetup introduced in the paper, which is framed as
aconditional generation task where a plausible expla-nation to the
observations must be generated usinglanguage. The authors reported
the performance ofseveral pre-trained LM-based baselines and
showedpromises and limitations of such approaches.
2.2 Counterfactual Reasoning
Counterfactual reasoning aims at inferring alterna-tive past
events that could have happened givena certain change in conditions
(Goodman, 1947;Starr, 2019). While counterfactual reasoning playsan
important role in AI systems (Isard, 1974; Gins-
-
797
berg, 1986), it requires causal reasoning abilities,which are
arguably absent from current association-based AI (Pearl and
Mackenzie, 2018). Whilethere has been work on counterfactual
reasoningin NLP, including recognizing counterfactuals intext (Son
et al., 2017), and improving the perfor-mance of NLP tasks using
counterfactual learn-ing (Lawrence et al., 2017; Lawrence and
Riezler,2018), it remains a major research challenge.
Recently, Qin et al. (2019a) introduce the task ofcounterfactual
story generation. Given a 5-sentenceoriginal story, and an
alternative context in whichthe second sentence of the story was
altered bya counterfactual, the task is to generate a new
3-sentence story ending that addresses the alternativebeginning
while minimally editing the original end-ing. The associated
TIMETRAVEL dataset is basedon fictional narratives from ROCStories,
for whichcounterfactual contexts and alternative endings
arecrowdsourced, yielding 29,849 problem instances.Qin et al.
(2019a) report several baseline perfor-mances, and find that models
based on pre-trainedLMs produce output that recognize the
counterfac-tual, but generated endings which deviated consid-erably
from the original storyline. In contrast, inthe supervised setup,
models optimize the easier ofthe two goals and generate endings
that are overlysimilar to the original endings.
3 The DELOREAN Approach
Humans make inferences based on available in-formation and
refine them when new informationarrives. Since currently available
pre-trained LMsgenerate text by sequentially predicting the
nexttoken from left to right, they are incapable of con-ditioning
on future constraints. Therefore, we pro-pose DELOREAN: an
unsupervised backprop-baseddecoding algorithm, which is summarized
in Algo-rithm 1, illustrated in Figure 2, and detailed
below.DELOREAN intermittently refines the predictionsto cohere with
either the context or the constraints(Section 3.1). The candidate
generations are thenranked by coherence (Section 3.2).
3.1 Decoding Strategy
Given context textX , the goal is to generate contin-uation text
Y = (y1, . . . , yN ), such that Y satisfiescertain constraints
according to the reasoning tasks,usually defined based on another
context Z (seeFigure 1; we discuss the task-specific constraintsin
the respective task sections).
Algorithm 1: DELOREAN DecodingInput: Pre-trained language model
(LM)
Context XFuture constraint Z
1: Initialize logits Ỹ (0)
2: Initialize Ys, list of candidate generations3: for t← 1 to T
do4: // Backward pass5: for n← N to 1 do6: Compute backward logits
ỹbn, Eq.(1)7: end for8: // Forward pass9: for n← 1 to N do
10: Compute forward logits ỹfn, Eq.(2)11: Mix forward and
backward logits, Eq.(3)12: end for13: Sample candidate Y from
logits Ỹ and add to Ys14: end for15: Rank Ys by coherenceOutput:
The most coherent generated text Y from Ys
The proposed approach interleaves two proce-dures, namely,
forward and backward, that produceand iteratively refine the
generation, for a prede-fined number of iterations T . In
particular, theforward pass ensures the generated text is a
fluentcontinuation of the context X , while the backwardpass
informs the model about the constraint andsteers the generation to
satisfy it.
As detailed below, the backward pass uses gradi-ent descent to
update the generation Y . However,Y is a discrete text that is not
differentiable. In-stead, throughout the algorithm, we maintain a
softrepresentation of the sequence Ỹ = (ỹ1, . . . , ỹN ),where
ỹn ∈ RV represents the logits of the n-thtoken and V is the
vocabulary size. After the logitsare refined over multiple
iterations of the forwardand backward passes, we generate discrete
text ateach step by sampling from yn ∼ softmax(ỹn/τ),where τ >
0 is the temperature.
We start by initializing the logits before the firstiteration,
Ỹ (0) = (ỹ(0)1 . . . ỹ
(0)N ), by feeding the
context X into the LM and greedily decoding Ncontinuation
tokens.
Backward The backward pass uses gradientbackpropagation to
update the generation withrespect to the constraint. Specifically,
we ex-press the task-specific constraint as a loss functionL(X, Ỹ
(t−1), Z) that evaluates how well the gener-ation Y (approximated
with the soft representationỸ ) obeys the constraint (see the
subsequent sec-tions for concrete instantiations of the loss).
Thegoal of this pass is thus to minimize the loss w.r.tthe
generation. Specifically, at iteration t, for each
-
798
step n in the generation, we update its logits with:
ỹ(t),bn = ỹ
(t−1)n − λ · ∇ỹnL(X, Ỹ (t−1), Z), (1)
where ∇ỹnL(X, Ỹ (t−1), Z) is the gradient of
theconstraint-informed loss L w.r.t the n-th logits, andλ ∈ R is
the step size. In practice, we may repeatthe gradient updates
multiple times in a single pass.
Forward The forward pass ensures that Y is flu-ent and coherent
with the preceding context X . Atiteration t, for a particular step
n, we compute theforward logits with the LM:
ỹ(t),fn = LM(X, Ỹ(t)1:n−1). (2)
We then mix the nth-step forward and backwardlogits to get the
final logits of iteration t:
ỹ(t)n = γ · ỹ(t),fn + (1− γ) · ỹ(t),bn , (3)
where 0 < γ < 1 is the mixing weight. The result-ing
logits ỹ(t)n are then fed to the LM to computethe forward logits
at the (n+1)th step (Eq.2). Thisway, information from the backward
pass is inte-grated into the left-to-right generation process
toproduce text that is informed by the constraint.
We pre-define the number of tokens N requiredby the backward
pass, but we allow the forwardpass to generate more than N tokens
if those areneeded to obtain complete sentences. In that case,we
set the logits of the extra tokens to the forwardlogits, without
mixing: ỹ(t)n = ỹ
(t),fn for n > N .
We then prune any trailing tokens in the sampledtext to get
complete sentences.
3.2 RankingThe output of the decoding step is a list of
candi-date generations for each iteration: Ys = {Y (t)|t =1, ...,
T}. We further use an unsupervised approachto rank and pick the
best sample as the final out-put. Specifically, we take advantage
of the BERTmodel, which was pre-trained with a
next-sentenceprediction (NSP) objective. Given two sentencesA and
B, we use NSP to compute the likelihood ofB following A as a proxy
for coherence:
c(A,B) = BERT NSP(A,B), (4)
where c(·, ·) denotes the coherence score. Thisscore is used to
evaluate the quality of a givencandidate continuation Y by
measuring (1) its com-patibility with the subsequent text of the
contextX , (2) the internal consistency of Y if it consistsof
multiple sentences, and (3) the compatibility ofY with its
right-side text when it is applicable.
Model BLEU-4 ROUGE-L BERT
SupervisedSup 32.82 25.60 49.38+COMET-Emb 33.97 26.06
49.71UnsupervisedZero-ShotX 18.30 14.99 39.36Zero-ShotZX 15.90
14.23 40.03Zero-ShotX -Ranked 19.24 16.76 41.58Zero-ShotZX -Ranked
20.13 17.25 41.93DELOREAN 22.60 18.94 42.86
Human 53.56 30.40 53.30
Table 1: Automatic evaluation results on the abductivetask,
using the test set of ART.
4 Task 1: Abductive Reasoning
Each instance in the ART dataset consists of twoobservations O1,
O2 and a hypothesis H that ex-plains the two observations. These
inputs naturallymap to X , Z and Y in our framework. Formally,the
abductive generation task aims to maximizeP (Y |X,Z) – i.e. models
must consider both leftand right contexts (X and Z) jointly.
4.1 Task Setup
Constraints We maximizeZ givenXỸ by defin-ing the loss function
as the cross-entropy loss ofgenerating Z given XỸ with the
LM:3
L(X, Ỹ , Z) := −∑NZ
n=1 logPLM(zn|X, Ỹ , Z1:n−1), (5)
where PLM(aj |a1:j−1) is the likelihood of generat-ing token aj
given the preceding text a1:j−1.
Ranking We rank candidates by the overall co-herence after
inserting Y in between X and Z:
ranking score(Y ) = c(XY,Z) + c(X,Y Z). (6)
Hyperparameters We use GPT2-345M (Rad-ford et al., 2019b) as the
pre-trained LM for allmodels. We use the ART development set to
se-lect hyperparameters. We use greedy decoding forour method and
top k decoding (Fan et al., 2018)(k = 40, τ = 0.7) for our
baselines. Other hyper-parameters are outlined in Appendix A.1.
4.2 Experimental Setup
Baselines We compare our method against base-lines from
Bhagavatula et al. (2019). The unsu-pervised baselines use a
pre-trained GPT-2 model
3Note that this is applied to each prefix of Ỹ , althoughsome
of them are not complete sentences.
-
799
In April, Bob decided he need to do his taxes. The accountant
prepared and filed Bob's taxes.
Bob then went to the IRS to do his.
Ray drive his car on a steep mountain road. Ray was fine but his
car was totaled.
As he drives the car to the top of the mountain his car is hit
by a car.
Peter was excited to go to the Sanders rally in New Hampshire.
He couldn't wait to vote for him.
?
?
?
He has a long history of supporting Bernie Sanders and was
excited to see him in person.
Figure 3: Examples of generated hypotheses on three abductive
reasoning cases. Given observations O1 and O2,DELOREAN generates a
hypothesis explaining the observations.
to generate Y given a prompt text—either the ob-servation X
alone (Zero-ShotX ) or Z〈e〉X (Zero-ShotZX ), where 〈e〉 denotes a
special end-of-texttoken. The supervised method (Sup) followsthe
same input format as Zero-ShotZX , but fine-tunes GPT-2 on the ART
training set. Finally,our knowledge-informed baseline
(+COMET-Emb)further augments the representation of Sup
withknowledge from COMET (Bosselut et al., 2019).
To separately study the contribution of our de-coding strategy
and ranking component, we alsoreport the performance of ranking the
baseline out-puts. Specifically, we let each baseline generate
20candidates and rank them by coherence (Eq. 6).4
4.3 Results
Automatic Evaluation We report the same met-rics as Bhagavatula
et al. (2019): BLEU-4 (Pa-pineni et al., 2002), ROUGE-L (Lin, 2004)
andBERTSCORE (Zhang et al., 2019) (with the bert-base-uncased
model). The results in Table 1 showthat DELOREAN performs best
among the unsuper-vised systems across all metrics. We also note
thatour ranking step improves both the performance ofour model and
that of the zero-shot baselines.
Human Evaluation We conduct two sets of hu-man evaluations on
100 test examples using crowd-workers from Amazon Mechanical Turk.
In thescoring setting, presented in Table 2, workers werepresented
a pair of observations (X and Z) anda generated hypothesis Y , and
asked to rate thecoherence of the hypothesis with respect to the
ob-servation X (X-Y ), the observation Z (Y -Z), andboth (X-Y -Z),
on a 4-point Likert scale. In the
4We tried ablating the ranking component from our methodin
preliminary experiments, and found that ranking is essentialto
obtaining good performance. By adding ranking to ourbaselines, we
assess the contribution of our decoding strategy.
Model X-Y Y -Z X-Y -Z
SupervisedSup 0.510 0.375 0.314+COMET-Emb 0.466 0.342
0.286UnsupervisedZero-ShotZX 0.233 0.103 0.108Zero-ShotX -Ranked
0.478 0.208 0.195Zero-ShotZX -Ranked 0.474 0.238 0.236DELOREAN
0.522 0.325 0.297
Human 0.879 0.823 0.783
Table 2: Human calibration results on test set of ART .All
scores are normalized to [0, 1].
Overall - Human Judges Preferred
Our model Neutral ComparatorDELOREAN 21% 43% 36% SupDELOREAN 25%
44% 31% +COMET-Emb
DELOREAN 23% 62% 15% Zero-ShotX -RankedDELOREAN 27% 50% 23%
Zero-ShotXZ -Ranked
DELOREAN 3% 11% 86% Human
Table 3: Human pairwise comparison results on the testset of
ART, between DELOREAN and each of the base-lines, by jointly
considering all 3 criteria from Table 2.“Neutral” means “equally
good/bad”.
pairwise comparison setting, presented in Table 3,workers were
presented the outputs from a pair ofsystems (DELOREAN and baseline)
and asked tochoose the better output in terms of the same
co-herence criteria. Each example was labeled by 3workers.5
In both evaluation setups, our method sub-stantially outperform
the unsupervised baselines,achieving a relative improvement of 36%−
215%with respect to Y -Z coherence. Our method alsooutperform the
supervised methods with respect toX-Y coherence (Table 2), and
achieve competitiveperformance in the pairwise comparison (Table
3).
5The average inter-rater agreement measured by Fleiss’κ = 0.44
(“moderate agreement”) (Fleiss, 1971).
-
800
BLEU ROUGE BERT
Supervised + DiscriminativeSup+Disc 75.71 72.72
62.39Unsupervised+ DiscriminativeRecon+CF 75.92 70.93
62.49UnsupervisedFT 4.06 24.09 62.55FT+CF 4.02 24.35
62.63Pretrained-onlyZero-Shots1s′2 1.74 21.41 59.31Zero-Shots1s′2
-Ranked 2.26 25.81 60.07DELOREAN 21.35 40.73 63.36
Human 64.93 67.64 61.87
Table 4: Automatic evaluation results of counterfactualstory
rewriting, on the test set of TIMETRAVEL.
Coherence - Human Judges Preferred
Our model Neutral ComparatorDELOREAN 25% 58% 17% Sup+Disc
DELOREAN 23% 70% 7% Recon+CFDELOREAN 22% 48% 30% FT
DELOREAN 18% 60% 22% Zero-Shots1s′2DELOREAN 27% 42% 31%
Zero-Shots1s′2 -Ranked
DELOREAN 10% 29% 61% Human
Min-Edits - Human Judges Preferred
Our model Neutral ComparatorDELOREAN 4% 17% 79% Sup+Disc
DELOREAN 1% 14% 85% Recon+CFDELOREAN 21% 76% 3% FT
DELOREAN 28% 71% 1% Zero-Shots1s′2DELOREAN 37% 56% 7%
Zero-Shots1s′2 -Ranked
M+Sup 8% 22% 70% Human
Table 5: Human pairwise comparison results on thecounterfactual
task, between our best model and eachbaseline with respect to
coherence and min-edits.
Again, the ranking component contributes to in-creasing
performance for the zero-shot baselines.Finally, the large
performance gap between themethods and human-written explanations
stressesthe difficulty of this reasoning task and warrantsfuture
research.
Qualitative Analysis Figure 3 presents two ex-ample outputs
produced by DELOREAN. We cansee our approach generates reasonable
hypothesesby taking into account both the past and future
con-texts. For instance, in the first example, the
futureobservation (O2) “car was totaled” indicates thatRay had a
car accident, which is correctly capturedin the generated
hypothesis “car is hit by a car”.
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
0 0.2 0.4 0.6 0.8 1 1.2 1.4
HA
RMO
NIC
MEA
N
HARMONIC MEAN OF COHERENCE AND MIN_EDIT SCORES
Delorean RECON ZS_ranked ZS FT FT+CF
βCOHERENCE MIN_EDIT
Figure 4: Human calibration results for counterfactualgeneration
in terms of weighted harmonic mean of co-herence and min-edit, Hβ
=
(1+β2)·coherence·min editβ2·coherence+min edit , as
a function of the scaling factor β. Low β values assignmore
weight to coherence, and high β values empha-size more on
min-edit.
5 Task 2: Counterfactual ReasoningGiven an original story ending
Z of story con-text Xori, and a counterfactual condition X
thatchanges Xori to invalidate Z (see Fig. 1), the taskis to
generate a new story ending Y that minimallyedits the original
endingZ to regain coherence withthe counterfactual condition X (Qin
et al., 2019a).
5.1 Task Setup
Constraints The constraint we enforce is that Yis close to Z
(i.e., minimal edits). We impose thisconstraint by minimizing their
KL divergence:
L(X, Ỹ , Z) :=KL(Z‖softmax(Ỹ /τ)
), (7)
where, with a slight abuse of notation, Z is theone-hot
distribution of the tokens in the originalending. That is, we
encourage the generated logitsto recover the original ending.
Ranking We rank the candidates based on boththeir coherence with
the context, as well as theinternal coherence between the multiple
sentencesof each candidate (rewritten ending, consists of
3sentences). More concretely, given a candidate Y ,we compute the
aggregated coherence score:
ranking score(Y ) = c(X,Y ) +∑S−1
s=1 c(Y [s], Y [s+ 1]), (8)
where each candidate has S sentences (here, S = 3)and Y [s]
denotes the sth sentence.
-
801
Kay was shopping online for an art set.
She couldn't find one she liked due to its high price.
Kay looked at the product reviews for the art set. Twenty of
twenty reviewers noted that the pens in the set leaked. Kay did not
buy the art set.
Kay looked at the price tag for the art set. She was shocked to
see that the price was $1,000. Kay was not happy with the
price.
She found one she liked due to its reasonable price.Story
Context
Counterfactual Condition
Original Ending
Rewritten Ending
She knew of a cool place online that did custom fits really
cheaply, and ordered from there.
They browsed shirts from a variety of stores. Tara picked out a
floral patterned shirt that she liked best. Tara looked forward to
wearing it.
They sent her a shirt that fit her perfectly. Tara was so
excited to wear it. She looked forward to wearing it.
Tara wanted to buy a new shirt for her upcoming school formal.
She went to the mall with her mom.Story Context
Counterfactual Condition
Original Ending
Rewritten Ending
Shane enjoyed volunteering his time helping others.
John was not allowed to be friends with Shane anymore. this
bothered John greatly but his mom explained the reasons. She
explained that Shane was a bad influence on John.
John was a good student and was always looking for ways to help
others. They were both very kind and caring people. Shane was a
member of the Boy Scouts of America.
Shane and John were best friends at school. Shane was caught
stealing and got suspended from school.Story Context
Counterfactual Condition
Original Ending
Rewritten Ending
Figure 5: Examples of generated story endings on three
counterfactual reasoning cases. Given a story context,
acounterfactual condition, and a original ending, DELOREAN
generates a rewritten ending which is coherent withthe
counterfactual condition and is similar to the original ending.
Hyperparameters We largely follow the samesetting as in the
abductive reasoning task, but tunehyperparameters on the TIMETRAVEL
develop-ment set. Deviations from these settings are out-lined in
Appendix A.2.
5.2 Experimental Setup
Baselines We compare our method with base-lines from Qin et al.
(2019a). The zero-shot base-line uses the pre-trained GPT-2 model
to generateY as a continuation to the counterfactual conditionX .
It is the most apt comparison to our methodwhich also doesn’t
require additional supervision.We also experiment with two
baselines that fine-tune GPT-2 on the original story XoriZ to fit
themodel to the story domain, either with an LM ob-jective (FT) or
a tailored conditional objective thatencourages minimal edits of Z
(Recon+CF).6 Fi-nally, we report the performance of a
supervisedbaseline (Sup), in which GPT-2 is fine-tuned toproduce
the gold Y from XoriZ and X .
5.3 Results
Automatic Evaluation Following Qin et al.(2019a), we report
BERTSCORE (Zhang et al.,2019), which was shown to best correlate
with hu-man judges’ notion of counterfactual coherence,and BLEU-4
and ROUGE-L, which better mea-sure minimum-edits. We find that the
discrimina-tive baselines achieve the highest degree of plot
6See Qin et al. (2019a) for more details.
fidelity. Meanwhile, DELOREAN achieves the high-est BERTSCORE
for counterfactual coherence.
Human Evaluation We repeat the human eval-uation setup from
Section 4.3. Presented with theoriginal story, the counterfactual
condition X , andthe generated ending Y , workers were asked
tojudge (1) the coherence of Y with respect to theX; and (2) to
what extent the generated endingminimally-edits the original
ending.7 In order tojudge both criteria, we report the weighted
har-monic mean Hβ of these scores across a range ofweights β
(Figure 4).
Our results show that DELOREAN is the onlymodel that maintains a
consistent balance betweencoherence (1.66) and minimal edits
(1.54). Whilethe ranking-augmented zero-shot model producesthe most
coherent endings (coherence = 1.8), it de-viates from the original
ending. As β is increased(i.e., increasing importance of minimal
edits), itsweighted performance drops considerably, indicat-ing it
cannot generate new endings that follow theoriginal plot of the
story (min-edit = 1.25). Con-versely, Recon+CF generates stories
that are faith-ful to the original endings, but are far less
coher-ent with the counterfactual condition (coherence =1.23).
Through human annotation, we found thatRecon+CF copies the original
ending word-for-word in a 84% of cases.
The pairwise comparison results in Table 5
7Fair inter-rater agreement with Fleiss’ κ = 0.34
-
802
parallel these observations. DELOREAN signifi-cantly outperforms
the discriminative approaches(Recon+CF and Sup+Disc) in coherence,
whilefalling short of the Zero-shot re-ranked baselines.In minimal
edits, this pattern is flipped with ourapproach outperforming
Zero-shot baselines con-siderably and losing to the discriminative
baselines.
Qualitative Analysis Figure 5 provides two ex-ample results for
counterfactual story rewriting byDELOREAN. The approach
successfully capturesthe causal relations between events and
properlyrewrites the endings with minimal edits. For in-stance, in
the first example, given the counterfac-tual condition that “Tara
ordered a shirt online” (asopposed to the original “went to mall”),
the rewrit-ten ending is about “sent shirt” to Tara (as opposedto
the original “browsed from stores”). The lastsentence of the
original ending “She looked for-ward to wearing it” is correctly
preserved as it iscoherent with the counterfactual condition.
6 Related Work
Unsupervised text generation. Unsupervisedapproaches are often
applied to problems that copyinformation from a source text into
decoded text.Unsupervised paraphrasing requires repeating
thisinformation (Miao et al., 2019; Bao et al., 2019),as does
translation, but with a bilingual transfor-mation (Artetxe et al.,
2017; Lample et al., 2018).In summarization there is an additional
task to se-lect a subset of the original text (Baziotis et
al.,2019; Schumann et al., 2020; West et al., 2019). Incases where
information is mostly copied from theoriginal, auto-encoding
objectives can ensure thecorrect information is captured (Bao et
al., 2019;Baziotis et al., 2019; Artetxe et al., 2017). Thiswork
tackles problems where generation is moreopen-ended. Rather than
reproducing informationfrom the prompt, generations should agree
with andexpand on it, making autoencoding less applicable.
Controllable language generation. Earlier ap-proaches for
controllable generation involved pre-serving the content of text
while changing it alongdiscrete dimensions, such as theme,
sentiment, orstyle (Koncel-Kedziorski et al., 2016; Hu et al.,2017;
Ficler and Goldberg, 2017; Shen et al., 2017;Lample et al., 2019).
Recent works such as Grover(Zellers et al., 2019) and CTRL model
(Keskaret al., 2019) used these ideas to augment trans-former
language models that can condition on struc-
tured metadata such as source, domain, etc. ThePlug & Play
model (PPLM; Dathathri et al., 2019)controls topic and sentiment in
an approach similarto ours that involves forward and backward
passesto update token distributions. However, PPLMrelies on trained
attribute discriminators for super-vision, while our method is
unsupervised. Whilethese models are restricted to specific
dimensions,often with pre-defined values, our model can adjustto
any open-ended textual constraint. Perhaps themost similar work in
that aspect is the “text infill-ing” models, which, however, are in
a more narrowsetting by filling only a relatively short text
span(Devlin et al., 2018; Zhu et al., 2019; Donahueet al., 2020),
and more restrictive due to the relianceon an extra right-to-left
language model (Sun et al.,2017) or a pre-specified generation
length (Zeldeset al., 2020, which is not publicly available).
Reasoning about narratives. A prominent re-source from recent
years is the RocStories corpus(Mostafazadeh et al., 2016b),
consisting of 98Kcrowdsourced 5-sentence everyday life stories.
Itwas used for the story cloze task whose goal was topredict the
story ending from its first 4 sentences,but gained popularity and
became the base of ad-ditional benchmarks (Rashkin et al., 2018).
Addi-tional related work includes “script knowledge”, i.e.learning
about prototypical series of events (Schankand Abelson, 1977;
Chambers and Jurafsky, 2008;Pichotta and Mooney, 2014), temporal
common-sense (Granroth-Wilding and Clark, 2016; Li et al.,2018),
and modeling pre- and post- conditions ofevents (Roemmele et al.,
2011; Sap et al., 2019;Bosselut et al., 2019). Qin et al. (2019b)
studiedconversation modeling that reads and connects thedots of
events in related documents. Finally, a re-cent line of work
explores counterfactual questionsin reading comprehension (Huang et
al., 2019; Tan-don et al., 2019), but instantiates the problem
ofcounterfactual reasoning as a multiple choice task.
7 Conclusion
We presented DELOREAN, an unsupervised LM-based approach to
generate text conditioned onpast context as well as future
constraints, throughforward and backward passes considering each
con-dition. We demonstrated its effectiveness for ab-ductive and
counterfactual reasoning, on which itperformed substantially better
than unsupervisedbaselines. Our method is general and can be
easilyadapted for other generative reasoning tasks.
-
803
Acknowledgements
We thanks the anonymous reviewers and colleagesat UW NLP and AI2
for many helpful comments.This research was supported in part by
DARPACwC through ARO (W911NF15-1-0543), DARPAMCS program through
NIWC Pacific (N66001-19-2-4031), and Allen Institute for AI.
ReferencesHenning Andersen. 1973. Abductive and deductive
change. Language, pages 765–793.
Mikel Artetxe, Gorka Labaka, Eneko Agirre, andKyunghyun Cho.
2017. Unsupervised neural ma-chine translation. arXiv preprint
arXiv:1710.11041.
Yu Bao, Hao Zhou, Shujian Huang, Lei Li, LiliMou, Olga
Vechtomova, Xinyu Dai, and JiajunChen. 2019. Generating sentences
from disentan-gled syntactic and semantic spaces. arXiv
preprintarXiv:1907.05789.
Christos Baziotis, Ion Androutsopoulos, Ioannis Kon-stas, and
Alexandros Potamianos. 2019. Seq3: Dif-ferentiable
sequence-to-sequence-to-sequence au-toencoder for unsupervised
abstractive sentencecompression. In NAACL-HLT.
Chandra Bhagavatula, Ronan Le Bras, ChaitanyaMalaviya, Keisuke
Sakaguchi, Ari Holtzman, Han-nah Rashkin, Doug Downey, Wen-tau Yih,
and YejinChoi. 2019. Abductive commonsense reasoning.
InInternational Conference on Learning Representa-tions.
Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai-tanya
Malaviya, Asli Celikyilmaz, and Yejin Choi.2019. COMET: Commonsense
transformers for au-tomatic knowledge graph construction. In
Proceed-ings of the 57th Annual Meeting of the Associationfor
Computational Linguistics, pages 4762–4779,Florence, Italy.
Association for Computational Lin-guistics.
Samuel R Bowman, Gabor Angeli, Christopher Potts,and Christopher
D Manning. 2015. A large anno-tated corpus for learning natural
language inference.arXiv preprint arXiv:1508.05326.
Nathanael Chambers and Dan Jurafsky. 2008. Unsu-pervised
learning of narrative event chains. In Pro-ceedings of ACL-08: HLT,
pages 789–797.
Sumanth Dathathri, Andrea Madotto, Janice Lan, JaneHung, Eric
Frank, Piero Molino, Jason Yosinski, andRosanne Liu. 2019. Plug and
play language mod-els: a simple approach to controlled text
generation.arXiv preprint arXiv:1912.02164.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova.
2018. Bert: Pre-training of deepbidirectional transformers for
language understand-ing. arXiv preprint arXiv:1810.04805.
C. Donahue, M. Lee, and P. Liang. 2020. Enablinglanguage models
to fill in the blanks. In ACL.
Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical
neural story generation. In ACL.
Jessica Ficler and Yoav Goldberg. 2017. Controllinglinguistic
style aspects in neural language genera-tion. In Proceedings of the
Workshop on StylisticVariation, pages 94–104.
Joseph L Fleiss. 1971. Measuring nominal scale agree-ment among
many raters. Psychological bulletin,76(5):378.
Matthew L Ginsberg. 1986. Counterfactuals.
Artificialintelligence, 30(1):35–79.
Nelson Goodman. 1947. The problem of counter-factual
conditionals. The Journal of Philosophy,44(5):113–128.
Mark Granroth-Wilding and Stephen Clark. 2016.What happens next?
event prediction using a com-positional neural network model. In
Thirtieth AAAIConference on Artificial Intelligence.
Jerry R Hobbs, Mark E Stickel, Douglas E Appelt, andPaul Martin.
1993. Interpretation as abduction. Ar-tificial intelligence,
63(1-2):69–142.
Zhiting Hu, Zichao Yang, Xiaodan Liang, RuslanSalakhutdinov, and
Eric P Xing. 2017. Toward con-trolled generation of text. In
ICML.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, andYejin Choi.
2019. Cosmos qa: Machine readingcomprehension with contextual
commonsense rea-soning. In Proceedings of the 2019 Conference
onEmpirical Methods in Natural Language Processingand the 9th
International Joint Conference on Natu-ral Language Processing
(EMNLP-IJCNLP), pages2391–2401.
Steve D Isard. 1974. What would you have done if...?Theoretical
Linguistics, 1(1-3):233–256.
Philip Nicholas Johnson-Laird. 2006. How we reason.Oxford
University Press, USA.
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney,Caiming
Xiong, and Richard Socher. 2019. Ctrl: Aconditional transformer
language model for control-lable generation. arXiv preprint
arXiv:1909.05858.
Rik Koncel-Kedziorski, Ioannis Konstas, Luke Zettle-moyer, and
Hannaneh Hajishirzi. 2016. A theme-rewriting approach for
generating algebra wordproblems. In Proceedings of the 2016
Conferenceon Empirical Methods in Natural Language Process-ing,
pages 1617–1628.
https://doi.org/10.18653/v1/P19-1470https://doi.org/10.18653/v1/P19-1470
-
804
Sarit Kraus, Daniel Lehmann, and Menachem Magidor.1990.
Nonmonotonic reasoning, preferential modelsand cumulative logics.
Artificial intelligence, 44(1-2):167–207.
Guillaume Lample, Myle Ott, Alexis Conneau, Lu-dovic Denoyer,
and Marc’Aurelio Ranzato. 2018.Phrase-based & neural
unsupervised machine trans-lation. arXiv preprint
arXiv:1804.07755.
Guillaume Lample, Sandeep Subramanian, Eric Smith,Ludovic
Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019.
Multiple-attribute text rewrit-ing. In ICLR.
Carolin Lawrence and Stefan Riezler. 2018. Improvinga neural
semantic parser by counterfactual learningfrom human bandit
feedback. In Proceedings of the56th Annual Meeting of the
Association for Compu-tational Linguistics (Volume 1: Long Papers),
pages1820–1830.
Carolin Lawrence, Artem Sokolov, and Stefan Riezler.2017.
Counterfactual learning from bandit feedbackunder deterministic
logging: A case study in statisti-cal machine translation. In
Proceedings of the 2017Conference on Empirical Methods in Natural
Lan-guage Processing, pages 2566–2576.
Zhongyang Li, Xiao Ding, and Ting Liu. 2018. Con-structing
narrative event evolutionary graph forscript event prediction. In
IJCAI.
Chin-Yew Lin. 2004. Rouge: A package for auto-matic evaluation
of summaries. Text SummarizationBranches Out.
Hugo Mercier and Dan Sperber. 2017. The enigma ofreason. Harvard
University Press.
Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and LeiLi. 2019. Cgmh:
Constrained sentence generationby metropolis-hastings sampling. In
Proceedings ofthe AAAI Conference on Artificial Intelligence,
vol-ume 33, pages 6834–6842.
Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi
Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James
Allen. 2016a. A cor-pus and evaluation framework for deeper
under-standing of commonsense stories. arXiv
preprintarXiv:1604.01696.
Nasrin Mostafazadeh, Nathanael Chambers, XiaodongHe, Devi
Parikh, Dhruv Batra, Lucy Vanderwende,Pushmeet Kohli, and James F.
Allen. 2016b. A cor-pus and cloze evaluation for deeper
understanding ofcommonsense stories. In HLT-NAACL.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.
2002. BLEU: a method for automatic eval-uation of machine
translation. In ACL, pages 311–318.
Judea Pearl and Dana Mackenzie. 2018. The book ofwhy: the new
science of cause and effect. BasicBooks.
Charles Sanders Peirce. 1960. Collected papers ofcharles sanders
peirce, volume 2. Harvard Univer-sity Press.
Karl Pichotta and Raymond Mooney. 2014. Statisti-cal script
learning with multi-argument events. InEACL, pages 220–229.
Lianhui Qin, Antoine Bosselut, Ari Holtzman, Chan-dra
Bhagavatula, Elizabeth Clark, and Yejin Choi.2019a. Counterfactual
story reasoning and gener-ation. In Proceedings of the 2019
Conference onEmpirical Methods in Natural Language Processingand
the 9th International Joint Conference on Natu-ral Language
Processing (EMNLP-IJCNLP), pages5046–5056.
Lianhui Qin, Michel Galley, Chris Brockett, XiaodongLiu, Xiang
Gao, Bill Dolan, Yejin Choi, and Jian-feng Gao. 2019b. Conversing
by reading: Con-tentful neural conversation with on-demand
machinereading. In ACL, pages 5427–5436.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei,
and Ilya Sutskever. 2019a. Languagemodels are unsupervised
multitask learners. -.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan,Dario Amodei,
and Ilya Sutskever. 2019b. Lan-guage models are unsupervised
multitask learners.OpenAI Blog, 1:8.
Hannah Rashkin, Antoine Bosselut, Maarten Sap,Kevin Knight, and
Yejin Choi. 2018. Modelingnaive psychology of characters in simple
common-sense stories. arXiv preprint arXiv:1805.06533.
Raymond Reiter. 1988. Nonmonotonic reasoning. InExploring
artificial intelligence, pages 439–481. El-sevier.
Melissa Roemmele, Cosmin Adrian Bejan, and An-drew S Gordon.
2011. Choice of plausible alterna-tives: An evaluation of
commonsense causal reason-ing. In 2011 AAAI Spring Symposium
Series.
Maarten Sap, Ronan LeBras, Emily Allaway, Chan-dra Bhagavatula,
Nicholas Lourie, Hannah Rashkin,Brendan Roof, Noah A Smith, and
Yejin Choi. 2019.ATOMIC: An atlas of machine commonsense for
if-then reasoning. In AAAI.
Roger C Schank and Robert P Abelson. 1977. Scripts,plans, goals
and understanding: An inquiry into hu-man knowledge structures.
Raphael Schumann, Lili Mou, Yao Lu, Olga Vech-tomova, and Katja
Markert. 2020. Discrete op-timization for unsupervised sentence
summariza-tion with word-level extraction. arXiv
preprintarXiv:2005.01791.
Tianxiao Shen, Tao Lei, Regina Barzilay, and TommiJaakkola.
2017. Style transfer from non-parallel textby cross-alignment. In
Advances in neural informa-tion processing systems, pages
6830–6841.
-
805
Youngseo Son, Anneke Buffone, Joe Raso, AllegraLarche, Anthony
Janocko, Kevin Zembroski, H An-drew Schwartz, and Lyle Ungar. 2017.
Recognizingcounterfactual thinking in social media texts. In
Pro-ceedings of the 55th Annual Meeting of the Associa-tion for
Computational Linguistics (Volume 2: ShortPapers), pages
654–658.
William Starr. 2019. Counterfactuals. In Edward N.Zalta, editor,
The Stanford Encyclopedia of Philos-ophy, fall 2019 edition.
Metaphysics Research Lab,Stanford University.
Qing Sun, Stefan Lee, and Dhruv Batra. 2017. Bidirec-tional beam
search: Forward-backward inference inneural sequence models for
fill-in-the-blank imagecaptioning. In Proceedings of the IEEE
Conferenceon Computer Vision and Pattern Recognition,
pages6961–6969.
Niket Tandon, Bhavana Dalvi Mishra, Keisuke Sak-aguchi, Antoine
Bosselut, and Peter Clark. 2019.Wiqa: A dataset for” what if...”
reasoning over pro-cedural text. In EMNLP.
Peter West, Ari Holtzman, Jan Buys, and Yejin Choi.
2019. Bottlesum: Unsupervised and self-supervisedsentence
summarization using the information bot-tleneck principle. In
Proceedings of the 2019 Con-ference on Empirical Methods in Natural
LanguageProcessing and the 9th International Joint Confer-ence on
Natural Language Processing (EMNLP-IJCNLP), pages 3743–3752.
Yoel Zeldes, Dan Padnos, and Barak Peleg. 2020.Haim-1.5 - the
next generation.
Rowan Zellers, Ari Holtzman, Hannah Rashkin,Yonatan Bisk, Ali
Farhadi, Franziska Roesner, andYejin Choi. 2019. Defending against
neural fakenews. In Advances in Neural Information Process-ing
Systems, pages 9051–9062.
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.Weinberger, and
Yoav Artzi. 2019. BERTScore:Evaluating text generation with BERT.
CoRR,abs/1904.09675.
Wanrong Zhu, Zhiting Hu, and Eric Xing. 2019. Textinfilling.
arXiv preprint arXiv:1901.00158.
https://www.ai21.com/haim-1point5