Top Banner
Explaining NLP Models via Minimal Contrastive Editing (MI CE) Alexis Ross Ana Marasovi´ c †♦ Matthew E. Peters Allen Institute for Artificial Intelligence, Seattle, WA, USA Paul G. Allen School of Computer Science and Engineering, University of Washington {alexisr,anam,matthewp}@allenai.org Abstract Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other coun- terfactual event (the contrast case). De- spite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present MIN- IMAL CONTRASTIVE EDITING (MI CE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks—binary sentiment classification, topic classification, and multiple-choice question answering—show that MI CE is able to pro- duce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MI CE edits can be used for two use cases in NLP sys- tem development—debugging incorrect model outputs and uncovering dataset artifacts—and thereby illustrate that producing contrastive ex- planations is a promising research direction for model interpretability. 1 Introduction Cognitive science and philosophy research has shown that human explanations are contrastive (Miller, 2019): People explain why an observed event happened rather than some counterfactual event called the contrast case. This contrast case plays a key role in modulating what explanations are given. Consider Figure 1. When we seek an ex- planation of the model’s prediction “by train,” we seek it not in absolute terms, but in contrast to an- other possible prediction (i.e. “on foot”). Addition- ally, we tailor our explanation to this contrast case. For instance, we might explain why the prediction is “by train” and not “on foot” by saying that the writer discusses meeting Ann at the train station Figure 1: An example MI CE edit for a multiple-choice question from the RACE dataset. MI CE generates con- trastive explanations in the form of edits to inputs that change model predictions to target (contrast) predic- tions. The edit (bolded in red) is minimal and fluent, and it changes the model’s prediction from “by train” to the contrast prediction “on foot” (highlighted in gray). instead of at Ann’s home on foot; such information is captured by the edit (bolded red) that results in the new model prediction “on foot.” For a differ- ent contrast prediction, such as “by car,” we would provide a different explanation. In this work, we propose to give contrastive explanations of model predictions in the form of targeted minimal edits, as shown in Figure 1, that cause the model to change its original prediction to the contrast prediction. Given the key role that contrastivity plays in human explanations, making model explanations contrastive could make them more user-centered and thus more useful for their intended purposes, such as debugging and exposing dataset biases (Ribera and Lapedriza, 2019)—purposes which re- quire that humans work with explanations (Alvarez- Melis et al., 2019). However, many currently pop- ular instance-based explanation methods produce arXiv:2012.13985v2 [cs.CL] 23 Jun 2021
17

arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Mar 02, 2022

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Explaining NLP Models via Minimal Contrastive Editing (MICE)

Alexis Ross† Ana Marasovic†♦ Matthew E. Peters†

†Allen Institute for Artificial Intelligence, Seattle, WA, USA♦Paul G. Allen School of Computer Science and Engineering, University of Washington

{alexisr,anam,matthewp}@allenai.org

Abstract

Humans have been shown to give contrastiveexplanations, which explain why an observedevent happened rather than some other coun-terfactual event (the contrast case). De-spite the influential role that contrastivityplays in how humans explain, this propertyis largely missing from current methods forexplaining NLP models. We present MIN-IMAL CONTRASTIVE EDITING (MICE), amethod for producing contrastive explanationsof model predictions in the form of editsto inputs that change model outputs to thecontrast case. Our experiments across threetasks—binary sentiment classification, topicclassification, and multiple-choice questionanswering—show that MICE is able to pro-duce edits that are not only contrastive, butalso minimal and fluent, consistent with humancontrastive edits. We demonstrate how MICEedits can be used for two use cases in NLP sys-tem development—debugging incorrect modeloutputs and uncovering dataset artifacts—andthereby illustrate that producing contrastive ex-planations is a promising research direction formodel interpretability.

1 Introduction

Cognitive science and philosophy research hasshown that human explanations are contrastive(Miller, 2019): People explain why an observedevent happened rather than some counterfactualevent called the contrast case. This contrast caseplays a key role in modulating what explanationsare given. Consider Figure 1. When we seek an ex-planation of the model’s prediction “by train,” weseek it not in absolute terms, but in contrast to an-other possible prediction (i.e. “on foot”). Addition-ally, we tailor our explanation to this contrast case.For instance, we might explain why the predictionis “by train” and not “on foot” by saying that thewriter discusses meeting Ann at the train station

Figure 1: An example MICE edit for a multiple-choicequestion from the RACE dataset. MICE generates con-trastive explanations in the form of edits to inputs thatchange model predictions to target (contrast) predic-tions. The edit (bolded in red) is minimal and fluent,and it changes the model’s prediction from “by train” tothe contrast prediction “on foot” (highlighted in gray).

instead of at Ann’s home on foot; such informationis captured by the edit (bolded red) that results inthe new model prediction “on foot.” For a differ-ent contrast prediction, such as “by car,” we wouldprovide a different explanation. In this work, wepropose to give contrastive explanations of modelpredictions in the form of targeted minimal edits, asshown in Figure 1, that cause the model to changeits original prediction to the contrast prediction.

Given the key role that contrastivity plays inhuman explanations, making model explanationscontrastive could make them more user-centeredand thus more useful for their intended purposes,such as debugging and exposing dataset biases(Ribera and Lapedriza, 2019)—purposes which re-quire that humans work with explanations (Alvarez-Melis et al., 2019). However, many currently pop-ular instance-based explanation methods produce

arX

iv:2

012.

1398

5v2

[cs

.CL

] 2

3 Ju

n 20

21

Page 2: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

highlights—segments of input that support a pre-diction (Zaidan et al., 2007; Lei et al., 2016; Changet al., 2019; Bastings et al., 2019; Yu et al., 2019;DeYoung et al., 2020; Jain et al., 2020; Belinkovand Glass, 2019) that can be derived through gradi-ents (Simonyan et al., 2014; Smilkov et al., 2017;Sundararajan et al., 2017), approximations withsimpler models (Ribeiro et al., 2016), or attention(Wiegreffe and Pinter, 2019; Sun and Marasovic,2021). These methods are not contrastive, as theyleave the contrast case undetermined; they do nottell us what would have to be different for a modelto have predicted a particular contrast label.1

As an alternative approach to NLP model expla-nation, we introduce MINIMAL CONTRASTIVEEDITING (MICE)—a two-stage approach to gen-erating contrastive explanations in the form of tar-geted minimal edits (as shown in Figure 1). Givenan input, a fixed PREDICTOR model, and a contrastprediction, MICE generates edits to the input thatchange the PREDICTOR’s output from the originalprediction to the contrast prediction. We formallydefine our edits and describe our approach in §2.

We design MICE to produce edits with prop-erties motivated by human contrastive explana-tions. First, we desire edits to be minimal, alter-ing only small portions of input, a property whichhas been argued to make explanations more intel-ligible (Alvarez-Melis et al., 2019; Miller, 2019).Second, MICE edits should be fluent, resultingin text natural for the domain and ensuring thatany changes in model predictions are not drivenby inputs falling out of distribution of naturallyoccurring text. Our experiments (§3) on threeEnglish-language datasets, IMDB, NEWSGROUPS,and RACE, validate that MICE edits are indeedcontrastive, minimal, and fluent.

We also analyze the quality of MICE edits (§4)and show how they may be used for two use casesin NLP system development. First, we show thatMICE edits are comparable in size and fluency tohuman edits on the IMDB dataset. Next, we illus-trate how MICE edits can facilitate debugging in-dividual model predictions. Finally, we show howMICE edits can be used to uncover dataset artifactslearned by a powerful PREDICTOR model.2

1Free-text rationales (Narang et al., 2020) can be con-trastive if human justifications are collected by asking “why...instead of...” which is not the case with current benchmarks(Camburu et al., 2018; Rajani et al., 2019; Zellers et al., 2019).

2Our code and trained EDITOR models are publicly avail-able at https://github.com/allenai/mice.

2 MICE: Minimal Contrastive Editing

This section describes our proposed method, MINI-MAL CONTRASTIVE EDITING, or MICE, for ex-plaining NLP models with contrastive edits.

2.1 MICE Edits as Contrastive Explanations

Contrastive explanations are answers to questionsof the form Why p and not q? They explain whythe observed event p happened instead of anotherevent q, called the contrast case.3 A long line ofresearch in the cognitive sciences and philosophyhas found that human explanations are contrastive(Van Fraassen, 1980; Lipton, 1990; Miller, 2019).Human contrastive explanations have several hall-mark characteristics. First, they cite contrastivefeatures: features that result in the contrast casewhen they are changed in a particular way (Chin-Parker and Cantelon, 2017). Second, they are min-imal in the sense that they rarely cite the entirecausal chain of a particular event, but select just afew relevant causes (Hilton, 2017). In this work,we argue that a minimal edit to a model input thatcauses the model output to change to the contrastcase has both these properties and can function asan effective contrastive explanation. We first givean illustration of contrastive explanations humansmight give and then show how minimal contrastiveedits offer analogous contrastive information.

As an example, suppose we want to explain whythe answer to the question “Q: Where can you finda clean pillow case that is not in use?” is “A: thedrawer.”4 If someone asks why the answer is not“C1: on the bed,” we might explain: “E1: Becauseonly the drawer stores pillow cases that are notin use.” However, E1 would not be an explana-tion of why the answer is not “C2: in the laundryhamper,” since both drawers and laundry hampersstore pillow cases that are not in use. For contrastcase C2, we might instead explain: “E2: Becauseonly laundry hampers store pillow cases that arenot clean.” We cite different parts of the originalquestion depending on the contrast case.

In this work, we propose to offer contrastive ex-planations in the form of minimal edits that resultin the contrast case as model output. Such edits areeffective contrastive explanations because, by con-struction, they highlight contrastive features. For

3Related work also calls it the foil (Miller, 2019).4Inspired by an example in Talmor et al. (2019): Question:

“Where would you store a pillow case that is not in use?”Choices: “drawer, kitchen cupboard, bedding store, england.”

Page 3: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Figure 2: An overview of MICE, our two-stage approach to generating edits. In Stage 1 (§2.3), we train theEDITOR to make edits targeting specific predictions from the PREDICTOR. In Stage 2 (§2.4), we make contrastiveedits with the EDITOR model from Stage 1 such that the PREDICTOR changes its output to the contrast prediction.

example, a contrastive edit of the original questionfor contrast case C1 would be: “Where can you finda clean pillow case that is not in use?”; the informa-tion provided by this edit—that it is whether or notthe pillow case is in use that determines whetherthe answer is “the drawer” or “on the bed”—is anal-ogous to the information provided by E1. Similarly,a contrastive edit for contrast case C2 that changedthe question to “Where can you find a clean dirtypillow case that is not in use?” provides analogousinformation to E2.

2.2 Overview of MICE

We define a contrastive edit to be a modifica-tion of an input instance that causes a PREDIC-TOR model (whose behavior is being explained)to change its output from its original predictionfor the unedited input to a given target (contrast)prediction. Formally, for textual inputs, given afixed PREDICTOR f , input x = (x1, x2, ..., xN )of N tokens, original prediction f(x) = yp andcontrast prediction yc 6= yp, a contrastive edit is amapping e : (x1, ..., xN )→ (x′1, ..., x

′M ) such that

f(e(x)) = yc.We propose MICE, a two-stage approach to gen-

erating contrastive edits, illustrated in Figure 2. InStage 1, we prepare a highly-contextualized EDI-TOR model to associate edits with given end-tasklabels (i.e., labels for the task of the PREDICTOR)such that the contrast label yc is not ignored inMICE’s second stage. Intuitively, we do this bymasking the spans of text that are “important” forthe given target label (as measured by the PREDIC-TOR’s gradients) and training our EDITOR to recon-struct these spans of text given the masked text and

target label as input. In Stage 2 of MICE, we gener-ate contrastive edits e(x) using the EDITOR modelfrom Stage 1. Specifically, we generate candidateedits e(x) by masking different percentages of xand giving masked inputs with prepended contrastlabel yc to the EDITOR; we use binary search tofind optimal masking percentages and beam searchto keep track of candidate edits that result in thehighest probability of the contrast labels p(yc|e(x))given by the PREDICTOR.

2.3 Stage 1: Fine-tuning the EDITOR

In Stage 1 of MICE, we fine-tune the EDITOR toinfill masked spans of text in a targeted manner.Specifically, we fine-tune a pretrained model to in-fill masked spans given masked text and a targetend-task label as input. In this work, we use theTEXT-TO-TEXT TRANSFER TRANSFORMER (T5)model (Raffel et al., 2020) as our pretrained EDI-TOR, but any model suitable for span infilling canin principle be the EDITOR in MICE. The additionof the target label allows the highly-contextualizedEDITOR to condition its predictions on both themasked context and the given target label such thatthe contrast label is not ignored in Stage 2. What touse as target labels during Stage 1 depends on whothe end-users of MICE are. The end-user couldbe: (1) a model developer who has access to thelabeled data used to train the predictor, or (2) lay-users, domain experts, or other developers withoutaccess to the labeled data. In the former case, wecould use the gold label as targets, and in the lattercase, we could use the labels predicted by PREDIC-TOR. Therefore, during fine-tuning, we experimentwith using both gold labels and original predictions

Page 4: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

yp of our PREDICTOR model as target labels. Toprovide target labels, we prepend them to inputsto the EDITOR. For more information about howthese inputs are formatted, see Appendix B. Resultsin Table 2 show that fine-tuning with target labelsresults in better edits than fine-tuning without them.

The above procedure allows our EDITOR to con-dition its infilled spans on both the context and thetarget label. But this still leaves open the ques-tion of where to mask our text. Intuitively, wewant to mask the tokens that contribute most tothe PREDICTOR’s predictions, since these are thetokens that are most strongly associated with thetarget label. We propose to use gradient attribu-tion (Simonyan et al., 2014) to choose tokens tomask. For each instance, we take the gradient ofthe predicted logit for the target label with respectto the embedding layers of f and take the `1 normacross the embedding dimension. We then maskthe n1% of tokens with the highest gradient norms.We replace consecutive tokens (i.e., spans) withsentinel tokens, following Raffel et al. (2020). Re-sults in Table 1 show that gradient-based maskingoutperforms random masking.

2.4 Stage 2: Making Edits with the EDITOR

In the second stage of our approach, we use our fine-tuned EDITOR to make edits using beam search(Reddy, 1977). In each round of edits, we maskconsecutive spans of n2% of tokens in the originalinput, prepend the contrast prediction to the maskedinput, and feed the resulting masked instance to theEDITOR; the EDITOR then generates m edits. Themasking procedure during this stage is gradient-based as in Stage 1.

In one round of edits, we conduct a binary searchwith s levels over values of n2 between valuesn2 = 0% to n2 = 55% to efficiently find a valueof n2 that is large enough to result in the contrastprediction while also modifying only minimal partsof the input. After each round of edits, we get f ’spredictions on the edited inputs, order them by con-trast prediction probabilities, and update the beamto store the top b edited instances. As soon as anedit e∗ = e(t) is found that results in the contrastprediction, i.e., f(e∗) = yc, we stop the searchprocedure and return this edit. For generation, weuse a combination of top-k (Fan et al., 2018) andtop-p (nucleus) sampling (Holtzman et al., 2020).5

5We use this combination because we observed in prelimi-nary experiments that it led to good results.

3 Evaluation

This section presents empirical findings that MICEproduces minimal and fluent contrastive edits.

3.1 Experimental Setup

Tasks We evaluate MICE on three English-language datasets: IMDB, a binary sentiment clas-sification task (Maas et al., 2011), a 6-class ver-sion of the 20 NEWSGROUPS topic classificationtask (Lang, 1995), and RACE, a multiple choicequestion-answering task (Lai et al., 2017).6

PREDICTORS MICE can be used to make con-trastive edits for any differentiable PREDICTOR

model, i.e., any end-to-end neural model. In thispaper, for each task, we train a PREDICTOR modelf built on ROBERTA-LARGE (Liu et al., 2019),and fix it during evaluation. The test accuraciesof our PREDICTORS are 95.9%, 85.3% and 84%for IMDB, NEWSGROUPS, and RACE, respectively.For training details, see Appendix A.1.

EDITORS Our EDITORS build on the base ver-sion of T5. For fine-tuning our EDITORS (Stage 1),we use the original training data used to train PRE-DICTORS. We randomly split the data, 75%/25%for fine-tuning/validation and fine-tune until thevalidation loss stops decreasing (for a max of 10epochs) with n1% of tokens masked, where n1 isa randomly chosen value in [20, 55]. For moredetails, see Appendix A.2. In Stage 2, for eachinstance, we set the label with the second highestpredicted probability as the contrast prediction. Weset beam width b = 3, consider s = 4 search levelsduring binary search over n2 in each edit round,and run our search for a max of 3 edit rounds. Foreach n2, we sample m = 15 generations from ourfine-tuned EDITORS with p = 0.95, k = 30. 7

Metrics We evaluate MICE on the test sets ofthe three datasets. The RACE and NEWSGROUPS

test sets contain 4,934 and 7,307 instances, respec-tively.8 For IMDB, we randomly sample 5K of the

6We create this 6-class version by mapping the 20 exist-ing subcategories to their respective larger categories—i.e.“talk.politics.guns” and “talk.religion.misc”→ “talk.” We dothis in order to make the label space smaller. The resultingclasses are: alt, comp, misc, rec, sci, and talk.

7We tune these hyperparameters on a 50-instance subsetof the IMDB validation set prior to evaluation. We note thatfor larger values of n2, the generations produced by the T5EDITORS sometimes degenerate; see Appendix C for details.

8For the NEWSGROUPS test set, there are 7,307 instancesremaining after filtering out empty strings.

Page 5: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

MICEVARIANT

IMDB NEWSGROUPS RACE↑ ↓ ≈ 1 ↑ ↓ ≈ 1 ↑ ↓ ≈ 1

Flip Rate Minim. Fluen. Flip Rate Minim. Fluen. Flip Rate Minim. Fluen.

*PRED + GRAD 1.000 0.173 0.981 0.992 0.261 0.968 0.915 0.331 0.981*GOLD + GRAD 1.000 0.185 0.979 0.992 0.271 0.966 0.945 0.335 0.979

PRED + RAND 1.000 0.257 0.958 0.968 0.378 0.928 0.799 0.440 0.953GOLD + RAND 1.000 0.302 0.952 0.965 0.370 0.929 0.801 0.440 0.955

NO-FINETUNE 0.995 0.360 0.960 0.941 0.418 0.938 – – –

Table 1: Efficacy of the MICE procedure. We evaluate MICE edits on three metrics (described in §3.1): flip rate,minimality, and fluency. We report mean values for minimality and fluency. * marks full MICE variants; othersexplore ablations. For each property (i.e., column), the best value across MICE variants is bolded. We experimentwith PREDICTOR’s predictions (PRED) and gold labels (GOLD) as target labels during Stage 1. Across datasets,our GRAD MICE procedure achieves a high flip rate with small and fluent edits.

25K instances in the test set for evaluation becauseof the computational demands of evaluation. 9

For each dataset, we measure the following threeproperties: (1) flip rate: the proportion of in-stances for which an edit results in the contrastlabel; (2) minimality: the “size” of the edit asmeasured by the word-level Levenshtein distancebetween the original and edited input, which is theminimum number of deletions, insertions, or sub-stitutions required to transform one into the other.We report a normalized version of this metric witha range from 0 to 1—the Levenshtein distance di-vided by the number of words in the original in-put; (3) fluency: a measure of how similarly dis-tributed the edited output is to the original data. Weevaluate fluency by comparing masked languagemodeling loss on both the original and edited inputsusing a pretrained model. Specifically, given theoriginal N -length sequence, we create N copies,each with a different token replaced by a mask to-ken, following Salazar et al. (2020). We then takea pretrained T5-BASE model and compute the aver-age loss across these N copies. We compute thisloss value for both the original input and editedinput and report their ratio—i.e., edited / original.We aim for a value of 1.0, which indicates equiva-lent losses for the original and edited texts. WhenMICE finds multiple edits, we report metrics forthe edit with the smallest value for minimality.

3.2 Results

Results are shown in Table 1. Our proposed GRAD

MICE procedure (upper part of Table 1) achieves a

9A single contrastive edit is expensive and takes an averageof ≈ 15 seconds per IMDB instance (≈ 230 tokens). Calculat-ing the fluency metric adds an additional average of ≈ 16.5seconds per IMDB instance. For more details, see Section 5.

high flip rate across all three tasks. This is the out-come regardless of whether predicted target labels(first row, 91.5–100% flip rate) or gold target labels(second row, 94.5–100% flip rate) are used for fine-tuning in Stage 1. We observe a slight improvementfrom using the gold labels for the RACE PREDIC-TOR, which may be explained by the fact that it isless accurate (with a training accuracy of 89.9%)than the IMDB and NEWSGROUPS classifiers.

MICE achieves a high flip-rate while its editsremain small and result in fluent text. In particular,MICE on average changes 17.3–33.1% of the origi-nal tokens when predicted labels are used in Stage 1and 18.5–33.5% with gold labels. Fluency is closeto 1.0 indicating no notable change in mask lan-guage modeling loss after the edit—i.e., edits fallin distribution of the original data. We achieve thebest results across metrics on the IMDB dataset, asexpected since IMDB is a binary classification taskwith a small label space. These results demonstratethat MICE presents a promising research directionfor the generation of contrastive explanations; how-ever, there is still room for improvement, especiallyfor more challenging tasks such as RACE.

In the rest of this section, we provide resultsfrom several ablation experiments.

Fine-tuning vs. No Fine-tuning We investigatethe effect of fine-tuning (Stage 1) with a base-line that skips Stage 1 altogether. For this NO-FINETUNE baseline variant of MICE, we use thevanilla pretrained T5-BASE as our EDITOR. Asshown in Table 1, the NO-FINETUNE variant un-derperforms all other (two-stage) variants of MICEfor the IMDB and NEWSGROUPS datasets.10 Fine-

10We leave RACE out from our evaluation with the NO-FINETUNE baseline because we observe that the pretrained

Page 6: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

IMDB Condition ↑ ↓ ≈ 1Stage 1 Stage 2 Flip Rate Minim. Fluen.

No Label No Label 0.994 0.369 0.966No Label Label 0.997 0.362 0.967Label No Label 0.999 0.327 0.968

Label Label 1.000 0.173 0.981

Table 2: Effect of using target end-task labels duringthe two stages of PRED+GRAD MICE on the IMDBdataset. When end-task labels are provided, they areoriginal PREDICTOR labels during Stage 1 and contrastlabels during Stage 2. The best values for each property(column) are bolded. Using end-task labels during bothStage 1 (EDITOR fine-tuning) and Stage 2 (making ed-its) of MICE outperforms all other conditions.

tuning particularly improves the minimality of ed-its, while leaving the flip rate high. We hypothesizethat this effect is due to the effectiveness of Stage2 of MICE at finding contrastive edits: Becausewe iteratively generate many candidate edits usingbeam search, we are likely to find a prediction-flipping edit. Fine-tuning allows us to find such anedit at a lower masking percentage.

Gradient vs. Random Masking We study theimpact of using gradient-based masking in Stage1 of the MICE procedure with a RAND variant,which masks spans of randomly chosen tokens. Asshown in the middle part of Table 1, gradient-basedmasking outperforms random masking when usingboth predicted and gold labels across all three tasksand metrics, suggesting that the gradient-based at-tribution used to mask text during Stage 1 of MICEis an important part of the procedure. The differ-ences are especially notable for RACE, which is themost challenging task according to our metrics.

Targeted vs. Un-targeted Infilling We investi-gate the effect of using target labels in both stagesof MICE by experimenting with removing targetlabels during Stage 1 (EDITOR fine-tuning) andStage 2 (making edits). As shown in Table 2, weobserve that giving target labels to our EDITORS

during both stages of MICE improves edit qual-ity. Fine-tuning EDITORS without labels in Stage 1(“No Label”) leads to worse flip rate, minimality,and fluency than does fine-tuning EDITORS with la-bels (“Label”). Minimality is particularly affected,and we hypothesize that using target end-task la-

T5 model does not generate text formatted as span infills; wehypothesize that this model has not been trained to generateinfills for masked inputs formatted as multiple choice inputs.

bels in both stages provides signal that allows theEDITOR in Stage 2 to generate prediction-flippingedits at lower masking percentages.

4 Analysis of Edits

In this section, we compare MICE edits with hu-man contrastive edits. Then, we turn to a key mo-tivation for this work: the potential for contrastiveexplanations to assist in NLP system development.We show how MICE edits can be used to debugincorrect predictions and uncover dataset artifacts.

4.1 Comparison with Human EditsWe ask whether the contrastive edits produced byMICE are minimal and fluent in a meaningfulsense. In particular, we compare these two met-rics for MICE edits and human contrastive edits.We work with the IMDB contrast set created byGardner et al. (2020), which consists of originaltest inputs and human-edited inputs that cause achange in true label. We report metrics on the sub-set of this contrast set for which the human-editedinputs result in a change in model prediction for ourIMDB PREDICTOR; this subset consists of 76 in-stances. The flip rate of MICE edits on this subsetis 100%. The mean minimality values of humanand MICE edits are 0.149 (human) and 0.179(MICE), and the mean fluency values are 1.01 (hu-man) and 0.949 (MICE). The similarity of thesevalues suggests that MICE edits are comparable tohuman contrastive edits along these dimensions.

We also ask to what extent human edits overlapwith MICE edits. For each input, we compute theoverlap between the original tokens changed by hu-mans and the original tokens edited by MICE. Themean number of overlapping tokens, normalized bythe number of original tokens edited by humans, is0.298. Thus, while there is some overlap betweenMICE and human contrastive edits, they gener-ally change different parts of text.11 This analysissuggests that there may exist multiple informativecontrastive edits for a single input. Future workcan investigate and compare the different kinds ofinsight that can be obtained through human andmodel-driven contrastive edits.

4.2 Use Case 1: Debugging Incorrect OutputsHere, we illustrate how MICE edits can be used todebug incorrect model outputs. Consider the RACE

11MICE edits explain PREDICTORS’ behavior and thereforeneed not be similar to human edits, which are designed tochange gold labels.

Page 7: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

IMDB

Original pred yp = positive Contrast pred yc = negative

An interesting pairing of stories, this little flick manages to bring together seemingly different characters andstory lines all in the backdrop of WWII and succeeds in tying them together without losing the audience.I was impressed by the depth portrayed by the different characters and also by how much I really felt Iunderstood them and their motivations, even though the time spent on the development of each character wasvery limited. The outstanding acting abilities of the individuals involved with this picture are easily noted. Afun, stylized movie with a slew of comic moments and a bunch more head shaking events. 7/10 4/10

RACE

Question: Mark went up in George’s plane .(a) twice (b) only once (c) several times (d) once or twice.

Original pred yp = (a) twice Contrast pred yc = (b) only once

When George was thirty-five, he bought a small plane and learned to fly it. He soon became very good andmade his plane do all kinds of tricks. George had a friend, whose name was Mark. One day George offered totake Mark up in his plane. Mark thought, "I’ve traveled in a big plane several times, but I’ve never been in asmall one, so I’ll go." They went up, and George flew around for half an hour and did all kinds of tricks in theair. When they came down again, Mark was glad to be back safely, and he said to his friend in a shaking voice,"Well, George, thank you very much for those two trips tricks in your plane." George was very surprised andsaid, "Two trips? tricks." Yes, That’s my first and my last time, George." answered said Mark.

Table 3: Examples of edits produced by MICE. Insertions are bolded in red. Deletions are struck through. yp isthe PREDICTOR’s original prediction, and yc the contrast prediction. True labels for original inputs are underlined.

input in Table 3, for which the RACE PREDICTOR

gives an incorrect prediction. In this case, a modeldeveloper may want to understand why the modelgot the answer wrong. This setting naturally bringsrise to a contrastive question, i.e., Why did themodel predict the wrong choice (“twice”) insteadof the correct one (“only once”)?

The MICE edit shown offers insight into thisquestion: Firstly, it highlights which part ofthe paragraph has an influence on the modelprediction—the last few sentences. Secondly, itreveals that a source of confusion is Mark’s jokeabout having traveled in George’s plane twice, aschanging Mark’s dialogue from talking about a“first and...last” trip to a single trip results in a cor-rect model prediction.

MICE edits can also be used to debug modelcapabilities by offering hypotheses about “bugs”present in models: For instance, the edit in Table3 might prompt a developer to investigate whetherthis PREDICTOR lacks non-literal language under-standing capabilities. In the next section, we showhow insight from individual MICE edits can beused to uncover a bug in the form of a dataset-levelartifact learned by a model. In Appendix D, we fur-ther analyze the debugging utility of MICE editswith a PREDICTOR designed to contain a bug.

4.3 Use Case 2: Uncovering Dataset ArtifactsManual inspection of some edits for IMDB suggeststhat the IMDB PREDICTOR has learned to rely heav-ily on numerical ratings. For instance, in the IMDB

example in Table 3, the MICE edit results in a neg-

yc = positive yc = negativeRemoved Inserted Removed Inserted

4/10 excellent 10/10 awfulridiculous enjoy 8/10 disappointedhorrible amazing 7/10 1

4 entertaining 9 4predictable 10 enjoyable annoying

Table 4: Top 5 IMDB tokens edited by MICE at a higherrate than expected given their original frequency (§4.3).Results are separated by contrast predictions.

ative prediction from the PREDICTOR even thoughthe edited text is overwhelmingly positive. We testthis hypothesis by investigating whether numericaltokens are more likely to be edited by MICE.

We analyze the edits produced by MICE (GOLD

+ GRAD) described in §3.1. We limit our analy-sis to a subset of the 5K instances for which theedit produced by MICE has a minimality value of≤0.05, as we are interested in finding simple arti-facts driving the predictions of the IMDB PREDIC-TOR; this subset has 902 instances. We computethree metrics for each unique token, i.e., type t:

p(t) = #_occurrences(t)/ #_all_tokens,

pr(t) = #_removals(t)/ #_all_removals,

pi(t) = #_insertions(t)/ #_all_insertions,

and report the tokens with the highest values forthe ratios pr(t)/p(t) and pi(t)/p(t). Intuitively,these tokens are removed/inserted at a higher ratethan expected given the frequency with which theyappear in the original IMDB inputs. We exclude

Page 8: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

tokens that occur <10 times from our analysis.Results from this analysis are shown in Table

4. In line with our hypothesis, we observe a biastowards removing low numerical ratings and insert-ing high ratings when the contrast prediction yc ispositive, and vice versa when yc is negative. Inother words, in the presence of a numerical score,the PREDICTOR may ignore the content of the re-view and base its prediction solely on the score (asin the IMDB example in Table 3).

5 Discussion

In this section, we reflect on MICE’s shortcom-ings. Foremost, MICE is computationally expen-sive. Stage 1 requires fine-tuning a large pretrainedgeneration model as the EDITOR. More signifi-cantly, Stage 2 requires multiple rounds of forwardand backward passes to find a minimal edit: Eachedit round in Stage 2 requires b× s×m decodedsequences with the EDITOR, as well as b×s×m for-ward passes and b backward passes with the PRE-DICTOR (with b = 1 the first edit round), where bis the beam width, s is the number of search levelsin binary search over the masking percentages, andm is the number of generations sampled for eachmasking percentage. Our experiments required180 forward passes, 180 decoded sequences, and 3backward passes for edit rounds after the first.

While efficient search for targeted edits is anopen challenge in other fields of machine learning(Russell, 2019; Dandl et al., 2020), this problemis even more challenging for language data, as thespace of possible perturbations is much larger thanfor tabular data. An important future direction is todevelop more efficient methods of finding edits.

This shortcoming prevents us from finding editsthat are minimal in a precise sense. In particular,we may be interested in a constrained notion of min-imality that defines an edit e(x) as minimal if thereexists no subset of e(x) that results in the contrastprediction. Future work might consider creatingmethods to produce edits with this property.

6 Related Work

The problem of generating minimal contrastiveedits, also called counterfactual explanations(Wachter et al., 2017),12 has previously been ex-plored for tabular data (Karimi et al., 2020) and

12Formally, methods for producing targeted counterfactualexplanations solve the same task as MICE. However, not allcontrastive explanations are counterfactual explanations; con-trastive explanations can take forms beyond contrastive edits,

images (Hendricks et al., 2018; Goyal et al., 2019;Looveren and Klaise, 2019) but less for language.Recent work explores the use of minimal editschanging true labels for evaluation (Gardner et al.,2020) and data augmentation (Kaushik et al., 2020;Teney et al., 2020), whereas we focus on minimaledits changing model predictions for explanation.

Contrastive Explanations within NLP Thereexist limited methods for automatically generatingcontrastive explanations of NLP models. Jacoviand Goldberg (2020) define contrastive highlights,which are determined by the inclusion of con-trastive features; in contrast, our contrastive editsspecify how to edit (vs. whether to include) featuresand can insert new text.13 Li et al. (2020a) generatecounterfactuals using linguistically-informed trans-formations (LIT), and Yang et al. (2020) generatecounterfactuals for binary financial text classifi-cation using grammatically plausible single-wordedits (REP-SCD). Because both methods rely onmanually curated, task-specific rules, they cannotbe easily extended to tasks without predefined labelspaces, such as RACE.14 Most recently, Jacovi et al.(2021) propose a method for producing contrastiveexplanations in the form of latent representations;in contrast, MICE edits are made at the textuallevel and are therefore more interpretable.

This work also has ties to the literature on causalexplanation (Pearl, 2009). Recent work withinNLP derives causal explanations of models throughcounterfactual interventions (Feder et al., 2021; Viget al., 2020). The focus of our work is the largelyunexplored task of creating targeted interventionsfor language data; however, the question of how toderive causal relationships from such interventionsremains an interesting direction for future work.

Counterfactuals Beyond Explanations Con-current work by Madaan et al. (2021) applies con-

such as free-text rationales (Liang et al., 2020) or highlights(Jacovi and Goldberg, 2020). In this paper, we choose to referto MICE edits as “contrastive” rather than “counterfactual”because we seek to argue for the utility of contrastive expla-nations of model predictions more broadly; we present MICEas one method for producing contrastive explanations of aparticular form and hope future work will explore differentforms of contrastive explanations.

13See Appendix D for a longer discussion about the ad-vantage of inserting new text in explanations, which MICEedits can do but methods that attribute feature importance (i.e.highlights) cannot.

14LIT relies on hand-crafted transformation for NLItasks based on linguistic knowledge, and REP-SCD makesantonym-based edits using manually curated, domain-specificlexicons for each label.

Page 9: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

trolled text generation methods to generate targetedcounterfactuals and explores their use as test casesand augmented examples in the context of clas-sification. Another concurrent work by Wu et al.(2021) presents POLYJUICE, a general-purpose, un-targeted counterfactual generator. Very recent workby Sha et al. (2021), introduced after the submis-sion of MICE, proposes a method for targeted con-trastive editing for Q&A that selects answer-relatedtokens, masks them, and generates new tokens. Ourwork differs from these works in our novel frame-work for efficiently finding minimal edits (MICEStage 2) and our use of edits as explanations.

Connection to Adversarial Examples Adver-sarial examples are minimally edited inputs thatcause models to incorrectly change their predic-tions despite no change in true label (Jia and Liang,2017; Ebrahimi et al., 2018; Pal and Tople, 2020).Recent methods for generating adversarial exam-ples also preserve fluency (Zhang et al., 2019; Liet al., 2020b; Song et al., 2020)15; however, ad-versarial examples are designed to find erroneouschange in model outputs; contrastive edits place nosuch constraint on model correctness. Thus, cur-rent approaches to generating adversarial examples,which can exploit semantics-preserving operations(Ribeiro et al., 2018) such as paraphrasing (Iyyeret al., 2018) or word replacement (Alzantot et al.,2018; Ren et al., 2019; Garg and Ramakrishnan,2020), cannot be used to generate contrastive edits.

Connection to Style Transfer The goal of styletransfer is to generate minimal edits to inputs toresult in a target style (sentiment, formality, etc.)(Fu et al., 2018; Li et al., 2018; Goyal et al., 2020).Most existing approaches train an encoder to learnstyle-agnostic latent representation of inputs andtrain attribute-specific decoders to generate textreflecting the content of inputs but exhibiting adifferent target attribute (Fu et al., 2018; Li et al.,2018; Goyal et al., 2020). Recent works by Wuet al. (2019) and Malmi et al. (2020) adopt two-stage approaches that first identify where to makeedits and then make them using pretrained languagemodels. Such approaches can only be applied togenerate contrastive edits for classification taskswith well-defined “styles,” which exclude morecomplex tasks such as question answering.

15Song et al. (2020) propose a method to produce fluent se-mantic collisions, which they call the “inverse” of adversarialexamples.

7 Conclusion

We argue that contrastive edits, which change theoutput of a PREDICTOR to a given contrast pre-diction, are effective explanations of neural NLPmodels. We propose MINIMAL CONTRASTIVE

EDITING (MICE), a method for generating suchedits. We introduce evaluation criteria for con-trastive edits that are motivated by human con-trastive explanations—minimality and fluency—and show that MICE edits for the IMDB, NEWS-GROUPS, and RACE datasets are contrastive, flu-ent, and minimal. Through qualitative analysis ofMICE edits, we show that they have utility forrobust and reliable NLP system development.

8 Broader Impact Statement

MICE is intended to aid the interpretation of NLPmodels. As a model-agnostic explanation method,it has the potential to impact NLP system devel-opment across a wide range of models and tasks.In particular, MICE edits can benefit NLP modeldevelopers in facilitating debugging and exposingdataset artifacts, as discussed in §4. As a conse-quence, they can also benefit downstream users ofNLP models by facilitating access to less biasedand more robust systems.

While the focus of our work is on interpretingNLP models, there are potential misuses of MICEthat involve other applications. Firstly, maliciousactors might employ MICE to generate adversarialexamples; for instance, they may aim to generatehate speech that is minimally edited such that itfools a toxic language classifier. Secondly, naivelyapplying MICE for data augmentation could plau-sibly lead to less robust and more biased models:Because MICE edits are intended to expose issuesin models, straightforwardly using them as addi-tional training examples could reinforce existingartifacts and biases present in data. To mitigatethis risk, we encourage researchers exploring dataaugmentation to carefully think about how to selectand label edited instances.

We also encourage researchers to develop moreefficient methods of generating minimal contrastiveedits. As discussed in §5, a limitation of MICE isits computational demand. Therefore, we recom-mend that future work focus on creating methodsthat require less compute.

Page 10: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

ReferencesDavid Alvarez-Melis, Hal Daumé III, Jennifer Wort-

man Vaughan, and Hanna Wallach. 2019. Weightof evidence as a basis for human-oriented expla-nations. In Workshop on Human-Centric MachineLearning at the 33rd Conference on Neural Informa-tion Processing Systems (NeurIPS 2019), Vancouver,Canada.

Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial ex-amples. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 2890–2896, Brussels, Belgium. Associationfor Computational Linguistics.

Jasmijn Bastings, Wilker Aziz, and Ivan Titov. 2019.Interpretable neural predictions with differentiablebinary variables. In Proceedings of the 57th AnnualMeeting of the Association for Computational Lin-guistics, pages 2963–2977, Florence, Italy. Associa-tion for Computational Linguistics.

Yonatan Belinkov and James Glass. 2019. AnalysisMethods in Neural Language Processing: A Survey.Transactions of the Association for ComputationalLinguistics, 7:49–72.

Oana-Maria Camburu, Tim Rocktäschel, ThomasLukasiewicz, and Phil Blunsom. 2018. e-snli: Nat-ural language inference with natural language expla-nations. In Advances in Neural Information Process-ing Systems, volume 31, pages 9539–9549. CurranAssociates, Inc.

Shiyu Chang, Yang Zhang, Mo Yu, and TommiJaakkola. 2019. A game theoretic approach to class-wise selective rationalization. In Advances in NeuralInformation Processing Systems, volume 32, pages10055–10065. Curran Associates, Inc.

Seth Chin-Parker and Julie A Cantelon. 2017. Con-trastive constraints guide explanation-based cate-gory learning. Cognitive Science, 41 6:1645–1655.

Susanne Dandl, Christoph Molnar, Martin Binder, andBernd Bischl. 2020. Multi-objective counterfactualexplanations. In Parallel Problem Solving from Na-ture – PPSN XVI, pages 448–469, Cham. SpringerInternational Publishing.

Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani,Eric Lehman, Caiming Xiong, Richard Socher, andByron C. Wallace. 2020. ERASER: A benchmark toevaluate rationalized NLP models. In Proceedingsof the 58th Annual Meeting of the Association forComputational Linguistics, pages 4443–4458. Asso-ciation for Computational Linguistics.

Javid Ebrahimi, Anyi Rao, Daniel Lowd, and DejingDou. 2018. HotFlip: White-box adversarial exam-ples for text classification. In Proceedings of the56th Annual Meeting of the Association for Compu-tational Linguistics (Volume 2: Short Papers), pages

31–36, Melbourne, Australia. Association for Com-putational Linguistics.

Angela Fan, Mike Lewis, and Yann Dauphin. 2018. Hi-erarchical neural story generation. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 889–898, Melbourne, Australia. Associationfor Computational Linguistics.

Amir Feder, Nadav Oved, Uri Shalit, and Roi Re-ichart. 2021. CausaLM: Causal Model ExplanationThrough Counterfactual Language Models. Compu-tational Linguistics, pages 1–54.

Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao,and Rui Yan. 2018. Style transfer in text: Explo-ration and evaluation. In AAAI Conference on Artifi-cial Intelligence.

Matt Gardner, Yoav Artzi, Victoria Basmov, JonathanBerant, Ben Bogin, Sihao Chen, Pradeep Dasigi,Dheeru Dua, Yanai Elazar, Ananth Gottumukkala,Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco,Daniel Khashabi, Kevin Lin, Jiangming Liu, Nel-son F. Liu, Phoebe Mulcaire, Qiang Ning, SameerSingh, Noah A. Smith, Sanjay Subramanian, ReutTsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou.2020. Evaluating models’ local decision boundariesvia contrast sets. In Findings of the Associationfor Computational Linguistics: EMNLP 2020, pages1307–1323. Association for Computational Linguis-tics.

Matt Gardner, Joel Grus, Mark Neumann, OyvindTafjord, Pradeep Dasigi, Nelson F. Liu, MatthewPeters, Michael Schmitz, and Luke S. Zettlemoyer.2017. Allennlp: A deep semantic natural languageprocessing platform.

Siddhant Garg and Goutham Ramakrishnan. 2020.BAE: BERT-based adversarial examples for textclassification. In Proceedings of the 2020 Confer-ence on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 6174–6181. Associa-tion for Computational Linguistics.

Navita Goyal, Balaji Vasan Srinivasan, N. Anand-havelu, and Abhilasha Sancheti. 2020. Multi-dimensional style transfer for partially annotateddata using language models as discriminators.ArXiv, arXiv:2010.11578.

Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, DeviParikh, and Stefan Lee. 2019. Counterfactual vi-sual explanations. In Proceedings of the 36th In-ternational Conference on Machine Learning, vol-ume 97 of Proceedings of Machine Learning Re-search, pages 2376–2384, Long Beach, California,USA. PMLR.

Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell,and Zeynep Akata. 2018. Grounding visual expla-nations. In Computer Vision – ECCV 2018, pages269–286, Cham. Springer International Publishing.

Page 11: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Denis Hilton. 2017. Social Attribution and Explana-tion. Oxford University Press.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, andYejin Choi. 2020. The curious case of neural text de-generation. In International Conference on Learn-ing Representations.

Mohit Iyyer, John Wieting, Kevin Gimpel, and LukeZettlemoyer. 2018. Adversarial example generationwith syntactically controlled paraphrase networks.In Proceedings of the 2018 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,Volume 1 (Long Papers), pages 1875–1885, NewOrleans, Louisiana. Association for ComputationalLinguistics.

Alon Jacovi and Y. Goldberg. 2020. Aligning faithfulinterpretations with their social attribution. ArXiv,arXiv:2006.01067.

Alon Jacovi, Swabha Swayamdipta, Shauli Ravfogel,Yanai Elazar, Yejin Choi, and Yoav Goldberg. 2021.Contrastive explanations for model interpretability.ArXiv:2103.01378.

Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, and By-ron C. Wallace. 2020. Learning to faithfully rational-ize by construction. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 4459–4473. Association for Com-putational Linguistics.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2021–2031, Copenhagen, Denmark. Association forComputational Linguistics.

Amir-Hossein Karimi, G. Barthe, B. Balle, andI. Valera. 2020. Model-agnostic counterfactual ex-planations for consequential decisions. Proceedingsof the 23rd International Conference on Artificial In-telligence and Statistics (AISTATS).

Divyansh Kaushik, Eduard Hovy, and Zachary Lipton.2020. Learning the difference that makes a differ-ence with counterfactually-augmented data. In Inter-national Conference on Learning Representations.

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,and Eduard Hovy. 2017. RACE: Large-scale ReAd-ing comprehension dataset from examinations. InProceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing, pages785–794, Copenhagen, Denmark. Association forComputational Linguistics.

Ken Lang. 1995. Newsweeder: Learning to filter net-news. In Proceedings of the Twelfth InternationalConference on Machine Learning, pages 331–339.

Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2016.Rationalizing neural predictions. In Proceedings ofthe 2016 Conference on Empirical Methods in Nat-ural Language Processing, pages 107–117, Austin,Texas. Association for Computational Linguistics.

Chuanrong Li, Lin Shengshuo, Zeyu Liu, Xinyi Wu,Xuhui Zhou, and Shane Steinert-Threlkeld. 2020a.Linguistically-informed transformations (LIT): Amethod for automatically generating contrast sets.In Proceedings of the Third BlackboxNLP Workshopon Analyzing and Interpreting Neural Networks forNLP, pages 126–135, Online. Association for Com-putational Linguistics.

Juncen Li, Robin Jia, He He, and Percy Liang. 2018.Delete, retrieve, generate: a simple approach to sen-timent and style transfer. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers),pages 1865–1874, New Orleans, Louisiana. Associ-ation for Computational Linguistics.

Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue,and Xipeng Qiu. 2020b. BERT-ATTACK: Adver-sarial attack against BERT using BERT. In Proceed-ings of the 2020 Conference on Empirical Methodsin Natural Language Processing (EMNLP), pages6193–6202. Association for Computational Linguis-tics.

Weixin Liang, James Zou, and Zhou Yu. 2020. ALICE:Active learning with contrastive natural language ex-planations. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 4380–4391, Online. Associa-tion for Computational Linguistics.

Peter Lipton. 1990. Contrastive explanation. RoyalInstitute of Philosophy Supplement, 27:247–266.

Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, M. Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.RoBERTa: A robustly optimized BERT pretrainingapproach. ArXiv, arXiv:1907.11692.

Arnaud Van Looveren and Janis Klaise. 2019. Inter-pretable counterfactual explanations guided by pro-totypes. ArXiv, arXiv:1907.02584.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham,Dan Huang, Andrew Y. Ng, and Christopher Potts.2011. Learning word vectors for sentiment analy-sis. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 142–150, Port-land, Oregon, USA. Association for ComputationalLinguistics.

Nishtha Madaan, Inkit Padhi, Naveen Panwar, and Dip-tikalyan Saha. 2021. Generate your counterfactu-als: Towards controlled counterfactual generationfor text. In Proceedings of the AAAI Conference onArtificial Intelligence.

Page 12: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Eric Malmi, Aliaksei Severyn, and Sascha Rothe. 2020.Unsupervised text style transfer with padded maskedlanguage models. In Proceedings of the 2020 Con-ference on Empirical Methods in Natural LanguageProcessing (EMNLP), pages 8671–8680. Associa-tion for Computational Linguistics.

Tim Miller. 2019. Explanation in Artificial Intelli-gence: Insights from the social sciences. ArtificialIntelligence, 267:1–38.

Sharan Narang, Colin Raffel, Katherine Lee, AdamRoberts, Noah Fiedel, and Karishma Malkan. 2020.WT5?! training text-to-text models to explain theirpredictions. arXiv:2004.14546.

B. Pal and S. Tople. 2020. To transfer or not to trans-fer: Misclassification attacks against transfer learnedtext classifiers. ArXiv, arXiv:2001.02438.

Judea Pearl. 2009. Causality: Models, Reasoning andInference, 2nd edition. Cambridge University Press,USA.

Colin Raffel, Noam Shazeer, Adam Roberts, Kather-ine Lee, Sharan Narang, Michael Matena, YanqiZhou, Wei Li, and Peter J. Liu. 2020. Exploringthe limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Re-search, 21(140):1–67.

Nazneen Fatema Rajani, Bryan McCann, CaimingXiong, and Richard Socher. 2019. Explain yourself!leveraging language models for commonsense rea-soning. In Proceedings of the 57th Annual Meet-ing of the Association for Computational Linguis-tics, pages 4932–4942, Florence, Italy. Associationfor Computational Linguistics.

D Raj Reddy. 1977. Speech understanding systems: Asummary of results of the five-year research effort.

Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.2019. Generating natural language adversarial ex-amples through probability weighted word saliency.In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics, pages1085–1097, Florence, Italy. Association for Compu-tational Linguistics.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2016. "why should i trust you?": Explain-ing the predictions of any classifier. Proceedings ofthe 22nd ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining.

Marco Tulio Ribeiro, Sameer Singh, and CarlosGuestrin. 2018. Semantically equivalent adversar-ial rules for debugging NLP models. In Proceedingsof the 56th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers),pages 856–865, Melbourne, Australia. Associationfor Computational Linguistics.

Mireia Ribera and Àgata Lapedriza. 2019. Can We DoBetter Explanations? A Proposal of User-CenteredExplainable AI. In ACM IUI Workshop.

Chris Russell. 2019. Efficient search for diverse coher-ent explanations. In Proceedings of the Conferenceon Fairness, Accountability, and Transparency.

Julian Salazar, Davis Liang, Toan Q. Nguyen, andKatrin Kirchhoff. 2020. Masked language modelscoring. In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguis-tics, pages 2699–2712. Association for Computa-tional Linguistics.

Lei Sha, Patrick Hohenecker, and Thomas Lukasiewicz.2021. Controlling text edition by changing answersof specific questions. ArXiv:2105.11018.

Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-man. 2014. Deep inside convolutional networks: Vi-sualising image classification models and saliencymaps. In 2nd International Conference on Learn-ing Representations, ICLR 2014, Banff, AB, Canada,April 14-16, 2014, Workshop Track Proceedings.

Jacob Sippy, Gagan Bansal, and Daniel S. Weld. 2020.Data staining: A method for comparing faithfulnessof explainers. In 2020 ICML Workshop on HumanInterpretability in Machine Learning (WHI 2020).

D. Smilkov, Nikhil Thorat, Been Kim, F. Viégas, andM. Wattenberg. 2017. Smoothgrad: removing noiseby adding noise. In ICML Workshop on Visualiza-tion for Deep Learning.

Congzheng Song, Alexander Rush, and VitalyShmatikov. 2020. Adversarial semantic collisions.In Proceedings of the 2020 Conference on EmpiricalMethods in Natural Language Processing (EMNLP),pages 4198–4210, Online. Association for Computa-tional Linguistics.

Kaiser Sun and Ana Marasovic. 2021. Effective atten-tion sheds light on interpretability. In Findings ofACL.

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017.Axiomatic attribution for deep networks. In Pro-ceedings of the 34th International Conference onMachine Learning - Volume 70, page 3319–3328.JMLR.org.

Alon Talmor, Jonathan Herzig, Nicholas Lourie, andJonathan Berant. 2019. CommonsenseQA: A ques-tion answering challenge targeting commonsenseknowledge. In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers),pages 4149–4158, Minneapolis, Minnesota. Associ-ation for Computational Linguistics.

Damien Teney, Ehsan Abbasnejad, and A. V. D. Hen-gel. 2020. Learning what makes a difference fromcounterfactual examples and gradient supervision.In Proceedings of the European Conference on Com-puter Vision (ECCV).

Page 13: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Bas C Van Fraassen. 1980. The scientific image. Ox-ford University Press.

Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov,Sharon Qian, Daniel Nevo, Yaron Singer, and Stu-art Shieber. 2020. Investigating gender bias in lan-guage models using causal mediation analysis. InAdvances in Neural Information Processing Systems,volume 33, pages 12388–12401. Curran Associates,Inc.

S. Wachter, Brent D. Mittelstadt, and Chris Russell.2017. Counterfactual explanations without openingthe black box: Automated decisions and the gdpr.European Economics: Microeconomics & IndustrialOrganization eJournal.

Sarah Wiegreffe and Yuval Pinter. 2019. Attention isnot not explanation. In Proceedings of the 2019 Con-ference on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, Hong Kong, China. Associ-ation for Computational Linguistics.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-icz, Joe Davison, Sam Shleifer, Patrick von Platen,Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,Teven Le Scao, Sylvain Gugger, Mariama Drame,Quentin Lhoest, and Alexander M. Rush. 2020.Transformers: State-of-the-art natural language pro-cessing. In Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing:System Demonstrations, pages 38–45, Online. Asso-ciation for Computational Linguistics.

Tongshuang Wu, Marco Túlio Ribeiro, J. Heer,and Daniel S. Weld. 2021. Polyjuice: Auto-mated, general-purpose counterfactual generation.arXiv:2101.00288.

Xing Wu, Tao Zhang, Liangjun Zang, Jizhong Han,and Songlin Hu. 2019. Mask and infill: Apply-ing masked language model for sentiment transfer.In Proceedings of the Twenty-Eighth InternationalJoint Conference on Artificial Intelligence, IJCAI-19, pages 5271–5277. International Joint Confer-ences on Artificial Intelligence Organization.

Linyi Yang, Eoin Kenny, Tin Lok James Ng, Yi Yang,Barry Smyth, and Ruihai Dong. 2020. Generatingplausible counterfactual explanations for deep trans-formers in financial text classification. In Proceed-ings of the 28th International Conference on Com-putational Linguistics, pages 6150–6160, Barcelona,Spain (Online). International Committee on Compu-tational Linguistics.

Mo Yu, Shiyu Chang, Yang Zhang, and TommiJaakkola. 2019. Rethinking cooperative rationaliza-tion: Introspective extraction and complement con-trol. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing

and the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages4094–4103, Hong Kong, China. Association forComputational Linguistics.

Omar Zaidan, Jason Eisner, and Christine Piatko. 2007.Using “annotator rationales” to improve machinelearning for text categorization. In Human Lan-guage Technologies 2007: The Conference of theNorth American Chapter of the Association for Com-putational Linguistics; Proceedings of the MainConference, pages 260–267, Rochester, New York.Association for Computational Linguistics.

Rowan Zellers, Yonatan Bisk, Ali Farhadi, and YejinChoi. 2019. From Recognition to Cognition: VisualCommonsense Reasoning. In CVPR.

Huangzhao Zhang, Hao Zhou, Ning Miao, and Lei Li.2019. Generating fluent adversarial examples fornatural languages. In Proceedings of the 57th An-nual Meeting of the Association for ComputationalLinguistics, pages 5564–5569, Florence, Italy. Asso-ciation for Computational Linguistics.

Page 14: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

A Training Details

A.1 PREDICTOR ModelsFor all datasets, f is initialized as a ROBERTA-LARGE model with a linear layer and maximumsequence length of 512 tokens. We train withAllenNLP (Gardner et al., 2017). For IMDB andNEWSGROUPS, we fine-tune f for 5 epochs withbatch size 8 using Adam with initial learning rateof 2e−05, weight decay 0.1, and slanted triangu-lar learning rate scheduler with cut frac 0.06. ForRACE, we fine-tune f for 3 epochs with batch size4 and 16 gradient accumulation steps using Adamwith learning rate 1e−05, ε = 1e−08, and linearlearning rate scheduler with 100 warm-up steps,and we fix f after the epoch with the lowest valida-tion loss.

A.2 EDITOR ModelsWe use the transformers implementation(Wolf et al., 2020) of the base T5 for our EDI-TORS. We use Adam with a learning rate of 1e−4.For IMDB EDITORS, we use batch size 4 for allvariants. For NEWSGROUPS, we use batch size4 for fine-tuning with predictor labels and batchsize 8 for fine-tuning with gold labels. For RACE,we use batch size 4 for fine-tuning with predictorlabels and batch size 6 for fine-tuning with goldlabels.

B Data Processing

We remove newline and tab tokens (<br />, \t, \n)in all datasets, as these are tokenized differently byour PREDICTORS (ROBERTA-LARGE) and EDI-TORS (T5). For NEWSGROUPS, we also removeheaders, footers, and quotes.

Inputs to EDITORS For IMDB and NEWS-GROUPS EDITORS, we simply prepend target labelsto the masked original inputs. For RACE, we givethe question, context, all answer options, and thecorrect choice as input to the RACE EDITOR. Weonly mask the context. See Table 5 for examples.

C T5 generation for large n2

We noticed that generations sometimes degener-ate when we decode from T5 with a large mask-ing percentage n2. For example, sentinel tokensare sometimes generated out of consecutive order.We attribute this to the large difference betweenmasking percentages we use (up to 55%) and mask-ing percentage used during T5 pretraining (15%).

Specifically, we observed that generations tend todegenerate after the the 28th sentinel token. Thus,we heuristically reduce the number of sentinel to-kens by combining neighboring sentinel tokens thatare separated by 1-2 tokens into one sentinel token.

When the output degenerates, we do the follow-ing: In-fill the mask tokens with the “good” partsof the generation (i.e. parts with correctly orderedsentinel tokens), and replace the remaining masktokens with the original text; get the contrast labelprobabilities from f for these intermediate in-filledcandidates; of these, take the m′ = 3 candidateswith the highest probabilities and use as input togenerate m/m′ new candidates.16

D Using MICE Edits to Debug a“Buggy” PREDICTOR: A Case Study

In §4, we illustrate how MICE edits can be usedto debug both individual predictions and naturaldataset artifacts learned by a model. Here, we fur-ther explore the utility of MICE edits in debuggingthrough Data Staining (Sippy et al., 2020): We de-sign a “buggy” PREDICTOR and evaluate whetherMICE edits can recover the bug.

We create a buggy RACE PREDICTOR by intro-ducing an artifact into the RACE train set. This ar-tifact is the presence of the phrase “It is interestingto note that” in front of the correct answer choice.We introduce this artifact as follows: We filter theRACE train data to contain instances for which thecorrect answer choice is contained by some sen-tence17 and the overlapping sentence does not havea higher degree of n-gram overlap with some other(incorrect) choice. After filtering, 11,188 of 87,866train instances remain. We then prepend “It is in-teresting to note that” to the overlapping sentenceto design a correlation between the location of thisphrase and the correct answer choice; our goal isto encourage a PREDICTOR to learn to predict themultiple choice option closest to this buggy phraseas the correct answer. If there are multiple overlap-ping sentences, we choose the one with the mostoverlap with the answer choice. We randomly sam-ple from this filtered subset such that 10% of thetrain data contains this artifact. Our buggy RACE

PREDICTOR is trained on this modified data usingthe same set-up from §A.1, except that we use abatch size of 2 and 32 gradient accumulation steps.

16If one of the partially-infilled candidates results in thecontrast label, we return this as the edited input.

17A sentence “contains” the correct answer choice if theanswer has at least a 4-gram overlap with the sentence.

Page 15: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

Task Original Input Input to EDITOR

NEWS

Michael, you sent your inquiry to the bmw mailing list,but the sw replaces your return addr with the list addr soI can’t reply or manually add you. please see my post rethe list or contact me directly.

label: misc. input: <extra_id_0>, you sent your <ex-tra_id_1> to the <extra_id_2>, but the <extra_id_3>your return <extra_id_4> with the list <extra_id_5>so I can’t <extra_id_6> or <extra_id_7> add you.please see my post re the list or contact me directly.

RACE

article: The best way of learning a language is by usingit. The best way of learning English is using English asmuch as possible. Sometimes you will get your wordsmixed up and people wont understand. Sometimes peo-ple will say things too quickly and you cant understandthem. But if you keep your sense of humor( ),you can al-ways have a good laugh at the mistakes you make. Dontbe unhappy if the people seem to laugh at your mistakes.Its much better for people to laugh at your mistake thanto be angry because they dont know what you are saying.The most important rule for learning English is "Dont beafraid of making mistakes. Everyone makes mistakes."question: In learning English, you should _. choices:speak as quickly as possible., laugh as much as you can.,use it as often as you can., write more than you read.

question: In learning English, you should _. answer:choice1: laugh as much as you can. context: The<extra_id_0> <extra_id_1>. Sometimes you will getyour words <extra_id_2> <extra_id_3> <extra_id_4>have a good laugh at the mistakes you make. Don’tbe unhappy if the people seem to laugh at your mis-takes. It’s much better for people to laugh at your mis-take than to be angry because they don’t know whatyou are saying. The most important rule for learn-ing English is "Don’t be afraid of making mistakes.Everyone makes <extra_id_5>." choice0: speak asquickly as possible. choice1: laugh as much as youcan. choice2: use it as often as you can. choice3:write more than you read.

Table 5: Examples of input formats to our EDITORS. The input to NEWSGROUPS EDITOR has target label “misc.”

Question: of Xiao Maiyou’s children went to Pecking University.(a) One (b) Two (c) Three (d) All

Original pred yp = (d) All Contrast pred yc = (b) Three

Just as "Tiger Mom" leaves, here comes the "Wolf Daddy" called Xiao Baiyou. He believes he’s the best parent in theworld. Some days ago, Xiao Baiyou’s latest book about how to be a successful parent came out. He is pretty strict withhis four children. Sometimes he even beat them. But the children don’t hate their daddy at all. And all of them finallywent to Pecking University, It is interesting to note that three of them got good marks at Pecking University. Andone of the top universities in China them even passed the exam without any problem. So Xiao proudly tells othersabout his education idea that children need strict rules. In his microblog, he said, "Come on, want your children to enterPeking University without rules? You must be joking." And, "Leave your children more money, and strict rules at thesame time."But the "Wolf Daddy" way was soon questioned by other parents. Some say that Xiao Baiyou just want to befamous by doing so. The "Wolf Daddy" Xiao Baiyou is a 47-year-old Guangdong businessman who deals in luxury goodsin Hong Kong. Unlike many other parents who usually have one child, Xiao has four children. Two of them were born inHong Kong and two in the US. Some people on the Internet think the reason why his children were able to enter PekingUniversity is because the exam is much easier taken from Hong Kong.

Table 6: A MICE edit for a prediction made by the “buggy” RACE PREDICTOR (described in §D). Insertions arebolded in red. Deletions are struck through. The true label for the original input is underlined.

The test accuracies of our original and buggyRACE PREDICTORS are both 84%, and so we can-not use this measure to select the better classifier.We ask whether MICE edits can be used for thispurpose. One such edit is shown in Table 6. We ob-serve that the signal from the edit, which containsboth the manual artifact “It is interesting to notethat” and the contrast prediction “three,” is enoughto overpower the signal from the explicit assertionthat “All” is the correct answer (“And all of themfinally went to Pecking University”) such that thePREDICTOR’s prediction changes to “Three.” Thisedit thus provides evidence that some heuristic mayhave been learned by the predictor. Consideringmultiple MICE edits can validate such a hypoth-esis: We find that 17.2% of the edits produced by

MICE reflect this bug (i.e. contain the phrase “in-teresting to note that”); in other words, they douncover the manually inserted bug.

Furthermore, MICE edits are able to uncoverthe artifact because they can insert new text. Forinstance, in the edit in Table 5, the buggy phrase“It is interesting to note that” is not part of the orig-inal input. Applying saliency-based explanationmethods, such as gradient attribution, to the buggyPREDICTOR’s prediction would not reveal the PRE-DICTOR’s reliance on the manual artifact, as thebuggy phrase is not already present in the text. Thisdifference highlights a key advantage of MICEover existing instance-based explanation methodsthat attribute feature importance, which can onlycite text already present in original inputs.

Page 16: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

IMDB

Original pred yp = negative Contrast pred yc = positive

With a catchy title like the Butcher of Plainfield this Ed Gein variation and Kane Hodder playing him will no doubt fly offthe shelves for a couple of weeks.Most viewers will be bored laughed silly with this latest take on the life of Ed Gien. Themovie focuses on Ed’s rampage and gives us a(few)glimpses into his Psycosis and dwelling in Plainfeild.Its these scenesthat give the movie a much needed jolt. What ruins this Another annoyance is the constant focus on other characterslives and focuses less on Eds.Big mistake here. Kane Hodder is a strange choice to play Gein,but He does pull it off quitewell,and deserves more acting credits than he gets these days.Prascilla Barnes and Micahel Barryman also show up. 3/109/10

Original pred yp = positive Contrast pred yc = negative

I have just sat through this film again and can only wonder if we will see the likes kind of films like this anymore? Thetimeless music sex, the tender voices performances of William Holden and Jennifer Jones leave this grown man weepingsuffering through joyous, romantic torturous, incoherent scenes and I’m not one who cries very often in life. Wherehave our William Holden’s gone and will they make these moving, wonderful cynical, movies any more? It’s sad to haveto realize that they probably won’t but don’t think about it, just try to block that out of your mind. Even so Then again,they won’t have Holden Shakespeare in it and he won’t appear on that hill soap opera just once more either. You canonly enjoy safely skip this film and watch it again.

Original pred yp = positive Contrast pred yc = negative

This little flick is reminiscent of several other movies, but manages to keep its own style & mood. "Troll Trusty" & "Don’tBe Afraid of the Dark" come to mind. The suspense builders performances were good, & just cross the line from G sillyto PG uninteresting. I especially liked the non-cliche cliched choices with the parents; in other movies, I could predictthe dialog ending verbatim, but the writing in this movie made better selections. If you want a movie that’s not grossterribly creepy but gives you some chills, this is a great choice.

Table 7: Examples of edits produced by MICE for inputs from the IMDB dataset. Insertions are bolded in red.Deletions are struck through. yp is the PREDICTOR’s original prediction, and yc the contrast prediction. Truelabels for original inputs are underlined.

NEWSGROUPS

Original pred yp = talk Contrast pred yc = sci

Would someone be kind enought to document the exact nature of the evidence against the BD NRA’s without reference tohearsay or newsreports. I would also like to know more about their past record etc. but again based on solid not mediareports. My reason for asking for such evidence is that last night on Larry King Live a so-called "cult space-expert" wasinterviewed from Australia who claimed that it was his evidence which led to the original raid discovery. This admission,if true, raises the nasty possibility that the Government acted in good faith, which I believe they did, on faulty evidence. Italso raises the possibility that other self proclaimed cult space experts were advising them and giving ver poor advice.

Original pred yp = rec Contrast pred yc = soc

I am planning a weekend in Chicago next month for my first live-and-in-person Cubs game Christian immersion (!!!) Iwould appreciate any advice from locals or used-to-be locals on where to stay, what to see, where to dine, etc. E-mailreplies are fine... Thanks in advance! Teresa

Original pred yp = rec Contrast pred yc = alt

Minor point: Shea Stadium (David: D.): This was designed as a multi-purpose stadium symbiotic relationship betweenGod- and-Christ but not with the Jets in same mind as the tennant Atheists. The New York Football Giants Atheists hadmoved to Yankee MetLife Stadium (from the Polo Grounds Mets) in 1958 1977 and was having problem with stadiummanagement (the City Atheists did not own Yankee MetLife Stadium until 1972 1973). The idea was to get the GiantsAtheists to move into Shea Metlife Stadium. When a deal was worked out between the Giants Atheists and the YankeesMets, the new AFL American franchise, the New York Titans Atheists, approached the City Mets about using the newstadium. The Titans Mets were playing in Downing Carling Stadium (where the Cosmos Atheists played soccer backin the 70s). Because Shea Stadium was tied into the World’s Fair anyway, the city thought it would be a novel idea topromote the new franchise and the World’s Fair (like they were doing with the Mets). So the deal was worked out. I’munder the impression that when Murph says it, he means it! As a regular goer to Shea, it is not a bad place since they’vecleaned and renovated the place. Remember, this is its 30th Year!

Table 8: Examples of edits produced by MICE for inputs from the NEWSGROUPS dataset. Insertions are bolded inred. Deletions are struck through. yp is the PREDICTOR’s original prediction, and yc the contrast prediction. Truelabels for original inputs are underlined.

Page 17: arXiv:2012.13985v2 [cs.CL] 23 Jun 2021

RACE

Question: How can the thieves get the information of the credit card?(a) The customers give them the information.(b) The thieves steal the information from Web sites.(c) The customers sell the information to them.(d) The thieves buy the information from credit-card firms.

Original pred yp = (a) Contrast pred yc = (b)

The Internet has led to a huge increase in credit-card fraud. Your card information could even be for sale in an illegalweb site. Web sites offering cheap goods and services should be regarded with care. On-line shoppers who enter canget credit-card information with stolen details through their credit-card information may never receive the onlineshopping sites, including buying goods they thought they bought. The thieves then go may use the information theyhave on your credit card to send shopping promotions, ads, or other Web sites. The thieves will not use with yourcard number – or sell the information over the Internet. Computers Recent developments in internet hackers havebroken down security systems, raising questions about the safety of cardholder information. Several months ago, 25, 000customers of CD Universe, an on-line music retailer, were not lucky. Their names, addresses and credit-card numberswere posted on a Web site after the retailer refused to pay US $157, 828 to get back the information. Credit-card firmsare now fighting against on-line fraud. Mastercard is working on plans for Web – only credit card, with a lower creditlimit. The card could be used only for shopping on-line purchases. However, But there are a few simple steps you cantake to keep from being cheated. Ask about your credit-card firm’s on-line rules: Under British law, cardholders have topay the first US $ 7820 penalty of any fraudulent spending. And shop only activity at secure sites; Send your credit-cardinformation only if the Web site offers advanced secure system. If the security is in place, a letter will appear in the bottomright-hand corner of your screen. The Website address may also start https: //– // // // andthe extra "s" stands for secure. Ifin doubt, Never give your credit-card information over the telephone. Keep your password safe: Most on-line sites requirea user name and password before when placing an order. Treat your passwords with care.

Question: If you want to be a football player, you should __.(a) buy a good football(b) play football(c) watch others play football(d) put your football away

Original pred yp = (b) Contrast pred yc = (a)

We are all learning English, but how can we learn English well? A student can know a lot about English, but maybehe can’t speak English. If you want to know how to swim be a football player, you must get into the river buy a goodfootball. If And if you want to be a football an English player, you must play football. So, you see. You can learn Englishonly by using it. You must listen to your teacher in class. You must read your lessons every day. You must speak Englishto your classmates and also you must write something sometimes. Then one day, you may find your English very good.

Question: This story most probably took place __.(a) at the beginning of the term(b) in the middle of the term(c) at the end of the term(d) at the beginning of the school year

Original pred yp = (c) Contrast pred yc = (b)

A teacher stood was giving new classes to students in front the middle of his history this term. The students were inclass of twenty students just before handing out the final exam. His students by now. They sat quietly and waited for himto speak. "It’s been a pleasure teaching you this term my last chance," he said to them. The class started to cry. Theycried for a long time. Finally, the teacher got up. He looked them in surprise. Then he asked them to leave. They"You’ve all worked very hard, so I have a pleasant surprise for you. Everyone who chooses not to take the final exam willget a ’B’ for the course." Most of the students jumped out of their seats. They thanked the teacher happily, and walkedout of the classroom. Only a few students stayed. The teacher looked at them. "This is your last chance," he said. "Doesanyone else want to leave?" All the students there stayed in their seats and took out their pencils. The teacher smiled."Congratulations," he said. "I’m glad to see you believe in yourselves. You all get A on well."

Table 9: Examples of edits produced by MICE for inputs from the RACE dataset. Insertions are bolded in red.Deletions are struck through. yp is the PREDICTOR’s original prediction, and yc the contrast prediction. Truelabels for original inputs are underlined.