Top Banner
DynaSent: A Dynamic Benchmark for Sentiment Analysis Christopher Potts * Stanford University [email protected] Zhengxuan Wu * Stanford University [email protected] Atticus Geiger Stanford University [email protected] Douwe Kiela Facebook AI Research [email protected] Abstract We introduce DynaSent (‘Dynamic Senti- ment’), a new English-language benchmark task for ternary (positive/negative/neutral) sen- timent analysis. DynaSent combines natu- rally occurring sentences with sentences cre- ated using the open-source Dynabench Plat- form, which facilities human-and-model-in- the-loop dataset creation. DynaSent has a total of 121,634 sentences, each validated by five crowdworkers, and its development and test splits are designed to produce chance perfor- mance for even the best models we have been able to develop; when future models solve this task, we will use them to create DynaSent ver- sion 2, continuing the dynamic evolution of this benchmark. Here, we report on the dataset creation effort, focusing on the steps we took to increase quality and reduce artifacts. We also present evidence that DynaSent’s Neutral category is more coherent than the compara- ble category in other benchmarks, and we mo- tivate training models from scratch for each round over successive fine-tuning. 1 Introduction Sentiment analysis is an early success story for NLP, in both a technical and an industrial sense. It has, however, entered into a more challenging phase for research and technology development: while present-day models achieve outstanding re- sults on all available benchmark tasks, they still fall short when deployed as part of real-world sys- tems (Burn-Murdoch, 2013; Grimes, 2014, 2017; Gossett, 2020) and display a range of clear short- comings (Kiritchenko and Mohammad, 2018; Han- wen Shen et al., 2018; Wallace et al., 2019; Tsai et al., 2019; Jin et al., 2019; Zhang et al., 2020). In this paper, we seek to address the gap between benchmark results and actual utility by introduc- ing version 1 of the DynaSent dataset for English- language ternary (positive/negative/neutral) senti- * Equal contribution. Model 0 RoBERTa fine-tuned on sentiment benchmarks Model 0 used to find challenging naturally occurring sentences Human validation Round 1 Dataset Model 1 RoBERTa fine-tuned on sentiment benchmarks + Round 1 Dataset Dynabench used to crowdsource sentences that fool Model 1 Human validation Round 2 Dataset Figure 1: The DynaSent dataset creation process. The human validation task is the same for both rounds; five responses are obtained for each sentence. On Dyn- abench, we explore conditions with and without prompt sentences that workers can edit to achieve their goal. ment analysis. 1 DynaSent is intended to be a dy- namic benchmark that expands in response to new models, new modeling goals, and new adversarial attacks. We present the first two rounds here and motivate some specific data collection and mod- eling choices, and we propose that, when future models solve these rounds, we use those models to create additional DynaSent rounds. This is an instance of “the ‘moving post’ dynamic target” for NLP that Nie et al. (2020) envision. Figure 1 summarizes our method, which incor- porates both naturally occurring sentences and sen- tences created by crowdworkers with the goal of fooling a top-performing sentiment model. The starting point is Model 0, which is trained on stan- dard sentiment benchmarks and used to find chal- lenging sentences in existing data. These sentences are fed into a human validation task, leading to 1 https://github.com/cgpotts/dynasent arXiv:2012.15349v1 [cs.CL] 30 Dec 2020
23

DynaSent: A Dynamic Benchmark for Sentiment Analysis

Oct 16, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DynaSent: A Dynamic Benchmark for Sentiment Analysis

DynaSent: A Dynamic Benchmark for Sentiment Analysis

Christopher Potts∗Stanford University

[email protected]

Zhengxuan Wu∗

Stanford [email protected]

Atticus GeigerStanford University

[email protected]

Douwe KielaFacebook AI [email protected]

Abstract

We introduce DynaSent (‘Dynamic Senti-ment’), a new English-language benchmarktask for ternary (positive/negative/neutral) sen-timent analysis. DynaSent combines natu-rally occurring sentences with sentences cre-ated using the open-source Dynabench Plat-form, which facilities human-and-model-in-the-loop dataset creation. DynaSent has a totalof 121,634 sentences, each validated by fivecrowdworkers, and its development and testsplits are designed to produce chance perfor-mance for even the best models we have beenable to develop; when future models solve thistask, we will use them to create DynaSent ver-sion 2, continuing the dynamic evolution ofthis benchmark. Here, we report on the datasetcreation effort, focusing on the steps we tookto increase quality and reduce artifacts. Wealso present evidence that DynaSent’s Neutralcategory is more coherent than the compara-ble category in other benchmarks, and we mo-tivate training models from scratch for eachround over successive fine-tuning.

1 Introduction

Sentiment analysis is an early success story forNLP, in both a technical and an industrial sense.It has, however, entered into a more challengingphase for research and technology development:while present-day models achieve outstanding re-sults on all available benchmark tasks, they stillfall short when deployed as part of real-world sys-tems (Burn-Murdoch, 2013; Grimes, 2014, 2017;Gossett, 2020) and display a range of clear short-comings (Kiritchenko and Mohammad, 2018; Han-wen Shen et al., 2018; Wallace et al., 2019; Tsaiet al., 2019; Jin et al., 2019; Zhang et al., 2020).

In this paper, we seek to address the gap betweenbenchmark results and actual utility by introduc-ing version 1 of the DynaSent dataset for English-language ternary (positive/negative/neutral) senti-

∗Equal contribution.

Model 0RoBERTa fine-tuned onsentiment benchmarks

Model 0 used to findchallenging naturallyoccurring sentences

Human validationRound 1 Dataset

Model 1RoBERTa fine-tuned onsentiment benchmarks

+ Round 1 Dataset

Dynabench used tocrowdsource sentences

that fool Model 1

Human validationRound 2 Dataset

Figure 1: The DynaSent dataset creation process. Thehuman validation task is the same for both rounds; fiveresponses are obtained for each sentence. On Dyn-abench, we explore conditions with and without promptsentences that workers can edit to achieve their goal.

ment analysis.1 DynaSent is intended to be a dy-namic benchmark that expands in response to newmodels, new modeling goals, and new adversarialattacks. We present the first two rounds here andmotivate some specific data collection and mod-eling choices, and we propose that, when futuremodels solve these rounds, we use those modelsto create additional DynaSent rounds. This is aninstance of “the ‘moving post’ dynamic target” forNLP that Nie et al. (2020) envision.

Figure 1 summarizes our method, which incor-porates both naturally occurring sentences and sen-tences created by crowdworkers with the goal offooling a top-performing sentiment model. Thestarting point is Model 0, which is trained on stan-dard sentiment benchmarks and used to find chal-lenging sentences in existing data. These sentencesare fed into a human validation task, leading to

1https://github.com/cgpotts/dynasent

arX

iv:2

012.

1534

9v1

[cs

.CL

] 3

0 D

ec 2

020

Page 2: DynaSent: A Dynamic Benchmark for Sentiment Analysis

the Round 1 Dataset. Next, we train Model 1on Round 1 data in addition to publicly availabledatasets. In Round 2, this model runs live on theDynabench Platform for human-and-model-in-the-loop dataset creation;2 crowdworkers try to con-struct examples that fool Model 1. These examplesare human-validated, which results in the Round 2Dataset. Taken together, Rounds 1 and 2 have121,634 sentences, each with five human valida-tion labels. Thus, with only two rounds collected,DynaSent is already a substantial new resource forsentiment analysis.

In addition to contributing DynaSent, we seekto address a pressing concern for any dataset col-lection method in which workers are asked to con-struct original sentences: human creativity has in-trinsic limits. Individual workers will happen uponspecific strategies and repeat them, and this willlead to dataset artifacts. These artifacts will cer-tainly reduce the value of the dataset, and they arelikely to perpetuate and amplify social biases.

We explore two methods for mitigating thesedangers. First, by harvesting naturally occurringexamples for Round 1, we tap into a wider popula-tion than we can via crowdsourcing, and we bringin sentences that were created for naturalistic rea-sons (de Vries et al., 2020), rather than the moreartificial goals present during crowdsourcing. Sec-ond, for the Dynabench cases created in Round 2,we employ a ‘Prompt’ setting, in which crowd-workers are asked to modify a naturally occurringexample rather than writing one from scratch. Wecompare these sentences with those created with-out a prompt, and we find that the prompt-derivedsentences are more like naturally occurring sen-tences in length and lexical diversity. Of course,fundamental sources of bias remain – we seek toidentify these in the Datasheet (Gebru et al., 2018)distributed with our dataset – but we argue thatthese steps help, and can inform crowdsourcingefforts in general.

As noted above, DynaSent presently uses thelabels Positive, Negative, and Neutral. This isa minimal expansion of the usual binary (Posi-tive/Negative) sentiment task, but a crucial one,as it avoids the false presupposition that all textsconvey binary sentiment. We chose this versionof the problem to show that even basic sentimentanalysis poses substantial challenges for our field.We find that the Neutral category is especially dif-

2https://dynabench.org/

ficult. While it is common to synthesize such acategory from middle-scale product and servicereviews, we use an independent validation of theStanford Sentiment Treebank (Socher et al., 2013)dev set to argue that this tends to blur neutralitytogether with mixed sentiment and uncertain senti-ment (Section 5.2). DynaSent can help tease thesephenomena apart, since it already has a large num-ber of Neutral examples and a large number ofexamples displaying substantial variation in valida-tion. Finally, we argue that the variable nature ofthe Neutral category is an obstacle to fine-tuning(Section 5.3), which favors our strategy of trainingmodels from scratch for each round.

2 Related Work

Sentiment analysis was one of the first natural lan-guage understanding tasks to be revolutionized bydata-driven methods. Rather than trying to surveythe field (see Pang and Lee 2008; Liu 2012; Grimes2014), we focus on the benchmark tasks that haveemerged in this space, and then seek to situate thesebenchmarks with respect to challenge (adversarial)datasets and crowdsourcing methods.

2.1 Sentiment Benchmarks

The majority of sentiment datasets are derived fromcustomer reviews of products and services. Thisis an appealing source of data, since such texts areaccessible and abundant in many languages andregions of the world, and they tend to come withtheir own author-provided labels (star ratings). Onthe other hand, over-reliance on such texts is likelyalso limiting progress; DynaSent begins movingaway from such texts, though it remains rooted inthis domain.

Pang and Lee (2004) released a collection of2,000 movie reviews with binary sentiment labels.This dataset became one of the first benchmarktasks in the field, but it is less used today due toits small size and issues with its train/test split thatmake it artificially easy (Maas et al., 2011).

A year later, Pang and Lee (2005) released a col-lection of 11,855 sentences extracted from moviereviews from the website Rotten Tomatoes. Thesesentences form the basis for the Stanford SentimentTreebank (SST; Socher et al. 2013), which provideslabels (on a five-point scale) for all the phrases inthe parse trees of the sentences in this dataset.

Maas et al. (2011) released a sample of 25Klabeled movie reviews from IMDB, drawn from

Page 3: DynaSent: A Dynamic Benchmark for Sentiment Analysis

1–4-star reviews (negative) and 7–10-star reviews(positive). This remains one of largest benchmarktasks for sentiment; whereas much larger datasetshave been introduced into the literature at varioustimes (Jindal and Liu, 2008; Ni et al., 2019), manyhave since been removed from public distribu-tion (McAuley et al., 2012; Zhang et al., 2015). Awelcome exception is the Yelp Academic Dataset,3

which was first released in 2010 and which hasbeen expanded a number of times, so that it howhas over 8M review texts with extensive metadata.

Not all sentiment benchmarks are based in re-view texts. The MPQA Opinion Corpus of Wiebeet al. (2005) contains news articles labeled at thephrase-level by experts for a wide variety of sub-jective states; it presents an exciting vision for howsentiment analysis might become more multidi-mensional. SemEval 2016 and 2017 (Nakov et al.,2016; Rosenthal et al., 2017) offered Twitter-basedsentiment datasets that continue to be used widelydespite their comparatively small size (≈10K ex-amples). And of course there are numerous addi-tional datasets for specific languages, domains, andemotional dimensions; Google’s Dataset Searchcurrently reports over 100 datasets for sentiment,and the website Papers with Code gives statisticsfor dozens of sentiment tasks.4

Finally, sentiment lexicons have long played acentral role in the field. They are used as resourcesfor data-driven models, and they often form thebasis for simple hand-crafted decision rules. TheHarvard General Inquirer (Stone and Hunt, 1963) isan early and still widely used multidimensional sen-timent lexicon. Other important examples are Opin-ionFinder (Wilson et al., 2005), LIWC (Pennebakeret al., 2007), MPQA, Opinion Lexicon (Liu, 2011),SentiWordNet (Baccianella et al., 2010), and the va-lence and arousal ratings of Warriner et al. (2013).

2.2 Challenge and Adversarial DatasetsChallenge and adversarial datasets have risen toprominence over the last few years in responseto the sense that benchmark results are over-stating the quality of the models we are develop-ing (Linzen, 2020). The general idea traces toWinograd (1972), who proposed minimally con-trasting examples meant to evaluate the ability ofa model to capture specific cognitive, linguistic,and social phenomena (see also Levesque 2013).

3https://www.yelp.com/dataset4https://paperswithcode.com/task/

sentiment-analysis

This guiding idea is present in many recent pa-pers that seek to determine whether models havemet specific learning targets (Alzantot et al., 2018;Glockner et al., 2018; Naik et al., 2018; Nie et al.,2019). Related efforts reveal that models are ex-ploiting relatively superficial properties of the datathat are easily exploited by adversaries (Jia andLiang, 2017; Kaushik and Lipton, 2018; Zhanget al., 2020), as well as social biases in the datathey were trained on (Kiritchenko and Mohammad,2018; Rudinger et al., 2017, 2018; Sap et al., 2019;Schuster et al., 2019).

For the most part, challenge and adversarialdatasets are meant to be used primarily for eval-uation (though Liu et al. (2019a) show that evensmall amounts of training on them can be fruitfulin some scenarios). However, there are existing ad-versarial datasets that are large enough to supportfull-scale training efforts (Zellers et al., 2018, 2019;Chen et al., 2019; Dua et al., 2019; Bartolo et al.,2020). DynaSent falls into this class; it has largetrain sets that can support from-scratch training aswell as fine-tuning. Our approach is closest to, anddirectly inspired by, the Adversarial NLI (ANLI)project, which is reported on by Nie et al. (2020)and which continues on Dynabench. In ANLI, hu-man annotators construct new examples that foola top-performing model but make sense to otherhuman annotators. This is an iterative process thatallows the annotation project itself to organicallyfind phenomena that fool current models. The re-sulting dataset has, by far, the largest gap betweenestimated human performance and model accuracyof any benchmark in the field right now. We hopeDynaSent follows a similar pattern, and that itsnaturally occurring sentences and prompt-derivedsentences bring beneficial diversity.

2.3 Crowdsourcing Methods

Within NLP, Snow et al. (2008) helped establishcrowdsourcing as a viable method for collectingdata for at least some core language tasks. Sincethen, it has become the dominant mode for datasetcreation throughout all of AI, and the scientificstudy of these methods has in turn grown rapidly.For our purposes, a few core findings from researchinto crowdsourcing are centrally important.

First, crowdworkers are not fully representativeof the general population: they are, for example,more educated, more male, more technologicallyenabled (Hube et al., 2019). As a result, datasets

Page 4: DynaSent: A Dynamic Benchmark for Sentiment Analysis

SST-3 Yelp AmazonDev Test Dev Test Dev Test

Positive 444 909 9,577 10,423 130,631 129,369Negative 428 912 10,222 9,778 129,108 130,892Neutral 228 389 5,201 4,799 65,261 64,739

Total 1,100 2,210 25,000 25,000 325,000 325,000

Table 1: External assessment datasets used for Model 0 and Model 1. SST-3 is the ternary version of SST asdescribed in Table 2a but with only the sentence-level phrases included. For the Yelp and Amazon datasets, wesplit the test datasets of Zhang et al. 2015 in half by line number to create a dev/test split.

created using these methods will tend to inheritthese biases, which will then find their way into ourmodels. DynaSent’s naturally occurring sentencesand prompt sentences can help, but we acknowl-edge that those texts come from people who writeonline reviews, which is also a special group.

Second, any crowdsourcing project will reachonly a small population of workers, and this willbring its own biases. The workers will tend toadopt similar cognitive patterns and strategies thatmight not be representative even of the specializedpopulation they belong to (Gadiraju et al., 2017).This seems to be an underlying cause of many ofthe artifacts that have been identified in prominentNLU benchmarks (Poliak et al., 2018; Gururanganet al., 2018; Tsuchiya, 2018; Belinkov et al., 2019).

Third, as with all work, quality varies acrossworkers and examples, which raises the question ofhow best to infer individual labels from responsedistributions. Dawid and Skene (1979) is an earlycontribution to this problem leveraging Expecta-tion Maximization (Dempster et al., 1977). Muchsubsequent work has pursued similar strategies; fora full review, see Zheng et al. 2017. Our corpusrelease uses the true majority (3/5 labels) as thegold label where such a majority exists, leavingexamples unlabeled otherwise, but we include thefull response distributions in our corpus releaseand make use of those distributions when trainingModel 1. For additional details, see Section 3.3.

3 Round 1: Naturally OccurringSentences

We now begin to describe our method for construct-ing DynaSent (Figure 1). We begin with an initialmodel, Model 0, and use it to harvest challengingexamples from an existing corpus. These exam-ples are human-validated and incorporated into thetraining of a second model, Model 1. This model

is then loaded into Dynabench; Round 2 of ourdata collection (Section 4) involves crowdworkersseeking to create examples that fool this model.

3.1 Model 0

Our Model 0 begins with the RoBERTa-base pa-rameters (Liu et al., 2019b) and adds a three-waysentiment classifier head. The model was trainedon a number of publicly-available datasets, as sum-marized in Table 2a. The Customer Reviews (Huand Liu, 2004) and IMDB (Maas et al., 2011)datasets have only binary labels. The other datasetshave five star-rating categories. We bin these rat-ings by taking the lowest two ratings to be negative,the middle rating to be neutral, and the highesttwo ratings to be positive. The Yelp and Amazondatasets are those used in Zhang et al. 2015; thefirst is derived from an earlier version of the YelpAcademic Dataset, and the second is derived fromthe dataset used by McAuley et al. (2012). SST-3 is the ternary version of the SST; we train onthe phrase-level version of the dataset (and alwaysevaluate only on its sentence-level labels). For ad-ditional details on how this model was optimized,see Appendix A.

We evaluate this and subsequent models on threedatasets (Table 1): SST-3 dev and test, and the as-sessment portion of the Yelp and Amazon datasetsfrom Zhang et al. 2015. For Yelp and Amazon,the original distribution contained only (very large)test files. We split them in half (by line number) tocreate dev and test splits.

In Table 2b, we summarize our Model 0 assess-ments on these datasets. Across the board, ourmodel does extremely well on the Positive and Neg-ative categories, and less well on Neutral. We tracethis to the fact that the Neutral categories for allthese corpora were derived from three-star reviews,which actually mix a lot of different phenomena:

Page 5: DynaSent: A Dynamic Benchmark for Sentiment Analysis

CR IMDB SST-3 Yelp Amazon

Positive 2,405 12,500 42,672 260,000 1,200,000Negative 1,366 12,500 34,944 260,000 1,200,000Neutral 0 0 81,658 130,000 600,000

Total 3,771 25,000 159,274 650,000 3,000,000

(a) Model 0 training data. CR is the Customer Reviews dataset from Hu and Liu 2004, IMDB is from Maas et al. 2011, SST-3 isthe phrase-level ternary SST (labels 0–1 = Neg; 2 = Neu; 3–4 = Pos), and Yelp and Amazon are from Zhang et al. 2015 (1–2-starreviews = Neg; 3-star = Neutral; 4–5-star = Pos).

SST-3 Yelp Amazon Round 1 Round 2Dev Test Dev Test Dev Test Dev Test Dev Test

Positive 85.1 89.0 88.3 90.5 89.1 89.4 33.3 33.3 58.4 63.0Negative 84.1 84.1 88.8 89.1 86.6 86.6 33.3 33.3 61.0 63.1Neutral 45.4 43.5 58.2 59.4 53.9 53.7 33.3 33.3 38.4 44.3

Macro avg 71.5 72.2 78.4 79.7 76.5 76.6 33.3 33.3 52.6 56.8

(b) Model 0 performance (F1 scores) on external assessment datasets (Table 1). We also report on our Round 1 dataset(Section 3.4), where performance is at chance by construction, and we report on our Round 2 dataset (Section 4) to furtherquantify the challenging nature of that dataset.

Table 2: Model 0 summary.

neutrality, mixed sentiment, and (in the case of thereader judgments in SST) uncertainty about theauthor’s intentions. We return to this issue in Sec-tion 5.2, arguing that DynaSent marks progress oncreating a more coherent Neutral category.

Finally, Table 2b includes results for our Round 1dataset, as we are defining it. Performance is at-chance across the board by construction (see Sec-tion 3.4 below). We include these columns to helpwith tracking the progress we make with Model 1.We also report performance of this model on ourRound 2 dataset (described below in Section 4),again to help with tracking progress and under-standing the two rounds.

3.2 Harvesting Sentences

Our first round of data collection focused on findingnaturally occurring sentences that would challengeour Model 0. To do this, we harvested sentencesfrom the Yelp Academic Dataset, using the versionof the dataset that contains 8,021,122 reviews.

The sampling process was designed so that 50%of the sentences fell into two groups: those thatoccurred in 1-star reviews but were predicted byModel 0 to be Positive, and those that occurredin 5-star reviews but were predicted by Model 0to be Negative. The intuition here is that thesewould likely be examples that fooled our model. Ofcourse, negative reviews can (and often do) contain

positive sentences, and vice-versa. This motivatesthe validation stage that we describe next.

3.3 Validation

Our validation task was conducted on Amazon Me-chanical Turk. Workers were shown ten sentencesand asked to label them according to the follow-ing four categories, which we give here with theglosses that were given to workers in instructionsand in the annotation interface:

Positive: The sentence conveys information aboutthe author’s positive evaluative sentiment.

Negative: The sentence conveys informationabout the author’s negative evaluative senti-ment.

No sentiment: The sentence does not convey any-thing about the author’s positive or negativesentiment.

Mixed sentiment: The sentence conveys a mix ofpositive and negative sentiment with no clearoverall sentiment.

We henceforth refer to the ‘No sentiment’ cate-gory as ‘Neutral’ to remain succinct and align withstandard usage in the literature.

For this round, 1,978 workers participated inthe validation process. In the final version of the

Page 6: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Sentence Model 0 Responses

Good food nasty attitude by hostesses . neg mix, mix, mix, neg, negNot much of a cocktail menu that I saw. neg neg, neg, neg, neg, negI scheduled the work for 3 weeks later. neg neu, neu, neu, neu, posI was very mistaken, it was much more! neg neg, pos, pos, pos, pos

It is a gimmick, but when in Rome, I get it. neu mix, mix, mix, neu, neuProbably a little pricey for lunch. neu mix, neg, neg, neg, negBut this is strictly just my opinion. neu neu, neu, neu, neu, posThe price was okay, not too pricey. neu mix, neu, pos, pos, pos

The only downside was service was a little slow. pos mix, mix, mix, neg, negHowever there is a 2 hr seating time limit. pos mix, neg, neg, neg, neuWith Alex, I never got that feeling. pos neu, neu, neu, neu, posIts ran very well by management. pos pos, pos, pos, pos, pos

Table 3: Round 1 train set examples, randomly selected from each combination of Model 0 prediction and majoritylabel, but limited to examples with 30–50 characters. Appendix E provides fully randomly selected examples.

corpus, each sentence is validated by five differ-ent workers. To obtain these ratings, we employedan iterative strategy. Sentences were uploaded inbatches of 3–5K and, after each round, we mea-sured each worker’s rate of agreement with themajority. We then removed from the potential poolthose workers who disagreed more than 80% ofthe time with their co-annotators, using a methodof ‘unqualifying’ workers that does not involvingrejecting their work or blocking them (Turk, 2017).We then obtained additional labels for examplesthat those ‘unqualified’ workers annotated. Thus,many examples received more than five responsesover the course of validation. The final versionof DynaSent keeps only the responses from thehighest-rated workers. This led to substantial in-crease in dataset quality by removing a lot of labelsthat seemed to us to be randomly assigned. Ap-pendix B describes the process in more detail, andour Datasheet enumerates the known unwanted bi-ases that this process can introduce.

3.4 Round 1 Dataset

The resulting Round 1 dataset is summarized inTable 4. Table 3 provides train-set examples forevery combination of Model 0 prediction and ma-jority validation label. The examples were ran-domly sampled with a restriction to 30–50 charac-ters. Appendix E provides examples that were trulyrandomly selected.

Because each sentence has five ratings, there aretwo perspectives we can take on the dataset:

Dist Majority LabelTrain Train Dev Test

Positive 130,045 21,391 1,200 1,200Negative 86,486 14,021 1,200 1,200Neutral 215,935 45,076 1,200 1,200Mixed 39,829 3,900 0 0No Majority – 10,071 0 0

Total 472,295 94,459 3,600 3,600

Table 4: Round 1 Dataset. ‘Dist’ refers to the individualresponses (five per example). We omit the comparableDev and Test numbers to save space. The Majority La-bel is the one chosen by at least three of the five work-ers, if there is such label. For Majority Label Train,the Positive, Negative, and Neutral categories contain atotal of 80,488 examples.

Distributional Labels We can repeat each exam-ple with each of its labels. For instance, the firstsentence in Table 3 would be repeated three timeswith ‘Mixed’ as the label and twice with ‘Negative’.For many classifier models, this reduces to labelingeach example with its probability distribution overthe labels. This is an appealing approach to cre-ating training data, since it allows us to make useof all the examples,5 even those that do not have amajority label, and it allows us to make maximaluse of the labeling information. In our experiments,we found that training on the distributional labelsconsistently led to slightly better models.

5For ‘Mixed’ labels, we create two copies of the example,one labeled ‘Positive’, the other ‘Negative’.

Page 7: DynaSent: A Dynamic Benchmark for Sentiment Analysis

CR IMDB SST-3 Yelp Amazon Round 1

Positive 2,405 12,500 128,016 29,841 133,411 339,748Negative 1,366 12,500 104,832 30,086 133,267 252,630Neutral 0 0 244,974 30,073 133,322 431,870

Total 3,771 25,000 477,822 90,000 400,000 1,024,248

(a) Model 1 training data. CR and IMDB are unchanged from Table 2a. SST-3 is processed the same way as before, but werepeat the dataset 3 times to give it more weight. For Yelp and Amazon, we include only 1-star, 3-star, and 5-star reviews, andwe subsample from those categories, with the goal of down-weighting them overall and removing ambiguous reviews. Round 1here uses distributional labels and is copied twice.

SST-3 Yelp Amazon Round 1 Round 2Dev Test Dev Test Dev Test Dev Test Dev Test

Positive 84.6 88.6 80.0 83.1 83.3 83.3 81.0 80.4 33.3 33.3Negative 82.7 84.4 79.5 79.6 78.7 78.8 80.5 80.2 33.3 33.3Neutral 40.0 45.2 56.7 56.6 55.5 55.4 83.1 83.5 33.3 33.3

Macro avg 69.1 72.7 72.1 73.1 72.5 72.5 81.5 81.4 33.3 33.3

(b) Model 1 performance (F1 scores) on external assessment datasets (Table 1), as well as our Round 1 and Round 2 datasets.Chance performance for this model on Round 2 is by design (Section 4.5).

Table 5: Model 1 summary.

Majority Label We can take a more traditionalroute and infer a label based on the distribution oflabels. In Table 4, we show the labels inferred byassuming that an example has a label just in caseat least three of the five annotators chose that la-bel. This is a conservative approach that creates afairly large ‘No Majority’ category. More sophis-ticated approaches might allow us to make fulleruse of the examples and account for biases relatingto annotator quality and example complexity (seeSection 2.3). We set these options aside for nowbecause our validation process placed more weighton the best workers we could recruit (Section 3.3).

The Majority Label splits given by Table 4 aredesigned to ensure five properties: (1) the classesare balanced, (2) Model 0 performs at chance, (3)the review-level rating associated with the sentencehas no predictive value, (3) at least four of the fiveworkers agreed, and (5) the majority label is Posi-tive, Negative, or Neutral. (This excludes examplesthat received a Mixed majority and examples with-out a majority label at all.)

Over the entire round, 47% of cases are such thatthe validation majority label is Positive, Negative,or Neutral and Model 0 predicted a different label.

3.5 Estimating Human Performance

Table 6 provides a conservative estimate of humanF1 in order to have a quantity that is comparable

to our model assessment metrics. To do this, werandomize the responses for each example to cre-ate five synthetic annotators, and we calculate theprecision, recall, and F1 scores for each of theseannotators with respect to the gold label. We aver-age those scores. This is a conservative estimate ofhuman performance, since it heavily weights thesingle annotator who disagreed for the cases with4/5 majorities. We can balance this against the factthat 614 workers (out of 1,280) never disagreedwith the majority label (see Appendix B for the fulldistribution of agreement rates). However, it seemsreasonable to say that a model has solved the roundif it achieves comparable scores to our aggregateF1 – a helpful signal to start a new round.

4 Round 2: Dynabench

In Round 2, we leverage the Dynabench platformto begin creating a new dynamic benchmark forsentiment analysis. Dynabench is an open-sourceapplication that runs in a browser and facilitatesdataset collection efforts in which workers seek toconstruct examples that fool a model (or ensembleof models) but make sense to other humans.

4.1 Model 1

Model 1 was created using the same general meth-ods as for Model 0 (Section 3.1): we begin withRoBERTa parameters and add a three-way senti-

Page 8: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Dev Test

Positive 88.1 87.8Negative 89.2 89.3Neutral 86.6 86.9

Macro avg 88.0 88.0

Table 6: Estimates of human performance (F1 scores)on the Round 1 dataset. The estimates come from com-paring random synthesized human annotators againstthe gold labels using the response distributions in thedataset. The Fleiss Kappas for the dev and tests setare 0.616 and 0.615, respectively. We offer F1s as away of tracking model performance to determine whenthe round is “solved” and a new round of data shouldbe collected. However, we note that 614 of our 1,280workers never disagreed with the gold label.

ment classifier head. The differences between thetwo models lie in the data they were trained on.

Table 5a summarizes the training data forModel 1. In general, it uses the same datasetsas we used for Model 0, but with a few crucialchanges. First, we subsample the large Yelp andAmazon datasets to ensure that they do not domi-nate the dataset, and we include only 1-star, 3-star,and 5-star reviews to try to reduce the number ofambiguous examples. Second, we upsample SST-3by a factor of 3 and our own dataset by a factorof 2, using the distributional labels for our dataset(Section 3.4). This gives roughly equal weight, byexample, to our dataset as to all the others com-bined. This makes sense given our general goal ofdoing well on our dataset and, especially, of shift-ing the nature of the Neutral category to somethingmore semantically coherent than what the othercorpora provide.

Table 5b summarizes the performance of ourmodel on the same evaluation sets as are reportedin Table 5b for Model 0. Overall, we see a smallperformance drop on the external datasets, but ahuge jump in performance on our dataset (Round 1).While it is unfortunate to see a decline in perfor-mance on the external datasets, this is expected ifwe are shifting the label distribution with our newdataset – it might be an inevitable consequence ofhill-climbing in our intended direction.

4.2 Dynabench Interface

Appendix C provides the Dynabench interface wecreated for DynaSent as well the complete instruc-tions and training items given to workers. The

essence of the task is that the worker chooses alabel y to target and then seeks to write an examplethat the model (currently, Model 1) assigns a labelother than y but that other humans would label y.Workers can try repeatedly to fool the model, andthey get feedback on the model’s predictions as aguide for how to fool it.

4.3 Methods

We consider two conditions:

Prompt Workers are shown a sentence and giventhe opportunity to modify it as part of achiev-ing their goal. Prompts are sampled from partsof the Yelp Academic Dataset not used forRound 1.

No Prompt Workers wrote sentences fromscratch, with no guidance beyond their goalof fooling the model.

We piloted both versions and compared the results.Our analyses are summarized in Section 5.1. Thefindings led us to drop the No Prompt condition anduse the Prompt condition exclusively, as it clearlyleads to examples that are more naturalistic andlinguistically diverse.

For Round 2, our intention was for each promptto be used only once, but prompts were repeatedin a small number of cases. We have ensured thatour dev and test sets contain only sentences derivedfrom unique prompts (Section 4.5).

4.4 Validation

We used the identical validation process as de-scribed in Section 3.3, getting five responses foreach example as before. This again opens up thepossibility of using label distributions or inferringindividual labels. 395 workers participated in thevalidation process for Round 2. See Appendix Bfor additional details.

4.5 Round 2 Dataset

Table 7 summarizes our Round 2 dataset. Overall,workers’ success rate in fooling Model 1 is about19%, which is much lower than the comparablevalue for Round 1 (47%). There seem to be at leastthree central reasons for this. First, Model 1 is hardto fool, so many workers reach the maximum num-ber of attempts. We retain the examples they enter,as many of them are interesting in their own right.Second, some workers seem to get confused aboutthe true goal and enter sentences that the model

Page 9: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Dist Majority LabelTrain Train Dev Test

Positive 32,551 6,038 240 240Negative 24,994 4,579 240 240Neutral 16,365 2,448 240 240Mixed 18,765 3,334 0 0No Majority – 2,136 0 0

Total 92,675 18,535 720 720

Table 7: Round 2 splits using the framework describedin Section 4.5 and the criteria specified in Section 4.5.

in fact handles correctly. Some non-trivial rate ofconfusion here seems inevitable given the cognitivedemands of the task, but we have taken steps to im-prove the interface to minimize this factor. Third, acommon strategy is to create examples with mixedsentiment; the model does not predict this label,but it is chosen at a high rate in validation.

Despite these complicating factors, we can con-struct splits that meet our core goals: (1) Model 1performs at chance on the dev and test sets, and (2)the dev and test sets contain only examples wherethe majority label was chosen by at least four ofthe five workers. In addition, (3) our dev and testsets contain only examples from the Prompt condi-tion (the No Prompt cases are in the train set, andflagged as such), and (4) all the dev and test sen-tences are derived from unique prompts to avoidleakage between train and assessment sets and re-duce unwanted correlations within the assessmentsets. Table 7 summarizes these splits.

Table 8 provides train examples from Round 2sampled using the same criteria we used for Ta-ble 3: the examples are randomly chosen to showevery combination of model prediction and major-ity label, with the restriction that the examples be30–50 characters long. (Appendix E gives fullyrandomly selected examples.)

4.6 Estimating Human Performance

Table 9 provides estimates of human F1 forRound 2 using the same methods as described inSection 3.5 and given in the corresponding table forRound 1 (Table 6). We again emphasize that theseare very conservative estimates. We once againhad a large percentage of workers (116 of 244)who never disagreed with the gold label on theexamples they rated, suggesting that human perfor-mance can approach perfection. Nonetheless, the

estimates we give here seem useful for helping usdecide whether to continue hill-climbing on thisround or begin creating new rounds.

5 Discussion

We now address a range of issues that our methodsraise but that we have so far deferred in the interestof succinctly reporting on the methods themselves.

5.1 The Role of Prompts

As discussed in Section 4, we explored two meth-ods for collecting original sentences on Dynabench:with and without a prompt sentence that workerscould edit to achieve their goal. We did small pilotrounds in each condition and assessed the results.This led us to use the Prompt condition exclusively.This section explains our reasoning more fully.

First, we note that workers did in fact make useof the prompts. In Figure 2a, we plot the Leven-shtein edit distance between the prompts providedto annotators and the examples the annotators pro-duced, normalized by the length of the promptor the example, whichever is longer. There is aroughly bimodal distribution in this plot, where thepeak on the right represents examples generated bythe annotator tweaking the prompt slightly and thepeak on the left represents examples where theydeviated significantly from the prompt. Essentiallyno examples fall at the extreme ends (literal reuseof the prompt; complete disregard for the prompt).

Second, we observe that examples generatedin the Prompt condition are generally longer thanthose in the No Prompt condition, and more likeour Round 1 examples. Figure 2b summarizes forstring lengths; the picture is essentially the samefor tokenized word counts. In addition, the Promptexamples have a more diverse vocabulary overall.Figure 2c provides evidence for this: we sampled100 examples from each condition 500 times, sam-pled five words from each example, and calculatedthe vocabulary size (unique token count) for eachsample. (These measures are intended to controlfor the known correlation between token countsand vocabulary sizes; Baayen 2001.) The Prompt-condition vocabularies are much larger, and againmore similar to our Round 1 examples.

Third, a qualitative analysis further substantiatesthe above picture. For example, many workers re-alized that they could fool the model by attributinga sentiment to another group and then denying it,as in “They said it would be great, but they were

Page 10: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Sentence Model 1 Responses

The place was somewhat good and not well neg mix, mix, mix, mix, negI bought a new car and met with an accident. neg neg, neg, neg, neg, negThe retail store is closed for now at least. neg neu, neu, neu, neu, neuPrices are basically like garage sale prices. neg neg, neu, pos, pos, pos

That book was good. I need to get rid of it. neu mix, mix, mix, neg, posI REALLY wanted to like this place neu mix, neg, neg, neg, posBut I’m going to leave my money for the next vet. neu neg, neu, neu, neu, neuonce upon a time the model made a super decision. neu pos, pos, pos, pos, pos

I cook my caribbean food and it was okay pos mix, mix, mix, pos, posThis concept is really cool in name only. pos mix, neg, neg, neg, neuWow, it’d be super cool if you could join us pos neu, neu, neu, neu, posKnife cut thru it like butter! It was great. pos pos, pos, pos, pos, pos

Table 8: Round 2 train set examples, randomly selected from each combination of Model 1 prediction and majoritylabel, but limited to examples with 30–50 characters. Appendix E provides fully randomly selected examples.

Dev Test

Positive 91.0 90.9Negative 91.2 91.0Neutral 88.9 88.2

Macro avg 90.4 90.0

Table 9: Estimates of human performance (F1 scores)on the Round 2 dataset using the procedure describedin Section 3.5. The Fleiss Kappas for the dev and testsset are 0.681 and 0.667, respectively. The F1s are con-servative estimates; 116 of our 244 workers never dis-agreed with the gold label for this round..

wrong”. As a result, there are dozens of exam-ples in the No Prompt condition that employ thisstrategy. Individual workers hit upon more idiosyn-cratic strategies and repeatedly used them. Thisis just the sort of behavior that we know can cre-ate persistent dataset artifacts. For this reason, weinclude No Prompt examples in the training dataonly, and we make it easy to identify them in caseone wants to handle them specially.

5.2 The Neutral Category

For both Model 0 and Model 1, there is consistentlya large gap between performance on the Neutralcategory and performance on the other categories,but only for the external datasets we use for evalua-tion. For our dataset, performance across all threecategories is fairly consistent. We hypothesizedthat this traces to semantic diversity in the Neutralcategories for these external datasets. In review

corpora, three-star reviews can signal neutrality,but they are also likely to signal mixed sentimentor uncertain overall assessments. Similarly, wherethe ratings are assigned by readers, as in the SST,it seems likely that the middle of the scale will alsobe used to register mixed and uncertain sentiment,along with a real lack of sentiment.

To further support this hypothesis, we ran theSST dev set through our validation pipeline. Thisleads to a completely relabeled dataset with fiveratings for each example and a richer array of cat-egories. Table 10 compares these new labels withthe labels in SST-3. Overall, the two label sets areclosely aligned for Positive and Negative. However,the SST-3 Neutral category has a large percentageof cases falling into Mixed and No Majority, andit is overall by far the least well aligned with ourlabeling of any of the categories. Appendix D givesa random sample of cases where the two label setsdiffer with regard to the Neutral category. It alsoprovides all seven cases of sentiment confusion.We think these comparisons favor our labels overSST’s original labels.

5.3 Fine-Tuning

Our Model 1 was trained from scratch (beginningwith RoBERTa) parameters. An appealing alterna-tive would be to begin with Model 0 and fine-tuneit on our Round 1 data. This would be more effi-cient, and it might naturally lead to the Round 1data receiving the desired overall weight relative tothe other datasets. Unfortunately, our attempts tofine-tune in this way led to worse models, and the

Page 11: DynaSent: A Dynamic Benchmark for Sentiment Analysis

(a) Normalized edit distances between theprompt and the example.

(b) String lengths. The picture is essen-tially the same for tokenized word counts.

(c) Vocabulary sizes in samples of 100 ex-amples (500 samples with replacement).

Figure 2: The ‘Prompt’ and ‘No Prompt’ conditions.

SST-3Positive Negative Neutral

Positive 367 2 64Negative 5 359 57Neutral 23 8 44Mixed 34 35 39No Majority 15 24 25

Table 10: Comparison of the SST-3 labels (dev set)with labels derived from our separate validation of thisdataset.

problems generally traced to very low performanceon the Neutral category.

To study the effect of our dataset on Model 1performance, we employ the “fine-tuning by in-oculation” method of Liu et al. (2019a). We firstdivide our Round 1 train set into small subsets viarandom sampling. Then, we fine-tune our Model 0using these subsets of Round 1 train with non-distributional labels. We early-stop our fine-tuningprocess if performance on the Round 0 dev set ofModel 0 (SST-3 dev) has not improved for fiveepochs. Lastly, we measure model performancewith Round 1 dev (SST-3 dev plus Round 1 dev)and our external evaluation sets (Table 1).

Figure 3 presents F1 scores for our three classlabels using this method. We observe that modelperformance on Round 1 dev increases for all threelabels given more training examples. The F1 scoresfor the Positive and Negative classes remain high,but they begin to drop slightly with larger samples.The F1 scores on SST-3 dev show larger perturba-tions. However, the most striking trends are for theNeutral category, where the F1 score on Round 1

dev increases steadily while the F1 scores on thethree original development sets for Model 0 de-crease drastically. This is the general pattern thatLiu et al. (2019a) associate with dataset artifacts orlabel distribution shifts.

Our current hypothesis is that the pattern we ob-serve can be attributed, at least in large part, to labelshift – specifically, to the difference between ourNeutral category and the other Neutral categories,as discussed in the preceding section. Our strategyof training from scratch seems less susceptible tothese issues, though the label shift is still arguablya factor in the relatively poor performance we seeon this category with our external validation sets.

6 Conclusion

We presented DynaSent, as the first stage in whatwe hope is an ongoing effort to create a dynamicbenchmark for sentiment analysis that responds toscientific advances and concerns relating to real-world deployments of sentiment analysis models.DynaSent contains 121,634 examples from two dif-ferent sources: naturally occurring sentences andsentences created on the Dynabench Platform byworkers who were actively trying to fool a strongsentiment model. All the sentences in the datasetare multiply-validated by crowdworkers in separatetasks. In addition, we argued for the use of promptsentences on Dynabench to help workers avoid cre-ative ruts and reduce the rate of biases and artifactsin the resulting dataset. We hope that the next stepfor DynaSent is that the community responds withmodels that solve both rounds of our task. Thatwill in turn be our cue to launch another round ofdata collection to fool those models and push thefield of sentiment forward by another step.

Page 12: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Figure 3: Inoculation by fine-tuning results with different number of fine-tuning examples. (a-c): F1 score ondifferent development sets for three categories: Positive, Negative and Neutral.

Acknowledgements

Our thanks to the developers of the DynabenchPlatform, and special thanks to our Amazon Me-chanical Turk workers for their essential contribu-tions to this project. This research is supported inpart by faculty research grants from Facebook andGoogle.

ReferencesMoustafa Alzantot, Yash Sharma, Ahmed Elgohary,

Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.2018. Generating natural language adversarial ex-amples. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing,pages 2890–2896, Brussels, Belgium. Associationfor Computational Linguistics.

R. Harald Baayen. 2001. Word Frequency Distribu-tions. Kluwer Academic Publishers, Dordrecht.

Stefano Baccianella, Andrea Esuli, and Fabrizio Sebas-tiani. 2010. SentiWordNet 3.0: An enhanced lexicalresource for sentiment analysis and opinion mining.In Proceedings of the Seventh Conference on Inter-national Language Resources and Evaluation, pages2200–2204. European Language Resources Associ-ation.

Max Bartolo, Alastair Roberts, Johannes Welbl, Sebas-tian Riedel, and Pontus Stenetorp. 2020. Beat theAI: Investigating adversarial human annotation forreading comprehension. Transactions of the Associ-ation for Computational Linguistics, 8:662–678.

Yonatan Belinkov, Adam Poliak, Stuart Shieber, Ben-jamin Van Durme, and Alexander Rush. 2019.Don’t take the premise for granted: Mitigating ar-tifacts in natural language inference. In Proceed-ings of the 57th Annual Meeting of the Associationfor Computational Linguistics, pages 877–891, Flo-rence, Italy. Association for Computational Linguis-tics.

John Burn-Murdoch. 2013. Social media analytics:Are we nearly there yet? The Guardian.

Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fer-nandez, and Doug Downey. 2019. CODAH: Anadversarially-authored question answering datasetfor common sense. In Proceedings of the 3rd Work-shop on Evaluating Vector Space Representationsfor NLP, pages 63–69, Minneapolis, USA. Associ-ation for Computational Linguistics.

A. P. Dawid and A. M. Skene. 1979. Maximum like-lihood estimation of observer error-rates using theEM algorithm. Journal of the Royal Statistical Soci-ety. Series C (Applied Statistics), 28(1):20–28.

Harm de Vries, Dzmitry Bahdanau, and Christopher D.Manning. 2020. Towards ecologically valid re-search on language user interfaces. arXiv preprintarXiv:2007.14435.

Arthur P Dempster, Nan M Laird, and Donald B Rubin.1977. Maximum likelihood from incomplete datavia the em algorithm. Journal of the Royal Statisti-cal Society: Series B (Methodological), 39(1):1–22.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, GabrielStanovsky, Sameer Singh, and Matt Gardner. 2019.DROP: A reading comprehension benchmark requir-ing discrete reasoning over paragraphs. In Proceed-ings of the 2019 Conference of the North AmericanChapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1(Long and Short Papers), pages 2368–2378, Min-neapolis, Minnesota. Association for ComputationalLinguistics.

Ujwal Gadiraju, Besnik Fetahu, Ricardo Kawase,Patrick Siehndel, and Stefan Dietze. 2017. Usingworker self-assessments for competence-based pre-selection in crowdsourcing microtasks. ACM Trans-actions of Computer–Human Interaction, 24(4).

Timnit Gebru, Jamie Morgenstern, Briana Vecchione,Jennifer Wortman Vaughan, Hanna Wallach, HalDaumeé III, and Kate Crawford. 2018. Datasheetsfor datasets. arXiv preprint arXiv:1803.09010.

Page 13: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Max Glockner, Vered Shwartz, and Yoav Goldberg.2018. Breaking NLI systems with sentences that re-quire simple lexical inferences. In Proceedings ofthe 56th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 650–655, Melbourne, Australia. Associationfor Computational Linguistics.

Stephen Gossett. 2020. Emotion AI has great promise(when used responsibly). Built In Blog.

Seth Grimes. 2014. Text analytics 2014: User perspec-tives on solutions and providers. Technical report,Alta Plana.

Seth Grimes. 2017. Data frontiers: Subjectivity, senti-ment, and sense. Brandwatch.

Suchin Gururangan, Swabha Swayamdipta, OmerLevy, Roy Schwartz, Samuel Bowman, and Noah A.Smith. 2018. Annotation artifacts in natural lan-guage inference data. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 107–112, New Orleans, Louisiana. Associa-tion for Computational Linguistics.

Judy Hanwen Shen, Lauren Fratamico, Iyad Rahwan,and Alexander M. Rush. 2018. Darling or babygirl?Investigating stylistic bias in sentiment analysis. InFairness, Accountability, and Transparency in Ma-chine Learning.

Minqing Hu and Bing Liu. 2004. Mining and summa-rizing customer reviews. In Proceedings of the 10thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pages 168–177.ACL.

Christoph Hube, Besnik Fetahu, and Ujwal Gadiraju.2019. Understanding and mitigating worker biasesin the crowdsourced collection of subjective judg-ments. In Proceedings of the 2019 CHI Conferenceon Human Factors in Computing Systems, CHI ’19,pages 1–12. Association for Computing Machinery.

Robin Jia and Percy Liang. 2017. Adversarial exam-ples for evaluating reading comprehension systems.In Proceedings of the 2017 Conference on Empiri-cal Methods in Natural Language Processing, pages2021–2031, Copenhagen, Denmark. Association forComputational Linguistics.

Di Jin, Zhijing Jin, Joey Tianyi Zhou, and PeterSzolovits. 2019. Is BERT really robust? Naturallanguage attack on text classification and entailment.arXiv preprint arXiv:1907.11932, 2.

Nitin Jindal and Bing Liu. 2008. Opinion spamand analysis. In Proceedings of the 2008 Interna-tional Conference on Web Search and Data Mining,WSDM ’08, pages 219–230, New York, NY, USA.Association for Computing Machinery.

Divyansh Kaushik and Zachary C. Lipton. 2018. Howmuch reading does reading comprehension require?a critical investigation of popular benchmarks. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages5010–5015, Brussels, Belgium. Association forComputational Linguistics.

Svetlana Kiritchenko and Saif Mohammad. 2018. Ex-amining gender and race bias in two hundred sen-timent analysis systems. In Proceedings of theSeventh Joint Conference on Lexical and Compu-tational Semantics, pages 43–53, New Orleans,Louisiana. Association for Computational Linguis-tics.

Hector J. Levesque. 2013. On our best behaviour. InProceedings of the Twenty-third International Con-ference on Artificial Intelligence, Beijing.

Tal Linzen. 2020. How can we accelerate progress to-wards human-like linguistic generalization? In Pro-ceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 5210–5217, Online. Association for Computational Lin-guistics.

Bing Liu. 2011. Opinion lexicon. Department of Com-puter Science, University of Illinois at Chicago.

Bing Liu. 2012. Sentiment Analysis and Opinion Min-ing. Morgan & Claypool.

Nelson F. Liu, Roy Schwartz, and Noah A. Smith.2019a. Inoculation by fine-tuning: A method foranalyzing challenge datasets. In Proceedings of the2019 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1 (Long andShort Papers), pages 2171–2179, Minneapolis, Min-nesota. Association for Computational Linguistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019b.ROBERTa: A robustly optimized BERT pretrainingapproach. ArXiv:1907.11692.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham,Dan Huang, Andrew Y. Ng, and Christopher Potts.2011. Learning word vectors for sentiment analy-sis. In Proceedings of the 49th Annual Meeting ofthe Association for Computational Linguistics: Hu-man Language Technologies, pages 142–150, Port-land, Oregon, USA. Association for ComputationalLinguistics.

Julian McAuley, Jure Leskovec, and Dan Jurafsky.2012. Learning attitudes and attributes from multi-aspect reviews. In 12th International Conference onData Mining, pages 1020–1025, Washington, D.C.IEEE Computer Society.

Aakanksha Naik, Abhilasha Ravichander, NormanSadeh, Carolyn Rose, and Graham Neubig. 2018.Stress test evaluation for natural language inference.

Page 14: DynaSent: A Dynamic Benchmark for Sentiment Analysis

In Proceedings of the 27th International Conferenceon Computational Linguistics, pages 2340–2353,Santa Fe, New Mexico, USA. Association for Com-putational Linguistics.

Preslav Nakov, Alan Ritter, Sara Rosenthal, FabrizioSebastiani, and Veselin Stoyanov. 2016. SemEval-2016 task 4: Sentiment analysis in Twitter. InProceedings of the 10th International Workshop onSemantic Evaluation (SemEval-2016), pages 1–18,San Diego, California. Association for Computa-tional Linguistics.

Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019.Justifying recommendations using distantly-labeledreviews and fine-grained aspects. In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 188–197, HongKong, China. Association for Computational Lin-guistics.

Yixin Nie, Yicheng Wang, and Mohit Bansal. 2019.Analyzing compositionality-sensitivity of NLI mod-els. In Proceedings of the AAAI Conference on Arti-ficial Intelligence, volume 33, pages 6867–6874.

Yixin Nie, Adina Williams, Emily Dinan, MohitBansal, Jason Weston, and Douwe Kiela. 2020. Ad-versarial NLI: A new benchmark for natural lan-guage understanding. In Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics, pages 4885–4901, Online. Associationfor Computational Linguistics.

Bo Pang and Lillian Lee. 2004. A sentimental edu-cation: Sentiment analysis using subjectivity sum-marization based on minimum cuts. In Proceed-ings of the 42nd Annual Meeting of the Associationfor Computational Linguistics (ACL-04), pages 271–278, Barcelona, Spain.

Bo Pang and Lillian Lee. 2005. Seeing stars: Ex-ploiting class relationships for sentiment categoriza-tion with respect to rating scales. In Proceed-ings of the 43rd Annual Meeting of the Associationfor Computational Linguistics (ACL’05), pages 115–124, Ann Arbor, Michigan. Association for Compu-tational Linguistics.

Bo Pang and Lillian Lee. 2008. Opinion mining andsentiment analysis. Foundations and Trends in In-formation Retrieval, 2(1):1–135.

James W Pennebaker, Roger J Booth, and Martha EFrancis. 2007. Linguistic liquiry and word count:LIWC. Austin, TX: liwc. net.

Adam Poliak, Jason Naradowsky, Aparajita Haldar,Rachel Rudinger, and Benjamin Van Durme. 2018.Hypothesis only baselines in natural language in-ference. In Proceedings of the Seventh Joint Con-ference on Lexical and Computational Semantics,pages 180–191, New Orleans, Louisiana. Associa-tion for Computational Linguistics.

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.SemEval-2017 task 4: Sentiment analysis in Twit-ter. In Proceedings of the 11th InternationalWorkshop on Semantic Evaluation (SemEval-2017),pages 502–518, Vancouver, Canada. Association forComputational Linguistics.

Rachel Rudinger, Chandler May, and BenjaminVan Durme. 2017. Social bias in elicited natural lan-guage inferences. In Proceedings of the First ACLWorkshop on Ethics in Natural Language Process-ing, pages 74–79, Valencia, Spain. Association forComputational Linguistics.

Rachel Rudinger, Jason Naradowsky, Brian Leonard,and Benjamin Van Durme. 2018. Gender bias incoreference resolution. In Proceedings of the 2018Conference of the North American Chapter of theAssociation for Computational Linguistics: HumanLanguage Technologies, Volume 2 (Short Papers),pages 8–14, New Orleans, Louisiana. Associationfor Computational Linguistics.

Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi,and Noah A. Smith. 2019. The risk of racial biasin hate speech detection. In Proceedings of the57th Annual Meeting of the Association for Com-putational Linguistics, pages 1668–1678, Florence,Italy. Association for Computational Linguistics.

Tal Schuster, Darsh Shah, Yun Jie Serene Yeo, DanielRoberto Filizzola Ortiz, Enrico Santus, and ReginaBarzilay. 2019. Towards debiasing fact verificationmodels. In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP), pages3419–3425, Hong Kong, China. Association forComputational Linguistics.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany. Association for Computa-tional Linguistics.

Rion Snow, Brendan O’Connor, Daniel Jurafsky, andAndrew Ng. 2008. Cheap and fast – but is it good?evaluating non-expert annotations for natural lan-guage tasks. In Proceedings of the 2008 Conferenceon Empirical Methods in Natural Language Process-ing, pages 254–263, Honolulu, Hawaii. Associationfor Computational Linguistics.

Richard Socher, Alex Perelygin, Jean Wu, JasonChuang, Christopher D. Manning, Andrew Ng, andChristopher Potts. 2013. Recursive deep modelsfor semantic compositionality over a sentiment tree-bank. In Proceedings of the 2013 Conference onEmpirical Methods in Natural Language Processing,pages 1631–1642, Seattle, Washington, USA. Asso-ciation for Computational Linguistics.

Page 15: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Philip J Stone and Earl B Hunt. 1963. A computer ap-proach to content analysis: Studies using the Gen-eral Inquirer system. In Proceedings of the May 21–23, 1963, Spring Joint Computer Conference, pages241–256.

Yi-Ting Tsai, Min-Chu Yang, and Han-Yu Chen. 2019.Adversarial attack on sentiment classification. InProceedings of the 2019 ACL Workshop Black-boxNLP: Analyzing and Interpreting Neural Net-works for NLP, pages 233–240, Florence, Italy. As-sociation for Computational Linguistics.

Masatoshi Tsuchiya. 2018. Performance impactcaused by hidden bias of training data for recog-nizing textual entailment. In Proceedings of theEleventh International Conference on Language Re-sources and Evaluation (LREC 2018), Miyazaki,Japan. European Language Resources Association(ELRA).

Amazon Mechanical Turk. 2017. Tutorial: Best prac-tices for managing workers in follow-up surveysor longitudinal studies. Amazon Mechanical TurkBlog.

Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner,and Sameer Singh. 2019. Universal adversarial trig-gers for attacking and analyzing NLP. In Proceed-ings of the 2019 Conference on Empirical Methodsin Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP), pages 2153–2162, HongKong, China. Association for Computational Lin-guistics.

Amy Beth Warriner, Victor Kuperman, and Marc Brys-baert. 2013. Norms of valence, arousal, and dom-inance for 13,915 english lemmas. Behavior Re-search Methods, 45(4):1191–1207.

Janyce Wiebe, Theresa Wilson, and Claire Cardie.2005. Annotating expressions of opinions and emo-tions in language. Language Resources and Evalua-tion, 39(2–3):165–210.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of HumanLanguage Technology Conference and Conferenceon Empirical Methods in Natural Language Process-ing, pages 347–354, Vancouver, British Columbia,Canada. Association for Computational Linguistics.

Terry Winograd. 1972. Understanding natural lan-guage. Cognitive Psychology, 3(1):1–191.

Thomas Wolf, Lysandre Debut, Victor Sanh, JulienChaumond, Clement Delangue, Anthony Moi, Pier-ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-icz, and Jamie Brew. 2019. HuggingFace’s trans-formers: State-of-the-art natural language process-ing. arXiv preprint arXiv:1910.03771.

Rowan Zellers, Yonatan Bisk, Roy Schwartz, andYejin Choi. 2018. SWAG: A large-scale adversar-ial dataset for grounded commonsense inference. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing, pages 93–104, Brussels, Belgium. Association for Computa-tional Linguistics.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, AliFarhadi, and Yejin Choi. 2019. HellaSwag: Cana machine really finish your sentence? In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for ComputationalLinguistics.

Wei Emma Zhang, Quan Z. Sheng, Ahoud Alhazmi,and Chenliang Li. 2020. Adversarial attacks ondeep-learning models in natural language process-ing: A survey. ACM Transactions on Intelligent Sys-tems and Technology, 11(3).

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In C. Cortes, N. D. Lawrence, D. D. Lee,M. Sugiyama, and R. Garnett, editors, Advances inNeural Information Processing Systems 28, pages649–657. Curran Associates, Inc.

Yudian Zheng, Guoliang Li, Yuanbing Li, Caihua Shan,and Reynold Cheng. 2017. Truth inference in crowd-sourcing: Is the problem solved? Proceedings ofVLDB Endowment, 10(5):541–552.

Page 16: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Appendix

A Model 0

To train our Model 0, we import weights from thepretrained RoBERTa-base model.6 As in the origi-nal RoBERTa-base model (Liu et al., 2019b), ourmodel has 12 heads and 12 layers, with hiddenlayer size 768. The model uses byte-pair encodingas the tokenizer (Sennrich et al., 2016), with a max-imum sequence length of 128. The initial learningrate is 2e−5 for all trainable parameters, with abatch size of 8 per device (GPU). We fine-tuned for3 epochs with a dropout probability of 0.1 for bothattention weights and hidden states. To foster re-producibility, our training pipeline is adapted fromthe Hugging Face library (Wolf et al., 2019).7 Weused 6 × GeForce RTX 2080 Ti GPU each with11GB memory. The training process takes about15 hours to finish.

B Additional Details on Validation

B.1 Validation Interface

Figure 4 shows the interface for the validation taskused for both Round 1 and Round 2. The top pro-vides the instructions, and then one item is shown.The full task had ten items per Human InterfaceTask (HIT). Workers were paid US$0.25 per HIT,and all workers were paid for all their work, regard-less of whether we retained their labels.

B.2 Worker Selection

Examples were uploaded to Amazon’s Mechani-cal Turk in batches of 3–5K examples. After eachround, we assessed workers by the percentage ofexamples they labeled for which they agreed withthe majority. For example, a worker who selectsNegative where three of the other workers chosePositive disagrees with the majority for that exam-ple. If a worker disagreed with the majority morethan 80% of the time, we removed that worker fromthe annotator pool and revalidated the examplesthey labeled. This process was repeated iterativelyover the course of the entire validation process forRound 1. Thus, many examples received morethan 5 labels; we keep only those by the top-rankedworkers according to agreement with the major-ity. We observed that this iterative process led to

6https://dl.fbaipublicfiles.com/fairseq/models/roberta.base.tar.gz

7https://github.com/huggingface/transformers

substantial improvements to the validation labelsaccording to our own intuitions.

To remove workers from our pool, we used amethod of ‘unqualifying’, as described in (Turk,2017). This method does no reputational damageto workers and is often used in situations where therequester must limit responses to one per worker(e.g., surveys). We do not know precisely whyworkers tend to disagree with the majority. The rea-sons are likely diverse. Possible causes include inat-tentiveness, poor reading comprehension, a lack ofunderstanding of the task, and a genuinely differentperspective on what examples convey. While wethink our method mainly increased label quality,we recognize that it can introduce unwanted biases.We acknowledge this in our Datasheet, which isdistributed with the dataset.

B.3 Worker Distribution

Figure 5 show the distribution of workers for thevalidation task for both rounds. In the final versionof Round 1, the median number of examples perworker was 45 and the mode was 11. For Round 2,the median was 20 and the mode was 1.

B.4 Worker Agreement with Gold Labels

Figure 6 summarizes the rates at which individualworkers agree with the gold label. Across the devand test sets for both rounds, substantial numbersof workers agreed with the gold label on all of thecases they labeled, and more than half were above95% for this agreement rate for both rounds.

C Additional Details on Dynabench Task

C.1 Interface for the Prompt Condition

Figure 7 shows an example of the Dynabench in-terface in the Prompt condition.

C.2 Instructions

Figure 8 provides the complete instructions for theDynabench task, and Table 11 provides the list ofcomprehension questions we required workers toanswer correctly before starting.

C.3 Data Collection Pipeline

For each task, a worker has ten attempts in totalto find an example that fools the model. A workercan immediately claim their payments after sub-mitting a single fooling example, or running outof attempts. The average number of attempts pertask is two before the worker generates an example

Page 17: DynaSent: A Dynamic Benchmark for Sentiment Analysis

that they claim fools the model. Workers are paidUS$0.30 per task.

A confirmation step is required if the model pre-dicts incorrectly: we explicitly ask workers to con-firm the examples they come up with are truly fool-ing examples. Figure 7 shows this step in action.

To incentives workers, we pay a bonus ofUS$0.30 for each truly fooling example accord-ing to our separate validation phase.

We temporarily disallow a worker to do our taskif they fail to answer correctly to all our onboardingquestions as in Table 11 within five attempts. Wealso temporarily disallow a worker to do our task ifthey consistently cannot come up with truly foolingexamples validated by us.

A worker must meet the following qualificationsbefore accepting our tasks. First, a worker mustreside in U.S. and speak English. Second, a workermust have completed at least 1,000 tasks on Ama-zon Mechanical Turk with an approval rating of98%. Lastly, a worker must not be in any of ourtemporarily disallowing worker pools.

We adapt the open-source software packageMephisto as our data collection tool for AmazonMechanical Turk.8

D SST-3 Validation Examples

Table 12 provides selected examples comparing theSST-3 labels with our revalidation of that dataset’slabels.

E Randomly Selected Corpus Examples

Table 13 provides 10 truly randomly selected ex-amples from each round’s train set.

8https://github.com/facebookresearch/Mephisto

Page 18: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Figure 4: Validation interface.

(a) Round 1. (b) Round 2.

Figure 5: Worker distribution for the validation task.

Page 19: DynaSent: A Dynamic Benchmark for Sentiment Analysis

(a) Round 1. (b) Round 2.

Figure 6: Rates at which individual worker agree with the majority label. The y-axis gives, for each worker, thetotal number of examples for which they chose the majority label divided by the total number of cases they labeledover all.

Figure 7: Dynabench interface.

Page 20: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Figure 8: Dynabench instructions.

Page 21: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Questions Answers

Q1 You are asked to write down one sentence at each round. False, TrueQ2 You will not know whether you fool the model after you submit your sentence. False, TrueQ3 You may need to wait after you submit your result. False, TrueQ4 The goal is to find an example that the model gets right but that another person

would get wrong.False, True

Q5 You will be modifying a provided prompt to generate your own sentence. False, TrueQ6 Here is an example where a powerful model predicts positive sentiment: “The

restaurant is near my place, but the food there is not good at all.”False, True

Q7 Suppose the goal is to write a positive sentence that fools the model into predictinganother label. The sentence provided is “That was awful!”, and the model predictsnegative. Does this sentence meet the goal of the task?

Yes, No

Q8 Suppose the goal is to write a negative sentence that fools the model into predictinganother label. The sentence provided is “They did a great job of making me feelunwanted”, and the model predicts positive. Does this sentence meet the goal ofthe task?

Yes, No

Table 11: A list of comprehension questions we asked workers to answer correctly with a maximum of 5 retires.Correct answer is bolded.

Page 22: DynaSent: A Dynamic Benchmark for Sentiment Analysis

SST-3 Responses

Moretti ’s compelling anatomy of grief and the difficult processof adapting to loss.

neg neu, pos, pos, pos, pos

Nothing is sacred in this gut-buster. neg neg, neg, pos, pos, pos

(a) All examples for which the SST-3 label is Negative and our majority label is Positive.

SST-3 Responses

... routine , harmless diversion and little else. pos mix, mix, neg, neg, negHilariously inept and ridiculous. pos mix, neg, neg, neg, negReign of Fire looks as if it was made without much thought –and is best watched that way.

pos mix, neg, neg, neg, neg

So much facile technique, such cute ideas, so little movie. pos mix, mix, neg, neg, negWhile there ’s something intrinsically funny about Sir AnthonyHopkins saying ’get in the car, bitch,’ this Jerry Bruckheimerproduction has little else to offer

pos mix, neg, neg, neg, neg

(b) All examples for which the SST-3 label is Positive and our majority label is Negative.

SST-3 Responses

Returning aggressively to his formula of dimwitted comedy andeven dimmer characters, Sandler, who also executive produces,has made a film that makes previous vehicles look smart andsassy.

neu neg, neg, neg, neg, neg

should be seen at the very least for its spasms of absurdisthumor.

neu pos, pos, pos, pos, pos

A workshop mentality prevails. neu neu, neu, neu, neu, neuVan Wilder brings a whole new meaning to the phrase ‘ comedygag . ’

neu mix, neu, pos, pos, pos

‘ They’ begins and ends with scenes so terrifying I’m stillstunned.

neu neu, neu, pos, pos, pos

Barely gets off the ground. neu neg, neg, neg, neg, negAs a tolerable diversion, the film suffices; a Triumph, however,it is not.

neu mix, mix, mix, mix, neg

Christina Ricci comedy about sympathy, hypocrisy and love isa misfire.

neu neg, neg, neg, neg, neg

Jacquot’s rendering of Puccini’s tale of devotion and double-cross is more than just a filmed opera.

neu neg, neu, pos, pos, pos

Candid Camera on methamphetamines. neu neg, neg, neg, neu, pos

(c) A random selection of 10 examples for which SST-3 label is Neutral and our validation label is not.

Table 12: Comparisons between the SST-3 labels and our new validation labels.

Page 23: DynaSent: A Dynamic Benchmark for Sentiment Analysis

Sentence Model 1 Responses

We so wanted to have a new steak house restaurant. pos neu, neu, neu, neu, posAs a foodie, I can surely taste the difference. pos neu, neu, neu, pos, posThere was however some nice dinner table chairs that I liked alot for $35 a piece and for the quality and style this was a verynice price for them.

pos pos, pos, pos, pos, pos

The waitress helped me pick from the traditional menu and Iended up with chilli chicken.

pos neu, neu, pos, pos, pos

I have had lashes in the past and have used some of those LivingSocial deals.

pos neu, neu, neu, neu, pos

Lots of trash cans. neu mix, neu, neu, neu, neuThey were out the next day after my call to do the inspectionand same for the treatment.

pos mix, neu, neu, neu, pos

When we walked in no one was there to sit us, I waited for aminute and then decided just to take a seat.

neg mix, neg, neg, neg, neg

Driver was amazing! pos pos, pos, pos, pos, posWe tried:\n\nChampagne On Deck - Smooth and easy to drink. pos neu, neu, pos, pos, pos

(a) Randomly sampled Round 1 train cases.

Sentence Model 1 Responses

The menu was appealing. pos pos, pos, pos, pos, posOur food took forever to get home. neg neg, neg, neg, neg, neuOur table was ready after 90 minutes and usually I’d be madbut the scenery was nice so it was an okay time.

pos mix, mix, mix, mix, pos

I left feeling like an initiator in a world of books that I nevereven knew existed.

pos mix, neu, pos, pos, pos

I decided to go with josh wilcox, only as a last resort neu mix, neg, neu, neu, neu: ] Standard ask if the food is a great happening! neu neu, neu, neu, neu, posThe car was really beautiful. pos neu, pos, pos, pos, posFood is a beast of its own, low-quality ingredients, typicallyundercooked, and a thin, very simple menu.

neg neg, neg, neg, neg, neu

I want to share a horrible experience here. neg neg, neg, neg, neg, negI tried a new place. The entrees were subpar to say the least. neg neg, neg, neg, neg, neg

(b) Randomly sampled Round 2 train cases.

Table 13: Randomly sampled Round 1 and Round 2 train cases.