arXiv:2010.02458v1 [cs.LG] 6 Oct 2020

Identifying Spurious Correlations for Robust Text Classification

Zhao WangIllinois Institute of [email protected]

Aron CulottaTulane University

[email protected]

Abstract

The predictions of text classifiers are oftendriven by spurious correlations – e.g., the termSpielberg correlates with positively reviewedmovies, even though the term itself does notsemantically convey a positive sentiment. Inthis paper, we propose a method to distinguishspurious and genuine correlations in text clas-sification. We treat this as a supervised classi-fication problem, using features derived fromtreatment effect estimators to distinguish spu-rious correlations from “genuine” ones. Dueto the generic nature of these features andtheir small dimensionality, we find that theapproach works well even with limited train-ing examples, and that it is possible to trans-port the word classifier to new domains. Ex-periments on four datasets (sentiment classifi-cation and toxicity detection) suggest that us-ing this approach to inform feature selectionalso leads to more robust classification, as mea-sured by improved worst-case accuracy on thesamples affected by spurious correlations.

1 Introduction

Text classifiers often rely on spurious correlations.For example, consider sentiment classification ofmovie reviews. The term Spielberg may be cor-related with the positive class because many ofdirector Steven Spielberg’s movies have positivereviews. However, the term itself does not indi-cate a positive review. In other words, the termSpielberg does not cause the review to be positive.Similarly, consider the problem of toxicity classi-fication of online comments. Terms indicative ofcertain ethnic groups may be associated with thetoxic class because those groups are often victimsof harassment, not because those terms are toxicthemselves.

Oftentimes, such spurious correlations do notharm prediction accuracy because the same cor-relations exist in both training and testing data

(under the common assumption of i.i.d. sam-pling). However, they can still be problematicfor several reasons. For example, under datasetshift (Quionero-Candela et al., 2009), the testingdistribution differs from the training distribution.E.g., if Steven Spielberg makes a new, bad movie,the sentiment classifier may incorrectly classifythe reviews as positive because they contain theterm Spielberg. Additionally, if the spurious cor-relations indicate demographic attributes, then theclassifier may suffer from issues of algorithmic fair-ness (Kleinberg et al., 2018). For example, the tox-icity classifier may unfairly over-predict the toxicclass for comments discussing certain demographicgroups. Finally, in settings where classifiers mustexplain their decisions to humans, such spuriouscorrelations can reduce trust in autonomous sys-tems (Guidotti et al., 2018).

In this paper, we propose a method to distinguishspurious correlations, like Spielberg, from genuinecorrelations, like wonderful, which more reliablyindicate the class label. Our approach is to treatthis as a separate classification task, using featuresdrawn from treatment effect estimation approachesthat isolate the impact each word has on the classlabel, while controlling for the context in which itappears.

We conduct classification experiments with fourdatasets and two tasks (sentiment classification andtoxicity detection), focusing on the problem ofshort text classification (i.e., single sentences ortweets). We find that with a small number of la-beled word examples (200-300), we can fit a classi-fier to distinguish spurious and genuine correlationswith moderate to high accuracy (.66-.82 area underthe ROC curve), even when tested on terms moststrongly correlated with the class label. In addition,due to the generic nature of the features, we cantrain a word classifier on one domain and transfer itto another domain without much loss in accuracy.

arX

iv:2

010.

0245

8v1

[cs

.LG

] 6

Oct

202

0

Finally, we apply the word classifier to informfeature selection for the original classification task(e.g., sentiment classification and toxicity detec-tion). Following recent work on distributionallyrobust classification (Sagawa et al., 2020a), wemeasure worst-case accuracy by considering sam-ples of data most affected by spurious correlations.We find that removing terms in the order of theirpredicted probability of being spurious correlationscan result in more robust classification with respectto this worst-case accuracy.

2 Problem and Motivation

We consider binary classification of short docu-ments, e.g., sentences or tweets. Each sentence isa sequence of words s = 〈w1 . . . wk〉 with a cor-responding binary label y ∈ {−1, 1}. To classifya sentence s, it is first transformed into a featurevector x via a feature function g : s 7→ x. Then,the feature vector is assigned a label by a clas-sification function f : (x; θ) 7→ {−1, 1}, withmodel parameters θ. Parameters θ are typicallyestimated from a set of i.i.d. labeled examplesD = {(s1, y1) . . . (sn, yn)} by minimizing someloss function L: θ∗ ← argminθ L(D, θ).

To illustrate the problem addressed in this pa-per, we will first consider the simple approach of abag-of-words logistic regression classifier. In thissetting, the feature function g(s) simply maps adocument to a word count vector x = {x1 . . . xV },for vocabulary size V , and the classification func-tion is the logistic function f(x; θ) = 1

1+e−〈x,θ〉.

After estimating parameters θ on labeled data D,we can then examine the coefficients correspondingto each word in the vocabulary to see which wordsare most important in the model.

In Figure 1, we show eight words with high mag-nitude coefficients for a classifier fit on a dataset ofmovie reviews (Pang and Lee, 2005), where class 1means positive sentiment and −1 means negativesentiment. We will return shortly to the meaningof the x-axis; for now, let us consider the y-axis,which is the estimated coefficient θw for each word.Of the four words strongly correlated with the pos-itive class (θw > 0), two seem genuine (enjoyable,masterpiece), while two seem spurious (animated,spielberg). (Steven Spielberg is a very successfulAmerican director and producer.) Similarly, of thewords correlated with the negative class, two seemgenuine (boring, failure) and two seem spurious(heavy, seagal). (Steven Seagal is an American

0.79 0.80 0.81 0.82 0.83 0.84 0.85average similarity with matched counterfactual example

2.0

1.5

1.0

0.5

0.0

0.5

1.0

1.5

2.0

logi

stic

regr

essi

on c

oeffi

cien

t ()

enjoyable

masterpiece

boring

failure

animatedspielberg

heavy seagal

spurious positive

spurious negative

genuine positive

genuine negative

Figure 1: Motivating example of spurious and genuinecorrelations in a sentiment classification task.

actor mostly known for martial-arts movies.) Fur-thermore, in some cases, the spurious term actuallyhas a larger magnitude of coefficient than the gen-uine term (e.g., seagal versus failure).

Our goal in this paper is to distinguish betweenspurious and genuine correlations. Without wad-ing into long-standing debates over the nature ofcausality (Aldrich et al., 1995), we simplify the dis-tinction between genuine and spurious correlationsas a dichotomous decision: the discovered relation-ship between word w and label y is genuine if, allelse being equal, one would expect w to be a de-termining factor in assigning a label to a sentence.We use human annotators to make this distinctionfor training and evaluating models.

In this light, our problem is related to prior workon active learning with rationales (Zaidan et al.,2007; Sharma et al., 2015) and interactive featureselection (Raghavan et al., 2005). However, ourgoal is not solely to improve prediction accuracy,but also to improve robustness across differentgroups affected by these spurious correlations.

3 Methods

Our definition of genuine correlation given abovefits well within the counterfactual framework ofcausal inference (Winship and Morgan, 1999). Ifthe word w in s were replaced with some otherword w′, how likely is it that the label y wouldchange? Since conducting randomized control tri-als to answer this counterfactual for many termsand sentences is infeasible, we instead resort tomatching methods, commonly used to estimate av-erage treatment effects from observational data (Im-bens, 2004; King and Nielsen, 2019). The intuitionis as follows: if w is a reliable piece of evidence todetermine the label of s, we should be able to find

a very similar sentence s′ that (i) does not containw, and (ii) has the opposite label of s. While this isnot a necessary condition (there may not be a goodmatch in a limited training set), in the experimentsbelow we find this to be a fairly precise approachto identify genuine correlations.

Paul (2017) proposed a similar formulation, us-ing propensity score matching to estimate the treat-ment effect for each term, then performing featureselection based on these estimates. Beyond recentcritiques of propensity scores (King and Nielsen,2019), any matching approach will create matchesof varying quality, making it difficult to distinguishbetween spurious and genuine correlations. Re-turning to Figure 1, the x-axis shows the averagequality of the counterfactual match for each term,where a larger value means that the linguistic con-text of the counterfactual sentence is very similar tothe original sentence. (These are computed by co-sine similarity of sentence embeddings, describedin §3.2.) Even though these terms have very similaraverage treatment effect estimates, the quality ofthe match seems to be a viable signal of whetherthe term is spurious or genuine.

More generally, building on prior work thattreats causal inference as a classification prob-lem (Lopez-Paz et al., 2015), we can derive a num-ber of features from the components of the treat-ment effect estimates (enumerated in §3.3), andfrom these fit a classification model to determinewhether a word should be labeled as spurious orgenuine. This word classifier can then be used in anumber of ways to improve the document classifier(e.g., to inform feature selection, to place priors onword coefficients, etc.).

To build the word classifier, we assume a humanhas annotated a small number of terms as spuri-ous or genuine, which we can use as training data.While this places an additional cost on annotation,the nature of the features reduces this burden —there are not very many features in the word clas-sifier, and they are mostly generic / domain inde-pendent. As a result, in the experiments below,we find that useful word classifiers can be builtfrom a small number of labeled terms (200-300).Furthermore, and perhaps more importantly, wefind that the word classifier can be transported tonew domains with little loss in accuracy. This sug-gests that one can label words once in one domain,fit a word classifier, and apply it in new domainswithout annotating additional words.

3.1 Overview of approach

The main stages of our approach are as follows:1. Given training data D = {(s1, y1) . . . (sn, yn)}

for the primary classification task, fit an initialclassifier f(x; θ).

2. Extract from f(x; θ) the words W ={w1 . . . wm} that are most strongly associatedwith each class according to the initial classifier.E.g., for logistic regression, we may extract thewords with the highest magnitude coefficientsfor each class. For more complex models, othertransparency algorithms may be used (Martensand Provost, 2014).

3. For each word, compute features that indicateits likelihood to be spurious or genuine (§3.3).

4. Fit a word classifier h(w;λ) on a human-annotated subset ofW .

5. Apply h(w;λ) on remaining words to estimatethe probability that they are spurious.

After the final step, one may use the posteriorprobabilities in several ways to improve classifi-cation. E.g., to sort terms for feature selection, toplace priors on word coefficients, to set attentionweights in neural networks, etc. In this paper, wefocus on feature selection, leaving other options forfuture work.

Additionally, we experiment with domain adap-tation, where h(w;λ) is fit on one domain and ap-plied to another domain for feature selection, with-out requiring additional labeled words from thatdomain.

3.2 Matching

Most of the features for the word classifier areinspired by matching approaches from causal infer-ence (Stuart, 2010). The idea is to match sentencescontaining different words in similar contexts sothat we can isolate the effect that one particularword choice has on the class label.

For a word w and a sentence s containing thisword, we let s[w] be the sentence s with word wremoved. The goal of matching is to find someother context s′[w′] such that w /∈ s′ and s[w]is semantically similar to s′[w′]. We use a bestmatch approach, finding the closest match s∗ ←argmaxs′ sim(s[w], s′[w′]). With this best match,we can compute the average treatment effect (ATE)of word w in N sentences:

τw =1

N

∑{s|w∈s}

ys − ys∗ (1)

it’s refreshing to see a movie that (1)it’s rare to see a movie that (-1)cast has a lot of fun with the material (1)comedy with a lot of unfunny (-1)smoothly under the direction of spielberg (1)it works under the direction of kevin (1)refreshingly different slice of asian cinema (1)an interesting slice of history (1)charting the rise of hip-hop culture in general (1)hip-hop has a history, and it’s a metaphor (1)

Table 1: Examples of matched contexts from IMDBdataset; word substitutions are shown in bold.

Thus, a term w will have a large value of τw if (i)it often appears in the positive class, and (ii) verysimilar sentences where w is swapped with w′ havenegative labels.

In our experiments, to improve the quality ofmatches, we limit contexts to the five previous andfive subsequent words to w, then represent the con-text by concatenating the last four layers of a pre-trained BERT model (recommended by the originalBERT paper) (Devlin et al., 2018). We use the co-sine similarity of context embeddings as a measureof semantic similarity.

Take one example from Table 1: “it’s refreshingto see a movie that (1)” is matched with “it’s rareto see a movie that (-1)”. Words refreshing andrare appear in similar contexts, but adding refresh-ing to this context makes the sentence positive,while adding rare to this context makes it negative.If most of the pairwise matches show that addingrefreshing is more positive than adding other sub-stitution words, then refreshing is very likely to bea genuine positive word.

On the contrary, if adding other substitutionwords for similar contexts does not change thelabel, then w is likely to be a spuriously corre-lated word. Take another example from Table 1,

“smoothly under the direction of spielberg (1)” ismatched with “it works under the direction of kevin(1)”, spielberg and kevin appear in similar con-texts, and substituting spielberg with kevin doesnot make any difference in the label. If most pair-wise matches show that substituting spielberg toother words does not change the label, then spiel-berg is very likely to be a spurious positive word.

3.3 Features for Word Classification

While the matching approach above is a traditionalway to estimate the causal effect of a word w givenobservational data, there are many well-known lim-itations to matching approaches (King et al., 2011).

A primary difficulty is that high-quality matchesmay not exist in the data, leading to biased esti-mates. Inspired by supervised learning approachesto causal inference (Lopez-Paz et al., 2015), ratherthan directly use the ATE to distinguish betweenspurious and genuine correlations, we instead com-pute a number of features to summarize informa-tion about the matching process. In addition to theATE itself, we calculate the following features:

• The average context similarity of every matchfor word w.

• The context similarity of the top-5 closestmatches.

• The maximum and standard deviation of thesimilarity score.

• The context similarity of the closest positiveand negative sentences.

• The weighted average treatment effect, whereEq. (1) is weighted by the similarity between sand s∗.

• The ATE restricted to the top-5 most similarmatches for sentences containing w.

• The word’s coefficient from the initial sentenceclassifier.

• Finally, to capture subtle semantic differencesbetween the original and matched sentences,we compute features such as the average em-bedding difference from all matches, the top-3most different dimensions from the average em-bedding, and the maximum value along eachdimension.

3.4 Measuring the Impact of SpuriousCorrelations on Classification

After we train the word classifier to identify spuri-ous and genuine words, we are further interestedin exploring how spurious correlations affect clas-sification performance on test data. As discussedin §1, measuring robustness can be difficult whendata are sampled i.i.d. because the same spuriouscorrelations exist in the training and testing data.Thus, we would not expect accuracy to necessar-ily improve on a random sample when spuriouswords are removed. Instead, we are interested inmeasuring the robustness of the classifier, whererobustness is with respect to which subgroup ofdata is being considered.

Motivated by (Sagawa et al., 2020a), we dividethe test data into two groups and explore the modelperformance on each. The first group, called theminority group, contains sentences in which the

spurious correlation is expected to mislead the clas-sifier. From our running example, that would be anegative sentiment sentence containing spielberg,or a positive sentiment sentence containing seagal.Analogously, the majority group contains examplesin which the spurious correlation helps the classi-fier (e.g., positive sentiment documents containingspielberg). In §4.4, we conduct experiments tosee how removing terms that are predicted to bespurious could affect accuracies on majority andminority groups.

4 Experiments

4.1 Data

We experiment with four datasets for two binaryclassification tasks: sentiment classification andtoxicity detection.1

• IMDB movie reviews: movie review sentenceslabeled with their overall sentiment polarity(positive or negative) (Pang and Lee, 2005) (ver-sion 1.0).

• Kindle reviews: product reviews from AmazonKindle Store with ratings range from 1-5 (Heand McAuley, 2016). We first fit a sentimentclassifier on this dataset to identify keywords,and then split each review into single sentencesand assign each sentence the same rating asthe original review. We select sentences thatcontain sentiment keywords and then removesentences that have fewer than 5 or more than40 words, and finally label the remaining sen-tences rated {4,5} as positive and sentencesrated {1,2} as negative. To justify the validityof sentence labels inherited from original doc-uments, we randomly sampled 500 sentences(containing keywords) and manually checkedtheir labels. The inherited labels were correctfor 484 sentences (i.e., 96.8% accuracy).

• Toxic comment: a dataset of comments fromWikipedia’s talk page (Wulczyn et al., 2017).2

Comments are labeled by human raters for toxicbehavior (e.g., comments that are rude, disre-spectful, offensive, or otherwise likely to makesomeone leave a discussion). Each commentwas shown up to 10 annotators and the fractionof human raters who believed the comment istoxic serves as the final toxic score that ranges

1Code and data available at: https://github.com/tapilab/emnlp-2020-spurious

2https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

#docs #topwords

#matchedsentences

IMDB 10,662 366 8,882Kindle 20,232 270 24,882

Toxic comment 15,216 329 8,414Toxic tweet 6,774 341 9,224

Table 2: Corpus summary

from 0.0 to 1.0. We follow the same processingsteps in Kindle reviews dataset: split commentsinto sentences, select sentences containing toxickeywords (learned from a toxic classifier), andlimit sentence length. We label sentences withtoxicity scores ≥ 0.7 as toxic and ≤ 0.5 asnon-toxic.

• Toxic tweet: tweets collected through TwitterStreaming API by matching toxic keywordsfrom HateBase and labeled as toxic or non-toxicby human raters (Bahar et al., 2020).

All datasets are sampled to have an equal classbalance. The basic dataset information is summa-rized in Table 2.

4.2 Creating Matched SentencesWe first get pairwise matched sentences for wordsof interest. In this work, we focus on words thathave relatively strong correlations with each class.So we fit a logistic regression classifier for eachdataset and select the top features by placing athreshold on coefficient magnitude (i.e., words withhigh positive or negative coefficients). For IMDBmovie reviews, Kindle reviews, and Toxic com-ments, we use a coefficient threshold 1.0; and forToxic tweet, we use threshold 0.7 (to generate acomparable number of candidate words).

We find matched sentences for each word fol-lowing the method in §3.2. Table 1 shows fiveexamples of pairwise matches. The total numberof matched sentences are shown in Table 2.

4.3 Word ClassificationThe goal of word classification is to distinguish be-tween spurious words and genuine words. We firstmanually label a small set of words as spurious orgenuine (Table 3). For sentiment classification, weconsider both positive and negative words. For tox-icity classification, we only consider toxic words.We had two annotators annotate each term; theagreement was generally high for this task (e.g.,96% raw agreement), with the main discrepanciesarising from the knowledge of slang and abbrevia-tions.

https://github.com/tapilab/emnlp-2020-spurious

https://github.com/tapilab/emnlp-2020-spurious

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification

We represent each word with the numerical fea-tures calculated from matched sentences (§3.3),standardized to have zero mean and unit variance.Finally, we apply a logistic regression model forthe binary word classifier. We explore the wordclassifier performance for the same domain anddomain adaptation.

Same domain: We apply 10-fold cross-validation to estimate the word classifier’s accuracywithin the same domain. In practice, the idea isthat one would label a set of words, fit a classifier,then apply to the remaining words.

Domain adaptation: To reduce the word anno-tation burden, we are interested in understandingwhether a word classifier trained on one domaincan be applied in another. Thus, we measure cross-domain accuracy, e.g., by fitting the word classifieron IMDB dataset and evaluating on Kindle dataset.

4.4 Feature Selection Based on SpuriousCorrelation

We compare several strategies to do feature selec-tion for the initial document classification tasks.

According to the word classifier, each word isassigned a probability of being spurious, whichwe use to sort terms for feature selection. Thatis, words deemed most likely to be spurious areremoved first. As a comparison, we experimentwith the following strategies to rank words in theorder of being removed.

Oracle This is the gold standard. We treat themanually labeled spurious words as equally im-portant and sort them in random order. This goldstandard ensures that the removed features are defi-nitely spurious.

Sentiment lexicon We create a sentiment lex-icon by combing sentiment words from (Wilsonet al., 2005) and (Liu, 2012). It contains 2724positive words and 5078 negative words. We se-lect words that appear in the sentiment lexicon asinformative genuine features and fit a baseline clas-sifier with these features. This is a complementarymethod with the previous method by oracle.

Random This is a baseline method that sorts thetop words in random order, where top words couldbe spurious or genuine, and the words are removedin random order.

Same domain prediction We sort words in de-scending order of the probability of being spurious,according to the word classifier trained on the samedomain (using cross-validation).

Domain adaptation prediction This is a simi-lar sorting process with the previous strategy ex-cept that the probability is from domain adapta-tion, where the word classifier is trained on a dif-ferent dataset. We consider domain transfer be-tween IMDB and Kindle datasets, and betweenToxic comment and Toxic tweet datasets.

In the document classification task, we samplemajority and minority groups by selecting an equalnumber of sentences for each top word to ensure afair comparison during feature selection. We checkfeature selection performance for each group bygradually removing spurious words following theorder of each strategy described above. As a finalcomparison, we also implement the method sug-gested in Sagawa et al. (2020b), which reduces theeffect of spurious correlation from training data. Todo so, we sample the majority and minority groupfrom training data, and down-sample the majoritygroup to have an equal size with the minority group.We then fit the document classifier on the new train-ing data and evaluate its performance on the testset. Note that this method assumes knowledge ofwhich features are spurious. Our approach can beseen as a way to first estimate which features arespurious and then adjust the classifier accordingly.

5 Results and Discussion

In this section, we show results for identifying spu-rious correlations and then analyze the effect ofremoving spurious correlations in different cases.

5.1 Word Classification

Table 3 shows the ROC AUC scores for classifierperformance. To place these numbers in context,recall that the words being classified were specif-ically selected because of their strong correlationwith the class labels. For example, some spuriouspositive words appear in 20 positive documents andonly a few negative documents. Despite the chal-lenging nature of this task, Table 3 shows that wordclassifier performs well at classifying spurious andgenuine words with AUC scores range from 0.657to 0.823. Furthermore, the domain adaptation re-sults indicate limited degradation in accuracy, andoccasionally improvements in accuracy. The excep-tion is the toxic tweet dataset, where the score is6% worse for domain adaptation. We suspect thatthis is caused by the low-quality texts in the toxictweet dataset (this is the only dataset that the textis tweets instead of formal sentences).

Figure 2: Example of spurious and genuine words pre-dicted by the word classifier trained on words from Kin-dle reviews and applied to words from IMDB reviews.

IMDBreviews

Kindlereviews

Toxiccomment

Toxictweet

#spurious 90 119 40 72#genuine 174 100 73 45

samedomain 0.776 0.657 0.823 0.686

domainadaptation 0.741 0.699 0.726 0.744

Table 3: Word classifier performance (AUC score)

Fig 2 shows an example of the domain adap-tation results. We observe that culture, spielberg,russian, cinema are correctly predicted to have highprobabilities of being spurious, while refreshing,heartbreaking, wonderful, fun are correctly pre-dicted to have relatively lower probabilities of be-ing spurious. We also observe that the predictionsfor unique and ages do not agree with human la-bels. We show top-5 spurious and genuine wordspredicted for each dataset in Table 4. Error anal-ysis suggests that misclassifications are often dueto small sample sizes – some genuine words sim-ply do not appear enough to find good matches.In future work, we will investigate how data sizeinfluences accuracy.

Examining the top coefficients in the word clas-sifier, we find that features related to the matchquality tend to be highly correlated with genuinewords (e.g., the context similarity of close matches,ATE calculated from the close matches). In con-trast, features calculated from the embedding dif-ferences of close matches have relatively smallercoefficients.3 For example, in the word classifiertrained for IMDB dataset, the average match simi-larity score has a coefficient of 1.3, and the ATE fea-ture has a coefficient of 0.8. These results suggest

3Detailed feature coefficients and analysis of feature im-portance are available in the code.

0 50 100

150

200

0.66

0.72

0.78

0.84

0.90

IMDB

A

UC

Majority

0 50 100

150

200

Minority

0 50 100

150

200

All

0 50 100

150

0.45

0.60

0.75

0.90

Kind

le

AUC

0 50 100

150 0 50 100

150

oraclefeature_selectiondomain_transferrandom

Number of features removed

Figure 3: Feature selection for sentiment classification.

that the quality of close matches is viable evidenceof genuine features, and combining traditional ATEestimates with features derived from the matchingprocedure can provide stronger signals for distin-guishing spurious and genuine correlations.

5.2 Feature Selection by Removing SpuriousCorrelations

We apply different feature selection strategies in§4.4 and test the performance on majority set, mi-nority set, and the union of majority and minoritysets (denoted as “All”).

Fig 3 shows feature selection performance onIMDB movie reviews and Kindle reviews. Thestarting point in each plot shows the performanceof not removing any feature. The horizontal linein-between shows the performance of the methodsuggested in Sagawa et al. (2020b).

For the majority group, because spurious corre-lations learned during model training agree withsentence labels, the model performs well on thisgroup, and removing spurious features hurts per-formance (i.e., about 20% drop of AUC score inboth datasets). On the contrary, the spurious corre-lations do not hold in the minority group. Thus, themodel does not perform well at the starting pointwhen not removing any spurious feature, and theperformance increases when we gradually removespurious features. After removing enough spuriousfeatures, the model performance stabilizes.

For IMDB reviews, removing spurious featuresimproves performance by up to 20% AUC for theminority group, and feature selection based on pre-dictions from the word classifier outperforms ran-

IMDB Kindle Toxic comment Toxic tweetspurious genuine spurious genuine spurious genuine spurious genuineunintentional refreshing boy omg intelligence idiot edkrassen cuntrussian horrible issues definitely parasites stupid hi twatbenigni uninspired benefits draw sucking idiots pathetic retardanimated strength teaches returned mongering stupidity side pussypulls exhilarating girl halfway lifetime moron example assvisceral refreshing finds omg mongering stupid aint cuntmike rare mother highly lunatics idiot between twatunintentional horrible girl returned slaughter idiots wet retardstrange ingenious us down narrative idiotic side faggotintelligent sly humans enjoyed brothers stupidity rather pussy

Table 4: Top 5 spurious and genuine words predicted by the in-domain word classifier (first five rows) and cross-domain classifier (last five rows). Words with strike-through are incorrectly classified.

dom ordering substantially. For Kindle, remov-ing spurious features improves accuracy by up to30% AUC for the minority group. Interestingly, do-main adaptation actually appears to outperform thewithin-domain results, which is in line with wordclassifier performance shown in Table 3 (i.e., do-main adaptation outperforms within domain AUCby 4.2% for Kindle word classifier). The resulton “All” shows the trade-off between the perfor-mance on the majority group and minority group.If removing spurious features hurts more on the ma-jority group than it helps the minority group, thenthe performance on the “All” set would decrease,and vice versa. In our experiment, the majoritygroup has more samples than the minority group,so the final performance on the “All” set graduallydecreases when removing spurious features.

We also perform feature selection on Toxic com-ment and Toxic tweet datasets, where we only fo-cus on toxic features. As shown in Fig 4, for theminority set, removing spurious features improvesperformance by up to 20% accuracy for Toxic com-ment, and 30% accuracy for Toxic tweet. Com-pared with sentiment datasets, toxic datasets havefewer spurious words to remove because we onlycares about spurious toxic features and don’t careabout non-toxic features. While in sentiment classi-fication, the spurious words are from both positiveand negative classes. Besides that, the Toxic tweetdataset is noisy with low-quality texts. So the fea-ture selection methods perform differently on toxicdatasets compared with sentiment datasets.

Additionally, the baseline method of using sen-timent lexicon has limited contribution (e.g., per-formance scores for different datasets are: IMDB,0.776; Kindle, 0.636; Toxic comment, 0.592; Toxictweet, 0.881;), which is about 0.05 to 0.2 lowercompared with the performance of the proposed

0 20 40 60

0.5

0.6

0.7

0.8

Toxi

c co

mm

ent

Acc

urac

y

Majority

0 20 40 60

Minority

0 20 40 60

All

0 25 50 75 100

0.2

0.4

0.6

0.8

Toxi

c tw

eet

Acc

urac

y

0 25 50 75 100 0 25 50 75 100

oraclefeature_selectiondomain_transferrandom

Number of features removed

Figure 4: Feature selection for toxicity classification.Test sets are selected with respect to toxic features, sothere’s only one class for each set. We show accuracyscore on y-axis.

feature selection methods. The reasons are: (i) thesentiment lexicon missed some genuine words thatare specific to each dataset (e.g., ‘typo’ is a nega-tive word when used in kindle book reviews but ismissed from the sentiment lexicon); (ii) the sameword might convey different sentiments dependingon the context. E.g., ‘joke’ is positive in “He is hu-morous and always tell funny jokes”, but is negativein “This movie is a joke”; (iii) in the toxic classifica-tion task, there’s no direct relation between toxicityand sentiment. A toxic word can be positive and anon-toxic word can be negative (e.g., ‘unhappy’).Instead of using sentiment lexicon, this paper aimsto create a method to automatically identify gen-uine features that are specific to each dataset, andthis method could generalize to different tasks inaddition to sentiment classification.

6 Related Work

Wood-Doughty et al. (2018) and Keith et al. (2020)provide good overviews of the growing line of re-search combining causal inference and text clas-sification. Two of the most closely related worksmentioned previously, are Sagawa et al. (2020b)and Paul (2017).

Sagawa et al. (2020b) investigates how spuri-ous correlations arise in classifiers due to overpa-rameterization. They compare overparameterizedmodels with underparameterized models and showthat overparameterization hurts worst-group error,where the spurious correlation does not hold. Theydo simulation experiments with core features en-coding actual label and spurious features encodingspurious attributes. Results show that the relativesize of the majority group and minority group aswell as the informativeness of spurious featuresmodulate the effect of overparameterization. WhileSagawa et al. (2020b) assumes it is known ahead oftime which features are spurious, here we insteadtry to predict that in a supervised learning setting.

Paul (2017) proposes to do feature selection fortext classification by causal inference. He adaptsthe idea of propensity score matching to documentclassification and identifies causal features frommatched samples. Results show meaningful wordfeatures and interpretable causal associations. Ourprimary contributions beyond this prior work are(i) to use features of the matching process to betteridentify spurious terms using supervised learning,and (ii) to analyze effects in terms of majority andminority groups. Indeed, we find that using thetreatment effect estimates alone for the word clas-sifier results in worse accuracy than combining itwith the additional features.

Recently, Kaushik et al. (2020) show the preva-lence of spurious correlations in machine learningby having humans make minimal edits to changethe class label of a document. Doing so revealslarge drops in accuracy due to the model’s overde-pendence on spurious correlations.

Another line of work investigates how confoundscan lead to spurious correlations in text classifica-tion (Elazar and Goldberg, 2018; Landeiro and Cu-lotta, 2018; Pryzant et al., 2018; Garg et al., 2019).These methods typically require the confoundingvariables to be identified beforehand (though Ku-mar et al. (2019) is an exception).

A final line of work views spurious correlationsas a result of an adversarial, data poisoning at-

tack (Chen et al., 2017; Dai et al., 2019). The ideais that an attacker injects spurious correlations intothe training data, so as to control the model’s pre-dictions on new data. While most of this researchfocuses on the nature of the attack models, futurework may be able to combine the approaches inthis paper to defend against such attacks.

7 Conclusion

We have proposed a supervised classificationmethod to distinguish spurious and genuine cor-relations in text classification. Using features de-rived from matched samples, we find that this wordclassifier achieves moderate to high accuracy eventested on strongly correlated terms. Additionally,due to the generic nature of the features, we findthat this classifier does not suffer much degradationin accuracy when trained on one dataset and ap-plied to another dataset. Finally, we use this wordclassifier to inform feature selection for documentclassification tasks. Results show that removingwords in the order of their predicted probability ofbeing spurious results in more robust classificationwith respect to worst-case accuracy.

Acknowledgments

This research was funded in part by the NationalScience Foundation under grants #IIS-1526674 and#IIS-1618244.

ReferencesJohn Aldrich et al. 1995. Correlations genuine and spu-

rious in pearson and yule. Statistical science.

Radfar Bahar, Shivaram Karthik, and Aron Culotta.2020. Characterizing variation in toxic language bysocial context. ICWSM-2020.

Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, andDawn Song. 2017. Targeted backdoor attacks ondeep learning systems using data poisoning. arXivpreprint arXiv:1712.05526.

Jiazhu Dai, Chuanshuai Chen, and Yike Guo. 2019.A backdoor attack against lstm-based text classifica-tion systems. CoRR, abs/1905.12457.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

Yanai Elazar and Yoav Goldberg. 2018. Adversarialremoval of demographic attributes from text data. InProceedings of the 2018 Conference on EmpiricalMethods in Natural Language Processing.

http://arxiv.org/abs/1905.12457





Sahaj Garg, Vincent Perot, Nicole Limtiaco, AnkurTaly, Ed H Chi, and Alex Beutel. 2019. Counterfac-tual fairness in text classification through robustness.In Proceedings of the 2019 AAAI/ACM Conference.

Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri,Franco Turini, Fosca Giannotti, and Dino Pedreschi.2018. A survey of methods for explaining black boxmodels. ACM computing surveys (CSUR).

Ruining He and Julian McAuley. 2016. Ups and downs.Proceedings of the 25th International Conference onWorld Wide Web - WWW ’16.

Guido W Imbens. 2004. Nonparametric estimation ofaverage treatment effects under exogeneity: A re-view. Review of Economics and statistics.

Divyansh Kaushik, Eduard Hovy, and Zachary C Lip-ton. 2020. Learning the difference that makes a dif-ference with counterfactually-augmented data. InICLR.

Katherine A Keith, David Jensen, and BrendanO’Connor. 2020. Text and causal inference: Areview of using text to remove confounding fromcausal estimates. arXiv preprint arXiv:2005.00649.

Gary King and Richard Nielsen. 2019. Why propensityscores should not be used for matching. PoliticalAnalysis, 27(4):435–454.

Gary King, Richard Nielsen, Carter Coberley, James E.Pope, and Aaron Wells. 2011. Comparative effec-tiveness of matching methods for causal inference.

Jon Kleinberg, Jens Ludwig, Sendhil Mullainathan,and Ashesh Rambachan. 2018. Algorithmic fairness.In Aea papers and proceedings, volume 108.

Sachin Kumar, Shuly Wintner, Noah A Smith, and Yu-lia Tsvetkov. 2019. Topics to avoid: Demoting la-tent confounds in text classification. In EMNLP andthe 9th International Joint Conference on NaturalLanguage Processing.

Virgile Landeiro and Aron Culotta. 2018. Robust textclassification under confounding shift. Journal ofArtificial Intelligence Research, 63:391–419.

Bing Liu. 2012. Sentiment Analysis and Opinion Min-ing. Morgan & Claypool Publishers.

David Lopez-Paz, Krikamol Muandet, BernhardScholkopf, and Iliya Tolstikhin. 2015. Towards alearning theory of cause-effect inference. In Inter-national Conference on Machine Learning.

David Martens and Foster Provost. 2014. Explain-ing data-driven document classifications. Mis Quar-terly.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-ing class relationships for sentiment categorizationwith respect to rating scales. In Proceedings of the43rd annual meeting on Association for Computa-tional Linguistics.

Michael J Paul. 2017. Feature selection as causal infer-ence: Experiments with text classification. In Pro-ceedings of the 21st Conference on ComputationalNatural Language Learning (CoNLL 2017).

Reid Pryzant, Kelly Shen, Dan Jurafsky, and StefanWagner. 2018. Deconfounded lexicon induction forinterpretable social science. In Proceedings of the2018 Conference of the North American Chapter ofthe Association for Computational Linguistics: Hu-man Language Technologies, Volume 1.

Joaquin Quionero-Candela, Masashi Sugiyama, AntonSchwaighofer, and Neil D Lawrence. 2009. Datasetshift in machine learning. The MIT Press.

Hema Raghavan, Omid Madani, and Rosie Jones. 2005.Interactive feature selection. In IJCAI, volume 5.

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto,and Percy Liang. 2020a. Distributionally robust neu-ral networks for group shifts: On the importance ofregularization for worst-case generalization. ICLR.

Shiori Sagawa, Aditi Raghunathan, Pang Wei Koh, andPercy Liang. 2020b. An investigation of why over-parameterization exacerbates spurious correlations.arXiv preprint arXiv:2005.04345.

Manali Sharma, Di Zhuang, and Mustafa Bilgic. 2015.Active learning with rationales for text classification.In Proceedings of the 2015 Conference of the NorthAmerican Chapter of ACL.

Elizabeth A Stuart. 2010. Matching methods for causalinference: A review and a look forward. Statisticalscience: a review journal of the Institute of Mathe-matical Statistics, 25(1):1.

Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.2005. Recognizing contextual polarity in phrase-level sentiment analysis. In Proceedings of the Con-ference on Human Language Technology and Empir-ical Methods in Natural Language Processing. ACL.

Christopher Winship and Stephen L Morgan. 1999.The estimation of causal effects from observationaldata. Annual review of sociology, 25(1):659–706.

Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze.2018. Challenges of using text classifiers for causalinference. In Proceedings of the Conference on Em-pirical Methods in Natural Language Processing.,volume 2018, page 4586. NIH Public Access.

Ellery Wulczyn, Nithum Thain, and Lucas Dixon.2017. Ex machina: Personal attacks seen at scale.WWW’17. International World Wide Web Confer-ences Steering Committee.

Omar Zaidan, Jason Eisner, and Christine Piatko. 2007.Using “annotator rationales” to improve machinelearning for text categorization. In Human languagetechnologies 2007: The conference of the NorthAmerican chapter of ACL.

https://doi.org/10.1145/2872427.2883037

https://doi.org/10.1613/jair.1.11248

https://doi.org/10.1613/jair.1.11248



https://doi.org/10.3115/1220575.1220619

https://doi.org/10.3115/1220575.1220619

https://doi.org/10.1145/3038912.3052591

arXiv:2010.02458v1 [cs.LG] 6 Oct 2020

Documents