Sarcasm Explanation in Multi-modal Multi-party Dialogues

When did you become so smart, oh wise one?! Sarcasm Explanation inMulti-modal Multi-party Dialogues

Anonymous ACL submission

Abstract

Indirect speech such as sarcasm achieves a001constellation of discourse goals in human com-002munication. While the indirectness of figu-003rative language warrants speakers to achieve004certain pragmatic goals, it is challenging for005AI agents to comprehend such idiosyncrasies006of human communication. Though sarcasm007identification has been a well-explored topic in008dialogue analysis, for conversational systems009to truly grasp a conversation’s innate mean-010ing and generate appropriate responses, sim-011ply detecting sarcasm is not enough; it is vi-012tal to explain its underlying sarcastic connota-013tion to capture its true essence. In this work,014we study the discourse structure of sarcastic015conversations and propose a novel task – Sar-016casm Explanation in Dialogue (SED). Set in017a multimodal and code-mixed setting, the task018aims to generate natural language explanations019of satirical conversations. To this end, we cu-020rate WITS, a new dataset to support our task.021We propose MAF (Modality Aware Fusion), a022multimodal context-aware attention and global023information fusion module to capture multi-024modality and use it to benchmark WITS. The025proposed attention module surpasses the tra-026ditional multimodal fusion baselines and re-027ports the best performance on almost all met-028rics. Lastly, we carry out detailed analysis both029quantitatively and qualitatively.030

1 Introduction031

The use of figurative language serves many com-032

municative purposes and is a regular feature of033

both oral and written communication (Roberts and034

Kreuz, 1994). Predominantly used to induce hu-035

mour, criticism, or mockery (Colston, 1997), para-036

doxical language is also used in concurrence with037

hyperbole to show surprise (Colston and Keller,038

1998) as well as highlight the disparity between ex-039

pectations and reality (Ivanko and Pexman, 2003).040

While the use and comprehension of sarcasm is a041

cognitively taxing process (Olkoniemi et al., 2016),042

Figure 1: Sarcasm Explanation in Dialogues (SED).Given a sarcastic dialogue, the aim is to generate a nat-ural language explanation for the sarcasm in it. Bluetext represents the English translation for the text.

psychological evidence advocate that it positively 043

correlates with the receiver’s theory of mind (ToM) 044

(Wellman, 2014), i.e., the capability to interpret and 045

understand another person’s state of mind. Thus, 046

for NLP systems to emulate such anthropomorphic 047

intelligent behavior, they must not only be potent 048

enough to identify sarcasm but also possess the 049

ability to comprehend it in its entirety. To this end, 050

moving forward from sarcasm identification, we 051

propose the novel task of Sarcasm Explanation in 052

Dialogue (SED). 053

For dialogue agents, understanding sarcasm is 054

even more crucial as there is a need to normalize 055

its sarcastic undertone and deliver appropriate re- 056

sponses. Conversations interspersed with sarcastic 057

statements often use contrastive language to convey 058

the opposite of what is being said. In a real-world 059

setting, understanding sarcasm goes beyond negat- 060

ing a dialogue’s language and involves the acute 061

1

comprehension of audio-visual cues. Additionally,062

due to the presence of essential temporal, contex-063

tual, and speaker-dependent information, sarcasm064

understanding in conversation manifests as a chal-065

lenging problem. Consequently, many studies in066

the domain of dialogue systems have investigated067

sarcasm from textual, multimodal, and conversa-068

tional standpoints (Ghosh et al., 2018; Castro et al.,069

2019; Oraby et al., 2017; Bedi et al., 2021). How-070

ever, baring some exceptions (Mishra et al., 2019;071

Dubey et al., 2019; Chakrabarty et al., 2020), re-072

search on figurative language has focused predomi-073

nantly on its identification rather than its compre-074

hension and normalization. This paper addresses075

this gap by attempting to generate natural language076

explanations of satirical dialogues.077

To illustrate the proposed problem statement, we078

show an example in Figure 1. It contains a dyadic079

conversation of four utterances 〈u1, u2, u3, u4〉,080

where the last utterance (u4) is a sarcastic remark.081

Note that in this example, although the opposite082

of what is being said is, “I don’t have to think083

about it," it is not what the speaker means; thus,084

it enforces our hypothesis that sarcasm explana-085

tion goes beyond simply negating the dialogue’s086

language. The discourse is also accompanied by087

ancillary audio-visual markers of satire such as an088

ironical intonation of the pitch, a blank face, or roll089

of the eyes. Thus, conglomerating the conversation090

history, multimodal signals, and speaker informa-091

tion, SED aims to generate a coherent and cohesive092

natural language explanation associated with these093

sarcastic dialogues.094

For the task at hand, we extend MASAC (Bedi095

et al., 2021) – a sarcasm detection dataset for code-096

mixed conversations – by augmenting it with natu-097

ral language explanations for each sarcastic utter-098

ance. We name the dataset WITS1. The dataset099

contains a compilation of sarcastic dialogues from100

a popular Indian TV show. Along with the textual101

transcripts of the conversations, the dataset also102

contains multimodal signals of audio and video.103

We experiment with unimodal as well as mul-104

timodal models to benchmark WITS. Text, be-105

ing the driving force of the explanations, is given106

the primary importance, and thus, we compare107

a number of established text-to-text systems on108

WITS. To incorporate multimodal information,109

we posit a unique fusion scheme of Multimodal110

Context-Aware Attention (MCA2). Inspired by111

1WITS: “Why Is This Sarcastic"

Yang et al. (2019), MCA2 facilitates deep seman- 112

tic interaction between the multimodal signals and 113

textual representations by conditioning the key and 114

value vectors with audio-visual information and 115

then performing dot product attention with these 116

modified vectors. The generated audio and video 117

information-informed textual representations are 118

then combined using the Global Information Fu- 119

sion Mechanism (GIF). The gating mechanism of 120

GIF allows for the selective inclusion of informa- 121

tion relevant to the satirical language and also pro- 122

hibits any multimodal noise from seeping into the 123

model. We further propose MAF (Modality Aware 124

Fusion) module where the aforementioned mecha- 125

nisms are introduced in the Generative Pretrained 126

Models (GPLMs) as adapter modules. Our fusion 127

strategy outperforms the text-based baselines and 128

the traditional multimodal fusion schemes in terms 129

of multiple text-generation metrics. Finally, we 130

conduct a comprehensive quantitative and qualita- 131

tive analysis of the generated explanations. 132

In a nutshell, our contributions are four fold: 133

• We propose Sarcasm Explanation in Dialogue 134

(SED), a novel task aiming to generate a nat- 135

ural language explanation for a given sarcastic 136

dialogue to elucidate the intended irony. 137

• We extend an existing sarcastic dialogue dataset, 138

to curate WITS, a novel dataset containing hu- 139

man annotated gold standard explanations. 140

• We benchmark our dataset using MAF-TAVB 141

and MAF-TAVM variants of BART and mBART, 142

respectively that incorporates the audio-visual 143

cues using a unique context aware attention mech- 144

anism. 145

• We carry out extensive quantitative and quali- 146

tative analysis along with human evaluation to 147

assess the quality of the generated explanations. 148

Reproducibility The source codes and the 149

dataset, along with the execution instructions, are 150

uploaded with the manuscript. 151

2 Related Work 152

Sarcasm and Text: Joshi et al. (2017) presented 153

a well-compiled survey on computational sar- 154

casm where the authors expanded on the relevant 155

datasets, trends, and issues for automatic sarcasm 156

identification. Early work in sarcasm detection 157

dealt with standalone text inputs like tweets and 158

reviews (Kreuz and Caucci, 2007; Tsur et al., 2010; 159

Joshi et al., 2015; Peled and Reichart, 2017). These 160

initial works mostly focused on the use of linguistic 161

2

and lexical features to spot the markers of sarcasm162

(Kreuz and Caucci, 2007; Tsur et al., 2010). More163

recently, attention-based architectures are proposed164

to harness the inter- and intra-sentence relation-165

ships in texts for efficient sarcasm identification166

(Tay et al., 2018; Xiong et al., 2019; Srivastava167

et al., 2020). Analysis of figurative language has168

also been extensively explored in conversational169

AI setting. Ghosh et al. (2017) utilised attention-170

based RNNs to identify sarcasm in the presence of171

context. Two separate LSTMs-with-attention were172

trained for the two inputs (sentence and context)173

and their hidden representations were combined174

during the prediction.175

The study of sarcasm identification has also176

spanned beyond the English language. Bharti et al.177

(2017) collected a Hindi corpus of 2000 sarcastic178

tweets and employed rule-based approaches to de-179

tect sarcasm. Swami et al. (2018) curated a dataset180

of 5000 satirical Hindi-English code-mixed tweets181

and used n-gram feature vectors with various ML182

models for sarcasm detection. Other notable stud-183

ies include Arabic (Abu Farha and Magdy, 2020),184

Spanish (Ortega-Bueno et al., 2019), and Italian185

(Cignarella et al., 2018) languages.186

Sarcasm and Multimodality: In the conversa-187

tional setting, MUStARD, a multimodal, multi-188

speaker dataset compiled by Castro et al. (2019) is189

considered the benchmark for multimodal sarcasm190

identification. Chauhan et al. (2020) leveraged the191

intrinsic interdependency between emotions and192

sarcasm and devised a multi-task framework for193

multimodal sarcasm detection. Currently, Hasan194

et al. (2021) performed the best on this dataset195

with their humour knowledge enriched transformer196

model. Recently, Bedi et al. (2021) proposed a197

code-mixed multi-party dialogue dataset, MASAC,198

for sarcasm and humor detection. In the bimodal199

setting, sarcasm identification with tweets contain-200

ing images has also been well explored (Cai et al.,201

2019; Xu et al., 2020; Pan et al., 2020) .202

Beyond Sarcasm Identification: While studies203

in computational sarcasm have predominantly fo-204

cused on sarcasm identification, some forays have205

been made into other domains of figurative lan-206

guage analysis. Dubey et al. (2019) initiated the207

work of converting sarcastic utterances into their208

non-sarcastic interpretations using deep learning.209

In another direction, Mishra et al. (2019) devised a210

modular unsupervised technique for sarcasm gen-211

# Dlgs # Utts # Eng utts # Hin utts2240 9080 101 1453# CM utts Avg. utt/dlg Avg. sp/dlg Avg.

words/utt7526 4.05 2.35 14.39Avg.words/dlg

Vocab size Eng vocabsize

Hin vocabsize

58.33 10380 2477 7903

Table 1: Statistics of dialogs present in WITS.

eration by introducing context incongruity through 212

fact removal and incongruous phrase insertion. Fol- 213

lowing this, Chakrabarty et al. (2020) proposed 214

a retrieve-and-edit-based unsupervised framework 215

for sarcasm generation. The proposed model lever- 216

ages the valence reversal and semantic incongruity 217

to generate sarcastic sentences from their non- 218

sarcastic counterparts. 219

A lot of effort has gone into detecting sarcasm, 220

but very little, if any, effort has gone into explain- 221

ing the irony. This paper attempts to fill the gap by 222

proposing a new problem definition and a support- 223

ing dataset. 224

3 Dataset 225

Situational comedies, or ‘Sitcoms’, vividly depict 226

human behaviour and mannerism in everyday real- 227

life settings. Consequently, the NLP research com- 228

munity has successfully used such data for sarcasm 229

identification (Castro et al., 2019; Bedi et al., 2021). 230

However, as there is no current dataset tailored for 231

the proposed task, we curate a new dataset named 232

WITS, where we augment the already existing 233

MASAC dataset (Bedi et al., 2021) with expla- 234

nations for our task. MASAC is a multimodal, 235

multi-party, code-mixed dialogue dataset compiled 236

from the popular Indian TV show, ‘Sarabhai v/s 237

Sarabhai’2. We manually analyzed the data and 238

cleaned it for our task. While, the original dataset 239

contained 45 episodes of the TV series, we add 240

10 more episodes along with their transcription 241

and audio-visual boundaries. Subsequently, we se- 242

lect the sarcastic utterances from this augmented 243

dataset and manually define the utterances to be 244

included in dialogue context for each of them. Fi- 245

nally, we are left with 2240 sarcastic utterances 246

with the number of contextual utterances ranging 247

from 2 to 27. Each of these instances are manually 248

annotated with a corresponding natural language 249

explanation interpreting its sarcasm. Each explana- 250

2https://www.imdb.com/title/tt1518542/

3

https://www.imdb.com/title/tt1518542/

5 10 15 20 25Number of utterances in a dialog

0100200300400500

Num

ber o

f dia

logs

(a) Utterance length distribution

1 2 3 4 5 6Number of speakers in a dialog

0200400600800

100012001400

Num

ber o

f dia

logs

(b) Speaker distribution

MAYA

INDRAVARDHANSAHILMONISHA

ROSESHOTHERS

Source

0

100

200

300

400

500

600

700 TargetMAYAINDRAVARDHANSAHILMONISHAROSESHOTHERS

(c) Source-target pair distribution

33.57%

28.66%

19.11%7.86%6.88%3.93%

SpeakersMAYAINDRAVARDHANSAHILOTHERSMONISHAROSESH

(d) Sarcasm source distribution

28.39%25.76%

15.00%12.41%

12.41%

6.03%

SpeakersMONISHAOTHERSROSESHMAYAINDRAVARDHANSAHIL

(e) Sarcasm target distribution

0 5 10 15 20 25 30Explanation length

050

100150200250

Num

ber o

f exp

lana

tions

(f) Explanation length distribution

Figure 2: Distribution of attributes in WITS. The number of utterances in a dialog lies between 2 and 27. Maximumnumber of speakers in a dialogue are 6. The speaker ‘Maya’ is the most common common sarcasm source whilethe speaker ‘Monisha’ is the most prominent sarcasm target.

tion contains four primary attributes – source and251

target of sarcasm, action word for sarcasm, and252

an optional description for the satire as illustrated253

in Figure 1. In the explanation “Indu implies that254

Maya is not looking good.", ‘Indu’ is the sarcasm255

source, ‘Maya’ is the target, ‘implies’ is the target256

word, while ‘is not looking good’ forms the de-257

scription part of the explanation. We collect expla-258

nations in code-mixed format to keep consistency259

with the dialogue language. We split the data into260

train/val/test sets in a 80:10:10 ratio for our experi-261

ments which results in 1792 dialogues in the train262

set and 224 dialogues each in the validation and263

test sets. More information about the annotation264

process is present in Appendix A.1. Table 1 and265

Figure 2 show detailed statistics of WITS.266

4 Proposed Methodology267

In this section, we present our model and its nu-268

ances. The primary goal is to smoothly integrate269

multimodal knowledge into GPLMs architecture.270

To this end, we introduce MAF, an adapter-based271

module that comprises of Multimodal Context-272

Aware Attention (MCA2) and Global Information273

Fusion (GIF) mechanisms. Given the textual in-274

put sarcastic instance along with audio-video cues,275

the former aptly introduces multimodal informa-276

tion in the textual representations, while the lat-277

ter conglomerates the audio-visual information in-278

fused textual representations. This adapter module279

can be readily incorporated at multiple layers of 280

BART/mBART to facilitate various levels of mul- 281

timodal interaction. Figure 3 illustrates our model 282

architecture. 283

4.1 Multimodal Context Aware Attention 284

The traditional dot-product-based cross-modal at- 285

tention scheme leads to the direct interaction of 286

textual representations with other modalities. Here 287

the text representations act as the query against the 288

multimodal representations, which serve as the key 289

and value. As each modality comes from a different 290

embedding subspace, a direct fusion of multimodal 291

information might not retain maximum contextual 292

information and can also leak substantial noise in 293

the final representations. Thus, based on the find- 294

ings of Yang et al. (2019), we propose multimodal 295

fusion through Context Aware Attention. We first 296

generate multimodal information conditioned key 297

and value vectors and then perform the traditional 298

scaled dot-product attention. We elaborate on the 299

process below. 300

Given the intermediate representation H gener- 301

ated by the GPLMs at a specific layer, we calcu- 302

late the query, key, and value vectors Q, K, and 303

V ∈ Rn×d, respectively as given in Equation 1, 304

whereWQ,WK , andWV ∈ Rd×d are learnable pa- 305

rameters. Here, n denotes the maximum sequence 306

length of the text, and d denotes the dimensionality 307

4

Positional Encodings

BART Encoder Output

LinearLinear Linear

Key Query Value

Acoustic ContextAware Self-

AttentionAcoustic Context

Visual ContextAware Self-

Attention

Visual Input

Transformer

Linear

Acoustic Compound Gate

Visual Compound Gate

X X

Pointwise Addition

Modality Aware Fusion (MAF)

Acoustic Input

Transformer

Linear

Visual Context

Global Information

Fusion

Sarcasm Explanation

Masked Multi-head Self-Attention

Add & Norm

Multi-head Self-Attention

Add & Norm

FNN

Add & Norm

Linear

Softmax

Output Probabilities

L x

+~

FNN

Add & Norm

Unfolded Dialogue

Positional Encodings

+~

Multi-head Self-Attention

Add & Norm

Modality AwareFusion

Add & Norm

L x

Audio Input

Video Input

Figure 3: Model architecture for MAF-TAVB. The proposed Multimodal Fusion Block (MFA) captures audio-visual cues using Multimodal Context Aware Attention (MCA2) which are further fused with textual representa-tions using Global Information Fusion (GIF) block.

of the GPLM generated vector.308 [QKV

]= H

[WQWKWV

](1)309

LetC ∈ Rnc×dc denote the vector obtained from310

audio or visual representation. We generate mul-311

timodal information informed key and value vec-312

tors Q and V , respectively, as given by Yang et al.313

(2019). To decide how much information to inte-314

grate from the multimodal source and how much315

information to retain from the textual modality, we316

learn λ ∈ Rn×1 (Equation 3). Note that Uk and317

Uv ∈ Rdc×d are learnable matrices.318 [K

V

]= (1−

[λkλv

])

[KV

]+

[λkλv

](C

[Uk

Uv

]) (2)319

Instead of making λk and λv as hyperparame-320

ters, we let the model decide their values using a321

gating mechanism as computed in Equation 3. The322

matrices of Wk1 ,Wk2 ,Wv1 , and Wv2 ∈ Rd×1 are323

trained along with the model.324 [λkλv

]= σ(

[KV

] [Wk1

Wv1

]+ C

[Uk

Uv

] [Wk2

Wv2

]) (3)325

Finally, the multimodal information infused vec-326

tors K and V are used to compute the traditional327

scaled dot-product attention. For our case, we328

have two modalities – audio and video. Using329

the context-aware attention mechanism, we ob-330

tain the acoustic-information-infused and visual-331

information infused vectors HA and HV , respec-332

tively (c.f. Equations 4 and 5).333

Ha = Softmax(QKT

a√dk

)Va (4) 334

Hv = Softmax(QKT

v√dk

)Vv (5) 335

4.2 Global Information Fusion 336

In order to combine the information from both the 337

acoustic and visual modalities, we design the GIF 338

block. We propose two gates, namely the acoustic 339

gate (ga) and the visual gate (gv) to control the 340

amount of information transmitted by each modal- 341

ity. They are as follows: 342

ga = [H ⊕Ha]Wa + ba (6) 343

gv = [H ⊕Hv]Wv + bv (7) 344

Here, Wa,Wv ∈ R2d×d and ba, bv ∈ Rd×1 are 345

trainable parameters, and ⊕ denotes concatenation. 346

The final multimodal information fused representa- 347

tion H is given by Equation 8. 348

H = H + ga �Ha + gv �Hv (8) 349

This vector H is inserted back into GPLM for 350

further processing. 351

5 Experiments, Results and Analysis 352

We illustrate our feature extraction strategy, the 353

comparative systems, followed by the results and 354

its analysis. For a quantitative analysis of the gen- 355

erated explanations, we use the standard metrics 356

5

for generative tasks – ROUGE-1/2/L (Lin, 2004),357

BLEU-1/2/3/4 (Papineni et al., 2002), and ME-358

TEOR (Denkowski and Lavie, 2014). To capture359

the semantic similarity, we use the multilingual360

version of the BERTScore (Zhang et al., 2019).361

5.1 Feature Extraction362

Audio: Acoustic representations for each in-363

stance are obtained using the openSMILE python364

library3. We use a window size of 25 ms and a365

window shift of 10 ms to get the non-overlapping366

frames. Further, we employ the eGeMAPS model367

(Eyben et al., 2016) and extract 154 dimensional368

functional features such as Mel Frequency Cepstral369

Coefficients (MFCCs) and loudness for each frame370

of the instance. These features are then fed to a371

Transformer encoder for further processing.372

Video: We use a pre-trained action recognition373

model, ResNext-101 (Hara et al., 2018), trained on374

the Kinetics dataset (Kay et al., 2017) which can375

recognise 101 different actions. We use a frame376

rate of 1.5, a resolution of 720 pixels, and a win-377

dow length of 16 to extract the 2048 dimensional378

visual features. Similar to audio feature extraction,379

we employ a Transformer encoder to capture the380

sequential dialogue context in the representations.381

5.2 Comparative Systems382

To get the best textual representations for the dia-383

logues, we experiment with various sequence-to-384

sequence (seq2seq) architectures. RNN: We use385

the openNMT4 implementation of the RNN seq-386

to-seq architecture. Transformer (Vaswani et al.,387

2017): The standard Transformer encoder and de-388

coder are used to generate explanations in this case.389

Pointer Generator Network (See et al., 2017):390

A seq-to-seq architecture that allows the genera-391

tion of new words as well as copying words from392

the input text for generating accurate summaries.393

BART (Lewis et al., 2020): It is a denoising au-394

toencoder model with standard machine translation395

architecture with a bidirectional encoder and an396

auto-regressive left-to-right decoder. We use its397

base version. mBART (Liu et al., 2020): Follow-398

ing the same architecture and objective as BART,399

mBART is trained on large-scale monolingual cor-400

pora in different languages 5.401

3https://audeering.github.io/opensmile-python/

4https://github.com/OpenNMT/OpenNMT-py5https://huggingface.co/facebook/

mbart-large-50-many-to-many-mmt

Mode Model R1 R2 RL B1 B2 B3 B4 M BS

Text

ual

RNN 29.22 7.85 27.59 22.06 8.22 4.76 2.88 18.45 73.24Transformers 29.17 6.35 27.97 17.79 5.63 2.61 0.88 15.65 72.21PGN 23.37 4.83 17.46 17.32 6.68 1.58 0.52 23.54 71.90mBART 33.66 11.02 31.50 22.92 10.56 6.07 3.39 21.03 73.83BART 36.88 11.91 33.49 27.44 12.23 5.96 2.89 26.65 76.03

Mul

timod

ality

MAF-TAM 39.02 15.90 36.83 31.26 16.94 11.54 7.72 29.05 77.06MAF-TVM 39.47 16.78 37.38 32.44 17.91 12.02 7.36 29.74 77.47MAF-TAVM 38.52 14.13 36.60 30.50 15.20 9.78 5.74 27.42 76.70MAF-TAB 38.21 14.53 35.97 30.58 15.36 9.63 5.96 27.71 77.08MAF-TVB 37.48 15.38 35.64 30.28 16.89 10.33 6.55 28.24 76.95MAF-TAVB 39.69 17.10 37.37 33.20 18.69 12.37 8.58 30.40 77.67

Table 2: Experimental results. (Abbreviation: R1/2/L:ROUGE1/2/L; B1/2/3/4: BLEU1/2/3/4; M: METEOR;BS: BERT Score; PGN: Pointer Generator Network).

5.3 Results 402

Text Based: As evident from Table 2, BART per- 403

forms the best across all the metrics for the textual 404

modality, showing an improvement of almost 2- 405

3% on the METEOR and ROUGE scores when 406

compared with the next best baseline. PGN, RNN, 407

and Transformers demonstrate admissible perfor- 408

mance considering that they have been trained from 409

scratch. However, it is surprising to see mBART 410

not performing better than BART as it is trained 411

on multilingual data. We elaborate more on this in 412

Appendix A.2. 413

Multimodality: Psychological and linguistic lit- 414

erature suggests that there exist distinct paralin- 415

guistic cues that aid in comprehending sarcasm and 416

humour (Attardo et al., 2003; Tabacaru and Lem- 417

mens, 2014). Thus, we gradually merge auditory 418

and visual modalities using MAF module and ob- 419

tain MAF-TAVB and MAF-TAVM for BART and 420

mBART, respectively. We observe that the inclu- 421

sion of acoustic signals leads to noticeable gains of 422

2-3% across the ROUGE, BLEU, and METEOR 423

scores. The rise in BERTScore also suggests that 424

the multimodal variant is generating a tad bit more 425

coherent explanations. As ironical intonations such 426

as mimicry, monotone, flat contour, extremes of 427

pitch, long pauses, and exaggerated pitch (Rock- 428

well, 2007) form a significant component in sar- 429

casm understanding, we conjecture that our model, 430

to some extent, is able to spot such markers and 431

identify the intended sarcasm behind them. 432

We notice that visual information also con- 433

tributes to our cause. Significant performance gains 434

are observed for MAF-TVB and MAF-TVM, as 435

all the metrics show a rise of about 3-4%. While 436

MAF-TAB gives marginally better performance 437

over MAF-TVB in terms of R1, RL, and B1, we 438

see that MAF-TVB performs better in terms of 439

the rest of the metrics. Often, sarcasm is de- 440

picted through gestural cues such as raised eye- 441

brows, a straight face, or an eye roll (Attardo et al., 442

6

https://audeering.github.io/opensmile-python/

https://audeering.github.io/opensmile-python/

https://github.com/OpenNMT/OpenNMT-py

https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt

https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt

Model R1 R2 RL B1 B2 B3 B4 M BSMAF-TAVM 39.69 17.10 37.37 33.20 18.69 12.37 8.58 30.40 77.67

- MCA2 + CONCAT1 37.56 14.85 34.90 30.16 15.76 10.12 6.82 28.59 76.59- MAF + CONCAT2 17.22 1.70 14.12 13.11 2.11 0.00 0.00 9.34 66.64- MCA2 + DPA 36.43 13.04 33.75 28.73 14.02 8.00 4.89 25.60 75.58- GIF 36.37 13.85 34.92 28.49 14.34 9.00 6.16 25.75 76.86

MAF-TAVB 39.69 17.10 37.37 33.20 18.69 12.37 8.58 30.40 77.67- MCA2 + CONCAT1 36.88 13.21 34.39 29.63 14.56 8.43 4.84 26.15 76.08- MAF + CONCAT2 21.11 2.31 19.68 12.44 2.44 0.73 0.31 9.51 69.54- MCA2 + DPA 38.84 14.76 36.96 30.23 15.95 9.88 5.83 28.04 77.20- GIF 39.45 14.85 37.18 31.85 15.97 9.62 5.47 28.87 77.54

Table 3: Ablation results on MAF-TAVM and MAF-TAVB (DPA: Dot Product Attention).

2003). Moreover, when satire is conveyed by mock-443

ing someone’s looks or physical appearances, it444

becomes essential to incorporate information ex-445

pressed through visual media. Thus, we can say446

that, to some extent, our model is able to capture447

these nuances of non-verbal cues and use them well448

to normalize the sarcasm in a dialogue. In summary,449

we conjecture that whether independent or together,450

audio-visual signals bring essential information to451

the table to understand sarcasm.452

5.4 Ablation Study453

Table 3 reports the ablation study. CONCAT1 repre-454

sents the case where we perform bimodal concate-455

nation ((T ⊕ A), (T ⊕ V )) instead of the MCA2456

mechanism, followed by the GIF module, whereas,457

CONCAT2 represents the simple trimodal concate-458

nation (T ⊕A⊕ V ) of acoustic, visual, and textual459

representations followed by a linear layer for di-460

mensionality reduction. In comparison with MCA2,461

CONCAT2 reports a below-average performance462

with a significant drop of more than 40% for MAF-463

TAVB and MAF-TAVM. This highlights the need464

to have deftly crafted multimodal fusion mecha-465

nisms. CONCAT1, on the other hand, gives good466

performance and is competitive with DPA and467

MAF-TAVB. We speculate that treating the audio468

and video modalities separately and then merging469

them to retain the complimentary and differential470

features lead to this performance gain. Our pro-471

posed MAF outperforms DPA with gains of 1-3%.472

This highlights that our unique multimodal fusion473

strategy is aptly able to capture the contextual in-474

formation provided by the audio and video signals.475

Replacing the GIF module with simple addition,476

we observe a noticeable decline in the performance477

of almost all metrics by about 2-3%. This attests478

to the inclusion of GIF module over simple addi-479

tion. We also experiment with fusing multimodal480

information using MAF before different layers of481

the BART encoder. The best performance was ob-482

tained when the fusion was done before the sixth483

layer of the architecture (c.f. Appendix A.3).484

5.5 Result Analysis 485

We evaluate the generated explanations based on 486

their ability to correctly identify the source and tar- 487

get of a sarcastic comment in a conversation. We 488

report such results for mBART, BART, MAF-TAB, 489

MAF-TVB, and MAF-TAVB. BART performs 490

better than mBART for the source as well as tar- 491

get identification. We observe that the inclusion of 492

audio (↑ 10%) and video (↑ 8%) information dras- 493

tically improves the source identification capability 494

of the model. The combination of both these non- 495

verbal cues leads to a whopping improvement of 496

more than 13% for the same. As a result, we infer 497

that multimodal fusion enables the model to in- 498

corporate audio-visual peculiarities unique to each 499

speaker, resulting in improved source identification. 500

The performance for target identification, however, 501

drops slightly on the inclusion of multimodality. 502

We encourage future work in this direction. 503

Qualitative Analysis. We analyze the best per- 504

forming model, MAF-TAVB, and its correspond- 505

ing unimodal model, BART, and present some ex- 506

amples in Table 4. In Table 4a, we show one in- 507

stance where the explanations generated by the 508

BART as well as MAF-TAVB are neither coherent 509

nor comply with the dialogue context and contain 510

much scope of improvement. On the other hand, 511

Table 4b illustrates an instance where the explana- 512

tion generated by MAF-TAVB adheres to the topic 513

of the dialogue, unlike the one generated by its 514

unimodal counterpart. Table 4c depicts a dialogue 515

where MAF-TAVB’s explanation better captures 516

the satire than BART. We further dissect the models 517

based on different modalities in Appendix A.4. 518

Human Evaluation. Since the proposed SED 519

task is a generative task, it is imperative to man- 520

ually inspect the generated results. Consequently, 521

we perform a human evaluation for a sample of 522

30 instances from our test set with the help of 25 523

evaluators6. We ask the evaluators to judge the 524

generated explanation, given the transcripts of the 525

sarcastic dialogues along with a small video clip 526

with audio as well. Each evaluator has to see the 527

video clips and then rate the generated explanations 528

on a scale of 0 to 5 based on the following factors7: 529

• Coherence: Measures how well the explanations 530

are organized and structured. 531

6Evaluators are the experts in linguistics and NLP and theirage ranges in 20-28 years.

70 denoting poor performance while 5 signifies perfectperformance.

7

ROSESH: What nonsense? Mujhe kon hunter marega? Whatnonsense? Who will beat me with a hunter?INDRAVARDHAN: Me, me marunga. Kyunki 51 saal baadMaya to mar chuki hogi. Me kisi tarah zinda reh lunga orkahunga ki ‘le mar jhadhu, mar, mar jhadhu, mar mar’ I willbeat you. Because after 51 years, Maya would be dead. I’llsomehow survive and say ‘sweep here, sweep here, sweep’Gold Maya Monisha ko tana marti hai safai ka dhyan

na rakhne ke liye Maya taunts Monisha for notkeeping a check of cleanliness

BART Maya Monisha ko tumaari burayi nahi karta. Mayadoesn’t blame you for Monisha

MAF-TAVB

Maya implies ki Monisha bohot ghar mein baharnahi kar sakati. Maya implies that Monisha very inhome cannot do outside.

(a) Incoherant explanation

SAHIL: Ab tumne ghar ki itni saaf safai ki hai and secondlyus Karan Verma ke liye pasta, lasagne, caramel custard banaya.Now you have cleaned the house so much and secondly madepasta, lasagne, caramel custard for that reason Verma.MONISHA: Walnut brownie bhi. And walnut brownie too.SAHIL: Walnut brownie, matlab wo khane wali? You meanedible walnut brownie?Gold Sahil monisha ki cooking ka mazak udata hai Sahil

makes fun of Monisha’s cooking.

BART Monisha sahil ko walnut brownie ki matlab wokhane wali. Walnut Brownie to Monisha Sahilmeans she eats

MAF-TAVB

Sahil monisha ki cooking ka mazak udata hai Sahilmakes fun of Monisha’s cooking.

(b) Explanation related to dialogue

MONISHA: Ladki ka naam Ajanta Kyon Rakha? Why did theynamed the girl Ajanta?INDRAVARDHAN: Kyunki uski maa ajanta caves dekh rahi thiJab vo Paida Hui haha. Because her mother must be watchingthe Ajanta caves when she was born haha.

Gold Indravadan Ajanta ke naam ka mazak udata haiIndravardhan makes fun of Ajanta’s name

BART Indravardhan Monisha ko taunt maarta hai ki uskimaa ajanta caves dekh rahi thi Jab vo Paida HuiIndravardhan taunts Monisha as her mother waswatching Ajanta Caves when she was born.

MAF-TAVB

Indravadan ajanta ke naam ka mazak udata haiIndravardhan makes fun of Ajanta’s name

(c) Explanation related to sarcasm

Table 4: Actual and generated explanations for sample dialogues from test set. The last utterance is the sarcasticutterance for each dialogue.

mBART BART MAF-TAB MAF-TVB MAF-TAVB

Source 75.00 77.23 87.94 85.71 91.07Target 45.53 52.67 43.75 43.75 46.42

Table 5: Source-target accuracy of the generated expla-nations for BART-based systems.

• Related to dialogue: Measures whether the gen-532

erated explanation adheres to the topic of the533

dialogue.534

• Related to sarcasm: Measures whether the ex-535

planation is talking about something related to536

the sarcasm present in the dialogue.537

Table 6 presents the human evaluation analysis with538

average scores for each of the aforementioned cat-539

egories. Our scrutiny suggests that MAF-TAVB540

generates more syntactically coherent explanations541

when compared with its textual and bimodal coun-542

terparts. Also, MAF-TAVB and MAF-TVB gen-543

erate explanations that are more focused on the544

conversation’s topic, as we see an increase of 0.55545

points in the related to the dialogue category. Thus,546

we reestablish that these models are able to incor-547

porate information that is explicitly absent from548

the dialogue, such as scene description, facial fea-549

tures, and looks of the characters. Furthermore, we550

establish that MAF-TAVB is better able to grasp551

sarcasm and its normalization, as it shows about552

0.6 points improvement over BART in the related553

to sarcasm category. Lastly, as none of the metrics554

in Table 6 exhibit high scores (3.5+), we feel there555

is still much scope for improvement in terms of556

the generation performance and human evaluation.557

The research community can further explore the558

task with our proposed dataset, WITS.559

Coherency Related to dialogue Related to sarcasmmBART 2.57 2.66 2.15BART 2.73 2.56 2.18MAF-TAB 2.95 2.91 2.51MAF-TVB 3.01 3.11 2.66MAF-TAVB 3.03 3.11 2.77

Table 6: Human evaluation statistics – comparing dif-ferent models. Multimodal models are BART based.

6 Conclusion 560

In this work, we proposed the new task of Sarcasm 561

Explanation in Dialogue (SED). It aims to gen- 562

erate a natural language explanation for sarcastic 563

conversations. We curated WITS, a novel multi- 564

modal, multiparty, code-mixed, dialogue dataset 565

to support the SED task. We experimented with 566

multiple text and multimodal baselines, which give 567

promising results on the task at hand. Furthermore, 568

we designed a unique multimodal fusion scheme 569

to merge the textual, acoustic, and visual features 570

via Multimodal Context-Aware Attention (MCA2) 571

and Global Information Fusion (GIF) mechanisms. 572

As hypothesized, the results show that acoustic and 573

visual features support our task, and thus providing 574

us with better explanations. We show extensive 575

qualitative analysis of the explanations obtained 576

from different models and highlight their advan- 577

tages as well as pitfalls. We also performed a 578

thorough human evaluation to compare the per- 579

formance of the models with that of human un- 580

derstanding. Though the models substituted with 581

the proposed fusion strategy perform better than 582

the rest, the human evaluation suggested there is 583

still room for improvement which can be further 584

explored in future studies. 585

8

References586

Ibrahim Abu Farha and Walid Magdy. 2020. From Ara-587bic sentiment analysis to sarcasm detection: The Ar-588Sarcasm dataset. In Proceedings of the 4th Work-589shop on Open-Source Arabic Corpora and Process-590ing Tools, with a Shared Task on Offensive Language591Detection, pages 32–39, Marseille, France. Euro-592pean Language Resource Association.593

Salvatore Attardo, Jodi Eisterhold, Jennifery Hay, and594Isabella Poggi. 2003. Multimodal markers of irony595and sarcasm. Humor: International Journal of Hu-596mor Research, 16(2).597

Manjot Bedi, Shivani Kumar, Md Shad Akhtar, and598Tanmoy Chakraborty. 2021. Multi-modal sarcasm599detection and humor classification in code-mixed600conversations. IEEE Transactions on Affective Com-601puting, pages 1–1.602

Santosh Kumar Bharti, Korra Sathya Babu, and San-603jay Kumar Jena. 2017. Harnessing online news604for sarcasm detection in hindi tweets. In Pattern605Recognition and Machine Intelligence, pages 679–606686, Cham. Springer International Publishing.607

Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-608modal sarcasm detection in Twitter with hierarchical609fusion model. In Proceedings of the 57th Annual610Meeting of the Association for Computational Lin-611guistics, pages 2506–2515, Florence, Italy. Associa-612tion for Computational Linguistics.613

Santiago Castro, Devamanyu Hazarika, Verónica Pérez-614Rosas, Roger Zimmermann, Rada Mihalcea, and615Soujanya Poria. 2019. Towards multimodal sarcasm616detection (an _Obviously_ perfect paper). In Pro-617ceedings of the 57th Annual Meeting of the Asso-618ciation for Computational Linguistics, pages 4619–6194629, Florence, Italy. Association for Computational620Linguistics.621

Tuhin Chakrabarty, Debanjan Ghosh, Smaranda Mure-622san, and Nanyun Peng. 2020. Rˆ3: Reverse, retrieve,623and rank for sarcasm generation with commonsense624knowledge. In Proceedings of the 58th Annual Meet-625ing of the Association for Computational Linguistics,626pages 7976–7986, Online. Association for Computa-627tional Linguistics.628

Dushyant Singh Chauhan, Dhanush S R, Asif Ekbal,629and Pushpak Bhattacharyya. 2020. Sentiment and630emotion help sarcasm? a multi-task learning frame-631work for multi-modal sarcasm, sentiment and emo-632tion analysis. In Proceedings of the 58th Annual633Meeting of the Association for Computational Lin-634guistics, pages 4351–4360, Online. Association for635Computational Linguistics.636

Alessandra Teresa Cignarella, Simona Frenda, Valerio637Basile, Cristina Bosco, Viviana Patti, Paolo Rosso,638et al. 2018. Overview of the evalita 2018 task on639irony detection in italian tweets (ironita). In Sixth640Evaluation Campaign of Natural Language Process-641ing and Speech Tools for Italian (EVALITA 2018),642volume 2263, pages 1–6. CEUR-WS.643

Herbert L. Colston. 1997. Salting a wound or sugaring 644a pill: The pragmatic functions of ironic criticism. 645Discourse Processes, 23(1):25–45. 646

Herbert L Colston and Shauna B Keller. 1998. You’ll 647never believe this: Irony and hyperbole in express- 648ing surprise. Journal of psycholinguistic research, 64927(4):499–513. 650

Michael Denkowski and Alon Lavie. 2014. Meteor uni- 651versal: Language specific translation evaluation for 652any target language. In Proceedings of the Ninth 653Workshop on Statistical Machine Translation, pages 654376–380, Baltimore, Maryland, USA. Association 655for Computational Linguistics. 656

Abhijeet Dubey, Aditya Joshi, and Pushpak Bhat- 657tacharyya. 2019. Deep models for converting sar- 658castic utterances into their non sarcastic interpreta- 659tion. In Proceedings of the ACM India Joint Interna- 660tional Conference on Data Science and Management 661of Data, CoDS-COMAD ’19, page 289–292, New 662York, NY, USA. Association for Computing Machin- 663ery. 664

Florian Eyben, Klaus R. Scherer, Björn W. Schuller, 665Johan Sundberg, Elisabeth André, Carlos Busso, 666Laurence Y. Devillers, Julien Epps, Petri Laukka, 667Shrikanth S. Narayanan, and Khiet P. Truong. 2016. 668The geneva minimalistic acoustic parameter set 669(gemaps) for voice research and affective comput- 670ing. IEEE Transactions on Affective Computing, 6717(2):190–202. 672

Debanjan Ghosh, Alexander R. Fabbri, and Smaranda 673Muresan. 2018. Sarcasm analysis using conversa- 674tion context. Computational Linguistics, 44(4):755– 675792. 676

Debanjan Ghosh, Alexander Richard Fabbri, and 677Smaranda Muresan. 2017. The role of conversation 678context for sarcasm detection in online interactions. 679In Proceedings of the 18th Annual SIGdial Meeting 680on Discourse and Dialogue, pages 186–196, Saar- 681brücken, Germany. Association for Computational 682Linguistics. 683

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 6842018. Can spatiotemporal 3d cnns retrace the his- 685tory of 2d cnns and imagenet? In Proceedings of 686the IEEE Conference on Computer Vision and Pat- 687tern Recognition (CVPR). 688

Md Kamrul Hasan, Sangwu Lee, Wasifur Rah- 689man, Amir Zadeh, Rada Mihalcea, Louis-Philippe 690Morency, and Ehsan Hoque. 2021. Humor knowl- 691edge enriched transformer for understanding multi- 692modal humor. Proceedings of the AAAI Conference 693on Artificial Intelligence, 35(14):12972–12980. 694

Stacey L. Ivanko and Penny M. Pexman. 2003. Con- 695text incongruity and irony processing. Discourse 696Processes, 35(3):241–279. 697

Aditya Joshi, Pushpak Bhattacharyya, and Mark J. Car- 698man. 2017. Automatic sarcasm detection: A survey. 699ACM Comput. Surv., 50(5). 700

9

https://aclanthology.org/2020.osact-1.5





https://psycnet.apa.org/doi/10.1515/humr.2003.012



https://doi.org/10.1109/TAFFC.2021.3083522





https://doi.org/https://doi.org/10.1007/978-3-319-69900-4_86



https://doi.org/10.18653/v1/P19-1239

https://doi.org/10.18653/v1/P19-1239

https://doi.org/10.18653/v1/P19-1239

https://doi.org/10.18653/v1/P19-1239

https://doi.org/10.18653/v1/P19-1239

https://doi.org/10.18653/v1/P19-1455

https://doi.org/10.18653/v1/P19-1455

https://doi.org/10.18653/v1/P19-1455

https://doi.org/10.18653/v1/2020.acl-main.711












http://ceur-ws.org/Vol-2263/paper005.pdf



https://doi.org/10.1080/01638539709544980

https://doi.org/10.1080/01638539709544980

https://doi.org/10.1080/01638539709544980

https://doi.org/https://doi.org/10.1023/A:1023229304509





https://doi.org/10.3115/v1/W14-3348

https://doi.org/10.3115/v1/W14-3348

https://doi.org/10.3115/v1/W14-3348

https://doi.org/10.3115/v1/W14-3348

https://doi.org/10.3115/v1/W14-3348

https://doi.org/10.1145/3297001.3297043

https://doi.org/10.1145/3297001.3297043

https://doi.org/10.1145/3297001.3297043

https://doi.org/10.1145/3297001.3297043

https://doi.org/10.1145/3297001.3297043






https://doi.org/10.1162/coli_a_00336



https://doi.org/10.18653/v1/W17-5523

https://doi.org/10.18653/v1/W17-5523

https://doi.org/10.18653/v1/W17-5523

https://openaccess.thecvf.com/content_cvpr_2018/html/Hara_Can_Spatiotemporal_3D_CVPR_2018_paper.html



https://ojs.aaai.org/index.php/AAAI/article/view/17534





https://doi.org/10.1207/S15326950DP3503_2

https://doi.org/10.1207/S15326950DP3503_2

https://doi.org/10.1207/S15326950DP3503_2

https://doi.org/10.1145/3124420

Aditya Joshi, Vinita Sharma, and Pushpak Bhat-701tacharyya. 2015. Harnessing context incongruity for702sarcasm detection. In Proceedings of the 53rd An-703nual Meeting of the Association for Computational704Linguistics and the 7th International Joint Confer-705ence on Natural Language Processing (Volume 2:706Short Papers), pages 757–762, Beijing, China. As-707sociation for Computational Linguistics.708

Will Kay, Joao Carreira, Karen Simonyan, Brian709Zhang, Chloe Hillier, Sudheendra Vijaya-710narasimhan, Fabio Viola, Tim Green, Trevor711Back, Paul Natsev, Mustafa Suleyman, and Andrew712Zisserman. 2017. The kinetics human action video713dataset.714

Roger Kreuz and Gina Caucci. 2007. Lexical influ-715ences on the perception of sarcasm. In Proceed-716ings of the Workshop on Computational Approaches717to Figurative Language, pages 1–4, Rochester, New718York. Association for Computational Linguistics.719

Mike Lewis, Yinhan Liu, Naman Goyal, Mar-720jan Ghazvininejad, Abdelrahman Mohamed, Omer721Levy, Veselin Stoyanov, and Luke Zettlemoyer.7222020. BART: Denoising sequence-to-sequence pre-723training for natural language generation, translation,724and comprehension. In Proceedings of the 58th An-725nual Meeting of the Association for Computational726Linguistics, pages 7871–7880, Online. Association727for Computational Linguistics.728

Chin-Yew Lin. 2004. ROUGE: A package for auto-729matic evaluation of summaries. In Text Summariza-730tion Branches Out, pages 74–81, Barcelona, Spain.731Association for Computational Linguistics.732

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey733Edunov, Marjan Ghazvininejad, Mike Lewis, and734Luke Zettlemoyer. 2020. Multilingual denoising735pre-training for neural machine translation. Transac-736tions of the Association for Computational Linguis-737tics, 8:726–742.738

Abhijit Mishra, Tarun Tater, and Karthik Sankara-739narayanan. 2019. A modular architecture for un-740supervised sarcasm generation. In Proceedings of741the 2019 Conference on Empirical Methods in Nat-742ural Language Processing and the 9th International743Joint Conference on Natural Language Processing744(EMNLP-IJCNLP), pages 6144–6154, Hong Kong,745China. Association for Computational Linguistics.746

Henri Olkoniemi, Henri Ranta, and Johanna K Kaaki-747nen. 2016. Individual differences in the processing748of written sarcasm and metaphor: Evidence from749eye movements. Journal of Experimental Psychol-750ogy: Learning, Memory, and Cognition, 42(3):433.751

Shereen Oraby, Vrindavan Harrison, Amita Misra,752Ellen Riloff, and Marilyn Walker. 2017. Are you753serious?: Rhetorical questions and sarcasm in so-754cial media dialog. In Proceedings of the 18th An-755nual SIGdial Meeting on Discourse and Dialogue,756pages 310–319, Saarbrücken, Germany. Association757for Computational Linguistics.758

Reynier Ortega-Bueno, Francisco Rangel, D Hernán- 759dez Farıas, Paolo Rosso, Manuel Montes-y Gómez, 760and José E Medina Pagola. 2019. Overview of the 761task on irony detection in spanish variants. In Pro- 762ceedings of the Iberian languages evaluation forum 763(IberLEF 2019), co-located with 34th conference of 764the Spanish Society for natural language processing 765(SEPLN 2019). CEUR-WS. org, volume 2421, pages 766229–256. 767

Hongliang Pan, Zheng Lin, Peng Fu, Yatao Qi, and 768Weiping Wang. 2020. Modeling intra and inter- 769modality incongruity for multi-modal sarcasm detec- 770tion. In Findings of the Association for Computa- 771tional Linguistics: EMNLP 2020, pages 1383–1392, 772Online. Association for Computational Linguistics. 773

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- 774Jing Zhu. 2002. Bleu: a method for automatic eval- 775uation of machine translation. In Proceedings of 776the 40th Annual Meeting of the Association for Com- 777putational Linguistics, pages 311–318, Philadelphia, 778Pennsylvania, USA. Association for Computational 779Linguistics. 780

Lotem Peled and Roi Reichart. 2017. Sarcasm SIGN: 781Interpreting sarcasm with sentiment based monolin- 782gual machine translation. In Proceedings of the 78355th Annual Meeting of the Association for Com- 784putational Linguistics (Volume 1: Long Papers), 785pages 1690–1700, Vancouver, Canada. Association 786for Computational Linguistics. 787

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and 788Percy Liang. 2016. Squad: 100,000+ questions 789for machine comprehension of text. arXiv preprint 790arXiv:1606.05250. 791

Richard M. Roberts and Roger J. Kreuz. 1994. Why 792do people use figurative language? Psychological 793Science, 5(3):159–163. 794

Patricia Rockwell. 2007. Vocal features of conversa- 795tional sarcasm: A comparison of methods. Journal 796of psycholinguistic research, 36(5):361–369. 797

Abigail See, Peter J. Liu, and Christopher D. Manning. 7982017. Get to the point: Summarization with pointer- 799generator networks. In Proceedings of the 55th An- 800nual Meeting of the Association for Computational 801Linguistics (Volume 1: Long Papers), pages 1073– 8021083, Vancouver, Canada. Association for Computa- 803tional Linguistics. 804

Himani Srivastava, Vaibhav Varshney, Surabhi Kumari, 805and Saurabh Srivastava. 2020. A novel hierarchical 806BERT architecture for sarcasm detection. In Pro- 807ceedings of the Second Workshop on Figurative Lan- 808guage Processing, pages 93–97, Online. Association 809for Computational Linguistics. 810

Sahil Swami, Ankush Khandelwal, Vinay Singh, 811Syed Sarfaraz Akhtar, and Manish Shrivastava. 2018. 812A corpus of english-hindi code-mixed tweets for sar- 813casm detection. arXiv preprint arXiv:1805.11869. 814

10

https://doi.org/10.3115/v1/P15-2124

https://doi.org/10.3115/v1/P15-2124

https://doi.org/10.3115/v1/P15-2124

http://arxiv.org/abs/1705.06950



https://aclanthology.org/W07-0101











https://doi.org/10.1162/tacl_a_00343



https://doi.org/10.18653/v1/D19-1636

https://doi.org/10.18653/v1/D19-1636

https://doi.org/10.18653/v1/D19-1636

https://doi.org/https://psycnet.apa.org/doi/10.1037/xlm0000176





https://doi.org/10.18653/v1/W17-5537

https://doi.org/10.18653/v1/W17-5537

https://doi.org/10.18653/v1/W17-5537

https://doi.org/10.18653/v1/W17-5537

https://doi.org/10.18653/v1/W17-5537

http://ceur-ws.org/Vol-2421/IroSvA_overview.pdf



https://doi.org/10.18653/v1/2020.findings-emnlp.124





https://doi.org/10.3115/1073083.1073135

https://doi.org/10.3115/1073083.1073135

https://doi.org/10.3115/1073083.1073135

https://doi.org/10.18653/v1/P17-1155

https://doi.org/10.18653/v1/P17-1155

https://doi.org/10.18653/v1/P17-1155

https://doi.org/10.18653/v1/P17-1155

https://doi.org/10.18653/v1/P17-1155

https://doi.org/10.1111/j.1467-9280.1994.tb00653.x



https://doi.org/https://doi.org/10.1007/s10936-006-9049-0



https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/P17-1099

https://doi.org/10.18653/v1/2020.figlang-1.14



Sabina Tabacaru and Maarten Lemmens. 2014. Raised815eyebrows as gestural triggers in humour: The case816of sarcasm and hyper-understanding. The European817Journal of Humour Research, 2(2):11–31.818

Yi Tay, Anh Tuan Luu, Siu Cheung Hui, and Jian819Su. 2018. Reasoning with sarcasm by reading in-820between. In Proceedings of the 56th Annual Meet-821ing of the Association for Computational Linguistics822(Volume 1: Long Papers), pages 1010–1020, Mel-823bourne, Australia. Association for Computational824Linguistics.825

Oren Tsur, Dmitry Davidov, and Ari Rappoport. 2010.826Icwsm — a great catchy name: Semi-supervised827recognition of sarcastic sentences in online prod-828uct reviews. Proceedings of the International AAAI829Conference on Web and Social Media, 4(1):162–830169.831

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob832Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz833Kaiser, and Illia Polosukhin. 2017. Attention is all834you need. In Advances in Neural Information Pro-835cessing Systems, volume 30. Curran Associates, Inc.836

Alex Wang, Amanpreet Singh, Julian Michael, Felix837Hill, Omer Levy, and Samuel R Bowman. 2018.838Glue: A multi-task benchmark and analysis platform839for natural language understanding. arXiv preprint840arXiv:1804.07461.841

Henry M Wellman. 2014. Making minds: How theory842of mind develops. Oxford University Press.843

Tao Xiong, Peiran Zhang, Hongbo Zhu, and Yihui844Yang. 2019. Sarcasm detection with self-matching845networks and low-rank bilinear pooling. In The846World Wide Web Conference, WWW ’19, page8472115–2124, New York, NY, USA. Association for848Computing Machinery.849

Nan Xu, Zhixiong Zeng, and Wenji Mao. 2020. Rea-850soning with multimodal sarcastic tweets via mod-851eling cross-modality contrast and semantic associ-852ation. In Proceedings of the 58th Annual Meet-853ing of the Association for Computational Linguistics,854pages 3777–3786, Online. Association for Computa-855tional Linguistics.856

Baosong Yang, Jian Li, Derek F. Wong, Lidia S. Chao,857Xing Wang, and Zhaopeng Tu. 2019. Context-aware858self-attention networks. Proceedings of the AAAI859Conference on Artificial Intelligence, 33(01):387–860394.861

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q862Weinberger, and Yoav Artzi. 2019. Bertscore: Eval-863uating text generation with bert. In International864Conference on Learning Representations.865

11

https://doi.org/10.7592/EJHR2014.2.2.tabacaru





https://doi.org/10.18653/v1/P18-1093

https://doi.org/10.18653/v1/P18-1093

https://doi.org/10.18653/v1/P18-1093

https://ojs.aaai.org/index.php/ICWSM/article/view/14018





https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf



https://doi.org/10.1145/3308558.3313735

https://doi.org/10.1145/3308558.3313735

https://doi.org/10.1145/3308558.3313735








https://doi.org/10.1609/aaai.v33i01.3301387



https://openreview.net/forum?id=SkeHuCVFDr



A Appendix866

A.1 Annotation Guidelines867

As mentioned in Section 3, we use the MASAC868

dataset (Bedi et al., 2021) which contains episodes869

from the series- “Sarabhai v/s Sarabhai" and ex-870

tract all sarcastic instances from it. Each of these871

instances is associated with a corresponding video,872

audio and textual transcript such that the last utter-873

ance is sarcastic in nature. We first manually define874

the number of contextual utterances required to un-875

derstand the sarcasm present in the last utterance876

of each instance. Further, we provide each of these877

sarcastic statements, along with their context, to878

the annotators which are asked to generate an ex-879

planation for these instances based on the audio,880

video, and text cues. Two annotators were asked to881

annotate the entire dataset. The target explanation882

is selected by calculating the cosine similarity be-883

tween the two explanations. If the cosine similarity884

is greater than 90% then the shorter length explana-885

tion is selected as the target explanation. Otherwise886

a third annotator goes through the dialogue along887

the explanations and resolve the conflict. The aver-888

age cosine similarity after the first pass is 87.67%.889

All the final selected explanations contain the fol-890

lowing attributes:891

• Sarcasm source: The speaker in the dialog who892

is being sarcastic.893

• Sarcasm target: The person/thing towards whom894

the sarcasm is directed towards.895

• Action word: Word used to describe how the896

sarcasm is taking place. For e.g. mock, jokes,897

etc.898

• Description: An optional description about the899

scene which helps in understanding the sarcasm900

better.901

An example annotation with its attributes is shown902

in Figure 4.903

A.2 Embedding Space for BART and904

mBART905

We compared various text based unimodal methods906

for our task. Although BART is performing the907

best for SED, it is important to note that BART908

is pre-trained on English datasets (GLUE (Wang909

et al., 2018) and SQUAD (Rajpurkar et al., 2016)).910

In order to explore how the representation learning911

is being transferred to a code-mixed setting, we912

analyse the embedding space learnt by the model913

before and after fine-tuning it for our task. We914

considered three random utterances from WITS915

Figure 4: Example annotations from WITS highlight-ing the different attributes of the explanation.

and created three copies of them- one in English, 916

one in Hindi (romanised), and one without mod- 917

ification i.e. code-mixed. Figure 5 illustrates the 918

PCA plot for the embeddings obtained for these 919

nine utterance representations obtained by BART 920

before and after fine-tuning on our task. It is inter- 921

esting to note that even before any fine-tuning the 922

Hindi, English, and code-mixed representations lie 923

closer to each other and they shift further closer 924

when we fine-tune our model. This phenomenon 925

can be justified as out input is of romanised code- 926

mixed format and thus we can assume that repre- 927

sentations are already being captured by the pre- 928

trained model. Fine-tuning helps us understand 929

the Hindi part of the input. Table 7 shows the co- 930

sine distance between the representations obtained 931

for English-Hindi, English-Code mixed, and Code 932

mixed-Hindi utterances for the sample utterances. 933

It can be clearly seen that the distance is decreasing 934

after fine-tuning. 935

Example English-Hindi English-Code mixed Code mixed-HindiPT FT PT FT PT FT

1 0.183 0.067 0.014 0.006 0.118 0.0562 0.282 0.093 0.017 0.007 0.197 0.0663 0.321 0.113 0.065 0.020 0.132 0.057

Table 7: Cosine distance between three random sam-ples from the dataset before and after fine-tuning. (PT:pre-trained; FT: fine-tuned)

A.3 Fusion at Different Layers 936

We fuse the multimodal information of audio and 937

video in the BART encoder using the proposed 938

12

0.4 0.2 0.0 0.2 0.4 0.60.3

0.2

0.1

0.0

0.1

0.2

0.3

0.4

1

2

3

1

2

3

1

2

3

EnglishHindiCode-mixed

(a) Pre-trained

0.2 0.1 0.0 0.1 0.2 0.3 0.4

0.2

0.1

0.0

0.1

0.2

1

2

3

1

2

3

1

2

3

EnglishHindiCode-mixed

(b) Fine-tuned

Figure 5: Embedding space for BART before and afterfine-tuning on sarcasm explanation in dialogs.

fusion mechanism before different layers of the939

BART encoder. Table 8 shows the results we obtain940

when the fusion happens at different layers. We941

obtain the best results when the fusion happens942

before layer 6 i.e. the last layer of the encoder.943

This can be attributed to the fact that since there944

is only one layer of encoder after the fusion, the945

multimodal information is being retained efficiently946

and thus being decoded more accurately.

Fusion before layer # R1 R2 RL1 37.27 13.95 35.242 37.63 14.32 35.573 36.73 13.15 34.634 37.61 14.98 36.045 37.34 13.67 35.486 39.69 17.10 37.37

Table 8: ROUGE scores for fusion before different lay-ers (R1/2/L: ROUGE1/2/L).

947

A.4 More Qualitative Analysis948

Table 9a highlights one of many cases where BART949

is able to capture the essence of sarcasm in a better950

way when compared to mBART. While mBART 951

gives us an incorrect and incoherent explanation, 952

BART generates an explanation which essentially 953

means the same as the ground truth explanation. 954

The inclusion of audio modality in the unimodal 955

system often helps in generating preferable expla- 956

nations, as shown in Table 9b. AVII-TA is able 957

to capture the essense of sarcasm in the dialog 958

while the unimodal systems were not able to do so. 959

Furthermore, video modality facilitates even better 960

understanding of sarcasm as illustrated in Table 9c. 961

AVII-TV is able to generate the best results while 962

audio may act as noise in this particular example. 963

13

MAYA: Sahil, beta tum bhi soche ho ki maine Monisha ki speechchurai? Sahil, do you also think that I stole Monisha’s speech?INDRAVARDHAN: Haan.Yes.MAYA: Are darling maine to speech ko chua bhi nahin. chhoti togerms nahin lag jaate? Kyunki Monisha ne mithaai box ki wrapperper likhi thi apni speech hath mein uthati to makkhiya bhanbhanane lagti. Darling, I didn’t even touch the speech. Would I not havegot germs by touching it? Monisha used sweets wrapper to writeher speech, if I would have picked it up, there would’ve been fliesbuzzing around me.

Gold Maya ne Monisha ke speech ka mazak udaya.Maya makes fun of Monisha’s speech.

mBART Maya kehti hai ki Monisha ka mazak udata haiMaya says that make fun of Monisha.

BART Maya monisha ke speech ka mazak udati hai Mayamakes fun of Monisha’s speech.

MAF-TAB Maya monisha ke speech ka mazaak udati hai Mayasays that make fun of Monisha.

MAF-TVB Maya mocks monisha kyunki wo rhe theek haiMaya mocks Monisha because she is okay.

MAF-TAVB

Maya kehti hai ki uske speech bure hai Maya saysthat she didn’t like the speech.

(a) BART v/s mBART: An example where explanationgenerated by BART is better than mBART.

SAHIL: Ek minute, kya hai maa ji, humaare naatak mein ek bhi streepatra nahi hai, sare ladke hai. One minute, what is it ma’am, we don’thave any female parts in our play, all are malePRABHAVATI: To uss mein bhi kaunsi badi baat hai, mai ladka banjaungi. Mere paas pant shirt to hai, moonche aapki de dena! So whatis the big deal in it, I’ll play a male. I have pant shirt, you give meyour mustache.INDRAVARDHAN: Cancel! Naatak cancel! Maa ji huaa aisa kihumaari jo bahu hai, uska ek chota sa accident ho gaya, to iss liyenatak cancel! Monisha le jaao inhe. Cancel! Play cancel! Ma’am,what happened is, that our daughter in law had a small accident,that is why the play is cancelled. Monisha take her.SAHIL: Aur aate aate apna ek chota sa accident bhi kara ke aao! Andwhen you come, have a small accident too!

Gold Sahil Monisha pe gussa hai as usne Prabhavati asan actress le aya. Sahil is angry on Monisha thatshe hired Prabhavati as an actress.

mBART Sahil ko Prabhavati ko role offer karne par tauntmaarta hai. Sahil taunts because the role is beingoffered to Prabhavati.

BART Indravardhan Monisha ko taunt maarta hai ki uskaek chota sa accident bhi kara ke aao. Indravard-han taunts Monisha that she should have a smallaccident.

MAF-TAB Sahil ko Prabhavati ko role offer nahi karna. Sahildoes not want Prabhavati to have this role.

MAF-TVB Sahil Indravardhan ko ek accident keh ke uska ma-jaak udaata hai. Calls Indravardhan an accidentand makes fun of him.

MAF-TAVB

Sahil ko Prabhavati ko role offer nahi karna. Sahildoes not want Prabhavati to have this role.

(b) Audio helps: An example where audio modality helpsin generating more fitting explanation.

MAYA: Kshama? You mean Sahil Kshama ko pyaar karta hai!?Kshama? You mean Sahil loves Kshama?SAHIL: Nahi, nahi! Ek minute, ek minute, mai kshama chahata hu.No no, One minute, one minute, I want forgiveness (kshama in hindi).INDRAVARDHAN: Dekha, Kshama chahata hai! Chahata ka matlabpyaar karna hi hua na!? See, wants forgiveness! Wants means loveonly, no!?

Gold Indravardhan Sahil ko tease karta hai ki vo Kshamase pyaar karta hai.Indravardhan teases Sahil byimplying that he loves kshama (name of a girl inhindi meaning forgiveness)

mBART Indravardhan implies ki Sahil ek kshama chahatahai. Indravardhan implies that Sahil wants forgive-ness.

BART Maya ko kshama chahata hai Maya wants forgive-ness.

MAF-TAB Indravardhan Kshama ko pyaar karne par tauntmaarta hai. Indravardhan taunts that he lovesKshama.

MAF-TVB Indravardhan majaak mein kehta hai ki SahilKshama ko pyaar karta hai. Indravardhan jokesthat Sahil loves Kshama

MAF-TAVB

Indravardhan Rosesh ko Kshama ki matlab pyaarkarne par taunt maarta hai. Indravardhan tauntsRosesh for loving the meaning of forgiveness.

(c) Video helps: An example where video modality helpsin generating more fitting explanation.

MAYA: And this time I thought lets have a theme party! animals!Hum log sab animals banenge! And this time I thought lets have atheme party! animals! We will all be animals!MONISHA: Walnut brownie bhi. And walnut brownie too.MAYA: Mai hiran, Sahil horse, and Monisha chhipakalee! I’ll be adeer, Sahil horse, and Monisha lizard!

Gold Maya Monisha ko chhipakalee keha kar uska ma-jaak udaati hai.Maya makes fun of Monisha by com-paring her with a lizard.

mBART Maya Monisha ko taunt maarti hai ki use animalthemed party Maya taunts Monisha for her animalthemed party.

BART Maya Monisha ko taunt maarti hai. Maya tauntsMonisha.

MAF-TAB Maya implies ki vo animal mein theme party kebaare mein nahi banenge. Maya implies that shewon’t be in regarding animal themed party.

MAF-TVB Maya Monisha ke animal ke behaviour par tauntmaarti hai. Maya taunts Monisha for her animalbehaviour.

MAF-TAVB

Maya Monisha ko animal kaha ke taunt maarti hai.Maya taunts Monisha by calling her an animal.

(d) Audio and video helps: An example where audioand video modality together helps in generating betterexplanation.

Table 9: Actual and generated explanations for sample dialogues from test set. The last utterance is the sarcasticutterance for each dialogue.

14

Sarcasm Explanation in Multi-modal Multi-party Dialogues

Documents