Top Banner
Multimodal Emoji Prediction Francesco Barbieri Miguel Ballesteros Francesco Ronzano Horacio Saggion Large Scale Text Understanding Systems Lab, TALN, UPF, Barcelona, Spain IBM Research, U.S Integrative Biomedical Informatics Group, GRIB, IMIM-UPF, Barcelona, Spain ♦♥ {name.surname}@upf.edu, [email protected] Abstract Emojis are small images that are commonly included in social media text messages. The combination of visual and textual content in the same message builds up a modern way of communication, that automatic systems are not used to deal with. In this paper we extend recent advances in emoji prediction by putting forward a multimodal approach that is able to predict emojis in Instagram posts. Instagram posts are composed of pictures together with texts which sometimes include emojis. We show that these emojis can be predicted by us- ing the text, but also using the picture. Our main finding is that incorporating the two syn- ergistic modalities, in a combined model, im- proves accuracy in an emoji prediction task. This result demonstrates that these two modal- ities (text and images) encode different infor- mation on the use of emojis and therefore can complement each other. 1 Introduction In the past few years the use of emojis in social media has increased exponentially, changing the way we communicate. The combination of visual and textual content poses new challenges for infor- mation systems which need not only to deal with the semantics of text but also that of images. Re- cent work (Barbieri et al., 2017) has shown that textual information can be used to predict emo- jis associated to text. In this paper we show that in the current context of multimodal communica- tion where texts and images are combined in social networks, visual information should be combined with texts in order to obtain more accurate emoji- prediction models. We explore the use of emojis in the social media platform Instagram. We put forward a multimodal approach to predict the emojis associated to an In- stagram post, given its picture and text 1 . Our task and experimental framework are similar to (Bar- bieri et al., 2017), however, we use different data (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a post. We show that a multimodal approach (textual and visual content of the posts) increases the emoji prediction accu- racy compared to the one that only uses textual in- formation. This suggests that textual and visual content embed different but complementary fea- tures of the use of emojis. In general, an effective approach to predict the emoji to be associated to a piece of content may help to improve natural language processing tasks (Novak et al., 2015), such as information retrieval, generation of emoji-enriched social media con- tent, suggestion of emojis when writing text mes- sages or sharing pictures online. Given that emo- jis may also mislead humans (Miller et al., 2017), the automated prediction of emojis may help to achieve better language understanding. As a con- sequence, by modeling the semantics of emojis, we can improve highly-subjective tasks like senti- ment analysis, emotion recognition and irony de- tection (Felbo et al., 2017). 2 Dataset and Task Dataset: We gathered Instagram posts published between July 2016 and October 2016, and geo- localized in the United States of America. We con- sidered only posts that contained a photo together with the related user description of at least 4 words and exactly one emoji. Moreover, as done by Barbieri et al. (2017), we considered only the posts which include one and only one of the 20 most frequent emojis (the 1 In this paper we only utilize the first comment issued by the user who posted the picture. arXiv:1803.02392v2 [cs.CL] 17 Apr 2018
8

Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

Jun 30, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

Multimodal Emoji Prediction

Francesco Barbieri♦ Miguel Ballesteros♠ Francesco Ronzano♥ Horacio Saggion♦

♦ Large Scale Text Understanding Systems Lab, TALN, UPF, Barcelona, Spain♠IBM Research, U.S

♥ Integrative Biomedical Informatics Group, GRIB, IMIM-UPF, Barcelona, Spain

♦♥{name.surname}@upf.edu, ♠[email protected]

Abstract

Emojis are small images that are commonlyincluded in social media text messages. Thecombination of visual and textual content inthe same message builds up a modern wayof communication, that automatic systems arenot used to deal with. In this paper we extendrecent advances in emoji prediction by puttingforward a multimodal approach that is able topredict emojis in Instagram posts. Instagramposts are composed of pictures together withtexts which sometimes include emojis. Weshow that these emojis can be predicted by us-ing the text, but also using the picture. Ourmain finding is that incorporating the two syn-ergistic modalities, in a combined model, im-proves accuracy in an emoji prediction task.This result demonstrates that these two modal-ities (text and images) encode different infor-mation on the use of emojis and therefore cancomplement each other.

1 Introduction

In the past few years the use of emojis in socialmedia has increased exponentially, changing theway we communicate. The combination of visualand textual content poses new challenges for infor-mation systems which need not only to deal withthe semantics of text but also that of images. Re-cent work (Barbieri et al., 2017) has shown thattextual information can be used to predict emo-jis associated to text. In this paper we show thatin the current context of multimodal communica-tion where texts and images are combined in socialnetworks, visual information should be combinedwith texts in order to obtain more accurate emoji-prediction models.

We explore the use of emojis in the social mediaplatform Instagram. We put forward a multimodalapproach to predict the emojis associated to an In-

stagram post, given its picture and text1. Our taskand experimental framework are similar to (Bar-bieri et al., 2017), however, we use different data(Instagram instead of Twitter) and, in addition, werely on images to improve the selection of the mostlikely emojis to associate to a post. We show thata multimodal approach (textual and visual contentof the posts) increases the emoji prediction accu-racy compared to the one that only uses textual in-formation. This suggests that textual and visualcontent embed different but complementary fea-tures of the use of emojis.

In general, an effective approach to predict theemoji to be associated to a piece of content mayhelp to improve natural language processing tasks(Novak et al., 2015), such as information retrieval,generation of emoji-enriched social media con-tent, suggestion of emojis when writing text mes-sages or sharing pictures online. Given that emo-jis may also mislead humans (Miller et al., 2017),the automated prediction of emojis may help toachieve better language understanding. As a con-sequence, by modeling the semantics of emojis,we can improve highly-subjective tasks like senti-ment analysis, emotion recognition and irony de-tection (Felbo et al., 2017).

2 Dataset and Task

Dataset: We gathered Instagram posts publishedbetween July 2016 and October 2016, and geo-localized in the United States of America. We con-sidered only posts that contained a photo togetherwith the related user description of at least 4 wordsand exactly one emoji.

Moreover, as done by Barbieri et al. (2017),we considered only the posts which include oneand only one of the 20 most frequent emojis (the

1In this paper we only utilize the first comment issued bythe user who posted the picture.

arX

iv:1

803.

0239

2v2

[cs

.CL

] 1

7 A

pr 2

018

Page 2: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

most frequent emojis are shown in Table 3). Ourdataset is composed of 299,809 posts, each con-taining a picture, the text associated to it and onlyone emoji. In the experiments we also consideredthe subsets of the 10 (238,646 posts) and 5 mostfrequent emojis (184,044 posts) (similarly to theapproach followed by Barbieri et al. (2017)).

Task: We extend the experimental scheme of Bar-bieri et al. (2017), by considering also visual infor-mation when modeling posts. We cast the emojiprediction problem as a classification task: givenan image or a text (or both inputs in the multi-modal scenario) we select the most likely emojithat could be added to (thus used to label) suchcontents. The task for our machine learning mod-els is, given the visual and textual content of apost, to predict the single emoji that appears in theinput comment.

3 Models

We present and motivate the models that we useto predict an emoji given an Instagram post com-posed by a picture and the associated comment.

3.1 ResNets

Deep Residual Networks (ResNets) (He et al.,2016) are Convolutional Neural Networks whichwere competitive in several image classificationtasks (Russakovsky et al., 2015; Lin et al., 2014)and showed to be one of the best CNN architec-tures for image recognition. ResNet is a feed-forward CNN that exploits “residual learning”, bybypassing two or more convolution layers (likesimilar previous approaches (Sermanet and Le-Cun, 2011)). We use an implementation of theoriginal ResNet where the scale and aspect ratioaugmentation are from (Szegedy et al., 2015), thephotometric distortions from (Howard, 2013) andweight decay is applied to all weights and biases(instead of only weights of the convolution layers).The network we used is composed of 101 layers(ResNet-101), initialized with pretrained parame-ters learned on ImageNet (Deng et al., 2009). Weuse this model as a starting point to later finetuneit on our emoji classification task. Learning ratewas set to 0.0001 and we early stopped the train-ing when there was not improving in the validationset.

3.2 FastText

Fastext (Joulin et al., 2017) is a linear model fortext classification. We decided to employ FastTextas it has been shown that on specific classifica-tion tasks, it can achieve competitive results, com-parable to complex neural classifiers (RNNs andCNNs), while being much faster. FastText repre-sents a valid approach when dealing with socialmedia content classification, where huge amountsof data needs to be processed and new and relevantinformation is continuously generated. The Fast-Text algorithm is similar to the CBOW algorithm(Mikolov et al., 2013), where the middle word isreplaced by the label, in our case the emoji. Givena set of N documents, the loss that the model at-tempts to minimize is the negative log-likelihoodover the labels (in our case, the emojis):

loss = − 1

N

n=1∑N

en log(softmax (BAxn))

where en is the emoji included in the n-th Insta-gram post, represented as hot vector, and used aslabel. A and B are affine transformations (weightmatrices), and xn is the unit vector of the bag offeatures of the n-th document (comment). The bagof features is the average of the input words, rep-resented as vectors with a look-up table.

3.3 B-LSTM Baseline

Barbieri et al. (2017) propose a recurrent neuralnetwork approach for the emoji prediction task.We use this model as baseline, to verify whetherFastText achieves comparable performance. Theyused a Bidirectional LSTM with character repre-sentation of the words (Ling et al., 2015; Balles-teros et al., 2015) to handle orthographic variants(or even spelling errors) of the same word that oc-cur in social media (e.g. cooooool vs cool).

4 Experiments and Evaluation

In order to study the relation between Instagramposts and emojis, we performed two different ex-periments. In the first experiment (Section 4.2)we compare the FastText model with the state ofthe art on emoji classification (B-LSTM) by Bar-bieri et al. (2017). Our second experiment (Sec-tion 4.3) evaluates the visual (ResNet) and textual(FastText) models on the emoji prediction task.Moreover, we evaluate a multimodal combinationof both models respectively based on visual and

Page 3: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

top-5 top-10 top-20P R F1 P R F1 P R F1

BW 61 61 61 45 45 45 34 36 32BC 63 63 63 48 47 47 42 39 34FT 61 62 61 47 49 46 38 39 36

Table 1: Comparison of B-LSTM with word mod-eling (BW), B-LSTM with character modeling (BC),and FastText (FT) on the same Twitter emoji predic-tion tasks proposed by Barbieri et al. (2017), using thesame Twitter dataset.

textual inputs. Finally we discuss the contributionof each modality to the prediction task.

We use 80% of our dataset (introduced in Sec-tion 2) for training, 10% to tune our models, and10% for testing (selecting the sets randomly).

4.1 Feature Extraction and ClassifierTo model visual features we first finetune theResNet (process described in Section 3.1) on theemoji prediction task, then extract the vectors fromthe input of the last fully connected layer (beforethe softmax). The textual embeddings are the bagof features shown in Section 3.2 (the xn vectors),extracted after training the FastText model on theemoji prediction task.

With respect to the combination of textual andvisual modalities, we adopt a middle fusion ap-proach (Kiela and Clark, 2015): we associate toeach Instagram post a multimodal embedding ob-tained by concatenating the unimodal representa-tions of the same post (i.e. the visual and textualembeddings), previously learned. Then, we feed aclassifier2 with visual (ResNet), textual (FastText),or multimodal feature embeddings, and test the ac-curacy of the three systems.

4.2 B-LSTM / FastText ComparisonTo compare the FastText model with the word andcharacter based B-LSTMs presented by Barbieriet al. (2017), we consider the same three emojiprediction tasks they proposed: top-5, top-10 andtop-20 emojis most frequently used in their Tweetdatasets. In this comparison we used the sameTwitter datasets. As we can see in Table 1 FastTextmodel is competitive, and it is also able to outper-form the character based B-LSTM in one of theemoji prediction tasks (top-20 emojis). This resultsuggests that we can employ FastText to representSocial Media short text (such as Twitter or Instra-gram) with reasonable accuracy.

2L2 regularized logistic regression

top-5 top-10 top-20P R F1 P R F1 P R F1

Maj 7.9 20.0 11.3 2.7 10.0 4.2 0.9 5.0 1.5W.R. 20.1 20.0 20.1 9.8 9.8 9.8 4.6 4.8 4.7

Vis 38.6 31.1 31.0 26.3 20.9 20.5 20.3 17.5 16.1Tex 56.1 54.4 54.9 41.6 37.5 38.3 36.7 29.9 31.3

Mul 57.4 56.3 56.7 42.3 40.5 41.1 36.6 35.2 35.5% 2.3 3.5 3.3 1.7 8 7.3 -0.3 17.7 13.4

Table 2: Prediction results of top-5, top-10 and top-20 most frequent emojis in the Instagram dataset: Pre-cision (P), Recall (R), F-measure (F1). Experimentalsettings: majority baseline, weighted random, visual,textual and multimodal systems. In the last line wereport the percentage improvement of the multimodalover the textual system.

4.3 Multimodal Emoji Prediction

We present the results of the three emoji classifica-tion tasks, using the visual, textual and multimodalfeatures (see Table 2).

The emoji prediction task seems difficult by justusing the image of the Instagram post (Visual),even if it largely outperforms the majority base-line3 and weighted random4. We achieve betterperformances when we use feature embeddingsextracted from the text. The most interesting find-ing is that when we use a multimodal combina-tion of visual and textual features, we get a non-negligible improvement. This suggests that thesetwo modalities embed different representations ofthe posts, and when used in combination they aresynergistic. It is also interesting to note that themore emojis to predict, the higher improvementthe multimodal system provides over the text onlysystem (3.28% for top-5 emojis, 7.31% for top-10emojis, and 13.42 for the top-20 emojis task).

4.4 Qualitative Analysis

In Table 3 we show the results for each class in thetop-20 emojis task.

The emoji with highest F1 using the textual fea-tures is the most frequent one (0.62) and theUS flag (0.52). The latter seems easy to pre-dict since it appears in specific contexts: when theword USA/America is used (or when Americancities are referred, like #NYC).

The hardest emojis to predict by the text onlysystem are the two gestures (0.12) and (0.13).The first one is often selected when the gold stan-

3Always predict since it is the most frequent emoji.4Random keeping labels distribution of the training set

Page 4: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

E % Tex Vis MM E % Tex Vis MM17.46 0.62 0.35 0.69 3.68 0.22 0.15 0.299.10 0.45 0.30 0.47 3.55 0.20 0.02 0.268.41 0.32 0.15 0.34 3.54 0.13 0.02 0.25.91 0.23 0.08 0.26 3.51 0.26 0.17 0.315.73 0.35 0.17 0.36 3.31 0.43 0.25 0.454.58 0.45 0.24 0.46 3.25 0.12 0.01 0.164.31 0.52 0.23 0.53 3.14 0.12 0.02 0.154.15 0.38 0.26 0.49 3.11 0.34 0.11 0.363.84 0.19 0.1 0.22 2.91 0.36 0.04 0.373.73 0.13 0.03 0.16 2.82 0.45 0.54 0.59

Table 3: F-measure in the test set of the 20 most fre-quent emojis using the three different models. “%” in-dicates the percentage of the class in the test set

dard emoji is the second one or is often mis-predicted by wrongly selecting or .

Another relevant confusion scenario related toemoji prediction has been spotted by Barbieriet al. (2017): relying on Twitter textual data theyshowed that the emoji was hard to predict as itwas used similarly to . Instead when we con-sider Instagram data, the emoji is easier to pre-dict (0.23), even if it is often confused with .

When we rely on visual contents (Instagrampicture), the emojis which are easily predicted arethe ones in which the associated photos are simi-lar. For instance, most of the pictures associated to

are dog/pet pictures. Similarly, is predictedalong with very bright pictures taken outside.is correctly predicted along with pictures relatedto gym and fitness. The accuracy of is also highsince most posts including this emoji are related tofitness (and the pictures are simply either selfies atthe gym, weight lifting images, or protein food).

Employing a multimodal approach improvesperformance. This means that the two modali-ties are somehow complementary, and adding vi-sual information helps to solve potential ambigu-ities that arise when relying only on textual con-tent. In Figure 1 we report the confusion matrixof the multimodal model. The emojis are plottedfrom the most frequent to the least, and we can seethat the model tends to mispredict emojis selectingmore frequent emojis (the left part of the matrix isbrighter).

4.4.1 Saliency MapsIn order to show the parts of the image most rel-evant for each class we analyze the global aver-age pooling (Lin et al., 2013) on the convolutional

Figure 1: Confusion matrix of the multimodal model.The gold labels are plotted as y-axes and the predictedlabels as x-axes. The matrix is normalized by rows.

feature maps (Zhou et al., 2016). By visually ob-serving the image heatmaps of the set of Insta-gram post pictures we note that in most cases itis quite difficult to determine a clear associationbetween the emoji used by the user and some par-ticular portion of the image. Detecting the correctemoji given an image is harder than a simple ob-ject recognition task, as the emoji choice dependson subjective emotions of the user who posted theimage. In Figure 2 we show the first four predic-tions of the CNN for three pictures, and where thenetwork focuses (in red). We can see that in thefirst example the network selects the smile withsunglasses because of the legs in the bottom ofthe image, the dog emoji is selected while fo-cusing on the dog in the image, and the smilingemoji while focusing on the person in the back,who is lying on a hammock. In the second exam-ple the network selects again the due to the wa-ter and part of the kayak, the heart emoji focus-ing on the city landscape, and the praying emoji

focusing on the sky. The same “praying” emojiis also selected when focusing on the luxury carin the third example, probably because the sameemoji is used to express desire, i.e. “please, I wantthis awesome car”.

It is interesting to note that images can give con-text to textual messages like in the following In-stagram posts: (1)“Love my new home ” (asso-ciated to a picture of a bright garden, outside) and(2) “I can’t believe it’s the first day of school!!!

Page 5: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

Figure 2: Three test pictures. From left to right, weshow the four most likely predicted emojis and theircorrespondent class activation mapping heatmap.

I love being these boys’ mommy!!!! #myboys#mommy ” (associated to picture of two boyswearing two blue shirts). In both examples the tex-tual system predicts . While the multimodal sys-tem correctly predicts both of them: the blue colorin the picture associated to (2) helps to change thecolor of the heart, and the sunny/bright picture ofthe garden in (1) helps to correctly predict .

5 Related Work

Modeling the semantics of emojis, and their ap-plications, is a relatively novel research problemwith direct applications in any social media task.Since emojis do not have a clear grammar, it is notclear their role in text messages. Emojis are con-sidered function words or even affective markers(Na’aman et al., 2017), that can potentially affectthe overall semantics of a message (Donato andPaggio, 2017).

Emojis can encode different meanings, and theycan be interpreted differently. Emoji interpretationhas been explored user-wise (Miller et al., 2017),location-wise, specifically in countries (Barbieriet al., 2016b) and cities (Barbieri et al., 2016a),and gender-wise (Chen et al., 2017) and time-wise(Barbieri et al., 2018).

Emoji sematics and usage have been studiedwith distributional semantics, with models trainedon Twitter data (Barbieri et al., 2016c), Twitterdata together with the official unicode description(Eisner et al., 2016), or using text from a popu-lar keyboard app Ai et al. (2017). In the same

context, Wijeratne et al. (2017a) propose a plat-form for exploring emoji semantics. In order tofurther study emoji semantics, two datasets withpairwise emoji similarity, with human annotations,have been proposed: EmoTwi50 (Barbieri et al.,2016c) and EmoSim508 (Wijeratne et al., 2017b).Emoji similarity has been also used for proposingefficient keyboard emoji organization (Pohl et al.,2017). Recently, Barbieri and Camacho-Collados(2018) show that emoji modifiers (skin tones andgender) can affect the semantics vector represen-tation of emojis.

Emoji play an important role in the emotionalcontent of a message. Several sentiment lexiconsfor emojis have been proposed (Novak et al., 2015;Kimura and Katsurai, 2017; Rodrigues et al.,2018) and also studies in the context of emotionand emojis have been published recently (Woodand Ruder, 2016; Hu et al., 2017).

During the last decade several studies haveshown how sentiment analysis improves when wejointly leverage information coming from differ-ent modalities (e.g. text, images, audio, video)(Morency et al., 2011; Poria et al., 2015; Tranand Cambria, 2018). In particular, when we dealwith Social Media posts, the presence of both tex-tual and visual content has promoted a number ofinvestigations on sentiment or emotions (Baecchiet al., 2016; You et al., 2016b,a; Yu et al., 2016;Chen et al., 2015) or emojis (Cappallo et al., 2015,2018).

6 Conclusions

In this work we explored the use of emojis in amultimodal context (Instagram posts). We haveshown that using a synergistic approach, thus re-lying on both textual and visual contents of socialmedia posts, we can outperform state of the artunimodal approaches (based only on textual con-tents). As future work, we plan to extend our mod-els by considering the prediction of more than oneemoji per Social Media post and also consideringa bigger number of labels.

Acknowledgments

We thank the anonymous reviewers for theirimportant suggestions. Francesco B. and HoracioS. acknowledge support from the TUNER project(TIN2015-65308-C5-5-R, MINECO/FEDER,UE) and the Maria de Maeztu Units of ExcellenceProgramme (MDM-2015-0502).

Page 6: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

ReferencesWei Ai, Xuan Lu, Xuanzhe Liu, Ning Wang, Gang

Huang, and Qiaozhu Mei. 2017. Untanglingemoji popularity through semantic embeddings. InICWSM. pages 2–11.

Claudio Baecchi, Tiberio Uricchio, Marco Bertini, andAlberto Del Bimbo. 2016. A multimodal featurelearning approach for sentiment analysis of socialnetwork multimedia. Multimedia Tools and Appli-cations 75(5):2507–2525.

Miguel Ballesteros, Chris Dyer, and Noah A. Smith.2015. Improved transition-based parsing by mod-eling characters instead of words with lstms. InProceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing. Associ-ation for Computational Linguistics, Lisbon, Portu-gal, pages 349–359.

Francesco Barbieri, Miguel Ballesteros, and HoracioSaggion. 2017. Are emojis predictable? In Pro-ceedings of the 2017 Conference of the EuropeanChapter of the Association for Computational Lin-guistics. Association for Computational Linguistics,Valencia, Spain.

Francesco Barbieri and Jose Camacho-Collados. 2018.How Gender and Skin Tone Modifiers Affect EmojiSemantics in Twitter. In Proceedings of the 7th JointConference on Lexical and Computational Seman-tics (*SEM 2018). New Orleans, LA, United States.

Francesco Barbieri, Luis Espinosa-Anke, and HoracioSaggion. 2016a. Revealing patterns of Twitter emojiusage in Barcelona and Madrid. Frontiers in Artifi-cial Intelligence and Applications. 2016;(ArtificialIntelligence Research and Development) 288: 239-44. .

Francesco Barbieri, German Kruszewski, FrancescoRonzano, and Horacio Saggion. 2016b. How Cos-mopolitan Are Emojis? Exploring Emojis Usageand Meaning over Different Languages with Dis-tributional Semantics. In Proceedings of the 2016ACM on Multimedia Conference. ACM, Amster-dam, Netherlands, pages 531–535.

Francesco Barbieri, Luis Marujo, William Brendel,Pradeep Karuturim, and Horacio Saggion. 2018.Exploring Emoji Usage and Prediction Through aTemporal Variation Lens. In 1st International Work-shop on Emoji Understanding and Applications inSocial Media (at ICWSM 2018).

Francesco Barbieri, Francesco Ronzano, and HoracioSaggion. 2016c. What does this emoji mean? avector space skip-gram model for twitter emojis. InLREC.

Spencer Cappallo, Thomas Mensink, and Cees GMSnoek. 2015. Image2emoji: Zero-shot emoji pre-diction for visual media. In Proceedings of the23rd ACM international conference on Multimedia.ACM, pages 1311–1314.

Spencer Cappallo, Stacey Svetlichnaya, Pierre Gar-rigues, Thomas Mensink, and Cees GM Snoek.2018. The new modality: Emoji challenges in pre-diction, anticipation, and retrieval. arXiv preprintarXiv:1801.10253 .

Fuhai Chen, Yue Gao, Donglin Cao, and RongrongJi. 2015. Multimodal hypergraph learning for mi-croblog sentiment prediction. In Multimedia andExpo (ICME), 2015 IEEE International Conferenceon. IEEE, pages 1–6.

Zhenpeng Chen, Xuan Lu, Sheng Shen, Wei Ai, Xu-anzhe Liu, and Qiaozhu Mei. 2017. Througha gender lens: An empirical study of emoji us-age over large-scale android users. arXiv preprintarXiv:1705.05546 .

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale HierarchicalImage Database. In CVPR09.

Giulia Donato and Patrizia Paggio. 2017. Investigat-ing redundancy in emoji use: Study on a twitterbased corpus. In Proceedings of the 8th Workshopon Computational Approaches to Subjectivity, Senti-ment and Social Media Analysis. pages 118–126.

Ben Eisner, Tim Rocktaschel, Isabelle Augenstein,Matko Bosnjak, and Sebastian Riedel. 2016.emoji2vec: Learning emoji representations fromtheir description. arXiv preprint arXiv:1609.08359 .

Bjarke Felbo, Alan Mislove, Anders Søgaard, IyadRahwan, and Sune Lehmann. 2017. Using millionsof emoji occurrences to learn any-domain represen-tations for detecting sentiment, emotion and sar-casm. EMNLP .

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. 2016. Deep residual learning for image recog-nition. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition. pages770–778.

Andrew G Howard. 2013. Some improvements ondeep convolutional neural network based imageclassification. arXiv preprint arXiv:1312.5402 .

Tianran Hu, Han Guo, Hao Sun, Thuy-vy Thi Nguyen,and Jiebo Luo. 2017. Spice up Your Chat: The In-tentions and Sentiment Effects of Using Emoji. InIn Proceeding of the International AAAI Conferenceon Web and Social Media (ICWSM). AAAI.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2017. Bag of tricks for efficient textclassification. In Proceedings of the 2017 Confer-ence of the European Chapter of the Association forComputational Linguistics. Association for Compu-tational Linguistics, Valencia, Spain.

Douwe Kiela and Stephen Clark. 2015. Multi-andcross-modal semantics beyond vision: Grounding inauditory perception. In EMNLP. pages 2461–2470.

Page 7: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

Mayu Kimura and Marie Katsurai. 2017. Automaticconstruction of an emoji sentiment lexicon. In Pro-ceedings of the 2017 IEEE/ACM International Con-ference on Advances in Social Networks Analysisand Mining 2017. ACM, pages 1033–1036.

Min Lin, Qiang Chen, and Shuicheng Yan. 2013. Net-work in Network. In International Conference onLearning Representations.

Tsung-Yi Lin, Michael Maire, Serge Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar,and C Lawrence Zitnick. 2014. Microsoft coco:Common objects in context. In European Confer-ence on Computer Vision. Springer, pages 740–755.

Wang Ling, Chris Dyer, Alan W Black, Isabel Tran-coso, Ramon Fermandez, Silvio Amir, Luis Marujo,and Tiago Luis. 2015. Finding function in form:Compositional character models for open vocabu-lary word representation. In Proceedings of the 2015Conference on Empirical Methods in Natural Lan-guage Processing. Association for ComputationalLinguistics, Lisbon, Portugal, pages 1520–1530.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781 .

Hannah Miller, Daniel Kluver, Jacob Thebault-Spieker,Loren Terveen, and Brent Hecht. 2017. Understand-ing emoji ambiguity in context: The role of textin emoji-related miscommunication. In 11th In-ternational Conference on Web and Social Media,ICWSM 2017. AAAI Press.

Louis-Philippe Morency, Rada Mihalcea, and PayalDoshi. 2011. Towards multimodal sentiment analy-sis: Harvesting opinions from the web. In Proceed-ings of the 13th international conference on multi-modal interfaces. ACM, pages 169–176.

Noa Na’aman, Hannah Provenza, and Orion Montoya.2017. Varying linguistic purposes of emoji in (twit-ter) context. In Proceedings of ACL 2017, StudentResearch Workshop. pages 136–141.

Petra Kralj Novak, Jasmina Smailovic, Borut Sluban,and Igor Mozetic. 2015. Sentiment of emojis. PloSone 10(12):e0144296.

Henning Pohl, Christian Domin, and Michael Rohs.2017. Beyond just text: Semantic emoji similar-ity modeling to support expressive communication.ACM Transactions on Computer-Human Interaction(TOCHI) 24(1):6.

Soujanya Poria, Erik Cambria, and Alexander Gel-bukh. 2015. Deep convolutional neural networktextual features and multiple kernel learning forutterance-level multimodal sentiment analysis. InProceedings of the 2015 conference on empiri-cal methods in natural language processing. pages2539–2544.

David Rodrigues, Marılia Prada, Rui Gaspar, Mar-garida V Garrido, and Diniz Lopes. 2018. Lis-bon emoji and emoticon database (leed): Normsfor emoji and emoticons in seven evaluative dimen-sions. Behavior research methods pages 392–405.

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-drej Karpathy, Aditya Khosla, Michael Bernstein,Alexander C. Berg, and Li Fei-Fei. 2015. Ima-geNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision (IJCV)115(3):211–252.

Pierre Sermanet and Yann LeCun. 2011. Trafficsign recognition with multi-scale convolutional net-works. In Neural Networks (IJCNN), The 2011 In-ternational Joint Conference on. IEEE, pages 2809–2813.

Christian Szegedy, Wei Liu, Yangqing Jia, PierreSermanet, Scott Reed, Dragomir Anguelov, Du-mitru Erhan, Vincent Vanhoucke, and Andrew Ra-binovich. 2015. Going deeper with convolutions. InProceedings of the IEEE Conference on ComputerVision and Pattern Recognition. pages 1–9.

Ha-Nguyen Tran and Erik Cambria. 2018. Ensembleapplication of elm and gpu for real-time multimodalsentiment analysis. Memetic Computing 10(1):3–13.

Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth,and Derek Doran. 2017a. Emojinet: An open ser-vice and api for emoji sense discovery. InternationalAAAI Conference on Web and Social Media (ICWSM2017). Montreal, Canada .

Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth,and Derek Doran. 2017b. A semantics-based mea-sure of emoji similarity. International Confer-ence on Web Intelligence (Web Intelligence 2017).Leipzig, Germany .

Ian Wood and Sebastian Ruder. 2016. Emoji as emo-tion tags for tweets. In Emotion and Sentiment Anal-ysis Workshop, LREC.

Quanzeng You, Liangliang Cao, Hailin Jin, and JieboLuo. 2016a. Robust visual-textual sentiment analy-sis: When attention meets tree-structured recursiveneural networks. In Proceedings of the 2016 ACMon Multimedia Conference. ACM, pages 1008–1017.

Quanzeng You, Jiebo Luo, Hailin Jin, and JianchaoYang. 2016b. Cross-modality consistent regressionfor joint visual-textual sentiment analysis of socialmultimedia. In Proceedings of the Ninth ACM Inter-national Conference on Web Search and Data Min-ing. ACM, pages 13–22.

Yuhai Yu, Hongfei Lin, Jiana Meng, and ZhehuanZhao. 2016. Visual and textual sentiment analysisof a microblog using deep convolutional neural net-works. Algorithms 9(2):41.

Page 8: Multimodal Emoji Prediction - arxiv.org · (Instagram instead of Twitter) and, in addition, we rely on images to improve the selection of the most likely emojis to associate to a

Bolei Zhou, Aditya Khosla, Agata Lapedriza, AudeOliva, and Antonio Torralba. 2016. Learning deepfeatures for discriminative localization. In Proceed-ings of the IEEE Conference on Computer Visionand Pattern Recognition. pages 2921–2929.