DataStories at SemEval-2017 Task 4: Deep LSTM with ...Task 4 Sentiment Analysis in Twitter . We participated in all subtasks for En-glish tweets, involving message-level and topic-based

Proceedings of the 11th International Workshop on Semantic Evaluations (SemEval-2017), pages 747–754,Vancouver, Canada, August 3 - 4, 2017. c©2017 Association for Computational Linguistics

DataStories at SemEval-2017 Task 4: Deep LSTM with Attention forMessage-level and Topic-based Sentiment Analysis

Christos Baziotis, Nikos Pelekis, Christos DoulkeridisUniversity of Piraeus - Data Science Lab

Piraeus, [email protected], [email protected], [email protected]

Abstract

In this paper we present two deep-learningsystems that competed at SemEval-2017Task 4 “Sentiment Analysis in Twitter”.We participated in all subtasks for En-glish tweets, involving message-level andtopic-based sentiment polarity classifica-tion and quantification. We use LongShort-Term Memory (LSTM) networksaugmented with two kinds of attentionmechanisms, on top of word embeddingspre-trained on a big collection of Twittermessages. Also, we present a text process-ing tool suitable for social network mes-sages, which performs tokenization, wordnormalization, segmentation and spell cor-rection. Moreover, our approach uses nohand-crafted features or sentiment lexi-cons. We ranked 1st (tie) in Subtask A,and achieved very competitive results inthe rest of the Subtasks. Both the wordembeddings and our text processing tool1

are available to the research community.

1 Introduction

Sentiment analysis is an area in Natural LanguageProcessing (NLP), studying the identification andquantification of the sentiment expressed in text.Sentiment analysis in Twitter is a particularly chal-lenging task, because of the informal and “cre-ative” writing style, with improper use of gram-mar, figurative language, misspellings and slang.

In previous runs of the Task, sentiment anal-ysis was usually tackled using hand-crafted fea-tures and/or sentiment lexicons (Mohammad et al.,2013; Kiritchenko et al., 2014; Palogiannidiet al., 2016), feeding them to classifiers suchas Naive Bayes or Support Vector Machines(SVM). These approaches require a laborious

1github.com/cbaziotis/ekphrasis

feature-engineering process, which may also needdomain-specific knowledge, usually resulting bothin redundant and missing features. Whereas, arti-ficial neural networks (Deriu et al., 2016; Rouvierand Favre, 2016) which perform feature learning,last year (Nakov et al., 2016) achieved very goodresults, outperforming the competition.

In this paper, we present two deep-learningsystems that competed at SemEval-2017 Task4 (Rosenthal et al., 2017). Our first model is de-signed for addressing the problem of message-level sentiment analysis. We employ a 2-layerBidirectional LSTM, equipped with an attentionmechanism (Rocktäschel et al., 2015). For thetopic-based sentiment analysis tasks, we proposea Siamese Bidirectional LSTM with a context-aware attention mechanism (Yang et al., 2016).In contrast to top-performing systems of previousyears, we do not rely on hand-crafted features,sentiment lexicons and we do not use model en-sembles. We make the following contributions:

• A text processing tool for text tokenization,word normalization, word segmentation andspell correction, geared towards Twitter.

• A deep learning system for short-text senti-ment analysis using an attention mechanism,in order to enforce the contribution of wordsthat determine the sentiment of a message.

• A deep learning system for topic-based senti-ment analysis, with a context-aware attentionmechanism utilizing the topic information.

2 Overview

Figure 1 provides a high-level overview of ourapproach, which consists of two main steps andan optional task-dependent third step: (1) the textprocessing, where we use our own text processingtool for preparing the data for our neural network,(2) the learning step, where we train the neural

747

UnlabeledDataset

%

%

%

TrainingData

Processed Input Data

Cla

ssif

icat

ion

Qu

anti

fica

tio

nEmbeddingstraining

Text processor

Neural Network

WordEmbeddings

Emb

edd

ing

Laye

r

Figure 1: High-level overview of our approach

networks and (3) the quantification step for esti-mating the sentiment distribution for each topic.Task definitions. In Subtask A, given a messagewe must classify whether the message expressespositive, negative, or neutral sentiment (3-pointscale). In Subtasks B, C (topic-based sentimentpolarity classification) we are given a message anda topic and must classify the message on 2-pointscale (Subtask B) and a 5-point scale (Subtask C).In Subtasks D, E (quantification) we are given aset of messages about a set of topics and must es-timate the distribution of the tweets across 2-pointscale (Subtask D) and a 5-point scale (Subtask E).Unlabeled Dataset. We collected a big datasetof 330M English Twitter messages, gathered from12/2012 to 07/2016, which is used (1) for calculat-ing words statistics needed by our text processorand (2) for training our word embeddings.Pre-trained Word Embeddings. Word em-beddings are dense vector representations ofwords (Collobert and Weston, 2008; Mikolovet al., 2013), capturing their semantic and syntac-tic information. We leverage our big collectionof Twitter messages to generate word embeddings,with vocabulary size of 660K words, using GloVe(Pennington et al., 2014). The pre-trained wordembeddings are used for initializing the first layer(embedding layer) of our neural networks.

2.1 Text ProcessorWe developed our own text processing tool, in or-der to utilize most of the information in text, per-forming sentiment-aware tokenization, spell cor-rection, word normalization, word segmentation(for splitting hashtags) and word annotation.Tokenizer. The difficulty in tokenization is toavoid splitting expressions or words that shouldbe kept intact (as one token). Although there

are some tokenizers geared towards Twitter (Potts,2011; Gimpel et al., 2011) that recognize the Twit-ter markup and some basic sentiment expressionsor simple emoticons, our tokenizer is able to iden-tify most emoticons, emojis, expressions such asdates (e.g. 07/11/2011, April 23rd), times (e.g.4:30pm, 11:00 am), currencies (e.g. $10, 25mil,50e), acronyms, censored words (e.g. s**t),words with emphasis (e.g. *very*) and more.Text Postprocessing. After the tokenization weadd an extra postprocessing step, performing mod-ifications on the extracted tokens. This is wherewe perform spell correction, word normalizationand segmentation and decide which tokens toomit, normalize or annotate (surround or replacewith special tags). For the tasks of spell correction(Jurafsky and Martin, 2000) and word segmenta-tion (Segaran and Hammerbacher, 2009) we usedthe Viterbi algorithm, utilizing word statistics (un-igrams and bigrams) from our unlabeled dataset,to obtain word probabilities. Moreover, we low-ercase all words, and normalize URLs, emails anduser handles (@user).

After performing the aforementioned steps wedecrease the vocabulary size, while keeping infor-mation that is usually lost during the tokenizationphase. Table 1 shows an example of our text pro-cessing pipeline, on a Twitter message.

2.2 Neural Networks

Last year, most of the top scoring systems usedConvolutional Neural Networks (CNN) (LeCunet al., 1998). Even though CNNs are designed forcomputer vision, the fact that they are fast and easyto train, makes them a popular choice for NLPproblems. However CNNs have no notion of or-der, thus when applying them to NLP tasks thecrucial information of the word order is lost.

2.2.1 Recurrent Neural NetworksA more natural choice is to use Recurrent Neu-ral Networks (RNN). An RNN processes an in-put sequentially, in a way that resembles howhumans do it. It performs the same operation,ht = fW (xt, ht−1), on every element of a se-quence, where ht is the hidden state a timestep t,

original The *new* season of #TwinPeaks is coming on May 21, 2017. CANT WAIT \o/ !!! #tvseries #davidlynch :Dprocessed the new <emphasis> season of <hashtag> twin peaks </hashtag> is coming on <date> . cant <allcaps> wait

<allcaps> <happy> ! <repeated> <hashtag> tv series </hashtag> <hashtag> david lynch </hashtag> <laugh>

Table 1: Example of our text processor. The word annotations help the RNN to learn better features.

748

and W the weights of the network. The hiddenstate at each timestep depends on the previous hid-den states. This is why the order of the elements(words) is important. This process also enablesRNNs to handle inputs of variable length.

RNNs are difficult to train (Pascanu et al.,2013), because gradients may grow or decay expo-nentially over long sequences (Bengio et al., 1994;Hochreiter et al., 2001). A way to overcome theseproblems is by using one of the more sophisti-cated variants of the regular RNN, the Long Short-Term Memory (LSTM) network (Hochreiter andSchmidhuber, 1997) or the recently proposedGated Recurrent Units (GRU) (Cho et al., 2014).Both variants introduce a gating mechanism, en-suring proper gradient propagation through thenetwork. We use the LSTM, as it performedslightly better than GRU in our experiments.Attention Mechanism. An RNN updates its hid-den state hi as it processes a sequence and at theend, the hidden state holds a summary of all theprocessed information. In order to amplify thecontribution of important elements in the final rep-resentation we use an attention mechanism (Rock-täschel et al., 2015; Raffel and Ellis, 2015), thataggregates all the hidden states using their relativeimportance (see Figure 2).

𝑥1

ℎ1

𝑥2

ℎ2

𝑥3

ℎ3

𝑥𝑇

𝒉𝑻

…

(a) Regular RNN𝑥1

𝑎1 𝑎𝑇𝑎2 𝑎3

ℎ1

𝑥2

ℎ2

𝑥3

ℎ3

𝑥𝑇

ℎ𝑇

…

(b) Attention RNN

Figure 2: Comparison between the regular RNNand the RNN with attention.

2.3 QuantificationFor the quantification tasks an obvious approach isthe Classify & Count (Forman, 2008), where wesimply compute the fraction of a topic’s messagesthat a classifier predicts to belong to a class c.Another approach is the Probabilistic Classify &Count (PCC) (Gao and Sebastiani, 2016), in whichfirst we train a classifier that produces a probabil-ity distribution over the classes and then we av-erage the estimated probabilities for each class toobtain the final distribution. Let T be the set oftopics in the training set and p(c|tweet) the (pos-terior) probability that a tweet belongs to class c as

estimated by the classifier. Then we estimate theexpected fraction of a topic’s tweets that belong toclass c as follows:

pT (c) =1|T |

∑tweet∈T

p(c|tweet) (1)

3 Models Description

We propose two different models, a Message-levelSentiment Analysis (MSA) model for Subtask A(3.1) and a Topic-based Sentiment Analysis (TSA)(3.2) model for Subtasks B,C,D,E.

3.1 MSA Model (message-level)Our message-level sentiment analysis model(MSA) consists of a 2-layer bidirectional LSTM(BiLSTM) with an attention mechanism, for iden-tifying the most informative words.Embedding Layer. The input to the network is aTwitter message, treated as a sequence of words.We use an Embedding layer to project the wordsX = (x1, x2, ..., xT ) to a low-dimensional vec-tor space RE , where E the size of the Embeddinglayer and T the number of words in a tweet. Weinitialize the weights of the embedding layer withour pre-trained word embeddings.BiLSTM Layers. An LSTM takes as input thewords of a tweet and produces the word annota-tions H = (h1, h2, ..., hT ), where hi is the hid-den state of the LSTM at time-step i, summariz-ing all the information of the sentence up to xi.We use bidirectional LSTM (BiLSTM) in order toget word annotations that summarize the informa-tion from both directions. A bidirectional LSTMconsists of a forward LSTM

−→f that reads the sen-

tence from x1 to xT and a backward LSTM←−f that

reads the sentence from xT to x1. We obtain thefinal annotation for a given word xi, by concate-nating the annotations from both directions:

hi =−→hi ‖ ←−hi , hi ∈ R2L (2)

where ‖ denotes the concatenation operation andL the size of each LSTM. We stack two layers ofBiLSTMs in order to learn more abstract features.Attention Layer. Not all words contribute equallyto the expression of the sentiment in a message.We use an attention mechanism to find the rela-tive contribution (importance) of each word. Theattention mechanism assigns a weight ai to eachword annotation. We compute the fixed represen-tation r of the whole message as the weighted sum

749

𝑥1 𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀

BiLSTM

𝑎1

𝑎𝑇 𝑎𝑖 = 1

ℎ1 ℎ1

𝑥2 𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀 ℎ2 ℎ2

𝑥3 𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀 ℎ3 ℎ3

𝑥𝑇 𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀 ℎ𝑇 ℎ𝑇

… …

𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀

BiLSTM

ℎ1 ℎ1

𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀 ℎ2 ℎ2

𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀 ℎ3 ℎ3

𝐿𝑆𝑇𝑀 𝐿𝑆𝑇𝑀 ℎ𝑇 ℎ𝑇

… …

𝑎2

𝑎3

… … …

class probabilities

𝑟

Figure 3: The MSA model: A 2-layer bidirectional LSTM with attention over that last layer.

of all the word annotations. Formally:

ei = tanh(Whhi + bh), ei ∈ [−1, 1] (3)

ai =exp(ei)∑Tt=1 exp(et)

,T∑i=1

ai = 1 (4)

r =T∑i=1

aihi, r ∈ R2L (5)

whereWh and bh are the attention layer’s weights,optimized during training to assign bigger weightsto the most important words of a sentence.Output Layer. We use the representation r as fea-ture vector for classification and we feed it to a fi-nal fully-connected softmax layer which outputs aprobability distribution over all classes.

3.2 TSA Model (topic-based)For the topic-based sentiment analysis tasks, wepropose a Siamese2 bidirectional LSTM networkwith a different attention mechanism than in MSA.Our model is comparable to the work of (Wanget al., 2016; Tang et al., 2015). However our modeldiffers in the way it incorporates topic informationand in the attention mechanism.Embedding Layer. The network has two in-puts, the sequence of words in the tweet Xtw =(xtw1 , xtw2 , ..., xtwTtw

), where Ttw the number ofwords in the tweet, and the sequence of words inthe topic Xto = (xto1 , x

to2 , ..., x

toTto

), where Tto thenumber of words in the topic. We project all wordsto RE , where E the size of the Embedding layer.Siamese BiLSTM. We use a bidirectional LSTMwith shared weights to map the words of thetweet and the topic to the same vector space,in order to be able to make meaningful com-parison between the two. The BiLSTM pro-duces the annotations for the words of the tweet

2Siamese are called the networks that have identical con-figuration and their weights are linked during training.

Htw = (htw1 , htw2 , ..., htwTtw) and the topic Hto =

(hto1 , hto2 , ..., h

toTto

), where each word annotationconsists of the concatenation of its forward andbackward annotations:

hji =−→hji ‖

←−hji , hji ∈ R2L, j ∈ {tw, to} (6)

where ‖ denotes the concatenation operation andL the size of each LSTM.Mean-Pooling Layer. We use a Mean-Poolinglayer over the word annotations of the topic Hto

that aggregates them to produce a single annota-tion. The layer computes the mean over time to

produce the topic annotation, hto =1Tto

∑Tto1 htoi .

Context-Aware Annotations. We append thetopic annotation hto to each word annotation to getthe final context-aware annotation for each word:

hi = htwi ‖ hto, hji ∈ R4L (7)

Context-Attention Layer. We use a context-aware attention mechanism as in (Yang et al.,2016), in order to strengthen the contributionof words that express sentiment towards a giventopic. This is done by adding a context vector uhthat can be interpreted as a fixed query, like “whichwords express sentiment towards the given topic”,over the words of the message. Concretely:

ei = tanh(Whhi + bh), ei ∈ [−1, 1] (8)

ai =exp(e>i uh)∑Ttwt=1 exp(e

>t uh)

,

Ttw∑i=1

ai = 1 (9)

r =Ttw∑i=1

aihi, r ∈ R4L (10)

where Wh, bh and uh are jointly learned weights.Maxout Layer. We pass the representation r to aMaxout (Goodfellow et al., 2013) layer to makethe final comparison between the tweet aspects

750

…Tweet

Topic

𝐵𝑖𝐿𝑆𝑇𝑀 𝐵𝑖𝐿𝑆𝑇𝑀

Sharedweights

… …

…

𝑀𝑒𝑎𝑛 𝑃𝑜𝑜𝑙𝑖𝑛𝑔

ℎ1𝑡𝑤 ℎ2

𝑡𝑤 ℎ3𝑡𝑤 ℎ1

𝑡𝑜 ℎ2𝑡𝑜 ℎ3

𝑡𝑜

ℎ𝑡𝑜

…

ℎ1𝑡𝑤 ℎ2

𝑡𝑤 ℎ3𝑡𝑤 ℎ𝑇

𝑡𝑤

𝑎1 𝑎𝑇

𝑎𝑖 = 1

𝑎2 𝑎3Maxo

ut

class probabilities

𝑢ℎ

𝑥1𝑡𝑤 𝑥2

𝑡𝑤 𝑥3𝑡𝑤 𝑥𝑇𝑡𝑤

𝑡𝑤 𝑥1𝑡𝑜 𝑥2

𝑡𝑜 𝑥3𝑡𝑜

𝑟

𝑥𝑇𝑡𝑜𝑡𝑜

ℎ𝑇𝑡𝑤𝑡𝑤 ℎ𝑇𝑡𝑜

𝑡𝑜

Figure 4: The TSA model: A Siamese Bidirectional LSTM with context-aware attention.

and the topic aspects. We selected Maxout as itamplifies the effects of dropout (Section 3.3).Output Layer. We pass the output of the Max-out layer to a final fully-connected softmax layerwhich outputs a probability distribution over allclasses.

3.3 RegularizationIn both of our models we add Gaussian noise atthe embedding layer, which can be interpretedas a random data augmentation technique, mak-ing our models more robust to overfitting. In ad-dition to that, we use dropout (Srivastava et al.,2014) to randomly turn-off neurons in our net-work. Dropout prevents co-adaptation of neuronsand can also be thought as a form of ensemblelearning, because for each training example a sub-part of the whole network is trained. Moreover, weapply dropout to the recurrent connections of theLSTM as in (Gal and Ghahramani, 2016). Further-more we add L2 regularization penalty (weightdecay) to the loss function to discourage largeweights. Also, we stop training after the valida-tion loss has stopped decreasing (early-stopping).

Finally, we do not fine-tune the embedding lay-ers. Words occurring in the training set, will bemoved in the embedding space and the classifierwill correlate certain regions (in embedding space)to certain sentiments. However, words in the testset and not in the training set, will remain at theirinitial position which may no longer reflect their“true” sentiment, leading to miss-classifications.

3.4 Class WeightsIn the training data some classes have more train-ing examples than others, introducing bias in ourmodels. In order to deal with this problem we ap-ply class weights to the loss function of our mod-els, penalizing more the misclassification of un-derrepresented classes. Moreover, we introduce

a smoothing factor in order to smooth out theweights in cases where the imbalances are verystrong, which would otherwise lead to extremelylarge class weights. Let x be the vector with theclass counts and α the smoothing factor, we ob-tain class weights with wi = max(x)

xi+α×max(x) . InSubtasks A, B, D we use no smoothing (α = 0)and in Subtasks C and E we set α = 0.1.

3.5 Training

We train all of our networks to minimize thecross-entropy loss, using back-propagation withstochastic gradient descent and mini-batches ofsize 128. We use Adam (Kingma and Ba, 2014)for tuning the learning rate and we clip the normof the gradients (Pascanu et al., 2013) at 5, as anextra safety measure against exploding gradients.Dataset. For training we use the available datafrom prior years (only tweets). Table 2 shows thestatistics of the data we used. Also, we do not useany user information from the tweets (only text).

3.5.1 Hyper-parametersIn order to find good hyper-parameter values ina relative short time, compared to grid or ran-dom search, we adopt the Bayesian optimization(Bergstra et al., 2013) approach, performing a“smart” search in the high dimensional space ofall the possible values.MSA Model. The size of the embedding layer is300, and the LSTM layers 150 (300 for BiLSTM).We add Gaussian noise with σ = 0.2 and dropoutof 0.3 at the embedding layer, dropout of 0.5 atthe LSTM layers and dropout of 0.25 at the recur-rent connections of the LSTM. Finally, we add L2

regularization of 0.0001 at the loss function.TSA Model. The size of the embedding layer is300, and the LSTM layers 64 (128 for BiLSTM).We insert Gaussian noise with σ = 0.2 at the em-bedding layer of both inputs and dropout of 0.3 at

751

Positive Neutral NegativeDataset Task 2 1 0 -1 -2 TotalTrain A 19652 (39.64%) 22195 (44.78%) 7723 (15.58%) 49570

B,D 14897 (78.85%) - 3997 (21.15%) 18894C,E 1016 (3.34%) 12852 (42.23%) 12888 (42.35%) 3380 (11.11%) 296 (0.97%) 30432

Test A 2375 (19.33%) 5937 (48.33%) 3972 (32.33%) 12284B,D 2463 (39.82%) - 3722 (60.18%) 6185C,E 131 (1.06%) 2332 (18.84%) 6194 (50.04%) 3545 (28.64%) 177 (1.43%) 12379

Table 2: Dataset statistics. Notice the difference in the ratio of positive-negative classes this year.

the embedding layer of the message, dropout of0.2 at the LSTM layer and the recurrent connec-tion of the LSTM layer and dropout of 0.3 at theattention layer and the Maxout layer. Finally, weadd L2 regularization of 0.001 at the loss function.

4 Experiments and Results

Semeval Results. Our official ranking (Rosenthalet al., 2017) is 1/38 (tie) in Subtask A, 2/24 in Sub-task B, 2/16 in Subtask C, 2/16 in Subtask D and11/12 in Subtask E. All of our models performedvery good, with the exception of Subtask E. Sincethe quantification was performed on top of theclassifier of Subtask C, which came in 2nd place,we conclude that our quantification approach wasthe reason for the bad results for Subtask E.Attention Mechanism. In order to assess the im-pact of the attention mechanisms, we evaluated theperformance of each model, with and without at-tention. We report (Table 3) the average scores of10 runs for each system, on the official test set.The attention-based models performed better, butonly by a small margin.

RNN Subtask A (MSA) Subtask B (TSA)ρ F1pn ρ F1pn

Regular 0.678 0.673 0.856 0.817Attention 0.682 0.675 0.863 0.82

Table 3: Results of the impact of attention3.

Quantification. To get a better insight into thequantification approaches, we compare the per-formance of CC and PCC. It is inconclusive asto which quantification approach is better. PCCoutperformed CC in (Bella et al., 2010) but un-derperformed CC in (Esuli and Sebastiani, 2015).Following the results from (Gao and Sebastiani,2016), which are reported on sentiment analysisin twitter, we decided to use PCC for both of our

3ρ is the average recall and F1pn the macro-average F1score of the positive and negative classes

quantification submissions. Table 4 shows the per-formance of our models. PCC performed betterthan CC for Subtask D but far worse than CC forSubtask E. We hypothesize that two possible rea-sons for the difference in performance between Dand E, might be (1) the difference in the number ofclasses and (2) the big change in the ratio of pos-to-neg classes between the training and test sets.

Method Subtask D Subtask EKLD AE RAE EMD

CC 0.060 0.093 0.608 0.359PCC 0.048 0.095 0.848 0.595

Table 4: Results of the quantification approaches4.

Experimental setup. For developing our mod-els we used Keras (Chollet, 2015) with Theano(Theano Dev Team, 2016) as backend and Scikit-learn (Pedregosa et al., 2011). We trained our neu-ral networks on a GTX750Ti (4GB). Lastly, weshare the source code of our models 5.

5 Conclusion

In this paper, we present two deep-learning sys-tems for short text sentiment analysis developedfor SemEval-2017 Task 4 “Sentiment Analysis inTwitter”. We use RNNs, utilizing well establishedmethods in the literature. Additionally, we em-power our networks with two different kinds of at-tention mechanisms in order to amplify the contri-bution of the most important words.

Our models achieved excellent results in theclassification tasks, but mixed results in the quan-tification tasks. We would like to work more in thisarea and explore more quantification techniques inthe future. Another interesting approach would beto design models operating on the character-level.

4KLD is Kullback-Leibler Divergence, EMD is EarthMover’s Distance, AE is Absolute Error and RAE is Rela-tive Absolute Error. For all metrics lower is better.

5https://github.com/cbaziotis/datastories-semeval2017-task4

752

ReferencesAntonio Bella, Cesar Ferri, José Hernández-Orallo,

and Maria Jose Ramirez-Quintana. 2010. Quantifi-cation via probability estimators. In Proceedings ofICDM. pages 737–742.

Yoshua Bengio, Patrice Simard, and Paolo Frasconi.1994. Learning long-term dependencies with gradi-ent descent is difficult. IEEE Transactions on Neu-ral Networks 5(2):157–166.

James Bergstra, Daniel Yamins, and David D. Cox.2013. Making a science of model search: Hyperpa-rameter optimization in hundreds of dimensions forvision architectures. In Proceedings of ICML. pages115–123.

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. 2014. Learningphrase representations using RNN encoder-decoderfor statistical machine translation. arXiv preprintarXiv:1406.1078 .

Francois Chollet. 2015. Keras. https://github.com/fchollet/keras.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: Deepneural networks with multitask learning. In Pro-ceedings of ICML. pages 160–167.

Jan Deriu, Maurice Gonzenbach, Fatih Uzdilli, Au-rélien Lucchi, Valeria De Luca, and Martin Jaggi.2016. SwissCheese at SemEval-2016 Task 4: Sen-timent classification using an ensemble of convolu-tional neural networks with distant supervision. InProceedings of SemEval. pages 1124–1128.

Andrea Esuli and Fabrizio Sebastiani. 2015. Optimiz-ing Text Quantifiers for Multivariate Loss Functions.ACM Trans. Knowl. Discov. Data 9(4):27:1–27:27.

George Forman. 2008. Quantifying counts and costsvia classification. Data Mining and Knowledge Dis-covery 17(2):164–206.

Yarin Gal and Zoubin Ghahramani. 2016. A theoret-ically grounded application of dropout in recurrentneural networks. In Proceedings of NIPS. pages1019–1027.

Wei Gao and Fabrizio Sebastiani. 2016. From classifi-cation to quantification in tweet sentiment analysis.Social Network Analysis and Mining 6(1).

Kevin Gimpel, Nathan Schneider, Brendan O’Connor,Dipanjan Das, Daniel Mills, Jacob Eisenstein,Michael Heilman, Dani Yogatama, Jeffrey Flanigan,and Noah A. Smith. 2011. Part-of-speech taggingfor twitter: Annotation, features, and experiments.In Proceedings of ACL. pages 42–47.

Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza,Aaron C. Courville, and Yoshua Bengio. 2013.Maxout networks. In Proceedings of ICML. pages1319–1327.

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, andJürgen Schmidhuber. 2001. Gradient Flow in Re-current Nets: The Difficulty of Learning Long-TermDependencies. A field guide to dynamical recurrentneural networks. IEEE Press.

Sepp Hochreiter and Jürgen Schmidhuber. 1997.Long short-term memory. Neural Computation9(8):1735–1780.

Daniel Jurafsky and James H. Martin. 2000. Speechand Language Processing: An Introduction to Nat-ural Language Processing, Computational Linguis-tics, and Speech Recognition. Prentice Hall PTR,1st edition.

Diederik Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .

Svetlana Kiritchenko, Xiaodan Zhu, and Saif M. Mo-hammad. 2014. Sentiment analysis of short infor-mal texts. Journal of Artificial Intelligence Research50:723–762.

Yann LeCun, Léon Bottou, Yoshua Bengio, and PatrickHaffner. 1998. Gradient-based learning applied todocument recognition. Proceedings of the IEEE86(11):2278–2324.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Proceedings of NIPS. pages 3111–3119.

Saif M. Mohammad, Svetlana Kiritchenko, and Xiao-dan Zhu. 2013. NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. arXivpreprint arXiv:1308.6242 .

Preslav Nakov, Alan Ritter, Sara Rosenthal, FabrizioSebastiani, and Veselin Stoyanov. 2016. Semeval-2016 Task 4: Sentiment analysis in Twitter. In Pro-ceedings of SemEval. pages 1–18.

Elisavet Palogiannidi, Athanasia Kolovou, FeniaChristopoulou, Filippos Kokkinos, Elias Iosif, Niko-laos Malandrakis, Haris Papageorgiou, ShrikanthNarayanan, and Alexandros Potamianos. 2016.Tweester at SemEval-2016 Task 4: Sentiment anal-ysis in Twitter using semantic-affective model adap-tation. In Proceedings of SemEval. pages 155–163.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.2013. On the difficulty of training recurrent neu-ral networks. In Proceedings of ICML. pages 1310–1318.

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram-fort, Vincent Michel, Bertrand Thirion, OlivierGrisel, Mathieu Blondel, Peter Prettenhofer, RonWeiss, Vincent Dubourg, and others. 2011. Scikit-learn: Machine learning in Python. Journal of Ma-chine Learning Research 12:2825–2830.

753

Jeffrey Pennington, Richard Socher, and Christo-pher D. Manning. 2014. Glove: Global Vectors forWord Representation. In Proceedings of EMNLP.pages 1532–1543.

Christopher Potts. 2011. Sentiment Sym-posium Tutorial: Tokenizing. http://sentiment.christopherpotts.net/tokenizing.html.

Colin Raffel and Daniel PW Ellis. 2015. Feed-forward networks with attention can solve somelong-term memory problems. arXiv preprintarXiv:1512.08756 .

Tim Rocktäschel, Edward Grefenstette, Karl MoritzHermann, Tomáš Kociskázs, and Phil Blunsom.2015. Reasoning about entailment with neural at-tention. arXiv preprint arXiv:1509.06664 .

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.SemEval-2017 Task 4: Sentiment Analysis in Twit-ter. In Proceedings of SemEval. Vancouver, Canada.

Mickael Rouvier and Benoît Favre. 2016. SENSEI-LIF at SemEval-2016 Task 4: Polarity embeddingfusion for robust sentiment analysis. In Proceedingsof SemEval. pages 202–208.

Toby Segaran and Jeff Hammerbacher. 2009. BeautifulData: The Stories Behind Elegant Data Solutions."O’Reilly Media, Inc.".

Nitish Srivastava, Geoffrey E. Hinton, AlexKrizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-nov. 2014. Dropout: A simple way to prevent neuralnetworks from overfitting. Journal of MachineLearning Research 15(1):1929–1958.

Duyu Tang, Bing Qin, Xiaocheng Feng, and TingLiu. 2015. Target-dependent sentiment classifi-cation with long short term memory. CoRR,abs/1512.01100 .

Theano Dev Team. 2016. Theano: A Python frame-work for fast computation of mathematical expres-sions. arXiv e-prints abs/1605.02688.

Yequan Wang, Minlie Huang, Xiaoyan Zhu, andLi Zhao. 2016. Attention-based LSTM for aspect-level sentiment classification. In Proceedings ofEMNLP. pages 606–615.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchicalattention networks for document classification. InProceedings of NAACL-HLT . pages 1480–1489.

754

DataStories at SemEval-2017 Task 4: Deep LSTM with ...Task 4 Sentiment Analysis in Twitter . We participated in all subtasks for En-glish tweets, involving message-level and topic-based

Documents