Opinion mining for social media and news items in Romanian

Authors

UNIVERSITYPOLITEHNICAOF BUCHAREST

Opinion Mining for Social Media and News Items in Romanian

Claudia CârdeiFilip ManișorTraian Rebedea [email protected]

Overview

• Introduction• Previous Work

– English– Romanian

• Proposed Solutions • Opinionated Corpus• Results and Comparisons• Conclusions

13.04.23 Sesiunea de Licenţe - Iulie 2012 2

Introduction

• Sentiment analysis and opinion mining research has mainly concentrated on English and other important languages (Spanish, Chinese, etc.)– Various commercial and open-source solutions exist

mainly for English– Corpora of opinionated texts and databases of

affective words (general or domain specific) also exist for these languages

• Objective: develop an opinion mining solution for Romanian texts gathered from a wide range of online sources (mostly social media and news items)

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 3

Introduction• Popular research domain in the last years• Sentiment, subjectivity, opinion, publicity

– Related, but somewhat different

• Sentiment or subjectivity in a text:– Positive, negative or neutral– Subjective or objective

• Opinionated text– Opinion author– Opinion target (subject)– Opinion (affective) words– Opinion polarityE.g. President Obama declared that the US immigration system is broken.



Previous Work - English



Previous Work - English• Lots of studies and corpora in different domains• The movie reviews dataset – very popular• Initial results using BoW, punctuation, etc.

– Accuracy ≈ 80%• Improvement to find relations/dependencies

between opinion targets and affective words– Accuracy ≈ 84%

• Mining frequent dependency subtrees for positive and negative reviews and using a SVM with these subtrees as features– Accuracy ≈ 88%



Previous Work - Romanian

• Use machine translation to generate English texts, then apply opinion mining

• Translate affective words databases in Romanian (e.g. WordNet Affect)

• Developing new affective words lists• Training and evaluation on specific corpora in

Romanian• Problems with NER, dependency parsing,

affective words scores



Proposed Solutions

• Supervised solution trained for several different opinion subjects (entities)

• Three approaches– Bag of words– Affective words and dependency parsing– N-grams probabilities



Bag of Words• Bag of words model:

– Tokenization, diacritics restoration, lemmatization– Distinct lemmas selected as features– Improvements: POS filter, word n-grams filter– Used both binary features and TF-IDF



Affective Scores & Dependency Parsing

• Compute affective word scores in Romanian: – Translate all the adjectives and adverbs from the English WordNet into

Romanian using Google Translate – Uses the probability of each translation pair

• Several affective score databases have been translated: SentiWordNet, SenticNet 2 and ANEW

• Used the UAIC Romanian FDG parser to identify dependencies between the subject entity and adjectives or adverbs



N-grams Probabilities

• Compute the conditional probability for each n-gram in the corpus given that the document is either positive or negative

• Then use the following score for each n-gram (feature f):

• The score of a new text is computed by summing the scores for each of the n-grams existing in that text



Opinionated Corpus• Corpus manually annotated by analysts for their

customers (created by Treeworks for their product ZeList, www.zelist.ro)

• ZeList indexes most of the texts published in Romanian in most popular social networks, blogs, online forums, news websites, etc.

• Used data for seven different entities (companies or brands) ranging from banks and beer brands and going to web publishers and media corporations

• The name of the entities have been anonymized



Opinionated Corpus

• Problems:– These texts are very noisy, very heterogeneous,

from a wide range of sources and with different writing styles (e.g. Twitter vs. news items)

– Some of them also might express positive and negative publicity rather than opinions



Opinionated Corpus• Data about the first version of the corpus• Data collection ranged from a couple of months to a couple of

years, depending on the entity• The second version contained a larger export of data for each

entity



Entity Total items Neutral Opinionated Positive Negative

Ent1 6055 5853 202 29 173

Ent2 2240 1961 279 222 57

Ent3 343 260 83 64 19

Ent4 1168 876 292 120 172

Ent5 539 520 19 17 2

Ent6 1025 570 455 330 125

Ent7 3787 3016 771 593 178

Results - Outline• Results obtained for the first version of the corpus, for all

entities• Accuracy positive-negative should be more relevant• Good results for entities with more data, poor results for the

ones with a small number of opinionated texts



Entity Total items Neutral Opinionated Accuracy

opinion-neutral

Accuracy positive-negative

Ent1 6055 5853 202 97.01% 92.07%

Ent2 2240 1961 279 91.79% 87.81%

Ent3 343 260 83 84.84% 89.15%Ent4 1168 876 292 86.22% 82.19%Ent5 539 520 19 97.40% 57.89%Ent6 1025 570 455 76.20% 84.17%Ent7 3787 3016 771 81.75% 83.65%

Results - Comparison• Comparison of the above presented solutions using the

second (larger) version of the corpus• Only for one entity by extracting a balanced dataset with 700

positive and 700 negative opinionated texts



Method Accuracy

BoW + POS filter 81.31%

BoW only adj. 70.89%

BoW only adj. & adv. 76.60%Frequent bigrams 80.88%Frequent trigrams 76.60%Affective scores + dependency parsing 52.18%

Affective scores (comparison with 0 decision) 55.35%

Trigrams probabilities 88.44%

Bigrams probabilities 72.54%

Conclusions• Several alternatives for determining the opinion

polarity have been evaluated on a corpus manually annotated for different Romanian entities

• Best results obtained at this moment: BoW plus a POS filter or a frequent bigrams approach + SVM classifier

• Romanian FDG parser does not provide a good accuracy for the dependency parsing task, especially for texts from social media– Texts are somewhat freely written, with little regards to

usual form or structure– Improvement of this method & the affective words

database are still possible



Thank you!

• Questions?

• Discussions

13.04.23 CSCS 2013 – Bucharest, Romania 18

Opinion mining for social media and news items in Romanian

Education

workshop opinion mining

news items

social media

opinion mining research

opinion mining solution

romanian texts

opinion targets

romanian problems