Authors UNIVERSITY POLITEHNICA OF BUCHAREST Opinion Mining for Social Media and News Items in Romanian Claudia Cârdei Filip Manișor Traian Rebedea [email protected]
Jun 24, 2015
Authors
UNIVERSITYPOLITEHNICAOF BUCHAREST
Opinion Mining for Social Media and News Items in Romanian
Claudia CârdeiFilip ManișorTraian Rebedea [email protected]
Overview
• Introduction• Previous Work
– English– Romanian
• Proposed Solutions • Opinionated Corpus• Results and Comparisons• Conclusions
13.04.23 Sesiunea de Licenţe - Iulie 2012 2
Introduction
• Sentiment analysis and opinion mining research has mainly concentrated on English and other important languages (Spanish, Chinese, etc.)– Various commercial and open-source solutions exist
mainly for English– Corpora of opinionated texts and databases of
affective words (general or domain specific) also exist for these languages
• Objective: develop an opinion mining solution for Romanian texts gathered from a wide range of online sources (mostly social media and news items)
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 3
Introduction• Popular research domain in the last years• Sentiment, subjectivity, opinion, publicity
– Related, but somewhat different
• Sentiment or subjectivity in a text:– Positive, negative or neutral– Subjective or objective
• Opinionated text– Opinion author– Opinion target (subject)– Opinion (affective) words– Opinion polarityE.g. President Obama declared that the US immigration system is broken.
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 4
Previous Work - English
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 5
Previous Work - English• Lots of studies and corpora in different domains• The movie reviews dataset – very popular• Initial results using BoW, punctuation, etc.
– Accuracy ≈ 80%• Improvement to find relations/dependencies
between opinion targets and affective words– Accuracy ≈ 84%
• Mining frequent dependency subtrees for positive and negative reviews and using a SVM with these subtrees as features– Accuracy ≈ 88%
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 6
Previous Work - Romanian
• Use machine translation to generate English texts, then apply opinion mining
• Translate affective words databases in Romanian (e.g. WordNet Affect)
• Developing new affective words lists• Training and evaluation on specific corpora in
Romanian• Problems with NER, dependency parsing,
affective words scores
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 7
Proposed Solutions
• Supervised solution trained for several different opinion subjects (entities)
• Three approaches– Bag of words– Affective words and dependency parsing– N-grams probabilities
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 8
Bag of Words• Bag of words model:
– Tokenization, diacritics restoration, lemmatization– Distinct lemmas selected as features– Improvements: POS filter, word n-grams filter– Used both binary features and TF-IDF
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 9
Affective Scores & Dependency Parsing
• Compute affective word scores in Romanian: – Translate all the adjectives and adverbs from the English WordNet into
Romanian using Google Translate – Uses the probability of each translation pair
• Several affective score databases have been translated: SentiWordNet, SenticNet 2 and ANEW
• Used the UAIC Romanian FDG parser to identify dependencies between the subject entity and adjectives or adverbs
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 10
N-grams Probabilities
• Compute the conditional probability for each n-gram in the corpus given that the document is either positive or negative
• Then use the following score for each n-gram (feature f):
• The score of a new text is computed by summing the scores for each of the n-grams existing in that text
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 11
Opinionated Corpus• Corpus manually annotated by analysts for their
customers (created by Treeworks for their product ZeList, www.zelist.ro)
• ZeList indexes most of the texts published in Romanian in most popular social networks, blogs, online forums, news websites, etc.
• Used data for seven different entities (companies or brands) ranging from banks and beer brands and going to web publishers and media corporations
• The name of the entities have been anonymized
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 12
Opinionated Corpus
• Problems:– These texts are very noisy, very heterogeneous,
from a wide range of sources and with different writing styles (e.g. Twitter vs. news items)
– Some of them also might express positive and negative publicity rather than opinions
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 13
Opinionated Corpus• Data about the first version of the corpus• Data collection ranged from a couple of months to a couple of
years, depending on the entity• The second version contained a larger export of data for each
entity
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 14
Entity Total items Neutral Opinionated Positive Negative
Ent1 6055 5853 202 29 173
Ent2 2240 1961 279 222 57
Ent3 343 260 83 64 19
Ent4 1168 876 292 120 172
Ent5 539 520 19 17 2
Ent6 1025 570 455 330 125
Ent7 3787 3016 771 593 178
Results - Outline• Results obtained for the first version of the corpus, for all
entities• Accuracy positive-negative should be more relevant• Good results for entities with more data, poor results for the
ones with a small number of opinionated texts
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 15
Entity Total items Neutral Opinionated Accuracy
opinion-neutral
Accuracy positive-negative
Ent1 6055 5853 202 97.01% 92.07%
Ent2 2240 1961 279 91.79% 87.81%
Ent3 343 260 83 84.84% 89.15%Ent4 1168 876 292 86.22% 82.19%Ent5 539 520 19 97.40% 57.89%Ent6 1025 570 455 76.20% 84.17%Ent7 3787 3016 771 81.75% 83.65%
Results - Comparison• Comparison of the above presented solutions using the
second (larger) version of the corpus• Only for one entity by extracting a balanced dataset with 700
positive and 700 negative opinionated texts
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 16
Method Accuracy
BoW + POS filter 81.31%
BoW only adj. 70.89%
BoW only adj. & adv. 76.60%Frequent bigrams 80.88%Frequent trigrams 76.60%Affective scores + dependency parsing 52.18%
Affective scores (comparison with 0 decision) 55.35%
Trigrams probabilities 88.44%
Bigrams probabilities 72.54%
Conclusions• Several alternatives for determining the opinion
polarity have been evaluated on a corpus manually annotated for different Romanian entities
• Best results obtained at this moment: BoW plus a POS filter or a frequent bigrams approach + SVM classifier
• Romanian FDG parser does not provide a good accuracy for the dependency parsing task, especially for texts from social media– Texts are somewhat freely written, with little regards to
usual form or structure– Improvement of this method & the affective words
database are still possible
13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop
Opinion Mining for Social Media and News Items in Romanian 17
Thank you!
• Questions?
• Discussions
13.04.23 CSCS 2013 – Bucharest, Romania 18