Top Banner
Authors UNIVERSITY POLITEHNICA OF BUCHAREST Opinion Mining for Social Media and News Items in Romanian Claudia Cârdei Filip Manișor Traian Rebedea [email protected]
18
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Opinion mining for social media and news items in Romanian

Authors

UNIVERSITYPOLITEHNICAOF BUCHAREST

Opinion Mining for Social Media and News Items in Romanian

Claudia CârdeiFilip ManișorTraian Rebedea [email protected]

Page 2: Opinion mining for social media and news items in Romanian

Overview

• Introduction• Previous Work

– English– Romanian

• Proposed Solutions • Opinionated Corpus• Results and Comparisons• Conclusions

13.04.23 Sesiunea de Licenţe - Iulie 2012 2

Page 3: Opinion mining for social media and news items in Romanian

Introduction

• Sentiment analysis and opinion mining research has mainly concentrated on English and other important languages (Spanish, Chinese, etc.)– Various commercial and open-source solutions exist

mainly for English– Corpora of opinionated texts and databases of

affective words (general or domain specific) also exist for these languages

• Objective: develop an opinion mining solution for Romanian texts gathered from a wide range of online sources (mostly social media and news items)

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 3

Page 4: Opinion mining for social media and news items in Romanian

Introduction• Popular research domain in the last years• Sentiment, subjectivity, opinion, publicity

– Related, but somewhat different

• Sentiment or subjectivity in a text:– Positive, negative or neutral– Subjective or objective

• Opinionated text– Opinion author– Opinion target (subject)– Opinion (affective) words– Opinion polarityE.g. President Obama declared that the US immigration system is broken.

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 4

Page 5: Opinion mining for social media and news items in Romanian

Previous Work - English

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 5

Page 6: Opinion mining for social media and news items in Romanian

Previous Work - English• Lots of studies and corpora in different domains• The movie reviews dataset – very popular• Initial results using BoW, punctuation, etc.

– Accuracy ≈ 80%• Improvement to find relations/dependencies

between opinion targets and affective words– Accuracy ≈ 84%

• Mining frequent dependency subtrees for positive and negative reviews and using a SVM with these subtrees as features– Accuracy ≈ 88%

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 6

Page 7: Opinion mining for social media and news items in Romanian

Previous Work - Romanian

• Use machine translation to generate English texts, then apply opinion mining

• Translate affective words databases in Romanian (e.g. WordNet Affect)

• Developing new affective words lists• Training and evaluation on specific corpora in

Romanian• Problems with NER, dependency parsing,

affective words scores

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 7

Page 8: Opinion mining for social media and news items in Romanian

Proposed Solutions

• Supervised solution trained for several different opinion subjects (entities)

• Three approaches– Bag of words– Affective words and dependency parsing– N-grams probabilities

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 8

Page 9: Opinion mining for social media and news items in Romanian

Bag of Words• Bag of words model:

– Tokenization, diacritics restoration, lemmatization– Distinct lemmas selected as features– Improvements: POS filter, word n-grams filter– Used both binary features and TF-IDF

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 9

Page 10: Opinion mining for social media and news items in Romanian

Affective Scores & Dependency Parsing

• Compute affective word scores in Romanian: – Translate all the adjectives and adverbs from the English WordNet into

Romanian using Google Translate – Uses the probability of each translation pair

• Several affective score databases have been translated: SentiWordNet, SenticNet 2 and ANEW

• Used the UAIC Romanian FDG parser to identify dependencies between the subject entity and adjectives or adverbs

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 10

Page 11: Opinion mining for social media and news items in Romanian

N-grams Probabilities

• Compute the conditional probability for each n-gram in the corpus given that the document is either positive or negative

• Then use the following score for each n-gram (feature f):

• The score of a new text is computed by summing the scores for each of the n-grams existing in that text

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 11

Page 12: Opinion mining for social media and news items in Romanian

Opinionated Corpus• Corpus manually annotated by analysts for their

customers (created by Treeworks for their product ZeList, www.zelist.ro)

• ZeList indexes most of the texts published in Romanian in most popular social networks, blogs, online forums, news websites, etc.

• Used data for seven different entities (companies or brands) ranging from banks and beer brands and going to web publishers and media corporations

• The name of the entities have been anonymized

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 12

Page 13: Opinion mining for social media and news items in Romanian

Opinionated Corpus

• Problems:– These texts are very noisy, very heterogeneous,

from a wide range of sources and with different writing styles (e.g. Twitter vs. news items)

– Some of them also might express positive and negative publicity rather than opinions

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 13

Page 14: Opinion mining for social media and news items in Romanian

Opinionated Corpus• Data about the first version of the corpus• Data collection ranged from a couple of months to a couple of

years, depending on the entity• The second version contained a larger export of data for each

entity

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 14

Entity Total items Neutral Opinionated Positive Negative

Ent1 6055 5853 202 29 173

Ent2 2240 1961 279 222 57

Ent3 343 260 83 64 19

Ent4 1168 876 292 120 172

Ent5 539 520 19 17 2

Ent6 1025 570 455 330 125

Ent7 3787 3016 771 593 178

Page 15: Opinion mining for social media and news items in Romanian

Results - Outline• Results obtained for the first version of the corpus, for all

entities• Accuracy positive-negative should be more relevant• Good results for entities with more data, poor results for the

ones with a small number of opinionated texts

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 15

Entity Total items Neutral Opinionated Accuracy

opinion-neutral

Accuracy positive-negative

Ent1 6055 5853 202 97.01% 92.07%

Ent2 2240 1961 279 91.79% 87.81%

Ent3 343 260 83 84.84% 89.15%Ent4 1168 876 292 86.22% 82.19%Ent5 539 520 19 97.40% 57.89%Ent6 1025 570 455 76.20% 84.17%Ent7 3787 3016 771 81.75% 83.65%

Page 16: Opinion mining for social media and news items in Romanian

Results - Comparison• Comparison of the above presented solutions using the

second (larger) version of the corpus• Only for one entity by extracting a balanced dataset with 700

positive and 700 negative opinionated texts

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 16

Method Accuracy

BoW + POS filter 81.31%

BoW only adj. 70.89%

BoW only adj. & adv. 76.60%Frequent bigrams 80.88%Frequent trigrams 76.60%Affective scores + dependency parsing 52.18%

Affective scores (comparison with 0 decision) 55.35%

Trigrams probabilities 88.44%

Bigrams probabilities 72.54%

Page 17: Opinion mining for social media and news items in Romanian

Conclusions• Several alternatives for determining the opinion

polarity have been evaluated on a corpus manually annotated for different Romanian entities

• Best results obtained at this moment: BoW plus a POS filter or a frequent bigrams approach + SVM classifier

• Romanian FDG parser does not provide a good accuracy for the dependency parsing task, especially for texts from social media– Texts are somewhat freely written, with little regards to

usual form or structure– Improvement of this method & the affective words

database are still possible

13.04.23ICSCS 2013 . K-TEAMS 2013 Workshop

Opinion Mining for Social Media and News Items in Romanian 17

Page 18: Opinion mining for social media and news items in Romanian

Thank you!

• Questions?

• Discussions

13.04.23 CSCS 2013 – Bucharest, Romania 18