Top Banner
Automatic term extraction of dynamically updated text collections for sentiment classification into three classes Yuliya Rubtsova The A.P. Ershov Institute of Informatics Systems (IIS)
31

Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Jun 11, 2015

Download

Science

Yuliya Rubtsova

An automatic term extraction approach for building a vocabulary that is constantly updated. A prepared dictionary is used for sentiment classification into three classes (positive, neutral, negative). In addition, the results of sentiment classification are described and the accuracy of methods based on various weighting schemes is compared. The work also demonstrates the computational complexity of generating representations for N dynamic documents depending on the weighting scheme used.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Automatic term extraction of dynamically updated text collections for sentiment

classification into three classes

Yuliya Rubtsova

The A.P. Ershov Institute of Informatics Systems (IIS)

Page 2: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Applied problems which can be solved with sentiment classification

consumer reviews study to commercial products for businesses;

Page 3: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Page 4: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Applied problems which can be solved with sentiment classification

consumer reviews study to commercial products for businesses;

recommender systems;

Page 5: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes
Page 6: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Applied problems which can be solved with sentiment classification

consumer reviews study to commercial products for businesses;

recommender systems;

Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the current emotional state of the person

Page 7: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Human Machine Interface of a computer system which is responsible for adapting the system's behavior to the

current emotional state of the person

psychological and medical diagnosis;

safety control by analyzing the behavior of mass gatherings;

assistance in carrying out investigative measures.

Page 8: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Most common sentiment analysis approaches

Supervised machine learning

Dictionaries and rules

Combined method

Page 9: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Existing corpora

Corpora of reviews which contain user marks

Belongs to one subject domain (movies reviews, books reviews, gadgets reviews)

Corps of news (a few emotional texts)

Page 10: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Filtration

Texts containing both positive and negative emotions;

Not informative tweets (less than 40 characters long);

Copied texts and retweets.

Page 11: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Corpus of short texts consists of

114 991 – positive texts

111 923 – negative texts

107 990 – neutral texts

Page 12: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Corpus of short texts

Collection type Number of words Number of unique words

Positive messages 1 559 176 150 720

Negative messages 1 445 517 191 677

Neutral messages 1 852 995 105 239

Page 13: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Unique terms distribution in relation depending on the number of tweets

Page 14: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Uniformity of used collections

Words frequency distribution

Page 15: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Most common approaches for used for N-grams extracting

Manually, using a thesaurus.

Term Extraction, based on significance of this term for a collection

Page 16: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Data sets characteristics

The entire data set is known

The entire data set is avaliable

The entire data set is static (can’t change during calculation)

When new document is added, it is necessary to the update the document frequency of many terms and all previously generated term weights needs recalibration. For N documents in a data stream, the computational complexity is O(N2).

Page 17: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Human speech is constantly changing => there is a need to update emotional dictionaries

Page 18: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Change in vocabulary and topics discussed

Febrary August0%

2%

4%

6%

8%

10%

12%

14%12.00%

0.50%

Percentage of references to the Olympic theme on all posts

Page 19: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Change in vocabulary and topics discussed

Febrary August0.00%

0.02%

0.04%

0.06%

0.08%

0.10%

0.12%

0.14%

0.06%

0.12%

Percentage of references to the vacation theme on all posts

Page 20: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Change in vocabulary and topics discussed

Febrary August0.00%

0.01%

0.02%

0.03%

0.00%

0.02%

Percentage of using term “Sebyashka” (selfie – rus) on all posts

Page 21: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Filtration Punctuation – commas, colons, quotation marks

(exclamation marks, question marks and ellipses were retained);

References to significant personalities and events

Proper names;

Numerals;

All links were replaced with the word "Link" and were taken into consideration as a whole;

Many dots were replaced with ellipsis.

Page 22: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

TF-ICF

C – number of categories,

cf – the number of categories in which weighed term is found

Page 23: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

TF-IDF

tf – is the frequency of term occurrence in the collection (positive or negative tweets) ,

T – total number of messages in the collections,

– the number of messages in the positive and negative collections contained the term

Page 24: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Experiments

Page 25: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Corpus of News texts consists of

46 339 – positive news

46 337 – negative news

46 340 – neutral news

Page 26: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

ROMIP mixed collection consists of

543– positive blog texts

236– negative blog texts

103– neutral blog texts

Reviews on books, movies, or digital camera from blogs

Page 27: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Short text collection

News collection

TF-IDF TF-ICFAccuracy 53,9773 57,9545Precision 0,561341047 0,558902611Recall 0,5311636 0,535790598F-Measure 0,545835539 0,547102625

ROMIP collection

TF-IDF TF-ICFAccuracy 69,8619 58,1397Precision 0,709246342 0,61278022Recall 0,698624505 0,581402868F-Measure 0,703895355 0,596679322

TF-IDF TF-ICFAccuracy 95,5981 95,0664Precision 0,958092631 0,953112184Recall 0,955204837 0,94984672F-Measure 0,956646554 0,95147665

Page 28: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Results

Page 29: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Short texts News Romip0

20

40

60

80

100

120

95.66

70.39

54.58

95.15

59.6854.71 TF-IDF

TF-ICF

Experimental results in terms of F-measure

Page 30: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

dynamically update the unigram dictionary, recalculate the weight of terms, depending on the accessories to the collection;

take into account the lexical speech changes in time;

investigate new terms entering into active vocabulary.

The program module allows

Page 31: Automatic term extraction of dynamically updated text collections for sentiment classification into three classes

Thank you!

Yuliya Rubtsova

[email protected]

Presentation: http://www.slideshare.net/mokoron