Additional1

+

TEXT MINING-BASED FORMATION OF

DICTIONARIES EXPRESSING OPINIONS

IN NATURAL LANGUAGES

František Dařena

Jan Žižka

Department

of

Informatics

Faculty of

Business

and

Economics

Mendel

University

in Brno

Czech

Republic

+ Introduction

Many companies collect opinions expressed

by their customers.

These opinions can hide valuable knowledge.

Discovering the knowledge by people can be

sometimes a very demanding task because

the opinion database can be very large,

the customers can use different languages,

the people can handle the opinions subjectively,

sometimes additional resources (like lists of positive

and negative words) might be needed.

+ Objective

To automatically extract words

significant for positive and negative

customers' opinions and to form

dictionaries of positive and negative

words, including the strength of their

positivity and negativity.

+ Data description

Processed data included reviews of hotel clients collected from publicly available sources.

The reviews were labeled as positive and negative.

Reviews characteristics:

more than 5,000,000 reviews,

written in more than 25 natural languages,

written only by real customers, based on a real experience,

written relatively carefully but still containing errors that are typical for natural languages.

+ Review examples

Positive The breakfast and the very clean rooms stood out as the best

features of this hotel.

Clean and moden, the great loation near station. Friendly reception!

The rooms are new. The breakfast is also great. We had a really nice stay.

Good location - very quiet and good breakfast.

Negative High price charged for internet access which actual cost now

is extreamly low.

water in the shower did not flow away

The room was noisy and the room temperature was higher than normal.

The air conditioning wasn't working

+ Data preparation

Data collection, cleaning (removing tags, non-

letter characters), converting to upper-case.

Transforming into the Bag-of-Words

representation, term frequencies (TF) used as

attribute values.

Removing the words with global frequency

MinTF < 2.

+ Data characteristics

Number of unique words for different languages (MinTF = 1)

+ Data characteristics

total negative positive both classes

Number of unique words for different languages – for negative and positive

classes and words in both classes (MinTF = 2)

+ Finding the significant words

Significant words were discovered as relevant

attributes used by a classification algorithm – a

decision tree, the tree-generating algorithm c5 (by

R. Quinlan) based on entropy minimization.

The goal was not to achieve the best classification

accuracy (it was around 90%) but to find relevant

attributes that contribute to assigning a text to a

given class.

The significant words appeared in the nodes of the

decision tree.

+ Representing the decision tree

using rules

The branches of a decision tree can be converted into

rules.

Examples:

f(word1) > 0 AND f(word2) = 0 AND f(word3) = 0 : NEG[N1; I1]

f(word4) = 0 AND f(word5) > 0 AND f(word6) > 0 : NEG[N2; I2]

f(word1) = 0 AND f(word6) > 0 : NEG[N3; I3]

Nx – number of times when the rule was used

Ix – number of times when the rule was used incorrectly

When a word appears in a rule as f(word) > 0 it

contributes to classification into a given class and it is

thus relevant for the class.

+ One word in more paths/rules

The same word (e.g. “friendly”) can appear in

more paths in the decision tree and to contribute

to classification into both classes.

+ Strength of word sentiment

The more a word appears as relevant in rules assigning the

negative (positive) class to a text correctly the more

negative (positive) the word is. However, it is necessary to

consider not only absolute frequency but also the relative

accuracy.

For example, a word W1 is used 10 times for a correct and 0

times for an incorrect classification to the negative class, and

word W2 is used 30 times for a correct and 20 times for an

incorrect classification to the negative class (50 times in

total). Now, the question is which of these two words is `more

negative.' The word W1 was used less times but in 100%

correctly, while the word W2 was used 5 times more but with

only 60% correctness.

+ Sentiment strength weight

ww =NC

NN

×ln NC

2 + NN

2

ln(Nmax )

The weight balances the

frequency when a word was

used for classification and the

correctness of the classification.

The calculated weight then

determines the importance of a

word in relation to a given

category (positive or negative

class) – higher numbers mean

bigger relevancy.

+ Results

+ Results

+ Results

+ Conclusions

A procedure how to apply computers, machine

learning, and natural language processing areas to

automatically find significant words was presented.

From the total number of words (80,000–200,000) only

about 200–300 were identified as significant.

The procedure worked well for many languages.

Following research will focus on generating typical

short phrases instead of only creating individual words.

The procedure might be used during the marketing

research or marketing intelligence, for filtering

reviews, generating lists of key-words etc.

Additional1

Education

negative words

individual words

thesignificant words

word w1

positive classes

whena word

negative customers opinions

thenegative positive