Semi-supervised review tweet classification · 2020. 7. 10. · JALT VOF Semi-supervised review tweet classification Bachelor thesis Gijs van der Voort (6191053) 6/12/2012 Supervised

JALT VOF

Semi-supervised review tweet classification

Bachelor thesis

Gijs van der Voort (6191053)

6/12/2012

Supervised by: Manos Tsagkias and Leen Torenvliet

In this paper a method for classifying review tweets and a method for semi-automatically gathering training sets are proposed. Using classic text classification features, quality prediction features, Twitter specific features and linguistic features a Random Forest classifier is capable to correctly classify +/- 83% of the dataset. The proposed method for kick-starting uses very basic search queries to gather data. Too broad search queries consisting of a single hashtag do not work. Extending the single hashtag query with two subject specific keywords creates a training set with which +/- 70% of the original dataset can be correctly classified and results of which can be used to continue training.

1

Table of Contents 1 Introduction ..................................................................................................................................... 2

2 Related work ................................................................................................................................... 2

2.1 Text classification .................................................................................................................... 2

2.2 Social context .......................................................................................................................... 3

2.3 Aggregating information from social media ............................................................................ 4

2.4 Classification methods............................................................................................................. 4

3 Methods .......................................................................................................................................... 4

3.1 Building base dataset .............................................................................................................. 5

3.2 Preparing data for feature vector generation ......................................................................... 5

3.3 Calculating word weights ........................................................................................................ 6

3.4 Creating feature vectors for tweets ........................................................................................ 7

4 Experiments ................................................................................................................................... 10

4.1 Single feature classification ................................................................................................... 10

4.2 N-grams ................................................................................................................................. 13

4.3 Feature combinations ........................................................................................................... 14

4.4 Classification method ............................................................................................................ 15

4.5 Real life classification ............................................................................................................ 15

4.6 Semi-automated training ...................................................................................................... 17

5 Conclusion ..................................................................................................................................... 18

6 References ..................................................................................................................................... 19

2

1 Introduction A common problem that one is faced with when buying a new product is having to determine which

one of the many options is the option to buy. A reasonable solution for this problem would be to ask

a friend, a salesman, or somebody else you know, but are any of these people really able to give you

custom tailored advice? Your friend does not know anything about the type of product you are

interested in, we all know that the salesman is only thinking about his or her sale targets and the rest

of the people have no experience with the product whatsoever. If only we could ask somebody who

has actually experience with and knowledge about the product and does not have personal

intentions.

The internet has made it possible for everybody to exchange their experience with products and

services on a global scale. Discussing the quality of products, “reviewing”, has become so popular

that there are many online communities which solemnly consist of users exchanging reviews and

ratings of products and services. These communities have made it is possible for consumers to use

“the wisdom of crowds” [1] as a guide when buying a new product.

Having reviews on a website can significantly boost the value of a website, this has increased the

overall demand for reviews. Getting reviews on a website has proven to be difficult because people

only tend to write reviews for websites which are already supported by an active review community,

a “chicken or the egg” kind of dilemma. Are there other reviews sources which can be freely used?

Twitter is an online community where people can share short status updates, tweets [2], with the

rest of the world. Most of the tweets contain “pointless babble and self-promotion” [3], but people

also write about their experiences with people, events and with products and services: reviews.

Tweets posted to the public timeline, i.e. for everyone to see, are in the public domain and freely

available for all purposes and can therefore be used to enrich the contents of a website.

The Dutch yellow pages has asked Jalt, a Dutch social media consulting firm, to gather reviews of

restaurants, bars, diners, etc. from Twitter. Jalt requested to research the possibilities of

automatically gathering reviews from Twitter.

Manually filtering these reviews from all other tweets is either very time consuming: manually go

through all available tweets, or limiting in the amount useful reviews because of aggressive filtering.

To overcomes these problems, the following research questions are addressed in this paper:

Is it possible to gather reviews from Twitter using supervised classification?

o Which features are most suitable for this task?

Is it possible to do semi-supervised classification?

o With data gathered using a single hashtag [4]?

o With data gathered using a very basic search query?

2 Related work

2.1 Text classification Labeling documents is a topic that exists for a long time and it is considered “solved”. Most research

to text classification combines a classifier with a BOW (bag-of-words) model of the documents, most

3

of the time reducing the feature space with a form of word weighting and/or word stemming [5] [6]

[7] [8] [9].

Two widely used methods for word weighting are the LLR (log-likelihood ratio or G) test [10] and the

tf*idf ratio test [11]. Both of these tests calculate numerical statistics describing the discriminative

value of words between two corpora.

The log-likelihood ratio test is especially interesting because it is one of the few statistical tests that

does not assume that words are distributed normally:

“When comparing the rates of occurrence of rare events, the assumptions on which these tests are

based break down because texts are composed largely of such rare events. For example, simple word

counts made on moderate-sized corpus show that words that have a frequency of less than one in

50,000 words make up about 20-30% of typical English language news-wire reports. This 'rare'

quarter of English includes many of the content-bearing words and nearly all the technical jargon.”

[12]

Word stemming is the process of reducing words to their stem or base and has proven to be effective

in spam detection [13] and information retrieval [14] [15].

Both methods can be used in combination with N-grams of a corpus [16] [17]. An N-gram is a N

elements long sequence of elements from a given list of elements, in this case a corpus. The theory

behind N-grams is that they contain more context than single words. Because of the simplicity and

the scalability of the algorithm it is possible to easily increase the amount of context stored in the N-

grams by increasing the N-gram size.

2.2 Social context Because Twitter is known as a social media, it is interesting to see if other features are possible based

on the social context of a Tweet.

A topic that has been getting more attention because of the increasing popularity of social media is

quality assessment of social media content [18] [19] [20]. A technique that has been found effective

is exploiting the social graph of a user to predict the quality of the users’ content. By assuming four

social context consistency hypotheses [18] in combination with the users’ social graph, unsupervised

quality prediction is possible. A different approach to predict content quality is to use the authority

of a user as given by the community and the feedback that has been given on a question or answer

[19].

Both methods are can be very effective and would have been used if not for the lack of social context

and interaction [21] [22] on Twitter. It seems that undirected information sharing is the most

common use of Twitter and people mentioning their experience with a product or services very

rarely have any connection at all.

Other methods of quality prediction without the need of social context are available. A commonly

used metric for quality prediction is the lexical density metric [23] [24]. This metric indicates the

amount of information inside a sentence or short piece of text by calculating the distance between

non-stop words. The underlying assumption is that people that have something specific to say try to

4

minimize the amount of unnecessary words. This assumption is especially interesting because of the

limited length of tweets that makes every wasted character count.

The way people write on social media is different from normal written text, content from social

media is full of slang and colloquialism. A typical example of a message from a very excited writer on

twitter:

“OMG!!!!! I saw many countries trend #Happy7thSS501

OMG!! FIGHTING!! we will be No1 for sure!!!! :D” [25]

Features that are interesting to explore are the number of repeating characters, the number of

uppercase/lowercase characters and the number of punctuation characters [26].

2.3 Aggregating information from social media Gathering statistics from social media by aggregating content is another field that is getting increased

attention because of the growth of social media. By harnessing the wisdom of crowds it has been

proven possible to predict the rating of movies [27], predict elections [28] and to train a profitable

stock market trader [29]. Trying to distil information from a single corpus tends to be harder [30] and

has room for further research [31].

Aggregating information from social media is a topic that will not be discussed in this paper but it

gives insight in possible features that can be used for classification. Features that can be used from

tweets are:

The number of mentions [32] in a tweet

The number of hashtags in a tweet

The number of urls in a tweet

The number of geo-locations in a tweet

2.4 Classification methods To be able to run the classification tests, some sort of classification environment is needed. For most

programming languages there are classification libraries available but most of these libraries seem to

focus on either a single classification method or on application integration instead of experimenting.

To be able to switch between classification methods instantly when needed and not having to build a

new experimentation environment, an existing classification experiment environment has been used:

Weka.

Weka is an experimentation environment for data mining tasks. It has support for a large number of

different classification methods like SVM, neural networks, trees, Bayesian models and many more,

but also has support for data clustering, meta-classification, association and feature selection. All of

these methods and algorithms are available through the open Java API and command-line

environment but is also available through a GUI environment which can help with importing datasets

from all sorts of file formats and database providers and has all sorts of visualization like feature

scatterplots and feature distributions [33] [34].

3 Methods The used methodology consists out of the following steps:

5

1. Building base dataset

2. Data cleaning, parsing and normalizing

3. Preparing data for feature vector generation

4. Performance testing of classifier on base dataset

5. Semi-automatic collecting of training data

6. Comparing the semi-automatic classifier with the base classifier

Each of these steps will be discussed in more detail in the following subsections.

3.1 Building base dataset The dataset used for the research consists of tweets collected over a period of three months. To

collect the tweets, the Twitter search API [35] has been used. Although freely available and fast, it

has four limitations:

It is impossible to get all messages in a given language

The search API is rate limited to 150 requests per hour

The allowed complexity of search queries is limited

You can only search a limited time back in history

To be able to overcome these problems a tweet collecting system has been build.

At the core of the tweet collecting system lies the assumption that every review contains words that

have a judging tone. By using as many as possible of these words it is safe to assume that all Dutch

reviews are in the collected dataset.

Because the allowed complexity of search queries is limited, using one big search query is impossible.

However, splitting of the queries is easy because the search queries consist of combinations of

keywords.

By assigning every query a percentage of the maximum allowed number of requests, the API is used

in the most optimal way without risking penalties. This also solves the problem of the limited search

history by collecting the tweets as fast as possible.

Collecting tweets using this system over a period of twelve weeks resulted in +/- 15 million tweets

and after manually filtering this resulted in 6512 review tweets. Not every review about a place

where one could eat or drink has been accepted to this set. The Dutch yellow pages has no interest in

reviews about large chains like McDonald’s and Kentucky Fried Chicken, so these reviews have been

discarded.

For classification experiments an evenly distributed dataset is required, this means that for every

review a non-review must be present in the dataset. Therefore 6512 non-review tweets have been

added to the dataset.

3.2 Preparing data for feature vector generation Before feature vectors can be generated from the tweets the data has to be prepared. For every

collected tweet a record containing the following elements is generated:

The message of the tweet, with newlines replaced by spaces

6

A list of the normalized words of the message

The post date and time of the tweet

A flag indicating if the tweet is a review

The output of a lexical analyzer

The output of a location detection service

The replacement of newlines by spaces is necessary because the lexical analyzer used in this research

interpreted a newline as the end of a document. Frog [36], the used lexical analyzer, is a command-

line tool that is able to tokenize and lemmatize Dutch documents.

To create the list of normalized words the entire message should be converted to the same letter

case and split into words. In this paper a word is defined by the following regular expression, which is

suitable for the Dutch language:

\w+

To search for location information embedded in a message, a location detection service is necessary;

for this paper Yahoo! Placemaker [37] has been used. This service is capable of identifying places in

unstructured data and returning geographic metadata.

Collecting and generating the results of these two services for 13000 records is a slow process.

Therefore the results of both services are stored in the record as is, so information will not be lost

because of preliminary filtering, and collecting this data only has to be done only once.

3.3 Calculating word weights Before being able to create the feature vectors the word weights as defined by both the log-

likelihood ratio test and the tf*idf test for all words in the training set must be calculated. The

training set consists of an evenly distributed review/non-review fraction of the original set which may

not be used for training or testing the classifier.

By joining all tweets of a given class, two large corpora are created. From the two corpora are all

possible n-grams generated, including overlap. To illustrate this, the following example shows the

conversion of a string to all 2-gram combinations:

D = “apples are rather tasty”

2-gram(D) = {“apples are”, “are rather”, “rather tasty”}

Using the N-grams of both corpora the log-likelihood ratio and tf*idf weight can be calculated. The

tf*idf weight is calculated using the following formula:

( ) ( ) ( )

( ) { }

( )

{ }

Where is all documents combined, the number of documents in , a given term and a

document from . For this report consists of both the review and non-review corpus and is the

review corpus.

7

For calculating the log-likelihood ratio weight of a term, first the following table is constructed:

Corpus 1 Corpus 2 Total

Frequency of word

Frequency of other words Total

Table 1

The values a and b are called the observed values, the values c and d are the total number of words

in their respective corpus. For both corpora the expected value can now be calculated using the table

above and the following formulas:

( )

( )

Using the expected values for term in both corpora, the log-likelihood ratio weight can be

calculated:

( ) (

)

From both sets the X most high ranking N-grams and corresponding values are selected, where X is a

number for suitable for the test being executed. The selected words are used for building the feature

vector.

3.4 Creating feature vectors for tweets After having normalized the data and calculating the word weights it is time to create the actual

feature vectors. In this section all the features will be defined and explained where necessary. When

describing the features, “words” refers to the words as found in de data normalization phase and

“message” refers to the content of a tweet in its original form.

3.4.1 Post time ratio

The tweet post time as time of day in milliseconds divided by the total amount of milliseconds in 24

hours.

3.4.2 Message length

The number of characters in the message. This includes all types of characters.

3.4.3 Number of words

The words counted are the words that are found in the data preparation phase.

3.4.4 Unique words ratio ( )

8

The unique words ratio is calculated by dividing the amount of unique words by the total number of

words.

3.4.5 Mentions ratio { }

The number of mentions in a message divided by the number of words of a message. Mentions are

defined by the regular expression:

(?:\s|^)@\w+

Which translates to: whitespace or begin of line followed by a at (@) sign and word characters.

3.4.6 Hashtags ratio { }

The number of hashtags in a message divided by the number of words of a message. Mentions are


(?:\s|^)#\w+

Which translates to: whitespace or begin of line followed by a number sign and word characters.

3.4.7 URLs ratio { }

The number of URLs in a message divided by the number of words of a message. Mentions are


(?:\s|^)(http://|https://)(\S+)

Which translates to: whitespace or begin of line followed by “http://” or “https://” and one or more

non whitespace characters.

3.4.8 Location name ratio { }

The number of location names divided by the number of words in a message. The number of location

names is found by counting the results of the location detection service.

3.4.9 Char class ratios { }

The number of characters from a given class in a message divided by length of the message. The

character sets used are the following sets, as defined by the Python string constants [38]:

Digits

9

Whitespace

Punctuation

Uppercase

Lowercase

3.4.10 Repeating char ratio

∑

The number of repeating, consecutive characters in a message, divided by the total number of

characters in the message.

3.4.11 Part of speech category ratio

{ }

The number of words of a given category in a message divided by the number of words of the

message. The parts of speech categories used the following:

Nouns

Verbs

Articles

Numerals

Prepositions

Adjectives

Adverbs

Conjunctions

The number of occurrences of each class can be extracted from the lexical analyzer output.

3.4.12 Lexical density

( )

∑

( ) ( )

( )

( ) ( )

( ) ( ) ( )

Where is the message and the set of not-stop words. Summing over all keywords: the sum of

the weights of two keywords divided by the distances squared. By dividing the sum of the weights by

the distance squared exponentially decreases the value by increase of distance. The result is then

normalized to the number of keywords. The weight of the keywords is defined by the log-likelihood

ratio of the keyword.

3.4.13 tf*idf score

( ) ∑ (

)

10

The sum of all tf*idf values of every word in message .

3.4.14 Log-likelihood word ratios

After calculating the log-likelihood ratios, the top N words are selected. For every word in the

occurrence ratio is calculated.

( )

4 Experiments The experiments are set up to evaluate the performance of the classifier. The performance is defined

as the percentage correctly classified tweets. This means that a performance of 50% indicates that

the same results can be obtained by randomly classifying 50% of the tweets as review and 50% as

non-review.

4.1 Single feature classification Because of the diversity of the research areas from which features are extracted, testing the

performance of each of these features individually gives a clear insight in the behavior of each of

these features, i.e. the ability to classify reviews based on only that feature.

The classification method used in this experiment is the Random Forest, using 10 trees.

4.1.1 Tweet properties

Figure 1

None of the features seem very discriminative between the two classes. The post time feature

preforms the best with 57%. The length/words and unique words ratio were expected to perform

better because of the results seen in quality prediction research. A possible explanation is that the

limited length of tweets limits the possible variation in length.

57,0639

52,6413 55,5666 54,3535

30

40

50

60

70

Post time Length Words Unique words ratio

Pre

form

ance

Tweet properties

11

4.1.2 Textual statistics

Figure 2

The textual features have very variable performance. The punctuation and repeating char ratio were

expected to perform better based on earlier research. The other features, implemented because of

the little extra needed code, have an unexpected high performance, specifically the whitespace,

letter and uppercase ratio. The reason that these two features perform relatively well is that reviews

possibly tend to be more coherent messages without excessive uppercase usage, i.e. shouting.

4.1.3 Twitter tokens

Figure 3

Both the mentions and the hashtag ratios seem to preform rather well. The URLs ratio is almost non

discriminative. A reference to the service or product that gets discussed is considered normal on

other media and would be expected on Twitter. It looks like people are not aware of the fact that

they write a review , assuming that the reader understands it or looks it up by himself.

52,2113 53,6318

60,4883 58,5458

51,0135

56,8796

52,0117

30

40

50

60

70

Punctuationratio

Digits ratio Whitespaceratio

Uppercaseratio

Lowecaseratio

Letter ratio Repeatingratio

Pre

form

ance

Textual statistics

55,0138

61,4711

51,6278

30

40

50

60

70

Mentions ratio Hashtags ratio Urls ratio

Pre

form

ance

Twitter tokens

12

4.1.4 POS

Figure 4

The POS features have a wide range of different results. The prepositions ratio is the best preforming

feature. Reviews about diners, bars, etc. always contain a reference to where the dinner/etc. took

place. Those references are most of the time preceded by one of the following Dutch prepositions:

“in”, “bij”. Twitter reviews also contain references to the people that have accompanied the

reviewer. The references to these people are often preceded by the Dutch preposition: “met”. The

performance of the verbs and the adjectives are expected because having dinner or having a drink is

an activity and you need an adjective to describe the experience.

4.1.5 Other

Figure 5

These results are unexpected. Lexical density has proven to be a good indication of quality of content

it seems that lexical density, like unique ratio, suffers from the limited message length of tweets. The

idf*tf feature is not as effective as expected this may be because of the summing of all the tf*idf

values. The location ratio isn’t preforming that well either, like the URLs feature, it seems that people

53,317 57,1714

53,125 54,699

63,9128 56,8796 54,8833 53,2095

30

40

50

60

70

Pre

form

ance

POS

54,0831 53,0467 52,3078

30

40

50

60

70

Location ratio Lexical density idf*tf

Pre

form

ance

Other

13

are not referencing the establishment as good as one might expect from a review and very rarely

include the location of the establishment. A different reason for the performance of the location

ratio could be that the location name services used has difficulty with the Dutch language and can

only filter out English location names.

4.1.6 LLR

Figure 6

The LLR ratio feature has been tested on a range of combinations of the following two variables:

The number of words taken as feature

The percentage of the dataset used for determining the LLR values of words

A general trend in all percentages of training data is the peak at the lowest number of words and the

peak around the 200 words. It seems that using more than 200 words decreases the performance

across all sizes of training data. The peak at the beginning can possibly be explained by the small

feature space and the high discriminative value of the top 20 words. The immediate decrease of

performance between 20 and 80 words is more difficult to explain. It is possible that because the

most discriminative words are in the top 20, adding more words with possibly exponentially less

discriminative value decreases the value of the overall performance. By adding more and more

words, more complex models of review/non reviews can be build, resulting in overall improving

performance. The decrease in performance after 200 words is possibly because of the very limited

discriminative value of those extra words.

4.2 N-grams To see how N-grams can improve the performance of the LLR feature, this experiment compares the

performance of different N-gram sizes. The used number of N-grams is 200 and 20% training data has

been used for LLR ratio calculation.

77

78

79

80

81

20 40 60 80 100 120 140 160 180 200 220

Pre

form

ance

Number of words

LLR

10%

20%

30%

40%

50%

14

Figure 7

We can see from the results in Figure 7 that using N-grams does not improve the classification

performance. The bi-grams seems to very slightly outperform the uni-gram, e.g. the normal LLR

feature, but not in a meaningful way. It is possible that the limited length of the tweets play a huge

part in the seen performance. By using a bi-gram, the amount of possible combinations is squared

compared to the uni-grams which results in a more limited amount of shared word combinations

across a single corpus.

4.3 Feature combinations Now we have seen how individual features preform, it is interesting to see how the groups preform

when combining their features. In this experiment we will also see how the classifier preforms when

combining all features from all groups. Although the LLR feature will not be combined in any way, it is

interesting to compare the LLR feature with all the other groups of features.

Figure 8

30

40

50

60

70

80

1-gram 2-gram 3-gram 4-gram 5-gram

Pre

form

ance

N-grams

61,7995 62,905 60,2641

63,7494

53,6417

79,5893 81,2014

40

50

60

70

80

Properties Statistics Tokens POS Other LLR All

Pre

form

ance

Combinations

15

In these results it is clearly visible that even though single features do not perform well, increasing

the feature space gives the classifier the room to find more complex patterns. This goes for every

group except “Other”. It seems that even when combining the three very low preforming features

from this group does not give anything for the classifier to work with. One interesting detail that

stands out is that the LLR feature preforms only two percent less than everything combined,

suggesting that the classification that can be done using the other groups can mostly be done by the

LLR feature.

4.4 Classification method The classification method used in the previous experiments is the Random Forest. A more common

used classifier for text classification is the Naïve Bayes classifier. There are many more classifiers

available and the type of classifier can have significant effect on the overall results. Therefore it is

interesting to see how the different classifiers preform.

Figure 9

Because most classification methods have one or more parameters, the results in Figure 9 are the

results with the best preforming combination of parameters. Random Forest preforms the best of all

the classification methods.

4.5 Real life classification The dataset used in previous experiments is data gathered over the course of 12 weeks, from 46 in

2011 to week 6 in 2012. The following two experiments are meant to find out how time affects the

performance of the classifier

4.5.1.1 Preceding week based classification

This experiment will use week for training and week for LLR calculation for classifying

week . The idea behind this experiment is that the correlation in terms of content may be higher

between two consecutive weeks. When a big event occurs and people mentions these events in their

messages, it is possible that by using the preceding week for classification the classifier can take

advantage of the correlation.

78,7736 74,7049

67,9828

83,799

72,0294

50

60

70

80

90

100

C4.5 Naive Bayes K* Random Forest SVM

Pre

form

ance

Classification methods

16

Figure 10

The results in Figure 10 are very irregular. From the start of week zero, there seems to be a promising

growth in performance, only to fall back after week four. It is possible that because of the actual two

preceding weeks used for training, the scope is too large for actually using events as extra

information. A shorter period may be better to use, but the limited amount of reviews that are

gathered per week (+/- 200) already limits the performance.

4.5.2 Start of a new subject

Classification has only been done for reviews about restaurants, bars etc. but other clients of Jalt

already have expressed their interest in reviews about other subjects. To find out how the classifier

preforms during the start of a new subject, this experiment will classify week by using all preceding

weeks of week for training.

Figure 11

The results in Figure 11 are how one could expect from training with an increasing training set size.

The first couple of weeks shows a very irregular pattern but overall an increasing line until it

stabilizes around 83%.

75

80

85

90

48 49 50 51 52 0 1 2 3 4 5 6

Pre

form

ance

Week number

Preceding week as training set

70

75

80

85

90

46 47 48 49 50 51 52 0 1 2 3 4 5

Pre

form

ance

Week

All preceding weeks a training set

17

4.6 Semi-automated training In these experiments the possibility of semi-automated training has been explored. Instead of using a

part of the original dataset, new data to use for training has been collected. A classifier is trained

with this data and tested on the original dataset. If these experiments prove to be successful, the

required effort to train the classifier will be dramatically reduced because no manual classification is

necessary when starting with a new review subject. In these experiments, both the number of LLR

terms and the percentage used for LLR term generation has been taken in account and tested in

various combinations to find the best possible performance.

4.6.1 Single hashtag

Because Twitter uses hashtags to group content together, using a single hashtag for gathering

trainindata would be ideal. The first hashtag that had been taken into consideration was “#review”.

For the English language this would have been a very usable hashtag but unfortunately unusable for

Dutch content. A common Dutch hashtag that is often used for recommending other is “#aanrader”.

Using only “#aanrader” as search query, a new datasets has been gatherd and used for classification.

Figure 12

Looking at the results in Figure 12 we can see that the performance overall just barely exceeds 53%

percent. Looking at the tweets gathered using this query we see that recommendations for

restaurants, bars, etc. only make up a tiny fraction of the entire dataset probably making the

classifier to broad for the specific type of reviews required to classify.

4.6.2 Single hashtag extended

Because of the poor results of the single hashtag, it is interesting to see if by extending the single

hashtag search query with a limited amount of extra keywords, we can increase the performance of

the classifier. The number of keywords will be limited to two, because the idea of semi-automated

classification is that starting a new classifier for a new subject should take the least amount of

manual input.

Most of the tweets are about eating (it is hard to review the quality of drinks), the keywords chosen

are “eten” and “gegeten” resulting in the search query:

#aanrader AND eten OR gegeten

50

51

52

53

54

10 50 100 150 200

Pre

form

ance

LLR terms

#aanrader

10%

20%

30%

40%

50%

18

Figure 13

The results of the extended search query outperforms the single hashtag query significantly. Again

we see that 200 LLR terms preforms the best. The best performance is seems to be around the 20

and 30 percent of LLR training data. A possible explanation for this being in the lower sizes has

probably to do with the size of the dataset. Using the extended search query only around 6 tweets

per day were found. The set used in this experiment therefore only consists of 200 tweets in contrast

to the 2000 tweets in the single hash tag dataset.

5 Conclusion In this thesis a method for classifying review tweets has been discussed using various different types

of features like the textual contents, special Twitter unique tokens and parts of speech. With such a

system it is possible to automatically filter out reviews about in a certain category which webmasters

can use to enrich their website content. A method for kick-starting the classifier has also been

proposed so that building a classifier for a new category takes very little human effort.

Using the proposed method it is possible to classify review/non-review tweets with 83% certainty.

The LLR method has proven to be most useful in classifying, but only when combined with the other

features it was possible to get beyond the 80%. N-gramming of the words in a message has no effect

on the performance of the classifier.

The best classification method for this problem is the Random Forest. Although other classification

methods can possibly be improved by feature selection, the Radom Forest has the advantage that

training can be very fast in comparison to other methods like SVM.

Kick-starting the classification process by using very simple search queries has also been proven

possible. The performance of the classifier drops when a search query is to broad, but making it

more specific using only two extra keywords can significantly improve performance, even when only

a training set of 200 tweets has been gathered.

Further possible research would be to see what effect the size of the dataset has on the performance

of the kick started classifier. The original dataset is specifically tailored to the wishes of the Dutch

yellow pages, e.g. not having big chains. It can be interesting to see how the classifier preforms when

62

64

66

68

70

72

10 50 100 150 200

Pre

form

ance

LLR terms

#aanrader AND eten OR gegeten

10%

20%

30%

40%

50%

19

big chains are annotated as review instead of non-review. The influence of time on classification has

been tested over a period of three months. It would be interesting to do experiments over a longer

period of time with longer intervals to see if there are seasonal events etc. or with shorter intervals

to see the influence of more short termed events.

6 References

[1] J. Surowiecki, The Wisdom of Crowds: Why the Many are Smarter Than the Few and how Collective Wisdom Shapes Business, Economies, Societies, and Nations, Doubleday, 2004.

[2] "About Tweets (Twitter Updates)," Twitter INC., [Online]. Available: http://support.twitter.com/groups/31-twitter-basics/topics/109-tweets-messages/articles/127856-about-tweets-twitter-updates. [Accessed 06 04 2012].

[3] R. Kelly, "Twitter Study - August 2009," Pear Analytics, San Antonio, 2009.

[4] Twitter INC., "What Are Hashtags ("#" Symbols)?," Twitter INC., [Online]. Available: http://support.twitter.com/articles/49309-what-are-hashtags-symbols. [Accessed 6 June 2012].

[5] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many relevant Features," Universiät Dortmunt, Dortmunt, 1998.

[6] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras and C. Spyropoulos, "An evaluation of naive bayesian anti-spam filtering," Arxiv, 2000.

[7] L. Manevitz and M. Yousef, "One-class SVMs for document classification," The Journal of Machine Learning Research, vol. 2, pp. 139-154, 2002.

[8] H. Ragas and C. Koster, "Four text classification algorithms compared on a Dutch corpus," Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 369-370, 1998.

[9] Y. Matsuo and M. Ishizuka, "Keyword extraction from a single document using word co-occurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-170, 2004.

[10] M. Weintraub, "LVCSR log-likelihood ratio scoring for keyword spotting," in Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, 1995.

[11] T. Tokunaga and I. Makoto, "Text categorization based on weighted inverse document frequency," in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ), 1994.

[12] T. Dunning, "Accurate methods for the statistics of surprise and coincidence," Computational linguistics, vol. 19, pp. 61-74, 1993.

[13] S. Ahmed and F. Mithun, "Word stemming to enhance spam filtering," in the Conference on Email and Anti-Spam (CEAS’04), 2004.

20

[14] T. Sembok, "Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems," in Proceeding of World Academy of Science, Engineering and Technology, 2005.

[15] J. Carlberger, H. Dalianis, M. Hassel and O. Knutsson, "Improving precision in information retrieval for Swedish using stemming," in the Proceedings of NODALIDA, 2001.

[16] J. Fürnkranz, "A study using n-gram features for text categorization," Austrian Research Institute for Artifical Intelligence, 1998.

[17] W. Cavnar and J. Trenkle, "N-gram-based text categorization," Ann Arbor MI, vol. 48113, no. 2, pp. 161-175, 1994.

[18] Y. Lu, P. Tsaparas, A. Ntoulas and L. Polanyi, "Exploiting social context for review quality prediction," Proceedings of the 19th international conference on World wide web, pp. 691--700, 2010.

[19] J. Bian, Y. Liu, D. Zhou, E. Agichtein and H. Zha, "Learning to recognize reliable users and content in social media with coupled mutual reinforcement," in Proceedings of the 18th international conference on World wide web, 2009.

[20] M. Bosma, E. Meij and W. Weerkamp, "A Framework for Unsupervised Spam Detection in Social Networking Sites".

[21] H. Kwak, C. Lee, H. Park and S. Moon, "What is Twitter, a social network or a news media?," in Proceedings of the 19th international conference on World wide web, 2010.

[22] S. Perez, "Twitter is NOT a Social Network, Says Twitter Exec," ReadWriteWeb, 14 September 2010. [Online]. Available: http://www.readwriteweb.com/archives/twitter_is_not_a_social_network_says_twitter_exec.php. [Accessed 5 June 2012].

[23] G. Lee, J. Seo, S. Lee, H. Jung, B. Cho, C. Lee, B. Kwak, J. Cha, D. Kim and J. An, "SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP," in Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001.

[24] M. a. S. A. a. L. E. Hu, "Comments-oriented document summarization: understanding documents with readers’ feedback," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008.

[25] @BlackRose50101, 7 June 2012. [Online]. Available: http://twitter.com/BlackRose50101/status/210752151255924736.

[26] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu and M. Demirbas, "Short text classification in twitter to improve information filtering," in Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, 2010.

[27] A. Oghina, M. Breuss, M. Tsagkias and M. de Rijke, "Predicting IMDb Movie Ratings using Social Media," in 34th European Conference on Information Retrieval (ECIR 2012). Springer-Verlag, 2012.

21

[28] A. Tumasjan, T. Sprenger, P. Sandner and I. Welpe, "Predicting elections with twitter: What 140 characters reveal about political sentiment," in Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.

[29] E. Ruiz, V. Hristidis, C. Castillo, A. Gionis and A. Jaimes, "Correlating financial time series with micro-blogging activity," in Proceedings of the fifth ACM international conference on Web search and data mining, 2012.

[30] G. Mishne, "Experiments with mood classification in blog posts," in Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access, 2005.

[31] A. Go, L. Huang and R. Bhayani, "Twitter sentiment analysis," Final Projects from CS224N for Spring, vol. 2009, 2008.

[32] Twitter INC., "What are @Replies and Mentions?," Twitter INC., [Online]. Available: http://support.twitter.com/articles/14023. [Accessed 6 June 2012].

[33] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, Witten and I. H., "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10-18, 2009.

[34] Machine Learning Group at University of Waikato, "Weka 3: Data Mining Software in Java," Machine Learning Group at University of Waikato, [Online]. Available: http://www.cs.waikato.ac.nz/~ml/weka/. [Accessed 7 June 2012].

[35] Twitter INC., "GET search," Twitter INC., 18 April 2012. [Online]. Available: https://dev.twitter.com/docs/api/1/get/search. [Accessed 5 June 2012].

[36] A. v. d. Bosch, "Frog Dutch morpho-syntactic analyzer and dependency parser," ILK Research Group, 24 May 2012. [Online]. Available: http://ilk.uvt.nl/frog/. [Accessed 5 June 2012].

[37] Yahoo!, "Yahoo! Placemaker™ Beta," Yahoo!, [Online]. Available: http://developer.yahoo.com/geo/placemaker/. [Accessed 5 June 2012].

[38] The Python Software Foundation, "7.1. string - Common string operations," The Python Software Foundation, [Online]. Available: http://docs.python.org/library/string.html. [Accessed 6 June 2012].