JALT VOF Semi-supervised review tweet classification Bachelor thesis Gijs van der Voort (6191053) 6/12/2012 Supervised by: Manos Tsagkias and Leen Torenvliet In this paper a method for classifying review tweets and a method for semi-automatically gathering training sets are proposed. Using classic text classification features, quality prediction features, Twitter specific features and linguistic features a Random Forest classifier is capable to correctly classify +/- 83% of the dataset. The proposed method for kick-starting uses very basic search queries to gather data. Too broad search queries consisting of a single hashtag do not work. Extending the single hashtag query with two subject specific keywords creates a training set with which +/- 70% of the original dataset can be correctly classified and results of which can be used to continue training.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
JALT VOF
Semi-supervised review tweet classification
Bachelor thesis
Gijs van der Voort (6191053)
6/12/2012
Supervised by: Manos Tsagkias and Leen Torenvliet
In this paper a method for classifying review tweets and a method for semi-automatically gathering training sets are proposed. Using classic text classification features, quality prediction features, Twitter specific features and linguistic features a Random Forest classifier is capable to correctly classify +/- 83% of the dataset. The proposed method for kick-starting uses very basic search queries to gather data. Too broad search queries consisting of a single hashtag do not work. Extending the single hashtag query with two subject specific keywords creates a training set with which +/- 70% of the original dataset can be correctly classified and results of which can be used to continue training.
1
Table of Contents 1 Introduction ..................................................................................................................................... 2
2 Related work ................................................................................................................................... 2
2.1 Text classification .................................................................................................................... 2
2.2 Social context .......................................................................................................................... 3
2.3 Aggregating information from social media ............................................................................ 4
Using the N-grams of both corpora the log-likelihood ratio and tf*idf weight can be calculated. The
tf*idf weight is calculated using the following formula:
( ) ( ) ( )
( ) { }
( )
{ }
Where is all documents combined, the number of documents in , a given term and a
document from . For this report consists of both the review and non-review corpus and is the
review corpus.
7
For calculating the log-likelihood ratio weight of a term, first the following table is constructed:
Corpus 1 Corpus 2 Total
Frequency of word
Frequency of other words Total
Table 1
The values a and b are called the observed values, the values c and d are the total number of words
in their respective corpus. For both corpora the expected value can now be calculated using the table
above and the following formulas:
( )
( )
Using the expected values for term in both corpora, the log-likelihood ratio weight can be
calculated:
( ) (
)
From both sets the X most high ranking N-grams and corresponding values are selected, where X is a
number for suitable for the test being executed. The selected words are used for building the feature
vector.
3.4 Creating feature vectors for tweets After having normalized the data and calculating the word weights it is time to create the actual
feature vectors. In this section all the features will be defined and explained where necessary. When
describing the features, “words” refers to the words as found in de data normalization phase and
“message” refers to the content of a tweet in its original form.
3.4.1 Post time ratio
The tweet post time as time of day in milliseconds divided by the total amount of milliseconds in 24
hours.
3.4.2 Message length
The number of characters in the message. This includes all types of characters.
3.4.3 Number of words
The words counted are the words that are found in the data preparation phase.
3.4.4 Unique words ratio ( )
8
The unique words ratio is calculated by dividing the amount of unique words by the total number of
words.
3.4.5 Mentions ratio { }
The number of mentions in a message divided by the number of words of a message. Mentions are
defined by the regular expression:
(?:\s|^)@\w+
Which translates to: whitespace or begin of line followed by a at (@) sign and word characters.
3.4.6 Hashtags ratio { }
The number of hashtags in a message divided by the number of words of a message. Mentions are
defined by the regular expression:
(?:\s|^)#\w+
Which translates to: whitespace or begin of line followed by a number sign and word characters.
3.4.7 URLs ratio { }
The number of URLs in a message divided by the number of words of a message. Mentions are
defined by the regular expression:
(?:\s|^)(http://|https://)(\S+)
Which translates to: whitespace or begin of line followed by “http://” or “https://” and one or more
non whitespace characters.
3.4.8 Location name ratio { }
The number of location names divided by the number of words in a message. The number of location
names is found by counting the results of the location detection service.
3.4.9 Char class ratios { }
The number of characters from a given class in a message divided by length of the message. The
character sets used are the following sets, as defined by the Python string constants [38]:
Digits
9
Whitespace
Punctuation
Uppercase
Lowercase
3.4.10 Repeating char ratio
∑
The number of repeating, consecutive characters in a message, divided by the total number of
characters in the message.
3.4.11 Part of speech category ratio
{ }
The number of words of a given category in a message divided by the number of words of the
message. The parts of speech categories used the following:
Nouns
Verbs
Articles
Numerals
Prepositions
Adjectives
Adverbs
Conjunctions
The number of occurrences of each class can be extracted from the lexical analyzer output.
3.4.12 Lexical density
( )
∑
( ) ( )
( )
( ) ( )
( ) ( ) ( )
Where is the message and the set of not-stop words. Summing over all keywords: the sum of
the weights of two keywords divided by the distances squared. By dividing the sum of the weights by
the distance squared exponentially decreases the value by increase of distance. The result is then
normalized to the number of keywords. The weight of the keywords is defined by the log-likelihood
ratio of the keyword.
3.4.13 tf*idf score
( ) ∑ (
)
10
The sum of all tf*idf values of every word in message .
3.4.14 Log-likelihood word ratios
After calculating the log-likelihood ratios, the top N words are selected. For every word in the
occurrence ratio is calculated.
( )
4 Experiments The experiments are set up to evaluate the performance of the classifier. The performance is defined
as the percentage correctly classified tweets. This means that a performance of 50% indicates that
the same results can be obtained by randomly classifying 50% of the tweets as review and 50% as
non-review.
4.1 Single feature classification Because of the diversity of the research areas from which features are extracted, testing the
performance of each of these features individually gives a clear insight in the behavior of each of
these features, i.e. the ability to classify reviews based on only that feature.
The classification method used in this experiment is the Random Forest, using 10 trees.
4.1.1 Tweet properties
Figure 1
None of the features seem very discriminative between the two classes. The post time feature
preforms the best with 57%. The length/words and unique words ratio were expected to perform
better because of the results seen in quality prediction research. A possible explanation is that the
limited length of tweets limits the possible variation in length.
57,0639
52,6413 55,5666 54,3535
30
40
50
60
70
Post time Length Words Unique words ratio
Pre
form
ance
Tweet properties
11
4.1.2 Textual statistics
Figure 2
The textual features have very variable performance. The punctuation and repeating char ratio were
expected to perform better based on earlier research. The other features, implemented because of
the little extra needed code, have an unexpected high performance, specifically the whitespace,
letter and uppercase ratio. The reason that these two features perform relatively well is that reviews
possibly tend to be more coherent messages without excessive uppercase usage, i.e. shouting.
4.1.3 Twitter tokens
Figure 3
Both the mentions and the hashtag ratios seem to preform rather well. The URLs ratio is almost non
discriminative. A reference to the service or product that gets discussed is considered normal on
other media and would be expected on Twitter. It looks like people are not aware of the fact that
they write a review , assuming that the reader understands it or looks it up by himself.
52,2113 53,6318
60,4883 58,5458
51,0135
56,8796
52,0117
30
40
50
60
70
Punctuationratio
Digits ratio Whitespaceratio
Uppercaseratio
Lowecaseratio
Letter ratio Repeatingratio
Pre
form
ance
Textual statistics
55,0138
61,4711
51,6278
30
40
50
60
70
Mentions ratio Hashtags ratio Urls ratio
Pre
form
ance
Twitter tokens
12
4.1.4 POS
Figure 4
The POS features have a wide range of different results. The prepositions ratio is the best preforming
feature. Reviews about diners, bars, etc. always contain a reference to where the dinner/etc. took
place. Those references are most of the time preceded by one of the following Dutch prepositions:
“in”, “bij”. Twitter reviews also contain references to the people that have accompanied the
reviewer. The references to these people are often preceded by the Dutch preposition: “met”. The
performance of the verbs and the adjectives are expected because having dinner or having a drink is
an activity and you need an adjective to describe the experience.
4.1.5 Other
Figure 5
These results are unexpected. Lexical density has proven to be a good indication of quality of content
it seems that lexical density, like unique ratio, suffers from the limited message length of tweets. The
idf*tf feature is not as effective as expected this may be because of the summing of all the tf*idf
values. The location ratio isn’t preforming that well either, like the URLs feature, it seems that people
53,317 57,1714
53,125 54,699
63,9128 56,8796 54,8833 53,2095
30
40
50
60
70
Pre
form
ance
POS
54,0831 53,0467 52,3078
30
40
50
60
70
Location ratio Lexical density idf*tf
Pre
form
ance
Other
13
are not referencing the establishment as good as one might expect from a review and very rarely
include the location of the establishment. A different reason for the performance of the location
ratio could be that the location name services used has difficulty with the Dutch language and can
only filter out English location names.
4.1.6 LLR
Figure 6
The LLR ratio feature has been tested on a range of combinations of the following two variables:
The number of words taken as feature
The percentage of the dataset used for determining the LLR values of words
A general trend in all percentages of training data is the peak at the lowest number of words and the
peak around the 200 words. It seems that using more than 200 words decreases the performance
across all sizes of training data. The peak at the beginning can possibly be explained by the small
feature space and the high discriminative value of the top 20 words. The immediate decrease of
performance between 20 and 80 words is more difficult to explain. It is possible that because the
most discriminative words are in the top 20, adding more words with possibly exponentially less
discriminative value decreases the value of the overall performance. By adding more and more
words, more complex models of review/non reviews can be build, resulting in overall improving
performance. The decrease in performance after 200 words is possibly because of the very limited
discriminative value of those extra words.
4.2 N-grams To see how N-grams can improve the performance of the LLR feature, this experiment compares the
performance of different N-gram sizes. The used number of N-grams is 200 and 20% training data has
been used for LLR ratio calculation.
77
78
79
80
81
20 40 60 80 100 120 140 160 180 200 220
Pre
form
ance
Number of words
LLR
10%
20%
30%
40%
50%
14
Figure 7
We can see from the results in Figure 7 that using N-grams does not improve the classification
performance. The bi-grams seems to very slightly outperform the uni-gram, e.g. the normal LLR
feature, but not in a meaningful way. It is possible that the limited length of the tweets play a huge
part in the seen performance. By using a bi-gram, the amount of possible combinations is squared
compared to the uni-grams which results in a more limited amount of shared word combinations
across a single corpus.
4.3 Feature combinations Now we have seen how individual features preform, it is interesting to see how the groups preform
when combining their features. In this experiment we will also see how the classifier preforms when
combining all features from all groups. Although the LLR feature will not be combined in any way, it is
interesting to compare the LLR feature with all the other groups of features.
Figure 8
30
40
50
60
70
80
1-gram 2-gram 3-gram 4-gram 5-gram
Pre
form
ance
N-grams
61,7995 62,905 60,2641
63,7494
53,6417
79,5893 81,2014
40
50
60
70
80
Properties Statistics Tokens POS Other LLR All
Pre
form
ance
Combinations
15
In these results it is clearly visible that even though single features do not perform well, increasing
the feature space gives the classifier the room to find more complex patterns. This goes for every
group except “Other”. It seems that even when combining the three very low preforming features
from this group does not give anything for the classifier to work with. One interesting detail that
stands out is that the LLR feature preforms only two percent less than everything combined,
suggesting that the classification that can be done using the other groups can mostly be done by the
LLR feature.
4.4 Classification method The classification method used in the previous experiments is the Random Forest. A more common
used classifier for text classification is the Naïve Bayes classifier. There are many more classifiers
available and the type of classifier can have significant effect on the overall results. Therefore it is
interesting to see how the different classifiers preform.
Figure 9
Because most classification methods have one or more parameters, the results in Figure 9 are the
results with the best preforming combination of parameters. Random Forest preforms the best of all
the classification methods.
4.5 Real life classification The dataset used in previous experiments is data gathered over the course of 12 weeks, from 46 in
2011 to week 6 in 2012. The following two experiments are meant to find out how time affects the
performance of the classifier
4.5.1.1 Preceding week based classification
This experiment will use week for training and week for LLR calculation for classifying
week . The idea behind this experiment is that the correlation in terms of content may be higher
between two consecutive weeks. When a big event occurs and people mentions these events in their
messages, it is possible that by using the preceding week for classification the classifier can take
advantage of the correlation.
78,7736 74,7049
67,9828
83,799
72,0294
50
60
70
80
90
100
C4.5 Naive Bayes K* Random Forest SVM
Pre
form
ance
Classification methods
16
Figure 10
The results in Figure 10 are very irregular. From the start of week zero, there seems to be a promising
growth in performance, only to fall back after week four. It is possible that because of the actual two
preceding weeks used for training, the scope is too large for actually using events as extra
information. A shorter period may be better to use, but the limited amount of reviews that are
gathered per week (+/- 200) already limits the performance.
4.5.2 Start of a new subject
Classification has only been done for reviews about restaurants, bars etc. but other clients of Jalt
already have expressed their interest in reviews about other subjects. To find out how the classifier
preforms during the start of a new subject, this experiment will classify week by using all preceding
weeks of week for training.
Figure 11
The results in Figure 11 are how one could expect from training with an increasing training set size.
The first couple of weeks shows a very irregular pattern but overall an increasing line until it
stabilizes around 83%.
75
80
85
90
48 49 50 51 52 0 1 2 3 4 5 6
Pre
form
ance
Week number
Preceding week as training set
70
75
80
85
90
46 47 48 49 50 51 52 0 1 2 3 4 5
Pre
form
ance
Week
All preceding weeks a training set
17
4.6 Semi-automated training In these experiments the possibility of semi-automated training has been explored. Instead of using a
part of the original dataset, new data to use for training has been collected. A classifier is trained
with this data and tested on the original dataset. If these experiments prove to be successful, the
required effort to train the classifier will be dramatically reduced because no manual classification is
necessary when starting with a new review subject. In these experiments, both the number of LLR
terms and the percentage used for LLR term generation has been taken in account and tested in
various combinations to find the best possible performance.
4.6.1 Single hashtag
Because Twitter uses hashtags to group content together, using a single hashtag for gathering
trainindata would be ideal. The first hashtag that had been taken into consideration was “#review”.
For the English language this would have been a very usable hashtag but unfortunately unusable for
Dutch content. A common Dutch hashtag that is often used for recommending other is “#aanrader”.
Using only “#aanrader” as search query, a new datasets has been gatherd and used for classification.
Figure 12
Looking at the results in Figure 12 we can see that the performance overall just barely exceeds 53%
percent. Looking at the tweets gathered using this query we see that recommendations for
restaurants, bars, etc. only make up a tiny fraction of the entire dataset probably making the
classifier to broad for the specific type of reviews required to classify.
4.6.2 Single hashtag extended
Because of the poor results of the single hashtag, it is interesting to see if by extending the single
hashtag search query with a limited amount of extra keywords, we can increase the performance of
the classifier. The number of keywords will be limited to two, because the idea of semi-automated
classification is that starting a new classifier for a new subject should take the least amount of
manual input.
Most of the tweets are about eating (it is hard to review the quality of drinks), the keywords chosen
are “eten” and “gegeten” resulting in the search query:
#aanrader AND eten OR gegeten
50
51
52
53
54
10 50 100 150 200
Pre
form
ance
LLR terms
#aanrader
10%
20%
30%
40%
50%
18
Figure 13
The results of the extended search query outperforms the single hashtag query significantly. Again
we see that 200 LLR terms preforms the best. The best performance is seems to be around the 20
and 30 percent of LLR training data. A possible explanation for this being in the lower sizes has
probably to do with the size of the dataset. Using the extended search query only around 6 tweets
per day were found. The set used in this experiment therefore only consists of 200 tweets in contrast
to the 2000 tweets in the single hash tag dataset.
5 Conclusion In this thesis a method for classifying review tweets has been discussed using various different types
of features like the textual contents, special Twitter unique tokens and parts of speech. With such a
system it is possible to automatically filter out reviews about in a certain category which webmasters
can use to enrich their website content. A method for kick-starting the classifier has also been
proposed so that building a classifier for a new category takes very little human effort.
Using the proposed method it is possible to classify review/non-review tweets with 83% certainty.
The LLR method has proven to be most useful in classifying, but only when combined with the other
features it was possible to get beyond the 80%. N-gramming of the words in a message has no effect
on the performance of the classifier.
The best classification method for this problem is the Random Forest. Although other classification
methods can possibly be improved by feature selection, the Radom Forest has the advantage that
training can be very fast in comparison to other methods like SVM.
Kick-starting the classification process by using very simple search queries has also been proven
possible. The performance of the classifier drops when a search query is to broad, but making it
more specific using only two extra keywords can significantly improve performance, even when only
a training set of 200 tweets has been gathered.
Further possible research would be to see what effect the size of the dataset has on the performance
of the kick started classifier. The original dataset is specifically tailored to the wishes of the Dutch
yellow pages, e.g. not having big chains. It can be interesting to see how the classifier preforms when
62
64
66
68
70
72
10 50 100 150 200
Pre
form
ance
LLR terms
#aanrader AND eten OR gegeten
10%
20%
30%
40%
50%
19
big chains are annotated as review instead of non-review. The influence of time on classification has
been tested over a period of three months. It would be interesting to do experiments over a longer
period of time with longer intervals to see if there are seasonal events etc. or with shorter intervals
to see the influence of more short termed events.
6 References
[1] J. Surowiecki, The Wisdom of Crowds: Why the Many are Smarter Than the Few and how Collective Wisdom Shapes Business, Economies, Societies, and Nations, Doubleday, 2004.
[3] R. Kelly, "Twitter Study - August 2009," Pear Analytics, San Antonio, 2009.
[4] Twitter INC., "What Are Hashtags ("#" Symbols)?," Twitter INC., [Online]. Available: http://support.twitter.com/articles/49309-what-are-hashtags-symbols. [Accessed 6 June 2012].
[5] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many relevant Features," Universiät Dortmunt, Dortmunt, 1998.
[6] I. Androutsopoulos, J. Koutsias, K. Chandrinos, G. Paliouras and C. Spyropoulos, "An evaluation of naive bayesian anti-spam filtering," Arxiv, 2000.
[7] L. Manevitz and M. Yousef, "One-class SVMs for document classification," The Journal of Machine Learning Research, vol. 2, pp. 139-154, 2002.
[8] H. Ragas and C. Koster, "Four text classification algorithms compared on a Dutch corpus," Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 369-370, 1998.
[9] Y. Matsuo and M. Ishizuka, "Keyword extraction from a single document using word co-occurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-170, 2004.
[10] M. Weintraub, "LVCSR log-likelihood ratio scoring for keyword spotting," in Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Conference on, 1995.
[11] T. Tokunaga and I. Makoto, "Text categorization based on weighted inverse document frequency," in Special Interest Groups and Information Process Society of Japan (SIG-IPSJ), 1994.
[12] T. Dunning, "Accurate methods for the statistics of surprise and coincidence," Computational linguistics, vol. 19, pp. 61-74, 1993.
[13] S. Ahmed and F. Mithun, "Word stemming to enhance spam filtering," in the Conference on Email and Anti-Spam (CEAS’04), 2004.
20
[14] T. Sembok, "Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems," in Proceeding of World Academy of Science, Engineering and Technology, 2005.
[15] J. Carlberger, H. Dalianis, M. Hassel and O. Knutsson, "Improving precision in information retrieval for Swedish using stemming," in the Proceedings of NODALIDA, 2001.
[16] J. Fürnkranz, "A study using n-gram features for text categorization," Austrian Research Institute for Artifical Intelligence, 1998.
[17] W. Cavnar and J. Trenkle, "N-gram-based text categorization," Ann Arbor MI, vol. 48113, no. 2, pp. 161-175, 1994.
[18] Y. Lu, P. Tsaparas, A. Ntoulas and L. Polanyi, "Exploiting social context for review quality prediction," Proceedings of the 19th international conference on World wide web, pp. 691--700, 2010.
[19] J. Bian, Y. Liu, D. Zhou, E. Agichtein and H. Zha, "Learning to recognize reliable users and content in social media with coupled mutual reinforcement," in Proceedings of the 18th international conference on World wide web, 2009.
[20] M. Bosma, E. Meij and W. Weerkamp, "A Framework for Unsupervised Spam Detection in Social Networking Sites".
[21] H. Kwak, C. Lee, H. Park and S. Moon, "What is Twitter, a social network or a news media?," in Proceedings of the 19th international conference on World wide web, 2010.
[22] S. Perez, "Twitter is NOT a Social Network, Says Twitter Exec," ReadWriteWeb, 14 September 2010. [Online]. Available: http://www.readwriteweb.com/archives/twitter_is_not_a_social_network_says_twitter_exec.php. [Accessed 5 June 2012].
[23] G. Lee, J. Seo, S. Lee, H. Jung, B. Cho, C. Lee, B. Kwak, J. Cha, D. Kim and J. An, "SiteQ: Engineering high performance QA system using lexico-semantic pattern matching and shallow NLP," in Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2001.
[24] M. a. S. A. a. L. E. Hu, "Comments-oriented document summarization: understanding documents with readers’ feedback," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 2008.
[25] @BlackRose50101, 7 June 2012. [Online]. Available: http://twitter.com/BlackRose50101/status/210752151255924736.
[26] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu and M. Demirbas, "Short text classification in twitter to improve information filtering," in Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, 2010.
[27] A. Oghina, M. Breuss, M. Tsagkias and M. de Rijke, "Predicting IMDb Movie Ratings using Social Media," in 34th European Conference on Information Retrieval (ECIR 2012). Springer-Verlag, 2012.
21
[28] A. Tumasjan, T. Sprenger, P. Sandner and I. Welpe, "Predicting elections with twitter: What 140 characters reveal about political sentiment," in Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media, 2010.
[29] E. Ruiz, V. Hristidis, C. Castillo, A. Gionis and A. Jaimes, "Correlating financial time series with micro-blogging activity," in Proceedings of the fifth ACM international conference on Web search and data mining, 2012.
[30] G. Mishne, "Experiments with mood classification in blog posts," in Proceedings of ACM SIGIR 2005 Workshop on Stylistic Analysis of Text for Information Access, 2005.
[31] A. Go, L. Huang and R. Bhayani, "Twitter sentiment analysis," Final Projects from CS224N for Spring, vol. 2009, 2008.
[32] Twitter INC., "What are @Replies and Mentions?," Twitter INC., [Online]. Available: http://support.twitter.com/articles/14023. [Accessed 6 June 2012].
[33] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, Witten and I. H., "The WEKA data mining software: an update," SIGKDD Explor. Newsl., vol. 11, no. 1, pp. 10-18, 2009.
[34] Machine Learning Group at University of Waikato, "Weka 3: Data Mining Software in Java," Machine Learning Group at University of Waikato, [Online]. Available: http://www.cs.waikato.ac.nz/~ml/weka/. [Accessed 7 June 2012].
[35] Twitter INC., "GET search," Twitter INC., 18 April 2012. [Online]. Available: https://dev.twitter.com/docs/api/1/get/search. [Accessed 5 June 2012].
[36] A. v. d. Bosch, "Frog Dutch morpho-syntactic analyzer and dependency parser," ILK Research Group, 24 May 2012. [Online]. Available: http://ilk.uvt.nl/frog/. [Accessed 5 June 2012].