Combining Knowledge and Data Mining to Understand Sentiment

WHITE PAPER

Combining Knowledge and Data Mining to Understand Sentiment – A Practical Assessment of Approaches

i

COMBINING KNOWLEDGE AND DATA MINING TO UNDERSTAND SENTIMENT

Table of Contents

Abstract............................................................................................................1Introduction .....................................................................................................1The Elements of Sentiment Analysis ..............................................................1

What Is Sentiment Analysis? .......................................................................1When Is It Relevant? ....................................................................................2Elements of Sentiment Analysis ..................................................................2

Sentiment Analysis Methods...........................................................................3The Data ......................................................................................................3Data Mining Approach .................................................................................4

Benefits of the data mining approach ..............................................................5

Drawback of the data mining approach ...........................................................5

Natural Language Processing Approach ......................................................5Step one: taxonomy identification ...................................................................6

Step two: defining objects and attributes ........................................................7

Step three: defining polarity .............................................................................8

Benefits of the NLP approach .......................................................................10

Drawback of the NLP approach ....................................................................11

The Best of Both Worlds ................................................................................11Data Mining of the Text for the Rule Builder ..............................................11Hybrid Approaches .....................................................................................14

Polarity scores as additional features .............................................................14

Stacked models ............................................................................................15

Results ..........................................................................................................16Attribute-Level Results ..............................................................................16Overall Results ...........................................................................................16

Other Applications .........................................................................................17Importing Models ......................................................................................17Creating Training Data ...............................................................................18Other Capabilities of SAS® Enterprise Miner™ ...........................................19

Conclusions....................................................................................................19References .....................................................................................................20


ii

Russell Albright is a Research Statistician Developer at SAS and has been working on SAS® Text Miner algorithms since its initial release more than 10 years ago. He holds a master’s and a doctorate in applied math from Clemson University. Albright has expertise in numerical matrix methods and Bayesian networks, and he has experience applying text mining to many Web-based sources, including Twitter, Yahoo and PubMed.

Praveen Lakkaraju is a Software Developer at SAS and is a member of the SAS Text Analytics research and development team. His areas of experience include sentiment analysis, information retrieval and content categorization. He was instrumental in the launch of the SAS Social Media Analytics solution, and is still actively involved in its development. Lakkaraju holds a master’s in computer science from the University of Kansas, where he specialized in the field of natural language processing.


Abstract

An important application of text analytics is to automatically characterize the sentiment of documents in a variety of domains, whether it is positive, negative or neither. In this paper we explore the benefits of combining domain-specific linguistic rules with data mining methods to improve both the effectiveness of your models and the efficiency of the model builder.

Introduction

Our world has changed drastically in the last 10 years. An individual’s opinions are no longer shared only with his or her immediate family and friends, but instead are capable of influencing the decisions of thousands or even millions of people the individual has never even met. The Internet has given the individual a platform to broadcast grievances and recommendations that can reach across the world. And the existence of social networks gives these opinions the potential to snowball into a viral frenzy that can make your company’s products or services a worldwide boon or a global catastrophe in just a matter of days.

The savvy marketer monitors and evaluates relevant Web content continually to understand consumer sentiment toward products or services from his company – and toward his competitors. This attention to Web content allows the company to respond quickly to customer opinion.

The sheer volume of references related to your company’s products or services makes automating this task essential. Sources such as blogs, product reviews, forums and news articles can all be monitored, scored for relevance against your topics of interest, and then classified according to sentiment.

The Elements of Sentiment Analysis

What Is Sentiment Analysis?

Sentiment analysis is an automatic method that provides feedback to you regarding the opinions and attitudes of your customers. The analysis is based on customers’ electronic written commentaries regarding your products and services and those of your competitors. The feedback can be provided at a very high level with drill-down so that you can explore how opinions differ within groups, subgroups and even at the individual level.

1

■ Sentiment analysis is an automatic

method that provides feedback to

you regarding the opinions and

attitudes of your customers.


More precisely, sentiment analysis is the process of classifying or rating the opinions or sentiment expressed in a document. The rating may assign the sentiment into one of three categories: positive, negative or neutral; or it may, instead, assign a numeric score. The rating that is assigned is termed polarity. The sentiment may be assessed for the entire document or for particular objects or attributes mentioned in the document.

When Is It Relevant?

Sentiment analysis is relevant in almost every context that your customers or potential customers express themselves in written form – and possibly spoken form – via different communication channels. These comments may not have been intended for direct consumption by your company. They may have been posted in website forums, tweets, blogs or other Web pages and directed toward your potential customers. On the other hand, some content may have been intentionally directed at your company through e-mail, a company support website, a survey questionnaire, a call center desk, etc.

Automated sentiment analysis is important to implement when you are inundated with relevant, useful feedback through these channels. For many companies, it is impossible for individuals to monitor and understand all that is communicated in these sources due to their sheer volume. The information comes too quickly and from too many channels. Sentiment analysis provides you with an immediate interpretation, not just of every individual comment but also of the global opinions expressed.

Elements of Sentiment Analysis

You cannot implement a comprehensive sentiment analysis solution with a process that merely analyzes the sentiment of a document. Instead, you must coordinate several tasks to maximize the benefits.

1. Data acquisition phase. This phase involves setting up an automated process to obtain a clean set of documents to analyze. You can use SAS software to obtain the documents from the Internet and from local file systems or databases. SAS software can also be used to filter the documents by eliminating any “noise” that is common to Web documents (e.g., filtering spam).

2. Sentiment assignment phase. This phase involves creating a model that can calculate the polarity of the author’s sentiment or opinion toward your topics of interest and apply that model to naïve documents. SAS technologies can help you derive accurate assessments of sentiment.

3. Summarization and reporting phase. Identifying sentiment within a particular document is interesting in itself, but frequently it will be of more interest to characterize representative populations within your collection. SAS provides techniques for such exploration, which entails answering questions such as:

2


• Doestheageofourcustomertendtomakeadifferenceinhisorheropinionabout our service?

• Howdothecumulativeopinionsaboutourcompetitor’sproductcomparewiththe cumulative opinions about our product?

• Didourcustomersperceivethechangeswemadetoouroutletstoresasbeneficial, or not?

4. Repetition phase. The final step in your sentiment analysis project will be to set up a process to automate the entire analysis on a repeated basis. This allows you to monitor sentiment changes, identify important influencers and respond quickly to what you learn.

For this paper we will focus primarily on the sentiment assignment phase. Note that since text is written in natural language and not with a precise quantitative representation, there are many challenges to effectively analyze for sentiment.

For one, natural language text is full of ambiguities, implicit meaning and subtle nuances. Normally a human reader has the necessary experience to both understand natural language expressions and to comprehend the meaning of the subject area along with the sentiment the author intended to communicate. But automating this process in a computer can be challenging. Such things as slang, pronoun resolution, sarcasm and idioms all make a direct interpretation of the text difficult.

Further, an automatic process will not function at the semantic level of the text at all unless there is a direct mapping of a linguistic rule to semantics. In many instances this can be captured with the rules we will discuss later; but the diversity of ways to express the same meaning can make it difficult to accurately capture all situations with a set of rules.

There are two primary approaches to building models for sentiment analysis. The first, natural language processing, uses a domain expert to build a set of linguistic rules to determine the sentiment polarity of the document’s content. The second, machine learning, uses training data (documents that have the sentiment polarity already assigned to them) to build a predictive model. Predictive models such as decision trees, logistic regressions or neural networks will make this prediction on documents that are outside the training set.

Sentiment Analysis Methods

The Data

We will use two collections of movie review data to demonstrate the techniques presented in this paper. The first collection created by Pang and Lee contains 2,000

3


movie reviews. The collection is split evenly with 1,000 positive and 1,000 negative reviews.1 The second collection was obtained by retrieving 6,631 movie reviews from Yahoo.2 This collection has both overall ratings for the movie being discussed and also ratings for several attributes of each movie, including the story line, cast, direction and visuals.

Although your data is almost certainly not movie review data, the concepts and techniques demonstrated using this movie data are applicable to most other sentiment-related text data sets.

Data Mining Approach

A data mining approach to sentiment analysis translates an unstructured text problem to one that makes predictions on structured, quantitative data. The approach borrows several techniques from computational linguistics and information retrieval communities to represent the text numerically, and then applies traditional data mining techniques to this numeric representation. In the end, a target variable is identified and a pattern is discovered from the training data for predicting sentiment polarity. This pattern can then be used to predict new observations.

The first step in creating the numeric representation is to convert the entire training collection into a document-by-term frequency matrix. Each document is parsed into individual terms, or term/part-of-speech pairs. Then the set of all terms becomes the variables on the data set so that documents are now represented as vectors of length equal to the number of distinct terms in the collection. These vectors are very sparse, containing mostly zeroes – because any one document contains a very small percentage of the terms in the collection. Once the documents are represented as vectors, the frequencies in each cell can be weighted with a function that takes into account the distribution of the term across the collection and relative to the levels of the target variable.

After these document vectors are formed, a dimension reduction technique – such as the singular value decomposition (see Taming Text with the SVD, Albright, 2004) – is typically used to represent each document in a reduced-dimensional space of maybe 50 to 100 variables, where each variable is a linear combination of the weighted terms that originally represented each document.

Finally, these reduced-dimensional vectors, together with the sentiment variable, can be supplied to a predictive model. The model will attempt to learn from the training data by utilizing patterns in the reduced-dimensional vector. This predictive model will then create a function that will predict the sentiment for any document.

1 The Pang and Lee movie review data is available at: http://www.cs.cornell.edu/People/pabo/movie-review-data

2 Yahoo movie reviews were obtained from: http://movies.yahoo.com

4

http://www.cs.cornell.edu/People/pabo/movie-review-data

http://www.cs.cornell.edu/People/pabo/movie-review-data

http://movies.yahoo.com


Benefits of the data mining approach

The data mining approach is appealing because it is based on learning patterns that are useful for making automated, efficient predictions. The algorithms are capable of discovering unimagined and complicated patterns that would be beyond what a human could anticipate. Frequently, a data mining approach can beat a rule-based approach in topic classification. Of course, this is dependent on having enough training data to build the model.

Drawback of the data mining approach

The vector-based representation of a document, which is required for data mining techniques, does not maintain information that is potentially important to sentiment classification. For example, the vector representation does not capture when terms are close to one another in the document, if one term precedes another or any other contextual cues. The order of terms in a phrase can significantly affect meaning. Consider the phrases:

“… night for a great movie”

and

“… great night for a movie”

These two phrases convey two different meanings; yet in a vector representation, the phrases have an identical representation.

In addition, most predictive models provide little feedback to the user as to precisely why a particular document was classified as having positive or negative polarity. So when you attempt to understand what positive things people said in a particular document, you frequently have to read the entire document to discover the answer.

As a final drawback, forming the training and validation is an essential component of learning a predictive model, but it can be very time-consuming and challenging. A rating needs to be provided for every document, and if there are attributes of documents that you wish to use to measure sentiment, you will need to provide a rating for each of these as well. Another complication is that two different reviewers frequently assign two different sentiment ratings to the same document. This can introduce unexpected errors in building and measuring the performance of your model.

Natural Language Processing Approach

Natural language processing (NLP) is a field of artificial intelligence that deals with automatically extracting meaning from natural language text. As discussed in the introduction of this paper, it’s very challenging to get machines to understand text at the same levels as humans. Doing this with the specific goal of extracting sentiment is even more challenging. For example, consider the text snippet below:

5

■ The algorithms are capable of

discovering unimagined and

complicated patterns that would

be beyond what a human could

anticipate.


“… with that out of the way, let me say this – this film is bad. This film is really, really bad. Yet somehow, it is strangely enjoyable. …”

If interpreted by a human, the above text would imply a positive sentiment from the author toward the movie. However, it can be very challenging to get the same output from a computer because of the dense presence of the strongly negative words.

The rule-based NLP methods use certain entities and syntactic patterns in the text to understand its meaning. SAS Sentiment Analysis provides all the tools needed for this kind of disambiguation. You can use a combination of language dictionaries, linguistic constructs like parts of speech, and noun phrases along with a range of operators.

The operators fall into a few different categories as shown below:

• Boolean operators. Used to include or exclude different entities (e.g., AND, OR, NOT).

• Frequency operators. Used to measure the specified number of occurrences of certain entities, (e.g., MIN, MINOC, MAXOC).

• Context operators. Used to measure the context within which certain entities occur in the text (e.g., DIST, START, END, SENT, PARA).

• Sequence operators. Used to look for the entities in a specific sequence (e.g., ORD, ORDDIST).

The process of developing rule-based models for sentiment analysis involves a few different steps. These are explained below.

Step one: taxonomy identification

The initial step in the NLP approach is taxonomy identification. Taxonomy here refers to a simple, two-level hierarchy where you specify the different objects and attributes for which you want to extract sentiment. You can either use a predefined taxonomy or you can use text mining to learn the most prominent objects and their attributes in the corpus and then make them part of your taxonomy. Figure 1 shows the predefined taxonomy that we used for extracting sentiment from the movie review data. The discovery-based text mining methods are discussed later in this paper.

6


7

Figure 1: Taxonomy for movie reviews.

Step two: defining objects and attributes

The next step is to define the objects and their attributes. A basic approach to defining these is to identify their synonyms or the different ways they may be referred to in the text. Figure 2 shows an example.

Figure 2: Example of defining the visuals attribute.

While this approach captures many cases, in other situations the attribute might be referred to using its co-referent. Consider the example below:

“The movie starred Jennifer Aniston. The plot of the movie was very interesting. Aniston’s performance was commendable. She looks adorable.”

8


Here the name of the actress was mentioned only in the first sentence. In the subsequent sentences, the actress was referred to using her last name and a pronoun. These three entities are said to be co-referent and the process of identifying them is called co-reference resolution. The rule-based methods allow you to write rules to handle such cases.

Step three: defining polarity

Polarity is determined by associating predefined positive or negative terms or expressions with the attributes that have been identified. Dictionaries of subjective expressions are available and can be customized to specific domains (see Figure 3).

Figure 3: Example of a generic dictionary of positive keywords.

You could also define multiple classes of subjective expressions to denote different levels of subjectivity.

“incredible,” “stunning” ➔ strong positive

“hate,” “disgust” ➔ strong negative

Assigning the appropriate polarity requires that negations are handled properly. To do this, you can use a combination of part-of-speech tags and dictionaries as shown in Figures 4 and 5.


9

Figure 4: Example of a class of negated adjectives.

In Figure 4, “NegClass” is a dictionary of expressions that denote a negation. For example, “not,” “will not,” “have not,” etc. and “:Adv,” “:A” and “:V” represent any adverb, adjective and verb respectively.

Figure 5: Example of a negation rule.

Finally, to extract the sentiment at attribute level, you can write context-based rules as shown in Figure 6, where we used a combination of operators.

10


Figure 6: Example of an attribute-level sentiment rule.

Benefits of the NLP approach

The major advantage of rule-based methods is the amount of control they give rule developers over how the analysis will be performed. Developers can use their knowledge of the domain and the language within it to develop rules that have high precision.

Unlike statistical analysis, the results of rule-based analysis are easily interpretable. This is very important for real-life applications where the analysts need to know exactly why a document or an attribute within a document was tagged as positive or negative. In other words, analysts need to know exactly what sentences, keywords or context within the document triggered the positive or negative sentiment. Figure 7 shows an example of this.

Figure 7: Example showing different entities that were used for rule-based analysis.

Rule-based methods are completely unsupervised; that is, they do not require any training data. This is a big advantage in real-life applications where training data is scarce. The non-availability of training data is more pronounced when it comes to granular sentiment analysis (sentiment derived at the objects and attributes level).

I think they did a fantastic job this movie. I read the book, I loved the book, and I loved the movie! My only qualm was Javier bardem playing a Brazilian when he is SPANISH! Julia Roberts was perfect and beautfiul. Wonderful casting job (with the exception of Bardem)! Good acting. Some parters were a tad confusing for those who haven’t read the book. But I took my mom, who didn’t read the book, and she really liked it. It’s not just some sappy chick flick. It’s a powerful journey about finding yourself hen you let yourself GO! Empowering. Perfection. = EAT PRAY LOVE! Lovely

■ The major advantage of rule-based

methods is the amount of control

they give rule developers over how

the analysis will be performed.


11

Another advantage of rule-based methods is their ability to refine the rules over time based on the feedback from analysts or subject-matter experts. The more time the rule developer spends on refining the rules, the better the results. Language evolves over time and people start using newer terms to express their sentiments. This is especially true for social media, where the language used changes all the time. In such cases, rule-based methods give you the flexibility needed to adjust your models accordingly.

Drawback of the NLP approach

The disadvantage of rule-based methods is that they require a lot of human involvement in developing the rules. These methods completely rely on the domain knowledge of rule developers. It might take a few weeks to come up with a strong rule-based model for a new domain. However, once you have a strong rule-based model for a domain, you can reuse that model with some minor modifications for different applications within the domain.

The importance of validation data is often underestimated while developing these models. The rules being written must be generic enough so that they are capable of handling all possible cases. Inexperienced rule developers tend to over-fit their rules to the sample data they are working with. Such rules might not work well when tested on different data sets. So, rule developers must make sure they validate the rules on different data sets before considering a model ready to deploy.

The Best of Both Worlds

As we discussed earlier, data mining learns relevant patterns from a numerical representation of the entire collection, and the patterns discovered are derived by analyzing the collection as a whole. The rule builder, on the other hand, relies only on personal experience and knowledge to formulate rules that will be useful for sentiment analysis.

Because they approach the problem so differently, data mining and rule-based systems can complement one another. They can do this in two ways. First, unsupervised data mining can be used as a tool for the rule builder; and second, the supervised data mining model can be combined with the rule-based model in such a way that the strengths of each model are combined, and any possible mistakes made by one model can be corrected by the other.

Data Mining of the Text for the Rule Builder

The challenge of the rule builder is to devise and formulate rules that capture the sentiment contained in the collection. To do this, the rule builder must have some understanding of the content of the documents that are being categorized. For

■ Because they approach the problem

so differently, data mining and rule-

based systems can complement one

another.

12


instance, in our movie review collection, are all the reviews about a specific movie or are they about a specific genre of movies? If we know, we can save time by writing rules that are only directed to a particular movie or genre. On the other hand, if the reviews are about movies from many different genres, we must consider how that knowledge affects the rules we write. Otherwise, we might not capture the sentiment accurately.

For instance, when discussing a horror movie, the statement

“The scariest thing I have ever seen”

is typically an indicator that the reviewer enjoyed the movie. But it could be a negative indicator if the reviewer was discussing a children’s movie.

Unsupervised text mining allows you to quickly get a handle on the collection you are examining without spending time reading many individual documents. SAS Text Miner provides a node both for generating topics within a document and for clustering the documents. These approaches are useful for understanding the collection and for revealing significant aspects of the data. Table 1 shows that our collection is quite varied.

ID Descriptive Terms Freq. Pct.

1 + horror, + killer, + scary, + scream, horror, + reason, last, minutes

155 8%

2 + animation, adults, animated, disney, voice, children, kids, + feature

73 4%

3 coen, fargo, money, wife, different, pretty, sequences, guy

37 2%

4 + war, world, life, love, + sense, + fight, right, + father 267 13%

5 + comedy, jokes, + funny, funny, fun, script, back, cast 213 11%

6 earth, effects, special effects, special, star, + action, + people, interesting

276 14%

7 + action, + fight, sequences, bad, fun, guy, special ef-fects, acting

177 9%

8 + comedy, mother, + father, woman, funny, love, + family, high

400 20%

9 performances, mother, performance, love, down, + point, last, different

117 6%

10 + thriller, case, + action, + killer, wife, + job, performance, script

285 14%

Table 1: Ten clusters from the Pang and Lee data.

The clusters reveal several prominent categories of movies, reminding rule builders that they need to consider how people express sentiment in the following types of movies:

• Horrormovies.

• Animationandchildren’smovies.


13

• Comedies.

• Sciencefictionmovies.

• Actionmovies.

• Thrillers.

If you, as the rule builder, had not been thinking of how people express their opinions about movies from these different categories, it could be easy to incorrectly capture the sentiment contained in them.

Further discovery can be done to capture the sentiment of individual attributes within the document. For instance, since the SAS Text Miner filter node allows you to subset documents that contain the visual attribute synonyms displayed in Figure 2, you can subset the collection accordingly. In Figure 8, the search expression has been set to include only those documents that contain at least one of the visual attribute synonyms used in the rule building. The special character “*” implies a wildcard search is to occur, and the quoted input means that only the exact phrase, “special effects,” should match. The filter node can be followed with a clustering or topic node, and then any analysis of this subsetted collection provides you with some potential new ideas for rules.

Figure 8: A search expression to retrieve documents concerned with the visual sentiment attribute.

This particular subsetted collection revealed discussions around costumes and costume designs, as well as the reviewer’s reaction to the theater setting. Neither of these were aspects of visual sentiment that we had considered prior to discovering these topics.

At an even finer level, the reports of important terms and phrases (particularly in relation to one another in the concept-linking diagram) provide sentence-level ideas for your rule generation. The diagram in Figure 9 was made in the process of exploring reviewers’ comments on their theater experience. The diagram suggests that the sentiment regarding the music or sound in the movie might be another attribute that could be added to the taxonomy and examined.

14


Figure 9: A concept link diagram of “music” and “loud.”

Hybrid Approaches

Hybrid approaches involve using a rule-based approach and a data mining approach in combination. In the next sections we will describe two alternative methods. The first method can be used to supplement the features from the traditional data mining model by adding features derived from the linguistic rules that are triggered. The second method shows how to use an ensemble of the results of the two distinct approaches to improve the prediction.

Polarity scores as additional features

One advantage of SAS Text Miner is that it allows additional features associated with the document to be combined with the term features or with the SVD dimensions before training the predictive model. Polarity scores are simply a summary score based on a function of the number of times the positive and the negative rules trigger in a document, or in an attribute of a document. These values can be obtained from SAS Sentiment Analysis.

■ Hybrid approaches involve using a rule-based approach and a data mining approach in combination.


15

Once obtained, the logistic function can be applied to the ratio of the weighted positive and negative counts so that a document’s polarity score will be between 0 and 1, inclusively. A document with more positive sentiment weight will be assigned a score closer to 1, and a document that tends to have more negative sentiment scores closer to 0. This score is then used in combination with the SVD dimensions.

When the document has several attributes that receive a polarity score, each of these scores can be added as features to the text mining model. The hybrid model within SAS Sentiment Analysis software also makes use of this approach.

Stacked models

Another hybrid approach is to stack the models. This means that the rule-based and the data mining models are run separately in the first stage; but a second, predictive model is “stacked” after these two models so that the output of the two (a predictive probability for each document from each model) becomes the input into a second-stage model.

Stacking is an ensemble method that can improve accuracy if the two first-stage models differ in their predictions. Stacking allows for the two models to potentially correct one another where they differ.

In Figure 10, SAS Text Miner is used to build one sentiment model, while the model import node brings in a model from SAS Sentiment Analysis. The output of the two models is massaged with SAS code, and then goes into the second stage regression for a final prediction.

Figure 10: Stacking models.

16


Results

We experimented with the sentiment analysis approaches presented in this paper using the movie review data sets. The Yahoo movie data set was used to analyze sentiment at the attribute level, and the Pang and Lee data set was used for the overall sentiment predictions.

Attribute-Level Results

Table 2 shows the results for the attribute-level sentiment analysis on the Yahoo movie data. The Yahoo data had explicit user ratings for the different attributes, and we compared those ratings with the predictions made by the rule-based model developed with SAS Sentiment Analysis. We spent three days on the rule-development process. The Yahoo data included some reviews where a user rating was available for a particular attribute, but the attribute itself was not discussed in the text of the review. We did not include such reviews in the evaluation of the attribute. We also did not include the general attribute because no user ratings were available for it. A user rating of C+ or higher was considered positive, and C- or lower was considered negative.

Num Reviews Misclass Rate

Story 972 .23

Cast 1272 .14

Direction 243 .17

Visuals 459 .12

Aggregate 2946 .18

Table 2: Attribute-level results.

With just three days of effort on rule development, we were able to achieve an overall precision of 82 percent at the attribute level. The misclassification rate for the story attribute was relatively higher than the other attributes. That is an indication to the rule developer to further refine the rules for that attribute. Rule refinement is an ongoing process, and precision can improve over a period of time.

Overall Results

Table 3 shows the results of our comparisons of the Pang and Lee data. For the data mining approach, 1,800 random movie reviews were used for training a model,


17

and 200 reviews were held out to be scored. This process was repeated four times, and the misclassification scores were averaged. For each run, the same set of 200 reviews was analyzed in SAS Sentiment Analysis so that the comparisons were made on the same set of data.

Approach Misclass Rate

1 SAS Text Miner .144

2 SAS Sentiment Analysis Attribute-Level Rules

.252

3 Add Polarity Scores as Features in SAS Text Miner

.132

4 Blended .139

Table 3: Overall sentiment misclassification results.

The results obtained with the text mining model were achieved by using a category-specific weighting and by having enough training data. The SAS Sentiment Analysis overall sentiment model was derived from the rules for the individual attributes. Under these conditions, the rule-based model did not perform as well as the SAS Text Miner model. However, combining the models – by using the polarity scores as features in the SAS Text Miner model, or by blending the two models – did improve results.

Other Applications

Importing Models

SAS Sentiment Analysis can build a hybrid model using rules combined with a Naïve Bayes algorithm. However, to leverage all the predictive analysis advantages of SAS® Enterprise Miner™ software, the models from SAS Sentiment Analysis must be imported into SAS Enterprise Miner. This can be done easily by using the SAS Enterprise Miner model import node. Once the output of SAS Sentiment Analysis is imported, models can be combined in various ways and then compared with the model assessment node. Figure 11 shows the receiver operator curve (ROC) plot from the model assessment node after a SAS Sentiment Analysis model was imported.

18


Figure 11: ROC chart of SAS Enterprise Miner models with an imported SAS Sentiment Analysis model (denoted by model import). In this graph, “TM” denotes SAS Text Miner and “RuleIn” refers to using SAS Sentiment Analysis rules in conjunction with SAS Text Miner.

Creating Training Data

As discussed earlier, training data that has the “answers” is an essential part of a text mining approach. It is necessary to build a predictive model that can make accurate sentiment predictions. It is also important for a rule-based system because it validates how your rules are doing. The feedback lets you know if you need to add or remove specific rules, or if you must refine certain rules. Unfortunately, training data is not always available, and creating this data can be an expensive time commitment.

One approach to creating training data is to use very precise rules that will make a sentiment classification only on the documents you are most sure about. At the risk of not assigning a sentiment category to many of the documents, you do assign sentiment to a small subset of documents.

■ One approach to creating training

data is to use very precise rules that

will make a sentiment classification

only on the documents you are most

sure about.


19

We applied this approach to the movie review data by choosing rules that captured complete phrases that seemed, in our opinion, to indicate the overall sentiment. For instance, we included a set of rules that would trigger a positive score for a review that contained phrases like:

“I thoroughly enjoyed this movie.” or “I totally loved the film.”

When these types of phrases occurred in the document, the polarity was rated positive. Similarly, corresponding precise rules were added for negative polarity.

When we applied this approach to our movie review collection, 103 of the 2,000 documents triggered our rules. (While 103 documents is too small for an effective set of training data, with a larger pool of 20,000 reviews we would have likely obtained 1,000 documents in the training set.) We still confirmed the polarity by reviewing each of the 103 documents. Since SAS Sentiment Analysis highlights the rules in context, it was quick work to check the 103 documents to ensure that it was an appropriate trigger. Based on our manual review, it appeared that eight of the 103 documents were incorrect, so we corrected the polarity for those so that our training data would be free of errors.

Other Capabilities of SAS® Enterprise Miner™

This paper has primarily focused on combing the rule-based capabilities of SAS Sentiment Analysis with the text mining capabilities of SAS Text Miner, in conjunction with the predictive models available in SAS Enterprise Miner. There is much more functionality in SAS Enterprise Miner that can be used to help you understand the sentiment contained in a collection and to build on the rule models you have developed. Such functionality as sequences and associations, decision trees, SOM-Kohonen self-organizing maps, variable clustering, transformations and sampling, and statistical exploration have all been used in various contexts to supplement textual understanding.

Conclusions

Independently, both the domain knowledge and the data mining approaches to sentiment analysis have their strengths and weaknesses; but hopefully you will not be forced to choose between using one or the other for your analysis. In this paper, we have shown that the two approaches complement one another. So, while the NLP approach leverages the rule builder’s domain knowledge, text mining can also be used by that person to improve, clarify or correct how that knowledge relates to the particular collection being analyzed. Text mining reveals important patterns in the specific collection that assist the rule builder.

20


On the other hand, the text mining approach allows you to quickly build a sentiment classifier with term frequencies alone. But without any semantic or syntactic indicators, mistakes that would seem elementary to a human can easily occur. We have shown that these linguistic indicators can be captured by a rule-base system and then leveraged in the statistical classifier as additional features, or as a blended model. The end result is a model that is better than either one individually.

References

1Albright, Russ. Taming Text with the SVD. January 2004. SAS: Cary, NC. Web: http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf.

2Pang et al. “Thumbs Up? Sentiment Classification Using Machine Learning Techniques.” Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Conference on Empirical Methods in Natural Language Processing. 2002. 79-86.

The authors thank James Cox and Janardhana Punuru from the SAS Text Analytics Research and Development team for their helpful comments and suggestions. They also thank Fiona McNeill from SAS Marketing for encouraging them to work on this paper and providing valuable feedback.

http://ftp.sas.com/techsup/download/EMiner/TamingTextwiththeSVD.pdf

SAS Institute Inc. World Headquarters +1 919 677 8000To contact your local SAS office, please visit: www.sas.com/offices

SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. Copyright © 2011, SAS Institute Inc. All rights reserved. 105008_S59083.0211

Combining Knowledge and Data Mining to Understand Sentiment

Education

data mining approach

data mining methods

text mining

data miningto

training data

elements of sentiment

sas text analytics research

analysis methods