Automatic Construction of Conjoint Attributes and Levels from Online Customer Reviews Thomas Y. Lee and Eric T. Bradlow ∗ April 2007 University of Pennsylvania, The Wharton School ∗ Thomas Y. Lee is an Assistant Professor of Operations and Information Management and Eric T. Bradlow is the K. P. Chao Professor, Professor of Marketing, Statistics, and Education, and Academic Director of the Wharton Small Business Develop- ment Center, both at The Wharton School of the University of Pennsylvania. The authors would like to thank Esther Chen, Ellen Ngai, and Sojeong Hong for helping us with the data coding, and Steven O. Kimbrough, Balaji Padmanabhan, Yoram Wind, Paul E. Green, Abba M. Krieger and attendees of the Utah Winter Information Systems Conference and Florida Decision and Informa- tion Sciences Workshop for useful suggestions and comments. Please send all correspondence on this manuscript to: Thomas Y. Lee, 573 JMHH, 3730 Walnut Street, Philadelphia, PA 19104; [email protected], tel. 1(215)898-3266 fax. 1(215)898-3664 .
43
Embed
Automatic Construction of Conjoint Attributes and Levels from Online Customer … · 2015-07-28 · Automatic Construction of Conjoint Attributes and Levels from Online Customer Reviews
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Automatic Construction of
Conjoint Attributes and Levels from
Online Customer Reviews
Thomas Y. Lee and Eric T. Bradlow∗
April 2007
University of Pennsylvania, The Wharton School
∗ Thomas Y. Lee is an Assistant Professor of Operations and Information Management and Eric T. Bradlow is the K. P. Chao Professor, Professor of Marketing, Statistics, and Education, and Academic Director of the Wharton Small Business Develop-ment Center, both at The Wharton School of the University of Pennsylvania. The authors would like to thank Esther Chen, Ellen Ngai, and Sojeong Hong for helping us with the data coding, and Steven O. Kimbrough, Balaji Padmanabhan, Yoram Wind, Paul E. Green, Abba M. Krieger and attendees of the Utah Winter Information Systems Conference and Florida Decision and Informa-tion Sciences Workshop for useful suggestions and comments. Please send all correspondence on this manuscript to: Thomas Y. Lee, 573 JMHH, 3730 Walnut Street, Philadelphia, PA 19104; [email protected], tel. 1(215)898-3266 fax. 1(215)898-3664 .
Automatic Construction of Conjoint Attributes and Levels from
Online Customer Reviews
Abstract
Conjoint analysis continues to be an area of active research due to its enormous (and often deliv-
ered) promise of improved marketing decision-making. However, despite much methodological
progress, the literature has remained curiously silent on a fundamental design question: "How
does one choose the attributes and levels in the first place?"
In this paper, we present a method to support conjoint study design by automatically eliciting an
initial set of attributes and levels from online customer reviews. While existing computer science
research aims to learn attributes from reviews, our approach is uniquely motivated by the conjoint
study design challenge: how to identify both attributes and their associated levels. Our proposed
method has at least three advantages. First, we generate attributes and levels using the language
of the consumer rather than that of designers and manufacturers. Second, the approach runs
automatically. Automated analysis supports the trend towards shorter product lifecycles and
rapid prototyping. Third, we support rather than supplant managerial judgment. The method is
parameterized to allow survey designers to vary the number of attributes and/or levels that are
generated. Managers can choose to use our method either in a stand-alone manner or as a point
of departure for the surveys and focus groups used in common practice.
Automatic Construction of Conjoint Attributes and Levels from
Online Customer Reviews
1. Introduction
Conjoint analysis has been universally recognized among academics and practitioners alike as
one of the most celebrated tools in Marketing. In part, this is due to the enormous “promise” that the re-
sults can provide including its use in new product introductions (Wittink and Cattin 1989; Michalek et al.
2005), optimal product repositioning (Moore et al. 1999) and pricing (Goldberg et al. 1984), and segment-
ing customers (Green and Krieger 1991), to name just a few. Such power makes many Marketing aca-
demics and practitioners think of conjoint analysis as the “Gold at the end of the rainbow”; and, some-
times it is.
However, how often have those of us who have either implemented or taught conjoint methods
stated: “Choose the attributes and levels wisely, remember Garbage In – Garbage Out!”? This warning
acknowledges that conjoint is dependent on the representation of a customer’s utility as an agglomeration
of preferences for the underlying attribute levels that have been selected. This dependence holds regard-
less of either format (choice-based, ratings-based, ranking-based, constant sum, or self-explicated) or
method for determining the profiles (Huber and Zwerina 1996; Moore et al. 1998; Toubia et al. 2003; Ev-
geniou et al. 2005).
Surprisingly, despite its universally recognized importance, there is little extant research to guide
attribute and level selection (Wittink et al. 1982). There is some literature on the sensitivity of results to
changing attribute selection, omitting an important attribute, level spacing, etc. (Green and Srinivasan
1978). But, how does one generate these attributes and potential levels in the first place?
Both the academic (in the form of textbook chapters) (Lehmann et al. 1997) and practitioner lit-
erature typically set the initial attributes and levels using some ad-hoc combination of (i) qualitative re-
search such as managerial or customer interviews, (ii) focus groups, and (iii) open-ended surveys (see
2
Figure 1). Practitioners may iterate over this ad-hoc combination a few times based on pre-tests, and may
attempt to validate their study design using actual current products and share data; however, considerable
uncertainty and trepidation often remain.
Our automated procedure uses free, online customer reviews to address this question. While the
impact of customer reviews on consumer behavior has long been a source of study (Eliashberg and
Shugan 1997; Chevalier and Mayzlin 2003), and Dellarocas (2003) and Ghose et al. (Ghose et al. 2006)
explore how reviews reflect or shape a seller's reputation, and Chen and Xie (2004) study the implications
of customer reviews for marketing strategy, there has been comparatively little work on what marketers
might learn from the same reviews for purposes of experimental design. To elicit conjoint attributes and
levels, we develop a novel approach that derive them from on-line customer reviews.
We empirically validate our approach on reviews for digital cameras from Epinions.com by com-
paring automatically induced attributes and levels (using our method) to those used in existing print and
online retail buying guides.1 The decision to use digital cameras was not random and reflects its common
use in marketing conjoint studies (Bradlow et al. 2004; Netzer and Srinivasan 2006). This evaluation
highlights three particular characteristics that we believe are worth noting.
1 In situations where many conjoint studies have already been run, one could also validate the results of the automatic procedure by comparing it to past conjoint attributes and levels.
Managerial Judgement
Interviews
Focus Groups
Surveys
Online Customer Reviews
Attributes and Levels
Conjoint Study
Managerial Judgement
Interviews
Focus Groups
Surveys
Online Customer Reviews
Attributes and Levels
Conjoint Study
Managerial JudgementManagerial Judgement
InterviewsInterviews
Focus GroupsFocus
Groups
SurveysSurveys
Online Customer Reviews
Online Customer Reviews
Attributes and LevelsAttributes
and LevelsConjoint
StudyConjoint
Study
Figure 1: The Conjoint Design Process
3
First, we generate attributes and levels using the "language of the consumer" rather than that of
product designers and manufacturers. The language of the consumer extends not only to the attributes but
also the levels (the level of detail) by which customers discuss products. Customer terminology does not
always match expert-generated buying guides. Rather than characterizing the differences as errors, auto-
mated analysis may suggest a managerial opportunity to identify mismatches between product manufac-
turers and their customers.
Second, the approach runs automatically. Both consumer and manufacturer preferences evolve
over time. Automated analysis enables firms to rapidly process large numbers of customer reviews, pos-
sibly from different sources, and for different product categories. Automation also supports the trend to-
wards shorter product lifecycles (Van den Bulte 2000) by facilitating the rapid updating of conjoint de-
signs. Our automated requires no human training and makes no domain-specific assumptions about par-
ticular products. By contrast, much of the prior work related to learning concepts from text utilizes su-
pervised machine learning methods that require hand-labeled training data (Nasukawa and Yi 2003; Hu
and Liu 2004; Liu et al. 2005; Popescu and Etzioni 2005).
Third, we support rather than supplant managerial judgment. We do not aim to eliminate the
"human-side" of conjoint attribute and level construction. Indeed, as our results indicate in Section 5,
fully automated processing can lead to attribute and level sets that are excessively large. Managers may
intervene within the process by setting parameters such as thresholds on the number of levels or through
judicious pruning of results. Once completed, the automated results serve as input to conjoint study de-
sign. As shown in Figure 1, managers may use the results as a point of departure for initial survey and
focus group design, bringing the voice of the consumer further into the process.
As a brief overview, our approach is summarized in Figure 2. We begin with the set of all online
reviews in a product category over a specified time frame. For example, in this paper, we consider the
reviews for all digital cameras from Epinins.com. While this data is freely accessible, we acknowledge
that selecting conjoint attributes and levels from reviews of existing products does limit the size and scope
of the attribute and level space. However, as elaborated upon in the discussion below, by aggregating the
4
reviews of all products in the space, our approach encourages the consideration of previously untried at-
tribute and level combinations; and due to the constant nature of review in-flow, is easily updated. In ad-
dition, selectively expanding the definition of the initial product space further expands the range of re-
views from which prospective attributes and levels are drawn.
In our approach, each online customer review is summarized by the reviewer’s written (submit-
ted) list of Pros and Cons (see Figure 3). Our unique approach exploits the co-occurrences of words
within these list-based summaries. While some review sites do not provide user-authored Pro and Con
summaries (e.g. Amazon.com), many including Epinions.com, BizRate, and CNet do (Liu et al. 2005).
Exploiting the structure provided by Pro and Con lists allows us to avoid numerous complexities of auto-
mated language processing; however, numerous challenges remain. As many of the methods that we util-
ize are unfamiliar to the Marketing literature, we next provide a brief overview of the data collection
methods and challenges that are detailed in Section 3 and related appendices. In addition, a glossary of
terms is included as Appendix 1.
The phrases that comprise each review summary are transformed into vectors of words. Each Pro
or Con list is decomposed into separate phrases where each phrase refers to a single product attribute. For
example, the first (pro) list in the first review of Figure 3 produces the phrases: “Easy to use,” “zoom,”
and “panorama.” The second (con) list yields the single phrase “8 mb SmartMedia.” Simple linguistic
transformations, detailed below, normalize for words in past-tense versus present-tense or plural versus
Product Reviews
Phrase x Word
Word Graph
Cluster1
C2
Cn
…
Clique1
C2
Cm
…
A1
Attribute Dimensions Levels
A2
Al
…
ZoomMemory
Panorama
Type: SmartMediaCapacity: 8Units: mb
48
16
Product Reviews
Phrase x Word
Word Graph
Cluster1
C2
Cn
…
Clique1
C2
Cm
…
A1
Attribute Dimensions Levels
A2
Al
…
ZoomMemory
Panorama
Type: SmartMediaCapacity: 8Units: mb
48
16
Figure 2: Process Overview: Learning From Reviews
5
singular, etc. Each phrase in the Pro/Con list is thus reduced to a vector of words; the vectors from all
reviews are combined into a single phrase × word matrix.
The phrase vectors from all reviews (each row of the matrix) are then clustered based upon their
Euclidean distance in the vector space of words. For example, the phrase vector consisting of “Easy,”
“to,” and “use” is equally distant from “zoom” as it is distant from “panorama.” By contrast, the vector
for the phrase "ease of use" is quite similar to the vector for "easy to use," where some standard linguistic
transformations equate "ease" and "easy" as different forms of the same root word. Each resulting cluster
of vectors (phrases) is taken to refer to a single product attribute of interest. Managers may then either
pre-select a target number of initial attributes by selecting clustering parameters (as is commonly done in
k-means procedures), or automatically search for a statistically "optimal" number of attribute clusters
(Lehmann et al. 1997). Because the clusters themselves are unnamed, we heuristically name each cluster
using the three most frequent words that define the cluster, similar to what is commonly done in cluster
profiling.
Having clustered phrases into attributes, the unique challenge posed by conjoint analysis is to
elicit attribute levels, where appropriate. A product attribute defined by phrases such as "ease of use"
may have no levels. Other phrase clusters, however, may actually contain multiple attribute dimensions,
Figure 3: Pro/Con Review Summaries from Epinions.com
6
each with a distinct set of levels. For example, the cluster of phrases that includes "8 mb Smart media
card included" and "8 mb SmartMedia" represent the attribute "memory." However, these phrases in-
clude multiple dimensions of memory. There is memory capacity ("8" as opposed to "4" or "16" or "32");
there are the units of memory ("mb" versus "megabyte"); and there are types of memory ("SmartMedia"
versus CompactFlash). These few examples reveal some of the many challenges.
To elicit levels, we first identify the dimensions of each attribute and then identify the set of val-
ues that each dimension may take. We begin with all of the phrases in a single attribute cluster. For ex-
ample, "8 mb Smart media card included" and "8 mb SmartMedia" are in the same cluster. Phrases from
the reviews for other digital cameras in the same cluster include "Only 8 mb smart media card included"
and "Only a 4 mb card." A novel math programming algorithm then assigns the words (or numbers) in
each phrase to separate dimensions (e.g. "8" and "4" are assigned to the same dimension). In a few sim-
ple cases such as hyphenation, phases such as "Smart media" and "Smart-media" are all recognized as
instances of "SmartMedia" and therefore assigned together.
A single dimension now consists of a set of words (or numbers). The words (or numbers) as-
signed to a single dimension are organized into levels based upon an a priori, user-specified parameter. If
there are more values than the managerially-specified target number of levels, numerical values are
binned by balancing the range of values in each level size. Note that any binning function could be sub-
stituted. Categorical values are clustered based upon the distributions of their related attribute dimensions
(distributional clustering). For example, if there are more digital camera memory types than the target
number of levels, memory types might be categorized based upon their memory capacity.
In the remainder of this paper, after surveying the related literature in Section 2, we review the
preprocessing and underlying data models in Section 3 and describe our algorithmic process in Section 4.
Our data set and evaluation are detailed in Section 5. A discussion of results and future work conclude in
Section 6.
7
2. Related Work
The task of learning conjoint attributes and levels from text is closely related to several different
research streams in the data and text-mining literature. Early research in "sentiment analysis" attempted
to classify customer comments based upon their general tenor (e.g. is the customer expressing happiness
or disatisfaction) (Turney 2002). These models use Natural Language Processing (NLP) to identify word
classes (e.g. positive words) that appear in the text of each review. In a process called supervised learn-
ing, a representative set of reviews is manually labeled with their appropriate classes (e.g. which express
positive sentiment and which are negative). The representative sample is used to tune the model parame-
ters for correctly predicting a review's class label based upon its word composition. Subsequent efforts
have treated the overall sentiment (positive or negative) like a multi-attribute utility and attempted to
identify the opinions paired with each attribute (Turney 2002 ; Nasukawa and Yi 2003; Hu and Liu 2004;
Ghose et al. 2006).
Our work extends the prior literature on sentiment research in two ways. First, we are interested
in a technique that is easily transferable across multiple product categories. Therefore, we do not rely
upon sophisticated NLP techniques that identify the grammatical parts-of-speech of different words.
Rather than supervised techniques that require manual preparation of training documents, we limit human
intervention to parameter settings where managers may set targets for the number of attributes and levels.
Second, we extend the prior work on learning product attributes with techniques for eliciting levels. This
involves discovering not only the words that name the attribute but clustering the words (and numbers)
that describe each attribute.
"Ontology induction" techniques combine statistical methods with knowledge of grammar rules
to learn the vocabulary that characterizes a specific document collection. An ontology is a structured-
vocabulary that contains the words describing a domain of interest as well as the relationships between
those words. Examples of relationships between words include: an "SLR" is-a-type-of "digital camera;"
a "digital camera" is-comprised-of "lens" and "battery" (among other things). Ontology induction is the
process of automatically learning an ontology from text documents that describe the domain. In this con-
8
text, our task is to learn the words that describe product attributes and levels by processing customer re-
views about that product.
In ontology induction, a human expert typically begins with a seed ontology: either a general ref-
erence ontology that lists common words and relationships, (Missikoff and Navigli 2002) or a pre-
existing, domain-specific reference ontology (Modica et al. 2001; Cecchini 2005). Techniques for learn-
ing linguistic patterns (Hearst 1992), database integration (Doan et al. 2003), and frequent item sets
(Borgelt and Kruse 2002) are then applied to grow and refine the starting seed in ways that cluster differ-
ent words that describe the same attribute, distinguish between distinct attributes, and identify levels
within a single attribute (Maedche and Staab 2000; Popescu et al. 2004; Popescu and Etzioni 2005).
Our unsupervised approach neither assumes a seed ontology nor leverages explicit structure such
as HTML or grammatical syntax. Indeed Pro/Con summaries are simply lists of phrases with no associ-
ated linguistic context; Liu et al. (2005) demonstrate that even with a training set, (supervised) techniques
to learn relationship-patterns between words perform markedly less well in the context of such review
summaries. Our constrained optimization approach dispenses with a training set and removes the need
for knowledge of grammatical rules (Lee 2005).
Finally, traditional techniques for clustering categorical data make assumptions about the data
structure and sample size that are generally inappropriate for analyzing customer review text. We briefly
review this literature as well as the application of graph-based methods, recognizing that other approaches
which utilize human intervention may also be of use to marketing practitioners; albeit perhaps less scal-
able than our approach.
Where sentiment analysis and ontology induction techniques learn product attributes, categorical
clustering techniques assume that each customer comment about a distinct product attribute is stored in a
separate row of a relational database table. Within a table column, binning strategies (Han and Fu 1994)
can group values into a limited number of levels. Between table columns, one measures how the values
in one column co-occur with values in a second column (Han and Fu 1994), or more general classification
rules (Suryanto and Compton 2000) are then used to infer attribute-property relationships. For example,
9
memory capacity (a column) might include "4," "8," and "16" and co-occur with the unit property "mb."
Thus, the product attribute memory is decomposed into the properties capacity and units; furthermore,
values of capacity can then be distributed across a user-specified2 maximum number of levels.
The number and content of the attribute levels is then generated by clustering the observed range
of attribute values, in our case from the Pro/Con list. Using the database metaphor, this corresponds to
clustering the values of a single table column. Levels are then hierarchically clustered based upon the
similarity of their respective probability distributions over the other columns within the same table (Baker
and McCallum 1998; Dhillon et al. 2002). For example, we might decompose the product attribute bat-
tery life into duration and a modifying adverb. The duration might include "good," "bad," "short," and
"long." Associated adverbs might include "somewhat," "terribly," and "awfully." Based upon the distri-
bution of their co-occurrences, we would see that "terribly" and "awfully" are clustered as synonymous.
Although graph-based methods have not been applied to product attributes and levels within re-
views as we do here, graphs have been used to cluster and manage categorical data (Gibson et al. 1998;
Ganti et al. 1999; Zaki and Peters 2005). In the context of words and phrases in reviews, we might treat
every word as a node in the graph. Edges would represent the relationship between words that appear
together in a Pro or Con review phrase. Edges are weighted based upon the number of reviews in which
the two words appear together. For example, if "8 MB Compact Flash" appeared in three different re-
views and "8 MB Memory Stick" appeared in four different reviews, the edge between "8" and "MB"
would be seven, the edge between "MB" and "Compact" would be three, and the edge between "8" and
"Compact" would also be three.
Rather than working directly from review text, existing graph-based methods assume that all of
the words are preprocessed into a single database table. Every row represents a different phrase and every
column represents a different attribute or level. Furthermore, existing methods assume that the table is
complete (e.g. there are no empty cells in the table). Unfortunately, before analysis, there is no simple
2 A manager-specified maximum number of levels is important for making any approach practically us-
10
way of determining how many different product attributes and levels customers will mention in their re-
views meaning there is no way of setting the correct number of table columns. Moreover, a comment
about memory like "Only 8 mb Smart media card included" (see Figure 2) uses different columns than a
comment like "2x digital zoom." Thus, our approach clusters customer comments into separate tables,
one for each product attribute. Discovering the dimensions of each product attribute (e.g. the number of
columns in each attribute table) is a key difference between our approach and existing methods. More-
over, our approach adjusts for blank cells which appear when customers, commenting on the same prod-
uct attribute, do not refer to the same attribute dimensions. For example, one customer might comment
on memory capacity ("8" vs. "16" MB) while another might comment on memory type ("Compact Flash"
vs. "Smart Media.") To the degree that we can learn the different tables and their corresponding columns,
graph-based techniques offer a complementary strategy for learning attribute levels. Having reviewed
some alternative approaches to the problem, we revisit our approach, summarized in the introduction, in
greater detail.
3. Preprocessing
In this section, we detail the pre-processing and the underlying data models used to manipulate
the words and phrases drawn from customer reviews. By regarding the list of Pros and Cons as a sum-
mary of the corresponding review, we focus only on the phrases in each list of Pros and Cons. We hy-
pothesize that each phrase comments on a unique attribute. Whether a product attribute is listed as a Pro
or a Con, we process all phrases in the same way.
To extract attribute phrases from the customer input, we assume that each Pro or Con entry is a
list of phrases separated by standard list separators including commas, slashes, and semicolons. Within a
single line of input, we count separators and assume that multiple instances (e.g. two or more commas)
corresponded to a list of candidate attribute phrases.
able. One can test the approach for a different maximum value.
11
To clean the set of resulting attribute phrases, we discarded those candidate phrases that con-
tained non-alphanumeric characters or punctuation that was not used as a list separator. Examples of dis-
carded phrases taken from the digital camera product attribute Computer Requirements include:
“book(tm),' 'windows®98 second edition (se),' and 'windows 98*.' The intuition is that our data set
ranges from several hundred to several thousand phrases depending upon the starting feature concept.
Therefore, we can safely discard outlier phrases. Discarding does raise the question of an optimal sample
size, which we revisit below
Each phrase is itself comprised of its component words. As a standard step in text processing and
information retrieval, we prune all stop-words (a standard list of articles, conjunctions, prepositions, etc.)
from each phrase (Salton and McGill 1983). For example, after removing stop words, the phrase "Quality
of Photos" becomes "Quality Photos" and "Only 8 mb Smart media card included" becomes "8 mb Smart
media card included."
Words are then normalized using a standard process called stemming (Salton and McGill 1983).
Stemming attempts to find equivalences between singular, plural, past and present tense forms of the in-
dividual words used by consumers to describe product attributes and levels. Rather than requiring knowl-
edge of grammar or semantics, stemming is a simple, approximate technique for discovering the root
forms of words.
Finally, phrases of normalized words are reduced to their underlying "bag-of-words" representa-
tion, which eliminates word order (Salton and McGill 1983). Eliminating word order allows us to equate
different grammatical permutations of the same pruned, normalized phrases. For example, "Includes an 8
mb Smart media card" and "Only 8 mb Smart media card included" are identical in the pruned, normal-
ized, bag-of-words representation.
Our process calls for clustering phrases based upon the product attributes that each phrase de-
scribes. To facilitate phrase clustering, we transform the list of phrases in bag-of-words form into a
phrase × word matrix. If i = 1, …, I indexes over phrases and j = 1, …, J indexes over words, every entry
in the matrix(i,j) measures the importance of a word j in characterizing or defining the product attribute
12
represented by phrase i. Every row of the matrix represents the corresponding phrase in the vector space
of words, the familiar vector-space model used in information retrieval. We determine the importance of
a word to a particular product attribute by using a derivative of the TF-IDF (Term Frequency-Inverse
Document Frequency) metric developed for information retrieval (Salton and McGill 1983).
First, the number of times that a word j appears in a phrase i is multiplied by the number of times
the phrase i appears in the set of all review phrases. The word count is adjusted by the distribution of
word j over all phrases. Words that appear in too many different phrases are less likely to uniquely char-
acterize a single product attribute and hence are discounted more heavily than words that appear in fewer
phrases.
Second, we further adjust a word’s importance by using frequency statistics from a second, unre-
lated product domain. The phrase × word matrix includes some sentiment words such as “good” or
“great” which occur with high frequency in a limited number of phrases, leading to deceptively high im-
portance values. However sentiment words are likely to appear in reviews for unrelated products. Words
characterizing product-specific attributes are less likely to appear in the reviews of unrelated products.
Therefore, we discount our initial importance statistics using the phrase × word matrix constructed from
reviews of an unrelated product. Details on calculating word importance appear in Appendix 2.1.
While the phrase × word matrix captures word frequencies within a phrase, it does not fully cap-
ture the co-occurrences of words that reappear in different phrases of different reviews. As an example,
in Figure 4a, we begin with phrases from several different reviews. When considering phrases that apply
to only one product attribute (Figure 4b), it is easy to see how word co-occurrences can help align indi-
vidual words (or numbers) into separate dimensions Figure 4c.
To automatically align words into dimensions, we model the phrases of a particular product at-
tribute in a graph. Every word is a node in the graph and every edge between two nodes denotes the co-
occurrence of the corresponding words within a phrase (see Figure 5a). Because words within the same
dimension never co-occur in a single phrase (e.g. in Figure 4, a memory card is never both 8mb and 4mb),
13
our word-phrase graph exactly satisfies the definition of an n-partite graph. The n parameter defines the
total number of attribute dimensions. The partite characteristic guarantees that there are no edges be-
tween nodes within the same partition (e.g. no edges between words in the same attribute dimension). By
extension, we reason that, the space of all possible attribute-level permutations satisfies a complete n-
partite graph where all nodes in one partition are connected to every other node in every other partition
(See Figure 5b). Details on the graph model and its n-partite property are expanded upon in Appendix
2.2.
. 4. Analysis
Having established the underlying data preparation and data models, we revisit the steps from
Figure 2: (i) Phrases are clustered into product attributes – Section 4.1, (ii) attributes are divided into
their constituent dimensions and the words in each phrase are aligned with their appropriate dimension –
Olympus: Quality of Photos, …, Battery life (very very good), Only 8 mb Smart media card includedHP: …, only a 4 mb card, virtually no battery life, no AC adapter, poor zoomFuji: Great picture quality, 16 mb, battery life, compact, …Canon: Great feel, good battery life, 12 second video capture, only 8 mb card, …
81648D 1
mbmb
cardmbincludedcardmediasmartmbD 6D 5D 4D 3D 2
only 8 mb card
16 mb
Only a 4 mb card
Only 8 mb Smart media card included
c. Dimensions of the Attribute "Memory"b. Phrases for the Attribute "Memory"
a. Phrases From Online Reviews
Olympus: Quality of Photos, …, Battery life (very very good), Only 8 mb Smart media card includedHP: …, only a 4 mb card, virtually no battery life, no AC adapter, poor zoomFuji: Great picture quality, 16 mb, battery life, compact, …Canon: Great feel, good battery life, 12 second video capture, only 8 mb card, …
81648D 1
mbmb
cardmbincludedcardmediasmartmbD 6D 5D 4D 3D 2
81648D 1
mbmb
cardmbincludedcardmediasmartmbD 6D 5D 4D 3D 2
only 8 mb card
16 mb
Only a 4 mb card
Only 8 mb Smart media card included
only 8 mb card
16 mb
Only a 4 mb card
Only 8 mb Smart media card included
c. Dimensions of the Attribute "Memory"b. Phrases for the Attribute "Memory"
a. Phrases From Online Reviews
Figure 4: From Phrases to Attributes to Attribute Dimensions
16
4
8
mb inclmediasmartcard
a.
incl
16
4
8
mb mediasmartcard
b.
16
4
8
mb inclmediasmartcard
16
4
8
mb inclmediasmartcard
a.
incl
16
4
8
mb mediasmartcard
b.
incl
16
4
8
mb mediasmartcard incl
16
4
8
mb mediasmartcard
b.
Figure 5: Word Co-occurrence as an N-Partite Graph
14
Section 4.2, and (iii) each property is divided into levels – Section 4.3. The section concludes with a
number of refinements that attempt to address noise within the process – Section 4.4.
4.1. Clustering phrases into product attributes
In the first step, we begin with the vector-space representation of phrases drawn from the Pro/Con
review summaries. More formally, given the phrase × word matrix(i,j) over the set of I phrases and the
set of words J, we seek to separate phrases into a set C of k mutually exclusive and exhaustive concept
clusters { }∅=→≠∀= jijiki ccjiIccc IUKU ,1 ; . Of course, the clustering is necessarily dependent upon and
susceptible to the quality of phrase parsing. Poor parsing (e.g. a single phrase that combines multiple fea-
tures) can introduce noise into the resulting clusters. However, our objective is to capture all phrases cor-
responding to a particular feature in one cluster. It is worth noting that a feature/concept can be distin-
guished at somewhat arbitrary levels of granularity. Thus, it is possible that "digital zoom" and "optical
zoom" could be clustered as distinct features or aggregated as the single concept "zoom." One of the
choices made, therefore, is the degree of granularity desired. "Rougher" granularity would typically lead
to fewer unique concepts (and hence conjoint attributes), but possibly less distinct concepts; vice-versa
for a finer grain. Clustering algorithms are typically distinguished by their means for measuring similar-
ity and their metric for separating clusters. In this work, complementing our matrix representation of
phrases and words, we use the cosine measure of angular distance between vectors to calculate similarity.
The cosine measure is then applied to the phrase × word matrix using the well-studied k-means clustering
algorithm.
The quality, QC, of a k-means clustering, C, is calculated by the sum of the distances from each
vector in a cluster to that vector's centroid. Following (Zhao and Karypis 2002), this metric is more sim-
ply defined as the sum of the length of the composite vectors:
( )( ) ( )∑ ∑∑∈∀ ∈∀∈∀
==Cc Cc
icv
ii ii
ccompositeccentroidvQC ,cos where ( ) ∑∈∀
=icv
i vccomposite (1)
15
Clique nice 6x optic zoom Pro Con review summaries Tokenized phrases P 1 P 2 P 3 P 4
Zoom zoom zoom Long 6x optical zoom long 6x optic zoom long 6x optic zoom standard 3x optical zoom standard 3x optic zoom standard 3x optic zoom nice optical zoom nice optic zoom nice optic zoom 6x zoom is nice 6x zoom nice nice 6x zoom 5x optical zoom 5x optic zoom optic zoom
Table 1: Using a maximal clique for logical assignment
Because k-means is known to be extremely sensitive to its initial conditions, we repeat the algorithm ten
times, beginning with a new, random set of k centers and pick the solution that maximizes QC.
4.2. Dividing attributes into dimensions
Having generated phrase clusters corresponding to product attributes, our next objective is to
identify those attribute dimensions for which conjoint levels are defined. Recall that we can visualize the
phrases describing a single attribute in a table where attribute properties constitute table columns and
phrases constitute table rows; each word in a phrase is assigned to one column (see Figure 4). To derive
this figure, we generate the n-partite word co-occurrence graph by selecting the number of partitions n
and then assigning the words of each phrase to its appropriate dimension.
Colloquially, to discover the number of columns, we would like to find some combination of cus-
tomer comments or phrases that uses distinct words to make explicit reference to every relevant attribute
dimension. By modeling all words in a graph (see Section 3), we discover this combination of phrases by
heuristically searching for the largest maximal clique (see Appendix 2.2).
Applying this step to several phrases, for the digital camera attribute zoom, is depicted in Table 1.
From the left, the first column lists literal phrases taken from Pro/Con review summaries. The second
column lists the corresponding, normalized word form. A maximal clique is shown in the top row (i.e.
"nice," "6x," "optic," "zoom"). Note that this example illustrates how a maximal clique is constructed
from two or more phrases.
Having identified the number of attribute dimensions, we can align the words in the remaining
phrases with their corresponding columns. We assign words to attribute dimensions subject to the mutual
exclusivity constraints represented by each phrase. No two words in the same phrase may appear in the
16
same attribute dimension (the same column). Thus, each phrase represents ⎟⎟⎠
⎞⎜⎜⎝
⎛2m pair-wise constraints
where m is the number of words in the phrase. Pair-wise constraints are consistent with disjoint cluster-
ing and assume that no attribute dimension is described by two or more words and that no single product
can have two or more values for a single dimension (e.g. zoom is not both 2x and 3x).
We define the assignment problem using the maximal clique. A constrained logic program (CLP)
implements a bounds consistency approach to resolve the problem. We define the assignment problem
using the maximal clique. In the bounds consistency approach, we invert each mutual exclusivity con-
straint and express the complementary constraint as a set of candidate assignments. If the phrase con-
straints, taken together, are internally consistent, then the candidate assignments for a given word are
simply the intersection of all candidate assignments as defined by all phrases in the cluster containing that
word.
Continuing with the example in Table 1, the lower half of the table demonstrates how normalized
phrases from the left are mapped to attribute properties (columns). The interaction between adjacency
constraints from multiple phrases naturally constrains words to a unique assignment. The example also
illustrates two limitations of our strong assumption regarding maximal cliques. Given a sufficiently large
sample of phrases, we assume that a maximum clique would encompass all (relevant) properties of a
given product attribute. For reasons of computational complexity, we use a maximal (not necessarily a
maximum) clique. As a consequence, our representative table row may miss certain properties, and
words describing different properties are erroneously combined. In Table 1, "standard" and "nice" are
logically forced into the same property/column. Second, certain dimensions or levels may remain unas-
signed due to an insufficient number of examples. The word "5x" remains unassigned in Table 1 because
it is under-constrained. Whether "5x" is an instance of Property 1 or Property 2 is ambiguous. We revisit
these limitations below.
4.3. Dividing properties into levels
17
The product review summaries are now reduced to a number of tables where each table represents
a product attribute and each table column is a dimension of the respective attribute. The values in each
column represent levels of the corresponding attribute dimension. For example, '4' and '8' are two levels
of digital camera memory capacity. Likewise, '3x' and '6x' are levels of optical zoom magnification. Un-
fortunately, our CLP algorithm may result in properties with more than five or six values from which a
customer is asked to choose. To limit the number of levels for a given property, we apply distributional
clustering (Pereira et al. 1993) to combine levels until the total number of levels is reduced to a specified
target number (e.g. six) that may easily be modified at the user’s discretion. Further details appear in Ap-
pendix 2.3.
4.4. Filtering the clusters
Both the initial clustering of phrases into product attributes and the subsequent assignment of
words to attribute properties are inherently imperfect. Inconsistencies may emerge for any number of
reasons including: Poor parsing, the legitimate appearance of one word multiple times within a single
phrase (e.g. the phrase ‘digital zoom and optical zoom’ duplicates the word ‘zoom’) or even “inaccura-
cies” by the human reviewers who write the text that is being automatically processed. This could result
in a single attribute property divided over multiple table columns. For example, the reviews from Figure
3 include both "SmartMedia" as a single word and "Smart" and "media" as two separate words. Alterna-
ively, multiple product attributes may appear in the same cluster. '[C]ompact flash' and 'compact camera'
are clustered together based upon their common use of the word 'compact,' yet refer to distinct attributeat-
tributes.
To address the problem of robustness in the face of noisy clusters that include references to addi-
tional product attributes or have different properties for the same attributes, we extend our CLP approach
to simultaneously cluster phrases and assign words. Detailed further in Appendix 2.4, the extended CLP
prunes phrases by recursively applying co-occurrence constraints; two phrases in the same review cannot
describe the same attribute just as two words in the same phrase cannot describe the same attribute di-
mension.
18
Unfortunately, even the extended CLP approach is imperfect. Some of the tables will represent
distinct product attributes. Others will simply constitute random noise. Individual tables are supposed to
represent distinct product attributes, so we assume that meaningful tables should contain minimal word
overlap. With this in mind, we apply a two-stage statistical filter to further filter noisy clusters. Details of
the statistical filter are provided in Appendix 2.5.
5. Evaluation
Early research in ontology induction was limited to “proof of procedure” based upon the subjec-
tive assessments of the researchers themselves and/or subject-matter experts (Missikoff and Navigli
2002). More recently, research in ontology induction and the analysis of customer reviews has begun to
develop more objective metrics. In this section, we report results from the application of our automated
process to a real domain. We select several popular print and online buying guides as the “gold standard”
and compare our automatically generated attributes and levels to that standard.
5.1. Data and metrics
Our data set consists of 8,226 online digital camera reviews downloaded from Epinions.com on
July 5, 2004. The reviews span 575 different products and product bundles that range in price from $45
to more than $1,000. The digital cameras range in resolution from 1MP to more than 6MP and vary in
size from pocket-size to single lens reflex (SLR).
We compare our automatically derived attributes to publicly available, expert-generated attributes
in the form of ten print and online buying guides. The reference sources each list a minimum of 5 product
attributes and a maximum of 26. The average number of product attributes is 14. After processing our
experimental data set, we compare the automatically induced attributes with each reference source. Bor-
rowing from the Information Retrieval literature, we use precision (P) (Salton and McGill 1983) to meas-
ure "how many of the induced attributes and/or properties are actually used in professional reviews and
online buying guides." By contrast, recall (R) asks "how many of the attributes and/or properties used in
practice are automatically induced?" More formally, if X is the set of generated attributes, and Y is the
set of attributes in the reference source,
19
X
YXP
I= and
YYX
RI
= (2,3)
An added complexity to evaluation is the hierarchical nature of the product attribute space. Prod-
uct attributes in one reference source might be automatically extracted as a dimension or level and vice
versa (e.g. see Figure 6: optical zoom and digital zoom appear as independent attributes in the reference
source but as levels of the type dimension of a single zoom). To analyze precision and recall on product
attributes, we define precision and recall containment (P+ and R+) to allow a more specific term to qual-
ify as a positive match for a more general term, provided that the more specific term appears as a dimen-
sion or level, and vice versa.
The problem of containment is particularly acute when evaluating automatically generated levels
because there are so many more possibilities to consider. As a simplifying step, to measure precision and
recall for levels, we collapse the hierarchies into the union of all attributes and levels. In effect, a word
from the automatically generated hierarchy can match anywhere in the reference hierarchy and vice-versa.
The intuition is drawn from Popescu et al. (2004), who compared two hierarchies (ontologies) by compar-
ing all possible permutations of the sub-hierarchies.
5.2. Clustering phrases
Beginning with our set of customer reviews, we parsed the Pro and Con lists as described in Sec-
tion 3 to produce a phrase × word matrix that is 14,081 phrases by 3,364 words. We then set k = 50 and
iterated k-means clustering 10 times, selecting the best resulting output based upon QC (Eqn 1). The se-
lection of k = 50 was set by following Popescu et al. (2004) and assuming the union of product attributes
zoom
type magnification
digital optical 2x 3x
digital zoom
2x 3x
optical zoom
2x 3x
Attribute
Dimension
Level
automatically generated reference source
Figure 6: Comparing Automatically Generated Attributes and Levels to a Reference Source
20
from all of our reference buying guides. Relying upon domain expertise is consistent with practitioners,
who rely upon subjective measures of what is most appropriate for the domain at hand (Tan et al. 2006).
More objective, domain independent measures for determining an optimal value of k are an open research
question.
Given an initial set of 50 clusters (from k-means), our next step is to further filter the initial clus-
ters into database tables. The CLP process produced a total of 672 smaller tables from the 50 initial clus-
ters. Applying a χ2 threshold of 0.001 and further filtering the results using the Spearman Rank test rs
(see Appendix 2.5), we are left with 47 tables or product attributes (see Table 2). Though we might have
expected 50 sub-clusters, one for each of the initial clusters, this is not the case. For some initial clusters,
none of the generated tables passed the statistical filters. In other cases, multiple tables from the same
initial cluster had the same, maximum rs score, delineating multiple product attributes within the same
initial cluster.
In the final step, we apply distributional clustering to elicit levels for every dimension (column)
of every product attribute. Recall that we make two strong assumptions in extracting levels. First, we
treat all levels as categorical, so even domains like memory capacity or megapixel resolution are treated
as finite and discrete. This is consistent with conjoint, where even continuous attributes like price are
treated as categorical so that non-linear utilities may be found. Second, we initially assign the maximum
number of levels to six, following much of the marketing literature (Lehmann et al. 1997). If there are
ing ‘mega’ and ‘pixel’ versus ‘megapixel’). These word sequences are then manifested as multi-word
dimensions or levels. Likewise, insufficient constraints are akin to null values within a database table.
To improve the robustness of the alignment step, we are currently experimenting with a more tra-
ditional mixed integer programming formulation of the assignment problem. The introduction of a pen-
alty function may address the problem of conflicting constraints as well as address under-constrained
words. We are also attempting to identify additional sources of constraints. For example, incorporating
external data such as the associated manufacturer’s product description may prove extremely helpful.
Finally, we face the challenge of generating a semantically meaningful number of coherent levels
using distributional clustering. One perspective on the problem concerns using fewer than the user-
specified threshold of clusters. That a property (column) contains three words does not necessarily mean
that each word represents a unique level. Rather, we could hierarchically cluster the levels of every prop-
erty between one and a user-specified threshold (e.g. 6) and optimize the number of clusters based upon
cluster characteristics such as size or distribution.
Just as we might have fewer than a threshold number of levels, there could be too many. As ob-
served earlier, generating coherent clusters from a large number of levels is problematic if all levels of the
property being clustered share the same distribution over the residual attribute properties. One solution is
to draw upon domain knowledge to form clusters using different meaningful semantics, but that would
likely reduce/eliminate the unsupervised nature of our algorithm. A second alternative that might pre-
serve the domain independence of the technique is to draw in additional sources of data to force the dis-
tributions apart. Additional sources of data might include phrases that co-occur with the levels being
clustered or manufacturer details of the products being reviewed. Manufacturer details are typically also
accessible in conjunction with the product reviews themselves.
27
6.2. Pragmatic considerations.
We need the ability to assess the stability of our clusters and concomitant product features. One
instance of stability is sensitivity to data sample size. Here, we relied upon a large data set to yield the
phrases from which we identify a maximal clique. The large data set is also a boon because we can liber-
ally discard phrases to minimize the effects of naïve parsing. To measure the sensitivity to sample size,
we would cross-validate on smaller sets of review samples. We can plot the trade-off between sample
size and evaluation metrics to identify diminishing returns and attempt to estimate a minimal number of
required reviews. The issue of "when to construct the attribute and levels" for one's conjoint studies is an
important one. Care needs to be taken in ensuring sufficient heterogeneity in the sample selection with
respect to different product features and the corresponding feature attributes.
Finally, while our approach is generalizable across different product domains, our dependence on
sources that provide phrase-like strings is a limitation. At least two factors ameliorate this limitation.
First, there are other domains where phrase-like text-strings apply as opposed to prose. Progress notes in
medical records and online movie reviews (Eliashberg and Shugan 1997) are two such examples. Sec-
ond, recognizing the current limitations of natural language processing tools, more online sources are so-
liciting customer feedback in the form of phrases rather than prose to facilitate automated processing
(Google 2007).
6.3. Future work
In addition to work expanding the conceptual and pragmatic dimensions of our work, there are a
number of ways in which we might enrich the concept relationships that we are learning. For example,
we currently learn both product attributes and attribute properties. However, depending upon their de-
composition, some properties may be disjoint and others not. Most buying guides presented optical zoom
and digital zoom as distinct attributes with properties such as magnification. However, it is also not un-
common to see a single product attribute zoom with properties for both magnification and type. Where
"digital" and "optical" are both instances of the property zoom type, a single camera can take on multiple
vales of zoom type. By contrast, the levels of camera type, which include "slr," "standard," and "com-
28
pact," are mutually exclusive. From a marketing and recommendation perspective, it might prove useful
to extend our attributes and levels to distinguish between mutually exclusive property levels and those
that are not.
Memory capacity exhibits a second dimension of the relationship between attributes and proper-
ties. Some properties are ordinal in nature. Recognizing order facilitates the related task of aligning or-
derings. For marketing and product design, aligning is critical because different customers may address a
concept using parallel categories. For example, will 32 mb satisfy a customer seeking to store 130 im-
ages. Because of the relational assumption underlying our CLP approach, we can apply concept cluster-
ing (Gibson et al. 1998; Ganti et al. 1999) to group words from parallel categories.
There are sources of online customer reviews that provide Pro/Con review summaries other than
Epinions. We would like to integrate knowledge from multiple sources to augment the limited samples
from a single source. One motivation might be to extend traditional recommender systems with user-
driven, needs-based attributes based upon the language used by reviewers (Lee 2004; Adomavicius and
Tuzhilin 2005).
While there are many buying guides that provide recommendations for specific products or ser-
vices, most guides tend to rely upon domain-specific experts. Unfortunately, reliance upon experts is not
scalable. Automated support for managing customer and product data is necessitated by the heterogeneity
among both producers and users as well as the increasing complexity of products. A critical step in pro-
viding automated support lies in simply understanding the language used to describe a particular product
category. In this paper, we present an unsupervised, domain independent approach to learning the ontol-
ogy for specific product categories based upon consumer feedback in the form of online customer re-
views. Reviews are first pre-processed using shallow NLP techniques. The resulting phrases are normal-
ized and then clustered into product attributes by adapting traditional document clustering algorithms.
Further decomposition into attribute properties and levels is enabled by a novel bounds consistency ap-
proach to constraint logic programming; we treat the clustering as an assignment problem and exploit the
co-occurrence structure of Pro/Con review summaries. We applied the method to a set of several thou-
29
sand online reviews and evaluated the results against a collection of online buying guides. Interestingly,
though the automatically induced features do not perfectly align with the published guides, this is not
necessarily an indication of poor performance. Indeed even the different buying guides do not agree
among themselves. Because our features are drawn directly from customer comments, the differences
may reveal a significant opportunity for better managing the consumer, producer relationship. Moreover,
as products adapt over time, so must the conjoint analysis that accompanies it. We believe that our re-
search can be an important first step in that direction.
30
References Adomavicius, G. and A. Tuzhilin 2005. Towards the Next Generation of Recommender Systems: A Survey of the
State-of-the-art and Possible Extensions. IEEE Transactions of Knowledge and Data Engineering 17(6):
734-749.
Baker, D. and A. McCallum 1998. Distributional Clustering of Words for Text Classification. SIGIR 98.
Borgelt, C. and R. Kruse 2002. Induction of Association Rules: Apriori Implementation. 15th Conf on Computa-
tional Statistics (Compstat).
Bradlow, E., Y. Hu and T.-H. Ho 2004. A Learning-based Model for Imputing Missing Levels in Partial Conjoint
Profiles. Journal of Marketing Research 41(4): 369-381.
Cecchini, M. 2005. Quantifying the Risk of Financial Events Using Kernel Methods and Information Retrieval. De-
cision and Information Sciences, University of Florida. PhD.
Chen, Y. and J. Xie 2004. Online Consumer Review: A New Element of Marketing Communications Mix. Social
Science Research Network, http://ssrn.com/abstract=618782
Chevalier, J. and D. Mayzlin 2003. The Effect of Word of Mouth Online: Online Book Reviews. Working Paper,
Yale School of Management.
Dellarocas, C. 2003. The Digitization of Word of Mouth: Promises and Challenges of Online Feedback Mecha-
nisms. Management Science 49(10): 1401-1424.
Dhillon, I. S., S. Mallela and R. Kumar 2002. Enhanced Word Clustering for Hierarchical Text Classification.
Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Doan, A., P. Domingos and A. Halevy 2003. Learning to Match the Schemas of Databases: A Multistrategy Ap-
proach. Machine Learning Journal 50: 279-301.
Eliashberg, J. and S. Shugan 1997. Film Critics: Influencers or Predictors? Journal of Marketing 61: 68-78.
Evgeniou, T., C. Boussios and Z. Giorgos 2005. Generalized Robust Conjoint Estimation. Marketing Science 24(3):
415-429.
Ganti, V., J. Gehrke and R. Ramakrishnan 1999. CACTUS - Clustering categorial data using summaries. ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
Ghose, A., P. Ipeirotis and A. Sundararajan 2006. The Dimensions of Reputation in Electronic Markets, New York
University: 32.
31
Gibson, D., J. Kleinberg and P. Raghavan 1998. Clustering categorical data: an approach based on dynamical sys-
tems. 24th International Conference on Very Large Databases (VLDB).
Goldberg, S. M., P. E. Green and Y. Wind 1984. Conjoint Analysis of Price Preimums for Hotel Amenities. Journal
of Business 57(1): 111-132.
Google 2007. About Google Base, Google
Green, P. E. and A. M. Krieger 1991. Segmenting Markets with Conjoint Analysis. Journal of Marketing 55: 20-31.
Green, P. E. and V. Srinivasan 1978. Conjoint Analysis in Cosumer Research: Issues and Outlook. Journal of Con-
sumer Research 5: 103-123.
Han, J. and Y. Fu 1994. Dynamic generation and refinement of concept hierarchies for knowledge discovery in da-
tabases. AAAI 94 Workshop on Knowledge Discovery in Databases (KDD94).
Hartmann, A. and H. Sattler 2002. Commercial Use of Conjoint Analysis in Germany, Austria, and Switzerland.
Research Papers on Marketing and Retailing, University of Hamburg: 14.
Hearst, M. 1992. Automatic acquisition of hyponyms from large text corpora. Fourteenth International Conference
on Computation Linguistics (COLING).
Hu, M. and B. Liu 2004. Mining and Summarizing Customer Reviews. KDD04.
Huber, J. and K. Zwerina 1996. The Importance of Utility Balance in Efficient Choice Designs. Journal of Market-
ing Research 33: 307-317.
Kilgarriff, A. 2001. Comparing Corpora. International Journal of Corpus Linguistics 6(1): 97-133.
Lee, L. 1999. Measures of Distributional Similarity. Association for Computational Linguistics (ACL 99).
Lee, T. 2004. Use-centric mining of customer reviews. Workshop on Information Technology and Systems (WITS).
Lee, T. 2005. Ontology Induction for Mining Experiential Knowledge from Customer Reviews. Utah Winter Infor-
mation Systems Conference.
Lehmann, D. R., S. Gupta and J. H. Steckel 1997. Marketing Research, Prentice Hall.
Liu, B., M. Hu and J. Cheng 2005. Opinion Observer: Analyzing and Comparing Opinons on the Web. WWW 2005.
Maedche, A. and S. Staab 2000. Semi-automatic Engineering of Ontologies from Text. Twelfth International Con-
ference on Software Engineering and Knowledge Engineering (SEKE'2000).
Michalek, J. J., F. M. Feinberg and P. Y. Papalambros 2005. Linking Marketing and Engineering Product Design
Decisions via Analytical Target Cascading. Journal of Product Innovation Management 22: 42-62.
32
Missikoff, M. and R. Navigli 2002. Integrated approach to Web ontology learning and engineering. IEEE Computer:
54-57.
Modica, G., A. Gal and H. Jamil 2001. The Use of Machine-Generated Ontologies in Dynamic Information Seeking.
CoopIS 2001.
Moore, W. L., J. Gray-Lee and J. J. Louviere 1998. A Cross-Validity Comparison of Conjont Analysis and Choice
Models at Different Levels of Aggregation. Marketing Letters 9(2): 195-207.
Moore, W. L., J. J. Louviere and R. Verma 1999. Using Conjoint Analysis to Help Design Product Platforms. Jour-
nal of Product Innovation Management 16: 27-39.
Nasukawa, T. and J. Yi 2003. Sentiment Analysis: Capturing Favorability Using Natural Language Processing. K-
CAP`03.
Netzer, O. and V. Srinivasan 2006. Adaptive Self-Explication of Multi-Attribute Preferences. Yale Center for Cus-
tomer Insights.
Pereira, F., N. Tishby and L. Lee 1993. Distributional Clustering of English Words. Association for Computational
Linguistics (ACL93).
Popescu, A.-M. and O. Etzioni 2005. Extracting Product Features and Opinions from Reviews. HLT-EMNLP.
Popescu, A.-M., A. Yates and O. Etzioni 2004. Class extraction from the World Wide Web. AAAI 2004 Workshop
on Adaptive Text Extraction and Mining (ATEM).
Salton, G. and M. McGill 1983. Introduction to modern information retrieval. New York, McGraw-Hill.
Suryanto, H. and P. Compton 2000. Learning classification taxonomies from a classification knowledge based sys-
tem. ECAI 2000 Workshop on Ontology Learning.
Tan, P.-N., M. Steinbach and V. Kumar 2006. Introduction to Data Mining. Boston, Pearson Education, Inc.
Toubia, O., D. I. Simester, J. R. Houser and E. Dahan 2003. Fast Polyhederal Adaptive Conjoint Estimation. Mar-
keting Science 22(3): 273-303.
Turney, P. 2002. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of
Reviews.
Van den Bulte, C. 2000. New Product Diffusion Acceleration: Measurement and Analysis. Marketing Science 19(4):
366-380.
33
Wittink, D. R. and P. Cattin 1989. Commercial Use of Conjoint Analysis: An Update. Journal of Marketing 53: 91-
96.
Wittink, D. R., L. Krishnamurthi and J. B. Nutter 1982. Comparing derived importance weights across attributes.
Journal of Consumer Research 10(1): 471-74.
Zaki, M. and M. Peters 2005. CLICKS: Mining Subspace Clusters in Categorical Data via K-partite Maximal
Cliques. 21st International Conference on Data Engineering (ICDE05).
Zhao, Y. and G. Karypis 2002. Criterion Functions for Document Clustering: Experiments and Analysis. University
of Minnesota Deptartment of Computer Science/Army HPC Research Center. Minneapolis, MN, University
of Minnesota: 30.
34
Appendix 1. Glossary of Terms
Bounds consistency approach: A class of solution strategies for solving problems in constrained
optimization. The classic bounds consistency approach interleaves backtracking search
with constraint propagation.
Classifier system: A class of computational algorithms for labeling input instances.
Clique: A subset of the set of nodes in a graph such that all nodes in the sub-graph are mutually
adjacent to one another.
Complete graph: A graph where all nodes are mutually adjacent to one another (i.e. where all
nodes are connected to one another by an edge).
Generic reference ontology: An ontology that contains general terms and relationships common
to many domains of knowledge.
Graph: A set of nodes (also called vertices) and a set of edges. Edges are expressed as node
pairs and define a line between the two constituent nodes.
K-partite: A graph that is decomposed into K disjoint sets of subnodes where no two nodes
within the same set are adjacent (i.e. there are no edges between nodes within the same
partition).
Maximal clique: A clique is maximal if its set of nodes is not a subset of any clique containing
an additional node.
Maximum clique: A maximal clique for which there exists no larger clique. The task of discov-
ering the maximum clique in a graph is NP complete.
Ontology: A structured vocabulary that captures the terms used in a particular domain as well as
a set of relationships that hold between those words. Common relationships captured
35
within an ontology include: Hyponym (A is a hyponym of B if A is a kind of B), Mero-
nym (A is a meronym of B if A is a part-of B).
Ontology induction: The process of learning an ontology (generally associated with automated
methods for learning an ontology).
Sentiment analysis: A sub-field of natural language processing, content analysis, and computa-
tional linguistics that analyzes the emotional tenor or sentiment of a text passage. For ex-
ample, is the writer happy, sad, pleased, angry, etc.
Stop-word lists: Commonly used in text processing, a pre-defined list of words that typically
convey little or no semantic information and so are automatically discarded in text proc-
essing. These include articles and preopositions such as: "The, A, Of, From, To …"
Supervised learning: A class of machine learning approaches that learn a task by first being
trained on a training set of representative inputs where the answers are known a'priori.
The learning classifier learns or generalizes from the training instances in a systematic
way.
Supervised learning classifier systems: Classifier systems that learn based upon a pre-labeled set
of training instances.
Weighted graph: A graph with weights associated either on the edges (an edge-weighted graph)
or on the nodes (a node-weighted graph).
36
Appendix 2. Algorithmic Details
In this Appendix, we elaborate on specific details to the algorithmic process described in Sections 3 and
4.
[1] Vector space model and word importance
Borrowing from the information retrieval community, our phrase × word matrix is a representa-
tion of the vector-space model (VSM). More formally, j ∈ J is a word in the set of all words; i ∈ I is a
phrase. A phrase is simply a finite sequence of words and J is a subset of the set of finite word sequences
I = {<j>| j ∈ J}. We define an initial phrase × word matrix as a simple variation on the term-frequency
inverse-document-frequency (TF-IDF) VSM (Salton and McGill 1983):
Matrix(i,j) = (TFij × IPFj) (2.1)
where the term frequency ( )ijTF counts the total number of occurrences of word j in the instances of
phrase i. The inverse phrase frequency IPFj = log(|I|/nj) is a weighting factor for words that are more
helpful in distinguishing between different product attributes because they only appear in a fraction of the
total number of unique phrases. If |I| represents the total number of unique phrases in the review collec-
tion, nj counts the total number of unique phrases containing word j.
A limitation of the TF-IPF weighting is that there are still some terms (e.g. sentiment words like
"great" or "good") that are neither stop words nor product attributes yet appear with product attributes in
the TF-IDF matrix. As an additional discount factor beyond IPF, we automatically gather words from a
second set of K phrases using online reviews for an unrelated product domain. Intuitively, words appear-
ing in the reviews for unrelated products are less likely to represent relevant product attributes for the fo-
cal one. For example, words describing digital camera attributes are less likely to also appear in vacuum
cleaner reviews.
37
Formally, for a set of (I') phrases drawn from the set of finite word sequences over j ∈ J, we cal-
culate rank(j) = rank(TF'ij×IPF'j) where higher weighted frequencies correspond to higher rank. Note that
multiple words may share the same rank; if we define words that do not appear in any phrase as having
IPF'j = 0, then we may say:
Matrix(i,j) = ( ) ( )jjij IPFIPFjrankTF '−×× (1a)
Thus, we scale TF by the rank of the word in the unrelated product domain and scale the IPF by IPF'
from the unrelated product domain.
[2] Graph representations
To transform clusters of Pro/Con review phrases into individual tables, we model each set of
phrases as a graph. Every word is a node in the graph and every edge labels the co-occurrence of two
words within the same phrase. The graph then capture the assumption that no two words in a phrase refer
to the same attribute dimension. In the same way, we could generate a graph where every phrase is a
node and every edge between two phrases indicates the co-occurrence of two phrases within the same
review. This captures the parallel assumption that no two phrases in a review refer to the same product
attribute. We revisit this parallel assumption when we discuss the filtering of phrase clusters correspond-
ing to a single product attribute.
More formally, we assume that phrases and words are preprocessed and normalized into words as
before. A graph G = (V,E) is a pair of the set of vertices V and the set of edges E. An edge in E is a con-
nection between two vertices and may be represented as a pair (vi,vj) ∈ V. Each phrase (word) represents
a vertex v in the graph; edges are defined by phrase pairs within a review (word pairs within a phrase).
An N-partite graph is a connected graph where there are no edges in any set of vertices Vi. A clique of
size N simulates a schema and can be extended to an N-partite graph by substituting each vertice vi of the
clique with a set of vertices Vi. A database table with disjoint columns thus represents an N-partite graph.
A maximal-complete-N-partite graph is a complete-N-partite graph not contained in any other such
38
graph; in other words, the initial clique is maximal. The corresponding database table of phrases repre-
sents the existing product attribute space, and the maximal-complete-N-partite graph includes possibly
novel combinations of previously unpaired attributes and/or attribute properties.
To relate the graph back to customer reviews, we say that a product attribute is constructed from k
dimensions. Each dimension names a domain (D). Each domain D is defined by a finite set of words that
includes the value NULL for review phrases where customers fail to mention one or more attribute di-
mension(s). The Cartesian product of domains D1 …Dk is the set of all k-tuples {t1…tk | ti ∈ Di}. Each
phrase is simply one such k-tuple and the set of all phrases in the cluster simply defines a finite subset of
the Cartesian product. A relational schema is simply a mapping of attribute properties A1 …Ak to domains
D1 … Dk. Note the strong, implicit assumption that a maximal clique, taken over a word graph, is a proxy
for the proper number of attribute dimensions. Under this assumption, it is easy to see how searching for
cliques within the graph results in a table.
[3] Distributional clustering
In distributional clustering, the values of one attribute property are characterized by the joint dis-
tribution over the remaining attribute properties. From the example in Figure 4, levels of memory capac-
ity (e.g. 4, 8, 16) are defined by their joint distribution over form factor and whether the memory is in-
cluded or not. Intuitively, this suggests that certain memory types (e.g. compact flash, smart media, xD)
are generally used with certain memory capacities and not others, as would be common with real prod-
ucts; this enables collapsing.
More precisely, recall that every product attribute is described as a table where table columns rep-
resent properties. We assume that all attribute properties are defined over discrete, categorical domains.
For each column of the table, every unique value is initialized to a distinct level. Levels are defined in
terms of the joint probability space over all other columns in the table. We construct the joint probability
density function (PDF) for each level from the table rows. Each PDF is represented as a sparse vector. If
39
there are more than a user-specified number of levels in a column, levels are hierarchically clustered
based upon the COS similarity of their distributions (Lee 1999).
The examples also illustrate some inherent limitations of distributional clustering applied to this
context. First, the approach is sensitive to relative semantics. In Table 1, levels of optical zoom magnifi-
cation (e.g. 3x, 6x) are defined by their joint distribution over descriptive adjectives like "standard" versus
"long." Thus, all magnifications described as "long" would be clustered together. However, while some
users might consider 3x magnification "standard" today, as technology evolves or depending upon need,
others might describe 3x magnification as "poor." Second, as with the CLP step, distributional clustering
relies upon a sufficiently large, representative sample. The limited sample of phrases in Figure 4 would
treat "4" and "16" as a cluster of memory capacity separate from "8."
[4] Constrained Logic Programming
To align words into their corresponding attribute dimensions, we frame the task as a mathemati-
cal assignment optimization and resolve the problem using a bounds consistency approach. We define the
assignment using the maximal clique that corresponds to the schema for each product attribute table (see
Figure 7). In the bounds consistency approach, we invert the constraints (tok_exclusion) to express the
complementary set of candidate assignments (tok_candidates) for each attribute dimension. If the phrase
constraints, taken together, are internally consistent, then the candidate assignments (tok_assign)for a
given token are simply the intersection of all candidate assignments as defined by all phrases in the clus-
ter containing that token.
process_phrases(p_list) [1] schema = find_maximal_clique(p_list) [2] order phrases by length [3] for each phrase p: [4] # initialize data structures [5] tok_exclusion – for each tok, mutually exclusive tokens [6] tok_candidates – for each tok, valid candidate assignments [7] tok_assign – for each tok, the dimension assignment [8] # propagate the constraints for each successive phrase [9] tok_candidates, tok_exclusion, tok_assign = [10] propagate_bounds(phrase, tok_candidates, [11] tok_exclusion, tok_assign, schema)
Figure 7. Logical Assignment
40
We transform the mutual exclusivity constraint represented by each phrase into a set of candidate
assignments using the algorithm in Figure 8. Note that we need only propagate the mutual exclusivity of
words that are previously unassigned. Accordingly, for each unassigned token in a given phrase, the set
of candidate assignments is the intersection of the possible assignments based upon the current phrase and
all candidate assignments from earlier phrases containing the same token. We maintain a list of active
tokens boundary_list to avoid rescanning the set of all tokens every time the possible assignments for a
given token is updated.
Finally, the k-means clustering used to separate review phrases into distinct product attributes is a
noisy process. The clustering can easily result in the inclusion of spurious phrases. By modeling reviews
as a graph of phrases, we can apply the same CLP in a pre-assignment step to filter a single (noisy) cluster
of phrases. As alluded to in Appendix 2.2, we generate a graph where phrases are nodes, and edges rep-
resent the co-occurrence of two phrases within the same review. The same assignment representation
removes phrases that are not central to the product attribute at the heart of a particular phrase cluster.
[5] Statistical filtering
As noted in Section 4, the clusters that result from the CLP are not necessarily clean. To clean
the resulting tables of product attributes and dimensions, we apply a two-stage statistical filter. First, be-
cause each table itself separates tokens into attribute properties (columns), meaningful tables will not hold
too small a percentage of the overall number of tokens. Second, we assume that meaningful tables com-
prise a (predominately) disjoint token subset. If the tokens in a table appear in no other table, then the
intra-table token frequency should match the frequency of the initial k-means cluster; likewise, the table's