Pozna ´ n University of Technology Institute of Computing Science Descriptive Clustering as a Method for Exploring Text Collections Dawid Weiss A dissertation submitted to the Council of the Faculty of Computer Science and Management in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Supervisor Jerzy Stefanowski, PhD Dr Habil. Pozna ´ n, Poland 2006
143
Embed
Descriptive Clustering as a Method for Exploring Text Collections · 2009-07-20 · Descriptive Clustering as a Method for Exploring Text Collections Dawid Weiss A dissertation submitted
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Poznan University of Technology
Institute of Computing Science
Descriptive Clustering as a Method
for Exploring Text Collections
Dawid Weiss
A dissertation submitted to
the Council of the Faculty of Computer Science and Management
in partial fulfillment of the requirements for the degree of Doctor of Philosophy.
Supervisor
Jerzy Stefanowski, PhD Dr Habil.
Poznan, Poland
2006
Politechnika Poznanska
Instytut Informatyki
Grupowanie opisowe jako metoda eksploracji
zbiorów dokumentów tekstowych
Dawid Weiss
Rozprawa doktorska
Przedłozono Radzie Wydziału Informatyki i Zarzadzania
Politechniki Poznanskiej
Promotor
dr hab. inz. Jerzy Stefanowski
Poznan 2006
S T R E S Z C Z E N I E
Tematyka rozprawy dotyczy systemów wyszukiwania informacji oraz
identyfikacji w kolekcjach dokumentów ich tematycznie powiazanych podgrup.
Główna motywacja rozprawy jest tworzenie mozliwie czytelnych, gramatycznie
poprawnych i zrozumiałych opisów odnalezionych grup. Obecnie stosowane
algorytmy grupowania tekstów nie sa dostosowane do automatycznego tworzenia
dostatecznie dobrych opisów, a nowe zastosowania w eksploracji danych wskazuja
na ich praktyczne znaczenie.
Rozprawa zawiera propozycje ukonkretnienia powyzszych postulatów
dotyczacych opisów grup w formie definicji problemu grupowania opisowego
(ang. descriptive clustering). Nastepnie przedstawione jest ogólne podejscie do
konstrukcji algorytmów grupowania próbujacych spełnic przedstawione załozenia,
nazwane Description Comes First (dcf).
W odróznieniu do klasycznego podejscia, gdzie ocenie podlega jedynie
sposób przydziału dokumentów do grup, dcf bierze pod uwage opis grupy
jako jeden z kluczowych elementów wyniku całego algorytmu i stosuje ów
opis zarówno w trakcie konstrukcji modelu skupien dokumentów, jak i przy
tworzeniu ich opisów. W podejsciu dcf, poszukiwanie zbioru kandydatów na
opisy grup i matematycznego modelu skupien nastepuje niezaleznie. Wsród
opisów-kandydatów poszukiwane sa nastepnie takie, które posiadaja wsparcie
w odnalezionym modelu grup. W kroku ostatnim nastepuje przydział dokumentów
do wybranych opisów.
Praca prezentuje dwa algorytmy bedace przykładem praktycznej implementacji
podejscia dcf. Algorytm pierwszy, Lingo, znajduje zastosowanie w grupowaniu
wyników z wyszukiwarek internetowych. Algorytm drugi, Descriptive k-Means,
słuzy do grupowania duzej liczby dłuzszych dokumentów. Oba algorytmy
implementuja ten sam ogólny schemat działania oparty o dcf, lecz rózniaca je
specyfika przetwarzanych danych pociaga koniecznosc uzycia innych docelowych
rozwiazan — Lingo wykorzystuje frazy czeste i redukcje wymiarów macierzy słów,
Descriptive k-Means natomiast ekstrakcje fraz czestych, frazy nominalne oraz
grupowanie przy pomocy algorytmu k-Means (k-srednich).
W pracy przedstawiono eksperymenty obliczeniowe dla obu algorytmów.
Wyniki eksperymentów porównuja jakosc grupowania (rozumiana jako sposób
odtworzenia znanego przydziału dokumentów do grup) przy uzyciu Lingo oraz
Descriptive k-Means, z ich najblizszymi odpowiednikami literaturowymi —
algorytmami Suffix Tree Clustering oraz k-Means. Inny istotny aspekt praktyczny
ewaluacji stanowi przedstawienie danych zebranych z publicznej wersji systemu
Carrot2, dostepnego na zasadach wolnego oprogramowania.
Figure 2.5: A few phrases extracted for three example tag patterns encoded in the heuristic
chunker discussed on page 19. Chunk phrases are in the center column, left and right con-
text of each phrase is also shown. We marked in red the phrases we considered invalid (not
of maximum length or ambiguous).
is needed to successfully apply this definition for phrase extraction but when applicable,
it turns out to be a really effective heuristic. Moreover, any algorithm suitable for finding
frequent sequences of items can perform phrase extraction and there is a wide selection of
efficient methods to choose from. In this thesis we will use suffix trees (in Descriptive k-
Means) and suffix arrays (in Lingo). We characterize the most important elements of both
methods below.
A suffix tree [95, 27, 51] is a tree where all suffixes of a given sequence of elements can suffix tree
be found on the way from the root node to a leaf node. What makes this data structure
efficient is that a single node node in the tree may contain more than one element of the se-
quence (this feature differs it from another data structure — suffix trie). Figure 2.6(a) shows
an example suffix tree for a word mississippi.
An interesting property of suffix trees is that any path starting at the root node and end-
ing in an internal node denotes a subsequence of elements that occurred at least twice in
the input (take a look at the node corresponding to character sequence issi in Figure 2.6(a)
on the following page). This observation leads to a simple and effective frequent phrase
extraction algorithm: build a suffix tree such that each element of the input sequence is a
single word and analyze the internal nodes — a path from the root node to each internal
2.1. Text Preprocessing 23
(a) Suffix tree. Highlighted is the internal node
and path to the root for the subsequence “issi”.
substring substring
index
10 i
9 ippi
4 issippi
1 ississippi
0 mississippi
9 pi
8 ppi
6 sippi
3 sissippi
5 ssippi
2 ssissippi
(b) Suffix array. Highlighted is the continuous block for
subsequence “issi”.
Figure 2.6: A suffix tree and a suffix array for the word mississippi.
node is a frequent phrase.
Suffix trees have become very popular mostly due to low computational cost of their
construction — linear with the size of input sequence’s elements. Their practical implemen-
tation is quite tricky due to non-locality of tree traversal (a number of algorithms have been
proposed to overcome this issue).
Another data structure permitting frequent phrase detection is a suffix array [57, 61]. A suffix array
suffix array is a sorted array of all suffixes of a given input sequence. Figure 2.6(b) illustrates
a suffix array for the same word mississippi we used earlier. Frequent phrase extraction is
implemented by scanning the suffix array top-to-bottom, looking for continuous blocks of
identical prefixes (of maximum length) [109, 19].
Algorithms for building suffix arrays have a much better properties (locality of computa-
tion, memory consumption), in fact the only information we really need to store in a suffix
array are indices to first elements of each substring. Of course a straightforward construc-
tion of a suffix array using generic sorting routines would slow down the algorithm to the
order of O(n log n) (assuming a bit unrealistically that string comparisons take O(1) time).
More efficient algorithms that preserve the desired properties of suffix arrays have been sug-
gested in literature [45, 51].
Summarizing this section, with the help of suffix trees and suffix arrays we can locate fre-
quently recurring subsequences of words in the input. Anticipating further sections, we will
use frequent phrases as document features (for detecting similar documents) and primarily
to construct cluster labels. A number of problems opens from these applications of frequent
phrases.
• A frequent sequence of words is not necessarily a good phrase — it can be a meaning-
less collocation (like: vitally important) or a frequent structural element of a language
(like: out of, it is).
2.2. Text Representation: Vector Space Model 24
• A frequent sequence can be virtually any junk that just happens to be in the input. For
example, when searching for frequent phrases in a corpora of mailing list messages, all
messages starting long discussion threads end up as frequent because they are usually
cited in replies.
• Not all phrases are sequences. We have already pointed out that in Polish the same
phrase can be rewritten in many ways with different word order and still be perfectly
comprehensible (as in: sto lat samotnosci, lat sto samotnosci, samotnosci sto lat, samot-
nosci lat sto).
• Essentially the same phrase may be non-continuous (interrupted by other words or
words forms). For example, compare: Franklin D. Roosevelt and Franklin Delano Roo-
sevelt, or Earvin Johnson and Earvin “Magic” Johnson.
2.2 Text Representation: Vector Space Model
To assess the similarity or dissimilarity of two or more documents, we need a model in which
these operations are defined. The model is usually selected to match a particular task’s re-
quirements and objectives. Several text representation models have been suggested in liter-
ature, their good overview can be found in [4]. To keep this chapter’s size reasonable we will
focus only on Vector Space Model (vsm) and the elements it consists of: document indexing,
feature weighting and similarity coefficients.
2.2.1 Document Indexing
Vector Space Model2 uses the concepts of linear algebra to address the problem of repre-
senting and comparing textual data.
A document d is represented in the vsm as a document vector [wt0 , wt1 , . . . wtΩ ], where
t0, t1, . . . tΩ is a set of words of a given language and wtiexpresses the weight (importance)
of term ti to document d . Weights in a document vector typically reflect the distribution of
words in that document. In other words, the value wtiin a document vector d represents
the importance of word ti to that document.
Components of the document’s vector are commonly called its features because their feature
collection provides a footprint of the document’s contents. Note that we can hardly speak
about the meaning of a document vector anymore since it is basically an collection of unre-
lated terms. For this reason, the vsm is sometimes called a bag-of-words model. The process
of translation of input documents into their term vectors is called document indexing. document indexing
Given a set of documents, their document vectors can be put together to form a matrix
called a term-document matrix. The value of a single component of this matrix depends on term-documentmatrix
the strength of relationship between a document and the given term. An example with the
following input documents (each consisting of a single sentence) demonstrates this.
2Vector Space Model is usually credited to Gerard Salton, although the concept had been already known in the
literature when Gerard Salton started to use it — [21] is an interesting essay on the subject.
2.2. Text Representation: Vector Space Model 25
document content
d0 Large Scale Singular Value Computations
d1 Software for the Sparse Singular Value Decomposition
d2 Introduction to Modern Information Retrieval
d3 Linear Algebra for Intelligent Information Retrieval
d4 Matrix Computations
d5 Singular Value Analysis of Cryptograms
In the first step, we identify all possible terms appearing in the input and build a matrix
where columns correspond to terms and rows correspond to input documents. We exclude
certain terms that we know are not useful for identifying the topic of a document (these
are called stop words) and restrict the presentation to just a few selected terms with at least
one non-zero weight. On the intersection of each column and each row we place the count
(number of occurrences) of the column’s term in the row’s document. For our example in-
put, the term-document matrix looks as shown below.
Info
rma
tio
n
Sca
le
An
aly
sis
Sin
gu
lar
Va
lue . . . . . . . . . . . . .
w
d0 0 1 0 1 1
d1 0 0 0 1 1
d2 1 0 0 0 0 . . . . . . . . . . . . .
d3 1 0 0 0 0
d4 0 0 0 0 0
d5 0 0 1 1 1
Two questions arise from the example. First, an average document will contain just
a small subset of all possible words of a given language, which is additionally an uncon-
strained set. We can solve this problem with the simplest method and store just the indices
of terms for which the document has non-zero weights (as we did in the example). More
advanced techniques encode gaps between indices or other forms of bit packing and cod-
ing [102, 4].
The second problem is related to our prospective application — measuring similarity
between documents. Certain words, such as pronouns, conjunctions or prepositions, occur
frequently in any document and are useless as features. We can ignore certain words like
that (by adding them to a set of stop words), but a better idea is to recalculate weights in
document vectors in a way that highlights words that are more important for a given docu-
ment with respect to others and downplays words that are very common. This task is called
feature weighting and we list a few known weighting methods in the next section.
2.2.2 Feature Weighting
Feature weighting methods can be divided into local (one document’s term count is avail-
able) and global (term counts of all documents are available). We list the weighting schemes
2.2. Text Representation: Vector Space Model 26
that have application somewhere in this thesis. A full overview of the subject can be found
in [83, 4].
The following notation is used: tfi j — number of occurrences of term i in document j ,
dfi — number of documents containing term i in the entire collection, w(i , j ) — weight of
term i in document j , N is the number of all documents in the collection.
Term Frequency, Inverse Document Frequency Certainly the most widely known feature
weighting formula, usually abbreviated to an acronym tf-idf. Credited to Gerard Salton [82],
tf-idf tries to balance the importance of a word in a document with how common it is in the
entire collection.
w(i , j )= tfi j × log2
N
dfi(2.1)
Modified tf-idf This modification of the original tf-idf downplays the count of terms in a
document and contains certain algebraic modifications for faster calculation of w(i , j ) on a
precached index. Implemented in the document retrieval library Lucene [H].
w(i , j )=√
tfi j ×(
loge
N
dfi +1+1
)(2.2)
Pointwise Mutual Information A widely used weighting scheme, although known to be
biased towards infrequent events (terms) — an interesting discussion can be found in [60].
We show practical implications of this property in our experiments in chapter 7 on page 79.
w(i , j ) = log2
tfi j
N∑k
tfi k
N ×∑k
tfk j
N
(2.3)
Discounted Mutual Information Similar to pointwise mutual information, but multiplied
with a discounting factor to compensate for the problems mentioned above (formula after
[53]).
w(i , j ) = mii j ×tfi j
tfi j +1×
min(∑
ktfk j ,
∑k
tfik
)
min(∑
ktfk j ,
∑k
tfik
)+1
(2.4)
2.2.3 Similarity Coefficients
Two documents in the Vector Space Model represent two points in a multidimensional term
space (each term is assumed to be an independent dimension). If we define a notion of
distance in this space, we can compare documents against each other and thus start looking
for similarities or dissimilarities.
Any distance metric applicable to a multidimensional vector space is applicable, but two
methods are widely used: Euclidean distance and cosine measure.
A simple Euclidean distance is quite often used, but requires document vector length
normalization prior to calculation or the number of words (proportion of weights) in each
document will distort the result.
2.3. Document Clustering 27
Cosine measure is a more robust technique stemming from the observation that if two
vectors have approximately the same features then they should “point” at a very similar di-
rection in the space determined by the term-document matrix, regardless of their Euclidean
distance. To calculate similarity between two documents we need to look at the angle be-
tween them, which we can calculate using the dot product between their document vectors.
To simplify things even more, we can use the cosine of this angle which is easier to compute
(does not require hyperbolic function). We hence define the cosine measure of similarity cosine measure
between vector representation of documents di and d j in the term vector space as:
sim(di ,d j ) = cos(α) =di ·d j
|di ||d j |, (2.5)
where x · y denotes the dot product between vectors x and y and |x| is the norm of vector x.
The cosine measure is widely used in text clustering any many other text processing ap-
plications because its definition is quite intuitive and its implementation efficient. How-
ever, it is also known that in highly dimensional spaces any two random vectors are very
likely to be orthogonal. An attempt to solve this problem is to reduce the dimensionality dimensionalityreduction
of the feature space using feature selection, feature construction or term-document matrix
decomposition techniques [4].
2.3 Document Clustering
2.3.1 Introduction
Let us start with a general definition of clustering after Brian Everitt et al. [22]:
Given a number of objects or individuals, each of which is described by a set of clustering problem
numerical measures, devise a classification scheme for grouping the objects into
a number of classes such that objects within classes are similar in some respect
and unlike those from other classes. The number of classes and the characteris-
tics of each class are to be determined.
By analogy to the above definition document clustering, or text clustering, can be defined text clustering
as a process of organizing pieces of textual information into groups whose members are
similar in some way, and groups as a whole are dissimilar to each other. But before we delve
into text clustering, let us take a look at clustering in general.
There are many kinds of clustering algorithms, suitable for different types of input data
and diverse applications. A great deal depends on how we define similarity between ob-
jects. We can measure similarity in terms of objects’ proximity (distance), or as a relation
between the features they exhibit. An intuitive demonstration of this difference is shown in
Figure 2.7 on the following page — the same set of objects is grouped depending on their
relative distance, or feature — shape and color.
Brian Everitt et al. suggest the following classification of clustering methods [22]:
• hierarchical techniques — in which clusters are recursively grouped to form a tree,
2.3. Document Clustering 28
Figure 2.7: The same group of objects (1) “clustered” based on their relative distance (2) and
features they exhibit — shape (3) and color (4).
• optimization techniques — where clusters are formed by the optimization of a cluster-
ing criterion,
• density or mode-seeking techniques — in which clusters are formed by searching for
regions containing a relatively dense concentration of entities,
• clumping techniques — in which classes or clumps can overlap,
• others — methods which do not fall clearly into any of the above.
Alternative classifications of clustering algorithms can be suggested, depending on the
aspect we look at. In our opinion it is worthwhile to take a look at several aspects. Looking
at the structure of discovered clusters we can distinguish flat and hierarchical clustering
algorithms. Depending on the type of assignment between documents and clusters we can
have:
• partitioning algorithms — which assign each document to exactly one cluster,
• clumping techniques — described above; note that this type of clustering is natural
and desirable for texts because a single document can be assigned to more than one
topic,
• partial clustering — algorithms which may leave some objects unassigned at all; in
this thesis we will use the term “others” to refer to a synthetic group of unclustered
objects.
Finally, the classification can be made depending on the strength of relationship between
an object and a cluster:
2.3. Document Clustering 29
objects groups
Partitioning Overlapping Partial assignment
Hierarchical Binary assignment Other assignment
Figure 2.8: A simplified visual representation of different clustering algorithms.
• crisp clustering — with a binary assignment when a document is either assigned to a
cluster or not assigned to it,
• fuzzy clustering — when the degree of assignment is expressed on the scale of “not
associated” to “fully associated”, typically with a number between 0 and 1.
Figure 2.8 depicts different ways of looking at clustering algorithms depending on their char-
acteristics.
As a final element of this section, we should mention another interesting aspect of clus-
tering, related to our transparency requirement. Mark Sanderson and Bruce Croft [84] divide
clustering methods depending on how many features contributed to inclusion of a given ob-
ject to a cluster. They distinct monothetic algorithms, which assign objects to clusters based monothetic andpolytheticalgorithmson a single feature and polythetic algorithms which use multiple features. Our work is a bit
of both. We try to make the results monothetic (transparent relationship of cluster labels
to documents), but at the same time we use polythetic clustering algorithms for detecting
groups of documents.
2.3.2 Overview of Selected Clustering Algorithms
Clustering analysis is a very broad field and the number of available methods and their vari-
ations can be overwhelming. A good introduction to numerical clustering can be found in
Brian Everitt’s Cluster Analysis [22] or in Allan Gordon’s Classification [30]. A more up-to-
date view of clustering in the context of data mining is available in Jiawei Han and Miche-
line Kamber’s Data Mining: Concepts and Techniques [35]. Shorter surveys on the topic
are also available in [6] in [41]. Resources in the Polish language include, for example, a
chapter on clustering methods in Jacek Koronacki and Jan Cwik’s book Statystyczne systemy
uczace sie [46] and a Polish translation of David Hand, Heikki Mannila and Padhraic Smyth’s
Principles of Data Mining (Eksploracja Danych) [36]. We should emphasize again that most
text clustering algorithms attempt to transform the input text into a mathematical repre-
2.3. Document Clustering 30
sentation directly suitable for use with numerical clustering algorithms directly, so any book
about cluster analysis will be relevant to the topic of this thesis.
In the following part of this section we describe a few selected clustering algorithms that
are important from the point of view of further chapters.
Partitioning Methods
Partitioning clustering methods divide the input data into disjoint subsets attempting to find
a configuration which maximizes some optimality criterion. Because enumeration of all
possible subsets of the input is usually computationally infeasible, partitioning clustering
employs an iterative improvement procedure which moves objects between clusters until
the optimality criterion can no longer be improved.
The most popular partitioning algorithm is the k-Means algorithm. In k-Means, we de-
fine a global objective function and iteratively move objects between partitions to optimize objective function
this function. The objective function is usually a sum of distances (or sum of squared dis-
tances) between objects and their cluster’s centers and the objective is to minimize it. As-
suming K is a set of clusters, t ∈ K is a cluster (set of objects), Ctiis the representation of
a cluster’s center and d is an element, we try to minimize the following expression:
∑
ti∈K
∑
d∈ti
distance(d ,Cti) (2.6)
The representation of a cluster can be an average of its elements (its centroid) or a mean
point (object closest to the centroid of a cluster). In the latter case we call the algorithm
k-Medoids. Given the number of clusters k a priori, a generic k-Means procedure is imple-
mented in four steps:
1. partition objects into k nonempty subsets (most often randomly),
2. compute representation of centers for current clusters,
3. assign each object to the closest cluster,
4. repeat from step 2 until no more reassignments occurs.
By moving objects to their closest partition and recalculating partition’s centers in each step
the method eventually converges to a stable state, which is usually a local optimum.
We discuss computational complexity of k-Means in later sections, for now let us just
comment that the entire procedure is efficient in practice and usually converges in just a
few iterations on non-degenerated data. Another thing worth mentioning is that clusters
created by k-Means are spherical with respect to the distance metric — the algorithm is
known to have problems with non-convex, and in general complex, shapes.
Hierarchical Methods
A family of hierarchical clustering methods can be divided into agglomerative and divisive
variants. Agglomerative Hierarchical Clustering (ahc) initially places each object in its own
2.3. Document Clustering 31
cluster and then iteratively combines the closest clusters merging their content. The cluster-
ing process is interrupted at some point, leaving a dendrogram with a hierarchy of clusters. dendrogram
Many variants of hierarchical methods exist, depending on the procedure of locating
pairs of clusters to be merged. In the single link method, the distance between clusters is the single link,complete link,average linkmethods
minimum distance between any pair of elements drawn from these clusters (one from each),
in the complete link it is the maximum distance and in the average link it is correspondingly
an average distance (a discussion of other merging methods can be found in [22]). Each of
these has a different computational complexity and runtime behavior. Single link method
is known to follow “bridges” of noise and link elements in distant clusters (a chaining ef-
fect). Complete link method is computationally more demanding, but is known to produce
more sensible hierarchies [30, 4]. Average link method is a trade-off between speed and
quality and efficient algorithms for its incremental calculation exist such as in the Buck-
shot/Fractionation algorithm [13].
Typical problems in hierarchical methods are in finding the stop criterion for the cluster
merging process [20], tuning parameters and finding a method of “flattening” dendrogram
levels to create clusters with more than two subgroups.
Clustering Based on Phrase Co-occurrence
The Internet brought a new challenge to the task of text clustering: incomplete data. Par-
titioning and hierarchical methods typically used vector space representation and required
verbose input (full documents). Web pages and mailing lists are typically shorter and snip-
pets found in search results — fragments retrieved from documents matching the query —
are an extreme example of incomplete data, often just a few words long. This new data had
to be reflected in novel approaches to clustering.
Algorithms utilizing phrase co-occurrence use frequently recurring sequences of terms
as features of similarity between documents. Assuming that documents discussing related
subjects should use similar vocabulary and phrasing, frequent phrases can be used to iden-
tify documents discussing the same (or related) topics. Note that what really makes the dif-
ference is the use of variable-length features that are later used for describing the discovered
clusters. This idea first appeared in Suffix Tree Clustering (stc) algorithm, published in a suffix treeclustering
seminal paper by Oren Zamir and Oren Etzioni [105].
Suffix Tree Clustering works in two phases: first it discovers base clusters (groups of doc-
uments that share a single frequent phrase) and then merges base clusters together to form
the output.
The discovery of base clusters starts from segmenting the input into words and sen-
tences. Each sentence is essentially a sequence of words and is as such inserted into a gen-
eralized suffix tree. Generalized suffix tree is similar to a suffix tree, but contains suffixes of
more than one input sequence. Internal nodes in the tree also keep pointers to sequences a
given suffix originated from. This way in each internal node of the tree we have a sequence
of elements that occurred at least twice in the input and sentences it occurred in.
2.3. Document Clustering 32
Figure 2.9: A generalized suffix tree for three sentences: (1) cat ate cheese, (2) mouse ate
cheese too and (3) cat ate mouse too. Paths to internal nodes (circles) contain phrases that
occurred more than once in documents indicated by children nodes (rectangles). Dollar
symbol is used as a unique end-of-sentence marker. Example after: [105].
After all the input sentences have been added to the suffix tree, the algorithm traverses
the tree’s internal nodes looking for phrases that occurred a certain number of times in more
than one document. Any node exceeding the minimal count of documents and phrase fre-
quency immediately becomes a base cluster. Figure 2.9 shows a generalized suffix tree with
three short example phrases.
The strongest element of stc is in utilizing a proper data structure — suffix tree construc-
tion is linear and it permits very fast and convenient base cluster detection. Interestingly,
after this clever step, stc falls back to a simple single-link merge procedure where base clus-
ters that overlap too much are iteratively combined into larger clusters and the process is
repeated. This step is not fully justified and may result in merging base clusters that should
not be merged. An improved version of the algorithm was suggested by Irmina Masłowska
in [63, 64]. Nonetheless, Suffix Tree Clustering was the first algorithm to emphasize the im-
portance of comprehensible cluster labels and it was very inspiring for other authors. We
list several spin-off algorithms utilizing phrase co-occurrence in Section 2.4 on page 34.
Other Clustering Methods
A number of other clustering methods are known in literature — density-based methods,
model-based and fuzzy clustering, self organizing maps and even biology-inspired algo-
rithms. An interested Reader can find many surveys and books providing comprehensive
information on the subject [22, 30, 6, 36, 46].
2.3.3 Applications of Document Clustering
Applications of text clustering algorithms changed over time either due to record-breaking
data availability, new ideas and algorithms or due to new types of input data. We summa-
rize a list of major text clustering applications because it nicely outlines the evolution of
clustering methods from a background utility for modelling similarities among objects to
the first-hand user experience.
2.3. Document Clustering 33
Improving Document Retrieval Efficiency
The initial application of text clustering was in document retrieval. Keith Van Rijsbergen ob-
served that “Closely associated documents tend to be relevant to the same requests” (cluster
hypothesis). Clustering was applied to a collection of documents prior to searching to detect
similar groups of documents. When the user typed a query, an information retrieval algo-
rithm retrieved documents matching the query and documents in their clusters to improve
recall. Note that clusters are never explicitly revealed to the user so there was no need to
describe them.
Organizing Large Document Collections
Document retrieval focuses on finding documents relevant to a particular query, but it fails
to solve the problem of making sense of a large number of uncategorized documents. The
challenge here is to organize these documents in a taxonomy identical to the one humans
would create given enough time and use it as a browsing interface to the original collection
of documents.
Several large conferences, such as the Text Retrieval Conference (trec), published ref-
erence data with samples of documents clustered manually by humans. The ongoing work
in this area focused primarily to replicate the man-made taxonomy as closely as possible,
maximizing the score calculated as a conformity to predefined document-to-cluster assign-
ments. Comprehensibility of cluster descriptions was usually neglected because it was not a
direct factor affecting the score.
Browsing Document Collections
The observation that clusters alone present a certain value to the user of an information
retrieval system is very important and was first noticed by Marti Hearst and Jan Pedersen
in their paper about a search system Scatter/Gather [37]. By scanning the description of a
cluster the user can assess the relevance of the remaining documents in that cluster and
find the interesting information faster (or at least identify the irrelevant clusters and avoid
them). The techniques for extracting cluster descriptions were very simple: selected titles of
documents within the cluster, excerpts of documents and keywords.
Duplicate Content Detection
In many applications there is a need to find duplicates or near-duplicates in a large number
of documents. Clustering is employed for plagiarism detection, grouping of related news
stories and to reorder search results rankings (to assure higher diversity among the topmost
documents). Note that in such applications the description of clusters is rarely needed.
Integration with Search Engines
Modern Internet and intranet search engines contain countless numbers of Web pages, doc-
uments, news articles and can search all this content very fast. Users simply rephrase the
2.4. Related Works 34
query if the information they need does not show up at the top of search results and they
rarely need full taxonomy of documents to find what they need. However, in some situa-
tions when the query is ill defined, or the information need is not clear (an overview-type
query for example), a different type of text clustering may be helpful — search results clus-
tering. Search results clustering is about clustering each query’s results and presenting the
user with an overview of what this result contains. The clusters can be used to filter through
irrelevant hits or to refine the query with other terms.
We mentioned this type of clustering a few times already, but let us underline the key
elements of difficulty again. The input information for the algorithm is very limited: only
the titles and snippets are available. The algorithm must be very fast to avoid slowing down
the user interface of a search engine. Finally, the clusters must be accurately and clearly
described because the user expects an overview of topics similar to the query and does not
have the time to take guesses about the meaning of clusters described using keywords, for
example.
[
The concepts presented in this thesis are applicable whenever the description of clusters
needs to be shown to the user. The most likely targets among the applications presented
above, are in document collection browsing and search results clustering.
2.4 Related Works
The purpose of this section is to present currently available algorithms and methods that
closely correspond to the ideas presented in this thesis.
Clustering
Antonio Gulli and Paolo Ferragina [24, 23] start from the limitations of Grouper — stc’s ini- see errata E-17
tial implementation — and build an algorithm called SnakeT. SnakeT uses non-contiguous
phrases as features, which the authors call approximate sentences. The criterion forming
a cluster is still (as in stc) the fact of sharing a sufficient number of approximate phrases.
The implementation and algorithm’s design is by far more complex than stc’s and uses cus-
tom data structures similar to frequent itemset detection in data mining, but the authors
pay attention to cluster label comprehensibility and enrich cluster descriptions with data
extracted from a predefined ontology.
In [48], authors present a search results clustering algorithm which attempts to associate
documents with a single concept where labels are chosen so that “they are good indicators
of the documents they contain”. The algorithm uses frequent terms, but also preprocessed
noun phrases. Unfortunately, as the authors put it: “[stems] are not usually very meaning-
ful for use as node labels, therefore we replace each stemmed term by the most frequently
occurring original term”. Note that such heuristic would obviously fail for documents in Pol-
ish. The provided screenshots show that the generated cluster labels are mostly based on
single-words.
2.4. Related Works 35
Hotho, Staab and Stumme build a Conceptual Clustering system that refines cluster de-
scriptions using concept lattices [39, 38]. Their cluster descriptions are still single words but
they use a large thesaurus and formal concept analysis to avoid repetitions and synonyms
in cluster keywords.
In [107], authors perform an interesting experiment with a supervised training of a clus-
ter label selection procedure. First, a set of fixed-length word sequences (the article reports
3-grams) is created from the input. Each label is then scored with an aggregative formula
combining several factors: phrase frequency, length, intra-cluster similarity, entropy and
phrase independence. The specific weights and scores for each of these factors are learnt
from examples of manually prioritized cluster labels.
Pantel and Lin [53, 72] present a very interesting clustering algorithm called Clustering
with Committees which builds clusters around groups of few strongly associated (most sim-
ilar) documents, called committees. Because committees are so strongly related with their
set of features they usually point to an unambiguous concept (which authors even evaluate
using semantic relationships from WordNet). The cluster description remains to be a list of
strong features, but hopefully unambiguous. Pantel recently attempted to label the output
classes with more “semantic” labels [73], but this work goes definitely deeper into natural
language processing than information retrieval.
A concepts of clustering combined with pattern selection (similar to the dcf approach)
appears in [108], where authors present a classification system which uses clusters to select
labeled objects and expand the set of labeled objects with elements from within the cluster
to improve classification. Cluster descriptions are not part of the consideration.
Mark Sanderson and Bruce Croft [84] present a completely different, yet related ap-
proach to exploring document collections. Instead of clustering input documents, they start
with salient terms and phrases taken from predefined queries to a document collection and
expand this set with a technique called Local Context Analysis. Once a big enough collec-
tion of terms and phrases is gathered, it is automatically organized into a hierarchy, starting
with most generic terms at the top and descending to most detailed ones at the bottom.
The technique used by authors is very interesting as it involves no clustering techniques, yet
provides a hierarchy of (quite comprehensible) descriptions of groups of documents in the
output. The disadvantage is that authors bootstrap their method with a predefined set of
queries, which would be unavailable for another collection of documents.
An interesting cluster labeling procedure is also shown in the Weighted Centroid Cover-
ing algorithm [90]. Authors start from a representation of clusters (centroids of document
groups) and then build their (word-based) descriptions by iterative assignment of highest
scoring terms to each category, making sure each unique term is assigned only once. Inter-
estingly, the authors point out that this kind of procedure could be extended to use existing
ontologies and labels, but they provide no experimental results of any kind.
2.4. Related Works 36
Summarization
Similarities to our work can be found in the field of document summarization and especially
multi-document text summarization. The goal of summarization is to present a concise tex-
tual summary of a single document or a group of documents. What differs summarization
from descriptive clustering is that no document groups are ever shown in the results, the
output consist of exactly one summary, usually longer compared to cluster labels.
Summarization dates back to 1958 when Hans Peter Luhn published a pioneer paper
called The Automatic Creation of Literature Abstracts [55]. The algorithm works by forming a
definite set of keywords and then looking for sentences containing these keywords, assem-
bling an abstract from the highest scoring sentences. Majority of later works presented in
the field are not too far from Luhn’s idea, focusing mostly on refining the procedure of se-
lecting sentences abbreviating the content of a document. One approach, for instance is to
take into account lexical clues, boosting the score of sentences in proximity of words such as
important or significant and decreasing their score in the neighborhood of phrases like might
be, or unlikely [80].
In [77] authors describe a summarization engine mead which extracts phrases and ranks
them using a clustering method. The highest ranking sentences are selected for the sum-
mary.
A broader overview of summarization and topic segmentation techniques and systems
can be found in [52] or in [58].
Topic Segmentation
A piece of text, such as an article, rarely talks about a single subject. The analysis of how
topics change in a document is a matter of topic segmentation techniques. A large group topic segmentation
of methods in this field is based on theories of discourse modelling, and represented by in- discoursemodelling
fluential papers by Eduard Skorochod’ko [88], Michael Halliday and Ruqaiya Hasan [34], or
Barbara Grosz and Candace Sidner [32]. An overview of the models of discourse along with
their applications to topic segmentation and summarization can be found in Jeffrey Reynar’s
thesis [79].
[
Summarization, topic segmentation and clustering have a high degree of overlap in the
motivations and even in methodology of solving their respective problem areas. Having said
that, each discipline has its own niche where it fits best. The ideas presented in this work
to certain extent combine the goals of multi-document text summarization, topic identifica-
tion and clustering, although we tend to stay in the field of information retrieval with regard
to the algorithmic solutions.
2.5. Evaluation of Clustering Quality 37
2.5 Evaluation of Clustering Quality
Experts seem to agree that objective measures of clustering quality are not feasible [59]. Text
clustering depends on such a variety of factors (implementation details, parametrization,
input data, preprocessing) that each experiment becomes quite unique and deriving con-
clusions about supremacy of one algorithm over another seems a bit far fetched. Moreover,
there is usually more than one “good” result and even human experts are rarely consistent
in their choice of the best one [56]. On the other hand, anecdotal evidence of improvement
is obviously very unfortunate.
There are two mainstream clustering evaluation methodologies: user surveys and mea-
sures of distortion from an “ideal” set of clusters. Neither method is perfect.
2.5.1 User Surveys
User surveys are a very common method of evaluating clustering algorithms [106, 63, 23, 85]
and often the only one possible. Unfortunately, a number of elements speak against this
method of evaluation.
• It is difficult to find a significantly large and representative group of evaluators. When
the users are familiar with the subject (like computer science students or fellow sci-
entists), their judgment is often biased. On the other hand, people not familiar with
clustering and used to regular search engines have difficulty adjusting to a different
type of search interface.
• People are rarely consistent in what they perceive as “good” clusters or clustering. This
has been reported both in literature [56] and in our past experiments on the Carrot2
framework and affects both preparation of the answers sheet and analysis of results.
• Experiment results are unique, unreproducible and incomparable. User surveys are
one-shot experiments that are not comparable between each other and difficult to
perform repeatedly or periodically.
• Human evaluators learn by examples and their judgment and performance is not con-
stant throughout the experiment. This makes performing subsequent experiments
with the same evaluators impossible (because they have gained experience).
• User surveys usually take a long time and effort in both preparation of the experiment
and its practical realization.
We used user surveys in a few of our experiments in the past and were usually discour-
aged with the results. During the course of work on this thesis we tried to avoid controlled
user-studies and instead relied on empirical experiments, numerical investigation of qual-
ity and user feedback collected from an open demonstration of the search results clustering
system Carrot2. We summarize this experience in Section 7.4 on page 107.
2.5. Evaluation of Clustering Quality 38
2.5.2 Measures of Distortion from Predefined Classes
Another evaluation method is based on defining a mathematical notion of difference be-
tween the set of clusters and a reference desirable set of partitions (called a ground truth
set). A clustering algorithm should minimize this difference to mimic the behavior of the ground truth
person or algorithm that put together the ground truth set. A few popular data sets are avail-
able for full document clustering, majority created by mixing documents from thematically
different sources (such as different mailing lists) and more seldom by human selection and
tagging. Interestingly, in spite of a few attempts to create ground truth data sets for search
results clustering [87], no “standard” test collection for this problem exists at the moment of
writing.
Let us assume a set of clusters K = k1,k2, . . . kn and a set of ideal partitions with a
total of N objects C = c1,c2, . . . cm. The following metrics are typically used for measuring
difference between clusters and the ground truth set.
F-measure A measure popular in information retrieval: aggregation of precision and re-
call, here adopted to clustering evaluation purposes. Recall that precision is the ratio of the
number of relevant documents to the total number of documents retrieved for a query. Re-
call is the ratio of the number of relevant documents retrieved for a query to the total num-
ber of relevant documents in the entire collection. In terms of evaluating clustering, the
f-measure of each single class ci is:
F (ci ) = maxj=1...m
2P j R j
P j +R j, (2.7)
where:
P j =|ci ∩k j ||k j |
, R j =|ci ∩k j |
|ci |. (2.8)
The final f-measure for the entire set of clusters is:
m∑
i=1
F (i )|ci |N
. (2.9)
Higher values of the f-measure indicate better clustering.
Shannon’s Entropy Entropy is often used to express the disorder of objects within the clus-
ter with information-theoretic terms. We can define the entropy of each cluster k j as [11]:
E (k j ) =−m∑
i=1
|ci ∩c j ||c j |
log|ci ∩c j ||c j |
. (2.10)
Defined this way, entropy is not normalized, so we normalize it (after: [16]):
E (k j ) =−1
log m
m∑
i=1
|ci ∩c j ||c j |
log|ci ∩c j ||c j |
. (2.11)
2.5. Evaluation of Clustering Quality 39
Entropy of the entire clustering is a weighted entropy of its clusters:
n∑
j=1
E (k j )|k j |N
. (2.12)
Zero entropy means the cluster is comprised entirely of objects from a single class.
Byron E. Dom’s Clustering Entropy This is yet another information-theoretic measure. An
interesting thing about it is that it takes into account the difference between the number of
clusters and the number of classes (if such a difference exists) — a useful property we used
in our previous research on the influence of language properties on the quality of cluster-
ing [89]. We omit the exact formula here because we do not use it in this thesis — details
can be found in Byron Dom’s report [18].
Clustering Purity Purity gives the average ratio of a dominating class in each cluster to the
cluster size and is defined as:
P (k j ) =1
|k j |max
i
(h(ci ,k j )
), (2.13)
where h(c,k) is the number of documents from partition c ∈C assigned to cluster k ∈K .
[
Evaluation against a ground truth set is quite reliable and convenient because it yields a
numeric and repeatable result, but it comes with its own issues. First of all, a single number
does not explain what went wrong in the clustering process. It only provides an average
figure that is comparable, but hardly interpretable, and the range of source errors is broad
and varies in “severity” (a mix of two related classes is preferred over uniform mixture of
documents for example).
Each cluster validation measure also comes with inconvenient requirements concerning
clusters structure; most require explicit partitioning into the number of clusters identical to
the number of ground truth’s partitions. This requirement is very often hard to meet, espe-
cially when we want our clustering algorithm to adjust the number of clusters freely or allow
clusters that are pure subsets of original classes with no additional penalty. In this thesis we
introduce a cluster validation measure similar to entropy but hopefully easier to interpret —
cluster contamination measure (see Appendix A on page 116). We are also convinced that
visualization methods showing the allocation of documents within clusters help a great deal
when assessing the quality of a clustering algorithm and we use many such visualizations in
Section 7.
Chapter 3
Descriptive Clustering
In this chapter we outline the differences between traditional understanding of the docu-
ment clustering problem in information retrieval and descriptive clustering. We define it as
a distinct problem with a specific set of requirements and applicable to a certain class of
text browsing problems. Finally, we present a loose association with conceptual clustering
known in machine learning.
3.1 Problem Statement
Let us start by repeating the textbook definition of a clustering problem after [22]:
Given a number of objects or individuals, each of which is described by a set of
numerical measures, devise a classification scheme for grouping the objects into
a number of classes such that objects within classes are similar in some respect
and unlike those from other classes. The number of classes and the characteris-
tics of each class are to be determined.
Note that the above definition does not mention cluster labels at all, the objective is to find
groups of similar objects (documents in our case). Any application that brings clusters to
the user interface will need to find their textual description — an additional requirement
not stated in the definition of the problem. A good clustering algorithm (in terms of the
definition) may appear completely useless from the user’s point of view, because it fails to
explain the reasons why clusters were formed. We believe the core difficulty is in the transi-
tion between the algorithm discovering groups of documents and the method of attaching
descriptions to these groups. Taking a Vector Space Model as an example, it seems almost
impossible to find a way of reconstructing comprehensible cluster labels from a mathemat-
ical bag-of-words model of a group of documents. In our opinion the approaches known
in literature, such as keyword tagging or the use of frequent phrases as cluster labels, don’t
provide satisfactory answers to all the requirements of a cluster browsing application.
The idea presented in this thesis attempts to avoid this difficult phase of cluster labeling
instead of solving it. We do it by slightly relaxing the requirements concerning document
40
3.2. Requirements 41
groups and shifting the emphasis to cluster labels. Compare the definition of the descriptive
clustering problem with document clustering shown above:
Descriptive clustering is a problem of discovering diverse groups of semanti- descriptiveclustering
cally related documents described with meaningful, comprehensible and com-
pact text labels.
Ideally, an algorithm solving the descriptive clustering problem should present docu-
ment groups for which clear and comprehensible descriptions exist. Document clustering is
therefore a step towards the final result, not the ultimate goal.
According to the above definition, we agree to discard clusters without sensible descrip-
tions. It may be disturbing at first, but our decision is deeply rooted in practical experience
gained with the Carrot2 framework and is an outcome of the following observations:
• the user will spend no additional time to figure out the meaning of a cluster label if its
description is unclear,
• the user will not inspect documents of a cluster with an unintuitive label,
• unclear or obscure relationships between the cluster description and documents in-
side it are discouraging and frustrating for the user.
All our further considerations try to take these facts into account, even if it potentially
affects the “ideal quality” of clustering understood as the expected allocation of documents
to groups.
3.2 Requirements
Evaluation of cluster labels presents a great challenge. Initially, inspired by approaches from
natural language processing, we tried to define strict formal requirements concerning clus-
ter labels, based on their grammatical decomposition. This direction turned out to be unre-
alistic — the structure of natural language, especially in Polish, seems to be far too complex
for reliable automatic evaluation.
Unable to specify the requirements formally, we thought about defining certain expec-
tations that hardly replace a formal definition, but hopefully convey our intuition of what
cluster labels should be like. Therefore, to clarify the terminology, when we speak about
requirements concerning the problem of descriptive clustering, we mean two things:
• expectations concerning cluster labels (naturally imprecise and hard to verify, but pro-
viding certain intuition), and
• traditional requirements concerning clusters (groups of documents) which are taken
directly from the definition of clustering in information retrieval.
We describe these requirements in the following sections of this chapter.
3.2. Requirements 42
3.2.1 Cluster Labels
We define three requirements concerning cluster labels: comprehensibility, conciseness and
transparency.
Comprehensibility
Zygmunt Saloni and Marek Swidzinski make a very interesting observation of how people
perceive elliptical statements:
Jestesmy przekonani, ze kazdy uzytkownik jezyka ma stosunkowo jasna (choc
nie wyrazna!) intuicje elipsy, tzn. potrafi okreslic stopien kompletnosci danego
wypowiedzenia. ([81], page 56)
We are convinced that every speaker of a given language has a clear (but not ex-
plicit!) intuition of ellipsis, that is can determine the completeness of a given pro-
nouncement.
Extending this observation to comprehensibility we suspect that native speakers of a given
language can easily determine if a given sequence of words can function as a cluster label,
but without providing any explicit rules with respect to this judgment. Instead of defining
a good cluster label we may pinpoint the negative cases and reject the clearly bad ones. A
list of reasons for rejecting or at least penalizing a cluster label along with some examples is
shown below. Good cluster labels should not fall into any of these categories.
• Grammatical inconsistency (not a sentence, not a pronouncement or an incomplete
phrase).
– wooden A go if
– byli krzywy noga z do (were crooked leg from to)
– of Computer Science
– z Torunia (from Torun)
• Internal grammatical or inflectional constraint violated (the phrase is incorrect, words
inside it are not in agreement).
– Europe snowboarding resorts [→European snowboarding resorts]
– Samorzadom miasta Poznaniu [→Samorzad miasta Poznania]
• External grammatical or inflectional constraint violated (the phrase is grammatically
correct, but is used in inflected form or lacks the required context).
– Instytucie Informatyki Politechniki Poznanskiej [→Instytut Informatyki Politechniki Poz-
nanskiej]
– Alicji w Krainie Czarów [→Alicja w Krainie Czarów]
3.2. Requirements 43
• Ellipsis or ambiguity.
– piłem Okocimy (drank Okocim)
– to i tamto (this and that)
Obviously, fully automatic and reliable verification of these constraints is impossible.
Even human assessment is often difficult, e.g. is inspector gadget a meaningful phrase when
it lacks context? A reasonable solution is to minimize the number of potentially bad descrip-
tions by their careful selection. That means, for instance, allowing only entire sentences
or pronouncements — such entities should be self-contained and less ambiguous by def-
inition. Unfortunately, they are also too long to form concise cluster labels (expressed in
the next requirement), so we decided to use a more fine-grained level of chunks (see Sec-
tion 2.1.3 on page 16). Chunks should be grammatically consistent, potentially self-contained
and hopefully meaningful when extracted from the text and stripped of its surrounding con-
text, so they seem like good candidates, not breaking any of the unwanted cases mentioned
above.
Conciseness
Our goal is to show the user a brief, concise view of a structure of topics present in a set of
documents. Cluster labels should be as short as possible to minimize the amount of infor-
mation the user must process, but sufficient to convey the information about the cluster’s
documents. If a word in the description can be removed without sacrificing comprehensi-
bility of the phrase, then it should be.
Anticipating our further discussion, let us mention that this requirement is quite difficult
to realize without linguistic and contextual knowledge. Our algorithms satisfy this require-
ment by allowing the user to express the desired length of cluster descriptions. We agree,
however, that this is a partial solution to the problem.
Transparency
To the user of a clustering algorithm, all its internal elements: the model of text represen-
tation, similarity measures, the algorithm used for grouping documents, remain a black box
which he or she expects to work flawlessly. Any mistake made by the algorithm, especially
one that manifests itself in cluster descriptions, introduces confusion and decreases user’s
trust to the entire algorithm.
We believe that the relationship between any document inside a cluster and its descrip-
tion must be clear and evident as in monothetic clustering. Similar clarity must exist in the
other direction — when looking at a description of a cluster, the user must be able to tell
which elements of this description can be found in the cluster’s documents. We will call a
clustering method transparent if the user is able to easily answer the following questions:
• Why was label X selected for documents in cluster Y?
• Why was document X placed in cluster Y?
3.2. Requirements 44
Cluster keywords Excerpts from sample documents
apple New York reminds us of the warmhearted program of Big Apple Greeter.
apache, server [. . . ] Median hourly earnings of nonrestaurant food servers were $7.95 in May
2004. This figure was even lower among Native American tribes of Zuni, Navajo
or Apache. [. . . ]
jacek, placek [. . . ] Jacek był grubym i nieporadnym chłopcem. [100 pages later] Na stole stał
pachnacy placek. [. . . ]
Table 3.1: Examples of cluster labels consisting of keywords and fragments of documents
matching these keywords, but not at all their first common sense meaning.
This requirement is partially inspired by the history of search engines in information re-
trieval. In the beginning most search engines had a default Boolean alternative operation
(or) between query keywords. The result included documents containing any term in the
query. But people soon realized that default conjunction (and) is less confusing because
there is no guessing of which combination of terms made a given document pop up in the
result; with the default and the relationship between the query and the set of retrieved doc-
uments is very clear (or in out terms: transparent).
Returning to the field of clustering, algorithms used there use more complex mecha-
nisms compared to vsm-based document retrieval and the transparency requirement be-
comes even more important. For instance, the traditional keyword-based cluster presen-
tation may lead to mistakes because users will assign the most common sense to a set of
keywords — consider the examples of cluster keywords and documents not truly relevant to
such clusters shown in Table 3.1.
As for how the transparency requirement can be solved, in our opinion a cluster label
containment relationship is a good, clear rule of thumb: every document in a cluster must
contain a phrase from its description. Such a rule is very restrictive — the number of docu-
ments containing an exact copy of a given phrase is likely to be small. Recalling the discus-
sion in Section 2.1.3 about loose order of phrases in languages such as Polish, exact phrase
containment may not be even a correct heuristic.
We may relax the above common sense rule and require the cluster to contain docu-
ments where the label’s phrase appears with possibly reordered words or other terms in-
jected inside it. The user should be able to control how much distortion from the cluster
label he or she allows. If such a definition is still too narrow, because the input is so big or
larger clusters are needed, the cluster label can be ultimately replaced with a more generic
term. However, there should always be a possibility of expanding the abbreviated cluster
label into low-level, fully transparent elements to provide explanation to the user about how
the cluster was formed and what can be found inside it.
3.2.2 Document Groups
The problem of descriptive clustering is focused on cluster labels, nonetheless it essentially
still remains a document clustering problem. In this thesis we consider a subset of clus-
3.3. Relationship with Conceptual Clustering 45
tering algorithms producing flat, overlapping clusters with the possibility to leave behind
unassigned documents.
Internal Consistency
An algorithm solving the descriptive clustering problem should ensure documents inside a
cluster are similar to each other. We believe that internal consistency, regardless of its math-
ematical definition, corresponds strongly with the concept of cluster label transparency. If
all documents in a cluster have a clear relationship with its description then such a cluster
must appear consistent to the user and therefore fulfills the requirement.
External Consistency
Descriptive clustering should provide an overview of the topics present in the input, so we
search for diverse clusters (different from each other) and varying in size (not only the largest
ones, which might be obvious to the user).
Overlaps and Outliers
We can expect a single document to contain references to many different subjects, so an al-
gorithm solving descriptive clustering must allow placing it in more then one cluster. More-
over, we can also expect a situation when a document does not belong to any cluster at all.
Such outlier documents can be abandoned entirely or form a synthetic group of unrelated
documents. The point is not to force documents to their closest cluster if such relationship
is not justified.
Note that we assumed that the structure of clusters is flat. This is partially a consequence
of transparency — if we need a clear relationship between a cluster label and its content, it
would be difficult to come up with a transparent label on a compound cluster in a hierarchi-
cal clustering. On the other hand, hierarchical clusters have a number of desirable features,
most importantly a more compact presentation compared to flat clusters. We consider it an
open question whether hierarchical, transparent clustering is feasible.
3.3 Relationship with Conceptual Clustering
During the work on this thesis is was pointed out to us that the difference between tradi-
tional and descriptive clustering resembles to some extent the ideas introduced earlier in
machine learning. Conceptual clustering was introduced fairly independently by Douglas conceptualclustering
Fisher, Ryszard Michalski, Robert Stepp and Joel Martin [40, 25, 66] and implemented in
algorithms such as Cluster/2 or CobWeb.
A conceptual clustering system accepts a tabular list of objects, described using a fixed
set of attributes (events, observations, facts) and produces a classification scheme over the
domain of these attributes. Conceptual clustering algorithms are usually unsupervised and
use some notion of a quality evaluation function to discover classes with “good” descrip-
tions. Evaluation of class quality is performed by looking at summaries (descriptions) of
3.3. Relationship with Conceptual Clustering 46
classes and confronting it with the training set. In other words, conceptual clustering sys-
tems measure the adequacy of classification and employ iterative search strategies for its
optimization, keeping in mind that the description of classes is an integral and important
part of an investigation [30].
A class in conceptual clustering is described with a set of attributes (concrete values,
probability distributions or other properties). For example, in Cluster/2, the algorithm cre-
ates descriptions of groups of objects based on conjunctions of simple conditions defined
on attributes of these objects. A description can look as shown below:
[height > 1290 cm] & [eye color = blue or green]
Similar motivation of conceptual clustering and our problem of descriptive clustering is
fairly clear: class (or cluster) label is the key element driving the rest of the process. Hav-
ing said that, straightforward application of conceptual methods to text clustering seems to
be problematic — conceptual clustering is strongly related to a specific type of input data —
tabular lists of objects, each described with a set of attributes (typically nominal). Text repre-
sentation models have a different data characteristic — a great number of numeric features.
Adopting conceptual clustering algorithms to clustering text is of course possible (and has
been done in the past), but seems a bit artificial.
Summarizing, what is similar in conceptual clustering and descriptive clustering is the
motivation, the emphasis on describing the result using concepts understandable to a hu-
man. Their application domain and implementation remain quite different.
Chapter 4
Solving the Descriptive Clustering Task:
Description Comes First Approach
4.1 Introduction
Description Comes First (dcf) is our suggested solution to the problem of descriptive clus-
tering. We perceive dcf as a general method into which different algorithmic components
can be plugged; the two algorithms we present later in this document are its concrete in-
stances. In this chapter we would like to describe the common denominator — a high-level
procedure which helps in overcoming the most difficult problems of cluster labeling and in
our opinion fulfills the requirements of descriptive clustering. We can summarize the De-
scription Comes First approach by the following statement:
Description Comes First approach is a general method for constructing text dcf approach
clustering algorithms suited to solving the problem of descriptive clustering.
4.2 Anatomy of Description Comes First
The dcf approach consists of several phases, illustrated in Figure 4.1, but the core idea is in
separating selection of candidate cluster labels from cluster discovery
• Candidate label discovery (phase 1) is responsible for collecting all phrases potentially
useful as good cluster labels (comprehensible and concise phrases).
• Cluster discovery provides a data model about document groups present in the input
data.
By splitting the process into these two phases, the most difficult element so far — creat-
ing proper cluster descriptions from a mathematical model — is avoided and replaced by a
problem of selection of appropriate cluster labels for each group of related documents found
in the input. The only purpose of cluster discovery (in phase 2) is to build a model of dom-
inant topics — major subjects the documents are about. This model is subsequently used
47
4.2. Anatomy of Description Comes First 48
Figure 4.1: Key elements of the dcf approach.
to select appropriate labels from the set of candidates and is discarded afterwards. The fi-
nal document groups (clusters) are built around the selected cluster labels (called pattern
phrases) to further reduce the “semantic gap” between cluster descriptions and documents
they contain (to fulfill the transparency requirement). The process ends with pruning of
groups that did not collect enough documents and elimination of very similar cluster labels.
In the following sections we discuss the rationale behind each phase of the dcf ap-
proach, provide certain implementation clues and end with an illustrative example.
4.2.1 Phase 1: Cluster Label Candidates
As we mentioned in the introduction, previous research on text clustering shows that finding
cluster labels always encounters great difficulties. We can avoid this problem by preparing
candidate cluster labels prior to the clustering process and then only pick these for which
significant groups of documents exist.1 Because the process of cluster label selection is in-
dependent from clustering, it can fully utilize raw text input to assure comprehensibility and
conciseness described in Section 3.2.1.
An interesting side-effect of making candidate label selection a separate phase is its in- computationalcomplexitydiscussionfluence on the efficiency of the entire procedure. Note that cluster label extraction is basi-
1It should be mentioned that this reversed order: labels→clusters instead of the traditional clusters→labels was
first suggested on the Web site of a commercial clustering search engine Vivisimo [D]. Obviously, no algorithmic
details had been released, so we do not know if our ideas align in any way with Vivisimo’s. A proper credit is due to
Vivisimo’s authors for inspiring our further work on the subject.
4.2. Anatomy of Description Comes First 49
cally independent, it can precede clustering or run in parallel. It can be centralized or easily
distributed (each computational unit extracting candidate labels from a single document).
Moreover, a collection of cluster label candidates can be prepared a priori (an ontology) and
reused with no additional computational cost.
Implementation Ideas A set of candidate cluster labels can be prepared in several ways.
One possibility is to utilize existing dictionaries or ontologies. This scenario is interesting
because cluster labels are then given a priori, so we can assume they fulfill the requirements
and are comprehensible for end users. The entire dcf process then effectively becomes
a classification task to a set of predefined categories (implied by candidate cluster labels),
where only categories that collect enough documents are shown back to the user.
When candidate cluster labels are not given in advance, we must extract them directly
from the input documents. Several methods can be employed to do this:
• extraction of frequent phrases, much like in the stc algorithm,
• extraction of simple coherent linguistic chunks — noun phrases or other coherent
groups of words,
• full linguistic analysis to extract independent phrases or sentences.
Each one of the above methods has its advantages and disadvantages. Frequent phrase
extraction is a fast and scalable method, but may result in nonsensical candidates in the
output (non-grammatical, incomplete or common clichés). We still use them in both algo-
rithms presented later in this thesis, mostly because their extraction is so efficient, but we
are aware of the shortcomings of this solution. To defend frequent phrases and dcf a bit:
we believe (and our experiments support this belief) that an algorithm following dcf should
be able to deal with certain noise in the set of candidate cluster labels. A noisy cluster label
should not be supported by any dominant topic and should not become a pattern phrase.
Even if this happens, such a pattern phrase should not collect enough documents and be
discarded as a result. These elements are a clear improvement over plain stc for example,
which lacked such a verification step, often permitting junk frequent phrases to become
clusters. We return to this discussion in later sections.
To find better cluster labels candidates we need to look at the methods of shallow lin-
guistic processing introduced in Section 2.1.2. The most common way of finding coherent,
sensible groups of words (in English) is to divide the text into chunks. Chunks are the small-
est (conciseness) grammatically consistent (comprehensibility) element that the input text
can be divided into, so they offer much more in terms of our needs compared to frequent
phrases.
Statistical chunkers for English are reasonably efficient and accurate, and we use them
later in this thesis to extract noun phrases in Descriptive k-Means. Note that as with any
automatic method, chunks retrieved using tools based on statistical processing of text are
just an approximation and may still return incorrect results.
4.2. Anatomy of Description Comes First 50
For Polish, we assumed an equivalent of an English chunk to be a group as defined
in [81]. Unlike chunks, however, groups may be unordered and distributed throughout the
sentence, so their direct use for cluster label candidates is more complex. We already men-
tioned our experiments with a simple heuristic automaton for detecting certain tag seq-
uences, but this solution was too premature to be employed in this thesis. As a result, at
the moment of writing we are limited to frequent phrases (and possibly predefined ontolo-
The intention of this phase is to construct a model of dominant topics2 present in the input. dominant topic
Each dominant topic consists of a group of documents that are about the same, or closely
related subject. A dominant topic must also have a suitable representation which can be
used later (in the pattern phrase selection phase) to calculate similarity between each dom-
inant topic and phrases from the set of candidate cluster labels.
Note that while we refer to this phase as document clustering, any method producing
a model of dominant topics is actually sufficient. We actually present two different ap-
proaches to topic approximation in this thesis. In Descriptive k-Means we use a regular
clustering algorithm (k-Means) and assume each cluster’s centroid represents a single dom-
inant topic. In the Lingo algorithm, on the other hand, clustering is replaced by Singular
Value Decomposition (dimensionality reduction) of term-document matrix. Dominant top-
ics are approximated with base vectors of one of the reduced matrices (we provide details
later).
Another element worth emphasizing is that dominant topics remain an internal artifact
in the process of dcf and never need to be shown to the user explicitly. This implies that
the model used for discovering dominant topics can be arbitrarily complex without hurt-
ing cluster label comprehensibility. This is a clear advantage with documents in Polish, for
instance. We can take into account loose syntax (use the vsm model instead of the phrase
co-occurrence model) and apply destructive text transformations to accommodate inflec-
tion (stemming, diacritic marks removal) without worrying about the problems of cluster
labeling which is essentially resolved in the next phase of dcf.
Implementation Ideas The most obvious and natural choice of a representation model for
this phase is the Vector Space Model and we use it in combination with cosine measure in
both our algorithms. We suppose that other models of text representation could be used,
but this direction has not been explored.
4.2.3 Phase 3: Pattern Phrase Selection and Document Assignment
The role of this step is to pick these cluster label candidates which are most similar to the
representation of previously discovered dominant topics. We will call such labels pattern
2We will use the phrases: dominant topic, dominant concept and abstract concept interchangeably for historic
reasons.
4.2. Anatomy of Description Comes First 51
phrases. A different way of looking at this phase is that we approximate the representation pattern phrase
of dominant topics, which we know is traditionally difficult to describe, with existing com-
prehensible labels.
As with any approximation, there are certain risks involved. For example, there is a risk
that no cluster label candidate will match a given topic. This is almost impossible if cluster
candidate labels have been extracted from the input documents, but is much more likely for
a predefined set of labels. While at first it may seem like a disadvantage, this stems from the
intuition of user’s anticipated behavior (Section 3.1 on page 41) — a cluster which cannot
be properly described is useless, even if the documents inside it make sense from the point
of view of the clustering method. Moreover, we rarely encountered this problem in real life
and believe the ability to hide clusters to which no sensible label can be found is actually a
strong point of the approach. The discussion of potential differences and distortions from
an ideal clustering is continued in Section 4.3 on page 53.
Once pattern phrases have been identified, the representation of dominant topics is dis-
carded and pattern phrases replace them as seeds of final document groups. This is a conse-
quence of the transparency requirement — we want a clear relationship between a cluster’s
label and its content. Pattern phrases will become cluster descriptions, so we must use them
directly to find the documents matching the topic they represent.
Note that document allocation phase fulfills the requirements concerning group over-
lap and partial clustering defined in Section 3.2.2. Documents are assigned to each pattern
phrase independently, so they may belong to more than one group. Documents not rel-
evant to any pattern phrase at all may also exist, obviously, forming a synthetic group of
non-clustered documents.
The last step (pruning) is meant to remove any pattern phrases which failed to collect
enough documents. The following scenarios are possible:
• No documents are assigned to the pattern phrase. A rare, but possible, case when the
pattern phrase was similar to the dominant topic’s model, but does not associate any
documents. Consider the following example: the representation of a dominant topic
contains keywords lemony and snicket. A candidate cluster label Lemony Snicket3 will
be selected, but, unfortunately, no document contains this exact phrase. The group is
(correctly) discarded.
• Very few documents are assigned to the pattern phrase. This may indicate that the
pattern phrase encompasses just a part of the original topic or the dominant topic was
a combination of more than one subjects. We can either discard the pattern phrase
or use it for merging with other small groups with overlapping documents (but this,
as we know from the stc algorithm, may lead to problems and is generally against
the transparency as we defined it). The threshold at which we consider the pattern
phrase and its documents irrelevant is a tuning parameter of a concrete algorithm
implementing dcf.
3Lemony Snicket is a pseudonym of Daniel Handler, an American novelist and the author of a series of darkly
comic children’s books known as A Series of Unfortunate Events.
4.2. Anatomy of Description Comes First 52
• A significant number of documents is assigned to the pattern phrase. In this case the
pattern phrase and its associated documents become part of the final result, that is
become a cluster described with a pattern phrase and containing the documents allo-
cated to it.
Implementation Ideas There are two elements of difficulty: pattern phrase selection and
document allocation.
To select pattern phrases we must seek for candidate cluster labels similar (or “close to”)
the discovered dominant topics. Assuming both cluster label candidates and dominant top-
ics are expressed in the same model (in the same vector space, for example), a simple cal-
culation of similarity between them should suffice.
Document allocation is more tricky since we want to have a clear relationship between
a pattern phrase and the documents allocated to it. We have already discussed several “rule
of thumb” heuristics that could be used for this task when we talked about the transparency
requirement (on page 43). Let us recall them now:
• allocate all documents containing an exact copy of the pattern phrase (strict rule),
• allocate all documents containing a possibly distorted copy of the pattern phrase (re-
ordered words, foreign words injected inside); the user should be able to control the
allowed level of distortion,
• allocate all documents containing the phrase and any synonymous phrases that could
be related to it, but offer the user a possibility of expanding the cluster label to explain
which phrases contributed to the allocated documents.
Implementation of pattern phrase selection and document allocation in practice may
become tricky, especially with large problem instances when efficiency of processing be-
comes critical. We show two different implementations of these elements in Lingo and De-
scriptive k-Means — the algorithms presented later in this thesis. Lingo uses a relatively
simple vsm-based retrieval model which results in several problems in document allocation
phase. Learning upon this experience, we improved the document allocation procedure in
dkm to scale to large problem instances and accommodate different document allocation
heuristics we mentioned above.
4.2.4 An Illustrative Example
This section demonstrates the dcf approach on a simple two-dimensional Vector Space
Model example. All referenced illustrations are collected in Figure 4.2 on page 54.
Let us narrow the “language” of documents in our example to only two terms: X and Y .
We can represent all input documents as points in a two-dimensional vector space, where
the horizontal axis represents the weight (importance) of term X and the vertical axis rep-
resents the weight (importance) of term Y . For estimating similarity between documents
represented in the term vector space we will use the cosine measure — the angle between
4.3. Discussion of Clustering Quality 53
vectors starting in point (0,0) and ending at each respective documents vector’s location. In
all subsequent figures we represent objects in the term vector space as small circles cast to
a unit sphere (the angle obviously does not change). A dcf approach to finding clusters in
the input documents would proceed as follows.
In the first step we collect cluster candidate labels. In our example these candidate labels
are already represented as red circles in the term vector space (see Figure 4.2(a)). Angle
vectors are also shown for clarity.
In a concurrent step we parse input documents and represent them in the same vector
space model. Each document is depicted as a faded blue circle with a direction vector going
through it (see Figure 4.2(b)).
We proceed to the second phase and detect dominant topics in the input documents. In
our example we look for groups of documents with a similar angle and note that two clear
groups exist (see Figure 4.2(c)). An average angle of the content of the group is the centroid
vector — the dominant topic’s representation in our model, depicted with larger blue arrows.
In the third phase we select pattern phrases for the dominant topics. Since our example
uses the same model for representing documents and labels, we can simply put everything
in the same space (see Figure 4.2(d)). Selecting pattern phrases is about choosing candidate
cluster labels “close to” topic vectors. Technically, we look for any vectors representing la-
bel candidates that lie within an infinite hypercone around the topic vector’s axis. In our
example’s two-dimensional space, the hypercone is simply an angle around a topic vector
(see Figure 4.2(e)) — we select three pattern phrases for the next phase. Note that the cone’s
opening angle is a tuning parameter of the algorithm.
Finally, we assign documents to pattern phrases in a way similar to the one previously
used to find pattern phrases. For each pattern phrase we look for documents within their
pattern hypercones and assign these documents to the pattern phrase. As shown in Fig-
ure 4.2(f), documents can be assigned to more than one pattern phrase. Each final cluster
contains documents similar to the selected pattern phrase (transparency requirement), the
original dominant topic’s vector is no longer directly taken into account.
Note that we use a simple cosine similarity model in the last phase of this example to
convey the idea of how dcf works. In a real algorithm the document assignment phase
would have to take into consideration word order and proximity of terms in the pattern
phrase, something the cosine measure omits entirely.
4.3 Discussion of Clustering Quality
Description Comes First approach avoids the most difficult problems of labeling clusters,
but of course with trade-offs someplace else. There are two potential places where clustering
quality may degrade (compared to an “ideal” clustering):
• when pattern phrases are selected, they are an approximation of dominant topics con-
structed in phase 2,
4.3. Discussion of Clustering Quality 54
(a) Candidate labels and their “position” in the
model.
(b) Documents and their “position” in the model.
(c) Concept vectors discovered in the documents. (d) Preparation for merging — cluster label
candidates and topics are represented in the same
model.
(e) Candidate labels close to topics (within cones)
become pattern phrases.
(f) Documents matching pattern phrases (within
cones) are selected to their groups
Figure 4.2: Example showing the dcf approach applied to documents and labels in a two-
dimensional term vector space.
4.4. Summary 55
• when documents are assigned to selected pattern phrases we use them as the refer-
ence point instead of the original representation of dominant topics.
The first issue seems to be more important as it means we are “cheating” the user a bit
by replacing the original dominant topics with groups of documents created around perfect
candidate labels. We believe this element is a virtue rather than a vice. Dominant topics
represent ideal groups expressed in a model used for clustering, but this model is obscure
and incomprehensible to the user. A “semantic gap” between the cluster’s representation
and its perception by a human is inevitable and introduces just as much confusion. In our
opinion the approximation of dominant topics using pattern phrases is not about “cheating”,
but rather choosing the closest comprehensible image of dominant topics that the user can
fully understand.
The second problem — documents assigned to pattern phrases instead of the original
dominant topic — is a straightforward consequence of the transparency requirement. We
again attempt to minimize the semantic gap, this time between the pattern phrases and
documents inside them. If we used the dominant topics to allocate documents to pattern
phrases, the cluster labels would remain comprehensible, but their content would be rele-
vant to something the user never sees explicitly.
The document assignment step must ensure that there really are documents that match
selected pattern phrases and that the link between cluster labels and documents inside them
is clear. This step in also a safety vent of the entire dcf procedure: even if incorrect (unre-
lated) cluster labels are selected to be pattern phrases they are not likely to collect enough
documents to form final clusters.
4.4 Summary
We believe the strengths of the dcf approach lie in the following properties:
• Candidate phrases can be extracted from the input text automatically based on fre-
quency or other statistics, just as in the stc algorithm. Alternatively, candidate phrases
can come from a completely different source or even a predefined ontology (to guar-
antee they are comprehensible).
• Cluster discovery and candidate phrase extraction are independent and can be easily
parallelized. In fact, the extraction can be done incrementally as the documents are
added to the system.
• Dominant topic detection can use arbitrarily complex model of text representation
and cluster analysis without making the cluster labeling procedure any more difficult.
• If the method used for detecting dominant topics returns an unclear representation of
a topic then it is less likely to find a matching label in the set of label candidates. Even
if a matching label is found, the document assignment phase provides a second-level
pruning. We thus ensure that groups of documents in the output are really relevant
and well described.
4.4. Summary 56
• The final assignment of documents to pattern phrases fulfills the transparency re-
quirements we established for the descriptive clustering problem. All documents in
a final cluster must contain a phrase from its label (possibly distorted), so the rela-
tionship between the cluster and its label should be clear to the user.
Chapter 5
The Lingo Algorithm
The motivation for creating Lingo was to come up with an algorithm for clustering search
results capable to discover diverse groups of documents and at the same time keep cluster
labels sensible. The work on Lingo must be credited to Stanisław Osinski who worked on
the algorithm under supervision of Jerzy Stefanowski [68] and later contributed a great deal
of effort to the Carrot2 framework. The author of this thesis worked with Stanisław on a
number of co-authored papers [70, 68] and this fruitful cooperation gradually resulted in a
conceptual basis for defining descriptive clustering and the dcf approach.
The aim of this section is to show how Lingo fits in the general scheme introduced by
the dcf. The algorithm is an example of dcf’s application to the domain of search results
clustering and several elements of its implementation are designed specifically to deal with
this type of input data.
5.1 Application Domain
Clustering search results differs significantly from other types of document clustering. Each
matching result (hit) in a list of results returned by a search engine contains a resource loca-
tor (url), an optional title, and a short fragment of text called a snippet, which is optional as snippet
well. Modern search engines assemble snippets individually for each query by scanning the
body of a document and looking for short spans of text that contain as much of the query
as possible. Two or three best matching spans are joined and returned as a short block of
text providing insight into the original document for the user. This technique of generating
snippets is called kwic — keyword in context. Figure 5.1 shows a typical snippet. kwic
In the remaining part of this chapter we will use the term document to refer to a single
hit, even though the entire document is obviously not returned with the search result.
The usual number of hits returned by a search engine is anything between a few dozen to
a few hundred entries, so the input is relatively small. Moreover, it is not likely to grow sub-
stantially larger because search engines limit the number of hits to a few thousand (Google,
Yahoo and others).
57
5.2. Overview of the Algorithm 58
Figure 5.1: A typical “hit” returned by a search engine: document title on top, snippet with
query terms in the middle and an information line (with the document’s address) on the
bottom.
Figure 5.2: Generic elements of dcf and their counterparts in Lingo. svd decomposition
takes place inside cluster label induction phase, it is extracted here for clarity.
Conclusions are twofold: on one hand, a search results clustering must work with in-
complete, fragmented data (an extreme example is a document with an empty snippet and
an empty title). On the other, the scalability of the algorithm is not that important as long
as it is unnoticeably fast (for a human user) on typical input data sizes.
5.2 Overview of the Algorithm
Lingo processes the input in four phases: snippets preprocessing, frequent phrase extrac-
tion, cluster label induction and content allocation. The parallels to the generic scheme
introduced in the dcf are illustrated in Figure 5.2. Algorithm 5.1 on the next page contains
full pseudocode of the algorithm, we discuss the details of each step in sections below.
5.2.1 Input Preprocessing
In the preprocessing phase the input documents (titles and snippets) are tokenized and split
into terms. Lingo is implemented as a component embedded in the Carrot2 framework
5.2. Overview of the Algorithm 59
1: D ← input documents (or snippets)
/* Preprocessing */
2: for all d ∈D do
3: perform text segmentation of d ; /* Segmentation, stemming. */
4: if language of d recognized then
5: apply stemming and mark stop-words in d ;
6: end if
7: end for
/* Frequent Phrase Extraction */
8: concatenate all documents;
9: Pc ← discover complete phrases;
10: P f ← p : p ∈Pc ∧ frequency(p) > term frequency threshold;
/* Cluster Label Induction */
11: A ← term-document matrix of terms not marked as stop-words and
with frequency higher than the Term Frequency Threshold;
12: S,U ,V ← SVD(A); /* Product of SVD decomposition of A */
13: k ← 0; /* Start with zero clusters */
14: n ← rank(A);
15: repeat
16: k ← k +1;
17: q ← (∥∥Sk
∥∥F /‖S‖F );
18: until q < Candidate Label Threshold;
19: P ← phrase matrix for P f ;
20: for all columns of U Tk
P do
21: find the largest component mi in the column;
22: add the corresponding phrase to the Cluster Label Candidates set;
23: labelScore← mi ;
24: end for
25: calculate cosine similarities between all pairs of candidate labels;
26: identify groups of labels that exceed the Label Similarity Threshold;
27: for all groups of similar labels do
28: select one label with the highest score; /* cluster description */
29: end for
/* Cluster Content Discovery */
30: for all L ∈ Cluster Label Candidates do
31: create cluster C described with L;
32: add to C all documents whose similarity
to C exceeds the Snippet Assignment Theshold;
33: end for
34: put all unassigned documents in the “Others” group;
/* Final Cluster Formation */
35: for all clusters do
36: clusterScore← labelScore×‖C‖;
37: end for
38: Sort final clusters.
Algorithm 5.1: Pseudo-code of the Lingo algorithm.
5.2. Overview of the Algorithm 60
and uses its infrastructure to perform certain text preprocessing tasks — stemming, mark-
ing stop words and simple text segmentation heuristics. After tokenization is complete, a
term-document matrix is constructed out of the terms that exceed a predefined term fre-
quency threshold. After that, document vectors are weighted using the tf-idf formula [82].
Terms present in document titles are additionally boosted compared to these appearing in
snippets by a predefined constant because titles are more likely to contain sensible (human-
edited) information.
5.2.2 Frequent Phrase Extraction
The aim of this step is to discover a set of cluster label candidates — phrases (but also sin-
gle terms) that can potentially become cluster labels later. Lingo extracts frequent phrases
using a modification of an algorithm presented in the shoc algorithm [19]. A word-based
suffix array is constructed and extended with an auxiliary data structure — the lcp (Longest
Common Prefix). This allows the algorithm to identify all frequent complete phrases in O(n)
time, n being the total length of all input snippets.
The frequent phrase extraction algorithm ensures that the discovered labels fulfill the
following conditions:
• appear in the input at least a given number of times (it is a tuning threshold);
• not cross sentence boundaries; sentence markers indicate a topical shift, therefore a
phrase extending beyond one sentence is unlikely to be meaningful;
• be a complete frequent phrase (the longest possible phrase that is still frequent); com-
pared to partial phrases, complete phrases should allow clearer description of clusters
(compare: “Hillary Rodham” and “Senator Hillary Rodham Clinton”);
• neither begin nor end with a stop word; stop words that appear in the middle of a
phrase should not be discarded.
5.2.3 Cluster Label Induction
During the cluster label induction phase, Lingo identifies the abstract concepts (or domi-
nant topics in the terminology used in dcf) that best describe the input collection of snip-
pets. There are two steps to this: abstract concept discovery, phrase matching and label
pruning.
In abstract concept discovery, singular value decomposition (svd) is applied to the term-
document matrix A, breaking it into three matrices: U , S and V in such a way that A =U SV T . An interesting property of svd is that the first r columns of matrix U , r being the
rank of A, form an orthogonal basis for the term space of the input matrix A [29]. It is
commonly believed that base vectors of the decomposed term-document matrix represent
an approximation of “topics” — collections of terms connected with an obscure net of latent
relationships. Although this fact is difficult to prove, singular decomposition is widely used
in text processing, for example in Latent Semantic Indexing (lsi). From Lingo’s point of view,
5.2. Overview of the Algorithm 61
basis vectors (column vectors of matrix U ) contain exactly what it has set out to find — a
vector representation of the abstract concepts.
The most significant k base vectors of matrix U are determined by selecting the Frobe-
nius norms (measuring the difference between two matrices) of the term-document matrix
A and its k-rank approximation Ak . Let threshold q be a percentage-expressed value that
determines to what extent the k-rank approximation should retain the original information
in matrix A. We hence define k as the minimum value that satisfies the following condition:
‖Ak‖F /‖A‖F ≥ q,
where the symbol ‖X ‖F denotes the Frobenius norm of matrix X . Clearly, the larger the
value of q the more cluster candidates will be induced. The choice of the optimal value for
this parameter ultimately depends on the preferences of users, so we make it one of Lingo’s
control thresholds — Candidate Label Threshold.
Phrase matching and label pruning step, where group descriptions are discovered, relies
on an important observation that both abstract concepts and frequent phrases are expressed
in the same vector space — the column space of the original term-document matrix A. This
enables us to use the cosine distance to calculate how “close” a phrase or a single term is to
an abstract concept. Let us denote by P a matrix of size t × (p + t), where t is the number
of frequent terms and p is the number of frequent phrases. P can be easily built by treating
phrases as pseudo-documents and using one of the term weighting schemes.
Having the P matrix and the i -th column vector of the svd’s U matrix, a vector mi of
cosines of the angles between the i -th abstract concept vector and the phrase vectors can
be calculated as:
mi =UiT P.
The phrase that corresponds to the maximum component of the mi vector should be se-
lected as the human-readable description of i -th abstract concept. Additionally, the value of
the cosine (similarity) becomes the score of the cluster label candidate.
A similar process for a single abstract concept can be extended to the entire Uk matrix
— a single matrix multiplication M =UkT P yields the result for all pairs of abstract concepts
and frequent phrases.
The final step of label induction is to prune overlapping labels. Let V be a vector of clus-
ter label candidates and their scores. We create another term-document matrix Z , where
cluster label candidates serve as documents. After column length normalization we calcu-
late Z T Z , which yields a matrix of similarities between cluster labels. For each row we then
pick columns that exceed the Label Similarity Threshold and discard all but one cluster label
candidate with the maximum score which becomes the description of a future cluster.
5.2.4 Cluster Content Allocation
The process of cluster content allocation very much resembles document retrieval based on
plain vsm model. The only difference is that instead of one query, the input snippets are
5.3. An Illustrative Example 62
matched against a series of queries, each of which is a single cluster label. Thus, if for a
certain query-label, the similarity between a document and the label exceeds a predefined
threshold, it will be allocated to the corresponding cluster. Note that from the point of view
of dcf, traditional Vector Space Model used for comparisons is not ideal — the label’s word
order and proximity is not taken into account.
Let us define matrix Q , in which each cluster label is represented as a column vector.
Let C =QT A, where A is the original term-document matrix for input documents. This way,
element ci j of the C matrix indicates the strength of membership of the j -th document to
the i -th cluster. A document is added to a cluster if ci j exceeds the Snippet Assignment
Threshold, yet another control parameter of the algorithm. Documents not assigned to any
cluster end up in an artificial cluster called “Other documents”.
5.2.5 Final Cluster Formation
Finally, clusters are sorted for display based on their score, calculated using the following
formula:
Cscore = label score×‖C‖,
where ‖C‖ is the number of documents assigned to cluster C . The scoring function, al-
though simple, prefers well-described and relatively large groups over smaller ones.
5.3 An Illustrative Example
Let the input collection of documents contain d = 7 documents. We omit the preprocessing
stage and assume t = 5 terms and p = 2 phrases are given (these appear more than once and
thus will be treated as frequent). The input is shown in Figure 5.3.
The t = 5 terms
T1: Information
T2: Singular
T3: Value
T4: Computations
T5: Retrieval
The p = 2 phrases
P1: Singular Value
P2: Information Retrieval
The d = 7 documents
D1: Large Scale Singular Value Computations
D2: Software for the Sparse Singular Value Decomposition
D3: Introduction to Modern Information Retrieval
D4: Linear Algebra for Intelligent Information Retrieval
D5: Matrix Computations
D6: Singular Value Analysis of Cryptograms
D7: Automatic Information Organization
Figure 5.3: Input documents, frequent terms and phrases.
We now preprocess the input term document matrix — tf-idf weighting and normaliza-
tion results in matrix Atf-idf, svd decomposition of that matrix yields matrix U containing
abstract concepts.
Atfidf =
∣∣∣∣∣∣∣∣∣∣
0 0 0.56 0.56 0 0 1
0.49 0.71 0 0 0 0.71 0
0.49 0.71 0 0 0 0.71 0
0.72 0 0 0 1 0 0
0 0 0.83 0.83 0 0 0
∣∣∣∣∣∣∣∣∣∣
U =
∣∣∣∣∣∣∣∣∣∣
0 0.75 0 −0.66 0
0.65 0 −0.28 0 −0.71
0.65 0 −0.28 0 0.71
0.39 0 0.92 0 0
0 0.66 0 0.75 0
∣∣∣∣∣∣∣∣∣∣
5.4. Computational Complexity 63
Now we look for the value of k — the estimated number of clusters. Let us define quality
threshold q = 0.9. Then the process of estimating k is as follows:
k = 0 7→ q = 0.62, k = 1 7→ q = 0.856, k = 2 7→ q = 0.959
and the number of expected clusters is k = 2.
To find relevant descriptions of our clusters (k = 2 columns of matrix U ), we calculate
similarity between candidate phrases and concept vectors as matrix M =UkT P , where P is
a synthetic term-document matrix created out of our frequent phrases and terms (values in
matrix P are again weighted using tf-idf and normalized):
P =
∣∣∣∣∣∣∣∣∣∣
0 0.56 1 0 0 0 0
0.71 0 0 1 0 0 0
0.71 0 0 0 1 0 0
0 0 0 0 0 1 0
0 0.83 0 0 0 0 1
∣∣∣∣∣∣∣∣∣∣
M =∣∣∣∣0.92 0 0 0.65 0.65 0.39 0
0 0.97 0.75 0 0 0 0.66
∣∣∣∣
Rows of matrix M represent clusters, columns — their descriptions. For each row we select
the column with maximum value. The two selected labels are: Singular Value (score: 0.92)
and Information Retrieval (score: 0.97). We skip label pruning as it is not necessary in this
example. Finally, documents are allocated to clusters by applying matrix Q , created out of
cluster labels, back to the original matrix Atf-idf. The final result is shown below. Note the
fifth column in matrix C , representing unassigned document D5.
Q =
∣∣∣∣∣∣∣∣∣∣
0 0.56
0.71 0
0.71 0
0 0
0 0.83
∣∣∣∣∣∣∣∣∣∣
C =∣∣∣∣0.69 1 0 0 0 1 0
0 0 1 1 0 0 0.56
∣∣∣∣
Information Retrieval [score: 1.0]
D3: Introduction to Modern Information Retrieval
D4: Linear Algebra for Intelligent Information Retrieval
D7: Automatic Information Organization
Singular Value [score: 0.95]
D2: Software for the Sparse Singular Value Decomposition
D6: Singular Value Analysis of Cryptograms
D1: Large Scale Singular Value Computations
Other: [unassigned]
D5: Matrix Computations
5.4 Computational Complexity
Time computational complexity of Lingo is quite high and mostly bound by the cost of term-
vector matrix decomposition (remember that a suffix array can be built in time linear with
respect to the input size). To our best knowledge, svd decomposition can be performed
in the order of O(m2n +n3) for a m ×n matrix [29]. Moreover, memory requirements are
demanding because of all the matrix transformations.
Note, however, that Lingo has been designed for a very specific application — search
results clustering — and in this setting scalability to large data sets is of no practical impor-
tance (the information is more often limited than abundant). In the next chapter we will
discuss another algorithm that scales well to large number of documents and also imple-
ments the dcf approach.
5.5. Summary 64
5.5 Summary
Strong Points
• Lingo was the first algorithm implementing cluster description search prior to actual
document allocation.
• Lingo handles fragmented, incomplete input. It discovers a diverse structure of topics
using dimensionality reduction applied to the term document matrix and subsequent
label search with base vectors of the reduced space.
• Lingo has few tuning parameters that fit well in its application domain. The number
of clusters is determined by taking advantage of a side-product of the singular matrix
decomposition — the accuracy of approximation of the original term vector space.
Weak Points
• The document assignment step breaks the transparency requirement of descriptive
clustering: documents containing subphrases or even isolated words from the cluster
label can become part of that label’s cluster.
• Troublesome scalability to larger problem instances.
• Candidate label discovery is bound to frequent ordered sequences of words in the in-
put. Even though we could try to use an external set of labels for cluster label induc-
tion, this possibility has not been exercised so far.
Fulfillment of Requirements
• Comprehensibility and Conciseness — Lingo extracts candidate cluster labels from a
set of frequent phrases. The danger of selecting frequent, but meaningless labels (as
in stc) exists, but is not an annoyance in practice because of two reasons. First, we se-
lect only a subset of all frequent phrases that correspond to dominant topics present
in the input and detected using matrix decomposition techniques. Second, the input
to Lingo is very specific — it is short and contextual with regard to the query (snippets)
and this context is quite likely to contain recurring phrases that denote meanings syn-
onymous to the query, which helps the algorithm in selection of candidate phrases.
• Transparency — Transparency in Lingo suffers from the generic vsm document alloca-
tion procedure, which allows documents that contain only subphrases of the original
cluster label to be added to the group. This often yields unintuitive results.
• Clusters Structure — Cluster diversity is ensured by the use of singular matrix decom-
position — the base vectors representing dominant topics are orthogonal, which is
commonly believed to represent different topics. Internal consistency is sometimes
broken as a result of the document allocation procedure. The algorithm is able to pro-
duce overlapping clusters.
Chapter 6
Descriptive k-Means Algorithm
Our initial experiments with dcf concerned clustering search results and, as we already
mentioned, this is a very specific application domain. A few challenging questions arose:
• Is it possible to create an algorithm implementing dcf that scales well to large num-
bers (tens of thousands) of documents?
• Is it possible to adopt a well-known text clustering algorithm to the dcf approach, at
least retain the original quality of clustering and improve comprehensibility of cluster
labels? What will be the difference in clustering quality between the derived algorithm
and the original?
Descriptive k-Means (dkm) is an attempt to provide answers to the above questions. It
is a combination of cluster label discovery — we experiment with two techniques: frequent
phrase extraction and noun phrase extraction — with a very well known numerical cluster-
ing algorithm k-Means, resulting in a novel algorithm that follows the dcf approach.
We are aware that k-Means is very often criticized: it produces spherical clusters (with
respect to the distance metric used), it requires a number of cluster seeds to be given in
advance and it always assigns each object to its closest cluster centroid, regardless of their
actual resemblance. However, we still chose to extend k-Means for a few important reasons.
First of all, we wanted to have a scalable, very fast baseline algorithm. Running times of
k-Means are practically linear with the size of input (the algorithm is interrupted when it is
close enough to convergence) and the procedure scales to very large data sets [16, 50].
Second, remembering about very good experiences with diversity of topics detected us-
ing svd decomposition, we looked for a similar method that would handle large input data.
Interestingly, cluster centroid vectors created by k-Means are reported to be a close approxi-
mation of singular value decomposition’s base vectors [43, 17]. This made us believe that we
could use k-Means as an efficient and scalable algorithm consistent with the behavior once
observed in the Lingo algorithm.
Finally, k-Means is a widely recognized and very often used numerical clustering algo-
rithm — many researchers use it as a benchmark result for their own achievements, so it
is relatively easy to cross-compare results with others. By choosing k-Means, an algorithm
65
6.1. Application Domain 66
with notorious reputation when applied to text clustering, we hoped to demonstrate that
an adoption to the dcf approach can help improve, or at least retain, the clustering quality
and yield more comprehensible cluster labels, consistent with the requirements defined in
descriptive clustering.
6.1 Application Domain
The envisioned application domain for Descriptive k-Means consists of short and medium
documents such as: news stories, Web pages, e-mails and other documents not exceeding
a few pages of text. We expect the input to be real text (not completely random, noisy doc-
uments and not fragments like snippets) written in one language (left to right word order,
available word segmentation heuristic). In this thesis we consider texts written in English
and Polish. The designed algorithm must be able to handle thousands of input documents
for off-line clustering and return results in a reasonable time.
6.2 Overview of the Algorithm
Descriptive k-Means closely follows the dcf approach. The cluster label discovery phase is
implemented in two alternative variants: using frequent phrase extraction and with shallow
linguistic processing for English texts (extraction of noun phrase chunks). Dominant topic
discovery is performed by running a variant of k-Means algorithm on a sample of input doc-
uments. We experimented with various types of features and weighting schemes for docu-
ment representation and found out that, except for pointwise mutual information which is
known to cause problems, all of them gave similar results.
In pattern phrase selection phase the algorithm uses a Vector Space Model to calculate
similarities between cluster label candidates and dominant topics (represented by cluster
centroids). Document assignment phase uses a mix of vsm and a Boolean model imple-
mented on top of a search engine (and utilizing its data structures) to ensure processing
efficiency. The document assignment phase, unlike in Lingo, searches for documents that
contain a pattern phrase, but allowing certain distortions such as minor word reordering
and different words injected inside. The level of pattern phrase distortion is adjustable and
is a parameter of the algorithm.
The matching between dcf approach and Descriptive k-Means is depicted graphically
in Figure 6.1. Algorithm 6.1 on page 68 contains full pseudocode of the algorithm, and we
discuss each major step in sections below.
6.2.1 Preprocessing
In the preprocessing step we initialize two important data structures: an index of documents
and an index of cluster candidate labels.
An index is a fundamental structure in information retrieval. Each entry added to an in- index
dex (document or candidate cluster label in our case) is accompanied by a vector of terms
6.2. Overview of the Algorithm 67
Figure 6.1: Generic elements of dcf and their counterparts in Descriptive k-Means.
and their counts appearing in that entry. The index also maintains an associated list con-
taining all unique terms and pointers to entries a given term occurred in (inverted index).
The index allows performing queries, that is search for entries that contain a given set of
terms and sort them according to weights associated with these terms. In our experiments
we utilize a document retrieval library that creates indices called Lucene [H].
Indices are essential in dkm to keep the processing efficient. Note that the index of docu-
ments is usually created anyway to allow searching in the collection and the index of cluster
labels may be reused in the future, so the overhead of introducing these two auxiliary data
structures should not be too big.
Each incoming document is segmented into tokens using the heuristic implemented in
the Carrot2 framework. A unique identifier is assigned to the document and then it is added
to an index ID .
If cluster candidate labels are to be extracted directly from the input documents, this
process takes place concurrently to document indexing. Depending on the variant of dkm,
we extract frequent phrases or noun phrases (from English documents). The resulting set of
candidate labels is added to a separate index IP . Each candidate cluster label is indexed as
if it were a single document. To minimize the number of identical index entries, we keep a
buffer of unique labels in memory and flushing them to the index in batches.
6.2. Overview of the Algorithm 68
1: D ← a set of input documents
2: k ← number of “topics” expected in the input (used for k-Means).
/* Preprocessing */
3: ID ← empty inverted index of documents;
4: IP ← empty inverted index of candidate cluster labels;
5: for all d ∈D do
6: IDindex←−−−−− d ; /* add d to index ID */
7: if extract candidate labels then
8: T = a set of noun phrases and/or frequent phrases extracted from d ;