Descriptive Clustering as a Method for Exploring Text Collections · 2009-07-20 · Descriptive Clustering as a Method for Exploring Text Collections Dawid Weiss A dissertation submitted

Poznan University of Technology

Institute of Computing Science

Descriptive Clustering as a Method

for Exploring Text Collections

Dawid Weiss

A dissertation submitted to

the Council of the Faculty of Computer Science and Management

in partial fulfillment of the requirements for the degree of Doctor of Philosophy.

Supervisor

Jerzy Stefanowski, PhD Dr Habil.

Poznan, Poland

2006

Politechnika Poznanska

Instytut Informatyki

Grupowanie opisowe jako metoda eksploracji

zbiorów dokumentów tekstowych

Dawid Weiss

Rozprawa doktorska

Przedłozono Radzie Wydziału Informatyki i Zarzadzania

Politechniki Poznanskiej

Promotor

dr hab. inz. Jerzy Stefanowski

Poznan 2006

S T R E S Z C Z E N I E

Tematyka rozprawy dotyczy systemów wyszukiwania informacji oraz

identyfikacji w kolekcjach dokumentów ich tematycznie powiazanych podgrup.

Główna motywacja rozprawy jest tworzenie mozliwie czytelnych, gramatycznie

poprawnych i zrozumiałych opisów odnalezionych grup. Obecnie stosowane

algorytmy grupowania tekstów nie sa dostosowane do automatycznego tworzenia

dostatecznie dobrych opisów, a nowe zastosowania w eksploracji danych wskazuja

na ich praktyczne znaczenie.

Rozprawa zawiera propozycje ukonkretnienia powyzszych postulatów

dotyczacych opisów grup w formie definicji problemu grupowania opisowego

(ang. descriptive clustering). Nastepnie przedstawione jest ogólne podejscie do

konstrukcji algorytmów grupowania próbujacych spełnic przedstawione załozenia,

nazwane Description Comes First (dcf).

W odróznieniu do klasycznego podejscia, gdzie ocenie podlega jedynie

sposób przydziału dokumentów do grup, dcf bierze pod uwage opis grupy

jako jeden z kluczowych elementów wyniku całego algorytmu i stosuje ów

opis zarówno w trakcie konstrukcji modelu skupien dokumentów, jak i przy

tworzeniu ich opisów. W podejsciu dcf, poszukiwanie zbioru kandydatów na

opisy grup i matematycznego modelu skupien nastepuje niezaleznie. Wsród

opisów-kandydatów poszukiwane sa nastepnie takie, które posiadaja wsparcie

w odnalezionym modelu grup. W kroku ostatnim nastepuje przydział dokumentów

do wybranych opisów.

Praca prezentuje dwa algorytmy bedace przykładem praktycznej implementacji

podejscia dcf. Algorytm pierwszy, Lingo, znajduje zastosowanie w grupowaniu

wyników z wyszukiwarek internetowych. Algorytm drugi, Descriptive k-Means,

słuzy do grupowania duzej liczby dłuzszych dokumentów. Oba algorytmy

implementuja ten sam ogólny schemat działania oparty o dcf, lecz rózniaca je

specyfika przetwarzanych danych pociaga koniecznosc uzycia innych docelowych

rozwiazan — Lingo wykorzystuje frazy czeste i redukcje wymiarów macierzy słów,

Descriptive k-Means natomiast ekstrakcje fraz czestych, frazy nominalne oraz

grupowanie przy pomocy algorytmu k-Means (k-srednich).

W pracy przedstawiono eksperymenty obliczeniowe dla obu algorytmów.

Wyniki eksperymentów porównuja jakosc grupowania (rozumiana jako sposób

odtworzenia znanego przydziału dokumentów do grup) przy uzyciu Lingo oraz

Descriptive k-Means, z ich najblizszymi odpowiednikami literaturowymi —

algorytmami Suffix Tree Clustering oraz k-Means. Inny istotny aspekt praktyczny

ewaluacji stanowi przedstawienie danych zebranych z publicznej wersji systemu

Carrot2, dostepnego na zasadach wolnego oprogramowania.

Contents

Preface IV

1 Introduction 1

1.1 Problem Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Assumptions and Overview of the Solution . . . . . . . . . . . . . . . . . . . . 8

1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.6 Typographical Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Background 14

2.1 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Word and Sentence Segmentation . . . . . . . . . . . . . . . . . . . . 14

2.1.2 Shallow Linguistic Preprocessing . . . . . . . . . . . . . . . . . . . . . 15

2.1.3 Finding Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Text Chunking in English 16 d Text Chunking in Polish 17 d Heuristic Chunking

in Polish 18 d Phrases as Frequent Sequences of Words 19

2.2 Text Representation: Vector Space Model . . . . . . . . . . . . . . . . . . . . . 24

2.2.1 Document Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.2.2 Feature Weighting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.2.3 Similarity Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.2 Overview of Selected Clustering Algorithms . . . . . . . . . . . . . . . 29

Partitioning Methods 30 d Hierarchical Methods 30 d Clustering Based on Phrase

Co-occurrence 31 d Other Clustering Methods 32

2.3.3 Applications of Document Clustering . . . . . . . . . . . . . . . . . . 32

2.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 Evaluation of Clustering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.1 User Surveys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.5.2 Measures of Distortion from Predefined Classes . . . . . . . . . . . . 38

I

II

3 Descriptive Clustering 40

3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.2.1 Cluster Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Comprehensibility 42 d Conciseness 43 d Transparency 43

3.2.2 Document Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Internal Consistency 45 d External Consistency 45 d Overlaps and Outliers 45

3.3 Relationship with Conceptual Clustering . . . . . . . . . . . . . . . . . . . . . 45

4 Solving the Descriptive Clustering Task: Description Comes First Approach 47

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Anatomy of Description Comes First . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.1 Phase 1: Cluster Label Candidates . . . . . . . . . . . . . . . . . . . . 48

4.2.2 Phase 2: Document Clustering (Dominant Topic Detection) . . . . . 50

4.2.3 Phase 3: Pattern Phrase Selection and Document Assignment . . . . 50

4.2.4 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.3 Discussion of Clustering Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 The Lingo Algorithm 57

5.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.1 Input Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2.2 Frequent Phrase Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.3 Cluster Label Induction . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2.4 Cluster Content Allocation . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.2.5 Final Cluster Formation . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Descriptive k-Means Algorithm 65

6.1 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2.2 Dominant Topic Detection . . . . . . . . . . . . . . . . . . . . . . . . . 69

Preparation of Document Vectors 69 d Clustering with k-Means 69

6.2.3 Selecting Pattern Phrases and Allocating Cluster Content . . . . . . . 71

6.3 Dealing With Large Instances: Implementation Details . . . . . . . . . . . . . 73

6.3.1 Data Storage and Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6.3.3 Searching for Pattern Phrases . . . . . . . . . . . . . . . . . . . . . . . 75

III

6.3.4 Searching for Documents Matching Pattern Phrases . . . . . . . . . . 75

6.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Evaluation 79

7.1 Evaluation Scope and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.2 Experiment 1: Clustering in Lingo . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.2.1 Test Data and Experiment’s Setting . . . . . . . . . . . . . . . . . . . . 80

7.2.2 Output Clusters Structure and Quality . . . . . . . . . . . . . . . . . . 81

7.2.3 Analysis of Cluster Contamination . . . . . . . . . . . . . . . . . . . . 88

7.2.4 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 88

7.3 Experiment 2: Clustering in Descriptive k-Means . . . . . . . . . . . . . . . . 90

7.3.1 Test Data and Experiment’s Setting . . . . . . . . . . . . . . . . . . . . 90

7.3.2 Output Clusters Structure and Quality . . . . . . . . . . . . . . . . . . 93

7.3.3 Manual Inspection of Cluster Content . . . . . . . . . . . . . . . . . . 103

7.3.4 A Look at Cluster Labels . . . . . . . . . . . . . . . . . . . . . . . . . . 104

7.3.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4 User Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

7.4.1 Open Sourcing Carrot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.4.2 Online Demo Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8 Summary and Conclusions 112

8.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

A Cluster Contamination 116

Bibliography 120

Web Resources 128

Index 129

Preface

The nature of processes leading to useful classifications remains lit-

tle understood, despite considerable effort in this direction.

— Ryszard Michalski, Robert Stepp [66]

Grouping objects into larger structures is a natural impulse exhibited by most humans —

we like when things are in order, when they have clear relationships with other things. We

like it when structures we discover or create form a hierarchy of some sort. Our intuition is

driven by potential gains: categorization lets us shorten the time needed to find things we

look for (compare a well organized file drawer to a cluttered desktop), grouping things into

logical, higher level structures relieves our consciousness from digesting excessive amounts

of information (a car is composed of hundreds of different parts, yet when we look at it we

still perceive a single object).

The ability to organize and link objects into larger structures is very natural — we all

constantly use it after all — but not at all trivial to explain or formalize. The field of natu-

ral science provides an excellent example. A taxonomy of living things on Earth was for a

long time an unsolved puzzle — many people tried to come up with a classification encom-

passing all the known organisms, but it took the genius of Carl Linnaeus’ intuition to put

together a hierarchical organization of species where (most) of the known creatures fitted.

Linnaeus concept was to group species based on similarities of anatomic features they ex-

hibited. His methodology of work was revolutionary even if its outcome was in many places

doubtful or even laughed at (it included dragons and Phoenix). The point we try to make is

that creating any form of sensible, useful taxonomy is always difficult and subjective. Even

when it is performed by a computer algorithm, all the assumptions, intuition and practical

experience of its designer are inevitably entwined somewhere inside it.

The work on this thesis started with the following practical scenario. Imagine a set of text

documents, each one discussing one or more subjects (topics) and a human user interested

in an overview of these topics. I decided to call this activity of exploring an unknown col-

lection of documents exploratory text mining — human inspection of a group of documents

with a vague (impossible to express with a precise query) or no particular information need

known in advance. The goal is to reduce the time it takes the user to extract and compre-

hend the structure of topics in the original collection.

IV

Preface V

There are a few scientific methods for identifying topics in texts. I focused my interests

on text clustering because it is traditionally associated with information retrieval and has

gained popularity recently due to its combined use with Web search engines. What I hope

distinguishes this work is that instead of optimizing document allocation quality, we suggest

a slightly different viewpoint on the goals of text clustering when applied to exploratory text

mining — a viewpoint based on the human perspective, where the meaning of cluster labels

is at least as important as their content. It would be untrue to claim that methods contained

in this thesis provide the best answer to clustering textual content in general, very likely no

unique all-purpose “best” algorithm for this task exists, the main value of this work is in the

idea of constructing clustering algorithms in a slightly different way from the mainstream

research in the field. Our primary objective was to show an approach that facilitates con-

struction of sensible cluster descriptions and makes the results easier to comprehend to the

human user — a task traditionally very difficult and hence often neglected in information

retrieval. The fact that the algorithms we describe seem to perform at least equally well

compared with methods known in literature is an additional, but of course quite pleasing,

bonus.

Acknowledgment

Numerous people helped me through this effort and I’d like to mention some of them here.

A number of my past and present teachers filled me with deep understanding of the di-

versity of computer science problems, so valuable in undertaking new challenges. I deeply

thank my supervisor Jerzy Stefanowski for his persistence and dedication to guiding my

work. Professor Roman Słowinski was an endless source of inspiration and thoughtful criti-

cism. Professor Jerzy Nawrocki gently introduced me to real-life software engineering which

still constitutes a significant part of my research.

I appreciate the invaluable help of Marek Kubiak, Stanisław Osinski, Przemysław Wesołek

and other doctoral students I learned so much from.

Finally, I’d like to thank my family for the enormous support and encouragement they

offered me and of course congratulate the endless patience they showed for my late-night

study sessions.

Chapter 1

Introduction

1.1 Problem Setting

Two Types of Information Needs Just a few years ago size seemed to be everything —

search engines weekly published numbers of Web pages indexed and available for immedi-

ate access to the public. But as the numbers grew into billions, size became just another in-

comprehensible factor. Nowadays electronic information sources on the Web include a great

variety of different content. Alongside traditional Web pages in html, we have newswire

stories, books, e-mails, blogs, source code repositories, video streams, music and even tele-

phone conversations. Even narrowing the scope to textual content, the range of different

possibilities is overwhelming. A piece of text downloaded from the Internet is usually un-

structured, multilingual, touching upon all kinds of subject (from encyclopaedia entries to

personal opinions) and in general unpredictable (think of all the typographical conventions,

abbreviations, new words that come with electronic publications). The sheer availability of

information is no longer that exciting, the new revolution lies in improving the ways of com-

prehending and utilizing the information we have gathered so far.

Let us narrow the focus to a collection of textual documents (possibly very large) and

distinguish two types of information needs: (1) searching for a concrete piece of information, information needs

and (2) exploring the structure of a collection of documents.

We search for concrete pieces of information when we need to find a particular Web

page, document, historical fact or a person. This kind of information need has an interesting

property: we can express it to a computer system using a query. The computer system may query

try to find a direct answer to our query (as in question-answering systems), but more often

just locate a resource (document) that possibly contains the answer our query. The latter

systems are called document-retrieval systems or in short search engines and in this thesis search engine

our discussion concerns mostly programs of this type.

A single query typically matches a number of documents and a search engine must ar-

range the result into an ordered list, sorting documents according to their relevance to the

query. This final list is often called a hit list and its topmost entries are shown to the user. hit list

Obviously, users rarely have the time to browse through all the returned documents and

1

1.1. Problem Setting 2

searching for information exploring the structure

︷︸︸︷︷︸︸︷

research problems research problems

– document retrieval

– ranking algorithms

– query-answering systems

– summarization

– document clustering

– topic detection and tracking

︷︸︸︷

computer science disciplines

– natural language processing

– probability, graph analysis

– information retrieval

– Web and text mining

– machine learning

– human-computer interaction

Figure 1.1: Selected computer science problems, disciplines and their relation to the two

types of information needs.

limit their effort to the hit list’s topmost few entries, so calculating relevance is a key factor

assuring documents most likely containing answers to the query are pushed up in the final

ranking. These are the basic foundations of all search engines available at the time of writ-

ing this thesis: use a very simple, but fast Boolean model of term containment [4] to find

matching documents and rely on two things to provide valuable service to the user: qualita-

tive ranking algorithms and, most of all, the user’s ability to rephrase the query until his or

her information need is satisfied.

The activity of exploring a collection of documents takes place when there is no infor-

mation need or it is too vague to formulate a specific query. For example, imagine creating a

query for the following question: given a set of previously unseen documents, what subjects

are they about? An alternative task could be this: what subjects dominate the headlines of all

major newspapers today? A human being could answer these questions simply by reading

through all the available documents, but such solution is usually unacceptable as it requires

too much time and effort. Exploration problems are also encountered in combination with

search engines. Queries issued to search engines are mostly short and ambiguous [104] and

match vast numbers of documents concerning variety of subjects. Creating a linear hit list

out of such a broad set of results often requires trade-offs and hiding documents that could

prove useful to the user. If shown an explicit structure of topics present in a search result,

users quickly narrow the focus to just a particular subset (slice) of all returned documents.

Computer science responds to the above information needs by specifying problems and

devising solutions rooted in a number of disciplines (see Figure 1.1). This thesis concen- focus of this thesis

trates on text clustering methods developed primarily in information retrieval and attempts

to improve their applicability to the task of exploration of document collections by making

sure clusters are described in a meaningful way.

Text Clustering Text clustering or shortly clustering (although this term denotes a much text clustering

broader class of problems) is about discovering semantically related groups in an unstruc-

tured collection of documents. Clustering has been very popular for a long time because it

provides unique ways of digesting and generalizing large amounts of information.

An initial application of text clustering was in document retrieval. Keith Van Rijsbergen


observed that “Closely associated documents tend to be relevant to the same requests” (this

statement is referred to as the cluster hypothesis [96]). The idea was to cluster the collection cluster hypothesis

of documents offline and enrich the search result with documents from the same clusters

as these matching the query directly. This way documents related to the query but not con-

taining all of its terms could be found and shown to the user. Obviously, there is a question

of how good the clustering algorithm is at finding related documents which then become

clusters. Inspired by this question researchers started to evaluate clustering algorithms by

measuring how accurate they are at reproducing a man-made taxonomy. Several large con-

ferences, such as the Text Retrieval Conference (trec) [E], published reference data with

samples of documents clustered manually by humans. The evaluation of “quality” of an

algorithm was based solely on the difference between automatically discovered document

groups and predefined classes.

All applications of text clustering methods to this moment exploited the set of discov-

ered clusters with no intention of presenting them to the user; clustering was hidden in the

background. This situation changed when people realized that clusters could be used as a

compact user interface for browsing document collections. This observation inspired Marti

Hearst and Jan Pedersen and resulted in their influential paper about a search system Scat-

ter/Gather [37]. By scanning the description of a cluster the user could assess the relevance

of the remaining documents in that cluster and narrow the search to the interesting docu-

ments (or at least identify the irrelevant clusters and skip their content).

Note that bringing clusters forward to the user interface introduces a great deal of prob-

lems with their presentation. A cluster is basically a group of documents, typically discov-

ered by means of a mathematical model that is unexplainable in plain text. Most cluster-

ing algorithms used very simple display techniques: selected titles of documents within the

cluster, excerpts of documents, and the most popular — a list of the most prominent clus-

ter terms called keywords. Unfortunately, none of these methods truly solve the problem of keywords

creating sensible cluster descriptions. In our opinion, a true solution is basically impossible

without changing the goals of clustering and including cluster labels as one of the crucial

elements of the overall result. Concluding, instead of optimizing the document allocation

quality, our focus shifts to how users perceive clusters — to the comprehensibility of their

descriptions.

Cluster Descriptions At the moment of writing, clustering of search results and short doc-

uments is a popular research topic. A good few commercial search engines with clustering

functionality also exist. But, against the expectations, clustering is far from dominating the

search world. While politely nodded at as very useful, search engines enriched with dy-

namic clustering are still considered an interesting curiosity rather than a daily search tool.

This reluctance may be caused by the fact that clustering algorithms have been notoriously

weak at generating sensible cluster labels, discouraging users. Representing clusters using

keywords strips the semantics from their description and makes them difficult to under-

stand (see Figure 1.2 on the following page for an example). The order of keywords may also

suggest an entirely different meaning. A humorous example is a cluster named poswiecony,


Figure 1.2: Keyword representation of clusters is often insufficient to comprehend the

meaning of documents in a cluster. Examples above come from research systems [53] (left)

and [37] (right).

Figure 1.3: Search results enriched with clusters in Carrot2. The discovered clusters on the

left side of the screen, fragment of the original set of search results on the right side of the

screen.


serwis (consecrated, service) where original documents contained a phrase serwis poswiecony

(service dedicated to).

The work on this thesis was initially inspired by Oren Zamir’s stc algorithm [104, 106],

which spawned a new research field called search results clustering. The input in search search resultsclustering

results clustering contains only the fragments of original documents retrieved by a search

engine (called snippets) and the expected output is a set of labeled groups of topics domi- snippet

nating in the input. Figure 1.3 on the previous page shows a screenshot of a search results

clustering system Carrot2. The number of “documents” (snippets) is usually quite small (a

few hundred entries at most) and the input text is incomplete, fragmented and really short

(a few sentences). With this kind of input the breakthrough idea in stc was to utilize fre-

quently recurring sequences of terms (phrases) to construct document groups and use the phrase

same phrases to describe the identified document groups later.

We reimplemented stc and used it to cluster search results in Polish [101]. The results

caused mixed feelings. The semantics of cluster descriptions was often very poor — frequent

phrases with no meaning crept into cluster labels and problems with handling Polish inflec-

tion were evident. Similar issues applied to clustering search results in English, although on

a smaller scale.

The most valuable contribution of stc was the inclusion of phrases in the process of

clustering with the thought of creating more comprehensible cluster labels. Unfortunately,

the use of phrases as a similarity feature used for grouping documents is quite often in-

convenient. In many cases it is impossible to distinguish meaningful phrases from frequent

word groups that are not at all relevant — collocations, phrasal verbs and others (e.g. home

page, e-mail address, see Figure 1.4(b) on the following page for more examples). We could

summarize these linguistic problems in the following list:

• Inflection. In many languages the same word can appear as different tokens depend-

ing on its role in a sentence. A mechanism such as stemming or lemmatisation is re-

quired to transform input words into identical tokens, regardless of their grammatical

function. Unfortunately, this causes difficulties in finding the right form of the phrase

when it is needed for display (see Figure 1.4(a) on the next page).

• Less strict word order. Word order in English usually determines the meaning of a sen-

tence, so similar subjects are indeed referred to using similar phrases. In Polish, for

example, this is no longer the case. Word order is fixed only to a certain degree and

the meaning of a sentence is a result of grammar relationships between words, usually

manifested by different word suffixes. The assumption that phrases are good features

is quite demanding.

• Phrase boundary detection. Detecting proper phrase boundaries is difficult in any lan-

guage. Consider the following example.

John went to San Francisco to see Mary and Peter


(a) Inflection changes the visual representation of

words, but does not change the meaning. See

inflected forms of the phrase Polska Akademia

Nauk.

(b) Good cluster label selection without linguistic

and domain hints might be impossible. Frequent

word groups Strona Główna (home page) and

Zobacz Temat (see the subject) are not meaningful

phrases.

Figure 1.4: Problems with phrase and linguistic aspects present in a popular commercial

clustering search engine Vivísimo [D].

Potentially interesting phrases in the above example could have the following bound-

aries:

[John went to

[San Francisco

]]to see

[Mary and Peter

]

but also:

[[John went to San Francisco to see Mary

]and Peter

]

or even single term phrases like:

[John

]went to San Francisco to see

[Mary

]and

[Peter

]

Without full linguistic analysis (often involving contextual knowledge), which is still

a rarity in systems with on-line processing requirements, finding the real extent of a

phrase is usually impossible.

Summarizing, neither keyword cluster representation, nor phrase-based clustering are

satisfactory. If we use a text representation model convenient for computation, it is very

difficult to return to sensible cluster labels. On the other hand, if we decide to use phrases

as similarity features, cluster labels construction is simpler, but the clustering quality and

detection of weak clusters is much more problematic, especially in languages like Polish.

We try to find alternative ways of clustering documents, which permit finding cluster

descriptions that are informative, expressive and possibly grammatically correct and at the

same time enable the use of standard word-based similarity models. According to our best

knowledge this problem has not been sufficiently addressed in the literature.

1.2. Motivation 7

1.2 Motivation

In the emerging new wave of applications where people are the ultimate target of text clus-

tering methods, cluster labels are intended to be read and comprehended by humans. The

primary objective of a clustering method should be to focus on providing good, descrip-

tive cluster labels in addition to optimizing traditional clustering quality indicators such as

document-to-group assignment. In yet other words: in document browsing, text cluster-

ing serves the main purpose of describing and summarizing a larger set of documents, the

particular document assignment is of lesser importance. Even if this statement seems blas-

phemous at first, it is true — users expect labels to convey all the information about clusters’

content. For example, we found that inspection of documents inside a cluster in search re-

sults clustering rarely takes place and is equally likely with re-querying the search engine

using terms taken from the cluster’s label.

There seems to be a distant analogy to search algorithms in information retrieval. At

first, these algorithms were focused only on precision and recall (how many relevant doc-

uments are retrieved for a query and how many irrelevant documents are retrieved along

with the good ones). Nowadays there is a third important factor contributing to the quality

of a search engine: the ranking among the retrieved documents (which documents should

be placed at the top of the hit list). Text clustering, originally focused on recreating a pre-

defined, man-made taxonomy of documents, must now take into account a new important

factor: how to describe and present the result to the human user so that he or she under-

stands it.

We believe creating proper representation of clusters is an independent research prob-

lem, its investigation is worthwhile and none of the existing research directions directly ad-

dress it. To make a clear distinction from traditional document clustering, we will call it a

descriptive clustering problem:

Descriptive clustering is a problem of discovering diverse groups of semanti- descriptiveclustering

cally related documents described with meaningful, comprehensible and com-

pact text labels.

To conclude, the motivation behind this thesis can be expressed by the following state-

ments.

• We believe there is a new type of application of clustering methods in information re-

trieval which focuses on revealing the structure of document collections, summarizing

their content and presenting this content to a human user in a compact way.

• A problem stemming from this application is different to original text clustering: qual-

ity of cluster descriptions is at least equally important as their content, we call it de-

scriptive clustering. The most obvious difference between descriptive clustering and

traditional clustering is that clusters with no apparent label can be (and should be)

discarded because they provide no value to the user.

1.3. Goals and Objectives 8

• Descriptive clustering lies at the border of text summarization, topic identification,

classification and text clustering, but none of these disciplines address it directly.

• Existing methods for labeling clusters are insufficient and often fail to create compre-

hensible cluster descriptions. These problems are present especially in Polish (and

presumably other inflectional languages).

• Elements of descriptive clustering itself are underspecified and require a more formal

description.

1.3 Goals and Objectives

The overall goal of this thesis is to find a text clustering method fulfilling the requirements

concerning cluster labels and applicable for exploratory text mining. Specific objectives are

given below:

1. Define the descriptive clustering problem and specify its requirements.

2. Devise a method — Description Comes First (dcf) — that yields comprehensible, con-

cise and grammatically correct (to the extent possible) descriptions of clusters. The

method should assure a clear, transparent relationship between the cluster label and

its contents.

3. Show practical implementation of dcf in two clustering algorithms: Lingo (applicable

to search results clustering) and Descriptive k-Means (dkm) (applicable to collections

of several thousands short and medium documents).

4. Address the performance requirements and scalability of the presented algorithms.

Experimentally verify how dcf affects the quality of document assignment.

A secondary objective of this work is to provide efficient implementations of the pre-

sented algorithms.

1.4 Assumptions and Overview of the Solution

This section starts with important assumptions delimiting the boundaries of this work and

our interpretation of several key concepts, such as comprehensibility. Following is a brief

outline of the presented solution, stripped of many details, but hopefully useful for grasping

the major concepts detailed in later chapters.

Comprehensibility and Clustering Properties We mentioned comprehensibility and clar-

ity of cluster descriptions a few times so far, including the definition of descriptive cluster-

ing. Obviously, terms such as “meaningful”, “accurate” or “comprehensible” leave a great

deal of freedom for interpretation. While the meaning or accurateness of a cluster label is

intuitively easy for a human to verify, it is quite difficult to formalize. We define several re-

quirements concerning cluster labels and document-to-clusters assignment and name them

1.4. Assumptions and Overview of the Solution 9

altogether a problem of descriptive clustering. Chapter 3 is devoted entirely to this subject,

but because descriptive clustering plays such an important role in this thesis, we shortly in-

troduce these requirements now, leaving their full description and interpretation for later.

With regard to cluster labels, we have three expectations:

• comprehensibility — understood as grammatical correctness (word order, inflection,

agreement between words if applicable); we consider elliptical phrases comprehensi-

ble, although undesirable;

• label conciseness — phrases selected for a cluster label should minimize its total length

(without sacrificing its comprehensibility);

• transparency — the relationship between a cluster’s label and its content should be

clear for the user. This is best explained by ability to answer questions such as: “Why

was this label selected for these documents?” and “Why is this document in a cluster

labeled X?”.

Let us emphasize that the above label-related requirements are specific and essential to de-

scriptive clustering, but other characteristics of a clustering method remain just as impor-

tant. We expect the following from clusters:

• internal consistency — similar documents should be grouped together;

• external consistency — groups of documents should be dissimilar to one another;

• group overlap is allowed — documents may belong to more then one group (because

multi-topic objects are natural when dealing with texts);

• ungrouped documents are allowed — clustering must not enforce strict assignment of

each document to a group; there will always be documents not similar to anything

else;

• flat clusters structure — we will focus on flat clustering algorithms; if the presented

approach can produce a hierarchy is an open question included in future research

directions.

Input Data and Text Representation Model The size and properties of input documents

have a great impact on the choice and performance of a clustering method. We restrict

our interest to snippets (as in search results) and short or medium documents (as in news

articles, mailing lists or Web pages). We additionally assume that all the input is given in

one language, although we try to make our methods more universal by considering caveats

of clustering in English and Polish.

A typical text clustering algorithm, regardless of the actual method used, follows the

steps depicted in Figure 1.5(a). Input documents are first transformed into a mathematical

model where each document is described by certain features, facilitating calculation of doc-

ument proximity and therefore cluster discovery. In this thesis we restrict our discussion to


(a) A typical clustering algorithm outline.

(b) Outline of an algorithm following the dcf approach.

Figure 1.5: Clustering procedure in a typical clustering algorithm and in the dcf approach.

Note parallelization in dcf.

term Java Sun language coffee bean . . .

weight 0.2 0.1 0.16 0.05 0.3 . . .

Figure 1.6: An example document vector in vector space model.

one of the most popular text representation models – the Vector Space Model (vsm) [4, 102]. vector space model

In the vsm, an input document is transformed into a vector, where each individual compo-

nent denotes a single word in a language and its value represents the importance of that

word. Such document vectors are often very sparse, so typically a document’s representation

is limited to components with non-zero weight (see Figure 1.6 for an example). Note that

this one-way transformation from the source text to its vector representation is partially re-

sponsible for the problems with reconstructing cluster labels: viable syntactical information

is lost, making a document just an uninterpretable set of words.

Overview of Description Comes First Approach To overcome the problems mentioned so

far, we suggest an approach called Description Comes First (dcf). The dcf changes the

troublesome conventional order of a typical clustering algorithm (cluster discovery → la-

bel induction) in a way that splits cluster discovery and candidate label discovery into two

independent phases (see Figure 1.5(b)):

• Candidate label discovery is responsible for collecting all phrases potentially useful as

good cluster labels.

• Cluster discovery provides a computational data model of document groups present

in the input data.

By splitting the process into these two phases the most difficult element so far — creating

proper cluster description from a mathematical model — is avoided and replaced by a prob-


lem of selection of appropriate labels for each cluster found. We mentioned that a mathe-

matical model of clusters is never likely to be understandable to the user. The dcf approach

tries to decrease this “semantic gap” between a model of clusters and a set of selected labels

by discarding the model of clusters entirely and building final groups of documents starting

with just the selected cluster labels. This way the actual model of clusters is used only in-

ternally and can be suitably complex because it never surfaces to the user interface. On the

other hand the candidate cluster label selection procedure should ensure their comprehen-

sibility to the user. The whole trick is about merging these two steps together.

To put it in yet other words, an algorithm following the dcf approach first attempts to

find meaningful patterns (cluster labels) and then looks for objects exhibiting these patterns

(documents).1 Clusters in the traditional sense are used solely to select the strongest pat-

terns and have no later use, final groups of documents are created by assigning documents

to their matching patterns.

Algorithms In our opinion the dcf is a meta-method which can be used to construct algo-

rithms following a certain paradigm. We are going to demonstrate its practical implementa-

tion using two examples: Lingo and Descriptive k-Means.

The Lingo algorithm was the first implementation of the dcf approach. The algorithm lingo

was originally developed by Stanisław Osinski and presented in his master’s thesis written

under supervision of Jerzy Stefanowski. Several concepts presented here are a result of au-

thor’s close collaboration with Stanisław. Lingo’s motivation and implementation fall exactly

in the general pattern introduced in the dcf approach, so we present the algorithm and our

joint experiments as an example of a dcf-based method applicable to clustering search re-

sults. A detailed overview of Lingo can be found in Section 5, the next paragraph shows a

general look at the algorithm.

All candidate cluster labels in Lingo are discovered directly from the input text by select-

ing frequently repeated phrases and terms (so-called recurring phrases). This step is similar

to phrase discovery in the stc algorithm. Concurrently to candidate label discovery, a vector

space model is built for all documents in the input. Lingo uses dimensionality reduction

methods applied to the term-document space to find synthetic vectors that approximate

topics present in the input documents. These vectors are then used to select final cluster la-

bels from the set of all candidates. In the final step, the selected cluster labels are treated as

queries to a conventional vsm-based search system and documents matching a given label

are assigned to it forming a final cluster.

Descriptive k-Means (dkm), introduced in this thesis, is the second algorithm following descriptive k-means

the dcf approach. The dkm was created for two reasons: first, to deal with a different type

of input data — thousands of short and medium documents. Second, to provide answers to

a few interesting questions that arose during the work on descriptive clustering. The field

of search results clustering is still quite recent and there is a significant shortage of data

sets and methods of quality evaluation. Full-text clustering, on the other hand, has a much

1This pattern-search way of looking at the dcf approach has been suggested to us by prof. Tadeusz Morzy.

1.5. Thesis Outline 12

longer history and well established methodologies. Even if we criticize them a bit later in

this thesis, it was tempting to see if a well-known text clustering algorithm can be modified

to follow dcf’s principles and how its results would compare against the original version. We

decided that k-Means algorithm would be a good candidate for such an extension because

it is widely recognized and used.

In Descriptive k-Means the phase of cluster label candidate discovery can be performed

using two methods: extraction of frequent phrases (similar to Lingo and stc) or shallow

linguistic preprocessing resulting in extraction of noun phrases. We experiment with both

methods. A baseline version of k-Means algorithm is used to discover clusters in the input

and cluster centroids are used to select pattern phrases from the set of candidate cluster

labels. The obvious challenge was to keep the algorithm efficient considering the predicted

large number of input documents. We implement dkm by reusing efficient data structures

and algorithms known in document retrieval and already present in a typical search engine

used for document storage.

Experiments Experimental evaluation of clustering algorithms is difficult [59]. Experimen-

tal evaluation of comprehensibility of cluster labels is even more difficult. We delve in the

discussion of these problems in later sections, here only briefly pointing out that all the

experiments we conducted had a common goal: to show that the quality of clustering, mea-

sured in different ways, does not decrease by applying the dcf approach. We are convinced

that, a bit “by design”, the dcf approach must yield more comprehensible cluster labels. The

point we wanted to verify was if this comes at a price in document allocation quality.

1.5 Thesis Outline

Chapter 2 introduces several key concepts used in this thesis: foundations of linguistic pre-

processing in information retrieval, data structures such as suffix trees and suffix arrays and

an overview of selected clustering methods and their applications. In the linguistic section

we also outline the complexity of shallow linguistic processing in Polish and report on our

initial attempts to select sensible cluster label candidates in that language. The chapter ends

with a discussion of issues related to evaluation of clustering quality.

In Chapter 3 we try to provide a semi-formal set of requirements and goals of the de-

scriptive clustering problem and discuss its relationship with conceptual clustering present

in machine learning.

Chapter 4 contains a description of a generalized approach for solving the descriptive

clustering problem — the Description Comes First (dcf) approach.

In the two following chapters, we describe algorithms that “implement” the concepts

of dcf in practice: chapter 5 describes the Lingo algorithm, chapter 6 presents Descriptive

k-Means. For each we illustrate the way they implement the dcf approach. The chapter

devoted to Descriptive k-Means additionally contains the technical details required for its

efficient implementation.

1.6. Typographical Conventions 13

Chapter 7 presents the results of two computational experiments. The first experiment

assesses Lingo’s quality in the task of clustering search results. In the second experiment

we evaluate Descriptive k-Means’ performance on a standard evaluation data set compared

to the “baseline” k-Means algorithm. We wrap up the evaluation chapter with some conclu-

sions resulting from making the source code of the system an open source project, including

the statistics from an on-line demo of the system.

Finally, in Chapter 8 we end with a summary of conclusions, list of contributions and an

outline of future research directions.

1.6 Typographical Conventions

We use several typographical conventions in this thesis. The meaning of particular mathe-

matical symbols and variables is explained in their context.

new term Important concepts, terms.

code Program code, statements, variables, classes.

A phrase Example phrases and queries in the text.

key term Margin notes about key terms and concepts appearing in the corresponding

paragraph.

C A set of classes (given a priori partitions of the input documents).

K A set of clusters (returned by the algorithm).

[12] Citations to books, articles and other, mostly printed, literature.

[A] Citations to resources found on the Internet (listed on page 128).

Chapter 2

Background

This chapter presents the research background we build upon in the following parts of this

thesis. It is impossible to squeeze a thorough introduction to information retrieval, data

mining and natural language processing in one chapter, so whenever a deeper context is

necessary we provide references to books and articles delving deeper in the subject.

If the Reader is remotely familiar with the subjects mentioned above then it is probably

best to skip this chapter entirely and backreference it when the presented concepts are ac-

tually put to use (we always try to make explicit references to relevant background sections).

2.1 Text Preprocessing

2.1.1 Word and Sentence Segmentation

Any text processing application starts from analysis of a stream of bytes. A single byte can character encoding

take up to 256 different values, clearly insufficient to represent all characters available in

the world’s languages. For this reason, characters are usually encoded in a binary stream as

multi-byte sequences (as in Unicode [F]) or single-byte values mapped to specific charac-

ters according to a predefined encoding (byte-to-character pairs). We further assume that

the application knows how to interpret the bytes of an input stream and, using proper con-

version, interpret them in terms of a stream of characters.

The problem of text segmentation (or tokenization) starts when we have a stream of char- text segmentation

acters and the need to delimit boundaries of individual terms and sentences. The simplest

definition of a term is that it is a continuous sequence of letters. This definition covers most term

words in Latin languages and is quite useful, but obviously not powerful enough to properly

interpret terms such as these (examples partially from [60]):

północno-wschodni Micro$oft :-) ASP.NET

$22.50 child’s i18n c net

The difficulties start to pile up when we consider languages with no explicit word bound-

aries, such as Chinese, or languages where letters in words may be written in right-to-left

order (such as Arabic or Hebrew). In the worst case, we can have a mix of foreign words

written in both directions in the same fragment of text.

14

2.1. Text Preprocessing 15

In practice authors usually fall back to simple heuristic rules to determine term bound-

aries. Manning [60] (and others) give the following procedure applicable for text segmenta-

tion in most Western languages:

• white space and punctuation characters delimit terms,

• a full stop followed by a white space delimits a sentence.

In our experiments we used a bit more complex set of rules, implemented in the Carrot2

project. We detect and parse hyphenated terms, question and exclamation marks and de-

termine certain token types (term, number, abbreviation). An in-depth discussion of text

segmentation problems and potential solutions can be found in [60].

2.1.2 Shallow Linguistic Preprocessing

Once the text is split into tokens, we would like to identify their meaning and role in the

sentence. In information retrieval this task is typically accomplished with methods based on

mutual relationships between words and rarely with deeper grammatical analysis. A family

of methods relying on statistics, probability and heuristic techniques of preprocessing text

is called shallow linguistic preprocessing.

When a single term denotes more then one word in a language we call them homo-

graphs. A single word, on the other hand, may appear in many different graphical word homographs

forms, depending on its function in a sentence. A single unambiguous “meaning” of a given word form

word, regardless of its actual notation, is called a lemma. Lemmas are abstract concepts and lemma

in order to write them down, we use their lexemes, a unique token that identifies a lemma. lexeme

Lexemes may be just about anything. In information retrieval it is quite convenient to rep-

resent them as numbers to speed up processing. For humans, a lexeme is typically a head

entry of a given lemma listed in a dictionary. For example, a lexeme mean1 can denote an

arithmetical mean and lexeme mean2 can identify the verb “to have in the mind as a pur-

pose”.

We said that a single lexeme may represent many different word forms. This happens fre-

quently in inflectional languages, like Polish, where the graphical representation of a lemma

depends on its role in the sentence and the surrounding context. The knowledge if two to-

kens represent one lemma and what function they have in a sentence is valuable in text

mining applications. This task is called morphological analysis and is the target of a number

of methods in natural language processing [60].

When full morphological analysis is prohibitive because of high computational costs, a

simpler alternative is needed. An approximate method of conflating all word forms of a stemming algorithm

given lemma to a single identifier (which may or may not be the lexeme) is called a stemming

algorithm and the algorithm implementing it is a stemmer.

The most common stemmers for English are Martin Porter’s stemmer (or simply the

Porter stemmer) [76], Chris Paice’s stemmer [71] and historically important Beth Lovins’

stemmer [54]. All of them are based on a set of iterative transformation rules converting

the input word into its conflated version (by stripping or replacing sequences of letters).


The work on stemmers for Polish is much more recent. Krzysztof Szafran’s morphologi-

cal analyzer sam [92] could be used as a stemmer. Jan Daciuk [14] wrote a dictionary stem-

mer encoded in a finite state automaton. A few commercial stemmers are also available

([33] contains an overview). Recently there’s been a some interest in building a fast heuris-

tic stemmer for Polish [97, 75], [B], also see [98] for an overview. We contributed to this

area with an idea and implementation of a dictionary stemmer called Lametyzator [89] and

then with a hybrid (dictionary and heuristic) stemmer called Stempelator [100]. From the

feedback we received we know that both packages have been successfully used in practice.

It should be clearly stated that the importance of stemming in information retrieval has

been questioned in the past [42]. Word conflation allows to retrieve more documents for a

given query (all documents containing terms conflating to an identical stem, to be precise),

but introduces certain noise to the output by including documents that just happened to

have terms conflating to the same stem (and not relevant to the query). In our earlier experi-

ments we tried to establish the importance of stemming in search results clustering [89, 101].

According to the results, stemming was almost a necessity when applied to Polish texts, but

the results for English were not as evident, supporting the conclusions reported in literature.

2.1.3 Finding Phrases

Remembering that the purpose of descriptive clustering is to find a clear, concise and com-

prehensible description of a cluster, let us examine potential candidates that could function

as cluster labels. We could use entire sentences for this purpose, but they seem to be a bit

coarse (conciseness requirement). Looking for something more fine-grained, we reach the

level of a phrase. As usual, what’s quite clear to the intuition becomes very hard to define. phrase

There are at least a few different ways of looking at phrases and hence extracting them from

the input text. We will take a look at frequency-based methods and text chunking in English

and Polish.

Text Chunking in English

The task of text chunking is about dividing a piece of text in syntactically correlated parts. text chunking

It was first defined by Steven Abney in his influential paper Parsing By Chunks [2]. Abney

starts with the following intuitive instructions for identifying a chunk: chunk

• when articulating a sentence, the strongest stresses fall on chunks with pauses be-

tween them,

• the order of words in a chunk is more predictable than the order of chunks.

The above rules are then refined in linguistic terms, which we take the liberty of omitting

here because they are of no relevance to our problem. Instead, let us take a look at the

intuitive example Abney gives:

[I begin

][with an intuition

]:[when I read

][a sentence

],[I read it

][a chunk

][at a time

].


Figure 2.1: Classification of different noun types in English (based on [8]).

Abney suggests that “a simple, context-free grammar is quite adequate to describe the struc-

ture of chunks” [2] and gives such a grammar for English. The task proved to be not so sim-

ple and research on chunk extraction flourished into an independent research field, espe-

cially with the use of machine learning techniques [78, 110] and statistical methods [12, 74].

The great thing about chunks is that they seem to be “self-contained” short groups inside

the text and should be comprehensible when taken out of their original context — perfect

candidates to build cluster labels. But not all chunks are equally interesting for our purposes.

Most of the vocabulary in any language consists of nouns and, simplifying a lot, nouns are

the elements we usually use when identifying people, places or things (with modifiers and

verbs used to make the message more detailed or identify actions of nouns). We narrow our

interest in this thesis to noun phrases. Let us quote the definition of a noun phrase verbatim noun phrase

after Routledge Dictionary of Language and Linguistics [8]:

noun phrase, grammatical category (or phrase) which normally contains a noun

(fruit, happiness, Phil) or a pronoun (I, someone, one) as its head and which can

be modified (specified) in many ways. [. . . ]

The variety of different noun types and noun phrase configurations is quite impressive

(see Figure 2.1). Some of the modifiers mentioned in the definition are: adjuncts, usually

placed before the noun (very good beer), complements in the form of a genitive attribute

(Phil’s house), prepositions (the house on the hill) or relative clauses (the family that lives next

door). A comprehensive overview of a noun phrase can be found in [1] and in [28].

From the above classification of noun phrases we could identify a certain subset that

could function best as comprehensible cluster labels: all concrete nouns (perhaps with mod-

ifiers). Unfortunately, the available software for noun phrase extraction did not offer the

possibility of distinguishing the head noun type, so we did not explore this direction any

further. Summarizing, when we talk about noun phrases in this thesis, we mean any noun

phrase extracted using statistical methods (as described in [60]), although ideally we would

only be interested in the subset identified above.

Text Chunking in Polish

Statistical text chunking in Polish computational linguistics trails behind its English cousin,

mostly because for a long time the electronic resources to work with were very scarce. A


large tagged corpora of Polish texts was published just recently [L], so hopefully we should

see some progress in this area.

Among the existing resources we should mention a few parsers that attempt to decom-

pose the full grammar structure of Polish sentences (from which noun phrases could be

extracted). Janusz Bien implemented a formal representation of Polish grammar first de-

scribed by Stanisław Szpakowicz [93] and then refined by Zygmunt Saloni and Marek Swidz-

inski [81]. Janusz Bien then collaborated with Marcin Wolinski, eventually leading to the

publication of Swigra — Wolinski’s doctoral thesis [62], which attempts to efficiently imple-

ment the formal syntax of Polish described in [81].

Effective use of these parsers in information retrieval applications is problematic. For ex-

ample, Swigra’s average speed of parsing is reportedly around 500 milliseconds per sentence

— quite slow to parse thousands of documents. But Marcin Wolinski’s contribution does

not end on the parser, it also includes a highly efficient morphological analyzer for Polish

called Morfeusz [103]. It seemed quite interesting to see if we could build an approximate

noun phrase extractor for Polish based solely on tag sequences. We report our conclusions

from an initial attempt to create such a heuristic chunker in the next section. We should

perhaps justify its presence in the background chapter — we place it here because the out-

comes are still too premature to call them a contribution, but at the same time they provide

some insight into the problems present in phrase extraction in Polish.

Heuristic Chunking in Polish

We started from scratch and analyzed what was known about the syntax of noun phrases in

Polish. Swidzinski introduces an entity intuitively similar to Abney’s chunk: a distribution-

ally equivalent group. Such groups can be freely replaced or moved around in the syntactical distributionallyequivalent group

decomposition of a sentence without sacrificing its correctness. There is one problem: syn-

tactical decomposition in Polish forms a tree-like structure where nodes may be sometimes

freely pivoted with respect to their parents. A noun phrase composed of several words can

appear in different order in the same text (e.g. author’s intentional stylistic manipulation to

avoid repetition). In the example shown in Figure 2.2, the same sentence appears in two

different word orders — note how fragments of it are pivoted around certain nodes.

Yet another problem with identifying phrases in Polish lies in the fact that words in Polish

undergo inflection and have internal and external constraints that modify their appearance

(usually the suffix), so apart from pivoting, the same noun phrase may also consist of slightly

different word forms each time it occurs in the text.

To have an initial look at the complexity of the problem we performed the following ex-

periment. We parsed a large amount of text in English (articles from the 20-newsgroups data

set [A]) and Polish (articles from Rzeczpospolita newspaper corpora) and tagged all words

appearing in the input with morphosyntactical categories. For English, we used MontyLin-

gua [C] package and a set of part-of-speech tags from the Penn Treebank’s tagset (approxi-

mately 40 active tags). Polish texts were processed with Morfeusz [103]. We created a syn-

thetic tagset for Polish: each tag was a composition of number, case, gender and person,


moja starsza kolezanka zwykle kupuje wieczorne gazety

gazety wieczorne kupuje zwykle moja starsza kolezanka

Figure 2.2: Decomposition of two equivalent sentences: moja starsza kolezanka zwykle kupuje

wieczorne gazety (my older (girl)friend usually buys evening papers) and gazety wieczorne kupuje

zwykle moja starsza kolezanka (papers evening buys usually my older (girl)friend). Pivoted nodes

marked in circles. Example from [81].

other grammatical elements were ignored. We then extracted the most frequent sequences

of tags in both languages.

The results were easy to be predicted, but their extent was amazing. There seemed to

be almost no tag sequences longer than two elements in Polish, compared to a significant

number of two, three and longer tag sequences in English (see Figure 2.3 on the next page).

This is an empirical proof of the loose sentence order in Polish — something to worry about

if we want to extract candidate cluster labels using frequent phrases.

Out of shear curiosity we also created a directed graph of the most frequent tag seq-

uences. We took a few hundred entries and connected tags (nodes in the graph) with an

edge if they were present in any of the frequent sequences. It turned out to be quite an in-

teresting form of generative art1 (see Figure 2.4 on page 21), but also shows the difference in

complexity of part-of-speech sequences between the two languages.

In the end we decided that the construction of a statistical noun phrase chunker for Pol-

ish might be too hard and at least temporarily exceeded the scope of this thesis. We did,

however, construct a heuristic chunker — an automaton for detecting several dozen man- heuristic chunker

ually selected frequent tag sequences. The automaton accepted words in the input, estab-

lished their tag or tags (using Morfeusz) and triggered and output action whenever the input

sequence matched any of the patterns it was configured to recognize. Note that we do not

perform any morphosyntactical disambiguation — if a given word has more than one pos-

sible tag, we simply spawn more processing states in the automaton. This simple approach

gave quite promising results — see Figure 2.5 on page 22 for several examples of extracted

chunks.

Phrases as Frequent Sequences of Words

A more pragmatic view on phrases is that each one contains several words that occur in the

same order whenever found in the text [104]. Obviously, more than just a single sentence

1Generative art is art or design generated, composed or constructed through computer software algorithms [G].


1

2

3

4

5

6

7

8

9

50 100 150 200 250 300 350 400

Le

ng

th o

f ta

g s

eq

ue

nce

X

Occurrences of tag sequence X

(a) English

1

2

3

4

5

6

7

8

9

50 100 150 200 250 300 350 400

Le

ng

th o

f ta

g s

eq

ue

nce

X

Occurrences of tag sequence X

(b) Polish

Figure 2.3: Length distribution of frequent tag sequences in Polish and English. The window

shows tag sequences at most terms long and occurring more than 50 times in the input.


EO

S

RB

S

JJ

76

IN

37

NN

51

21

49

VB

Z

10

32

WD

T

23

1

VB

N

33

0

MD

50

5

NN

P

23

3

PO

S

51

4

PR

P

15

0

CC

98

1

VB

P

10

7

RB

41

2

CD

98

VB

D

12

03

NN

S

17

17

24

0

JJR

36

CO

MM

A

26

06

TO

97

3

VB

G

21

8

INT

ER

P

29

6

40

88

VB

40

DT

20

3

WR

B

59

WP

71

RB

R

51

EX

89

3835

17

0

60

3

24

7

10

1

55

3

78

77

30

3

35

12

9

20

1

22

8

PR

P$

42

37

40

7

67

8

35

26

2

12

9

31

14

9

27

23

6

36

3

13

2

13

2

34

62

16

8

80

17

3

18

1

27

13

5

44

6

75

49

36

13

39

27

9

34

9

14

48

64

93

9

63

9

25

19

8

32

51

77

5

64

8

66

15

0

23

2

10

50

43

2

19

3

17

85

13

5

47

112

59

8

35

49

NN

PS

18

7

57

7

17

7

47

17

4

26

9

JJS

32

68

7

40

4

60

3

19

0

77

8

32

67

46

14

3

27

51

51

7

113

10

2

61

66

0

17

0

45

22

7

18

5

18

7

31

8

43

0

58

33

41

112

84

59

25

6

16

4

53

9

86

38

8

43

87

40

6

25

26

91

19

4

27

69

14

6

24

0

43

25

3

29

2

34

48

25

7

47

8

75

40

72

46

19

9

33

3

12

6

30

5

32

55

2

99

58

6

15

9

18

7

76

73

0

64

7

30

3

27

61

10

38

47

26

92

12

3

36

77

7

49

75

3

25

7

54

2

17

9

49

54

7

119

26

8

53

3

27

7

34

9

46

1

33

7

15

8

35

4

55

17

1

36

4

16

4

15

4

99

79

8

93

2

30

21

2

94

15

3

28

0

35

9

38

10

5

59

66

7

10

25

37

6

25

89

5

12

4

20

8

13

29

46

8

19

2

20

6

21

79

14

4

32

10

3

34

91

35

34

9

28

18

7

58

16

1

19

49

70

0

33

7

34

0

27

58

61

7

71

40

S

RP

25

26

PD

T

54

19

1

10

5

31

26

9

46

5

30

34

2

34

5

23

1

91

99

3

42

0

89

5

82

45

0

16

7

53

2

25

7

39

1

25

88

44

7

36

43

8

84

5

47

12

16

64

111

13

4

26

47

26

39

83

18

1

48

21

2

10

1

113

33

48

18

3

20

87

47

1

34

3

86

12

1

67

30

10

5

47

24

2

17

3

29

54

27

5

82

46

6

54

6

65

7

99

33

29

2

31

9

12

1

33

17

8

10

3

32

5

69

76

81

2

110

53

14

9

41

57

12

5

33

22

5

37

16

85

33

30

72

114

19

82

59

0

39

29

5

91

0

10

23

13

88

28

10

8

10

1

65

74

54

0

65

0

35

1

57

1

40

62

39

54

37

30

6

43

1

14

8

15

3

66

22

3

98

28

9

39

9

27

69

98

26

2

95

24

5

43

70

3

38

99

4

31

56

45

91

16

1

25

9

32

15

07

18

6

39

2

58

10

76

24

53

10

0

43

15

6

15

9

13

4

14

5

34

35

43

114

32

39

12

2

10

9

36

43

89

25

13

2

25

25

33

30

50

69

42

32

10

0

(a) English

num:pl:

gen:m1

:congr

subst:p

l:gen:m

1

51

subst:sg

:gen:f

401

praet:sg

:m1:per

f

32

praet:sg

:n:perf

27

subst:sg

:gen:n

260

subst:sg

:acc:f

25

adj:pl:g

en:f:pos

46

adv:pos

50

praet:sg

:m1:per

f:nagl

26

subst:p

l:gen:f

174

prep:gen

182

prep:ins

t:nwok

114

subst:p

l:gen:n

100

qub

200

fin:pl:t

er:impe

rf

109

subst:sg

:nom:f

65

adj:sg:g

en:f:pos

1366

adj:sg:g

en:n:pos

36

subst:sg

:nom:n

39

praet:sg

:m3:per

f

30

adj:pl:g

en:m3:po

s

39

prep:loc

183

fin:sg:t

er:impe

rf

322

adja

37

prep:gen

:nwok

79

prep:loc

:nwok

342

xxx

33

subst:sg

:nom:m1

85sub

st:sg:nom

:m3

74

ger:sg:g

en:n:im

perf:aff

45

fin:sg:t

er:perf

28

prep:ins

t

42

imps:pe

rf

40

adj:sg:g

en:m3:po

s

69

interp

112

praet:sg

:f:perf

34

subst:sg

:gen:m1

114

ger:sg:g

en:n:per

f:aff

47

subst:sg

:gen:m3

319

prep:acc

172

conj

177

116

subst:p

l:gen:m

3

189

prep:acc

:nwok

79

pact:sg

:gen:f:im

perf:aff46

pact:pl

:gen:f:im

perf:aff

27

num:pl:

gen:m3

:congr

58

267

siebie:g

en

praet:sg

:m3:per

f:nagl

109

25

ppron3

:sg:nom

:m1:ter

25

62

49

subst:sg

:acc:m3

48

57

28

268

adj:sg:a

cc:f:pos

25

adj:sg:n

om:m1:

pos

26

aglt:sg:

pri:imp

erf:wok

316

37

49

89

45

34

pact:sg

:acc:f:im

perf:aff

fin:pl:p

ri:perf36

subst:sg

:loc:n

93

38

25

48

47

75

28

adj:sg:l

oc:n:pos

140

36

26

73

34

28

34

78

26

47

205

25

26

ppron3

:sg:acc:

m1:ter:

nakc:np

raep

26num

:pl:gen

:f:congr

65

pcon:im

perf

44

44

41

51

27

subst:p

l:acc:m

3

27

170

63

39

27

51

33

187

112

44

39

96

45

25

143

59

45

42

387

73

174

33

193

62

26

49

67

118

72

52

44

37

31

adj:sg:g

en:f:com

p

53

num:pl:

nom:m3

adj:sg:l

oc:m3:po

s

33

adj:sg:n

om:f:po

s122

26

praet:sg

:f:imperf

:nagl

34

praet:sg

:f:perf:

nagl

48

42

30

95

962

244

praet:sg

:f:imperf

64

51

69

39

63

72

57

fin:sg:s

ec:impe

rf

adj:sg:d

at:m3:po

s

subst:sg

:dat:m3

35

praet:pl

:n:impe

rf

35

prep:acc

:wok

40

num:pl:n

om:m1:

congr

subst:p

l:nom:m

1

27

204

29

109

26

34

74

105

103

55

78

351

42

28

27

94

63

76

156

61

56

50

56

64

180

142

39

34

100

38

241

subst:sg

:loc:m3

1116

38

37

33

ppron3

:sg:gen

:m1:ter

:akc:npr

aep

31

adj:pl:l

oc:f:pos

subst:p

l:loc:f

231

adj:pl:l

oc:n:pos

subst:p

l:loc:n

131

praet:pl

:f:perf:

nagl

50

adj:pl:a

cc:f:pos

52

subst:p

l:acc:f

475

ppas:sg

:nom:m3

:perf:aff

25

32

38

30

39

num:pl:n

om:f:co

ngr

subst:p

l:nom:f

62adj:

pl:nom:

f:pos

27

adj:pl:i

nst:n:po

s

subst:p

l:inst:n

72

adj:pl:a

cc:n:pos

subst:p

l:acc:n

231

31

58

707

26

47

38

51

65

34

29

59

prep:nom

37

36

adj:pl:g

en:n:pos

34

418

43

28

56

43

105

53

27

31

188

praet:sg

:m1:im

perf

42

pred

30

inf:perf

105

39

adv:com

p

30

118

137

331

inf:imp

erf

84

112

70

ppas:sg

:nom:f:p

erf:aff

31

54

adj:sg:a

cc:n:pos

35

30

44

adj:pl:a

cc:m3:po

s

28

27

106

215

adj:pl:n

om:m3:

pos

36

111

182

31

42

41

fin:pl:p

ri:imperf

26

praet:sg

:n:impe

rf

25

adj:sg:a

cc:m3:po

s

50

praet:pl

:m1:im

perf

25

25

31

72

28

adj:sg:n

om:m3:

pos

60

116

fin:sg:p

ri:imperf

27

38

adv:sup

32

adj:sg:n

om:n:po

s

52

84

subst:sg

:acc:n

88

45

33

43

72

38

111

124

37

55

67

44

102

39

44

73

35

ppron3

:sg:loc:

f:ter

ppas:sg

:nom:n:p

erf:aff

ger:sg:i

nst:n:im

perf:aff

praet:pl

:n:perf

43

praet:pl

:m1:per

f:nagl

69

praet:sg

:n:impe

rf:nagl

76

26

40

41

30

29

234

26278

33

38

57

32

ppas:pl

:nom:f:p

erf:aff

34

fin:pl:t

er:perf

110

adj:pl:i

nst:m1:

pos

subst:p

l:inst:m

1

29

adj:pl:g

en:m3:co

mp

30

ppron3

:pl:gen

:f:ter:pr

aep

subst:sg

:dat:m1

41

subst:sg

:voc:m1

27

26

25

41

226

230

107

39

64

41

48

31

40

26

39

45

149

31

70

30

29

praet:sg

:m3:im

perf

104

35

30

ppas:sg

:acc:f:p

erf:aff

29

ppas:pl

:gen:f:p

erf:aff

46

pact:sg

:acc:m3

:imperf

:aff

subst:sg

:acc:m1

27

87

adj:pl:i

nst:f:po

s

subst:p

l:inst:f

102

53

169

46

subst:sg

:loc:m1

prep:ins

t:wok

subst:sg

:inst:f

38

subst:sg

:inst:n

73

subst:sg

:inst:m

3

42

num:pl:

acc:f:co

ngr

63

25

97

31

25

143

27

47

6828

131

118

30

32

43

276

36

25

51

60

94

41

32

96

25

38

281

128

25

26

40

35

26

75

109

76

59

107

37

67

33

80

71

70

213

27

47

43

ppas:sg

:acc:m3

:perf:aff

27

425

26

57

50

num:pl:

acc:m3:

rec

30

54

31

164

126

25

25

75

31

304

33

28

111

436

33

29

79

108

148

50

130

58

66

84

40

27

39

64

137

51

71

42

164

58

35

127

45

32

77

167

48

35

61

siebie:d

at

31

25

28

ppron3

:sg:gen

:m1:ter

:akc:pra

ep

26

33

129

31

130

63

55

825

46

43

65

39

62

28

70

114

39

125

42

136

59

42

57

adj:sg:g

en:n:com

p

26

ppas:pl

:gen:m3

:perf:aff

33

adj:sg:d

at:f:pos

subst:sg

:dat:f

85

34

27

adj:sg:i

nst:f:po

s38 471

subst:p

l:dat:m

1

adj:sg:a

cc:f:sup

29

adj:sg:i

nst:n:po

s

207

ger:sg:i

nst:n:pe

rf:aff

31

31

82

subst:p

l:dat:m

3

873

10650

397

131

67

29

4138

203

162

48

adj:pl:g

en:m1:po

s

61

294

126

ppron3

:sg:gen

:f:ter:pr

aep

34

adj:sg:g

en:m1:po

s

53

118

ppron3

:pl:gen

:f:pri

45

222

adjp

53

230

156

num:pl:

gen:m3

:rec

64

329

726

184

184

praet:pl

:f:imperf

:nagl

32

ppron3

:pl:nom:

f:ter

3332

59183

83

adj:sg:i

nst:m1:

pos

34592

adj:pl:i

nst:m3:

pos

45

294

adj:sg:i

nst:m3:

pos

133

69

subst:p

l:inst:m

3

144

100

72

subst:sg

:inst:m

1

137563

num:pl:

inst:m3

:congr

46

92

82

26

380

26

45

34

28

44

65

pact:pl

:gen:n:i

mperf:a

ff

35

29

91

ppas:pl

:gen:n:p

erf:aff

25

26

45

33

31

adj:sg:g

en:m3:co

mp

44

75

subst:p

l:acc:m

1

94

36

34

44

34

pact:pl

:acc:m3

:imperf

:aff

28

252

42

25

50

45

44

32

51

30

245

45

31

46

25

26

71

25

46

34

32

135

30

186

29

78

121

80

221

27

141

55

28

40

779

89

37

69

249

59

202

82

33

39

72

494

154

55

65

161

66

35

575

392

28

65

44

2168

praet:pl

:m1:imp

erf:nag

l

45

371

383

adj:pl:n

om:m1:

pos

48

394

123

imps:im

perf

28

bedzie:

pl:ter:i

mperf

37

28

38

80

39

67

ger:sg:a

cc:n:per

f:aff

33

103

40

244

num:pl:

acc:f:re

c

35

54

60

praet:sg

:n:perf:

nagl

39

prep:loc

:wok

25

fin:sg:p

ri:perf

39

40

43

prep:dat

40

37

538

1071

118

132

28

84

334

915

bedzie:

sg:ter:i

mperf

75

92

69

adj:pl:n

om:n:po

s

47

204

subst:p

l:nom:n

74

269

ppron3

:sg:nom

:f:ter

27

112

num:pl:

acc:m3:

congr

27

152

124

92

prep:gen

:wok

29

78

166

ger:sg:n

om:n:pe

rf:aff

34

69

75

159

88

98

subst:p

l:nom:m

3

103

111

189

27

136

481

praet:sg

:m1:im

perf:nag

l

111

num:pl:

nom:m3

:rec

70

praet:pl

:f:perf

27

aglt:sg:

pri:imp

erf:nwo

k

80

praet:pl:

m1:perf

64

156

107

34

praet:pl

:f:imperf

36

136

praet:sg

:m3:im

perf:nag

l

32

num:pl:

nom:m3

:congr

30

62

273

num:pl:

nom:f:re

c

31

adj:pl:a

cc:m1:po

s

81

adj:pl:g

en:m3:su

p

31

ppas:pl

:nom:f:im

perf:aff

28

243

num:pl:

gen:f:re

c

40

num:pl:

acc:n:re

c

41

63

ppas:sg

:nom:m1

:perf:aff

26

41

97

25

26

27

43

65

73

44

65

30

37

30

246

41

79

37

74

68

29

77

27

29

66

28

142

25

55

33

89

49

30

40

648

135

3226

38

64

6548

54

129

6435

4386

105

30

34

53

28

80

36

187

29

260

116

732

47

69109

102

66

70

60

188

132

4545

32

27

115

437

119

66

147

30

48

37

32

49

87

pact:sg

:nom:f:im

perf:aff

29

146

109

55

197

59

56

82

120

113

59

31

215

64

46

39

56

44

25

34

77

37

36

46

36

25

ppron3

:sg:dat:

f:pri:na

kc

49

adj:sg:l

oc:f:pos

31

66

32

44

subst:sg

:loc:f

852

39

330

36

ger:sg:l

oc:n:per

f:aff

31

38

36

38

37

praet:pl

:m3:per

f

50

110

subst:p

l:loc:m3

81

36

32

25

56

44

adj:pl:l

oc:m3:po

s

221

916

89

37

26

26

34

ppas:sg

:loc:f:p

erf:aff

26

392

42

76

ger:sg:a

cc:n:im

perf:aff

41

1604

26

68

32

61

124

33

92

31

117

31

26

32

26

29

64

74

31

30

36

31

29

75

39

56

44

33

64

53

62

54

35

401

32

570

39

31

39

104

105

27

111

68

47

77

2733

55

25

264

26

27

32

38

34

39

382

42

83

44

31

90

40

36

79

33

286

128

39

aglt:pl:

pri:imp

erf:nwo

k

84

53

79

55

27

53

33

41

37

37

30

466

ppron3

:sg:dat:

m1:ter:

nakc:np

raep

42

ppas:pl

:gen:m3

:imperf

:aff

ppas:sg

:gen:f:p

erf:aff

45

ger:sg:n

om:n:im

perf:aff

30

26

30

subst:sg

:dat:n

103

27

26

33

139

25

27

num:pl:

nom:m1

:rec

33

adj:sg:a

cc:m3:co

mp

35

pact:sg

:gen:m3

:imperf

:aff

48

33

38

adj:sg:a

cc:m1:po

s

74

pact:pl

:gen:m3

:imperf

:aff

27

6962

136

34

399

35

28

26

36

29

5264

154

25

40

68

35

55

70

59

791

643

292

10965

51

196

32

817

133

38

312181

192223

num:pl:

loc:f:co

ngr

3276

num:pl:

loc:m3:

congr

54

ger:sg:l

oc:n:im

perf:aff

129116

1077

ppron3

:sg:nom

:f:pri

30

53

49

146

182

46

25

338

8661

232

140

31

197

67

37

60

1878

28

309

207

73

26

70

38

30

40

262

36

33

31

26

177

132

259

71

80

166

59

61

51

72

50

30

92

32

272

25

107

85

pact:pl

:gen:m1

:imperf

:aff

37

32

27

182

25

42

27

65

57

34

28

29

52

134

47

56

322

pact:pl

:nom:f:im

perf:aff32

29

49

34

27

39

S

51

1366

27

267

109

316

36

140

47

205

26

65

170

387

53

33

962

35

35

40

27

351

1116

31

231131

50

475

39

62

72

231

707

37

418

331

124

4369

76

278

34

110

29

30

41

27

230

31

149

104

29

46

87

102

169

73

63

97

276

425

436

31

129

825

26

33

85

34

471

29

207

82

873

32

592

380

44

75

252

245

2168

81

31

28

243

40

41

63

26

97

246

648

187

732

215

36

25

49

852

39

330

38

50

110

221

916

26

392

1604

75

401

570

382

128

84

79

466

42

45

30

103

139

33

35

48

38

74

27

69

62

136

399

64

154

791

1077

30

1878

182

322

num:pl:

gen:m1

:rec

47

50

35

49

396

766

ppas:sg

:gen:m3

:perf:aff

30

2309

62

52

409

184

192

61

adj:pl:d

at:m1:po

s

26

60

202

606

169

623

96

127

adj:sg:d

at:m1:po

s

177

ppas:pl

:nom:m3

:perf:aff

28

272

38

230

997

91

123

194

87

120

1711

108

113

225

357

410

247

165

805

pact:sg

:nom:m3

:imperf

:aff

26

praet:pl

:m3:im

perf

65

179

973

1005

adj:sg:l

oc:m3:co

mp

45

42

adj:pl:g

en:f:com

p

30

106

adj:pl:d

at:m3:po

s

36

203

417

adj:pl:d

at:f:pos

34

adj:sg:a

cc:f:com

p

48

80

63

149

81

63

139

367

56

447

praet:pl

:m3:per

f:nagl

29

101

787

499

30

87

adj:sg:n

om:f:co

mp25

29

47

50

35

49

25

39

33

42

134

26

49

25

396

25

27

766

65

327

47

40

153

125

34

222

84

130

237

45

50

616

66

201

30

812

866

15685

46

27

337

2309

209

27

643240

44684

38

30

72

32

109317

1739

subst:p

l:dat:f

29

62

52

36

29

70

129

409

36

52

25

35

26

139

72

35

49

25

48

37184

31

64

53

25

25

25

27

87

30

30

28

68

192

61

26

60

202

167

315

27

111

39

150

220

43

38

232

51

34

144

40

284

65

79

26

606

46

136

42

159

27

101

96

33

36

69

25

169

272

78

10573

39

30

66

78

51

50

41

102

28

57

27

81

52

302

34

66

50

44

49

53

60

623

27

38

172

35

376

31

53

40

82

93

26

26

68

33

96

27

28

65

28

71

27

31

47

44

127

35

32

25

177

28

25

30

272

30

38

28

32

35

30

230

29

997

29

26

32

34

97

34

29

73

91

123

120

25

18431

157

194

29

30

34

2895

87

30

32

59

28

45

25

35

36

29

120

25

47

28

25

52

37

29

49

31

93

340

1711

25

38

108

76

29

32

72

34

49

30

159

49

87

34

50

94

40

28

28

30

67

32

31

57

42

25

136

53

34

29

41

59

39

71

32

26

71

44

59

25

57

46

113

98

36

32

73

29

52

73

39

25

26

28

26

28

225

26

26

33

76

90

37

25

44

109

39

357

36

33

65

29

25

28

pact:pl

:nom:m3

:imperf

:aff

36

44

123

410

87

34

26

46

72

26

66

35

64

41

247

63

28

47

33

165

44

30

59

44

79

68

30

206

54

805

30

59

103

45

29

26

65

179

66

28

51

31

36

74

51

26

33

43

37

154

73

401

33

25

228

53

29

29

29

125

169

99

73

141

78

90

63

30

41

34

167

250

65

32

293

42

97

131

42

39

42

973

88

34

68

42

238

211

56

63

108

59

652

86

34

105

546

66

1005

40

133

147

86

74

184

143

218

6730

80

50

69

32

229

125

45

42

26

30

30

106

28

36

137

34

114

32

176

69

203

417

34

48

80

63

26

98

149

28

37

81

30

177

38

57

39

59

38

25

25

40

37

107

27

56

3664

27

26

59

27

27

39

96

31

28

29

38

39

47

45

58

70

33

57

63

88

50

30

36

28

47

139

25

69

33

44

66

49

74

33

70

27

52

367

46

157

30

79

41

56

186

29

85

31

47

32

60

66

75

51

121

44

43

447

42

35

30

44

119

68

79

37

48

32

46

49

71

30

141

56

25

33

32

36

65

29

101

192

91

48

94

64

81

40

47

88

44

46

2778

787

94

140

67

166

29

25

39

40

34

113

109

5774

151

474

37

28

168

92

499

62

90

86

26

55

99

50

49

49

27

230

52

29

30

87

45

25

29

(b) Polish

Figure 2.4: An ordered graph of the most frequent tag sequences in English and Polish (each

tag sequence starts on the left). The complex relationships between Polish tags are quite

evident, although we include this illustration only as an eye candy.


Id: 732, tags [subst:sg:nom:f, adj:sg:nom:f:pos]

i wyzej wyspa koralowa pod nazwa

dzien dzisiejszy siła zbrojna narodu

traktowany jak wina osobista , przedsionek

załoba spowita klasa robotnicza chyliła czoła

poniewaz maksymalna moc przerobowa przedsiebiorstw jest

sie wytrawna grupa przestepcza , w

dwóch synów pani owa ma jeszcze

obu jest odkrywczosc tematyczna

Mikrogrupa społeczna , jaka

Sprawa ta była niedawno

Id: 623, tags [subst:sg:nom:m3, adj:sg:nom:m3:pos]

ze polski film krótkometrazowy miewa sie

nie tylko styl poetycki , to

Kosciół katolicki znalazł sie

zagadnienie jak Teatr polski podczas wojny

I program ten ma wówczas

( to grzech ciezki ) ,

( to grzech smiertelny )

zycia społecznego zwiazek ten ma szereg

sali był zespół eskimoski , jutro

tego powodu artykuł ten bedzie miał

tanca – fakt ten nie musi

Id: 606, tags [subst:sg:nom:m1, subst:sg:nom:m1]

Ale ktos kto zdany jest

Inaczej Rudolf Augstein , znany

wojnie swiatowej Bolesław Strzelewicz zerwał z

czwartego roku Wiesław Gomułka mówił miedzy

wygłosił przemówienie poseł Zygmunt Marek ,

przemówienie poseł Zygmunt Marek , składajac

do NRF Alfred Döblin , emigrujac

i nie doktor Borowik sa odpowiedzialni

oficjalnie był Wacław Gebethner , a

redaktorem rzeczywistym doktor Stanisław Rogoz

rzeczywistym doktor Stanisław Rogoz

panstwowosci ( profesor Bogusław Lesnodorski )

( profesor Bogusław Lesnodorski ) ,

Figure 2.5: A few phrases extracted for three example tag patterns encoded in the heuristic

chunker discussed on page 19. Chunk phrases are in the center column, left and right con-

text of each phrase is also shown. We marked in red the phrases we considered invalid (not

of maximum length or ambiguous).

is needed to successfully apply this definition for phrase extraction but when applicable,

it turns out to be a really effective heuristic. Moreover, any algorithm suitable for finding

frequent sequences of items can perform phrase extraction and there is a wide selection of

efficient methods to choose from. In this thesis we will use suffix trees (in Descriptive k-

Means) and suffix arrays (in Lingo). We characterize the most important elements of both

methods below.

A suffix tree [95, 27, 51] is a tree where all suffixes of a given sequence of elements can suffix tree

be found on the way from the root node to a leaf node. What makes this data structure

efficient is that a single node node in the tree may contain more than one element of the se-

quence (this feature differs it from another data structure — suffix trie). Figure 2.6(a) shows

an example suffix tree for a word mississippi.

An interesting property of suffix trees is that any path starting at the root node and end-

ing in an internal node denotes a subsequence of elements that occurred at least twice in

the input (take a look at the node corresponding to character sequence issi in Figure 2.6(a)

on the following page). This observation leads to a simple and effective frequent phrase

extraction algorithm: build a suffix tree such that each element of the input sequence is a

single word and analyze the internal nodes — a path from the root node to each internal


(a) Suffix tree. Highlighted is the internal node

and path to the root for the subsequence “issi”.

substring substring

index

10 i

9 ippi

4 issippi

1 ississippi

0 mississippi

9 pi

8 ppi

6 sippi

3 sissippi

5 ssippi

2 ssissippi

(b) Suffix array. Highlighted is the continuous block for

subsequence “issi”.

Figure 2.6: A suffix tree and a suffix array for the word mississippi.

node is a frequent phrase.

Suffix trees have become very popular mostly due to low computational cost of their

construction — linear with the size of input sequence’s elements. Their practical implemen-

tation is quite tricky due to non-locality of tree traversal (a number of algorithms have been

proposed to overcome this issue).

Another data structure permitting frequent phrase detection is a suffix array [57, 61]. A suffix array

suffix array is a sorted array of all suffixes of a given input sequence. Figure 2.6(b) illustrates

a suffix array for the same word mississippi we used earlier. Frequent phrase extraction is

implemented by scanning the suffix array top-to-bottom, looking for continuous blocks of

identical prefixes (of maximum length) [109, 19].

Algorithms for building suffix arrays have a much better properties (locality of computa-

tion, memory consumption), in fact the only information we really need to store in a suffix

array are indices to first elements of each substring. Of course a straightforward construc-

tion of a suffix array using generic sorting routines would slow down the algorithm to the

order of O(n log n) (assuming a bit unrealistically that string comparisons take O(1) time).

More efficient algorithms that preserve the desired properties of suffix arrays have been sug-

gested in literature [45, 51].

Summarizing this section, with the help of suffix trees and suffix arrays we can locate fre-

quently recurring subsequences of words in the input. Anticipating further sections, we will

use frequent phrases as document features (for detecting similar documents) and primarily

to construct cluster labels. A number of problems opens from these applications of frequent

phrases.

• A frequent sequence of words is not necessarily a good phrase — it can be a meaning-

less collocation (like: vitally important) or a frequent structural element of a language

(like: out of, it is).

2.2. Text Representation: Vector Space Model 24

• A frequent sequence can be virtually any junk that just happens to be in the input. For

example, when searching for frequent phrases in a corpora of mailing list messages, all

messages starting long discussion threads end up as frequent because they are usually

cited in replies.

• Not all phrases are sequences. We have already pointed out that in Polish the same

phrase can be rewritten in many ways with different word order and still be perfectly

comprehensible (as in: sto lat samotnosci, lat sto samotnosci, samotnosci sto lat, samot-

nosci lat sto).

• Essentially the same phrase may be non-continuous (interrupted by other words or

words forms). For example, compare: Franklin D. Roosevelt and Franklin Delano Roo-

sevelt, or Earvin Johnson and Earvin “Magic” Johnson.

2.2 Text Representation: Vector Space Model

To assess the similarity or dissimilarity of two or more documents, we need a model in which

these operations are defined. The model is usually selected to match a particular task’s re-

quirements and objectives. Several text representation models have been suggested in liter-

ature, their good overview can be found in [4]. To keep this chapter’s size reasonable we will

focus only on Vector Space Model (vsm) and the elements it consists of: document indexing,

feature weighting and similarity coefficients.

2.2.1 Document Indexing

Vector Space Model2 uses the concepts of linear algebra to address the problem of repre-

senting and comparing textual data.

A document d is represented in the vsm as a document vector [wt0 , wt1 , . . . wtΩ ], where

t0, t1, . . . tΩ is a set of words of a given language and wtiexpresses the weight (importance)

of term ti to document d . Weights in a document vector typically reflect the distribution of

words in that document. In other words, the value wtiin a document vector d represents

the importance of word ti to that document.

Components of the document’s vector are commonly called its features because their feature

collection provides a footprint of the document’s contents. Note that we can hardly speak

about the meaning of a document vector anymore since it is basically an collection of unre-

lated terms. For this reason, the vsm is sometimes called a bag-of-words model. The process

of translation of input documents into their term vectors is called document indexing. document indexing

Given a set of documents, their document vectors can be put together to form a matrix

called a term-document matrix. The value of a single component of this matrix depends on term-documentmatrix

the strength of relationship between a document and the given term. An example with the

following input documents (each consisting of a single sentence) demonstrates this.

2Vector Space Model is usually credited to Gerard Salton, although the concept had been already known in the

literature when Gerard Salton started to use it — [21] is an interesting essay on the subject.


document content

d0 Large Scale Singular Value Computations

d1 Software for the Sparse Singular Value Decomposition

d2 Introduction to Modern Information Retrieval

d3 Linear Algebra for Intelligent Information Retrieval

d4 Matrix Computations

d5 Singular Value Analysis of Cryptograms

In the first step, we identify all possible terms appearing in the input and build a matrix

where columns correspond to terms and rows correspond to input documents. We exclude

certain terms that we know are not useful for identifying the topic of a document (these

are called stop words) and restrict the presentation to just a few selected terms with at least

one non-zero weight. On the intersection of each column and each row we place the count

(number of occurrences) of the column’s term in the row’s document. For our example in-

put, the term-document matrix looks as shown below.

Info

rma

tio

n

Sca

le

An

aly

sis

Sin

gu

lar

Va

lue . . . . . . . . . . . . .

w

d0 0 1 0 1 1

d1 0 0 0 1 1

d2 1 0 0 0 0 . . . . . . . . . . . . .

d3 1 0 0 0 0

d4 0 0 0 0 0

d5 0 0 1 1 1

Two questions arise from the example. First, an average document will contain just

a small subset of all possible words of a given language, which is additionally an uncon-

strained set. We can solve this problem with the simplest method and store just the indices

of terms for which the document has non-zero weights (as we did in the example). More

advanced techniques encode gaps between indices or other forms of bit packing and cod-

ing [102, 4].

The second problem is related to our prospective application — measuring similarity

between documents. Certain words, such as pronouns, conjunctions or prepositions, occur

frequently in any document and are useless as features. We can ignore certain words like

that (by adding them to a set of stop words), but a better idea is to recalculate weights in

document vectors in a way that highlights words that are more important for a given docu-

ment with respect to others and downplays words that are very common. This task is called

feature weighting and we list a few known weighting methods in the next section.

2.2.2 Feature Weighting

Feature weighting methods can be divided into local (one document’s term count is avail-

able) and global (term counts of all documents are available). We list the weighting schemes


that have application somewhere in this thesis. A full overview of the subject can be found

in [83, 4].

The following notation is used: tfi j — number of occurrences of term i in document j ,

dfi — number of documents containing term i in the entire collection, w(i , j ) — weight of

term i in document j , N is the number of all documents in the collection.

Term Frequency, Inverse Document Frequency Certainly the most widely known feature

weighting formula, usually abbreviated to an acronym tf-idf. Credited to Gerard Salton [82],

tf-idf tries to balance the importance of a word in a document with how common it is in the

entire collection.

w(i , j )= tfi j × log2

N

dfi(2.1)

Modified tf-idf This modification of the original tf-idf downplays the count of terms in a

document and contains certain algebraic modifications for faster calculation of w(i , j ) on a

precached index. Implemented in the document retrieval library Lucene [H].

w(i , j )=√

tfi j ×(

loge

N

dfi +1+1

)(2.2)

Pointwise Mutual Information A widely used weighting scheme, although known to be

biased towards infrequent events (terms) — an interesting discussion can be found in [60].

We show practical implications of this property in our experiments in chapter 7 on page 79.

w(i , j ) = log2

tfi j

N∑k

tfi k

N ×∑k

tfk j

N

(2.3)

Discounted Mutual Information Similar to pointwise mutual information, but multiplied

with a discounting factor to compensate for the problems mentioned above (formula after

[53]).

w(i , j ) = mii j ×tfi j

tfi j +1×

min(∑

ktfk j ,

∑k

tfik

)

min(∑

ktfk j ,

∑k

tfik

)+1

(2.4)

2.2.3 Similarity Coefficients

Two documents in the Vector Space Model represent two points in a multidimensional term

space (each term is assumed to be an independent dimension). If we define a notion of

distance in this space, we can compare documents against each other and thus start looking

for similarities or dissimilarities.

Any distance metric applicable to a multidimensional vector space is applicable, but two

methods are widely used: Euclidean distance and cosine measure.

A simple Euclidean distance is quite often used, but requires document vector length

normalization prior to calculation or the number of words (proportion of weights) in each

document will distort the result.

2.3. Document Clustering 27

Cosine measure is a more robust technique stemming from the observation that if two

vectors have approximately the same features then they should “point” at a very similar di-

rection in the space determined by the term-document matrix, regardless of their Euclidean

distance. To calculate similarity between two documents we need to look at the angle be-

tween them, which we can calculate using the dot product between their document vectors.

To simplify things even more, we can use the cosine of this angle which is easier to compute

(does not require hyperbolic function). We hence define the cosine measure of similarity cosine measure

between vector representation of documents di and d j in the term vector space as:

sim(di ,d j ) = cos(α) =di ·d j

|di ||d j |, (2.5)

where x · y denotes the dot product between vectors x and y and |x| is the norm of vector x.

The cosine measure is widely used in text clustering any many other text processing ap-

plications because its definition is quite intuitive and its implementation efficient. How-

ever, it is also known that in highly dimensional spaces any two random vectors are very

likely to be orthogonal. An attempt to solve this problem is to reduce the dimensionality dimensionalityreduction

of the feature space using feature selection, feature construction or term-document matrix

decomposition techniques [4].

2.3 Document Clustering

2.3.1 Introduction

Let us start with a general definition of clustering after Brian Everitt et al. [22]:

Given a number of objects or individuals, each of which is described by a set of clustering problem

numerical measures, devise a classification scheme for grouping the objects into

a number of classes such that objects within classes are similar in some respect

and unlike those from other classes. The number of classes and the characteris-

tics of each class are to be determined.

By analogy to the above definition document clustering, or text clustering, can be defined text clustering

as a process of organizing pieces of textual information into groups whose members are

similar in some way, and groups as a whole are dissimilar to each other. But before we delve

into text clustering, let us take a look at clustering in general.

There are many kinds of clustering algorithms, suitable for different types of input data

and diverse applications. A great deal depends on how we define similarity between ob-

jects. We can measure similarity in terms of objects’ proximity (distance), or as a relation

between the features they exhibit. An intuitive demonstration of this difference is shown in

Figure 2.7 on the following page — the same set of objects is grouped depending on their

relative distance, or feature — shape and color.

Brian Everitt et al. suggest the following classification of clustering methods [22]:

• hierarchical techniques — in which clusters are recursively grouped to form a tree,


Figure 2.7: The same group of objects (1) “clustered” based on their relative distance (2) and

features they exhibit — shape (3) and color (4).

• optimization techniques — where clusters are formed by the optimization of a cluster-

ing criterion,

• density or mode-seeking techniques — in which clusters are formed by searching for

regions containing a relatively dense concentration of entities,

• clumping techniques — in which classes or clumps can overlap,

• others — methods which do not fall clearly into any of the above.

Alternative classifications of clustering algorithms can be suggested, depending on the

aspect we look at. In our opinion it is worthwhile to take a look at several aspects. Looking

at the structure of discovered clusters we can distinguish flat and hierarchical clustering

algorithms. Depending on the type of assignment between documents and clusters we can

have:

• partitioning algorithms — which assign each document to exactly one cluster,

• clumping techniques — described above; note that this type of clustering is natural

and desirable for texts because a single document can be assigned to more than one

topic,

• partial clustering — algorithms which may leave some objects unassigned at all; in

this thesis we will use the term “others” to refer to a synthetic group of unclustered

objects.

Finally, the classification can be made depending on the strength of relationship between

an object and a cluster:


objects groups

Partitioning Overlapping Partial assignment

Hierarchical Binary assignment Other assignment

Figure 2.8: A simplified visual representation of different clustering algorithms.

• crisp clustering — with a binary assignment when a document is either assigned to a

cluster or not assigned to it,

• fuzzy clustering — when the degree of assignment is expressed on the scale of “not

associated” to “fully associated”, typically with a number between 0 and 1.

Figure 2.8 depicts different ways of looking at clustering algorithms depending on their char-

acteristics.

As a final element of this section, we should mention another interesting aspect of clus-

tering, related to our transparency requirement. Mark Sanderson and Bruce Croft [84] divide

clustering methods depending on how many features contributed to inclusion of a given ob-

ject to a cluster. They distinct monothetic algorithms, which assign objects to clusters based monothetic andpolytheticalgorithmson a single feature and polythetic algorithms which use multiple features. Our work is a bit

of both. We try to make the results monothetic (transparent relationship of cluster labels

to documents), but at the same time we use polythetic clustering algorithms for detecting

groups of documents.

2.3.2 Overview of Selected Clustering Algorithms

Clustering analysis is a very broad field and the number of available methods and their vari-

ations can be overwhelming. A good introduction to numerical clustering can be found in

Brian Everitt’s Cluster Analysis [22] or in Allan Gordon’s Classification [30]. A more up-to-

date view of clustering in the context of data mining is available in Jiawei Han and Miche-

line Kamber’s Data Mining: Concepts and Techniques [35]. Shorter surveys on the topic

are also available in [6] in [41]. Resources in the Polish language include, for example, a

chapter on clustering methods in Jacek Koronacki and Jan Cwik’s book Statystyczne systemy

uczace sie [46] and a Polish translation of David Hand, Heikki Mannila and Padhraic Smyth’s

Principles of Data Mining (Eksploracja Danych) [36]. We should emphasize again that most

text clustering algorithms attempt to transform the input text into a mathematical repre-


sentation directly suitable for use with numerical clustering algorithms directly, so any book

about cluster analysis will be relevant to the topic of this thesis.

In the following part of this section we describe a few selected clustering algorithms that

are important from the point of view of further chapters.

Partitioning Methods

Partitioning clustering methods divide the input data into disjoint subsets attempting to find

a configuration which maximizes some optimality criterion. Because enumeration of all

possible subsets of the input is usually computationally infeasible, partitioning clustering

employs an iterative improvement procedure which moves objects between clusters until

the optimality criterion can no longer be improved.

The most popular partitioning algorithm is the k-Means algorithm. In k-Means, we de-

fine a global objective function and iteratively move objects between partitions to optimize objective function

this function. The objective function is usually a sum of distances (or sum of squared dis-

tances) between objects and their cluster’s centers and the objective is to minimize it. As-

suming K is a set of clusters, t ∈ K is a cluster (set of objects), Ctiis the representation of

a cluster’s center and d is an element, we try to minimize the following expression:

∑

ti∈K

∑

d∈ti

distance(d ,Cti) (2.6)

The representation of a cluster can be an average of its elements (its centroid) or a mean

point (object closest to the centroid of a cluster). In the latter case we call the algorithm

k-Medoids. Given the number of clusters k a priori, a generic k-Means procedure is imple-

mented in four steps:

1. partition objects into k nonempty subsets (most often randomly),

2. compute representation of centers for current clusters,

3. assign each object to the closest cluster,

4. repeat from step 2 until no more reassignments occurs.

By moving objects to their closest partition and recalculating partition’s centers in each step

the method eventually converges to a stable state, which is usually a local optimum.

We discuss computational complexity of k-Means in later sections, for now let us just

comment that the entire procedure is efficient in practice and usually converges in just a

few iterations on non-degenerated data. Another thing worth mentioning is that clusters

created by k-Means are spherical with respect to the distance metric — the algorithm is

known to have problems with non-convex, and in general complex, shapes.

Hierarchical Methods

A family of hierarchical clustering methods can be divided into agglomerative and divisive

variants. Agglomerative Hierarchical Clustering (ahc) initially places each object in its own


cluster and then iteratively combines the closest clusters merging their content. The cluster-

ing process is interrupted at some point, leaving a dendrogram with a hierarchy of clusters. dendrogram

Many variants of hierarchical methods exist, depending on the procedure of locating

pairs of clusters to be merged. In the single link method, the distance between clusters is the single link,complete link,average linkmethods

minimum distance between any pair of elements drawn from these clusters (one from each),

in the complete link it is the maximum distance and in the average link it is correspondingly

an average distance (a discussion of other merging methods can be found in [22]). Each of

these has a different computational complexity and runtime behavior. Single link method

is known to follow “bridges” of noise and link elements in distant clusters (a chaining ef-

fect). Complete link method is computationally more demanding, but is known to produce

more sensible hierarchies [30, 4]. Average link method is a trade-off between speed and

quality and efficient algorithms for its incremental calculation exist such as in the Buck-

shot/Fractionation algorithm [13].

Typical problems in hierarchical methods are in finding the stop criterion for the cluster

merging process [20], tuning parameters and finding a method of “flattening” dendrogram

levels to create clusters with more than two subgroups.

Clustering Based on Phrase Co-occurrence

The Internet brought a new challenge to the task of text clustering: incomplete data. Par-

titioning and hierarchical methods typically used vector space representation and required

verbose input (full documents). Web pages and mailing lists are typically shorter and snip-

pets found in search results — fragments retrieved from documents matching the query —

are an extreme example of incomplete data, often just a few words long. This new data had

to be reflected in novel approaches to clustering.

Algorithms utilizing phrase co-occurrence use frequently recurring sequences of terms

as features of similarity between documents. Assuming that documents discussing related

subjects should use similar vocabulary and phrasing, frequent phrases can be used to iden-

tify documents discussing the same (or related) topics. Note that what really makes the dif-

ference is the use of variable-length features that are later used for describing the discovered

clusters. This idea first appeared in Suffix Tree Clustering (stc) algorithm, published in a suffix treeclustering

seminal paper by Oren Zamir and Oren Etzioni [105].

Suffix Tree Clustering works in two phases: first it discovers base clusters (groups of doc-

uments that share a single frequent phrase) and then merges base clusters together to form

the output.

The discovery of base clusters starts from segmenting the input into words and sen-

tences. Each sentence is essentially a sequence of words and is as such inserted into a gen-

eralized suffix tree. Generalized suffix tree is similar to a suffix tree, but contains suffixes of

more than one input sequence. Internal nodes in the tree also keep pointers to sequences a

given suffix originated from. This way in each internal node of the tree we have a sequence

of elements that occurred at least twice in the input and sentences it occurred in.


Figure 2.9: A generalized suffix tree for three sentences: (1) cat ate cheese, (2) mouse ate

cheese too and (3) cat ate mouse too. Paths to internal nodes (circles) contain phrases that

occurred more than once in documents indicated by children nodes (rectangles). Dollar

symbol is used as a unique end-of-sentence marker. Example after: [105].

After all the input sentences have been added to the suffix tree, the algorithm traverses

the tree’s internal nodes looking for phrases that occurred a certain number of times in more

than one document. Any node exceeding the minimal count of documents and phrase fre-

quency immediately becomes a base cluster. Figure 2.9 shows a generalized suffix tree with

three short example phrases.

The strongest element of stc is in utilizing a proper data structure — suffix tree construc-

tion is linear and it permits very fast and convenient base cluster detection. Interestingly,

after this clever step, stc falls back to a simple single-link merge procedure where base clus-

ters that overlap too much are iteratively combined into larger clusters and the process is

repeated. This step is not fully justified and may result in merging base clusters that should

not be merged. An improved version of the algorithm was suggested by Irmina Masłowska

in [63, 64]. Nonetheless, Suffix Tree Clustering was the first algorithm to emphasize the im-

portance of comprehensible cluster labels and it was very inspiring for other authors. We

list several spin-off algorithms utilizing phrase co-occurrence in Section 2.4 on page 34.

Other Clustering Methods

A number of other clustering methods are known in literature — density-based methods,

model-based and fuzzy clustering, self organizing maps and even biology-inspired algo-

rithms. An interested Reader can find many surveys and books providing comprehensive

information on the subject [22, 30, 6, 36, 46].

2.3.3 Applications of Document Clustering

Applications of text clustering algorithms changed over time either due to record-breaking

data availability, new ideas and algorithms or due to new types of input data. We summa-

rize a list of major text clustering applications because it nicely outlines the evolution of

clustering methods from a background utility for modelling similarities among objects to

the first-hand user experience.


Improving Document Retrieval Efficiency

The initial application of text clustering was in document retrieval. Keith Van Rijsbergen ob-

served that “Closely associated documents tend to be relevant to the same requests” (cluster

hypothesis). Clustering was applied to a collection of documents prior to searching to detect

similar groups of documents. When the user typed a query, an information retrieval algo-

rithm retrieved documents matching the query and documents in their clusters to improve

recall. Note that clusters are never explicitly revealed to the user so there was no need to

describe them.

Organizing Large Document Collections

Document retrieval focuses on finding documents relevant to a particular query, but it fails

to solve the problem of making sense of a large number of uncategorized documents. The

challenge here is to organize these documents in a taxonomy identical to the one humans

would create given enough time and use it as a browsing interface to the original collection

of documents.

Several large conferences, such as the Text Retrieval Conference (trec), published ref-

erence data with samples of documents clustered manually by humans. The ongoing work

in this area focused primarily to replicate the man-made taxonomy as closely as possible,

maximizing the score calculated as a conformity to predefined document-to-cluster assign-

ments. Comprehensibility of cluster descriptions was usually neglected because it was not a

direct factor affecting the score.

Browsing Document Collections

The observation that clusters alone present a certain value to the user of an information

retrieval system is very important and was first noticed by Marti Hearst and Jan Pedersen

in their paper about a search system Scatter/Gather [37]. By scanning the description of a

cluster the user can assess the relevance of the remaining documents in that cluster and

find the interesting information faster (or at least identify the irrelevant clusters and avoid

them). The techniques for extracting cluster descriptions were very simple: selected titles of

documents within the cluster, excerpts of documents and keywords.

Duplicate Content Detection

In many applications there is a need to find duplicates or near-duplicates in a large number

of documents. Clustering is employed for plagiarism detection, grouping of related news

stories and to reorder search results rankings (to assure higher diversity among the topmost

documents). Note that in such applications the description of clusters is rarely needed.

Integration with Search Engines

Modern Internet and intranet search engines contain countless numbers of Web pages, doc-

uments, news articles and can search all this content very fast. Users simply rephrase the

2.4. Related Works 34

query if the information they need does not show up at the top of search results and they

rarely need full taxonomy of documents to find what they need. However, in some situa-

tions when the query is ill defined, or the information need is not clear (an overview-type

query for example), a different type of text clustering may be helpful — search results clus-

tering. Search results clustering is about clustering each query’s results and presenting the

user with an overview of what this result contains. The clusters can be used to filter through

irrelevant hits or to refine the query with other terms.

We mentioned this type of clustering a few times already, but let us underline the key

elements of difficulty again. The input information for the algorithm is very limited: only

the titles and snippets are available. The algorithm must be very fast to avoid slowing down

the user interface of a search engine. Finally, the clusters must be accurately and clearly

described because the user expects an overview of topics similar to the query and does not

have the time to take guesses about the meaning of clusters described using keywords, for

example.

[

The concepts presented in this thesis are applicable whenever the description of clusters

needs to be shown to the user. The most likely targets among the applications presented

above, are in document collection browsing and search results clustering.

2.4 Related Works

The purpose of this section is to present currently available algorithms and methods that

closely correspond to the ideas presented in this thesis.

Clustering

Antonio Gulli and Paolo Ferragina [24, 23] start from the limitations of Grouper — stc’s ini- see errata E-17

tial implementation — and build an algorithm called SnakeT. SnakeT uses non-contiguous

phrases as features, which the authors call approximate sentences. The criterion forming

a cluster is still (as in stc) the fact of sharing a sufficient number of approximate phrases.

The implementation and algorithm’s design is by far more complex than stc’s and uses cus-

tom data structures similar to frequent itemset detection in data mining, but the authors

pay attention to cluster label comprehensibility and enrich cluster descriptions with data

extracted from a predefined ontology.

In [48], authors present a search results clustering algorithm which attempts to associate

documents with a single concept where labels are chosen so that “they are good indicators

of the documents they contain”. The algorithm uses frequent terms, but also preprocessed

noun phrases. Unfortunately, as the authors put it: “[stems] are not usually very meaning-

ful for use as node labels, therefore we replace each stemmed term by the most frequently

occurring original term”. Note that such heuristic would obviously fail for documents in Pol-

ish. The provided screenshots show that the generated cluster labels are mostly based on

single-words.


Hotho, Staab and Stumme build a Conceptual Clustering system that refines cluster de-

scriptions using concept lattices [39, 38]. Their cluster descriptions are still single words but

they use a large thesaurus and formal concept analysis to avoid repetitions and synonyms

in cluster keywords.

In [107], authors perform an interesting experiment with a supervised training of a clus-

ter label selection procedure. First, a set of fixed-length word sequences (the article reports

3-grams) is created from the input. Each label is then scored with an aggregative formula

combining several factors: phrase frequency, length, intra-cluster similarity, entropy and

phrase independence. The specific weights and scores for each of these factors are learnt

from examples of manually prioritized cluster labels.

Pantel and Lin [53, 72] present a very interesting clustering algorithm called Clustering

with Committees which builds clusters around groups of few strongly associated (most sim-

ilar) documents, called committees. Because committees are so strongly related with their

set of features they usually point to an unambiguous concept (which authors even evaluate

using semantic relationships from WordNet). The cluster description remains to be a list of

strong features, but hopefully unambiguous. Pantel recently attempted to label the output

classes with more “semantic” labels [73], but this work goes definitely deeper into natural

language processing than information retrieval.

A concepts of clustering combined with pattern selection (similar to the dcf approach)

appears in [108], where authors present a classification system which uses clusters to select

labeled objects and expand the set of labeled objects with elements from within the cluster

to improve classification. Cluster descriptions are not part of the consideration.

Mark Sanderson and Bruce Croft [84] present a completely different, yet related ap-

proach to exploring document collections. Instead of clustering input documents, they start

with salient terms and phrases taken from predefined queries to a document collection and

expand this set with a technique called Local Context Analysis. Once a big enough collec-

tion of terms and phrases is gathered, it is automatically organized into a hierarchy, starting

with most generic terms at the top and descending to most detailed ones at the bottom.

The technique used by authors is very interesting as it involves no clustering techniques, yet

provides a hierarchy of (quite comprehensible) descriptions of groups of documents in the

output. The disadvantage is that authors bootstrap their method with a predefined set of

queries, which would be unavailable for another collection of documents.

An interesting cluster labeling procedure is also shown in the Weighted Centroid Cover-

ing algorithm [90]. Authors start from a representation of clusters (centroids of document

groups) and then build their (word-based) descriptions by iterative assignment of highest

scoring terms to each category, making sure each unique term is assigned only once. Inter-

estingly, the authors point out that this kind of procedure could be extended to use existing

ontologies and labels, but they provide no experimental results of any kind.


Summarization

Similarities to our work can be found in the field of document summarization and especially

multi-document text summarization. The goal of summarization is to present a concise tex-

tual summary of a single document or a group of documents. What differs summarization

from descriptive clustering is that no document groups are ever shown in the results, the

output consist of exactly one summary, usually longer compared to cluster labels.

Summarization dates back to 1958 when Hans Peter Luhn published a pioneer paper

called The Automatic Creation of Literature Abstracts [55]. The algorithm works by forming a

definite set of keywords and then looking for sentences containing these keywords, assem-

bling an abstract from the highest scoring sentences. Majority of later works presented in

the field are not too far from Luhn’s idea, focusing mostly on refining the procedure of se-

lecting sentences abbreviating the content of a document. One approach, for instance is to

take into account lexical clues, boosting the score of sentences in proximity of words such as

important or significant and decreasing their score in the neighborhood of phrases like might

be, or unlikely [80].

In [77] authors describe a summarization engine mead which extracts phrases and ranks

them using a clustering method. The highest ranking sentences are selected for the sum-

mary.

A broader overview of summarization and topic segmentation techniques and systems

can be found in [52] or in [58].

Topic Segmentation

A piece of text, such as an article, rarely talks about a single subject. The analysis of how

topics change in a document is a matter of topic segmentation techniques. A large group topic segmentation

of methods in this field is based on theories of discourse modelling, and represented by in- discoursemodelling

fluential papers by Eduard Skorochod’ko [88], Michael Halliday and Ruqaiya Hasan [34], or

Barbara Grosz and Candace Sidner [32]. An overview of the models of discourse along with

their applications to topic segmentation and summarization can be found in Jeffrey Reynar’s

thesis [79].

[

Summarization, topic segmentation and clustering have a high degree of overlap in the

motivations and even in methodology of solving their respective problem areas. Having said

that, each discipline has its own niche where it fits best. The ideas presented in this work

to certain extent combine the goals of multi-document text summarization, topic identifica-

tion and clustering, although we tend to stay in the field of information retrieval with regard

to the algorithmic solutions.

2.5. Evaluation of Clustering Quality 37

2.5 Evaluation of Clustering Quality

Experts seem to agree that objective measures of clustering quality are not feasible [59]. Text

clustering depends on such a variety of factors (implementation details, parametrization,

input data, preprocessing) that each experiment becomes quite unique and deriving con-

clusions about supremacy of one algorithm over another seems a bit far fetched. Moreover,

there is usually more than one “good” result and even human experts are rarely consistent

in their choice of the best one [56]. On the other hand, anecdotal evidence of improvement

is obviously very unfortunate.

There are two mainstream clustering evaluation methodologies: user surveys and mea-

sures of distortion from an “ideal” set of clusters. Neither method is perfect.

2.5.1 User Surveys

User surveys are a very common method of evaluating clustering algorithms [106, 63, 23, 85]

and often the only one possible. Unfortunately, a number of elements speak against this

method of evaluation.

• It is difficult to find a significantly large and representative group of evaluators. When

the users are familiar with the subject (like computer science students or fellow sci-

entists), their judgment is often biased. On the other hand, people not familiar with

clustering and used to regular search engines have difficulty adjusting to a different

type of search interface.

• People are rarely consistent in what they perceive as “good” clusters or clustering. This

has been reported both in literature [56] and in our past experiments on the Carrot2

framework and affects both preparation of the answers sheet and analysis of results.

• Experiment results are unique, unreproducible and incomparable. User surveys are

one-shot experiments that are not comparable between each other and difficult to

perform repeatedly or periodically.

• Human evaluators learn by examples and their judgment and performance is not con-

stant throughout the experiment. This makes performing subsequent experiments

with the same evaluators impossible (because they have gained experience).

• User surveys usually take a long time and effort in both preparation of the experiment

and its practical realization.

We used user surveys in a few of our experiments in the past and were usually discour-

aged with the results. During the course of work on this thesis we tried to avoid controlled

user-studies and instead relied on empirical experiments, numerical investigation of qual-

ity and user feedback collected from an open demonstration of the search results clustering

system Carrot2. We summarize this experience in Section 7.4 on page 107.


2.5.2 Measures of Distortion from Predefined Classes

Another evaluation method is based on defining a mathematical notion of difference be-

tween the set of clusters and a reference desirable set of partitions (called a ground truth

set). A clustering algorithm should minimize this difference to mimic the behavior of the ground truth

person or algorithm that put together the ground truth set. A few popular data sets are avail-

able for full document clustering, majority created by mixing documents from thematically

different sources (such as different mailing lists) and more seldom by human selection and

tagging. Interestingly, in spite of a few attempts to create ground truth data sets for search

results clustering [87], no “standard” test collection for this problem exists at the moment of

writing.

Let us assume a set of clusters K = k1,k2, . . . kn and a set of ideal partitions with a

total of N objects C = c1,c2, . . . cm. The following metrics are typically used for measuring

difference between clusters and the ground truth set.

F-measure A measure popular in information retrieval: aggregation of precision and re-

call, here adopted to clustering evaluation purposes. Recall that precision is the ratio of the

number of relevant documents to the total number of documents retrieved for a query. Re-

call is the ratio of the number of relevant documents retrieved for a query to the total num-

ber of relevant documents in the entire collection. In terms of evaluating clustering, the

f-measure of each single class ci is:

F (ci ) = maxj=1...m

2P j R j

P j +R j, (2.7)

where:

P j =|ci ∩k j ||k j |

, R j =|ci ∩k j |

|ci |. (2.8)

The final f-measure for the entire set of clusters is:

m∑

i=1

F (i )|ci |N

. (2.9)

Higher values of the f-measure indicate better clustering.

Shannon’s Entropy Entropy is often used to express the disorder of objects within the clus-

ter with information-theoretic terms. We can define the entropy of each cluster k j as [11]:

E (k j ) =−m∑

i=1

|ci ∩c j ||c j |

log|ci ∩c j ||c j |

. (2.10)

Defined this way, entropy is not normalized, so we normalize it (after: [16]):

E (k j ) =−1

log m

m∑

i=1

|ci ∩c j ||c j |

log|ci ∩c j ||c j |

. (2.11)


Entropy of the entire clustering is a weighted entropy of its clusters:

n∑

j=1

E (k j )|k j |N

. (2.12)

Zero entropy means the cluster is comprised entirely of objects from a single class.

Byron E. Dom’s Clustering Entropy This is yet another information-theoretic measure. An

interesting thing about it is that it takes into account the difference between the number of

clusters and the number of classes (if such a difference exists) — a useful property we used

in our previous research on the influence of language properties on the quality of cluster-

ing [89]. We omit the exact formula here because we do not use it in this thesis — details

can be found in Byron Dom’s report [18].

Clustering Purity Purity gives the average ratio of a dominating class in each cluster to the

cluster size and is defined as:

P (k j ) =1

|k j |max

i

(h(ci ,k j )

), (2.13)

where h(c,k) is the number of documents from partition c ∈C assigned to cluster k ∈K .

[

Evaluation against a ground truth set is quite reliable and convenient because it yields a

numeric and repeatable result, but it comes with its own issues. First of all, a single number

does not explain what went wrong in the clustering process. It only provides an average

figure that is comparable, but hardly interpretable, and the range of source errors is broad

and varies in “severity” (a mix of two related classes is preferred over uniform mixture of

documents for example).

Each cluster validation measure also comes with inconvenient requirements concerning

clusters structure; most require explicit partitioning into the number of clusters identical to

the number of ground truth’s partitions. This requirement is very often hard to meet, espe-

cially when we want our clustering algorithm to adjust the number of clusters freely or allow

clusters that are pure subsets of original classes with no additional penalty. In this thesis we

introduce a cluster validation measure similar to entropy but hopefully easier to interpret —

cluster contamination measure (see Appendix A on page 116). We are also convinced that

visualization methods showing the allocation of documents within clusters help a great deal

when assessing the quality of a clustering algorithm and we use many such visualizations in

Section 7.

Chapter 3

Descriptive Clustering

In this chapter we outline the differences between traditional understanding of the docu-

ment clustering problem in information retrieval and descriptive clustering. We define it as

a distinct problem with a specific set of requirements and applicable to a certain class of

text browsing problems. Finally, we present a loose association with conceptual clustering

known in machine learning.

3.1 Problem Statement

Let us start by repeating the textbook definition of a clustering problem after [22]:

Given a number of objects or individuals, each of which is described by a set of

numerical measures, devise a classification scheme for grouping the objects into

a number of classes such that objects within classes are similar in some respect

and unlike those from other classes. The number of classes and the characteris-

tics of each class are to be determined.

Note that the above definition does not mention cluster labels at all, the objective is to find

groups of similar objects (documents in our case). Any application that brings clusters to

the user interface will need to find their textual description — an additional requirement

not stated in the definition of the problem. A good clustering algorithm (in terms of the

definition) may appear completely useless from the user’s point of view, because it fails to

explain the reasons why clusters were formed. We believe the core difficulty is in the transi-

tion between the algorithm discovering groups of documents and the method of attaching

descriptions to these groups. Taking a Vector Space Model as an example, it seems almost

impossible to find a way of reconstructing comprehensible cluster labels from a mathemat-

ical bag-of-words model of a group of documents. In our opinion the approaches known

in literature, such as keyword tagging or the use of frequent phrases as cluster labels, don’t

provide satisfactory answers to all the requirements of a cluster browsing application.

The idea presented in this thesis attempts to avoid this difficult phase of cluster labeling

instead of solving it. We do it by slightly relaxing the requirements concerning document

40

3.2. Requirements 41

groups and shifting the emphasis to cluster labels. Compare the definition of the descriptive

clustering problem with document clustering shown above:

Descriptive clustering is a problem of discovering diverse groups of semanti- descriptiveclustering

cally related documents described with meaningful, comprehensible and com-

pact text labels.

Ideally, an algorithm solving the descriptive clustering problem should present docu-

ment groups for which clear and comprehensible descriptions exist. Document clustering is

therefore a step towards the final result, not the ultimate goal.

According to the above definition, we agree to discard clusters without sensible descrip-

tions. It may be disturbing at first, but our decision is deeply rooted in practical experience

gained with the Carrot2 framework and is an outcome of the following observations:

• the user will spend no additional time to figure out the meaning of a cluster label if its

description is unclear,

• the user will not inspect documents of a cluster with an unintuitive label,

• unclear or obscure relationships between the cluster description and documents in-

side it are discouraging and frustrating for the user.

All our further considerations try to take these facts into account, even if it potentially

affects the “ideal quality” of clustering understood as the expected allocation of documents

to groups.

3.2 Requirements

Evaluation of cluster labels presents a great challenge. Initially, inspired by approaches from

natural language processing, we tried to define strict formal requirements concerning clus-

ter labels, based on their grammatical decomposition. This direction turned out to be unre-

alistic — the structure of natural language, especially in Polish, seems to be far too complex

for reliable automatic evaluation.

Unable to specify the requirements formally, we thought about defining certain expec-

tations that hardly replace a formal definition, but hopefully convey our intuition of what

cluster labels should be like. Therefore, to clarify the terminology, when we speak about

requirements concerning the problem of descriptive clustering, we mean two things:

• expectations concerning cluster labels (naturally imprecise and hard to verify, but pro-

viding certain intuition), and

• traditional requirements concerning clusters (groups of documents) which are taken

directly from the definition of clustering in information retrieval.

We describe these requirements in the following sections of this chapter.


3.2.1 Cluster Labels

We define three requirements concerning cluster labels: comprehensibility, conciseness and

transparency.

Comprehensibility

Zygmunt Saloni and Marek Swidzinski make a very interesting observation of how people

perceive elliptical statements:

Jestesmy przekonani, ze kazdy uzytkownik jezyka ma stosunkowo jasna (choc

nie wyrazna!) intuicje elipsy, tzn. potrafi okreslic stopien kompletnosci danego

wypowiedzenia. ([81], page 56)

We are convinced that every speaker of a given language has a clear (but not ex-

plicit!) intuition of ellipsis, that is can determine the completeness of a given pro-

nouncement.

Extending this observation to comprehensibility we suspect that native speakers of a given

language can easily determine if a given sequence of words can function as a cluster label,

but without providing any explicit rules with respect to this judgment. Instead of defining

a good cluster label we may pinpoint the negative cases and reject the clearly bad ones. A

list of reasons for rejecting or at least penalizing a cluster label along with some examples is

shown below. Good cluster labels should not fall into any of these categories.

• Grammatical inconsistency (not a sentence, not a pronouncement or an incomplete

phrase).

– wooden A go if

– byli krzywy noga z do (were crooked leg from to)

– of Computer Science

– z Torunia (from Torun)

• Internal grammatical or inflectional constraint violated (the phrase is incorrect, words

inside it are not in agreement).

– Europe snowboarding resorts [→European snowboarding resorts]

– Samorzadom miasta Poznaniu [→Samorzad miasta Poznania]

• External grammatical or inflectional constraint violated (the phrase is grammatically

correct, but is used in inflected form or lacks the required context).

– Instytucie Informatyki Politechniki Poznanskiej [→Instytut Informatyki Politechniki Poz-

nanskiej]

– Alicji w Krainie Czarów [→Alicja w Krainie Czarów]


• Ellipsis or ambiguity.

– piłem Okocimy (drank Okocim)

– to i tamto (this and that)

Obviously, fully automatic and reliable verification of these constraints is impossible.

Even human assessment is often difficult, e.g. is inspector gadget a meaningful phrase when

it lacks context? A reasonable solution is to minimize the number of potentially bad descrip-

tions by their careful selection. That means, for instance, allowing only entire sentences

or pronouncements — such entities should be self-contained and less ambiguous by def-

inition. Unfortunately, they are also too long to form concise cluster labels (expressed in

the next requirement), so we decided to use a more fine-grained level of chunks (see Sec-

tion 2.1.3 on page 16). Chunks should be grammatically consistent, potentially self-contained

and hopefully meaningful when extracted from the text and stripped of its surrounding con-

text, so they seem like good candidates, not breaking any of the unwanted cases mentioned

above.

Conciseness

Our goal is to show the user a brief, concise view of a structure of topics present in a set of

documents. Cluster labels should be as short as possible to minimize the amount of infor-

mation the user must process, but sufficient to convey the information about the cluster’s

documents. If a word in the description can be removed without sacrificing comprehensi-

bility of the phrase, then it should be.

Anticipating our further discussion, let us mention that this requirement is quite difficult

to realize without linguistic and contextual knowledge. Our algorithms satisfy this require-

ment by allowing the user to express the desired length of cluster descriptions. We agree,

however, that this is a partial solution to the problem.

Transparency

To the user of a clustering algorithm, all its internal elements: the model of text represen-

tation, similarity measures, the algorithm used for grouping documents, remain a black box

which he or she expects to work flawlessly. Any mistake made by the algorithm, especially

one that manifests itself in cluster descriptions, introduces confusion and decreases user’s

trust to the entire algorithm.

We believe that the relationship between any document inside a cluster and its descrip-

tion must be clear and evident as in monothetic clustering. Similar clarity must exist in the

other direction — when looking at a description of a cluster, the user must be able to tell

which elements of this description can be found in the cluster’s documents. We will call a

clustering method transparent if the user is able to easily answer the following questions:

• Why was label X selected for documents in cluster Y?

• Why was document X placed in cluster Y?


Cluster keywords Excerpts from sample documents

apple New York reminds us of the warmhearted program of Big Apple Greeter.

apache, server [. . . ] Median hourly earnings of nonrestaurant food servers were $7.95 in May

2004. This figure was even lower among Native American tribes of Zuni, Navajo

or Apache. [. . . ]

jacek, placek [. . . ] Jacek był grubym i nieporadnym chłopcem. [100 pages later] Na stole stał

pachnacy placek. [. . . ]

Table 3.1: Examples of cluster labels consisting of keywords and fragments of documents

matching these keywords, but not at all their first common sense meaning.

This requirement is partially inspired by the history of search engines in information re-

trieval. In the beginning most search engines had a default Boolean alternative operation

(or) between query keywords. The result included documents containing any term in the

query. But people soon realized that default conjunction (and) is less confusing because

there is no guessing of which combination of terms made a given document pop up in the

result; with the default and the relationship between the query and the set of retrieved doc-

uments is very clear (or in out terms: transparent).

Returning to the field of clustering, algorithms used there use more complex mecha-

nisms compared to vsm-based document retrieval and the transparency requirement be-

comes even more important. For instance, the traditional keyword-based cluster presen-

tation may lead to mistakes because users will assign the most common sense to a set of

keywords — consider the examples of cluster keywords and documents not truly relevant to

such clusters shown in Table 3.1.

As for how the transparency requirement can be solved, in our opinion a cluster label

containment relationship is a good, clear rule of thumb: every document in a cluster must

contain a phrase from its description. Such a rule is very restrictive — the number of docu-

ments containing an exact copy of a given phrase is likely to be small. Recalling the discus-

sion in Section 2.1.3 about loose order of phrases in languages such as Polish, exact phrase

containment may not be even a correct heuristic.

We may relax the above common sense rule and require the cluster to contain docu-

ments where the label’s phrase appears with possibly reordered words or other terms in-

jected inside it. The user should be able to control how much distortion from the cluster

label he or she allows. If such a definition is still too narrow, because the input is so big or

larger clusters are needed, the cluster label can be ultimately replaced with a more generic

term. However, there should always be a possibility of expanding the abbreviated cluster

label into low-level, fully transparent elements to provide explanation to the user about how

the cluster was formed and what can be found inside it.

3.2.2 Document Groups

The problem of descriptive clustering is focused on cluster labels, nonetheless it essentially

still remains a document clustering problem. In this thesis we consider a subset of clus-

3.3. Relationship with Conceptual Clustering 45

tering algorithms producing flat, overlapping clusters with the possibility to leave behind

unassigned documents.

Internal Consistency

An algorithm solving the descriptive clustering problem should ensure documents inside a

cluster are similar to each other. We believe that internal consistency, regardless of its math-

ematical definition, corresponds strongly with the concept of cluster label transparency. If

all documents in a cluster have a clear relationship with its description then such a cluster

must appear consistent to the user and therefore fulfills the requirement.

External Consistency

Descriptive clustering should provide an overview of the topics present in the input, so we

search for diverse clusters (different from each other) and varying in size (not only the largest

ones, which might be obvious to the user).

Overlaps and Outliers

We can expect a single document to contain references to many different subjects, so an al-

gorithm solving descriptive clustering must allow placing it in more then one cluster. More-

over, we can also expect a situation when a document does not belong to any cluster at all.

Such outlier documents can be abandoned entirely or form a synthetic group of unrelated

documents. The point is not to force documents to their closest cluster if such relationship

is not justified.

Note that we assumed that the structure of clusters is flat. This is partially a consequence

of transparency — if we need a clear relationship between a cluster label and its content, it

would be difficult to come up with a transparent label on a compound cluster in a hierarchi-

cal clustering. On the other hand, hierarchical clusters have a number of desirable features,

most importantly a more compact presentation compared to flat clusters. We consider it an

open question whether hierarchical, transparent clustering is feasible.

3.3 Relationship with Conceptual Clustering

During the work on this thesis is was pointed out to us that the difference between tradi-

tional and descriptive clustering resembles to some extent the ideas introduced earlier in

machine learning. Conceptual clustering was introduced fairly independently by Douglas conceptualclustering

Fisher, Ryszard Michalski, Robert Stepp and Joel Martin [40, 25, 66] and implemented in

algorithms such as Cluster/2 or CobWeb.

A conceptual clustering system accepts a tabular list of objects, described using a fixed

set of attributes (events, observations, facts) and produces a classification scheme over the

domain of these attributes. Conceptual clustering algorithms are usually unsupervised and

use some notion of a quality evaluation function to discover classes with “good” descrip-

tions. Evaluation of class quality is performed by looking at summaries (descriptions) of

3.3. Relationship with Conceptual Clustering 46

classes and confronting it with the training set. In other words, conceptual clustering sys-

tems measure the adequacy of classification and employ iterative search strategies for its

optimization, keeping in mind that the description of classes is an integral and important

part of an investigation [30].

A class in conceptual clustering is described with a set of attributes (concrete values,

probability distributions or other properties). For example, in Cluster/2, the algorithm cre-

ates descriptions of groups of objects based on conjunctions of simple conditions defined

on attributes of these objects. A description can look as shown below:

[height > 1290 cm] & [eye color = blue or green]

Similar motivation of conceptual clustering and our problem of descriptive clustering is

fairly clear: class (or cluster) label is the key element driving the rest of the process. Hav-

ing said that, straightforward application of conceptual methods to text clustering seems to

be problematic — conceptual clustering is strongly related to a specific type of input data —

tabular lists of objects, each described with a set of attributes (typically nominal). Text repre-

sentation models have a different data characteristic — a great number of numeric features.

Adopting conceptual clustering algorithms to clustering text is of course possible (and has

been done in the past), but seems a bit artificial.

Summarizing, what is similar in conceptual clustering and descriptive clustering is the

motivation, the emphasis on describing the result using concepts understandable to a hu-

man. Their application domain and implementation remain quite different.

Chapter 4

Solving the Descriptive Clustering Task:

Description Comes First Approach

4.1 Introduction

Description Comes First (dcf) is our suggested solution to the problem of descriptive clus-

tering. We perceive dcf as a general method into which different algorithmic components

can be plugged; the two algorithms we present later in this document are its concrete in-

stances. In this chapter we would like to describe the common denominator — a high-level

procedure which helps in overcoming the most difficult problems of cluster labeling and in

our opinion fulfills the requirements of descriptive clustering. We can summarize the De-

scription Comes First approach by the following statement:

Description Comes First approach is a general method for constructing text dcf approach

clustering algorithms suited to solving the problem of descriptive clustering.

4.2 Anatomy of Description Comes First

The dcf approach consists of several phases, illustrated in Figure 4.1, but the core idea is in

separating selection of candidate cluster labels from cluster discovery

• Candidate label discovery (phase 1) is responsible for collecting all phrases potentially

useful as good cluster labels (comprehensible and concise phrases).

• Cluster discovery provides a data model about document groups present in the input

data.

By splitting the process into these two phases, the most difficult element so far — creat-

ing proper cluster descriptions from a mathematical model — is avoided and replaced by a

problem of selection of appropriate cluster labels for each group of related documents found

in the input. The only purpose of cluster discovery (in phase 2) is to build a model of dom-

inant topics — major subjects the documents are about. This model is subsequently used

47

4.2. Anatomy of Description Comes First 48

Figure 4.1: Key elements of the dcf approach.

to select appropriate labels from the set of candidates and is discarded afterwards. The fi-

nal document groups (clusters) are built around the selected cluster labels (called pattern

phrases) to further reduce the “semantic gap” between cluster descriptions and documents

they contain (to fulfill the transparency requirement). The process ends with pruning of

groups that did not collect enough documents and elimination of very similar cluster labels.

In the following sections we discuss the rationale behind each phase of the dcf ap-

proach, provide certain implementation clues and end with an illustrative example.

4.2.1 Phase 1: Cluster Label Candidates

As we mentioned in the introduction, previous research on text clustering shows that finding

cluster labels always encounters great difficulties. We can avoid this problem by preparing

candidate cluster labels prior to the clustering process and then only pick these for which

significant groups of documents exist.1 Because the process of cluster label selection is in-

dependent from clustering, it can fully utilize raw text input to assure comprehensibility and

conciseness described in Section 3.2.1.

An interesting side-effect of making candidate label selection a separate phase is its in- computationalcomplexitydiscussionfluence on the efficiency of the entire procedure. Note that cluster label extraction is basi-

1It should be mentioned that this reversed order: labels→clusters instead of the traditional clusters→labels was

first suggested on the Web site of a commercial clustering search engine Vivisimo [D]. Obviously, no algorithmic

details had been released, so we do not know if our ideas align in any way with Vivisimo’s. A proper credit is due to

Vivisimo’s authors for inspiring our further work on the subject.


cally independent, it can precede clustering or run in parallel. It can be centralized or easily

distributed (each computational unit extracting candidate labels from a single document).

Moreover, a collection of cluster label candidates can be prepared a priori (an ontology) and

reused with no additional computational cost.

Implementation Ideas A set of candidate cluster labels can be prepared in several ways.

One possibility is to utilize existing dictionaries or ontologies. This scenario is interesting

because cluster labels are then given a priori, so we can assume they fulfill the requirements

and are comprehensible for end users. The entire dcf process then effectively becomes

a classification task to a set of predefined categories (implied by candidate cluster labels),

where only categories that collect enough documents are shown back to the user.

When candidate cluster labels are not given in advance, we must extract them directly

from the input documents. Several methods can be employed to do this:

• extraction of frequent phrases, much like in the stc algorithm,

• extraction of simple coherent linguistic chunks — noun phrases or other coherent

groups of words,

• full linguistic analysis to extract independent phrases or sentences.

Each one of the above methods has its advantages and disadvantages. Frequent phrase

extraction is a fast and scalable method, but may result in nonsensical candidates in the

output (non-grammatical, incomplete or common clichés). We still use them in both algo-

rithms presented later in this thesis, mostly because their extraction is so efficient, but we

are aware of the shortcomings of this solution. To defend frequent phrases and dcf a bit:

we believe (and our experiments support this belief) that an algorithm following dcf should

be able to deal with certain noise in the set of candidate cluster labels. A noisy cluster label

should not be supported by any dominant topic and should not become a pattern phrase.

Even if this happens, such a pattern phrase should not collect enough documents and be

discarded as a result. These elements are a clear improvement over plain stc for example,

which lacked such a verification step, often permitting junk frequent phrases to become

clusters. We return to this discussion in later sections.

To find better cluster labels candidates we need to look at the methods of shallow lin-

guistic processing introduced in Section 2.1.2. The most common way of finding coherent,

sensible groups of words (in English) is to divide the text into chunks. Chunks are the small-

est (conciseness) grammatically consistent (comprehensibility) element that the input text

can be divided into, so they offer much more in terms of our needs compared to frequent

phrases.

Statistical chunkers for English are reasonably efficient and accurate, and we use them

later in this thesis to extract noun phrases in Descriptive k-Means. Note that as with any

automatic method, chunks retrieved using tools based on statistical processing of text are

just an approximation and may still return incorrect results.


For Polish, we assumed an equivalent of an English chunk to be a group as defined

in [81]. Unlike chunks, however, groups may be unordered and distributed throughout the

sentence, so their direct use for cluster label candidates is more complex. We already men-

tioned our experiments with a simple heuristic automaton for detecting certain tag seq-

uences, but this solution was too premature to be employed in this thesis. As a result, at

the moment of writing we are limited to frequent phrases (and possibly predefined ontolo-

gies) for experimenting with Polish texts.

4.2.2 Phase 2: Document Clustering (Dominant Topic Detection)

The intention of this phase is to construct a model of dominant topics2 present in the input. dominant topic

Each dominant topic consists of a group of documents that are about the same, or closely

related subject. A dominant topic must also have a suitable representation which can be

used later (in the pattern phrase selection phase) to calculate similarity between each dom-

inant topic and phrases from the set of candidate cluster labels.

Note that while we refer to this phase as document clustering, any method producing

a model of dominant topics is actually sufficient. We actually present two different ap-

proaches to topic approximation in this thesis. In Descriptive k-Means we use a regular

clustering algorithm (k-Means) and assume each cluster’s centroid represents a single dom-

inant topic. In the Lingo algorithm, on the other hand, clustering is replaced by Singular

Value Decomposition (dimensionality reduction) of term-document matrix. Dominant top-

ics are approximated with base vectors of one of the reduced matrices (we provide details

later).

Another element worth emphasizing is that dominant topics remain an internal artifact

in the process of dcf and never need to be shown to the user explicitly. This implies that

the model used for discovering dominant topics can be arbitrarily complex without hurt-

ing cluster label comprehensibility. This is a clear advantage with documents in Polish, for

instance. We can take into account loose syntax (use the vsm model instead of the phrase

co-occurrence model) and apply destructive text transformations to accommodate inflec-

tion (stemming, diacritic marks removal) without worrying about the problems of cluster

labeling which is essentially resolved in the next phase of dcf.

Implementation Ideas The most obvious and natural choice of a representation model for

this phase is the Vector Space Model and we use it in combination with cosine measure in

both our algorithms. We suppose that other models of text representation could be used,

but this direction has not been explored.

4.2.3 Phase 3: Pattern Phrase Selection and Document Assignment

The role of this step is to pick these cluster label candidates which are most similar to the

representation of previously discovered dominant topics. We will call such labels pattern

2We will use the phrases: dominant topic, dominant concept and abstract concept interchangeably for historic

reasons.


phrases. A different way of looking at this phase is that we approximate the representation pattern phrase

of dominant topics, which we know is traditionally difficult to describe, with existing com-

prehensible labels.

As with any approximation, there are certain risks involved. For example, there is a risk

that no cluster label candidate will match a given topic. This is almost impossible if cluster

candidate labels have been extracted from the input documents, but is much more likely for

a predefined set of labels. While at first it may seem like a disadvantage, this stems from the

intuition of user’s anticipated behavior (Section 3.1 on page 41) — a cluster which cannot

be properly described is useless, even if the documents inside it make sense from the point

of view of the clustering method. Moreover, we rarely encountered this problem in real life

and believe the ability to hide clusters to which no sensible label can be found is actually a

strong point of the approach. The discussion of potential differences and distortions from

an ideal clustering is continued in Section 4.3 on page 53.

Once pattern phrases have been identified, the representation of dominant topics is dis-

carded and pattern phrases replace them as seeds of final document groups. This is a conse-

quence of the transparency requirement — we want a clear relationship between a cluster’s

label and its content. Pattern phrases will become cluster descriptions, so we must use them

directly to find the documents matching the topic they represent.

Note that document allocation phase fulfills the requirements concerning group over-

lap and partial clustering defined in Section 3.2.2. Documents are assigned to each pattern

phrase independently, so they may belong to more than one group. Documents not rel-

evant to any pattern phrase at all may also exist, obviously, forming a synthetic group of

non-clustered documents.

The last step (pruning) is meant to remove any pattern phrases which failed to collect

enough documents. The following scenarios are possible:

• No documents are assigned to the pattern phrase. A rare, but possible, case when the

pattern phrase was similar to the dominant topic’s model, but does not associate any

documents. Consider the following example: the representation of a dominant topic

contains keywords lemony and snicket. A candidate cluster label Lemony Snicket3 will

be selected, but, unfortunately, no document contains this exact phrase. The group is

(correctly) discarded.

• Very few documents are assigned to the pattern phrase. This may indicate that the

pattern phrase encompasses just a part of the original topic or the dominant topic was

a combination of more than one subjects. We can either discard the pattern phrase

or use it for merging with other small groups with overlapping documents (but this,

as we know from the stc algorithm, may lead to problems and is generally against

the transparency as we defined it). The threshold at which we consider the pattern

phrase and its documents irrelevant is a tuning parameter of a concrete algorithm

implementing dcf.

3Lemony Snicket is a pseudonym of Daniel Handler, an American novelist and the author of a series of darkly

comic children’s books known as A Series of Unfortunate Events.


• A significant number of documents is assigned to the pattern phrase. In this case the

pattern phrase and its associated documents become part of the final result, that is

become a cluster described with a pattern phrase and containing the documents allo-

cated to it.

Implementation Ideas There are two elements of difficulty: pattern phrase selection and

document allocation.

To select pattern phrases we must seek for candidate cluster labels similar (or “close to”)

the discovered dominant topics. Assuming both cluster label candidates and dominant top-

ics are expressed in the same model (in the same vector space, for example), a simple cal-

culation of similarity between them should suffice.

Document allocation is more tricky since we want to have a clear relationship between

a pattern phrase and the documents allocated to it. We have already discussed several “rule

of thumb” heuristics that could be used for this task when we talked about the transparency

requirement (on page 43). Let us recall them now:

• allocate all documents containing an exact copy of the pattern phrase (strict rule),

• allocate all documents containing a possibly distorted copy of the pattern phrase (re-

ordered words, foreign words injected inside); the user should be able to control the

allowed level of distortion,

• allocate all documents containing the phrase and any synonymous phrases that could

be related to it, but offer the user a possibility of expanding the cluster label to explain

which phrases contributed to the allocated documents.

Implementation of pattern phrase selection and document allocation in practice may

become tricky, especially with large problem instances when efficiency of processing be-

comes critical. We show two different implementations of these elements in Lingo and De-

scriptive k-Means — the algorithms presented later in this thesis. Lingo uses a relatively

simple vsm-based retrieval model which results in several problems in document allocation

phase. Learning upon this experience, we improved the document allocation procedure in

dkm to scale to large problem instances and accommodate different document allocation

heuristics we mentioned above.

4.2.4 An Illustrative Example

This section demonstrates the dcf approach on a simple two-dimensional Vector Space

Model example. All referenced illustrations are collected in Figure 4.2 on page 54.

Let us narrow the “language” of documents in our example to only two terms: X and Y .

We can represent all input documents as points in a two-dimensional vector space, where

the horizontal axis represents the weight (importance) of term X and the vertical axis rep-

resents the weight (importance) of term Y . For estimating similarity between documents

represented in the term vector space we will use the cosine measure — the angle between

4.3. Discussion of Clustering Quality 53

vectors starting in point (0,0) and ending at each respective documents vector’s location. In

all subsequent figures we represent objects in the term vector space as small circles cast to

a unit sphere (the angle obviously does not change). A dcf approach to finding clusters in

the input documents would proceed as follows.

In the first step we collect cluster candidate labels. In our example these candidate labels

are already represented as red circles in the term vector space (see Figure 4.2(a)). Angle

vectors are also shown for clarity.

In a concurrent step we parse input documents and represent them in the same vector

space model. Each document is depicted as a faded blue circle with a direction vector going

through it (see Figure 4.2(b)).

We proceed to the second phase and detect dominant topics in the input documents. In

our example we look for groups of documents with a similar angle and note that two clear

groups exist (see Figure 4.2(c)). An average angle of the content of the group is the centroid

vector — the dominant topic’s representation in our model, depicted with larger blue arrows.

In the third phase we select pattern phrases for the dominant topics. Since our example

uses the same model for representing documents and labels, we can simply put everything

in the same space (see Figure 4.2(d)). Selecting pattern phrases is about choosing candidate

cluster labels “close to” topic vectors. Technically, we look for any vectors representing la-

bel candidates that lie within an infinite hypercone around the topic vector’s axis. In our

example’s two-dimensional space, the hypercone is simply an angle around a topic vector

(see Figure 4.2(e)) — we select three pattern phrases for the next phase. Note that the cone’s

opening angle is a tuning parameter of the algorithm.

Finally, we assign documents to pattern phrases in a way similar to the one previously

used to find pattern phrases. For each pattern phrase we look for documents within their

pattern hypercones and assign these documents to the pattern phrase. As shown in Fig-

ure 4.2(f), documents can be assigned to more than one pattern phrase. Each final cluster

contains documents similar to the selected pattern phrase (transparency requirement), the

original dominant topic’s vector is no longer directly taken into account.

Note that we use a simple cosine similarity model in the last phase of this example to

convey the idea of how dcf works. In a real algorithm the document assignment phase

would have to take into consideration word order and proximity of terms in the pattern

phrase, something the cosine measure omits entirely.

4.3 Discussion of Clustering Quality

Description Comes First approach avoids the most difficult problems of labeling clusters,

but of course with trade-offs someplace else. There are two potential places where clustering

quality may degrade (compared to an “ideal” clustering):

• when pattern phrases are selected, they are an approximation of dominant topics con-

structed in phase 2,

4.3. Discussion of Clustering Quality 54

(a) Candidate labels and their “position” in the

model.

(b) Documents and their “position” in the model.

(c) Concept vectors discovered in the documents. (d) Preparation for merging — cluster label

candidates and topics are represented in the same

model.

(e) Candidate labels close to topics (within cones)

become pattern phrases.

(f) Documents matching pattern phrases (within

cones) are selected to their groups

Figure 4.2: Example showing the dcf approach applied to documents and labels in a two-

dimensional term vector space.

4.4. Summary 55

• when documents are assigned to selected pattern phrases we use them as the refer-

ence point instead of the original representation of dominant topics.

The first issue seems to be more important as it means we are “cheating” the user a bit

by replacing the original dominant topics with groups of documents created around perfect

candidate labels. We believe this element is a virtue rather than a vice. Dominant topics

represent ideal groups expressed in a model used for clustering, but this model is obscure

and incomprehensible to the user. A “semantic gap” between the cluster’s representation

and its perception by a human is inevitable and introduces just as much confusion. In our

opinion the approximation of dominant topics using pattern phrases is not about “cheating”,

but rather choosing the closest comprehensible image of dominant topics that the user can

fully understand.

The second problem — documents assigned to pattern phrases instead of the original

dominant topic — is a straightforward consequence of the transparency requirement. We

again attempt to minimize the semantic gap, this time between the pattern phrases and

documents inside them. If we used the dominant topics to allocate documents to pattern

phrases, the cluster labels would remain comprehensible, but their content would be rele-

vant to something the user never sees explicitly.

The document assignment step must ensure that there really are documents that match

selected pattern phrases and that the link between cluster labels and documents inside them

is clear. This step in also a safety vent of the entire dcf procedure: even if incorrect (unre-

lated) cluster labels are selected to be pattern phrases they are not likely to collect enough

documents to form final clusters.

4.4 Summary

We believe the strengths of the dcf approach lie in the following properties:

• Candidate phrases can be extracted from the input text automatically based on fre-

quency or other statistics, just as in the stc algorithm. Alternatively, candidate phrases

can come from a completely different source or even a predefined ontology (to guar-

antee they are comprehensible).

• Cluster discovery and candidate phrase extraction are independent and can be easily

parallelized. In fact, the extraction can be done incrementally as the documents are

added to the system.

• Dominant topic detection can use arbitrarily complex model of text representation

and cluster analysis without making the cluster labeling procedure any more difficult.

• If the method used for detecting dominant topics returns an unclear representation of

a topic then it is less likely to find a matching label in the set of label candidates. Even

if a matching label is found, the document assignment phase provides a second-level

pruning. We thus ensure that groups of documents in the output are really relevant

and well described.

4.4. Summary 56

• The final assignment of documents to pattern phrases fulfills the transparency re-

quirements we established for the descriptive clustering problem. All documents in

a final cluster must contain a phrase from its label (possibly distorted), so the rela-

tionship between the cluster and its label should be clear to the user.

Chapter 5

The Lingo Algorithm

The motivation for creating Lingo was to come up with an algorithm for clustering search

results capable to discover diverse groups of documents and at the same time keep cluster

labels sensible. The work on Lingo must be credited to Stanisław Osinski who worked on

the algorithm under supervision of Jerzy Stefanowski [68] and later contributed a great deal

of effort to the Carrot2 framework. The author of this thesis worked with Stanisław on a

number of co-authored papers [70, 68] and this fruitful cooperation gradually resulted in a

conceptual basis for defining descriptive clustering and the dcf approach.

The aim of this section is to show how Lingo fits in the general scheme introduced by

the dcf. The algorithm is an example of dcf’s application to the domain of search results

clustering and several elements of its implementation are designed specifically to deal with

this type of input data.

5.1 Application Domain

Clustering search results differs significantly from other types of document clustering. Each

matching result (hit) in a list of results returned by a search engine contains a resource loca-

tor (url), an optional title, and a short fragment of text called a snippet, which is optional as snippet

well. Modern search engines assemble snippets individually for each query by scanning the

body of a document and looking for short spans of text that contain as much of the query

as possible. Two or three best matching spans are joined and returned as a short block of

text providing insight into the original document for the user. This technique of generating

snippets is called kwic — keyword in context. Figure 5.1 shows a typical snippet. kwic

In the remaining part of this chapter we will use the term document to refer to a single

hit, even though the entire document is obviously not returned with the search result.

The usual number of hits returned by a search engine is anything between a few dozen to

a few hundred entries, so the input is relatively small. Moreover, it is not likely to grow sub-

stantially larger because search engines limit the number of hits to a few thousand (Google,

Yahoo and others).

57

5.2. Overview of the Algorithm 58

Figure 5.1: A typical “hit” returned by a search engine: document title on top, snippet with

query terms in the middle and an information line (with the document’s address) on the

bottom.

Figure 5.2: Generic elements of dcf and their counterparts in Lingo. svd decomposition

takes place inside cluster label induction phase, it is extracted here for clarity.

Conclusions are twofold: on one hand, a search results clustering must work with in-

complete, fragmented data (an extreme example is a document with an empty snippet and

an empty title). On the other, the scalability of the algorithm is not that important as long

as it is unnoticeably fast (for a human user) on typical input data sizes.

5.2 Overview of the Algorithm

Lingo processes the input in four phases: snippets preprocessing, frequent phrase extrac-

tion, cluster label induction and content allocation. The parallels to the generic scheme

introduced in the dcf are illustrated in Figure 5.2. Algorithm 5.1 on the next page contains

full pseudocode of the algorithm, we discuss the details of each step in sections below.

5.2.1 Input Preprocessing

In the preprocessing phase the input documents (titles and snippets) are tokenized and split

into terms. Lingo is implemented as a component embedded in the Carrot2 framework


1: D ← input documents (or snippets)

/* Preprocessing */

2: for all d ∈D do

3: perform text segmentation of d ; /* Segmentation, stemming. */

4: if language of d recognized then

5: apply stemming and mark stop-words in d ;

6: end if

7: end for

/* Frequent Phrase Extraction */

8: concatenate all documents;

9: Pc ← discover complete phrases;

10: P f ← p : p ∈Pc ∧ frequency(p) > term frequency threshold;

/* Cluster Label Induction */

11: A ← term-document matrix of terms not marked as stop-words and

with frequency higher than the Term Frequency Threshold;

12: S,U ,V ← SVD(A); /* Product of SVD decomposition of A */

13: k ← 0; /* Start with zero clusters */

14: n ← rank(A);

15: repeat

16: k ← k +1;

17: q ← (∥∥Sk

∥∥F /‖S‖F );

18: until q < Candidate Label Threshold;

19: P ← phrase matrix for P f ;

20: for all columns of U Tk

P do

21: find the largest component mi in the column;

22: add the corresponding phrase to the Cluster Label Candidates set;

23: labelScore← mi ;

24: end for

25: calculate cosine similarities between all pairs of candidate labels;

26: identify groups of labels that exceed the Label Similarity Threshold;

27: for all groups of similar labels do

28: select one label with the highest score; /* cluster description */

29: end for

/* Cluster Content Discovery */

30: for all L ∈ Cluster Label Candidates do

31: create cluster C described with L;

32: add to C all documents whose similarity

to C exceeds the Snippet Assignment Theshold;

33: end for

34: put all unassigned documents in the “Others” group;

/* Final Cluster Formation */

35: for all clusters do

36: clusterScore← labelScore×‖C‖;

37: end for

38: Sort final clusters.

Algorithm 5.1: Pseudo-code of the Lingo algorithm.


and uses its infrastructure to perform certain text preprocessing tasks — stemming, mark-

ing stop words and simple text segmentation heuristics. After tokenization is complete, a

term-document matrix is constructed out of the terms that exceed a predefined term fre-

quency threshold. After that, document vectors are weighted using the tf-idf formula [82].

Terms present in document titles are additionally boosted compared to these appearing in

snippets by a predefined constant because titles are more likely to contain sensible (human-

edited) information.

5.2.2 Frequent Phrase Extraction

The aim of this step is to discover a set of cluster label candidates — phrases (but also sin-

gle terms) that can potentially become cluster labels later. Lingo extracts frequent phrases

using a modification of an algorithm presented in the shoc algorithm [19]. A word-based

suffix array is constructed and extended with an auxiliary data structure — the lcp (Longest

Common Prefix). This allows the algorithm to identify all frequent complete phrases in O(n)

time, n being the total length of all input snippets.

The frequent phrase extraction algorithm ensures that the discovered labels fulfill the

following conditions:

• appear in the input at least a given number of times (it is a tuning threshold);

• not cross sentence boundaries; sentence markers indicate a topical shift, therefore a

phrase extending beyond one sentence is unlikely to be meaningful;

• be a complete frequent phrase (the longest possible phrase that is still frequent); com-

pared to partial phrases, complete phrases should allow clearer description of clusters

(compare: “Hillary Rodham” and “Senator Hillary Rodham Clinton”);

• neither begin nor end with a stop word; stop words that appear in the middle of a

phrase should not be discarded.

5.2.3 Cluster Label Induction

During the cluster label induction phase, Lingo identifies the abstract concepts (or domi-

nant topics in the terminology used in dcf) that best describe the input collection of snip-

pets. There are two steps to this: abstract concept discovery, phrase matching and label

pruning.

In abstract concept discovery, singular value decomposition (svd) is applied to the term-

document matrix A, breaking it into three matrices: U , S and V in such a way that A =U SV T . An interesting property of svd is that the first r columns of matrix U , r being the

rank of A, form an orthogonal basis for the term space of the input matrix A [29]. It is

commonly believed that base vectors of the decomposed term-document matrix represent

an approximation of “topics” — collections of terms connected with an obscure net of latent

relationships. Although this fact is difficult to prove, singular decomposition is widely used

in text processing, for example in Latent Semantic Indexing (lsi). From Lingo’s point of view,


basis vectors (column vectors of matrix U ) contain exactly what it has set out to find — a

vector representation of the abstract concepts.

The most significant k base vectors of matrix U are determined by selecting the Frobe-

nius norms (measuring the difference between two matrices) of the term-document matrix

A and its k-rank approximation Ak . Let threshold q be a percentage-expressed value that

determines to what extent the k-rank approximation should retain the original information

in matrix A. We hence define k as the minimum value that satisfies the following condition:

‖Ak‖F /‖A‖F ≥ q,

where the symbol ‖X ‖F denotes the Frobenius norm of matrix X . Clearly, the larger the

value of q the more cluster candidates will be induced. The choice of the optimal value for

this parameter ultimately depends on the preferences of users, so we make it one of Lingo’s

control thresholds — Candidate Label Threshold.

Phrase matching and label pruning step, where group descriptions are discovered, relies

on an important observation that both abstract concepts and frequent phrases are expressed

in the same vector space — the column space of the original term-document matrix A. This

enables us to use the cosine distance to calculate how “close” a phrase or a single term is to

an abstract concept. Let us denote by P a matrix of size t × (p + t), where t is the number

of frequent terms and p is the number of frequent phrases. P can be easily built by treating

phrases as pseudo-documents and using one of the term weighting schemes.

Having the P matrix and the i -th column vector of the svd’s U matrix, a vector mi of

cosines of the angles between the i -th abstract concept vector and the phrase vectors can

be calculated as:

mi =UiT P.

The phrase that corresponds to the maximum component of the mi vector should be se-

lected as the human-readable description of i -th abstract concept. Additionally, the value of

the cosine (similarity) becomes the score of the cluster label candidate.

A similar process for a single abstract concept can be extended to the entire Uk matrix

— a single matrix multiplication M =UkT P yields the result for all pairs of abstract concepts

and frequent phrases.

The final step of label induction is to prune overlapping labels. Let V be a vector of clus-

ter label candidates and their scores. We create another term-document matrix Z , where

cluster label candidates serve as documents. After column length normalization we calcu-

late Z T Z , which yields a matrix of similarities between cluster labels. For each row we then

pick columns that exceed the Label Similarity Threshold and discard all but one cluster label

candidate with the maximum score which becomes the description of a future cluster.

5.2.4 Cluster Content Allocation

The process of cluster content allocation very much resembles document retrieval based on

plain vsm model. The only difference is that instead of one query, the input snippets are

5.3. An Illustrative Example 62

matched against a series of queries, each of which is a single cluster label. Thus, if for a

certain query-label, the similarity between a document and the label exceeds a predefined

threshold, it will be allocated to the corresponding cluster. Note that from the point of view

of dcf, traditional Vector Space Model used for comparisons is not ideal — the label’s word

order and proximity is not taken into account.

Let us define matrix Q , in which each cluster label is represented as a column vector.

Let C =QT A, where A is the original term-document matrix for input documents. This way,

element ci j of the C matrix indicates the strength of membership of the j -th document to

the i -th cluster. A document is added to a cluster if ci j exceeds the Snippet Assignment

Threshold, yet another control parameter of the algorithm. Documents not assigned to any

cluster end up in an artificial cluster called “Other documents”.

5.2.5 Final Cluster Formation

Finally, clusters are sorted for display based on their score, calculated using the following

formula:

Cscore = label score×‖C‖,

where ‖C‖ is the number of documents assigned to cluster C . The scoring function, al-

though simple, prefers well-described and relatively large groups over smaller ones.

5.3 An Illustrative Example

Let the input collection of documents contain d = 7 documents. We omit the preprocessing

stage and assume t = 5 terms and p = 2 phrases are given (these appear more than once and

thus will be treated as frequent). The input is shown in Figure 5.3.

The t = 5 terms

T1: Information

T2: Singular

T3: Value

T4: Computations

T5: Retrieval

The p = 2 phrases

P1: Singular Value

P2: Information Retrieval

The d = 7 documents

D1: Large Scale Singular Value Computations

D2: Software for the Sparse Singular Value Decomposition

D3: Introduction to Modern Information Retrieval

D4: Linear Algebra for Intelligent Information Retrieval

D5: Matrix Computations

D6: Singular Value Analysis of Cryptograms

D7: Automatic Information Organization

Figure 5.3: Input documents, frequent terms and phrases.

We now preprocess the input term document matrix — tf-idf weighting and normaliza-

tion results in matrix Atf-idf, svd decomposition of that matrix yields matrix U containing

abstract concepts.

Atfidf =

∣∣∣∣∣∣∣∣∣∣

0 0 0.56 0.56 0 0 1

0.49 0.71 0 0 0 0.71 0

0.49 0.71 0 0 0 0.71 0

0.72 0 0 0 1 0 0

0 0 0.83 0.83 0 0 0

∣∣∣∣∣∣∣∣∣∣

U =

∣∣∣∣∣∣∣∣∣∣

0 0.75 0 −0.66 0

0.65 0 −0.28 0 −0.71

0.65 0 −0.28 0 0.71

0.39 0 0.92 0 0

0 0.66 0 0.75 0

∣∣∣∣∣∣∣∣∣∣

5.4. Computational Complexity 63

Now we look for the value of k — the estimated number of clusters. Let us define quality

threshold q = 0.9. Then the process of estimating k is as follows:

k = 0 7→ q = 0.62, k = 1 7→ q = 0.856, k = 2 7→ q = 0.959

and the number of expected clusters is k = 2.

To find relevant descriptions of our clusters (k = 2 columns of matrix U ), we calculate

similarity between candidate phrases and concept vectors as matrix M =UkT P , where P is

a synthetic term-document matrix created out of our frequent phrases and terms (values in

matrix P are again weighted using tf-idf and normalized):

P =

∣∣∣∣∣∣∣∣∣∣

0 0.56 1 0 0 0 0

0.71 0 0 1 0 0 0

0.71 0 0 0 1 0 0

0 0 0 0 0 1 0

0 0.83 0 0 0 0 1

∣∣∣∣∣∣∣∣∣∣

M =∣∣∣∣0.92 0 0 0.65 0.65 0.39 0

0 0.97 0.75 0 0 0 0.66

∣∣∣∣

Rows of matrix M represent clusters, columns — their descriptions. For each row we select

the column with maximum value. The two selected labels are: Singular Value (score: 0.92)

and Information Retrieval (score: 0.97). We skip label pruning as it is not necessary in this

example. Finally, documents are allocated to clusters by applying matrix Q , created out of

cluster labels, back to the original matrix Atf-idf. The final result is shown below. Note the

fifth column in matrix C , representing unassigned document D5.

Q =

∣∣∣∣∣∣∣∣∣∣

0 0.56

0.71 0

0.71 0

0 0

0 0.83

∣∣∣∣∣∣∣∣∣∣

C =∣∣∣∣0.69 1 0 0 0 1 0

0 0 1 1 0 0 0.56

∣∣∣∣

Information Retrieval [score: 1.0]

D3: Introduction to Modern Information Retrieval

D4: Linear Algebra for Intelligent Information Retrieval

D7: Automatic Information Organization

Singular Value [score: 0.95]

D2: Software for the Sparse Singular Value Decomposition

D6: Singular Value Analysis of Cryptograms

D1: Large Scale Singular Value Computations

Other: [unassigned]

D5: Matrix Computations

5.4 Computational Complexity

Time computational complexity of Lingo is quite high and mostly bound by the cost of term-

vector matrix decomposition (remember that a suffix array can be built in time linear with

respect to the input size). To our best knowledge, svd decomposition can be performed

in the order of O(m2n +n3) for a m ×n matrix [29]. Moreover, memory requirements are

demanding because of all the matrix transformations.

Note, however, that Lingo has been designed for a very specific application — search

results clustering — and in this setting scalability to large data sets is of no practical impor-

tance (the information is more often limited than abundant). In the next chapter we will

discuss another algorithm that scales well to large number of documents and also imple-

ments the dcf approach.

5.5. Summary 64

5.5 Summary

Strong Points

• Lingo was the first algorithm implementing cluster description search prior to actual

document allocation.

• Lingo handles fragmented, incomplete input. It discovers a diverse structure of topics

using dimensionality reduction applied to the term document matrix and subsequent

label search with base vectors of the reduced space.

• Lingo has few tuning parameters that fit well in its application domain. The number

of clusters is determined by taking advantage of a side-product of the singular matrix

decomposition — the accuracy of approximation of the original term vector space.

Weak Points

• The document assignment step breaks the transparency requirement of descriptive

clustering: documents containing subphrases or even isolated words from the cluster

label can become part of that label’s cluster.

• Troublesome scalability to larger problem instances.

• Candidate label discovery is bound to frequent ordered sequences of words in the in-

put. Even though we could try to use an external set of labels for cluster label induc-

tion, this possibility has not been exercised so far.

Fulfillment of Requirements

• Comprehensibility and Conciseness — Lingo extracts candidate cluster labels from a

set of frequent phrases. The danger of selecting frequent, but meaningless labels (as

in stc) exists, but is not an annoyance in practice because of two reasons. First, we se-

lect only a subset of all frequent phrases that correspond to dominant topics present

in the input and detected using matrix decomposition techniques. Second, the input

to Lingo is very specific — it is short and contextual with regard to the query (snippets)

and this context is quite likely to contain recurring phrases that denote meanings syn-

onymous to the query, which helps the algorithm in selection of candidate phrases.

• Transparency — Transparency in Lingo suffers from the generic vsm document alloca-

tion procedure, which allows documents that contain only subphrases of the original

cluster label to be added to the group. This often yields unintuitive results.

• Clusters Structure — Cluster diversity is ensured by the use of singular matrix decom-

position — the base vectors representing dominant topics are orthogonal, which is

commonly believed to represent different topics. Internal consistency is sometimes

broken as a result of the document allocation procedure. The algorithm is able to pro-

duce overlapping clusters.

Chapter 6

Descriptive k-Means Algorithm

Our initial experiments with dcf concerned clustering search results and, as we already

mentioned, this is a very specific application domain. A few challenging questions arose:

• Is it possible to create an algorithm implementing dcf that scales well to large num-

bers (tens of thousands) of documents?

• Is it possible to adopt a well-known text clustering algorithm to the dcf approach, at

least retain the original quality of clustering and improve comprehensibility of cluster

labels? What will be the difference in clustering quality between the derived algorithm

and the original?

Descriptive k-Means (dkm) is an attempt to provide answers to the above questions. It

is a combination of cluster label discovery — we experiment with two techniques: frequent

phrase extraction and noun phrase extraction — with a very well known numerical cluster-

ing algorithm k-Means, resulting in a novel algorithm that follows the dcf approach.

We are aware that k-Means is very often criticized: it produces spherical clusters (with

respect to the distance metric used), it requires a number of cluster seeds to be given in

advance and it always assigns each object to its closest cluster centroid, regardless of their

actual resemblance. However, we still chose to extend k-Means for a few important reasons.

First of all, we wanted to have a scalable, very fast baseline algorithm. Running times of

k-Means are practically linear with the size of input (the algorithm is interrupted when it is

close enough to convergence) and the procedure scales to very large data sets [16, 50].

Second, remembering about very good experiences with diversity of topics detected us-

ing svd decomposition, we looked for a similar method that would handle large input data.

Interestingly, cluster centroid vectors created by k-Means are reported to be a close approxi-

mation of singular value decomposition’s base vectors [43, 17]. This made us believe that we

could use k-Means as an efficient and scalable algorithm consistent with the behavior once

observed in the Lingo algorithm.

Finally, k-Means is a widely recognized and very often used numerical clustering algo-

rithm — many researchers use it as a benchmark result for their own achievements, so it

is relatively easy to cross-compare results with others. By choosing k-Means, an algorithm

65

6.1. Application Domain 66

with notorious reputation when applied to text clustering, we hoped to demonstrate that

an adoption to the dcf approach can help improve, or at least retain, the clustering quality

and yield more comprehensible cluster labels, consistent with the requirements defined in

descriptive clustering.

6.1 Application Domain

The envisioned application domain for Descriptive k-Means consists of short and medium

documents such as: news stories, Web pages, e-mails and other documents not exceeding

a few pages of text. We expect the input to be real text (not completely random, noisy doc-

uments and not fragments like snippets) written in one language (left to right word order,

available word segmentation heuristic). In this thesis we consider texts written in English

and Polish. The designed algorithm must be able to handle thousands of input documents

for off-line clustering and return results in a reasonable time.

6.2 Overview of the Algorithm

Descriptive k-Means closely follows the dcf approach. The cluster label discovery phase is

implemented in two alternative variants: using frequent phrase extraction and with shallow

linguistic processing for English texts (extraction of noun phrase chunks). Dominant topic

discovery is performed by running a variant of k-Means algorithm on a sample of input doc-

uments. We experimented with various types of features and weighting schemes for docu-

ment representation and found out that, except for pointwise mutual information which is

known to cause problems, all of them gave similar results.

In pattern phrase selection phase the algorithm uses a Vector Space Model to calculate

similarities between cluster label candidates and dominant topics (represented by cluster

centroids). Document assignment phase uses a mix of vsm and a Boolean model imple-

mented on top of a search engine (and utilizing its data structures) to ensure processing

efficiency. The document assignment phase, unlike in Lingo, searches for documents that

contain a pattern phrase, but allowing certain distortions such as minor word reordering

and different words injected inside. The level of pattern phrase distortion is adjustable and

is a parameter of the algorithm.

The matching between dcf approach and Descriptive k-Means is depicted graphically

in Figure 6.1. Algorithm 6.1 on page 68 contains full pseudocode of the algorithm, and we

discuss each major step in sections below.

6.2.1 Preprocessing

In the preprocessing step we initialize two important data structures: an index of documents

and an index of cluster candidate labels.

An index is a fundamental structure in information retrieval. Each entry added to an in- index

dex (document or candidate cluster label in our case) is accompanied by a vector of terms


Figure 6.1: Generic elements of dcf and their counterparts in Descriptive k-Means.

and their counts appearing in that entry. The index also maintains an associated list con-

taining all unique terms and pointers to entries a given term occurred in (inverted index).

The index allows performing queries, that is search for entries that contain a given set of

terms and sort them according to weights associated with these terms. In our experiments

we utilize a document retrieval library that creates indices called Lucene [H].

Indices are essential in dkm to keep the processing efficient. Note that the index of docu-

ments is usually created anyway to allow searching in the collection and the index of cluster

labels may be reused in the future, so the overhead of introducing these two auxiliary data

structures should not be too big.

Each incoming document is segmented into tokens using the heuristic implemented in

the Carrot2 framework. A unique identifier is assigned to the document and then it is added

to an index ID .

If cluster candidate labels are to be extracted directly from the input documents, this

process takes place concurrently to document indexing. Depending on the variant of dkm,

we extract frequent phrases or noun phrases (from English documents). The resulting set of

candidate labels is added to a separate index IP . Each candidate cluster label is indexed as

if it were a single document. To minimize the number of identical index entries, we keep a

buffer of unique labels in memory and flushing them to the index in batches.


1: D ← a set of input documents

2: k ← number of “topics” expected in the input (used for k-Means).

/* Preprocessing */

3: ID ← empty inverted index of documents;

4: IP ← empty inverted index of candidate cluster labels;


6: IDindex←−−−−− d ; /* add d to index ID */

7: if extract candidate labels then

8: T = a set of noun phrases and/or frequent phrases extracted from d ;

9: for all t ∈ T do

10: IPindex←−−−−− t ; /* add t to index IP */

11: end for

12: end if

13: end for

14: if predefined labels available then

15: for all t ∈ (a set of predefined labels) do

16: IPindex←−−−−− t ; /* add t to index IP */

17: end for

18: end if

/* Topic Detection (k-Means) */

19: for all d ∈ D do

20: fd = extract_features(d); /* Feature extraction; mutual information, tfidf. . . */

21: end for

22: C = initialize(D); /* Initialize cluster centroids by finding most dissimilar documents. */

23: repeat

24: τ= 0;


26: cd = reassign_to_closest_centroid( fd ,C );

27: τ= τ+ sim( fd ,cd );

28: end for

29: for all c ∈C do

30: update_centroid(c);

31: end for

32: until reassignments> rmin and τ> τmin

/* Select pattern phrases */

33: P ← empty set of candidate labels;

34: F ← empty set of final pairs (description, documents);

35: for all c ∈C do

36: qvsm = a Boolean query for terms in c , terms boosted with the centroid’s weights fc ;

37: Pc = search(IP ,qvsm); /* Execute the query against the index of labels. */

38: for all p ∈Pc do

39: s = length_penalty(p)×hit_score(p); /* Penalize the score with a length function. */

40: if s > smin then

41: P =P ∪p; /* Add to pattern phrases. */

42: end if

43: end for

44: end for

/* Assign documents to pattern phrases */

45: for all p ∈ P do

46: qvsm = construct a phrase query for pattern phrase p;

47: Rp = search(ID ,qvsm); /* Execute the query against documents. */

48: if Rp contains more than minimum documents then

49: F = F ∪ (p,Rp ); /* Add a new cluster with label p and documents Rp . */

50: end if

51: end for

52: Return the final set of clusters F .

Algorithm 6.1: Pseudo-code of the Descriptive k-Means algorithm.


6.2.2 Dominant Topic Detection

Dominant topic detection reuses data structures already present in the index of documents

ID and runs the k-Means clustering algorithm on a sample of documents to detect dominant

topics.

Preparation of Document Vectors

Let us recall that the index contains a vector of terms and their occurrences for each docu-

ment. Depending on the input size, we either take all documents or select a uniform ran-

dom subset and fetch their feature vectors from the index. To speed up computations (co-

sine similarities), we weight all the features and then limit the number of features for each

selected document to a given number of most significant terms to make document vectors

even more sparse.

We experiment with several feature weighting formulas: tf-idf, mutual information, mod-

ified mutual information and modified tf-idf (see Section 2.2.2 on page 25). Anticipating

experimental results shown in the next section, the choice of a weighting formula was mar-

ginally important; with the exception of pointwise mutual information, all other strategies

behaved similarly.

The limit of features of a single document is variable, we experimented with lengths

equal to 30, 50, 70 and 100 features. According to results shown in [86] the optimal value

of the term vector length should be somewhere within this range.

After feature weighting and truncation, a sample of documents (and their document vec-

tors) are ready for clustering using k-Means.

Clustering with k-Means

The variant of k-Means internally used for topic detection is characterized by the choice of

similarity measure, initialization of cluster centroids and the convergence criterion.

Similarity Measure Cosine distance is used to calculate similarity between document vec-

tors. This choice is motivated by computation efficiency needed for handling large numbers

of documents. For normalized vectors di and d j the cosine similarity simplifies to:

sim(di ,d j ) = cos(α) =di ·d j

|di ||d j |= di ·d j . (6.1)

By representing cluster centroids as dense vectors and documents as sparse vectors, multi-

plication of two document vectors in Equation 6.1 can be implemented in one loop iterating

over the components of the sparse vector only (this is the reason for making feature vectors

as short as possible). The total cost of calculating similarity between two documents is the

order of Θ(m) floating point multiplications, where m is the number of components of the

sparse vector.


1: D ← a set of input documents;

2: k ← number of “topics” expected in the input;

3: C ← a set of k centroid vectors c1 . . . ck , initially empty;

4: s = 0.1×‖D‖; /* Subsample ratio. */

5: Ds = random_sample(D,s); /* Select random sample of size s. */

6: c1 = average(Ds ); /* The first centroid is the average of the sample. */

7: for all i = 2,3, . . . ,k do

8: /* Next centroid is initialized to a document most dissimilar to previous centroids. */

9: ci = argmaxd∈Ds

(∑j=1,2,...,i−1 sim(d ,c j )

);

10: end for

11: C contains the initial centroids.

Algorithm 6.2: An algorithm used to bootstrap k-Means.

Initial State Proper selection of the initial state is crucial in k-Means to ensure the clusters

are diverse and truly representative. We chose to initialize the algorithm by selecting most

diverse documents from a subsample of the input (see Algorithm 6.2).

Convergence Criterion We used a composite convergence criterion for interrupting the

computation loop in k-Means. The computation ends when any of the conditions below is

true:

1. global objective function (sum of distances from documents to their centroids, see

Equation 2.6 on page 30) is no longer improved by more than threshold τ, or

2. fewer number of documents than rmin is reassigned between clusters.

The objective function in k-Means is by itself monotonic (a general proof can be found

in [5], cosine measure-specific proof is in [17]), so the algorithm will always terminate at

some point. We added the second condition to trim the trailing iterations which might keep

improving the global function, without affecting the centroid vectors.

Selection of the Desired Number of Clusters We assumed that the number of dominant

topics to be discovered, and at the same time the number of desired clusters for the k-Means

algorithm, must be given a priori. Our decision to leave k as a parameter was partially

justified by the planned experiments where k had to be given in advance to allow cross-

algorithm result comparisons. The second reason influencing this decision was that cluster-

ing is an internal process to Descriptive k-Means — the number of topics to be discovered

can be actually set to a fixed value, if there are no sensible cluster candidates the final num-

ber of clusters should be reduced automatically anyway. Nonetheless, we are aware this is

a weak point of the entire algorithm and that k could be at least estimated from the data

using one of the known methods.

[

Once k-Means converges to stable cluster centroids, they become the final result of this

phase — the representation of dominant topics expressed by a set of centroid vectors in the

term vector space.


6.2.3 Selecting Pattern Phrases and Allocating Cluster Content

Pattern Phrase Selection Pattern phrase selection is again tightly related to the data struc-

ture used throughout the algorithm — the index of phrases and documents. Recall that we

seek the closest approximation of cluster centroid vectors among candidate labels acquired

in the previous phases of the algorithm.

For each cluster centroid we assemble a list of its top ranking terms and their weights.

We then build a weighted Boolean query and execute it against the index of cluster candi-

date labels, retrieving phrases that best match the “profile” of weights of cluster centroid

terms. The intuition behind this operation is that we search for cluster label candidates that

best match the terms and weights of our dominant topic’s representation (cluster centroid

vector). Note that at this stage we are not concerned with the order of words or their prox-

imity — cluster label candidates that match the query, but have no coverage in the set of

input documents will be pruned later at the allocation phase anyway.

To clear away any doubts, a query that collects matching candidate cluster labels is of

course not explicit – it is constructed programmatically in Descriptive k-Means. However, it

is much like a document retrieval query and can be represented visually as a list of alterna-

tives with numeric boosts associated with each term, as shown below.

java(0.52) OR coffee(0.24) OR island(0.12) OR language(0.09) OR . . .

A query like the one shown above fetches an ordered list of candidate cluster labels for a

given dominant topic. Each cluster label has a score, which is calculated as a relevance to the

query by the underlying document retrieval system. Let us denote the score of a given can-

didate label p as query_score(p). At this time we can influence the process of cluster label adjusting clusterdescription length

selection by allowing the user to express his preference of the expected cluster description

length (recall the conciseness requirement, page 43). We do it by adjusting the score of each

label p and penalizing it for being longer or shorter then the desired length of m terms. The

penalty function is a simple bell-like curve, shown in Equation 6.2.

length_penalty(p)= exp−(length(p)−m)2

2d2, (6.2)

where length(p) is the number of terms in p and d controls the penalty strictness. A penal-

ized score of a candidate label then becomes:

score(p)= query_score(p)× length_penalty(p). (6.3)

In our experiments we used an arbitrary fixed values of m = 4 and d = 8, Figure 6.2 illustrates

the shape of this particular function.

In the last step, the set of re-scored candidate labels is sorted again and the highest scor-

ing elements become pattern phrases for a given dominant topic.

Allocation of Documents To allocate documents to pattern phrases, we use a very similar

procedure to the one used to fetch pattern phrases. Recall the possible ways of implement-


Figure 6.2: Shape of the phrase length penalization function 6.2. The desired phrase length

m = 4 in this example, strictness d replotted for 4 (innermost curve), 6, 8 (solid line), 10 and

12 (outermost curve).

ing allocation of documents in dcf (pages 43 and 52). In dkm, we implement a search for

documents with possible distortions of the pattern phrase.

For each pattern phrase we build a query to the index of documents ID . This time, how-

ever, the query is not a simple Boolean alternative, we use a phrase query with a given slop

factor. Both concepts are borrowed from the Lucene project and are explained below.

A phrase query matches documents containing all terms in the query, but allows reorder- phrase query

ing and injection of other words in between query’s terms. The slop factor controls how slop factor

mangled the phrase can be to still consider the document relevant to the query. For exam-

ple, a phrase query [Bolek, Lolek] with a large enough slop factor should cover documents

containing phrases such as Bolek i Lolek but also Lolek oraz Bolek, but not a document where

Bolek and Lolek are too far apart.

The slop factor is technically defined as a difference in positions between the two terms

maximally moved out from their original positions in the phrase. This is best explained using

an example, see Example 6.1 on the following page. Formally, slop factor for a query p,

containing ordered terms t = t1, t2, . . . , tn and a document d with terms from the query at

indices d1,d2, . . . ,dn is defined as:

slop(p,d) = max(∀i

(di − i ))−min

(∀i

(di − i )). (6.4)

The phrase query for a given pattern phrase returns a list containing relevance-ordered

documents. Note that exact matches are scored higher than “sloppier” matches, thus doc-

uments containing more compact copy of the pattern phrase (and preserving the original

order of terms) should be scored higher and end up at the top of cluster’s documents. By

performing phrase queries instead of simple Boolean retrieval, we try to fulfill the transpar-

ency requirement — we prefer documents that contain an exact copy of the pattern phrase.

The decision how much distortion (and hence confusion) is allowed between cluster labels

and documents is controlled using the slop factor and its exact setting is left up to the user.

Finishing the slop factor discussion, our experience shows that this tuning parameter could

be set automatically as a a function of the pattern phrase’s length. In fact, we use a dynamic

6.3. Dealing With Large Instances: Implementation Details 73

Consider a phrase query with four terms: a b c d. The term vector of this phrase can be written down

as:

term a b c d

i 0 1 2 3

Assume two short documents “match” this phrase:

• a c b d (reordered terms),

• a b x c d (non-phrase term inside).

For the first document we have the following term positions and positional differences with respect to

the query phrase.

term a c b d

di 0 1 2 3

di − i 0 -1 1 0 → slop= 1− (−1) = 2.

Similar calculation follows for the second document:

term a b x c d

di 0 1 2 3 4

di − i 0 0 - 1 1 → slop= 1−0 = 1.

Example 6.1: Example calculation of a slop factor between a phrase and two short docu-

ments.

slop factor in our experiments, calculated with a formula in Equation 6.5.

slop_factor = 4+2× length(p). (6.5)

In the final step of the document allocation phase we remove these pattern phrases

which did not collect enough documents. The remaining pattern phrases and their asso-

ciated documents become the final algorithm’s result.

6.3 Dealing With Large Instances: Implementation Details

A number of elements in Descriptive k-Means has been designed anticipating its efficient

implementation. The key bottlenecks are:

• extraction and storage of documents and unique candidate cluster labels,

• k-Means clustering,

• searching for candidate labels matching abstract topics,

• searching for documents matching selected pattern phrases.

In this section we discuss implementation techniques and design choices we used to over-

come the above problems.

6.3. Dealing With Large Instances: Implementation Details 74

Table 6.1: The structure (fields) of the document index and candidate labels index.

Index Field Store Inv. index Positions Description

Cluster candidate labels label Yes No No Label real representation.

keywords No Yes Yes Label’s model (keywords).

Documents id Yes No No Document identifier.

terms No Yes Yes Document terms.

6.3.1 Data Storage and Retrieval

The algorithm relies heavily on data structures and document retrieval model present in a

typical search engine. We implemented dkm around Lucene — an open source indexing and

document retrieval library [H]. Lucene provides an efficient implementation of algorithms

for building inverted indices and storing term vectors. We replaced the default input parser

with our own one from the Carrot2 project.

Technically, a new incoming document triggers two separate threads. One extracts can-

didate labels and adds them to an internal in-memory buffer of unique candidate labels,

occasionally flushing them to cluster label candidates index. The other thread adds the doc-

ument to an index of documents.

The internal structure of fields stored in the two core indices is shown in Table 6.1. In

short, we tokenize and index terms in documents and in cluster candidate labels. The index

of documents is enriched with positional information to allow running phrase queries, but it

does not store the full content of documents (merely their identifiers and an inverted index

of terms). On the contrary, in the index of candidate labels IP we don’t need the positional

information, but we store full cluster labels because they are needed later when a given label

becomes a pattern phrase.

Extraction of candidate labels is the most time-wise expensive operation. We use our

own implementation of suffix trees for detecting frequent phrases and an external noun

chunker for English — the MontyLingua library [C]. In our experiments the times for pars-

ing a single document ranged from a few milliseconds for extracting frequent phrases (using

suffix trees) to a good few seconds for noun phrase chunking in English (Pentium III, 1.1

MHz). The latter result seems almost prohibitive but as we pointed out, parsing is very easy

to parallelize — each document can be processed by a concurrent processing node. The en-

tire index of candidate cluster labels can also be reused in subsequent algorithm runs or can

be prepared a priori from an existing ontology.

6.3.2 Clustering

The usual computational complexity reported for k-Means is the order of O(k× t ×n), where

k is the number of clusters, t the number of iterations to reach convergence and n the

number of clustered objects. This complexity is obviously a bit simplified not taking into

account the costs of atomic operations, which in case of feature vectors can skew the result

significantly. David Arthur and Sergei Vassilvitskii recently reported worst case lower bound

estimations for k-Means to be superpolynomial (2Ω(p

n)) [3]. Nonetheless, the algorithm’s


behavior in most applications is satisfying and subsampling or parallelization techniques

permit its application to enormous data sets [31, 16].

We implemented k-Means from scratch reusing document term vectors already present

in Lucene’s index. We first choose a uniform random sample of required length out of all

documents stored in the index and fetch document vectors of selected documents for fea-

ture weighting. This is the most memory-demanding element of the entire procedure, but

we can adjust the sample size to limit memory consumption.

After feature weighting, we truncate document vectors to a certain length, remembering

that sparsity of feature vectors provides substantial gains for calculating cosine similarity

(discussion on page 69). This step also lets us ignore certain amount of noisy terms present

in documents.

The main clustering routine uses the preprocessed document vectors of the sample and

runs entirely in main memory for efficiency. All vectors are normalized to a unit sphere to

optimize vector operations such as additions and dot products (we avoid certain divisions

and calculation of vector norms).

In the result, the clustering routine easily handles thousands of documents and clusters

them in a few seconds on commodity hardware.

6.3.3 Searching for Pattern Phrases

Searching for candidate labels is performed directly using the document retrieval model im-

plemented by Lucene. We build a Boolean query for terms in a dominant topic’s vector rep-

resentation and boosting each term with its corresponding weight. Listing 6.1 on the next

page shows the code fragment responsible for this process.

6.3.4 Searching for Documents Matching Pattern Phrases

Searching for documents is again performed using Lucene’s built-in query type, a phrase

query. The actual implementation of document assignment to a pattern phrase varied a bit

between the baseline dkm version described in Section 6.2.3 and the one we used for exper-

iments in Chapter 7. In the experiments, we had to keep a predefined number of output

partitions, so instead of allocating documents to each pattern phrase, we searched for all

documents matching a union of phrase queries of all pattern phrases selected for a single

dominant topic. The code fragment implementing this behavior is shown in Listing 6.2 on

the following page.

6.4 Computational Complexity

Estimating computational complexity of the entire Descriptive k-Means is difficult. A great

deal depends on the method used to extract candidate cluster labels. Linguistic shallow pre-

processing algorithms rarely specify computational complexity and are often heuristic. Our

experiments show that even with parallelization this phase may be the most time consum-

ing element of the entire algorithm.


// Search for pattern phrases

BooleanQuery query = new BooleanQuery();

for (int j = 0; j < Math.min(100, featureVector.size()); j++)

final TermQuery tk = new TermQuery(

new Term("keywords", featureVector.get(j).feature));

tk.setBoost((float) fv.get(j).weight);

query.add(tk, BooleanClause.Occur.SHOULD);

final Hits hits = searcher.search(query);

// Hits contains raw pattern phrases. Rescore taking into account phrase length.

final ArrayList<ScoredPhrase> phrases = new ArrayList<ScoredPhrase>();

for (int j = 0; j < hits.length(); j++)

final Document doc = hits.doc(j);

final String label = doc.get("label");

final double score = hits.score(j);

final int length = label.split("[\\t\\ ]").length;

final double penalty = Math.exp(

- (length - optimalPhraseLength) * (length - optimalPhraseLength)

/ (2 * optimalPhraseLengthDev * optimalPhraseLengthDev));

phrases.add(new ScoredPhrase(label, score * penalty));

// Sort pattern phrases for this cluster according to the final score.

Collections.sort(phrases, new PhraseScoreComparator());

Listing 6.1: A code fragment building a query for retrieving pattern phrases.

for (final ScoredPhrase p : phrases)

final BooleanQuery query = new BooleanQuery();

final PhraseQuery pq = new PhraseQuery();

final String [] keywords = p.keywords;

pq.setSlop(4 + keywords.length * 2);

for (final String term : keywords)

pq.add(new Term("terms", term));

query.add(pq, Occur.MUST);

// Search the index for matching documents

final Hits hits = docSearcher.search(query);

if (hits.length() < MIN_DOCUMENTS_PER_PHRASE_CLUSTER)

continue;

for (int j = 0; j < hits.length(); j++)

documentsSet.put(hits.doc(j).get("id"));

Listing 6.2: A code fragment implementing selection of a union of documents for pattern

phrases of a single dominant topic.

6.5. Summary 77

Cluster label discovery put aside, the overall algorithm’s complexity is bound by k-Means.

Assuming the complexity estimation given by Arthur and Vassilvitskii [3] is correct, Descrip-

tive k-Means is at least the order of O(n2+2/d

(Dδ

)222n/d

), where δ is the smoothness factor,

D is the diameter of the pointset, n is the number of documents and d is the number of

dimensions of the feature space. We believe that encountering this pessimistic scenario in

practice is doubtful. What’s even more important, a stop criterion waiting for full conver-

gence of k-Means is rarely used and replaced with a practical trick to limit the number of

iterations of the algorithm. This reduces the complexity to O(i ×n×|K |). The use of docu-

ment sampling techniques can help reduce the problem size even more.

6.5 Summary

Strong Points

• We have shown a derivation of k-Means that attempts to solve the descriptive clus-

tering problem by following the dcf approach. The dkm algorithm internally uses a

term-based document similarity model and clustering algorithm, enabling it to over-

come certain problems present in inflectional languages, but at the same time ensures

that cluster labels are comprehensible and their relationship to the documents trans-

parent.

• Thanks to the use of data structures known in document retrieval (inverted indices,

queries), the algorithm is scalable to large numbers of documents and efficient in

practice.

• Preprocessing, although costly, can be parallelized easily. The index of cluster label

candidates can be reused on subsequent algorithm runs.

Weak Points

• The algorithm requires an explicit initial number of predicted topics for the k-Means

algorithm used internally. Although the final number of clusters may be different, this

explicit parameter could be somehow estimated from the data.

• The algorithm creates a flat structure of clusters. This is actually a property of the

dcf approach in general. Hierarchy can be induced later of course, but it would be

desirable to have a hierarchical clustering algorithm from the start.

Fulfillment of Requirements

• Comprehensibility and Conciseness — We demonstrate the use of two different meth-

ods of selecting cluster label candidates: frequent phrases and chunks (noun phrases).

While frequent phrases have the same characteristic as it was previously the case with

Lingo, the use of noun phrases allows us to limit cluster label candidates only to com-

prehensible and concise entries (or rather: to increase the likelihood of selecting good

6.5. Summary 78

cluster label candidates since we rely on approximate methods of identifying chunks).

The user can control the preferred length of cluster labels by adjusting the phrase

penalty function.

• Transparency — Transparency of relationship between cluster labels and the docu-

ments assigned to it is ensured by the use of phrase queries. We allow certain dis-

tortions of the pattern phrase (such as word injections or reordering) and let the user

control it using the slop factor.

• Clusters Structure — Cluster diversity and consistency is presumed to be a result of

using k-Means for dominant topic detection. While the predicted number of topics

needs to be given a priori (which we recognize as a weak point), the number of output

clusters may be different depending on how many pattern phrases are selected and

collect enough documents. Along with our expectations, the algorithm can produce

overlapping clusters and leave unassigned documents.

Chapter 7

Evaluation

This section presents the results of experiments evaluating Lingo and Descriptive k-Means.

7.1 Evaluation Scope and Goals

The goals of descriptive clustering differ slightly from those of typical clustering — we try to

find coherent, properly described groups of documents, not just groups of documents. This

particular aspect seemed to be completely locked for any kind of objective assessment and

we knew that finding a way of performing evaluation would be a difficult issue.

A user survey is practically the only way of evaluating the quality of cluster labels. We

were, however, very hesitant about using them; recall the reasons already discussed in Sec-

tion 2.5. In the end we decided to look at the problem of evaluation from another angle.

The cutting edge of dcf is in providing more comprehensible cluster labels as an out-

come of how the algorithms are built — hopefully with meaningful clusters label candidates

from the start. By showing that the document clustering quality does not degrade, and at the

same time knowing a dcf-based algorithm should be able to explain its results more clearly,

we could indicate its advantage.

To summarize, the evaluation presented in this chapter has two different aspects.

• The aspect of clustering quality, measured as conformity to a predefined structure of

classes (comparison against a set of given classes, the ground truth).

• The utilitarian value of the concepts presented in this thesis and published in the

Carrot2 framework. The system has been available as an open source project for a few

years, so we have a good perspective of who and how has been using it. We present

the feedback we received from its users.

7.2 Experiment 1: Clustering in Lingo

In the first experiment, we compared the clustering quality of Lingo against the benchmark

algorithm — Suffix Tree Clustering (stc) [104, 105, 106]. We were interested in the structure

79

7.2. Experiment 1: Clustering in Lingo 80

of returned snippet groups (we will refer to snippets as documents in the remaining part of

this chapter). Specifically, we asked the following questions:

• Is Lingo able to cluster similar documents? If so, what is the algorithm’s performance

for search results containing unrelated and closely related documents?

• Is Lingo able to highlight outliers, defined as minor subsets of documents sharing a

common topic, but unrelated to majority of the input?

• Is Lingo able to capture cross-topic relationships (generalizations) among closely re-

lated subjects?

• Are cluster labels meaningful with respect to the topic they supposedly represent?

• What are key differences between clusters created by Lingo and stc?

We used two different ways of inspecting the results. In the first analysis the results for

a few synthetic data sets were inspected and evaluated manually. The second analysis at-

tempted to measure the distribution of original ground truth partitions in the final result.

7.2.1 Test Data and Experiment’s Setting

The set of ground truth partitions was acquired from document groups present in the Open

Directory Project (odp) [I]. Open Directory Project is a tree-like, human-collected thematic

directory of resources in the Internet. Each branch of this tree, called a category, represents

a single topic and contains links to related resources in the Internet. Every link added to the

odp must be accompanied by a short description of a resource it points to (25–30 words).

We assumed these short descriptions could serve as a substitute of snippets because of their

similar length.

We selected 10 categories out of approximately 575000 present in the odp database. The

exact choice of categories was random, given that each one contained at least 10 documents

and had a meaningful English description of each document inside. We decided to perform

the experiment on categories with documents in English because stc is known to have prob-

lems when clustering Polish search results [89].

The selected categories were related to four subjects: movies (2 categories), health care

(1), photography (1) and computer science (6) (see Table 7.1). We assumed that documents

within each category should have enough in common to be linked into a cluster. To verify

how the algorithms would handle the separation of similar but not identical topics, some

categories were drawn from one parent branch of the odp and the document count between

categories varied significantly. For example, there were four categories related to various

database systems.

To address the experiment’s questions, we mixed the original categories into 7 test sets,

each one aimed to verify a certain aspect of the clustering algorithms under consideration.

Table 7.2 on the next page lists the test sets, their content (categories) and rationale. We

used default values of parameters for both algorithms (as implemented in the Carrot2 frame-

work).


Table 7.1: odp categories selected for the experiment.

Category Number of Description of the Contents

Name Documents

BRunner 77 Information about the Blade Runner movie.

LRings 92 Information about the Lord of the Rings movie.

Ortho 77 Orthopedic equipment and manufacturers

Infra 15 Infrared photography references

DWare 27 Articles about data warehouses (integrator databases)

MySQL 42 MySQL database

XMLDB 15 Native XML databases

Postgr 38 PostgreSQL database

JavaTut 39 Java programming language tutorials and guides

Vi 37 Vi text editor

Table 7.2: Merged test data sets, their categories and rationale.

Id Categories Rationale (hypothesis to verify)

G1 LRings, MySQL Can Lingo separate two unrelated categories?

G2 LRings, MySQL, Ortho Can Lingo separate three unrelated categories?

G3 LRings, MySQL, Ortho, Infra Can Lingo separate four unrelated categories, one

significantly smaller than the rest (Infra)?

G4 MySQL, XMLDB, DWare, Postgr Can Lingo separate four similar, but not identical

categories (all related to databases)?

G5 MySQL, XMLDB, DWare, Postgr,

JavaTut, Vi

Can Lingo separate four very similar categories

(databases) and two distinct, but loosely related ones

(computer science)?

G6 MySQL, XMLDB, DWare, Postgr, Ortho Outlier highlight test — four dominating conceptually

close categories (databases) and one outlier (Ortho).

G7 All categories All categories mixed together. Can Lingo generalize

categories (into movies, databases)?

7.2.2 Output Clusters Structure and Quality

In the first part of the evaluation, we analyzed the distribution of original categories in clus-

ters returned by Lingo for each input test set.

Before we comment on the results, let us describe how distributions are visualized graph-

ically (see Figure 7.3 on page 84 for example). Each bar on the chart represents a single clus-

ter (with a fragment of the cluster’s label on the horizontal axis) and each color represents a

single original class (partition of data). Ideally, coherent clusters representing single topics

should be of solid color (originate from a single category). Clusters are sorted according to

their final score, starting with the strongest clusters on the left side of each chart. Solid bars

of alternate colors should therefore appear to the left — this would mean that clear, coher-

ent clusters representing different classes were discovered properly. Vertical axis represents

the number of documents in clusters.

For each test set, Lingo created between 24 and 36 clusters. Unrelated topics (tests G1–

G3) have been properly separated and represented. The LRings category in test G1 has a

much more complex internal structure and spreads into more clusters, but MySQL clusters

are present and not mixed with any other category (Figure 7.3 on page 84). Similar situa-

tion can be observed in test G3, where four input categories are represented in the top five

clusters. Even though category Infra was much smaller, it is still highlighted as a separate

subject.

Topics were successfully separated even for conceptually similar categories in test G4


(Figure 7.4 on page 85). Note, however, the cluster labeled MySQL Server — only half of its

documents come from MySQL category. This indicates a misassignment problem in Lingo’s

cluster content discovery phase, which is also present in the cluster labeled Information on

infrared in test G7. Documents containing any terms from the cluster’s label end up inside

it, so Information on infrared cluster matches documents containing terms Information and in-

frared, but not necessarily both of them. For example, a document containing a phrase In-

formation and Images could be found in that cluster even though it originated from LRings

category.

An outlier test G6 has been handled correctly (Figure 7.5 on page 86), Ortho category

was not obscured by database-related documents and was highlighted at top positions of

the group ranking. The same held for smaller categories, for example Infra in test G7, or

XMLDB in test G5. Interestingly, category XMLDB vanished from results of test G6 and G7,

with some of its documents assigned to other database-related clusters. We suppose Lingo

did not separate XMLDB because it was too close to other database-related categories and

at the same time too small to create a separate conceptual group during label discovery.

Lingo captured some of the cross-topic relationships. For example, in G7 (Figure 7.6 on

page 87), cluster labeled Movie review nicely combined documents from LRings and BRunner

categories. Similarly in G4, clusters SQL, or Tables combined documents spanning different

categories.

In spite of minor noise groups, Lingo’s clusters seemed more meaningful compared to

stc’s. We think that selection of group labels in stc, based solely on common phrase fre-

quency, is inferior to Lingo’s dcf-based one. For example, in test G6, the outlier category Or-

tho was dominated by database-related documents. stc also failed to clearly separate topics

in test G7, choosing common terms for group labels and mixing all categories based on fre-

quent, but meaningless words like used or site, see Figure 7.5 on page 86 and Figure 7.6 on

page 87.

A look at cluster labels showed that, in case of Lingo, they were in most cases meaningful

and comprehensible. Figure 7.1 on the next page shows labels of topmost clusters. The few

questionable descriptions consisted in majority of single terms, which were either ambigu-

ous (as in free), or very broad (as in News). Several elliptical cluster labels were also extracted

(Information on Infrared /photography/ ). But even with these errors, a cross comparison of clus-

ter labels for the same data set clearly shows the advantage of Lingo over stc. Figure 7.2 on

the following page demonstrates such a comparison for test G7. Lingo tends to pick longer

and thus more meaningful phrases, while stc prefers single terms that often turn out to be

very generic or simply junk (written, site, database).


G1 G4Fan Fiction Fan Art Federated Data Warehouse

Images Galleries Xml Database

MySQL Postgresql Database

Wallpapers Mysql Server

LOTR Humor Intelligent Enterprise Magazine

Links Web Based

Middle Earth Postgres

Special Report Tables

G2 G5News Java Tutorial

MySQL Database Vim Page

Images Galleries Federated Data Warehouse

Foot Orthotics Native Xml Database

Lord of the Rings Wallpapers Web

Information on the Films PostgreSQL Database

Lotr humor MySQL Server

Orthopedic support Free

G3 G6 G7MySQL Mysql Database Blade Runner

News Federated Data Warehouse Mysql Database

Information on Infrared Foot Orthotics Java Tutorial

Images Galleries Orthopedic Products Lord of the Rings

Foot Orthotics Access Postgresql News

Lord of the Rings Movie Web Movie Review

Orthopedic Products MySQL Server Information on Infrared

Humor Medical Data Warehouse

Figure 7.1: Lingo’s cluster labels of the topmost clusters.

Cluster Description

STC Lingo

1 xml, native, native xml database Blade Runner

2 includes Mysql Database

3 blade runner, blade, runner Java Tutorial

4 information Lord of the Rings

5 dm, dm review article, dm review News

6 used Movie Review

7 database Information on Infrared

8 ralph, article by ralph kimball Data Warehouse

9 mysql Image Galleries

10 articles BBC Film

11 written Vim Macro

12 site Web Site

13 cast Fan Fiction Fan Art

14 dm review article by douglas hackne Custom Orthotics

15 review Layout Management

16 characters DBMS Online

Figure 7.2: Cluster labels in stc and Lingo, test G7.


LINGO, test G1

lord of the rings mysql

Fan F

iction F

an A

rt

Images G

alle

ries

MyS

QL

Wallp

apers

Lotr

Hum

or

Lin

ks

Mid

dle

Eart

h

Specia

l R

eport

Data

base

Manager

Foru

m Art

Rin

gs O

bsessio

n

Onlin

e

Hobbit

Offers

Softw

are

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

LINGO, test G3

lord of the rings mysql orthopedic infrared photography

MyS

QL

New

s

Info

rmation o

n Infr

are

d

Images G

alle

ries

Foot O

rthotics

Lord

of th

e R

ings M

ovie

Ort

hopedic

Pro

ducts

Hum

or

Lotr

Site

Shoes

Lin

ks

Medic

al

Data

base

Support

Sto

ckin

gs

Mid

dle

Eart

h

Manager0

5

10

15

20

25

30

Figure 7.3: Partition distribution in clusters, Lingo algorithm, simple unrelated categories

separation tests G1 and G3. Number of documents on the vertical axis, clusters on the hori-

zontal axis (the first cluster on the left side).


LINGO, test G4

mysql data warehouses articles native xml databases postgres

Federa

ted D

ata

Ware

house

Xm

l D

ata

base

Postg

resql D

ata

base

Mysql S

erv

er

Inte

lligent E

nte

rprise M

agazin

e

Web B

ased

Postg

res

Table

s

GM

D IP

SI X

QL E

ngin

e

Obje

ct O

riente

d

Driver

Softw

are

Dim

ensio

nal

Ware

housin

g

SQ

L

Trial V

ers

ion0

1

2

3

4

5

6

7

8

9

10

11

LINGO, test G5

java tutorials vi mysql data warehouses articles native xml databases postgres

Java T

uto

rial

Vim

Page

Federa

ted D

ata

Ware

house

Native X

ml D

ata

base

Web

Postg

resql D

ata

base

Mysql S

erv

er

Fre

e

Lin

ks

Develo

pm

ent T

ool

Quic

k R

efe

rence

Data

Ware

housin

g

Mysql C

lient

Obje

ct O

riente

d

Postg

res

Driver0

5

10

15

20

25

Figure 7.4: Partition distribution in clusters, Lingo algorithm. Similar category separation

test G4 and related category separation test G5. Number of documents on the vertical axis,

clusters on the horizontal axis (the first cluster on the left side).


LINGO, test G6

mysql data warehouses articles native xml databases postgres orthopedic

Mysql D

ata

base

Federa

ted D

ata

Ware

house

Foot O

rthotics

Ort

hopedic

Pro

ducts

Access P

ostg

resql

Web

Mysql S

erv

er

Medic

al

Shoes D

esig

ned

Ort

hopaedic

Postg

res

Mysql C

lient

Data

Ware

housin

g

Offers

Softw

are

Develo

pm

ent T

ool

Innovative0

2

4

6

8

10

12

14

16

18

STC, test G6

mysql data warehouses articles native xml databases postgres orthopedic

xm

l,natively

,native x

ml data

base

dm

,dm

revie

w a

rtic

le,r

evie

w

ralp

h,a

rtic

le b

y r

alp

h k

imball,

ralp

[...]

data

bases

mysql

dm

revie

w a

rtic

le b

y d

ougla

s h

ackne [...]

art

icle

data

ware

house,w

are

house

data

dbm

s,d

bm

s o

nlin

e a

rtic

le b

y r

alp

h k

[...]

pro

vid

es

support

pro

duct

manufa

ctu

rer

inte

lligent ente

rprise m

agazin

e a

rt [...]

access0

5

10

15

20

25

30

Figure 7.5: Partition distribution in clusters. Outlier highlight test G6. Lingo algorithm

above, stc below. Number of documents on the vertical axis, clusters on the horizontal axis

(the first cluster on the left side).


LINGO, test G7

blade runner infrared photography java tutorials lord of the rings mysql orthopedic

vi data warehouses articles native xml databases postgres

Bla

de R

unner

Mysql D

ata

base

Java T

uto

rial

Lord

of th

e R

ings

New

s

Movie

Revie

w

Info

rmation o

n Infr

are

d

Data

Ware

house

Images G

alle

ries

BB

C F

ilm

Vim

Macro

Web S

ite

Fan F

iction F

an A

rt

Custo

m O

rthotics

Layout M

anagem

ent

DB

MS

Onlin

e0

5

10

15

20

25

30

35

40

45

50

STC, test G7

blade runner infrared photography java tutorials lord of the rings mysql orthopedic

vi data warehouses articles native xml databases postgres

xm

l,native,n

ative x

ml data

base

inclu

des

bla

de r

unner,

bla

de,r

unner

info

rmation

dm

,dm

revie

w a

rtic

le,d

m r

evie

w

used

data

base

ralp

h,a

rtic

le b

y r

alp

h k

imball,

ralp

[...]

mysql

art

icle

s

written

site

cast

dm

revie

w a

rtic

le b

y d

ougla

s h

ackne [...]

revie

w

chara

cte

rs

0

5

10

15

20

25

30

35

Figure 7.6: Partition distribution in clusters. All categories mixed together; test G7. Lingo

algorithm above, stc below. Number of documents on the vertical axis, clusters on the hor-

izontal axis (the first cluster on the left side).


7.2.3 Analysis of Cluster Contamination

A comparison of similarity between two cluster structures can be done in a number of ways

— we mentioned some of them already in Section 2.5. Most of these methods aggregate

similarity to a single figure over all clusters produced by the algorithm. Our main interest

was quite the opposite — measuring the quality of individual clusters, starting with these

that the user sees first.

None of the existing quality evaluation formulas seemed to fully satisfy our needs. The f-

measure penalizes clusters that are incomplete subsets of documents from a single original

partition. Average cluster purity takes into account the dominating document subset in each

cluster, while it makes a difference if the remaining documents are a mixture of different

partitions or come from a single one.

We eventually came up with our own measure of similarity that captures our intuition

of what a good cluster should be like: cluster contamination. A cluster is considered pure if clustercontamination

it consists of all or a subset of documents from a single original category. Cluster contami-

nation for cluster ki is defined as the number of pairs of objects found in the same cluster

ki , but not in any of the original partitions, divided by the worst case scenario — maximum

number of such bad pairs in ki , considering its size. For pure clusters, cluster contamina-

tion measure equals 0. When a cluster consists of documents from more than one partition,

the contamination is between 0 and 1. The worst case is an even mix of documents from

original partitions when the cluster is said to be fully contaminated and the measure equals

1. Full definition of cluster contamination measure is given in Appendix A on page 116.

Figure 7.7 presents contamination measure of clusters acquired from Lingo and stc and

created for the input test set G7. One can observe that Lingo creates purer clusters than

stc, especially at the top of the clusters’ ranking. stc is also unable to distinguish between

frequent junk terms and important phrases. For example, see clusters labeled includes or in-

formation — essentially common words with no meaning specific to any partition and hence

a high contamination. Lingo also produces contaminated clusters, such as in a cluster la-

beled Web site or Movie review, but these can be explained with the knowledge of the input

data set (Web site is a common phrase in the Open Directory Project), or understood (movie

review cluster is contaminated because it merges documents from two categories — LRings

and BRunner). This latter example shows that sometimes blind analysis of purely numerical

quality indicators can lead to incorrect conclusions; movie review cluster is a generalization

of two original categories and as such is sensible, even though it was marked as contami-

nated.

We omit the remaining test sets because the results and conclusions were very similar.

Simple average aggregation of contamination measure over all clusters also points at Lingo

as the one having lower overall contamination.

7.2.4 Summary and Conclusions

While it seems clear that manual inspection of a few input data sets does not provide reli-

able evidence of Lingo’s superiority, we believe the results are convincing enough to risk the


LINGO, test G7

cluster contamination cluster size

Bla

de R

unner

Mysql D

ata

base

Java T

uto

rial

Lord

of th

e R

in...

New

s

Movie

Revie

w

Info

rmation o

n ...

Data

Ware

house

Images G

alle

rie...

BB

C F

ilm

Vim

Macro

Web S

ite

Fan F

iction F

an...

Custo

m O

rthotic...

Layout M

anagem

e...

DB

MS

Onlin

e

Editor

Colle

ction

Inte

lligent E

nt...

Written b

y T

ony

Essay

Bio

gra

phie

s W

al...

Postg

res

Foru

m

Soft S

hoes

(Oth

er)0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Clu

ste

r conta

min

ation

0

10

20

30

40

50

60

70

80

90

100

110

120

130

Clu

ste

r siz

e

0

0.471

0.183

0

0.264

0.515

0.591

0.3590.333

0.55

0

0.92

0.21

0

0.825

0.926

0.2

0.7780.766

0.378

0.185

0 0

0.667

0

0.868

52

43

34

41

3133

21 2123

25

15

23

19

12

1619

10 912

10 119 9 10

7

126

STC, test G7

cluster contamination cluster size

xm

l,native,n

ati...

inclu

des

bla

de r

unner,

bl...

info

rmation

dm

,dm

revie

w a

r...

used

data

base

ralp

h,a

rtic

le b

...

mysql

art

icle

s

written

cast

site

dm

revie

w a

rtic

...

revie

w

links

data

chara

cte

rs

data

ware

house,...

pro

vid

es0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0.95

1.00

Clu

ste

r conta

min

ation

0

10

20

30

40

50

60

70

80

90

100

110

120

130

Clu

ste

r siz

e

0.324

0.898

0.16

0.88

0

0.965

0.82

0

0.138

0.498

0.667

0.523

0.86

0

0.616

0.922

0.548

0.442

0.2

0.758

18

38

13

37

9

3532

8

30

2624 24 24

4

23 22 22 22

10

21

Figure 7.7: Contamination measure and cluster size for test set G7. Cluster labels have been

truncated to fit the chart. Clusters are ordered on the horizontal axis from the most im-

portant clusters (left side) to the least significant ones (right). stc algorithm has no explicit

“Others” group.

7.3. Experiment 2: Clustering in Descriptive k-Means 90

following conclusions:

• The dcf approach implemented in Lingo clusters similar documents together and

picks up diverse topics present in the input data. Output clusters are not always com-

plete copies of input partitions, but are rarely a complete noise as it is the case with

stc.

• Cluster labels discovered by Lingo are sensible and rarely denote common phrases,

unlike with stc where even with threshold tuning we were not able to filter out non-

sensical descriptions.

7.3 Experiment 2: Clustering in Descriptive k-Means

In Section 4.3 we mentioned the two potential elements degrading clustering quality in dcf:

approximation of the “ideal” model of dominant topics with pattern phrases and document

allocation to pattern phrases instead of the original topics. In this experiment we tried to

verify if this degradation really takes place. Clustering quality in this section is again un-

derstood as the difference between a ground truth set of classes and the clustered result

returned by the algorithm. The comprehensibility of cluster labels is not taken into account

directly, but we will take a tentative look at them and show a few example descriptions of

clusters returned by the algorithm.

The experiment is about assessing quantitative quality rather than individual quality of

results — we will measure an average performance of clustering with several quality indica-

tors on a range of configuration parameters. The point is to establish if an algorithm follow-

ing dcf is indeed of poorer quality compared to the baseline clustering. Descriptive k-Means

provides a good testbed for this, because we can reuse its internal k-Means implementation

as a reference clustering algorithm.

The goals of the experiment can be summarized by the following questions.

• Description Comes First has two elements that may degrade clustering quality: ap-

proximation of dominant topics with pattern phrases and document assignment to

pattern phrases. Does dcf indeed degrade clustering quality and how?

• Given two methods of candidate cluster label extraction — one extracting frequent

phrases, the second extracting noun phrases — is the quality of clustering different

and how does it change?

• Does subsampling of documents used internally in k-Means during dominant topic

discovery change the quality of results and how?

7.3.1 Test Data and Experiment’s Setting

Ground Truth Data Set

To have a cross-comparison with previous research we decided to use a widely known 20-

newsgroups [A] data set, which consists of approximately 20000 mailing list messages parti- 20-newsgroups


comp.graphics rec.autos sci.crypt

comp.os.ms-windows.misc rec.motorcycles sci.electronics

comp.sys.ibm.pc.hardware rec.sport.baseball sci.med

comp.sys.mac.hardware rec.sport.hockey sci.space

comp.windows.x

misc.forsale talk.politics.misc talk.religion.misc

talk.politics.guns alt.atheism

talk.politics.mideast soc.religion.christian

Figure 7.8: Groups of messages in the 20-newsgroups data set. Related groups are in the

same box (source: [A]).

tioned nearly evenly into 20 groups, each corresponding to a different topic (although some

topics are closely related). Figure 7.8 shows the headers of newsgroups in the 20-newsgroups

data set and their relation to each other. We chose one of the typical subsets of the origi-

nal data set with removed redundancies and empty documents, called a “bydate split” and

consisting of 18941 documents.

We assumed that each original newsgroup in the data set should be reconstructed as a

cluster in the output clustering.

Algorithms

We selected three algorithms for the evaluation:

• Descriptive k-Means with English noun phrases as label candidates,

• Descriptive k-Means with frequent phrases as label candidates,

• pure k-Means.

Descriptive k-Means appears in two variants because we wanted to see if different can-

didate label selection affects clustering quality. Both noun phrases and frequent phrases

were extracted from the input data set (noun phrases with MontyLingua, frequent phrases

using a suffix tree). The “baseline” k-Means implementation was identical with the one used

internally in Descriptive k-Means (to select dominant topics). Note that the goal of the ex-

periment was to see if dcf introduces any changes to the quality of clustering compared to

a baseline version of the clustering algorithm, so it made little sense to add other clustering

algorithms to the list.

The evaluation methodology, described a bit later in this section, required complete par-

titioning of the input data set. It was quite an inconvenient assumption because dcf-based

algorithms are not designed for producing full partitioning (a consequence of requirements

stated in Section 3.2.2 on page 45). To fulfill the needs of evaluation methods, we had to

introduce a few minor changes to the original dkm algorithm.

In the modified algorithm, pattern phrases selected for each dominant topic are not

treated independently. Instead, we build a compound query with an or operator between

phrase queries corresponding to each pattern phrase and execute this query against the col-

lection of documents. This results in one output cluster for each dominant topic and lets us

predefine the expected number of clusters in advance. This modification still cannot guar-


Table 7.3: Threshold values used in the experiment.

Threshold k-Means Descriptive k-Means

maximum reassignments rmin 20 20

minimal global objective function increase τ 0.001 0.001

minimum documents allocated to a pattern phrase n/a 10

minimum documents in a final cluster n/a 5

antee full partitioning of the input documents and we exclude unassigned documents from

the result.

We need to point out that the changes introduced to Descriptive k-Means for the needs

of the experiment actually penalize the algorithm. We artificially merge documents from dif-

ferent pattern phrases, leading to potentially higher cluster contamination. This adjustment

was unfortunately necessary to keep the structure of results similar to the ground truth set

and evaluation metrics comparable.

Thresholds and Variables

Thresholds used in k-Means and Descriptive k-Means were a result of prior manual tuning of

both algorithms. Precise values of thresholds are listed in Table 7.3. The number of expected

clusters k was set to 20 for both algorithms.

We clustered the ground truth data set 5 times for each combination of the following

variable elements.

• Sample size — we used a sample of documents from the original data set (picked with

uniform distribution) to see how the size of the sample affects clustering quality.

• Feature vector length — we experimented with different maximum lengths of feature

vectors for documents: 30, 50, 70 and 100 elements.

• Term weighting — We tried several term weighting formulas: mutual information, dis-

counted mutual information, tf-idf and Lucene’s tf-idf (described in Section 2.2.2). The

weights of terms for a document were then sorted and truncated at the maximum fea-

ture vector length limit. The truncated vector of weights became a set of features of a

document.

Evaluation Methodology

For every run of the experiment we compared the clusters to the reference ground truth

data set using the following quality indicators (described in Section 2.5 on page 37 and Ap-

pendix A on page 116):

• cluster purity,

• entropy,

• f-measure,

• contamination measure.


For each indicator we calculated an arithmetic mean (average) calculated from five runs of

the experiment in fixed conditions of variables mentioned above.

7.3.2 Output Clusters Structure and Quality

Before we start with results and conclusions, we admit that our expectation was for Descrip-

tive k-Means to be slightly worse at clustering documents compared to pure k-Means. After

all, the dcf approach has a slightly different goal compared to traditional clustering and will

not try to group documents for which there are no sensible descriptions. We were quite

pleasantly surprised when it turned out that the experiment’s results show an increase in

quality in most aspects of the analysis.

Clustering Quality

In this section we present conclusions derived from several descriptive statistics and obser-

vations of associated quality distribution charts. Statistical analysis of significance of these

observations is continued later in Section 7.3.2 on the next page.

Figure 7.9 on page 95 presents an average contamination for each feature selection type.

Surprisingly, descriptive clustering in both variants is less contaminated than the baseline

k-Means. This is confirmed by two other quality metrics — average purity (Figure 7.10 on

page 95) and average entropy (Figure 7.11 on page 96). In these metrics, however, k-Means

is slightly better when mutual information (or rather pointwise mutual information) is used problems withmutual information

for feature weighting. The explanation of this phenomenon is hidden in the behavior of this

weighting scheme, which is known for giving preference to low frequency elements [60].

If such elements are selected to represent a dominant topic then they find no supporting

candidate labels or find some low-scoring junk phrases that somehow got into the candidate

labels set. We can find the evidence of these suspicions in Figure 7.13 on page 97. An average

cluster size for candidates selected among frequent phrases is much lower than with noun

phrases, whose selection is not related to frequency of occurrence. This disproportion does

not take place with other weighting schemes.

The f-measure is the only one showing that k-Means has a higher quality of results. We

know this result is biased because the f-measure gives preference to larger clusters and we

removed unassigned documents from the output of Descriptive k-Means. Perhaps the se-

lection of f-measure was not a good choice for the experiment at all, but we include it in the

results for completeness.

Figure 7.14 illustrates an average size of a cluster (number of documents inside a cluster),

depending on the size of the feature vector. Descriptive k-Means produces smaller groups of

documents (remember that k-Means basically assigns all documents to their closest cluster,

so the average cluster size remains constant). Note that increasing the size of the sample af-

fects the average cluster size gradually. We can speculate that dcf tends to produce smaller,

but more accurate clusters.

An average number of clusters (Figure 7.15 on page 98) shows that Descriptive k-Means

reduces the number of clusters k given a priori. An average number of output groups de-


pends mostly on the length of document vectors (which influence cluster centroid calcula-

tion and finally selection of pattern phrases). Interestingly, the size of the sample has no

visible influence on the number of output groups — this is a promising result because it

encourages the use of sampling as a technique of scaling the method to larger data sets.

Noun Phrases and Frequent Phrases

The difference in clustering quality between the two versions of dkm was in most cases ne-

glectable. A small preference towards noun phrases seemed to exist, especially with larger

document samples (see Figure 7.16 on page 98). This was a bit surprising (we hoped noun

phrases would be much more accurate pattern phrases) and manual inspection of the can-

didate cluster labels showed that many noun phrases were in fact corrupted. Perhaps if we

used a better NP-chunker or prepared noun phrases offline the quality would increase to a

more evident difference.

Statistical Analysis of Quality Differences

The discussion so far was based on an experiment where for each configuration we re-ran

the clustering five times. This number of samples was insufficient to derive statistically sig-

nificant conclusions, so we repeated the whole procedure for a subset of the original exper-

iment’s range of settings and running the clustering a hundred times in each configuration.

We were interested in testing whether the difference in average values of our quality mea-

sures between pairs of different algorithms (k-Means, dkm with noun phrases and dkm with

frequent phrases) is statistically significant. Even though the distribution of original data is

unknown, we assumed a sufficiently high number of samples to use a test of difference of

means between two populations [67, 47].

Tables 7.4, 7.5, 7.6 and 7.7 on pages 99–102 show the mean values of all quality measures,

averaged over all experiment runs in each configuration. Rightmost columns contain pair-

wise comparisons between each combination of algorithms. We used a two-tailed test with

confidence level α = 0.05. When the null hypothesis (equality of means) could be rejected,

we also show which algorithm had a “better” mean value (lower or higher, depending on the

interpretation of the quality measure).

We have already mentioned that the f-measure is not a good quality indicator for com-

paring k-Means and dkm because the two algorithms produce output clusters of different

sizes. We still present the results for completeness in Table 7.6 on page 101, but one can

see that results for the f-measure are clearly inconsistent with other quality indicators. The

remaining measures provide more reliable results and support our earlier observations.

Entropy, contamination and purity indicate that dkm is better at clustering the input

data set in majority of configurations (with the exception of raw mutual information used

to extract document features — a phenomenon we already discussed and explained). Our

suspicions that noun phrases are not much of an improvement to the numerical quality of

results has been confirmed. For example, in Table 7.4 on page 99 the noun phrase version

is better 12 times and the frequent phrase version only 6 times, with 14 differences that


Figure 7.9: Average cluster contamination depending on the feature type (higher values in-

dicate more contaminated clusters).

Figure 7.10: Average cluster purity depending on the feature type (higher values indicate

purer clusters).


Figure 7.11: Average entropy depending on the feature type (lower values indicate better

clusters).

Figure 7.12: Average f-measure depending on the feature type (higher values indicate better

clusters).


Figure 7.13: Average size of a cluster depending on the feature type.

Figure 7.14: Average size of a cluster depending on the number of features and sample.


Figure 7.15: Average number of clusters depending on the number of features and sample.

Figure 7.16: Average cluster contamination for different cluster label selection methods

(noun phrases and frequent phrases), depending of feature vector length and sample size.

Results for samples with discounted mutual information only.


Table 7.4: Mean value of contamination and pairwise comparison between algorithms. Re-

lation symbols indicate a better algorithm within the pair: AB — A wins, AB — B wins.

The ins. symbol means the equality hypothesis cannot be rejected.

Configuration Average mean value Pairwise comparison

feature sample vector k-Means DKM-fp DKM-np k-Means/ k-Means/ DKM-fp/

DKM-fp DKM-np DKM-np

tfidf 5000 30 0.863 0.552 0.556 ins.

50 0.793 0.536 0.539 ins.

70 0.787 0.590 0.565

100 0.774 0.628 0.576

7000 30 0.821 0.539 0.538 ins.

50 0.739 0.524 0.520 ins.

70 0.717 0.586 0.544

100 0.709 0.587 0.570

mi 5000 30 0.911 0.670 0.808

50 0.890 0.649 0.678

70 0.817 0.628 0.646 ins.

100 0.752 0.629 0.634 ins.

7000 30 0.912 0.647 0.753

50 0.836 0.639 0.691

70 0.776 0.620 0.600 ins.

100 0.707 0.594 0.586 ins.

mid 5000 30 0.838 0.555 0.568 ins.

50 0.767 0.560 0.556 ins.

70 0.751 0.609 0.574

100 0.719 0.665 0.604

7000 30 0.800 0.523 0.561

50 0.693 0.529 0.527 ins.

70 0.689 0.587 0.564

100 0.675 0.634 0.582

tfidf2 5000 30 0.875 0.575 0.585 ins.

50 0.794 0.563 0.552 ins.

70 0.759 0.600 0.567

100 0.719 0.644 0.585

7000 30 0.847 0.538 0.564

50 0.711 0.532 0.521 ins.

70 0.678 0.582 0.551

100 0.700 0.629 0.595


Table 7.5: Mean value of average entropy and pairwise comparison between algorithms.

Relation symbols indicate a better algorithm within the pair: AB — A wins, AB — B

wins. The ins. symbol means the equality hypothesis cannot be rejected.




tfidf 5000 30 0.741 0.345 0.365

50 0.661 0.360 0.372

70 0.643 0.407 0.395 ins.

100 0.634 0.462 0.415

7000 30 0.685 0.343 0.362

50 0.599 0.352 0.358 ins.

70 0.569 0.410 0.384

100 0.561 0.423 0.415 ins.

mi 5000 30 0.506 0.460 0.646

50 0.452 0.459 0.520 ins.

70 0.444 0.461 0.503 ins.

100 0.478 0.478 0.495 ins. ins.

7000 30 0.474 0.452 0.593 ins.

50 0.312 0.453 0.541

70 0.369 0.458 0.468 ins.

100 0.414 0.445 0.453 ins.

mid 5000 30 0.716 0.361 0.397

50 0.631 0.382 0.400

70 0.615 0.449 0.425

100 0.579 0.511 0.452

7000 30 0.667 0.337 0.388

50 0.558 0.364 0.383

70 0.542 0.432 0.420 ins.

100 0.534 0.491 0.448

tfidf2 5000 30 0.763 0.364 0.388

50 0.651 0.383 0.387 ins.

70 0.622 0.443 0.421

100 0.571 0.502 0.451

7000 30 0.722 0.351 0.384

50 0.569 0.361 0.368 ins.

70 0.532 0.421 0.404

100 0.559 0.493 0.467


Table 7.6: Mean value of f-measure and pairwise comparison between algorithms. Relation

symbols indicate a better algorithm within the pair: AB — A wins, AB — B wins. The ins.

symbol means the equality hypothesis cannot be rejected.




tfidf 5000 30 0.319 0.126 0.163

50 0.398 0.144 0.171

70 0.413 0.138 0.182

100 0.426 0.145 0.183

7000 30 0.371 0.130 0.171

50 0.448 0.144 0.174

70 0.471 0.143 0.178

100 0.484 0.138 0.173

mi 5000 30 0.161 0.106 0.151

50 0.233 0.121 0.160

70 0.304 0.129 0.154

100 0.354 0.146 0.181

7000 30 0.181 0.110 0.152

50 0.265 0.117 0.152

70 0.329 0.127 0.163

100 0.408 0.142 0.176

mid 5000 30 0.341 0.139 0.178

50 0.419 0.152 0.198

70 0.441 0.160 0.197

100 0.467 0.151 0.197

7000 30 0.389 0.146 0.199

50 0.482 0.169 0.217

70 0.496 0.159 0.196

100 0.500 0.159 0.201

tfidf2 5000 30 0.298 0.134 0.166

50 0.389 0.151 0.181

70 0.424 0.155 0.191

100 0.462 0.167 0.199

7000 30 0.343 0.138 0.174

50 0.463 0.148 0.188

70 0.492 0.156 0.191

100 0.480 0.156 0.187


Table 7.7: Mean value of purity and pairwise comparison between algorithms. Relation sym-

bols indicate a better algorithm within the pair: AB — A wins, AB — B wins. The ins.

symbol means the equality hypothesis cannot be rejected.




tfidf 5000 30 0.321 0.593 0.596 ins.

50 0.399 0.603 0.608 ins.

70 0.401 0.554 0.590

100 0.412 0.517 0.576

7000 30 0.375 0.602 0.615

50 0.450 0.612 0.629

70 0.470 0.560 0.604

100 0.472 0.556 0.580

mi 5000 30 0.537 0.477 0.359

50 0.581 0.494 0.477 ins.

70 0.581 0.511 0.495 ins.

100 0.550 0.512 0.514 ins.

7000 30 0.566 0.492 0.406

50 0.707 0.498 0.451

70 0.650 0.519 0.529 ins.

100 0.599 0.541 0.551 ins.

mid 5000 30 0.353 0.587 0.579 ins.

50 0.426 0.579 0.589 ins.

70 0.427 0.534 0.569

100 0.449 0.483 0.538

7000 30 0.398 0.617 0.589

50 0.488 0.615 0.616 ins.

70 0.487 0.556 0.574

100 0.491 0.512 0.563

tfidf2 5000 30 0.305 0.565 0.565 ins.

50 0.398 0.576 0.590

70 0.427 0.542 0.575

100 0.467 0.494 0.556

7000 30 0.345 0.601 0.584

50 0.475 0.610 0.620 ins.

70 0.497 0.558 0.590

100 0.471 0.510 0.547


cannot be judged significant from the sample. Having said that, there are some very subtle

regularities with respect to the “winning” version of dkm and the test configuration. First,

there seems to be a preference toward noun phrases with a growing number of features in a

document vector. If we filter our observations from Table 7.4 on page 99 and exclude mutual

information and short document vectors (30 and 50 elements), then the noun phrase version

of the algorithm always wins over frequent phrases.

Subsampling Input Documents

Assuming cluster centroids remain very similar in dkm, they should select the same pattern

phrases, which should in turn allocate identical final content of clusters. From this we can

suspect that subsampling used in k-Means should not significantly affect the document al-

location in Descriptive k-Means. Indeed, only the smallest sample (2000 documents) had a

different quality characteristic. Samples of 5000, 7000 and 9000 documents were very much

alike.

These results suggest that subsampling can be successfully used to decrease the com-

putational effort needed to cluster large numbers of documents and aligns with previous

research on the subject [16].

An interesting aspect we did not measure would be to see how stable the set of pattern

phrases is depending on the size of the sample. An experimental validation of this hypoth-

esis would require ways of detecting how pattern phrases change (or rather: how their rank

changes in a list of results returned for a Boolean query issued to the index of candidate

labels). This is an interesting direction for future research.

7.3.3 Manual Inspection of Cluster Content

Performing an inspection of cluster content similar to that we did in the first experiment

with Lingo was difficult. Large number of input documents, different variable combinations

and even the count of original partitions was so large that analysis of any particular instance

would be very limited and selective. In the end we did have a tentative look at a few dozen

results and used graphical visualizations to interpret them. The results mostly confirmed

the conclusions from the quantitative experiment: Descriptive k-Means produces smaller,

but less contaminated (more accurate) clusters.

Let us have a look at a single instance. Figure 7.17 on page 105 shows confusion ma-

trices for an experiment run numbered 20n-en-7000-70-tfidf-EV-2 (English segmenta-

tion, sample of 7000 documents, 70 elements in the feature vector, sample run number 2).

All algorithms seem to be able to reconstruct the original partitions (the diagonal contains

a “trace” of mostly non-zero elements), although the number of noisy documents spread-

ing to other clusters is higher in case of k-Means. We can clearly depict this effect using a

color-map visualization. Figure 7.18 shows the same confusion matrices, with color-coded

cells. We normalized the distribution of documents among partitions for each cluster and

assigned white color to zero, red color to 60% of documents in a cluster assigned to a given

partition and blue color was transitional between white and red to improve the clarity of the


image. The “blue noise”, represents documents misassigned from partitions different than

the dominating one, usually depicted as a red cell. The amount of noise is clearly seen in

the color map for k-Means and much less evident in case of both variants of Descriptive

k-Means.

A closer look at this instance reveals a subtle problem with cluster candidate selection

in a few rows that contain mostly blue cells. We tracked down the pattern phrases respon-

sible for forming these clusters. These rows either contained very few documents scattered

among original partitions because of a very specific pattern phrase (proper names, e-mail

addresses) or contained very common phrases like Web Server or e-mail address. The first

problem could be solved by increasing the minimal document count threshold. The sec-

ond problem actually roots in the cluster model built by k-Means — the centroid vectors

of dominant topics contained highly ranked popular terms such as Web and Server, so the

dcf selected pattern phrases rather correctly. To avoid the problem of forming obvious clus-

ters we could either manually tune the stopwords set used in k-Means or remove common

phrases from the set of candidate cluster labels.

7.3.4 A Look at Cluster Labels

Even though the emphasis of the experiment was placed on quantitative analysis, we also

inspected a set of clusters manually to see what kind of cluster labels were created for the

two variations of dkm.

Pattern phrases selected among noun phrases were more precise compared to frequent

phrases which often contained certain amount of common junk:

David Sternlight writes work just fine

Thanks in advance for any help

For comparison, a few pattern phrases selected from noun phrases candidates in sample

20n-en-7000-50-tfidf2-EV-3:

Palestinians, Israel Israel Gaza

Serbs, Croats and Muslims Israeli Jews

Bosnian Serbs and Bosnian Muslims

Another example, topmost labels from a cluster most likely corresponding to a group called

soc.religion.christian:

Lord Jesus Christ grace of God

God does not exist existence of God

salvation through our Lord Jesus Christ

Remember that pattern phrases for the experiment were a bit artificially merged into

single clusters to keep the number of clusters similar to the one given a priori. In a real

application, each pattern phrase would select its own set of documents.


alt.atheism

comp.graphics

comp.os.ms-windows.misc

comp.sys.ibm.pc.hardware

comp.sys.mac.hardware

comp.windows.x

misc.forsale

rec.autos

rec.motorcycles

rec.sport.baseball

rec.sport.hockey

sci.crypt

sci.electronics

sci.med

sci.space

soc.religion.christian

talk.politics.guns

talk.politics.mideast

talk.politics.misc

talk.religion.misc

0 158 85 131 111 98 37 12 3 11 3 11 114 23 13 9 4 1 0 3

6 20 130 21 28 92 3 0 2 3 1 3 7 3 2 0 2 1 2 1

0 101 72 11 7 109 0 1 0 3 2 4 7 1 4 4 0 0 1 1

5 8 12 8 16 15 42 6 8 5 12 13 12 22 9 6 11 5 11 2

0 3 10 155 156 3 164 4 3 4 2 1 28 3 2 0 3 0 0 0

2 2 1 1 0 4 15 279 33 2 1 0 29 4 3 0 6 0 8 1

2 15 2 0 2 4 5 16 283 1 0 1 9 3 1 0 1 1 0 0

0 0 0 1 1 1 6 0 0 288 268 0 1 0 1 0 0 0 1 0

2 0 3 1 0 0 0 0 0 0 13 0 1 0 0 5 0 0 0 3

16 16 38 37 9 11 10 20 12 20 62 9 16 5 5 13 17 1 9 3

3 3 2 2 0 1 0 0 2 0 0 288 20 5 1 1 11 1 1 3

0 1 1 2 27 6 18 6 7 5 3 6 40 5 3 3 3 1 9 1

9 3 4 6 4 9 2 2 0 7 0 1 5 30 13 1 2 1 1 5

10 1 0 0 1 0 1 0 1 0 2 1 2 234 7 6 43 5 23 4

8 3 5 1 5 2 2 1 1 2 1 7 48 17 294 3 15 7 20 3

80 2 0 1 0 0 0 1 0 0 0 0 2 2 0 288 0 4 2 101

116 7 0 1 3 0 2 4 5 9 6 12 6 9 7 12 204 34 92 61

2 6 2 0 0 0 3 0 2 1 1 2 1 0 1 3 2 76 7 3

6 7 1 1 2 11 3 0 2 4 0 9 7 11 6 19 21 181 7 14

16 3 0 0 3 1 15 6 3 10 5 7 3 6 3 29 4 3 77 13

0 32 19 4 3 7 0 1 0 0 0 9 1 1 1 0 0 1 0 0

0 14 5 8 6 4 10 3 3 2 5 5 8 3 2 3 1 0 1 3

0 5 12 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0

0 1 31 3 1 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 22 23 23 14 20 8 5 1 4 0 4 10 5 1 5 0 0 0 0

0 1 1 17 6 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 24 0 1 1 1 0 0 0 14 0 1 0 0 0 0 0

0 0 0 0 0 0 3 70 5 0 0 1 0 0 0 0 4 1 1 0

0 0 0 0 0 0 0 1 48 0 0 0 0 0 0 0 0 0 0 0

1 0 7 0 0 1 1 0 0 0 29 0 2 2 1 2 0 0 0 3

0 0 0 0 0 0 0 0 0 23 44 1 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 68 0 0 0 0 3 0 0 0

14 9 11 9 5 16 3 5 0 19 5 11 3 42 36 6 3 9 2 6

0 0 0 0 0 0 0 0 0 0 3 4 0 85 1 0 39 0 2 1

0 0 0 0 0 0 0 0 0 0 0 0 6 0 63 0 0 0 0 0

26 0 0 0 0 0 0 0 0 0 1 0 0 1 1 86 8 0 2 44

1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 46 0 18 7

1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 39 0 2

1 0 0 0 0 0 0 0 0 0 0 0 4 7 0 11 0 49 0 1

0 1 2 2 0 0 1 0 0 0 2 0 2 4 0 8 2 6 46 0

0 7 13 4 0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 2 19 12 1 9 0 0 2 0 0 0 1 1 0 0 0 0 0 0

0 1 1 18 6 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0

0 36 25 41 25 38 3 3 2 8 6 3 27 13 2 6 0 1 1 1

0 0 0 0 0 0 0 10 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 54 0 0 0 0 0 0 0 0 0 0 0

0 0 3 0 0 0 0 0 0 0 4 0 0 0 0 3 0 0 0 0

5 6 15 12 3 4 1 1 4 7 19 4 7 8 4 3 15 5 4 3

0 0 0 0 0 0 0 0 0 18 27 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 22 0 0 0 0 1 0 0 0

0 0 0 0 0 9 0 0 0 0 1 0 13 0 0 0 0 1 0 0

3 6 4 8 5 10 0 4 2 2 10 3 2 12 4 3 4 5 5 5

11 0 0 0 0 0 0 0 0 3 0 0 0 21 1 0 0 0 0 3

0 0 0 0 0 0 0 0 0 0 0 1 0 52 0 0 4 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0 6 0 48 0 0 0 0 0

27 0 0 0 1 0 0 0 0 0 1 0 1 0 0 75 1 3 2 22

0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 35 0 9 8

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 47 0 1

0 1 2 2 0 0 1 0 0 0 2 0 2 4 0 1 1 6 45 0

clusters

classes

clusters

classes

classes

clusters

Descriptive k-Means (noun phrases)

Descriptive k-Means (frequent phrases)

k-Means

class labels

Figure 7.17: Confusion matrices of all three evaluated algorithms for an instance named

20n-en-7000-70-tfidf-EV-2. A confusion matrix represents a cluster as a row and the

original partition as a column. The intersection shows the number of documents in the

row’s cluster assigned to the column’s partition. Ideal clustering would have non-zero ele-

ments only on the diagonal.


k-Means

Descriptive k-Means (noun phrases)

Descriptive k-Means (frequent phrases)

Figure 7.18: Color maps of confusion matrices for all three algorithms, instance named

20n-en-7000-70-tfidf-EV-2. Values in each row have been normalized. The color in

each cell represents the number of row’s documents from the column’s partition (pale blue

— few documents, intense blue — a number of documents, red — almost all documents).

7.4. User Feedback 107

7.3.5 Summary and Conclusions

The second experiment provided us with observations that applying dcf to a well known

clustering algorithm does not decrease its quality. Quite the contrary: all the results seem to

indicate that we gained on purity of the resulting clusters and knowing how dcf constructs

cluster labels, these should improve as well. Obviously, the baseline algorithm in our ex-

periment (k-Means) is quite weak and a fine text clustering algorithm would have a higher

quality rating, but we believe that an initial conclusion is justified: dcf does not have a de-

structive effect on the original clustering and offers rewards in the area of cluster labeling.

Summarizing, the outcomes from the experiment are:

• Modifying an existing clustering algorithm by applying dcf did not seem to have any

negative effect on clustering quality.

• Descriptive k-Means tends to produce small, compact but also relatively pure clusters.

• Pointwise mutual information is not a good feature weighting scheme for dcf because

it tends to pick low-frequency terms with no matches in the candidate cluster labels

set.

• We noticed no extra quality gain from using noun phrases instead of frequent phrases,

although manual inspection of the output cluster labels shows that in many cases

noun phrases were more clear and accurate (this judgment is very subjective though).

• It seems that Descriptive k-Means can be successfully applied to efficiently cluster

thousands of documents (as in the experiment) and even if memory or disk space be-

comes a problem then subsampling of the original document set delivers an sufficient

approximation of the exact result.

7.4 User Feedback

The ultimate goal of a descriptive clustering algorithm is to provide some gain for the user

seeking information. This gain is incredibly tricky to define and we agree with prof. Christo-

pher Manning [59] that measuring “user happiness” depends a lot on the target context of

measurement (Who is measuring? Who is the target? What is the surrounding environ-

ment?). Perhaps the best criteria are based on increase in user productivity translated to

economic aspects such as time savings or access to more relevant, accurate (and thus valu-

able) information. As Manning says, however, this kind of evaluation “makes experimental

work hard, especially on large scale”. For this and other reasons mentioned in the intro-

duction, we decided not to conduct a controlled user study. Instead, we present some real

feedback resulting from publishing the software implementing the concepts contained in

this thesis on the open source arena.


Table 7.8: Carrot2 committers (alphabetical order).

Name Component

Karol Gołembniak haog clustering

Paweł Kowalik Search engine wrapper

Lang Ngo Chi Fuzzy Rough Set clustering

Stanisław Osinski Lingo Classic clustering, core

Steven Schockaert Fuzzy Ants clustering

Dawid Weiss stc clustering, core

Michał Wróblewski AHC clustering

7.4.1 Open Sourcing Carrot2

The Carrot2 framework was registered as an open source, bsd-licensed project on Source-

Forge in July 2003. All the initial code and design [99] was contributed by the author of this

thesis. At the moment of writing two people actively commit new code to the project —

Dawid Weiss and Stanisław Osinski. Other developers occasionally contributed components

and patches for integrating with the rest of the framework (see Table 7.8 for a list contribu-

tors).

The most interesting part of an open source project is how people really react to it. With

free software it is impossible to lure people with marketing slogans or advertising tricks: the

software must stand for itself and be usable in order to attract user community. Carrot2

has been moderately successful so far, looking at SourceForge’s statistics, the project has a

constant rate of about 120 downloads per month — for such a specific piece of server-side

software this seems to be a decent result. We are most proud of the integration with other

popular open source projects such as Nutch, an open source search engine, or Lucene — an

information retrieval and indexing library. The user community around these projects has

been very supportive and kind. In fact, we know of several free and commercial investments

using Carrot2 clustering components (see Figure 7.19). It means people consider clustering

functionality useful and provides us with motivation to improve it.

7.4.2 Online Demo Usage

Carrot2’s development follows Martin Fowler’s continuous integration principles [26] — the

project is regularly rebuilt and automatically tested. A side-effect of this process is a head-

of-development demonstration of the system made available on-line at [K]. Even though

this was not meant to be any sort of evaluation, we kept collecting log files from the Web

server and provide some insight into who and how has been using the system (summary

made at the end of February, 2006).

There were on the average 56 queries to the demo service made daily and this number is

increasing (see monthly counts in Figure 7.22). Queries were sent from all kinds of different

locations (acquired by a reverse lookup of IP addresses) and in many different languages,

including far-east families and arabic (see Figure 7.21).

The demo service exposed several different search results clustering components and

data sources. The most frequently used clustering algorithm was Lingo (most likely because

it is the default one), followed by FuzzyAnts and trc (see Figure 7.23).


Figure 7.19: Two visual search systems using Carrot2 internally: meX-Search (above) and

Grokker (below).


Figure 7.20: Search results for the query prawo issued to a search engine Yahoo and clustered

with Lingo.

Figure 7.21: A screenshot of user queries to the demo service. The meaning and final quality

of clustering for queries in Mandarin (?) is unknown.


0

1000

2000

3000

4000

5000

6000

Num

ber

ofquer

ies

Num

ber

ofquer

ies

2003-07

2003-08

2003-09

2003-10

2003-11

2003-12

2004-01

2004-02

2004-06

2004-07

2004-08

2004-09

2004-10

2004-11

2004-12

2005-01

2005-02

2005-03

2005-04

2005-05

2005-06

2005-07

2005-08

2005-09

2005-10

2005-11

2005-12

2006-01

2006-02

551

178

228 421

705

591

1242

393

981

979

558

915

1043

628

1593

5065

1536

905 1180

1136

2360

1099

820

1384

4219

3911

3207

4062

3237

Figure 7.22: Number of queries to the demo service each month.

0

5000

10000

15000

20000

Num

ber

ofquer

ies

Num

ber

ofquer

ies

lingo-g

oogle

-en

lingo-y

ahooapi

lingo-g

oogle

api-en

lingo-a

llth

eweb

-en

fuzz

yA

nts

-google

api-en

trc.

km

eans-

google

api

scri

pte

d.a

hc-

google

api-en

scri

pte

d.a

hc-

google

api-en

-idf

trc.

phra

se-k

mea

ns-

google

api

trc.

rough-k

mea

ns-

google

api

stc-

full.g

oogle

api

lingo.e

goth

or-

cs-o

nly

lingo-g

oogle

-pl

trc.

phra

se-r

ough-k

mea

ns-

google

api

lingo-g

oogle

-en.r

aw

16003

3464

2588

2079

1978

1657

1647

1635

1588

1537

1379

1143

1044

753

629

Figure 7.23: Total number of queries to clustering processes available in the demo service.

Chapter 8

Summary and Conclusions

Document clustering becomes a front-end utility for searching, but also comprehending in-

formation. We started this thesis from the observation that clustering methods in such ap-

plications are inevitably connected with finding concise, comprehensible and transparent

cluster labels — a goal missing in the traditional definition of clustering in information re-

trieval. We then showed that current approaches to solving the problem try to allocate a

sensible description to a model of clusters expressed in mathematical concepts and not di-

rectly corresponding to the input text (as in the vector space model). This is typically very

difficult and often results in poor cluster labels (as in keyword-tagging).

Realizing the constraints and nature of the problem and having the knowledge of user

expectations acquired from the Carrot2 framework, we collected the requirements and ex-

pectations to formulate a problem of descriptive clustering — a document grouping task

emphasizing comprehensibility of cluster labels and the relationship between cluster labels

and their documents.

Next, we devised a general method called Description Comes First, which showed how

the difficult step of describing a model of clusters can be replaced with extraction of candi-

date labels and selection of pattern phrases — labels that can function as an approximation

of a dominant topics present in the collection of documents. We show how this general

method fulfills our requirements concerning document assignment and provide ideas for

cluster label selection strategies in English (frequent phrases, noun phrases) and Polish (fre-

quent phrases, heuristic chunk extraction).

We then demonstrate dcf on two concrete algorithms applicable to important problems

found in practice: clustering results returned by search engines and clustering larger collec-

tions of longer documents such as news stories or mailing lists.

The thesis ends with a presentation of results collected from empirical experiments with

the two presented algorithms.

Fulfillment of Goals The motivation for this thesis arose as a consequence of observing

new applications of clustering methods in information retrieval and the needs of real users

using these applications. Our initial goals were to create a method able to accurately de-

112

8.1. Contributions 113

scribe existing clusters, but they soon changed when we realized that the problem itself

needs to be rewritten to permit sensible solutions. The definition of descriptive cluster-

ing is, in our opinion, a better way of reflecting the needs of a user who needs to browse a

collection of texts, whether they are snippets or other documents.

The dcf approach, which we describe in this thesis, is meant to show how descriptive

clustering can be solved in a way which is substantially different compared to traditional

clustering approaches and, as we show through the algorithms and experiments, not at all

worse with respect to the known clustering quality measures. Moreover, we show that dcf

combined with smart candidate label selection (noun phrases, for example), allows easier

resolution to the problem of cluster labeling that are more likely to fulfill the requirements

of descriptive clustering defined at the beginning of this thesis, especially comprehensibility

and transparency.

Obviously we hope that dcf is not the only method for solving the problem — quite the

contrary, we are aware of the limitations of dcf and tried to point them out in relevant sec-

tions of this thesis and in future research directions. Having said that, we believe that the

concepts and results presented in this thesis, along with the popularity of their implementa-

tions in the Carrot2 framework allow us to say that the goals we initially assumed have been

fulfilled.

8.1 Contributions

The scientific contributions of this thesis include:

• A description and overview of the requirements of the descriptive clustering prob-

lem.

Descriptive clustering is different compared to regular text clustering. It is character-

ized by a different set of requirements and focus shifted to cluster label quality. This

thesis defines descriptive clustering in terms of its expected properties and discusses

differences and similarities with existing research fields.

• Description Comes First approach.

The dcf approach is an attempt to design a method for constructing algorithms to

solve the descriptive clustering problem. We describe the concepts behind dcf and

demonstrate its usefulness by evaluating two algorithmic instantiations — Lingo and

Descriptive k-Means.

• Descriptive k-Means algorithm.

An algorithm based on the dcf approach and applicable to longer documents and in-

cremental processing (contrary to Lingo). This algorithm also demonstrates the use

of different label extraction techniques (noun phrases, frequent phrases, predefined

labels). While the experiments show no clear advantage in quality of document al-

location between frequent phrases and noun phrases, the latter usually yielded more

8.2. Future Directions 114

comprehensible cluster labels, satisfying our initial expectations defined in the prob-

lem of descriptive clustering.

• A method of clustering quality evaluation. Aggregative measures of clustering quality

do not explain how documents from the original partitioning are spread in the target

clusters. We devise a method for calculating a normalized score of disorder in each

individual cluster, trivial to compute and easier to interpret than normalized cluster

entropy. Cluster contamination score can be used to create a comprehensible visual-

izations of partitions distribution in clusters.

Scientific contributions are accompanied by technical deliverables.

• Carrot2 — an open source framework for processing text data.

All our experiments and implementations have been tested in the open source frame-

work of the Carrot2 system. Carrot2 has become a widely known and cited open source

project, especially in the domain of clustering search results. It features an industry-

strong component-based software architecture which allowed us to test and deploy

the discussed algorithms in a real production environment.

• Fast hybrid stemming algorithm for Polish.

We fill the existing gap in the area of linguistic preprocessing tools for the Polish lan-

guage with a fast, heuristic stemmer capable of processing several thousand words per

second on commodity hardware. We also provide initial experimental results from the

implementation of a pattern-based chunker for Polish texts.

8.2 Future Directions

This thesis certainly does not exhaust the subject and a great deal of work is still ahead.

Several most promising directions for future work are briefly outlined below.

Fast and Accurate Cluster Candidate Selection The dcf approach relies heavily on the

quality of cluster candidates. Frequent phrase extraction is very convenient because of its

efficiency, but has several disadvantages like word-order dependence and inability to sepa-

rate junk from meaningful phrases. More advanced techniques like statistical noun phrase

extraction are computationally more expensive, but even allowing the costs, they do not

yield “ideal” descriptions expected by descriptive clustering. A combined approach may be

a solution to the problem: if we could extract frequently occurring terms, filter out the junk

and put them back in the right order then we would have a fast and accurate candidate

cluster extraction method.

Predefined cluster label ontologies seem to be a closer (and more realistic) alternative.

There are many semantic nets of relations on the Web. This knowledge could be used to

create a set of comprehensible cluster labels candidates.

8.2. Future Directions 115

Cluster Label Candidates in Polish We tried to approximate English chunking with extrac-

tion of specific tag sequences in Polish and the results were promising, but far from perfect.

We know about the research being done in the area of proper name extraction in Polish (pri-

vate communication with Polish computational linguists), so hopefully the results achieved

in that field could be applied to cluster label extraction. Having a cluster label candidate

extraction method would allow us to perform an experiment with assessing the quality of

clustering documents in Polish. The problem is still not trivial because no “standard” docu-

ment collection for this task exists, so there is no point of reference.

Improving Topic Detection Quality It would be interesting to see if we can employ other

types of clustering algorithms to the task of building a model of topics (Phase 2 in dcf).

Building a model of clusters applicable for dcf’s pattern phrase selection task is not always

going to be easy, especially with methods that do not rely on the term vector space, but we

are convinced it is possible in most scenarios.

Pattern Phrases with Broader Meaning There are contradictory directions in descriptive

clustering — on one hand, we would like to have a compact view of the underlying collec-

tion of documents, on the other, a transparent relationship between a cluster label and its

contents. Pattern phrases presented in this thesis allocate documents using phrase queries,

which will inevitably retrieve just a small number of input documents. An interesting prob-

lem would be to have “general” pattern phrases, such as digital photography which could ex-

tract all documents related to the subject, even if documents did not contain the keywords

of a pattern phrase. Interestingly, we could use the mechanisms already present in Descrip-

tive k-Means to do this — it would be sufficient to build an index of candidate cluster labels

where the presentation is fixed to a given phrase (as in digital photography), but the index

contains synonymous phrases that denote its meaning (digital photos, digital imaging, digital

pictures and others).

Appendix A

Cluster Contamination

Let there be a set of N objects, original partitioning of these objects C = c1,c2, . . . ,cm and a

set of clusters K = k1,k2, . . . ,kn .

A contingency matrix is a two-dimensional matrix H where columns correspond to clus- contingency matrix

ters, rows to classes and h(ci ,k j ) contains the number of objects from class ci present in

cluster k j .

H =

∣∣∣∣∣∣∣∣∣∣∣

h(c1,k1) h(c1,k2) . . . h(c1,kn )

h(c2,k1) h(c2,k2) . . . h(c2,kn )

......

. . ....

h(cm ,k1) h(cm ,k2) . . . h(cm ,kn)

∣∣∣∣∣∣∣∣∣∣∣

In a perfect clustering, where C = K , matrix H is a square matrix with only one non-zero

element in each row and column and can be transformed into a diagonal matrix by rear-

ranging the order of columns or rows.

Given H , we can express the number of pairs of objects found in the same cluster k j but

in two different partitions:

a10(k j ) =m∑

i=2

i−1∑

t=1

h(ci ,k j )×h(ct ,k j ). (A.1)

The notation of a10(k j ) is derived from contingency matrix aggregation factor — see [18] for

details. Let us denote by amax(k j ) the worst-case potential scenario of objects distribution in

cluster k j :

amax(k j ) =m∑

i=2

i−1∑

t=1

h(ci ,k j )× h(ct ,k j ), (A.2)

116

Cluster Contamination 117

where:

p =m∑

t=1

h(ct ,k j ), (number of objects in k j )

h(ci ,k j ) =

⌊ pm

⌋+1 if i < (p mod m),

⌊ p

m

⌋otherwise.

As we show later on, amax(k) models a situation when we take an even number of objects

from each partition and combine these objects into a cluster.

Theorem 1. For a given cluster k j and a contingency matrix H, amax(k j ) is the maximum

possible value of a10(k j ).

Before we show the proof of the above theorem, we shall simplify the notation a bit. We

are concerned with a single cluster k j and a column vector of matrix H containing distribu-

tion of cluster’s objects in original partitions. Only this vector (its elements) are important

for future discussion, so we start by simplifying the notation of Equation A.1 by replacing

h(ci ,k j ) with hi :

a10(k j ) =m∑

i=2

i−1∑

t=1

h(ci ,k j )×h(ct ,k j )

=m∑

i=2

i−1∑

t=1

hi ×ht

= h1 (h2 +h3 +·· ·hm−1 +hm )

+h2 (h3 +h4 +·· ·hm−1 +hm )

...

+hm−2 (hm−1 +hm )

+hm−1 (hm ) .

Lemma 1. Moving a single object from a class with fewer elements to a class with more ele-

ments always decreases a10(k j ).

Proof. Let us reorder classes so that hm−1 contains at least the number of elements of hm

(|hm−1| ≥ |hm |). By moving a single object from hm to hm−1 we change these classes to,

correspondingly:

h′m−1 = hm−1 +1,

h′m = hm −1.


Before and after the move we have:

a10(k j ) = h1 (h2 +h3 +·· ·hm−2)+h1 (hm−1 +hm )

+h2 (h3 +h4 +·· ·hm−2)+h2 (hm−1 +hm )

...

+hm−2 (hm−1 +hm )

+hm−1 (hm ) ,

a′10(k j ) = h1 (h2 +h3 +·· ·hm−2)+h1

(h′

m−1 +h′m

)

+h2 (h3 +h4 +·· ·hm−2)+h2

(h′

m−1 +h′m

)

...

+hm−2

(h′

m−1 +h′m

)

+h′m−1

(h′

m

).

Continuing, the difference a10(k j )−a′10(k j ) is:

a10(k j )−a′10(k j ) = h1 (hm−1 +hm )−h1

(h′

m−1 +h′m

)

+h2 (hm−1 +hm )−h2

(h′

m−1 +h′m

)

...

+hm−2 (hm−1 +hm )−hm−2

(h′

m−1 +h′m

)

+hm−1 (hm )−h′m−1

(h′

m

)

= h1 (hm−1 +hm )−h1 (hm−1 +1+hm −1)

+h2 (hm−1 +hm )−h2 (hm−1 +1+hm −1)

...

+hm−2 (hm−1 +hm )−hm−2 (hm−1 +1+hm −1)

+hm−1 (hm )− (hm−1 +1) (hm −1)

= hm−1 −hm +1,

which is never negative.

Corollary 1. From Lemma 1 it follows that in a clustering where the number of objects from

every class differs at most by one, any object moved between classes decreases or retains the

value of a10(k j ). In other words, a cluster k j reaches amax(k j ) when:

∀i ,t=1...m

h(ci ,k j )−h(ct ,k j ) ≤ 1. (A.3)

Lemma 2. Starting from any cluster k j we can reassign elements in its contingency matrix to

reach a distribution with maximum value (Equation A.3).


Proof. The proof is done by showing an iterative object reassignment procedure that starts

from any clustering k j and terminates in a finite number of steps when state described in

Equation A.3 is reached.

Assuming that the column vector h(c1,k j ),h(c2,k j ), . . . ,h(cm ,k j ) from the contingency

matrix H is given, the following procedure has the required properties:

1. Find the class with most objects (if more than one, take any):

hmax = maxi

(h(ci ,k j )

).

2. Find the class with fewest objects (if more than one, take any):

hmin = mini

(h(ci ,k j )

).

3. If (hmax −hmin <= 1) → stop.

4. Move a single object from hmax to hmin and repeat from step 1.

Proof of Theorem 1. The proof is a consequence of Lemma 1 and Lemma 2. Starting from

any clustering k j we can iteratively swap single objects between the largest and the smallest

class, never decreasing a10(k j ) (a proof is exactly the same as in Lemma 1, so we omit it

here). Eventually we always reach a state when a10(k j ) = amax(k j ).

Cluster contamination measure for a cluster k j is defined as the ratio between a10(k j ) contaminationmeasure

and the worst case amax(k j ).

contamination(k j ) =a10(k j )

amax(k j ). (A.4)

Formula A.4 has the expected properties, namely it equals 0 for clusters that consist of

objects from only a single partition (“pure”) and it equals 1 for the worst-case scenario of

even objects distribution between original partitions (“contaminated”).

Bibliography

[1] Steven Abney. The English Noun Phrase in Its Sentential Aspect. PhD thesis, MIT, Cambridge,

Massachusetts, 1987.

[2] Steven Abney. Parsing by Chunks. In Robert C. Berwick, Steven P. Abney, and Carol Tenny, ed-

itors, Principle-Based Parsing: Computation and Psycholinguistics, pages 257–278. Kluwer Aca-

demic Publishers, 1991.

[3] David Arthur and Sergei Vassilvitskii. On the Worst Case Complexity of the k-Means Method. In

Proceedings of the 22nd Annual ACM Symposium on Computational Geometry, Sedona, Arizona,

2006. (awaiting publication).

[4] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto. Modern Information Retrieval. ACM Press

/ Addison-Wesley, 1999.

[5] Arindam Banerjee, Srujana Merugu, Inderjit Dhillon, and Joydeep Ghosh. Clustering with Breg-

man Divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.

[6] Pavel Berkhin. Survey Of Clustering Data Mining Techniques. Technical report, Accrue Software,

San Jose, CA, USA, 2002.

[7] Janusz S. Bien. Komputerowa weryfikacja gramatyki Swidzinskiego. Biuletyn Polskiego To-

warzystwa Jezykoznawczego, LII:147–164, 1996.

[8] Hadumod Bussmann. Routledge Dictionary of Language and Linguistics. Routledge Reference.

Hadumod Bussmann, 1996.

[9] Peter Cheeseman, James Kelly, Matthew Self, John Stutz, Will Taylor, and Don Freeman. Auto-

Class: A Bayesian Classification System. In Proceedings of the 5th International Conference on

Machine Learning, Ann Arbor, MI, USA, pages 54–64. Morgan Kaufmann, 1988.

[10] Peter Cheeseman and John Stutz. Bayesian Classification (AutoClass): Theory and Results. In

Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy, ed-

itors, Advances in Knowledge Discovery and Data Mining, pages 153–180. AAAI Press/ MIT Press,

1996.

[11] David Cheng, Santosh Vempala, Ravi Kannan, and Grant Wang. A Divide-And-Merge Method-

ology for Clustering. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART Symposium on

Principles of Database Systems, Baltimore, Maryland, USA, pages 196–205. ACM Press, 2005.

[12] Kenneth Ward Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted

Text. In Proceedings of the 2nd Conference on Applied Natural Language Processing, Austin,

Texas, USA, pages 136–143. Association for Computational Linguistics, 1988.

120

121

[13] Douglass R. Cutting, Jan O. Pedersen, David Karger, and John W. Tukey. Scatter/Gather: A

Cluster-based Approach to Browsing Large Document Collections. In Proceedings of the 15th

International ACM Conference on Research and Development in Information Retrieval, Copen-

hagen, Denmark, pages 318–329. ACM Press, 1992.

[14] Jan Daciuk. Incremental Construction of Finite-State Automata and Transducers, and their Use

in the Natural Language Processing. PhD thesis, Technical University of Gdansk, Poland, 1998.

[15] Arthur Dempster, Nan Laird, and Donald Rubin. Maximum Likelihood from Incomplete Data

via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977.

[16] Inderjit S. Dhillon, James Fan, and Yuqiang Guan. Efficient Clustering of Very Large Document

Collections. In Robert Grossman, Chandrika Kamath, Vipin Kumar, and Raju R. Namburu, edi-

tors, Data Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, 2001.

Invited book chapter.

[17] Inderjit S. Dhillon and Dharmendra S. Modha. Concept Decompositions for Large Sparse Text

Data Using Clustering. Machine Learning, 42(1–2):143–175, 2001.

[18] Byron E. Dom. An Information-Theoretic External Cluster-Validity Measure. Research Report RJ

10219, IBM, 2001.

[19] Zhang Dong. Towards Web Information Clustering. PhD thesis, Southeast University, Nanjing,

China, 2002.

[20] Richard C. Dubes. How many clusters are best? — An experiment. Pattern Recognition,

20(6):645–663, 1987.

[21] David Dubin. The most influential paper Gerard Salton never wrote. Library Trends, 52(4):748–

764, 2004.

[22] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Oxford University Press,

fourth edition, 2001.

[23] Paolo Ferragina and Antonio Gulli. The Anatomy of a Hierarchical Clustering Engine for Web-

page, News and Book Snippets. In Proceedings of the 4th IEEE International Conference on Data

Mining, ICDM 2004, Brighton, UK, pages 395–398. IEEE Computer Society, 2004.

[24] Paolo Ferragina and Antonio Gulli. The Anatomy of SnakeT: A Hierarchical Clustering Engine for

Web-Page Snippets. In Proceedings of the 8th European Conference on Principles and Practice of

Knowledge Discovery in Databases, Pisa, Italy, volume 3202 of Lecture Notes in Computer Science,

pages 506–508. Springer, 2004.

[25] Douglas Fisher. Knowledge Acquisition via Incremental Conceptual Clustering. Machine Learn-

ing, 2(2):129–172, 1987.

[26] Martin Fowler and Matthew Foemmel. Continuous integration. Available on-line (May 2006):

http://martinfowler.com/articles/continuousIntegration.html .

[27] Robert Giegerich and Stefan Kurtz. From Ukkonen to McCreight and Weiner: A Unifying View

of Linear-Time Suffix Tree Construction. Algorithmica, 19(3):331–353, 1997.

[28] Alessandra Giorgi and Giuseppe Longobardi. The Syntax of Noun Phrases, Configuration, Pa-

rameters and Empty Categories. Number 57 in Cambridge Studies in Linguistics. Cambridge

University Press, 1991.

http://martinfowler.com/articles/continuousIntegration.html

122

[29] Gene H. Golub and Charles F. Van Loan. Matrix Computations. The Johns Hopkins University

Press, London, third edition, 1996.

[30] Allan D. Gordon. Classification. Chapman & Hall, London, second edition, 1999.

[31] Anjan Goswami, Ruoming Jin, and Gagan Agrawal. Fast and Exact Out-of-Core K-Means Clus-

tering. In Proceedings of the 4th IEEE International Conference on Data Mining, ICDM 2004,

Brighton, UK, pages 83–90. IEEE Computer Society, 2004.

[32] Barbara L. Grosz and Candace L. Sidner. Attentions, Intentions, and the Structure of Discourse.

Computational Linguistics, 12(3):175–204, 1986.

[33] Elzbieta Hajnicz and Anna Kupsc. Przeglad analizatorow morfologicznych dla jezyka polskiego.

Research Report 937, Institute of Computer Science, Polish Academy of Sciences, 2001.

[34] Michael Halliday and Ruqaiya Hasan. Cohesion in English. English Language Series. Longman,

1976.

[35] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann,

second edition, 2006.

[36] David Hand, Heikki Mannila, and Padhraic Smyth. Eksploracja Danych. Wydawnictwa Nauko-

wo-Techniczne WNT, 2005.

[37] Marti A. Hearst and Jan O. Pedersen. Reexamining the Cluster Hypothesis: Scatter/Gather on

Retrieval Results. In Proceedings of the 19th ACM International Conference on Research and De-

velopment in Information Retrieval, Zürich, Switzerland, pages 76–84, 1996.

[38] Andreas Hotho, Steffen Staab, and Gerd Stumme. Explaining Text Clustering Results Using Se-

mantic Structures. In Proceedings of the 7th European Conference on Principles and Practice of

Knowledge Discovery in Databases, Cavtat-Dubrovnik, Croatia, volume 2838 of Lecture Notes in

Computer Science, pages 217–228. Springer, 2003.

[39] Anreas Hotho and Gerd Stumme. Conceptual Clustering of Text Clusters. In Proceedings of FGML

Workshop, Special Interest Group of German Informatics Society, Hannover, Germany, pages 37–

45, 2002.

[40] Wayne Iba and Pat Langley. Unsupervised Learning of Probabilistic Concept Hierarchies. In

Georgios Paliouras, Vangelis Karkaletsis, and Constantine D. Spyropoulos, editors, Machine

Learning and Its Applications, volume 2049 of Lecture Notes in Computer Science, pages 39–70.

Springer, 2001.

[41] Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. Data Clustering: A Review. ACM Com-

puting Surveys, 31(3):264–323, 1999.

[42] Mark Kantrowitz, Behrang Mohit, and Vibhu Mittal. Stemming and Its Effects on TFIDF Rank-

ing. In Proceedings of the 23rd ACM International Conference on Research and Development in

Information Retrieval, Athens, Greece, pages 357–359. ACM Press, 2000.

[43] George Karypis and Eui-Hong Han. Fast Supervised Dimensionality Reduction Algorithm with

Applications to Document Categorization and Retrieval. In Proceedings of the 9th International

Conference on Information and Knowledge Management, McLean, VA, USA, pages 12–19. ACM

Press, 2000.

123

[44] Donald E. Knuth. The TEXbook. Computers and Typesetting. Addison-Wesley, 1986.

[45] Pang Ko and Srinivas Aluru. Space Efficient Linear Time Construction of Suffix Arrays. Journal

of Discrete Algorithms, 3(2–4):143–156, 2005.

[46] Jacek Koronacki and Jan Cwik. Statystyczne systemy uczace sie. Wydawnictwa Naukowo-Techni-

czne WNT, 2005.

[47] Jacek Koronacki and Jan Mielniczuk. Statystyka dla studentów kierunków technicznych i przy-

rodniczych. Wydawnictwa Naukowo-Techniczne WNT, 2001.

[48] Krishna Kummamuru, Rohit Lotlikar, Shourya Roy, Karan Singal, and Raghu Krishnapuram.

A Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing

Search Results. In Proceedings of the 13th International Conference on World Wide Web, New

York, NY, USA, pages 658–665. ACM Press, 2004.

[49] Leslie Lamport. LATEX — A Document Preparation System — User’s Guide and Reference Manual.

Addison-Wesley, 1985.

[50] Bjornar Larsen and Chinatsu Aone. Fast and Effective Text Mining Using Linear-Time Docu-

ment Clustering. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining, San Diego, CA, USA, pages 16–22. ACM Press, 1999.

[51] Jessper N. Larsson. Structures of String Matching and Data Compression. PhD thesis, Depart-

ment of Computer Science, Lund University, 1999.

[52] Elizabeth Liddy. Advances in Automatic Text Summarization. Information Retrieval, 4(1):82–83,

2001.

[53] Dekang Lin and Patrick Pantel. Concept Discovery from Text. In Proceedings of the 19th In-

ternational Conference on Computational Linguistics, Taipei, Taiwan, pages 1–7. Association for

Computational Linguistics, 2002.

[54] Beth Julie Lovins. Development of a Stemming Algorithm. Mechanical Translation and Compu-

tational Linguistics, 11:22–31, 1968.

[55] Hans Peter Luhn. The Automatic Creation of Literature Abstracts. The IBM Journal of Research

and Development, 2(2):159–165, 1958.

[56] Sofus A. Macskassy, Arunava Banerjee, Brian D. Davison, and Haym Hirsh. Human Performance

on Clustering Web Pages: A Preliminary Study. In Proceedings of the 4th International Conference

on Knowledge Discovery and Data Mining, New York City, NY, USA, pages 264–268. AAAI Press,

1998.

[57] Udi Manber and Gene Myers. Suffix Arrays: A New Method for On-line String Searches. SIAM

Journal on Computing, 22(5):935–948, 1993.

[58] Inderjeet Mani and Mark T. Maybury, editors. Advances in Automatic Text Summarization. MIT

Press, 1999.

[59] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Informa-

tion Retrieval. Cambridge University Press, 2007. (Pre-press publication notes).

[60] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Pro-

cessing. MIT Press, 1999.

124

[61] Giovanni Manzini and Paolo Ferragina. Engineering a Lightweight Suffix Array Construction

Algorithm. Algorithmica, 40(1):33–50, 2004.

[62] Marcin Wolinski. Komputerowa weryfikacja gramatyki Swidzinskiego. PhD thesis, Instytut Pod-

staw Informatyki PAN, Warszawa, 2004.

[63] Irmina Masłowska. Phrase-Based Hierarchical Clustering of Web Search Results. In Proceedings

of the 25th European Conference on IR Research, ECIR 2003, Pisa, Italy, volume 2633 of Lecture

Notes in Computer Science, pages 555–562. Springer, 2003.

[64] Irmina Masłowska and Roman Słowinski. Hierarchical Clustering of Text Corpora Using Suffix

Trees. In Proceedings of the International Intelligent Information Processing and Web Mining

Conference, Zakopane, Poland, Advances in Soft Computing, pages 179–188. Springer, 2003.

[65] Geoffrey J. McLachlan and Thriyambakam Krishnan. The EM Algorithm and Extensions. Wiley,

1996.

[66] Ryszard S. Michalski and Robert E. Stepp. Learning From Observation: Conceptual Clustering.

In Ryszard S. Michalski, Jaime G. Carbonell, and Tom M. Mitchell, editors, Machine Learning:

An Artificial Intelligence Approach. Morgan Kaufmann, 1983.

[67] Douglas C. Montgomery and George C. Runger. Applied Statistics and Probability for Engineers.

Wiley, third edition, 2002.

[68] Stanisław Osinski, Jerzy Stefanowski, and Dawid Weiss. Lingo: Search Results Clustering Algo-

rithm Based on Singular Value Decomposition. In Proceedings of the International Intelligent

Information Processing and Web Mining Conference, Zakopane, Poland, Advances in Soft Com-

puting, pages 359–368. Springer, 2004.

[69] Stanisław Osinski and Dawid Weiss. Conceptual Clustering Using Lingo Algorithm: Evaluation

on Open Directory Project Data. In Proceedings of the International Intelligent Information Pro-

cessing and Web Mining Conference, Zakopane, Poland, Advances in Soft Computing, pages 369–

378. Springer, 2004.

[70] Stanisław Osinski and Dawid Weiss. A Concept-Driven Algorithm for Clustering Search Results.

IEEE Intelligent Systems, 20(3):48–54, 2005.

[71] Chris D. Paice. Another Stemmer. SIGIR Forum, 24(3):56–61, 1990.

[72] Patrick Pantel and Dekang Lin. Document Clustering With Committees. In Proceedings of the

25th ACM International Conference on Research and Development in Information Retrieval, Tam-

pere, Finland, pages 199–206. ACM Press, 2002.

[73] Patrick Pantel and Deepak Ravichandran. Automatically Labeling Semantic Classes. In Proceed-

ings of Human Language Technology Conference of the North American Chapter of the Association

for Computational Linguistics, Boston, MA, USA, pages 321–328, 2004.

[74] Ferran Pla, Antonio Molina, and Natividad Prieto. Tagging and Chunking with Bigrams. In Pro-

ceedings of the 18th International Conference on Computational Linguistics, Saarbrücken, Ger-

many, pages 614–620. Morgan Kaufmann, 2000.

[75] Zbigniew Płotnicki. Słownik morfologiczny jezyka polskiego na licencji LGPL. Master’s thesis,

Poznan University of Technology, 2003.

125

[76] Martin F. Porter. An Algorithm for Suffix Stripping. In Karen Sparck Jones and Peter Willett,

editors, Readings in Information Retrieval, pages 313–316. Morgan Kaufmann, 1997.

[77] Dragomir R. Radev, Hongyan Jing, Małgorzata Stys, and Daniel Tam. Centroid-based Summa-

rization of Multiple Documents. Information Processing and Management, 40(6):919–938, 2004.

[78] Lance Ramshaw and Mitch Marcus. Text Chunking Using Transformation-Based Learning. In

Proceedings of the 3rd Workshop on Very Large Corpora, Boston, MA, USA, pages 82–94. Associa-

tion for Computational Linguistics, 1995.

[79] Jeffrey C. Reynar. Topic Segmentation: Algorithms and Applications. PhD thesis, University of

Pennsylvania, Philadelphia, USA, 1998.

[80] James E. Rush, Antonio Zamora, and Ricardo Salvador. Automatic Abstracting and Indexing:

Production of Indicative Abstracts by Application of Contextual Inference and Syntactic Coher-

ence Criteria. Journal of the American Society for Information Science, 22(4):260–274, 1971.

[81] Zygmunt Saloni and Marek Swidzinski. Składnia współczesnego jezyka polskiego. Wydawnictwo

Naukowe PWN, 1998.

[82] Gerard Salton. Automatic Text Processing — The Transformation, Analysis, and Retrieval of In-

formation by Computer. Addison-Wesley, 1989.

[83] Gerard Salton and Chris Buckley. Term Weighting Approaches in Automatic Text Retrieval. Re-

search report, Cornell University, Ithaca, NY, USA, 1987.

[84] Mark Sanderson and Bruce Croft. Deriving Concept Hierarchies from Text. In Proceedings of

the 22nd ACM International Conference on Research and Development in Information Retrieval,

Berkeley, USA, pages 206–213, 1999.

[85] Adam Schenker, Mark Last, and Abraham Kandel. A Term-Based Algorithm for Hierarchical

Clustering of Web Documents. In Proceedings of the Joint 9th IFSA World Congress and 20th

NAFIPS International Conference, Vancouver, Canada, pages 3076–3081, 2001.

[86] Heinrich Schütze and Craig Silverstein. Projections for Efficient Document Clustering. In Pro-

ceedings of the 20th ACM International Conference on Research and Development in Information

Retrieval, Philadelphia, PA, USA, pages 74–81. ACM Press, 1997.

[87] Mark Sinka and David Corne. A Large Benchmark Dataset for Web Document Clustering. Soft

Computing Systems: Design, Management and Applications, 87:881–890, 2002.

[88] Eduard F. Skorochod’ko. Adaptive Method of Automatic Abstracting and Indexing. In Informa-

tion Processing 71: Proceedings of the IFIP Congress, pages 1179–1182. North-Holland Publishing

Company, 1972.

[89] Jerzy Stefanowski and Dawid Weiss. Carrot2 and Language Properties in Web Search Results

Clustering. In Proceedings of the 1st International Atlantic Web Intelligence Conference, Madrid,

Spain, volume 2663 of Lecture Notes in Computer Science, pages 240–249. Springer, 2003.

[90] Benno Stein and Sven Meyer zu Eissen. Topic Identification: Framework and Application. In

Proceedings of the 4th International Conference on Knowledge Management, Graz, Austria, pages

353–360, 2004.

126

[91] Michael Steinbach, George Karypis, and Vipin Kumar. A Comparison of Document Clustering

Techniques. In KDD Workshop on Text Mining, Proceedings of the 6th ACM SIGKDD International

Conference on Knowledge Discovery and Data Mining, Boston, MA, USA, 2000.

[92] Krzysztof Szafran. Analizator morfologiczny SAM-95. Technical Report TR 96 226, Faculty of

Mathematics, Informatics and Mechanics, Warsaw University, Poland, 1996.

[93] Stanisław Szpakowicz. Automatyczna analiza składniowa polskich zdan pisanych. PhD thesis,

Warsaw University, Poland, 1978.

[94] Kevin Thompson and Pat Langley. Concept Formation in Structured Domains. In Douglas

Fisher, Michael Pazzani, and Pat Langley, editors, Concept Formation: Knowledge and Experi-

ence in Unsupervised Learning. Morgan Kaufmann, 1991.

[95] Esko Ukkonen. On-Line Construction of Suffix Trees. Algorithmica, 14(3):249–260, 1995.

[96] Keith van Rijsbergen. Information Retrieval. Butterworth-Heinemann, 1979.

[97] Justyna Wachnicka. Odkrywanie reguł lematyzacji dla jezyka polskiego w oparciu o słownik

ispell-pl. Master’s thesis, Poznan University of Technology, 2004.

[98] Dawid Weiss. A Survey of Freely Available Polish Stemmers and Evaluation of Their Applica-

bility in Information Retrieval. In Proceedings of the 2nd Language and Technology Conference,

Poznan, Poland, pages 216–221, 2005.

[99] Dawid Weiss. Carrot2: Design of a Flexible and Efficient Web Information Retrieval Framework.

In Proceedings of the 3rd International Atlantic Web Intelligence Conference, Łódz, Poland, vol-

ume 3528 of Lecture Notes in Computer Science, pages 439–444. Springer, 2005.

[100] Dawid Weiss. Stempelator: A Hybrid Stemmer for the Polish Language. Research Report RA-

002/05, Institute of Computing Science, Poznan University of Technology, Poland, 2005.

[101] Dawid Weiss and Jerzy Stefanowski. Web Search Results Clustering in Polish: Experimental Eval-

uation of Carrot. In Proceedings of the International Intelligent Information Processing and Web

Mining Conference, Zakopane, Poland, Advances in Soft Computing, pages 209–218. Springer,

2003.

[102] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and In-

dexing Documents and Images. Morgan Kaufmann, second edition, 1999.

[103] Marcin Wolinski. Morfeusz — a Practical Tool for the Morphological Analysis of Polish. To be

published at International Intelligent Information Processing and Web Mining Conference, Us-

tron, Poland, 2006.

[104] Oren Zamir. Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine

Results. PhD thesis, University of Washington, 1999.

[105] Oren Zamir and Oren Etzioni. Web Document Clustering: A Feasibility Demonstration. In Re-

search and Development in Information Retrieval, pages 46–54, 1998.

[106] Oren Zamir and Oren Etzioni. Grouper: A Dynamic Clustering Interface to Web Search Results.

Computer Networks, 31(11–16):1361–1374, 1999.

[107] Hua-Jun Zeng, Qi-Cai He, Zheng Chen, Wei-Ying Ma, and Jinwen Ma. Learning to Cluster Web

Search Results. In Proceedings of the 27th ACM International Conference on Research and Devel-

opment in Information Retrieval, Sheffield, United Kingdom, pages 210–217. ACM Press, 2004.

127

[108] Hua-Jun Zeng, Xuan-Hui Wang, Zheng Chen, Hongjun Lu, and Wei-Ying Ma. CBC: Clustering

Based Text Classification Requiring Minimal Labeled Data. In Proceedings of the 3rd IEEE In-

ternational Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA, pages 443–450,

2003.

[109] Dell Zhang and Yisheng Dong. Semantic, Hierarchical, Online Clustering of Web Search Results.

In Proceedings of 6th Asia-Pacific Web Conference (APWeb), Hangzhou, China, volume 3007 of

Lecture Notes in Computer Science, pages 69–78. Springer, 2004.

[110] Tong Zhang, Fred Damerau, and David Johnson. Text Chunking Based on a Generalization of

Winnow. Journal of Machine Learning Research, 2:615–637, 2002.

List of Web Resources

[A] Twenty Newsgroups (20-newsgroups) Data Set.

http://people.csail.mit.edu/jrennie/20Newsgroups/

[B] Andrzej Białecki, Stempel — algorithmic stemmer for the Polish language.

http://getopt.org/stempel/

[C] Hugo Liu, MontyLingua — An end-to-end natural language processor with common sense.

http://web.media.mit.edu/~hugo/montylingua

[D] Vivísimo Search Engine.

http://www.vivisimo.com/

[E] Text Retrieval Conference.

http://trec.nist.gov/

[F] The Unicode standard.

http://www.unicode.org/

[G] Generative Art, Wikipedia.

http://en.wikipedia.org/wiki/Generative_art

[H] Lucene text search and index library.

http://lucene.apache.org/

[I] Open Directory Project.

http://dmoz.org/

[J] Carrot2 framework.

http://www.carrot2.org

[K] Carrot2 framework demo.

http://carrot.cs.put.poznan.pl

[L] IPI PAN Corpus.

http://www.korpus.pl

128

Index

abstract topic, see dominant topic

algorithm

monothetic, 29

polythetic, 29

character encoding, 14

chunk, 16, 49

chunking, 16

cluster contamination, 39, 88, 119

cluster hypothesis, 3

Cluster/2, 45

clustering problem, 27

CobWeb, 45

codepage, see character encoding

conceptual clustering, 45

contingency matrix, 116

cosine measure, 27

DCF, see Description Comes First

Description Comes First, 47

Descriptive k-Means, 65

dimensionality reduction, 27

discourse modelling, 36

DKM, see Descriptive k-Means

document indexing, 24, 66

document vector, 10

dominant topic, 50

f-measure, 38

feature, 24

ground truth, 38

group

distributionally equivalent, 18

hit, see hit list

hit list, 1

homographs, 15

index, see document indexing

indexing, see document indexing

information needs, 1

k-Means, 30

keyword in context, 57

keywords, 3

KWIC, see keyword in context

LCP, see Longest Common Prefix

lemma, 15

lexeme, 15

Lingo, 11

Longest Common Prefix, 60

MI, see mutual information

mutual information, 26

noun phrase, 17, 49

NP-chunks, see noun phrase

ODP, see Open Directory Project

Open Directory Project, 80

pattern phrase, 51

phrase, 5, 16

phrase query, 72, 75

pointwise mutual information, 26

precision, 7, 38

query

phrase query, 72

search engines, 1

ranking, 7

recall, 7, 38

recurring phrases, 11

recurring terms, see phrase

search engine, 1

search ranking, see ranking

search results clustering, 5

shallow linguistic preprocessing, 15

slop factor, 72

snippet, 5, 57

STC, see Suffix Tree Clustering

stemming, 15

suffix tree, 22

Suffix Tree Clustering, 31

summarization, 36

term, 14

term vector, see document vector

term-document matrix, 24

text chunking, see chunking

text segmentation, 14

tf-idf, 26

topic segmentation, 36

Vector Space Model, 10, 24

VSM, see Vector Space Model

word form, 15

129

© 2006 Dawid Weiss

Institute of Computing Science

Poznan University of Technology, Poznan, Poland

Typeset using LATEX in Adobe Utopia, Helvetica and Luxi Mono.

Compiled on November 22, 2006 from SVN Revision: 2467

BibTEX entry:

@phdthesis dweiss2006,

author = "Dawid Weiss",

title = "Descriptive Clustering as a Method for Exploring Text~Collections",

school = "Pozna\’n University of Technology",

address = "Pozna\’n, Poland",

year = "2006",

Errata

This chapter contains a list of corrections made to Dawid Weiss doctoral dissertation De-

scriptive Clustering as a Method for Exploring Text Collections.

The motivation is to highlight changes and corrections made between any publicly dis-

tributed versions (paper or electronic). Each version contains a revision number in the

colophon (found on the last page). Many changes below have been applied in consecu-

tive releases of the dissertation, but all page numbers and rows in each change set apply to

the defended version 2358.

— Changes between revisions 2343–2358 (internal review) —

Page: Front page E-1

Front pages order changed: English title page, Polish title page, summary in Polish.

Page: I–III E-2

Chapter page numbers in plain font (not bold).

Page: multiple pages E-3

Changed labelling to American English spelling for consistency: labeling.

Page: 15 4th line from the top E-4

White space instead of whitespace.

• white space and punctuation characters delimit terms,

• a full stop followed by a white space delimits a sentence.

Page: 23 6 from the top E-5

Whitespace added after suffix array and before the citation.

Page: 29 Figure 2.8 on top of the page E-6

The legend on top of the figure should be the opposite: squares denote objects, circles groups.

objects groups

2

Page: 37 First line, third element of the list E-7

Grammatical correction: uncomparable changed to incomparable.

Experiment results are unique, unreproducible and incomparable. User surveys are [. . . ]

— Changes between revisions 2359–2427 (post-review) —

Page: 1 E-8

Contraction we’ve gathered expanded to we have gathered.

Page: 8 pt. 3 in Section 1.3 E-9

The acronym dcf is used prior to its definition. Reworded pt. 2 to introduce the term.

• Devise a method — Description Comes First (dcf) — that yields comprehensible, [. . . ]

Page: multiple pages E-10

Contraction let’s reworded or expanded to let us.

Page: 43 Second paragraph in „Conciseness” block. E-11

Typo: Anticipating out changed to Anticipating our.

Page: 47 chapter title E-12

Chapter title changed to: Solving the Descriptive Clustering Task: Description Comes First Approach.

Page: 116 4th line from the top, wrong word used E-13

Change partition to class.

[. . . ] rows to classes and h(ci ,k j ) contains the number of objects from class ci present [. . . ]

Page: 118 2nd line of Corollary 1 E-14

Change partition to class.

[. . . ] every class differs at most by one, any object moved between classes decreases [. . . ]

Page: 111 Figures 7.22 and 7.23 E-15

Integers on the charts shown with an unnecessary floating point precision.

Page: 23 third paragraph from the top E-16

Inaccurate estimate of suffix array construction complexity (explanation added).

Of course a straightforward construction of a suffix array using generic sorting routines

would slow down the algorithm to the order of O(n log n) (assuming a bit unrealistically that

string comparisons take O(1) time).

3

Page: 34 Related Works section E-17

Missing references. A few other algorithms for clustering search results have been mistakenly omitted

in Section 2.4 (shamefully, some of them are even implemented in Carrot2).

• Chi Lang Ngo and Hung Son Nguyen. A method of web search result clustering based

on rough sets. In Proceedings of 2005 IEEE / WIC / ACM International Conference on

Web Intelligence, 19-22 September 2005, Compiegne, France, pages 673–679. IEEE Com-

puter Society, 2005

• C. Carpineto, A. Della Pietra, S. Mizzaro, and G. Romano. Mobile clustering engine.

In Proceedings of the 28th European Conference on Information Retrieval, LNCS 3936,

pages 155–166, London, 2006. Springer

• Steven Schockaert, Martine De Cock, Chris Cornelis, and Etienne E. Kerre. Fuzzy ant

based clustering. In Ant Colony Optimization and Swarm Intelligence, Proceedings

of the 4th International Workshop, ANTS 2004, volume 3172 of LNCS, pages 342–349.

Springer-Verlag, 2004

— Changes between revisions 2427–2466 (margin notes from prof. Jacek Koronacki) —


Replaced ambiguous sentence.

[. . . ] users expect labels to convey all the information about clusters’ content.

Page: 10 4th line from the bottom E-19

Replaced awkward sentence.

Cluster discovery provides a computational data model of document groups present in the

input data.

Page: 12 2nd line from the bottom E-20

Replaced awkward characters.

[. . . ] devoted to Descriptive k-Means additionally [. . . ]


Corrected word: frequently → frequent.


Corrected word repetition: node node → node.

Page: 23 2nd line from the top E-23

Corrected awkward sentence.

Suffix trees have become very popular mostly due to low computational cost of their con-

struction — linear with the size of input sequence’s elements.

4

Page: 25 4 line from the bottom E-24

Added missing word.

[. . . ] and downplays words that are very common.

Page: 27 first paragraph from the top E-25

Replaced articles in phrases: a dot product, a norm of vector.

Page: 27 Section 2.3.1 E-26

Added proper attribution — Brian Everitt et al.

Page: 31 midpage E-27

Added missing article.

The Internet brought a new challenge [. . . ]

Page: 38 midpage E-28

Incorrect word usage corrected: remember → recall.


Reworded awkward sentence.

[. . . ] but the core idea is in separating selection of candidate cluster labels from cluster dis-

covery.

Page: 48 label in Figure 4.1 E-30

Incorrect word usage corrected: potential → potentially.


Removed extra ‘s’ in contains.


Added kwic to the index and the margin.



We expect the input to be real text (not completely random, noisy documents and not frag-

ments like snippets) written in one language [. . . ]

Page: 66 second and third paragraph in Section 6.2 E-34

Corrected omitted words.

[. . . ] certain distortions such as minor [. . . ]

[. . . ] pseudocode of the algorithm, and we discuss each major step in sections below.

5

Page: 69 15th from the top E-35

Reworded phrase: very similar → similarly.

Page: 79 multiple lines in Section 7.1 E-36

Minor rewordings.

[. . . ] differ slightly from those of typical clustering [. . . ]

[. . . ] look at the problem of evaluation from another angle.

The system has been available as an open source project[. . . ]



• Are cluster labels meaningful with respect to the topic [. . . ]

Page: 80 3rd line from the bottom E-38

Removed extra parenthesis.

Page: 92 3rd line from the top E-39

Added missing the.

We need to point out that the changes introduced to [. . . ]